[PDF] Spillovers of Program Benefits with Mismeasured Networks

Abstract

In studies of program evaluation under network interference, correctly measuring spillovers of the intervention is crucial for making appropriate policy recommendations. However, increasing empirical evidence has shown that network links are often measured with errors. This paper explores the identification and estimation of treatment and spillover effects when the network is mismeasured. I propose a novel method to nonparametrically point-identify the treatment and spillover effects, when two network observations are available. The method can deal with a large network with missing or misreported links and possesses several attractive features: (i) it allows heterogeneous treatment and spillover effects; (ii) it does not rely on modelling network formation or its misclassification probabilities; and (iii) it accommodates samples that are correlated in overlapping ways. A semiparametric estimation approach is proposed, and the analysis is applied to study the spillover effects of an insurance information program on the insurance adoption decisions.

Full PDF

aa r X i v : . [ ec on . E M ] S e p Spillovers of Program Beneﬁts with Mismeasured Networks

Lina Zhang

Job Market Paper (Click here for the latest version)September 22, 2020

Abstract

In studies of program evaluation under network interference, correctly measuring spilloversof the intervention is crucial for making appropriate policy recommendations. However,increasing empirical evidence has shown that network links are often measured with errors.This paper explores the identiﬁcation and estimation of treatment and spillover eﬀects whenthe network is mismeasured. I propose a novel method to nonparametrically point-identify thetreatment and spillover eﬀects, when two network observations are available. The method candeal with a large network with missing or misreported links and possesses several attractivefeatures: (i) it allows heterogeneous treatment and spillover eﬀects; (ii) it does not rely onmodelling network formation or its misclassiﬁcation probabilities; and (iii) it accommodatessamples that are correlated in overlapping ways. A semiparametric estimation approach isproposed, and the analysis is applied to study the spillover eﬀects of an insurance informationprogram on the insurance adoption decisions.

Keywords : Causal Inference; Spillover; Heterogeneity; Random Experiments; MismeasuredNetworks. Department of Econometrics and Business Statistics, Monash University ( [email protected] ). I amgrateful to David Frazier, Donald Poskitt and Xueyan Zhao for their invaluable guidance and support. I wouldlike to thank Jiti Gao, Didier Nibbering, Bing Peng, Denni Tommasi, Takuya Ura and Benjamin Wong for helpfulcomments. All errors are mine.

Introduction

Measuring correctly the spillovers of a program intervention is incredibly relevant to understandwhether and how the intervention inﬂuences individuals’ outcome through their social interactions,and provide meaningful policy advices aiming at eﬀective treatment allocation (Angelucci and Di Maro,2016; Viviano, 2019). Existing methods studying spillover eﬀects typically assume that accuratenetwork links of all sampled units are available (see Leung, 2020b; Ma, Wang, and Tresp, 2020;Vazquez-Bare, 2019; Viviano, 2019, for example). However, such an assumption is hard to verifyin practice and questionable in many settings (S¨avje, 2019). For example, Angelucci et al. (2010)point out that when constructing the generational family network via respondents’ surnames usingthe PROGRESA data, false connections exist between unrelated families sharing the same sur-names. In the study of technology diﬀusion among pineapple farmers in Ghana, Conley and Udry(2010) notice the potential misclassiﬁcation of the information neighbors, due to the lack of pre-cise deﬁnition of the information network and the existence of multi-contextual social connections.Comola and Fafchamps (2017) also document a massive discrepancies (about 73%) between theinter-household transfers reported by givers and receivers from a village in Tanzanian. Ignoringor mis-connecting either side of the responses may lead to misclassiﬁed risk-sharing networks. This paper investigates the identiﬁcation and estimation of the treatment and spillover eﬀectsof a program intervention with mismeasured network data. There are several attractive features ofthe method proposed in this paper. First, it allows ﬂexible forms of heterogeneity in the treatmentand spillover eﬀects, which is important to inform how treatment response varies across population(Manski, 2001). In addition, the analysis can be applied to settings with a large network that neednot be block-diagonal and that contains missing or misreported links. Moreover, modeling of thenetwork formation or its misclassiﬁcation probability is not required to implement the proposedmethod.I focus on a randomized program intervention and a superpopulation model studied in Leung(2020b). If the network is correctly measured, the direct treatment eﬀect can be identiﬁed fromthe variation of the ego unit’s own treatment status, and the spillover eﬀect can be identiﬁed viathe variation of a statistic summarizing the exposure to the treated peers. However, the networkmeasurement errors introduced in this paper sophisticate the identiﬁcation by contaminating thetrue channels of the network interference. Therefore, ignoring those errors will lead to biasedestimation. The measurement errors considered in this paper are nonclassical; that is, they dependon the network interactions. In addition, the measurement errors are assumed to be independent of The ratio 73% is computed as the number of reported transfers coming from only giver or only receiver (1250)over the number of total reported transfers (1721), see Comola and Fafchamps (2017) page 560-561. Similar non-reciprocal problem has also been found in other survey data, e.g. 40% of risk-sharing networklinks from rural Philippines (Fafchamps and Lund, 2003) and more than 10% of the friendship among adolescentsin Add-Health dataset (Calv´o-Armengol, Patacchini, and Zenou, 2009; Patacchini, Rainone, and Zenou, 2017) arenon-reciprocal. number of network neighbors (hereafter degree), under thehelp of the instrument network proxy. Secondly, the distribution involving the true number oftreated network neighbors , which measures the exposure to the treated peers, is identiﬁed. Theidentiﬁcation in the second step relies on the observation that, network proxies in some studiesmight satisfy one assumption: there exists only one type of measurement error. It means that thenetwork proxy either includes no false links while allowing missing ones (“no false positive”), orincludes no missing links but allowing false ones (“no false negative”). The one type of measure-ment error assumption dramatically simpliﬁes the interdependence of the observed network-basedvariables with their latent counterparts, which is the main diﬃculty of the identiﬁcation. Testableimplication of such an assumption is also available.Inference in network settings are nonstandard due to the data correlation induced by thenetwork interaction. In particular, outcomes of two units are correlated if they are connected orthey share common network neighbors (Leung, 2020b). In this paper, the mismeasured networkadds to the complication by introducing extra source of correlation, through the spillover of themeasurement errors. Such spillover occurs, because a false network connection of two unitswill alter both of their observable exposures to the treated neighbors. In addition to the abovenetwork-induced correlation, this paper also considers the data dependency due to general formsof heteroscedasticity, autocorrelation and clustering, so that units who are not friends nor sharecommon friends may also correlate with each other. Such correlation may caused by, for instance,family background, school culture, or community diversity. All sources of data correlation describedabove generate distinct technical issues for the causal inference.I propose a semiparametric estimation approach, which overcomes the diﬃculty caused by thespillover of the measurement errors, and the resulting estimator is shown to be consistent andasymptotically normal. To derive limit theorems, I extend the univariate central limit theorem(CLT) of Chandrasekhar and Jackson (2016) to multivariate settings. The estimation approach inthis paper possesses several advantages: (i) it ﬁts the situations where there may exist no clearspatial or ordered structure; (ii) it does not require a large number of independent subnetworks Similar to this paper, S¨avje (2019) also ﬁnds that misspeciﬁcation of the treatment exposures will cause extradata dependence.

Spillover eﬀects of the treatment via network interactions have been documented in many applica-tions, e.g., cash transfer programs (Barrera-Osorio et al., 2011), health programs (Dupas, 2014),public policy programs (Kremer and Miguel, 2007), education programs (Opper, 2019), and in-formation diﬀusion (Banerjee et al., 2013). Misclassiﬁcation is a pervasive problem of networkdata and has been noticed in e.g. Advani and Malde (2018), Chandrasekhar and Lewis (2011),De Paula (2017) and Kossinets (2006).This paper is one among few papers that have studied the spillover eﬀect of a program interven-tion with mismeasured network. Hardy, Heath, Lee, and McCormick (2019) consider a parametricmodel for the potential outcomes and for the network misclassiﬁcations, and use a likelihood-basedapproach to estimate the spillover eﬀect. In a nonparametric setting, when only a network proxyis available, He and Song (2018) provide a lower bound for the spillover eﬀect under the restric-tion that the spillover is nonnegative. This paper is substantially diﬀerent from the papers above,because it does not reply on modeling the network misclassiﬁcations, and more importantly, itprovides a formal solution for the nonparametric point-identiﬁcation of the spillover eﬀect, whentwo network proxies are available. S¨avje, Aronow, and Hudgens (2017) ﬁnd that when there is limited or moderate degree of network interactions,ignoring the network interference would not impact the asymptotic properties of the average treatment eﬀectestimators. Chin (2018) studies the average treatment eﬀects under unmodeled network interference. However,neither of them explore the spillover eﬀect. Gao and Li (2019) explore the endogenous and exogenous peereﬀects via the linear-in-means model with two mismeasured network proxies. Their identiﬁcationresult depends on three key assumptions. First, there exist two diﬀerent latent network structuresfor the same group of individuals. Second, the error contaminated network-based variables areassumed to be independent conditional on their latent counterparts, which implicitly requires thenetworks to be non-stochastic. At last, a copula is used to capture the dependence between themismeasured network eﬀects. Like this paper, Gao and Li (2019) exploit the matrix diagonaliza-tion method, however my analysis focuses on the reduce-form treatment response function whichis modeled nonparametrically, enabling ﬂexible forms of heterogeneous treatment and spillovereﬀects. In addition, my identiﬁcation strategy does not require diﬀerent network structures forthe same set of individuals, the non-stochastic network, or a copula structure for the network-based variables. Instead, the identiﬁcation in this paper is achieved via restricting the networkmeasurement error.Consequences and solutions of misclassiﬁed network on estimating network statistics or networkformation are discussed by, e.g. Breza, Chandrasekhar, McCormick, and Pan (2020), Candelaria and Ura(2020), Comola and Fafchamps (2017), Kossinets (2006), Liu (2013) and Thirkettle (2019). How-ever, it is not clear how to apply these methods to identify treatment and spillover eﬀects in acausal setting.Literature exploring limit theorems using network dependent data is developing rapidly. Somepapers assume that the social network can be partitioned into a large number of disjoint andindependent subnetworks (see Lewbel et al., 2019; Vazquez-Bare, 2019, for example). However,this independence assumption may not be plausible in practice, because it ignores the links acrosssubnetworks. Chandrasekhar and Lewis (2011) adopt mixing conditions to restrict the dependenceof network links, while, in many contexts, there is no underlying metric space to deﬁne the stan-dard “mixing” forms of dependence. Leung (2020b) introduces the notion of “dependence graph”to capture the network-correlated eﬀects, and derives limit theorems under the conditional localdependence; that is, the outcomes of two units are independent if they are not friends nor sharecommon friends. However, in the setting considered in this paper, the measurement errors disruptthe true network dependence structure, so that some seemingly uncorrelated units may actuallycorrelate with each other due to the latent network connections, and vice versa. Therefore, analternative data dependence structure is needed. I adopt the “dependence neighborhoods” struc-ture proposed by Chandrasekhar and Jackson (2016) to control the data correlation, which doesnot require to observe the correct network links and employs less restrictions on the dependence Other related papers are, e.g. Chandrasekhar and Lewis (2011), De Paula, Rasul, and Souza (2018a),Goldsmith-Pinkham and Imbens (2013), Lewbel, Qu, and Tang (2019) and Sojourner (2013). See Hardy et al. (2019), Leung (2019a) and Manski (2013) which also emphasis the diﬀerence between thestructural model of social interactions and the reduced-form model focusing on the treatment response function. Let D = { D i } i ∈P and Z = { Z i } i ∈P denote vectors consisting of units’ (or individuals, nodes,agents) treatment status and observable characteristics for the super-population P , respectively.Denote A ∗ as the true, latent and binary adjacency matrix, corresponding to an unweightedand undirected random network over the super-population P . Each row of A ∗ , denoted by A ∗ i ,represents unit i ’s network connection with unit j . Let A ∗ ij = 1 if i and j are linked (or equivalently,network neighbors ), and A ∗ ij = 0 otherwise. As a convention, self links are ruled out, i.e. A ∗ ii = 0for ∀ i ∈ P . Given the adjacency matrix A ∗ , we deﬁne the set of unit i ’s ﬁrst-degree networkneighbors by N ∗ i = { j ∈ P : A ∗ ij = 1 } . Denote |N ∗ i | = P j ∈P A ∗ ij as the cardinality of N ∗ i , and |N ∗ i | is usually referred to as the “degree” of unit i . For each i ∈ P , deﬁne the outcome Y i as Y i = ˜ r ( i, D , A ∗ , Z , ε i ) , (1)where ˜ r is a unknown real-valued function and ε i is an unobservable error term. The Y i in (1)acknowledges that one units outcome depends on not only his or her own treatment status, butalso the treatments assigned to other units, i.e. the spillover eﬀect. I impose the assumption belowto restrict the dependence of the outcome Y i on ( i, D , A ∗ , Z , ε i ). Assumption 3.1 (Network Interference)

For ∀ i, k ∈ P , ∀ ( D , A ∗ , Z ) and ∀ ( ˜D , ˜A ∗ , ˜Z ) , ˜ r ( i, D , A ∗ , Z , e ) = ˜ r ( k, ˜D , ˜A ∗ , ˜Z , e ) , for all e ∈ Ω ε i ∪ Ω ε k , if the following conditions hold simultaneously: (i) D i = ˜ D k ; (ii) P j ∈P A ∗ ij = P j ∈P ˜ A ∗ kj ; (iii) P j ∈P A ∗ ij D j = P j ∈P ˜ A ∗ kj ˜ D j ; (iv) Z i = ˜ Z k . Assumption 3.1 states that the outcome is fully determined by (i) unit’s own treatment status; (ii) Other papers study limit theorems of network dependent data include Chin (2018),Kojevnikov, Marmer, and Song (2019), Kuersteiner (2019), Lee and Ogburn (2020), Leung and Moon (2019),Leung (2019b, 2020a), Liu and Hudgens (2014), Song (2018), van der Laan (2014) and references therein. The vectors of treatment status and observable characteristics, and the adjacency matrix are inﬁnite-dimensional. We follow Leung (2020b) and obviate further details to ease the illustration. It is worthy to notice that there are two diﬀerent deﬁnitions of neighbors utilized in this paper. The ﬁrst one,which is referred to as “network neighbors”, is deﬁned by the network links D . The second one, which is referredto as “dependent neighbors”, is deﬁned via the dependency neighborhoods and is used to characterize correlationsof random variables of interest. See Section 5.1 for more details. S ∗ i := P j ∈P A ∗ ij D j ; and(iv) unit’s own covariates. The same assumption is used in Leung (2020b).Assumption 3.1 substantially reduces the dimensionality of the outcome and reveals two crucialfeatures of the network interactions. Firstly, the interference occurs locally, only among the ﬁrst-order network neighbors. Thus, ( D i , S ∗ i ) can be viewed as the “eﬀective treatment” of Manski(2013). Secondly, the outcome is invariant to any permutations of the treatments received bythe ﬁrst-order network neighbors, meaning that the interactions are anonymous. The anonymousinteraction is also referred to as “stratiﬁed interference”, see Baird et al. (2018), Basse and Feller(2018) and Hudgens and Halloran (2008) among others. Under Assumption 3.1, equation (1)can be simpliﬁed to Y i = r ( D i , S ∗ i , Z i , |N ∗ i | , ε i ) , for ∀ i ∈ P (2)where r represents a real-valued unknown function. Such an outcome structure permits adequatecontrols for the observable and unobservable heterogeneity of the treatment response. Given (2),it is easy to see that unit i ’s outcome Y i is directly aﬀected by his or her own treatment status D i (treatment eﬀect), and is also aﬀected by S ∗ i because of the exposure to the treated peers(spillover eﬀect). The network N ∗ i aﬀects the outcome via two pathways: the degree |N ∗ i | and thetreated network neighbors incorporated in S ∗ i . The degree |N ∗ i | is a critical attribute because itquantiﬁes the inﬂuence of each unit in the social network. Controlling |N ∗ i | in (2) enables us totarget subpopulation based on diﬀerent levels of inﬂuence, and it also acts as a control variable asit allows the correlation between the degree and the unobservables. Throughout the paper, the following notations are used. For any generic random variables X and Y , denote f X and f X | Y as the probability function of X and the conditional probabilityfunction of X given Y , respectively. Let Ω X denote the support of the random variable X . Bynotation abuse, let | B | denote the cardinality of any set B , or the absolute value for any scalar B . For any vector a ∈ R p , let k a k = P pi =1 | a i | be its L norm, k a k = ( a ′ a ) / be its Euclideannorm and k a k ∞ = max ≤ i ≤ p | a i | . Given a matrix A = ( a ij ), we set k A k = [ tr ( A ′ A )] / and k A k ∞ = max ≤ i,j ≤ p | a ij | . More generally, for an array (or a vector) of functions, say a = { a i } with a i : Ω X R , denote k a k ∞ = sup x ∈ Ω X sup i | a i ( x ) | , where i could stand for a multiple index. Foran arbitrary parameter β , denote d β = dim ( β ). ⊥ means statistical independence. To motivate the potential identiﬁcation issues, let me begin by deﬁning key concepts and intro-ducing basic assumptions. Aronow and Samii (2017), Leung (2019a) and S¨avje (2019) consider the possible mis-speciﬁcation of modelssimilarly deﬁned by Assumption 3.1, and tests for Assumption 3.1 are feasible in Athey et al. (2018). Similar control variable method is used in e.g. Johnsson and Moon (2015). eﬁnition 1 (CASF) For ∀ ( d, s, z, n ) ∈ { , } × Ω S ∗ ,Z, |N ∗ | , the conditional average structuralfunction (CASF) is deﬁned as m ∗ ( d, s, z, n ) = E (cid:2) r ( d, s, Z i , |N ∗ i | , ε i ) (cid:12)(cid:12) Z i = z, |N ∗ i | = n (cid:3) . In this paper, I focus on the treatment and spillover eﬀects, measuring the average change inpotential outcomes in response to the counterfactual manipulation of the treatment assigned tothe ego unit and to the network peers, respectively. Deﬁnition 2 (Treatment and Spillover Eﬀects)

For ∀ ( s, z, n ) ∈ Ω S ∗ ,Z, |N ∗ | , deﬁnetreatment eﬀect: τ d ( s, z, n ) = m ∗ (1 , s, z, n ) − m ∗ (0 , s, z, n ) , spillover eﬀect: τ s ( s, z, n ) = m ∗ (0 , s, z, n ) − m ∗ (0 , , z, n ) . The assumption below introduces the ignorability conditions accounting for the network interfer-ence, based on which the causal eﬀects of interest can be recovered if the actual network data isavailable.

Assumption 3.2 (a) ( Randomized Treatment ) { D i } i ∈P are i.i.d. and { D i } i ∈P ⊥ { ε j , Z j , N ∗ j } j ∈P .(b) ( Unconfounded Network ) For ∀ i ∈ P , ε i ⊥ (cid:0) N ∗ i , { D j } j ∈N ∗ i (cid:1) (cid:12)(cid:12) Z i , |N ∗ i | . Assumption 3.2 (a) states that the treatment is randomly assigned, does not impact the network,and is independent to the potential outcomes. The randomized intervention is relevant for a widerange of experimental contexts, including Miguel and Kremer (2004), Aral and Walker (2012),Oster and Thornton (2012), Cai et al. (2015b) to name a few, and see Athey and Imbens (2017)for a review. Therefore, Assumption 3.2 (a) is a straightforward starting point for the analysis.Assumption 3.2 (b) requires the unconfounded network, which is weaker than the fully exogenousnetwork, by allowing the correlation between the degree |N ∗ i | and the unobservable characteristics,for example, through the spillovers of unobservables. See Leung (2020b) for similar assumptionand supportive examples. The network unconfoundedness to the treatment and the potentialoutcomes is likely to hold in randomized experiments where the network data is collected beforethe intervention. Similar deﬁnitions measuring the direct eﬀect and the spillover eﬀect of treatment are also introduced inHudgens and Halloran (2008) and Sobel (2006) to name a few. See Tchetgen and VanderWeele (2012) for a dis-cussion about relationships between various notions of causal eﬀects in the presence of network interference. Theanalysis in this paper can be straightforwardly extended to studies dealing with other notions of treatment eﬀectestimands. It also worth to notice that although Assumption 3.2 (b) allows the dependence between the network N ∗ i andunobservable ε i through |N ∗ i | and observable exogenous characteristics, it does not allow the unobserved homophilyin network formation where the unobservables are correlated to the potential outcomes. ssumption 3.3 (Distribution) (a) { Z i } i ∈P are i.i.d. and |N ∗ i | given Z i is identically distributed across i ∈ P .(b) For ∀ i ∈ P , ε i given ( Z i , |N ∗ i | ) is identically distributed. Assumption 3.3 (a) implies that the covariate Z i is of randomly drawn samples, which is standardin the network eﬀect models literature, e.g. Johnsson and Moon (2015) and Auerbach (2019). Condition (a) also requires the conditional distribution of the degree to be invariant across units.An example of the dyadic network formation in Appendix A can be used to verify the existence ofsuch an identical distribution. Also see a strategic network formation model in Leung (2020b) thatsatisﬁes (a). Moreover, the identical distribution of the error term ε i given ( Z i , |N ∗ i | ) in condition(b) permits that the expressions of the CASF, the treatment eﬀect τ d and the spillover eﬀect τ s are all identical for any unit i ∈ P .If the actual network N ∗ i is correctly observed, under the assumptions introduced so far, theCASF can be identiﬁed by m ∗ ( d, s, z, n ) = E (cid:2) Y i (cid:12)(cid:12) D i = d, S ∗ i = s, Z i = z, |N ∗ i | = n (cid:3) , which ensures that the treatment and spillover eﬀects are also identiﬁable. However, it appears thatin many applications we fail to obtain fully accurate network information. Ignoring the missing ormisclassiﬁed network links may lead to biased estimation and misleading causal implications. In this subsection, I will present the potential bias of the CASF identiﬁed from the mismeasurednetwork data. Assume that researchers randomly draw N units from the population P , andcollect their outcomes of interest, treatment status, covariates, network information and treatmentassignments of their network neighbors. Thus, researchers can observe:( Y i , D i , Z i , N i , { D j } j ∈N i ) , for i = 1 , , ..., N, In the analysis of this paper, it is feasible to relax the i.i.d. of Z i and allow it to possess dependent structureunder the framework described in Section 5.1. We maintain such an i.i.d. assumption only for illustration simplicity. For ∀ ( d, s, z, n ) ∈ { , } × Ω S ∗ ,Z, |N ∗ | , it can be shown that E (cid:2) Y i (cid:12)(cid:12) D i = d, S ∗ i = s, Z i = z, |N ∗ i | = n (cid:3) = E (cid:2) r ( D i , S ∗ i , Z i , |N ∗ i | , ε i ) (cid:12)(cid:12) D i = d, S ∗ i = s, Z i = z, |N ∗ i | = n (cid:3) = E (cid:2) r ( d, s, z, n, ε i ) (cid:12)(cid:12) Z i = z, |N ∗ i | = n (cid:3) = m ∗ ( d, s, z, n ) , where the second equality is due to the unconfoundedness of ( D i , S ∗ i ) in Lemma B.1 and the last equality is byDeﬁnition 1. N i denotes the observed identities of unit i ’s network neighbors with cardinality |N i | , andthe convention of no self connections is maintained, i.e. i

6∈ N i . Note that there are no restrictionson the sampling scheme of the network data. Namely, N i can be obtained from a single and fullyobserved network, or from a (possibly partially observed) sampled network. In addition, N i canbe either self-reported, acquired from the administrative data, or constructed by researchers basedon some speciﬁc rules. Throughout the paper, N i is referred to as “network proxy”. Given N i , thenumber of the observed treated network neighbors is denoted by S i = P j ∈N i D j .The assumption below extends Assumption 3.2 to accommodate the observable network proxyby restricting the misclassiﬁcation of the network links. Assumption 3.4 (Nondiﬀerential Misclassiﬁcation) (a) { D i } i ∈P ⊥ { ε j , Z j , N ∗ j , N j } j ∈P ;(b) For ∀ i ∈ P , ε i ⊥ (cid:0) N ∗ i , { D j } j ∈N ∗ i , N i , { D j } j ∈N i (cid:1) (cid:12)(cid:12) Z i , |N ∗ i | .(c) For ∀ i ∈ P , |N i | given ( Z i , |N ∗ i | ) is identically distributed. Assumption 3.4 (a) and (b) indicates that given the actual network information and individual’scharacteristics, the observed proxy N i does not contain relevant information to predict the out-come, which is often referred to as “nondiﬀerential misclassiﬁcation” in the measurement errormodels literature, e.g. Battistin and Sianesi (2011), Hu (2008) and Lewbel (2007). In addition,Assumption 3.4 (c) holds in many contexts, for example, when units fail to respond with prob-ability proportional to their actual degrees (“the load eﬀect”), or inversely proportional to theiractual degrees (“the periphery eﬀect”), see Kossinets (2006). I also provide one set of suﬃcientconditions for Assumption 3.4 (c) in Appendix A.Now, let us denote the conditional mean function of the outcome given the observables as m i ( d, s, z, n ) = E [ Y i | D i = d, S i = s, Z i = z, |N i | = n ] , where the subscript i of m i represents the possibly non-identical conditional mean of the outcomegiven the observables, which is caused by the unknown dependence between the error contaminatednetwork-variables ( S i , |N i | ) and their latent counterparts ( S ∗ i , |N ∗ i | ). The relationship between m i and m ∗ can be obtained by the proposition below. Proposition 3.1

Under Assumptions 3.1-3.4, for ∀ i ∈ P and ∀ ( d, s, z, n ) ∈ { , } × Ω S,Z, |N | , m i ( d, s, z, n ) = X ( s ∗ ,n ∗ ) ∈ Ω S ∗ , |N∗| m ∗ ( d, s ∗ , z, n ∗ ) f S ∗ i , |N ∗ i || D i = d,S i = s,Z i = z, |N i | = n ( s ∗ , n ∗ ) . Proposition 3.1 characterizes the bias in the CASF estimand if ignoring the measurement errors ofthe network links. The expression of m i makes it clear that the bias of m i is governed by the latent9istribution of the actual network-based variable ( S ∗ i , |N ∗ i | ) given its observed proxies ( S i , |N i | ).The bias will be larger, if the misclassiﬁcation probability of ( S i , |N i | ) is higher. Importantly, dueto the nonparametric setting of m ∗ , simply diﬀerencing m i (1 , s, z, n ) and m i (0 , s, z, n ) in generalcannot give the treatment eﬀect τ d ( s, z, n ), even though the treatment is randomized and correctly-observed. While, it will be true if the response to the variation of ego unit’s own treatment statusis homogeneity in both the observables and unobservables, relying on strong structural restriction.Similar weighted average expressions of the identiﬁable parameter are founded in Gao and Li (2019)for the endogenous peer eﬀects and in Hardy et al. (2019) for the treatment spillover eﬀects. Suppose that two network proxies are available for each sampled individual i ∈ { , , ..., N } ,denoted by N i and ˜ N i . These two proxies may come from repeated observations of a samplednetwork over time, diﬀerent dimensions of connections (e.g. kinship and borrowing-lending), multi-contextual interactions (e.g. various social evens or aﬄictions), or self-reported and administrativenetwork data. Intuitively, the additional network proxy ˜ N i can be understood as an instrument forthe true latent network, as in Hu (2008) and Hu and Schennach (2008), that is conceptually similarto the ones utilized in conventional instrumental variable methods. Following the same constructionas before, for ˜ N i , denote its degree as | ˜ N i | and the number of treated network neighbors as˜ S i = P j ∈ ˜ N i D j , respectively.Let me ﬁrst introduce an useful lemma which plays a key role in decomposing the latentdistribution f S ∗ i , |N ∗ i || D i ,S i ,Z i , |N i | into identiﬁable components. Without loss of generality, I state theresult for one network proxy N i , the same result holds for ˜ N i . Lemma 4.1

Under Assumption 3.2 (a) and 3.4 (a), we have that(a) N ∗ i ⊥ S ∗ i (cid:12)(cid:12) Z i , |N ∗ i | and |N i | ⊥ S ∗ i (cid:12)(cid:12) Z i , |N ∗ i | ;(b) N i ⊥ S i (cid:12)(cid:12) Z i , |N i | and |N ∗ i | ⊥ S i (cid:12)(cid:12) Z i , |N i | ;(c) for ∀ ( s, n ) ∈ Ω S, |N | , f S ∗ i | Z i , |N ∗ i | = n ( s ) = f S i | Z i , |N i | = n ( s ) = C sn f D (1) s f D (0) n − s , for f D ( d ) := P r ( D i = d ) with d ∈ { , } . Lemma 4.1 (a) delivers two implications. The distribution of the exposure to treated peers, S ∗ i , isfully determined by the degree and the exogenous covariates, instead of the identity of interacted In order to consistently estimate the CASF m ∗ , practitioners should either be aware of and able to restrict thedegree of mismeasurement in the observed network, or identify and consistently estimate the intermediate latentdistribution f S ∗ i , |N ∗ i || D i ,S i ,Z i , |N i | . The ﬁrst solution can be accomplished utilizing a single network proxy if theprobability of misclassiﬁcation decreases as sample size increases. Rigorous study along this line is provided in thesupplemental material. S i . In addition, the identiﬁability of f S ∗ i | Z i , |N ∗ i | in (c) is intuitive,because the summation of any given n i.i.d. treatment status follows a binomial distribution.Given Proposition 3.1 and Lemma 4.1, to identify the CASF m ∗ , we can ﬁrst decompose thelatent distribution function f S ∗ i , |N ∗ i || D i ,S i ,Z i , |N i | , for one of the two network proxies, as follows. Proposition 4.2

Under Assumptions 3.2 and 3.4, we have that f S ∗ i , |N ∗ i || D i ,S i ,Z i , |N i | = f S i | S ∗ i ,Z i , |N ∗ i | , |N i | × f S ∗ i | Z i , |N ∗ i | × f |N i || Z i , |N ∗ i | × f |N ∗ i || Z i f S i | Z i , |N i | × f |N i || Z i . (3)From Proposition 4.2, it is easy to see that the latent distribution of interest can be expressed as aproduct of six distributions, where f |N i || Z i , f S ∗ i | Z i , |N ∗ i | and f S i | Z i , |N i | can be identiﬁed directly fromthe observables under the assumptions exploited in the previous sections.In what follows, I will deal with the identiﬁcation of the rest distributions in the decompositionabove in two steps. First, given the two observed networks N i and ˜ N i , apply the method of matrixdiagonalization of Hu (2008) to achieve the identiﬁcation of f |N i || Z i , |N ∗ i | and f |N ∗ i || Z i . Due to thecomplex and unconstrained interdependence between the observed ( S i , |N i | ), ( ˜ S i , | ˜ N i | ) and theirlatent counterpart ( S ∗ i , |N ∗ i | ) through the underlying network N ∗ i , even with two network proxies,it is not feasible to identify the latent distribution f S i | S ∗ i ,Z i , |N ∗ i | , |N i | by simply repeating the matrixdiagonalization approach. Therefore, in the second step, I introduce a crucial assumption on thenetwork measurement error, which dramatically simpliﬁes the interdependence and ensures theidentiﬁcation of f S i | S ∗ i ,Z i , |N i | , |N ∗ i | . Assumptions 4.1 to 4.4 below are crucial when establishing the identiﬁcation results via the matrixdiagonalization technique similar to that used in Hu (2008). Modiﬁcations to the assumptions andto the method are made accordingly, to ﬁt the network setting considered in this paper.

Assumption 4.1 (Exclusion Restriction) |N i | ⊥ | ˜ N i | (cid:12)(cid:12) Z i , |N ∗ i | . Assumption 4.1 can be interpreted as standard exclusion restriction that | ˜ N i | does not provide extrainformation about |N i | than the actual degree |N ∗ i | already provides. It can also be understood asthat the instrumental variable ˜ N i is conditionally independent to the measurement errors containedin the proxy N i . It rules out the situations where both network proxies are mismeasured due torandom omission of the same group of units when constructing the networks. One set of suﬃcientconditions for Assumption 4.1 is given in Appendix A. The exclusion restriction is the key toimplement the matrix diagonalization method. Assumption 4.2 (Sparsity) Ω | ˜ N | = Ω |N | = Ω |N ∗ | with ﬁnite cardinality K |N | . e S = Ω S = Ω S ∗ .To illustrate the basic idea of the matrix diagonalization technique, let me introduce the fol-lowing notations. Without loss of generality, set Ω |N ∗ | = Ω |N | = Ω | ˜ N | = { , , ..., K |N | − } . Denotethe K |N | × K |N | matrix F |N || Z, |N ∗ | as F |N || Z, |N ∗ | =  f |N i || Z i , |N ∗ i | =0 (0) · · · f |N i || Z i , |N ∗ i | = K |N| − (0)... . . . ... f |N i || Z i , |N ∗ i | =0 ( K |N | − · · · f |N i || Z i , |N ∗ i | = K |N| − ( K |N | −  . (4)In a similar vein as F |N || Z, |N ∗ | , deﬁne F | ˜ N || Z, |N ∗ | via replacing f |N i || Z i , |N ∗ i | by f | ˜ N i || Z i , |N ∗ i | . In addition,deﬁne two observable K |N | × K |N | matrices F | ˜ N | , |N || Z = { f | ˜ N i | , |N i || Z i ( i, j ) } , and E | ˜ N | , |N | ,Y | Z = (cid:26)Z y ∈ Ω Y yf | ˜ N i | , |N i | ,Y i | Z i ( i, j, y ) dy (cid:27) , with i, j = 0 , , ..., K |N | −

1, and deﬁne a K |N | × K |N | diagonal matrix T Y | Z, |N ∗ | = diag (cid:0) E [ Y i | Z i , |N ∗ i | = 0] , E [ Y i | Z i , |N ∗ i | = 1] , · · · , E [ Y i | Z i , |N ∗ i | = K |N | − (cid:1) . The main idea of the matrix diagonalization method is to identify the latent distributions of interestvia diagonalizing the directly observable distribution E | ˜ N | , |N | ,Y | Z × F − | ˜ N | , |N || Z , T Y | Z, |N ∗ | = F − | ˜ N || Z, |N ∗ | × (cid:16) E | ˜ N | , |N | ,Y | Z × F − | ˜ N | , |N || Z (cid:17) × F | ˜ N || Z, |N ∗ | . Then, recover the latent distributions in F |N || Z, |N ∗ | and F | ˜ N || Z, |N ∗ | via the eigen-decompositionapproach: columns of F | ˜ N || Z, |N ∗ | are the eigenvectors of matrix E | ˜ N | , |N | ,Y | Z × F − | ˜ N | , |N || Z , and thediagonal elements of T Y | Z, |N ∗ | are the corresponding eigenvalues. Note that the discussion above isbased on the preassumption about the invertibility of F |N || Z, |N ∗ | and F | ˜ N || Z, |N ∗ | , which is formalizedby Assumption 4.3 below. Assumption 4.3 (Rank Condition)

The ranks of F |N || Z, |N ∗ | and F | ˜ N || Z, |N ∗ | are both K |N | . The next assumption is the key to identifying latent probabilities via eigen-decomposition.

Assumption 4.4 (Eigen-decomposition) (a) For ∀ n, n ′ ∈ Ω |N ∗ | such that n = n ′ , we have E [ Y i | Z i , |N ∗ i | = n ] = E [ Y i | Z i , |N ∗ i | = n ′ ] . b) For ∀ n ∗ ∈ Ω |N ∗ | and any n = n ∗ , we have f |N i || Z i , |N ∗ i | = n ∗ ( n ∗ ) > f |N i || Z i , |N ∗ i | = n ∗ ( n ) , f | ˜ N i || Z i , |N ∗ i | = n ∗ ( n ∗ ) > f | ˜ N i || Z i , |N ∗ i | = n ∗ ( n ) . Assumption 4.4 (a) is a suﬃcient condition to avoid duplicate eigenvalues so that the eigen-decomposition is unique. It is automatically satisﬁed if E [ Y i | Z i , |N ∗ i | ] is monotone in |N ∗ i | andit also holds for more general scenarios. Noticing that the condition (a) is a special case of amore general condition E [ ̟ ( Y i ) | Z i , |N ∗ i | = n ] = E [ ̟ ( Y i ) | Z i , |N ∗ i | = n ′ ], where the transformationfunction ̟ ( · ) can be user-speciﬁed, such as ̟ ( y ) = ( y − E [ Y i ]) (variance) or ̟ ( y ) = 1[ y ≤ y ](quantile) for some given y . Assumption 4.4 (b) permits that the order of the eigenvectors isidentiﬁable. It indicates that the observable degrees are informative proxies for the latent degree,which implicitly assumes that the probability of correctly reporting is higher than that of misreport-ing. Similar restrictions are widely invoked in the literature of measurement error models. See e.g.Battistin and Sianesi (2011), Battistin, De Nadai, and Sianesi (2014), Chen, Hong, and Nekipelov(2011), Hu and Schennach (2008), Lewbel (2007) and Mahajan (2006). Theorem 4.3

Suppose Assumption 3.4 is satisﬁed by ˜ N i and N i . Under Assumptions 3.1-3.3 and4.1,(a) f |N i || Z i , f | ˜ N i | , |N i || Z i and f | ˜ N i | , |N i | ,Y i | Z i are identical across i ∈ P .(b) If further assume Assumptions 4.2-4.4 hold, then f |N ∗ i || Z i , f |N i || Z i , |N ∗ i | and f | ˜ N i || Z i , |N ∗ i | are non-parametrically identiﬁed. Next, let me proceed with the identiﬁcation of f S i | S ∗ i ,Z i , |N i | , |N ∗ i | . The matrix diagonalization methodis infeasible in this step, because of the violation of the exclusion restriction analogue to Assumption4.1. In other words, the conditional independence S i ⊥ ˜ S i given ( S ∗ i , Z i ) with Z i = ( Z i , |N ∗ i | ) doesnot hold. To be more speciﬁc, consider the expression of S i in terms of S ∗ i below S i = S ∗ i − X j ∈N ∗ i / N i D j + X j ∈N i / N ∗ i D j , (5)where for any sets A and B , let A/B := A T B c with B c being the complement of B . The set N ∗ i / N i contains all the missing network links of i (false negative), and the set N i / N ∗ i includes allthe false network links (false positive). Similarly, ˜ S i = S ∗ i − P j ∈N ∗ i / ˜ N i D j + P j ∈ ˜ N i / N ∗ i D j . Consideran extreme case, where the super-population P is a ﬁnite set and unit i connects to all other Intuitively, Z i = ( Z i , |N ∗ i | ) as a whole can be regarded as a covariate, because its distribution is identiﬁablebased on Theorem 4.3. N ∗ i = P . Then the only possible misclassiﬁcation for N i and ˜ N i will be underreporting,leading to both S i and ˜ S i smaller than S ∗ i . Apparently, S i and ˜ S i are positively correlated in thiscase, contradicting the exclusion restriction.Based on the discussion above, the main issue of identifying f S i | S ∗ i ,Z i , |N i | , |N ∗ i | arises from thedependence between ( S i , |N i | ) and ( S ∗ i , |N ∗ i | ). Such dependence is not easy to characterize, because( S i , |N i | ) and ( S ∗ i , |N ∗ i | ) relate to each other via the underlying network N ∗ i which is unobservable,and the arbitrary measurement error further complicates their relationship. The latter is because,without imposing any constraint on the measurement error, given ( S ∗ i = s ∗ , |N ∗ i | = n ∗ , |N i | = n ),there will be various realizations of N i and N ∗ i , each of which may lead to substantially diﬀerent S i . For example, when n = n ∗ , it is possible that all network links are classiﬁed correctly, therefore N i = N ∗ i . If so, S i would be entirely determined by its latent counterpart S ∗ i . While, it is alsopossible that not even a single element in N i and N ∗ i is the same, although they have the samecardinality. If that be the case, then S i would be solely governed by the treatment status ofthe misreported false network neighbors P j ∈N i D j , and would not depend on ( S ∗ i , |N ∗ i | ) anymore.Therefore, without further restricting the measurement error, there will be too little informationand too much uncertainty to pin down f S i | S ∗ i ,Z i , |N i | , |N ∗ i | .For any given n ∈ Ω |N | and n ∗ ∈ Ω |N ∗ | , the ( n +1) × ( n ∗ +1) unknown conditional probabilities of S i which characterize the dependence structure between ( S i , |N i | ) and ( S ∗ i , |N ∗ i | ), can be formalizedby the ( n + 1) × ( n ∗ + 1) matrix below: F S | S ∗ , |N | , |N ∗ | ,Z =  f S i | S ∗ i =0 ,Z i , |N i | = n, |N ∗ i | = n ∗ (0) · · · f S i | S ∗ i = n ∗ ,Z i , |N i | = n, |N ∗ i | = n ∗ (0)... . . . ... f S i | S ∗ i =0 ,Z i , |N i | = n, |N ∗ i | = n ∗ ( n ) · · · f S i | S ∗ i = n ∗ ,Z i , |N i | = n, |N ∗ i | = n ∗ ( n )  . (6)Denote a ( n + 1) × F S | Z, |N | and a ( n ∗ + 1) × F S ∗ | Z, |N ∗ | by F S | Z, |N | =[ f S i | Z i , |N i | = n (0) , · · · , f S i | Z i , |N i | = n ( n )] ′ ,F S ∗ | Z, |N ∗ | =[ f S ∗ i | Z i , |N ∗ i | = n ∗ (0) , · · · , f S ∗ i | Z i , |N ∗ i | = n ∗ ( n ∗ )] ′ , where both of the vectors are identiﬁable. It then yields a system of ( n + 1) linear equations with( n + 1) × ( n ∗ + 1) unknowns from Lemma 4.1 (b) and the law of total probability: F S | Z, |N | = F S | S ∗ , |N | , |N ∗ | ,Z × F S ∗ | Z, |N ∗ | , (7)which, however, is underdetermined because there are fewer equations than unknowns. Therefore,it is necessary to impose restrictions to reduce the number of unknown parameters in order to get Equation (7) is because f S i | Z i , |N i | = n ( s ) = f S i | Z i , |N i | = n, |N ∗ i | = n ∗ ( s ) = P s ∗ ∈ Ω S ∗ f S i | Z i ,S ∗ i = s ∗ , |N i | = n, |N ∗ i | = n ∗ × f S ∗ i | Z i , |N i | = n, |N ∗ i | = n ∗ ( s ∗ ) = P s ∗ ∈ Ω S ∗ f S i | Z i ,S ∗ i = s ∗ , |N i | = n, |N ∗ i | = n ∗ × f S ∗ i | Z i , |N ∗ i | = n ∗ ( s ∗ ).

14n unique solution for the system (7). Fortunately, this goal is achieved, if the possibility of eitherfalse negative or false positive can be ruled out.Without loss of generality, suppose no false negative holds, i.e. N ∗ i ⊂ N i . Firstly, N ∗ i ⊂ N i enforces a sparsity constraint on the unknowns: given S ∗ i = s ∗ , the probability of S i = s with s < s ∗ should be zero, as the only source of misclassiﬁcation in S i is from those false connections.Therefore, the elements above the diagonal of matrix F S | S ∗ , |N | , |N ∗ | ,Z are all zero. Secondly, N ∗ i ⊂ N i also dramatically simpliﬁes the dependence structure between ( S i , |N i | ) and ( S ∗ i , |N ∗ i | ) via limitingthe possible realizations of N i and N ∗ i : the elements in each k -diagonal ( k = − , − , ..., − n ) ofmatrix F S | S ∗ , |N | , |N ∗ | ,Z will be the same. It is because, under no false negative, no matter what thenumber of actual treated friends S ∗ i is, the conditional distribution of S i will be the same as longas its increase relative to the truth, S i − S ∗ i , is the same. Intuitively, f S i | S ∗ i = s ∗ ,Z i , |N i | = n, |N ∗ i | = n ∗ ( s ∗ )is equal to the probability of randomly choosing s − s ∗ units out of n − n ∗ units, which does notvary with the realizations of S ∗ i , |N i | and |N ∗ i | .Now, under no false negative, for any n ∗ ≤ n , the matrix F S | S ∗ , |N | , |N ∗ | ,Z can be simpliﬁed to  f ∆ S i | Z i , |N i / N ∗ i | =∆ n (0) 0 · · · f ∆ S i | Z i , |N i / N ∗ i | =∆ n (1) f ∆ S i | Z i , |N i / N ∗ i | =∆ n (0) · · · f ∆ S i | Z i , |N i / N ∗ i | =∆ n (1) . . . 0... ... . . . f ∆ S i | Z i , |N i / N ∗ i | =∆ n (0) f ∆ S i | Z i , |N i / N ∗ i | =∆ n (∆ n ) ... ... f ∆ S i | Z i , |N i / N ∗ i | =∆ n (1)... f ∆ S i | Z i , |N i / N ∗ i | =∆ n (∆ n ) ... ...... ... . . . ... f ∆ S i | Z i , |N i / N ∗ i | =∆ n ( n ) f ∆ S i | Z i , |N i / N ∗ i | =∆ n ( n − · · · f ∆ S i | Z i , |N i / N ∗ i | =∆ n (∆ n )  , (8)with ( n + 1) unknowns, which is the same as the number of equations, ensuring a unique solutionfor (7) and the identiﬁcation of f S i | S ∗ i ,Z i , |N i | , |N ∗ i | . The discussion above only requires one of thetwo network proxies satisfying the desired property, while does not impose any restriction on themeasurement error of the other proxy expect for those having been assumed previously. Withoutloss of generality, hereafter we use N i to denote the one that satisﬁes the requirement. Assumption 4.5 (One Type of Measurement Error)

For each unit i ∈ P , the proxy N i sat- Under no false negative, we do not consider the case n < n ∗ , because N ∗ i ⊂ N i implies that the event( |N ∗ i | , |N i | ) = ( n ∗ , n ) with n < n ∗ is a zero probability even, and a conditional probability conditional on a zeroprobability even is undeﬁned. Similarly, under no false positive, we do not consider the case n > n ∗ . It is worth to note the equivalence between f S i | S ∗ i ,Z i , |N i | , |N ∗ i | and f S ∗ i | S i ,Z i , |N i | , |N ∗ i | via re-scaling: f S i | S ∗ i = s ∗ ,Z i , |N i | = n, |N ∗ i | = n ∗ ( s ) = f S ∗ i | S i = s,Z i , |N i | = n, |N ∗ i | = n ∗ ( s ∗ ) f S i | Z i , |N i | = n ( s ) /f S ∗ i | Z i , |N ∗ i | = n ∗ ( s ∗ ) , where the equality is based on Lemma 4.1. Therefore, under no false positive, Similar arguments can be applied tothe case when no false positive holds. sﬁes either no false positive, i.e. N i ⊂ N ∗ i , or no false negative, i.e. N ∗ i ∈ N i . Borrowing the terminology from Calvi et al. (2018), Assumption 4.5 is referred to as “one type ofmeasurement error”. As can be seen from the next lemma, exploiting Assumption 4.5 beneﬁts usthe signiﬁcant simplicity of the interdependence between the observable ( S i , |N i | ) and the latent( S ∗ i , |N ∗ i | ), which dramatically reduces the number of unknown probabilities. Lemma 4.4

Suppose Assumptions 3.2, 3.4 and 4.5 hold. Let ∆ s = | s − s ∗ | and ∆ n = | n − n ∗ | .For ∀ ( s ∗ , n ∗ ) ∈ Ω S ∗ , |N ∗ | and ∀ ( s, n ) ∈ Ω S, |N | , f S ∗ i | S i , |N ∗ i | , |N i | ,Z i is identical across i ∈ P . (a) If no false negative N ∗ i ⊂ N i holds, then for n ∗ ≤ n , f S i | S ∗ i = s ∗ , |N ∗ i | = n ∗ , |N i | = n,Z i ( s ) =  C ∆ s ∆ n f D (1) ∆ s f D (0) ∆ n − ∆ s , if s ∗ ≤ s and ∆ s ≤ ∆ n , otherwise . (b) If no false positive N i ⊂ N ∗ i holds, then for n ≤ n ∗ f S ∗ i | S i = s, |N ∗ i | = n ∗ , |N i | = n,Z i ( s ∗ ) =  C ∆ s ∆ n f D (1) ∆ s f D (0) ∆ n − ∆ s , if s ≤ s ∗ and ∆ s ≤ ∆ n , otherwise . It perhaps not surprising that S i conditional on ( S ∗ i , |N ∗ i | , |N i | , Z i ) follows a binomial distribu-tion, given the equivalence of f S i | S ∗ i = s ∗ ,Z i , |N i | = n, |N ∗ i | = n ∗ ( s ∗ ) to the probability of randomly assigningtreatment to ∆ s out of ∆ n units. The result in Lemma 4.4 enables a faster and easier way to iden-tifying f S ∗ i | S i , |N ∗ i | , |N i | ,Z i without relying on solving the linear system. Nevertheless, the linear systemgreatly facilitates the identiﬁcation analysis, determines the identiﬁcation status of f S ∗ i | S i , |N ∗ i | , |N i | ,Z i ,and the solution of the system produces the same result to that obtained by simply exploiting thebinomial distribution. Theorem 4.5

Under Assumptions 3.2-3.4 and 4.5, f S ∗ i , |N ∗ i || D i ,S i ,Z i , |N i | is identical across i ∈ P and nonparametrically identiﬁed. “No false positive” is satisﬁed in many situations, for instance, when the mismeasurement iscaused by sampling-induced error, such as missing links (“induced subgraph” in Kossinets, 2006);restricting the network within a village (Angelucci et al., 2010); limiting the maximum number ofnominated friends (Cai et al., 2015b). It is also satisﬁed when non-sampling-induced error arises,for example, when a survey respondent becomes uninterested in naming the full list of his or herfriends due to survey fatigue; the lack of measurability of abstract but meaningful connections, e.g.esteem or authority; constructing network by intersecting repeated network observations assuming The conditions ( s ∗ , n ∗ ) ∈ Ω S ∗ , |N ∗ | and ( s, n ) ∈ Ω S, |N | implicitly imply that 0 ≤ s ≤ n and 0 ≤ s ∗ ≤ n ∗ . N i satisﬁes the one type of measurement error assumption, the matrix F |N || Z, |N ∗ | in (4) should be upper triangular if no false positive, and lower triangular if no falsenegative. Based on Theorem 4.3, since F |N || Z, |N ∗ | is identiﬁable, it is possible to test the one typeof measurement error assumption via the null hypothesis that all elements in either the upper orthe lower triangular of matrix F |N || Z, |N ∗ | are zero. One possible testing approach is the subsam-pling or bootstrap method proposed by Romano and Shaikh (2012) with proper adjustments toaccommodate the network data. A formal test if left for future research.Given the results in Theorem 4.3 and Theorem 4.5, the identiﬁcation of the CASF, the treat-ment and spillover eﬀects estimands can be achieved.

Theorem 4.6 (Identiﬁcation)

Suppose Assumption 3.4 is satisﬁed by ˜ N i and N i . Let Assump-tions 3.1-3.3, and 4.1-4.5 hold.(a) For ∀ ( d, s, z, n ) ∈ { , } × Ω S,Z, |N | such that f |N i | Z i = z ( n ) > , m i ( d, s, z, n ) = E [ Y i | D i = d, S i = s, Z i = z, | N i | = n ] is identical for all i ∈ P .(b) The CASF m ∗ , the treatment eﬀect estimand τ d and the spillover eﬀect estimand τ s arenonparametrically identiﬁable wherever they are well-deﬁned. Other possible testing approaches may be established following Leung (2020a) if the √ N convergence rate ofestimator for F |N || Z, |N ∗ | is satisﬁed. It might be the case if the outcome Y i and covariates in Z i in this paper arediscrete, then a smooth kernel estimation is not needed and the √ N convergence rate can be achieved based on theproof of Theorem 5.2 in Section 5. .3 Discussion and Extension As implied by Lemma 4.1, the anonymous interactions S ∗ i ⊥ N ∗ i | Z i , |N ∗ i | is critical for accomplish-ing the identiﬁcation of m ∗ . The key factor to ensure the anonymous interactions is that, for anygiven unit i , the treatment assignments to units other than i , { D j } j ∈P ,j = i conditional on ( Z i , |N ∗ i | )are i.i.d. across j . It might be violated if there exist some exogenous covariates that enter thenetwork formation process and also inﬂuence the treatment assignment. It is because, if one wouldlike to believe the homophily eﬀects in the network formation, i.e. individuals are more likely toestablish a link if they are similar, then unit i ’s characteristic and her peers’ identity together willreveal relevant information on the characteristics of her peers and non-peers. Then, conditioningon the covariate Z i , the i.i.d. of { D j } j ∈P ,j = i would fail to hold. Given the discussion in Section 4.3.1, it is apparent that there exist two settings where the fullyrandomized treatment can be relaxed to allow the treatment being randomized based on indi-viduals’ characteristics. The ﬁrst setting accounts for the homophily eﬀects, while requires theexistence of a subset of individual’s characteristics Z ,i ⊂ Z i such that Z ,i does not aﬀect the net-work formation. Then, the treatments can be randomly assigned based on Z ,i . For example, inthe microﬁnance program, interventions can be allocated randomly give participants’ social status(like occupation), which is unlikely to determine their network measured by “go to pray together”,as people with whom a individual goes to pray should rely closely on their religion, gender andcaste, instead of the social status. The second setting suits situations where it is reasonable tobelieve that the network is formed following the random graph model of Erd¨os and R´enyi (1959),i.e. each link is formed independently with the same probability. Then, the treatments can berandomly assigned based on Z i . Studying the consequences of relaxing this condition to adoptmore general unconfounded treatment assignment is an interesting area for future exploration. Readers may observe that the analysis so far does not require the network N ∗ i to be undirected,and the generalization to directed network is straightforward. If the unweighted restriction is alsorelaxed, then the spillover eﬀects can be captured by S ∗ i = P j ∈N ∗ i π ( Z ,j ) D j with Z ,j being asubset of Z i and π ( · ) being a known weighting function. For the same reason discussed in Section4.3.1, it is required that Z ,j should not impact the network formation. For example, in themicroﬁnance program, a unit with a higher degree of ﬁnancial literacy might be assigned a higherweight. While, the ﬁnancial literacy is unlikely to have direct impacts on the network of women18rom South India, which is collected before the microﬁnance program is implemented. This section is organized as below. Subsection 5.1 introduces the notion of dependency neighbor-hood, which is the stepstone to establish the asymptotic properties of the proposed estimationapproach in this paper. Subsection 5.2 presents the nonparametric kernel estimation and subsec-tion 5.3 discusses the semiparametric estimation procedure.

Let W i be some observable random variable or vector. For each unit i and sample size N , thedependency neighborhood of unit i , denoted by ∆( i, N ), is such that ∆( i, N ) ⊂ { , , ..., N } , i ∈ ∆( i, N ) and satisﬁes conditions in Assumption 5.1 below. Any unit j such that j ∈ ∆( i, N ) isreferred to as unit i ’ dependent neighbor, while the dependent neighbor is not necessary a networkneighbor. Following Chandrasekhar and Jackson (2016), I deﬁne the dependency neighborhoodby restricting the relative correlation of { W i } Ni =1 inside and outside of { ∆( i, N ) } Ni =1 . For anyintegrable function b , denote the sum of covariance of all pairs of units in each other’s dependencyneighborhood as Σ bN = N X i =1 X j ∈ ∆( i,N ) Cov ( b ( W i ) , b ( W j )) , (9)which captures the variation of b ( W i ) of all N units and the dependence across all pairs ( b ( W i ) , b ( W j ))with j being dependent neighbor of i . The assumption below characterizes two principle propertiesof the dependency neighborhood. Assumption 5.1 (Dependency Neighborhood)

For any integrable function b : Ω W R d b ,(a) Σ bN → ∞ as N → ∞ ;(b) P Ni =1 P j ∆( i,N ) Cov ( b ( W i ) , b ( W j )) = o (Σ bN ) . Condition (a) ensures that the dependence among units in each other’s dependent neighborhooddoes not vanish and contains suﬃcient information which is necessary when deriving asymptoticproperties for statistics that is constructed using these dependent variables. Intuitively, condition(b) requires that ∆( i, N ) is a collection of units that have relatively high correlation with the egounit i compared to those in its complement. The objects in ∆( i, N ) may not be unique, becauseit is deﬁned asymptotically. In addition, the size of each ∆( i, N ) may change (generally expand)as sample size increases. 19s mentioned in Chandrasekhar and Jackson (2016), there is substantial freedom in construct-ing these sets in diﬀerent studies. For example, the dependency neighborhoods can be deﬁnedbased on individuals’ participation in common actions, aﬃliation, and social events regardless oftheir network interactions; individuals’ identities (groups) that lead to strong social norms andclear barrier across groups, such as caste, tribe or race (Currarini et al., 2009, 2010); or simplydeﬁned via social or geographical location, like occupation, class, school, village or community. Es-sentially, the dependency neighborhoods { ∆( i, N ) } Ni =1 can be understood as deﬁned by individual’sexogenous attributes and the analysis in this paper is conducted conditional on these attributes:that is, the dependent neighborhoods are treated as non-stochastic. The nonparametric kernel estimation of density function has been extensively studied, see Newey and MacFadden(1994), Newey (1994) and Li and Racine (2007) among others. To ease illustration, we denote theobservable variable by W i = ( W c ′ i , W d ′ i ) ′ where W ci represents the vector containing continuousvariables and W di is the vector containing discrete variables. Denote the support of W ci and W di as Ω W c and Ω W d , respectively. Note that W i may be used to denote diﬀerent observable variablesat diﬀerent places. For a bandwidth h > ∀ w = ( w c ′ , w d ′ ) ′ ∈ Ω W c ,W d , denote K ( W ci , w c ) = 1 h Q Q Y q =1 κ (cid:18) W ci,q − w cq h (cid:19) , with κ ( · ) being a univariate kernel function and Q is the dimension of vector W ci . Denote thenonparametric kernel estimator of the probability function of interest as ˆ f W i ( w ) = 1 N N X i =1 K ( W ci , w c )1 (cid:2) W di = w d (cid:3) . (10)Given (10), the estimators for a nuisance parameter ˆ γ N isˆ γ N = h ˆ f | ˜ N i | , |N i | ,Y i ,Z i , ˆ f | ˜ N i | , |N i | ,Z i , ˆ f S i , |N i | ,Z i , ˆ f |N i | ,Z i , ˆ f Z i i ′ . Let γ be the true value of ˆ γ N . Assumption below provides suﬃcient conditions for the uniformconvergence of the nonparametric kernel estimation. Assumption 5.2

Let W ci = ( Y i , Z c ′ i ) ′ and W di = ( D i , Z d ′ i , S i , |N i | , ˜ S i , | ˜ N i | ) ′ .(a) Ω W c ⊂ R Q is a compact and convex set and the cardinality of Ω W d is ﬁnite. For expositional simplicity, we restrict the bandwidth for all continuous variables to be the same. In practice,our method also allows for diﬀerent bandwidths, while a data-driven method for bandwidth selection is not thefocus of this paper. b) Each element in γ is bounded and continuously diﬀerentiable in w c to order two with boundedderivatives on an open set containing Ω W c .(c) κ ( · ) is nonnegative kernel function and is diﬀerentiable with uniformly bounded ﬁrst deriva-tive. In addition, for some constant K , K > Z κ ( v ) dv = 1 , κ ( v ) = κ ( − v ) , Z v κ ( v ) dv = K , Z κ ( v ) dv = K . (d) h → , N h Q → ∞ , ln( N ) / ( N h Q ) → , as N → ∞ . (e) Let ¯ r N = sup ≤ i ≤ N | ∆( i, N ) | . The cardinalities of dependency neighborhoods satisfy ¯ r N (cid:2) ln( N ) / ( N h Q ) (cid:3) / = O (1) , N N X i =1 | ∆( i, N ) | = O (1) . Conditions (a) and (b) state the regularity conditions of the support and of the data distribution.Conditions (c) and (d) describe features of the kernel function and the bandwidth, which arestandard for nonparametric kernel estimation. In addition, to accommodate the dependence acrossunits, we need to impose restrictions on the dependency neighborhood. Condition (e) allowsthe situation where suﬃciently large number of units possess increasing number of dependentneighbors, say O ([ln( N ) N/h Q ] / ) units having O ([ N h Q / ln( N )] / ) dependent neighbors and therest having bounded number of dependent neighbors. Note that although we require the sparsenetwork, the number of dependent neighbors may increase with sample size.To address issues arising from the dependence between observations, we accommodate themethod of Masry (1996), which is based on the approximation theorems developed by Bradley et al.(1983), to approximate dependent random variables by independent ones. For any given samplesize N , partition the index set { , , ..., N } into q N mutually exclusive subsets S , ..., S q N with S ≤ l ≤ q N S l = { , , ..., N } . The subscript of q N means that it may go to inﬁnity as N → ∞ . Let i = 1 and S = ∆( i , N ). For any given sample size N , ﬁnd i to be the unit that is not adependent neighbor of i but has the largest number of common dependent neighbors with i , i.e. i satisfying n i ∆( i , N ) , (cid:12)(cid:12)(cid:12) ∆( i , N ) \ ∆( i , N ) (cid:12)(cid:12)(cid:12) ≥ (cid:12)(cid:12)(cid:12) ∆( i , N ) \ ∆( j, N ) (cid:12)(cid:12)(cid:12) for ∀ j ∆( i , N ) o . Then, i can be understood as the most correlated non-dependent neighbor of i . Given i we canset S = ∆( i , N ) / ∆( i , N ). If there are more than one units, say two units, satisfy the above For any given sample size N , the partition exists and every sampled unit is included in exactly one set of S , ..., S q N . The largest possible value of q N will be N . i k ( k ≤ q N ) as the unit that satisﬁes ( i k [ ≤ l ≤ k − ∆( i l , N ) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ ≤ l ≤ k − ∆( i l , N ) \ ∆( i k , N ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≥ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) [ ≤ l ≤ k − ∆( i l , N ) \ ∆( j, N ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) for ∀ j [ ≤ l ≤ k − ∆( i l , N ) ) , which implies that i k is the most correlated non-dependent neighbor of { i , i , ..., i k − } . Then, let S k = ∆( i k , N ) / S ≤ l ≤ k − S l . Given the above partition, deﬁne the dependence coeﬃcient α k = sup A ∈F k − ,B ∈F kk | P r ( A, B ) − P r ( A ) P r ( B ) | , where F k − = σ (cid:0) { W i , i ∈ S ≤ l ≤ k − S l } (cid:1) and F kk = σ ( { W i , i ∈ S k } ) for k = 1 , , ..., q N . Thecoeﬃcient α k measures the dependence strength between observations in the two sets S ≤ l ≤ k − S l and S k . Without loss of generality, suppose q N is an even integer. Assumption 5.3

Let L N = [ N/ (ln( N ) h Q +2 )] Q/ . The dependence coeﬃcient α k satisﬁes Ψ N := L N (cid:18) N ln( N ) (cid:19) / q N X k =1 α k , ∞ X N =1 Ψ N < ∞ . Assumption 5.3 controls the asymptotic dependence among observables and is akin to the mixingcoeﬃcient decaying condition in a setting with network-induced data dependence. It ensures thatuniform convergence still holds even when the large scale of dependency among samples existsand it allows for nonzero dependence outside the dependency neighborhood. Similar assumptionis exploited in Masry (1996) to restrict the time series data, and in S¨avje (2019) to control thedependence between network measurement errors.Lemma 5.1 provides two suﬃcient conditions under which Assumption 5.3 holds

Lemma 5.1

Assumption 5.3 is satisﬁed, if either one of the following two conditions holds.(i) For any i ∈ N , W j ⊥ W l if j ∈ ∆( i, N ) and l ∆( i, N ) ;(ii) For i , i , ..., i k constructed as described above, W j ⊥ W l if j ∈ S kr =1 ∆( i r , N ) and l nS kr =1 ∆( i r , N ) S ∆( i k +1 , N ) o , where i k +1 is the unit having the largest number of commondependent-neighbors with units i , i , ..., i k . The proof of Lemma 5.1 is trivial therefore omitted. Condition (i) indicates that Assumption5.3 holds if dependence neighborhoods are independent clusters. Suppose j ∈ ∆( i k , N ), then22ondition (ii) requires that W j and W l are independent, if unit l is not a dependent neighbor,ﬁrstly, of unit i k ; secondly, of the units i , ..., i k − with whom unit i k has the largest number ofcommon dependent neighbors; lastly, of the unit i k +1 who has the largest number of commondependent neighbors with units i , ..., i k . Condition (ii) is much weaker than condition (i) and itdoes not contradict Assumption 5.2 (e). Because such an independence in (ii) is required acrossdependent neighborhoods, and it imposes no restrictions on the number of units in each dependentneighborhood. Theorem 5.2

Let Assumptions 5.2 and 5.3 hold, then k ˆ γ N − γ k ∞ = O p (cid:16)(cid:2) ln( N ) / ( N h Q ) (cid:3) / + h (cid:17) . The uniform convergence rate of the kernel estimation in Theorem 5.2 is consistent with that ofthe conventional kernel estimation under i.i.d. or strong mixing data settings. See e.g. Newey(1994), Li and Racine (2007) and Masry (1996).Let ˆ φ N := φ (ˆ γ N ) represent the estimator of the latent distribution function f S ∗ , |N ∗ i || D i ,S i ,Z i , |N i | .According to Proposition 4.2, we can obtain a plug-in estimator ˆ φ N via replacing the distributionson the right hand side of (3) by their kernel estimators based on ˆ γ N in (10). Denote φ = φ ( γ )as the true latent distribution function. Given the uniform convergence of ˆ γ N in Theorem 5.2, weonly need to consider the convergence of ˆ φ N in a small neighborhood of γ . Corollary 5.3

Let Assumption 3.1-3.4 and 4.1-4.5 hold. Under assumptions in Theorem 5.2,suppose that there exists a constant ǫ > such that f |N i || Z i > ǫ . Then, for η → as N → ∞ , sup k ˆ γ N − γ k ∞ ≤ η (cid:13)(cid:13)(cid:13) ˆ φ N − φ (cid:13)(cid:13)(cid:13) ∞ = O p ( (cid:13)(cid:13) ˆ γ N − γ (cid:13)(cid:13) ∞ ) . In this subsection, we study the estimation of CASF m ∗ by simplifying m ∗ = m ∗ ( · ; θ ) as knownfunction up to unknown parameter θ ∈ Θ ⊂ R d θ . Consequently, m i ( · ) = m ( · ) = m ( · ; θ, φ ) is alsoknown up to ( θ, φ ). Note that the identiﬁcation of m ∗ studied in Section 4 does not rely onsuch an simpliﬁcation. More importantly, imposing such a parametric structure on m ∗ still allowsﬂexible heterogeneity of the treatment and spillover eﬀects, which can be captured by interactionsof D i and S i , with covariate Z i and network degree |N i | , as well as using their polynomials. For notational simplicity, let X ∗ i := ( D i , S ∗ i , Z i , |N ∗ i | ) ′ and X i := ( D i , S i , Z i , |N i | ) ′ with supportΩ X ∗ and Ω X , respectively. In addition, denote T ∗ i = ( S ∗ i , |N ∗ i | ) ′ . Let x ∗ j := ( d, s ∗ j , z, n ∗ j ) with Based on Theorem 4.5, we know that m i ( · ) is identical across all i . Thus, we can suppress the subscript i , i.e. m i ( · ) = m ( · ). In addition, m ( · ) = m ( · ; θ, φ ) is because of m ( · ) being a function of the CASF m ∗ ( · ; θ ) and nuisanceparameter φ . ∗ j = ( s ∗ j , n ∗ j ) ∈ Ω S ∗ , |N ∗ | , and j ∈ { , , ..., K T } represents the lexicographical ordering of the possiblevalues of T ∗ i as described in (B.29). Similarly, let x j := ( d, s j , z, n j ) with t j := ( s j , n j ) ∈ Ω S, |N | .By deﬁnition of m ( · ; θ, φ ), the following moment condition holds: E (cid:2) Y i − m ( X i ; θ, φ ) (cid:12)(cid:12) X i (cid:3) = 0 . From Proposition 3.1, m ( · ; θ, φ ) and the CASF m ∗ ( · ; θ ) are linked through the formula m ( x ; θ, φ ) = K T P j =1 m ∗ ( x ∗ j ; θ ) f T ∗ i | X i = x ( t ∗ j ). Recall that X i is identically distributed for all i under assumptions inSection 3. Denote the objective function and its sample analogue as L ( θ, φ ) = E (cid:8) τ i [ Y i − m ( X i ; θ, φ )] (cid:9) , and L N ( θ, φ ) = 1 N N X i =1 τ i [ Y i − m ( X i ; θ, φ )] , where τ i := τ ( X i ) is non-negative weight. Following Newey (1994), we use the weight function τ to focus the optimization problem on regions where the kernel estimation is relatively reliable. Then, θ can be estimated by minimizing L N ( θ, ˆ φ N ) given the estimator ˆ φ N from Theorem 5.2:ˆ θ N = arg min θ ∈ Θ L N ( θ, ˆ φ N ) . (11)Let W i = ( Y i , X ′ i ) ′ be the vector containing all the observed variables and w = ( y, x ′ ) ′ ∈ Ω W . Assumption 5.4 (i) Θ ⊂ R d θ is compact, θ ∈ int (Θ) and θ is identiﬁable from the weighted conditional momentfunction L ( θ, φ ) = 0 .(ii) τ ( · ) is nonnegative and sup x ∈ Ω X | τ ( x ) | < C for some constant C > .(iii) m ∗ ( x ; θ ) is continuous in θ for all x ∈ Ω X , and is an integrable function of X i for all θ ∈ Θ .(iv) Denote the random variable x ∗ i,j = ( D i , s ∗ j , Z i , n ∗ j ) with t ∗ j = ( s ∗ j , n ∗ j ) ∈ Ω T ∗ and j = 1 , , ..., K T .There exists a function h ( x ) such that | m ∗ ( x ; θ ) | ≤ h ( x ) for all θ ∈ Θ , and E [ h ( x ∗ i,j )] < ∞ for all j = 1 , , ..., K T .(v) Let e ( w, θ ) := τ ( x )[ y − m ( x ; θ, φ )] and e i ( θ ) := e ( W i , θ ) . For any given constant η > ,denote U i ( θ, η ) = sup θ ′ ∈ Θ , k θ ′ − θ k <η | e i ( θ ′ ) − e i ( θ ) | . There exists a function h ( w ) such that | e ( w, θ ) | ≤ h ( w ) for all θ ∈ Θ and E [ h ( W i )] < ∞ . In addition, sup θ ∈ Θ E [ | e i ( θ ) | δ ] < C for some constants δ > and C > . Hu (2008) also adopts the weight function and set it to be a ﬁxed trimming τ ( x ) = 1[ x ∈ X ] with X ⊂ Ω X a ﬁxed set. In this paper, we follow Hu (2008) to use the ﬁxed trimming weight function. Other types of weightfunctions such as data-driven weight functions or methods for selection of weight functions are out of the scope ofthis paper. heorem 5.4 (Consistency) Let assumptions in Theorem 4.6 hold. Under Assumptions 5.1-5.4, we have k ˆ θ N − θ k = o p (1) . To show asymptotic normality of the estimator ˆ θ N , we need to account for the presence of thenuisance parameter φ and the various forms of dependence arising from the mismeasured networkdata, which requires a signiﬁcant generalization of the classical CLT. In particular, the oft-usedCLT developed for mixing process does not work for our purpose, as they rely on some particularordering structure to measure the “distance” between units. Therefore, I adopt and extend theunivariate CLT for network data proposed by Chandrasekhar and Jackson (2016) to multivariatesetting, see Lemma E.6 in the Appendix, which will be applied in this section to derive theasymptotic normality for ˆ θ N .Let g ( W i ; θ, φ ) = τ i [ Y i − m ( X i ; θ, φ )] ∂m ( X i ; θ,φ ) ∂θ . Then, from the ﬁrst order condition of theoptimization problem (11), ˆ θ N solves N P Ni =1 g ( W i ; ˆ θ N , ˆ φ N ) = 0 . Then, by the mean value theoremwe can obtain0 = 1 N N X i =1 g ( W i ; ˆ θ N , ˆ φ N ) = 1 N N X i =1 g ( W i ; θ , ˆ φ N ) + 1 N N X i =1 ∂g ( W i ; ˜ θ N , ˆ φ N ) ∂θ ′ (ˆ θ N − θ ) , (12)where ˜ θ N is between ˆ θ N and θ . If N P Ni =1 ∂g ( W i ;˜ θ N , ˆ φ N ) ∂θ ′ is invertible, rearranging (12) leads to √ N (ˆ θ N − θ ) = " N N X i =1 ∂g ( W i ; ˜ θ N , ˆ φ N ) ∂θ ′ − √ N N X i =1 g ( W i ; θ , ˆ φ N ) . Let us introduce some useful notations. Recall that φ ( · ) = φ ( · ; γ ). We set t := ( t , ..., t K T ) ′ and φ ( t ; γ ) = [ f T ∗ i | X i ( t ) , ..., f T ∗ i | X i ( t K T )] ′ . Let d γ be a d γ × ν ( w ; θ, γ ) = E (cid:20) τ ( X i ) ∂∂θ R ( W i ; θ, φ ) ∂φ ( t ; γ ) ∂γ ′ (cid:12)(cid:12)(cid:12) γ = γ ( w ) d γ (cid:12)(cid:12)(cid:12) w (cid:21) and δ ( W i ; θ, γ ) := ν ( W i ; θ, γ ) − E [ ν ( W i ; θ, γ )], where R ( W i ; θ, φ ) =  [ Y i − m ( X i ; θ, φ )] m ∗ ( x ∗ i, ; θ )...[ Y i − m ( X i ; θ, φ )] m ∗ ( x ∗ i, K T ; θ )  ′ . To simplify notation, denote ν ( W i ) := ν ( W i ; θ , γ ) and δ ( W i ) := δ ( W i ; θ , γ ). Assumption 5.5 (i) m ∗ ( x ; θ ) is continuously diﬀerentiable in θ up to order three with bounded third order deriva- ive uniformly in x , i.e. for any r, q = 1 , , ..., d θ , sup x ∈ Ω X (cid:12)(cid:12)(cid:12)(cid:12) ∂∂θ (cid:18) ∂ m ∗ ( x ; θ ) ∂θ r ∂θ q (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) < C, for all θ ∈ Θ . (ii) There exist functions H ( x ) and H ( x ) such that (cid:13)(cid:13)(cid:13) d m ∗ ( x ; θ ) dθdθ ′ (cid:13)(cid:13)(cid:13) ≤ H ( x ) , (cid:13)(cid:13)(cid:13) dm ∗ ( x ; θ ) dθ (cid:13)(cid:13)(cid:13) ≤ H ( x ) for all θ ∈ Θ and E [ H ( x i,j )] < ∞ , E [ H ( x i,j )] < ∞ for all j = 1 , , ..., K T .(iii) E h ∂g ( W i ; θ ,φ ) ∂θ ′ i exists and is nonsingular. In addition, E (cid:20)(cid:13)(cid:13)(cid:13) ∂g ( W i ; θ ,φ ) ∂θ ′ (cid:13)(cid:13)(cid:13) (cid:21) < ∞ . Assumption 5.5 (i) and (ii) introduce regularity conditions on the smoothness of the CASF m ∗ ( · , θ ).Condition (iii) ensures that the limit of the Hessian matrix exists and is invertible. Assumption 5.6 (i) N / [ln( N ) / ( N h Q )] → and N h → as N → ∞ .(ii) ν ( w ; θ, γ ) = ν ( w c , w d ; θ, γ ) is continuously diﬀerentiable in w c almost everywhere and satisﬁes P w d ∈ Ω Wd R k ν ( w ) k dw c < ∞ . In addition, k V ar [ ν ( W i )] k < ∞ . Assumption 5.6 implies that the convergence rate of ˆ γ N is faster than N / . It is a typical restrictionon the bandwidth to guarantee the asymptotic normality for semiparametric two-step estimatorsthat depend on kernel density, e.g. Newey and MacFadden (1994).We ﬁrst show that the d θ × d θ Hessian matrix N P Ni =1 ∂g ( W i ;˜ θ N , ˆ φ N ) ∂θ ′ converges in probabilityuniformly. Lemma 5.5

Let the assumptions in Theorem 5.4 hold.(a) Under Assumption 5.5, for a small enough η → as N → ∞ , we have sup k ˆ γ N − γ k ∞ <η (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 ∂g ( W i ; ˜ θ N , ˆ φ N ) ∂θ ′ − E (cid:20) ∂g ( W i ; θ , φ ) ∂θ ′ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o p (1) . (b) Under Assumption 5.6, we can get √ N N X i =1 g ( W i ; θ , ˆ φ N ) = 1 √ N N X i =1 (cid:2) g ( W i ; θ , φ ) + δ ( W i ) (cid:3) + o p (1) . Denote the dependence neighborhoods covariance matrixΣ ˜ gN = N X i =1 X j ∈ ∆( i,N ) E n (cid:2) g ( W i ; θ , φ ) + δ ( W i ) (cid:3) (cid:2) g ( W j ; θ , φ ) + δ ( W j ) (cid:3) ′ o .

26o ease the notations, denote the d θ × g i = g ( W i ; θ , φ ) + δ ( W i ) with ˜ g i = (˜ g i, , ..., ˜ g i,d θ ) ′ .Then, Σ ˜ gN = P Ni =1 P j ∈ ∆( i,N ) E [˜ g i ˜ g ′ j ]. In addition, by notation abuse, let S ci = P j ∆( i,N ) ˜ g j be thecovariance outside the dependency neighborhoods. For any vector a , let a ≥ A = { a ij } , vec ( A ) denotes the vectorization of A and | A | = {| a ij |} . Assumption 5.7 (i) For all i ∈ P , ∆( i, N ) is symmetric such that j ∈ ∆( i, N ) if and only if i ∈ ∆( j, N ) .(ii) There exists a ﬁnite, strictly positive-deﬁnite and symmetric matrix Ω ∈ R d θ × R d θ such that k N Σ ˜ gN − Ω k → as N → ∞ .(iii) The following conditions hold for { ˜ g i } Ni =1 .(a) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N P i =1 P j,k ∈ ∆( i,N ) E (cid:2)(cid:12)(cid:12) vec (˜ g i ˜ g ′ j )˜ g ′ k (cid:12)(cid:12)(cid:3)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ = o (cid:16)(cid:13)(cid:13)(cid:13) [Σ ˜ gN ] / (cid:13)(cid:13)(cid:13) ∞ (cid:17) ;(b) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N P i,k =1 P j ∈ ∆( i,N ) P l ∈ ∆( k,N ) E h (cid:0) ˜ g i ˜ g ′ j − E [˜ g i ˜ g ′ j ] (cid:1) ′ (˜ g k ˜ g ′ l − E [˜ g k ˜ g ′ l ]) i(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ = o (cid:16)(cid:13)(cid:13)(cid:13) [Σ ˜ gN ] (cid:13)(cid:13)(cid:13) ∞ (cid:17) ;(c) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N P i =1 P j ∆( i,N ) Cov (˜ g i , ˜ g j ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ = o (cid:16)(cid:13)(cid:13)(cid:13) Σ ˜ gN (cid:13)(cid:13)(cid:13) ∞ (cid:17) ;(d) E (cid:2) ˜ g i S ci (cid:12)(cid:12) S ci (cid:3) ≥ for all i ∈ P . Assumption 5.7 (i) guarantees that the covariance matrix Σ ˜ gN is symmetric. Condition (ii) ensuresthat the samples possess suﬃciently large variation so that the CLT holds. Meanwhile, it requiresthe limit of Σ ˜ gN /N being a constant matrix Ω, instead of varying with sample size, which imposesrestriction on the allowable divergence rate of Σ ˜ gN to some degree. Similar assumptions are used tostudy asymptotic properties of covariance matrix estimator, e.g. in White and Domowitz (1984).Moreover, Assumption 5.7 (iii) is crucial for multivariate normal approximation under thedependency neighborhood structure, which further guarantees the CLT. Similar assumption isused in Chandrasekhar and Jackson (2016) to establish the asymptotic normality. We extend theassumption to accommodate general multivariate random vectors without imposing any restrictionson their support. In particular, conditions (iii) (a) and (b) restrict the rate of dependency betweenthe dependency sets, while (c) limits the rate of dependency outside the dependence sets. Besides,condition (d) states that on average, units outside each other’s dependency neighborhood do nottend to interact negatively. Chandrasekhar and Jackson (2016) also use condition that is similar to Assumption 5.7 (iii) (d) to ease theirproof. We note that the condition (iii) (d) is not necessary for the asymptotic normality in this paper and can bereplaced by more primitive assumptions. heorem 5.6 (Asymptotic Normality) Suppose assumptions in Theorem 5.4, Assumptions5.5-5.7 hold. Then √ N (ˆ θ N − θ ) d → N (0 , H − Ω H − ) , where H = E [ ∂g ( W i ; θ , φ ) /∂θ ′ ] and N represents normal distribution. Notably, the consistency and asymptotic normality of ˆ θ N only require the existence of depen-dency neighborhoods, rather than the accurate knowledge of them. Nevertheless, if the knowledgeof dependency neighborhoods { ∆( i, N ) } Ni =1 is available, it suﬃces a consistent variance estimator.Given that the form of δ ( w ) is known, following Newey and MacFadden (1994), we construct theestimator of δ ( W i ) by substituting (ˆ θ N , ˆ γ N ) for ( θ , φ ), i.e. ˆ δ ( W i ) := δ ( W i ; ˆ θ N , ˆ γ N ). The corol-lary below provides a consistent estimator of the variance-covariance matrix H − Ω H − , which isessential when constructing asymptotic conﬁdence intervals and conducting hypothesis tests. Corollary 5.7 (Variance Estimator)

Under assumptions in Theorem 5.6, we can get (cid:13)(cid:13)(cid:13) ˆ H − N ˆΩ N ˆ H − N − H − Ω H − (cid:13)(cid:13)(cid:13) p → as N → ∞ , where ˆ H N = 1 N N X i =1 ∂g ( W i ; ˆ θ N , ˆ φ N ) ∂θ ′ , ˆΩ N = 1 N N X i =1 X j ∈ ∆( i,N ) h g ( W i ; ˆ θ N , ˆ φ N ) + ˆ δ ( W i ) i h g ( W j ; ˆ θ N , ˆ φ N ) + ˆ δ ( W j ) i ′ . It is worth to note that the consistent variance estimator ˆ H − N ˆΩ N ˆ H − N is robust to mild degree ofmisspeciﬁcation of the dependency neighborhoods. For example, if there is only ﬁnite units whosedependency neighborhoods are misspeciﬁed, the variance estimator is still consistent due to theconsistency of (ˆ θ N , ˆ φ N , ˆ γ N ) and Assumption 5.7 (iii)(c). Moreover, if the knowledge of dependencyneighborhoods is not available at all, one may resort to the resampling method proposed by Leung(2020a) to conduct inference for the parameter of interest. Rigourous study is left for futureresearch. In this section, I illustrate the estimation performance by Monte Carlo simulations. A similardata generating process (DGP) and network formation design to those used by Leung (2020b) is The resampling method of Leung (2020a) requires that the ﬁrst-stage kernel estimation to be √ N -consistent,which is satisﬁed if outcome and covariates ( Y i , Z i ) are discrete random variables. Y i = θ + θ D i + θ D i |N ∗ i | Z i + θ S ∗ i + θ S ∗ i + θ S ∗ i Z i + θ S ∗ i |N ∗ i | + ε i , (13)where the i.i.d. treatment D i is independently generated with the probability of being treated equal0.3, and the covariate Z i i.i.d. ∼ Bernoulli (0 . ε i = ε idioi + ε peeri with ε idioi being the idiosyncratic disturbance and ε peeri = P j ∈P A ∗ ij v j capturing the unobservable peer eﬀects,where v j i.i.d. ∼ N (0 , . θ = ( θ , θ , θ , θ , θ , θ , θ ) ′ = (0 , , / , , − , − / , ′ .Throughout this section, we aim to estimate the treatment eﬀects τ d (0 , ,

3) and τ d (0 , ,

3) (withtrue value 1 and 2, respectively), and the spillover eﬀects τ s (1 , ,

3) and τ s (1 , ,

3) (with true values3 and 2.5, respectively).Suppose the actual network is formed as following. We randomly allocate units on a [0 , × [0 , ρ i ∈ [0 , × [0 ,

1] are correctly observed.Assume the links satisfying A ∗ ij =1[ β + β ( Z i + Z j ) + β d ( ρ i , ρ j ) + ζ ij > × i = j ] , where ζ ij = ζ ji is a random shock that is i.i.d. across dyads with distribution N (0 ,

1) and indepen-dent of Z i and ρ i for all i and j . d ( ρ i , ρ j ) indicates the distance between two units d ( ρ i , ρ j ) =  , if r − k ρ i − ρ j k ≤ ∞ , otherwise , with the scaling constant r = ( r deg /N ) / to guarantee the network sparsity and the parameter r deg controls the average degree: the larger value of r deg is, the larger average degree E [ |N ∗ i | ] willbe. We consider two levels of sparsity r deg = 5 and 8. Set β = ( β , β , β ) ′ = ( − . , . , − A ∗ = { A ∗ ij } Ni,j =1 are summarized in Table 1. Given the truenetwork formed as described above, suppose two self-reported and mismeasured network proxiesare available for all units: for i, j = 1 , , ..., N , A ij = ω i (cid:2) U ij A ∗ ij + V ij (1 − A ∗ ij ) (cid:3) + (1 − ω i ) A ∗ ij , ˜ A ij = ˜ ω i (cid:2) ˜ U ij A ∗ ij + ˜ V ij (1 − A ∗ ij ) (cid:3) + (1 − ˜ ω i ) A ∗ ij , where ω i , U ij , V ij , ˜ ω i , ˜ U ij and ˜ V ij are mutually independent and randomly generated binaryindicators, taking value one with probabilities p ω , p U , p V , p ˜ ω , p ˜ U , p ˜ V , respectively. In particular,taking A ij as an example, ω i indicates whether unit i ever misreports his or her links, and p ω captures the overall level of misreporting. If unit i misreports, there are two types of classiﬁcation29rrors: if U ij = 0 then a actually linked pair ( i, j ) with A ∗ ij = 1 is misclassiﬁed as unlinked (i.e.false negative), and if V ij = 1 then a actually unlinked pair ( i, j ) with A ∗ ij = 0 is treated as linked(false positive). Therefore, 1 − p U and p V are the probability of false negative and false positive,respectively.Following the design of Leung (2020b), assume the full network is collected for both proxieswhen they are available, meaning that P = { , , ..., N } and |N i | = P Ni =1 A ij , S i = P Ni =1 A ij D j , | ˜ N i | = P Ni =1 ˜ A ij and ˜ S i = P Ni =1 ˜ A ij D j . Given the DGP design, the dependency neighborhood ofeach unit i can be set as a collection of units that are located close to unit i with distance lessthan r , equivalently, ∆( i, N ) = { j ∈ { , , ..., N } , k ρ i − ρ j k ≤ r } .We generate data using sample size N ∈ { , , } with replications M = 1000. In theﬁrst-step kernel estimation, bandwidth is set to be h = N − / . In what follows, we estimate thetreatment eﬀects of interest under two scenarios based on the availability of network information.Table 1: Statistics of Latent Links r deg = 5 r deg = 8 |N i | S i total |N i | S i total N avg. max avg. max avg. max avg. max1k 5.65 15.52 1.69 7.31 5649 8.92 21.45 2.68 9.53 89192k 5.73 16.36 1.72 7.90 11458 9.08 22.54 2.72 10.26 181675k 5.80 17.39 1.74 8.55 29018 9.23 23.78 2.77 11.07 46165 Note: statistics reported in this table are the average over 1000 replications.

Set the overall misclassiﬁcation rates p ω = p ˜ ω = 0 .

6. For the ﬁrst proxy, let 1 − p U ∈ { . , . } and p V = δ V /N with δ V ∈ { . , . } to permit the network sparsity. For the second proxy, set1 − p ˜ U ∈ { . , . } and p ˜ V = 0. Then, the ﬁrst proxy possesses both false negative and falsepositive classiﬁcation errors, and the second one contains no false positive. Table 3 reports thestatistics of the two mismeasured network proxies for diﬀerent misclassiﬁcation rates. We can seethat when p U or p ˜ U is 0.2, the misclassiﬁcation rates is relatively low, varying around 12% to 17%.While when p U or p ˜ U is set to be 0.4, the misclassiﬁcation rates become quite high, varying around24% to 29%. We conduct and compare three estimations:(1) SPE : the semiparametric estimation studied in Section 5.3 using two proxies;and two naive parametric estimations via ordinary least square (OLS) and ignoring potentialmisclassiﬁcation errors:(2)

Naive 1 : regression of Y i on (1 , D i , D i |N i | Z i , S i , S i , S i Z i , S i |N i | );303) Naive 2 : regression of Y i on (1 , D i , D i | ˜ N i | Z i , ˜ S i , ˜ S i , ˜ S i Z i , ˜ S i | ˜ N i | ).Table 4 to Table 7 display the estimation results for the treatment and the spillover eﬀects viathe above three approaches: SPE, Naive 1 and Naive 2. The bias, the standard deviation (sd), themean squared error (mse), and the coverage rate of the 95% conﬁdence interval for the true valueof the causal parameter (cr) are reported.For the treatment eﬀect τ d (0 , ,

3) (Table 4), the estimation of three methods are comparable interms of the mse and the coverage rate (cr). It is reasonable because the treatment status of eachego unit is correctly observed and the network measurement error does not impact the estimationof τ d (0 , ,

3) via the naive methods for units with Z i = 0.For the treatment eﬀect τ d (0 , ,

3) (Table 5), the spillover eﬀects τ s (1 , ,

3) (Table 6) and τ s (1 , ,

3) (Table 7), their ﬁnite sample estimation reveals several patterns. First and most im-portantly, the bias of the SPE is remarkably lower than the bias of the two naive estimations inmost cases, especially when the network degree is relatively small ( r deg = 5), the misclassiﬁcationrate is relatively low (1 − p U = 1 − p ˜ U = 0 . N = 5000).In addition, as expected, the bias of the SPE is decreasing as sample size increases for mostcases. While, the two naive estimations are biased in all settings, and the bias is quite severewhen the misclassiﬁcation rate is relatively high (1 − p U = 1 − p ˜ U = 0 .

4) or the network degreeis relatively large ( r deg = 8). Increasing the sample size fails to mitigate the bias of the two naiveestimations. For instance, consider the estimation of the spillover τ s (1 , ,

3) under r deg = 8 in panel(b) of Table 6. Under low misclassiﬁcation rate 1 − p U = 1 − p ˜ U = 0 . δ V = 0 . N = 1000,the bias of SPE (0.076) is 11.6% of the bias of Naive 1 (0.653), and is 9.7% of the bias of Naive 2(0.780). When sample size increases to N = 5000, the bias of SPE (-0.034) becomes to 5.2% of thebias of Naive 1 (0.650) and 4.5% of the bias of Naive 2 (0.753). While, under high misclassiﬁcationrate 1 − p U = 1 − p ˜ U = 0 . δ V = 0 .

1, the naive estimations suﬀer even severer bias: the biasesof both Naive 1 and Naive 2 become doubled compared to those under the low misclassiﬁcationrate. Although the bias of SPE also increases in cases with high misclassiﬁcation rate compared tothat in cases with low misclassiﬁcation rate, it diminishes with sample size. Hence, the simulationsverify that ignoring the network classiﬁcation errors would result in non-negligible bias whichcannot be eliminated via increasing sample size.In addition, we can see that the sd and the mse of SPE decrease with sample size. The mse ofSPE outperforms the those of Naive 1 and Naive 2 in most cases. Moreover, it is apparent thatthe coverage rate of the SPE performs better, because it is not only closer to the nominal rate 95%compared to the rates of two naive methods, but also approaching to the nominal rate as samplesizes becomes larger. However, for the two naive approaches, their coverage rates drops rapidlywhen samples size increases or when the misclassiﬁcation become worse. Take the spillover eﬀect τ s (1 , ,

3) under r deg = 8 (panel (b) in Table 7) as an example, when the misclassiﬁcation rate islow 1 − p U = 1 − p ˜ U = 0 . δ V = 0 .

1, the coverage rate is 11.9% for Naive 1 and is 6.2% for31aive 2, while it is 93.1% for SPE. When N = 5000, it goes down to 0% for both Naive 1 andNaive 2, but increases to 93.8% for SPE.At last, the accuracy of the SPE decreases as r deg increases, or as the misclassiﬁcation rateincreases. To sum up, the SPE works signiﬁcantly better than the naive estimations neglectingthe network misclassiﬁcation, especially when the sample size is relatively large. In addition, theasymptotic properties in Section 5 are veriﬁed by the simulation results. Two key identiﬁcation assumptions, i.e. the exclusion restriction and the one type of measurementerror, may be violated in some applications. In this section, I consider more empirically importantquestions: Is SPE robust to the violation of these two assumptions? Does SPE still perform betterthan the naive estimation if any violation is present? To answer these questions, consider thefollowing two scenarios where the observable networks are generated for sensible departures fromeither of the two identiﬁcation conditions:(i) violating “exclusion restriction”: generate random error ( U ∗ ij , V ∗ ij , ˜ U ∗ ij , ˜ V ∗ ij ) ′ from a joint normaldistribution for all i, j = 1 , , ..., N ,  U ∗ ij V ∗ ij ˜ U ∗ ij ˜ V ∗ ij  = N   ,  ̺

00 1 0 ̺̺ ̺  , U ij = 1[Φ( U ∗ ij ) < − p U ] , V ij = 1[Φ( V ∗ ij ) < p V ]˜ U ij = 1[Φ( ˜ U ∗ ij ) < − p ˜ U ] , ˜ V ij = 1[Φ( ˜ V ∗ ij ) < p ˜ V ] . where ̺ ∈ { . , . } controls the correlation between the misclassiﬁcation errors;(ii) violating “one type of measurement error”: generate ˜ V ij via p ˜ V = δ ˜ V /N with δ ˜ V ∈ { . , . } ;while keeping anything else the same with the design in Section 6.1. Results for the three ap-proaches are reported in Table 8 and Table 9. To check the robustness of the SPE method when the exclusion restriction is violated, let uscompare the results in Table 4 to Table 7 with their counterparts in Table 8 and Table 9. Wecan see that the violation of either assumptions aggravates the performance of SPE in most of thecases, but only at a limited degree.Take the spillover τ s (1 , ,

3) as an example. When r deg = 5, N = 5000 and misclassiﬁcationrate is relatively low (1 − p U = 1 − p ˜ U = 0 . , δ V = 0 . Due to the space limitation, Table 9 only displays the results for cases with relatively large sample size ( N =5000), which is suﬃcient to illustrate the asymptotic performance of the SPE relative to the naive approaches. ̺ = 0 . δ ˜ V = 0 . τ d (0 , , − p U = 1 − p ˜ U = 0 . − p U = 1 − p ˜ U = 0 . τ d (0 , , τ s (1 , ,

3) and τ s (1 , , r deg = 8with low misclassiﬁcation rate (1 − p U = 1 − p ˜ U = 0 . τ s (1 , ,

3) and τ s (1 , ,

3) obtained by the SPE method lies inthe range of 93.0% to 94.3%, while it drops down dramatically to less than 6% for τ s (1 , ,

3) andeven becomes to 0% for τ s (1 , ,

3) when using native estimations. If the one type of measurementerror assumption fails, the coverage rate of τ s (1 , ,

3) and τ s (1 , ,

3) computed via the SPE variesfrom 94.9% to 95.5%, while it varies from 0% to a little below 4% for the naive estimations.The results in this section show that (i) the SPE approach is robust to mild violation of theone type of measurement error assumption; and (ii) the SPE is still superior to the naive methodsexcept in rare cases, in the sense that the bias reduction provided by the SPE is substantial andits causal inference is much more reliable.

This section applies the proposed SPE method to data on social network of rice farmers from185 villages of rural China. The data is collected by Cai, De Janvry, and Sadoulet (2015b) toinvestigate the take-up decisions of weather insurance, which is typically adopted with low rateseven when the government provides heavy subsidies. The primary interest of Cai et al. (2015b)is to study whether and how the diﬀusion of insurance knowledge through social network aﬀectsthe insurance take-up rate. Thus, two rounds of sessions are oﬀered with a three-days gap toallow information sharing by the ﬁrst round participants. In each round, there are two typesof sessions held simultaneously: the 20 minutes simple session where only contract is discussed,and the 45 minutes intensive session where details of how the insurance operates and the expectedbeneﬁts are explained. About 5000 rice-producing households from those 185 villages are randomly Data is available at Cai, De Janvry, and Sadoulet (2015a) https://doi.org/10.3886/E113593V1.

T akeup ig = θ + θ Intensive ig + θ N etwork ig + θ Cov ig + θ N etSize ig + η g + ε ig , (14)where T akeup ig is a binary indicator of whether the household i in village g decide to buy theinsurance, Intensive ig is a dummy variable taking value one if the household is invited to in-tensive session, N etwork ig is the fraction of household i ’ friends who have been invited to theﬁrst round intensive session, N etSize ig is a set of dummies indicating network degree, Cov ig in-cludes household characteristics and η g represents village ﬁxed eﬀect. Household characteristicsin

Cov ig include gender, age and education of household head, rice production area, risk aversionand perceived probability of future disasters. Five Dummies in N etSize ig are indicators of thenumber of nominated friends equal to one to ﬁve, where the dummy of zero nominated friends isdropped to avoid collinearity. Instead of the baseline model (14), I also consider an alternativemodel speciﬁcation where the interaction term Intensive ig ∗ N etwork ig is included. Data from the social network survey is used to construct the household-level network mea-sures. The social network survey requires the sampled household heads to nominate ﬁve friendswith whom they discuss rice production or ﬁnancial issues, while not all the respondents list ﬁvefriends. No geographical restriction is imposed, which means the nominated friends can eitherlive in the same village with the respondent or outside the village. This network measure is non-reciprocal and is referred to as “general measure” in Cai et al. (2015b). The general measure maycontain two types of measurement error: those with less than ﬁve friends are likely to report falsefriends (false positive) and those with more than ﬁve friends may censor the number of networklinks (false negative). Another household-level network measure used in Cai et al. (2015b), referredto as “strong measure”, is deﬁned as the bilaterally linked friends (reciprocal) using the same in-formation from the social network survey. The social network survey is conducted before theexperiment, therefore the network formation should not be aﬀected by the treatment assignmentsnor the take-up decisions.The analysis in this section utilizes both these two measures, and assumes that the strongmeasure includes only false negative links. It is worth noting that although the two networkmeasures are probably correlated even conditional on the true network information, according to If household i nominates zero friends, then N etwork ig is set to be zero. In the same spirit of Cai et al. (2015b), because the treatment

Intensive ig is whether household is invited toan intensive session or not, the treatment and spillover eﬀects are studied from an intention-to-treat perspective.Nevertheless, almost 90% of households who are invited to one of the sessions actually attend. Therefore, thedropout is not a main concern. About 95% of respondents report ﬁve friends, the rest report less than ﬁve friends. i are those from the same village with i . Two further remarks are worth noticing. Firstly, the second round participants are not impactedby the take-up decisions made by the ﬁrst round participants if this information is not revealedto them (see Table 6 column 7 and Table 7 column 6 of Cai et al., 2015b). In addition, accordingto the survey, there is only 9% of the households who are not informed of any ﬁrst round take-upinformation know at least one of their friends’ decision. Thus, the endogenous peer eﬀects, i.e.the spillovers of friends’ take-up decisions, should not be of major concern in this application.Secondly, the ﬁrst round simple session also exhibits no signiﬁcant spillover eﬀects to the secondround participants (see Table 2 column 3 of Cai et al., 2015b).Table 2: Eﬀect of Social Networks on Insurance Take-upNaive SPE Naive SPEGeneral Strong General Strong(1) (2) (3) (4) (5) (6)Intensive 0 . . ∗∗ . ∗∗∗ . ∗ . ∗∗∗ . ∗∗∗ − . ∗∗ − . ∗∗ -0.106(0.161) (0.111) (0.189) η g Yes Yes Yes Yes Yes Yes

Cov ig Yes Yes Yes Yes Yes Yes

Note: Samples are from the second round sessions “Simple2-NoInfo” and “Intensive2-NoInfo” as deﬁned and usedby Cai et al. (2015b). Number of observations is 1255. Standard error (se) is reported in the parenthesis. For thenaive method, column “General” shows the result using the general measure of the network and column “Strong”display the result using the strong measure of the network. The SPE method is implemented by assuming theclassiﬁcation error is correlated to literacy. The se of the naive method is computed using clustered standard errorwith villages as clusters. The se of the SPE method is calculated based on Corollary 5.7 with villages as dependencyneighborhoods.

Estimation results are summarized in Table 2. The baseline model (columns (1) to (3)) andthe alternative model with interaction term of the treatment and the network exposure (columns(4) to (6)) are estimated using the household-level samples from the second round sessions, whereno overall attendance/take-up rate nor individual insurance purchase results at the ﬁrst roundsessions in their village are revealed to the participants. Results for the Naive method usinggeneral measure of the network data in columns (1) and (4) in Table 2 are the same to those To mitigate estimation error arising from small sample size, the ﬁrst step estimation uses samples from boththe ﬁrst and the second rounds and their network data based on the social network survey, with sample size 4588.

35n Table 2 columns (2) and (4) of Cai et al. (2015b), based on which they draw two conclusions.First, the spillover eﬀect on insurance take-up is signiﬁcantly positive. For example, column (1)(or column (2)) reveals that having additional 20% increase in the ratio of friends attending theﬁrst round intensive session will lead to a 29.1% × × × × × × × × × Motivated by applications of program evaluation under network interference, this paper studies theidentiﬁcation and estimation of treatment and spillover eﬀects when the network is mismeasured.The novel identiﬁcation strategy proposed in this paper utilizes two network proxies, where oneof them is used as an instrumental variable for the latent network and the other is assumedto contain only one type of measurement error. A semiparametric estimation approach for the36ﬀects of interest is also provided. The simulation results conﬁrm that the proposed estimation(i) outperforms the naive estimation neglecting the network misclassiﬁcation, and (ii) remains tobe a preferred alternative to the naive estimation, even if its key assumption is mildly violated.Therefore, the proposed estimation serves as an eﬀective way to reduce the bias caused by thenetwork measurement errors, and provide reliable causal inference.The proposed semiparametric estimation approach exploits a parametric structural assumptionof the outcome variable to avoid the curse of dimensionality, which opens new questions on thetrade-oﬀ between the potential model misspeciﬁcation and the network mismeasurement-robustestimation. It is also meaningful and feasible to investigate the estimation in a more ﬂexiblesemiparametric setup, including partially linear model, index model and random-coeﬃcient model.This paper is particularly suited for studies where the treatment is randomly assigned withperfect compliance. While, for some empirical studies, it is reasonable to allow for non-compliance(Vazquez-Bare, 2020). Future research could further explore the impacts of relaxing the perfectcompliance, and develop methods for the identiﬁcation and estimation of spillover eﬀects to ac-commodate the non-compliance.Finally, this paper assumes that the speciﬁcation of how the exposure to the treated peersaﬀecting the outcome is correct, meaning that the spillover eﬀect is local through the ﬁrst-orderpeers. The literature on network eﬀects often stresses the existence of the higher-order interfer-ence, i.e. the interference with friends of friends. It sophisticates the analysis in this paper byintroducing the spillover of the treatment and of the measurement errors from higher-order in-teractions. Moreover, it also makes the dependence structure among the observable and latentnetwork-based variables more complicated. It is nontrivial how the analysis of this paper can beextended to deal with higher-order interference. However, for studies where the treatment responseis primarily governed by the ﬁrst-order spillover, it is possible to apply the analysis of this papervia assuming the higher-order interference eﬀects are omitted to the unobservables. The rationaleis that, based on the study of Leung (2019a) and S¨avje (2019), the exposure misspeciﬁcation ig-noring higher-order interference does not alter the estimation results, if the speciﬁcation errors arewell counterbalanced by the deceasing data correlation as the order of the interference increases.Rigorous exploration along this direction is worthwhile in the future research.

References

Advani, A. and B. Malde (2018): “Credibly identifying social eﬀects: Accounting for networkformation and measurement error,”

Journal of Economic Surveys , 32, 1016–1044.

Andrew, A. L., K.-W. E. Chu, and P. Lancaster (1993): “Derivatives of eigenvaluesand eigenvectors of matrix functions,”

SIAM journal on matrix analysis and applications , 14,903–926. 37 ngelucci, M., G. De Giorgi, M. A. Rangel, and I. Rasul (2010): “Family networks andschool enrolment: Evidence from a randomized social experiment,”

Journal of public Economics ,94, 197–221.

Angelucci, M. and V. Di Maro (2016): “Programme evaluation and spillover eﬀects,”

Journalof Development Eﬀectiveness , 8, 22–43.

Aral, S. and D. Walker (2012): “Identifying inﬂuential and susceptible members of socialnetworks,”

Science , 337, 337–341.

Aronow, P. M. and C. Samii (2017): “Estimating average causal eﬀects under general inter-ference, with application to a social network experiment,”

The Annals of Applied Statistics , 11,1912–1947.

Athey, S., D. Eckles, and G. W. Imbens (2018): “Exact p-values for network interference,”

Journal of the American Statistical Association , 113, 230–240.

Athey, S. and G. W. Imbens (2017): “The econometrics of randomized experiments,” in

Handbook of economic ﬁeld experiments , Elsevier, vol. 1, 73–140.

Auerbach, E. (2019): “Identiﬁcation and estimation of a partially linear regression model usingnetwork data,” arXiv preprint arXiv:1903.09679 . Baird, S., J. A. Bohren, C. McIntosh, and B. ¨Ozler (2018): “Optimal design of experi-ments in the presence of interference,”

Review of Economics and Statistics , 100, 844–860.

Banerjee, A., A. G. Chandrasekhar, E. Duflo, and M. O. Jackson (2013): “Thediﬀusion of microﬁnance,”

Science , 341, 1236498.

Barrera-Osorio, F., M. Bertrand, L. L. Linden, and F. Perez-Calle (2011): “Im-proving the design of conditional transfer programs: Evidence from a randomized educationexperiment in Colombia,”

American Economic Journal: Applied Economics , 3, 167–95.

Basse, G. and A. Feller (2018): “Analyzing two-stage experiments in the presence of inter-ference,”

Journal of the American Statistical Association , 113, 41–55.

Battistin, E., M. De Nadai, and B. Sianesi (2014): “Misreported schooling, multiple mea-sures and returns to educational qualiﬁcations,”

Journal of Econometrics , 181, 136–150.

Battistin, E. and B. Sianesi (2011): “Misclassiﬁed treatment status and treatment eﬀects: anapplication to returns to education in the United Kingdom,”

Review of Economics and Statistics ,93, 495–509.

Bound, J., C. Brown, and N. Mathiowetz (2001): “Measurement error in survey data,” in

Handbook of econometrics , Elsevier, vol. 5, 3705–3843.

Bradley, R. C. et al. (1983): “Approximation theorems for strongly mixing random variables.”

The Michigan Mathematical Journal , 30, 69–81.

Breza, E., A. G. Chandrasekhar, T. H. McCormick, and M. Pan (2020): “Using Aggre-gated Relational Data to Feasibly Identify Network Structure without Network Data,”

AmericanEconomic Review , 110, 2454–2484. 38 ai, J., A. De Janvry, and E. Sadoulet (2015a): “Replication data for: Socialnetworks and the decision to insure,” Nashville, TN: American Economic Association,Ann Arbor, MI: Inter-university Consortium for Political and Social Research, 2019-10-12. https://doi.org/10.3886/E113593V1 .——— (2015b): “Social networks and the decision to insure,”

American Economic Journal: Ap-plied Economics , 7, 81–108.

Calvi, R., A. Lewbel, and D. Tommasi (2018): “Women’s Empowerment and Family Health:Estimating LATE with Mismeasured Treatment,”

Available at SSRN 2980250 . Calv´o-Armengol, A., E. Patacchini, and Y. Zenou (2009): “Peer eﬀects and social net-works in education,”

The Review of Economic Studies , 76, 1239–1267.

Candelaria, L. E. and T. Ura (2020): “Identiﬁcation and inference of network formationgames with misclassiﬁed links,” Tech. rep., University of Warwick, Department of Economics.

Chandrasekhar, A. (2016): “Econometrics of network formation,”

The Oxford Handbook of theeconomics of networks , 303–357.

Chandrasekhar, A. and R. Lewis (2011): “Econometrics of sampled networks,”

WorkingPaper . Chandrasekhar, A. G. and M. O. Jackson (2016): “A network formation model based onsubgraphs,”

Available at SSRN 2660381 . Chen, L. H., L. Goldstein, and Q.-M. Shao (2010):

Normal approximation by Steins method ,Springer Science & Business Media.

Chen, X., H. Hong, and D. Nekipelov (2011): “Nonlinear models of measurement errors,”

Journal of Economic Literature , 49, 901–37.

Chin, A. (2018): “Central limit theorems via Stein’s method for randomized experiments underinterference,” arXiv preprint arXiv:1804.03105 . Comola, M. and M. Fafchamps (2017): “The missing transfers: Estimating misreporting indyadic data,”

Economic Development and Cultural Change , 65, 549–582.

Conley, T. G. and C. R. Udry (2010): “Learning about a new technology: Pineapple inGhana,”

American Economic Review , 100, 35–69.

Currarini, S., M. O. Jackson, and P. Pin (2009): “An economic model of friendship:Homophily, minorities, and segregation,”

Econometrica , 77, 1003–1045.——— (2010): “Identifying the roles of race-based choice and chance in high school friendshipnetwork formation,”

Proceedings of the National Academy of Sciences , 107, 4857–4861.

De Paula, A. (2017): “Econometrics of network models,” in

Advances in economics and econo-metrics: Theory and applications, eleventh world congress , Cambridge University Press Cam-bridge, 268–323. 39 e Paula, A., I. Rasul, and P. Souza (2018a): “Recovering social networks from paneldata: identiﬁcation, simulations and an application,”

LACEA Working Paper Series No. 0001,Available at SSRN: http://dx.doi.org/10.2139/ssrn.3322049 . De Paula, ´A., S. Richards-Shubik, and E. Tamer (2018b): “Identifying preferences innetworks with bounded degree,”

Econometrica , 86, 263–288.

Dupas, P. (2014): “Short-run subsidies and long-run adoption of new health products: Evidencefrom a ﬁeld experiment,”

Econometrica , 82, 197–228.

Erd¨os, P. and A. R´enyi (1959): “On random graphs,”

Mathematicae Debrecen , 6, 290–297.

Fafchamps, M. and S. Lund (2003): “Risk-sharing networks in rural Philippines,”

Journal ofDevelopment Economics , 71, 261–287.

Gao, W. and S. Li (2019): “Identiﬁcation and Estimation of Peer Eﬀects in Latent Networks,”

Working Paper . Goldsmith-Pinkham, P. and G. W. Imbens (2013): “Social networks and the identiﬁcationof peer eﬀects,”

Journal of Business & Economic Statistics , 31, 253–264.

Goldstein, L. and Y. Rinott (1996): “Multivariate normal approximations by Stein’s methodand size bias couplings,”

Journal of Applied Probability , 33, 1–17.

Hardy, M., R. M. Heath, W. Lee, and T. H. McCormick (2019): “Estimating spilloversusing imprecisely measured networks,” arXiv preprint arXiv:1904.00136 . He, X. and K. Song (2018): “Measuring diﬀusion over a large network,” arXiv preprintarXiv:1812.04195 . Hu, Y. (2008): “Identiﬁcation and estimation of nonlinear models with misclassiﬁcation errorusing instrumental variables: A general solution,”

Journal of Econometrics , 144, 27–61.

Hu, Y. and S. M. Schennach (2008): “Instrumental variable treatment of nonclassical mea-surement error models,”

Econometrica , 76, 195–216.

Hudgens, M. G. and M. E. Halloran (2008): “Toward causal inference with interference,”

Journal of the American Statistical Association , 103, 832–842.

Imbens, G. W. (2000): “The role of the propensity score in estimating dose-response functions,”

Biometrika , 87, 706–710.

Johnsson, I. and H. R. Moon (2015): “Estimation of peer eﬀects in endogenous social networks:control function approach,”

Review of Economics and Statistics , 1–51.

Kojevnikov, D., V. Marmer, and K. Song (2019): “Limit theorems for network dependentrandom variables,” arXiv preprint arXiv:1903.01059 . Kossinets, G. (2006): “Eﬀects of missing data in social networks,”

Social networks , 28, 247–268.

Kremer, M. and E. Miguel (2007): “The illusion of sustainability,”

The Quarterly Journal ofEconomics , 122, 1007–1065. 40 uersteiner, G. M. (2019): “Limit Theorems for Data with Network Structure,” arXiv preprintarXiv:1908.02375 . Lee, Y. and E. L. Ogburn (2020): “Network Dependence Can Lead to Spurious Associationsand Invalid Inference,”

Journal of the American Statistical Association , 1–31.

Leung, M. P. (2019a): “Causal Inference Under Approximate Neighborhood Interference,”

Avail-able at SSRN 3479902 .——— (2019b): “A weak law for moments of pairwise stable networks,”

Journal of Econometrics ,210, 310–326.——— (2020a): “Dependence-Robust Inference Using Resampled Statistics,” arXiv preprintarXiv:2002.02097 .——— (2020b): “Treatment and spillover eﬀects under network interference,”

Review of Economicsand Statistics , 102, 368–380.

Leung, M. P. and H. R. Moon (2019): “Normal Approximation in Large Network Models,” arXiv preprint arXiv:1904.11060 . Lewbel, A. (2007): “Estimation of average treatment eﬀects with misclassiﬁcation,”

Economet-rica , 75, 537–551.

Lewbel, A., X. Qu, and X. Tang (2019): “Social networks with misclassiﬁed or unobservedlinks,” Tech. rep., Boston College Department of Economics.

Li, Q. and J. S. Racine (2007):

Nonparametric econometrics: theory and practice , PrincetonUniversity Press.

Liu, L. and M. G. Hudgens (2014): “Large sample randomization inference of causal eﬀects inthe presence of interference,”

Journal of the American Statistical Association , 109, 288–301.

Liu, X. (2013): “Estimation of a local-aggregate network model with sampled networks,”

Eco-nomics Letters , 118, 243–246.

Ma, Y., Y. Wang, and V. Tresp (2020): “Causal Inference under Networked Interference,” arXiv preprint arXiv:2002.08506 . Mahajan, A. (2006): “Identiﬁcation and estimation of regression models with misclassiﬁcation,”

Econometrica , 74, 631–665.

Manski, C. F. (2001): “Designing programs for heterogeneous populations: The value of covariateinformation,”

American Economic Review , 91, 103–106.——— (2013): “Identiﬁcation of treatment response with social interactions,”

The EconometricsJournal , 16, S1–S23.

Masry, E. (1996): “Multivariate local polynomial regression for time series: uniform strongconsistency and rates,”

Journal of Time Series Analysis , 17, 571–599.41 eckes, E. et al. (2009): “On Steins method for multivariate normal approximation,” in

Highdimensional probability V: the Luminy volume , Institute of Mathematical Statistics, 153–178.

Miguel, E. and M. Kremer (2004): “Worms: identifying impacts on education and health inthe presence of treatment externalities,”

Econometrica , 72, 159–217.

Newey, W. and D. MacFadden (1994): “Large Sample Estimation and Hypoth0 esis Testing,Chapter 36,”

Handbook of Econometrics Vol , 4.

Newey, W. K. (1994): “Kernel estimation of partial means and a general variance estimator,”

Econometric Theory , 10, 1–21.

Opper, I. M. (2019): “Does helping john help sue? evidence of spillovers in education,”

AmericanEconomic Review , 109, 1080–1115.

Oster, E. and R. Thornton (2012): “Determinants of technology adoption: Peer eﬀects inmenstrual cup take-up,”

Journal of the European Economic Association , 10, 1263–1293.

Patacchini, E., E. Rainone, and Y. Zenou (2017): “Heterogeneous peer eﬀects in education,”

Journal of Economic Behavior & Organization , 134, 190–227.

Qu, X. and L.-f. Lee (2015): “Estimating a spatial autoregressive model with an endogenousspatial weight matrix,”

Journal of Econometrics , 184, 209–232.

Romano, J. P. and A. M. Shaikh (2012): “On the uniform asymptotic validity of subsamplingand the bootstrap,”

The Annals of Statistics , 40, 2798–2822.

Ross, N. (2011): “Fundamentals of Stein’s method,”

Probability Surveys , 8, 210–293.

S¨avje, F. (2019): “Causal inference with misspeciﬁed exposure mappings,” Tech. rep., YaleUniversity.

S¨avje, F., P. M. Aronow, and M. G. Hudgens (2017): “Average treatment eﬀects in thepresence of unknown interference,” arXiv preprint arXiv:1711.06399 . Sobel, M. E. (2006): “What do randomized studies of housing mobility demonstrate? Causalinference in the face of interference,”

Journal of the American Statistical Association , 101, 1398–1407.

Sojourner, A. (2013): “Identiﬁcation of peer eﬀects with missing peer data: Evidence fromProject STAR,”

The Economic Journal , 123, 574–605.

Song, K. (2018): “Measuring the graph concordance of locally dependent observations,”

Reviewof Economics and Statistics , 100, 535–549.

Stein, C. (1986): “Approximate computation of expectations,”

Lecture Notes-Monograph Series ,7, i–164.

Tauchen, G. (1985): “Diagnostic testing and evaluation of maximum likelihood models,”

Journalof Econometrics , 30, 415–443. 42 chetgen, E. J. T. and T. J. VanderWeele (2012): “On causal inference in the presenceof interference,”

Statistical Methods in Medical Research , 21, 55–75.

Thirkettle, M. (2019): “Identiﬁcation and Estimation of Network Statistics with Missing LinkData,” Tech. rep., Working Paper. van der Laan, M. J. (2014): “Causal inference for a population of causally connected units,”

Journal of Causal Inference , 2, 13–74.

Vazquez-Bare, G. (2019): “Identiﬁcation and estimation of spillover eﬀects in randomizedexperiments,” arXiv preprint arXiv:1711.02745 .——— (2020): “Causal spillover eﬀects using instrumental variables,” arXiv preprintarXiv:2003.06023 . Viviano, D. (2019): “Policy targeting under network interference,” arXiv preprintarXiv:1906.10258 . White, H. and I. Domowitz (1984): “Nonlinear regression with dependent observations,”

Econometrica: Journal of the Econometric Society , 143–161.43 ppendix

We ﬁrst introduce notations used in the Appendix. I K is the K × K identity matrix. λ max ( A ) and λ min ( A ) denote the largest and the smallest eigenvalues of a square matrix A , respectively. We use C to represent some positive constant and its value may be diﬀerent at diﬀerent uses. s.o. denotesthe terms of smaller order. Appendix Section E contains some useful lemmas and is relegated tothe supplementary material. A Examples

This section provides suﬃcient conditions or examples for the assumptions in the main text.

Example 1 (Assumption 3.3 (a))

Suppose the network links follow the dyadic formation below: A ∗ ij = 1[ ω ( Z i , Z j ) > η ij ] · i = j ] , with i, j ∈ P where ω : Ω Z R and the unobserved link speciﬁc error term η ij is independent to { Z i } i ∈P andis i.i.d. across ( i, j ) . Then, A ∗ ij given Z i is a function of ( Z j , η ij ) , which is i.i.d. across j and |N ∗ i | = P j ∈P A ∗ ij is identically distributed following the binomial distribution. Such a networkformation is considered in e.g. Johnsson and Moon (2015). Example 2 (Assumption 3.4 (c))

For any given latent A ∗ ij , consider the following data gener-ating process of the observable A ij A ij = U ij A ∗ ij + V ij (1 − A ∗ ij ) , with i, j ∈ P (A.1) where N i = { j ∈ P : A ij = 1 } and the classiﬁcation errors ( U ij , V ij ) are random indicators takingvalues from { } . From (A.1) , we can obtain that |N i | = X j ∈P A ij = X j ∈P ( U ij − V ij ) A ∗ ij + X j ∈P V ij = X j ∈N ∗ i ( U ij − V ij ) + X j ∈P V ij . Let two vectors U i = { U ij } j ∈P and V i = { V ij } j ∈P . If the random vector ( U i , V i ) is conditionallyindependent to N ∗ i and identically distributed across i ∈ P given ( Z i , |N ∗ i | ) , then the identicaldistribution of |N i | given ( Z i , |N ∗ i | ) holds. Example 3 (Assumption 4.1)

For each i ∈ P and any given latent A ∗ ij , suppose the observablelinks are generated as A ij = ω j (cid:2) U ij A ∗ ij + V ij (1 − A ∗ ij ) (cid:3) , ˜ A ij = ˜ ω j (cid:2) ˜ U ij A ∗ ij + ˜ V ij (1 − A ∗ ij ) (cid:3) , with j ∈ P (A.2) with U ij , V ij , ˜ U ij , ˜ V ij , ω j and ˜ ω j are all binary random variables taking values from { } . ω j and ˜ ω j can be understood as indicators of sampling-induced errors, e.g. ω j = 0 means unit j isnot sampled when constructing N i , while only links among pairs of sampled units are accountedfor. ( U ij , V ij ) and ( ˜ U ij , ˜ V ij ) can be understood as indicators of non-sampling-induced errors, e.g. U ij = 0 represents unit i ’s misreporting of her link with unit j when constructing ˜ N i . Then, theobserved sets of links are N i = { j ∈ P : A ij = 1 } and ˜ N i = { j ∈ P : ˜ A ij = 1 } . Therefore, |N i | = X j ∈P A ij = X j ∈P ( U ij − V ij ) ω j A ∗ ij + X j ∈P ω j V ij = X j ∈N ∗ i ( U ij − V ij ) ω j + X j ∈P V ij ω j , | ˜ N i | = X j ∈P ˜ A ij = X j ∈P ( ˜ U ij − ˜ V ij )˜ ω j A ∗ ij + X j ∈P ˜ ω j ˜ V ij = X j ∈N ∗ i ( ˜ U ij − ˜ V ij )˜ ω j + X j ∈P ˜ V ij ˜ ω j . (A.3) Then, one set of suﬃcient conditions for Assumption 4.1 is provided by the lemma below.

Lemma A.1

Let Assumption 3.4 (a) holds for both N i and ˜ N i . Suppose the random vector ( U ij , V ij , ˜ U ij , ˜ V ij , ω j , ˜ ω j ) given ( Z i , |N ∗ i | ) is i.i.d. across j for all i ∈ P . If(a) { U ij , V ij , ω j } j ∈P ⊥ { ˜ U ik , ˜ V ik , ˜ ω k } k ∈P (cid:12)(cid:12) Z i , N ∗ i ;(b) { U ij , V ij , ˜ U ij , ˜ V ij , ω j , ˜ ω j } j ∈P ⊥ N ∗ i (cid:12)(cid:12) Z i , |N ∗ i | ;then Assumption 4.1 is satisﬁed by |N i | and | ˜ N i | given in (A.3) . Proof of Lemma A.1. (i) From condition (a) that { U ij , V ij , ω j } j ∈P ⊥ { ˜ U ik , ˜ V ik , ˜ ω k } k ∈P (cid:12)(cid:12) Z i , N ∗ i ,we have N i ⊥ ˜ N i (cid:12)(cid:12) Z i , N ∗ i , which implies |N i | ⊥ | ˜ N i | (cid:12)(cid:12) Z i , N ∗ i . If we can further show that |N i | ⊥N ∗ i (cid:12)(cid:12) Z i , |N ∗ i | and | ˜ N i | ⊥ N ∗ i (cid:12)(cid:12) Z i , |N ∗ i | hold, then the desired result follows, because f |N i || Z i , |N ∗ i | , | ˜ N i | ( n ) = X J ∈ Ω N∗ f |N i || Z i , |N ∗ i | , | ˜ N i | , N ∗ i = J ( n ) × f N ∗ i | Z i , |N ∗ i | , | ˜ N i | ( J )= X J ∈ Ω N∗ f |N i || Z i , |N ∗ i | , N ∗ i = J ( n ) × f N ∗ i | Z i , |N ∗ i | ( J )= f |N i || Z i , |N ∗ i | ( n ) , which indicates |N i | ⊥ | ˜ N i | (cid:12)(cid:12) Z i , |N ∗ i | .Given the expressions in (A.3), based on the i.i.d. of ( U ij , V ij , ω j ) across j , applying the samearguments used to prove Lemma 4.1 (a), we can show that given ( Z i , |N ∗ i | ), the distribution of P j ∈N ∗ i ( U ij − V ij ) ω j does not depend on N ∗ i , i.e. P j ∈N ∗ i ( U ij − V ij ) ω j ⊥ N ∗ i (cid:12)(cid:12) Z i , |N ∗ i | . In addition,from condition (b) we can obtain the independence of P j ∈P V ij ω j to N ∗ i given Z i , |N ∗ i | . Thus, itfollows from the above results and (A.3) that |N i | ⊥ N ∗ i (cid:12)(cid:12) Z i , |N ∗ i | . Similarly, | ˜ N i | ⊥ N ∗ i (cid:12)(cid:12) Z i , |N ∗ i | also holds. B Proofs

B.1 Proofs of Section 3

Lemma B.1

Under Assumption 3.2, we have that for ∀ i ∈ P , ε i ⊥ ( D i , S ∗ i ) (cid:12)(cid:12) Z i , |N ∗ i | . Proof of Lemma B.1.

If we can show that

P r ( ε i < e | D i , S ∗ i , Z i , |N ∗ i | ) = P r ( ε i < e | Z i , |N ∗ i | ) , then the stated result follows. By the law of total probability, we have for ∀ e ∈ Ω ε , P r ( ε i < e | D i , S ∗ i , Z i , |N ∗ i | ) 45 E h P r (cid:16) ε i < e (cid:12)(cid:12)(cid:12) D i , S ∗ i , Z i , |N ∗ i | , N ∗ i , { D j } j ∈N ∗ i (cid:17) (cid:12)(cid:12)(cid:12) D i , S ∗ i , Z i , |N ∗ i | i , (B.1)where the expectation is with respect to f N ∗ i , { D j } j ∈N∗ i | D i ,S ∗ i ,Z i , |N ∗ i | . By deﬁnition, S ∗ i = P j ∈N ∗ i D j ,therefore, S ∗ i becomes ﬁxed when given ( N ∗ i , { D j } j ∈N ∗ i ). In addition, since Assumption 3.2 (a)implies that D i is independent to ( ε i , Z i , |N ∗ i | , N ∗ i , { D j } j ∈N ∗ i ). Then, we know that ( D i , S ∗ i ) canbe eliminated from the conditional probability of ε i < e in (B.1), i.e. P r ( ε i < e | D i , S ∗ i , Z i , |N ∗ i | ) = E (cid:2) P r (cid:0) ε i < e | Z i , |N ∗ i | , N ∗ i , { D j } j ∈N ∗ i (cid:1) (cid:12)(cid:12) D i , S ∗ i , Z i , |N ∗ i | (cid:3) = E (cid:2) P r ( ε i < e | Z i , |N ∗ i | ) (cid:12)(cid:12) D i , S ∗ i , Z i , |N ∗ i | (cid:3) = P r ( ε i < e | Z i , |N ∗ i | ) , (B.2)where the second line is from Assumption 3.2 (b) and the last line implies ε i ⊥ ( D i , S ∗ i ) (cid:12)(cid:12) Z i , |N ∗ i | . Lemma B.2

Under Assumptions 3.2 and 3.4, ε i ⊥ ( D i , S ∗ i , S i , |N i | ) | Z i , |N ∗ i | for ∀ i ∈ P . Proof of Lemma B.2.

Denote P ∗ i = ( N ∗ i , { D j } j ∈N ∗ i ) and P i = ( N i , { D j } j ∈N i ). Then, we knowfrom Assumptions 3.2 (a) and 3.4 (a) that D i ⊥ ( P ∗ i , P i ), because of the facts that i

6∈ N ∗ i , i

6∈ N i , D i ⊥ ( { D j } j ∈N ∗ i , { D j } j ∈N i ) (cid:12)(cid:12) N ∗ i , N i and { D i } i ∈P ⊥ ( N ∗ i , N i ). Moreover, since S ∗ i and S i are functions of P ∗ i and P i , respectively, we have D i ⊥ ( S ∗ i , S i ). By the law of total probability, P r (cid:16) ε i < e (cid:12)(cid:12)(cid:12) D i , S ∗ i , S i , |N i | , Z i , |N ∗ i | (cid:17) = P r (cid:16) ε i < e (cid:12)(cid:12)(cid:12) S ∗ i , S i , |N i | , Z i , |N ∗ i | (cid:17) = E h P r (cid:16) ε i < e (cid:12)(cid:12)(cid:12) S ∗ i , S i , |N i | , Z i , |N ∗ i | , P ∗ i , P i (cid:17) (cid:12)(cid:12)(cid:12) S ∗ i , S i , |N i | , Z i , |N ∗ i | i , (B.3)for ∀ e ∈ Ω ε , where the expectation is with respect to f P ∗ i ,P i | S ∗ i ,S i , |N i | ,Z i , |N ∗ i | . We know that S ∗ i , S i , |N i | are ﬁxed given ( P ∗ i , P i ). Thus, equation (B.3) becomes to P r (cid:16) ε i < e (cid:12)(cid:12)(cid:12) D i , S ∗ i , S i , |N i | , Z i , |N ∗ i | (cid:17) = E h P r (cid:16) ε i < e (cid:12)(cid:12)(cid:12) Z i , |N ∗ i | , P ∗ i , P i (cid:17) (cid:12)(cid:12)(cid:12) S ∗ i , S i , |N i | , Z i , |N ∗ i | i = E h P r (cid:16) ε i < e (cid:12)(cid:12)(cid:12) Z i , |N ∗ i | (cid:17) (cid:12)(cid:12)(cid:12) S ∗ i , S i , |N i | , Z i , |N ∗ i | i = P r (cid:16) ε i < e (cid:12)(cid:12)(cid:12) Z i , |N ∗ i | (cid:17) , where the second equality above is due to Assumption 3.4 (b). Proof of Proposition 3.1.

By Assumption 3.1 and the law of iterated expectation, m i ( d, s, z, n )= E (cid:2) r ( D i , S ∗ i , Z i , |N ∗ i | , ε i ) (cid:12)(cid:12) D i = d, S i = s, Z i = z, |N i | = n (cid:3) = X ( s ∗ ,n ∗ ) ∈ Ω S ∗ , |N∗| E (cid:2) r ( d, s ∗ , z, n ∗ , ε i ) (cid:12)(cid:12) D i = d, S i = s, Z i = z, |N i | = n, S ∗ i = s ∗ , |N ∗ i | = n ∗ (cid:3) × f S ∗ i , |N ∗ i || D i = d,S i = s,Z i = z, |N i | = n ( s ∗ , n ∗ ) . (B.4)46ased on Lemma B.2 that ε i ⊥ ( D i , S ∗ i , S i , |N i | ) | Z i , |N ∗ i | , we have that (B.4) becomes to m i ( d, s, z, n )= X ( s ∗ ,n ∗ ) ∈ Ω S ∗ , |N∗| E (cid:2) r ( d, s ∗ , z, n ∗ , ε i ) (cid:12)(cid:12) Z i = z, |N ∗ i | = n ∗ (cid:3) f S ∗ i , |N ∗ i || D i = d,S i = s,Z i = z, |N i | = n ( s ∗ , n ∗ )= X ( s ∗ ,n ∗ ) ∈ Ω S ∗ , |N∗| m ∗ ( d, s ∗ , z, n ∗ ) f S ∗ i , |N ∗ i || D i = d,S i = s,Z i = z, |N i | = n ( s ∗ , n ∗ ) , (B.5)where the last equality is by Deﬁnition 1. B.2 Proofs of Section 4

Lemma B.3

Under Assumption 3.2 (a), suppose Assumption 3.4 (a) and (b) are satisﬁed by both N i and ˜ N i . Then, Y i ⊥ ( |N i | , | ˜ N i | ) (cid:12)(cid:12) Z i , |N ∗ i | holds. Proof of Lemma B.3.

First, same arguments used in the proof of Lemma 4.1 (a) can be appliedto show that S ∗ i ⊥ ( |N i | , | ˜ N i | ) (cid:12)(cid:12) Z i , |N ∗ i | . Second, rewrite Y i in terms of the potential outcomes: Y i = X ( d,s ) ∈{ , }× Ω S ∗ D i = d, S ∗ i = s ] r ( d, s, Z i , |N ∗ i | , ε i ) , where by the randomness of the treatment assignment and S ∗ i ⊥ ( |N i | , | ˜ N i | ) (cid:12)(cid:12) Z i , |N ∗ i | , we knowthat 1[ D i = d, S ∗ i = s ] ⊥ ( |N i | , | ˜ N i | ) (cid:12)(cid:12) Z i , |N ∗ i | . Then, because Assumption 3.4 (b) implies that( |N i | , | ˜ N i | ) is independent to the potential outcome r ( d, s, Z i , |N ∗ i | , ε i ) given ( Z i , |N ∗ i | ), we canconclude that Y i ⊥ ( |N i | , | ˜ N i | ) (cid:12)(cid:12) Z i , |N ∗ i | . Proof of Lemma 4.1. (a) If we can show that for any ( s, J ) ∈ Ω S ∗ , N ∗ the equation below holds, f S ∗ i , N ∗ i | Z i , |N ∗ i | = n ( s, J ) = f S ∗ i | Z i , |N ∗ i | = n ( s ) f N ∗ i | Z i , |N ∗ i | = n ( J ) , (B.6)then the desired result follows. First of all, if either s > n or |J | 6 = n , (B.6) holds trivially.Therefore, we consider ( s, J ) such that s ≤ n and |J | = n . Because for any ﬁxed J , { D j } j ∈J isindependent to ( Z i , |N ∗ i | , N ∗ i ) by Assumption 3.2 (a), then by i.i.d. of D i f S ∗ i | Z i , |N ∗ i | = n, N ∗ i = J ( s ) = f P j ∈J D j | Z i , |N ∗ i | = n, N ∗ i = J ( s ) = f P j ∈J D j ( s ) = C sn f sD (1) f D (0) ( n − s ) . (B.7)where f D ( d ) = P r ( D i = d ) with d ∈ { , } . On the other hand, by the law of total probability, f S ∗ i | Z i , |N ∗ i | = n ( s ) = X J ∈ Ω N∗ , s.t. |J | = n f P j ∈J D j | Z i , |N ∗ i | = n, N ∗ i = J ( s ) f N ∗ i | Z i , |N ∗ i | = n ( J )= X J ∈ Ω N∗ , s.t. |J | = n f P j ∈J D j ( s ) f N ∗ i | Z i , |N ∗ i | = n ( J )= C sn f sD (1) f D (0) ( n − s ) . (B.8)Therefore, (B.7) and (B.8) lead to f S ∗ i , N ∗ i | Z i , |N ∗ i | = n ( s, J ) = f S ∗ i | Z i , |N ∗ i | = n, N ∗ i = J ( s ) f N ∗ i | Z i , |N ∗ i | = n ( J ) = f S ∗ i | Z i , |N ∗ i | = n ( s ) f N ∗ i | Z i , |N ∗ i | = n ( J ) .

47n addition, due to S ∗ i = P j ∈N ∗ i D j and Assumption 3.4 (a), it is easy to see that |N i | ⊥ S ∗ i | Z i , N ∗ i . Thus, similar arguments used to show (B.10) give us f S ∗ i | Z i , |N i | = n, |N ∗ i | = n ∗ ( s ) = X J ∈ Ω N∗ , s.t. |J | = n ∗ f S ∗ i | Z i , |N i | = n, |N ∗ i | = n ∗ , N ∗ i = J ( s ) × f N ∗ i | Z i , |N i | = n, |N ∗ i | = n ∗ ( J )= X J ∈ Ω N∗ , s.t. |J | = n ∗ f S ∗ i | Z i , |N ∗ i | = n ∗ , N ∗ i = J ( s ) × f N ∗ i | Z i , |N i | = n, |N ∗ i | = n ∗ ( J )= f S ∗ i | Z i , |N ∗ i | = n ∗ ( s ) X J ∈ Ω N∗ , s.t. |J | = n ∗ f N ∗ i | Z i , |N i | = n, |N ∗ i | = n ∗ ( J )= f S ∗ i | Z i , |N ∗ i | = n ∗ ( s ) , (B.9)where the second equality is due to |N i | ⊥ S ∗ i | Z i , N ∗ i , the third equality is because of N ∗ i ⊥ S ∗ i | Z i , |N ∗ i | in proof (a). Hence, (B.9) permits that |N i | ⊥ S ∗ i | Z i , |N ∗ i | .(b) Given S i = P j ∈N i D j , according to Assumptions 3.2 (a) and 3.4 (a), { D i } i ∈P are i.i.d. andindependent to ( Z i , N i ). Thus, applying the same arguments used to show (a), we can obtain for s ≤ n , f S i | Z i , |N i | = n, N i ( s ) = f S i | Z i , |N i | = n ( s ) = C sn f sD (1) f D (0) ( n − s ) , leading to N i ⊥ S i | Z i , |N i | .Moreover, because S i = P j ∈N i D j is a function of ( N i , { D j } j ∈N i ), the randomness of S i given N i only comes from D j for j ∈ N i . In addition, since { D j } j ∈P are independent to ( Z i , N ∗ i , N i ) asin Assumption 3.4 (a), it implies that |N ∗ i | ⊥ S i | Z i , N i . Hence, for ∀ ( s, n, n ∗ ) ∈ Ω S, |N | , |N ∗ | , f S i | Z i , |N i | = n, |N ∗ i | = n ∗ ( s ) = X J ∈ Ω N , s.t. |J | = n f S i | Z i , |N i | = n, |N ∗ i | = n ∗ , N i = J ( s ) × f N i | Z i , |N i | = n, |N ∗ i | = n ∗ ( J )= X J ∈ Ω N , s.t. |J | = n f S i | Z i , |N i | = n, N i = J ( s ) × f N i | Z i , |N i | = n, |N ∗ i | = n ∗ ( J )= f S i | Z i , |N i | = n ( s ) X J ∈ Ω N , s.t. |J | = n f N i | Z i , |N i | = n, |N ∗ i | = n ∗ ( J )= f S i | Z i , |N i | = n ( s ) , (B.10)where the second equality is because |N ∗ i | ⊥ S i | Z i , N i , the third equality is due to N i ⊥ S i | Z i , |N i | as shown at the beginning of this proof. Therefore, S i ⊥ |N ∗ i || Z i , |N i | from (B.10).(c) The proof in this step follows directly from the proofs in (a) and (b). Proof of Proposition 4.2.

Recall that by Bayes’ Theorem, we have f S ∗ i , |N ∗ i || D i ,S i ,Z i , |N i | = f S i , |N i || D i ,S ∗ i ,Z i , |N ∗ i | × f S ∗ i , |N ∗ i || D i ,Z i f S i , |N i || D i ,Z i . (B.11)In what follows, we further rewrite the distributions in the numerator and the denominator toachieve the desired result. Based on Assumptions 3.2 and 3.4, we know that { D i } i ∈P is i.i.d. andindependent to { Z i , N ∗ i , N i } i ∈P . Thus, from the fact that i

6∈ N ∗ i and i

6∈ N i , we can conclude that D i ⊥ ( S ∗ i , S i , Z i , N ∗ i , N i ) , for ∀ i ∈ P . (B.12)It further yields that D i ⊥ S i (cid:12)(cid:12) ( S ∗ i , Z i , |N ∗ i | , |N i | ) and D i ⊥ |N i | (cid:12)(cid:12) ( S ∗ i , Z i , |N ∗ i | ). Therefore, consider48he ﬁrst term in the numerator, for any ( s, n ) ∈ Ω S, |N | f S i , |N i || D i ,S ∗ i ,Z i , |N ∗ i | ( s, n ) = f S i | D i ,S ∗ i ,Z i , |N ∗ i | , |N i | = n ( s ) × f |N i || D i ,S ∗ i ,Z i , |N ∗ i | ( n )= f S i | S ∗ i ,Z i , |N ∗ i | , |N i | = n ( s ) × f |N i || S ∗ i ,Z i , |N ∗ i | ( n )= f S i | S ∗ i ,Z i , |N ∗ i | , |N i | = n ( s ) × f |N i || Z i , |N ∗ i | ( n ) , (B.13)where the last equality is because of |N i | ⊥ S ∗ i (cid:12)(cid:12) Z i , |N ∗ i | in Lemma 4.1. Besides, again by (B.12),we have D i ⊥ S ∗ i (cid:12)(cid:12) Z i , |N ∗ i | and D i ⊥ |N ∗ i | (cid:12)(cid:12) Z i . For the second term in the numerator, f S ∗ i , |N ∗ i || D i ,Z i ( s, n ) = f S ∗ i | D i ,Z i , |N ∗ i | = n ( s ) × f |N ∗ i || D i ,Z i ( n )= f S ∗ i | Z i , |N ∗ i | = n ( s ) × f |N ∗ i || Z i ( n ) . (B.14)Similarly, by (B.12), we can rewrite the denominator f S i , |N i || D i ,Z i ( s, n ) = f S i | Z i , |N i | = n ( s ) × f |N i || Z i ( n ) . (B.15)Now, substituting (B.13), (B.14) and (B.15) into (B.11) leads to the stated result. Proof of Theorem 4.3. (a) Due to the Assumption 3.3, it is clear that f Z i , f |N ∗ i || Z i , f | ˜ N i || Z i , |N ∗ i | , f |N i || Z i , |N ∗ i | , are all identical across i ∈ P . (B.16)Now, according to | ˜ N i | ⊥ |N i || Z i , |N ∗ i | in Assumption 4.1, we can obtain f |N ∗ i | , | ˜ N i | , |N i | ,Z i = f | ˜ N i | , |N i || Z i , |N ∗ i | × f |N ∗ i | ,Z i = f | ˜ N i || Z i , |N ∗ i | × f |N i || Z i , |N ∗ i | × f |N ∗ i | ,Z i . Because all the terms on the right hand side of the above equation are identical for all i , then f |N ∗ i | , | ˜ N i | , |N i | ,Z i is identical for all i , so as all the marginal and conditional distributions of ( |N ∗ i | , | ˜ N i | , |N i | , Z i ),which include f |N i || Z i and f | ˜ N i | , |N i || Z i .In addition, recall that Y i = r ( D i , S ∗ i , Z i , |N ∗ i | , ε i ) as in Assumption 3.1. By Lemma B.2 and(B.12), we have ( ε i , D i ) ⊥ ( S ∗ i , | ˜ N i | , |N i | ) | Z i , |N ∗ i | . Moreover, from Lemma 4.1 we know that S ∗ i ⊥ ( | ˜ N i | , |N i | ) | Z i , |N ∗ i | . Therefore, f |N ∗ i | , | ˜ N i | , |N i | ,S ∗ i ,ε i ,D i ,Z i = f | ˜ N i | , |N i | ,S ∗ i ,ε i ,D i | Z i , |N ∗ i | × f |N ∗ i | ,Z i = f D × f ε i | Z i , |N ∗ i | × f | ˜ N i | , |N i | ,S ∗ i | Z i , |N ∗ i | × f |N ∗ i | ,Z i = f D × f ε i | Z i , |N ∗ i | × f S ∗ i | Z i , |N ∗ i | × f | ˜ N i || Z i , |N ∗ i | × f |N i || Z i , |N ∗ i | × f |N ∗ i | ,Z i . By the identical distribution of ε i given Z i , |N ∗ i | in Assumption 3.3, and f S ∗ i | Z i , |N ∗ i | = n ∗ ( s ∗ ) = C s ∗ n ∗ f s ∗ D (1) f D (0) ( n ∗ − s ∗ ) , together with (B.16), we can conclude that ( |N ∗ i | , | ˜ N i | , |N i | , S ∗ i , ε i , D i , Z i )is identically distributed and so as ( |N ∗ i | , | ˜ N i | , |N i | , Y i , Z i ).(b) In this proof, we ﬁrst show that f | ˜ N i || Z i , |N ∗ i | and f |N i || Z i , |N ∗ i | are identiﬁed. We then verifythe identiﬁcation of f |N ∗ i || Z i . By the law of total probability, for any (˜ n, n, y ) ∈ Ω | ˜ N | , |N | ,Y f | ˜ N i | , |N i | ,Y i | Z i (˜ n, n, y )= X n ∗ ∈ Ω |N∗| f | ˜ N i | , |N i | ,Y i | Z i , |N ∗ i | = n ∗ (˜ n, n, y ) × f |N ∗ i || Z i ( n ∗ )49 X n ∗ ∈ Ω |N∗| f Y i | Z i , |N ∗ i | = n ∗ , | ˜ N i | =˜ n, |N i | = n ( y ) × f | ˜ N i | , |N i || Z i , |N ∗ i | = n ∗ (˜ n, n ) × f |N ∗ i || Z i ( n ∗ )= X n ∗ ∈ Ω |N∗| f Y i | Z i , |N ∗ i | = n ∗ ( y ) × f | ˜ N i || Z i , |N ∗ i | = n ∗ (˜ n ) × f |N i || Z i , |N ∗ i | = n ∗ ( n ) × f |N ∗ i || Z i ( n ∗ ) , (B.17)where the last equality is due to Assumption 4.1 and Lemma B.3. Integrate both sides of (B.17) Z y ∈ Ω Y yf | ˜ N i | , |N i | ,Y i | Z i (˜ n, n, y ) dy = X n ∗ ∈ Ω |N∗| E [ Y i | Z i , |N ∗ i | = n ∗ ] × f | ˜ N i || Z i , |N ∗ i | = n ∗ (˜ n ) × f |N i || Z i , |N ∗ i | = n ∗ ( n ) × f |N ∗ i || Z i ( n ∗ ) . (B.18)Besides, for any (˜ n, n ) ∈ Ω | ˜ N | , |N | , because of Assumption 4.1 f | ˜ N i | , |N i || Z i (˜ n, n ) = X n ∗ ∈ Ω |N∗| f | ˜ N i | , |N i || Z i , |N ∗ i | = n ∗ (˜ n, n ) × f |N ∗ i || Z i ( n ∗ )= X n ∗ ∈ Ω |N∗| f | ˜ N i || Z i , |N ∗ i | = n ∗ (˜ n ) × f |N i || Z i , |N ∗ i | = n ∗ ( n ) × f |N ∗ i || Z i ( n ∗ ) . (B.19)Recall that the notations below from the main text: for ∀ y ∈ Ω Y , the K |N | × K |N | matrices E | ˜ N | , |N | ,Y | Z =  R y ∈ Ω Y yf | ˜ N i | , |N i | ,Y i | Z i (0 , , y ) dy · · · R y ∈ Ω Y yf | ˜ N i | , |N i | ,Y i | Z i (0 , K |N | − , y ) dy ... . . . ... R y ∈ Ω Y yf | ˜ N i | , |N i | ,Y i | Z i ( K |N | − , , y ) dy · · · R y ∈ Ω Y yf | ˜ N i | , |N i | ,Y i | Z i ( K |N | − , K |N | − , y ) dy  ,F | ˜ N | , |N || Z =  f | ˜ N i | , |N i || Z i (0 , · · · f | ˜ N i | , |N i || Z i (0 , K |N | − f | ˜ N i | , |N i || Z i ( K |N | − , · · · f | ˜ N i | , |N i || Z i ( K |N | − , K |N | −  . In addition, recall and denote two K |N | × K |N | diagonal matrices T Y | Z, |N ∗ | = diag (cid:0) E [ Y i | Z i , |N ∗ i | = 0] , E [ Y i | Z i , |N ∗ i | = 1] , · · · , E [ Y i | Z i , |N ∗ i | = K |N | − (cid:1) ,T |N ∗ || Z = diag (cid:0) f |N ∗ i || Z i (0) , f |N ∗ i || Z i (1) , · · · , f |N ∗ i || Z i ( K |N | − (cid:1) . Then, given the notations above, (B.17) and (B.19) can be rewritten in the following expressions: E | ˜ N | , |N | ,Y | Z = F | ˜ N || Z, |N ∗ | × T Y | Z, |N ∗ | × T |N ∗ || Z × F ′|N || Z, |N ∗ | , (B.20) F | ˜ N | , |N || Z = F | ˜ N || Z, |N ∗ | × T |N ∗ || Z × F ′|N || Z, |N ∗ | , (B.21)where F | ˜ N || Z, |N ∗ | and F |N || Z, |N ∗ | are deﬁned in the main text. Based on Assumption 4.3, we knowthat F | ˜ N || Z, |N ∗ | and F |N || Z, |N ∗ | are invertible. In addition, based on Assumption 4.4 (b), we havethat f |N ∗ i || Z i ( n ) > η > ∀ n ∈ Ω |N ∗ | indicates the invertibility of T |N ∗ || Z . Hence, (B.21) impliesthat F | ˜ N | , |N || Z is also invertible. It then yields from (B.20) and (B.21) that the square matrix50 | ˜ N | , |N | ,Y | Z × F − | ˜ N | , |N || Z can be factorized as E | ˜ N | , |N | ,Y | Z × F − | ˜ N | , |N || Z = F | ˜ N || Z, |N ∗ | × T Y | Z, |N ∗ | × F − | ˜ N || Z, |N ∗ | , (B.22)where the matrix E | ˜ N | , |N | ,Y | Z × F − | ˜ N | , |N || Z on the left hand side of the above equation is identiﬁablefrom the observed data, and the right hand side corresponds to its eigen-decomposition, whoseeigenvalues are the diagonal entries of T Y | Z, |N ∗ | .By Assumption 4.4 (a), all the K |N | eigenvalues in the diagonal matrix T Y | Z, |N ∗ | are strictlypositive and distinct. Thus, given the eigen-decomposition of matrix E | ˜ N | , |N | ,Y | Z × F − | ˜ N | , |N || Z in(B.22), its K |N | eigenvectors are linearly independent and are corresponding to the K |N | columnsof F | ˜ N || Z, |N ∗ | . By simple algebra, we can solve the K |N | eigenvectors, meaning that the columnsof F | ˜ N || Z, |N ∗ | are identiﬁable. Moreover, Assumption 4.4 (b) ensures there is an unique maximumentry of each eigenvector, and its location reveals which eigenvalue it corresponds to. For example,if the largest value in some eigenvector appears in its ﬁrst entry, then this eigenvector gives the la-tent probabilities [ f | ˜ N i || Z i , |N ∗ i | =1 (1) , f | ˜ N i || Z i , |N ∗ i | =1 (2) , ..., f | ˜ N i || Z i , |N ∗ i | =1 ( K |N | )] ′ and corresponds to theeigenvalue E [ Y i | Z i , |N ∗ i | = 1]. Because the summation of each column in the matrix F |N i || Z i , |N ∗ i | is naturally normalized to be one, there is an unique solution for each eigenvector. The abovediscussions verify that F | ˜ N i || Z i , |N ∗ i | can be nonparametrically identiﬁed. Same arguments can beuse to show the identiﬁcation of F |N i || Z i , |N ∗ i | .Next, let us move on to f |N ∗ i || Z i . Deﬁne two K |N | × F |N ∗ || Z = (cid:2) f |N ∗ i || Z i (0) f |N ∗ i || Z i (1) · · · f |N ∗ i || Z i ( K |N | − (cid:3) ′ ,F |N || Z = (cid:2) f |N i || Z i (0) f |N i || Z i (1) · · · f |N i || Z i ( K |N | − (cid:3) ′ . Based on the law of total probability, it is easy to get F |N || Z = F |N || Z, |N ∗ | × F |N ∗ || Z . Since F |N || Z, |N ∗ | is invertible, multiplying both sides of the above equation by F − |N || Z, |N ∗ | gives us F |N ∗ || Z = F − |N || Z, |N ∗ | × F |N || Z , (B.23)which indicates the identiﬁability of F |N ∗ || Z . Proof of Lemma 4.4.

Recall that ∆ S i := S i − S ∗ i . (a) Consider the case N ∗ i ⊂ N i . For ∀ ( s, n ) ∈ Ω S, |N | and ( s ∗ , n ∗ ) ∈ Ω S ∗ , |N ∗ | such that n ∗ ≤ nf S i | S ∗ i = s ∗ ,Z i , |N i | = n, |N ∗ i | = n ∗ ( s ) = f ∆ S i | S ∗ i = s ∗ ,Z i , |N i | = n, |N ∗ i | = n ∗ ( s − s ∗ )= X ( J ∗ , J ) , s.t. J ∗ ⊂J , |J ∗ | = n ∗ , |J | = n f ∆ S i | S ∗ i = s ∗ ,Z i , |N i | = n, |N ∗ i | = n ∗ , N ∗ i = J ∗ , N i = J ( s − s ∗ ) × f N ∗ i , N i | S ∗ i = s ∗ ,Z i , |N i | = n, |N ∗ i | = n ∗ ( J ∗ , J ) , (B.24)where the last line is based on the law of total probability. Because N ∗ i ⊂ N i , we have that N ∗ i / N i is empty and ∆ S i = P j ∈N i / N ∗ i D j . In addition, N i / N ∗ i and N ∗ i are mutually exclusive, i.e. if i ∈ N i / N ∗ i then i

6∈ N ∗ i . Due to the i.i.d. of { D i } i ∈P (Assumption 3.2), and the independencebetween { D i } i ∈P and ( Z i , N ∗ i , N i ) (Assumption 3.4), we have that ∆ S i ⊥ S ∗ i | Z i , N ∗ i , N i . Therefore, f ∆ S i | S ∗ i = s ∗ ,Z i , |N i | = n, |N ∗ i | = n ∗ , N ∗ i = J ∗ , N i = J ( s − s ∗ ) = f ∆ S i | Z i , |N i | = n, |N ∗ i | = n ∗ , N ∗ i = J ∗ , N i = J ( s − s ∗ ) . (B.25)51gain by the independence of { D i } i ∈P , once conditional on N i / N ∗ i = J / J ∗ , we know that ∆ S i = P j ∈J / J ∗ D j follows a binomial distribution if s ∗ ≤ s and ∆ s ≤ ∆ n , and is independent to theidentity of network neighbors contained in ( N ∗ i , N i ). Then (B.25) becomes to f ∆ S i | Z i , |N i | = n, |N ∗ i | = n ∗ , N ∗ i = J ∗ , N i = J ( s − s ∗ )= f ∆ S i | Z i , |N ∗ i | = n ∗ , |N i / N ∗ i | = n − n ∗ , N ∗ i = J ∗ , N i / N ∗ i = J / J ∗ ( s − s ∗ )= f ∆ S i | Z i , |N i / N ∗ i | =∆ n (∆ s ) , (B.26)where the last equality follows the same arguments used to show Lemma 4.1 (a). Substituting(B.26) into (B.24) gives the desired result.(b) Similar arguments used in proof for the case N ∗ i ⊂ N i can be applied to obtain the resultfor the case N i ⊂ N ∗ i . Therefore, we omit the proof. Proof of Theorem 4.5. (a) From Proposition 4.2, we know that f S ∗ i , |N ∗ i || D i ,S i ,Z i , |N i | = f S i | S ∗ i ,Z i , |N ∗ i | , |N i | × f S ∗ i | Z i , |N ∗ i | × f |N i || Z i , |N ∗ i | × f |N ∗ i || Z i f S i | Z i , |N i | × f |N i || Z i . (B.27)Based on Lemma 4.1, we know that f S ∗ i | Z i , |N ∗ i | = n ∗ ( s ∗ ) = C s ∗ n ∗ f s ∗ D (1) f D (0) ( n ∗ − s ∗ ) and f S i | Z i , |N i | = n ( s ) = C sn f sD (1) f D (0) ( n − s ) . Similarly, from Lemma 4.4, f S i | S ∗ i = s ∗ ,Z i , |N ∗ i | = n ∗ , |N i | = n ( s ) = C ∆ s ∆ n f ∆ sD (1) f D (0) (∆ n − ∆ s ) . (B.28)Because D i is i.i.d., by Assumptions 3.3 and 3.4, f |N ∗ i || Z i , f |N i || Z i , |N ∗ i | and f |N i || Z i are identical forall i . Therefore, all distributions on the right hand side of (B.27) are identical across i , togetherwith Theorem 4.3, f S i | S ∗ i = s ∗ ,Z i , |N i | = n, |N ∗ i | = n ∗ ( s ) can be nonparametrically identiﬁed. Proof of Theorem 4.6. (a) From Theorem 4.5, we know that f S ∗ i , |N ∗ i || D i ,S i ,Z i , |N i | is identical forall i , together with Proposition 3.1, we know m i is also identical for all i ∈ P .(b) To ease the notation, denote T i = ( S i , |N i | ) ′ and T ∗ i = ( S ∗ i , |N ∗ i | ) ′ . According to Assumption4.2, the support of T i and T ∗ i are the same, and we denote it as Ω T = { t , t , ..., t K T } with t k = ( s k , n k ) ∈ Ω S, |N | . Let us rank the possible values in Ω T by the lexicographical ordering,according to the natural order of the integers in Ω S, |N | , i.e. t = (0 , ,t = (0 , , t = (1 , ,t = (0 , , t = (1 , , t = (2 , , · · · t ( K |N|− K |N| +1 = (0 , K |N | − , · · · , t ( K |N|− K |N| +2)2 +1 = ( K |N | − , K |N | − . (B.29)Because by result in (a), m i ( · ) is identical for all i , thus we suppress the subscript i , i.e. m ( d, s, z, n ) := E [ Y i | D i = d, S i = s, Z i = z, |N i | = n ]. By notation abuse, we ignore the arguments ( d, z ) in func-tions m and m ∗ , and introduce the following notations. For any ( d, z ) ∈ { , } × Ω Z , denote52 Y | T,D = d,Z = z and M Y | T ∗ ,D = d,Z = z as two K T × M Y | T,D = d,Z = z ( m ) = [ m ( t ) , m ( t ) , · · · , m ( t K T )] ′ , (B.30) M Y | T ∗ ,D = d,Z = z ( m ∗ ) = [ m ∗ ( t ) , m ∗ ( t ) , · · · , m ∗ ( t K T )] ′ , (B.31)where m ( t k ) represents the mean function m ( d, s k , z, n k ) = E [ Y i | D i = d, S i = s k , Z i = z, |N i | = n k ].Deﬁne the K T × K T matrix F T ∗ | T,D = d,Z = z =  f T ∗ i | T i = t ,D i = d,Z i = z ( t ) · · · f T ∗ i | T i = t ,D i = d,Z i = z ( t K T )... . . . ... f T ∗ i | T i = t K T ,D i = d,Z i = z ( t ) · · · f T ∗ i | T i = t K T ,D i = d,Z i = z ( t K T )  . (B.32)From Proposition 3.1 and the notations in (B.30)-(B.32), we have for any ( d, z ) ∈ { , } × Ω Z M Y | T,D = d,Z = z ( m ) = F T ∗ | T,D = d,Z = z × M Y | T ∗ ,D = d,Z = z ( m ∗ ) . (B.33)Given Proposition 4.2, for ∀ ( d, z ) ∈ { , } × Ω Z , the elements in the main diagonal of F T ∗ | T,D = d,Z = z f S ∗ i , |N ∗ i || D i = d,S i = s,Z i = z, |N i | = n ( s, n )= f S i | S ∗ i = s, |N ∗ i | = n, |N i | = n,Z i = z ( s ) × f S ∗ i | Z i = z, N ∗ i | = n ( s ) × f |N ∗ i || Z i = z, |N i | = n ( n ) × f |N ∗ i || Z i = z ( n ) f S i | Z i = z, |N ∗ i | = n ( s ) × f |N i || Z i = z ( n )= f |N ∗ i || Z i = z, |N i | = n ( n ) × f |N ∗ i || Z i = z ( n ) f |N i || Z i = z ( n ) , where the second equality is because of Lemma 4.1 and Lemma 4.4. In addition, based on Assump-tion 4.4 (b), we know that f |N ∗ i || Z i = z, |N i | = n ( n ) >

0, which also leads to f |N ∗ i || Z i = z ( n ) >

0. Therefore,by the preassumption that f |N i || Z i = z ( n ) >

0, we can conclude that f S ∗ i , |N ∗ i || D i = d,S i = s,Z i = z, |N i | = n ( s, n ) > ∀ ( s, n ) ∈ Ω S ∗ , |N ∗ | . (B.34)In what follows, we prove the desired result in two steps. Firstly, we show that the square matrix F T ∗ | T,D = d,Z = z is invertible. Secondly, we show that the CASF m ∗ is identiﬁable from (B.33). Step 1 . Consider any t ∗ = ( s ∗ , n ∗ ) and t = ( s, n ) such that 0 ≤ s ∗ ≤ n ∗ and 0 ≤ s ≤ n . UnderAssumption 4.5, we need to consider two cases.Firstly, suppose N ∗ i ⊂ N i holds. Then, we know that S ∗ i ≤ S i and |N ∗ i | ≤ |N i | . Thus, f T ∗ i | T i = t,D i = d,Z i = z ( t ∗ ) = 0 if at least one of the restrictions s ∗ ≤ s and n ∗ ≤ n is violated. Similarly,when N i ⊂ N ∗ i holds, we have that S i ≤ S ∗ i and |N i | ≤ |N ∗ i | . Then, f T ∗ i | T i = t,D i = d,Z i = z ( t ∗ ) = 0 if atleast one of the restrictions s ≤ s ∗ and n ≤ n ∗ is violated. Given the lexicographical ordering ofthe elements in Ω T , it is easy to see that the matrix F T ∗ | T,D = d,Z = z is lower triangular if N ∗ i ⊂ N i ,and is upper triangular if N i ⊂ N ∗ i . Moreover, (B.34) implies that all the elements on the maindiagonal of the triangular matrix F T ∗ | T,D = d,Z = z are strictly positive. Since the eigenvalues of atriangular matrix are its diagonal entries, the matrix F T ∗ | T,D = d,Z = z is therefore invertible. Step 2 . Next, we show that the CASF m ∗ is identiﬁable. Suppose m ∗ is not identiﬁable, thenthere exists ˜ m ∗ = m ∗ such that ˜ m ∗ is observationally equivalent to m ∗ , in the sense that (B.33)53lso holds for ˜ m ∗ : M Y | T,D = d,Z = z ( m ) = F T ∗ | T,D = d,Z = z M Y | T ∗ ,D = d,Z = z ( ˜ m ∗ ) . (B.35)It then yields from (B.33) and (B.35) that0 = F T ∗ | T,D = d,Z = z (cid:2) M Y | T ∗ ,D = d,Z = z ( m ∗ ) − M Y | T ∗ ,D = d,Z = z ( ˜ m ∗ ) (cid:3) . (B.36)Since F T ∗ | T,D = d,Z = z is invertible, it follows from (B.36) that M Y | T ∗ ,D = d,Z = z ( m ∗ ) = M Y | T ∗ ,D = d,Z = z ( ˜ m ∗ ) , meaning that ˜ m ∗ ( t k ) = m ∗ ( t k ) for all k = 1 , , ..., K T , which contradicts ˜ m ∗ = m ∗ . Therefore, wecan conclude that m ∗ is identiﬁable. B.3 Proofs of Section 5

Proof of Theorem 5.2.

For illustration simplicity, by notation abuse, we denote W i as anygeneric vector of observable variables of interest, where W i = ( W c ′ i , W d ′ i ) ′ ∈ Ω W c × Ω W d , with the Q × W ci := ( W ci , ..., W ciQ ) ′ containing continuous variables and the vector W di containingdiscrete variables. In this proof, we focus on the uniform convergence rate of the kernel estimationˆ f W i ( w ). Then, replacing W i by the observable variables of interest gives the stated results.Denote w = ( w c ′ , w d ′ ) ′ with w c = ( w c , ..., w cQ ) ′ and ˆ f W i ( w ) = 1 /N P Ni =1 ˆ f keri ( w ), whereˆ f keri ( w ) := K ( W ci , w c )1 (cid:2) W di = w d (cid:3) , (B.37)with K ( W ci , w c ) = h − Q Q Qq =1 κ (cid:0) ( W ciq − w cq ) /h (cid:1) . Let f W i ( w ) be the true distribution of W i . For any w ∈ Ω W , (cid:12)(cid:12)(cid:12) ˆ f W i ( w ) − f W i ( w ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) ˆ f W i ( w ) − E h ˆ f W i ( w ) i(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) E h ˆ f W i ( w ) i − f W i ( w ) (cid:12)(cid:12)(cid:12) . Given the inequality above, we prove the uniform convergence of (cid:12)(cid:12)(cid:12) ˆ f W i ( w ) − f W i ( w ) (cid:12)(cid:12)(cid:12) and its ratein two steps. In Step 1, we show that the bias of ˆ f W i ( w ), i.e. | E [ ˆ f W i ( w )] − f W i ( w ) | , is O ( h )uniformly. In Step 2, we show the uniform convergence of ˆ f W i ( w ) to E [ ˆ f W i ( w )] and establish itsconvergence rate. Step 1 . Firstly, let w d ∗ and w c ∗ := ( w c ∗ , ..., w c ∗ Q ) ′ be any generic element in Ω W d and Ω W c ,respectively. Then, for w = ( w c ′ , w d ′ ) ′ E h ˆ f keri ( w ) i = h − Q X w d ∗ ∈ Ω Wd " w d ∗ = w d ] Z Q Y q =1 κ (cid:18) w c ∗ q − w cq h (cid:19) f W ci ,W di (cid:0) w c ∗ , w d ∗ (cid:1) dw c ∗ , by changing of variables using v = ( v , ..., v Q ) ′ with v q = ( w c ∗ q − w cq ) /h and q = 1 , ..., Q , E h ˆ f keri ( w ) i = X w d ∗ ∈ Ω Wd " w d ∗ = w d ] Z Q Y q =1 κ ( v q ) f W ci ,W di ( w c + hv, w d ∗ ) dv Z f W ci ,W di ( w c + hv, w d ) Q Y q =1 κ ( v q ) dv, (B.38)where we denote the shorthand notation w c + hv := ( w c + hv , ..., w cQ + hv Q ) . Let the Q × f (1) c ( w ) := ∂f W i ( w ) /∂w c represent the ﬁrst order derivative of f W i ( w ) with respect to w c , and letthe Q × Q matrix f (2) c ( w ) := ∂ f W i ( w ) /∂w c ∂w c ′ be the second order derivative of f W i with respectto w c . Consider the Taylor series expansion of f W ci ,W di ( w c + hv, w d ) around w : f W ci ,W di ( w c + hv, w d ) − f W ci ,W di ( w c , w d ) = hf (1) c ( w ) ′ v + h v ′ f (2) c ( ˜ w ) v (B.39)where ˜ w is between ( w c + hv, w d ) and ( w c , w d ). Since W i is identically distributed based onTheorems 4.3 and 4.5, we have E [ ˆ f W i ( w )] = E [ ˆ f keri ( w )]. Plugging (B.39) into (B.38) gives E h ˆ f W i ( w ) i − f W i ( w ) = Z (cid:2) hf (1) c ( w ) ′ v + h v ′ f (2) c ( ˜ w ) v (cid:3) Q Y q =1 κ ( v q ) dv = hf (1) c ( w ) ′ Z v Q Y q =1 κ ( v q ) dv + h Z v ′ f (2) c ( ˜ w ) v Q Y q =1 κ ( v q ) dv ≤ Ch Z v ′ v Q Y q =1 κ ( v q ) dv = Ch Q X q =1 Z v q κ ( v q ) dv q , (B.40)where the inequality is because that each element in f (2) c is bounded uniformly in w c , and the sym-metric kernel function κ ( · ) in Assumption 5.2 (c) implies R κ ( v q ) v q dv q = 0, thus R v Q Qq =1 κ ( v q ) dv =( R v κ ( v ) dv , ..., R v Q κ ( v Q ) dv Q ) ′ = (0 , ..., ′ . From (B.40), we getsup w ∈ Ω W (cid:12)(cid:12)(cid:12) E h ˆ f W i ( w ) i − f W i ( w ) (cid:12)(cid:12)(cid:12) ≤ sup w ∈ Ω W (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Ch Q X q =1 Z κ ( v q ) v q dv q (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ CK Qh = O ( h ) . (B.41) Step 2 . Next, we show the uniform convergence of | ˆ f W i ( w ) − E [ ˆ f W i ( w )] | . Since Ω W c is compactand Ω W d has ﬁnite dimension as in Assumption 5.2 (a), for some constant C >

0, Ω W can be coveredby less than L N = Cl − QN open balls of radius l N , where for any w = ( w c ′ , w d ′ ) ′ , ˜ w = ( ˜ w c ′ , ˜ w d ′ ) ′ inthe same ball, we let w d = ˜ w d . Denote the centers of these open balls as ¯ w jǫ with j = 1 , , ..., J ( ǫ )and J ( ǫ ) ≤ L N . For any w, ˜ w in the same ball, the mean value theorem implies thatsup k w − ˜ w k <ǫ (cid:12)(cid:12)(cid:12) ˆ f W i ( w ) − ˆ f W i ( ˜ w ) (cid:12)(cid:12)(cid:12) ≤ sup k w − ˜ w k <ǫ N N X i =1 | K W ( W ci , w c ) − K W ( W ci , ˜ w c ) | = sup k w − ˜ w k <ǫ N h

Q N X i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Q Y q =1 κ (cid:18) W ciq − w cq h (cid:19) − Q Y q =1 κ (cid:18) W ciq − ˜ w cq h (cid:19)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) sup k w − ˜ w k <ǫ N h Q +1 N X i =1 | ˜ κ ′ ( w c ∗ h ) | k w c − ˜ w c k≤ Cl N h − ( Q +1) , (B.42)where w c ∗ h denotes some intermediate value between ( W ci − w c ) /h and ( W ci − ˜ w c ) /h , and ˜ κ ′ ( v )represents the ﬁrst order derivative of Q Qq =1 κ ( v q ) to v = ( v , ..., v Q ) ′ . The last line of (B.42)is because of the boundedness of κ ( · ) and the uniform boundedness of its ﬁrst order derivative(Assumption 5.2). Let ¯ w jǫ denote the center of an open ball containing w . Then,sup w ∈ Ω W (cid:12)(cid:12)(cid:12) ˆ f W i ( w ) − E [ ˆ f W i ( w )] (cid:12)(cid:12)(cid:12) ≤ max ≤ j ≤ L N sup k w − ¯ w jǫ k <ǫ (cid:12)(cid:12)(cid:12) ˆ f W i ( w ) − ˆ f W i ( ¯ w jǫ ) (cid:12)(cid:12)(cid:12) + max ≤ j ≤ L N (cid:12)(cid:12)(cid:12) ˆ f W i ( ¯ w jǫ ) − E [ ˆ f W i ( ¯ w jǫ )] (cid:12)(cid:12)(cid:12) + max ≤ j ≤ L N sup k w − ¯ w jǫ k <ǫ (cid:12)(cid:12)(cid:12) E [ ˆ f W i ( w )] − E [ ˆ f W i ( ¯ w jǫ )] (cid:12)(cid:12)(cid:12) := R + R + R . (B.43)By (B.42), we ﬁnd immediately that R and R can be bounded as below R ≤ C l N h − ( Q +1) , and R ≤ C l N h − ( Q +1) , (B.44)for some constants C , C . The main task is then to ﬁnd the convergence rate of R . Denote Q N,i := Q N,i ( w ) = ( ˆ f keri ( w ) − E [ ˆ f keri ( w )]) /N, where to east the notation, we suppress the argument w in Q N,i ( w ). Then, ˆ f W i ( w ) − E [ ˆ f W i ( w )] = P Ni =1 Q N,i . Following the method of Masry (1996), which aims at approximating dependent randomvariables by independent ones, we further divide the proof for R into two parts: • Step 2.1 construct the approximation process; • Step 2.2 shows that the independent random variable approximation converges uniformlyand veriﬁes the uniform convergence for the reminder term.

Step 2.1 . Recall that S , ..., S q N are the mutually exclusive partitions of index set { , , ..., N } with S l =1 ,...,q N S l = { , , ..., N } . Deﬁne V N ( k ) = P i ∈ S k Q N,i , for k = 1 , ..., q N and  W ′ N = q N / P k =1 V N (2 k − , W ′′ N = q N / P k =1 V N (2 k ) , if q N is even W ′ N = ( q N +1) / P k =1 V N (2 k − , W ′′ N = ( q N − / P k =1 V N (2 k ) , if q N is oddso that ˆ f W i ( w ) − E [ ˆ f W i ( w )] = W ′ N + W ′′ N with W ′ N and W ′′ N are the sums of Q N,i over the odd-numbered subsets { S k − } and even-numbered subsets { S k } , respectively. Then, for any η > P r ( R > η ) ≤ P r (cid:18) max ≤ j ≤ L N | W ′ N ( ¯ w jǫ ) | > η/ (cid:19) + P r (cid:18) max ≤ j ≤ L N (cid:12)(cid:12)(cid:12) W ′′ N ( ¯ w jǫ ) (cid:12)(cid:12)(cid:12) > η/ (cid:19) L N sup w ∈ Ω W P r ( | W ′ N ( w ) | > η/ . (B.45)Next, we bound P r ( | W ′ N ( w ) | > η/

2) by applying Lemma E.3 and approximating the odd-numbered { V N (2 k − } series by independent random variables. Enlarging the probability space if necessary,let us introduce a random variable sequence { U , U , ... } of mutually independent uniform [0 , { V N (2 k − } . Deﬁne V ∗ N (0) = 0 and V ∗ N (1) = V N (1). Then by Lemma E.3, for each k ≥

2, there is a random variable V ∗ N (2 k −

1) that is a measurable function of { V N (1) , V N (3) , ...V N (2 k − , U k } satisfying the threeconditions below:(a) V ∗ N (2 k −

1) is independent of { V N (1) , V N (3) , ..., V N (2 k − } ;(b) V ∗ N (2 k −

1) has the same distribution as V N (2 k − µ such that 0 < µ ≤ k V N (2 k − k ∞ < ∞ , P r ( | V ∗ N (2 k − − V N (2 k − | > µ ) ≤ k V N (2 k − k ∞ /µ ) / sup | P r ( AB ) − P r ( A ) P r ( B ) | , (B.46)where the inequality follows by setting the γ in Lemma E.3 as inﬁnity, and the supremum is over allpossible sets A and B , for A, B in the σ -ﬁeld of events generated by { V N (1) , V N (3) , ..., V N (2 k − } and by V N (2 k − V ∗ N (2 k −

1) guaranteesthat V ∗ N (1) , V ∗ N (3) , ..., V ∗ N (2 k −

1) are mutually independent with each other based on condition(a) above. Up to here, we have established the approximation of the dependent random sequence { V N (2 k − } by the independent one { V ∗ N (2 k − } . Step 2.2 . Without loss of generality, let q N be an even number. Then, P r ( | W ′ N ( w ) | > η/ P r (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q N / X k =1 [ V N (2 k − − V ∗ N (2 k − q N / X k =1 V ∗ N (2 k − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > η/  ≤ P r (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q N / X k =1 V ∗ N (2 k − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > η/  + P r (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q N / X k =1 [ V N (2 k − − V ∗ N (2 k − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > η/  := R ( w ) + R ( w ) . (B.47)Firstly, we bound R ( w ) as follows. Denote r i = | ∆( i, N ) | , then ¯ r N = sup ≤ i ≤ N r i . Noting that κ ( · ) is bounded, let sup w c ∈ Ω Wc | Q Qq =1 κ ( v q ) | = A for some constant A >

0. Then, by construction, | Q N,i ( w ) | ≤ A ( N h Q ) − , and | V N ( k ) | ≤ r k A ( N h Q ) − ≤ r N A ( N h Q ) − . (B.48)Let λ N = C [ N h Q ln( N )] / and we have that for N large enough, by choosing C properly, λ N | V N ( k ) | = 2 CA ¯ r N (cid:18) ln( N ) N h Q (cid:19) / ≤ / , r N [ln( N ) / ( N h Q )] / = O (1) in Assumption 5.2. By the inequality that exp( x ) ≤ x + x when | x | ≤ /

2, we can getexp ( ± λ N V N (2 k − ≤ ± λ N V N (2 k −

1) + λ N V N (2 k − . Thus, it yields from E [ λ N V N (2 k − V ∗ N (2 k −

1) and V N (2 k − E [exp( ± λ N V ∗ N (2 k − E [exp( ± λ N V N (2 k − ≤ λ N E [ V N (2 k − . (B.49)Moreover, because 1 + x ≤ exp( x ) for x ≥

0, let x = E [ λ N V N (2 k − E [exp( ± λ N V ∗ N (2 k − ≤ exp (cid:0) E (cid:2) λ N V N (2 k − (cid:3)(cid:1) = exp (cid:0) E (cid:2) λ N V ∗ N (2 k − (cid:3)(cid:1) , (B.50)From the Markov inequality, for any generic random variable X , constants c and a >

0, we have

P r ( X > c ) ≤ E [exp( aX )]exp( ac ) . Consequently, based on the independence of { V ∗ N (2 k − } q N / k =1 and (B.50), R ( w ) = P r (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q N / X k =1 V ∗ N (2 k − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > η/  = P r  q N / X k =1 V ∗ N (2 k − > η/  + P r  − q N / X k =1 V ∗ N (2 k − > η/  ≤  E  exp  λ N q N / X k =1 V ∗ N (2 k −  + E  exp  − λ N q N / X k =1 V ∗ N (2 k −  / exp( λ N η/ ≤  q N / Y k =1 E [exp ( λ N V ∗ N (2 k − q N / Y k =1 E [exp ( − λ N V ∗ N (2 k −  / exp( λ N η/ ≤ q N / Y k =1 exp (cid:0) E (cid:2) λ N V ∗ N (2 k − (cid:3)(cid:1) / exp( λ N η/ ≤  − λ N η/ λ N q N / X k =1 E (cid:2) V ∗ N (2 k − (cid:3) (B.51)where the ﬁrst inequality is obtained by letting a = λ N and c = η/ { V N (2 k − } and { V ∗ N (2 k − } have identical probability and V N ( k ) = P i ∈ S k Q N,i , q N / X k =1 E (cid:2) V ∗ N (2 k − (cid:3) = q N / X k =1 E (cid:2) V N (2 k − (cid:3) = q N / X k =1 X i,j ∈ S k − Cov ( Q N,i , Q

N,j ) . Given that the density function f W ci is uniformly bounded (Assumption 5.2 (b)), there exists a58onstant A such that | f W ci | < A . Then, because Q N,i = 1 N (cid:8) K ( W ci , w c )1[ W di = w d ] − E (cid:2) K ( W ci , w c )1[ W di = w d ] (cid:3)(cid:9) , we have V ar [ Q N,i ] = E (cid:2) Q N,i (cid:3) ≤ N E (cid:2) K ( W ci , w c ) (cid:3) = 1( N h Q ) Z Q Y q =1 κ (cid:18) w c ∗ q − w cq h (cid:19) f W ci ( w c ∗ q ) dw c ∗ ≤ A N h Q Q Y q =1 Z κ ( v q ) dv q = A N h Q , (B.52)with A = A K Q and A < ∞ due that R κ ( v ) dv = K < ∞ . Recall that r i = | ∆( i, N ) | . By theCauchySchwarz inequality and (B.52) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q N / X k =1 X i,j ∈ S k − Cov ( Q N,i , Q

N,j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ q N / X k =1 X i,j ∈ S k − | Cov ( Q N,i , Q

N,j ) | ≤ q N / X k =1 X i,j ∈ S k − V ar [ Q N,i ] ≤ A N h Q q N / X k =1 | S k − | ( | S k − | − , substituting P q/ k =1 | S k − | ≤ N and | S k − | ≤ r i k − into the above inequality, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q N / X k =1 X i,j ∈ S k − Cov ( Q N,i , Q

N,j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ A N h Q  q N / X k =1 r i k − + N  = A N h Q , (B.53)for some constant A >

0, because P q N / k =1 r i k − ≤ P q N k =1 r i k ≤ P Ni =1 | ∆( i, N ) | = O ( N ) (Assump-tion 5.2). Given (B.53), it is easy to see that (B.51) becomes to R ( w ) ≤ (cid:18) − λ N η λ N A N h Q (cid:19) = 2 exp (cid:18) − λ N η A ln( N ) (cid:19) . (B.54)Let η = 4 A [ln( N ) / ( N h Q )] / for some constant A >

0. Then, we have λ N η = A ln( N ). We canbound R ( w ) uniformly assup w ∈ Ω W R ( w ) ≤ A − A ) ln( N )) = 2 N − α , (B.55)and we choose A large enough such that α > α = A − A .At last, we deal with R ( w ). Let B k − ∈ σ { V N (1) , V N (3) , ..., V N (2 k − } , B ′ k − ∈ σ { V N (2 k − } and α k − = sup B k − ,B ′ k − (cid:12)(cid:12) P r ( B k − , B ′ k − ) − P r ( B k − ) P r ( B ′ k − ) (cid:12)(cid:12) . R ( w ) = P r (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) q N / X k =1 [ V N (2 k − − V ∗ N (2 k − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > η/  ≤ q N / X k =1 P r (cid:18) | V N (2 k − − V ∗ N (2 k − | > η q N (cid:19) ≤ q N / X k =1 (cid:18) q N k V N (2 k − k ∞ η (cid:19) / α k − . (B.56)Furthermore, applying (B.48) and η = 4 A [ln( N ) / ( N h Q )] / to the above inequality, R ( w ) ≤ q N / X k =1 (cid:18) q N A r k − ηN h Q (cid:19) / α k − ≤ A (cid:18) q N ¯ r N [ln( N ) N h Q ] / (cid:19) / q N / X k =1 α k − ≤ A (cid:18) N ln( N ) (cid:19) / q N / X k =1 α k − (B.57)uniformly in w for some constant A >

0, where the last line is due to ¯ r N = O ([ N h Q / ln( N )] / )and q N ≤ N . Now, substitute (B.55) and (B.57) into (B.47),sup w ∈ Ω W P r ( | W ′ N ( w ) | > η/ ≤ N − α + A (cid:18) N ln( N ) (cid:19) / q N / X k =1 α k − which, together with (B.45), further implies that P r ( R > η ) ≤ L N N − α + 2 A L N (cid:18) N ln( N ) (cid:19) / q N / X k =1 α k − . (B.58)Let l N = [ln( N ) h ( Q +2) /N ] / = ηh Q +1 →

0, then L N = 1 /l QN = 1 / [ ηh ( Q +1) ] Q → ∞ as N → ∞ . Byproperly choosing α , we can obtain the result that L n N − α is summable, i.e. P ∞ N =1 L n N − α < ∞ .In addition, by Assumption 5.3, we know that L N (cid:16) N ln( N ) (cid:17) / P q N / k =1 α k − is also summable. It thenfollows from the Borel-Cantelli lemma that R = O ( η ) = O (cid:20) ln( N ) N h Q (cid:21) / ! almost surely. (B.59)Together with (B.41) and (B.44), we arrive the conclusion thatsup w ∈ Ω W (cid:12)(cid:12)(cid:12) ˆ f W i ( w ) − f W i ( w )] (cid:12)(cid:12)(cid:12) = O p (cid:0) [ln( N ) / ( N h Q )] / + h (cid:1) . (B.60)60 roof of Corollary 5.3. We prove the desired result in two steps. Step 1 aims at the uniformconvergence of ˆ f |N i || Z i , |N ∗ i | . Step 2 fulﬁlls the proof by establishing the uniform convergence ofˆ f S ∗ , |N ∗ i || D i ,S i ,Z i , |N i | . Step 1 . From (B.22) we know that F | ˜ N | , |N | ,Y | Z × F − | ˜ N | , |N || Z = F |N || Z, |N ∗ | × T Y | Z, |N ∗ | × F − |N || Z, |N ∗ | . Denote B ( γ ) := F | ˜ N | , |N | ,Y | Z × F − | ˜ N | , |N || Z , and let λ ( γ ) and ψ ( γ ) represent the eigenvalues andeigenvectors of B ( γ ). Then, we have (cid:16) B ( γ ) − λ ( γ ) I K T (cid:17) ψ ( γ ) = 0 . Furthermore, recall that T Y | Z, |N ∗ | is a diagonal matrix with all entries on its diagonal strictlypositive. It then yields from the eigendecomposition that for the eigenvalue λ ( γ ) = f Y i | Z i , |N ∗ i | = n ∗ ( y ),its eigenvector is ψ ( γ ) = [ f |N i || Z i , |N ∗ i | = n ∗ ( n ) , ..., f |N i || Z i , |N ∗ i | = n ∗ ( K N )] ′ . Andrew et al. (1993) showsthe existence of a neighborhood of γ in the parameter space, denoted by M , such that for any γ ∈ M , there exist an eigenvalue function λ ( γ ) and an eigenvector function ψ ( γ ) that are bothanalytic functions of γ . Given the uniform convergence of ˆ γ N to γ proved in Theorem 5.2, we onlyneed to consider the convergence of ψ ( γ ) over a small neighborhood of γ such that k γ − γ k ∞ ≤ η with η = o (1). The rest of the proof is exactly the same with the proof of Lemma 3 in Hu (2008),therefore ignored here due to space limitation. Let ˆ ψ N := ψ (ˆ γ N ) and ψ := ψ ( γ ), then we canshow the uniform convergencesup k ˆ γ N − γ k ∞ ≤ η (cid:13)(cid:13)(cid:13) ˆ ψ N − ψ (cid:13)(cid:13)(cid:13) ∞ = O p (cid:0) k ˆ γ N − γ k ∞ (cid:1) , sup k ˆ γ N − γ k ∞ ≤ η (cid:13)(cid:13)(cid:13)(cid:13) ˆ ψ N − ψ − ∂ψ ( γ ) ∂γ ′ (ˆ γ N − γ ) (cid:13)(cid:13)(cid:13)(cid:13) ∞ = O p (cid:0) k ˆ γ N − γ k ∞ (cid:1) . Step 2 . Again, because of the uniform convergence of ˆ γ N , in Step 2 we consider only a smallneighborhood of γ . Denote ϕ = ( ϕ , ..., ϕ ) ′ where each of its elements represents one probabilitydistribution on the right hand side of the equation (3): ϕ = f S i | S ∗ i ,Z i , |N ∗ i | , |N i | , ϕ = f S ∗ i | Z i , |N ∗ i | , ϕ = f |N i || Z i , |N ∗ i | , ϕ = f |N ∗ i || Z i ,ϕ = f S i | Z i , |N i | , ϕ = f |N i || Z i . where we actually have that ϕ = ψ . Given Proposition 4.2, φ = φ ( ϕ ) = ϕ ϕ ϕ ϕ / ( ϕ ϕ ),which is a twice continuously diﬀerentiable function of ϕ by Assumption 5.2. Beside, its estimatoris constructed by ˆ φ N = φ ( ˆ ϕ N ) with true value φ = φ ( ϕ ). Let the true value of ϕ be ϕ = ϕ ( γ , ψ ) = ( ϕ , ..., ϕ ) ′ and let its plug-in estimator be ˆ ϕ N = ϕ (ˆ γ N , ˆ ψ N ) = ( ˆ ϕ ,N , ..., ˆ ϕ ,N ) ′ . Then, dφ ( ϕ ) dϕ ′ = (cid:18) ϕ ϕ ϕ ϕ ϕ , ϕ ϕ ϕ ϕ ϕ , ϕ ϕ ϕ ϕ ϕ , ϕ ϕ ϕ ϕ ϕ , − ϕ ϕ ϕ ϕ ϕ ϕ , − ϕ ϕ ϕ ϕ ϕ ϕ (cid:19) . (B.61)Recall that there exists a ǫ >

0, such that ϕ are uniformly bounded from below by ǫ based onthe condition stated in Corollary 5.3. We also know that ϕ = C sn f sD (1) f D (0) ( n − s ) > ǫ ′ uniformlyover Ω W for some constant ǫ ′ . Moreover, since ϕ to ϕ are all conditional probabilities of discreterandom variables, their true values ϕ to ϕ all lie in [0 , o (1)neighborhood of γ , by the uniform convergence of ˆ ψ N in Step 1, we know that for large enoughsample size, ˆ ϕ ,N to ˆ ϕ ,N are also uniformly bounded from above and ˆ ϕ ,N and ˆ ϕ ,N are uni-formly bounded from below. Therefore, any intermediate value ˜ ϕ between ϕ and ˆ ϕ N is uniformly61ounded. Thus, for the derivative in (B.61) evaluated at ˜ ϕ , there exists some constant C > k dφ ( ˜ ϕ ) /dϕ ′ k ≤ C uniformly over Ω W . By the mean value theorem, we then have thatsup k ˆ γ N − γ k ∞ <η k ˆ φ N − φ k ∞ = sup k ˆ γ N − γ k ∞ <η k φ ( ˆ ϕ N ) − φ ( ϕ ) k ∞ ≤ sup k ˆ γ N − γ k ∞ <η (cid:13)(cid:13)(cid:13)(cid:13) dφ ( ˜ ϕ ) dϕ ′ (cid:13)(cid:13)(cid:13)(cid:13) ∞ k ˆ ϕ N − ϕ k ∞ ≤ C sup k ˆ γ N − γ k ∞ <η k ˆ ϕ N − ϕ k ∞ , (B.62)where ˜ ϕ is an intermediate vector between ϕ and ˆ ϕ N . Besides, because ˆ ϕ N = ϕ (ˆ γ N , ˆ ψ N ) and ϕ = ϕ ( γ , ψ ), together with the fact that ϕ ( γ, ψ ) is continuously diﬀerentiable in ( γ, ψ ) withuniformly bounded ﬁrst order derivative, we get that (B.62) can be further bounded bysup k ˆ γ N − γ k ∞ <η k ˆ φ N − φ k ∞ ≤ C ′ sup k ˆ γ N − γ k ∞ <η k ˆ ψ N − ψ k ∞ + k ˆ γ N − γ k ∞ ! = O p (cid:0) k ˆ γ N − γ k ∞ (cid:1) , for some constant C ′ >

0, and the last line is from in Step 1. Furthermore, recall that φ = φ ( ψ ),where ψ = ψ ( γ, ϕ ) and ϕ = ϕ ( γ ). Thus, φ can be regarded as a function of γ only. Applyingsimilar arguments, we can also obtain thatsup w ∈ Ω W (cid:13)(cid:13)(cid:13)(cid:13) ˆ φ N − φ − ∂φ∂γ (ˆ γ N − γ ) (cid:13)(cid:13)(cid:13)(cid:13) ∞ = O p (cid:0) k ˆ γ N − γ k ∞ (cid:1) . (B.63) Proof of Theorem 5.4.

Now, from m ( x ; θ, φ ) = P K T j =1 m ∗ ( x j ; θ ) f T ∗ i | X i = x ( t j ) with x j = ( d, s j , z, n j )and t j = ( s j , n j ), we can get L N ( θ, ˆ φ N ) − L N ( θ, φ )= 1 N N X i =1 τ i (cid:26)h Y i − m (cid:16) X i ; θ, ˆ φ N (cid:17)i − (cid:2) Y i − m (cid:0) X i ; θ, φ (cid:1)(cid:3) (cid:27) = 1 N N X i =1 τ i h m (cid:16) X i ; θ, ˆ φ N (cid:17) − m (cid:0) X i ; θ, φ (cid:1)i − N N X i =1 τ i (cid:2) Y i − m (cid:0) X i ; θ, φ (cid:1)(cid:3) h m (cid:16) X i ; θ, ˆ φ N (cid:17) − m (cid:0) X i ; θ, φ (cid:1)i = 1 N N X i =1 τ i ( K T X j =1 m ∗ ( x i,j ; θ ) h ˆ f T ∗ i | X i ( t j ) − f T ∗ i | X i ( t j ) i) − N N X i =1 τ i (cid:2) Y i − m (cid:0) X i ; θ, φ (cid:1)(cid:3) ( K T X j =1 m ∗ ( x i,j ; θ ) h ˆ f T ∗ i | X i ( t j ) − f T ∗ i | X i ( t j ) i) , (B.64)where x i,j = ( D i , s j , Z i , n j ). Because of the uniform convergence of ˆ γ N , we only need to focus on62 small neighborhood of γ . Due to the boundedness of τ ( x ) and the CauchySchwarz inequality, (cid:12)(cid:12)(cid:12) L N ( θ, ˆ φ N ) − L N ( θ, φ ) (cid:12)(cid:12)(cid:12) ≤ CN N X i =1 K T X j =1 m ∗ ( x i,j ; θ ) K T X j =1 h ˆ f T ∗ i | X i ( t j ) − f T ∗ i | X i ( t j ) i + 2 N N X i =1 K T X j =1 τ i (cid:12)(cid:12) Y i − m (cid:0) X i ; θ, φ (cid:1)(cid:12)(cid:12) | m ∗ ( x i,j ; θ ) | (cid:12)(cid:12)(cid:12) ˆ f T ∗ i | X i ( t j ) − f T ∗ i | X i ( t j ) (cid:12)(cid:12)(cid:12) , ≤ C sup k ˆ γ N − γ k ∞ ≤ η (cid:13)(cid:13)(cid:13) ˆ φ N − φ (cid:13)(cid:13)(cid:13) ∞ ! N K T X j =1 N X i =1 m ∗ ( x i,j ; θ ) + 2 sup k ˆ γ N − γ k ∞ ≤ η (cid:13)(cid:13)(cid:13) ˆ φ N − φ (cid:13)(cid:13)(cid:13) ∞ K T X j =1 " N N X i =1 τ i (cid:12)(cid:12) Y i − m (cid:0) X i ; θ, φ (cid:1)(cid:12)(cid:12) / " N N X i =1 | m ∗ ( x i,j ; θ ) | / . (B.65)Because ( D i , Z i ) is i.i.d., then x i,j = ( D i , s j , Z i , n j ) is also i.i.d. for any given j = 1 , ..., K T . Then,by Assumption 5.4 and the uniform convergence of i.i.d. samples (Lemma 2.4 of Newey and MacFadden(1994))sup θ ∈ Θ N N X i =1 m ∗ ( x i,j ; θ ) ≤ sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 m ∗ ( x i,j ; θ ) − E (cid:2) m ∗ ( x i,j ; θ ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + sup θ ∈ Θ (cid:12)(cid:12) E (cid:2) m ∗ ( x i,j ; θ ) (cid:3)(cid:12)(cid:12) = O p (1) , (B.66)because sup θ ∈ Θ E [ m ∗ ( x i,j ; θ ) ] ≤ E [ h ( x i,j )] < ∞ by Assumption 5.4. Similarly, the uniformconvergence of data with dependency neighborhood structure in Lemma E.2 leads tosup θ ∈ Θ N N X i =1 τ i (cid:12)(cid:12) Y i − m (cid:0) X i ; θ, φ (cid:1)(cid:12)(cid:12) = O p (1) , (B.67)because of Assumption 5.1 and Assumption 5.4 (v). Hence, we can conclude thatsup θ ∈ Θ (cid:12)(cid:12)(cid:12) L N ( θ, ˆ φ N ) − L N ( θ, φ ) (cid:12)(cid:12)(cid:12) = O p sup k ˆ γ N − γ k ∞ ≤ η (cid:13)(cid:13)(cid:13) ˆ φ N − φ (cid:13)(cid:13)(cid:13) ∞ ! = O p (cid:0) k ˆ γ N − γ k ∞ (cid:1) . (B.68)Next, we show the uniform convergence of L N ( θ, φ ) to L ( θ, φ ) by verifying the uniform law oflarge number for dependent data as in Lemma E.2. Firstly, condition (i), (ii), (iii) and (iv)-(c)of Lemma E.2 are trivially sanctiﬁed by Assumption 5.4 (i), (iii) and (v). Secondly, (iv) (a) ofLemma E.2 holds because of Assumption 5.1. In addition, we have that 1 /N P Ni =1 | ∆( i, N ) | ≤ /N P Ni =1 | ∆( i, N ) | = O (1) as in Assumption 5.2. Hence, we have veriﬁed that all requiredconditions of Lemma E.2 are satisﬁed, implyingsup θ ∈ Θ (cid:12)(cid:12) L N ( θ, φ ) − L ( θ, φ ) (cid:12)(cid:12) = sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 τ i (cid:2) Y i − m ( X i ; θ, φ ) (cid:3) − E h τ i (cid:2) Y i − m ( X i ; θ, φ ) (cid:3) i(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) o p (1) . (B.69)Then, making use of (B.68), (B.69) and Theorem 5.2, we can boundsup θ ∈ Θ (cid:12)(cid:12)(cid:12) L ( θ, φ ) − L N ( θ, ˆ φ N ) (cid:12)(cid:12)(cid:12) = sup θ ∈ Θ (cid:12)(cid:12)(cid:12) L ( θ, φ ) − L N ( θ, φ ) + L N ( θ, φ ) − L N ( θ, ˆ φ N ) (cid:12)(cid:12)(cid:12) ≤ sup θ ∈ Θ (cid:12)(cid:12) L ( θ, φ ) − L N ( θ, φ ) (cid:12)(cid:12) + sup θ ∈ Θ (cid:12)(cid:12)(cid:12) L N ( θ, φ ) − L N ( θ, ˆ φ N ) (cid:12)(cid:12)(cid:12) = sup θ ∈ Θ (cid:12)(cid:12) L ( θ, φ ) − L N ( θ, φ ) (cid:12)(cid:12) + O p (cid:0) k ˆ γ N − γ k ∞ (cid:1) = o p (1) . (B.70)As assumed in Assumption 5.4, θ uniquely minimizes the objective function L ( θ, φ ) over Θ.Then, for any δ >

0, there exists a ǫ > k ˆ θ N − θ k > δ implies L (ˆ θ N , φ ) − L ( θ , φ ) > ǫ .Thus, by the deﬁnition of ˆ θ N , P r (cid:16)(cid:13)(cid:13)(cid:13) ˆ θ N − θ (cid:13)(cid:13)(cid:13) > δ (cid:17) ≤ P r (cid:16) L (ˆ θ N , φ ) − L ( θ , φ ) > ǫ (cid:17) ≤ P r (cid:16) L (ˆ θ N , φ ) − L N (ˆ θ N , ˆ φ N ) + L N (ˆ θ N , ˆ φ N ) − L ( θ , φ ) > ǫ (cid:17) ≤ P r (cid:16) L (ˆ θ N , φ ) − L N (ˆ θ N , ˆ φ N ) + L N ( θ , ˆ φ N ) − L ( θ , φ ) > ǫ (cid:17) ≤ P r (cid:18) sup θ ∈ Θ (cid:12)(cid:12)(cid:12) L ( θ, φ ) − L N ( θ, ˆ φ N ) (cid:12)(cid:12)(cid:12) > ǫ (cid:19) → , (B.71)where the last line is due to (B.70). It then follows from (B.71) that k ˆ θ N − θ k = o p (1). Proof of Lemma 5.5. (a) Based on Theorems 5.2 and 5.3, we know that ˆ φ N = φ (ˆ γ N ) andˆ γ N p → γ . Hence, in what follows, we can establish the consistency of N P Ni =1 ∂g ( W i ;˜ θ N , ˆ φ N ) ∂θ ′ in asmall neighborhood of γ . For a small constant η >

0, by triangular inequality,sup k ˆ γ N − γ k ∞ <η (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 ∂g ( W i ; ˜ θ N , ˆ φ N ) ∂θ ′ − E (cid:20) ∂g ( W i ; θ , φ ) ∂θ ′ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ sup k ˆ γ N − γ k ∞ <η (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 ∂g ( W i ; ˜ θ N , ˆ φ N ) ∂θ ′ − N N X i =1 ∂g ( W i ; ˜ θ N , φ ) ∂θ ′ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 ∂g ( W i ; ˜ θ N , φ ) ∂θ ′ − N N X i =1 ∂g ( W i ; θ , φ ) ∂θ ′ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 ∂g ( W i ; θ , φ ) ∂θ ′ − E (cid:20) ∂g ( W i ; θ , φ ) ∂θ ′ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) := H + H + H . (B.72)Given (B.72), it suﬃces to show that H , H , H are all o p (1). In what follows, we divide the restof the proof into three steps. 64 tep 1 . First, consider H . By deﬁnition of g ( W i ; θ, φ ), we have1 N N X i =1 ∂g ( W i ; ˜ θ N , ˆ φ N ) ∂θ ′ − N N X i =1 ∂g ( W i ; ˜ θ N , φ ) ∂θ ′ = 1 N N X i =1 τ i (h Y i − m ( X i ; ˜ θ N , ˆ φ N ) i d m ( X i ; ˜ θ N , ˆ φ N ) dθdθ ′ − h Y i − m ( X i ; ˜ θ N , φ ) i d m ( X i ; ˜ θ N , φ ) dθdθ ′ ) − N N X i =1 τ i " dm ( X i ; ˜ θ N , ˆ φ N ) dθ dm ( X i ; ˜ θ N , ˆ φ N ) dθ ′ − dm ( X i ; ˜ θ N , φ ) dθ dm ( X i ; ˜ θ N , φ ) dθ ′ . (B.73)Making use of the identity ˆ a ˆ b − ab = (ˆ a − a ) b + a (ˆ b − b ) + (ˆ a − a )(ˆ b − b ) and applying it to bothterms on the right hand side of (B.73) give us1 N N X i =1 ∂g ( W i ; ˜ θ N , ˆ φ N ) ∂θ ′ − N N X i =1 ∂g ( W i ; ˜ θ N , φ ) ∂θ ′ = − N N X i =1 τ i h m ( X i ; ˜ θ N , ˆ φ N ) − m ( X i ; ˜ θ N , φ ) i d m ( X i ; ˜ θ N , φ ) dθdθ ′ + 1 N N X i =1 τ i h Y i − m ( X i ; ˜ θ N , φ ) i " d m ( X i ; ˜ θ N , ˆ φ N ) dθdθ ′ − d m ( X i ; ˜ θ N , φ ) dθdθ ′ − N N X i =1 τ i h m ( X i ; ˜ θ N , ˆ φ N ) − m ( X i ; ˜ θ N , φ ) i " d m ( X i ; ˜ θ N , ˆ φ N ) dθdθ ′ − d m ( X i ; ˜ θ N , φ ) dθdθ ′ − N N X i =1 τ i " dm ( X i ; ˜ θ N , ˆ φ N ) dθ − dm ( X i ; ˜ θ N , φ ) dθ dm ( X i ; ˜ θ N , φ ) dθ ′ − N N X i =1 τ i dm ( X i ; ˜ θ N , φ ) dθ " dm ( X i ; ˜ θ N , ˆ φ N ) dθ ′ − dm ( X i ; ˜ θ N , φ ) dθ ′ − . N N X i =1 τ i " dm ( X i ; ˜ θ N , ˆ φ N ) dθ − dm ( X i ; ˜ θ N , φ ) dθ dm ( X i ; ˜ θ N , ˆ φ N ) dθ ′ − dm ( X i ; ˜ θ N , φ ) dθ ′ . (B.74)Recall that m ( X i ; θ, φ ) = P K T j =1 m ∗ ( x i,j ; θ ) f T ∗ i | X i ( t j ) and x i,j = ( D i , s j , Z i , n j ). We can furtherrewrite (B.74) as1 N N X i =1 ∂g ( W i ; ˜ θ N , ˆ φ N ) ∂θ ′ − N N X i =1 ∂g ( W i ; ˜ θ N , φ ) ∂θ ′ = − N N X i =1 τ i ( K T X j =1 m ∗ ( x i,j ; ˜ θ N ) h ˆ f T ∗ i | X i ( t j ) − f T ∗ i | X i ( t j ) i) K T X j =1 d m ∗ ( x i,j ; ˜ θ N ) dθdθ ′ f T ∗ i | X i ( t j )+ 1 N N X i =1 τ i h Y i − m ( X i ; ˜ θ N , φ ) i ( K T X j =1 d m ∗ ( x i,j ; ˜ θ N ) dθdθ ′ h ˆ f T ∗ i | X i ( t j ) − f T ∗ i | X i ( t j ) i) N N X i =1 τ i ( K T X j =1 m ∗ ( x i,j ; ˜ θ N ) h ˆ f T ∗ i | X i ( t j ) − f T ∗ i | X i ( t j ) i) × ( K T X j =1 d m ∗ ( x i,j ; ˜ θ N ) dθdθ ′ h ˆ f T ∗ i | X i ( t j ) − f T ∗ i | X i ( t j ) i) − N N X i =1 τ i ( K T X j =1 ∂m ∗ ( x i,j ; ˜ θ N ) ∂θ h ˆ f T ∗ i | X i ( t j ) − f T ∗ i | X i ( t j ) i) K T X j =1 dm ∗ ( x i,j ; ˜ θ N ) dθ ′ f T ∗ i | X i ( t j ) − N N X i =1 τ i K T X j =1 dm ∗ ( x i,j ; ˜ θ N ) dθ f T ∗ i | X i ( t j ) ( K T X j =1 ∂m ∗ ( x i,j ; ˜ θ N ) ∂θ ′ h ˆ f T ∗ i | X i ( t j ) − f T ∗ i | X i ( t j ) i) − . N N X i =1 τ i ( K T X j =1 ∂m ∗ ( x i,j ; ˜ θ N ) ∂θ h ˆ f T ∗ i | X i ( t j ) − f T ∗ i | X i ( t j ) i) × ( K T X j =1 ∂m ∗ ( x i,j ; ˜ θ N ) ∂θ ′ h ˆ f T ∗ i | X i ( t j ) − f T ∗ i | X i ( t j ) i) . (B.75)Because that for a k × k matrix A = ab ′ where a, b ∈ R k , then k A k = k a kk b k , the boundedness of f T ∗ i | X i and (B.75), H ≤ C sup k ˆ γ N − γ k ∞ <η (cid:13)(cid:13)(cid:13) ˆ φ N − φ (cid:13)(cid:13)(cid:13) ∞ N K T X j,l =1 N X i =1 (cid:12)(cid:12)(cid:12) m ∗ ( x i,j ; ˜ θ N ) (cid:12)(cid:12)(cid:12) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d m ∗ ( x i,l ; ˜ θ N ) dθdθ ′ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + C sup k ˆ γ N − γ k ∞ <η (cid:13)(cid:13)(cid:13) ˆ φ N − φ (cid:13)(cid:13)(cid:13) ∞ N K T X j =1 N X i =1 τ i (cid:12)(cid:12)(cid:12) Y i − m ( X i ; ˜ θ N , φ ) (cid:12)(cid:12)(cid:12) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d m ∗ ( x i,j ; ˜ θ N ) dθdθ ′ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + C sup k ˆ γ N − γ k ∞ <η (cid:13)(cid:13)(cid:13) ˆ φ N − φ (cid:13)(cid:13)(cid:13) ∞ ! N K T X j,l =1 N X i =1 (cid:12)(cid:12)(cid:12) m ∗ ( x i,j ; ˜ θ N ) (cid:12)(cid:12)(cid:12) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d m ∗ ( x i,l ; ˜ θ N ) dθdθ ′ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + 2 C sup k ˆ γ N − γ k ∞ <η (cid:13)(cid:13)(cid:13) ˆ φ N − φ (cid:13)(cid:13)(cid:13) ∞ N K T X j,l =1 N X i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∂m ∗ ( x i,j ; ˜ θ N ) ∂θ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) dm ∗ ( x i,l ; ˜ θ N ) dθ ′ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) C sup k ˆ γ N − γ k ∞ <η (cid:13)(cid:13)(cid:13) ˆ φ N − φ (cid:13)(cid:13)(cid:13) ∞ ! N K T X j,l =1 N X i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∂m ∗ ( x i,j ; ˜ θ N ) ∂θ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) dm ∗ ( x i,l ; ˜ θ N ) dθ ′ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) := H + H + H + H + H . (B.76)By the CauchySchwarz inequality, we can further bound H as H ≤ C sup k ˆ γ N − γ k ∞ <η (cid:13)(cid:13)(cid:13) ˆ φ N − φ (cid:13)(cid:13)(cid:13) ∞ K T X j,l =1 " N N X i =1 (cid:12)(cid:12)(cid:12) m ∗ ( x i,j ; ˜ θ N ) (cid:12)(cid:12)(cid:12) /  N N X i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d m ∗ ( x i,l ; ˜ θ N ) dθdθ ′ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  / ≤ O p sup k ˆ γ N − γ k ∞ <η (cid:13)(cid:13)(cid:13) ˆ φ N − φ (cid:13)(cid:13)(cid:13) ∞ ! o p (1) , (B.77)where the second line is due to (B.66) and Lemma E.7, and the last line is because of Corollary5.3. For H , it follows again from the CauchySchwarz inequality and Corollary 5.3 that H ≤ o p (1) K T X j =1 " N N X i =1 τ i (cid:12)(cid:12) Y i − m ( X i ; θ , φ ) (cid:12)(cid:12) /  N N X i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d m ∗ ( x i,j ; ˜ θ N ) dθdθ ′ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  / = o p (1) , (B.78)where the last line is due to the uniform convergence in (B.69) and that proved in Lemma E.7.Given H = o p (1), it is apparent that H is also a o p (1). Similarly, if we know that H = o p (1),then H = o p (1). Again, by the CauchySchwarz inequality and Lemma E.7, H ≤ o p (1) K T X j,l =1  N N X i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) dm ∗ ( x i,j ; ˜ θ N ) dθ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  /  N N X i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) dm ∗ ( x i,l ; ˜ θ N ) dθ ′ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  / = o p (1) , (B.79)Thus, based on (B.77), (B.78) and (B.79), we can conclude that H = o p (1). Step 2 . Consider the term inside the absolute value in H N N X i =1 ∂g ( W i ; ˜ θ N , φ ) ∂θ ′ − N N X i =1 ∂g ( W i ; θ , φ ) ∂θ ′ = 1 N N X i =1 τ i (h Y i − m ( X i ; ˜ θ N , φ ) i ∂ m ( X i ; ˜ θ N , φ ) ∂θ∂θ ′ − (cid:2) Y i − m ( X i ; θ , φ ) (cid:3) ∂ m ( X i ; θ , φ ) ∂θ∂θ ′ ) + 1 N N X i =1 τ i " ∂m ( X i ; ˜ θ N , φ ) ∂θ ∂m ( X i ; ˜ θ N , φ ) ∂θ ′ − ∂m ( X i ; θ , φ ) ∂θ ∂m ( X i ; θ , φ ) ∂θ ′ . (B.80)Applying again the identity ˆ a ˆ b − ab = (ˆ a − a ) b + a (ˆ b − b ) + (ˆ a − a )(ˆ b − b ) to (B.80) and substituting m ( X i ; θ, φ ) = P K T j =1 m ∗ ( x i,j ; θ ) f T ∗ i | X i ( t j ) give us H ≤ CN N X i =1 K T X j,l =1 (cid:20)(cid:12)(cid:12)(cid:12) m ∗ ( x i,j ; ˜ θ N ) − m ∗ ( x i,j ; θ ) (cid:12)(cid:12)(cid:12) (cid:13)(cid:13)(cid:13)(cid:13) d m ∗ ( x i,l ; θ ) dθdθ ′ (cid:13)(cid:13)(cid:13)(cid:13)(cid:21) + CN N X i =1 K T X j =1 " τ i (cid:12)(cid:12) Y i − m ( X i ; θ , φ ) (cid:12)(cid:12) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d m ∗ ( x i,j ; ˜ θ N ) dθdθ ′ − d m ∗ ( x i,j ; θ ) dθdθ ′ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + CN N X i =1 K T X j,l =1 "(cid:12)(cid:12)(cid:12) m ∗ ( x i,j ; ˜ θ N ) − m ∗ ( x i,j ; θ ) (cid:12)(cid:12)(cid:12) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d m ∗ ( x i,l ; ˜ θ N ) dθdθ ′ − d m ∗ ( x i,l ; θ ) dθdθ ′ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + 2 CN N X i =1 K T X j,l =1 "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) dm ∗ ( x i,j ; ˜ θ N ) dθ − dm ∗ ( x i,j ; θ ) dθ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13) dm ∗ ( x i,l ; θ ) dθ ′ (cid:13)(cid:13)(cid:13)(cid:13) CN N X i =1 K T X j,l =1 "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) dm ∗ ( x i,j ; ˜ θ N ) dθ − dm ∗ ( x i,l ; θ ) dθ ′ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) dm ∗ ( x i,l ; ˜ θ N ) dθ ′ − dm ∗ ( x i,l ; θ ) dθ ′ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) := H + H + H + H + H . (B.81)By the CauchySchwarz inequality and Lemma E.7, it is easy to show H to H are all o p (1).Consequently, we know that H = o p (1). Step 3 . Next, consider H . Let g r ( W i ; θ, φ ) be the r -th element in the column vector g ( W i ; θ, φ ).Then, we can rewrite H as H = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 ∂g ( W i ; θ , φ ) ∂θ ′ − E (cid:20) ∂g ( W i ; θ , φ ) ∂θ ′ (cid:21)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = d θ X r,q =1 " N N X i =1 (cid:18) ∂g r ( W i ; θ , φ ) ∂θ q − E (cid:20) ∂g r ( W i ; θ , φ ) ∂θ q (cid:21)(cid:19) . (B.82)Because E [ | ∂g r ( W i ; θ , φ ) /∂θ q | ] < ∞ as in Assumption 5.5, the variance of ∂g r ( W i ; θ , φ ) /∂θ q exists and is ﬁnite for all r, q = 1 , ..., d θ . Then, the Chebyshev’s inequality implies P r "(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 ∂g r ( W i ; θ , φ ) ∂θ q − E (cid:20) ∂g r ( W i ; θ , φ ) ∂θ q (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > ǫ ≤ V ar " N N X i =1 ∂g r ( W i ; θ , φ ) ∂θ q /ǫ = 1 ǫ N N X i =1 X j ∈ ∆( i,N ) Cov (cid:18) ∂g r ( W i ; θ , φ ) ∂θ q , ∂g r ( W j ; θ , φ ) ∂θ q (cid:19) + s.o. ≤ Cǫ N N X i =1 | ∆( i, N ) | + s.o. = O (cid:18) ǫ N (cid:19) , where the second equality comes from Assumption 5.1, and the last line is because that 1 /N P Ni =1 | ∆( i, N ) | = O (1) (Assumption 5.2), and set ǫ such that ǫ → ǫ N → ∞ as N → ∞ . Thus,1 N N X i =1 ∂g r ( X i ; θ , φ ) ∂θ q − E (cid:20) ∂g r ( X i ; θ , φ ) ∂θ q (cid:21) p → , for all r, q = 1 , ..., d θ , (B.83)leading to H = o p (1). Based on the results in the above three steps, we can make the conclusionthat the stated result holds.(b) This proof is analogue to the proof of Theorem 8.1 in Newey and MacFadden (1994). Allthe suﬃcient conditions are veriﬁed in the Lemmas E.8, E.9 and E.10. Recall that ˜ F W ( w ) =1 /N P Ni =1 W i ≤ w ] represents the empirical distribution and R δ ( w ) d ˜ F W ( w ) = 1 /N P Ni =1 δ ( W i ).68y triangular inequality, we have (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ N N X i =1 h g ( W i ; θ , ˆ φ N ) − g ( W i ; θ , φ ) − δ ( W i ) i(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ N N X i =1 h g ( W i ; θ , ˆ φ N ) − g ( W i ; θ , φ ) − G ( W i ; ˜ γ N − γ ) i(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ N N X i =1 (cid:20) G ( W i ; ˜ γ N − γ ) − Z G ( w ; ˆ γ N − ¯ γ ) dF W ( w ) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ N N X i =1 (cid:20)Z G ( w ; ˆ γ N − ¯ γ ) dF W ( w ) − Z δ ( w ) d ˆ F W ( w ) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13) √ N (cid:20)Z δ ( w ) d ˆ F W ( w ) − Z δ ( w ) d ˜ F W ( w ) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13) = o p (1) , (B.84)where the last line follows from Lemmas E.8, E.9 and E.10. Proof of Theorem 5.6.

By Assumption 5.4 and the construction of δ ( w ), we know that E [˜ g i ]=0.Since the dependency neighborhood ∆( i, N ) is symmetric as in Assumption 5.7, we know that Σ ˜ gN is symmetric: because for ∀ r, q = 1 , , ..., d θ , its ( r, q )-th entry N X i =1 X j ∈ ∆( i,N ) E [˜ g i,r ˜ g j,q ] = N X j =1 X i ∈ ∆( j,N ) E [˜ g j,r ˜ g i,q ] = N X i =1 X j ∈ ∆( i,N ) E [˜ g i,q ˜ g j,r ] , where the ﬁrst equality follows from change of index and the second equality is due to the symmetryof ∆( i, N ). Under Assumption 5.7, the suﬃcient conditions for the CLT under neighborhooddependent data required in Lemma E.6 are satisﬁed. Thus, we can show that h Σ ˜ gN i − / S ˜ gN d → N (0 , I d θ ). Next, we show the asymptotic normality for √ N (ˆ θ N − θ ).From (12) and Lemma 5.5 (b), we have − " √ N N X i =1 ˜ g i + o p (1) = 1 N N X i =1 ∂g ( W i ; ˜ θ N , ˆ φ N ) ∂θ ′ √ N (ˆ θ N − θ ) . Since from Lemma 5.5 (a), we have that N P Ni =1 ∂g ( W i ;˜ θ N , ˆ φ N ) ∂θ ′ p → E h ∂g ( W i ; θ ,φ ) ∂θ ′ i , where by Assump-tion 5.5 E h ∂g ( W i ; θ ,φ ) ∂θ ′ i is invertible. Thus, h N P Ni =1 ∂g ( W i ;˜ θ N , ˆ φ N ) ∂θ ′ i − exists for large enough N .Moreover, recall that Ω N is symmetric and Ω N p → Ω with Ω being positive deﬁnite and nonsingu-lar. It indicates that Ω − / N also exists for large enough N . Then, because k Ω − / N k = O (1) andΩ − / N = √ N [Σ ˜ gN ] − / , we can obtain √ N (ˆ θ N − θ ) = − " N N X i =1 ∂g ( W i ; ˜ θ N , ˆ φ N ) ∂θ ′ − " √ N N X i =1 ˜ g i + o p (1) − " N N X i =1 ∂g ( W i ; ˜ θ N , ˆ φ N ) ∂θ ′ − Ω / N " Ω − / N √ N N X i =1 ˜ g i + Ω − / N o p (1) = − " N N X i =1 ∂g ( W i ; ˜ θ N , ˆ φ N ) ∂θ ′ − Ω / N h [Σ ˜ gN ] − / S ˜ gN + o p (1) i d → N (0 , H − Ω H − ) , where the last line is because of " N N X i =1 ∂g ( W i ; ˜ θ N , ˆ φ N ) ∂θ ′ p → E (cid:20) ∂g ( W i ; θ , φ ) ∂θ ′ (cid:21) and [Σ ˜ gN ] − / S ˜ gN d → N (0 , I d θ ) . Proof of Corollary 5.7.

To simplify notation, denote ˆ˜ g i = g ( W i ; ˆ θ N , ˆ φ N ) + ˆ δ ( W i ). Then, (cid:13)(cid:13)(cid:13) ˆΩ N − Ω (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 X j ∈ ∆( i,N ) (cid:16) ˆ˜ g i ˆ˜ g ′ j − E [˜ g i ˜ g ′ j ] (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 X j ∈ ∆( i,N ) (cid:16) ˆ˜ g i ˆ˜ g ′ j − ˜ g i ˜ g ′ j (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 X j ∈ ∆( i,N ) (cid:0) ˜ g i ˜ g ′ j − E [˜ g i ˜ g ′ j ] (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) :=∆Ω + ∆Ω . (B.85) Step 1 . Consider ∆Ω and by simple algebra∆Ω ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 X j ∈ ∆( i,N ) h(cid:16) ˆ˜ g i − ˜ g i (cid:17) (cid:16) ˆ˜ g ′ j − ˜ g ′ j (cid:17) + ˜ g i (cid:16) ˆ˜ g ′ j − ˜ g ′ j (cid:17) + (cid:16) ˆ˜ g i − ˜ g i (cid:17) ˜ g ′ j i(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ N N X i =1 X j ∈ ∆( i,N ) h(cid:13)(cid:13)(cid:13) ˆ˜ g i − ˜ g i (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) ˆ˜ g ′ j − ˜ g ′ j (cid:13)(cid:13)(cid:13) + k ˜ g i k (cid:13)(cid:13)(cid:13) ˆ˜ g ′ j − ˜ g ′ j (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) ˆ˜ g i − ˜ g i (cid:13)(cid:13)(cid:13) (cid:13)(cid:13) ˜ g ′ j (cid:13)(cid:13)i (B.86)Given (B.86), it suﬃces to show ∆Ω = o p (1) by verifying that (a) ˜ g i and ˆ˜ g i are bounded, and (b) N P Ni =1 P j ∈ ∆( i,N ) k ˆ˜ g i − ˜ g i k = o p (1).Firstly, (a) is satisﬁed if | g ( w ; θ, φ ) + δ ( w ; θ, φ ) | is uniformly bounded over Ω W and Θ × [0 , m ∗ ( x ; θ ) is continuous diﬀerentiable in θ to order three (Assumption 5.5) and Θ iscompact, implying for ∀ x ∈ Ω X | m ∗ ( x ; θ ) | , (cid:12)(cid:12)(cid:12)(cid:12) ∂m ∗ ( x ; θ ) ∂θ (cid:12)(cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12)(cid:12) ∂ m ∗ ( x ; θ ) ∂θ∂θ ′ (cid:12)(cid:12)(cid:12)(cid:12) are bounded uniformly over Θ . (B.87)Furthermore, since ν ( w ; θ, γ ) is almost everywhere (a.e.) continuously diﬀerentiable in w c (As-sumption 5.6), it implies (by deﬁnition of ν ( w ; θ, γ )) that m ∗ ( x ; θ ) and ∂m ∗ ( x ; θ ) ∂θ are also continuous70n w c a.e. within the compact Ω W c . Therefore, for ∀ θ ∈ Θ, | m ∗ ( x ; θ ) | , (cid:12)(cid:12)(cid:12)(cid:12) ∂m ∗ ( x ; θ ) ∂θ (cid:12)(cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12)(cid:12) ∂ m ∗ ( x ; θ ) ∂θ∂θ ′ (cid:12)(cid:12)(cid:12)(cid:12) are bounded uniformly over Ω X . (B.88)Then, (B.87) and (B.88) together indicate the uniform boundedness of | m ∗ ( x ; θ ) | and its ﬁrst andsecond derivatives over Ω X and Θ. Thus,sup w ∈ Ω W , ( θ,φ ) ∈ Θ × [0 , | g ( w ; θ, φ ) | = sup w ∈ Ω W , ( θ,φ ) ∈ Θ × [0 , (cid:12)(cid:12)(cid:12)(cid:12) τ ( x )( y − m ( x ; θ, φ )) ∂m ∗ ( x ; θ ) ∂θ (cid:12)(cid:12)(cid:12)(cid:12) ≤ C sup w ∈ Ω W , ( θ,φ ) ∈ Θ × [0 , (cid:12)(cid:12)(cid:12)(cid:12) ∂m ∗ ( x ; θ ) ∂θ (cid:12)(cid:12)(cid:12)(cid:12) ≤ C , where the ﬁrst inequality is because the maximum of y and m ( x ; θ, φ ) are ﬁnite since Ω W c iscompact, and τ ( · ) is bounded (Assumption 5.2).For δ ( W i ; θ, φ ) = ν ( W i ; θ, φ ) − E [ ν ( W i ; θ, φ )], with ν ( W i ; θ, φ ) = τ ( X i ) ∂ R ( W i ; θ,φ ) ∂θ ∂φ ( t ; γ ) ∂γ ′ d γ andthe d θ × K T vector ∂ R ( W i ; θ, φ ) ∂θ =  − ∂m ( X i ; θ,φ ) ∂θ m ∗ ( x i, ; θ ) + ( Y i − m ( X i ; θ, φ )) ∂m ∗ ( x i, ; θ ) ∂θ ... − ∂m ( X i ; θ,φ ) ∂θ m ∗ ( x i, K T ; θ ) + ( Y i − m ( X i ; θ, φ )) ∂m ∗ ( x i, K T ; θ ) ∂θ  ′ , it is easy to see that δ ( W i ; θ, φ ) is a function of m ∗ ( x ; θ ), ∂m ∗ ( x ; θ ) ∂θ and ∂φ ( γ ) ∂γ , and it is linear in φ .Moreover, φ is the probability function of discrete random variables therefore strictly lies in [0 , ∂φ ( γ ) ∂γ provided in the proofof Corollary 5.3 leads to sup w ∈ Ω W , ( θ,φ ) ∈ Θ × [0 , | δ ( w ; θ, φ ) | ≤ C for constant C >

0. So far we haveestablished that (a) holds.Secondly, move on to (b). For θ ∗ N between θ and ˆ θ N , the triangular inequality and the meanvalue theorem lead to (cid:13)(cid:13)(cid:13) ˆ˜ g i − ˜ g i (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) g ( W i ; ˆ θ N , ˆ φ N ) − g ( W i ; θ , ˆ φ N ) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) g ( W i ; θ , ˆ φ N ) − g ( W i ; θ , φ ) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) δ ( W i ; ˆ θ N , ˆ φ N ) − δ ( W i ; θ , ˆ φ N ) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) δ ( W i ; θ , ˆ φ N ) − δ ( W i ; θ , φ ) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∂g ( W i ; θ ∗ N , ˆ φ N ) ∂θ ′ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) ˆ θ N − θ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) g ( W i ; θ , ˆ φ N ) − g ( W i ; θ , φ ) (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∂δ ( W i ; θ ∗ N , ˆ φ N ) ∂θ ′ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) ˆ θ N − θ (cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13) δ ( W i ; θ , ˆ φ N ) − δ ( W i ; θ , φ ) (cid:13)(cid:13)(cid:13) . (B.89)Start from the ﬁrst term of (B.89), when sample size is large enough (i.e. ˆ φ N is close to φ ), (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∂g ( W i ; θ ∗ N , ˆ φ N ) ∂θ ′ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) τ ( X i ) " − ∂m ( X i ; θ ∗ N , ˆ φ N ) ∂θ ∂m ( X i ; θ ∗ N , ˆ φ N ) ∂θ ′ h Y i − m ( X i ; θ ∗ N , ˆ φ N ) i ∂ m ( X i ; θ ∗ N , ˆ φ N ) ∂θ∂θ ′ ≤ C K T X j,l =1 (cid:13)(cid:13)(cid:13)(cid:13) ∂m ∗ ( x i,j ; θ ∗ N ) ∂θ (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13) ∂m ∗ ( x i,l ; θ ∗ N ) ∂θ ′ (cid:13)(cid:13)(cid:13)(cid:13) + K T X j =1 (cid:13)(cid:13)(cid:13)(cid:13) ∂ m ∗ ( x i,j ; θ ∗ N ) ∂θ∂θ ′ (cid:13)(cid:13)(cid:13)(cid:13)! ≤ C , (B.90)where the last line is because of (B.87) and (B.88). For the second term of (B.89), it yields fromthe calculation in (E.17) that (cid:13)(cid:13)(cid:13) g ( W i ; θ , ˆ φ N ) − g ( W i ; θ , φ ) (cid:13)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13) ˆ φ N − φ (cid:13)(cid:13)(cid:13) ∞ " K T X j,l =1 (cid:12)(cid:12) m ∗ ( x i,j ; θ ) (cid:12)(cid:12) (cid:13)(cid:13)(cid:13)(cid:13) ∂m ∗ ( x i,l ; θ ) ∂θ (cid:13)(cid:13)(cid:13)(cid:13) + τ i (cid:12)(cid:12) Y i − m ( X i ; θ , φ ) (cid:12)(cid:12) K T X j =1 (cid:13)(cid:13)(cid:13)(cid:13) ∂m ∗ ( x i,j ; θ ) ∂θ (cid:13)(cid:13)(cid:13)(cid:13) + s.o. ≤ C (cid:13)(cid:13)(cid:13) ˆ φ N − φ (cid:13)(cid:13)(cid:13) ∞ K T X j =1 (cid:13)(cid:13)(cid:13)(cid:13) ∂m ∗ ( x i,l ; θ ) ∂θ (cid:13)(cid:13)(cid:13)(cid:13) + s.o. ≤ C (cid:13)(cid:13)(cid:13) ˆ φ N − φ (cid:13)(cid:13)(cid:13) ∞ . (B.91)where the second inequality is because of the compactness of Ω Y which implies both m ∗ ( x i,j ; θ )and | Y i − m ( X i ; θ , φ ) | are bounded, and the last inequality is due to (B.87) and (B.88). To boundthe third term of (B.89), by the dominated convergence theorem, we have ∂δ ( W i ; θ ∗ N , ˆ φ N ) ∂θ ′ = τ ( X i ) ∂∂θ ′ ∂ R ( W i ; θ ∗ N , ˆ φ N ) ∂θ ∂φ ( t ; ˆ γ N ) ∂γ ′ d γ ! − E " τ ( X i ) ∂∂θ ′ ∂ R ( W i ; θ ∗ N , ˆ φ N ) ∂θ ∂φ ( t ; ˆ γ N ) ∂γ ′ d γ ! . Based on similar arguments used to obtain (B.90) and the uniform boundedness of ∂φ ( γ ) ∂γ ′ over γ ∈ [0 ,

1] provided in the proof of Corollary 5.3, we can get (cid:13)(cid:13)(cid:13) ∂δ ( W i ; θ ∗ N , ˆ φ N ) ∂θ ′ (cid:13)(cid:13)(cid:13) ≤ C for some constant C >

0. At last, (cid:13)(cid:13)(cid:13) δ ( W i ; θ , ˆ φ N ) − δ ( W i ; θ , φ ) (cid:13)(cid:13)(cid:13) ≤| τ ( X i ) | (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∂ R ( W i ; θ , ˆ φ N ) ∂θ ∂φ ( t ; ˆ γ N ) ∂γ ′ − ∂ R ( W i ; θ , φ ) ∂θ ∂φ ( t ; γ ) ∂γ ′ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) k d γ k≤ C (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∂ R ( W i ; θ , ˆ φ N ) ∂θ − ∂ R ( W i ; θ , φ ) ∂θ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + (cid:13)(cid:13)(cid:13)(cid:13) ∂φ ( t ; ˆ γ N ) ∂γ ′ − ∂φ ( t ; γ ) ∂γ ′ (cid:13)(cid:13)(cid:13)(cid:13)! ≤ C (cid:13)(cid:13) ˆ γ N − γ (cid:13)(cid:13) ∞ . ≤ CN N X i =1 | ∆( i, N ) | (cid:16) k ˆ θ N − θ k + (cid:13)(cid:13) ˆ γ N − γ (cid:13)(cid:13) ∞ (cid:17) = o p (1) , based on the consistency of ˆ θ N and ˆ γ N , and the fact that 1 /N P Ni =1 | ∆( i, N ) | = O (1). Step 2 . Next, let us deal with ∆Ω . Based on (E.7) and Assumption 5.7 (v), E [ k ∆Ω k ] ≤ d θ N (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N X i,k =1 X j ∈ ∆( i,N ) X l ∈ ∆( k,N ) E h (cid:0) ˜ g i ˜ g ′ j − E [˜ g i ˜ g ′ j ] (cid:1) ′ (˜ g k ˜ g ′ l − E [˜ g k ˜ g ′ l ]) i(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ d θ N o (cid:16)(cid:13)(cid:13)(cid:13) [Σ ˜ gN ] (cid:13)(cid:13)(cid:13) ∞ (cid:17) = o (1) , (B.92)where the last line comes from (E.10) that o ( k [Σ ˜ gN ] k ∞ /N ) = o (1). Hence, (cid:13)(cid:13)(cid:13) ˆΩ N − Ω (cid:13)(cid:13)(cid:13) = o p (1). C Tables and Figures p ω = 0 . , p V = δ V /N ) (a) r deg = 5 δ V − p U p V N |N i | S i Misclassiﬁed links(%) (%) avg. max avg. max 1 to 0 0 to 1 total ratio (%)0 20 0 1k 4.97 14.84 1.49 6.95 677.5 0 677.5 12.010 2k 5.04 15.63 1.51 7.51 1371 0 1371 12.000 5k 5.11 16.66 1.53 8.12 3482 0 3482 12.010.1 20 0.010 1k 5.03 14.87 1.51 6.97 677.5 59.64 737.2 13.050.005 2k 5.10 15.67 1.53 7.56 1371 119.1 1490 13.010.002 5k 5.17 16.66 1.55 8.17 3482 299.5 3782 13.030.5 20 0.050 1k 5.27 15.02 1.58 7.08 677.5 298.2 975.7 17.280.025 2k 5.34 15.81 1.60 7.65 1371 596.3 1968 17.180.010 5k 5.41 16.81 1.62 8.28 3482 1501 4983 17.170 40 0 1k 4.29 14.70 1.29 6.81 1356 0 1356 24.030 2k 4.35 15.53 1.31 7.37 2746 0 2746 24.000 5k 4.41 16.56 1.32 7.99 6961 0 6961 24.020.1 40 0.010 1k 4.35 14.72 1.31 6.79 1356 59.64 1416 25.070.005 2k 4.42 15.55 1.32 7.38 2746 119.1 2865 25.010.002 5k 4.47 16.52 1.34 7.99 6961 299.5 7260 25.020.5 40 0.050 1k 4.59 14.74 1.38 6.84 1356 298.2 1654 29.290.025 2k 4.65 15.56 1.40 7.42 2746 596.3 3343 29.180.010 5k 4.71 16.54 1.41 8.03 6961 1501 8462 29.16(b) r deg = 8 δ V − p U p V N |N i | S i Misclassiﬁed links(%) (%) avg. max avg. max 1 to 0 0 to 1 total ratio (%)0 20 0 1k 7.85 20.50 2.36 9.08 1069 0 1069 12.020 2k 7.99 21.55 2.40 9.75 2177 0 2177 12.010 5k 8.12 22.83 2.44 10.58 5540 0 5540 12.010.1 20 0.010 1k 7.91 20.52 2.37 9.07 1069 59.45 1128 12.650.005 2k 8.05 21.55 2.42 9.81 2177 118.9 2296 12.640.002 5k 8.18 22.82 2.45 10.55 5540 299.3 5839 12.650.5 20 0.050 1k 8.15 20.60 2.45 9.15 1069 297.3 1366 15.320.025 2k 8.29 21.64 2.49 9.88 2177 595.3 2772 15.260.010 5k 8.43 22.90 2.53 10.62 5540 1500 7039 15.250 40 0 1k 6.78 20.40 2.03 8.92 2139 0 2139 24.030 2k 6.90 21.48 2.07 9.62 4356 0 4356 24.010 5k 7.02 22.75 2.10 10.45 11078 0 11078 24.020.1 40 0.010 1k 6.84 20.43 2.05 8.87 2139 59.45 2199 24.650.005 2k 6.96 21.48 2.09 9.66 4356 118.9 4475 24.630.002 5k 7.08 22.77 2.12 10.41 11078 299.3 11377 24.640.5 40 0.050 1k 7.08 20.43 2.12 8.90 2139 297.3 2437 27.320.025 2k 7.20 21.48 2.16 9.67 4356 595.3 4951 27.250.010 5k 7.32 22.77 2.19 10.43 11078 1500 12577 27.24

Note: The results in this table can be applied to both ( |N i | , S i ) and ( | ˜ N i | , ˜ S i ). τ d (0 , ,

3) ( p ω = p ˜ ω = 0 . , p V = δ V /N ) (a) r deg = 5 δ V (1 − p U , p V ) (1 − p ˜ U , p ˜ V ) N SPE Naive 1 Naive 2(%) (%) bias sd mse cr bias sd mse cr bias sd mse cr0.1 (20, 0.010) (20,0) 1k -0.060 0.349 0.125 0.931 -0.073 0.292 0.091 0.943 -0.063 0.294 0.090 0.935(20, 0.005) 2k -0.027 0.245 0.061 0.941 -0.077 0.206 0.048 0.933 -0.071 0.208 0.048 0.931(20, 0.002) 5k -0.016 0.133 0.018 0.937 -0.060 0.132 0.021 0.924 -0.063 0.130 0.021 0.9160.5 (20, 0.050) (20,0) 1k -0.053 0.354 0.128 0.941 -0.076 0.319 0.108 0.942 -0.061 0.284 0.084 0.946(20, 0.025) 2k -0.032 0.243 0.060 0.941 -0.097 0.219 0.057 0.925 -0.061 0.205 0.046 0.939(20, 0.010) 5k -0.028 0.133 0.019 0.942 -0.083 0.139 0.026 0.909 -0.062 0.133 0.022 0.9220.1 (40, 0.010) (40,0) 1k 0.075 0.538 0.296 0.950 -0.035 0.405 0.165 0.952 -0.016 0.390 0.153 0.948(40, 0.005) 2k 0.051 0.384 0.150 0.942 -0.018 0.276 0.076 0.948 -0.019 0.273 0.075 0.955(40, 0.002) 5k 0.038 0.236 0.057 0.938 -0.013 0.173 0.030 0.948 0.004 0.182 0.033 0.9450.5 (40, 0.050) (40,0) 1k 0.059 0.547 0.303 0.940 -0.040 0.398 0.160 0.958 -0.012 0.399 0.160 0.950(40, 0.025) 2k 0.040 0.368 0.137 0.941 -0.047 0.280 0.081 0.954 -0.015 0.283 0.080 0.953(40, 0.010) 5k 0.022 0.219 0.048 0.940 -0.052 0.189 0.038 0.944 0.012 0.175 0.031 0.952(b) r deg = 8 δ V (1 − p U , p V ) (1 − p ˜ U , p ˜ V ) N SPE Naive 1 Naive 2(%) (%) bias sd mse cr bias sd mse cr bias sd mse cr0.1 (20, 0.010) (20,0) 1k 0.060 0.574 0.334 0.953 -0.161 0.527 0.304 0.941 -0.142 0.507 0.277 0.949(20, 0.005) 2k -0.048 0.284 0.083 0.940 -0.140 0.359 0.148 0.929 -0.130 0.392 0.170 0.928(20, 0.002) 5k -0.020 0.180 0.033 0.954 -0.141 0.237 0.076 0.910 -0.139 0.233 0.074 0.9080.5 (20, 0.050) (20,0) 1k -0.016 0.535 0.287 0.930 -0.170 0.518 0.298 0.938 -0.170 0.522 0.302 0.938(20, 0.025) 2k 0.019 0.399 0.160 0.954 -0.141 0.394 0.175 0.935 -0.155 0.361 0.154 0.934(20, 0.010) 5k -0.019 0.169 0.029 0.963 -0.162 0.241 0.084 0.899 -0.144 0.243 0.080 0.9020.1 (40, 0.010) (40,0) 1k 0.383 0.792 0.774 0.934 -0.119 0.776 0.617 0.946 -0.120 0.756 0.585 0.942(40, 0.005) 2k 0.356 0.569 0.451 0.933 -0.118 0.574 0.343 0.946 -0.103 0.560 0.325 0.945(40, 0.002) 5k 0.280 0.343 0.196 0.897 -0.101 0.354 0.135 0.935 -0.086 0.354 0.133 0.9380.5 (40, 0.010) (40,0) 1k 0.367 0.794 0.765 0.919 -0.184 0.757 0.607 0.948 -0.121 0.749 0.575 0.949(40, 0.025) 2k 0.323 0.556 0.413 0.934 -0.148 0.552 0.326 0.937 -0.115 0.550 0.316 0.950(40, 0.010) 5k 0.211 0.342 0.162 0.910 -0.154 0.348 0.145 0.928 -0.103 0.362 0.141 0.945

Note: SPE lists the semiparametric estimation results proposed in Section 5.3. Estimates of Naive 1 arecomputed using OLS with { Y i , D i , S i , Z i , |N i |} Ni =1 ; and estimates of Naive 2 are computed using OLS with { Y i , D i , ˜ S i , Z i , | ˜ N i |} Ni =1 . True value of the treatment eﬀect τ d (0 , ,

3) = 1. τ d (0 , ,

3) ( p ω = p ˜ ω = 0 . , p V = δ V /N ) (a) r deg = 5 δ V (1 − p U , p V ) (1 − p ˜ U , p ˜ V ) N SPE Naive 1 Naive 2(%) (%) bias sd mse cr bias sd mse cr bias sd mse cr0.1 (20, 0.010) (20,0) 1k -0.068 0.313 0.102 0.948 0.131 0.279 0.095 0.921 0.120 0.268 0.086 0.936(20, 0.005) 2k -0.059 0.215 0.050 0.939 0.121 0.195 0.053 0.897 0.140 0.190 0.056 0.883(20, 0.002) 5k -0.052 0.126 0.019 0.930 0.132 0.122 0.032 0.815 0.133 0.126 0.034 0.8180.5 (20, 0.050) (20,0) 1k -0.066 0.323 0.108 0.944 0.078 0.283 0.086 0.941 0.133 0.270 0.090 0.920(20, 0.025) 2k -0.059 0.209 0.047 0.946 0.075 0.201 0.046 0.943 0.136 0.195 0.057 0.892(20, 0.010) 5k -0.057 0.114 0.016 0.931 0.081 0.124 0.022 0.907 0.135 0.115 0.031 0.7780.1 (40, 0.010) (40,0) 1k 0.040 0.528 0.281 0.953 0.287 0.405 0.247 0.885 0.318 0.408 0.268 0.882(40, 0.005) 2k 0.007 0.350 0.123 0.949 0.299 0.293 0.175 0.834 0.303 0.291 0.176 0.825(40, 0.002) 5k 0.001 0.209 0.044 0.957 0.305 0.181 0.126 0.600 0.316 0.183 0.133 0.5810.5 (40, 0.050) (40,0) 1k 0.054 0.522 0.276 0.946 0.255 0.393 0.219 0.898 0.303 0.411 0.261 0.892(40, 0.025) 2k 0.027 0.325 0.106 0.953 0.252 0.286 0.145 0.863 0.325 0.275 0.181 0.788(40, 0.010) 5k 0.003 0.196 0.039 0.952 0.248 0.181 0.094 0.730 0.322 0.185 0.138 0.590(b) r deg = 8 δ V (1 − p U , p V ) (1 − p ˜ U , p ˜ V ) N SPE Naive 1 Naive 2(%) (%) bias sd mse cr bias sd mse cr bias sd mse cr0.1 (20, 0.010) (20,0) 1k -0.024 0.516 0.267 0.953 0.071 0.445 0.203 0.941 0.083 0.446 0.205 0.944(20, 0.005) 2k -0.061 0.323 0.108 0.951 0.084 0.319 0.109 0.937 0.078 0.335 0.118 0.941(20, 0.002) 5k -0.089 0.190 0.044 0.939 0.077 0.200 0.046 0.936 0.087 0.202 0.048 0.9380.5 (20, 0.050) (20,0) 1k 0.062 0.629 0.400 0.966 0.062 0.448 0.204 0.955 0.075 0.454 0.212 0.943(20, 0.025) 2k -0.054 0.373 0.142 0.960 0.058 0.330 0.112 0.945 0.082 0.305 0.100 0.943(20, 0.010) 5k -0.096 0.177 0.041 0.937 0.044 0.208 0.045 0.942 0.078 0.210 0.050 0.9270.1 (40, 0.010) (40,0) 1k 0.329 0.813 0.768 0.932 0.267 0.703 0.565 0.938 0.279 0.709 0.581 0.933(40, 0.005) 2k 0.299 0.571 0.416 0.932 0.300 0.511 0.351 0.908 0.306 0.502 0.346 0.901(40, 0.002) 5k 0.173 0.336 0.143 0.916 0.272 0.318 0.175 0.877 0.285 0.322 0.185 0.8510.5 (40, 0.010) (40,0) 1k 0.300 0.814 0.752 0.934 0.244 0.655 0.488 0.939 0.298 0.700 0.579 0.933(40, 0.025) 2k 0.256 0.538 0.355 0.925 0.240 0.474 0.282 0.912 0.298 0.500 0.339 0.903(40, 0.010) 5k 0.119 0.327 0.121 0.937 0.218 0.316 0.147 0.888 0.284 0.330 0.190 0.863

3) = 2. τ s (1 , ,

3) ( p ω = p ˜ ω = 0 . , p V = δ V /N ) (a) r deg = 5 δ V (1 − p U , p V ) (1 − p ˜ U , p ˜ V ) N SPE Naive 1 Naive 2(%) (%) bias sd mse cr bias sd mse cr bias sd mse cr0.1 (20, 0.010) (20,0) 1k 0.035 0.488 0.240 0.957 0.270 0.214 0.119 0.749 0.366 0.215 0.180 0.582(20, 0.005) 2k 0.040 0.373 0.141 0.961 0.254 0.160 0.090 0.628 0.351 0.155 0.147 0.365(20, 0.002) 5k 0.073 0.209 0.049 0.945 0.252 0.102 0.074 0.306 0.342 0.100 0.127 0.0740.5 (20, 0.050) (20,0) 1k 0.050 0.543 0.297 0.947 -0.096 0.220 0.058 0.924 0.355 0.211 0.170 0.596(20, 0.025) 2k 0.054 0.354 0.128 0.952 -0.091 0.159 0.033 0.919 0.352 0.151 0.147 0.360(20, 0.010) 5k 0.082 0.209 0.051 0.930 -0.104 0.105 0.022 0.841 0.346 0.102 0.130 0.0890.1 (40, 0.010) (40,0) 1k 0.051 0.750 0.565 0.945 0.436 0.283 0.270 0.650 0.533 0.301 0.375 0.562(40, 0.005) 2k 0.079 0.607 0.375 0.949 0.432 0.209 0.230 0.441 0.532 0.216 0.330 0.299(40, 0.002) 5k 0.165 0.351 0.150 0.923 0.415 0.138 0.192 0.144 0.507 0.138 0.276 0.0460.5 (40, 0.050) (40,0) 1k 0.037 0.753 0.568 0.952 0.082 0.289 0.090 0.948 0.519 0.309 0.364 0.611(40, 0.025) 2k 0.086 0.557 0.317 0.942 0.079 0.212 0.051 0.936 0.517 0.204 0.309 0.285(40, 0.010) 5k 0.175 0.369 0.167 0.927 0.057 0.138 0.022 0.932 0.505 0.138 0.274 0.044(b) r deg = 8 δ V (1 − p U , p V ) (1 − p ˜ U , p ˜ V ) N SPE Naive 1 Naive 2(%) (%) bias sd mse cr bias sd mse cr bias sd mse cr0.1 (20, 0.010) (20,0) 1k 0.076 0.861 0.748 0.935 0.653 0.367 0.561 0.544 0.780 0.354 0.733 0.392(20, 0.005) 2k 0.051 0.675 0.459 0.947 0.651 0.283 0.504 0.358 0.769 0.269 0.664 0.176(20, 0.002) 5k -0.034 0.403 0.163 0.945 0.650 0.171 0.451 0.040 0.753 0.167 0.595 0.0100.5 (20, 0.050) (20,0) 1k 0.119 0.891 0.809 0.943 0.265 0.368 0.205 0.885 0.774 0.372 0.737 0.459(20, 0.025) 2k 0.029 0.746 0.557 0.941 0.263 0.263 0.138 0.833 0.764 0.267 0.655 0.181(20, 0.010) 5k -0.013 0.375 0.141 0.951 0.245 0.176 0.091 0.730 0.745 0.174 0.586 0.0140.1 (40, 0.010) (40,0) 1k 1.270 1.023 2.659 0.789 1.260 0.535 1.872 0.335 1.348 0.517 2.085 0.254(40, 0.005) 2k 1.040 0.831 1.772 0.796 1.244 0.367 1.683 0.104 1.314 0.379 1.869 0.063(40, 0.002) 5k 0.743 0.602 0.915 0.787 1.179 0.253 1.454 0.008 1.269 0.255 1.676 0.0010.5 (40, 0.010) (40,0) 1k 1.171 1.045 2.462 0.823 0.873 0.506 1.019 0.581 1.356 0.516 2.106 0.231(40, 0.025) 2k 0.993 0.854 1.715 0.843 0.814 0.368 0.798 0.401 1.298 0.369 1.822 0.053(40, 0.010) 5k 0.653 0.620 0.811 0.845 0.803 0.241 0.702 0.090 1.270 0.253 1.678 0.002

3) = 3. τ s (1 , ,

3) ( p ω = p ˜ ω = 0 . , p V = δ V /N ) (a) r deg = 5 δ V (1 − p U , p V ) (1 − p ˜ U , p ˜ V ) N SPE Naive 1 Naive 2(%) (%) bias sd mse cr bias sd mse cr bias sd mse cr0.1 (20, 0.010) (20,0) 1k 0.060 0.846 0.719 0.942 0.617 0.243 0.440 0.272 0.709 0.246 0.564 0.172(20, 0.005) 2k 0.028 0.522 0.273 0.957 0.603 0.173 0.393 0.064 0.693 0.181 0.513 0.029(20, 0.002) 5k 0.021 0.284 0.081 0.954 0.605 0.112 0.379 0.001 0.695 0.112 0.496 0.0000.5 (20, 0.050) (20,0) 1k 0.127 0.928 0.878 0.941 0.289 0.246 0.144 0.768 0.690 0.246 0.537 0.203(20, 0.025) 2k 0.090 0.555 0.316 0.952 0.299 0.186 0.124 0.633 0.704 0.173 0.525 0.025(20, 0.010) 5k 0.047 0.306 0.096 0.954 0.294 0.117 0.100 0.275 0.703 0.114 0.507 0.0000.1 (40, 0.010) (40,0) 1k 0.619 0.979 1.342 0.897 1.273 0.343 1.739 0.037 1.386 0.359 2.050 0.030(40, 0.005) 2k 0.377 0.864 0.889 0.915 1.288 0.242 1.718 0.000 1.406 0.249 2.038 0.000(40, 0.002) 5k 0.232 0.591 0.403 0.937 1.277 0.167 1.658 0.000 1.385 0.164 1.944 0.0000.5 (40, 0.050) (40,0) 1k 0.564 1.016 1.351 0.905 0.865 0.344 0.866 0.297 1.370 0.369 2.013 0.041(40, 0.025) 2k 0.272 0.869 0.829 0.926 0.875 0.247 0.828 0.053 1.387 0.249 1.986 0.000(40, 0.010) 5k 0.207 0.588 0.388 0.949 0.868 0.156 0.777 0.000 1.389 0.167 1.957 0.000(b) r deg = 8 δ V (1 − p U , p V ) (1 − p ˜ U , p ˜ V ) N SPE Naive 1 Naive 2(%) (%) bias sd mse cr bias sd mse cr bias sd mse cr0.1 (20, 0.010) (20,0) 1k 0.382 1.416 2.150 0.931 1.320 0.419 1.919 0.119 1.456 0.403 2.281 0.062(20, 0.005) 2k 0.139 1.055 1.133 0.950 1.345 0.321 1.912 0.013 1.480 0.304 2.282 0.007(20, 0.002) 5k -0.053 0.621 0.389 0.938 1.369 0.191 1.912 0.000 1.489 0.196 2.256 0.0000.5 (20, 0.050) (20,0) 1k 0.498 1.411 2.239 0.940 0.921 0.412 1.019 0.396 1.474 0.424 2.353 0.066(20, 0.025) 2k 0.029 1.003 1.007 0.944 0.928 0.293 0.947 0.111 1.470 0.304 2.255 0.004(20, 0.010) 5k -0.019 0.627 0.393 0.950 0.931 0.201 0.908 0.002 1.476 0.201 2.220 0.0000.1 (40, 0.010) (40,0) 1k 2.673 1.634 9.812 0.647 3.000 0.618 9.379 0.001 3.120 0.631 10.136 0.001(40, 0.005) 2k 2.185 1.432 6.826 0.704 2.974 0.443 9.041 0.000 3.113 0.449 9.892 0.000(40, 0.002) 5k 1.437 1.152 3.391 0.779 2.997 0.301 9.072 0.000 3.125 0.294 9.851 0.0000.5 (40, 0.010) (40,0) 1k 2.540 1.698 9.335 0.724 2.482 0.607 6.526 0.012 3.125 0.613 10.140 0.001(40, 0.025) 2k 2.135 1.460 6.692 0.727 2.459 0.423 6.225 0.000 3.109 0.441 9.860 0.000(40, 0.010) 5k 1.262 1.175 2.974 0.837 2.489 0.279 6.274 0.000 3.120 0.295 9.821 0.000

3) = 2 . p ω = p ˜ ω = 0 . , p V = δ V /N, p ˜ V = 0, δ V = 0 . N = 5 k ) (a) r deg = 5 ̺ (1 − p U , p V ) (1 − p ˜ U , p ˜ V ) SPE Naive 1 Naive 2(%) (%) bias sd mse cr bias sd mse cr bias sd mse cr0.05 (20,0.002) (20,0) τ d (0 , ,

3) -0.020 0.118 0.014 0.915 -0.067 0.128 0.021 0.910 -0.054 0.135 0.021 0.929 τ d (0 , ,

3) -0.053 0.118 0.017 0.931 0.123 0.118 0.029 0.834 0.140 0.124 0.035 0.800 τ s (1 , ,

3) 0.145 0.184 0.055 0.865 0.249 0.097 0.072 0.273 0.347 0.099 0.130 0.063 τ s (1 , ,

3) 0.095 0.301 0.100 0.935 0.604 0.111 0.377 0.000 0.701 0.114 0.504 0.0000.1 (20,0.002) (20,0) τ d (0 , ,

3) -0.021 0.110 0.013 0.934 -0.065 0.129 0.021 0.917 -0.062 0.131 0.021 0.923 τ d (0 , ,

3) -0.059 0.112 0.016 0.910 0.131 0.120 0.032 0.802 0.132 0.120 0.032 0.815 τ s (1 , ,

3) 0.157 0.200 0.065 0.869 0.247 0.098 0.070 0.296 0.346 0.099 0.129 0.060 τ s (1 , ,

3) 0.101 0.295 0.097 0.920 0.596 0.109 0.368 0.000 0.696 0.115 0.497 0.0000.05 (40,0.002) (40,0) τ d (0 , ,

3) 0.049 0.219 0.051 0.917 -0.019 0.175 0.031 0.950 -0.001 0.180 0.032 0.949 τ d (0 , ,

3) 0.005 0.210 0.044 0.945 0.306 0.186 0.128 0.620 0.318 0.184 0.135 0.606 τ s (1 , ,

3) 0.245 0.329 0.168 0.877 0.424 0.130 0.196 0.101 0.513 0.136 0.281 0.032 τ s (1 , ,

3) 0.347 0.545 0.417 0.910 1.290 0.155 1.687 0.000 1.397 0.164 1.979 0.0000.1 (40,0.002) (40,0) τ d (0 , ,

3) 0.070 0.225 0.055 0.930 0.005 0.182 0.033 0.956 0.014 0.174 0.030 0.951 τ d (0 , ,

3) 0.010 0.208 0.043 0.954 0.307 0.187 0.129 0.613 0.313 0.182 0.131 0.586 τ s (1 , ,

3) 0.236 0.330 0.165 0.882 0.413 0.139 0.190 0.153 0.499 0.134 0.267 0.053 τ s (1 , ,

3) 0.344 0.509 0.377 0.886 1.282 0.161 1.670 0.000 1.388 0.157 1.951 0.000(b) r deg = 8 ̺ (1 − p U , p V ) (1 − p ˜ U , p ˜ V ) SPE Naive 1 Naive 2(%) (%) bias sd mse cr bias sd mse cr bias sd mse cr0.05 (20, 0.002) (20, 0) τ d (0 , ,

3) -0.049 0.192 0.039 0.959 -0.143 0.243 0.080 0.908 -0.141 0.241 0.078 0.923 τ d (0 , ,

3) -0.093 0.193 0.046 0.943 0.078 0.211 0.050 0.930 0.089 0.208 0.051 0.922 τ s (1 , ,

3) -0.026 0.394 0.156 0.939 0.647 0.183 0.452 0.059 0.751 0.173 0.594 0.007 τ s (1 , ,

3) -0.050 0.612 0.377 0.930 1.364 0.206 1.904 0.000 1.482 0.203 2.236 0.0000.1 (20, 0.002) (20, 0) τ d (0 , ,

3) -0.048 0.177 0.034 0.960 -0.144 0.241 0.079 0.914 -0.143 0.243 0.079 0.911 τ d (0 , ,

3) -0.102 0.191 0.047 0.929 0.080 0.213 0.052 0.929 0.082 0.207 0.050 0.926 τ s (1 , ,

3) -0.030 0.356 0.128 0.943 0.634 0.180 0.434 0.069 0.754 0.179 0.601 0.015 τ s (1 , ,

3) -0.046 0.591 0.351 0.937 1.352 0.201 1.868 0.000 1.488 0.203 2.256 0.0000.05 (40, 0.002) (40, 0) τ d (0 , ,

3) 0.284 0.341 0.197 0.888 -0.098 0.349 0.132 0.943 -0.073 0.354 0.131 0.952 τ d (0 , ,

3) 0.184 0.343 0.151 0.933 0.274 0.323 0.180 0.863 0.300 0.322 0.193 0.851 τ s (1 , ,

3) 0.786 0.597 0.975 0.774 1.180 0.250 1.454 0.002 1.269 0.255 1.675 0.003 τ s (1 , ,

3) 1.581 1.114 3.741 0.749 2.997 0.298 9.071 0.000 3.121 0.299 9.829 0.0000.1 (40, 0.002) (40, 0) τ d (0 , ,

3) 0.310 0.312 0.193 0.855 -0.086 0.346 0.127 0.944 -0.065 0.349 0.126 0.942 τ d (0 , ,

3) 0.195 0.325 0.144 0.917 0.285 0.333 0.192 0.861 0.306 0.325 0.199 0.851 τ s (1 , ,

3) 0.781 0.618 0.992 0.796 1.178 0.249 1.449 0.004 1.270 0.249 1.674 0.000 τ s (1 , ,

3) 1.498 1.131 3.522 0.776 2.995 0.290 9.057 0.000 3.119 0.293 9.813 0.000 p ω = p ˜ ω = 0 . , p V = δ V /N, p ˜ V = δ ˜ V /N , N = 5 k ) (a) r deg = 5 δ V δ ˜ V (1 − p U , p V ) (1 − p ˜ U , p ˜ V ) SPE Naive 1 Naive 2(%) (%) bias sd mse cr bias sd mse cr bias sd mse cr0.1 0.05 (20,0.002) (20,0.001) τ d (0 , ,

3) -0.027 0.124 0.016 0.930 -0.069 0.136 0.023 0.908 -0.063 0.132 0.021 0.928 τ d (0 , ,

3) -0.055 0.115 0.016 0.923 0.125 0.122 0.031 0.818 0.133 0.119 0.032 0.803 τ s (1 , ,

3) 0.125 0.204 0.057 0.880 0.248 0.102 0.072 0.312 0.298 0.107 0.100 0.200 τ s (1 , ,

3) 0.104 0.303 0.103 0.933 0.600 0.116 0.373 0.001 0.646 0.119 0.432 0.0020.1 0.1 (20,0.002) (20,0.002) τ d (0 , ,

3) -0.027 0.129 0.017 0.934 -0.062 0.134 0.022 0.925 -0.059 0.132 0.021 0.920 τ d (0 , ,

3) -0.053 0.116 0.016 0.927 0.124 0.123 0.031 0.825 0.131 0.122 0.032 0.809 τ s (1 , ,

3) 0.098 0.221 0.058 0.904 0.247 0.099 0.071 0.289 0.251 0.101 0.073 0.303 τ s (1 , ,

3) 0.104 0.306 0.104 0.942 0.600 0.108 0.372 0.000 0.602 0.111 0.374 0.0010.1 0.05 (40,0.002) (40,0.001) τ d (0 , ,

3) 0.049 0.231 0.056 0.941 -0.018 0.180 0.033 0.958 -0.003 0.178 0.032 0.945 τ d (0 , ,

3) -0.007 0.208 0.043 0.959 0.295 0.177 0.119 0.634 0.306 0.186 0.128 0.617 τ s (1 , ,

3) 0.183 0.342 0.150 0.910 0.415 0.136 0.191 0.157 0.466 0.135 0.235 0.066 τ s (1 , ,

3) 0.320 0.540 0.394 0.909 1.278 0.156 1.656 0.000 1.342 0.159 1.826 0.0000.1 0.1 (40,0.002) (40,0.002) τ d (0 , ,

3) 0.026 0.232 0.054 0.939 -0.013 0.184 0.034 0.947 -0.019 0.175 0.031 0.945 τ d (0 , ,

3) -0.012 0.211 0.045 0.954 0.293 0.186 0.120 0.634 0.290 0.184 0.118 0.663 τ s (1 , ,

3) 0.159 0.348 0.146 0.927 0.424 0.134 0.198 0.121 0.412 0.135 0.188 0.132 τ s (1 , ,

3) 0.282 0.524 0.354 0.926 1.293 0.159 1.698 0.000 1.281 0.156 1.666 0.000(b) r deg = 8 δ V δ ˜ V (1 − p U , p V ) (1 − p ˜ U , p ˜ V ) SPE Naive 1 Naive 2(%) (%) bias sd mse cr bias sd mse cr bias sd mse cr0.1 0.05 (20, 0.002) (20, 0.001) τ d (0 , ,

3) -0.052 0.202 0.044 0.956 -0.141 0.239 0.077 0.911 -0.129 0.244 0.076 0.922 τ d (0 , ,

3) -0.097 0.201 0.050 0.943 0.074 0.205 0.047 0.934 0.090 0.216 0.055 0.926 τ s (1 , ,

3) -0.040 0.419 0.177 0.955 0.641 0.178 0.443 0.058 0.697 0.178 0.517 0.035 τ s (1 , ,

3) -0.027 0.669 0.448 0.949 1.355 0.206 1.878 0.000 1.417 0.202 2.048 0.0000.1 0.1 (20, 0.002) (20, 0.002) τ d (0 , ,

3) -0.061 0.180 0.036 0.959 -0.147 0.225 0.072 0.910 -0.153 0.237 0.080 0.899 τ d (0 , ,

3) -0.105 0.186 0.046 0.934 0.066 0.194 0.042 0.936 0.069 0.206 0.047 0.934 τ s (1 , ,

3) -0.049 0.379 0.146 0.955 0.642 0.172 0.442 0.041 0.644 0.168 0.443 0.039 τ s (1 , ,

3) -0.026 0.644 0.415 0.949 1.358 0.196 1.883 0.000 1.363 0.194 1.896 0.0000.1 0.05 (40, 0.002) (40, 0.001) τ d (0 , ,

3) 0.264 0.338 0.184 0.903 -0.077 0.347 0.126 0.941 -0.068 0.335 0.117 0.943 τ d (0 , ,

3) 0.169 0.318 0.130 0.917 0.305 0.320 0.195 0.843 0.311 0.316 0.196 0.825 τ s (1 , ,

3) 0.728 0.646 0.947 0.817 1.178 0.243 1.447 0.001 1.227 0.251 1.568 0.002 τ s (1 , ,

3) 1.359 1.150 3.168 0.802 2.988 0.291 9.011 0.000 3.047 0.292 9.368 0.0000.1 0.1 (40, 0.002) (40, 0.002) τ d (0 , ,

3) 0.254 0.349 0.187 0.909 -0.094 0.351 0.132 0.943 -0.108 0.360 0.141 0.932 τ d (0 , ,

3) 0.165 0.330 0.136 0.916 0.286 0.322 0.185 0.852 0.284 0.325 0.187 0.868 τ s (1 , ,

3) 0.737 0.636 0.948 0.831 1.176 0.246 1.443 0.004 1.184 0.250 1.465 0.002 τ s (1 , ,

3) 1.446 1.134 3.377 0.787 2.987 0.292 9.009 0.000 2.992 0.295 9.038 0.000 upplement to “Spillovers of Program Beneﬁts withMismeasured Networks” Lina Zhang

Job Market Paper (Click here for the latest version)September 22, 2020

In this supplemental material, Section D provides suﬃcient conditions under which the existingnonparametric estimation of Leung (2020b) is still consistent, when no instrument variable for thetrue network is available. It is important for practitioners, because if one is not aware of thepotential network mismeasurement in a setting with a limited misclassiﬁcation rate, the ﬁndingsin this section ensure that the standard nonparametric estimators are nevertheless likely to beconsistent when the sample size is suﬃciently large. Section E introduces some useful lemmas which are used in the proofs of the Appendix in themain text.

D Single Network Proxy

Denote the observed N × N adjacency matrix by A , with its ij -th entry A ij = 1 if j ∈ N i . For( d, s, z, n ) ∈ { , } × Ω S,Z, |N | , deﬁne the conditional propensity score e ( d, s, z, n ) = P r (cid:0) D i = d, S i = s (cid:12)(cid:12) Z i = z, |N i | = n (cid:1) . The deﬁnition of e ( d, s, z, n ) is akin to the propensity score of the multi-valued treatment (Imbens,2000). Notably, e ( d, s, z, n ) does not depend on the index i even if the network is mismeasured,implying that it is identiﬁable from the observables. Theorem D.1

Under the assumptions in Proposition 3.1 and suppose the following conditionsare satisﬁed. Department of Econometrics and Business Statistics, Monash University ( [email protected] ). Other studies using a single and imperfectly measured network include, e.g. Chandrasekhar and Lewis (2011),He and Song (2018), Lewbel, Qu, and Tang (2019), S¨avje (2019) and Leung (2019a). The conditional propensity score e ( d, s, z, n ) does not depend on the index i , because under Assumptions3.2 (a) and 3.4 (a), e ( d, s, z, n ) = f dD (1) f D (0) − d f S i | Z i = z, |N i | = n ( s ) = f dD (1) f D (0) − d × C sn f sD (1) f D (0) n − s where f D ( d ) := P r ( D i = d ). From the expression of e ( d, s, z, n ), we can see that it is identical across all units in P . a) ( Strict Overlap ) There exists a constant ξ ∈ (0 , , such that e ( d, s, z, n ) ∈ ( ξ, − ξ ) for ∀ ( d, s, z, n ) ∈ { , } × Ω S,Z, |N | .(b) (

Lipschitz Condition ) For any ( d, z ) ∈ { , } × Ω Z and all ( s, n ) , ( s ′ , n ′ ) ∈ Ω S ∗ , |N ∗ | , thereexist two Lipschitz constants L S and L |N | such that | m ∗ ( d, s, z, n ) − m ∗ ( d, s ′ , z, n ) | < L S | s − s ′ | , | m ∗ ( d, s, z, n ) − m ∗ ( d, s, z, n ′ ) | < L |N | | n − n ′ | . (c) ( Misclassiﬁcation Rate ) There exists some constant δ > such that sup i ∈P , ( z,n ) ∈ Ω Z, |N| E (cid:2) k A ∗ i − A i k (cid:12)(cid:12) Z i = z, |N i | = n (cid:3) = O ( N − δ ) . Then, we have k m i − m ∗ k ∞ = O ( N − δ ) . Theorem D.1 states that the bias of the naive estimand m i relative to the CASF m ∗ is negligiblewhen the sample size is suﬃciently large, if the observed network links suﬀer from only limitedor mild level of mismeasurement. Such a condition holds, for example, if the misclassiﬁcation ofthe network occurs only in a subset of an increasing number of individuals with rate O ( N κ − δ )( κ > δ ), and the misclassiﬁcation probability converges to zero uniformly sup i,j ∈P E [ | A ∗ ij − A ij | ] = O ( N − κ ). It also holds in situations where links are misclassiﬁed only among a decreasing numberof individuals, for example with decreasing rate O ( N − δ ), and the misclassiﬁcation probability isﬁxed: sup i,j ∈P E [ | A ∗ ij − A ij | ] = p for a constant p ∈ [0 , This result is of practical interest, because it applies to many previous studies where the networkmisclassiﬁcation may have been presented, while it was assumed not to be. The key assumptionon the misclassiﬁcation rate might be veriﬁed in some situations, for example, if an upper boundof the extent to which a certain form of the measurement error occurs is known and small.Next, I show that the consistency of the nonparametric estimation for the CASF m ∗ in Leung(2020b) is maintained, while its asymptotic convergence rate depends on the average networkmisclassiﬁcation rate. For ∀ ( d, s, z, n ) ∈ { , } × Ω S,Z, |N | , letˆ m ( d, s, z, n ) := P Ni =1 Y i ˆ f D i ,S i ,Z i , |N i | ( d, s, z, n ) P Ni =1 ˆ f D i ,S i ,Z i , |N i | ( d, s, z, n ) , (D.1)where ˆ f D i ,S i ,Z i , |N i | is the kernel density. Theorem D.2 (Consistency)

Let assumptions in Theorem D.1 and Assumption 5.1-5.3 hold.Suppose m ∗ ( d, s, z, n ) is twice continuously diﬀerentiable in its continuous arguments. Then, k ˆ m − m ∗ k ∞ = O p (cid:0) N − δ + h + ( N h Q ) − / (cid:1) . The strict overlap assumption rules out the cluster randomized trial where groups are randomly assigned to betreated. It is likely to hold in situations where the network is sparse, and it might be violated in the presence ofdense network. In the proof of Theorem D.1 (see Appendix), I also provide another suﬃcient assumption underwhich the result in Theorem D.1 still holds without relying on the strict overlap of the propensity score e ( d, s, z, n ).I defer the large sample property of the estimator for m i in Section 5. The Lipschitz condition is satisﬁed by anybounded function m ∗ .

82t is worthy noticing that if all covariates are discrete, we can simply replace the kernel densityˆ f D i ,S i ,Z i , |N i | ( d, s, z, n ) in (D.1) by the indicator i ( d, s, z, n ) := 1[ D i = d, S i = s, Z i = z, |N i | = n ],and the consistency still holds with the convergence rate N − min { δ, / } . See more details in the proofof Theorem D.2. D.1 Simulations for Nonparametric Estimation with Single NetworkProxy

The data generating process is described in Section 6. Set the overall misclassiﬁcation rate p ω = 0 . − p U = N − δ and false positive p V = 50 N − δ − . Thus,the larger value of δ is, the less misclassiﬁcation of the network links. Notice that by setting p V = 50 N − δ − , we let the misreported links maintain the similar sparsity to the actual network,which is common in many empirical applications. Similar design of the measurement errors is usedin Lewbel et al. (2019) to study parametric estimation of the endogenous peer eﬀects in a linearmodel. Table 10 reports, for diﬀerent values of δ , the corresponding misclassiﬁcation probabilities1 − p U and p V , average numbers of the observed degree and of the observed treated friends, andthe average numbers of misclassiﬁed links. We can see that the total number of misclassiﬁed linksdecreases as δ increasing. In addition, the ratio between the number of the misclassiﬁed links andthe number of the actual links decreases with the sample size, varying from 295% to less than 1%when the network degree is relatively small ( r deg = 5), and varying from 197% to less than 1%when the network degree is relatively large ( r deg = 8).Because covariate Z i is binary, we proceed the nonparametric estimation of the CASF viaˆ m ( d, s, z, n ) := N X i =1 Y i i ( d, s, z, n ) / N X i =1 i ( d, s, z, n ) , (D.2)where recall that i ( d, s, z, n ) := 1[ D i = d, S i = s, Z i = z, |N i | = n ]. Table 11 presents thenonparametric estimation results, including bias, standard deviation (sd), mean squared error(mse), and the ratio between the mse of the feasible nonparametric estimation in (D.2) usingthe observed data { Y i , D i , S i , Z i , |N i |} Ni =1 and the mse of the infeasible nonparametric estimationusing the latent data { Y i , D i , S ∗ i , Z i , |N ∗ i |} Ni =1 via replacing i ( d, s, z, n ) in (D.2) by 1[ D i = d, S ∗ i = s, Z i = z, |N ∗ i | = n ]. It is easy to observe several patterns from the results. First, for lowmisclassiﬁcation rate ( δ ≥ . δ < . r deg increases from 5 to 8, i.e. the average degree increases, the sd becomes larger. It is intuitivebecause the larger average degree leads to less eﬀective sample size for the nonparametric estimatorin (D.2) at a given ( d, s, z, n ). If the ratio between two mean squared errors is one, the feasible nonparametric estimation performs as goodas the infeasible one. A larger (smaller) than one ratio means that the feasible nonparametric estimation produceslarger (smaller) mean squared error than that of the infeasible estimation. p ω = 0 . (a) r deg = 5 N δ − p U p V |N i | S i Misclassiﬁed links(%) (%) avg. max avg. max 1 to 0 0 to 1 total ratio (%)1k 0.1 50.1 2.51 18.9 45.2 5.7 18.4 1698 14954 16652 2950.3 12.6 0.63 9.0 23.3 2.7 10.3 427.1 3757 4184 74.20.5 3.16 0.16 6.5 17.4 1.9 8.0 106.5 943.6 1050 18.60.7 0.79 0.04 5.9 15.9 1.8 7.5 26.80 236.1 262.9 4.660.9 0.20 0.01 5.7 15.6 1.7 7.4 6.812 59.56 66.37 1.182k 0.1 46.8 1.17 18.1 45.0 5.4 18.6 3218 27970 31188 2720.3 10.2 0.26 8.4 22.9 2.5 10.4 702.6 6125 6828 59.60.5 2.24 0.06 6.3 17.7 1.9 8.4 153.4 1340 1494 13.00.7 0.49 0.01 5.9 16.6 1.8 8.0 33.70 292.5 326.2 2.850.9 0.11 0.00 5.8 16.5 1.7 7.9 7.524 63.94 71.46 0.625k 0.1 42.7 0.43 17.1 44.1 5.1 18.7 7433 63914 71347 2450.3 7.77 0.08 7.9 22.5 2.4 10.4 1354 11646 13000 44.80.5 1.41 0.01 6.2 18.2 1.9 8.8 246.1 2119 2365 8.150.7 0.26 0.00 5.9 17.5 1.8 8.6 44.66 386.7 431.3 1.490.9 0.05 0.00 5.8 17.4 1.7 8.5 8.222 70.61 78.83 0.27 (b) r deg = 8 N δ − p U p V |N i | S i Misclassiﬁed links(%) (%) avg. max avg. max 1 to 0 0 to 1 total ratio (%)1k 0.1 50.1 2.51 21.1 47.4 6.4 19.1 2682 14904 17587 1970.3 12.6 0.63 12.0 27.8 3.6 12.0 676.6 3744 4421 49.60.5 3.16 0.16 9.7 22.9 2.9 10.2 168.3 940.4 1109 12.40.7 0.79 0.04 9.1 21.7 2.7 9.7 42.36 235.3 277.7 3.110.9 0.20 0.01 9.0 21.5 2.7 9.6 10.79 59.37 70.17 0.792k 0.1 46.8 1.17 20.5 47.4 6.1 19.5 5101 27923 33025 1820.3 10.2 0.26 11.6 27.9 3.5 12.3 1114 6115 7229 39.80.5 2.24 0.06 9.6 23.6 2.9 10.7 243.2 1338 1581 8.700.7 0.49 0.01 9.2 22.8 2.8 10.4 53.34 292.0 345.3 1.900.9 0.11 0.00 9.1 22.6 2.7 10.3 11.81 63.83 75.64 0.425k 0.1 42.7 0.43 19.6 46.9 5.9 19.7 11827 63869 75697 1640.3 7.77 0.08 11.1 28.0 3.3 12.5 2156 11638 13794 29.90.5 1.41 0.01 9.6 24.5 2.9 11.3 392.0 2117 2509 5.440.7 0.26 0.00 9.3 23.9 2.8 11.1 71.26 386.4 457.6 0.990.9 0.05 0.00 9.2 23.8 2.8 11.0 12.88 70.57 83.44 0.18

Note: All the statistics are obtained by averaging the 1000 replications; “1 to 0” is the total number of missing links(false negative); “0 to 1” lists the total number of misreported nonexisting links (false positive); “total” displaysthe total number of misclassiﬁed links including missing an existing link (1 to 0) and misreporting an nonexistinglink (0 to 1); “ratio” is the ratio between the total number of misclassiﬁed links and the number of total links. p ω = 0 . (a) r deg = 5Direct Spillover τ d (0 , , τ d (0 , , τ s (1 , , τ s (1 , , N δ bias sd mse ratio bias sd mse ratio bias sd mse ratio bias sd mse ratio1k 0.1 0.04 1.04 1.08 2.91 0.09 1.39 1.94 2.06 0.03 0.70 0.50 2.72 -0.04 1.11 1.23 2.870.3 -0.04 0.99 0.99 2.67 0.09 1.39 1.95 2.07 -0.15 0.75 0.59 3.23 -0.03 1.11 1.23 2.870.5 -0.08 0.78 0.62 1.67 -0.10 1.16 1.35 1.44 -0.56 0.57 0.64 3.52 -0.39 0.92 0.99 2.310.7 -0.05 0.65 0.43 1.17 -0.02 1.01 1.03 1.09 -0.25 0.51 0.32 1.77 -0.09 0.72 0.52 1.220.9 -0.01 0.64 0.40 1.09 0.00 0.98 0.97 1.03 -0.07 0.44 0.20 1.11 0.00 0.67 0.45 1.062k 0.1 0.01 0.70 0.49 2.74 0.04 1.04 1.08 2.44 0.01 0.49 0.24 2.71 0.01 0.75 0.56 2.600.3 -0.01 0.63 0.40 2.23 0.01 1.06 1.12 2.53 -0.20 0.48 0.27 3.14 -0.12 0.76 0.60 2.760.5 -0.02 0.49 0.24 1.37 -0.05 0.80 0.65 1.46 -0.44 0.38 0.34 3.84 -0.26 0.58 0.40 1.850.7 0.02 0.43 0.19 1.05 -0.01 0.69 0.48 1.08 -0.12 0.31 0.11 1.31 -0.07 0.50 0.25 1.160.9 0.01 0.41 0.17 0.96 0.01 0.66 0.44 0.99 -0.03 0.32 0.10 1.15 -0.04 0.46 0.21 0.985k 0.1 0.02 0.43 0.19 2.81 0.00 0.69 0.48 3.04 0.02 0.32 0.10 2.50 0.01 0.47 0.22 2.610.3 -0.02 0.38 0.14 2.11 -0.03 0.61 0.37 2.37 -0.35 0.32 0.22 5.58 -0.22 0.46 0.26 3.090.5 0.01 0.30 0.09 1.31 -0.04 0.46 0.22 1.37 -0.31 0.21 0.14 3.56 -0.21 0.34 0.16 1.960.7 0.00 0.27 0.07 1.05 -0.01 0.41 0.17 1.08 -0.07 0.20 0.04 1.10 -0.05 0.31 0.10 1.200.9 0.00 0.26 0.07 1.02 -0.02 0.41 0.17 1.05 -0.01 0.20 0.04 0.96 -0.02 0.29 0.09 1.04(b) r deg = 8Direct Spillover τ d (0 , , τ d (0 , , τ s (1 , , τ s (1 , , N δ bias sd mse ratio bias sd mse ratio bias sd mse ratio bias sd mse ratio1k 0.1 0.07 1.60 2.55 1.78 0.03 1.53 2.33 0.96 0.07 1.42 2.02 2.36 -0.13 1.65 2.74 1.280.3 0.05 1.57 2.46 1.71 0.32 1.78 3.27 1.35 -0.01 1.39 1.93 2.25 0.18 1.83 3.39 1.580.5 -0.02 1.48 2.19 1.52 -0.19 1.79 3.25 1.34 -0.42 1.20 1.60 1.87 -0.32 2.01 4.13 1.920.7 0.04 1.33 1.77 1.23 0.02 1.89 3.57 1.47 -0.18 1.01 1.05 1.23 -0.09 1.62 2.64 1.230.9 -0.10 1.21 1.48 1.03 0.07 1.63 2.67 1.10 -0.01 0.94 0.89 1.04 0.01 1.57 2.48 1.152k 0.1 0.02 1.34 1.80 2.44 0.05 1.77 3.12 1.22 -0.04 1.02 1.04 2.74 0.02 1.73 3.01 1.590.3 -0.07 1.40 1.97 2.67 0.12 1.90 3.61 1.41 -0.10 1.10 1.22 3.22 0.06 1.87 3.50 1.850.5 0.01 1.10 1.20 1.63 -0.06 1.76 3.10 1.21 -0.32 0.80 0.74 1.96 -0.04 1.60 2.55 1.340.7 -0.01 0.91 0.83 1.13 0.04 1.55 2.41 0.94 -0.08 0.66 0.45 1.18 -0.12 1.42 2.03 1.070.9 0.04 0.87 0.76 1.03 0.03 1.54 2.39 0.93 0.01 0.62 0.39 1.03 0.01 1.35 1.81 0.965k 0.1 0.01 0.95 0.90 2.88 -0.19 1.63 2.71 1.71 0.02 0.66 0.44 2.58 -0.12 1.42 2.03 2.180.3 0.00 0.93 0.87 2.78 0.01 1.78 3.16 1.99 -0.23 0.67 0.50 2.94 -0.12 1.65 2.73 2.930.5 -0.01 0.66 0.44 1.42 -0.03 1.49 2.21 1.39 -0.19 0.48 0.27 1.59 -0.19 1.21 1.50 1.610.7 0.02 0.55 0.30 0.97 -0.05 1.32 1.75 1.10 -0.03 0.41 0.17 1.01 -0.07 0.99 0.99 1.060.9 0.00 0.56 0.31 1.01 0.02 1.29 1.66 1.04 -0.01 0.40 0.16 0.96 0.01 0.98 0.97 1.04

Note: The true values τ d (0 , ,

3) = 1, τ d (0 , ,

3) = 2, τ s (1 , ,

3) = 3 and τ s (1 , ,

3) = 2 .

5. Column “ratio” lists theratio between the mse of the feasible estimation using the observed data { Y i , D i , S i , Z i , |N i |} Ni =1 and the infeasibleestimation using the latent data { Y i , D i , S ∗ i , Z i , |N ∗ i |} Ni =1 . .2 Proofs of Section D Proof of Theorem D.1.

We prove this theorem by three steps. Firstly, we show that theabsolute diﬀerence between m i ( d, s, z, n ) and m ∗ ( d, s, z, n ) is proportional to∆ A := E (cid:2) k A ∗ i − A i k (cid:12)(cid:12) D i = d, S i = s, Z i = z, | N i | = n (cid:3) . Secondly, we verify that condition (c) restricts the uniform convergence rate of ∆ A . At last, wepresent the alternative assumption that relaxes the strict overlap condition of the propensity score. Step 1 . Recall that Proposition 3.1 demonstrates m i ( d, s, z, n ) = E h m ∗ ( d, S ∗ i , z, |N ∗ i | ) (cid:12)(cid:12)(cid:12) D i = d, S i = s, Z i = z, |N i | = n i . (D.3)In addition, we can get the following conditional expectation E h m ∗ ( d, S i , z, |N i | ) (cid:12)(cid:12)(cid:12) D i = d, S i = s, Z i = z, |N i | = n i = E h m ∗ ( d, s, z, n ) (cid:12)(cid:12)(cid:12) D i = d, S i = s, Z i = z, |N i | = n i = m ∗ ( d, s, z, n ) . (D.4)Based on (D.3) and (D.4), m i ( d, s, z, n ) − m ∗ ( d, s, z, n )= E h m ∗ ( d, S ∗ i , z, |N ∗ i | ) − m ∗ ( d, S i , z, |N i | ) (cid:12)(cid:12)(cid:12) D i = d, S i = s, Z i = z, |N i | = n i . (D.5)By the Lipschitz condition of m ∗ , we know that (cid:12)(cid:12) m ∗ ( d, S ∗ i , z, |N ∗ i | ) − m ∗ ( d, S i , z, |N i | ) (cid:12)(cid:12) ≤ (cid:12)(cid:12) m ∗ ( d, S ∗ i , z, |N ∗ i | ) − m ∗ ( d, S i , z, |N ∗ i | ) (cid:12)(cid:12) + (cid:12)(cid:12) m ∗ ( d, S i , z, |N ∗ i | ) − m ∗ ( d, S i , z, |N i | ) (cid:12)(cid:12) = L S (cid:12)(cid:12) S ∗ i − S i (cid:12)(cid:12) + L N (cid:12)(cid:12) |N ∗ i | − |N i | (cid:12)(cid:12) . (D.6)For any generic set A , let A c be its complement. Then, | S ∗ i − S i | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X j ∈N ∗ i D j − X j ∈N i D j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X j ∈N ∗ i T N ci D j − X j ∈N i T ( N ∗ i ) c D j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X j ∈N ∗ i T N ci D j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X j ∈ N i T ( N ∗ i ) c D j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) N ∗ i \ N ci (cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12) N i \ ( N ∗ i ) c (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) ( N ∗ i \ N ci ) [ ( N i \ ( N ∗ i ) c ) (cid:12)(cid:12)(cid:12) , where the second inequality is because D j can only take values in { , } , and the last equality isdue that sets N ∗ i T N ci and N i T ( N ∗ i ) c are mutually exclusive. Moreover, because the cardinality86f the set ( N ∗ i T N ci ) S ( N i T ( N ∗ i ) c ) is P j ∈P (cid:12)(cid:12) A ∗ ij − A ij (cid:12)(cid:12) , we have that | S ∗ i − S i | ≤ X j ∈P (cid:12)(cid:12) A ∗ ij − A ij (cid:12)(cid:12) = X j ∈P (cid:12)(cid:12) A ∗ ij − A ij (cid:12)(cid:12) = k A ∗ i − A i k . (D.7)In addition, by deﬁnition of |N ∗ i | and |N i | we know that (cid:12)(cid:12) |N ∗ i | − |N i | (cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X j ∈P A ∗ ij − X j ∈P A ij (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ X j ∈P (cid:12)(cid:12) A ∗ ij − A ij (cid:12)(cid:12) = k A ∗ i − A i k . (D.8)Substitute (D.6), (D.7) and (D.8) into (D.5), for some constant C > (cid:12)(cid:12) m i ( d, s, z, n ) − m ∗ ( d, s, z, n ) (cid:12)(cid:12) ≤ E h L S (cid:12)(cid:12) S ∗ i − S i (cid:12)(cid:12) + L N (cid:12)(cid:12) |N ∗ i | − |N i | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D i = d, S i = s, Z i = z, |N i | = n i ≤ CE h k A ∗ i − A i k (cid:12)(cid:12)(cid:12) D i = d, S i = s, Z i = z, |N i | = n i . (D.9) Step 2 . By the law of iterated expectation, E (cid:2) k A ∗ i − A i k (cid:12)(cid:12) Z i = z, |N i | = n (cid:3) = X ( d,s ) ∈{ , }× Ω S E (cid:2) k A ∗ i − A i k (cid:12)(cid:12) D i = d, S i = s, Z i = z, |N i | = n (cid:3) f D i ,S i | Z i = z, |N i | = n ( d, s ) . (D.10)Besides, under the strict overlap, we have e ( d, s, z, n ) = f D i ,S i | Z i = z, |N i | = n ( d, s ) > ξ > d, s, z, n ) ∈ { , } × Ω S,Z, |N | . Consequently, from (D.10),0 ≤ ξ X ( d,s ) ∈{ , }× Ω S E h k A ∗ i − A i k (cid:12)(cid:12)(cid:12) D i = d, S i = s, Z i = z, |N i | = n i ≤ E (cid:2) k A ∗ i − A i k (cid:12)(cid:12) Z i = z, |N i | = n (cid:3) , which, together with the condition (c), implies thatsup i ∈P , ( d,s,z,n ) ∈{ , }× Ω S,Z, |N| E h k A ∗ i − A i k (cid:12)(cid:12)(cid:12) D i = d, S i = s, Z i = z, |N i | = n i = O ( N − δ ) . (D.11)Therefore, we can conclude from (D.9) thatsup i ∈P , ( d,s,z,n ) ∈{ , }× Ω S ∗ ,Z, |N∗| (cid:12)(cid:12)(cid:12) m i ( d, s, z, n ) − m ∗ ( d, s, z, n ) (cid:12)(cid:12)(cid:12) = O ( N − δ ) , which, by the deﬁnition of k · k ∞ , implies k m i − m ∗ k ∞ = O ( N − δ ). Step 3 . Now, consider the following assumption.

Assumption D.1 (Misclassiﬁcation Rate)

There exists some constant δ > such that sup i ∈P , ( d,s,z,n ) ∈{ , }× Ω S,Z, |N| E (cid:2) k A ∗ i − A i k (cid:12)(cid:12) D i = d, S i = s, Z i = z, |N i | = n (cid:3) = O ( N − δ ) . k A ∗ i − A i k given ( D i = d, S i = s, Z i = z, |N i | = n ). Then, under Assumption D.1 and (D.9), wecan get the desired result without relying on the strict overlap of the propensity score. Proof of Theorem D.2.

Denote X i = ( D i , S i , Z i , |N i | ) and x = ( d, s, z, n ). Deﬁne u i = Y i − E [ Y i | X i ] = Y i − m i ( X i ). Without loss of generality, suppose Z i is continuous. Let ˆ f keri ( x ) :=1[ D i = d, S i = s, |N i | = n ] K (cid:0) Z i − zh (cid:1) and ˆ f X i ( x ) := 1 /N P Ni =1 ˆ f keri ( x ). Then, by Theorem 5.2, weknow that | ˆ f X i ( x ) − f X i ( x ) | = o p (1). To establish the consistency, we rewriteˆ m ( x ) − m ∗ ( x ) = N P Ni =1 [ Y i − m ∗ ( x )] ˆ f keri ( x ) N P Ni =1 ˆ f keri ( x ) = ˆ M ( x ) + ˆ M ( x )ˆ f X i ( x ) , where ˆ M ( x ) = 1 N N X i =1 [ m i ( X i ) − m ∗ ( x )] ˆ f keri ( x ) , ˆ M ( x ) = 1 N N X i =1 u i ˆ f keri ( x ) . By notation abuse, let Q denote the dimension of Z i and let κ ( z/h ) = Q Qq =1 κ ( z q /h ), where z = ( z , ..., z Q ) ∈ R Q . In addition, since m ∗ ( x ) and f X i ( x ) are twice continuously diﬀerentiable inthe argument z , by Taylor expansion, for v ∈ R Q f X i ( d, s, z + hv, n ) = f X i ( x ) + h ∂f X i ( x ) ∂z ′ v + h v ′ ∂ f X i (˜ x ) ∂z∂z ′ v,m ∗ ( d, s, z + hv, n ) = m ∗ ( x ) + h ∂m ∗ ( x ) ∂z ′ v + h v ′ ∂ m ∗ (˜ x ) ∂z∂z ′ v, (D.12)where ˜ z is between z and z + hv , and ˜ x = ( d, s, ˜ z, n ). Let x = ( d , s , z , n ). Due that X i isidentically distributed, E h ˆ M ( x ) i = E n [ m i ( X i ) − m ∗ ( x )] ˆ f keri ( x ) o = 1 h Q X ( d ,s ,n ) ∈{ , }× Ω S, |N| Z [ m i ( x ) − m ∗ ( x )]1[ d = d, s = s, n = n ] κ (cid:18) z − zh (cid:19) f X i ( x ) dz = 1 h Q Z [ m i ( d, s, z , n ) − m ∗ ( x )] κ (cid:18) z − zh (cid:19) f X i ( d, s, z , n ) dz = Z [ m i ( d, s, z + hv, n ) − m ∗ ( x )] κ ( v ) f X i ( d, s, z + hv, n ) dv = Z [ m i ( d, s, z + hv, n ) − m ∗ ( d, s, z + hv, n )] κ ( v ) f X i ( d, s, z + hv, n ) dv + Z [ m ∗ ( d, s, z + hv, n ) − m ∗ ( x )] κ ( v ) f X i ( d, s, z + hv, n ) dv, (D.13)where the ﬁrst term on the right hand side of (D.13) can be bounded as below: (cid:12)(cid:12)(cid:12)(cid:12)Z [ m i ( d, s, z + hv, n ) − m ∗ ( d, s, z + hv, n )] κ ( v ) f X i ( d, s, z + hv, n ) dv (cid:12)(cid:12)(cid:12)(cid:12) Z | m i ( d, s, z + hv, n ) − m ∗ ( d, s, z + hv, n ) | κ ( v ) f X i ( d, s, z + hv, n ) dv ≤ k m i − m ∗ k ∞ (cid:18) f X i ( x ) Z κ ( v ) dv + O ( h ) (cid:19) = O ( N − δ ) , (D.14)with the second inequality follows from the expansion of f X i in (D.12), R vκ ( v ) dv = 0 and theboundedness of the second derivative of f X i ( x ) to z , and the last equality is due to R κ ( v ) dv = 1and Theorem D.1. In addition, for the second term of (D.13), by the expansions in (D.12) Z [ m ∗ ( d, s, z + hv, n ) − m ∗ ( x )] κ ( v ) f X i ( d, s, z + hv, n ) dv = h Z (cid:20) v ′ ∂f X i ( x ) ∂z ∂m ∗ ( x ) ∂z ′ v + f X i ( x ) v ′ ∂ m ∗ (˜ x ) ∂z∂z ′ v (cid:21) κ ( v ) dv + O ( h )= O ( h ) , (D.15)where the last line O ( h ) comes from the boundedness of f X i ( x ) and its ﬁrst order derivative (As-sumption 5.2), the compactness of Ω X and the continuity of the ﬁrst and second order derivativesof m ∗ in z . Given (D.13), (D.14) and (D.15), E h ˆ M ( x ) i = O ( N − δ + h ) . (D.16)Next, we tackle the variance of ˆ M ( x ). For notation simplicity, let ˆ M ,i ( x ) := [ m i ( X i ) − m ∗ ( x )] ˆ f keri ( x ).Then, ˆ M ( x ) = 1 /N P Ni =1 ˆ M ,i ( x ) and by Assumption 5.1 V ar [ ˆ M ( x )] = 1 N " N X i =1 X j ∈ ∆( i,N ) Cov (cid:16) ˆ M ,i ( x ) , ˆ M ,j ( x ) (cid:17) + N X i =1 X j ∆( i,N ) Cov (cid:16) ˆ M ,i ( x ) , ˆ M ,j ( x ) (cid:17) = 1 N N X i =1 X j ∈ ∆( i,N ) Cov (cid:16) ˆ M ,i ( x ) , ˆ M ,j ( x ) (cid:17) + s.o., where we use s.o. to denote the terms of smaller order. By Assumption 5.2, since 1 /N P Ni =1 | ∆( i, N ) | ≤ /N P Ni =1 | ∆( i, N ) | = O (1), we can get1 N N X i =1 X j ∈ ∆( i,N ) Cov (cid:16) ˆ M ,i ( x ) , ˆ M ,j ( x ) (cid:17) ≤ N N X i =1 X j ∈ ∆( i,N ) V ar h ˆ M ,i ( x ) i ≤ N V ar h ˆ M ,i ( x ) i N N X i =1 | ∆( i, N ) | = 1 N V ar h ˆ M ,i ( x ) i O (1) . (D.17)89oreover, by change of variables and simple algebra1 N V ar h ˆ M ,i ( x ) i ≤ N E h [ m i ( X i ) − m ∗ ( x )] ( ˆ f keri ( x )) i = 1 N h Q Z [ m i ( d, s, z , n ) − m ∗ ( x )] κ (cid:18) z − zh (cid:19) f X i ( d, s, z , n ) dz = 1 N h Q Z [ m i ( d, s, z + hv, n ) − m ∗ ( x )] κ ( v ) f X i ( d, s, z + hv, n ) dv = 1 N h Q Z [ m i ( d, s, z + hv, n ) − m ∗ ( d, s, z + hv, n )] κ ( v ) f X i ( d, s, z + hv, n ) dv + 1 N h Q Z [ m ∗ ( d, s, z + hv, n ) − m ∗ ( x )] κ ( v ) f X i ( d, s, z + hv, n ) dv + 2 N h Q Z [ m i ( d, s, z + hv, n ) − m ∗ ( d, s, z + hv, n )] × [ m ∗ ( d, s, z + hv, n ) − m ∗ ( x )] κ ( v ) f X i ( d, s, z + hv, n ) dv := V M + V M + V M . Let us start from

V M . By Assumption 5.2 that R κ ( v ) dv = K and Theorem D.1, V M ≤ N h Q k m i − m ∗ k ∞ Z κ ( v ) dv = O (cid:0) ( N h Q ) − N − δ (cid:1) . (D.18)Next, for V M , based on (D.12) and the boundedness of the derivatives, V M = 1 N h Q Z (cid:20) h ∂m ∗ ( x ) ∂z ′ v + h v ′ ∂ m ∗ (˜ x ) ∂z ′ ∂z v (cid:21) (cid:20) f X i ( x ) + h ∂f X i ( x ) ∂z ′ v + h v ′ ∂ f X i (˜ x ) ∂z ′ ∂z v (cid:21) κ ( v ) dv = h N h Q f X i ( x ) Z v ′ ∂m ∗ ( x ) ∂z ∂m ∗ ( x ) ∂z ′ vκ ( v ) dv + s.o. = O (cid:0) ( N h Q ) − h (cid:1) . (D.19)According to the CauchySchwarz inequality, we can then get V M = O (( N h Q ) − N − δ h ), whichtogether with (D.18) and (D.19) implies that V ar h ˆ M ( x ) i = O (cid:18) ( N − δ + h ) N h Q (cid:19) . (D.20)Thus, based on (D.16), (D.20) and the fact that N h Q → ∞ as N → ∞ (Assumption 5.2),ˆ M ( x ) = O p (cid:0) N − δ + h + ( N − δ + h ) / ( N h Q ) / (cid:1) = O p (cid:0) N − δ + h + h ( N h Q ) − / (cid:1) . (D.21)Next, let us deal with ˆ M ( x ). Observe that E [ ˆ M ( x )] = 0 and by Assumption 5.1, E h ˆ M ( x ) i = 1 N E  N X i =1 u i ˆ f keri ( x ) !  = 1 N N X i =1 X j ∈ ∆( i,N ) Cov (cid:16) u i ˆ f keri ( x ) , u j ˆ f kerj ( x ) (cid:17) + s.o, u i may not be identically distributed,1 N N X i =1 X j ∈ ∆( i,N ) Cov (cid:16) u i ˆ f keri ( x ) , u j ˆ f kerj ( x ) (cid:17) ≤ N max i ∈P V ar h u i ˆ f keri ( x ) i O (1) . For all i ∈ P , let σ i ( X i ) := V ar [ u i | X i ], where the subscript i is used to capture the possiblynon-identical distribution of Y i given X i . By the law of iterated expectation and (D.12),1 N V ar h u i ˆ f keri ( x ) i = 1 N h Q E (cid:20) u i D i = d, S i = s, |N i | = n ] κ (cid:18) Z i − zh (cid:19)(cid:21) = 1 N h Q E (cid:20) σ i ( X i )1[ D i = d, S i = s, |N i | = n ] κ (cid:18) Z i − zh (cid:19)(cid:21) = 1 N h Q Z σ i ( d, s, z , n ) κ (cid:18) z − zh (cid:19) f X i ( d, s, z , n ) dz = 1 N h Q Z σ i ( d, s, z + hv, n ) κ ( v ) f X i ( d, s, z + hv, n ) dv = 1 N h Q f X i ( x ) Z σ i ( d, s, z + hv, n ) κ ( v ) dv + s.o. ≤ CN h Q , (D.22)where the last line is by Assumption 5.2 that E [ Y i | X i = x ] < ∞ as Ω W is compact, and f X i ( x ) isbounded for ∀ x ∈ Ω X . Thus, according to (D.22), we know that ˆ M ( x ) = O p (( N h Q ) − / ), whichtogether with (D.21) indicates thatˆ m ( x ) − m ∗ ( x ) = ˆ M ( x ) + ˆ M ( x ) f X i ( x ) + o p (1) = O p (cid:0) N − δ + h + ( N h Q ) − / (cid:1) . Thus, the above discussion fulﬁlls the proof of Theorem D.2.In what follows, we consider the case when covariates in Z i are all discrete. We want to showthat replacing the kernel density in ˆ m ( x ) by the indicator i ( x ) = 1[ X i = x ] gives a consistentestimator of m ∗ ( x ). Firstly, for the numerator, because of Assumption 5.1 we have V ar " N N X i =1 Y i i ( x ) = 1 N N X i =1 X j ∈ ∆( i,N ) Cov ( Y i i ( x ) , Y j j ( x )) + s.o. By the compactness of Ω W c , we know that V ar [ Y i i ( x )] < ∞ for any x and all i ∈ P , V ar " N N X i =1 Y i i ( x ) ≤ CN N X i =1 | ∆( i, N ) | = O ( N − ) , /N P Ni =1 | ∆( i, N ) | = O (1). Then, by Chebyshev’s inequality, for any ǫ > P r (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 Y i i ( x ) − N N X i =1 E [ Y i i ( x )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > ǫ ! ≤ V ar " N N X i =1 Y i i ( x ) /ǫ = O ( N − ) , which implies that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 Y i i ( x ) − N N X i =1 E [ Y i i ( x )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p ( N − / ) . (D.23)Recall that X i is identically distributed. Then, Theorem D.1 leads to (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 E [ Y i i ( x )] − m ∗ ( x ) f X i ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 E [ Y i | X i = x ] f X i ( x ) − m ∗ ( x ) f X i ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ N N X i =1 | m i ( x ) − m ∗ ( x ) | f X i ( x ) ≤ O p ( N − δ ) . (D.24)Given (D.23) and (D.24), we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 Y i i ( x ) − m ∗ ( x ) f X i ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 Y i i ( x ) − N N X i =1 E [ Y i i ( x )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 E [ Y i i ( x )] − m ∗ ( x ) f X i ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p ( N − / + N − δ ) . (D.25)Because E [ i ( x )] = f X i ( x ), replacing Y i in (D.23) by a constant one, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 i ( x ) − f X i ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = O p ( N − / ) . (D.26)Let ˆ β ( x ) = N P Ni =1 Y i i ( x ), β ( x ) = m ∗ ( x ) f X i ( x ), and ˆ f X i ( x ) = N P Ni =1 i ( x ). Based on (D.25)and (D.26) P Ni =1 Y i i ( x ) P Ni =1 i ( x ) − m ∗ ( x ) = ˆ β ( x )ˆ f X i ( x ) − β ( x ) f X i ( x ) = ( ˆ β ( x ) − β ( x )) f X i ( x ) + β ( x )( f X i ( x ) − ˆ f X i ( x ))ˆ f X i ( x ) f X i ( x )= ( ˆ β ( x ) − β ( x )) f X i ( x ) + β ( x )( f X i ( x ) − ˆ f X i ( x )) f X i ( x ) + o p (1)= O p ( N − / + N − δ ) . Useful Lemmas

This section introduces some useful lemmas which are used in the proofs of Appendix Section B.

Lemma E.1

By the law of iterated expectation | E [ Xh ( Z )] | = (cid:12)(cid:12)(cid:12)(cid:12)Z E [ X | Z ] h ( Z ) dP r ( Z ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ Z | E [ X | Z ] h ( Z ) | dP r ( Z ) ≤ Z | E [ X | Z ] | dP r ( Z ) . Then, by | E [ X | Z ] | = E [ X | Z ] sign ( E [ X | Z ]), it is clear that h ( Z ) = sign ( E [ X | Z ]). Lemma E.2 (Uniform Law of Large Number under Dependency Neighborhood)

For anyfunction b : Ω W × Θ R p , if the following conditions hold(i) Θ is compact;(ii) b ( w ; θ ) is continuous in θ over Θ ;(iii) there exists a function h ( w ) with k b ( w ; θ ) k ≤ h ( w ) for all θ ∈ Θ and E [ h ( W i )] < ∞ ;(iv) for some constant η > , deﬁne u ( w ; θ, η ) = sup θ ′ ∈ Θ , k θ ′ − θ k <η k b ( w ; θ ′ ) − b ( w ; θ ) k , Σ bN ( θ ) = N X i =1 X j ∈ ∆( i,N ) Cov ( b ( W i ; θ ) , b ( W j ; θ )) , Σ uN ( θ, η ) = N X i =1 X j ∈ ∆( i,N ) Cov ( u ( W i ; θ, η ) , u ( W j ; θ, η )) . (a) for all θ ∈ Θ and any ﬁxed η , (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N X i =1 X j ∆( i,N ) Cov ( b ( W i ; θ ) , b ( W j ; θ )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o (cid:0) k Σ bN ( θ ) k (cid:1) , N X i =1 X j ∆( i,N ) Cov ( u ( W i ; θ, η ) , u ( W j ; θ, η )) = o (cid:16) Σ uN ( θ, η ) (cid:17) . (b) /N P Ni =1 | ∆( i, N ) | = O (1) ; (c) sup θ ∈ Θ E [ k b ( W i ; θ ) k δ ] < C for some constants δ > and C > , and all i ;then E [ b ( W i ; θ )] is continuous in θ and sup θ ∈ Θ (cid:13)(cid:13)(cid:13) N P Ni =1 { b ( W i ; θ ) − E [ b ( W i ; θ )] } (cid:13)(cid:13)(cid:13) p → . Proof of Lemma E.2.

This proof is based on the proof of Lemma 1 in Tauchen (1985). Let b r ( W i ; θ ) be the r -th element in vector b ( W i ; θ ), r = 1 , , ..., p . Deﬁne a matrix Λ ij ( θ ) such thatits rq -th entry is corr ( b r ( W i ; θ ) , b q ( W j ; θ )), r, q = 1 , , ..., p . Denote a diagonal matrix V i ( θ ) = diag ( V ar [ b ( W i ; θ )] , ..., V ar [ b p ( W i ; θ )]). 93y condition (iv) (c), for all i and given η , there exist constants C , C > θ ∈ Θ V ar [ b r ( W i ; θ )] < C for all r = 1 , ..., p , and sup θ ∈ Θ V ar [ u ( W i ; θ, η )] < C . Then, (cid:13)(cid:13) Σ bN ( θ ) (cid:13)(cid:13) ≤ N X i =1 X j ∈ ∆( i,N ) k Cov ( b ( W i ; θ ) , b ( W j ; θ )) k≤ N X i =1 X j ∈ ∆( i,N ) (cid:13)(cid:13) V i ( θ ) / Λ ij ( θ ) V j ( θ ) / (cid:13)(cid:13) ≤ C p N X i =1 | ∆( i, N ) | = O ( N ) , where the last line follows from 1 /N P Ni =1 | ∆( i, N ) | = O (1) in condition (iv) (b). Similarly,Σ uN ( θ, η ) = O ( N ). Applying Chebyshev’s inequality, we have that for any ǫ > P r (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 { b ( W i ; θ ) − E [ b ( W i ; θ )] } (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) > ǫ ! ≤ ε N E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N X i =1 { b ( W i ; θ ) − E [ b ( W i ; θ )] } (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  = 1 ε N tr  N X i =1 X j ∈ ∆( i,N ) Cov ( b ( W i ; θ ) , b ( W j ; θ )) + N X i =1 X j ∆( i,N ) Cov ( b ( W i ; θ ) , b ( W j ; θ ))  = pε N (cid:0)(cid:13)(cid:13) Σ bN ( θ ) (cid:13)(cid:13) + s.o. (cid:1) = O (cid:18) ε N (cid:19) , where the second equality comes from that tr ( A ) ≤ p k A k ∞ ≤ p k A k for any p × p square matrix A ,and the third equality is due to condition (iv) (a). By choosing ǫ such that ǫ → ǫ N → ∞ as N → ∞ , we can get (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 { b ( W i ; θ ) − E [ b ( W i ; θ )] } (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o p (1) . Similar arguments can be used to show that N P Ni =1 { u ( W i ; θ, η ) − E [ u ( W i ; θ, η )] } = o p (1) . Bycondition (ii) the continuity of b ( w ; θ ) in θ , we have that with ﬁxed θ , lim u ( w ; θ, η ) = 0 as η → ǫ >

0, there exists a ¯ η ( θ ) such that E [ u ( W i ; θ, η )] ≤ ǫ, whenever η ≤ ¯ η ( θ ) . (E.1)Let B ( θ ) be an open ball of radius ¯ η ( θ ) about θ . Due to the compactness of Θ, there exist a ﬁnitesequence of open balls B k := B ( θ k ) with k = 1 , , ..., K such that Θ ⊂ S Kk =1 B k . Let η k = ¯ η ( θ k )and u k = E [ u ( W i ; θ k , η k ]. By (E.1) and dominated convergence theorem, if θ ∈ B k then u k ≤ ǫ and k E [ b ( W i ; θ )] − E [ b ( W i ; θ ′ )] k ≤ ǫ . Next, for ∀ θ ∈ Θ, there exists a k such that θ ∈ B k , then (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 b ( W i ; θ ) − E [ b ( W i ; θ )] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 k b ( W i ; θ ) − b ( W i ; θ k ) k + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 b ( W i ; θ k ) − E [ b ( W i ; θ k )] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + k E [ b ( W i ; θ k )] − E [ b ( W i ; θ )] k≤ N N X i =1 u ( W i ; θ k , η k ) − u k + u k + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 b ( W i ; θ k ) − E [ b ( W i ; θ k )] (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + ǫ ≤ ǫ (E.2)whenever N ≥ ¯ N k ( ǫ ), by u k ≤ ǫ . Thus, whenever N ≥ max k ¯ N k ( ǫ ), we have thatsup θ ∈ Θ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N N X i =1 { b ( W i ; θ ) − E [ b ( W i ; θ )] } (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ ǫ. Lemma E.3 (Theorem 3 of Bradley et al. (1983))

Suppose X and Y are random variablestaking their values on a Borel space Γ and R , respectively. Suppose U is a uniform [0 , randomvariable independent of ( X, Y ) . Suppose µ and γ are positive numbers such that µ ≤ k Y k γ < ∞ .Let k Y k γ = ( E [ | Y | γ ]) /γ . Then there exists a real-valued random variable Y ∗ = g ( X, Y, U ) where g is a measurable function from Γ × R × [0 , into R , such that(i) Y ∗ is independent of X ;(ii) the probability distributions of Y ∗ and Y are identical;(iii) P r ( | Y ∗ − Y | ≥ µ ) ≤ k Y k γ /µ ) γ/ (2 γ +1) [ α ( B ( X ) , B ( Y ))] γ/ (2 γ +1) ,where for any two σ -ﬁelds B , B , α ( B , B ) = sup | P r ( B T B ) − P r ( B ) P r ( B ) | . The following lemmas are pioneered by Stein (1986) and utilized in e.g. Chen et al. (2010),Ross (2011) and Goldstein and Rinott (1996) among others, to derive central limit theorems fordependency graphs. We re-state them here such that the proofs are self-contained.

Lemma E.4 (Meckes et al. (2009) Lemma 1)

Let Z ∈ R p be a standard normal random vec-tor with mean zero and covariance matrix I d .(i) If a function f : R p R is twice continuously diﬀerentiable with compact support, then E (cid:20) tr (cid:18) d f ( Z ) dzdz ′ (cid:19) − Z ′ df ( Z ) dz (cid:21) = 0 . (ii) If a random vector X ∈ R p is such that E (cid:20) tr (cid:18) d f ( X ) dxdx ′ (cid:19) − X ′ df ( X ) dx (cid:21) = 0 for every f ∈ C ( R p ) that is twice continuously diﬀerentiable with ﬁnite absolute mean value E [ | tr ( d f ( X ) /dxdx ′ ) − X ′ df ( X ) /dx | ] < ∞ , then X is a standard normal random vector. emma E.5 (Goldstein and Rinott (1996) Lemma 3.1) Let Z ∈ R p be a standard normrandom vector and let h : R p R have three bounded derivatives. Deﬁne ( T u h )( x ) = E [ h ( xe − u + √ − e − u Z )] for x ∈ R p . Then f ( x ) = − R ∞ [ T u h ( x ) − E [ h ( Z )]] du solves tr (cid:18) d f ( x ) dxdx ′ (cid:19) − x ′ df ( x ) dx = h ( x ) − E [ h ( Z )] . In addition, for any k -th partial derivative we have that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ k f ( x ) Q kj =1 ∂x j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ k sup x ∈ Ω X (cid:13)(cid:13)(cid:13)(cid:13) d h ( x ) dxdx ′ (cid:13)(cid:13)(cid:13)(cid:13) ∞ . Further, for any λ ∈ R p and positive deﬁnite p × p matrix Σ , then f ∗ , denoted by the change ofvariable f ∗ ( x ) := f (Σ − / ( x − λ )) solves tr (cid:0) Σ ∇ f ∗ ( x ) (cid:1) − ( x − λ ) ′ ∇ f ∗ ( x ) = h (Σ − / ( x − λ )) − E [ h ( Z )] , and (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ k f ∗ ( x ) Q kj =1 ∂x j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ p k k k Σ − / k k ∞ k∇ k h k ∞ . The lemma below is based on Theorem 1.4 of Goldstein and Rinott (1996) which aims atproviding a bound on the distance to normality for any sum of dependent random vectors whosedependence structure is formed via dependency neighborhoods.

Lemma E.6 (Multivariate CLT under Dependency Neighborhood)

Let { W i } Ni =1 be ran-dom vectors in R p with E [ W i ] = 0 and Z ∈ R p be a standard normal random vector. Denote S N = N X i =1 W i and Σ N = N X i =1 X j ∈ ∆( i,N ) E [ W i W ′ j ] . In addition, denote S ci = P j ∆( i,N ) W j . Assume Σ N is symmetric positive deﬁnite. If the followingconditions hold,(i) there exists a ﬁnite, strictly positive-deﬁnite and symmetric p × p matrix Ω such that k N Σ N − Ω k → as N → ∞ ;(ii) (a) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N P i =1 P j,k ∈ ∆( i,N ) E (cid:2)(cid:12)(cid:12) vec ( W i W ′ j ) W ′ k (cid:12)(cid:12)(cid:3)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ = o (cid:16)(cid:13)(cid:13)(cid:13) Σ / N (cid:13)(cid:13)(cid:13) ∞ (cid:17) ;(b) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N P i,k =1 P j ∈ ∆( i,N ) P l ∈ ∆( k,N ) E h (cid:0) W i W ′ j − E [ W i W ′ j ] (cid:1) ′ ( W k W ′ l − E [ W k W ′ l ]) i(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ = o ( k Σ N k ∞ ) ;(c) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N P i =1 P j ∆( i,N ) Cov ( W i , W j ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ = o ( k Σ N k ∞ ) ;(d) E (cid:2) W i S ci (cid:12)(cid:12) S ci (cid:3) ≥ for all i ∈ P ; hen Σ − / N S N d → N (0 , I p ) . Proof of Lemma E.6.

Denote S ci,q as the q -th element of S ci . Let h : R p R be a functionwith bounded mixed partial derivatives up to order three. Denote ∇ k h the k -th derivative of h .Let ∇ r f ∗ ( x ) = ∂f ∗ ( x ) /∂x r and ∇ rq f ∗ ( x ) = ∂ f ∗ ( x ) /∂x r ∂x q . It follows directly from the proof ofTheorem 1.4 in Goldstein and Rinott (1996) that (cid:12)(cid:12)(cid:12) E h h (cid:16) Σ − / N S N (cid:17) − E [ h ( Z )] i(cid:12)(cid:12)(cid:12) ≤ p (cid:13)(cid:13)(cid:13) Σ − / N (cid:13)(cid:13)(cid:13) ∞ k∇ h k ∞ p X r,q =1 vuut E " N X i =1 X j ∈ ∆( i,N ) ( W i,r W j,q − E [ W i,r W j,q ]) ! + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p X r =1 N X i =1 E [ W i,r ∇ r f ∗ ( S ci )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + p (cid:13)(cid:13)(cid:13) Σ − / N (cid:13)(cid:13)(cid:13) ∞ k∇ h k ∞ p X r,q,u =1 N X i =1 E "(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) W i,r X j ∈ ∆( i,N ) W j,q X k ∈ ∆( i,N ) W k,u (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (E.3)where f ∗ is deﬁned as in Lemma E.5. Consider the second term on the right hand side of (E.3) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p X r =1 N X i =1 E [ W i,r ∇ r f ∗ ( S ci )] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p X r =1 N X i =1 E { W i,r [ ∇ r f ∗ ( S ci ) − ∇ r f ∗ (0)] } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p X r =1 ∇ r f ∗ (0) N X i =1 E [ W i,r ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p X r,q =1 N X i =1 E h W i,r S ci,q ∇ rq f ∗ ( e S ci ) i(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (E.4)where e S ci is between S ci and 0 and the last equality comes from the mean value theorem and thefact that E [ W i,r ] = 0. Without loss of generality, suppose there exists a function ˜ f such that e S ci = ˜ f ( S ci ). Then, we can further bound (E.4) as below: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p X r,q =1 N X i =1 E h W i,r S ci,q ∇ rq f ∗ ( e S ci ) i(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) p X r,q =1 N X i =1 E h W i,r S ci,q ∇ rq f ∗ (cid:16) ˜ f ( S ci ) (cid:17)i(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ p k Σ − / N k ∞ k∇ h k ∞ p X r,q =1 N X i =1 E n W i,r S ci,q sign (cid:0) E (cid:2) W i,r (cid:12)(cid:12) S ci (cid:3) S ci,q (cid:1) o , (E.5)where the inequality is because of Lemma E.1 and |∇ rq f ∗ ◦ ˜ f | ≤ p k Σ − / N k ∞ k∇ h k ∞ by LemmaE.5. Therefore, we have that (cid:12)(cid:12)(cid:12) E h h (cid:16) Σ − / N S N (cid:17) − E [ h ( Z )] i(cid:12)(cid:12)(cid:12) p (cid:13)(cid:13)(cid:13) Σ − / N (cid:13)(cid:13)(cid:13) ∞ k∇ h k ∞ p X r,q =1 vuut E " N X i =1 X j ∈ ∆( i,N ) ( W i,r W j,q − E [ W i,r W j,q ]) ! + p (cid:13)(cid:13)(cid:13) Σ − / N (cid:13)(cid:13)(cid:13) ∞ k∇ h k ∞ p X r,q =1 N X i =1 E n W i,r S ci,q sign (cid:0) E (cid:2) W i,r (cid:12)(cid:12) S ci (cid:3) S ci,q (cid:1) o + p (cid:13)(cid:13)(cid:13) Σ − / N (cid:13)(cid:13)(cid:13) ∞ k∇ h k ∞ p X r,q,u =1 N X i =1 E "(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) W i,r X j ∈ ∆( i,N ) W j,q X k ∈ ∆( i,N ) W k,u (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (E.6)for some constant p >

0. Let us start from the ﬁrst term. By the CauchySchwarz inequality p X r,q =1 vuut E " N X i =1 X j ∈ ∆( i,N ) ( W i,r W j,q − E [ W i,r W j,q ]) ! ≤ p X r,q =1 E " N X i =1 X j ∈ ∆( i,N ) ( W i,r W j,q − E [ W i,r W j,q ]) ! / p X r,q =1 ! / = pE "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N X i =1 X j ∈ ∆( i,N ) ( W i,r W j,q − E [ W i,r W j,q ]) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / , where E "(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N X i =1 X j ∈ ∆( i,N ) ( W i,r W j,q − E [ W i,r W j,q ]) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = E " tr N X i =1 X j ∈ ∆( i,N ) ( W i,r W j,q − E [ W i,r W j,q ]) ′ N X i =1 X j ∈ ∆( i,N ) ( W i,r W j,q − E [ W i,r W j,q ]) ! = tr N X i,k =1 X j ∈ ∆( i,N ) X l ∈ ∆( k,N ) E h ( W i,r W j,q − E [ W i,r W j,q ]) ′ ( W k,r W l,q − E [ W k,r W l,q ]) i! ≤ p (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N X i,k =1 X j ∈ ∆( i,N ) X l ∈ ∆( k,N ) E h ( W i,r W j,q − E [ W i,r W j,q ]) ′ ( W k,r W l,q − E [ W k,r W l,q ]) i(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ . (E.7)Besides, since E (cid:2) W i,r (cid:12)(cid:12) S ci (cid:3) S ci ≥ i = 1 , ..., N and r = 1 , ..., p , the second term becomes to p X r,q =1 N X i =1 E (cid:2) W i,r S ci,q (cid:3) = p X r,q =1 N X i =1 X j ∆( i,N ) Cov ( W i,r , W j,q ) ≤ p (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N X i =1 X j ∆( i,N ) Cov ( W i , W j ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ . (E.8)For the last term, we can obtain p X r,q,u =1 N X i =1 E "(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) W i,r X j ∈ ∆( i,N ) W j,q X k ∈ ∆( i,N ) W k,u (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ p X r,q,u =1 N X i =1 X j,k ∈ ∆( i,N ) E [ | W i,r W j,q W k,u | ]98 p (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N X i =1 X j,k ∈ ∆( i,N ) E (cid:2)(cid:12)(cid:12) vec ( W i W ′ j ) W ′ k (cid:12)(cid:12)(cid:3)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ . (E.9)Moreover, since k N − Σ N − Ω k →

0, implying that there exist ǫ, ǫ such that0 < ǫ ≤ N λ min (Σ N ) ≤ N λ max (Σ N ) < ǫ < ∞ . In addition, by the property of norm and the symmetry of Σ N , we have that (cid:13)(cid:13)(cid:13) Σ − / N (cid:13)(cid:13)(cid:13) ∞ ≤ (cid:13)(cid:13)(cid:13) Σ − / N (cid:13)(cid:13)(cid:13) = tr (cid:0) Σ − N (cid:1) = p X r =1 λ − r (Σ N ) ≤ pλ − min (Σ N ) = O ( N − ) , where λ r (Σ N ) means the r -th largest eigenvalue of matrix Σ N . Similarly, k Σ N k ∞ = O ( N ) , (cid:13)(cid:13)(cid:13) Σ / N (cid:13)(cid:13)(cid:13) ∞ = O ( N ) , (cid:13)(cid:13) Σ N (cid:13)(cid:13) ∞ = O ( N ) . (E.10)Now, plugging (E.7), (E.8) and (E.9) into (E.6) gives us (cid:12)(cid:12)(cid:12) E h h (cid:16) Σ − / N S N (cid:17) − E [ h ( Z )] i(cid:12)(cid:12)(cid:12) ≤ C (cid:13)(cid:13)(cid:13) Σ − / N (cid:13)(cid:13)(cid:13) ∞ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N X i,k =1 X j ∈ ∆( i,N ) X l ∈ ∆( k,N ) E h (cid:0) W i W ′ j − E [ W i W ′ j ] (cid:1) ′ ( W k W ′ l − E [ W k W ′ l ]) i(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) / ∞ + C (cid:13)(cid:13)(cid:13) Σ − / N (cid:13)(cid:13)(cid:13) ∞ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N X i =1 X j ∆( i,N ) Cov ( W i , W j ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ + C (cid:13)(cid:13)(cid:13) Σ − / N (cid:13)(cid:13)(cid:13) ∞ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) N X i =1 X j,k ∈ ∆( i,N ) E (cid:2)(cid:12)(cid:12) vec ( W i W ′ j ) W k (cid:12)(cid:12)(cid:3) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ = (cid:13)(cid:13)(cid:13) Σ − / N (cid:13)(cid:13)(cid:13) ∞ o (cid:16)(cid:13)(cid:13) Σ N (cid:13)(cid:13) / ∞ + k Σ N k ∞ (cid:17) + (cid:13)(cid:13)(cid:13) Σ − / N (cid:13)(cid:13)(cid:13) ∞ o (cid:16)(cid:13)(cid:13)(cid:13) Σ / N (cid:13)(cid:13)(cid:13) ∞ (cid:17) = o (1) , implying that Σ − / N S N d → N (0 , I p ).In what follows, we ﬁrst present several lemmas that will be used to show the asymptoticproperties of the jacobian and hessian matrix of the objective function. Lemma E.7

Under Assumptions 5.4, 5.5 and the i.i.d. of x i,j across i for any given j =1 , , ..., K T , we have that sup θ ∈ Θ N N X i =1 (cid:13)(cid:13)(cid:13)(cid:13) d m ∗ ( x i,j ; θ ) dθdθ ′ (cid:13)(cid:13)(cid:13)(cid:13) = O p (1); sup θ ∈ Θ N N X i =1 (cid:13)(cid:13)(cid:13)(cid:13) dm ∗ ( x i,j ; θ ) dθ (cid:13)(cid:13)(cid:13)(cid:13) = O p (1); and for ˜ θ N p → θ , N N X i =1 (cid:12)(cid:12)(cid:12) m ∗ ( x i,j ; ˜ θ N ) − m ∗ ( x i,j ; θ ) (cid:12)(cid:12)(cid:12) = o p (1);99 N N X i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) dm ∗ ( x i,j ; ˜ θ N ) dθ − dm ∗ ( x i,j ; θ ) dθ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o p (1);1 N N X i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d m ∗ ( x i,j ; ˜ θ N ) dθdθ − d m ∗ ( x i,j ; θ ) dθdθ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o p (1) . Proof of Lemma E.7.

By Assumption 5.5 and the uniform convergence of i.i.d. samples (Lemma2.4 of Newey and MacFadden (1994))sup θ ∈ Θ N N X i =1 (cid:13)(cid:13)(cid:13)(cid:13) d m ∗ ( x i,j ; θ ) dθdθ ′ (cid:13)(cid:13)(cid:13)(cid:13) ≤ sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 (cid:13)(cid:13)(cid:13)(cid:13) d m ∗ ( x i,j ; θ ) dθdθ ′ (cid:13)(cid:13)(cid:13)(cid:13) − E "(cid:13)(cid:13)(cid:13)(cid:13) d m ∗ ( x i,j ; θ ) dθdθ ′ (cid:13)(cid:13)(cid:13)(cid:13) + sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E "(cid:13)(cid:13)(cid:13)(cid:13) d m ∗ ( x i,j ; θ ) dθdθ ′ (cid:13)(cid:13)(cid:13)(cid:13) = o p (1) + sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E "(cid:13)(cid:13)(cid:13)(cid:13) d m ∗ ( x i,j ; θ ) dθdθ ′ (cid:13)(cid:13)(cid:13)(cid:13) . (E.11)Because sup θ ∈ Θ E (cid:20)(cid:13)(cid:13)(cid:13) d m ∗ ( x i,j ; θ ) dθdθ ′ (cid:13)(cid:13)(cid:13) (cid:21) ≤ E [ H ( x i,j )] < ∞ by Assumption 5.5, (E.11) becomes tosup θ ∈ Θ N N X i =1 (cid:13)(cid:13)(cid:13)(cid:13) d m ∗ ( x i,j ; θ ) dθdθ ′ (cid:13)(cid:13)(cid:13)(cid:13) = O p (1) . (E.12)Similar arguments can be used to show that sup θ ∈ Θ 1 N P Ni =1 (cid:13)(cid:13)(cid:13) dm ∗ ( x i,j ; θ ) dθ (cid:13)(cid:13)(cid:13) = O p (1). Besides, themean value theorem gives1 N N X i =1 (cid:12)(cid:12)(cid:12) m ∗ ( x i,j ; ˜ θ N ) − m ∗ ( x i,j ; θ ) (cid:12)(cid:12)(cid:12) = 1 N N X i =1 (cid:12)(cid:12)(cid:12)(cid:12) ∂m ∗ ( x i,j ; ¯ θ N ) ∂θ ′ (˜ θ N − θ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ sup θ ∈ Θ N N X i =1 (cid:13)(cid:13)(cid:13)(cid:13) ∂m ∗ ( x i,j ; θ ) ∂θ (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) ˜ θ N − θ (cid:13)(cid:13)(cid:13) = o p (1) , (E.13)for ¯ θ N between ˜ θ N and θ . Similarly, we can also obtain that1 N N X i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∂m ∗ ( x i,j ; ˜ θ N ) ∂θ − ∂m ∗ ( x i,j ; θ ) ∂θ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ sup θ ∈ Θ N N X i =1 (cid:13)(cid:13)(cid:13)(cid:13) ∂m ∗ ( x i,j ; θ ) ∂θ ′ (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) ˜ θ N − θ (cid:13)(cid:13)(cid:13) = o p (1) . (E.14)Moreover, since d m ∗ ( x i,j ; ˜ θ N ) dθ r dθ q − d m ∗ ( x i,j ; θ ) dθ r dθ q = ∂∂θ ′ (cid:18) d m ∗ ( x i,j ; ¯ θ N ) dθ r dθ q (cid:19) (˜ θ N − θ ) , m ∗ ( x ; θ ),1 N N X i =1 (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) d m ∗ ( x i,j ; ˜ θ N ) dθdθ ′ − d m ∗ ( x i,j ; θ ) dθdθ ′ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ N d θ X r,q =1 N X i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) d m ∗ ( x i,j ; ˜ θ N ) dθ r dθ q − d m ∗ ( x i,j ; θ ) dθ r dθ q (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ N d θ X r,q =1 N X i =1 (cid:13)(cid:13)(cid:13)(cid:13) ∂∂θ ′ (cid:18) d m ∗ ( x i,j ; ¯ θ N ) dθ r dθ q (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) ˜ θ N − θ (cid:13)(cid:13)(cid:13) = O p (cid:18)(cid:13)(cid:13)(cid:13) ˜ θ N − θ (cid:13)(cid:13)(cid:13) (cid:19) = o p (1) . (E.15)Lemmas E.8 to E.10 show the key steps for establishing the asymptotics for the jacobian ofthe objective function. The proofs are based on Section 8 of Newey and MacFadden (1994) andextended to adopt data under dependency-neighborhoods structure. Lemma E.8 (Linearization)

Under assumptions in Lemma 5.5 (b), there exists a function G ( · ; γ ) : Ω W R d θ which is linear in γ and satisﬁes (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ N N X i =1 h g ( W i ; θ , ˆ φ N ) − g ( W i ; θ , φ ) − G ( W i ; ˆ γ N − γ ) i(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o p (1) . Proof of Lemma E.8.

Recall that g ( W i ; θ, φ ) = τ i [ Y i − m ( X i ; θ, φ )] ∂m ( X i ; θ,φ ) ∂θ . Then,1 √ N N X i =1 g ( W i ; θ , ˆ φ N ) − √ N N X i =1 g ( W i ; θ , φ )= 1 √ N N X i =1 τ i " [ Y i − m ( X i ; θ , ˆ φ N )] ∂m ( X i ; θ , ˆ φ N ) ∂θ − [ Y i − m ( X i ; θ , φ )] ∂m ( X i ; θ , φ ) ∂θ , (E.16)where making use of the identity ˆ a ˆ b − ab = (ˆ a − a ) b + a (ˆ b − b ) + (ˆ a − a )(ˆ b − b ) leads to1 √ N N X i =1 g ( W i ; θ , ˆ φ N ) − √ N N X i =1 g ( W i ; θ , φ )= − √ N N X i =1 τ i h m ( X i ; θ , ˆ φ N ) − m ( X i ; θ , φ ) i ∂m ( X i ; θ , φ ) ∂θ + 1 √ N N X i =1 τ i (cid:2) Y i − m ( X i ; θ , φ ) (cid:3) " ∂m ( X i ; θ , ˆ φ N ) ∂θ − ∂m ( X i ; θ , φ ) ∂θ − √ N N X i =1 τ i h m ( X i ; θ , ˆ φ N ) − m ( X i ; θ , φ ) i " ∂m ( X i ; θ , ˆ φ N ) ∂θ − ∂m ( X i ; θ , φ ) ∂θ = − √ N N X i =1 τ i K T X j =1 m ∗ ( x i,j ; θ ) h ˆ f T ∗ i | X i ( t j ) − f T ∗ i | X i ( t j ) i ∂m ( X i ; θ , φ ) ∂θ

101 1 √ N N X i =1 τ i (cid:2) Y i − m ( X i ; θ , φ ) (cid:3) K T X j =1 ∂m ∗ ( x i,j ; θ ) ∂θ h ˆ f T ∗ i | X i ( t j ) − f T ∗ i | X i ( t j ) i − √ N N X i =1 τ i K T X j =1 m ∗ ( x i,j ; θ ) h ˆ f T ∗ i | X i ( t j ) − f T ∗ i | X i ( t j ) i K T X j =1 ∂m ∗ ( x i,j ; θ ) ∂θ h ˆ f T ∗ i | X i ( t j ) − f T ∗ i | X i ( t j ) i := G + G + G . (E.17)Firstly, consider G . By the CauchySchwarz inequality, (B.66) and Lemma E.7, k N − / G k≤ CN N X i =1 K T X j,l =1 (cid:13)(cid:13)(cid:13)(cid:13) m ∗ ( x i,j ; θ ) h ˆ f T ∗ i | X i ( t j ) − f T ∗ i | X i ( t j ) i ∂m ∗ ( x i,l ; θ ) ∂θ h ˆ f T ∗ i | X i ( t l ) − f T ∗ i | X i ( t l ) i(cid:13)(cid:13)(cid:13)(cid:13) ≤ sup k ˆ γ N − γ k ∞ <η (cid:13)(cid:13)(cid:13) ˆ φ N − φ (cid:13)(cid:13)(cid:13) ∞ ! N K T X j,l =1 N X i =1 (cid:12)(cid:12) m ∗ ( x i,j ; θ ) (cid:12)(cid:12) (cid:13)(cid:13)(cid:13)(cid:13) ∂m ∗ ( x i,l ; θ ) ∂θ (cid:13)(cid:13)(cid:13)(cid:13) ≤ sup k ˆ γ N − γ k ∞ <η (cid:13)(cid:13)(cid:13) ˆ φ N − φ (cid:13)(cid:13)(cid:13) ∞ ! K T X j,l =1 " N N X i =1 m ∗ ( x i,j ; θ ) / " N N X i =1 (cid:13)(cid:13)(cid:13)(cid:13) ∂m ∗ ( x i,l ; θ ) ∂θ (cid:13)(cid:13)(cid:13)(cid:13) / = O p (cid:0) k ˆ γ N − γ k ∞ (cid:1) . (E.18)Thus, given (E.18) we can get that kG k = O p ( N / k ˆ γ N − γ k ∞ ) = o p (1) by Assumption 5.6.Next, let us consider G + G . Recall that the 1 × K T row vector R ( W i ; θ, φ ) is deﬁned as R ( W i ; θ, φ ) =  [ Y i − m ( X i ; θ, φ )] m ∗ ( x i, ; θ )...[ Y i − m ( X i ; θ, φ )] m ∗ ( x i, K T ; θ )  ′ . Denote φ ( t ; ˆ γ N ) = [ ˆ f T ∗ i | X i ( t ) , ..., ˆ f T ∗ i | X i ( t K T )] ′ and φ ( t ; γ ) = [ f T ∗ i | X i ( t ) , ..., f T ∗ i | X i ( t K T )] ′ . Then,simple calculations yield that G + G = 1 √ N N X i =1 τ i " K T X j =1 (cid:2) Y i − m ( X i ; θ , φ ) (cid:3) ∂m ∗ ( x i,j ; θ ) ∂θ − m ∗ ( x i,j ; θ ) ∂m ( X i ; θ , φ ) ∂θ ! h ˆ f T ∗ i | X i ( t j ) − f T ∗ i | X i ( t j ) i = 1 √ N N X i =1 τ i (cid:20) ∂∂θ R ( W i ; θ , φ ) (cid:0) φ ( t ; ˆ γ N ) − φ ( t ; γ ) (cid:1)(cid:21) = 1 √ N N X i =1 τ i (cid:20) ∂∂θ R ( W i ; θ , φ ) ∂φ ( t ; γ ) ∂γ ′ (cid:0) ˆ γ N − γ (cid:1)(cid:21) + G R , (E.19)102here the reminder term G R := 1 √ N N X i =1 τ i ∂∂θ R ( W i ; θ , φ ) (cid:20) φ ( t ; ˆ γ N ) − φ ( t ; γ ) − ∂φ ( t ; γ ) ∂γ ′ (cid:0) ˆ γ N − γ (cid:1)(cid:21) = 1 √ N N X i =1 τ i [ ∇R + ∇R ] (cid:20) φ ( t ; ˆ γ N ) − φ ( t ; γ ) − ∂φ ( t ; γ ) ∂γ ′ (cid:0) ˆ γ N − γ (cid:1)(cid:21) , with ∂∂θ R ( W i ; θ , φ ) := ∇R + ∇R and ∇R = (cid:2) Y i − m ( X i ; θ , φ ) (cid:3) h ∂m ∗ ( x i, ; θ ) ∂θ · · · ∂m ∗ ( x i, K T ; θ ) ∂θ i , ∇R = − ∂m ( X i ; θ , φ ) ∂θ (cid:2) m ∗ ( x i, ; θ ) · · · m ∗ ( x i, K T ; θ ) (cid:3) . Next, we show that G R = o p (1). Due to Theorem 5.2, we can focus on a small neighborhood of γ and bound the reminder term as follows: k N − / G R k ≤ sup k ˆ γ N − γ k ∞ <η (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) ˆ φ N − φ (cid:17) − ∂φ ( γ ) ∂γ ′ (cid:0) ˆ γ N − γ (cid:1)(cid:13)(cid:13)(cid:13)(cid:13) ∞ N N X i =1 τ i k∇R + ∇R k≤ O p (cid:16)(cid:13)(cid:13) ˆ γ N − γ (cid:13)(cid:13) ∞ (cid:17) " N N X i =1 τ i k∇R k + 1 N N X i =1 τ i k∇R k , where the O p ( k ˆ γ N − γ k ∞ ) is due to (B.63), and applying the CauchySchwarz inequality to eachof the term inside the bracket leads to1 N N X i =1 τ i k∇R k ≤ N N X i =1 τ i (cid:12)(cid:12) Y i − m ( X i ; θ , φ ) (cid:12)(cid:12) (cid:13)(cid:13)(cid:13)h ∂m ∗ ( x i, ; θ ) ∂θ · · · ∂m ∗ ( x i, K T ; θ ) ∂θ i(cid:13)(cid:13)(cid:13) ≤ C " N N X i =1 τ i (cid:2) Y i − m ( X i ; θ , φ ) (cid:3) / " N K T X j =1 N X i =1 (cid:13)(cid:13)(cid:13)(cid:13) ∂m ∗ ( x i,j ; θ ) ∂θ (cid:13)(cid:13)(cid:13)(cid:13) / = O p (1) , where the last line follows from (B.67) and Lemma E.7. Similarly, from the CauchySchwarzinequality and Lemma E.7, we can also get that1 N N X i =1 τ i k∇R k ≤ CN N X i =1 (cid:13)(cid:13)(cid:13)(cid:13) ∂m ( X i ; θ , φ ) ∂θ (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:2) m ∗ ( x i, ; θ ) · · · m ∗ ( x i, K T ; θ ) (cid:3)(cid:13)(cid:13) ≤ C " N N X i =1 (cid:13)(cid:13)(cid:13)(cid:13) ∂m ( X i ; θ , φ ) ∂θ (cid:13)(cid:13)(cid:13)(cid:13) / " N K T X j =1 N X i =1 m ∗ ( x i,j ; θ ) / = O p (1) , kG R k = O p (cid:0) N / k ˆ γ N − γ k ∞ (cid:1) = o p (1) . (E.20)To fulﬁll this proof and ﬁnd the function G , let ˜ ν ( W i ) := τ i h ∂∂θ R ( W i ; θ , φ ) ∂φ ( t ; γ ) ∂γ ′ i and G ( W i ; γ ) =˜ ν ( W i ) γ = τ i h ∂∂θ R ( W i ; θ , φ ) ∂φ ( t ; γ ) ∂γ ′ i γ, then by construction G ( W i ; γ ) is linear in γ . Moreover,based on (E.18) and (E.20), (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ N N X i =1 h g ( W i ; θ , ˆ φ N ) − g ( W i ; θ , φ ) − G ( W i ; ˆ γ N − γ ) i(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤kG k + kG R k = o p (1) . (E.21) Lemma E.9 (Stochastic Equicontinuity)

Let F W ( w ) be the true probability distribution func-tion of W i . Suppose assumptions in Lemma 5.5 (b) hold, then √ N N X i =1 (cid:20) G ( W i ; ˆ γ N − γ ) − Z G ( w ; ˆ γ N − γ ) dF W ( w ) (cid:21) = o p (1) . Proof of Lemma E.9.

By the linearity of G ( w ; γ ) = ˜ ν ( w ) γ in γ , we can get (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ N N X i =1 (cid:20) G ( W i ; ˆ γ N − γ ) − Z G ( w ; ˆ γ N − γ ) dF W ( w ) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ N N X i =1 h ˜ ν ( W i ) − E [˜ ν ( W i )] i (ˆ γ N − γ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ C (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ N N X i =1 h ˜ ν ( W i ) − E [˜ ν ( W i )] i(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13) ˆ γ N − γ (cid:13)(cid:13) ∞ . (E.22)Denote ˜ ν r ( W i ) as the r -th entry of the vector ˜ ν ( W i ). Then, E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ N N X i =1 h ˜ ν ( W i ) − E [˜ ν ( W i )] i(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)  = 1 N E " N X i =1 (cid:16) ˜ ν ( W i ) − E [˜ ν ( W i )] (cid:17) ′ N X i =1 (cid:16) ˜ ν ( W i ) − E [˜ ν ( W i )] (cid:17) = 1 N d θ X r =1  N X i =1 X j ∈ ∆( i,N ) Cov (cid:16) ˜ ν r ( W i ) , ˜ ν r ( W j ) (cid:17) + N X i =1 X j ∆( i,N ) Cov (cid:16) ˜ ν r ( W i ) , ˜ ν r ( W j ) (cid:17) = 1 N d θ X r =1 N X i =1 X j ∈ ∆( i,N ) Cov (cid:16) ˜ ν r ( W i ) , ˜ ν r ( W j ) (cid:17) + s.o., (E.23)104here the last line comes from Assumption 5.1. Note that due to V ar [˜ ν r ( W i )] < ∞ as in Assump-tion 5.6 and 1 /N P Ni =1 | ∆( i, N ) | = O (1),1 N N X i =1 X j ∈ ∆( i,N ) Cov (cid:16) ˜ ν r ( W i ) , ˜ ν r ( W j ) (cid:17) ≤ CN N X i =1 | ∆( i, N ) | = O (1) . (E.24)Given (E.24), together with the consistency k ˆ γ N − γ k ∞ = o p (1), we know that (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) √ N N X i =1 (cid:20) G ( W i ; ˆ γ N − γ ) − Z G ( w ; ˆ γ N − γ ) dF W ( w ) (cid:21)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o p (1) . (E.25) Lemma E.10 (Mean-square Diﬀerentiability)

Under assumptions in Lemma 5.5 (b), thereexists a function δ : Ω W R d θ such that Z G ( w ; ˜ γ N − γ ) dF W ( w ) = Z δ ( w ) d ˆ F W ( w ) , √ N E (cid:20)(cid:13)(cid:13)(cid:13)(cid:13)Z δ ( w ) d ˆ F W ( w ) − Z δ ( w ) d ˜ F W ( w ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:21) = o (1) , where ˆ F W ( w ) is the kernel estimator of F W ( w ) and ˜ F W ( w ) := 1 /N P Ni =1 W i ≤ w ] is the empiricaldistribution of W i . Proof of Lemma E.10.

Following the derivations of Theorem 8.1 or (Theorem 8.11) in Newey and MacFadden(1994), it is apparent from the linearity of G ( w ; γ ) in γ that and the law of iterated expectation, Z G ( w ; γ ) dF W ( w ) = Z ν ( w ) γ ( w ) dw, where recall that the d θ × d γ matrix ν ( w ) is deﬁned as ν ( w ) = E h τ ( X i ) ∂∂θ R ( W i ; θ , φ ) ∂φ ( t ; γ ) ∂γ ′ (cid:12)(cid:12) γ = γ ( w ) d γ (cid:12)(cid:12)(cid:12) w i . In addition, let δ ( w ) := ν ( w ) − E [ ν ( w )], we have Z G ( w ; ˜ γ N − γ ) dF W ( w ) = Z δ ( w ) d ˆ F W ( w ) , with ˆ F W ( w ) being the kernel estimator of the distribution of W i .At last, recall the empirical distribution ˜ F W ( w ) = 1 /N P Ni =1 W i ≤ w ]. By an abuse ofnotation, we denote κ ( w c − ˜ w c h ) := Q Qq =1 κ ( w cq − ˜ w cq h ). Consider the diﬀerence between the two integrals δ ( F ) deﬁned as below, which can be interpreted as a smoothing bias term, δ ( F ) := Z δ ( w ) d ˆ F W ( w ) − Z δ ( w ) d ˜ F W ( w )= 1 N N X i =1  X w d ∈ Ω Wd Z ν ( w ) ˆ f keri ( w ) dw c − ν ( W i ) 

105 1 N N X i =1  X w d ∈ Ω Wd Z ν ( w ) 1 h Q W di = w d ] Q Y q =1 κ (cid:18) w cq − W ciq h (cid:19) dw c − ν ( W i )  = 1 N N X i =1 "Z ν ( w c , W di ) 1 h Q Q Y q =1 κ (cid:18) w cq − W ciq h (cid:19) dw c − ν ( W i ) = 1 N N X i =1 Z (cid:2) ν ( W ci + hv, W di ) − ν ( W i ) (cid:3) Q Y q =1 κ ( v q ) dv := 1 N N X i =1 δ i ( F ) . (E.26)Because the identical distribution of W i across i , it follows from (E.26) that √ N E [ δ ( F )] = √ N E [ δ i ( F )]= √ N E "Z ν ( W ci + hv, W di ) Q Y q =1 κ ( v q ) dv − ν ( W i ) = √ N Z Z ν ( ˜ w c + hv, ˜ w d ) Q Y q =1 κ ( v q ) dvdF W ( ˜ w ) − √ N Z ν ( w ) dF W ( w )= √ N Z Z ν ( ˜ w c , ˜ w d ) Q Y q =1 κ ( v q ) dvdF W ( ˜ w c − hv, ˜ w d ) − √ N Z ν ( w ) dF W ( w )= √ N (Z Z ν ( ˜ w c , ˜ w d ) Q Y q =1 κ ( v q ) dvdF W ( ˜ w c − hv, ˜ w d ) − Z Z ν ( w ) Q Y q =1 κ ( v q ) dvdF W ( w ) ) = √ N  X w d ∈ Ω Wd Z Z ν ( w ) h f W ci ,W di ( w c − hv, w d ) − f W ci ,W di ( w c , w d ) i Q Y q =1 κ ( v q ) dvdw c  , (E.27)which together with (B.39) and Assumption 5.6, implies that √ N k E [ δ ( F )] k≤√ N X w d ∈ Ω Wd Z k ν ( w ) k (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)Z h f W ci ,W di ( w c − hv, w d ) − f W ci ,W di ( w c , w d ) i Q Y q =1 κ ( v q ) dv (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) dw c ≤ C √ N h X w d ∈ Ω Wd Z k ν ( w ) k dw c = o (1) . (E.28)106ext, let δ ( F ) = ( δ ( F ) , ..., δ d θ ( F )) ′ with δ r ( F ) = 1 /N P Ni =1 δ r,i ( F ) and consider E (cid:20)(cid:13)(cid:13)(cid:13) √ N δ ( F ) − √ N E [ δ ( F )] (cid:13)(cid:13)(cid:13) (cid:21) = d θ X r =1 E (cid:20)(cid:12)(cid:12)(cid:12) √ N δ r ( F ) − √ N E [ δ r ( F )] (cid:12)(cid:12)(cid:12) (cid:21) = N d θ X r =1 E (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N X i =1 ( δ r,i ( F ) − E [ δ r,i ( F )]) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)  = 1 N d θ X r =1 N X i =1 X j ∈ ∆( i,N ) Cov ( δ r,i ( F ) , δ r,j ( F )) + s.o., (E.29)where the last line follows from Assumption 5.1. Due to the identical distribution of W i and (E.26),we can bound the covariance in (E.29) by | Cov ( δ r,i ( F ) , δ r,j ( F )) | ≤ V ar [ δ r,i ( F )] ≤ E (cid:2) | δ r,i ( F ) | (cid:3) = E  Z (cid:2) ν r ( W ci + hv, W di ) − ν r ( W i ) (cid:3) Q Y q =1 κ ( v q ) dv !  . (E.30)From Assumption 5.2 we know that R xκ ( x ) dx = 0 and R x κ ( x ) dx = K , and Assumption 5.6that ν ( w ) is twice continuously diﬀerentiable in w c . Expanding ν r ( W ci + hv, W di ) around W ci , thenthere exists a constant C > | Cov ( δ r,i ( F ) , δ r,j ( F )) | ≤ h E  Z v ′ ∂ν r ( W ci + w ∗ , W di ) ∂w c ∂ ( w c ) ′ v Q Y q =1 κ ( v q ) dv !  ≤ Ch . (E.31)Substituting (E.31) into (E.29), since 1 /N P Ni =1 | ∆( i, N ) | = O (1) as in Assumption 5.2, E (cid:20)(cid:13)(cid:13)(cid:13) √ N δ ( F ) − √ N E [ δ ( F )] (cid:13)(cid:13)(cid:13) (cid:21) = O ( h ) = o (1) . (E.32)Based on (E.28) and (E.32), since both the mean and variance of √ N δ ( F ) are o (1), by Chebyshev’sinequality, it follows directly that E h k√ N δ ( F ) k i p →→