[PDF] Causal Gradient Boosting: Boosted Instrumental Variable Regression

Abstract

Recent advances in the literature have demonstrated that standard supervised learning algorithms are ill-suited for problems with endogenous explanatory variables. To correct for the endogeneity bias, many variants of nonparameteric instrumental variable regression methods have been developed. In this paper, we propose an alternative algorithm called boostIV that builds on the traditional gradient boosting algorithm and corrects for the endogeneity bias. The algorithm is very intuitive and resembles an iterative version of the standard 2SLS estimator. Moreover, our approach is data driven, meaning that the researcher does not have to make a stance on neither the form of the target function approximation nor the choice of instruments. We demonstrate that our estimator is consistent under mild conditions. We carry out extensive Monte Carlo simulations to demonstrate the finite sample performance of our algorithm compared to other recently developed methods. We show that boostIV is at worst on par with the existing methods and on average significantly outperforms them.

Full PDF

CCausal Gradient Boosting:Boosted Instrumental Variable Regression

Edvard Bakhitov and Amandeep Singh Department of Economics, University of Pennsylvania. E-mail: [email protected] The Wharton School, University of Pennsylvania. E-mail: [email protected]

January 18, 2021

Abstract

Recent advances in the literature have demonstrated that standard supervised learning algorithms areill-suited for problems with endogenous explanatory variables. To correct for the endogeneity bias,many variants of nonparameteric instrumental variable regression methods have been developed. Inthis paper, we propose an alternative algorithm called boostIV that builds on the traditional gradientboosting algorithm and corrects for the endogeneity bias. The algorithm is very intuitive and resemblesan iterative version of the standard 2SLS estimator. Moreover, our approach is data driven, meaning thatthe researcher does not have to make a stance on neither the form of the target function approximationnor the choice of instruments. We demonstrate that our estimator is consistent under mild conditions.We carry out extensive Monte Carlo simulations to demonstrate the ﬁnite sample performance of ouralgorithm compared to other recently developed methods. We show that boostIV is at worst on parwith the existing methods and on average signiﬁcantly outperforms them.

Keywords:

Causal Learning, Boosting, Instrumental Variables, Gradient Descent, Nonparametric

Gradient boosting method is considered one of the leading machine learning (ML) algorithms forsupervised learning with structured data. There is a large body of evidence showing that gradientboosting dominates in a signiﬁcant number of ML competitions conducted on Kaggle . However,recent literature (e.g., see Hartford et al., 2017) has shown that traditional supervised machine learningmethods do not perform well in the presence of endogeneity in the explanatory variables.A common approach to correct for the endogeneity bias is to use instrumental variables (IVs).Nonparametric instrumental variables (NPIV) methods have regained popularity among appliedresearchers over the last decade as they do not require imposing (possibly) implausible parametricassumptions on the target function. On the other hand, existing nonparameteric estimation techniquesrequire the researcher to specify a target function approximation (ideally driven by some ex-anteunderstanding of the data generating process), e.g. a sieve space, which in turn drives the choice ofunconditional moment restrictions (or simply put, the choice of IV basis functions). If the approxima-tion is bad, it will lead to misspeciﬁcation issues, and if the IVs are “weak” , most likely the standardNPIV asymptotic techniques will no longer be valid. Moreover, the complexity of both modelling andestimation explodes when there are more than a handful of inputs. For reference see By “weak” we mean that the IV basis functions are weakly correlated with the target function basis functions. It is hardto deﬁne a notion of weak instruments in the NPIV set-up of Newey and Powell (2003), since there is no explicit reducedform. In the triangular simultaneous equations models, where the explicit reduced form exists, Han (2014) deﬁned weak IVsas a sequence of reduced-form functions where the associated rank shrinks to zero. a r X i v : . [ ec on . E M ] J a n n this paper, we introduce an algorithm that allows to learn the target function in the presence ofendogenous explanatory variables in a data driven way, meaning that the researcher does not have tomake a stance on neither the form of the target function approximation nor the choice of instruments.We build on gradient boosting algorithm to transform the standard NPIV problem into a learningproblem that accounts for endogeneity in explanatory variable, and thus, we call our algorithm boostIV .We also consider a couple extensions to the boostIV algorithm that might improve its ﬁnite sampleperformance. First, we show how to incorporate optimal IVs, i.e. IVs that achieve the lowest asymptoticvariance (Chamberlain, 1987). Second, we augment the boostIV algorithm with a post-processing stepwhere we re-estimate the weights on the learnt basis functions, we call this algorithm post-boostIV . Theidea is based on Friedman and Popescu (2003) who propose to learn an ensemble of basis functionsand then apply lasso to perform basis function selection.To avoid potentially severe ﬁnite sample bias due to the double use of data, we resort to thecross-ﬁtting idea of Chernozhukov et al. (2017). For the boostIV algorithm we split the data to learninstruments and basis functions on di ﬀ erent data folds. We add an additional layer of cross-ﬁtting tothe post-boostIV algorithm to update the weights on the learnt basis functions.Our method has a number of advantages over the standard NPIV approach. First, our approachallows the researcher to be completely agnostic to the choice of basis functions and IVs. Both basisfunctions and instruments are learnt in a data driven way which picks up the underlying data structure.Second, the method becomes even more attractive when the dimensionality of the problem grows, asthe standard NPIV methods su ﬀ er greatly from the curse of dimensionality. Intuitively, learning viaboosting should be able to construct basis functions that approximately represent the underlying lowdimensional data features. However, our approach does not work in purely high-dimensional settingswhere the number of regressors exceeds the number of observations.We study the performance of boostIV and post-boostIV algorithms in a series of Monte Carloexperiments. We compare the performance of our algorithms to both the standard sieve NPIV estimatorand a variety of modern ML estimators. Our results demonstrate that boostIV performs at worst onpar with the state of the art ML estimators. Moreover, we ﬁnd no empirical evidence that post-boostIV achieves superior performance compared to boostIV and vice versa. However, adding thepost-processing step reduces the amount of boosting iterations needed for the algorithm to convergerendering it (potentially) computationally more e ﬃ cient .This paper brings together two strands of literature. First, our approach contributes to the literatureon nonparametric instrumental variables modeling. Newey and Powell (2003) propose to replace thelinear relationships in standard linear IV regression with linear projections on a series of basis functions(also see Blundell et al. (2007) for an application to Engel-curve estimation). Darolles et al. (2011) andHall, Horowitz, et al. (2005) suggest to nonparametrically estimate the conditional distribution ofendogenous regressors given the instruments, 𝐹 ( 𝑥 | 𝑧 ), using kernel density estimators. However, despitetheir simplicity and ﬂexibility, both approaches are subject to the curse of dimensionality. Machinelearning literature has recently also contributed to the nonparametric IV literature. Hartford et al. (2017)propose a DeepIV estimator which ﬁrst estimates 𝐹 ( 𝑥 | 𝑧 ) with a mixture of deep generative models onwhich then the structural function is learned with another deep neural network. Kernel IV estimatorof Singh et al. (2019) exploits conditional mean embedding of 𝐹 ( 𝑥 | 𝑧 ), which is then used in the secondstage kernel ridge regression. Muandet et al. (2019) avoid the traditional two stage procedure byfocusing on the dual problem and ﬁtting just a single kernel ridge regression.Second, we exploit insights from the boosting literature. Originally boosting came out as anensemble method for classiﬁcation in the computational learning theory (Schapire, 1990; Freund, 1995;Freund and Schapire, 1997). Later on Friedman et al. (2000) draw connections between boostingand statistical learning theory by viewing boosting as an approximation to additive modeling. A To be more precise, there is a trade-o ﬀ at play. One boostIV iteration takes less time than one post-boostIV iteration asthe latter algorithm includes an additional estimation step plus one more layer of cross-ﬁtting. As a result, if adding thepost-processing step reduces the amount of boosting iterations signiﬁcantly, then we achieve computational gains. It mightnot be the case otherwise. ﬀ erent perspective on boosting as a gradient descent algorithm in a function space that connectsboosting to the more common optimization view of statistical inference (Breiman, 1998; Breiman, 1999;Friedman, 2001). 𝐿 -boosting introduced by Bühlmann and Yu (2003) provides a powerful tool tolearning regression functions. A comprehensive boosting review can be found in Bühlmann andHothorn (2007).The remainder of the paper is organized as follows. Section 2 brieﬂy introduces the NPIV framework.Section 3 describes the standard boosting procedure. We present boostIV and post-boostIV in Section4. Section 5 talks about hyperparameter tuning. Section 6 discusses consistency. We illustrate thenumerical performance of our algorithms in Section 7. Section 8 concludes. All the proofs andmathematical details are left for the Appendix. Consider the standard conditional mean model of Newey and Powell (2003) 𝑦 = 𝑔 ( 𝑥 ) + 𝜀, E [ 𝜀 | 𝑧 ] = 0 , (1)where 𝑦 is a scalar random variable, 𝑔 is an unknown structural function of interest, 𝑥 is a 𝑑 𝑥 × 𝑧 is a 𝑑 𝑧 × 𝜀 isan error term . Suppose that the model is identiﬁed and the completion condition holds, i.e. for allmeasurable real functions 𝛿 with ﬁnite expectation, E [ 𝛿 ( 𝑥 ) | 𝑧 ] = 0 ⇒ 𝛿 ( 𝑥 ) = 0 . Intuitively this condition implies that there is enough variation in the instruments to explain thevariation in 𝑥 .The conditional expectation of (1) yields the integral equation E [ 𝑦 | 𝑧 ] = ∫︁ 𝑔 ( 𝑥 ) 𝑑𝐹 ( 𝑥 | 𝑧 ) , (2)where 𝐹 denotes the conditional cdf of 𝑥 given 𝑧 . Solving for 𝑔 directly is an ill-posed problem as itinvolves inverting linear compact operators (see e.g. Kress (1989)). Note that the model in (1) doesnot have an explicit reduced form, i.e. a functional relationship between endogenous and exogenousvariables, however, it is implicitly embedded in 𝐹 . Thus, from the estimation perspective we have twoobjects to estimate: (i) the conditional cdf F(x|z) and (ii) the structural function 𝑔 .A common approach in applied work is to assume that the relationships between 𝑦 and 𝑥 as wellas 𝑥 and 𝑧 are linear, which leads to a standard 2SLS estimator. However, it can be a very restrictiveassumption in practice, which can result in misspeciﬁcation bias. A lot of more ﬂexible non-parametricextension to 2SLS have been developed in the econometrics literature. The standard approach is ti usethe series estimator of Newey and Powell (2003) who propose to replace the linear relationships with alinear projections on a series of basis functions.To illustrate the approach let us approximate 𝑔 with a series expansion 𝑔 ( 𝑥 ) ≈ 𝐿 ∑︁ ℓ =1 𝛾 ℓ 𝑝 ℓ ( 𝑥 ) , where 𝑝 𝐿 ( 𝑥 ) = ( 𝑝 ( 𝑥 ) , . . . ,𝑝 𝐿 ( 𝑥 )) is a series of basis functions. It allows us to rewrite the conditionalexpectation of 𝑦 given 𝑧 as E [ 𝑦 | 𝑧 ] ≈ 𝐿 ∑︁ ℓ =1 𝛾 ℓ E [ 𝑝 ℓ ( 𝑥 ) | 𝑧 ] . (3) The approach can easily be extended to cases where only some of the regressors are endogenous. Suppose 𝑥 = ( 𝑥 ,𝑥 )where 𝑥 consists of endogenous regressors and 𝑥 is a vector of exogenous regressors. Let 𝑤 be a vector of excludedinstruments and set 𝑧 = ( 𝑤,𝑥 ). This perfectly ﬁts into the model described by (1). 𝑞 𝐾 ( 𝑧 ) = ( 𝑞 ( 𝑧 ) , . . . ,𝑞 𝐾 ( 𝑧 )) be a series of IV basis functions. This implies a 2SLS type estimator of 𝛾 ˆ 𝛾 = (︁ ˆ E [ 𝑝 𝐿 ( 𝑥 ) | 𝑧 ] ′ ˆ E [ 𝑝 𝐿 ( 𝑥 ) | 𝑧 ] )︁ − ˆ E [ 𝑝 𝐿 ( 𝑥 ) | 𝑧 ] ′ 𝑦, (4)where ˆ E [ 𝑝 𝐿 ( 𝑥 ) | 𝑧 ] = 𝑞 𝐾 ( 𝑧 ) (︁ 𝑞 𝐾 ( 𝑧 ) ′ 𝑞 𝐾 ( 𝑧 ) )︁ − 𝑞 𝐾 ( 𝑧 ) ′ 𝑝 𝐿 ( 𝑥 ). Given 𝐿,𝐾 → ∞ as 𝑛 → ∞ , asymptotically onecan recover the true structural function. However, in ﬁnite samples one has to truncate the sieveat some value. Despite that, the performance of the estimator hinges crucially on the choice of theapproximating space, especially in high dimensions. Moreover, NPIV estimators su ﬀ er greatly fromthe curse of dimensionality which renders them inapplicable even in many applications. Alternatively,we propose a data-driven approach, which is agnostic to the choice of sieve/approximating functions. Boosting is a greedy algorithm to learn additive basis function models of the form 𝑓 ( 𝑥 ) = 𝛼 + 𝑀 ∑︁ 𝑚 =1 𝛼 𝑚 𝜙 ( 𝑥 ; 𝜃 𝑚 ) , (5)where 𝜙 𝑚 are generated by a simple algorithm called a weak learner or base learner . The weak learnercan be any classiﬁcation or regression algorithm, such as a regression tree, a random forest, a simplesingle-layer neural network, etc. One could boost the performance (on the training set) of any weaklearner arbitrarily high, provided the weak learner could always perform slightly better than chance (Schapire, 1990; Freund and Schapire, 1996). It is a very nice feature, since the only thing we need tomake a stance on is the form of the weak learner, which is much less restrictive than choosing a sieve.The goal of boosting is to solve the following optimization problemmin 𝑓 𝑁 ∑︁ 𝑖 =1 𝐿 ( 𝑦 𝑖 ,𝑓 ( 𝑥 𝑖 )) , (6)where 𝐿 ( 𝑦,𝑦 ′ ) is a loss function and 𝑓 is deﬁned by (5). Since the boosting estimator depends on thechoice of the loss function, the algorithm to solve (6) should be adjusted for a particular choice. Instead,one can use a generic version called gradient boosting (Friedman, 2001; Mason et al., 2000), which worksfor an arbitrary loss function.Breiman (1998) showed that boosting can be interpreted as a form of the gradient descent algorithmin function space. This idea then was further extended by Friedman (2001) who presented the followingfunctional gradient descent or gradient boosting algorithm:1. Given data { ( 𝑦 𝑖 , 𝑥 𝑖 ) } 𝑛𝑖 =1 , initialize the algorithm with some starting value. Common choices are 𝑓 ( 𝑥 ) ≡ argmin 𝑐 𝑁 ∑︁ 𝑖 =1 𝐿 ( 𝑦 𝑖 ,𝑐 ) , which is simply ¯ 𝑦 under the squared loss, or 𝑓 ( 𝑥 ) ≡

0. Set 𝑚 = 0.2. Increase 𝑚 by 1. Compute the negative gradient vector and evaluate it at 𝑓 𝑚 − ( 𝑥 𝑖 ): 𝑟 𝑖𝑚 = − 𝜕𝐿 ( 𝑦 𝑖 , 𝑓 ) 𝜕𝑓 ⃒⃒⃒⃒⃒ 𝑓 = 𝑓 𝑚 − ( 𝑥 𝑖 ) , 𝑖 = 1 , . . . , 𝑛.

3. Use the weak learner to compute ( 𝛼 𝑚 , 𝜃 𝑚 ) which minimize ∑︀ 𝑁𝑖 =1 ( 𝑟 𝑖𝑚 − 𝛼𝜑 ( 𝑥 𝑖 ; 𝜃 )) . This is relevant when applied to classiﬁcation problems. For regression problems any simple method such as leastsquares regression, regression stump, or one or two-layered neural network will work.

4. Update 𝑓 𝑚 ( 𝑥 ) = 𝑓 𝑚 − ( 𝑥 ) + 𝛼 𝑚 𝜑 ( 𝑥 ; 𝜃 𝑚 ) , that is, proceed along an estimate of the negative gradient vector. In practice, better (test set)performance can be obtained by performing “partial updates” of the form 𝑓 𝑚 ( 𝑥 ) = 𝑓 𝑚 − ( 𝑥 ) + 𝜈𝛼 𝑚 𝜑 ( 𝑥 ; 𝜃 𝑚 ) , where 0 ≤ 𝜈 ≤ 𝑚 = 𝑀 for some stopping iteration 𝑀 .The key point is that we do not go back and adjust earlier parameters. The resulting basis functionslearnt from the data are 𝜑 ( 𝑥 ) = ( 𝜑 ( 𝑥 ; 𝜃 ) , . . . ,𝜑 ( 𝑥 ; 𝜃 𝑀 )). The number of iterations 𝑀 is a tuning parameter,which can be optimally tuned via cross-validation or some model selection criterion (see Section 5 formore details). The main complication in the NPIV set-up is that 𝑥 is potentially endogenous, otherwise learning thestructural function via boosting would be straightforward. Moreover, we cannot learn basis functionsin the ﬁrst step and then construct IVs in the second. Dependence of the basis functions for thestructural equation on the instruments and vice versa suggests an iterative algorithm.Before we introduce the algorithm, we need to set up the boosting IV framework ﬁrst. Combining(1) and (5) gives 𝑦 = 𝛼 + 𝑀 ∑︁ 𝑚 =1 𝛼 𝑚 𝜙 ( 𝑥 ; 𝜃 𝑚 ) + 𝜀. (7)Hence, the conditional expectation of 𝑦 given 𝑧 becomes E [ 𝑦 | 𝑧 ] = 𝛼 + 𝑀 ∑︁ 𝑚 =1 𝛼 𝑚 E [ 𝜙 ( 𝑥 ; 𝜃 𝑚 ) | 𝑧 ] . (8)Note that (7) and (8) closely resemble their standard NPIV counterparts (1) and (3). The only di ﬀ erenceis that the form of basis functions for boosting must be estimated, while for the standard NPIV it hasto be ex-ante speciﬁed. Unlike standard boosting, where the goal is to learn E [ 𝑦 | 𝑥 ], in the presenceof endogeneity, we want to learn E [ 𝑦 | 𝑥 ], implying that in each boosting iteration we have to learn theconditional expectation of the weak learner given the IVs.To keep things clear and simple, we focus on 𝐿 -boosting which assumes the squared loss function.Bühlmann and Yu (2003) show that 𝐿 -boosting is equivalent to iterative ﬁtting of residuals. In the IVcontext, it means that at step 𝑚 the loss has the form 𝐿 ( 𝑦, 𝑔 𝑚 − ( 𝑥 ) + 𝛼 E [ 𝜙 ( 𝑥 ; 𝜃 ) | 𝑧 ]) = ( 𝑟 𝑚 − 𝛼 E [ 𝜙 ( 𝑥 ; 𝜃 ) | 𝑧 ]) , where 𝑟 𝑚 ≡ 𝑦 − 𝑔 𝑚 − is the current residual. Thus, at step 𝑚 the optimal parameters minimize the lossbetween the residuals and the conditional expectation of the weak learner given the instruments,( 𝛼 𝑚 , 𝜃 𝑚 ) = argmin 𝛼,𝜃 𝑁 ∑︁ 𝑖 =1 ( 𝑟 𝑖𝑚 − 𝛼 E [ 𝜙 ( 𝑥 𝑖 ; 𝜃 ) | 𝑧 𝑖 ]) . (9)However, the conditional expectation E [ 𝜙 ( 𝑥 ; 𝜃 ) | 𝑧 ] is unknown and has to be estimated.5 simple way to estimate the conditional expectation in (9) is to project the weak learner on thespace spanned by IVs ˆ E [ 𝜙 ( 𝑥 ; 𝜃 ) | 𝑧 ] = 𝑃 𝑍 𝜙 ( 𝑥 ; 𝜃 ) , where 𝑃 𝐴 = 𝐴 ( 𝐴 ′ 𝐴 ) − 𝐴 ′ is a projection matrix. The exogeneity condition in (1) implies that any functionof 𝑧 can serve as an instrument. However, we do not need any function, we need such a transformationof 𝑧 that will give us strong instruments, i.e. instruments that explain the majority of the variation inthe endogenous variables. We follow Gandhi et al. (2019) and introduce an additional step on whichwe learn the instruments. Let ℋ ( · ; 𝜂 ) be a class of IV functions parameterized by 𝜂 . This formulationallows us to use various o ﬀ -the-shelf algorithms such as Neural Networks, Random Forests, etc. tolearn ℋ ( · ; 𝜂 ). Given the learnt IV transformation ℋ ( · ; 𝜂 ), we can rewrite (9) as( 𝛼 𝑚 , 𝜃 𝑚 ) = argmin 𝛼,𝜃 𝑁 ∑︁ 𝑖 =1 ( 𝑟 𝑖𝑚 − 𝛼𝑃 ℋ ( 𝑧 𝑖 ; 𝜂 ) 𝜙 ( 𝑥 𝑖 ; 𝜃 )) . Since the basis function parameters ( 𝛼, 𝜃 ) depend on the IV transformation parameters 𝜂 and viceverse, we propose an algorithm that iterates between two steps. At the ﬁrst step we learn instruments,i.e. 𝜂 𝑚 , given the basis functions parameter estimates from the previous iteration ( 𝛼 𝑚 − , 𝜃 𝑚 − ), then atthe second step we learn new parameter estimates ( 𝛼 𝑚 , 𝜃 𝑚 ) given the instruments from the ﬁrst step.We can draw an analogy with the canonical two-stage least squares, where we estimate the reducedform in the ﬁrst stage, and the structural equation in the second. The details are provided in Algorithm1. Algorithm 1:

Naive boostIVInitialize basis functions: 𝜙 = ¯ 𝑦 for iteration 𝑚 doFirst stage: given 𝜙 ( 𝑥 ; 𝜃 𝑚 − ), estimate ℋ ( 𝑧 ; 𝜂 𝑚 ) Second stage: given ℋ ( 𝑧 ; 𝜂 𝑚 ), solve( 𝛼 𝑚 , 𝜃 𝑚 ) = argmin 𝛼,𝜃 𝑁 ∑︁ 𝑖 =1 (︁ 𝑟 𝑖𝑚 − 𝛼𝑃 ℋ ( 𝑧 𝑖 ; 𝜂 𝑚 ) 𝜙 ( 𝑥 𝑖 ; 𝜃 ) )︁ update: 𝑔 𝑚 ( 𝑥 ) = 𝑔 𝑚 − ( 𝑥 ) + 𝛼 𝑚 𝜙 ( 𝑥 ; 𝜃 𝑚 ) end Stop at iteration 𝑀 We call this algorithm the Naive boostIV, since we use the same data to learn both the instrumentsand the basis functions. Asymptotically this will not a ﬀ ect the properties of the estimator, however,in ﬁnite samples biases from the ﬁrst stage will propagate to the second. This issue can be especiallysevere if we use regularized estimators in the ﬁrst stage as the regularization bias will heavily a ﬀ ect thesecond stage estimates. To get around this issue we resort to cross-ﬁtting.Let 𝐷 = { 𝑦 𝑖 ,𝑥 𝑖 ,𝑧 𝑖 } 𝑛𝑖 =1 be our data set, where 𝐷 𝑖 are iid. Split the data set into a 𝐾 -fold partition, suchthat each partition 𝐷 𝑘 has size ⌊︁ 𝑛𝐾 ⌋︁ , and let 𝐷 𝑐𝑘 be the excluded data. The boostIV procedure withcross-ﬁtting is described in Algorithm 2. Our boostIV algorithm also allows to incorporate optimal instruments in the sense of Chamber-lain (1987), i.e. instruments that achieve the smallest asymptotic variance. Assuming conditional In general we do not have to use a projection, we can use a more complex model to estimate the conditional expectation. lgorithm 2: boostIV with cross-ﬁttingFolds {𝒟 , . . . , 𝒟 𝐾 } ← Partition ( 𝒟 , 𝐾 )Initialize basis functions: 𝜙 𝑘 = ¯ 𝑦 for 𝑘 = 1 , . . . , 𝐾 for iteration 𝑚 dofor fold 𝑘 doFirst stage: • given 𝜙 ( 𝑥 𝑐𝑘 ; 𝜃 𝑘𝑚 − ) and 𝑧 𝑐𝑘 , estimate ℋ ( · ; 𝜂 𝑘𝑚 )• apply the learnt transformation to generate IVs ℋ ( 𝑧 𝑘 ; 𝜂 𝑘𝑚 ) Second stage:

Given ℋ ( 𝑧 𝑘 ; 𝜂 𝑘𝑚 ), solve( 𝛼 𝑘𝑚 , 𝜃 𝑘𝑚 ) = argmin 𝛼,𝜃 ∑︁ 𝑖 ∈𝒟 𝑘 (︁ 𝑟 𝑖𝑚 − 𝛼𝑃 ℋ ( 𝑧 𝑖 ; 𝜂 𝑘𝑚 ) 𝜙 ( 𝑥 𝑖 ; 𝜃 ) )︁ update: 𝑔 𝑘𝑚 ( 𝑥 𝑘 ) = 𝑔 𝑘𝑚 − ( 𝑥 𝑘 ) + 𝛼 𝑘𝑚 𝜙 ( 𝑥 𝑘 ; 𝜃 𝑘𝑚 ) endend Stop at iteration M

Output: ˆ 𝑔 ( 𝑥 ) = 𝐾 ∑︀ 𝐾𝑘 =1 𝑔 𝑘𝑀 ( 𝑥 )homoskedasticity, the optimal instrument vector of Chamberlain (1987) at step 𝑚 is ℋ ( 𝑧 ; 𝜂 𝑚 ) = 𝐷 𝑚 ( 𝑧 ) 𝜎 − 𝑚 , (10)where 𝐷 𝑚 ( 𝑧 ) = E [︃ 𝜕𝜀 ( 𝛾 𝑚 ) 𝜕𝛾 ′ 𝑚 ⃒⃒⃒⃒⃒ 𝑧 ]︃ , 𝛾 𝑚 = ( 𝛼 𝑚 , 𝜃 ′ 𝑚 ) ′ (11)is the conditional expectation of the derivative of the conditional moment restriction with respect tothe boosting parameters, and 𝜎 𝑚 = E [ 𝑟 𝑚 | 𝑧 ] is the conditional variance of the error term at step 𝑚 . Thus,the IV transformation parameters 𝜂 𝑚 are implicitly embedded in a particular approximation used toestimate 𝐷 𝑚 ( 𝑧 ).The main complication with using optimal IVs is that they are generally unknown, hence, thecommon approach is to consider approximations. The parametrization in (10)-(11) allows us to useany o ﬀ -the-shelf statistical/ML method to estimate the optimal functional form for the instruments.Moreover, the iterative nature of the algorithm allows us to use the estimates from step 𝑚 − An important feature of the forward stage-wise additive modeling is that we do not go back and adjustearlier parameters. However, we might want to revisit the weights on the learnt basis functions toachieve a better ﬁt. This can be seen as a way of post-processing our boostIV procedure.The whole procedure can be broken down into two stages:1. Apply the boostIV algorithm to learn basis functions ˆ 𝜙 𝑚 ( 𝑥 ) = 𝐾 ∑︀ 𝐾𝑘 =1 𝜙 ( 𝑥 ; 𝜃 𝑘𝑚 ) for 𝑚 = 1 , . . . , 𝑀 ;2. Estimate the weights ˆ 𝛽 = argmin 𝛽 𝑛 ∑︁ 𝑖 =1 ⎛⎜⎜⎜⎜⎜⎝ 𝑦 𝑖 − 𝛽 − 𝑀 ∑︁ 𝑚 =1 𝛽 𝑚 ˆ 𝜙 𝑚 ( 𝑥 𝑖 ) ⎞⎟⎟⎟⎟⎟⎠ . (12)Note that the basis functions ( ˆ 𝜙 ( 𝑥 ) , . . . , ˆ 𝜙 𝑀 ( 𝑥 )) are causal in the sense that they are constructed usingestimated parameters 𝜃 that identify a causal relationship between 𝑥 and 𝑦 .7oosting is an example of an ensemble method which combines various predictions with appropriateweights to get a better prediction. In the context of boostIV it works in the following way. We exploitthe variation in IVs to get causal parameters 𝜃 . Given the estimated parameters, we can treat eachlearnt basis function 𝜙 ( 𝑥 ; 𝜃 𝑚 ) as a separate prediction obtained by ﬁtting a base learner. Then, thepost-processing step in (12) can be simply seen as model averaging.We can estimate optimal weights ˆ 𝛽 by simply running a least squares regression as in (12) or useany other method such as Random Forrests, Neural Networks, boosting, etc. To avoid carrying overany biases from the estimation of ( ˆ 𝜙 ( 𝑥 ) , . . . , ˆ 𝜙 𝑀 ( 𝑥 )) into the choice of 𝛽 , we use cross-ﬁtting once again,which is a generalization of the stacking idea of Wolpert (1992). Algorithm 3: post-boostIVFolds {𝒟 , . . . , 𝒟 𝐿 } ← Partition ( 𝒟 , 𝐿 ) for fold ℓ do

1. apply boostIV to 𝒟 𝑐ℓ and estimate basis functions ( ˆ 𝜙 ℓ ( 𝑥 ) , . . . , ˆ 𝜙 ℓ𝑀 ( 𝑥 ))2. estimate post-boosting weightsˆ 𝛽 ℓ = argmin 𝛽 ∑︁ 𝑖 ∈𝒟 ℓ ⎛⎜⎜⎜⎜⎜⎝ 𝑦 𝑖 − 𝛽 − 𝑀 ∑︁ 𝑚 =1 𝛽 𝑚 ˆ 𝜙 ℓ𝑚 ( 𝑥 𝑖 ) ⎞⎟⎟⎟⎟⎟⎠ .

3. fold ﬁt at point 𝑥 : 𝑔 ℓ ( 𝑥 ) = ˆ 𝛽 ℓ + ∑︀ 𝑀𝑚 =1 ˆ 𝛽 ℓ𝑚 ˆ 𝜙 ℓ𝑚 ( 𝑥 ) end Stop at iteration 𝑀 Output: ˆ 𝑔 ( 𝑥 ) = 𝐿 ∑︀ 𝐿ℓ =1 𝑔 ℓ𝑀 ( 𝑥 ) Boosting performance crucially depends on the number of boosting iterations, in other words, 𝑀 is atuning parameter. A common way to tune any ML algorithm is cross-validation (CV). The most populartype of CV is 𝑘 -fold CV. The idea behind 𝑘 -fold CV is to create a number of partitions (validationdatasets) from the training dataset and ﬁt the model to the training dataset (sans the validation data).The model is then evaluated against each validation dataset and the results are averaged to obtain thecross-validation error. In application to boosting, we can estimate the CV error for a grid of candidatetuning parameters (number of iterations) and pick 𝑀 * that minimizes the CV error. Alternatively,Bühlmann and Hothorn (2007) show how to apply AIC and BIC criteria to boosting in the exogenouscase. However, it is not clear how to adjust those criteria for the presence of endogeneity.Both standard 𝑘 -fold cross validation and the model selection criteria considered in Bühlmannand Hothorn (2007) can be computationally costly as it is necessary to compute all boosting iterationsunder consideration for the training data. To surpass this issue, we apply early stopping to 𝑘 -fold CV.The idea behind early stopping is to monitor the behavior of the CV error and stop as soon as theperformance starts decreasing, i.e. CV error goes up.Algorithm 4 provides implementation details for the 𝑘 -fold CV with early stopping for eitherboostIV or post-boostIV procedure. The early stopping criterion compares the CV error evaluatedfor the model based on 𝑀 𝑗 boosting iterations to the CV error evaluated for the model based on 𝑀 𝑖 , 𝑀 𝑖 < 𝑀 𝑗 . If 𝐶𝑉 𝑒𝑟𝑟 ( 𝑀 𝑗 ) > 𝐶𝑉 𝑒𝑟𝑟 ( 𝑀 𝑖 ) + 𝜖 , where 𝜖 > 𝑀 * = 𝑀 𝑖 , otherwise, continue the search. If the criterion is not met for any of thecandidate tuning parameters, we pick the largest value 𝑀 * = ¯ 𝑀 .8n alternative solution would be to use a slice of the dataset as the validation sample and tune thenumber of iterations using the observations from the validation sample. We actually use this approachin our simulations since it signiﬁcantly reduces the computational burden. Algorithm 4: 𝑘 -fold CV with early stoppingFolds {𝒟 , . . . , 𝒟 𝑘 } ← Partition ( 𝒟 , 𝑘 )Set of indices ℐ 𝑀 corresponding to a sorted grid of tuning parameters ℳ = { , . . . , ¯ 𝑀 } while ℳ [ 𝑖 ] ≤ ¯ 𝑀 for 𝑖 ∈ ℐ 𝑀 dofor fold 𝜅 = 1 , . . . , 𝑘 do

1. training set 𝒯 𝜅 = 𝒟 𝑐𝜅 → apply (post-)boostIV( 𝒯 𝜅 , ℳ [ 𝑖 ]) → 𝑔 𝑏𝑜𝑜𝑠𝑡 ℳ [ 𝑖 ] ,𝜅 ( 𝑥 )2. validation set 𝒱 𝜅 = 𝒟 𝜅 → 𝐶𝑉 𝑒𝑟𝑟𝜅 ( ℳ [ 𝑖 ]) = |𝒱 𝜅 | ∑︀ 𝑖 ∈𝒱 𝜅 (︁ 𝑦 𝑖 − 𝑔 𝑏𝑜𝑜𝑠𝑡 ℳ [ 𝑖 ] ,𝜅 ( 𝑥 𝑖 ) )︁ end calculate 𝐶𝑉 𝑒𝑟𝑟 ( ℳ [ 𝑖 ]) = 𝑘 ∑︀ 𝑘𝜅 =1 𝐶𝑉 𝑒𝑟𝑟𝜅 ( ℳ [ 𝑖 ]) if 𝐶𝑉 𝑒𝑟𝑟 ( ℳ [ 𝑖 ]) > 𝐶𝑉 𝑒𝑟𝑟 ( ℳ [ 𝑖 − 𝜖 then ; // Early stopping criterion 𝑀 * = ℳ [ 𝑖 − break ; // Break while loop if the criterion is met else 𝑖 = +1 endend 𝑀 * = ¯ 𝑀 Output: 𝑔 𝑏𝑜𝑜𝑠𝑡𝑀 * ( 𝑥 ) ← (post-)boostIV( 𝒟 , 𝑀 * ) In this section, we show that under mild conditions boostIV is consistent. Theoretical properties ofpost-boostIV are beyond the scope of the paper and are left for future research.We borrow the main idea from Zhang and Yu (2005) and modify it accordingly to apply it to theGMM criterion. Let 𝑔 ( 𝑊 𝑖 , 𝑓 ) = ( 𝑦 𝑖 − 𝑓 ( 𝑥 𝑖 )) 𝑧 𝑖 denote a 𝑘 × 𝑔 ( 𝑓 ) = E [ 𝑔 ( 𝑊 𝑖 , 𝑓 )] isthe population moment function and ˆ 𝑔 ( 𝑓 ) = 𝑛 − ∑︀ 𝑛𝑖 =1 𝑔 ( 𝑊 𝑖 , 𝑓 ) is its sample analog. Also let Ω denotea 𝑘 × 𝑘 positive semi-deﬁnite weight matrix and ˆ Ω be its sample analog. Thus, the population GMMcriterion and its sample analog are 𝑄 ( 𝑓 ) = 𝑔 ( 𝑓 ) ′ Ω 𝑔 ( 𝑓 ) , ˆ 𝑄 ( 𝑓 ) = ˆ 𝑔 ( 𝑓 ) ′ ˆ Ω ˆ 𝑔 ( 𝑓 ) . (13)The form of the GMM criterion in (13) corresponds to the form of the empirical objective function inZhang and Yu (2005) with the loss function replaced by the moment function.We follow Zhang and Yu (2005) and replace the functional gradient decent step (9) leading tothe 2SLS ﬁtting procedure on every iteration with an approximate minimization involving a GMMcriterion. We can do that since the 2SLS solution is a special case of a GMM solution with an appropriateweighting matrix. Assumption 1. Approximate Minimization.

On each iteration step 𝑚 we ﬁnd ¯ 𝛼 𝑚 ∈ Λ 𝑚 and ¯ 𝑔 𝑚 ∈ 𝒮 such that ˆ 𝑄 ( 𝑓 𝑚 + ¯ 𝛼 𝑚 ¯ 𝑔 𝑚 ) ≤ inf 𝛼 𝑚 ∈ Λ 𝑚 𝑔 𝑚 ∈𝒮 ˆ 𝑄 ( 𝑓 𝑚 + 𝛼 𝑚 𝑔 𝑚 ) + 𝜖 𝑚 , (14)where 𝜖 𝑚 is a sequence of non-negative numbers that converge to 0.9s Zhang and Yu (2005) show, the consistency of the boosting procedure consists of two parts: (i)numerical convergence of the procedure itself, i.e. the algorithm achieves the true minimum of theobjective function, and (ii) statistical convergence that insures the uniform convergence of the samplecriterion to its population analog. We will treat these two steps separately in the following subsections,and then combine them to demonstrate consistency of the boostIV. To demonstrate numerical convergence, we ﬁrst have to verify that the sample GMM criterion in (13)satisﬁes Assumption 3.1 from Zhang and Yu (2005).Following Zhang and Yu (2005), I introduce some additional notation. Let 𝒮 be a set of real-valuedfunctions and deﬁne span( 𝒮 ) = ⎧⎪⎪⎪⎨⎪⎪⎪⎩ 𝐽 ∑︁ 𝑗 =1 𝑤 𝑗 𝑓 𝑗 : 𝑓 𝑗 ∈ 𝒮 , 𝑤 𝑖 ∈ R , 𝐽 ∈ Z + ⎫⎪⎪⎪⎬⎪⎪⎪⎭ , which forms a linear function space. Also, for all 𝑓 ∈ span( 𝑆 ) deﬁne the 1-norm with respect to thebasis 𝑆 as || 𝑓 || = ⎧⎪⎪⎪⎨⎪⎪⎪⎩ || 𝑤 || : 𝑓 = 𝐽 ∑︁ 𝑗 =1 𝑤 𝑗 𝑓 𝑗 : 𝑓 𝑗 ∈ 𝒮 , 𝐽 ∈ Z + ⎫⎪⎪⎪⎬⎪⎪⎪⎭ . Assumption 2.

A convex function 𝐴 ( 𝑓 ) deﬁned on span( 𝒮 ) should satisfy the following conditions:1. The functional 𝐴 satisﬁes the following Frechet-like di ﬀ erentiability conditionlim ℎ → ℎ ( 𝐴 ( 𝑓 + ℎ𝜙 ) − 𝐴 ( 𝑓 )) = ∇ 𝐴 ′ 𝜙

2. For all 𝑓 ∈ span( 𝒮 ) and 𝜙 ∈ 𝒮 , the real-valued function 𝐴 𝑓 ,𝜙 ( ℎ ) = 𝐴 ( 𝑓 + ℎ𝜙 ) is second-orderdi ﬀ erentiable (as a function of ℎ ) and the second derivative satisﬁes 𝐴 ′′ 𝑓 ,𝜙 (0) ≤ 𝑀 ( || 𝑓 || ) , where 𝑀 ( · ) is a nondecreasing real-valued function. Lemma 1.

Let (i) the basis functions 𝜙 be bounded as sup 𝑥 | 𝜙 ( 𝑥 ) | = 𝐶 < ∞ , (ii) the maximal eigenvalue 𝜆 𝑚𝑎𝑥 of the weighting matrix Ω be bounded from above, 𝜆 𝑚𝑎𝑥 ( Ω ) < ∞ , and (iii) E [ | 𝑧 ′ 𝑖 𝑧 𝑖 | ] ≤ 𝐵 < ∞ . Thenthe population GMM criterion deﬁned in (13) satisﬁes Assumption 2. Assumption 3. Step size. (a) Let Λ 𝑚 ⊂ R such that 0 ∈ Λ 𝑚 and Λ 𝑚 = − Λ 𝑚 .(b) Let ℎ 𝑚 = sup Λ 𝑚 satisfy the conditions ∞ ∑︁ 𝑗 =0 ℎ 𝑗 = ∞ , ∞ ∑︁ 𝑗 =0 ℎ 𝑗 < ∞ . (15)Then we can bound the step size | ¯ 𝛼 𝑚 | ≤ ℎ 𝑚 .Note that Assumption 3(a) restricts the step size 𝛼 𝑚 . Friedman (2001) argues that restricting thestep size is always preferable in practice, thus, we will restrict our attention to this case . Moreover, Λ 𝑚 is allowed to depend on the previous steps of the algorithm. Assumption 3(b) requires the step size ℎ 𝑗 to be small ( ∑︀ ∞ 𝑗 =0 ℎ 𝑗 < ∞ ) preventing large oscillation, but not too small ( ∑︀ ∞ 𝑗 =0 ℎ 𝑗 = ∞ ) ensuring that 𝑓 𝑚 can cover the whole span( 𝒮 ). The following theorem establishes the main numerical convergenceresult. Zhang and Yu (2005) provide a short discussion on how to deal with the unrestricted step size, however, the argumentrelies on the exact minimization which greatly complicates the analysis. heorem 1. Assume that we choose quantities 𝑓 , 𝜖 𝑚 and Λ 𝑚 independent of the sample 𝑊 . Given theresults of Lemma 1, as long as there exists ℎ 𝑗 satisfying Assumption 3 and 𝜖 𝑗 such that ∑︀ ∞ 𝑗 =0 𝜖 𝑗 < ∞ , wehave the following convergence result:lim 𝑚 →∞ ˆ 𝑄 ( 𝑓 𝑚 ) = inf 𝑓 ∈ span( 𝒮 ) ˆ 𝑄 ( 𝑓 ) . We need to show that the sample GMM criterion uniformly converges to its population analog, thenunder proper regularity conditions we will be able to ensure consistency of boostIV.To show that the sample GMM criterion converges uniformly to its population analog, we will ﬁrstbound the moment function and then we will show that it is su ﬃ cient to put a bound on the criterionfunction. Assumption 4.

Assume the following conditions:1. The class of weak learners 𝒮 is closed under negation, i.e. 𝑓 ∈ 𝒮 → − 𝑓 ∈ 𝒮 .2. The moment function is Lipschitz with each component 𝑗 = 1 , . . . , 𝑘 satisfying ∃ 𝛾 𝑗 ( 𝛽 ) such that ∀| 𝑓 | , | 𝑓 | ≤ 𝛽 | 𝑔 𝑗 ( 𝑓 ) − 𝑔 𝑗 ( 𝑓 ) | ≤ 𝛾 𝑗 ( 𝛽 ) | 𝑓 − 𝑓 | , implying that || 𝑔 ( 𝑓 ) − 𝑔 ( 𝑓 ) || ≤ 𝛾 ( 𝛽 ) | 𝑓 − 𝑓 | , 𝛾 ( 𝛽 ) = ⎯⎸⎸⎷ 𝑘 ∑︁ 𝑗 =1 𝛾 𝑗 ( 𝛽 ) . To bound the rate of uniform convergence of the moment function, we appeal to the concept ofRademacher complexity. Let ℋ = ℎ ( 𝑤 ) be a set of real-valued functions. Let { 𝜁 𝑖 } 𝑛𝑖 =1 be a sequence ofbinary random variables such that 𝜁 𝑖 takes values in {− , } with equal probabilities. Then the sampleor empirical Rademacher complexity of class ℋ is given byˆ 𝑅 ( ℋ ) = E 𝜁 ⎡⎢⎢⎢⎢⎢⎣ sup ℎ ∈ℋ 𝑛 − 𝑛 ∑︁ 𝑖 =1 𝜁 𝑖 ℎ ( 𝑊 𝑖 ) ⎤⎥⎥⎥⎥⎥⎦ . (16)We also denote 𝑅 ( ℋ ) = E 𝑊 ˆ 𝑅 ( ℋ ) to be the expected Rademacher complexity, where E 𝑊 is the expectationwith respect to the sample 𝑊 = ( 𝑊 , . . . , 𝑊 𝑛 ). Note that the deﬁnition in (16) di ﬀ ers from the standarddeﬁnition of Rademacher complexity where there is an absolute value under the supremum sign (vander Vaart and Wellner, 1996). The current version of Rademacher complexity has the merit that itvanishes for function classes consisting of single constant function, and is always dominated by thestandard Rademacher complexity. Both deﬁnitions agree for function classes which are closed undernegation (Meir and Zhang, 2003). Lemma 2.

Under Assumption 4, for all 𝑗 = 1 , . . . , 𝑘 , E 𝑊 sup || 𝑓 || ≤ 𝛽 | 𝑔 ,𝑗 ( 𝑓 ) − ˆ 𝑔 𝑗 ( 𝑓 ) | ≤ 𝛾 𝑗 ( 𝛽 ) 𝛽𝑅 ( 𝒮 ) . For many classes the Rademacher complexity can be calculated directly, however, to obtain a moregeneral result we need to bound 𝑅 ( 𝒮 ). Using the results from Section 4.3 in Zhang and Yu (2005) wecan bound the expected Rademacher complexity of the weak learner class by 𝑅 ( 𝒮 ) ≤ 𝐶 𝒮 √ 𝑛 , (17)11here 𝐶 𝒮 is a constant that solely depends on 𝒮 . Zhang and Yu (2005) also show that popular weaklearners such as two-level neural networks and trees basis functions satisfy the requirements. However,Zhang and Yu (2005) point out that in general the bound may be slower than root-n. In Appendix C wederive an alternative bound on 𝑅 ( 𝒮 ) that works for any class with ﬁnite VC dimension. The derived VCbound is slower by the factor of log( 𝑛 ) that appears in a lot of ML algorithms.Condition (17) allows us to bound the moment function which leads to a bound on the rate ofuniform convergence of the GMM criterion. The formal statements of the results are presented below. Lemma 3.

Suppose that condition (17) holds, then under Assumption 4,sup || 𝑓 || ≤ 𝛽 || 𝑔 ( 𝑓 ) − ˆ 𝑔 ( 𝑓 ) || 𝑝 → . Theorem 2.

Suppose that (i) the data 𝑊 = ( 𝑊 , . . . , 𝑊 𝑛 ) are i.i.d., (ii) ˆ Ω 𝑝 → Ω , (iii) Assumption 4 issatisﬁed, and (iv) E 𝑊 [︁ sup || 𝑓 || ≤ 𝛽 || 𝑔 ( 𝑊 𝑖 , 𝑓 ) || ]︁ < ∞ . Thensup || 𝑓 || ≤ 𝛽 | ˆ 𝑄 ( 𝑓 ) − 𝑄 ( 𝑓 ) | 𝑝 → . In this section we put together the arguments for numerical and statistical convergence presentedin the previous subsections to prove consistency of the boostIV algorithm. We start with a generaldecomposition illustrating the proof strategy and highlighting where exactly numerical and statisticalconvergence step in.Suppose that we run the boostIV algorithm and stop at an early stopping point ˆ 𝑚 that satisﬁes P ( || ˆ 𝑓 ˆ 𝑚 || ≤ 𝛽 𝑛 ) = 1 for some sample-independent 𝛽 𝑛 ≥

0. Let 𝑓 * be a unique minimizer of the populationcriterion, i.e. 𝑄 ( 𝑓 * ) = inf 𝑓 ∈ span( 𝒮 ) 𝑄 ( 𝑓 ). By the triangle inequality, we get the following decomposition ⃒⃒⃒ 𝑄 ( ˆ 𝑓 ˆ 𝑚 ) − 𝑄 ( 𝑓 * ) ⃒⃒⃒ ≤ ⃒⃒⃒ 𝑄 ( ˆ 𝑓 ˆ 𝑚 ) − ˆ 𝑄 ( ˆ 𝑓 ˆ 𝑚 ) ⃒⃒⃒ + ⃒⃒⃒ ˆ 𝑄 ( ˆ 𝑓 ˆ 𝑚 ) − ˆ 𝑄 ( 𝑓 * ) ⃒⃒⃒ + ⃒⃒⃒ ˆ 𝑄 ( 𝑓 * ) − 𝑄 ( 𝑓 * ) ⃒⃒⃒ ≤ || 𝑓 || ≤ 𝛽 ⃒⃒⃒ ˆ 𝑄 ( 𝑓 ) − 𝑄 ( 𝑓 ) ⃒⃒⃒ + ⃒⃒⃒ ˆ 𝑄 ( ˆ 𝑓 ˆ 𝑚 ) − ˆ 𝑄 ( 𝑓 * ) ⃒⃒⃒ We can bound the ﬁrst term using the uniform bound on the sample GMM criterion in Theorem 2, thisis the statistical convergence argument. In order to bound the second term, we have to appeal to thenumerical convergence argument in Theorem 1. As a result, since 𝑄 ( ˆ 𝑓 ˆ 𝑚 ) → 𝑄 ( 𝑓 * ) as 𝑛 → ∞ , it followsthat ˆ 𝑓 ˆ 𝑚 𝑝 → 𝑓 * . The following theorem formalizes the result. Theorem 3.

Suppose that the assumptions of Theorems 1 and 2 hold. Consider two sequences 𝑘 𝑛 and 𝛽 𝑛 such that lim 𝑛 →∞ 𝑚 𝑛 = ∞ and lim 𝑛 →∞ 𝛾 ( 𝛽 𝑛 ) 𝛽 𝑛 𝑅 ( 𝒮 ) = 0. Then as long as we stop the algorithm atstep ˆ 𝑚 based on 𝑊 such that ˆ 𝑚 ≥ 𝑚 𝑛 and || ˆ 𝑓 ˆ 𝑚 || ≤ 𝛽 𝑛 , we have the consistency result ˆ 𝑓 ˆ 𝑚 𝑝 → 𝑓 * . To begin with, we consider a simple low-dimensional scenario with one endogenous variable and twoinstruments. 𝑦 = 𝑔 ( 𝑥 ) + 𝜌𝑒 + 𝛿, 𝑥 = 𝑧 + 𝑧 + 𝑒 + 𝛾, where instruments 𝑧 𝑗 ∼ 𝑈 [ − ,

3] for 𝑗 = 1 , 𝑒 ∼ 𝒩 (0 ,

1) is the confounder, 𝛿, 𝛾 ∼ 𝒩 (0 , .

1) are additionalnoise components, and 𝜌 is the parameter measuring the degree of endogeneity, which we set to 0.5 inthe simulations. We focus on four speciﬁcations of the structural function:12 abs: 𝑔 ( 𝑥 ) = | 𝑥 | • log: 𝑔 ( 𝑥 ) = log( | 𝑥 − | + 1)sign( 𝑥 − . sin: 𝑔 ( 𝑥 ) = sin( 𝑥 )• step: 𝑔 ( 𝑥 ) = { 𝑥 < } + 2 . × { 𝑥 ≥ } We compare the performance of boostIV and post-boostIV with the standard NPIV estimator usingthe cubic polynomial basis, Kernel IV (KIV) regression of Singh et al. (2019) , DeepIV estimator ofHartford et al. (2017) and DeepGMM estimator of Bennett et al. (2019) . We use 1 ,

000 observationsfor both train and test sets and 500 observations for the validation set. Our results are based on 200simulations for each scenario.

Table 1. Univariate design: Out-of-sample MSE.

NPIV KIV DeepIV DeepGMM boostIV post-boostIVabs 0.1916 0.0564 0.1347 1.2717 0.0348 0.0217log 0.6936 0.3367 1.2708 14.4615 0.3173 0.0930sin 0.1837 0.0217 0.2798 0.8595 0.0292 0.0124step 0.1267 0.0972 0.1756 0.9796 0.1027 0.0546We plot our results in Figures 1 which shows the average out of sample ﬁt across simulations(orange line) compared to the true target function (black line). Table 1 presents the out-of-sampleMSE across simulations. First thing to notice is that NPIV fails to capture di ﬀ erent functional formsubtleties. Second, DeepIV’s performance does not improve upon the one of NPIV. Moreover, eventhough DeepGMM estimates have lower bias than the ones of NPIV and DeepIV (except for the logfunction), they are quite volatile across simulations leading to higher MSE. BoostIV performs on parwith KIV both in terms of the bias term as they are able to recover the underlying structural relation,and in terms of the variance leading to low MSE. Finally, the post-processing step helps to furtherimprove upon boostIV’s performance by reducing bias. On top of that, post-boostIV requires lessiterations to converge. We use 5 ,

000 iterations for boostIV, while post-boostIV uses on average 50iterations. Consider the following data generating process: 𝑦 𝑖 = ℎ ( 𝑥 𝑖 ) + 𝜀 𝑖 𝑥 𝑖,𝑘 = 𝑔 𝑘 ( 𝑧 𝑖 ) + 𝑣 𝑖,𝑘 , 𝑘 = 1 , . . . , 𝑑 𝑥 , where 𝑦 𝑖 ∈ R is the response variable, 𝑥 𝑖 ∈ R 𝑑 𝑥 is the vector of potentially endogenous variables, 𝑧 𝑖 ∈ R 𝑑 𝑧 is the vector of instruments, 𝜀 𝑖 ∈ R is the structural error term, and 𝑣 𝑖 ∈ R 𝑑 𝑥 is the vector of the reducedform errors. Function ℎ ( · ) is the structural function of interest, and function 𝑔 ( · ) governs the reducedform relationship between the endogenous regressors and instrumental variables.Instruments are drawn from a multivariate normal distribution, 𝑧 𝑖 ∼ 𝒩 (0 , Σ 𝑧 ), where Σ 𝑧 is just anidentity matrix. The error terms are described by the following relationship: 𝜀 ∼ 𝒩 (0 , , 𝑣 ∼ 𝒩 ( 𝜌𝜀, ℐ − 𝜌 ) , Code: https://github.com/r4hu1-5in9h/KIV We use the latest implementation of the econML package: https://github.com/microsoft/EconML Code: https://github.com/CausalML/DeepGMM In this experiment we do not tune boostIV, we just pick a large enough number of iterations for it to converge. However,we do tune post-boostIV. i g u r e . O u t - o f - s a m p l e a v e r a g e ﬁ t a c r o sss i m u l a t i o n s . T h e b l a c k l i n e i s t h e t r u e f u n c t i o n , t h e o r a n g e l i n e i s t h e ﬁ t . 𝜌 is the correlation between 𝜀 and the elements of 𝑣 , which controls the degree of endogeneity.We consider two structural function speciﬁcations:1. a simpler design where the structural function is proportional to a multivariate normal density,i.e. ℎ ( 𝑥 ) = exp {− . 𝑥 ′ 𝑥 } . We will further refer to this speciﬁcation as Design 1;2. a more challenging design where the structural function is ℎ ( 𝑥 ) = ∑︀ 𝑑 𝑥 𝑘 =1 sin(10 𝑥 𝑘 ). We will furtherrefer to this speciﬁcation as Design 2.We also consider two di ﬀ erent choices of the reduced form function 𝑔 ( · ):(a) linear: 𝑔 ( 𝑍 𝑖 ) = 𝑍 ′ 𝑖 Π , where Π ∈ R 𝑑 𝑥 × 𝑑 𝑧 is a matrix of reduced form parameters;(b) non-linear: 𝑔 𝑘 ( 𝑍 𝑖 ) = 𝐺 ( 𝑍 𝑖 ; 𝜃 𝑘 ) for 𝑘 = 1 , . . . ,𝑑 𝑥 , where 𝐺 ( 𝑍 𝑖 ; 𝜃 𝑘 ) is a multivariate normal densityparameterized by the mean vector 𝜃 𝑘 (for simplicity, we use the identity covariance matrix).We use 1 ,

000 observations for the train set and 500 observations for both the validation and testsets. We run 200 simulations for each scenario. The results are summarized in Tables 2 and 3.

Table 2. Design 1: Out-of-sample MSE. dx dz IV type 𝜌 NPIV KIV DeepIV DeepGMM boostIV post-boostIV5 7 lin 0.25 4.9535 0.0147 0.0497 0.2234 0.0213 1.33060.75 6.8889 0.0286 0.054 0.1655 0.0603 0.7249nonlin 0.25 4.0548 0.017 0.0757 0.3262 0.0875 0.34570.75 1.9932 0.0516 0.1188 0.7265 0.4287 0.912810 12 lin 0.25 23.1025 0.0024 0.0867 0.3089 0.0084 1.09530.75 39.6902 0.0108 0.0884 0.251 0.0427 0.6347nonlin 0.25 6.4842 0.0038 0.05 0.4908 0.0525 0.81370.75 2.53 0.0147 0.0691 0.822 0.3332 0.8937

Table 3. Design 2: Out-of-sample MSE. dx dz IV type 𝜌 NPIV KIV DeepIV DeepGMM boostIV post-boostIV5 7 lin 0.25 21.5854 2.4983 2.5484 2.9105 2.5105 3.54980.75 23.2413 2.5043 2.5358 2.792 2.5351 3.5081nonlin 0.25 19.0871 2.5118 2.5415 2.9867 2.5707 3.03030.75 22.1192 2.5367 2.5523 3.4188 2.891 3.704310 12 lin 0.25 147.984 5.0047 5.1383 5.9647 5.0209 5.94350.75 241.56 5.0326 5.1698 5.7147 5.0781 6.3259nonlin 0.25 61.0328 5.0103 5.0636 6.2145 5.0713 5.90010.75 112.785 4.9799 5.0631 6.193 5.3172 6.4674

In this section we apply our algorithms to a more economically driven example of demand estimation.Demand estimation is a cornerstone of modern industrial organization and marketing research. Besidesits practical importance, it poses a challenging estimation problem which modern econometric andstatistical tools can be applied to.We consider a nonparametric demand estimation framework of Gandhi et al. (2020) (hereafter,GNT). GNT is a ﬂexible framework that combines the nonparametric identiﬁcation arguments of Berryand Haile (2014) with the dimensionality reduction techniques of Gandhi and Houde (2019).15n market 𝑡 , 𝑡 = 1 , . . . ,𝑇 , there is a continuum of consumers choosing from a set of products 𝒥 = { , , . . . ,𝐽 } which includes the outside option, e.g. not buying any good. The choice set in market 𝑡 is characterized by a set of product characteristics 𝜒 𝑡 partitioned as follows: 𝜒 𝑡 ≡ ( 𝑥 𝑡 , 𝑝 𝑡 , 𝜉 𝑡 ) , where 𝑥 𝑡 ≡ ( 𝑥 𝑡 , . . . , 𝑥 𝐽𝑡 ) is a vector of exogenous observable characteristics (e.g. exogenous productcharacteristics or market-level income), 𝑝 𝑡 ≡ ( 𝑝 𝑡 , . . . , 𝑝 𝐽𝑡 ) are observable endogenous characteristics(typically, market prices) and 𝜉 𝑡 ≡ ( 𝜉 𝑡 , . . . , 𝜉 𝐽𝑡 ) represent unobservables potentially correlated with 𝑝 𝑡 (e.g. unobserved product quality). Let 𝒳 denote the support of 𝜒 𝑡 . Then the structural demand systemis given by 𝜎 : 𝒳 ↦→ ∆ 𝐽 , where ∆ 𝐽 is a unit 𝐽 -simplex. The function 𝜎 gives, for every market 𝑡 , the vector 𝑠 𝑡 of shares for the 𝐽 goods.Following Berry and Haile (2014), we partition the exogenous characteristics as 𝑥 𝑡 = (︂ 𝑥 (1) 𝑡 , 𝑥 (2) 𝑡 )︂ ,where 𝑥 (1) 𝑡 ≡ (︂ 𝑥 (1)1 𝑡 , . . . , 𝑥 (1) 𝐽𝑡 )︂ , 𝑥 𝑗𝑡 ∈ R for 𝑗 ∈ 𝒥 ∖{ } , and deﬁne the linear indices 𝛿 𝑗𝑡 = 𝑥 (1) 𝑗𝑡 𝛽 𝑗 + 𝜉 𝑗𝑡 , 𝑗 ∈ 𝒥 ∖{ } , and let 𝛿 𝑡 ≡ ( 𝛿 𝑡 , . . . , 𝛿 𝐽𝑡 ). Without loss of generality, we can normalize 𝛽 𝑗 = 1 for all 𝑗 (see Berry andHaile (2014) for more details). Given the deﬁnition of the demand system, for every market 𝑡𝜎 ( 𝜒 𝑡 ) = 𝜎 (︂ 𝛿 𝑡 , 𝑝 𝑡 , 𝑥 (2) 𝑡 )︂ . Following Berry et al. (2013) and Berry and Haile (2014), we can show that there exists at most onevector 𝛿 𝑡 such that 𝑠 𝑡 = 𝜎 (︂ 𝛿 𝑡 , 𝑝 𝑡 , 𝑥 (2) 𝑡 )︂ , meaning that we can write 𝛿 𝑗𝑡 = 𝜎 − 𝑗 (︂ 𝑠 𝑡 , 𝑝 𝑡 , 𝑥 (2) 𝑡 )︂ , 𝑗 ∈ 𝒥 ∖{ } . (18)We can rewrite (18) in a more convenient form to get the following estimation equation 𝑥 (1) 𝑗𝑡 = 𝜎 − 𝑗 (︂ 𝑠 𝑡 , 𝑝 𝑡 , 𝑥 (2) 𝑡 )︂ − 𝜉 𝑗𝑡 . (19)Note that in (19) the inverse demand is indexed by 𝑗 , meaning that we have to estimate 𝐽 inversedemand functions. To circumvent this problem, Gandhi and Houde (2019) suggest transforming theinput vector space under the linear utility speciﬁcation to get rid of the 𝑗 subscript. GNT follow thisidea and show that Equation (19) can be rewritten aslog (︃ 𝑠 𝑗𝑡 𝑠 𝑡 )︃ = 𝑥 (1) 𝑗𝑡 + 𝑔 ( 𝜔 𝑗𝑡 ) + 𝜉 𝑗𝑡 , (20)where 𝑔 is such that 𝜎 − 𝑗 (︂ 𝑠 𝑡 , 𝑝 𝑡 , 𝑥 (2) 𝑡 )︂ = (︃ 𝑠 𝑗𝑡 𝑠 𝑡 )︃ − 𝑔 ( 𝜔 𝑗𝑡 ) , and 𝜔 𝑗𝑡 ≡ (︁ 𝑠 𝑗𝑡 , { 𝑠 𝑘𝑡 , 𝑑 𝑗𝑘𝑡 } 𝑗 (cid:44) 𝑘 )︁ , where 𝑑 𝑗𝑘𝑡 = ˜ 𝑥 𝑗𝑡 − ˜ 𝑥 𝑘𝑡 and ˜ 𝑥 𝑡 ≡ (︂ 𝑝 𝑡 , 𝑥 (2) 𝑡 )︂ .Let 𝑦 𝑗𝑡 ≡ log( 𝑠 𝑗𝑡 /𝑠 𝑡 ) − 𝑥 (1) 𝑗𝑡 , then we can rewrite equation (20) in a more convenient form 𝑦 𝑗𝑡 = 𝑔 ( 𝜔 𝑗𝑡 ) + 𝜉 𝑗𝑡 . (21)16hus, (21) is our structural equation where 𝜔 𝑗𝑡 contains endogenous variables. Assume we have acost-shifter 𝑤 𝑗𝑡 that is exogenous, then given E [ 𝜉 𝑗𝑡 | 𝑥 𝑡 , 𝑤 𝑡 ] = 0 for 𝑗 ∈ 𝒥 ∖{ } we can estimate the inversedemand function 𝑔 . To constructs instruments, GNT transform the input space ( 𝑥 𝑡 , 𝑤 𝑡 ) similarly to 𝜔 𝑗𝑡 .Let 𝜁 𝑗𝑡 ≡ { ∆ 𝑗𝑘𝑡 } 𝑗 (cid:44) 𝑘 , where ∆ 𝑗𝑘𝑡 = 𝑧 𝑗𝑡 − 𝑧 𝑘𝑡 and 𝑧 𝑡 = ( 𝑥 𝑡 , 𝑤 𝑡 ). Thus, we can perform estimation based on E [ 𝜉 𝑗𝑡 | 𝜁 𝑗𝑡 ] = 0. Table 4. Inverse demand ﬁt.

KIV DeepGMM boostIV post-boostIVT J K Bias100 10 10 5.3307 0.9852 0.1917 0.984720 6.5053 1.6591 -0.3053 0.910020 10 5.4312 1.6979 0.1016 0.835120 6.6666 3.2286 -0.4047 0.9389T J K MSE100 10 10 45.5991 36.4565 15.1358 7.542620 73.2391 62.9117 26.1670 15.512620 10 47.4954 40.8320 15.8077 6.519920 76.8815 70.0830 27.4539 14.2017In our model design there are 𝑇 = 100 markets with 𝐽 ∈ { , } products and 𝐾 ∈ { , } nonlinearcharacteristics besides the price. We compare the performance of boostIV and post-boostIV with KIVand DeepGMM. We drop NPIV since in our design it fails due to the curse of dimensionality. We alsodrop DeepIV as it su ﬀ ers from the exploding gradient problem.Table 4 summarizes the results. The ﬁrst thing to notice is that KIV performs the worst, while in theprevious experiments it was one of the best performing estimators. It has both high bias and variance.DeepGMM has smaller bias, but the variance is still big. Our algorithms clearly dominate KIV andGMM, with post-boostIV delivering the best MSE results, while having slightly higher bias comparedto boostIV. In this paper we have introduced a new boosting algorithm called boostIV that allows to learn thetarget function in the presence of endogenous regressors. The algorithm is very intuitive as it resemblesan iterative version of the standard 2SLS regression. We also study several extensions including the useof optimal instruments and the post-processing step.We show that boostIV is consistent and demonstrates an outstanding ﬁnite sample performancein the series of Monte Carlo experiments. It performs especially well in the nonparameteric demandestimation example which is characterized by a complex nonlinear relationship between the targetfunction and explanatory features.Despite all the advantages of boostIV, the algorithm does not allow for high-dimensional settingswhere the number of regressors and/or instruments exceeds the number of iterations. We also believeit is possible to extend our algorithm in the spirit similar to XGBoost (Chen and Guestrin, 2016) thatcould decrease the computation time taken by the algorithm. These would be interesting directions forfuture research. We also need the completeness condition to be satisﬁed, see Berry and Haile (2014) for more details. ppendix A Auxiliary Lemmas Lemma A1.

Assume the assumptions of Lemma 1 are satisﬁed. Consider ℎ 𝑚 that satisﬁes Assumption3. Let ¯ 𝑓 be an arbitrary reference function in 𝒮 . Also, deﬁne 𝑠 𝑚 = || 𝑓 || + ∑︀ 𝑚 − 𝑖 =0 ℎ 𝑖 , and ∆ ˆ 𝑄 ( 𝑓 ) = max (︁ , ˆ 𝑄 ( 𝑓 ) − ˆ 𝑄 ( ¯ 𝑓 ) )︁ , (22)¯ 𝜖 𝑚 = ℎ 𝑚 𝑀 + 𝜖 𝑚 . (23)Then after 𝑚 steps the following bound holds for 𝑓 𝑚 +1 : ∆ ˆ 𝑄 ( 𝑓 𝑘 +1 ) ≤ (︃ − ℎ 𝑚 𝑠 𝑚 + || ¯ 𝑓 || )︃ ∆ ˆ 𝑄 ( 𝑓 𝑚 ) + ¯ 𝜖 𝑚 (24) Proof.

The result follows directly from Lemma 1 and Lemma 4.1 in Zhang and Yu (2005). (cid:4)

Lemma A2.

Under the assumptions of Lemma A1, we have ∆ ˆ 𝑄 ( 𝑓 𝑚 ) ≤ || 𝑓 || + || ¯ 𝑓 || 𝑠 𝑚 + || ¯ 𝑓 || ∆ ˆ 𝑄 ( 𝑓 ) + 𝑚 ∑︁ 𝑗 =1 𝑠 𝑗 + || ¯ 𝑓 || 𝑠 𝑚 + || ¯ 𝑓 || ¯ 𝜖 𝑗 − (25) Proof.

The above lemma directly follows from the repetitive application of Lemma A1. For detailedproof see Zhang and Yu (2005). (cid:4)

Lemmas A1 and A2 are direct counterparts of Lemmas 4.1 and 4.2 in Zhang and Yu (2005) with 𝑀 ( 𝑠 𝑚 +1 ) replaced by 𝑀 . Therefore, the main numerical convergence result below follows as well (seeCorollary 4.1). Appendix B Proofs

B.1 Proof of Lemma 1

First, 𝑄 ( · ) is convex in 𝑓 , hence, it is convex di ﬀ erentiable. Now we have to bound the second derivativewith respect to ℎ . Note that the second derivative of 𝑄 𝑓 ,𝜙 ( ℎ ) does not even depend on ℎ , 𝑄 ′′ 𝑓 ,𝜙 ( ℎ ) = E [ 𝜙 ( 𝑥 𝑖 ) 𝑧 𝑖 ] ′ Ω E [ 𝜙 ( 𝑥 𝑖 ) 𝑧 𝑖 ] ≤ 𝜆 𝑚𝑎𝑥 ( Ω ) || E [ 𝜙 ( 𝑥 𝑖 ) 𝑧 𝑖 ] || ≤ 𝜆 𝑚𝑎𝑥 ( Ω ) E [ | 𝜙 ( 𝑥 𝑖 ) | ] E [ | 𝑧 ′ 𝑖 𝑧 𝑖 | ] ≤ 𝜆 𝑚𝑎𝑥 ( Ω ) 𝐶𝐵 ≡ 𝑀 < ∞ , where the second inequality is a by the Cauchy-Schwarz inequality, and the last inequality comes fromthe assumptions of the lemma. Thus, the second derivative has a ﬁxed bound 𝑀 < ∞ . (cid:4) B.2 Proof of Theorem 1

The result follows directly from Lemmas A1 and A2. For detailed proof see Zhang and Yu (2005). (cid:4)

B.3 Proof of Lemma 2

Follows directly from Lemma 4.3 in Zhang and Yu (2005). (cid:4) .4 Proof of Lemma 3 It follows from Lemma 2 and condition (17) that for all 𝑗 = 1 , . . . , 𝑘 , E 𝑊 sup || 𝑓 || ≤ 𝛽 | 𝑔 ,𝑗 ( 𝑓 ) − ˆ 𝑔 𝑗 ( 𝑓 ) | ≤ 𝛾 𝑗 ( 𝛽 ) 𝛽𝑅 ( 𝒮 ) ≤ 𝛾 𝑗 ( 𝛽 ) 𝛽 𝐶 𝒮 √ 𝑛 = 𝑂 ( 𝑛 − / ) . Thus, by Markov inequality, sup || 𝑓 || ≤ 𝛽 | 𝑔 ,𝑗 ( 𝑓 ) − ˆ 𝑔 𝑗 ( 𝑓 ) | 𝑝 → , 𝑗 = 1 , . . . , 𝑘. Since every coordinate of the sample moment function converges uniformly to its population analog,we can bound the norm as well || 𝑔 ( 𝑓 ) − ˆ 𝑔 ( 𝑓 ) || = ⎛⎜⎜⎜⎜⎜⎜⎝ 𝑘 ∑︁ 𝑗 =1 | 𝑔 ,𝑗 ( 𝑓 ) − ˆ 𝑔 𝑗 ( 𝑓 ) | ⎞⎟⎟⎟⎟⎟⎟⎠ / ≤ √ 𝑘𝑂 𝑝 ( 𝑛 − / ) , which combined with Markov inequality completes the proof. (cid:4) B.5 Proof of Theorem 2

By the triangle and Cauchy-Schwarz inequalities, ⃒⃒⃒ ˆ 𝑄 ( 𝑓 ) − 𝑄 ( 𝑓 ) ⃒⃒⃒ ≤ ⃒⃒⃒ [ ˆ 𝑔 ( 𝑓 ) − 𝑔 ( 𝑓 )] ′ ˆ Ω [ ˆ 𝑔 ( 𝑓 ) − 𝑔 ( 𝑓 )] ⃒⃒⃒ + ⃒⃒⃒ 𝑔 ( 𝑓 ) ′ ( ˆ Ω + ˆ Ω ′ )[ ˆ 𝑔 ( 𝑓 ) − 𝑔 ( 𝑓 )] ⃒⃒⃒ + ⃒⃒⃒ 𝑔 ( 𝑓 ) ′ ( ˆ Ω − Ω ) 𝑔 ( 𝑓 ) ⃒⃒⃒ ≤ || ˆ 𝑔 ( 𝑓 ) − 𝑔 ( 𝑓 ) || || ˆ Ω || + 2 || 𝑔 ( 𝑓 ) || || ˆ 𝑔 ( 𝑓 ) − 𝑔 ( 𝑓 ) || || ˆ Ω || + || 𝑔 ( 𝑓 ) || || ˆ Ω − Ω || . Using Lemma 3, (ii), and (iv) and taking the supremum of both sides of the inequality completes theproof. (cid:4)

B.6 Proof of Theorem 3

By Theorem 2, the ﬁrst term converges in probability to zero, and the second term converges to zeroaccording to the arguments from the proof of Theorem 3.1 in Zhang and Yu (2005), which completesthe proof. (cid:4)

Appendix C Alternative bound on the Rademacher complexity

To derive an alternative bound on the Rademacher complexity, we introduce the following lemma(Massart’s lemma).

Lemma C1.

For any 𝐴 ⊆ R 𝑛 , let 𝑀 = sup 𝑎 ∈ 𝐴 || 𝑎 || . Thenˆ 𝑅 ( 𝐴 ) = E 𝜎 ⎡⎢⎢⎢⎢⎢⎣ sup 𝑎 ∈ 𝐴 𝑛 𝑛 ∑︁ 𝑖 =1 𝜎 𝑖 𝑎 𝑖 ⎤⎥⎥⎥⎥⎥⎦ ≤ 𝑀 √︀ | 𝐴 | 𝑛 . This lemma can be applied to any ﬁnite class of functions.

Example 1.

Consider a set of binary classiﬁers

ℋ ⊆ { ℎ : 𝑊 ↦→ {− , }} . Given a sample 𝑊 = ( 𝑊 , . . . , 𝑊 𝑛 ),we can take 𝐴 = { ℎ ( 𝑊 ) , . . . , ℎ ( 𝑊 𝑛 ) | ℎ ∈ ℋ} . Then | 𝐴 | = |ℋ| and 𝑀 = √ 𝑛 . Massart’s lemma givesˆ 𝑅 ( ℋ ) ≤ √︂ |ℋ| 𝑛 .

19n general, Massart’s lemma can also be applied to inﬁnite function classes with a ﬁnite shatteringcoe ﬃ cient. Notice that Massart’s ﬁnite lemma places a bound on the empirical Rademacher complexitythat depends only on 𝑛 data points. Therefore, all that matters as far as empirical Rademachercomplexity is concerned is the behavior of a function class on those data points. We can deﬁne theempirical Rademacher complexity in terms of the shattering coe ﬃ cient. Lemma C2.

Let

𝒴 ⊂ R be a ﬁnite set of real numbers of modulus at most 𝐶 >

0. Given a sample 𝑊 = ( 𝑊 , . . . , 𝑊 𝑛 ), the Rademacher complexity of any function class ℋ ⊆ { ℎ : 𝑊 ↦→ 𝒴 } can be bounded interms of its shattering coe ﬃ cient 𝑠 ( ℋ , 𝑛 ) byˆ 𝑅 ( ℋ ) ≤ 𝐶 √︂ 𝑠 ( ℋ , 𝑛 ) 𝑛 . Proof.

Let 𝐴 = { ℎ ( 𝑊 ) , . . . , ℎ ( 𝑊 𝑛 ) | ℎ ∈ ℋ} , then 𝑀 = sup 𝑎 ∈ 𝐴 || 𝑎 || = 𝐶 √ 𝑛 and | 𝐴 | = 𝑠 ( ℋ , 𝑛 ). Applying theMassart’s lemma givesˆ 𝑅 ( ℋ ) = E 𝜎 ⎡⎢⎢⎢⎢⎢⎣ sup ℎ ∈ℋ 𝑛 𝑛 ∑︁ 𝑖 =1 𝜎 𝑖 ℎ ( 𝑊 𝑖 ) ⎤⎥⎥⎥⎥⎥⎦ ≤ 𝑀 √︀ |ℋ| 𝑛 = 𝐶 √︂ 𝑠 ( ℋ , 𝑛 ) 𝑛 . (cid:4) Note that we apply the Massart’s lemma conditional on the sample, hence, we can use the samebound for ˆ 𝑅 ( ℋ ). We can loosen the bound by applying Sauer’s lemma which says that 𝑠 ( ℋ , 𝑛 ) ≤ 𝑛 𝑑 ,where 𝑑 is the VC dimension of ℋ . This simpliﬁes the result of Theorem C2 toˆ 𝑅 ( ℋ ) ≤ 𝐶 √︂ 𝑑 log( 𝑛 ) 𝑛 = 𝑂 ⎛⎜⎜⎜⎜⎜⎝√︂ log( 𝑛 ) 𝑛 ⎞⎟⎟⎟⎟⎟⎠ . (26)The bound in (26) is valid for any class with ﬁnite VC dimension which is coherent with the resultsof Zhang and Yu (2005). However, the VC bound is slower that the bound in (17) by the factor or log( 𝑛 )which appears in a lot of ML algorithms.Note that the bound in (26) is still a valid bound for the main results to follow. It only a ﬀ ects therate of convergence. References

Bennett, Andrew, Nathan Kallus, and Tobias Schnabel (2019). “Deep generalized method of momentsfor instrumental variable analysis”. In:

Advances in Neural Information Processing Systems , pp. 3564–3574.Berry, Steven T and Philip A Haile (2014). “Identiﬁcation in di ﬀ erentiated products markets usingmarket level data”. In: Econometrica

Econometrica

The annals of statistics

Neural computation

Statistical Science

Journal of the American Statistical Association ﬃ ciency in estimation with conditional moment restrictions”.In: Journal of Econometrics

Proceedings ofthe 22nd acm sigkdd international conference on knowledge discovery and data mining , pp. 785–794.Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duﬂo, Christian Hansen, and WhitneyNewey (2017). “Double/debiased/neyman machine learning of treatment e ﬀ ects”. In: AmericanEconomic Review

Econometrica

Information and computation

Journal of computer and system sciences icml .Vol. 96. Citeseer, pp. 148–156.Friedman, Jerome H (2001). “Greedy function approximation: a gradient boosting machine”. In:

Annalsof statistics , pp. 1189–1232.Friedman, Jerome H and Bogdan Popescu (2003). “Importance sampled learning ensembles”. In:

Journalof Machine Learning Research

The annals ofstatistics

Available at SSRN 3352957 .Gandhi, Amit and Jean-François Houde (2019).

Measuring substitution patterns in di ﬀ erentiated productsindustries . Tech. rep. National Bureau of Economic Research.Gandhi, Amit, Aviv Nevo, and Jing Tao (2020). Flexible Estimation of Di ﬀ erentiated Product DemandModels Using Aggregate Data . Tech. rep. Working paper.Hall, Peter, Joel L Horowitz, et al. (2005). “Nonparametric methods for inference in the presence ofinstrumental variables”. In: The Annals of Statistics

Nonparametric estimation of triangular simultaneous equations models under weakidentiﬁcation . Tech. rep.Hartford, Jason, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy (2017). “Deep IV: A ﬂexibleapproach for counterfactual prediction”. In:

Proceedings of the 34th International Conference onMachine Learning-Volume 70 . JMLR. org, pp. 1414–1423.Kress, Rainer (1989).

Linear integral equations . Vol. 3. Springer.Mason, Llew, Jonathan Baxter, Peter L Bartlett, and Marcus R Frean (2000). “Boosting algorithms asgradient descent”. In:

Advances in neural information processing systems , pp. 512–518.Meir, Ron and Tong Zhang (2003). “Generalization error bounds for Bayesian mixture algorithms”. In:

Journal of Machine Learning Research arXiv preprint arXiv:1910.12358 .Newey, Whitney K and James L Powell (2003). “Instrumental variable estimation of nonparametricmodels”. In:

Econometrica

Machine learning

Advances in Neural Information Processing Systems , pp. 4595–4607.van der Vaart, Aad W and Jon A Wellner (1996).

Weak convergence and empirical processes: with applica-tions to statistics . Springer.Wolpert, David H (1992). “Stacked generalization”. In:

Neural networks