[PDF] Minimax Estimation of Conditional Moment Models

Abstract

We develop an approach for estimating models described via conditional moment restrictions, with a prototypical application being non-parametric instrumental variable regression. We introduce a min-max criterion function, under which the estimation problem can be thought of as solving a zero-sum game between a modeler who is optimizing over the hypothesis space of the target model and an adversary who identifies violating moments over a test function space. We analyze the statistical estimation rate of the resulting estimator for arbitrary hypothesis spaces, with respect to an appropriate analogue of the mean squared error metric, for ill-posed inverse problems. We show that when the minimax criterion is regularized with a second moment penalty on the test function and the test function space is sufficiently rich, then the estimation rate scales with the critical radius of the hypothesis and test function spaces, a quantity which typically gives tight fast rates. Our main result follows from a novel localized Rademacher analysis of statistical learning problems defined via minimax objectives. We provide applications of our main results for several hypothesis spaces used in practice such as: reproducing kernel Hilbert spaces, high dimensional sparse linear functions, spaces defined via shape constraints, ensemble estimators such as random forests, and neural networks. For each of these applications we provide computationally efficient optimization methods for solving the corresponding minimax problem (e.g. stochastic first-order heuristics for neural networks). In several applications, we show how our modified mean squared error rate, combined with conditions that bound the ill-posedness of the inverse problem, lead to mean squared error rates. We conclude with an extensive experimental analysis of the proposed methods.

Full PDF

MMinimax Estimation of Conditional Moment Models

Nishanth Dikkala

MIT [email protected]

Greg Lewis

Microsoft Research [email protected]

Lester Mackey

Microsoft Research [email protected]

Vasilis Syrgkanis

Microsoft Research [email protected]

Abstract

We develop an approach for estimating models described via conditional momentrestrictions, with a prototypical application being non-parametric instrumentalvariable regression. We introduce a min-max criterion function, under which theestimation problem can be thought of as solving a zero-sum game between a mod-eler who is optimizing over the hypothesis space of the target model and an ad-versary who identiﬁes violating moments over a test function space. We analyzethe statistical estimation rate of the resulting estimator for arbitrary hypothesisspaces, with respect to an appropriate analogue of the mean squared error metric,for ill-posed inverse problems. We show that when the minimax criterion is reg-ularized with a second moment penalty on the test function and the test functionspace is sufﬁciently rich, then the estimation rate scales with the critical radiusof the hypothesis and test function spaces, a quantity which typically gives tightfast rates. Our main result follows from a novel localized Rademacher analysisof statistical learning problems deﬁned via minimax objectives. We provide ap-plications of our main results for several hypothesis spaces used in practice suchas: reproducing kernel Hilbert spaces, high dimensional sparse linear functions,spaces deﬁned via shape constraints, ensemble estimators such as random forests,and neural networks. For each of these applications we provide computationallyefﬁcient optimization methods for solving the corresponding minimax problem(e.g. stochastic ﬁrst-order heuristics for neural networks). In several applications,we show how our modiﬁed mean squared error rate, combined with conditions thatbound the ill-posedness of the inverse problem, lead to mean squared error rates.We conclude with an extensive experimental analysis of the proposed methods.

Understanding how policy choices affect social systems requires an understanding of the underlyingcausal relationships between them. To measure these causal relationships, social scientists look toeither ﬁeld experiments, or quasi-experimental variation in observational data. Most observationalstudies rely on assumptions that can be formalized in moment conditions. This is the basis of theestimation approach known as generalized method of moments (GMM) [Hansen, 1982].While GMM is an incredibly ﬂexible estimation approach, it suffers from some drawbacks. Theunderlying independence (randomization) assumptions often imply an inﬁnite number of momentconditions. Imposing all of them is infeasible with ﬁnite data, but it is hard to know which ones to

A very preliminary version of this work appeared as

Adversarial Generalized Method of Moments (see https://arxiv.org/abs/1803.07164 )Preprint. Under review. a r X i v : . [ ec on . E M ] J un elect. For some special cases, asymptotic theory provides some guidance, but it is not clear thatthis guidance translates well when the data is ﬁnite and/or the models are non-parametric. Given theincreasing availability of data and new machine learning approaches, researchers and data scientistsmay want to apply adaptive non-parametric learners such as reproducing kernel Hilbert spaces, high-dimensional regularized linear models, neural networks and random forests to these GMM estima-tion problems, but this requires a way of ﬁnding solutions to the moment conditions within complexhypothesis classes imposed by the learner and selecting moment conditions that are adapted to thehypothesis class of the learner .Most recent theoretical developments in machine learning and high-dimensional statistics arefounded on statistical learning theory: formulate a loss function (typically strongly convex withrespect to the output of the hypothesis), whose minimizer over the hypothesis space is the desired so-lution; typically referred to as an M -estimator. Being able to frame the problem as an M -estimationproblem with a strongly convex function, leads to many desirable properties : i) tight generaliza-tion bounds and mean squared error rates based on localized notions of statistical complexity canbe invoked to provide tight and fast ﬁnite sample rates with minimal assumptions [Bartlett et al.,2005, Wainwright, 2019], ii) regularization can be invoked to make the estimation adaptive to thecomplexity of the true hypothesis space, without knowledge of that complexity [Lecué and Mendel-son, 2018, 2017, Negahban et al., 2012], iii) the computational problem can be typically efﬁcientlysolved via ﬁrst order methods that can scale massively [Agarwal et al., 2014, Rahimi and Recht,2008, Le, 2013, Sra et al., 2012, Bottou et al., 2007]. This formulation is seemingly at odds with themethod of moments language, as many times the moment conditions do not correspond to the gradi-ent of some loss function and this problem is exacerbated in the case of non-parametric endogenousregression problems (i.e. when the instruments in the observational study does not coincide with thetreatments). The problem is: Can we develop an analogue of modern statistical learning theory of M -estimators, for non-parametric problems deﬁned via moment restrictions? Our starting point is a set of conditional moment restrictions: E [ y − h ( x ) | z ] = 0 (1)where y is an outcome of interest, x is a vector of treatments and z is a vector of instruments.To obtain a criterion function, we ﬁrst move to an unconditional moment formulation, where themoment restrictions are products of the moment conditions and test functions in the instruments.We then take as our criterion function the maximum moment deviation over the set of test functions,where the set of test functions is potentially inﬁnite. h = arg inf h ∈H sup f ∈F E [( y − h ( x )) f ( z )] =: Ψ( h, f ) We show that as long as the set of test functions F contains all functions of the form f ( z ) = E [ h ( x ) − h (cid:48) ( x ) | z ] for h, h (cid:48) ∈ H , then such an estimator achieves a projected MSE rate that scaleswith the critical radius of the function classes F , H and their tensor product class (i.e. functions ofthe form f ( z ) · h ( x ) , with f ∈ F and h ∈ H ). The critical radius captures information theoreticallyoptimal rates for many function classes of interest and thereby this main theorem can be used toderive tight estimation rates for many hypothesis spaces. Moreover, if the regularization terms relateto the squared norms of h, f in their corresponding spaces, then the estimation error scales with thenorm of the true hypothesis, without knowledge of this norm.We offer several applications of our main theorems for several hypothesis spaces of practical inter-est, such as reproducing kernel Hilbert spaces (RKHS), sparse linear functions, functions deﬁnedvia shape restrictions, neural networks and random forests. For many of these estimators, we of-fer optimization algorithms with performance guarantees. As we illustrate in extensive simulationstudies, different estimators are best in different regimes. Related work

The non-parametric IV problem has a long history in econometrics Newey andPowell [2003], Blundell et al. [2007], Chen and Pouzo [2012], Chen and Christensen [2018], Hallet al. [2005], Horowitz [2007, 2011], Darolles et al. [2011], Chen and Pouzo [2009]. Arguably theclosest to our work is that of Chen and Pouzo [2012], who consider estimation of non-parametricfunction classes and estimation via the method of sieves and a penalized minimum distance estimatorof the form: min h ∈H E [ E [ y − h ( x ) | z ] ] + λR ( h ) , where R ( h ) is a regularizer. As we show in2ppendix A, our estimator can be interpreted asymptotically as a minimum distance estimator, albeitour estimation method applies to arbitrary function classes and non just linear sieves. There is also agrowing body of work in the machine learning literature on the non-parametric instrumental variableregression problem Hartford et al. [2017], Bennett et al. [2019], Singh et al. [2019], Muandet et al.[2019, 2020]. Our work has several features that draw connections to each of these works, e.g.Bennett et al. [2019], Muandet et al. [2019, 2020] also use a minimax criterion and Bennett et al.[2019], Muandet et al. [2019] also impose some form of variance penalty on the test function. Wediscuss subtle differences in Appendix A. Moreover, Singh et al. [2019], Muandet et al. [2019] alsostudy RKHS hypothesis spaces and Hartford et al. [2017], Bennett et al. [2019] also study neuralnet hypothesis spaces. None of these prior works provide ﬁnite sample estimation error rates forarbitrary hypothesis spaces and typically only show consistency for the particular hypothesis spaceanalyzed (with the exception of Singh et al. [2019], who provide ﬁnite sample rates for RKHSspaces, under further conditions on the smoothness of the true hypothesis). In Appendix A we offera more detailed exposition on the related work and how it relates to our main results. We consider the problem of estimating a ﬂexible econometric model that satisﬁes a set of conditionalmoment restrictions presented in (1) (see also Appendix B), where z ∈ Z ⊆ R d , X ∈ X ⊆ R p , y ∈ R , h ∈ H ⊆ ( X → R ) for H a hypothesis space. For simplicity of notation we will also denotewith ψ ( y ; h ( x )) = y − h ( x ) . The truth is some model h that satisﬁes all the moment restrictions.We assume we have access to a set of n i.i.d. sample points { v i := ( y i , x i , z i ) } ni =1 drawn from someunknown distribution D that satisﬁes the moment condition in Equation (1). We will analyze esti-mators that optimize an empirical analogue of the minimax objective presented in the introduction,potentially adding norm-based penalties Φ :

F → R + , R : H → R + : ˆ h := arg min h ∈H sup f ∈F Ψ n ( h, f ) − λ Φ( f ) + µ R ( h ) where Ψ n ( h, f ) := n (cid:80) ni =1 ψ ( y i ; h ( x i )) f ( z i ) .We assume that H and F are classes of bounded functions on their corresponding domains and,without loss of generality, their image is a subset of [ − , . Similarly, we will also assume that y ∈ [ − , . The results of this section hold for a general bounded range [ − b, b ] via standard re-scaling arguments with an extra multiplicative factor of b . Moreover, we will assume that F is asymmetric class, i.e. if f ∈ F then − f ∈ F . Moreover, we will assume that H and F are equippedwith norms (cid:107) · (cid:107) H , (cid:107) · (cid:107) F and we will deﬁne the norm-constrained classes and for any function class G we let G B = { g ∈ G : (cid:107) g (cid:107) ≤ B } , be the B bounded norm subset of the class.Our estimation target is good generalization performance with respect to the projected residual meansquared error (RMSE), deﬁned as the RMSE projected onto the space of instruments: (cid:107) T (ˆ h − h ) (cid:107) := (cid:115) E (cid:20)(cid:16) E [ˆ h ( x ) − h ( x ) | z ] (cid:17) (cid:21) (Projected RMSE)where T : H → F is the linear operator deﬁned as

T h := E [ h ( X ) | Z = · ] . This performancemetric is appropriate given the ill-posedness problem well known in this setting; imposing furtherconditions on the strength of the correlation between the treatments and instruments (instrumentstrength) allows one to, translate bounds on the projected RMSE to bounds on the RMSE (see e.g.Chen and Pouzo [2012] and other references in the applications below).We start by deﬁning some preliminary notions from empirical process theory that are required tostate our main results. Let G a class of uniformly bounded functions g : V → [ − , fromsome domain V to [ − , . The localized Rademacher complexity of the function class is deﬁnedas: R n ( δ ; G ) = E { (cid:15) i } ni =1 , { v i } ni =1 (cid:20) sup g ∈G(cid:107) g (cid:107) ≤ δ (cid:12)(cid:12) n (cid:80) ni =1 (cid:15) i g ( v i ) (cid:12)(cid:12)(cid:21) , where { v i } ni =1 are i.i.d. samplesfrom some distribution D on V and { (cid:15) i } ni =1 are i.i.d. Rademacher random variables taking valuesequiprobably in {− , } . We will also denote with R n ( G ) , the un-restricted Rademacher complex-ity, i.e. δ = ∞ . 3e denote with (cid:107) · (cid:107) the (cid:96) -norm with respect to the distribution D , i.e. (cid:107) g (cid:107) = (cid:112) E v ∼ D [ g ( v ) ] ,and analogously we deﬁne the empirical (cid:96) -norm as (cid:107) g (cid:107) ,n = (cid:113) n (cid:80) i g ( v i ) . In our context, where v = ( y, x, z ) , when functions take as input subsets of the vector v , then we will overload notationand let (cid:107) · (cid:107) and (cid:107) · (cid:107) ,n denote the population and sample (cid:96) norms with respect to the marginaldistribution of the corresponding input, e.g., if h is a function of x alone and f a function of z alone,we write (cid:107) h (cid:107) = (cid:112) E x [ h ( x ) ] , (cid:107) f (cid:107) = (cid:112) E z [ f ( z ) ] , and (cid:107) hf (cid:107) = (cid:112) E x,z [ h ( x ) f ( z ) ] .A function class G is said to be symmetric if g ∈ G = ⇒ − g ∈ G . Moreover, it is said to be star-convex if: g ∈ G = ⇒ r g ∈ G , ∀ r ∈ [0 , . The critical radius δ n of the function class G isany solution to the inequality R n ( δ ; G ) ≤ δ . We show that, if the function space F U contains projected differences of hypothesis spaces h ∈ H B ,with some benchmark hypothesis h ∗ ∈ H B , i.e. T ( h − h ∗ ) ∈ F U , then a regularized minimaxestimator can achieve estimation rates that are of the order of the projected root-mean-squared-errorof the benchmark hypothesis h ∗ and the critical radii of (i) the function class F U and (ii) a functionclass G that consists of functions of the form: q ( x ) · T q ( z ) , for q = h − h ∗ . The projected root meansquared error of the benchmark class can be understood as the approximation error or bias of thehypothesis space H B , and the critical radius can be understood as the sampling error or variance ofthe estimate. If h ∈ H B , then the approximation error is zero. We present a slightly more generalstatement, where we also allow for F U to not exactly include T ( h − h ∗ ) , but rather functions that areclose to it with respect to the (cid:96) norm. For this reason, we will need to deﬁne the following slightlymore complex hypothesis space, in order to state our main theorem: ˆ G B,U := { ( x, z ) → r ( h ( x ) − h ∗ ( x )) f Uh ( z ) : h ∈ H s.t. h − h ∗ ∈ H B , r ∈ [0 , } (2)where f Uh = arg min f ∈F U (cid:107) f − T ( h − h ∗ ) (cid:107) . If T ( h − h ∗ ) ∈ F U , then this simpliﬁes to the classof functions of the form: ( h − h ∗ )( x ) T ( h − h ∗ )( z ) . Theorem 1.

Let F be a symmetric and star-convex set of test functions and consider the estimator: ˆ h = arg min h ∈H sup f ∈F Ψ n ( h, f ) − λ (cid:18) (cid:107) f (cid:107) F + Uδ (cid:107) f (cid:107) ,n (cid:19) + µ (cid:107) h (cid:107) H (3) Let h ∗ ∈ H be any ﬁxed hypothesis (independent of the samples) and h be any hypothesis (notnecessarily in H ) that satisﬁes the Conditional Moment (1) and suppose that: ∀ h ∈ H : min f ∈F L (cid:107) h − h ∗(cid:107) H (cid:107) f − T ( h − h ∗ ) (cid:107) ≤ η n (4) Assume that functions in H B and F U have uniformly bounded ranges in [ − , and that: δ := δ n + c (cid:113) log( c /ζ ) n , for universal constants c , c , and δ n an upper bound on the critical radii of F U and ˆ G B,L B . If λ ≥ δ /U and µ ≥ λ (4 L + 27 U/B ) , then ˆ h satisﬁes w.p. − ζ : (cid:107) T (ˆ h − h ∗ ) (cid:107) , (cid:107) T (ˆ h − h ) (cid:107) ≤ O (cid:16) δ + η n + (cid:107) h ∗ (cid:107) H λ + µδ + (cid:107) T ( h ∗ − h ) (cid:107) + (cid:107) T ( h ∗ − h ) (cid:107) δ (cid:17) If further λ, µ = O ( δ ) and δ ≥ (cid:107) T ( h ∗ − h ) (cid:107) , then: (cid:107) T (ˆ h − h ∗ ) (cid:107) , (cid:107) T (ˆ h − h ) (cid:107) ≤ O (cid:0) δ max { , (cid:107) h ∗ (cid:107) H } + η n + (cid:107) T ( h ∗ − h ) (cid:107) (cid:1) Observe that if the classes H , F already are norm constrained, then the theorem directly applies tothe estimator that solely penalizes the (cid:96) ,n norm of f , i.e.: ˆ h := arg min h ∈H sup f ∈F Ψ n ( h, f ) − (cid:107) f (cid:107) ,n (5)However, as we show below, imposing norm regularization as opposed to hard norm constraintsleads to adaptivity properties of the estimator. By setting λ = δ /U , µ = 2 λ (cid:0) L + 27 U/B (cid:1) using an (cid:96) ∞ norm in both function spaces and taking U, B → ∞ . Observe that we can also take L = 1 , since (cid:107) T h (cid:107) ∞ ≤ (cid:107) h (cid:107) ∞ for any T . daptivity of regularized estimator Suppose that we know that for

B, U = 1 , we have thatfunctions in H B , F U have ranges in [ − , as their inputs range in X and Z correspondingly. Thenour Theorem requires that we set: λ ≥ δ and µ ≥ λ (4 L + 27) , where δ depends on the criticalradius of the function class F and G . Observe that none of these values depend on the norm of thebenchmark hypothesis (cid:107) h ∗ (cid:107) H , which can be arbitrary and not constrained by our theorem (see alsoAppendix C.1).For some function classes H that admit sparse representations, we can get an improved performanceif instead of testing for classes of functions F that contain T ( h − h ∗ ) , we test functions whoselinear span contains T ( h − h ∗ ) , i.e. that T ( h − h ∗ ) = (cid:80) i w i f i , assuming the weights requiredin this linear span have small (cid:96) norm. The reason being that the generalization error of linearspans with bounded (cid:96) norm can be prohibitively large to get fast error rates, i.e. the Rademachercomplexity of the span of F can be much larger than F , thereby introducing large sampling vari-ance to our sup-loss objective. To state the improved result, we deﬁne for any function space F :span κ ( F ) := { (cid:80) pi =1 w i f i : f i ∈ F , (cid:107) w (cid:107) ≤ κ, p ≤ ∞} , i.e. the set of functions that consist of lin-ear combinations of a ﬁnite set of elements F , with the (cid:96) norm of the weights bounded by R . Toget fast rates in this second result, we will require that the (cid:96) -normalized T ( h − h ∗ ) belongs to thespan. We present the theorem in the well-speciﬁed setting, but a similar result holds in the casewhere h / ∈ H B , with the extra modiﬁcation of adding a second moment penalty on f . Theorem 2.

Consider a set of test functions F := ∪ di =1 F i , that is decomposable as a union of d symmetric test function spaces F i and let F iU = { f ∈ F i : (cid:107) f (cid:107) F ≤ U } . Consider the estimator: ˆ h = arg min h ∈H sup f ∈F U Ψ n ( h, f ) + λ (cid:107) h (cid:107) H (6) Let h ∈ H B be any ﬁxed (independent of the samples) hypothesis that satisﬁes the ConditionalMoment (1) . Let δ n,ζ := 2 max di =1 R ( F iU ) + c (cid:113) log( c d/ζ ) n , for some universal constants c , c and B n,λ,ζ := (cid:107) h (cid:107) H + δ n,ζ /λ . Suppose that: ∀ h ∈ H B n,λ,ζ : T ( h − h ) (cid:107) T ( h − h ) (cid:107) ∈ span κ ( F U ) Then if λ ≥ δ n,ζ , ˆ h satisﬁes for some universal constants c , c , that w.p. − ζ : (cid:107) T ( h − ˆ h ) (cid:107) ≤ κ (cid:16) B + 1) R ( H ) + δ n,ζ + λ (cid:16) (cid:107) h (cid:107) H − (cid:107) ˆ h (cid:107) H (cid:17)(cid:17) In Appendix C we provide further discussion related to our main theorems: i) we provide furtherdiscussion on the adaptivity of our estimators, ii) we provide connections between the critical radiusand the entropy integral and how to bound the critical radius via covering arguments, iii) we providegeneric approaches to solving the optimization problem, iv) we show how to combine our maintheorem on the projected MSE with bounds on the ill-posedness of the inverse problem in order toachieve MSE rates, v) we offer a discussion on the optimality of our estimation rate.

In this section we describe how Theorem 1 applies to the case where h lies in a Reproducing KernelHilbert space (RKHS) with kernel K H : X × X → R , denoted with H K and T h lies in anotherRKHS with kernel K F : Z × Z → R (see Appendix E for more details). We outline here the mainideas behind the three components required to apply our general theory and defer the full discussionto Appendix E.First we characterize the set of test functions that are sufﬁcient to satisfy the requirement that T ( h − h ) ∈ F U . We show (see Lemma 7) that if the conditional density function p ( x | z ) satisﬁes thatthe function p ( x | · ) falls in an RKHS H K F , then T h ∈ H K F . Moreover, we show that under thestronger conditions (see Lemma 8) that p ( x | z ) = ρ ( x − z ) and K H ( x, y ) = k ( x − y ) , for k positive deﬁnite and continuous, then T h ∈ H K , i.e. T h falls in the same RKHS as h . These twotheorems give conrete guidance in terms of primitive assumptions, on what RKHS should be usedas a test function space, so that the condition that T ( h − h ) ∈ F is satisﬁed.Second, by recent results in statistical learning theory, the critical radius of any RKHS-norm con-strained subset of an RKHS class with kernel K and norm bound B , can be characterized as a5unction of the eigen-decay of the empirical kernel matrix K deﬁned as K ij = K ( x i , x j ) /n . Moreconcretely, it is the solution to: B (cid:113) n (cid:113)(cid:80) nj =1 min { λ Sj , δ } ≤ δ , where λ Sj are the empirical eigen-values. In the worst-case is of the order of n − / . In the context of Theorem 1, the function classes F and G B are kernel classes, with kernels K F and K × (( x, z ) , ( x (cid:48) , z (cid:48) )) = K H ( x, x (cid:48) ) · K F ( z, z (cid:48) ) .Thus we can bound the critical radius required in the theorem as a function of the eigendecay of thecorresponding empirical kernel matrices, which are data-dependent quantities.Combining these two facts, we can then apply Theorem 1, to get a bound on the estimation error ofthe minimax or regularized minimax estimator. Moreover, we show that for this set of test functionsand hypothesis spaces, the empirical min-max optimization problem can be solved in closed form .In particular, the estimator in Equation (3) takes the form: ˆ h = (cid:80) ni =1 α λ ∗ ,i K H ( x i , · ) α λ := ( K H ,n M K H ,n + 4 λ µK H ,n ) † K H ,n M y where K H ,n = ( K H ( x i , x j )) ni,j =1 and K F ,n = ( K F ( z i , z j )) ni,j =1 , are empirical kernel matrices,and M = K / F ,n ( Unδ K F ,n + I ) − K / F ,n (where A † is the Moore-Penrose pseudoinverse of A ).Moreover, in Appendix E.3, we discuss how ideas from low rank kernel matrix approximation (suchas the Nystrom method) can avoid the O ( n ) running time for matrix inverse computation in thelatter closed form. Finally, we show (see Appendix E.4) that if we make further assumptions on therate at which the operator T distorts the orthonormality of the eigenfunctions of the kernel K H , thenwe can show that our estimator also implies mean-squared-error rates. In this section we deal with high-dimensional linear function classes, i.e. the case when X , Z ⊆ R p for p (cid:29) n and h ( x ) = (cid:104) θ , x (cid:105) (see Appendix F for more details). We will address the case whenthe function θ is assumed to be sparse, i.e. (cid:107) θ (cid:107) := { j ∈ [ p ] : | θ j | > } ≤ s . We will bedenoting with S the subset of coordinates of θ that are non-zero and with S c its complement. Forsimplicity of exposition we will also assume that E [ x i | z ] = (cid:104) β, z (cid:105) , though most of the resultsof this section also extend to the case where E [ x i | z ] ∈ F i for some F i with small Rademachercomplexity. Variants of this setting have been analyzed in the prior works of [Gautier et al., 2011,Fan and Liao, 2014]. We focus on the case where the covariance matrix V := E [ E [ x | z ] E [ x | z ] (cid:62) ] ,has a restricted minimum eigenvalue of γ and apply Theorem 2. We note that without the minimumeigenvalue condition, our Theorem 1 provides slow rates of the order of n − / , for computationallyefﬁcient estimators that replace the hard sparsity constraint with an (cid:96) -norm constraint. Corollary 3.

Suppose that h ( x ) = (cid:104) θ , x (cid:105) with (cid:107) θ (cid:107) ≤ s and (cid:107) θ (cid:107) ≤ B and (cid:107) θ (cid:107) ∞ ≤ .Moreover, suppose that E [ x i | z ] = (cid:104) β i , z (cid:105) , with β i ∈ R p and (cid:107) β i (cid:107) ≤ U and that the co-variancematrix V satisﬁes the following restricted eigenvalue condition: ∀ ν ∈ R p s.t. (cid:107) ν S c (cid:107) ≤ (cid:107) ν S (cid:107) + 2 δ n,ζ : ν (cid:62) V ν ≥ γ (cid:107) ν (cid:107) Then let H = { x → (cid:104) θ, x (cid:105) : θ ∈ R p } , (cid:107)(cid:104) θ, ·(cid:105)(cid:107) H = (cid:107) θ (cid:107) , F U = { z → (cid:104) β, z (cid:105) : β ∈ R p , (cid:107) β (cid:107) ≤ U } and (cid:107)(cid:104) β, ·(cid:105)(cid:107) F = (cid:107) β (cid:107) . Then the estimator presented in Equation (6) with λ ≤ γ s , satisﬁes that w.p. − ζ : (cid:107) T (ˆ h − h ) (cid:107) ≤ O (cid:18) max (cid:8) , λ γs (cid:9) (cid:113) sγ (cid:18) ( B + 1) (cid:113) log( p ) n + U (cid:113) log( p ) n + (cid:113) log( p/ζ ) n (cid:19)(cid:19) If instead we assume that (cid:107) β i (cid:107) ≤ U and sup z ∈Z (cid:107) z (cid:107) ≤ R then by setting F U = { z → (cid:104) β, z (cid:105) : (cid:107) β (cid:107) ≤ U } and (cid:107)(cid:104) β, ·(cid:105)(cid:107) F = (cid:107) β (cid:107) , then the later rate holds with U (cid:113) log( p ) n replaced by U R √ n . Notably, observe that in the case of (cid:107) β i (cid:107) ≤ U , we note that if one wants to learn the true β withrespect to the (cid:96) norm or the functions E [ x i | z ] with respect to the RMSE, then the best rate one canachieve (by standard results for statistical learning with the square loss), even when one assumes that sup z ∈Z (cid:107) z (cid:107) ≤ R and that E [ zz (cid:62) ] has minimum eigenvalue of at least γ , is: min (cid:110)(cid:112) pn , (cid:0) U Rn (cid:1) / (cid:111) .For large p (cid:29) n the ﬁrst rate is vacuous. Thus we see that even though we cannot accurately learn theconditional expectation functions at a / √ n rate, we can still estimate h at a / √ n rate, assuming6hat h is sparse. Therefore, the minimax approach offers some form of robustness to nuisanceparameters, reminiscent of Neyman orthogonal methods (see e.g. Chernozhukov et al. [2018]).In Appendix F.3 we also provide ﬁrst-order iterative and computationally efﬁcient algorithms withprovable guarantees for solving the optimization problem. Moreover, we show that recent advancesin online learning theory can be utilized to get fast iteration complexity, i.e. achieve error (cid:15) after O (1 /(cid:15) ) iterations (instead of the typical rate of O (1 /(cid:15) ) for non-smooth functions). Finally, inAppendix F.4, we also show if we assume that the minimum eigenvalue of V is at least γ andthe maximum eigenvalue of Σ = E [ xx † ] is at most σ , then the same rate as the one presented inCorollary 3 holds for the MSE, multiplied by the constant (cid:112) σ/γ . In this section we describe how one can apply the theoretical ﬁndings from the previous sectionsto understand how to train neural networks that solve the conditional moment problem. We willconsider the case when our true function h can be represented (or well-approximated) by a deepneural network function of x , for some given domain speciﬁc network architecture, and we willrepresent it as h ( x ) = h θ ( x ) , where θ are the weights of the neural net (see Appendix H for moredetails). Moreover, we will assume that the linear operator T , satisﬁes that for any set of weights θ , we have that T h θ belongs to a set of functions that can be represented (or well-approximated) asanother deep neural network architecture, and we will denote these functions as f w ( z ) , where w arethe weights of the neural net. Adversarial GMM Networks (AGMM)

Thus we can apply our general approach presented inTheorem 1 (simpliﬁed for the case when U = B = 1 , λ = δ , µ = 2 δ (4 L + 27) , where L is abound on the lipschitzness of the operator T with respect to the two function space norms and δ is abound on the critical radius of the function spaces F and ˆ G ,L ): ˆ θ = arg min θ sup w E n [ ψ ( y i ; h θ ( x i )) f w ( z )] − δ (cid:107) f w (cid:107) F − n (cid:88) i f w ( z i ) + c δ (cid:107) h θ (cid:107) H for some constant c > that depends on the lipschitzness of the operator T . The AGMM crite-rion for training neural networks is closely related to the work of Bennett et al. [2019]. However,the regularization presented in Bennett et al. [2019] is not a simple second moment penalization.Here we show that such re-weighting is not required if one simply wants fast projected MSE rates(in Appendix H we provide further discussion). Moreover, in Appendix H.1, we show how toderive intuition from our RKHS analysis to develop an architecture for the test function networkthat under conditions is guaranteed to contain the set of functions of the form T h . This leads toan MMD-GAN style adversarial GMM approach, where we consider test functions of the form: f ( z ) = s (cid:80) si =1 β i K ( c i , g w ( z )) , where c i are parameters that could also be trained via gradientdescent. The latter essentially corresponds to adding what is known as an RBF layer at the end ofthe adversary neural net (denoted as KLayerTrained in experiments). Finally, in Appendix H.2, weprovide heuristic methods for solving the non-convex/non-concave zero-sum game, using ﬁrst orderdynamics. We will show that we can reduce the problem presented in (5) to a regression oracle over the functionspace F and a classiﬁcation oracle over the function space H (see Appendix I for more details). Wewill assume that we have a regression oracle that solves the square loss problem over F : for any setof labels and features z n , u n it returnsOracle F ( z n , u n ) = arg min f ∈F n (cid:80) ni =1 ( u i − f ( z i )) Moreover, we assume that we have a classiﬁcation oracle that solves the weighted binary classiﬁca-tion problem over H w.r.t. the accuracy criterion: for any set of sample weights w n , binary labels v n in { , } and features x n :Oracle H ( x n , v n , w n ) = arg max h ∈H n (cid:80) ni =1 w i Pr z i ∼ Bernoulli (cid:16) h ( xi )2 (cid:17) [ v i = z i ] heorem 4. Consider the algorithm where for t = 1 , . . . , T : let u ti = (cid:16) y i − t − (cid:80) t − τ =1 h τ ( x i ) (cid:17) , f t = Oracle F ( z n , u t n ) v ti = 1 { f t ( z i ) > } , w ti = | f t ( z i ) | h t = Oracle H (cid:0) x n , v t n , w t n (cid:1) Suppose that the set A = { ( f ( z ) , . . . , f ( z n )) : f ∈ F} is a convex set. Then the ensemble: ¯ h = T (cid:80) Tt =1 h t , is a T )+1) T -approximate solution to the minimax problem in Equation (5) . In practice, we will consider a random forest regression method as the oracle over F and a binarydecision tree classiﬁcation method as the oracle for H (which we will refer to as RFIV). Prior workon random forests for causal inference has focused primarily on learning forests that capture theheterogeneity of the treatment effect of a treatment, but did not account for non-linear relationshipsbetween the treatment and the outcome variable. The method proposed in this section makes thispossible. Observe that the convexity of the set A is violated by the random forest function classwith a bounded set of trees. Albeit in practice this non-convexity can be alleviated by growinga large set of trees on bootstrap sub-samples or using gradient boosted forests as oracles for F .Moreover, observe that we solely addressed the optimization problem and postpone the statisticalpart of random forests (e.g. critical radius) to future work (see also Appendix I). In the appendix we also provide further applications of our main theorems. In Appendix D weshow how our theorems apply to the case where H and F are growing linear sieves, which is atypical approach to non-parametric estimation in the econometric literature (see e.g. Chen and Pouzo[2012]). In Appendix G we analyze the case where H and F are function classes deﬁned via shapeconstraints. We analyze the case of total variation bound constraints and convexity constraints.This applications provides analogues of the convex regression and the isotonic regression to theendogenous regression setting and draws connections to recent works in econometrics on estimationsubject to monotonicity constraints Chetverikov and Wilhelm [2017]. Experimental Design.

We consider the following data generating processes: for n x = 1 and n z ≥ y = h ( x [0]) + e + δ, δ ∼ N (0 , . x = γ z [0] + (1 − γ ) e + γ, z ∼ N (0 , I n z ) , e ∼ N (0 , , γ ∼ N (0 , . While, when n x = n z > , then we consider the following modiﬁed treatment equation: x = γ z + (1 − γ ) e + γ, We consider several functional forms for h including absolute value, sigmoid and sin functions(more details in Appendix J) and several ranges of the number of samples n , number of treatments n x , number of instruments n z and instrument strength γ . We consider as classic benchmarks 2SLSwith a polynomial features of degree (2SLS) and a regularized version of 2SLS where Elastic-NetCV is used in both stages (Reg2SLS).In addition to these regimes, we consider high-dimensional experiments with images, following thescenarios proposed in Bennett et al. [2019] where either the instrument z or treatment x or both areimages from the MNIST dataset consisting of grayscale images of × pixels. We compare theperformance of our approaches to that of Bennett et al. [2019], using their code. A full descriptionof the DGP is given in the supplementary material. Results.

The main ﬁndings are: i) for small number of treatments, the RKHS method with a Nys-trom approximation (NystromRKHS), outperforms all methods (Figure 1), ii) for moderate numberof instruments and treatments, Random Forest IV (RFIV) signiﬁcantly outperforms most methods,with second best being neural networks (AGMM, KLayerTrained) (Figure 2), iii) the estimator forsparse linear hypotheses can handle an ultra-high dimensional regime (Figure 3), iv) neural network8ethods (AGMM, KLayerTrained) outperform the state of the art in prior work [Bennett et al.,2019] for tasks that involve images (Figure 4). The ﬁgures below present the average MSE across experiments ( experiments for Figure 4) and two times the standard error of the average MSE. NystromRKHS 2SLS Reg2SLS RFIVabs ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± step ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Figure 1: n = 300 , n z = 1 , n x = 1 , γ = . NystromRKHS 2SLS Reg2SLS RFIV AGMM KLayerTrainedabs 0.143 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± band 0.078 ± ± ± ± ± ± Figure 2: n = 2000 , n z = 10 , n x = 10 , γ = . p = ± ± ± ± ± ± Figure 3: n = 400 , n z = n x := p , γ = . , h ( x [0]) = x [0] DeepGMM (Bennett et al. [2019]) AGMM KLayerTrainedMNIST z ± ± ± x ± ± ± xz ± ± ± Figure 4: MSE on the high-dimensional DGPs

References

Alekh Agarwal, Olivier Chapelle, Miroslav Dudík, and John Langford. A reliable effective terascalelinear learning system.

The Journal of Machine Learning Research , 15(1):1111–1133, 2014.Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and Generalization in Overparam-eterized Neural Networks, Going Beyond Two Layers. arXiv e-prints , art. arXiv:1811.04918,November 2018.Martin Anthony and Peter L Bartlett.

Neural network learning: Theoretical foundations . cambridgeuniversity press, 2009.Francis R Bach and Michael I Jordan. Predictive low-rank decomposition for kernel methods. In

Proceedings of the 22nd international conference on Machine learning , pages 33–40, 2005.Krishnakumar Balasubramanian, Tong Li, and Ming Yuan. On the optimality of kernel-embeddingbased goodness-of-ﬁt tests. arXiv preprint arXiv:1709.08148 , 2017.Peter L Bartlett, Olivier Bousquet, Shahar Mendelson, et al. Local rademacher complexities.

TheAnnals of Statistics , 33(4):1497–1537, 2005.Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds forneural networks. In

Advances in Neural Information Processing Systems , pages 6240–6249, 2017.9ndrew Bennett, Nathan Kallus, and Tobias Schnabel. Deep generalized method of moments forinstrumental variable analysis. In

Advances in Neural Information Processing Systems , pages3559–3569, 2019.Mikolaj Binkowski, Dougal J. Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMDGANs. In

International Conference on Learning Representations , 2018.Richard Blundell, Xiaohong Chen, and Dennis Kristensen. Semi-nonparametric iv estimation ofshape-invariant engel curves.

Econometrica , 75(6):1613–1669, 2007.Léon Bottou, Olivier Chapelle, Dennis DeCoste, and Jason Weston.

Large-Scale Kernel Machines(Neural Information Processing) . The MIT Press, 2007. ISBN 0262026252.Stephen Boyd and Lieven Vandenberghe.

Convex optimization . Cambridge University Press, 2004.EM Bronshtein. ε -entropy of convex sets and functions. Siberian Mathematical Journal , 17(3):393–398, 1976.Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm.

Foundations of Computational Mathematics , 7(3):331–368, 2007.Sabyasachi Chatterjee, Adityanand Guntuboyina, and Bodhisattva Sen. On risk bounds in isotonicand other shape restricted regression problems.

Ann. Statist. , 43(4):1774–1800, 08 2015. doi:10.1214/15-AOS1324. URL https://doi.org/10.1214/15-AOS1324 .Xiaohong Chen and Timothy M Christensen. Optimal sup-norm rates and uniform inference onnonlinear functionals of nonparametric iv regression.

Quantitative Economics , 9(1):39–84, 2018.Xiaohong Chen and Demian Pouzo. Efﬁcient estimation of semiparametric conditional momentmodels with possibly nonsmooth residuals.

Journal of Econometrics , 152(1):46–60, 2009.Xiaohong Chen and Demian Pouzo. Estimation of nonparametric conditional moment models withpossibly nonsmooth generalized residuals.

Econometrica , 80(1):277–321, 2012.Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duﬂo, Christian Hansen, Whit-ney Newey, and James Robins. Double/debiased machine learning for treatment and structuralparameters.

The Econometrics Journal , 21(1):C1–C68, 2018. doi: 10.1111/ectj.12097. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/ectj.12097 .Denis Chetverikov and Daniel Wilhelm. Nonparametric instrumental variable estimation un-der monotonicity.

Econometrica , 85(4):1303–1320, 2017. doi: 10.3982/ECTA13639. URL https://onlinelibrary.wiley.com/doi/abs/10.3982/ECTA13639 .Serge Darolles, Yanqin Fan, Jean-Pierre Florens, and Eric Renault. Nonparametric instrumentalregression.

Econometrica , 79(5):1541–1565, 2011.Constantinos Daskalakis, Andrew Ilyas, Vasilis Syrgkanis, and Haoyang Zeng. Training gans withoptimism.

CoRR , abs/1711.00141, 2017. URL http://arxiv.org/abs/1711.00141 .Miguel del Álamo and Axel Munk. Total variation multiscale estimators for linear inverse problems. arXiv preprint arXiv:1905.08515 , 2019.Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizesover-parameterized neural networks. arXiv preprint arXiv:1810.02054 , 2018.John Duchi and Yoram Singer. Efﬁcient online and batch learning using forward backward splitting.

Journal of Machine Learning Research , 10(99):2899–2934, 2009. URL http://jmlr.org/papers/v10/duchi09a.html .John Duchi, Shai Shalev-Shwartz, Yoram Singer, and Tushar Chandra. Efﬁcient projections onto thel1-ball for learning in high dimensions. In

Proceedings of the 25th International Conference onMachine Learning , ICML08, pages 272–279, New York, NY, USA, 2008. doi: 10.1145/1390156.1390191.Jianqing Fan and Yuan Liao. Endogeneity in high dimensions.

Annals of statistics , 42(3):872, 2014.10ylan J. Foster and Vasilis Syrgkanis. Orthogonal Statistical Learning. arXiv e-prints , art.arXiv:1901.09036, January 2019.Yoav Freund and Robert E. Schapire. Adaptive game playing using multiplicative weights.

Games and Economic Behavior , 29(1):79 – 103, 1999. ISSN 0899-8256. doi: https://doi.org/10.1006/game.1999.0738. URL .Eric Gautier, Alexandre Tsybakov, and Christiern Rose. High-dimensional instrumental variablesregression and conﬁdence sets. arXiv preprint arXiv:1105.2454 , 2011.Evarist Gine and Richard Nickl.

Mathematical Foundations of Inﬁnite-Dimensional Statistical Mod-els . Cambridge University Press, USA, 1st edition, 2015. ISBN 1107043166.Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity ofneural networks. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors,

Proceed-ings of the 31st Conference On Learning Theory , volume 75 of

Proceedings of Machine LearningResearch , pages 297–299. PMLR, 06–09 Jul 2018. URL http://proceedings.mlr.press/v75/golowich18a.html .Adityanand Guntuboyina and Bodhisattva Sen. Covering numbers for convex functions.

IEEETransactions on Information Theory , 59(4):1957–1965, 2012.Peter Hall, Joel L Horowitz, et al. Nonparametric methods for inference in the presence of instru-mental variables.

The Annals of Statistics , 33(6):2904–2929, 2005.Lars Peter Hansen. Large sample properties of generalized method of moments estimators.

Econo-metrica , 50(4):1029–1054, 1982. ISSN 00129682, 14680262. URL .Jason Hartford, Greg Lewis, Kevin Leyton-Brown, and Matt Taddy. Deep IV: A ﬂexible approachfor counterfactual prediction. In Doina Precup and Yee Whye Teh, editors,

Proceedings of the 34thInternational Conference on Machine Learning , volume 70 of

Proceedings of Machine Learn-ing Research , pages 1414–1423, International Convention Centre, Sydney, Australia, 06–11 Aug2017. PMLR. URL http://proceedings.mlr.press/v70/hartford17a.html .Joel L Horowitz. Asymptotic normality of a nonparametric instrumental variables estimator.

Inter-national Economic Review , 48(4):1329–1349, 2007.Joel L Horowitz. Applied nonparametric instrumental variables estimation.

Econometrica , 79(2):347–394, 2011.Yu-Guan Hsieh, Franck Iutzeler, Jérôme Malick, and Panayotis Mertikopoulos. On the convergenceof single-call stochastic extra-gradient methods. arXiv e-prints , art. arXiv:1908.08465, August2019.Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. Minmax optimization: Stable limit points ofgradient descent ascent are locally optimal.

CoRR , abs/1902.00618, 2019. URL http://arxiv.org/abs/1902.00618 .Sham M Kakade, Varun Kanade, Ohad Shamir, and Adam Kalai. Efﬁcient learning of generalizedlinear and single index models with isotonic regression. In J. Shawe-Taylor, R. S. Zemel, P. L.Bartlett, F. Pereira, and K. Q. Weinberger, editors,

Advances in Neural Information ProcessingSystems 24 , pages 927–935. Curran Associates, Inc., 2011.Sanjiv Kumar, Mehryar Mohri, and Ameet Talwalkar. Sampling methods for the nyström method.

Journal of Machine Learning Research , 13(Apr):981–1006, 2012.John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncated gradient. InD. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors,

Advances in Neural InformationProcessing Systems 21 , pages 905–912. Curran Associates, Inc., 2009.Quoc V Le. Building high-level features using large scale unsupervised learning. In , pages 8595–8598. IEEE,2013. 11uillaume Lecué and Shahar Mendelson. Regularization and the small-ball method ii: complexitydependent error rates.

The Journal of Machine Learning Research , 18(1):5356–5403, 2017.Guillaume Lecué and Shahar Mendelson. Regularization and the small-ball method i: Sparserecovery.

Ann. Statist. , 46(2):611–641, 04 2018. doi: 10.1214/17-AOS1562. URL https://doi.org/10.1214/17-AOS1562 .Qi Lei, Jason D. Lee, Alexandros G. Dimakis, and Constantinos Daskalakis. SGD Learns One-LayerNetworks in WGANs. arXiv e-prints , art. arXiv:1910.07030, October 2019.Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos. Mmd gan:Towards deeper understanding of moment matching network. In

Advances in Neural InformationProcessing Systems , pages 2203–2213, 2017.Tianyi Lin, Chi Jin, Michael Jordan, et al. Near-optimal algorithms for minimax optimization. arXivpreprint arXiv:2002.02417 , 2020.Feng Liu, Wenkai Xu, Jie Lu, Guangquan Zhang, Arthur Gretton, and DJ Sutherland. Learning deepkernels for non-parametric two-sample tests. arXiv preprint arXiv:2002.09116 , 2020.Yishay Mansour and David A. McAllester. Generalization bounds for decision trees. In

Proceedingsof the Thirteenth Annual Conference on Computational Learning Theory , COLT00, pages 69–74,San Francisco, CA, USA, 2000. Morgan Kaufmann Publishers Inc. ISBN 155860703X.Pascal Massart. Some applications of concentration inequalities to statistics.

Annales de la Facultédes sciences de Toulouse : Mathématiques , Ser. 6, 9(2):245–303, 2000. URL .Andreas Maurer. A vector-contraction inequality for rademacher complexities. In

InternationalConference on Algorithmic Learning Theory , pages 3–17. Springer, 2016.Brendan McMahan. Follow-the-regularized-leader and mirror descent: Equivalence theorems andl1 regularization. In

Proceedings of the Fourteenth International Conference on Artiﬁcial Intelli-gence and Statistics , pages 525–533, 2011.Panayotis Mertikopoulos, Houssam Zenati, Bruno Lecouat, Chuan-Sheng Foo, Vijay Chan-drasekhar, and Georgios Piliouras. Mirror descent in saddle-point problems: Going the extra(gradient) mile.

CoRR , abs/1807.02629, 2018. URL http://arxiv.org/abs/1807.02629 .Konstantin Mishchenko, Dmitry Kovalev, Egor Shulgin, Peter Richtárik, and Yura Malitsky. Revis-iting Stochastic Extragradient. arXiv e-prints , art. arXiv:1905.11373, May 2019.Aryan Mokhtari, Asuman Ozdaglar, and Sarath Pattathil. A uniﬁed analysis of extra-gradient andoptimistic gradient methods for saddle point problems: Proximal point approach. arXiv preprintarXiv:1901.08511 , 2019.Krikamol Muandet, Arash Mehrjou, Si Kai Lee, and Anant Raj. Dual iv: A single stage instrumentalvariable regression. arXiv preprint arXiv:1910.12358 , 2019.Krikamol Muandet, Wittawat Jitkrittum, and Jonas Kübler. Kernel conditional moment test viamaximum moment restriction. arXiv preprint arXiv:2002.09225 , 2020.Cameron Musco and Christopher Musco. Recursive sampling for the nystrom method. In

Advancesin Neural Information Processing Systems , pages 3833–3845, 2017.Sahand N. Negahban, Pradeep Ravikumar, Martin J. Wainwright, and Bin Yu. A uniﬁed frame-work for high-dimensional analysis of m -estimators with decomposable regularizers. Statist.Sci. , 27(4):538–557, 11 2012. doi: 10.1214/12-STS400. URL https://doi.org/10.1214/12-STS400 .Arkadi Nemirovski. Prox-method with rate of convergence o(1/t) for variational inequalities withlipschitz continuous monotone operators and smooth convex-concave saddle point problems.

SIAM Journal on Optimization , 15(1):229–251, 2004. doi: 10.1137/S1052623403425629. URL https://doi.org/10.1137/S1052623403425629 .12u Nesterov. Smooth minimization of non-smooth functions.

Mathematical programming , 103(1):127–152, 2005.Whitney K Newey and James L Powell. Instrumental variable estimation of nonparametric models.

Econometrica , 71(5):1565–1578, 2003.Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D Lee, and Meisam Razaviyayn. Solvinga class of non-convex min-max games using iterative ﬁrst order methods. In

Advances in NeuralInformation Processing Systems 32 , pages 14934–14942. Curran Associates, Inc., 2019.Dino Oglic and Thomas Gärtner. Nyström method with kernel k-means++ samples as landmarks. In

Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 2652–2660. JMLR. org, 2017.Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In

Advances inneural information processing systems , pages 1177–1184, 2008.Alexander Rakhlin, Karthik Sridharan, and Alexandre B. Tsybakov. Empirical entropy, minimaxregret and minimax risk.

Bernoulli , 23(2):789–824, 05 2017. doi: 10.3150/14-BEJ679. URL https://doi.org/10.3150/14-BEJ679 .Sasha Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable se-quences. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger,editors,

Advances in Neural Information Processing Systems 26 , pages 3066–3074. Curran Asso-ciates, Inc., 2013.Bernhard Schölkopf, Ralf Herbrich, and Alex J Smola. A generalized representer theorem. In

International conference on computational learning theory , pages 416–426. Springer, 2001.Shai Shalev-Shwartz and Shai Ben-David.

Understanding machine learning: From theory to algo-rithms . Cambridge university press, 2014.Shai Shalev-Shwartz and Yoram Singer. Convex repeated games and fenchel duality. In

Advancesin neural information processing systems , pages 1265–1272, 2007.Rahul Singh, Maneesh Sahani, and Arthur Gretton. Kernel instrumental variable regression. In

Advances in Neural Information Processing Systems , pages 4595–4607, 2019.M. Soltanolkotabi, A. Javanmard, and J. D. Lee. Theoretical insights into the optimization landscapeof over-parameterized shallow neural networks.

IEEE Transactions on Information Theory , 65(2):742–769, 2019.Suvrit Sra, Sebastian Nowozin, and Stephen J Wright.

Optimization for machine learning . MitPress, 2012.Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence of regu-larized learning in games. In

Advances in Neural Information Processing Systems , pages 2989–2997, 2015.Kiran K Thekumparampil, Prateek Jain, Praneeth Netrapalli, and Sewoong Oh. Efﬁcient algorithmsfor smooth minimax optimization. In

Advances in Neural Information Processing Systems 32 ,pages 12680–12691. Curran Associates, Inc., 2019.A. W. Van Der Vaart and J. A. Wellner.

Weak Convergence and Empirical Processes: With Applica-tions to Statistics . Springer Series, March 1996.Martin J Wainwright.

High-dimensional statistics: A non-asymptotic viewpoint , volume 48. Cam-bridge University Press, 2019.Holger Wendland.

Scattered data approximation , volume 17. Cambridge university press, 2004.Junchi Yang, Negar Kiyavash, and Niao He. Global convergence and variance-reduced optimizationfor a class of nonconvex-nonconcave minimax problems. arXiv preprint arXiv:2002.09621 , 2020.13. Yeganova and W. J. Wilbur. Isotonic regression under lipschitz constraint.

Journal of Optimiza-tion Theory and Applications , 141(2):429–443, 2009. doi: 10.1007/s10957-008-9477-0. URL https://doi.org/10.1007/s10957-008-9477-0 .Yuchen Zhang, Martin J Wainwright, and Michael I Jordan. Lower bounds on the performanceof polynomial-time algorithms for sparse linear regression. In

Conference on Learning Theory ,pages 921–948, 2014. 14 upplementary Material:Minimax Estimation of Conditional Moment ModelsContents

A Further Discussion on Related Work 16B Beyond the IV Moments 18C Supplementary Discussion of Main Theorems 18

C.1 Adaptivity of Regularized Estimator . . . . . . . . . . . . . . . . . . . . . . . . . 18C.2 Critical Radius and Rademacher Complexity via Covering . . . . . . . . . . . . . 19C.3 Solving the Min-Max Optimization Problem . . . . . . . . . . . . . . . . . . . . . 20C.4 From Projected MSE to MSE: Measure of Ill-Posedness . . . . . . . . . . . . . . 21C.5 Minimax Optimality of Estimation Rate . . . . . . . . . . . . . . . . . . . . . . . 22

D Application: Growing Linear Sieves 23E Application: Reproducing Kernel Hilbert Spaces 24

E.1 Characterization of Sufﬁcient Test Functions . . . . . . . . . . . . . . . . . . . . . 24E.2 Critical Radius of F U and G B . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26E.3 Closed-Form Solution to Optimization Problem . . . . . . . . . . . . . . . . . . . 26E.4 Bounds on Ill-Posedness Measure . . . . . . . . . . . . . . . . . . . . . . . . . . 28 F Application: High-Dimensional Sparse Linear Function Spaces 30

F.1 Hard Sparsity Constraints without Minimum Eigenvalue . . . . . . . . . . . . . . 30F.2 (cid:96) -Relaxation under Minimum Eigenvalue Condition . . . . . . . . . . . . . . . . 31F.3 Solving the (cid:96) -Relaxation Optimization Problem via First-Order Methods . . . . . 31F.4 Bounds on Ill-Posedness Measure . . . . . . . . . . . . . . . . . . . . . . . . . . 34 G Application: Shape Constrained Functions 35

G.1 Monotone functions and functions with small total variation . . . . . . . . . . . . 35G.2 Convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

H Neural Networks 38

H.1 MMD-GMM: A Neural Network Architecture for Adversarial GMM . . . . . . . . 39H.2 Adversarial Training: Simultaneous Optimistic First-Order Stochastic Optimization 41

I Random Forests via a Reduction Approach 41J Experimental Analysis 45

J.1 Experiments with Image Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4915

Proofs from Section 3 and Appendix C 51

K.1 Preliminary Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51K.2 Proof of Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51K.3 Proof of Theorem 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55K.4 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

L Proofs from Section 4 and Appendix E 57

L.1 Proof of Proposition 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57L.2 Proof of Proposition 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58L.3 Proof of Lemma 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

M Proofs from Section 5 and Appendix F 59

M.1 Proof of Corollary 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59M.2 Proof of Propositions 13 and 14 . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

N Proofs from Section 7 and Appendix I 62

N.1 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

A Further Discussion on Related Work

The non-parametric IV problem has a long history in econometrics [Newey and Powell, 2003, Blun-dell et al., 2007, Chen and Pouzo, 2012, Chen and Christensen, 2018, Hall et al., 2005, Horowitz,2007, 2011, Darolles et al., 2011, Chen and Pouzo, 2009]. Arguably the closest to our work is that ofChen and Pouzo [2012] (in particular their Theorem 4.1), who consider estimation of non-parametricfunction classes and estimation via the method of sieves and a penalized minimum distance estima-tor of the form: min h ∈H E [ E [ y − h ( x ) | z ] ] + λR ( h ) , where R ( h ) is a regularizer. The authorsapproximate the function class H by linear functions in a growing feature space. Subsequently, theyalso estimate the function m ( z ) = E [ y − h ( x ) | z ] based on another growing sieve.Though it may seem at ﬁrst that the approach in that paper and ours are quite distinct, the populationlimit of our objective function coincides with theirs. To see this, consider the simpliﬁed version ofour estimator presented in (5), where the function classes are already norm-constrained and no normbased regularization is imposed. Moreover, for a moment consider the population version of thisestimator, i.e. min h ∈H max f ∈F Ψ( h, f ) − (cid:107) f (cid:107) = min h ∈H max f ∈F E [( y − h ( x )) f ( z ) − f ( z ) ] Observe that if F is expressive enough (if T ( h − h ) ∈ F ), then the maximizing test function is E [ y − h ( x ) | z ] = E [ h ( x ) − h ( x ) | z ] . Then by the law of iterated expectations, the populationcriterion becomes: min h ∈H E (cid:20) ( y − h ( x )) 12 E [ y − h ( x ) | z ] − E [ y − h ( x ) | z ] (cid:21) = min h ∈H E (cid:2) E [ y − h ( x ) | z ] (cid:3) Thus in the population limit and without norm regularization on the test function f , our criterion isequivalent to the minimum distance criterion analyzed in Chen and Pouzo [2012]. Another point ofsimilarity is that we prove convergence of the estimator in terms of the pseudo-metric, the projectedMSE deﬁned in Section 4 of Chen and Pouzo [2012] - and like that paper we require additionalconditions to relate the pseudo-metric to the true MSE.The present paper differs in a number of ways: (i) the ﬁnite sample criterion is different; (ii) weprove our results using localized Rademacher analysis which allows for weaker assumptions; (iii)we consider for a broader range of estimation approaches than linear sieves, necessitating more of afocus on optimization. 16igging into the second point, Chen and Pouzo [2012] take a more traditional parameter recoveryapproach which requires several minimum eigenvalue conditions and several regularity conditionsto be satisﬁed for their estimation rate to hold (see e.g. their Assumptions 3.1, 3.2, 3.3, 4.1 andC.1). This is like a mean squared error proof in an exogenous linear regression setting, that re-quires a minimum eigenvalue of the feature co-variance to be bounded. Moreover, such parameterrecovery methods seem limited to the growing sieve approach, since only then one has a clear ﬁnitedimensional parameter vector to work on for each ﬁxed n .In contrast we work with inﬁnite dimensional parameter spaces directly and our analysis makes nofurther assumptions other than boundedness of the random variables and the conditional momentrestriction in order to provide a projected MSE rate. We do not require that the hypothesis spacebe a convex set, nor that the moment is path-wise differentiable with respect to h . Relaxing theseassumptions is important, since they are violated in three of our leading examples: linear hypothe-sis spaces with hard sparsity constraints or for neural network spaces or for tree based regressors.Another beneﬁt of the localized Rademacher analysis is that we do not require a preliminary proofof consistency, which is typical of more classical approaches to MSE rates. Such proofs typicallyrequire that n be larger than some constant before the convergence rate kicks in, so that the estimatoris within some small ball around the truth. This constant can sometimes be prohibitively large. Ourconvergence rate is global and holds without any lower bound condition on n . The sieve method ismost closely related to our RKHS section (and the expository sieve Appendix D), where essentiallywe consider inﬁnite dimensional linear function spaces. However, unlike the sieve method, we donot clip the eigenfunctions to a ﬁnite set that is growing, but rather impose an RKHS penalty. Weshow that this approach has advantages in auto-tuning to the ill-posedness of the problem. Finally,we do not require a bound on the ill-posedness of the problem in order to prove convergence ratesin terms of the pseudo-metric - this bound is only needed in post-processing to relate the pseudo-metric to the MSE. By contrast Chen and Pouzo [2012] use the bounded ill-posedness condition(Assumption 4.1) to prove convergence in the pseudo-metric.As a concrete example of the differences in the analysis, we apply our main Theorem 1 for thecase where H and F are growing sieves, equipped with the parameter (cid:96) norms, i.e. H = (cid:8) (cid:104) θ, φ n ( · ) (cid:105) : θ ∈ R k n (cid:9) , F = {(cid:104) β, ψ n ( · ) (cid:105) : β ∈ R m n } , (cid:107)(cid:104) θ, φ n ( · ) (cid:105)(cid:107) H = (cid:107) θ (cid:107) , (cid:107)(cid:104) β, ψ n ( · ) (cid:105)(cid:107) F = (cid:107) β (cid:107) , for some ﬁxed and growing feature maps φ n ( · ) , ψ n ( · ) . In that case η n will correspond tothe approximation error of the sieve ψ n that is used for the test function space and, if we choose h ∗ = arg min h ∈H (cid:107) h ∗ − h (cid:107) , then (cid:107) T ( h ∗ − h ) (cid:107) ≤ (cid:107) h ∗ − h (cid:107) =: (cid:15) n , will correspond to theapproximation error of the sieve φ n that is used for approximating the model h . In that case, The-orem 1 gives a bound of O ( δ n (cid:107) θ ∗ (cid:107) + η n + (cid:15) n ) , where θ ∗ is the (cid:96) norm of the parameter of theprojection of h on the sieve space for the model, i.e arg min θ ∈ R kn (cid:107)(cid:104) θ, φ ( · ) (cid:105) − h (cid:107) . Moreover, δ is a bound on the critical radius of F U and G B,U . Since both are ﬁnite dimensional linear functions,via standard covering arguments (see Corollary 5), we can bound δ = O (cid:18)(cid:113) max { k n ,m n } log( n ) n (cid:19) . Combined with ill-posedness conditions provided in [Chen and Pouzo, 2012], our results can thusgive an alternative proof to the results in [Chen and Pouzo, 2012] that i) do not make minimumeigenvalue conditions, ii) provide adaptivity to (cid:107) θ ∗ (cid:107) , without knowledge of it, thereby justifyingtheoretically the use of the regularization term R ( h ) , that was mostly proposed for experimentalimprovement in [Chen and Pouzo, 2012]. We provide a more thorough exposition of how our maintheorem applies to the case of growing sieves in Appendix D.The localized Rademacher analysis also allows us to consider hypothesis spaces that are not linearsieves, such as neural nets and random forests. This introduces some new optimization difﬁculties,as the estimator cannot be written in closed form (as it can for linear sieves). Our work gives severalsolutions for these difﬁculties, via iterative ﬁrst order algorithms. Intuitively, our optimization algo-rithms gradually and iteratively make gradient steps towards solving both optimization problems (ofregressing y − h ( x ) on z and minimizing E [ E [ y − h ( x ) | z ] ] over H ), as opposed to calculatingfull solutions of either problem. This formulation allows us to work with arbitrary hypothesis spacesand not just linear sieves.There is also a growing body of work in the machine learning literature on the non-parametric in-strumental variable regression problem [Hartford et al., 2017, Bennett et al., 2019, Singh et al., 2019, The log( n ) factor can also be saved with a more careful analysis of the critical radius for ﬁnite dimensionallinear function spaces (see Section D). h fall in an RKHS and the conditionaldistribution of X conditional on Z is represented via a conditional kernel mean embedding. Theyoffer very strong ﬁnite RKHS-norm rates on the estimated h , which typically imply sup-norm ratesof the recovered function. Albeit, we focus on projected MSE and MSE rates and achieve fasterrates as a function of the eigendecay of the kernel and the degree of ill-posedness. Moreover, thework of Singh et al. [2019] makes several stronger prior assumptions, that control the smoothnessof the function within the kernel, assumptions that are typical of RKHS norm guarantees in kernelridge regression [Caponnetto and De Vito, 2007], but which are not required for the weaker MSEmetric. Muandet et al. [2019] also propose a method that is very related to the second moment pe-nalized method that we propose, albeit the motivation stems from a different dual formulation of thetwo-stage-least-squares problem presented in [Hartford et al., 2017] and similar to [Bennett et al.,2019] only offer asymptotic consistency of the estimator and only focus on RKHS function spaces.Finally, Muandet et al. [2020] consider the version of the minimax criterion that does not impose thesecond moment penalty on f , and make the important observation that for RKHS function spaces,the internal maximization takes a closed form, leading to a pairwise sample criterion (see Equa-tion (11) and Equation (14)). Moreover, they focus primarily on hypothesis testing as opposed toestimation. The un-penalized criterion can have sub-optimal convergence guarantees, as it does notposses the property that as the hypothesis of the learner gets close to the truth, then the adversary istesting smaller functions in terms of variance. The inability to achieve the fast rates attained via thecritical radius was the main reason why we introduced the second moment penalty. The suboptimal-ity of the un-penalized kernel based criterion was also proven in the context of hypothesis testing byBalasubramanian et al. [2017], who also show that a form of second moment penalization can yieldhypothesis tests with optimal power, when the alternative is very close to the null. Moreover, forRKHS, we show that the penalized method still admits a closed form solution, albeit now the closedform depends on the inverse of a kernel matrix, which makes it less amenable to gradient training aswe discuss in 6. B Beyond the IV Moments

Our results easily extend to arbitrary moments that are linear in h , which can capture several otherproblems in econometrics and causal inference, but for simplicity of exposition we focus on thecase of moments of the form y − h ( x ) . Moreover, our results can also be extended to non-linearand non-smooth moments ψ ( y ; h ( x )) , albeit in that case our convergence rates will be with respectto the distance metric: d (ˆ h, h ) = (cid:113) E [ E [ ψ ( y ; ˆ h ( x )) − ψ ( y ; h ( x )) | z ] ] as opposed to the projectedMSE distance. For instance, in the case of α -quantile IV regression: ψ ( y ; h ( x )) = a − { y ≤ h ( x ) } and the distance metric corresponds to: d (ˆ h, h ) = (cid:112) E [ E [1 { y ≤ h ( x ) } − α | z ] ] . C Supplementary Discussion of Main Theorems

C.1 Adaptivity of Regularized Estimator

Suppose that we know that for

B, U = 1 , we have that functions in H B , F U have ranges in [ − , as their inputs range in X and Z correspondingly. Then our Theorem requires that we set: λ ≥ δ and µ ≥ λ (4 L + 27) , where δ depends on the critical radius of the function class F and G .Observe that none of these values depend on the norm of the benchmark hypothesis (cid:107) h ∗ (cid:107) H , whichcan be arbitrary and not constrained by our theorem. For instance, if we knew that the true model h ∈ H and T ( h − h ) ∈ F L (cid:107) h − h (cid:107) H , then we can apply the latter theorem to get rates of the form: O (cid:0) δ max (cid:8) , (cid:107) h (cid:107) H (cid:9)(cid:1) λ = δ and µ = 2 δ (4 L + 27) . This hyperparameter tuning only requires knowledge of thecritical radius of the function classes F adn H and the Lipschitz constant of the operator T , butdoes not require knowledge of the norm of the true model (cid:107) h (cid:107) H , nor upper bounds on it. If thetrue model does not fall in the hypothesis H , then observe that we also require knowledge of theunconstrained approximation error, i.e. if we knew that: inf h ∈H (cid:107) h − h (cid:107) ≤ (cid:15) n and that T ( h − h ) ∈ F L (cid:107) h − h (cid:107) H , then we can choose δ ≥ (cid:15) n to get rates of the form: O (cid:0) δ max (cid:8) , (cid:107) h ∗ (cid:107) H (cid:9) + (cid:15) n (cid:1) where h ∗ = arg inf h ∈H (cid:107) h − h (cid:107) . Again we do not require knowledge of the norm of the uncon-strained projection, (cid:107) h ∗ (cid:107) H , just bounds on the approximation error of the unconstrained functionspace. Then the regularized estimator adapts to the norm of the projection of the true model on H .These results are inline with recent work on statistical learning theory [Lecué and Mendelson, 2017,2018] for square losses and extend these qualitative insights to the minimax objectives that we dealwith. C.2 Critical Radius and Rademacher Complexity via Covering

The critical radius of a function class is characterized to within a constant factor by it’s empiricallocalized Rademacher critical radius, which subsequently is chracterized by the empirical entropyintegral. The empirical Rademacher complexity of a function class G : V → [ − , , for a given setof samples S = { v i } ni =1 is deﬁned as: R S ( δ ; G ) = E { (cid:15) i } ni =1 (cid:34) sup g ∈G : (cid:107) g (cid:107) ,n ≤ δ n (cid:88) i (cid:15) i g ( v i ) (cid:35) The empirical critical radius is deﬁned as any solution ˆ δ n to: R S ( δ ; G ) ≤ δ Proposition 14.1 of Wainwright [2019] shows that w.p. − ζ , δ n = O (cid:32) ˆ δ n + (cid:114) log(1 /ζ ) n (cid:33) . (7)Thus we can choose δ in our main theorems based on the empirical critical radius ˆ δ n .Moreover, an upper bound on the empirical critical radius can be obtained via the empirical coveringintegral deﬁned as follows. An empirical (cid:15) -cover of G , is any function class G (cid:15) , such that for all g ∈ G , inf g (cid:15) ∈G (cid:15) (cid:107) g (cid:15) − g (cid:107) ,n ≤ (cid:15) . We denote with N ( (cid:15), G , S ) as the size of the smallest empirical (cid:15) -cover of G . The empirical metric entropy of G is deﬁned as H ( (cid:15), G , S ) = log( N ( (cid:15), G , S )) . Anempirical δ -slice of G is deﬁned as G S,δ = { g ∈ G : (cid:107) g (cid:107) ,n ≤ δ } . Then the empirical critical radiusof G is upper bounded by any solution to the inequality: ˆ δδ / (cid:114) H ( (cid:15), G S,δ , S ) n d(cid:15) ≤ δ (8)Observe that a conservative upper bound on ˆ δ n comes from replacing G S,δ inside the integral with G , i.e. when we do not restrict the function class to be in an empirical δ -slice, when calculatingit’s empirical metric entropy. For many function classes (e.g. parametric (cid:96) -balls, RKHS, high-dimensional sparse parametric spaces, VC-subgraph classes) this still yields tight results. For someother cases, such as (cid:96) -balls centered around a sparse parameter, this can be loose.When we make this relaxation, then observe that we can derive an upper bound on the criticalradius of G B,U , as a function of the empirical metric entropy of H and F . Observe that if H (cid:15) isan empirical (cid:15) -cover of H and F (cid:15) is an empirical (cid:15) -cover of F U , then since H contains functionsuniformly bounded in [ − , , we have that: inf h ∈H (cid:15) ,f (cid:15) ∈F (cid:15) (cid:107) ( h (cid:15) − h ) f (cid:15) − ( h − h ∗ ) f Uh (cid:107) ,n ≤ (cid:107) h (cid:15) − h ∗ (cid:107) ,n + 2 (cid:107) f (cid:15) − f Uh (cid:107) ,n ≤ (cid:15) (cid:15) -cover of the function class G deﬁned in Equation (2).Hence, the empirical metric entropy of G satisﬁes: H ( (cid:15), G B,U , S ) ≤ H ( (cid:15)/ , H B , S ) + H ( (cid:15)/ , F U , S ) Thus by applying Proposition 14.1 of Wainwright [2019] we get the following corollary.

Corollary 5.

Suppose that ˆ δ n satisﬁes the inequality: ˆ δδ / (cid:114) H ( (cid:15)/ , H B , S ) + H ( (cid:15)/ , F U , S ) n d(cid:15) ≤ δ Then w.p. − ζ , δ n ≤ O (cid:18) ˆ δ n + (cid:113) log(1 /ζ ) n (cid:19) , where δ n is the maximum of the critical radii of F U , G B,U and ˆ G B,U . For instance, if H and F is assumed to be a VC-subgraph class with constant VC dimension, thenthe above is satisﬁed for ˆ δ n = O (cid:18)(cid:113) log( n ) n (cid:19) . C.3 Solving the Min-Max Optimization Problem

In this section we outline some strategies for addressing the empirical min-max problem required bythe estimators described in Equations (3) and (6). In subsequent sections, we will present instancesof these optimization approaches for each of the function classes that we consider.First observe that if the hypothesis space can be parameterized as h ( x ; θ ) , such that the moment ψ ( y ; h ( x ; θ )) is convex in θ and the inner optimization problem is solvable in closed form then wecan solve the empirical problem via subgradient descent: i.e. letting f ∗ ( · ; h ) := arg sup f ∈F Ψ n ( h, f ) − λ Φ( f ) ,θ t +1 := θ t − η ( E n [ f ∗ ( z ; h ( x ; θ t )) ∇ θ h ( x ; θ t )] + µ ∇ θ R ( h ( · ; θ t ))) where Φ , R are the regularizers on f and h correspondingly. After T iterations, the average pa-rameter ¯ θ = T (cid:80) Tt =1 θ t , will correspond to an O (cid:0) T − / (cid:1) approximate solution to the min-maxproblem. This approximate solution will satisfy the same guarantees as ˆ h presented in Theorem 1and Theorem 2, augmented by an extra O (cid:0) T − / (cid:1) additive factor.Many times, even if the hypothesis space is not parameterizable by a ﬁnite dimensional parametervector θ , universally, we can invoke characterizations (typically referred to as representer theorems ),that prove that the empirical solution can always be expressed in terms of a ﬁnite set of parameters(many times of the order of the number of samples). This is for instance the case when F and H belong to a Reproducing Kernel Hilbert space, as we will see in Section 4. In such settings, we willsee that even the overall min-max optimization problem can be expressed in closed form, involvingonly matrix inversions and mutliplications, with matrices of size of the order of n .Since the min-max problem does not have a smooth gradient, one can also beneﬁt by invoking al-gorithms that are tailored to saddle point problems. These improvements typically assume somestructure on the inner optimization problem. For instance, if the function f can be parameterizeddas f ( · ; w ) such that the inner maximization problem is concave in w then faster than T − / op-timization rates can be achieved. We will see examples of such settings in the high-dimensionallinear function class setting in Section 5. The following set of papers provide examples of algo-rithms that achieve T − approximation rates (see e.g. Nesterov [2005], Nemirovski [2004], Rakhlinand Sridharan [2013], Mokhtari et al. [2019]).One simple such algorithm is the simultaneous optimistic mirror descent algorithm proposed inRakhlin and Sridharan [2013] and also recently analyzed by several papers, both theoretically andempirically, in the context of non-convex optimization problems (see e.g. Daskalakis et al. [2017],Mertikopoulos et al. [2018]). In this algorithm, instead of fully solving the internal optimizationproblem, we only take gradient steps. However, it modiﬁes the gradient descent algorithm to in-corporate a notion of optimism (i.e. that the next gradient will look similar to the last gradient). In20articular, if we use the short-hand notation Ψ n ( θ, w ) := Ψ n ( h ( · ; θ ) , f ( · ; w )) , then in the simpliﬁedsetting where we have no regularization on θ, w , the algorithm is described via the following updatedynamics: θ t +1 = θ t − η ∇ θ Ψ n ( θ t , w t ) + η ∇ θ Ψ n ( θ t − , w t − ) w t +1 = w t + 2 η ∇ w Ψ n ( θ t , w t ) − η ∇ w Ψ n ( θ t − , w t − ) Convex constraints on θ and w can be easily incorporated via projection steps and we defer toRakhlin and Sridharan [2013] for the formal deﬁnition of the algorithm in that setting. Similarly, forthe regularized versions one would simply replace Ψ n with its regularized counterparts.Unlike the sub-gradient descent approach, the simultaneous optimistic gradient dynamics, with theregularized version of our estimator, can also be implemented in a stochastic gradient manner, wherea mini-batch of samples are drawn at each step (with replacement), from the empirical set of samplesand Ψ n is replaced with the empirical expectation over that sub-sample. This can enable applicationswhere storing all the dataset in-memory is prohibitive. Moreover, this algorithm has variants thathave been proven beneﬁcial for neural nets (see, e.g. the Optimistic Adam algorithm of Daskalakiset al. [2017], also used in the related work of Bennett et al. [2019] in a generalized method ofmoments setup). Properties of simultaneous gradient dynamics in non-convex/non-concave settingshave also been a topic of recent interest in the machine learning community and recent techinquesfrom this line of work can be invoked to empirically solve the optimization problem (see e.g. Jinet al. [2019], Nouiehed et al. [2019], Thekumparampil et al. [2019], Yang et al. [2020], Lin et al.[2020]). C.4 From Projected MSE to MSE: Measure of Ill-Posedness

If we want to get a bound on the RMSE of ˆ h , i.e. (cid:107) h − h (cid:107) , then we need to bound the quantity: τ ∗ ( δ ) = sup h ∈H B : (cid:107) T ( h − h ∗ ) (cid:107) ≤ δ (cid:107) h − h ∗ (cid:107) In fact, it sufﬁces to bound the measure of ill-posedness of the operator T with respect to the functionclass H B , deﬁned as: τ := sup h ∈H B (cid:107) h − h ∗ (cid:107) (cid:107) T ( h − h ∗ ) (cid:107) . Both of these measures have been used in the literature on conditional moment models. For instance,Chen and Pouzo [2012] deﬁnes both of these measures for the case where H B is a space of growinglinear sieves. In that case, the second measure τ is typically referred to as the sieve measure ofill-posedness . Then observe that Theorem 1 implies that: (cid:107) ˆ h − h ∗ (cid:107) ≤ τ (cid:107) T (ˆ h − h ∗ ) (cid:107) ≤ O ( τ δ n + τ (cid:107) T ( h ∗ − h ) (cid:107) ) ≤ O ( τ δ n + τ (cid:107) h ∗ − h (cid:107) ) which by a triangle inequality also implies that: (cid:107) ˆ h − h (cid:107) ≤ O ( τ δ n + ( τ + 1) (cid:107) h ∗ − h (cid:107) ) Choosing h ∗ = arg min h ∈H : (cid:107) h (cid:107) H ≤ B (cid:107) h ∗ − h (cid:107) , yields the bound: (cid:107) ˆ h − h (cid:107) ≤ O (cid:18) τ δ n + ( τ + 1) inf h ∈H : (cid:107) h (cid:107) H (cid:107) h − h (cid:107) (cid:19) Subsequently one can appropriately choose H and B so as to trade-off the ill-posedness constantand the bias term.Moreover, we show that when we have a bounded ill-posedness measure, then we can prove a moreconvenient version of Theorem 1, that only requires bounds on the critical radius of the centeredfunction classes star ( H B − h ∗ ) = { r ( h − h ∗ ) : h ∈ H B , r ∈ [0 , } and star ( T ( H − h ∗ )) = { T ( h − h ∗ ) : h ∈ H B , r ∈ [0 , } , as opposed to the space G that contains products of thesefunctions. Theorem 6.

Let F be a symmetric and star-convex set of test functions and consider the estimatorin Equation (3) . Let h be any hypothesis (not necessarily in H ) that satisﬁes the ConditionalMoment (1) and suppose that H satisﬁes that: inf h ∈H (cid:107) h − h (cid:107) ≤ (cid:15) n nd let h ∗ = arg inf h ∈H (cid:107) h − h (cid:107) . Moreover, suppose that: ∀ h ∈ H : min f ∈F L (cid:107) h − h ∗(cid:107) H (cid:107) f − T ( h − h ∗ ) (cid:107) ≤ η n Assume that functions in H B and F U have uniformly bounded ranges in [ − , and that: δ := δ n + η n + (cid:15) n + c (cid:114) log( c /ζ ) n for universal constants c , c , and δ n an upper bound on the critical radii of the classes F U andstar ( H B − h ∗ ) := { r ( h − h ∗ ) : h − h ∗ ∈ H B , r ∈ [0 , } star ( T ( H B − h ∗ )) := { r f h : h − h ∗ ∈ H B , r ∈ [0 , } where f h = arg min f ∈F U (cid:107) f − T ( h − h ∗ ) (cid:107) . If O ( δ ) ≥ λ ≥ δ /U and O ( δ ) ≥ µ ≥ λ (4 L +27 U/B ) , then ˆ h satisﬁes w.p. − ζ : (cid:107) h − h (cid:107) = O (cid:0) τ δ max { , (cid:107) h ∗ (cid:107) H } (cid:1) C.5 Minimax Optimality of Estimation Rate

In this section we take the viewpoint of establishing minimax optimal rates for the estimation prob-lem of interest and discuss under which circumstances the upper bound we provide will typicallybe tight (i.e. achieving the statistically best possible projected RMSE). Suppose that the only priorassumptions we are willing to make about our data generating process is that it satisﬁes the momentcondition, that h ∈ H and that T ∈ T for some function class H and linear operator class T .Moreover, let F := { T h : T ∈ T , h ∈ H} . What is the minimax estimation rate, with respect tothe projected MSE norm, achievable in this setting? More concretely, let D ( h, T ) be any distribu-tion consistent with function h , linear operator T and conditional moment condition T h = E [ y | z ] .Then for any estimator ˆ h , that takes as input a training sample S of size n , drawn i.i.d. from D ( h, T ) ,and returns a function ˆ h S , we want to lower bound the minimax optimal rate: min ˆ h max h ∈H ,T ∈T E S ∼ D ( h ,T ) n (cid:104) (cid:107) T (ˆ h S − h ) (cid:107) (cid:105) If the space T contains the identity, then this is lower bounded by the RMSE rates of a non-parametric regression problem over hypothesis space H . Thus by standard results on regressionproblems, the critical radius of H is insurmountable for many classes H of interest (see e.g. Massart[2000], Bartlett et al. [2005], Rakhlin et al. [2017].Moreover, suppose that there exists a T ∈ T such that: for all f there exists h ∈ H , such that T h = f , i.e. T is the worst mapping that allows one to span all of F . Then even if we knew T = T , we could not bypass the critical radius of F for many classes F of interest (see e.g. Bartlettet al. [2005], Rakhlin et al. [2017]). More generally, we can lower bound the minimax risk as: max T ∈T min ˆ h max h ∈H E S ∼ D ( h ,T ) n (cid:104) (cid:107) T (ˆ h S − h ) (cid:107) (cid:105) Let F T = { T h : h ∈ H} . Then the above can be re-written: max T ∈T min ˆ f ∈F T max f ∈F E S ∼ D ( f ) n (cid:104) (cid:107) ˆ f S − f (cid:107) (cid:105) where D ( f ) is any distribution that satisﬁes E [ y | z ] = f . This is the minimax lower bound forthe regression problem of predicting y from z , assuming that E [ y | z ] ∈ F T . Thus we have thatthe minimax rate is at least max T δ ( F T ) . If we knew that there was a ﬁnite set of k representativelinear operators T , . . . , T k in T , such that F = F T ∪ . . . ∪ F T k , then observe that the criticalradius of F is at most O (log( k )) more than the maximum critical radius of each of the F T i . Thusthe only case that remains open where our upper bound might not be providing tight results iswhen there is not such ﬁnite small set of representative operators in T . In many of our settings,we will have that δ ( F ) ∼ δ ( H ) , which is achieved for the single identity operator T = I . Thecase where our upper bound is loose, is essentially the case when knowing the operator, or some22quivalence class of the operator, can signiﬁcantly reduce the sample complexity of the problem.Potentially in such settings ﬁtting a ﬁrst stage model of T to identify the equivalence class or aﬁnite number of viable equivalence classes and focus only on a remaining set of k candidate F T ∪ . . . ∪ F T i in a second stage can be beneﬁcial. However, in most of our applications this setting doesnot arise. One for instance can follow techniques similar to aggregation algorithms Rakhlin et al.[2017], that applies our minimax estimator on an (cid:15) partition of the original hypothesis H and thenaggregates the resulting winning hypothesis from each partition. However, this would typically be acomputationally inefﬁcient algorithm. D Application: Growing Linear Sieves

Consider the case where H and F are growing linear sieves, i.e. H = H n := (cid:8) (cid:104) θ, φ n ( · ) (cid:105) : θ ∈ R k n (cid:9) , F = F n := {(cid:104) β, ψ n ( · ) (cid:105) : β ∈ R m n } , equipped with norms (cid:107)(cid:104) θ, φ n ( · ) (cid:105)(cid:107) H = (cid:107) θ (cid:107) , (cid:107)(cid:104) β, ψ n ( · ) (cid:105)(cid:107) F = (cid:107) β (cid:107) , for some known and growingfeature maps φ n ( · ) , ψ n ( · ) .Moreover, we denote with η n the approximation error of the sieve ψ n that is used for the test functionspace, i.e. for all h, h ∗ ∈ H : inf f ∈F (cid:107) f − T ( h − h ∗ ) (cid:107) ≤ η n and, let (cid:15) n the approximation error of the sieve φ n used for the model, i.e.: inf h ∈H (cid:107) h − h (cid:107) ≤ (cid:15) n In that case, applying Theorem 1 with h ∗ = arg inf h ∈H (cid:107) h − h (cid:107) , gives a bound w.p. − ζ of: (cid:107) T (ˆ h − h ) (cid:107) ≤ O (cid:32)(cid:32) δ n + (cid:15) n + (cid:114) log(1 /ζ ) n (cid:33) max { , (cid:107) θ ∗ (cid:107) } + η n (cid:33) where θ ∗ is the (cid:96) norm of the parameter that corresponds to h ∗ .Moreover, δ n is a bound on the critical radius of F U and G B,U . Since both are ﬁnite dimen-sional linear functions, via standard covering arguments (see Corollary 5), we can bound δ n = O (cid:18)(cid:113) max { k n ,m n } log( n ) n (cid:19) . We also now provide a more intricate argument that removes the log( n ) from this rate. Observe that F U is a simple linear model space and therefore existing results directlyapply to show that the critical radius of F U is at most (cid:112) m n n (see e.g. Example 13.5 of Wainwright[2019]). The function space G B,U is a bit more subtle. We will in fact bound the critical radius ofthe following larger class: ˜ G B,U = { ( x, z ) → (cid:104) θ − θ ∗ , φ n ( x ) (cid:105)(cid:104) β, ψ n ( z ) (cid:105) : θ ∈ R k n , β ∈ R m n , (cid:107) θ − θ ∗ (cid:107) ≤ B, (cid:107) β (cid:107) ≤ U } We will use the empirical covering integral bound on the critical radius, presented in Equation (8).Thus we need to bound the metric entropy of the function class ˜ G B,U ( δ ) = { g ∈ ˜ G B,U : (cid:107) g (cid:107) ,n ≤ δ } . Let Ψ n denote the n × k n matrix whose i -th row corresponds to the vector ψ n ( x i ) and similarly Φ n . Observe that the norm empirical (cid:96) ,n norm can then be written as: (cid:107)(cid:104) θ − θ ∗ , φ n ( · ) (cid:105)(cid:104) β, ψ n ( · ) (cid:105)(cid:107) ,n = (cid:107) Ψ n ( θ − θ ∗ ) (cid:107) (cid:107) Φ n β (cid:107) √ n Thus (cid:96) ,n deﬁnes a norm on the space deﬁned by the Hadamard (coordinate-wise) product v ◦ v of two vectors v , v in range (Ψ n ) and range (Φ n ) , correspondingly, i.e. (cid:107) v ◦ v (cid:107) = (cid:107) v (cid:107) (cid:107) v (cid:107) √ n .Moreover, ˜ G B,U ( δ ) is isomorphic to a δ -ball in this space. Moreover, observe that the dimension ofthe space { v ◦ v : v ∈ range (Ψ n ) , v ∈ range (Φ n ) } is at most rank (Ψ n ) rank (Φ n ) ≤ k n · m n .Therefore by the volumetric argument presented in Example 5.4 of Wainwright [2019], we get that23or any set of samples S of size n , log( H ( (cid:15), ˜ G B,U ( δ ) , S ) ≤ k n m n log (cid:0) δ(cid:15) (cid:1) . Moreover, observethat: ˆ δ log( H ( (cid:15), ˜ G B,U ( δ ) , S ) d(cid:15) ≤ (cid:114) k n m n n ˆ δ (cid:115) log (cid:18) δ(cid:15) (cid:19) d(cid:15) ≤ δ (cid:114) k n m n n ˆ (cid:115) log (cid:18) u (cid:19) du = c δ (cid:114) k n m n n for some constant c . Thus Equation (8) is satisﬁed for δ = O (cid:18)(cid:113) k n m n n (cid:19) . Combining all these weget a projected MSE rate w.p. − ζ of: (cid:107) T (ˆ h − h ) (cid:107) = O (cid:32)(cid:32)(cid:114) k n m n n + η n + (cid:15) n + (cid:114) log(1 /ζ ) n (cid:33) max { , (cid:107) θ ∗ (cid:107) } (cid:33) Invoking standard bounds on the approximation error of classical sieves (e.g. wavelets) and op-timally balancing k n , m n , yields concrete rates (see e.g. Chen and Pouzo [2012] for particularapproximation rates of known sieves).Combined with ill-posedness conditions provided in [Chen and Pouzo, 2012], our results can thusgive an alternative proof to the results in [Chen and Pouzo, 2012] that i) do not make minimumeigenvalue conditions, ii) provide adaptivity to (cid:107) θ ∗ (cid:107) , without knowledge of it, thereby justifyingtheoretically the use of the regularization term R ( h ) , that was mostly proposed for experimentalimprovement in [Chen and Pouzo, 2012]. For instance, one concrete ill-posedness condition isthat λ min (cid:0) E (cid:2) E [ φ n ( x ) | z ] E [ φ n ( x ) | z ] (cid:62) (cid:3)(cid:1) ≥ γ n and λ max (cid:0) E (cid:2) ψ n ( x ) ψ n ( x ) (cid:62) (cid:3)(cid:1) ≤ σ n . Then theill-posedness constant is upper bounded by τ n = σ n /γ n . Moreover, if one assumes a bound onill-posedness, then Theorem 6 requires δ to be an upper bound of simpler function spaces, thatall correspond to simple linear function spaces in ﬁnite dimensions. Thus a smaller bound of O (cid:18)(cid:113) max { k n ,m n } n (cid:19) , sufﬁces, leading to an error w.p. − ζ of the form: (cid:107) ˆ h − h (cid:107) = O (cid:32) τ n (cid:32)(cid:114) max { k n , m n } n + η n + (cid:15) n + (cid:114) log(1 /ζ ) n (cid:33) max { , (cid:107) θ ∗ (cid:107) } (cid:33) E Application: Reproducing Kernel Hilbert Spaces

In this section we deal with the case where h lies in a Reproducing Kernel Hilbert space (RKHS)with kernel K H : X × X → R , denoted with H K and T h lies in another RKHS with kernel K F : Z × Z → R . We present the three components required to apply our general theory.First we characterize the set of test functions that are sufﬁcient to satisfy the requirement that T ( h − h ) ∈ F U ; under non-parametric assumptions on the conditional density p ( x | z ) then we can have K H = K F . Second, by recent results in statistical learning theory, the critical radius of the functionclasses F and G can be characterized as a function of the eigendecay of the kernel K and the productkernel K × (( x, z ) , ( x (cid:48) , z (cid:48) )) = K ( x, x (cid:48) ) · K ( z, z (cid:48) ) and in the worst-case is of the order of n − / .Combining these two facts, we can then apply Theorem 1, to get a bound on the estimation error ofthe minimax or regularized minimax estimator. Finally, we show that for this set of test functionsand hypothesis spaces, the empirical min-max optimization problem can be solved in closed form ;in particular the inner maximization problem can be shown to correspond roughly to a regularizedversion of a pairwise metric of the form: (cid:80) i,j ψ i K ( z i , z j ) ψ j , where ψ i = ψ ( y i ; h ( x i )) . E.1 Characterization of Sufﬁcient Test Functions

In general, it sufﬁces to assume that the linear operator T is regular enough that it satisﬁes that forany h ∈ H , we have that T h ∈ H K F for some known kernel K F and that it is an L -Lipschitzoperator with respect to the pair of RKHS norms (cid:107)·(cid:107) H , (cid:107)·(cid:107) K F . Then observe that we satisfy therequirement that T ( h − h ∗ ) ∈ F L (cid:107) h − h ∗ (cid:107) H , if we take F = H K F . We now present two comple-mentary sets of sufﬁcient conditions for which the aforementioned property holds.24he ﬁrst set of conditions applies to a generic function class H and asks principally that p ( x |· ) belongs to a common RKHS for each x . Lemma 7.

Suppose that, for each x , p ( x |· ) is an element of an RKHS H K F and h ∈ H satisﬁes | h ( x ) | ≤ κ ( x ) (cid:107) h (cid:107) H for some κ : X → R . If L (cid:44) ´ κ ( x ) (cid:107) p ( x |· ) (cid:107) K F dx < ∞ , then T h ∈ H K F with (cid:107) T h (cid:107) K F ≤ L (cid:107) h (cid:107) H .Proof. For any nonnegative h , Jensen’s inequality implies that (cid:107) T h (cid:107) K F = (cid:107) ´ h ( x ) p ( x |· ) dx (cid:107) K ≤ ´ | h ( x ) |(cid:107) p ( x |· ) (cid:107) K F dx. (9)The same result (9) holds for arbitrary signed h due to the decomposition h = h + − h − for h + ( x ) =max( h ( x ) , and h − ( x ) = max( − h ( x ) , , the identity | h ( x ) | = | h + ( x ) | + | h − ( x ) | , and thetriangle inequality (cid:107) T h (cid:107) K F ≤ (cid:107) T h + (cid:107) K F + (cid:107) T h − (cid:107) K F .Now consider any h ∈ H satisfying | h ( x ) | ≤ κ ( x ) (cid:107) h (cid:107) H for some κ : X → R By our inequality(9), we have (cid:107)

T h (cid:107) K F ≤ (cid:107) h (cid:107) H ˆ κ ( x ) (cid:107) p ( x |· ) (cid:107) K F dx = L (cid:107) h (cid:107) H . The second set of conditions applies when h belongs to a translation-invariant RKHS and ensuresthat T h belongs to the same RKHS. Suppose that the kernel K H ( x, y ) = k ( x − y ) . Moreover,suppose that p ( x | z ) = ρ ( x − z ) . Then the following lemma states that T h ∈ H K H and hence also T ( h − h ∗ ) ∈ H K H for any h, h ∗ ∈ H K H . Lemma 8.

Suppose the conditional distribution of X given Z = z has continuous density p ( x | z ) = ρ ( x − z ) and that K H ( x, y ) = k ( x − y ) for k positive deﬁnite and continuous. If the generalizedFourier transform of k is continuous on R d \{ } , then T h ∈ H K H for all h ∈ H K H with (cid:107) T h (cid:107) K H ≤ L (cid:107) h (cid:107) K H for L = (cid:107) ˆ ρ (cid:107) ∞ .Proof. Fix any h ∈ H K . By [Wendland, 2004, Thm. 10.21], (cid:107) h (cid:107) K H = (cid:107) ˆ h/ (cid:112) ˆ k (cid:107) < ∞ . Moreover,since ρ is in L , the Hausdorff-Young inequality implies that ˆ ρ ∈ L ∞ . Hence, since T h = h ∗ ρ , (cid:107) T h (cid:107) K H = ˆ (cid:99) T h ( ω ) / ˆ k ( ω ) dω = ˆ ˆ h ( ω ) ˆ ρ ( ω ) / ˆ k ( ω ) dω ≤ (cid:107) ˆ ρ (cid:107) ∞ (cid:107) ˆ h/ (cid:112) ˆ k (cid:107) = L (cid:107) h (cid:107) K H < ∞ , so that T h ∈ H K H by [Wendland, 2004, Thm. 10.21].Thus in Theorem 1 we can use H = F = H K for K = K H . Moreover, we can set B to be an upperbound on the squared RKSH norm of h , i.e. (cid:107) h (cid:107) H ≤ B so that we can take h ∗ = h and have (cid:107) T ( h ∗ − h ) (cid:107) = 0 , i.e. zero bias. Moreover, by Lemma 8 we also know that (cid:107) T h (cid:107) F ≤ LB forsome constant L . Thus we can set U = 2 LB in Theorem 1 and have that Equation (4) holds with η n = 0 . Thus by Theorem 1, we can get that the estimator in Equation (3) satisﬁes w.p. − ζ : (cid:107) T (ˆ h − h ) (cid:107) ≤ δ n + c (cid:114) log( c /δ ) n where δ n is an upper bound on the critical radii of F LB and G B , which simplify to: F U := (cid:8) f ∈ H K : (cid:107) f (cid:107) K ≤ LB (cid:9) G B := (cid:8) ( x, z ) → ( h ( x ) − h ( x )) T ( h − h )( z ) : h ∈ H K , (cid:107) h − h (cid:107) K ≤ B (cid:9) Similar rates can also be established for the regularized estimator analogue in Theorem 1, withoutexplicit knowledge of B . 25 .2 Critical Radius of F U and G B We now turn to analyze the critical radii of F U and G B . We ﬁrst show that these function spacesare also RKHS with appropriate kernels and have bounded RKHS norms. This is trivial for F U .Moreover, observe that the space G B , contains the product of two functions hf , where h : X → [ − , and f : Z → [ − , and such that h ∈ H and f = T h ∈ F . Thus the space G , withinner product (cid:104) hf, h (cid:48) f (cid:48) (cid:105) G = (cid:104) h, h (cid:48) (cid:105) H (cid:104) f, f (cid:48) (cid:105) F , also admits a reproducing kernel, deﬁned as (seeProposition 12.2 of Wainwright [2019]): K G (( x ; z ) , ( x (cid:48) ; z (cid:48) )) = K H ( x, x (cid:48) ) K F ( z, z (cid:48) ) Moreover, (cid:107) hf (cid:107) G = (cid:107) h (cid:107) H (cid:107) f (cid:107) F . Thus if h , satisﬁes (cid:107) h (cid:107) H ≤ B , then by Lemma 8, (cid:107) T h (cid:107) F ≤ L (cid:107) h (cid:107) K ≤ LB for some constant L and (cid:107) hf (cid:107) G ≤ LB .Assuming that the RKHS spaces F and G , also have a sufﬁciently fast eigendecay then existingresults in statistical learning theory also bound the generalization error Wainwright [2019]. In par-ticular, Corollary 14.2 of Wainwright [2019], shows that for any RKHS H K , if we let H BK := { h ∈ H K : (cid:107) h (cid:107) K ≤ B } , then we can bound the localized Rademacher and empirical Rademacher complexity as: R ( δ ; H BK ) ≤ B (cid:114) n (cid:118)(cid:117)(cid:117)(cid:116) ∞ (cid:88) j =1 min { λ j , δ } R S ( δ ; H BK ) ≤ B (cid:114) n (cid:118)(cid:117)(cid:117)(cid:116) n (cid:88) j =1 min { λ Sj , δ } where λ j are the eigenvalues of the kernel and λ Sj are the empirical eigenvalues of the empiricalkernel matrix K deﬁned as K ij = K ( x i , x j ) /n . Moreover, the unrestricted Rademacher complexityis upper bounded as (see Lemma 26.10 of Shalev-Shwartz and Ben-David [2014]): R ( H BK ) ≤ O (cid:32) B (cid:114) max x ∈X K ( x, x ) n (cid:33) Thus in the worst case we can take δ n = O (cid:18) √ B (cid:16) max x ∈X K ( x,x ) n (cid:17) / (cid:19) , to get a non-parametricrate of convergence. However, for many kernels, the eigendecay will be sufﬁciently fast, that δ willnot be binding in the minimum. For instance, for the Gaussian kernel in one dimension on the do-main [0 , , with bandwidth of , i.e. K ( x, x (cid:48) ) = e − ( x − x (cid:48) )22 , we have that δ n = O (cid:18) B (cid:113) log( n +1) n (cid:19) (see Example 14.4 of Wainwright [2019]). Data-adaptive estimation

Moreover, by Equation (7), we can choose δ in Theorem 1 based onthe empirical critical radius. Observe that the empirical eigenvalues are directly computable fromthe data and hence, we can calculate a data-adaptive quantity ˆ δ n and choose δ in Theorem 1, basedon this data-adaptive quantity plus an O (cid:18)(cid:113) log(1 /ζ ) n (cid:19) term. Moreover, if we use the regularized es-timator, then we also do not require knowledge of B , which leads to a very data-adaptive estimationscheme. The only thing required is knowledge of an upper bound on the Lipschitz constant L of theoperator T with respect to the RKHS norm. E.3 Closed-Form Solution to Optimization Problem

Finally, we show that the optimization problem that deﬁnes the estimator in Equation (3) can becomputed in closed form. We present the results for the constrained estimator, but exact analoguesalso hold for the regularized version. The proof can be found in Appendix L.1. Observe that: K ( x, x ) = (cid:80) ∞ j =1 λ j e j ( x ) and therefore: (cid:80) ∞ j =1 λ j = (cid:80) ∞ j =1 λ j E x [ e j ( x ) ] = E x [ (cid:80) ∞ j =1 λ j e j ( x ) ] = E x [ K ( x, x )] ≤ max x ∈X K ( x, x ) . Thus in the worst case, when λ j ≥ δ for most j ,we still recover the non-localized from the localized bounds. roposition 9 (Closed-form maximization) . Suppose F is an RKHS with kernel K equipped withthe canonical RKHS norm (cid:107)·(cid:107) F = (cid:107)·(cid:107) K . Then for any h sup f ∈F Ψ n ( h, f ) − λ (cid:18) (cid:107) f (cid:107) K F + Unδ (cid:107) f (cid:107) ,n (cid:19) = 14 λ ψ (cid:62) n K / n ( Unδ K n + I ) − K / n ψ n (10) = 14 λ ψ (cid:62) n K n ( Unδ K n + I ) − ψ n where K n = ( K ( z i , z j )) ni,j =1 is the empirical kernel matrix and ψ n = ( n ψ ( y i ; h ( x i ))) ni =1 . We note that if we did not enforce the extra (cid:96) ,n norm constraint on f (i.e. δ → ∞ , then the aboveinner optimization problem simpliﬁes to: sup f ∈F Ψ n ( h, f ) − λ (cid:107) f (cid:107) K F = 14 λ ψ (cid:62) n K n ψ n = 14 λ n (cid:88) i,j ψ ( y i ; h ( x i )) K ( z i , z j ) ψ ( y j ; h ( x i )) (11)i.e. we get a pair-wise residual loss, weighted by a kernel matrix that is only a function of theconditioning set z .Thus the solution ˆ h of the estimator in Equation (3) is equivalent to: ˆ h = arg min h ∈H λ ψ (cid:62) n M ψ n + µ (cid:107) h (cid:107) H = arg min h ∈H ψ (cid:62) n M ψ n + 4 µ λ (cid:107) h (cid:107) H where M := K / n ( Unδ K n + I ) − K / n . Finally, we show that this outer maximization also has aclosed form solution. See Appendix L.2 for the proof. Proposition 10 (Closed-form minimization) . Suppose that H and F are the RKHSes of the kernels K H and K F , equipped with the canonical RKHS norms (cid:107)·(cid:107) H = (cid:107)·(cid:107) K H and (cid:107)·(cid:107) F = (cid:107)·(cid:107) K F . Deﬁnethe empirical kernel matrices K H ,n = ( K H ( x i , x j )) ni,j =1 and K F ,n = ( K F ( z i , z j )) ni,j =1 . Then thefollowing estimator is an optimizer of Equation (3) : ˆ h = n (cid:88) i =1 α λ ∗ ,i K H ( x i , · ) α λ := ( K H ,n M K H ,n + 4 λ µK H ,n ) † K H ,n M y for M = K / F ,n ( Unδ K F ,n + I ) − K / F ,n ≡ K F ,n ( Unδ K F ,n + I ) − and A † is the Moore-Penrosepseudoinverse of a matrix A . Hyper-parameter tuning

Observe that Theorem 1 states that as long as the regularization strengthsatisﬁes that λµ = Θ( δ L ) , then this estimator will provide results that automatically scale withthe RKHS norm of true hypothesis h . Moreover, the regularization hyperparameter λ · µ can alsobe tuned in practice by evaluating the loss function ψ (cid:62) n M ψ n on a left-out sample, with parameters n, δ set to the appropriate ones for the size of that sample. Low-Rank Approximation and Nystrom’s Method

The solution to the empirical optimizationproblem requires inverting an n × n kernel matrix, which takes time O ( n ) . This can be prohibitivefor moderate sample sizes of the order of tens of thousands. We note here that one can constructvery good approximations to the solution in Proposition 10 by considering low-rank approximationsof the kernel matrix K . We present here one such low-rank approximation, based on Nystrom’smethod, but we note that the plethora of recent literature on low-rank kernel approximation methodsare applicable to our problem too (see e.g. Kumar et al. [2012], Bach and Jordan [2005], Musco andMusco [2017], Oglic and Gärtner [2017]).Suppose that we can express our kernel matrices as K H ,n and K H ,n as K H ,n = DD (cid:62) and K H ,n = V V (cid:62) , where D and V are of dimensions n × r and such that we can express the kernel row of anynew test sample as: ( K H ( x , x ) , . . . , K H ( x n , x )) = V φ ( x ) for some r -dimensional vector φ ( x ) . Then we can express h ( x ) = φ ( x ) (cid:62) V (cid:62) a λ . If we then deﬁne γ = V (cid:62) α λ . Then we can re-write the closed form solutions to the min and max problems as follows: sup f ∈F Ψ n ( h, f ) − λ (cid:18) (cid:107) f (cid:107) K F + Unδ (cid:107) f (cid:107) ,n (cid:19) = 14 λ ψ (cid:62) n D (cid:18) Unδ D (cid:62) D + I (cid:19) − D (cid:62) ψ n a) − . · x + . · x (b) . · { x > } (c) sin( x ) (d) | x | (e) e − x (f) − . · x + . · x + x Figure 5: Estimated functions based on our minimax estimator for different true functions. We usean rbf kernel with parameter γ = . and samples. We chose critical radius parameter δ = 5 /n . and the regularization hyper-parameter τ is chosen via k-fold cross-validation. The data generatingprocess was: x = . z + . u + δ and y = h ( x ) + u + (cid:15) and z, u ∼ N (0 , and (cid:15), δ ∼ N (0 , . .and if we let Q := (cid:0) Unδ D (cid:62) D + I (cid:1) − and A = V (cid:62) D , then: γ := (cid:0) AQA (cid:62) + 4 λ µ I (cid:1) − AQD (cid:62) y ˆ h ( x ) := φ ( x ) (cid:62) γ Observe that every matrix calculation in the above expressions requires time at most O ( n r ) to becomputed. Thus if r (cid:28) n , we have massively reduced the computation time from Θ( n ) to O ( n r ) ,making the method practical even very large data regimes.Even though r in the worst-case can be of size n , we can typically well-approximate the kernel ma-trices with r (cid:28) n . One popular approach for achieving this is Nystrom’s method, which essentiallysub-samples a set of r points and uses the normalized kernel distances with respect to this subsetof points as D and V , respectively. In particular, let S denote an n × r matrix whose i -th columncontains a in position j for some randomly sampled index j . Then KS is an n × r sub-matrixof K , where a subset S of the columns of K are chosen at random. Then we can approximate K via V V (cid:62) , where V = KSM / and M = ( S (cid:62) KS ) + (i.e. V is contains normalized kernel-basedsimilarities to the subset S of r randomly chosen points). Moreover, for any new test point, we canset φ ( x ) = M / ( K H ( x i , x )) i ∈ S . E.4 Bounds on Ill-Posedness Measure

The results so far in the section provide bounds on the projected RMSE. In this last section, weshow that under further assumptions on the strength of the instrument (i.e. the correlation of x and z ), then the projected RMSE rates also imply rates for the RMSE. We give an example such set ofconditions, mostly as an example of a sufﬁcient set of assumptions that lead to RMSE rates and inorder to provide qualitative insights on what RMSE rates one can expect in different regimes of theinstrument strength and the eigendecay of the kernel. In this section we will assume that the space H is also augmented with a hard constraint on the RKHS norm, i.e. H = H BK = { h ∈ H K : (cid:107) h (cid:107) K ≤ B } . Assuming (cid:107) h (cid:107) K ≤ B this does not change the statistical guarantees and moreover the closedform optimization theorems, can easily be amended to incorporate a hard constraint on top of the Several sampling strategies have been proposed in the literature to improve upon pure uniform sampling(see e.g. Kumar et al. [2012], Musco and Musco [2017], Oglic and Gärtner [2017]). One popular practical andsimple method is to perform some version of unsupervised clustering of the samples, such as kmeans clustering,and choosing the points as the cluster centroids. a) sin( x ) (b) . · { x > } (c) − . · x + . · x + x Figure 6: Estimates based on Nystrom approximation, with nystrom samples, for the same dgpand parameter setup as in Figure 5.regularization (due to the equivalent between hard constraints and regularization). Imposing thishard constraint will simplify the analysis of this section. By Mercer’s theorem we can express any function in the RKHS H BK , in terms of the eigenfunctionsof the kernel: h = (cid:88) j ∈ J a j e j with e j : X → R , such that E [ e j ( x ) ] = 1 and E [ e i ( x ) e j ( x )] = 0 and J a countable set. Moreover,we have (cid:107) h (cid:107) = (cid:80) j ∈ J a j and (cid:107) h (cid:107) K = (cid:80) j ∈ J a j λ j ≤ B . Thus we have that (cid:107) h (cid:107) H ≤ B implies thatfor all m ∈ N + : (cid:80) j ≥ m a j ≤ λ m B . Moreover, we have: (cid:107) T h (cid:107) = (cid:88) i,j ∈ J a i a j E [ E [ e i ( x ) | z ] E [ e j ( x ) | z ]] . For any m ∈ N + , let I := { , . . . , m } , e I = ( e , . . . , e m ) , a I = ( a , . . . , a m ) and: V m := E [ E [ e I ( x ) | z ] E [ e I ( x ) | z ] (cid:62) ] and suppose that λ min ( V m ) ≥ τ m , i.e. that these ﬁnite eigenfunctions maintain some fraction oftheir independent components, even when they are smoothened through the conditional expectation p ( x | z ) . Furthermore suppose that for all i ≤ m < j : | E [ E [ e i ( x ) | z ] E [ e j ( x ) | z ]] | ≤ γ m ≤ c τ m (for some constant c ), i.e. the smoothening performed by the conditional expectation does not ruina lot the orthogonality of the ﬁrst m eigenfunctions with eigenfunctions for indices larger than m .Observe that if we had a perfect instrument, i.e. z was perfectly correlated with x , then V m = I m and E [ E [ e i ( x ) | z ] E [ e j ( x ) | z ]] = E [ e i ( x ) e j ( x )] = 0 . Thus for a perfect instrument τ m = 1 and γ m = 0 . Therefore the latter requirements are implicit assumptions on the strength of theinstrument. We show that under these assumptions, we can bound the measure of ill-posedness asfollows.

Lemma 11.

Suppose that λ min ( V m ) ≥ τ m and for some constant c > , for all i ≤ m < j , | E [ E [ e i ( x ) | z ] E [ e j ( x ) | z ]] | ≤ c τ m Then: τ ∗ ( δ ) := max h ∈ H BK : (cid:107) T h (cid:107) ≤ δ (cid:107) h (cid:107) ≤ min m ∈ N + (cid:18) δ τ m + (4 c + 1) Bλ m +1 (cid:19) The optimal choice of m ∗ roughly solves the equation: τ m λ m +1 = δ /B . If for instance λ m ≤ m − b for b > , and τ m ≥ m − a for a > , then: m ∗ ∼ δ / ( a + b ) , leading to a rate of: (cid:107) ˆ h − h ∗ (cid:107) = O (cid:16) δ b/ ( a + b ) (cid:17) We note that the proof of Theorem 1 implies that even without a hard constraint, with high probability (cid:107) ˆ h (cid:107) K ≤ (cid:107) h (cid:107) K + δ + λUµ . Thus the results of this section hold for B = (cid:107) h (cid:107) K + δ + λUµ even without theextra hard constraint. Potentially the strongest assumption of these is that γ m ≤ τ m . This could be avoided by restricting thehypothesis space H B to only be supported on the ﬁrst m eigenfunctions. However, this would require beingable to diagonalize the kernel and also to tune the estimator to the unknown parameters τ m .

29e see that the RMSE rate is of a slower order than the projected MSE rate. If λ m has an exponentialeigendecay, i.e. λ m ∼ − m (e.g. such as in the case of a Gaussian kernel), and τ m ≥ m − a , then m ∗ ∼ log(1 /δ ) and we get: (cid:107) ˆ h − h ∗ (cid:107) = O (cid:16) δ (log(1 /δ )) a/ (cid:17) Thus we only get a logarithmic increase in the RMSE rate as compared to the Projected RMSE rate.However, we note that if also τ m ∼ − a m and λ m ∼ − b m , then we get rates of O (cid:0) δ b/ ( a + b ) (cid:1) ,by settings m ∗ ∼ log(1 /δ / ( a + b ) ) . Finally, in the severely ill-posed setup, where τ m ∼ − m and λ m ∼ m − b , then we have m ∗ ∼ log(1 /δ ) and: (cid:107) ˆ h − h ∗ (cid:107) = O (cid:18) /δ ) b (cid:19) leading to a very slow rate of convergence that will typically be of the order of / log( n ) .Observe that we achieve the rate for the optimal choice of m , without the need to tune our algorithm.The RKHS norm penalty implicitly clips the weight that our functions can put on eigenfunctionswith large index and hence controls the measure of ill-posedness for whatever is the decay rates ofthe eigenvalues λ m and τ m . F Application: High-Dimensional Sparse Linear Function Spaces

In this section we deal with high-dimensional linear function classes, i.e. the case when X , Z ⊆ R p for p (cid:29) n and h ( x ) = (cid:104) θ , x (cid:105) . We will address the case when the function θ is assumed tobe sparse, i.e. (cid:107) θ (cid:107) := { j ∈ [ p ] : | θ j | > } ≤ s . We will be denoting with S the subset ofcoordinates of θ that are non-zero and with S c its complement. For simplicity of exposition wewill also assume that E [ x i | z ] = (cid:104) β, z (cid:105) , though most of the results of this section also extend to thecase where E [ x i | z ] ∈ F i for some F i with small Rademacher complexity. We provide two sets ofresults, dependent on whether we make further minimum eigenvalue assumptions on the covariancematrix of the random variables E [ x i | z ] . F.1 Hard Sparsity Constraints without Minimum Eigenvalue

In the ﬁrst result, we apply Theorem 1 to show that even without any further assumptions on theeigenvalues of the covariance matrix V := E [ E [ x | z ] E [ x | z ] (cid:62) ] , we can attain fast rates of the order of n − / that are logarithmic in p and only linear in the sparsity s of h and the sparsity r of the conditional expectation functions E [ x i | z ] . Albeit the optimizationproblem we need to solve to get these rates is non-convex and has running time that is exponentialin r, s . This setting covers and extends the linear moment case of the setting analyzed in [Fan andLiao, 2014]; albeit we only provide RMSE and projected RMSE rates. Corollary 12.

Suppose that h ( x ) = (cid:104) θ , x (cid:105) with (cid:107) θ (cid:107) ≤ s and E [ x i | z ] = (cid:104) β i , z (cid:105) with (cid:107) β i (cid:107) ≤ r . Then let H consist of all s -sparse linear functions of x and F consist of all ( s · r ) -sparse linearfunctions of z with coefﬁcients in [ − , . in p dimensions with only s non-zero coefﬁcients and F consists of linear functions in q dimensions with r non-zero coefﬁcients. Then the estimatorpresented in Equation (3) , satisﬁes that w.p. − ζ : (cid:107) T (ˆ h − h ) (cid:107) ≤ O (cid:32)(cid:114) r s log( p n ) n + (cid:114) log(1 /ζ ) n (cid:33) The proof follows immediately from the fact that the metric entropy of r s -sparse linear functionsin p -dimensions, with coefﬁcients in [ − , is of the order of O ( r s log( p/(cid:15) )) . Thus we can invokeCorollary 5 to get a bound of O (cid:18)(cid:113) r s log( p n ) n (cid:19) on the critical radii of classes F U and G B,U andapply Theorem 1. 30 .2 (cid:96) -Relaxation under Minimum Eigenvalue Condition In the second set of results we assume a restricted minimum eigenvalue of γ on the matrix V andapply Theorem 2 to get fast rates of the order of n − / , that also scale logarithmically in p , linearlyin r, s and γ − . Moreover, the optimization problem required is now a convex problem as we replacethe hard sparsity constraint with an (cid:96) constraint. This dichotomy of computationally efﬁcient vscomputationally hard estimation dependent on whether we make minimum eigenvalue assumptionsis a well established result in exogenous regression problems [Zhang et al., 2014] and hence weprovide here analogous positive results for the endogenous regression setup. We also note thatwithout the minimum eigenvalue condition, our Theorem 1 still provides slow rates of the order of n − / , for computationally efﬁcient estimators that replace the hard sparsity constraint with an (cid:96) -norm constraint. Our results based on the (cid:96) -constraint are also closely related to the work of Gautieret al. [2011], who analyzes an endogenous analogue of the Dantzig selector. Our work proposes analternative to the Dantzig selector that enjoys similar estimation rate guarantees. Corollary 3.

Suppose that h ( x ) = (cid:104) θ , x (cid:105) with (cid:107) θ (cid:107) ≤ s and (cid:107) θ (cid:107) ≤ B and (cid:107) θ (cid:107) ∞ ≤ .Moreover, suppose that E [ x i | z ] = (cid:104) β i , z (cid:105) , with β i ∈ R p and (cid:107) β i (cid:107) ≤ U and that the co-variancematrix V satisﬁes the following restricted eigenvalue condition: ∀ ν ∈ R p s.t. (cid:107) ν S c (cid:107) ≤ (cid:107) ν S (cid:107) + 2 δ n,ζ : ν (cid:62) V ν ≥ γ (cid:107) ν (cid:107) Then let H = { x → (cid:104) θ, x (cid:105) : θ ∈ R p } , (cid:107)(cid:104) θ, ·(cid:105)(cid:107) H = (cid:107) θ (cid:107) , F U = { z → (cid:104) β, z (cid:105) : β ∈ R p , (cid:107) β (cid:107) ≤ U } and (cid:107)(cid:104) β, ·(cid:105)(cid:107) F = (cid:107) β (cid:107) . Then the estimator presented in Equation (6) with λ ≤ γ s , satisﬁes that w.p. − ζ : (cid:107) T (ˆ h − h ) (cid:107) ≤ O (cid:32) max (cid:26) , λ γs (cid:27) (cid:114) sγ (cid:32) ( B + U + 1) (cid:114) log( p ) n + (cid:114) log( p/ζ ) n (cid:33)(cid:33) If instead we assume that (cid:107) β i (cid:107) ≤ U and sup z ∈Z (cid:107) z (cid:107) ≤ R then by setting F U = { z → (cid:104) β, z (cid:105) : (cid:107) β (cid:107) ≤ U } and (cid:107)(cid:104) β, ·(cid:105)(cid:107) F = (cid:107) β (cid:107) , we have: (cid:107) T (ˆ h − h ) (cid:107) ≤ O (cid:32) max (cid:26) , λ γs (cid:27) (cid:114) sγ (cid:32) ( B + 1) (cid:114) log( p ) n + U R √ n + (cid:114) log( p/ζ ) n (cid:33)(cid:33) Second order inﬂuence from E [ x i | z ] model complexity Notably, observe that in the caseof (cid:107) β i (cid:107) ≤ U , we note that if one wants to learn the true β with respect to the (cid:96) norm or thefunctions E [ x i | z ] with respect to the RMSE, then the best rate one can achieve (by standard resultsfor statistical learning with the square loss), even when one assumes that sup z ∈Z (cid:107) z (cid:107) ≤ R andthat E [ zz (cid:62) ] has minimum eigenvalue of at least γ , is: min (cid:110)(cid:112) pn , (cid:0) U Rn (cid:1) / (cid:111) . For large p (cid:29) n the ﬁrst rate is vacuous. Thus we see that even though we cannot accurately learn the conditionalexpectation functions at a / √ n rate, we can still estimate h at a / √ n rate, assuming that h issparse. Therefore, the minimax approach offers some form of robustness to nuisance parameters,reminiscent of the type of robustness of Neyman orthogonal methods (see e.g. [Chernozhukov et al.,2018]). F.3 Solving the (cid:96) -Relaxation Optimization Problem via First-Order Methods The estimator presented in Corollary 3 require solving optimization problems of the form: min θ : (cid:107) θ (cid:107) ≤ B max β : (cid:107) β (cid:107)≤ U (cid:104) E n [( y − (cid:104) θ, x (cid:105) ) z ] , β (cid:105) + µ (cid:107) θ (cid:107) (12)for some R, µ and for norm (cid:107) · (cid:107) either (cid:107) · (cid:107) or (cid:107) · (cid:107) (in the constrained estimator µ = 0 ; whilein the regularized R = ∞ - though in practice we can set it to some large value for stability of theoptimization process). Observe that inner optimization simpliﬁes to: min θ : (cid:107) θ (cid:107) ≤ B (cid:107) E n [( y − (cid:104) θ, x (cid:105) ) z ] (cid:107) ∗ + µU (cid:107) θ (cid:107) (cid:107) · (cid:107) ∗ is the dual norm of (cid:107) · (cid:107) (i.e. the (cid:96) ∞ norm in the case where (cid:107) · (cid:107) is the (cid:96) norm and the (cid:96) norm in the case where (cid:107) · (cid:107) is the (cid:96) norm). One approach to solving these optimization problemsis using projected sub-gradient descent: β t = arg max β : (cid:107) β (cid:107)≤ U (cid:104) E n [( y − (cid:104) θ t , x (cid:105) ) z ] , β (cid:105) θ t +1 = Π (cid:16) θ t + η E n (cid:2) x z (cid:62) (cid:3) β t − µU sign ( θ t ) (cid:17) Π( θ ) = arg min θ (cid:48) : (cid:107) θ (cid:48) (cid:107) ≤ B (cid:107) θ − θ (cid:48) (cid:107) Moreover, for both (cid:96) and (cid:96) norm, the solution to β t can be easily found in closed form. After O (1 /(cid:15) ) iterations and for η = Θ( (cid:15) ) , we will have that ¯ θ = T (cid:80) Tt =1 θ t , is an (cid:15) -approximate solutionto the optimization problem. Improved Iteration Complexity with Optimistic FTRL Dynamics

The sub-gradient descent ap-proach has two caveats: i) the rate of /(cid:15) is considerably slow and would require a large number ofiterations to converge to a reasonable solution, ii) the gradient does not admit an unbiased stochasticversion (due to the non-linearity introduced by the arg max operation that deﬁnes β t ), and thereforethe algorithm does not admit a stochastic variant, which is useful for large samples. We can improvethe error rate by invoking algorithms that address non-smooth optimization problems that take theform of a min-max objective of some underlying smooth loss.First, we show that we can remove the non-smoothness of the (cid:96) -regularization by lifting the pa-rameter θ to a p -dimensional positive orthant. Consider two vectors ρ + , ρ − ≥ and then setting θ = ρ + − ρ − , with ρ = ( ρ + ; ρ − ) and (cid:107) ρ (cid:107) ≤ B . Observe that for any feasible θ , the solution ρ + i = θ i { θ i > } and ρ − i = θ i { θ i ≤ } is still feasible and achieves the same objective. Moreover,any solution ρ , maps to a feasible solution θ (since (cid:107) θ (cid:107) ≤ (cid:107) ρ + − ρ − (cid:107) ≤ (cid:107) ρ + (cid:107) + (cid:107) ρ − (cid:107) ≤ B )and thus the two optimization programs have the same optimal solutions. Then, if we deﬁne with v = ( x ; − x ) , then the optimization problem can be re-stated as: min ρ ≥ (cid:107) ρ (cid:107) ≤ B max β : (cid:107) β (cid:107)≤ U (cid:96) ( ρ, β ) where: (cid:96) ( ρ, β ) := β (cid:62) E n [ zy ] − β (cid:62) E n [ zv (cid:62) ] ρ + µ p (cid:88) i =1 ρ i This falls exactly into the class of problems analyzed in a line of work on bi-linear minimax opti-mization, starting from the seminal work of Nesterov [2005]. For instance, we can view the problemas a two-player bi-linear zero-sum game and invoke the Optimistic Follow-the-Regularized-Leader(OFTRL) or Optimistic Mirror Descent (OMD) paradigm of Rakhlin and Sridharan [2013], Syrgka-nis et al. [2015], to ﬁnd an (cid:15) -approximate solution for ρ in O (1 /(cid:15) ) iterations. The algorithm repeatsfor T iterations the updates: ρ t +1 = arg min ρ ≥ (cid:107) ρ (cid:107) ≤ B (cid:88) τ ≤ t (cid:96) ( ρ, β τ ) + (cid:96) ( ρ, β t ) + 1 η R min ( ρ ) β t +1 = arg max β : (cid:107) β (cid:107) ≤ U (cid:88) τ ≤ t (cid:96) ( ρ τ , β ) + (cid:96) ( ρ t , β ) − η R max ( β ) and returns ¯ ρ = T (cid:80) Tt =1 ρ t , ¯ β = T (cid:80) Tt =1 β t . We note that if we did not double count the lastperiod’s loss and we used R min ( x ) = R max ( x ) = (cid:107) x (cid:107) , then this would correspond to running For the case of the (cid:96) norm: β t = Ue i t sign ( E n [ y − (cid:104) θ t , x (cid:105) ) z i t ]) , with i t = arg max i | E n [( y −(cid:104) θ t , x (cid:105) ) z i ] | . For the case of the (cid:96) norm: β t = E n [( y − (cid:104) θ t , x (cid:105) ) z ] · U/ (cid:107) E n [( y − (cid:104) θ t , x (cid:105) ) z ] (cid:107) Finally, if we want to compare with s -sparse solutions and we want to enhance sparsity of the returnedsolution, then we can always truncate to zero at the end of training any coordinate of ¯ θ = ¯ ρ + − ¯ ρ − that wassmaller than / ( s n / (cid:15) ) . This can introduce an extra lower order approximation error of at most /n / (cid:15) in our projected MSE theorem, since by this shrinkage procedure, the error with respect to a sparse solution θ can only increase on the non-zero entries of θ and it can only increase by at most / ( sn / (cid:15) ) on every suchentry. ρ, β . Moreover, the parameters ¯ ρ, ¯ β can be thought as primal and dual solutions and we can use the duality gap as a certiﬁcate forconvergence of the algorithm. tol = max β : (cid:107) β (cid:107)≤ U (cid:96) (¯ ρ, β ) − min ρ : (cid:107) ρ (cid:107) ≤ B (cid:96) ( ρ, ¯ β ) This approach addresses both problems with projected sub-gradient descent: i) as we will showbelow, the iteration complexity is O (cid:0) ( B + U ) log( B p ) /(cid:15) (cid:1) , instead of /(cid:15) , ii) the per-iterationlosses (cid:96) ( ρ, β t ) , (cid:96) ( ρ t , β ) in the FTRL formulation can be replaced with unbiased estimates, whilestill maintaining theoretical guarantees and therefore the algorithm admits a stochastic analoguewhich makes it scalable to very large data sets. To instantiate this paradigm we need to ﬁnd appropriate regularizers for the strategy spaces of thetwo players. Below we outline two concrete such algorithms for the two cases of the norm of β andprovide worst-case convergence rates. (cid:96) -ball adversary For the case when (cid:107) β (cid:107) = (cid:107) β (cid:107) , we can further simplify the problem byshowing that the inner optimization can be performed over a p -dimensional simplex. If we let u = ( z ; − z ) , then we can re-write the optimization problem as: (cid:96) ( ρ, w ) := w (cid:62) E n [ uy ] − w (cid:62) E n [ uv (cid:62) ] ρ + µU p (cid:88) i =1 ρ i min ρ ≥ (cid:107) ρ (cid:107) ≤ B max w : (cid:107) w (cid:107) =1 (cid:96) ( ρ, w ) Since both player strategies ρ , w are constrained to be in an (cid:96) -ball, we can get iteration complexitythat only grows logarithmically with the dimension p , if for each player we use OFTRL with anentropic regularizer: i.e. R min ( x ) = R max ( x ) = (cid:80) pi =1 x i log( x i ) , denotes the negative entropy. Proposition 13.

Consider the algorithm that for t = 1 , . . . , T , sets: ˜ ρ i,t +1 = ˜ ρ i,t e − ηB ( − E n [ v i u (cid:62) w t ]+ µU ) + ηB ( − E n [ v i u (cid:62) w t − ]+ µU ) ρ t +1 = ˜ ρ t +1 min (cid:26) , B (cid:107) ˜ ρ t +1 (cid:107) (cid:27) ˜ w i,t +1 = w i,t e η E n [( y − ρ (cid:62) t v ) u i ] − η E n [( y − ρ (cid:62) t − v ) u i ] w t +1 = ˜ w t +1 (cid:107) ˜ w t +1 (cid:107) with ˜ ρ i, − = ˜ ρ i, = 1 /e and ˜ w i, − = ˜ w i, = 1 / (2 p ) and returns ¯ ρ = T (cid:80) Tt =1 ρ t . Then for η = (cid:107) E n [ vu (cid:62) ] (cid:107) ∞ , after T = 16 (cid:107) E n [ vu (cid:62) ] (cid:107) ∞ B log( B ∨

1) + ( B + 1) log(2 p ) (cid:15) iterations, the parameter ¯ θ = ¯ ρ + − ¯ ρ − is an (cid:15) -approximate solution to the minimax problem inEquation (12) . Moreover, every update step requires computation time O (min { n p, p } ) . Using techniques forsparse gradient updates, one could also potentially improve the iteration complexity to not dependlinearly on the dimension p (see e.g. Langford et al. [2009], Duchi et al. [2008], Duchi and Singer[2009], McMahan [2011]), but we defer such approaches to future work. In particular, ¯ ρ and ¯ β are an (cid:15) -equilibrium of the zero-sum game. We note that the fast rate of /(cid:15) will deteriorate with the size of the mini-batch, but a /(cid:15) rate is alwaysachievable and the step-size η should be appropriately tuned to account for the mini-batch sampling noise. For a matrix A , we denote with (cid:107) A (cid:107) ∞ = max i,j | A ij | If p ≥ n , then at every iteration we can calculate m ( j ) = v ( j ) · w t , for each sample v ( j ) ; which takes O ( n · p ) time; and then update each ˜ ρ i,t +1 based on the quantity E n [ v i u (cid:62) w t ] = n (cid:80) j v ( j ) i m ( j ) . If p < n ,then we can calculate Σ n = E n [ vu (cid:62) ] ahead of time and at each period calculate E n [ v i u (cid:62) w t ] = (Σ w t ) i ;which would require O ( p ) time. a) true vs. est. θ ( n = 600 ) (b) true vs. est. θ ( n = 1000 ) (c) dual variables w + − w − Figure 7: Estimates based on minimax estimator proposed in Proposition 13. The left ﬁgure de-picts the p = 2000 estimated coefﬁcients compared to the true coefﬁcients; we also include thecoefﬁcients of i) a direct lasso regression to portray the importance of dealing with the endo-geneity problem (Lasso), ii) a two-stage lasso regression where we regress each x i on z and thenregress y on E [ x | z ] , all regressions performed with lasso where the ﬁrst stage regularizationwas ﬁxed to . and the ﬁnal stage was chosen via cross-validation (2SLasso), iii) the algorithmin Proposition 13 (SparseIV), iv) a stochastic variant of the algorithm in Proposition 13 where amini-batch of samples is used at each iteration (StochasticSparseIV). The right pictures depictsthe coefﬁcients of the dual test function learned by the adversary at equilibrium, which is of theform: f ( z ) = (cid:80) pi =1 ( w + i − w − i ) z i . The data generating process was: x, z, u ∈ R p , x = z + u , y = (cid:104) x + u, θ (cid:105) , z, u ∼ N (0 , I d ) , θ = (1 , − , , . . . , , p = 2000 . (cid:96) -ball adversary For the case when (cid:107) β (cid:107) = (cid:107) β (cid:107) , then we can use R max ( β ) = (cid:107) β (cid:107) , whichleads to an alternative update rule for the maximizing player. In this case, the update of the maxi-mizing player is essentially optimistic gradient descent, modulo the normalization so as to respectthe (cid:96) -norm constraint. Proposition 14.

Consider the algorithm that for t = 1 , . . . , T , sets: ˜ ρ i,t +1 = ˜ ρ i,t e − ηB ( − E n [ v i z (cid:62) β t ]+ µU ) + ηB ( − E n [ v i z (cid:62) β t − ]+ µU ) ρ t +1 = ˜ ρ t +1 min (cid:26) , B (cid:107) ˜ ρ t +1 (cid:107) (cid:27) ˜ β t +1 = ˜ β t +1 + 2 η E n [( y − ρ (cid:62) t v ) z ] − η E n [( y − ρ (cid:62) t − v ) z ] β t +1 = ˜ β t +1 min (cid:26) , U (cid:107) ˜ β t +1 (cid:107) (cid:27) with ˜ ρ i, − = ˜ ρ i, = 1 /e and ˜ β − = ˜ β = 0 . Then for η = (cid:107) E n [ zv (cid:62) (cid:107) , ∞ , after T = 16 (cid:107) E n [ zv (cid:62) ] (cid:107) , ∞ B log( B ∨

1) + B log(2 p ) + U / (cid:15) . iterations, the parameter ¯ θ = ¯ ρ + − ¯ ρ − is an (cid:15) -approximate solution to the minimax problem inEquation (12) . Observe that if v j ∈ [ − H, H ] then the quantity (cid:107) E n [ zv (cid:62) ] (cid:107) , ∞ (cid:107) can be upper bounded by H (cid:112) E n [ (cid:107) z (cid:107) ] , which under the assumptions of Corollary 3 is at most a constant. F.4 Bounds on Ill-Posedness Measure

Let h ( x ) = (cid:104) θ, x (cid:105) , h ( x ) = (cid:104) θ , x (cid:105) and ν = θ − θ . Then observe that we have: (cid:107) T ( h − h ) (cid:107) = ν (cid:62) E (cid:2) E [ x | z ] E [ x | z ] (cid:62) (cid:3) ν = ν (cid:62) V ν ≥ λ min ( V ) (cid:107) ν (cid:107) where we remind that V := E (cid:2) E [ x | z ] E [ x | z ] (cid:62) (cid:3) and λ min ( V ) denotes the minimum eigenvalue of V . Moreover, if we let Σ = E (cid:2) xx (cid:62) (cid:3) then: (cid:107) h − h (cid:107) = ν (cid:62) E (cid:2) xx (cid:62) (cid:3) ν ≤ λ max (Σ) (cid:107) ν (cid:107) Thus we see that the measure of ill-posedness can be upper bounded as: τ ≤ (cid:115) λ max (Σ) λ min ( V ) For a matrix A , we denote with (cid:107) A (cid:107) , ∞ = (cid:113)(cid:80) i max j A ij ˆ h and not just projected RMSEguarantees, at the cost of an extra multiplicative factor of τ .Moreover, we note that in both our hard sparsity and (cid:96) -relaxed estimators we have further con-straints on the vector ν and thus we only require the minimum and maximum eigenvalue to bebounded subject to these constraints. For instance, in the case of hard sparsity, we know that ν is a s -sparse vector. Thus it sufﬁces to require the minimum eigenvalue of V and the maximumeigenvalue of Σ to be bounded only for such s -sparse vectors (i.e. they should hold for all s × s square sub-matrices of Σ and V ). Similarly, for the (cid:96) based estimators we know that the vector ν falls in a restricted cone, such that most of the (cid:96) norm of ν is concentrated on the s coordinates ofthe true coefﬁcient θ . Thus we solely need the λ min and λ max constraints to be valid only in thisrestricted cone of vectors. G Application: Shape Constrained Functions

In this section, we consider the case when x ∈ [0 , and we make shape constraints on h . We lookat both monotonicity/total variation bound constraints and convexity constraints. G.1 Monotone functions and functions with small total variation

Consider the case when h is a function with range in [0 , and of bounded total variation, BV ( h ) ≤ . We let H := BV (1) denote the latter class of functions. Moreover, we assumethat the operator T satisﬁes that T h is a monotone non-decreasing (or non-increasing) function of z for any monotone non-decreasing (or non-increasing) function h of x . Total variation functionclasses in linear inverse problems with a known linear operator have also been recently analyzed bydel Álamo and Munk [2019] and a minimax loss based estimator was also considered, similar inspirit to our general framework.Observe that any function h with range in [0 , and total variation at most can be written as thedifference of two non-decreasing functions h + , h − with ranges in [0 , , i.e. h = h + − h − . Thus wenote that our assumption on T implies that if h ∈ BV (1) , then T h = T h + − T h − = f + − f − , where f + and f − are monotone non-decreasing functions in [0 , . Thus T h ∈ BV (1) and T ( h − h ) ∈ BV (2) . Thus in order to apply our main theorems, it sufﬁces to take F = BV (2) , i.e. the classof functions that can be expressed as the difference of two monotone non-decreasing functions withrange in [0 , . Alternatively, we could also deﬁne the norm of a function in the function classes F and H as the total variation, which would enable the regularized estimator to adapt to the totalvariation of the true hypothesis. For simplicity, we assume a known upper bound.Furthermore, we note that by standard results in statistical learning theory (see e.g. exercise 18,p.153 of Vaart and Wellner [1996] or excercise 3.6.7 of Gine and Nickl [2015]), that the class ofmonotone functions with range in [0 , have metric entropy of the order of O (1 /(cid:15) ) . Thus the sameholds for the class BV (2) , leading to a critical radius of δ n = O (cid:0) n − / (cid:1) , by invoking Corollary 5.Thus by applying our Theorem 1, we get that the corresponding estimators presented in these sec-tions, when H = BV (1) and F = BV (2) (and no norm constraints, which can be emulated bysetting B = U = ∞ ), satisfy w.p. − ζ : (cid:107) T (ˆ h − h ) (cid:107) = O (cid:32) n / + (cid:114) log(1 /ζ ) n (cid:33) The latter rate matches known lower bounds on the achievable RMSE for monotone functions evenin the case of exogenous regression problems Chatterjee et al. [2015].

Efﬁciently solving the optimization problem

We can solve the empirical optimization problemby using piece-wise constant monotone functions (or piece-wise linear), i.e. when running the es-timator on n samples, we can describe the function h via a n -dimensional vector θ = ( θ + ; θ − ) , Our results easily extend to arbitrary intervals x ∈ [ a, b ] and ranges [ − H, H ] , though we restrict to [0 , for simplicity of exposition. ≥ θ +1 ≥ . . . ≥ θ + n ≥ and ≥ θ − ≥ . . . ≥ θ − n ≥ . Let Θ describe the set of θ that satisfy these constraints. Similarly, we can describe f via a vector w = ( w + ; w − ) , such that ≥ w +1 ≥ . . . ≥ w + n ≥ and ≥ w − ≥ . . . ≥ w − n ≥ . Let W describe the set of w that satisfythese constraints.Then for every sample i , if we let q x ( i ) be the rank of sample i (i.e. sample i has the q x ( i ) highest x ),when we order all samples based on x , we can set h ( x i ) = θ + q x ( i ) − θ − q x ( i ) . Similarly, if we let q z ( i ) be the rank of sample i , when we order all samples based on z , we can set f ( z i ) = w + q z ( i ) − w − q z ( i ) .For simplicity of exposition and w.l.o.g. we will assume that samples are ordered in terms of x , i.e. q x ( i ) = i . Thus we can simplify the optimization problem in Theorem 1 as: min θ ∈ Θ max w ∈ W (cid:88) i ( y i − ( θ + i − θ − i ))( w + q z ( i ) − w − q z ( i ) ) − λ n (cid:88) i =1 ( w + i − w − i ) where the conclusions of the theorem hold if λ ≥ . Since the loss: (cid:96) ( θ, w ) = (cid:88) i ( y i − ( θ + i − θ − i ))( w + q z ( i ) − w − q z ( i ) ) − λ n (cid:88) i =1 ( w + i − w − i ) is convex in θ and concave in w and the spaces Θ , W are convex sets, we can solve this problem byrunning simultaneous projected gradient descent for θ and w separately and returning the averagesolutions, i.e.: for t = 1 , . . . , T : θ t = Π Θ ( θ t − − η ∇ θ (cid:96) ( θ t − , w t − )) w t = Π W ( w t − + η ∇ w (cid:96) ( θ t − , w t − )) and return ¯ θ = T (cid:80) Tt =1 θ t . After O ( n/(cid:15) ) iterations this would return an (cid:15) -approximate solution tothe minimax problem. Each iteration step would require running a projection on the spaces Θ , W .If we let ˜ θ ∈ R n , then we need to ﬁnd a solution to the problem: min θ ∈ Θ n (cid:88) i (˜ θ + i − θ + i ) + (˜ θ − i − θ − i ) Since the objective and the constraints decompose for the two parts of the vector, this correspondsto running two isotonic regressions for θ + i and θ − i with observations ˜ θ + i and ˜ θ − i . Thus each problemcan be solved via the well-known Pool-Adjacent-Violator (PAV) algorithm, which requires O ( n ) computation time. Similarly, we can deal with the projection of w . Thus each iteration of the simul-taneous projected gradient descent algorithm requires four calls to the PAV algorithm. If we furtherwant to impose Lipschitzness constraints on our estimates, then we can instead use the Lipschitz-PAV algorithm (see Yeganova and Wilbur [2009], Kakade et al. [2011]) to project onto spaces Θ and W that are augmented with lipschitzness constraints, e.g. ≤ θ + i − θ + j ≤ L ( x i − x j ) for all i ≤ j .Albeit the LPAV algorithm requires computation of O ( n ) . Generality of computational approach

We note that the above approach of solving the endoge-nous regression problem with shape constraints via our minimax estimator essentially applies to anytype of shape constraints and reduces the minimax problem to a standard square loss problem subjectto the same shape constraints (assuming that both H and F satisfy the same shape constraints; i.e.that these constraints are invariant to the application of the operator T ). Thus to solve the minimaxproblem we simply require an oracle for the square loss problem. In the the setting described in thissection we used the PAV and LPAV algorithm as such oracles. In the next section we will be usinga quadratic optimization subject to linear constraints solver as our oracle. Ill-posedness

We note that the recent work of Chetverikov and Wilhelm [2017], shows that when x, z ∈ [0 , and the distributions of x and z have full support and lower-bounded density, then forany function h , that is α -approximately monotone and continuously differentiable, then (cid:107) T h (cid:107) ≥ τ (cid:107) h (cid:107) ,t , where (cid:107) h (cid:107) ,t = ´ x x h ( x ) dx , for some < x < x < . The result requires several more If we want to enforce a monotone non-decreasing h , then we can set θ − = 0 and similarly, for a monotonenon-increasing algorithm θ + = 0 . a) Isotonic Regression y ∼ x (b) Isotonic IV (c) Lipschitz Isotonic IV Figure 8: Estimated functions based on our minimax estimator under monotonicity constraints. Theﬁrst ﬁgure depicts a direct isotonic regression that ignores endogeneity. The second ﬁgure depics ourisotonic IV regression, without any lipschitz constraints and the ﬁnal ﬁgure depicts our isotonic IVregression with Lipschitzness constraints. The data generating process was: h ( x ) = x { x > } , x = . z + . u + δ and y = h ( x ) + u + (cid:15) and z, u ∼ N (0 , and (cid:15), δ ∼ N (0 , . . ( n = 1000 )regularity conditions on the operator T and the constant τ depends on constants in these regularityconditions (e.g. the lower bound on the density, the quantities x and − x , the constant α , etc).Thus under these further regularity conditions, we have that for any h ∗ that is α -approximatelyconstant and for h being a monotone function (cid:107) T ( h − h ∗ ) (cid:107) ≥ τ (cid:107) h (cid:107) ,t . Thus our bound on (cid:107) T ( h − h ∗ ) (cid:107) also implies a bound on (cid:107) h − h ∗ (cid:107) ,t . This claim, roughly recovers the main estimationrate result of Chetverikov and Wilhelm [2017]. G.2 Convex functions

In this section we consider the case when h is assumed to be a convex function in [0 , , Γ -Lipschitzand with range in [0 , . Moreover, we asusme that the linear operator T satisﬁes that for any convex Γ -Lipschitz function h , T h is also convex and Γ -Lipschitz. Observe that if T is a symmetric density,i.e. T h = h (cid:63) ρ (where (cid:63) denotes the convolution operator), for some conditional density function ρ , then we have ( T h ) (cid:48)(cid:48) ( z ) = ( h (cid:48)(cid:48) ) (cid:63) ρ ≥ , since h (cid:48)(cid:48) ( x ) ≥ and ρ ( x ) ≥ for all x . Thus any suchsymmetric density satisﬁes our constraints.The work of Bronshtein [1976] shows that the metric entropy this function class, even in the d -dimensional hypercube, with respect to the (cid:96) ∞ norm, and therefore also with respect to the (cid:96) ,n norm, is of the order of (cid:15) − d/ (see also the recent work of Guntuboyina and Sen [2012]). Thus weget that by invoking Corollary 5, for d = 1 , we can choose δ n in Theorem 1 in the order of O ( n / ) ,leading to the corollary that the estimator in Theorem 1, for the case when H is the space of convex, Γ -Lipscthiz functions with range in [0 , and F is the space of differences of two convex functions,each Γ -Lipschitz and with range in [0 , , then w.p. − ζ : (cid:107) T (ˆ h − h ) (cid:107) = O (cid:32) n / + (cid:114) log(1 /ζ ) n (cid:33) Solving the optimization problem

Moreover, we can address the optimization problem in man-ner similar to the previous section. We can choose estimators that optimize over piece-wise linearfunctions and hence can be uniquely determined by their values on the n samples, i.e. we candescribe h by a n -dimensional vector θ , such that h ( x i ) = θ q x ( i ) (where q x ( i ) as deﬁned in theprevious section). Similarly, we can descirbe f ∈ F via a n -dimensional vector w = ( w + ; w − ) ,such that f ( z i ) = w + q z ( i ) − w − q z ( i ) . Subsequently, we can apply the simultaneous projected gradi-ent descent approach, which reduces the minimax optimization problem to solving the projectionproblem. Observe that we can describe the constraints that describe the vectors θ and w as linearconstraints. Using the same idea as the one described in Example 13.4 of Wainwright [2019], we canexpress the convexity constraint as the existence of a subgradient, i.e. there must exist sub-gradients u, µ + , µ − ∈ R n such that for all i, j ∈ [ n ] : θ j ≥ θ i + (cid:104) u i , x q − x ( j ) − x q − x ( i ) (cid:105) w + j ≥ w + i + (cid:104) µ + i , z q − z ( j ) − z q − z ( i ) (cid:105) w − j ≥ w − i + (cid:104) µ − i , z q − z ( j ) − z q − z ( i ) (cid:105) a) Bounded TV (b) Bounded TV and -Lipschitz (c) Convex and -Lipschitz Figure 9: Estimated functions based on our minimax estimator for different sets of shape constraints.In the last ﬁgure we also depict the direct regression estimate subject to the same constraints, i.e.if we regressed y on x , ignoring endogeneity. The data generating process was: h ( x ) = | x | and x = . z + . u + δ and y = h ( x ) + u + (cid:15) and z, u ∼ N (0 , and (cid:15), δ ∼ N (0 , . . ( n = 1000 )This is a set of linear constraints of θ, w + , w − , u, µ + , µ − . Moreover, the lipschitz constraints corre-sponds to another set of linear constraints, for all i ∈ [ n ] : − Γ( x q − x ( i +1) − x q − x ( i ) ) ≤ θ i +1 − θ i ≤ Γ( x q − x ( i +1) − x q − x ( i ) ) and similarly for w + , w − . Thus projecting onto onto Θ or W , corresponds to a convex quadraticoptimization problem with n variables and O ( n ) linear constraints. Therefore, we can computesuch projections in polynomial time at every iteration of the simultaneous projected gradient descentalgorithm. In practice, one can achieve substantial speedup by subsampling a set of s (cid:28) n pointsand restricting the curve to a piece-wise linear function in between these points. This would reducethe number of variables and constraints to s and O ( s ) , correspondingly. H Neural Networks

In this section we describe how one can apply the theoretical ﬁndings from the previous sectionsto understand how to train neural networks that solve the conditional moment problem. We willconsider the case when our true function h can be represented (or well-approximated) by a deepneural network function of x , for some given domain speciﬁc network architecture, and we willrepresent it as h ( x ) = h θ ( x ) , where θ are the weights of the neural net. Moreover, we willassume that the linear operator T , satisﬁes that for any set of weights θ , we have that T h θ belongsto a set of functions that can be represented (or well-approximated) as another deep neural networkarchitecture, and we will denote these functions as f w ( z ) , where w are the weights of the neural net. Adversarial GMM Networks (AGMM)

Thus we can apply our general approach presented inTheorem 1 and consider the estimator: ˆ θ = arg min θ sup w E n [ ψ ( y i ; h θ ( x i )) f w ( z )] − λ (cid:32) (cid:107) f w (cid:107) F + Unδ (cid:88) i f w ( z i ) (cid:33) + µ (cid:107) h θ (cid:107) H (13)where λ, µ, U, δ are hyperparameters that need to satisfy the conditions of the theorem. In particular,if we know that the neural nets h θ , f w output functions in [0 , , then we can choose U = B = 1 , λ = δ , µ = 2 δ (4 L + 27) , where L is a bound on the lipschitzness of the operator T with respectto the two function space norms and δ is a bound on the critical radius of the function spaces F and ˆ G ,L . Then problem takes the form: ˆ θ = arg min θ sup w E n [ ψ ( y i ; h θ ( x i )) f w ( z )] − δ (cid:107) f w (cid:107) F − n (cid:88) i f w ( z i ) + c δ (cid:107) h θ (cid:107) H for some constant c > that depends on the lipschitzness of the operator T . Moreover, theoreticallywe can set the critical radius δ by invoking Corollary 5, and using existing results on the pseudo-dimension of the neural network architecture, for which there exist known bounds Anthony andBartlett [2009] that scale with the number of nodes and edges of the neural net. Moreover, one canalso use the recent work of Bartlett et al. [2017], Golowich et al. [2018], to provide size independent38ounds on the critical radius of these classes, that only depend on spectral properties of the learnedweight matrices of the neural nets.The work of Bennett et al. [2019] also proposed the use of second moment penalization of the testfunction, albeit from a different perspective. In particular, their approach stems from a reasoningbased on the optimally weighted GMM estimator. In this work we show that second moment pe-nalization arises also when one wants to achieve fast rates of convergence in terms of mean squarederror of the learned function. Moreover, the regularization presented in Bennett et al. [2019] isnot a simple second moment penalization, but the second moment of each sample is re-weightedbased on the moment evaluated at a preliminary estimate of θ , i.e. (cid:80) i f w ( z i ) ψ ( y i ; h ˜ θ ( x i )) . Thepreliminary estimate of ˜ θ is an extra burden and typically requires sample splitting and ﬁrst stageestimation. Here we show that such re-weighting is not required if one simply wants fast projectedMSE rates. Moreover, this alternative penalty has the property that as the model h becomes veryaccurate, then ψ ( y i ; h ( x i )) ≈ and hence the penalty vanishes as the model becomes accurate. Thisis a big qualitative difference of the two penalties and it is not clear that the penalty that rescaleswith the moment enjoys the same theoretical guarantees in terms of projected MSE as the simplersecond moment penalty.In the remainder of the section, we will mostly focus on the practical aspect of training neuralnetworks, such as what would be appropriate architectures for the test function space, based on theintuition developed in the prior theoretical developments of the paper and what would be appropriateoptimization algorithms for solving the optimization problem. H.1 MMD-GMM: A Neural Network Architecture for Adversarial GMMMaximum Mean Discrepancy GMM Networks (MMD-GMM).

Our results for RKHS functionspaces, suggest that one class of test functions are functions that fall in an RKHS. Observe thatLemma 7 shows that, even when h is an arbitrary function represented by a neural network, as longas p ( x | · ) is a function that belongs to an RKHS H K , with some kernel K , then T h ∈ H K . Thuswe can choose test functions in H K .In many neural network applications, we might have that p ( x | · ) is not in an RKHS (or might havevery large RKHS norm), when we use the raw instrument z , as z might be very high-dimensionaland structured (e.g. an image). However, it might be natural to assume that there is some latentrepresentation g ( z ) of the instrument z , such that: p ( x | z ) = ρ ( x | g ( z )) and such that ρ ( x | · ) isin an RKHS.Thus we will generalize our RKHS approach to augment the adversary with the ability to simul-taneously learn the representation g w (represented as a neural network with weights w ), and alsochoose the best function in the RKHS of the implied kernel K w ( z, z (cid:48) ) := K ( g w ( z ) , g w ( z (cid:48) )) . Withthis generalization, we are still guaranteeing that T ( h − h ) ∈ F , whenever p ( x | · ) = ρ ( x | g ( · )) and ρ ( x | · ) is in H K .Using the variational characterization of the best function in the RKHS presented in Equation (10)we get that the optimization of the adversary can be rephrased as optimizing over test functions ofthe form f ( z ) = n (cid:80) ni =1 β i K w ( z i , z ) , leading to an objective for the adversary of the form: sup β,w n (cid:88) i,j (cid:0) ψ ( y i ; h θ ( x i )) K w ( z i , z j ) β j − δ β i K w ( z i , z j ) β j (cid:1) − n (cid:88) i (cid:88) j β j n K w ( z i , z j )  which can be written as an average over triplets of samples: n (cid:88) i,j,k (cid:0) ψ ( y i ; h θ ( x i )) K w ( z i , z j ) β j − β i (cid:0) δ K w ( z i , z j ) + K w ( z i , z k ) K w ( z k , z j ) (cid:1) β j (cid:1) Kernels applied to learned representations have been applied in the context of distribution learning(see e.g. the work on MMD-GANs Li et al. [2017], Binkowski et al. [2018]) and distribution testing(see the recent work of Liu et al. [2020]). 39igure 10: MMD-GMM architecture of adversary’s test function.

Unregularized MMD-GMM.

When we omit the (cid:96) ,n regularization then the optimal solution for β can be found in closed form (see Proposition 9) and the MMD-GMM simpliﬁes to: arg min θ sup w n (cid:88) i,j ψ ( y i ; h θ ( x i )) K w ( z i , z j ) ψ ( y j ; h θ ( x j )) + cδ (cid:107) h θ (cid:107) H (14)This version (without ﬁxed kernel parameters w ) was also independently analyzed from the per-spective of testing by Muandet et al. [2020]. However, the (cid:96) ,n penalty is crucial for obtaining fastrates (e.g. rates that adapt to the eigendecay in the case of RKHS spaces). On the other hand, theunregularized MMD-GMM admits a much easier implementation as we do not need to deal withthe n parameters β and in the case where we use ﬁxed kernel parameters w we don’t even needadversarial training. Kernel Approximation

Moreover, as we saw in the RKHS section, it can be beneﬁcial froma computational perspective to approximate the kernel function by sampling a set of trainingpoints (either at random or more cleverly based on either leverage scores or k-means cluster-ing) and restrict the space of functions to be supported only on this subset of the points, i.e. f ( z ) = s (cid:80) si =1 β i K ( g w ( z ∗ i ) , g w ( z )) , where z ∗ i is a set of representative samples and approximatingthe RKHS norm penalty with (cid:80) i,j ∈ S β i K w ( z ∗ i , z ∗ j ) β j . This has the beneﬁt of only depending on an | S | -dimensional vector β , that the adversary needs to optimize over, as opposed to n -dimensional.Moreover, in practice, instead of constraining the centers to be of the form g w ( z ∗ i ) , we could insteadconsider arbitrary centers c i in the space of the output of g w and consider test functions of the form: f ( z ) = s (cid:80) si =1 β i K ( c i , g w ( z )) , where c i are parameters that could also be trained via gradientdescent. The latter essentially corresponds to adding what is known as an RBF layer at the end ofthe adversary neural net. This simpliﬁed architecture seems the most appealing from a practicalpoint of view (as it does not require any pre-selection of representative samples z ∗ i ) and is depictedin Figure 11. Multi-Kernel MMD-GMM.

The case of sparse linear representations portrays that it might beimportant to test many different classes of functions, each potentially trained on a separate part ofthe input space, since different instruments might be correlated with different treatments and manyof these treatments can be irrelevant. sup w ,...,w m ,t ∈ [ m ] E n [ ψ ( y i ; h ( x i )) f w t ( z S t )] − δ (cid:107) f w t (cid:107) F − n (cid:88) i f w t ( z S t ,i ) where S t are pre-deﬁned subsets of the instruments and z S t corresponds to the sub-vector of instru-ments. Each of these functions f w t corresponding to a neural net.One can also combine the above approaches and set f w t ( z S t ) = n (cid:80) j β tj K w t ( z S t ,j , z S t ) , i.e.allow for the test function that takes as input the subset of the instruments S t to be in an RKHSof a learned kernel w t . This leads to taking a supremum over a set of kernels in the MMD-GMMobjective, where each kernel calculates similarity based on a subset of the input instruments, i.e.: sup β,w,t n (cid:88) i,j (cid:0) ψ ( y i ; h θ ( x i )) K tw ( z i , z j ) β tj − δ β ti K tw ( z i , z j ) β tj (cid:1) − n (cid:88) i (cid:88) j β tj n K tw ( z i , z j )  K tw ( z i , z j ) is shorthand notation for K w t ( z S t ,i , z S t ,j ) . The adverary’s objective can also bewritten as choosing a distribution p t over the t kernels, leading to an adversary objective of: n (cid:88) i,j,k (cid:32) ψ ( y i ; h θ ( x i )) (cid:88) t p t K tw ( z i , z j ) β j − β i β j (cid:88) t p t ( δ K tw ( z i , z j ) + K tw ( z i , z k ) K tw ( z k , z j )) (cid:33) We can again reduce the complexity of the optimization problem by restricting to a subset of samplesto represent the test functions.This combined method targets settings where different instruments are correlated with different la-tent “treatment factors”, treatment factors are high-dimensional but only a small subset of themhaving a large and additively separable effect on the outcome and the relationship between the treat-ment factor and the instrument is non-linear. Thus it tackles several sources of high-dimensionalityin the instrumental variable regression problem.

H.2 Adversarial Training: Simultaneous Optimistic First-Order Stochastic Optimization

The optimization problem that we are facing is similar to the optimization problem that is encoun-tered in training Generative Adversarial Networks, i.e. we need to solve a non-convex, non-concavezero-sum game, where the strategy of each of the two players are the parameters of a neural net.This is obviously a computationally intractable problem from a worst-case perspective. However,typical instances are far from worst-case and there has been a surge of recent work proposing iter-ative optimization algorithms inspired by the convex-concave zero-sum game theory (see, e.g. theOptimistic Adam algorithm of Daskalakis et al. [2017]). For instance, one can expect that in practicemost early layers of a neural net will change very slowly or will not have a face transition in theirnon-linearities. In that case, the main parameters that matter are the parameters of the ﬁnal layersof the two neural nets. However, the zero-sum game is convex-concave in these parameters. Hence,assuming that the features constructed in the ﬁnal layer of the two neural nets, change slowly, thenone should expect convex-concave zero-sum game optimization theory to apply. Such argumentshave been recently exploited in the case of square loss minimization with deep over-parameterizedneural networks (see e.g. Allen-Zhu et al. [2018], Du et al. [2018], Soltanolkotabi et al. [2019]). It ishighly plausible and an interesting question for future research, whether such guarantees extend tothe minimax problem that we are facing here. For instance, recent work of Lei et al. [2019], providesan instance of a minimax objective, related to training Wasserstein GANs, where stochastic iterativeoptimization of neural nets provably converges to an optimal solution.In our implementation and experiments we used the optimistic Adam algorithm as was also pro-posed in Bennett et al. [2019]. Other algorithms that could prove useful for our problem are theextra-gradient or stochastic extra-gradient algorithm (see e.g. Hsieh et al. [2019], Mishchenko et al.[2019]).

I Random Forests via a Reduction Approach

In this section we deal with the problem of training random forests that solve the non-parametricIV problem. In particular, we aim to develop a learning procedure that learns a hypothesis h thatsolves the Conditional Moment (1), that is represented as an ensemble of regression trees. Priorwork on random forests for causal inference problems has primarily focused on learning forests41 a) AGMM( p = 1 , n = 4000 ) (b) MMD-GMM( p = 1 , n = 4000 ) (c) Learned Kernel MMD-GMM( p = 1 , n = 4000 )(d) AGMM( p = 50 , n = 4000 ) (e) MMD-GMM( p = 50 , n = 4000 ) (f) Learned Kernel MMD-GMM( p = 50 , n = 4000 ) Figure 12: Estimated function based on our minimax estimator with neural networks as a functionof the relevant treatment. The h θ function was a two layer neural net with hidden units. Inthe ﬁrst ﬁgure an two-layer neural net was used as a test function f w . In the second and third, weused the MMD-GMM test functions with a low rank approximation. In the second we used testfunctions of the form: f β ( z ) = (cid:80) si =1 β i K γ ( c i , z ) , with c i a ﬁxed grid of test points in [ − , p and K is the rbf kernel with parameter γ = . , i.e. K ( z, z (cid:48) ) = exp( − γ (cid:107) z − z (cid:48) (cid:107) ) . In the third welearned the kernel, i.e. we used test functions of the form: f w,β ( z ) = (cid:80) si =1 β i K γ ( c i , g w ( z )) and g w ( z ) = relu ( Az + b ) (all the parameters A, b, β, c i , γ where trained). The networks were trainedvia the simultaneous Optimistic Adam algorithm. The data generating process was: h ( x ) = | x [0] | and x = . z + . u + δ and y = h ( x ) + u + (cid:15) and z ∼ N (0 , I p ) , u ∼ N (0 , and (cid:15), δ ∼ N (0 , . . (a) Weak Instruments( p = 2 , n = 4000 ) Figure 13: Estimated function based on our minimax estimator with neural networks as a functionof the relevant treatment. The setup is the same as in Figure 12, but we now made the instrumentvery weak. The data generating process was: h ( x ) = | x [0] | { x [0] > } and x = . z + . u + δ and y = h ( x ) + u + (cid:15) and z ∼ N (0 , I p ) , u ∼ N (0 , and (cid:15), δ ∼ N (0 , . .42hat capture the heterogeneity of the treatment effect of a treatment, but did not account for non-linear relationships between the treatment and the outcome variable. We will provide a theoreticalfoundation of the proposed method by taking a reductions approach to the minimax problem deﬁnedby our estimator.For simplicity, throughout this section we will assume that the hypothesis spaces H and F arebounded and have bound critical radius and will make no further norm constraints. Thus the estima-tor proposed in Theorem 1 takes the simple form of: ˆ h = arg min h ∈H sup f ∈F E n [ ψ ( y i ; h ( x i )) f ( z i )] − E n [ f ( z i ) ] Since the statistical properties of random forests is an active area of investigation, we will solelyfocus on the optimization problem and leave the statistical properties (e.g. bounding the criticalradius or bias of Random Forest methods) to future work. Our goal is to reduce the aforementionedoptimization problem to classiﬁcation and regression oracles over arbitrary hypothesis spaces. Sub-sequently in practice we can use random forests as oracles.

Reducing the Optimization to Regression and Classiﬁcation Oracles

To achieve this reductionwe will make the assumption that the space F deﬁnes a convex image set on the samples, i.e. the set A = { ( f ( z ) , . . . , f ( z n )) : f ∈ F} is a convex set. This can potentially be violated for tree basedmethods, but in practice will be alleviated when training a forest with a large set of trees.We will show that we can reduce the problem to a regression oracle over the function space F and aclassiﬁcation oracle over the function space B . We will assume that we have a regression oracle thatsolves the square loss problem over F : for any set of labels and features z n , u n it returnsOracle F ( z n , u n ) = arg min f ∈F n n (cid:88) i =1 ( u i − f ( z i )) Moreover, we will assume that we have a classiﬁcation oracle that solves the weighted binary classi-ﬁcation problem over B : for any set of sample weights w n , binary labels v n in { , } and features x n : Oracle H ( x n , v n , w n ) = arg max h ∈H n n (cid:88) i =1 w i Pr z i ∼ Bernoulli (cid:16) h ( xi )2 (cid:17) [ v i = z i ] Observe that the objective in the equation above is equivalent to a classiﬁcation accuracy objective,assuming that h outputs values in [ − , and it corresponds to an expected accuracy objective if oneinterprets ( h ( x ) + 1) / as the probability of label conditional on x . Having access to these oracleswe can then show the following computational result: Theorem 4.

Consider the algorithm where for t = 1 , . . . , T : let u ti = 12 (cid:32) y i − t − t − (cid:88) τ =1 h τ ( x i ) (cid:33) , f t = Oracle F (cid:0) z n , u t n (cid:1) v ti = 1 { f t ( z i ) > } , w ti = | f t ( z i ) | h t = Oracle H (cid:0) x n , v t n , w t n (cid:1) Then the ensemble hypothesis: ¯ h = T (cid:80) Tt =1 h t , is a T )+1) T -approximate solution to the mini-max problem in Equation (5) . In practice, we will consider a random forest regression method as the oracle over F and a binarydecision tree classiﬁcation method as the oracle for H .Moreover, we observe that if the hypothesis space H can be expressed as linear span of base hy-pothesis, i.e. H = { (cid:80) i w i b i : b i ∈ B } , then observe that because the best-response problem of thelearner is linear in the output of the hypothesis, it sufﬁces to optimize only over the space of basehypothesis. Then the algorithm will return a linear span, supported on T base hypothesis that solvesthe minimax problem over the whole linear span. This improvement can also lead to statistical rateimprovements. For instance, if the base hypothesis B is a VC class with VC dimension d (e.g. a By setting λ = δ /U , µ = 2 λ (cid:0) L + 27 U/B (cid:1) using an (cid:96) ∞ norm in both function spaces and taking U, B → ∞ . Observe that we can also take L = 1 , since (cid:107) T h (cid:107) ∞ ≤ (cid:107) h (cid:107) ∞ for any T . T base hypothesis, which has VC dimension at most d T [Shalev-Shwartz and Ben-David, 2014]. Thus the entropy integral of H is of the order of (cid:113) T d log( n ) n . If wefurther have that the entropy integral of F is at most κ ( F ) , then we get a ﬁnal rate of the order of: (cid:114) T d log( n ) n + κ ( F ) + log( T ) T Setting, T = O ( n / ) , one can achieve rates of the order of n − / + κ ( F ) .In practice, we will leverage the above observation and train a single binary classiﬁcation tree ateach period of the algorithm, as our Oracle H . In the end the ﬁnal prediction will be the predictionof the random forest represented by the ensemble of the T trees trained at each period. We refer tothis algorithm as Random Forest IV (RFIV). 44 Experimental Analysis

2. 2dpoly: h ( x ) = − . x + . x

3. sigmoid: h ( x ) = e − x

4. sin: h ( x ) = sin( x )

5. frequentsin: h ( x ) = sin(3 x )

6. abssqrt: h ( x ) = (cid:112) | x |

7. step: h ( x ) = 1 { x < } + 2 . { x ≥ }

8. 3dpoly: h ( x ) = − . x + . x + x

9. linear: h ( x ) = x

10. randpw: piece wise linear function drawn at random11. abspos: h ( x ) = x { x ≥ }

12. sqrpos: h ( x ) = x { x ≥ }

13. band: h ( x ) = 1 {− . ≤ x ≤ . }

14. invband: h ( x ) = 1 − {− . ≤ x ≤ . }

15. steplinear: h ( x ) = 2 1 { x ≥ } − x

16. pwlinear: h ( x ) = ( x + 1) 1 { x ≤ − } + ( x −

1) 1 { x > = 1 } We consider as classic benchmarks 2SLS with a polynomial features of degree (2SLS) and aregularized version of 2SLS where ElasticNetCV is used in both stages (Reg2SLS). We have imple-mented several of the algorithms described in the paper:1. NystromRKHS: The method described in Appendix E, with the Nystrom approximationdescribed in Appendix E.3. We used Nystrom samples for the approximation.2. ConvexIV: The variant of the method described in Appendix G.2 with both lipscthiz andconvexity constraints (lipschitz bound of L = 2 ).3. TVIV: The variant of the method described in Appendix G.1 without a lipschitz constraintand only total variation constraint.4. LipTVIV: The variant of the method described in Appendix G.1 with lipscthiz constraintand total variation constraint (lipscthiz bound of L = 2 )5. RFIV: The method described in Appendix I, where a Random Forest Regressor is used asan oracle for the adversary (with trees, max depth , bootstrap sub-sampling enabled,and minimum leaf size of ) and Random Forest Classiﬁer (with trees, max depth ,minimum leaf size of and bootstrap subsampling disabled) was used as an oracle for thelearner. The optimization was run for T = 200 iterations.6. SpLin: The method described in Appendix F.2 with the speciﬁc optimization method de-scribed in Proposition 13.7. StSpLin: A stochastic gradient descent variant of SpLin, where a mini-batch of samplesis used at every step to calculate the co-variance matrices.45. AGMM: The method described in Equation (13). A two-layer neural net with hiddenunits at each layer and leaky ReLU units was used for both the learner and the adversaryarchitecture. Optimization was done via the Optimistic Adam.9. KLayerFixed: The variant of the method described in Appendix H.1, where an RBF layeris attached at the end of the adversary’s architecture with ﬁxed centers, i.e. testing functionsof the form: f ( z ) = (cid:80) n centers j =1 K ( c j , g w ( z )) β j , with n centers = 100 . The centers c j are placedin a dimensional feature space and the function g w is a two-layer neural net with hidden units in each layer.10. KLayerTrained: The same as KLayerFixed, but the centers of the RBF layer are trained.11. CentroidMMD: The version of the MMMD-GMM in Appendix H.1, where we select asubset of the data points to use as centers in the Kernel approximation, i.e. testing functionsof the form: f ( z ) = (cid:80) n centers j =1 K ( g w ( z ∗ j , g w ( z )) β j . z ∗ j are chosen as the centroids of aKMeans clustering and n centers = 100 . g w is the same architecture as in KLayerFixed.12. KLossMMD: The method described in Equation (14), where no (cid:96) ,n penalty is imposed onthe adversary test function. g w is the same architecture as in KLayerFixed.In addition to these regimes, we consider high-dimensional experiments with images, following thescenarios proposed in Bennett et al. [2019] where either the instrument z or treatment x or both areimages from the MNIST dataset consisting of grayscale images of × pixels. We compare theperformance of our approaches to that of Bennett et al. [2019], using their code. A full descriptionof the DGP is given in Appendix J.1. Results.

The main ﬁndings are: i) for small number of treatments, the RKHS method with a Nys-trom approximation (NystromRKHS), outperforms all methods (Figure 1) with only exception beingfunctions that are highly non-smooth or non-continuous, in which case the methods that are basedon shape constraints (ConvexIV, TVIV, LipTVIV) are better, ii) for moderate number of instrumentsand treatments, Random Forest IV (RFIV) signiﬁcantly outperforms most methods, with secondbest being neural networks (AGMM, KLayerTrained) (Figure 2), iii) the estimator for sparse lin-ear hypotheses can handle an ultra-high dimensional regime (Figure 3), iv) neural network methods(AGMM, KLayerTrained) outperform the state of the art in prior work [Bennett et al., 2019] for tasksthat involve images (Figure 4). The ﬁgures below present the average MSE across experiments( experiments for Figure 4) and two times the standard error of the average MSE. NystromRKHS 2SLS Reg2SLS ConvexIV TVIV LipTVIV RFIVabs ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Figure 14: n = 300 , n z = 1 , n x = 1 , γ = . ystromRKHS 2SLS Reg2SLS ConvexIV TVIV LipTVIV RFIVabs ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± abssqrt ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Figure 15: n = 2000 , n z = 1 , n x = 1 , γ = . NystromRKHS 2SLS Reg2SLS ConvexIV TVIV LipTVIV RFIVabs ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± abssqrt ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Figure 16: n = 2000 , n z = 1 , n x = 1 , γ = . NystromRKHS 2SLS Reg2SLS RFIVabs 0.026 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± abssqrt 0.027 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± invband 0.051 ± ± ± ± steplinear 0.087 ± ± ± ± pwlinear 0.023 ± ± ± ± Figure 17: n = 2000 , n z = 5 , n x = 1 , γ = . ystromRKHS 2SLS Reg2SLS RFIVabs 0.027 ± ± ± ± ± ± ± ± ± ± ± ± sin 0.023 ± ± ± ± frequentsin 0.136 ± ± ± ± abssqrt 0.026 ± ± ± ± step 0.035 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± sqrpos 0.042 ± ± ± ± ± ± ± ± invband 0.052 ± ± ± ± steplinear 0.102 ± ± ± ± pwlinear 0.031 ± ± ± ± Figure 18: n = 2000 , n z = 10 , n x = 1 , γ = . NystromRKHS 2SLS Reg2SLS RFIVabs 0.051 ± ± ± ± ± ± ± ± ± ± ± ± sin 0.035 ± ± ± ± frequentsin 0.140 ± ± ± ± abssqrt 0.037 ± ± ± ± step 0.045 ± ± ± ± ± ± ± ± ± ± ± ± randpw 0.131 ± ± ± ± abspos 0.034 ± ± ± ± sqrpos 0.111 ± ± ± ± ± ± ± ± invband 0.060 ± ± ± ± steplinear 0.161 ± ± ± ± pwlinear 0.052 ± ± ± ± Figure 19: n = 2000 , n z = 5 , n x = 5 , γ = . NystromRKHS 2SLS Reg2SLS RFIVabs 0.143 ± ± ± ± ± ± ± ± sigmoid 0.045 ± ± ± ± sin 0.058 ± ± ± ± frequentsin 0.136 ± ± ± ± abssqrt 0.062 ± ± ± ± step 0.064 ± ± ± ± ± ± ± ± ± ± ± ± randpw 0.272 ± ± ± ± abspos 0.067 ± ± ± ± sqrpos 0.243 ± ± ± ± band 0.078 ± ± ± ± invband 0.079 ± ± ± ± steplinear 0.212 ± ± ± ± pwlinear 0.075 ± ± ± ± Figure 20: n = 2000 , n z = 10 , n x = 10 , γ = . GMM KLayerFixed KLayerTrained CentroidMMD KLossMMDabs ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Figure 21: n = 2000 , n z = 10 , n x = 10 , γ = . p = ± ± ± ± ± ± Figure 22: n = 400 , n z = n x := p , γ = . , h ( x [0]) = x [0] DeepGMM (Bennett et al. [2019]) AGMM KLayerTrainedMNIST z ± ± ± x ± ± ± xz ± ± ± Figure 23: MSE on the high-dimensional DGPs

J.1 Experiments with Image Data

In this section, we describe the experimental setup for our experiments with high-dimensional datausing the MNIST dataset. We replicate the data-generating process of Bennett et al. [2019]. Wepresent a full description here for completeness.

The Data-Generating Process

We begin by describing a low-dimensional DGP which will deﬁnea mapping for x or z or both to be MNIST images. The data-generating process is: y = g ( x low ) + e + δz low ∼ Uniform ([ − , ) x low = z low + e + γe ∼ N (0 , , δ, γ ∼ N (0 , . . Let π ( x ) = round (min(max(1 . x + 5 , , . π is a transformation function that maps inputs toan integer between 0 and 9. Let RandomImage ( d ) be a function which selects a random MNISTimage from the class of images corresponding to digit d . The three high-dimensional scenarios are:MNIST Z : x = x low , z = RandomImage ( π ( z low )) MNIST X : x = RandomImage ( π ( x low )) , z = z low MNIST XZ : x = RandomImage ( π ( x low )) , z = RandomImage ( π ( z low )) . We use the function g ( x ) = | x | to compare with Bennett et al. [2019] but in general, the otherfunctional forms described above can also be used. Similar to Bennett et al. [2019] we normalizethe data so that y has zero mean and unit standard deviation.We evaluate the performance of our AGMM and KLayerTrained estimators on these 3 data-generating processes with 20,000 train samples and 2,000 test samples and compare their perfor-mance to that achieved when we evaluate Bennett et al. [2019]’s code (performance is measured bythe average mean squared error of the predictions on test data).49 etup We describe more details about our experimental setup for the MNIST experiments here.We run 10 Monte-Carlo runs of each experiment and report the average MSE and the standarddeviation in the MSE achieved.

Architectures

We use a 4-layer convolutional architecture in all cases where the input to the net-work is an image. This consists of 2 convolutional layers with a 3x3 kernel followed by two fullyconnected layers with 9216 and 512 hidden units respectively. A ReLU activation is applied aftereach layer. Along with that, a max-pooling operation is applied after the ﬁrst two convolutionallayers and a dropout operation (with dropout probability 0.1) is applied before each fully connectedlayer. When the instrument or treatment is low-dimensional we use a 2 layer fully connected neu-ral network with 200 neurons in the hidden layer along with the dropout function as before. Allnetworks use ReLU as the activation function.

Early Stopping

We utilize the early stopping procedure proposed in Bennett et al. [2019] whichworks as follows. In addition to the 20,000 training samples, 10,000 samples are used for preparinga set of candidate adversary functions prior to training. During training at each epoch, the maximumerror incurred by the learner against the candidates in this pre-computed list is recorded. The earlystopping selects the model whose maximum error as computer above is the smallest.

Hyper-Parameters

We use a batch size of 100 samples, and run for 200 epochs where an epoch isdeﬁned as one full pass over the train set. We have as hyper-parameters learning rates for the learnerand adversary networks, the regularization terms for the weights of the learner and the adversary,and a regularization term on the norm of the output of the adversary network. For the MNIST x experiment, we saw best results when the weight penalizations on both the learner and the adversarywere set to very small values as compared to the other two experiments.50 Proofs from Section 3 and Appendix C

K.1 Preliminary LemmasLemma 15.

Let f h , be any test function that satisﬁes: (cid:107) f h − T ( h ∗ − h ) (cid:107) ≤ (cid:15) and let Ψ( h, f ) := E [ ψ ( y ; h ( x )) f ( z )] . Then: (cid:107) f h (cid:107) (Ψ ( h, f h ) − Ψ( h ∗ , f h )) ≥ (cid:107) T ( h − h ∗ ) (cid:107) − (cid:15) n Proof.

Let f ∗ h = T ( h ∗ − h ) and observe that by the tower law of expectations: (cid:107) f h (cid:107) (Ψ( h, f h ) − Ψ( h ∗ , f h )) = E [( h ∗ − h )( x ) f h ( z )] (cid:107) f h (cid:107) = E [ f ∗ h ( z ) f h ( z )] (cid:107) f h (cid:107) However, observe that by the Cauchy-Schwarz inequality we have: E [ f ∗ h ( Z ) f h ( Z )] = E [ f h ( Z ) ] + E [ f h ( Z )( f ∗ h ( Z ) − f h ( Z ))] ≥ (cid:107) f h (cid:107) − | E [ f h ( Z )( f ∗ h ( Z ) − f h ( Z ))] |≥ (cid:107) f h (cid:107) − (cid:112) E [ f h ( Z ) ] (cid:113) E [( f ∗ h ( Z ) − f h ( Z )) ] ≥ (cid:107) f h (cid:107) − (cid:107) f h (cid:107) (cid:107) f ∗ h − f h (cid:107) ≥ (cid:107) f h (cid:107) − (cid:15) n (cid:107) f h (cid:107) Thus we have: (cid:107) f h (cid:107) (Ψ( h, f h ) − Ψ( h ∗ , f h )) ≥ (cid:107) f h (cid:107) − (cid:15) n Finally, by a triangle inequality, (cid:107) f h (cid:107) ≥ (cid:107) f ∗ h (cid:107) − (cid:107) f ∗ h − f h (cid:107) ≥ (cid:107) f ∗ h (cid:107) − (cid:15) n . Hence, we can conclude that: (cid:107) f h (cid:107) (Ψ( h, f h ) − Ψ( h ∗ , f h )) ≥ (cid:107) f ∗ h (cid:107) − (cid:15) n = (cid:107) T ( h − h ∗ ) (cid:107) − (cid:15) n K.2 Proof of Theorem 1

Proof.

For convenience let: Ψ( h, f ) := E [ ψ ( y ; h ( x )) f ( z )] = E [ T ( h − h )( z ) f ( z )] (by conditional moment restriction) Ψ n ( h, f ) := 1 n n (cid:88) i =1 ψ ( y i ; h ( x i )) f ( z i ) Moreover, for our choice of δ as described in the statement of the theorem, let: H B := (cid:8) h ∈ H : (cid:107) h (cid:107) H ≤ B (cid:9) F U := (cid:8) f ∈ F : (cid:107) f (cid:107) F ≤ U (cid:9) Moreover, let: Ψ λn ( h, f ) = Ψ n ( h, f ) − λ (cid:18) (cid:107) f (cid:107) F + Uδ (cid:107) f (cid:107) ,n (cid:19) Ψ λ ( h, f ) = Ψ( h, f ) − λ (cid:18) (cid:107) f (cid:107) F + Uδ (cid:107) f (cid:107) (cid:19) Thus our estimate can be written as: ˆ h := arg min h ∈H sup f ∈F Ψ λn ( h, f ) + µ (cid:107) h (cid:107) H elating empirical and population regularization. As a preliminary observation, we have thatby Theorem 14.1 of Wainwright [2019], w.p. − ζ : ∀ f ∈ F U : (cid:12)(cid:12) (cid:107) f (cid:107) n, − (cid:107) f (cid:107) (cid:12)(cid:12) ≤ (cid:107) f (cid:107) + δ for our choice of δ := δ n + c (cid:113) log( c /ζ ) n , where δ n upper bounds the critical radius of F U and c , c are universal constants. Moreover, for any f , with (cid:107) f (cid:107) F ≥ U , we can consider the function f √ U / (cid:107) f (cid:107) F , which also belongs to F U , since F is star-convex. Thus we can apply the abovelemma to this re-scaled function and multiply both sides by (cid:107) f (cid:107) F / (3 U ) , leading to: ∀ f ∈ F s.t. (cid:107) f (cid:107) F ≥ U : (cid:12)(cid:12) (cid:107) f (cid:107) n, − (cid:107) f (cid:107) (cid:12)(cid:12) ≤ (cid:107) f (cid:107) + δ (cid:107) f (cid:107) F U Thus overall, we have: ∀ f ∈ F : (cid:12)(cid:12) (cid:107) f (cid:107) n, − (cid:107) f (cid:107) (cid:12)(cid:12) ≤ (cid:107) f (cid:107) + δ max (cid:26) , (cid:107) f (cid:107) F U (cid:27) (15)Thus we have that w.p. − ζ : ∀ f ∈ F : (cid:107) f (cid:107) F + Uδ (cid:107) f (cid:107) ,n ≥ (cid:107) f (cid:107) F + Uδ (cid:18) (cid:107) f (cid:107) − δ max (cid:26) , (cid:107) f (cid:107) F U (cid:27)(cid:19) ≥ (cid:107) f (cid:107) F + Uδ (cid:107) f (cid:107) − max (cid:26) U, (cid:107) f (cid:107) F (cid:27) ≥ (cid:107) f (cid:107) F + Uδ (cid:107) f (cid:107) − U (16) Upper bounding centered empirical sup-loss.

We now argue that the centered empirical sup-loss: sup f ∈F (Ψ n (ˆ h, f ) − Ψ n ( h ∗ , f )) is small. By the deﬁnition of ˆ h : sup f ∈F Ψ λn (ˆ h, f ) ≤ sup f ∈F Ψ λn ( h ∗ , f ) + µ (cid:16) (cid:107) h ∗ (cid:107) H − (cid:107) ˆ h (cid:107) H (cid:17) (17)By Lemma 7 of Foster and Syrgkanis [2019], the fact that φ ( y ; h ∗ ( x )) f ( z ) is -Lipschitz with re-spect to f ( z ) (since y ∈ [ − , and (cid:107) h ∗ (cid:107) ∞ ∈ [ − , ) and by our choice of δ := δ n + c (cid:113) log( c /ζ ) n ,where δ n is an upper bound on the critical radius of F U , w.p. − ζ : ∀ f ∈ F U : | Ψ n ( h ∗ , f ) − Ψ( h ∗ , f ) | ≤ δ (cid:107) f (cid:107) + 36 δ Thus, if (cid:107) f (cid:107) F ≥ √ U , we can apply the latter inequality for the function f √ U / (cid:107) f (cid:107) F , which fallsin F U , and then multiply both sides by (cid:107) f (cid:107) F / √ U to get: ∀ f ∈ F : | Ψ n ( h ∗ , f ) − Ψ( h ∗ , f ) | ≤ δ (cid:107) f (cid:107) + 36 δ max (cid:26) , (cid:107) f (cid:107) F √ U (cid:27) (18)By Equations (16) and (18), we have that w.p. − ζ : sup f ∈F Ψ λn ( h ∗ , f ) = sup f ∈F (cid:18) Ψ n ( h ∗ , f ) − λ (cid:18) (cid:107) f (cid:107) F + Uδ (cid:107) f (cid:107) ,n (cid:19)(cid:19) ≤ sup f ∈F (cid:18) Ψ( h ∗ , f ) + 36 δ + 36 δ √ U (cid:107) f (cid:107) F + 36 δ (cid:107) f (cid:107) − λ (cid:18) (cid:107) f (cid:107) F + Uδ (cid:107) f (cid:107) ,n (cid:19)(cid:19) ≤ sup f ∈F (cid:18) Ψ( h ∗ , f ) + 36 δ + 36 δ √ U (cid:107) f (cid:107) F + 36 δ (cid:107) f (cid:107) − λ (cid:18) (cid:107) f (cid:107) F + Uδ (cid:107) f (cid:107) (cid:19) + λU (cid:19) ≤ sup f ∈F Ψ λ/ ( h ∗ , f ) + 36 δ + λU + sup f ∈F (cid:18) δ √ U (cid:107) f (cid:107) F − λ (cid:107) f (cid:107) F (cid:19) + sup f ∈F (cid:18) δ (cid:107) f (cid:107) − λ Uδ (cid:107) f (cid:107) (cid:19) (cid:107) · (cid:107) and any constants a, b > : sup f ∈F (cid:0) a (cid:107) f (cid:107) − b (cid:107) f (cid:107) (cid:1) ≤ a b Thus if we assume that λ ≥ δ /U , we have: sup f ∈F (cid:18) δ √ U (cid:107) f (cid:107) F − λ (cid:107) f (cid:107) F (cid:19) ≤ δ U λ ≤ δ sup f ∈F (cid:18) δ (cid:107) f (cid:107) − λ Uδ (cid:107) f (cid:107) (cid:19) ≤ δ λU ≤ δ Thus we have: sup f ∈F Ψ λn ( h ∗ , f ) ≤ sup f ∈F Ψ λ/ ( h ∗ , f ) + λU + O ( δ ) Moreover: sup f ∈F Ψ λn (ˆ h, f ) = sup f ∈F (cid:18) Ψ n (ˆ h, f ) − Ψ n ( h ∗ , f ) + Ψ n ( h ∗ , f ) − λ (cid:18) (cid:107) f (cid:107) F + Uδ (cid:107) f (cid:107) ,n (cid:19)(cid:19) ≥ sup f ∈F (cid:18) Ψ n (ˆ h, f ) − Ψ n ( h ∗ , f ) − λ (cid:18) (cid:107) f (cid:107) F + Uδ (cid:107) f (cid:107) ,n (cid:19)(cid:19) + inf f ∈F (cid:18) Ψ n ( h ∗ , f ) + λ (cid:18) (cid:107) f (cid:107) F + Uδ (cid:107) f (cid:107) ,n (cid:19)(cid:19) = sup f ∈F (cid:18) Ψ n (ˆ h, f ) − Ψ n ( h ∗ , f ) − λ (cid:18) (cid:107) f (cid:107) F + Uδ (cid:107) f (cid:107) ,n (cid:19)(cid:19) − sup f ∈F Ψ λn ( h ∗ , f ) Combining this with Equation (17) yields: sup f ∈F (cid:18) Ψ n (ˆ h, f ) − Ψ n ( h ∗ , f ) − λ (cid:18) (cid:107) f (cid:107) F + Uδ (cid:107) f (cid:107) ,n (cid:19)(cid:19) ≤ f ∈F Ψ λn ( h ∗ , f ) + µ (cid:16) (cid:107) h ∗ (cid:107) H − (cid:107) ˆ h (cid:107) H (cid:17) ≤ O ( δ ) + λU + 2 sup f ∈F Ψ λ/ ( h ∗ , f )+ µ (cid:16) (cid:107) h ∗ (cid:107) H − (cid:107) ˆ h (cid:107) H (cid:17) Lower bounding centered empirical sup-loss.

For any h , let f h := arg inf f ∈F L (cid:107) h − h ∗(cid:107) H (cid:107) f − T ( h ∗ − h ) (cid:107) . and observe that by our assumption, for any h ∈ H : (cid:107) f h − T ( h ∗ − h ) (cid:107) ≤ η n .Suppose that (cid:107) f ˆ h (cid:107) ≥ δ and let r = δ (cid:107) f ˆ h (cid:107) ∈ [0 , / . Then observe that since f ˆ h ∈ F L (cid:107) h − h ∗ (cid:107) H and F is star-convex, we also have that rf h ∈ F L (cid:107) h − h ∗ (cid:107) H . Thus we can lower bound the supremumby its evaluation at r f h : sup f ∈F (cid:18) Ψ n (ˆ h, f ) − Ψ n ( h ∗ , f ) − λ (cid:18) (cid:107) f (cid:107) F + Uδ (cid:107) f (cid:107) ,n (cid:19)(cid:19) ≥ r (Ψ n (ˆ h, f ˆ h ) − Ψ n ( h ∗ , f ˆ h )) − λr (cid:18) (cid:107) f ˆ h (cid:107) F + Uδ (cid:107) f ˆ h (cid:107) ,n (cid:19) Moreover, since δ n upper bounds the critical radius of F U , (cid:107) f ˆ h (cid:107) F ≤ L (cid:107) ˆ h − h ∗ (cid:107) H and by Equa-tion (15): r (cid:18) (cid:107) f ˆ h (cid:107) F + Uδ (cid:107) f ˆ h (cid:107) ,n (cid:19) ≤ (cid:107) f ˆ h (cid:107) F + Uδ r (cid:107) f ˆ h (cid:107) ,n ≤ (cid:107) f ˆ h (cid:107) F + Uδ r (cid:18) (cid:107) f ˆ h (cid:107) + δ + δ (cid:107) f ˆ h (cid:107) F U (cid:19) ≤ L (cid:107) h − h ∗ (cid:107) H + U U ≤ L (cid:107) h − h ∗ (cid:107) H + U sup f ∈F (cid:18) Ψ n (ˆ h, f ) − Ψ n ( h ∗ , f ) − λ (cid:18) (cid:107) f (cid:107) F + Uδ (cid:107) f (cid:107) ,n (cid:19)(cid:19) ≥ r (Ψ n (ˆ h, f ˆ h ) − Ψ n ( h ∗ , f ˆ h )) − λL (cid:107) h − h ∗ (cid:107) H − λU Observe that: Ψ n ( h, f h ) − Ψ n ( h ∗ , f h ) = 1 n n (cid:88) i =1 ( h ∗ ( x i ) − h ( x i )) f h ( h ∗ − h )( z i )Ψ( h, f h ) − Ψ( h ∗ , f h ) = E [( h ∗ ( x i ) − h ( x i )) f h ( z i )] By Lemma 7 of Foster and Syrgkanis [2019], and by our choice of δ := δ n + c (cid:113) log( c /ζ ) n , where δ n upper bounds the critical radius of G , we have that w.p. − ζ : ∀ h , such that h − h ∗ ∈ H B | (Ψ n ( h, f h ) − Ψ n ( h ∗ , f h )) − (Ψ( h, f h ) − Ψ( h ∗ , f h )) | ≤ δ (cid:112) E [( h ∗ ( X ) − h ( X )) f h ( Z ) ] + 18 δ ≤ δ (cid:112) E [ f h ( Z ) ] + 18 δ = 18 δ (cid:107) f h (cid:107) + 18 δ (19)where in the second inequality we used the fact that h − h ∗ has range in [ − , , when (cid:107) h − h ∗ (cid:107) H ≤ B . If h − h ∗ has (cid:107) h − h ∗ (cid:107) H ≥ B , we can apply the latter for ( h − h ∗ ) √ B/ (cid:107) h − h ∗ (cid:107) H and multiplyboth sides by (cid:107) h − h ∗ (cid:107) H /B : | (Ψ n ( h, f h ) − Ψ n ( h ∗ , f h )) − (Ψ( h, f h ) − Ψ( h ∗ , f h )) | ≤ δ (cid:107) f h (cid:107) (cid:107) h − h ∗ (cid:107) H √ B + 18 δ (cid:107) h − h ∗ (cid:107) H B Thus we have that for all h ∈ H : | (Ψ n ( h, f h ) − Ψ n ( h ∗ , f h )) − (Ψ( h, f h ) − Ψ( h ∗ , f h )) | ≤ (cid:0) δ (cid:107) f h (cid:107) + 18 δ (cid:1) max (cid:26) , (cid:107) h − h ∗ (cid:107) H B (cid:27) Applying the latter bound for h := ˆ h and multiplying by r := δ (cid:107) f ˆ h (cid:107) ∈ [0 , / , yields: r (Ψ n (ˆ h, f ˆ h ) − Ψ n ( h ∗ , f ˆ h )) ≥ r (Ψ(ˆ h, f ˆ h ) − Ψ( h ∗ , f ˆ h )) − δ max (cid:26) , (cid:107) h − h ∗ (cid:107) H B (cid:27) Moreover, observe that by Lemma 15 and the fact that (cid:107) f ˆ h − T ( h ∗ − ˆ h ) (cid:107) ≤ η n , we have: r (Ψ(ˆ h, f ˆ h ) − Ψ( h ∗ , f ˆ h )) ≥ δ (cid:107) T ( h ∗ − ˆ h ) (cid:107) − δη n Thus we have: sup f ∈F (cid:18) Ψ n (ˆ h, f ) − Ψ n ( h ∗ , f ) − λ (cid:18) (cid:107) f (cid:107) F + Uδ (cid:107) f (cid:107) ,n (cid:19)(cid:19) ≥ δ (cid:107) T ( h ∗ − h ) (cid:107) − δη n − δ max (cid:26) , (cid:107) h − h ∗ (cid:107) H B (cid:27) − λL (cid:107) h − h ∗ (cid:107) H − λU Combining upper and lower bound.

Combining the upper and lower bound on the centeredpopulation sup-loss we get that w.p. − ζ : either (cid:107) f ˆ h (cid:107) ≤ δ or: δ (cid:107) T (ˆ h − h ∗ ) (cid:107) ≤ O ( δ + δη n + λU ) + 2 sup f ∈F Ψ λ/ ( h ∗ , f )+ 27 δ (cid:107) ˆ h − h ∗ (cid:107) H B + 4 λL (cid:107) ˆ h − h ∗ (cid:107) H + µ (cid:16) (cid:107) h ∗ (cid:107) H − (cid:107) ˆ h (cid:107) H (cid:17)

54e now control the last part. Since λ ≥ δ /U , the latter is upper bounded by: λ (cid:18) UB + 4 L (cid:19) (cid:107) ˆ h − h ∗ (cid:107) H + µ (cid:16) (cid:107) h ∗ (cid:107) H − (cid:107) ˆ h (cid:107) H (cid:17) ≤ λ (cid:18) UB + 4 L (cid:19) (cid:16) (cid:107) ˆ h (cid:107) H + (cid:107) h ∗ (cid:107) H (cid:17) + µ (cid:16) (cid:107) h ∗ (cid:107) H − (cid:107) ˆ h (cid:107) H (cid:17) Since µ ≥ λ (cid:0) UB + 4 L (cid:1) , the latter is upper bounded by: (cid:18) λ (cid:18) UB + 4 L (cid:19) + µ (cid:19) (cid:107) h ∗ (cid:107) H Thus as long as µ ≥ λ (cid:0) UB + 4 L (cid:1) and λ ≥ δ /U , we have: δ (cid:107) T (ˆ h − h ∗ ) (cid:107) ≤ O ( δ + δη n + λU ) + 2 sup f ∈F Ψ λ/ ( h ∗ , f ) + (cid:18) λ (cid:18) UB + 4 L (cid:19) + µ (cid:19) (cid:107) h ∗ (cid:107) H Dividing over by δ and treating L, U, B as constants, we get: (cid:107) T (ˆ h − h ∗ ) (cid:107) ≤ O ( δ + η n + (cid:107) h ∗ (cid:107) H ( λ/δ + µ/δ )) + 2 δ sup f ∈F Ψ λ/ ( h ∗ , f ) Thus either (cid:107) f ˆ h (cid:107) ≤ δ or the latter inequality holds. However, in the case when (cid:107) f ˆ h (cid:107) ≤ δ , we haveby a triangle inequality that: (cid:107) T (ˆ h − h ∗ ) (cid:107) ≤ δ + η n . Thus in any case the latter inequality holds. Upper bounding population sup-loss at minimum.

Let f = T ( h − h ∗ ) and observe that: sup f ∈F Ψ λ/ ( h ∗ , f ) = sup f ∈F E [ f ( z ) f ( z )] − λ (cid:18) (cid:107) f (cid:107) F + Uδ (cid:107) f (cid:107) (cid:19) ≤ sup f ∈F E [ f ( z ) f ( z )] − λ Uδ (cid:107) f (cid:107) Then by the Cauchy-Schwarz inequality and since λ ≥ δ /U : sup f ∈F E [ f ( z ) f ( z )] − λ Uδ (cid:107) f (cid:107) ≤ sup f ∈F (cid:107) f (cid:107) (cid:107) f (cid:107) − λ Uδ (cid:107) f (cid:107) ≤ (cid:107) f (cid:107) λU δ ≤ (cid:107) f (cid:107) Concluding.

Concluding we get that w.p. − ζ : (cid:107) T (ˆ h − h ∗ ) (cid:107) ≤ O ( δ + η n + (cid:107) h ∗ (cid:107) H ( λ/δ + µ/δ )) + (cid:107) T ( h ∗ − h ) (cid:107) δ By a triangle inequality: (cid:107) T (ˆ h − h ) (cid:107) ≤ (cid:107) T (ˆ h − h ∗ ) (cid:107) + (cid:107) T ( h ∗ − h ) (cid:107) ≤ O ( δ + η n + (cid:107) h ∗ (cid:107) H ( λ/δ + µ/δ )) + (cid:107) T ( h ∗ − h ) (cid:107) δ + (cid:107) T ( h ∗ − h ) (cid:107) K.3 Proof of Theorem 2

Proof.

By the deﬁnition of ˆ h : ≤ sup f Ψ n (ˆ h, f ) ≤ sup f Ψ n ( h , f ) + λ (cid:16) (cid:107) h (cid:107) H − (cid:107) ˆ h (cid:107) H (cid:17) Let F iU = { f ∈ F i : (cid:107) f (cid:107) F ≤ U } and δ n,ζ = max di =1 R ( F iU ) + c (cid:113) log( c /ζ ) n for some universalconstants c , c . By Theorem 26.5 and 26.9 of Shalev-Shwartz and Ben-David [2014], and since F iU is a symmetric class and sup y ∈Y ,x ∈X | y − h ( x ) | ≤ , w.p. − ζ : f ∈ F iU | Ψ n ( h , f ) − Ψ( h , f ) | ≤ δ n,ζ Since Ψ( h , f ) = 0 for all f , we have that, w.p. − ζ : (cid:107) ˆ h (cid:107) H ≤ (cid:107) h (cid:107) H + δ n,ζ /λ B n,λ,ζ = ( (cid:107) h (cid:107) H + δ n,ζ /λ ) . Then if we let (cid:15) n,λ,ζ = max i R ( H B n,λ,ζ · F iU ) + c (cid:113) log( c /ζ ) n for some universal constants c , c . ∀ h ∈ H B n,λ,ζ , f ∈ F iU | Ψ n ( h, f ) − Ψ( h, f ) | ≤ δ n,ζ By a union bound over the d function classes composing F , we have that w.p. − ζ : sup f ∈F U Ψ n ( h , f ) ≤ sup f ∈F U Ψ( h , f ) + δ n,ζ/d = δ n,ζ/d and sup f ∈F U Ψ n (ˆ h, f ) ≥ sup f ∈F U Ψ(ˆ h, f ) − (cid:15) n,ζ/d Since, by assumption, for any h ∈ H B n,λ,ζ , T ( h − h ) (cid:107) T ( h − h ) (cid:107) ∈ span R ( F U ) , we have T ( h − h ) (cid:107) T ( h − h ) (cid:107) = (cid:80) pi =1 w i f i , with p < ∞ , (cid:107) w (cid:107) ≤ κ and f i ∈ F U . Thus we have: sup f ∈F U Ψ(ˆ h, f ) ≥ κ p (cid:88) i =1 w i Ψ(ˆ h, f i ) = 1 κ Ψ (cid:32) ˆ h, (cid:88) i w i f i (cid:33) = 1 κ (cid:107) T ( h − ˆ h ) (cid:107) Ψ(ˆ h, T ( h − ˆ h ))= 1 κ (cid:107) T ( h − ˆ h ) (cid:107) E [ T ( h − ˆ h )( z ) ]= 1 κ (cid:107) T ( h − ˆ h ) (cid:107) Combining all the above we have: (cid:107) T ( h − ˆ h ) (cid:107) ≤ κ (cid:16) (cid:15) n,λ,ζ/d + δ n,ζ/d + λ (cid:16) (cid:107) h (cid:107) H − (cid:107) ˆ h (cid:107) H (cid:17)(cid:17) Moreover, since functions in H and F are bounded in [ − , , we have that the function h · f is -Lipschitz with respect to the vector of functions ( h, f ) . Thus we can apply a vector version of thecontraction inequality Maurer [2016] to get that: R ( H B n,λ,z · F iU ) ≤ (cid:0) R ( H B n,λ,z ) + R ( F iU ) (cid:1) Finally, we have that since H is star-convex: R ( H B n,λ,z ) ≤ (cid:112) B n,λ,z R ( H ) Leading the ﬁnal bound of: (cid:107) T ( h − ˆ h ) (cid:107) ≤ κ (cid:32) (cid:107) h (cid:107) H + δ n,ζ /λ ) R ( H ) + 2 d max i =1 R ( F iU ) + c (cid:114) log( c d/ζ ) n + λ (cid:16) (cid:107) h (cid:107) H − (cid:107) ˆ h (cid:107) H (cid:17)(cid:33) Since (cid:107) h (cid:107) H ≤ R and λ ≥ δ n,ζ , we get the result. K.4 Proof of Theorem 6

The proof is identical to that of Theorem 1 with small modiﬁcations. Hence we solely mention thesemodiﬁcations and omit the full proof.The only part that we change is instead of the set of Equations (19), we instead view ψ ( y ; h ( x )) f h ( z ) as a function of the vector valued function ( x, z ) → ( h ( x ) , f h ( z )) . Then wenote that since h, f take values in [ − , and y ∈ [ − , , we note that this function -Lipschitzwith respect to this vector. Then we can apply Lemma 7 of Foster and Syrgkanis [2019], and by ourchoice of δ := δ n + c (cid:113) log( c /ζ ) n , where δ n upper bounds the critical radius of star ( H B − h ∗ ) andstar ( T ( H B − h ∗ )) , we have that w.p. − ζ : ∀ h ∈ H B : | (Ψ n ( h, f h ) − Ψ n ( h ∗ , f h )) − (Ψ( h, f h ) − Ψ( h ∗ , f h )) | ≤ δ ( (cid:107) h − h ∗ (cid:107) + (cid:107) f h (cid:107) ) + 18 δ − ζ , either (cid:107) f ˆ h (cid:107) ≤ δ or: (cid:107) T (ˆ h − h ) (cid:107) ≤ O (cid:32) δ + δ (cid:107) ˆ h − h ∗ (cid:107) (cid:107) f h (cid:107) + η n + (cid:107) h ∗ (cid:107) H ( λ/δ + µ/δ ) + (cid:107) T ( h ∗ − h ) (cid:107) δ (cid:33) Subsequently, by the measure of ill-posedness we have: (cid:107) ˆ h − h ∗ (cid:107) ≤ τ (cid:107) T (ˆ h − h ∗ ) (cid:107) Moreover, observe that when (cid:107) f ˆ h (cid:107) ≥ δ ≥ η n , then we have by a triangle inequality that: (cid:107) T ( h − h ∗ ) (cid:107) ≥ (cid:107) f ˆ h (cid:107) − η n ≥ η n and: (cid:107) f ˆ h (cid:107) ≥ (cid:107) T ( h − h ∗ ) (cid:107) − η n ≥ (cid:107) T ( h − h ∗ ) (cid:107) Thus we get that: (cid:107) ˆ h − h ∗ (cid:107) (cid:107) f h (cid:107) ≤ τ (cid:107) T ( h − h ∗ ) (cid:107) (cid:107) f ˆ h (cid:107) ≤ τ Thus overall we have that either (cid:107) f ˆ h (cid:107) ≤ δ or: (cid:107) ˆ h − h ∗ (cid:107) ≤ O (cid:18) τ (cid:18) τ δ + η n + (cid:107) h ∗ (cid:107) H ( λ/δ + µ/δ ) + (cid:107) T ( h ∗ − h ) (cid:107) δ (cid:19)(cid:19) ≤ O (cid:18) τ (cid:18) τ δ + η n + (cid:107) h ∗ (cid:107) H ( λ/δ + µ/δ ) + (cid:107) h ∗ − h (cid:107) δ (cid:19)(cid:19) (20)where the last inequality follows by that fact that Jensen’s inequality implies that (cid:107) T ( h ∗ − h ) (cid:107) ≤(cid:107) h ∗ − h (cid:107) . Moreover, if (cid:107) f ˆ h (cid:107) ≤ δ , then by a triangle inequality that (cid:107) T (ˆ h − h ∗ ) (cid:107) ≤ δ + η n , which,subsquently implies by invoking the bound on the ill-posedness measure that: (cid:107) ˆ h ∗ − h (cid:107) ≤ τ ( δ + η n ) .Thus in any case the bound in Equation (20) holds. Choosing h ∗ := arg inf h ∈H B (cid:107) h − h (cid:107) , yieldsthe result. L Proofs from Section 4 and Appendix E

L.1 Proof of Proposition 9

Proof.

Since (cid:107) f (cid:107) ,n depends on f only through the values f ( z ) , . . . , f ( z n ) , and the maximizationover f in (10) is the penalized problem sup f ∈F n n (cid:88) i =1 ψ ( y i ; h ( x i )) f ( z i ) − λ ( Uδ (cid:107) f (cid:107) ,n + (cid:107) f (cid:107) K ) for some choice of λ ≥ , the generalized representer theorem of [Schölkopf et al., 2001, Thm. 1]implies that an optimal solution of the constrained problem in (10) takes the form f ∗ ( z ) = n (cid:88) i =1 α ∗ i K ( z i , z ) for some weight vector α ∗ ∈ R n . Now consider a function f ( z ) = n (cid:88) i =1 α i K ( z i , z ) for any α ∈ R n . We have (cid:107) f (cid:107) K = α (cid:62) K n α , f ( z i ) = e (cid:62) i K n α , and (cid:107) f (cid:107) ,n = 1 n n (cid:88) i =1 f ( z i ) = 1 n n (cid:88) i =1 α (cid:62) K n e i e (cid:62) i K n α = 1 n α (cid:62) K n α. sup α ∈ R n ψ (cid:62) n K n α − λα (cid:62) (cid:18) Unδ K n + I (cid:19) K n α by taking the ﬁrst order condition, the latter has a closed form optimizer of: α ∗ = 12 λ (cid:18) Unδ K n + I (cid:19) − ψ n and optimal value of: λ ψ (cid:62) n K n (cid:18) Unδ K n + I (cid:19) − ψ n = 14 λ ψ (cid:62) n K / n (cid:18) Unδ K n + I (cid:19) − K / n ψ n where in the last equality we used a classic matrix inverse identity for kernel matrices. L.2 Proof of Proposition 10

Proof.

By Proposition 9, ˆ h = arg min h ∈H λ ψ (cid:62) n M ψ n + µ (cid:107) h (cid:107) K H = arg min h ∈H ψ (cid:62) n M ψ n + 4 λ µ (cid:107) h (cid:107) K H (21)where ψ n = ( n ψ ( y i ; h ( x i ))) ni =1 . Since the objective of (21) depends only on h only through thevalues h ( x ) , . . . , h ( x n ) , and the problem, the generalized representer theorem of [Schölkopf et al.,2001, Thm. 1] implies that an optimal solution of the problem (21) takes the form h ∗ ( x ) = n (cid:88) i =1 α ∗ i K H ( x i , x ) for some weight vector α ∗ ∈ R n . Now consider a function h ( z ) = n (cid:88) i =1 α i K H ( z i , z ) for any α ∈ R n . We have (cid:107) h (cid:107) K H = α (cid:62) K H ,n α , h ( z i ) = e (cid:62) i K H ,n α , and ψ n = y − K H ,n α . Theproblem (21) is therefore equivalent to min α ∈ R n α (cid:62) K H ,n M K H ,n α − y (cid:62) M K H ,n α + 4 λ µ α (cid:62) K H ,n α. By [Boyd and Vandenberghe, 2004, Ex. 4.22], this problem is solved by: α ∗ := ( K H ,n M K H ,n + 4 λ µ K H ,n ) † K H ,n M y The fact that for any matrix X : X ( X (cid:62) X + λI ) − = XX (cid:62) ( X (cid:62) X + λI ) , and that K n = K / n K / n and K / n is symmetric. .3 Proof of Lemma 11 Proof.

Under these assumptions we have: (cid:107)

T h (cid:107) = a (cid:62) I V m a I − (cid:88) i ≤ mm a j E [ e j ( x ) | z ]   ≥ a (cid:62) I V m a I − (cid:88) i ≤ mm a j ≥ τ m (cid:107) a I (cid:107) − c τ m (cid:112) λ m +1 B (cid:88) i ≤ m | a i |≥ τ m (cid:107) a I (cid:107) − c τ m (cid:112) λ m +1 B (cid:115)(cid:88) i ≤ m a i ≥ τ m (cid:107) a I (cid:107) − c τ m (cid:112) λ m +1 B (cid:107) a I (cid:107) Thus if (cid:107)

T h (cid:107) ≤ δ , then by solving the above quadratic inequality and using the fact that ( a + b ) ≤ a + 2 b , we have for all m : (cid:107) a I (cid:107) ≤ δ τ m + 4 c λ m +1 B Moreover, observe that by the RKHS norm bound: (cid:107) h (cid:107) = (cid:88) j ∈ J a j ≤ (cid:107) a I (cid:107) + λ m B Thus we can bound: τ ∗ ( δ ) = min h : (cid:107) T h (cid:107) ≤ δ (cid:107) h (cid:107) ≤ min m ∈ N + δ τ m + (4 c + 1) λ m +1 B M Proofs from Section 5 and Appendix F

M.1 Proof of Corollary 3

Proof.

Let H = {(cid:104) θ, x (cid:105) : θ ∈ R d } and (cid:107) h (cid:107) H = (cid:107) θ (cid:107) . Moreover, suppose that h is s -sparse. Thenif h ∈ H B n,λ,ζ , then: δ n,ζ /λ + (cid:107) θ (cid:107) ≥ (cid:107) ˆ θ (cid:107) = (cid:107) θ + ν (cid:107) = (cid:107) θ + ν S (cid:107) + (cid:107) ν S c (cid:107) ≥ (cid:107) θ (cid:107) − (cid:107) ν S (cid:107) + (cid:107) ν S c (cid:107) Thus: (cid:107) ν (cid:107) ≤ (cid:107) ν S (cid:107) + δ n,ζ /λ ≤ √ s (cid:107) ν S (cid:107) + δ n,ζ /λ ≤ √ s (cid:107) ν (cid:107) + δ n,ζ /λ ≤ (cid:114) sγ ν (cid:62) V ν + δ n,ζ /λ Moreover, observe that: (cid:107) T ( h − h ) (cid:107) = (cid:112) E [ (cid:104) ν, E [ x | z ] (cid:105) ] = √ ν (cid:62) V ν

Thus we have: T ( h − h ) (cid:107) T ( h − h ) (cid:107) = p (cid:88) i =1 ν i √ ν (cid:62) V ν E [ x i | z ] T ( h − h ) (cid:107) T ( h − h ) (cid:107) as (cid:80) pi =1 w i f i , with f i ∈ F U and: (cid:107) w (cid:107) = (cid:107) ν (cid:107) √ ν (cid:62) V ν ≤ (cid:114) sγ + δ n,ζ λ (cid:107) T ( h − h ) (cid:107) . Thus: T ( h − h ) (cid:107) T ( h − h ) (cid:107) ∈ span κ ( F U ) for κ = 2 (cid:113) sγ + δ n,ζ λ (cid:107) T ( h − h ) (cid:107) .Moreover, observe that by the triangle inequality: (cid:107) h (cid:107) H − (cid:107) ˆ h (cid:107) H = (cid:107) θ (cid:107) − (cid:107) ˆ θ (cid:107) ≤ (cid:107) θ − ˆ θ (cid:107) = (cid:107) ν (cid:107) ≤ (cid:114) sγ ν (cid:62) V ν + δ n,ζ /λ Moreover, by standard results on the Rademacher complexity of linear function classes(see e.g. Lemma 26.11 of [Shalev-Shwartz and Ben-David, 2014]), we have R ( H B ) ≤ B (cid:113) p ) n max x ∈X (cid:107) x (cid:107) ∞ and R ( F U ) ≤ U (cid:113) p ) n max z ∈Z (cid:107) z (cid:107) ∞ for F U = { z → (cid:104) β, z (cid:105) : β ∈ R p , (cid:107) β (cid:107) ≤ U } . Thus invoking Theorem 2: (cid:107) T (ˆ h − h ) (cid:107) ≤ (cid:18) (cid:114) sγ + δ n,ζ λ (cid:107) T ( h − h ) (cid:107) (cid:19) · (cid:32) B + 1) (cid:114) log(2 p ) n + δ n,ζ + λ (cid:114) sγ (cid:107) T ( h − h ) (cid:107) (cid:33) The right hand side is upper bounded by the sum of the following four terms: Q := 2 (cid:114) sγ (cid:32) B + 1) (cid:114) log(2 p ) n + δ n,ζ (cid:33) Q := (cid:18) δ n,ζ λ (cid:107) T ( h − h ) (cid:107) (cid:19) (cid:32) B + 1) (cid:114) log(2 p ) n + δ n,ζ (cid:33) Q := 2 λ sγ (cid:107) T ( h − h ) (cid:107) Q := δ n,ζ (cid:114) sγ If (cid:107) T ( h − h ) (cid:107) ≥ (cid:113) sγ δ n,ζ and setting λ ≤ γ s , yields: Q ≤ λ (cid:114) γs (cid:32) B + 1) (cid:114) log(2 p ) n + δ n,ζ (cid:33) Q ≤ (cid:107) T ( h − h ) (cid:107) Thus bringing Q on the left-hand-side and dividing by / , we have: (cid:107) T ( h − h ) (cid:107) ≤

43 ( Q + Q + Q ) = 43 max (cid:26)(cid:114) sγ , λ (cid:114) γs (cid:27) (cid:32)

20 ( B + 1) (cid:114) log(2 p ) n + 11 δ n,ζ (cid:33) The result for the case when sup z ∈Z (cid:107) z (cid:107) ≤ R and F U = { z → (cid:104) β, z (cid:105) : (cid:107) β (cid:107) ≤ U } , follows alongthe exact same lines, but invoking the Lemma 26.10 of [Shalev-Shwartz and Ben-David, 2014],instead of Lemma 26.11, in order to get that R ( F U ) ≤ U R √ n . M.2 Proof of Propositions 13 and 14Proposition 16.

Consider an online linear optimization algorithm over a convex strategy space S and consider the OFTRL algorithm with a -strongly convex regularizer with respect to some norm (cid:107) · (cid:107) on space S : f t = arg min f ∈ S f (cid:62) (cid:88) τ ≤ t (cid:96) τ + (cid:96) t  + 1 η R ( f ) et (cid:107) · (cid:107) ∗ denote the dual norm of (cid:107) · (cid:107) and R = sup f ∈ S R ( f ) − inf f ∈ S R ( f ) . Then for any f ∗ ∈ S : T (cid:88) t =1 ( f t − f ∗ ) (cid:62) (cid:96) t ≤ Rη + η T (cid:88) t =1 (cid:107) (cid:96) t − (cid:96) t − (cid:107) ∗ − η T (cid:88) t =1 (cid:107) f t − f t − (cid:107) Proof.

The proof follows by observing that Proposition 7 in Syrgkanis et al. [2015] holds verbatimfor any convex strategy space S and not necessarily the simplex. Proposition 17.

Consider a minimax objective: min θ ∈ Θ max w ∈ W (cid:96) ( θ, w ) . Suppose that Θ , W areconvex sets and that (cid:96) ( θ, w ) is convex in θ for every w and concave in θ for any w . Let (cid:107) · (cid:107) Θ and (cid:107) · (cid:107) W be arbitrary norms in the corresponding spaces. Moreover, suppose that the followingLipschitzness properties are satisﬁed: ∀ θ ∈ Θ , w, w (cid:48) ∈ W : (cid:107)∇ θ (cid:96) ( θ, w ) − ∇ θ (cid:96) ( θ, w (cid:48) ) (cid:107) Θ , ∗ ≤ L (cid:107) w − w (cid:48) (cid:107) W ∀ w ∈ W, θ, θ (cid:48) ∈ Θ : (cid:107)∇ w (cid:96) ( θ, w ) − ∇ w (cid:96) ( θ (cid:48) , w ) (cid:107) W, ∗ ≤ L (cid:107) θ − θ (cid:48) (cid:107) W where (cid:107) · (cid:107) Θ , ∗ and (cid:107) · (cid:107) W, ∗ correspond to the dual norms of (cid:107) · (cid:107) Θ , (cid:107) · (cid:107) W . Consider the algorithmwhere at each iteration each player updates their strategy based on: θ t +1 = arg min θ ∈ Θ θ (cid:62) (cid:88) τ ≤ t ∇ θ (cid:96) ( θ τ , w τ ) + ∇ θ (cid:96) ( θ t , w t )  + 1 η R min ( θ ) w t +1 = arg max w ∈ W w T (cid:88) τ ≤ t ∇ w (cid:96) ( θ τ , w τ ) + ∇ w (cid:96) ( θ t , w t )  − η R max ( w ) such that R min is -strongly convex in the set Θ with respect to norm (cid:107) · (cid:107) Θ and R max is -stronglyconvex in the set W with respect to norm (cid:107) · (cid:107) W and with any step-size η ≤ L . Then the parameters ¯ θ = T (cid:80) Tt =1 θ t and ¯ w = T (cid:80) Tt =1 w t correspond to an R ∗ η · T -approximate equilibrium and hence ¯ θ is a R ∗ ηT -approximate solution to the minimax objective, where R is deﬁned as: R ∗ := max (cid:26) sup θ ∈ Θ R min ( θ ) − inf θ ∈ Θ R min ( θ ) , sup w ∈ W R max ( w ) − inf w ∈ W R max ( w ) (cid:27) Proof.

The proposition is essentially a re-statement of Theorem 25 of Syrgkanis et al. [2015] (whichin turn is an adaptation of Lemma 4 of Rakhlin and Sridharan [2013]), specialized to the case of theOFTRL algorithm and to the case of a two-player convex-concave zero-sum game, which impliesthat the if the sum of regrets of players is at most (cid:15) , then the pair of average solutions correspondsto an (cid:15) -equilibrium (see e.g. Freund and Schapire [1999] and Lemma 4 of Rakhlin and Sridharan[2013]).

Proof of Proposition 13: (cid:96) -ball adversary Let R E ( x ) = (cid:80) pi =1 x i log( x i ) . For the space Θ := { ρ ∈ R p : ρ ≥ , (cid:107) ρ (cid:107) ≤ B } , the entropic regularizer is B -strongly convex with respect to the (cid:96) norm and hence we can set R min ( ρ ) = B R E ( ρ ) . Similarly, for the space W := { w ∈ R p : w ≥ , (cid:107) w (cid:107) = 1 } , the entropic regularizer is -strongly convex with respect to the (cid:96) norm andthus we can set R max ( w ) = R E ( w ) . For this choice of regularizers, the update rules can be easilyveriﬁed to have a closed form solution provided in Proposition 13, by writing the Lagrangian of eachOFTRL optimization problem and invoking strong duality. Further, we can verify the lipschitznessconditions. Since the dual of the (cid:96) norm is the (cid:96) ∞ norm, ∇ ρ (cid:96) ( ρ, w ) = E n [ vu (cid:62) ] w + µW and thus: (cid:107)∇ ρ (cid:96) ( ρ, w ) − ∇ ρ (cid:96) ( ρ, w (cid:48) ) (cid:107) ∞ = (cid:107) E n [ vu (cid:62) ]( w − w (cid:48) ) (cid:107) ∞ ≤ (cid:107) E n [ vu (cid:62) ] (cid:107) ∞ (cid:107) w − w (cid:48) (cid:107) (cid:107)∇ w (cid:96) ( ρ, w ) − ∇ w (cid:96) ( ρ (cid:48) , w ) (cid:107) ∞ = (cid:107) E n [ uv (cid:62) ]( ρ − ρ (cid:48) ) (cid:107) ∞ ≤ (cid:107) E n [ vu (cid:62) ] (cid:107) ∞ (cid:107) ρ − ρ (cid:48) (cid:107) Thus we have L = (cid:107) E n [ uv (cid:62) ] (cid:107) ∞ . Finally, observe that: sup ρ ∈ Θ B R E ( ρ ) − inf ρ ∈ Θ B R E ( ρ ) = B log( B ∨

1) + B log(2 p )sup w ∈ W R E ( w ) − inf w ∈ W R E ( w ) = log(2 p ) R ∗ = B log( B ∨

1) + ( B + 1) log(2 p ) . Thus if we set η = (cid:107) E n [ vu (cid:62) ] (cid:107) ∞ , then wehave that after T iterations, ¯ θ = ¯ ρ + − ¯ ρ − is an (cid:15) ( T ) -approximate solution to the minimax problem,with (cid:15) ( T ) = 16 (cid:107) E n [ vu (cid:62) ] (cid:107) ∞ B log( B ∨

1) + ( B + 1) log(2 p ) T .

Combining all the above with Proposition 17 yields the proof of Proposition 13.

Proof of Proposition 14: (cid:96) -ball adversary For the case when W := { β ∈ R p : (cid:107) β (cid:107) ≤ U } ,then we have that the squared norm regularizer R max ( β ) = (cid:107) β (cid:107) is -strongly convex with respectto the (cid:96) norm and we can use (cid:107) · (cid:107) W = (cid:107) · (cid:107) . The choice of R min is the same as in the case ofan (cid:96) adversary, as detailed in the previous paragraph. For this choice of regularizers, the updaterules can be easily veriﬁed to have a closed form solution provided in Proposition 14, by writingthe Lagrangian of each OFTRL optimization problem and invoking strong duality. Moreover, theLipschitzness conditions become: (cid:107)∇ ρ (cid:96) ( ρ, β ) − ∇ ρ (cid:96) ( ρ, β (cid:48) ) (cid:107) ∞ = (cid:107) E n [ vz (cid:62) ]( β − β (cid:48) ) (cid:107) ∞ ≤ (cid:107) E n [ vz (cid:62) ] (cid:107) ∞ , (cid:107) β − β (cid:48) (cid:107) (cid:107)∇ β (cid:96) ( ρ, β ) − ∇ β (cid:96) ( ρ (cid:48) , β ) (cid:107) = (cid:107) E n [ zv (cid:62) ]( ρ − ρ (cid:48) ) (cid:107) ≤ (cid:107) E n [ zv (cid:62) ] (cid:107) , ∞ (cid:107) ρ − ρ (cid:48) (cid:107) where (cid:107) A (cid:107) ∞ , = max i (cid:113)(cid:80) j A ij and (cid:107) A (cid:107) , ∞ = (cid:113)(cid:80) i max j A ij . Thus we can take L = max  max i (cid:115)(cid:88) j E n [ v i z j ] + (cid:115)(cid:88) i max j E n [ z i v j ]  ≤ (cid:115)(cid:88) i max j E n [ z i v j ] = (cid:107) E n [ zv T ] (cid:107) , ∞ Finally, we also have that: sup β ∈ W R max ( β ) − inf β ∈ W R max ( β ) ≤ U Thus we can take R ∗ = B log( B ∨

1) + B log(2 p ) + U . Thus if we set η = (cid:107) E n [ zv (cid:62) ] (cid:107) , ∞ ,then we have that after T iterations, ¯ θ = ¯ ρ + − ¯ ρ − is an (cid:15) ( T ) -approximate solution to the minimaxproblem, with (cid:15) ( T ) = 16 (cid:107) E n [ zv (cid:62) ] (cid:107) , ∞ B log( B ∨

1) + B log(2 p ) + U / T .

Combining all the above with Proposition 17 yields the proof of Proposition 14.

N Proofs from Section 7 and Appendix I

N.1 Proof of Theorem 4

Observe that we can view the minimax problem as the solution to a convex-concave zero-sum game,where the strategy of each player is a vector in an n -dimensional space, subject to complex con-straints imposed by the corresponding hypothesis. In particular, let A = { ( f ( z ) , . . . , f ( z n )) : f ∈F} and B = { ( h ( x ) , . . . , h ( z n )) : h ∈ H} . Then the minimax problem can be phrased as: min b ∈ B max a ∈ A n (cid:88) i (( y i − b i ) a i − a i ) = max b ∈ B min a ∈ A n (cid:88) i ( a i − ( y i − b i ) a i ) Moreover, we will denote with (cid:96) ( a, b ) := n (cid:80) i ( a i − ( y i − b i ) a i ) , which is a loss that is concave(in fact linear) in b and convex in a . Moreover, our assumption on F implies that A is a convex set.Then the algorithm described in the statement of the theorem corresponds to solving this zero-sumgame via the following iterative algorithm: at every period t = 1 , . . . , T , the adversary chooses avector a t based on the the follow the leader (FTL) algorithm, i.e.: a t = arg min a ∈ A t − t − (cid:88) τ =1 (cid:96) ( a, b τ ) b t by best-responding to the current test function, i.e.: b t = arg max b ∈ B (cid:96) ( a t , b ) The equivalent stems from the following two observations: First, for the adversary we can re-writethe FTL algorithm by completing the square as: a t = arg min a ∈ A n (cid:88) i t − t − (cid:88) τ =1 ( a i − ( y i − b it ) a i )= arg min a ∈ A n (cid:88) i (cid:32) a i − (cid:32) y i − t − t − (cid:88) τ =1 b it (cid:33) a i (cid:33) = arg min a ∈ A n (cid:88) i (cid:32) a i − (cid:32) y i − t − t − (cid:88) τ =1 b it (cid:33)(cid:33) which then is equivalent to the oracle call described in the statement of the theorem. Second for thelearner we have: b t = arg max b ∈ B (cid:96) ( a t , b )= arg max b ∈ B n (cid:88) i b i a it = arg max b ∈ B n (cid:88) i b i | a it | sign ( a it )= arg max b ∈ B n (cid:88) i | a it | E z ∼ Bernoulli ( bi +12 ) [(2 z i − sign ( a it )]= arg max b ∈ B n (cid:88) i | a it | (cid:16) Pr z ∼ Bernoulli ( bi +12 ) [(2 z i −

1) = sign ( a it )] − Pr z ∼ Bernoulli ( bi +12 ) [(2 z i − (cid:54) = sign ( a it )] (cid:17) = arg max b ∈ B n (cid:88) i | a it | (cid:16) z ∼ Bernoulli ( bi +12 ) [(2 z i −

1) = sign ( a it )] − (cid:17) = arg max b ∈ B n (cid:88) i | a it | Pr z ∼ Bernoulli ( bi +12 ) [(2 z i −