[PDF] Inference on Heterogeneous Quantile Treatment Effects via Rank-Score Balancing

Abstract

Understanding treatment effect heterogeneity in observational studies is of great practical importance to many scientific fields because the same treatment may affect different individuals differently. Quantile regression provides a natural framework for modeling such heterogeneity. In this paper, we propose a new method for inference on heterogeneous quantile treatment effects that incorporates high-dimensional covariates. Our estimator combines a debiased ℓ 1 -penalized regression adjustment with a quantile-specific covariate balancing scheme. We present a comprehensive study of the theoretical properties of this estimator, including weak convergence of the heterogeneous quantile treatment effect process to the sum of two independent, centered Gaussian processes. We illustrate the finite-sample performance of our approach through Monte Carlo experiments and an empirical example, dealing with the differential effect of mothers' education on infant birth weights.

Full PDF

IInference on Heterogeneous Quantile Treatment Eﬀects viaRank-Score Balancing

Alexander Giessing ∗ Jingshen Wang †‡§

February 4, 2021

Abstract

Understanding treatment eﬀect heterogeneity in observational studies is of great practicalimportance to many scientiﬁc ﬁelds because the same treatment may aﬀect diﬀerent individualsdiﬀerently. Quantile regression provides a natural framework for modeling such heterogeneity.In this paper, we propose a new method for inference on heterogeneous quantile treatment eﬀectsthat incorporates high-dimensional covariates. Our estimator combines a debiased (cid:96) -penalizedregression adjustment with a quantile-speciﬁc covariate balancing scheme. We present a com-prehensive study of the theoretical properties of this estimator, including weak convergenceof the heterogeneous quantile treatment eﬀect process to the sum of two independent, centeredGaussian processes. We illustrate the ﬁnite-sample performance of our approach through MonteCarlo experiments and an empirical example, dealing with the diﬀerential eﬀect of mothers’ ed-ucation on infant birth weights. Keywords:

Quantile Regression; De-biased Inference; High-dimensional Data; Causal Infer-ence.

Given the prevalence of high-dimensional data in observational studies, many scientiﬁc ﬁelds requirean understanding of treatment eﬀect heterogeneity, simply because the same treatment may aﬀectdiﬀerent individuals diﬀerently. For example, in modern drug development, it is important to testfor the existence (or the lack) of treatment eﬀect heterogeneity and to identify subpopulations forwhich a treatment is most beneﬁcial (or harmful) (Lipkovich et al., 2017; Ma and Huang, 2017). ∗ Department of ORFE, Princeton University. E-mail: [email protected]. † Division of Biostatistics, University of California, Berkeley. E-mail: [email protected]. ‡ Jingshen Wang acknowledges the support of the National Science Foundation (DMS 2015325). § Corresponding author. a r X i v : . [ s t a t . M E ] F e b imilarly, in precision medicine, it is essential to be able to generalize causal eﬀect estimates froma small experimental sample to a target population (Kern et al., 2016; Coppock et al., 2018). Sincequantile regression (Koenker, 2005a) models the eﬀect of covariates on the conditional distributionof the response variable, it provides a natural framework for studying treatment heterogeneity.In this paper, we propose a new procedure for inference on the heterogeneous quantile treatmenteﬀects (HQTE) curve in the presence of high-dimensional covariates. The HQTE curve is deﬁnedas the diﬀerence between the quantiles of the conditional distributions of treatment and controlgroup: α ( τ ; z ) := Q ( τ ; z ) − Q ( τ ; z ) , (1)where Q ( τ ; z ) ( Q ( τ ; z )) is the conditional quantile curve of the potential outcome of the treatedgroup (the control group) evaluated at a quantile level τ ∈ (0 ,

1) and covariate z ∈ R p .The HQTE is a natural quantity to model treatment eﬀect heterogeneity and has three beneﬁts:First, the HQTE curve provides information about the treatment eﬀect at every quantile level. Ittherefore allows for a more nuanced analysis of treatment eﬀects than the average treatment eﬀect,which only provides information about the mean of the outcome variable. In the canonical studyof infant birth weights, literature documents that maternal hypertension is a risk factor for lowerbirth weight and that the eﬀect of hypertension on birth weight is greater on the left tail of the birthweight distribution (Bowers et al., 2011; Mhanna et al., 2015). In this case, a statistical procedureaimed at detecting treatment eﬀects at the lower quantiles of the birth weight distribution is moreuseful than a statistical procedure designed to estimate the average treatment eﬀect. Second, whenthe outcome variable has a skewed distribution (such as survival times), the HQTE curve is amore informative quantity than the classical average treatment eﬀect. For example, Wang et al.(2018) show that mean-optimal treatment regimes that maximize the average treatment eﬀect canbe harmful to individuals who diﬀer signiﬁcantly from the typical individual in the sample. Insuch a case, a more adaptive quantile-optimal treatment regime based on the HQTE is clearlypreferred. Third, the HQTE curve can be used to deﬁne new measures of treatment eﬀects such asthe integrated treatment eﬀect, (cid:82) T α ( τ ; z ) dτ , and the maximal treatment eﬀect, sup τ ∈T | α ( τ ; z ) | .These quantities can be particularly helpful in identifying subgroups in precision medicine. The primary contribution of this article is the novel rank-score balanced estimator of the HQTEand a comprehensive study of its theoretical properties. We break down our contribution as follows:On the methodological side, we show how to combine a high-dimensional quantile regression ad-justment with inverse-density weighted regression rank-scores to approximately balance covariates.Speciﬁcally, we demonstrate that regression rank scores and the conditional densities are impor-tant quantities for quantile balancing in causal inference (Section 3.2). Moreover, our procedurealso provides new insights into the debiasing of (cid:96) -penalized estimates of high-dimensional quantile2egression vectors (Section 3.3).On the practical side, we propose a systematic way of selecting the tuning parameters of therank-score balanced estimator. Conventional covariate balancing procedures are rather sensitiveto the choice of tuning parameters. In contrast, our systematic procedure makes use of the dualformulation of the rank-score balancing program and is fully automatic (Section 5.2). We illustratethe ﬁnite-sample performance of our approach through Monte Carlo experiments (Section 6) and anempirical example, dealing with the diﬀerential eﬀect of mothers’ education on infant birth weight(Section 7).On the theoretical side, we make the following three contributions. First, we establish weakconvergence of the entire rank-score balanced HQTE process to the weighted sum of two indepen-dent and centered Gaussian processes in (cid:96) ∞ ( T ). The large sample properties of this process areneeded whenever one would like to conduct inference on the HQTE curve on more than just onequantile at a time. More precisely, this result allows us to conduct simultaneous inference on thecontinuum of quantile levels T ⊂ (0 , (cid:96) -penalized quantile regression (Belloni et al., 2019b; Belloni and Chernozhukov,2011) nor quantile regression processes in growing dimension (Belloni et al., 2019a; Chao et al.,2017). While we are able to borrow some ideas from this literature, most of our proofs are original.Among the technical auxiliary results, the dual formulation of the rank-score balancing program andthe Bahadur-type representation for the rank-score balanced estimator are particularly importantand useful beyond the scope of this paper. (Giessing and Wang, 2020, Supplementary Materials) Treatment eﬀect heterogeneity is of signiﬁcant interest in causal inference and is analyzed frommany diﬀerent angles. Imai and Ratkovic (2013) formulate the estimation of heterogeneous meantreatment eﬀects as a variable selection problem. Angrist (2004) studies mean treatment eﬀectheterogeneity through instrumental variables. In recent publications K¨unzel et al. (2019) and Nieand Wager (2019) propose several new meta-learners to estimate conditional average treatment ef-fects. Firpo (2007); Fr¨olich and Melly (2013); Cattaneo (2010) study (marginal) quantile treatmenteﬀects through modeling inverse propensity scores. Chernozhukov and Hansen (2005) and Abadieet al. (2002) show how instrumental variables can be helpful in identifying conditional quantiletreatment eﬀects in the presence of unmeasured confoudings. Belloni et al. (2019b) develop anestimator for the coeﬃcient of a treatment variable in a high-dimensional quantile regression modelwhich is robust to model misspeciﬁcation.Our paper is related to a particular subset of this literature through our methodology based3n covariates weighting and regression adjustment. These two techniques are commonly used inobservational studies to mitigate confounding biases. Common weighting procedures for averagetreatment eﬀect estimation include inverse propensity score weighting (Rosenbaum and Rubin,1983; Tan, 2010), matching (Rosenbaum, 1989, 2020), and covariate balancing (Hainmueller, 2012;Imai and Ratkovic, 2014; Zubizarreta, 2015; Wang and Zubizarreta, 2017). Examples for regressionadjustments include Rubin (1979); Belloni et al. (2014); Wang et al. (2020). Recently, Athey et al.(2018) have shown that when the goal is to estimate average treatment eﬀects with high-dimensionalcovariates, it can be beneﬁcial to combine both techniques. Our rank-score balancing proceduremirrors this idea by combining a quantile regression estimate of the conditional quantile functionwith a (quantile-speciﬁc) weighting scheme.Our paper also makes a novel contribution to the literature on inference for high-dimensionalquantile regression. Up until now, this literature is limited to diﬀerent approaches for debiasing (cid:96) -penalized estimates of the quantile regression vector when the response is homoscedastic (Zhaoet al., 2019; Bradic and Kolar, 2017). Our approach diﬀers in two aspects from this literature: First,our inference procedure allows for heteroscedastic responses. This is of great practical importancesince the ability to model heteroscedasticity is a key reason for using quantile regression in the ﬁrstplace. Second, we directly debias the scalar estimate of the conditional quantile function Q d ( τ ; z ).This is conceptually very diﬀerent from debiasing a regression vector and the resulting estimator hasdistinct theoretical properties. In particular, we can derive the limit distribution of the estimatedconditional quantile function, whereas this is not possible when using the results from Zhao et al.(2019); Bradic and Kolar (2017). Throughout this paper, Y ∈ R denotes the response variable, D ∈ { , } a binary treatmentvariable, and X ∈ R p a vector of covariates. Following the framework of Rubin (1974), we deﬁnethe causal eﬀect of interest in terms of so-called potential outcomes : Potential outcomes describecounterfactual states of the world, i.e. possible responses if certain treatments were administered.More formally, we index the outcomes of the response variable Y by the treatment variable D andwrite Y D for the potential outcomes of Y . With this notation, the potential outcome Y d correspondsto the response that we would observe if treatment D = d was assigned. The causal quantity ofinterest in this paper is the heterogeneous quantile treatment eﬀect (HQTE) curve evaluated atcovariates z ∈ R p , α ( τ ; z ) := Q ( τ ; z ) − Q ( τ ; z ) , (2)where Q d ( τ ; z ) = inf (cid:8) y ∈ R : F Y d | X ( y | z ) ≥ τ (cid:9) is the conditional quantile function (CQF) of thepotential outcome Y d | X = z at a quantile level τ ∈ (0 ,

1) and F Y d | X denotes the correspondingconditional distribution function.The key challenge in causal inference is that for each individual we only observe its potential4utcome Y D under one of the two possible treatment assignments D ∈ { , } but never under both.In other words, the observed response variable is given as Y = DY + (1 − D ) Y . Since the potentialoutcomes Y and Y are not observed, a priori, it is unclear how to estimate Q ( τ ; z ) and Q ( τ ; z ).To make headway, we introduce the following condition: Condition 1 (Unconfoundedness) . ( Y , Y ) is independent of D given X , i.e. ( Y , Y ) ⊥⊥ D | X . Colloquially speaking, this condition guarantees that after controlling for relevant covariatesthe treatment assignment is completely randomized. Under this condition, Q d ( τ ; z ) is identiﬁableand can be recast as the solution to the following program: Q d ( τ ; · ) ∈ arg min q ( · ) E [ ρ τ ( Y − q ( X )) − ρ τ ( Y ) | D = d ] , (3)where ρ τ ( u ) = u ( τ − { u ≤ } ) is the so-called check-loss and the minimum is taken over allmeasurable functions q ( · ) of X (Koenker, 2005b; Angrist et al., 2006). While unconfoundedness oftreatment assignments is a standard condition in the literature on causal inference, it cannot beveriﬁed from the data alone. Rubin (2009) argues that unconfoundedness is more plausible when X is a rich set of covariates. This motivates us to frame our problem as a high-dimensional statisticalproblem with predictors X ∈ R p whose dimension p exceeds the sample size n .The convex optimization program (3) poses already a formidable challenge in low dimensionsand to make it tractable in high dimensions we need to impose further structural constraints: Condition 2 (Sparse linear quantile regression function) . Let T be a compact subset of (0 , . TheCQF of Y d | X = z is given by Q d ( τ ; z ) = z (cid:48) θ d ( τ ) and sup τ ∈T (cid:107) θ d ( τ ) (cid:107) (cid:28) p ∧ n . In principle, this condition can be relaxed to approximate linearity and approximate sparsitysimilar to Belloni et al. (2019b), but we do not pursue the technical reﬁnements in this direction.Under Conditions 1 and 2, the program (3) reduces to the linear quantile regression program θ d ( τ ) ∈ arg min θ ∈ R p E (cid:2) ρ τ ( Y − X (cid:48) θ ) − ρ τ ( Y ) | D = d (cid:3) , (4)and the HQTE curve is identiﬁed as α ( τ ; z ) = z (cid:48) θ ( τ ) − z (cid:48) θ ( τ ) . (5)Despite the linearity condition, the HQTE curve in (5) is remarkably ﬂexible and captures threediﬀerent aspects of treatment heterogeneity: First, by keeping z ∈ R p ﬁxed and varying only thequantile levels τ ∈ T we can investigate treatment eﬀect heterogeneity across diﬀerent quantilelevels. Second, by keeping τ ∈ T ﬁxed and varying z ∈ R p we can analyze individual treatmenteﬀects for individuals characterized by diﬀerent covariates z . Third, by keeping τ ∈ T ﬁxed andletting z ∈ R p be a sparse contrast we can identify diﬀerential eﬀects of treatments in diﬀerent sub-populations characterized by a few pre-treatment covariates (e.g. race, marriage status, gender,socioeconomic status, etc.). 5 Methodology

In this section, we ﬁrst introduce our rank-score balancing procedure for estimating the HQTEcurve. We then discuss its intuitive connection to classical covariates weighting procedures incausal inference. Lastly, we show heuristically that the estimator solves a bias-variance trade-oﬀproblem in high-dimensional inference.

Let { ( Y i , D i , X i ) } ni =1 be a random sample of response variable Y , treatment indicator D , andcovariates X . Denote by f Y d | X the conditional density of Y d | X , d ∈ { , } . To simplify notation,write f i ( τ ) = f Y Di | X ( X (cid:48) i θ D i ( τ ) | X i ), i = 1 , . . . , n . Moreover, assume that the ﬁrst n observationsbelong to the control group and the remaining n = n − n observations to the treatment group. Step 1.

For d ∈ { , } , compute pilot estimates of θ d ( τ ) as the solution of the (cid:96) -penalizedquantile regression program,ˆ θ d ( τ ) ∈ arg min θ ∈ R p  (cid:88) i : D i = d ρ τ ( Y i − X (cid:48) i θ ) + λ d (cid:107) θ (cid:107)  , (6)where λ d > θ d ( τ ) to estimate the conditionaldensities f i ( τ ) as ˆ f i ( τ ) :=  hX (cid:48) i ˆ θ ( τ + h ) − X (cid:48) i ˆ θ ( τ − h ) , i ∈ { j : D j = 1 } hX (cid:48) i ˆ θ ( τ + h ) − X (cid:48) i ˆ θ ( τ − h ) , i ∈ { j : D j = 0 } , (7)where h > λ d and h in Sections 5.1 and 5.3. Step 2.

Solve the rank-score balancing program with plug-in estimates of the conditionaldensities from Step 1, (cid:98) w ( τ ; z ) ∈ arg min w ∈ R n  n (cid:88) i =1 w i ˆ f − i ( τ ) : (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) z − √ n (cid:88) i : D i = d w i X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ γ d n , d ∈ { , }  , (8)where the γ d > γ d in Section 5.2. Step 3.

Deﬁne the rank-score balanced estimator of the CQF as (cid:98) Q d ( τ ; z ) := z (cid:48) ˆ θ d ( τ ) + 1 √ n (cid:88) i : D i = d (cid:98) w i ( τ ; z ) ˆ f − i ( τ ) (cid:0) τ − { Y i ≤ X (cid:48) i ˆ θ d ( τ ) } (cid:1) . (9) Step 4.

Deﬁne the rank-score balanced estimator of the HQTE curve as (cid:98) α ( τ ; z ) := (cid:98) Q ( τ ; z ) − (cid:98) Q ( τ ; z ) . α ( τ ; z ) as (cid:34)(cid:98) α ( τ ; z ) ± . × τ (1 − τ ) √ n n (cid:88) i =1 (cid:98) w i ( τ ; z ) ˆ f − i ( τ ) (cid:35) . Steps 2 and 3 constitute the core of the rank-score balancing procedure. In Step 2 we computequantile-speciﬁc balancing weights and in Step 3 we augment the estimated conditional quantilefunction z (cid:48) ˆ θ d ( τ ) with a bias correction based on these weights. This bias correction addressestwo sources of bias in z (cid:48) ˆ θ d ( τ ): First, the (cid:96) -penalty introduces a regularization bias by shrinkingcoeﬃcients in ˆ θ d ( τ ) towards zero. Second, since the quantile regression vector ˆ θ d ( τ ) is based onthe observed covariates { X i : D i = d } alone, estimating Q d ( τ ; z ) as z (cid:48) ˆ θ d ( τ ) introduces a sort ofmismatch bias. The more z diﬀers from a typical covariate in { X i : D i = d } the larger is this bias.We refer to our estimator as the rank-score balanced estimator, because its key component isa weighted sum of quantile regression rank scores with weights that approximately balance thecovariates. The speciﬁc form of the bias correction in Step 3 is inspired by weighting and residual balancingprocedures used to estimate average treatment eﬀects. In this section we explain how it relates toand diﬀers from these procedures.As in conventional residual balancing, the bias correction term of the rank-score balancingprocedure is based on the residuals of the ﬁtted quantile regression model. While the ﬁtted quantileregression model z (cid:48) ˆ θ d ( τ ) captures the bulk of information about the conditional quantile function,the residuals contain additional information about biases and weak signals. By augmenting themodel based prediction z (cid:48) ˆ θ d ( τ ) with a weighted sum of (transformations of) the residuals we aim toimprove the predictive accuracy. The quantile treatment eﬀect framework requires two substantialchanges to this idea pertaining to the choice of the transformation of the residuals and the choiceof the balancing weights:First, by the very fact that we deal with a conditional quantile regression model, only the rank-scores { ( τ − { Y i ≤ X (cid:48) i ˆ θ ( τ ) } ) : D i = d } (i.e. the weighted signs of the residuals) and not themagnitude of the residuals are informative (Koenker, 2005b). However, since the rank-scores aredimensionless, they cannot be directly compared to the quantile regression estimate z (cid:48) ˆ θ d ( τ ). Thus,to put both terms on the same scale, we divide the rank-scores by the estimates of the conditionaldensities ˆ f i ( τ ), which are (approximately) proportional to the asymptotic standard deviation of z (cid:48) ˆ θ d ( τ ).Second, we choose the balancing weights according to a quantile-speciﬁc re-weighting scheme.Assume, for a moment, that z is a linear combination of the X i ’s, i.e. z = (cid:80) i a i X i . Then, anatural way of associating z with the re-scaled rank-scores is to weight the i th re-scaled rank-score( τ − { Y i ≤ X (cid:48) i ˆ θ d ( τ ) } ) / ˆ f i ( τ ) with the i th coeﬃcient a i of the linear combination. The rationale isthat covariates X i that contribute signiﬁcantly to z (via coeﬃcients a i with large absolute values),7lso entail large biases in the pilot estimate z (cid:48) ˆ θ d ( τ ). Since in high dimensions one can not hope tomatch z exactly with a linear combination of the X i ’s, the rank-score balancing problem (8) aimsat matching z only approximately while minimizing the inverse-density weighted sum of squaredbalancing weights. The inverse-density weighting ensures that observations associated with lowdensity at the τ th quantile are given smaller balancing weights. This guards against overconﬁdentbias correction based on unrepresentative observations. The rank-score balanced estimator can also be motivated from a more technical standpoint. Thisperspective oﬀers a ﬁrst glimpse at its theoretical properties.Let θ ∈ R p and w ∈ R n be arbitrary. Recall that we write f i ( τ ) = f Y Di | X ( X (cid:48) i θ D i ( τ ) | X i ), i = 1 , . . . , n . Deﬁne ϕ i ( θ ) := { Y i ≤ X (cid:48) i θ } − { Y i ≤ X (cid:48) i θ d ( τ ) } and note that E [ f − i ( τ ) ϕ i ( θ ) | X i ] = f − Y Di | X ( X (cid:48) i θ D i ( τ ) | X i ) (cid:0) F Y Di | X ( X (cid:48) i θ | X i ) − F Y Di | X ( X (cid:48) i θ D i ( τ ) | X i ) (cid:1) . A ﬁrst-order Taylor approximationat θ = θ D i ( τ ) yields1 √ n (cid:88) i : D i = d w i E (cid:2) f − i ( τ ) ϕ i ( θ ) | X i (cid:3) = 1 √ n (cid:88) i : D i = d w i X (cid:48) i (cid:0) θ − θ d ( τ ) (cid:1) + a n ( θ ) , where a n ( θ ) := (cid:13)(cid:13) √ n (cid:80) i : D i = d w i f − i ( τ ) ξ i,τ X i X i (cid:13)(cid:13) op (cid:13)(cid:13) θ − θ d ( τ ) (cid:13)(cid:13) and ξ i,τ = f Y Di | X ( X (cid:48) i ξ | X i ) with ξ anintermediate point between θ and θ D i ( τ ). Suppose that this identity remains (approximately) truefor θ = ˆ θ d ( τ ). Then, re-arranging this expansion leads to z (cid:48) ˆ θ d ( τ ) + 1 √ n (cid:88) i : D i = d w i f − i ( τ ) (cid:0) τ − { Y i ≤ X (cid:48) i ˆ θ d ( τ ) } (cid:1) = z (cid:48) θ d ( τ ) + 1 √ n (cid:88) i : D i = d w i f − i ( τ ) (cid:0) τ − { Y i ≤ X (cid:48) i θ d ( τ ) } (cid:1) +  z − √ n (cid:88) i : D i = d w i X i  (cid:48) (cid:0) ˆ θ d ( τ ) − θ d ( τ ) (cid:1) + a n (cid:0) ˆ θ d ( τ ) (cid:1) + b n (cid:0) ˆ θ d ( τ ) (cid:1) , (10)where b n ( θ ) := √ n (cid:80) i : D i = d w i (cid:0) f − i ( τ ) ϕ i ( θ ) − E [ f − i ( τ ) ϕ i ( θ ) | X i ] (cid:1) .If ˆ θ d ( τ ) is consistent for θ d ( τ ), the remainder terms a n (cid:0) ˆ θ d ( τ ) (cid:1) and b n (cid:0) ˆ θ d ( τ ) (cid:1) can be shown tobe asymptotically negligible and the statistical behavior of the left hand side of (10) is governedby the ﬁrst three terms on the right hand side. In particular, the ﬁrst term on the right hand side, z (cid:48) θ d ( τ ), is deterministic, the second term has mean zero and variance τ (1 − τ ) n − (cid:80) i : D i = d w i f − i ( τ )(expectations taken conditionally on the X i ’s), and the third term can be upper bounded by (cid:13)(cid:13) z − √ n (cid:80) i : D i = d w i X i (cid:13)(cid:13) ∞ (cid:13)(cid:13) ˆ θ d ( τ ) − θ d ( τ ) (cid:13)(cid:13) . Since the weights w are arbitrary, we can choose them toﬁne-tune the statistical behavior of the left hand side of (10). Given above observations, it is naturalto seek weights w that minimize the variance τ (1 − τ ) n − (cid:80) i : D i = d w i f − i ( τ ) while controlling thebias term (cid:13)(cid:13) z − √ n (cid:80) i : D i = d w i X i (cid:13)(cid:13) ∞ . The rank-score balancing program (8) with plug-in estimates8 f i ( τ ) can be viewed as a feasible sample version of this constrained minimization problem.Since the weights are chosen to minimize the variance of the right hand side, we expect that therank-score balanced estimator is more eﬃcient than other debiasing procedures. Our theoreticaland empirical results in Sections 4 and 6 conﬁrm this intuition. In this section we establish joint asymptotic normality of the HQTE process, propose consistentestimators of its asymptotic covariance function, and discuss the duality theory of the rank-scorebalancing program which underlies the theoretical results.

Throughout, we assume that { ( Y i , D i , X i ) } ni =1 are i.i.d. copies of ( Y, D, X ). Recall that Y = DY +(1 − D ) Y ∈ R , where Y and Y are potential outcomes, D ∈ { , } , and X ∈ R p . For examplesof quantile regression models that satisfy below conditions, we refer to Section 2.5 in Belloni andChernozhukov (2011). Condition 3 (Sub-Gaussian predictors) . X ∈ R p is a sub-Gaussian vector, i.e. (cid:107) X − E [ X ] (cid:107) ψ (cid:46) (cid:0) E [( X (cid:48) u ) ] (cid:1) / for all u ∈ R . Condition 4 (Sparsity of dual solution) . Let T be compact subset of (0 , and v d ( τ ; z ) := − (cid:0) E [ f Y d | X (cid:0) X (cid:48) θ d ( τ ) | X (cid:1) XX (cid:48) { D = d } ] (cid:1) − z. There exists s v ≥ such that sup d ∈{ , } sup τ ∈T (cid:12)(cid:12) T v d ( τ ) (cid:12)(cid:12) ≤ s v for T v d ( τ ) = support (cid:0) v d ( τ ; z ) (cid:1) . Condition 5 (Sparsity and Lipschitz continuity of τ (cid:55)→ θ d ( τ )) . Let T be compact subset of (0 , .(i) There exists s θ ≥ such that sup d ∈{ , } sup τ ∈T (cid:12)(cid:12) T θ d ( τ ) (cid:12)(cid:12) ≤ s θ for T θ d ( τ ) = support (cid:0) θ d ( τ ) (cid:1) ;(ii) There exists L θ ≥ such that sup d ∈{ , } (cid:107) θ d ( τ ) − θ d ( τ (cid:48) ) (cid:107) ≤ L θ | τ − τ (cid:48) | for all τ, τ (cid:48) ∈ T . Condition 6 (Lipschitz continuity and boundedness of f Y d | X ) . Let a, b, x ∈ R p be arbitrary.(i) There exists L f ≥ such that sup d ∈{ , } (cid:12)(cid:12) f Y d | X ( x (cid:48) a | x ) − f Y d | X ( x (cid:48) b | x ) (cid:12)(cid:12) ≤ L f | x (cid:48) a − x (cid:48) b | ;(ii) There exists ¯ f ≥ such that sup d ∈{ , } f Y d | X ( a | x ) ≤ ¯ f . Condition 7 (Diﬀerentiability of τ (cid:55)→ Q d ( τ ; X )) . Let T be a compact subset of (0 , . TheCQF Q d ( τ ; X ) is three times boundedly diﬀerentiable on T , i.e. there exists C Q ≥ such that sup d ∈{ , } | Q (cid:48)(cid:48)(cid:48) d ( τ ; x ) | ≤ C Q for all x ∈ R p and τ ∈ T . Condition 3 is standard in high-dimensional statistics. We introduce it to analyze the rank-score balancing program (8), but it also simpliﬁes the theoretical analysis of the quantile regression9rogram (6). The speciﬁc formulation of sub-Gaussianity is convenient because it allows us to relatehigher moments of (sparse) linear combinations X (cid:48) u to (sparse) eigenvalues of their covariance andsecond moment matrix. Notably, we do not impose any moment conditions on the response variable Y ∈ R . Condition 4 is a technical condition pertaining to the estimation of the rank-score balancingweights. For example, if the density weighted covariates f Y d | X ( X (cid:48) θ d ( τ ) | X ) X j , j = 1 , . . . , p , followan AR( q )-process and (cid:107) z (cid:107) ≤ s , then s v ≤ (2 q − s . In principle, this condition can be relaxed toapproximate sparsity. Conditions 5 and 6 are common in the literature on high-dimensional quantileregression (Belloni and Chernozhukov, 2011; Chao et al., 2017; Belloni et al., 2019c). They arerelevant for establishing weak convergence of the rank-score balanced HQTE process to a Gaussianprocess in (cid:96) ∞ ( T ). Condition 7 was introduced only recently in Belloni et al. (2019b) as part ofthe suﬃcient conditions for establishing uniform (in τ ∈ T ) consistency of the non-parametricestimates of the conditional densities in (7). This condition can be relaxed to Q d ( τ ; X ) belongingto a H¨older class of functions, which is a common assumption in non-parametric (quantile) splineestimation (He and Shi, 1994; He et al., 2013).We now collect deﬁnitions and conditions related to the population and sample design matrices. Deﬁnition 1 ( s -sparse maximum eigenvalues) . We deﬁne the s -sparse maximum eigenvalues ofthe population and sample design matrices by ϕ max ,d ( s ) := sup u : (cid:107) u (cid:107) ≤ s E [( X (cid:48) u ) { D = d } ] (cid:107) u (cid:107) and (cid:98) ϕ max ,d ( s ) := sup u : (cid:107) u (cid:107) ≤ s n − (cid:80) i : D i = d ( X (cid:48) i u ) (cid:107) u (cid:107) , and ϕ max ( s ) := sup d ∈{ , } ϕ max ,d ( s ) and (cid:98) ϕ max ( s ) := sup d ∈{ , } (cid:98) ϕ max ,d ( s ) . Deﬁnition 2 (Cone of (

J, ϑ )-dominated coordinates) . For J ⊆ { , . . . , p } and ϑ ∈ [0 , ∞ ] we deﬁnethe cone of ( J, ϑ ) -dominated coordinates by C p ( J, ϑ ) := { u ∈ R p : (cid:107) u J c (cid:107) ≤ ϑ (cid:107) u J (cid:107) } . In above deﬁnition, the parameter ϑ ∈ [0 , ∞ ] controls the (approximate) sparsity level of vectors u ∈ C p (cid:0) J, ϑ (cid:1) : If ϑ = 0, then vectors u ∈ C p (cid:0) J, (cid:1) are | J | -sparse and only entries u j with indices j ∈ J are non-zero. In contrast, if ϑ = ∞ , then C p (cid:0) J, ∞ (cid:1) = R p . Deﬁnition 3 (( ω, ϑ )-restricted minimum eigenvalue of the density weighted design matrix) . Let T be a compact subset of (0 , , ω, ϑ ≥ , and J ⊆ { , . . . , p } . Set A ( ϑ, J ) := C p ( J, ϑ ) ∩ ∂B p (0 , and κ ( ω, ϑ, J ) := inf u ∈ A ( J,ϑ ) E [ f ωY | X ( X (cid:48) θ ( τ ) | X )( X (cid:48) u ) { D = d } ] . We deﬁne the ( ω, ϑ ) -restrictedminimum eigenvalue of the density weighted design matrix as κ ( ω, ϑ ) := inf d ∈{ , } inf τ ∈T κ (cid:0) ω, ϑ, T v d ( τ ) ∪ T θ d ( τ ) (cid:1) , where T v d ( τ ) and T θ d ( τ ) are as in Conditions 4 and 5, respectively. ondition 8 (Bounds on sparse maximum eigenvalues) . Recall s v , s θ ≥ from Conditions 4 and 5,respectively. There exists an absolute constant ¯ ϕ max ≥ such that ϕ max (2 s v ) ∨ ϕ max (2 s θ ) ≤ ¯ ϕ max . Condition 9 (Bounds on restricted minimum eigenvalues of the density weighted design matrix) . There exists an absolute constant κ > such that κ (2 , ∞ ) ∧ κ (2 , ∧ κ (1 , ≥ κ. Condition 10 (Empirical sparsity bounds) . There exists an absolute constant ¯ ϕ emp ≥ such thatwith probability − o (1) , ϕ max (cid:0) n d / log( n d p ) (cid:1) ∨ (cid:98) ϕ max (cid:0) n d / log( n d p ) (cid:1) ≤ ϕ emp , d ∈ { , } . Condition 11 ( (cid:37) -restricted identiﬁability) . Let T be a compact subset of (0 , and (cid:37) ≥ . Set A d ( τ,

2) := C p ( T θ d ( τ ) , ∩ ∂B p (0 , , where T θ d ( τ ) = support (cid:0) θ d ( τ ) (cid:1) as in Condition 5. Thequantile regression vectors θ ( τ ) and θ ( τ ) are (cid:37) -restricted identiﬁable if inf d ∈{ , } inf τ ∈T inf u ∈ A d ( τ, E (cid:2) f Y d | X ( X (cid:48) θ d ( τ ) | X )( X (cid:48) u ) { D = d } (cid:3) L f E [ | X (cid:48) u | { D = d } ] (cid:38) (cid:37). Bounds on maximum and minimum sparse eigenvalues are standard in high-dimensional statis-tics. Concerning Condition 8, the upper bound on ϕ max (2 s v ) is used in the analysis of the rank-scorebalancing program (8), while the upper bound on ϕ max (2 s θ ) is relevant in the analysis of the (cid:96) -penalized quantile regression program (6). Condition 9 is the analogue to the restricted eigenvaluecondition in least squares (Candes and Tao, 2007; Bickel et al., 2009). We use the lower boundson κ (2 , ∞ ) and κ (2 ,

2) in the analysis of the rank-score balancing program and the lower bound on κ (1 ,

2) in the analysis of the (cid:96) -penalized quantile regression program. Condition (10) is needed tocontrol the empirical sparsity of the (cid:96) -penalized quantile regression vectors. For example, centeredGaussian covariates X j , j = 1 , . . . , p , that follow an AR( p )-process satisfy this condition (Belloniand Chernozhukov, 2011, 2013). Condition 11 guarantees that the objective of the (cid:96) -penalizedquantile regression program (6) can be locally minorized by a quadratic function. It is slightlymore stringent than the restricted identiﬁability and nonlinearity condition D.5 in Belloni andChernozhukov (2011). However, Belloni and Chernozhukov (2011) require an additional upperbound on the growth rate of the sparsity level s θ (eq. (3.9)). Together, these two conditions areequivalent to Condition 11 (in terms of implied growth rates on s θ , n, p ). Remark 1.

To simplify the presentation of the theoretical results in the next sections, we suppressthe precise nature of how regularization parameters and upper bounds depend on the quantities ¯ f , L f , L θ , C Q , ¯ ϕ max , κ , and ϕ emp . For more detailed (non-asymptotic) statements of the theoremswe refer to the supplementary materials. As far as asymptotic results are concerned we will assume hat all suppressed quantities are bounded away from zero and inﬁnity as n, p → ∞ . In this section we establish weak convergence of the rank-score balanced CQF and the HQTEprocesses, (cid:110) √ n (cid:0) (cid:98) Q d ( τ ; z ) − Q d ( τ ; z ) (cid:1) : τ ∈ T (cid:111) and (cid:8) √ n (cid:0)(cid:98) α ( τ ; z ) − α ( τ ; z ) (cid:1) : τ ∈ T (cid:9) . The large sample properties of these processes are needed whenever one would like to conductinference on the HQTE curve on more than just one quantile at a time. For example, statisticalcomparisons of the HQTE across diﬀerent quantiles require simultaneous conﬁdence bands thathold for all quantiles under consideration. Similarly, testing hypotheses about subsets of quantilesrequires constructing rejection regions that hold across these quantiles. In both cases, processmethods provide a natural way of addressing these problems. We provide concrete examples below.In order to formulate the theoretical results we introduce the following operator. For τ , τ ∈ T arbitrary, deﬁne H d ( τ , τ ; z ) := lim n →∞ τ ∧ τ − τ τ H ( n ) d ( τ , τ ; z ), where H ( n ) d ( τ , τ ; z ) := v (cid:48) d ( τ ; z ) E [ f Y d | X ( X (cid:48) θ d ( τ ) | X ) f Y d | X ( X (cid:48) θ d ( τ ) | X ) XX (cid:48) { D = d } ] v d ( τ ; z ) , and v d ( τ ; z ) := − (cid:0) E [ f Y d | X (cid:0) X (cid:48) θ d ( τ ) | X (cid:1) XX (cid:48) { D = d } ] (cid:1) − z . Note that H ( n ) d ( τ , τ ; z ) depends on n since we assume that the dimension p grows with the sample size n . It is easy to verify that H d ( τ , τ ; z ) < ∞ for all τ , τ ∈ T whenever Conditions 6 (ii) and 9 hold and (cid:107) z (cid:107) = O (1).The following theorem establishes joint asymptotic normality of the entire rank-score balancedCQF process. Theorem 1 (Weak convergence of the rank-score balanced CQF process) . Let T be a compact sub-set of (0 , . Suppose that Conditions 1–10 and Condition 11 with (cid:37) = (cid:112) ( s v + s θ ) log( np ) /n hold,and ( s v + s θ ) log ( np ) log ( n ) = o ( nh ) , h s v = o (1) , and (cid:107) z (cid:107) = O (1) . If λ d (cid:16) √ ϕ emp (cid:112) n log( np ) and γ d (cid:16) (cid:107) z (cid:107) (cid:0) h − s θ log( np ) + √ nh (cid:1) √ n , then √ n (cid:0) (cid:98) Q d ( · ; z ) − Q d ( · ; z ) (cid:1) (cid:32) G d ( · ; z ) in (cid:96) ∞ ( T ) , where G d ( · ; z ) is a centered Gaussian process with covariance function ( τ , τ ) (cid:55)→ H d ( τ , τ ; z ) . Remark 2 (On the proof strategy) . The key to this result is a non-asymptotic Bahadur-typerepresentation based on the dual formulation of the rank-score balancing program. Establishing thenon-asymptotic Bahadur-type representation is non-trivial and we elaborate on this in greater detailin Section 4.3. With this Bahadur-type representation we can establish asymptotic equicontinuityof τ (cid:55)→ √ n ( (cid:98) Q d ( τ ; z ) − Q d ( τ ; z )) and ﬁnite dimensional convergence of {√ n ( (cid:98) Q d ( τ j ; z ) − Q d ( τ j ; z )) :1 ≤ j ≤ K } to a Gaussian random vector. The conclusion of the theorem follows by combiningthese two facts. While these facts are conceptually straightforward, the proofs require some newempirical process techniques since p ≥ n (Giessing, 2020). p is ﬁxed. Then Theorem 1 implies that √ n (cid:0) (cid:98) Q d ( τ ; z ) − Q d ( τ ; z ) (cid:1) (cid:32) N (cid:18) , τ (1 − τ ) z (cid:48) (cid:16) E [ f Y d | X ( X (cid:48) θ d ( τ ) | X ) XX (cid:48) { D = d } ] (cid:17) − z (cid:19) . What is of interest here is that the asymptotic variance τ (1 − τ ) z (cid:48) ( E [ f Y d | X ( X (cid:48) θ d ( τ ) | X ) XX (cid:48) { D = d } ]) − z is known to be the semi-parametric eﬃciency bound for all estimators of the linear condi-tional quantile function (Newey and Powell, 1990). In particular, the rank-score balanced estimatorof the CQF is as eﬃcient as the estimate of the CQF based on the weighted quantile regressionprogram (Koenker, 2005b; Koenker and Zhao, 1994; Zhao, 2001). This lends further support to theheuristic arguments made in Section 3.3.Since the rank-core balanced estimates of Q ( τ ; z ) and Q ( τ ; z ) are asympototically indepen-dent, Theorem 1 and the Continuous Mapping Theorem yield the following result for the HQTEprocess. Theorem 2 (Weak convergence of the rank-score balanced HQTE process) . Let T be a compactsubset of (0 , . Under the conditions of Theorem 1, √ n (cid:0)(cid:98) α ( · ; z ) − α ( · ; z ) (cid:1) (cid:32) G ( · ; z ) + G ( · ; z ) in (cid:96) ∞ ( T ) , where G ( · ; z ) , G ( · ; z ) are independent, centered Gaussian processes with covariance functions ( τ , τ ) (cid:55)→ H d ( τ , τ ; z ) . The takeaway from Theorem 2 is that the HQTE process converges weakly to the sum of twoindependent centered Gaussian processes. We illustrate Theorem 2 with four examples; for moreelaborate applications of process weak convergence in the context of quantile regression we referto Belloni et al. (2019a); Chao et al. (2017); Angrist et al. (2006); Chernozhukov and Fern´andez-Val(2005).

Example 1 (Asymptotic normality of the HQTE estimator) . For ﬁxed quantile τ ∈ T , Theorem 2implies that √ n (cid:0)(cid:98) α ( τ ; z ) − α ( τ ; z ) (cid:1) is asymptotically normal with mean zero and variance σ ( τ ; z ) :=lim n →∞ σ n ) ( τ ; z ) , where σ n ) ( τ ; z ) := τ (1 − τ ) z (cid:48) (cid:20)(cid:16) π E [ f Y | X ( X (cid:48) θ ( τ ) | X ) XX (cid:48) | D = 1] (cid:17) − + (cid:16) π E [ f Y | X ( X (cid:48) θ ( τ ) | X ) XX (cid:48) | D = 0] (cid:17) − (cid:21) z, where < π = 1 − π = P { D = 1 } < . Example 2 (Joint asymptotic normality of the HQTE estimator at ﬁnitely many quantiles) . Con-sider a ﬁnite collection of quantile levels { τ , . . . , τ K } ⊂ T . Theorem 2 implies that the collection √ n (cid:0)(cid:98) α ( τ j ; z ) − α ( τ j ; z ) (cid:1) , j = 1 , . . . , K , is jointly asymptotically normal with mean zero and covari-ance matrix Σ = ( H ( τ j , τ k ; z ) + H ( τ j , τ k ; z )) Kj,k =1 . xample 3 (Uniform conﬁdence bands for the HQTE curve) . Deﬁne K ( z ) := sup τ ∈T | G ( τ ; z ) /σ ( τ ; z )+ G ( τ ; z ) /σ ( τ ; z ) | , where σ ( τ ; z ) is the variance from Example 1. Let ˆ κ ( α ; z ) and (cid:98) σ n ( τ ; z ) be (uni-formly) consistent estimates of the α quantile of K ( z ) and σ ( τ ; z ) , respectively. Then, lim n →∞ P (cid:26) α ( τ ; z ) ∈ (cid:20) ˆ α ( τ ; z ) ± ˆ κ ( α ; z ) (cid:98) σ n ( τ ; z ) √ n (cid:21) , τ ∈ T (cid:27) = α. A consistent estimate ˆ κ ( α ; z ) can be obtained via simulation based bootstrap, i.e. sampling from (cid:98) K ( z ) = sup τ ∈T | (cid:101) G ( τ ; z )+ (cid:101) G ( τ ; z ) | , where (cid:101) G ( τ ; z ) and (cid:101) G ( τ ; z ) are independent centered Gaussianprocesses with covariance functions based on uniformly consistent plug-in estimates of the operators ( τ , τ ) (cid:55)→ H d ( τ , τ ; z ) / ( σ ( τ ; z ) σ ( τ ; z )) , d ∈ { , } . Example 4 (Asymptotic theory for the integrated HQTE curve) . Assessing the HQTE on a spe-ciﬁc quantile is often less relevant than assessing the average HQTE over a continuum of quantilelevels T (e.g., lower, middle, or upper quantiles). In such cases, it is natural to consider theintegrated HQTE. Theorem 2 and the continuous mapping theorem imply that √ n (cid:82) T (cid:0)(cid:98) α ( τ ; z ) − α ( τ ; z ) (cid:1) dτ (cid:32) I ( z ) , where I ( z ) := (cid:82) T G ( τ, z ) dτ + (cid:82) T G ( τ ; z ) dτ. While the random variable I ( z ) isnot distribution-free, its distribution can be approximated via re-sampling techniques (Chernozhukovand Fern´andez-Val, 2005). In this section we introduce the dual to the rank-score balancing program (8) and explain itspivotal role in the proofs of the weak convergence results in Sections 4.2. The dual program is alsoimportant for constructing uniformly consistent estimates of the covariance function in Sections 4.4.Observe that the solution to the rank-score balancing program (8) can be written as (cid:98) w ( τ ; z ) = (cid:98) w ( τ ; z ) + (cid:98) w ( τ ; z ), with the (cid:98) w d ( τ ; z )’s being the solutions to two independent optimization prob-lems: (cid:98) w d ( τ ; z ) ∈ arg min w ∈ R n  n (cid:88) i =1 w i ˆ f − i ( τ ) : (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) z − √ n (cid:88) i : D i = d w i X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ γ d n  , d ∈ { , } . (11)These two optimization problems have the following two duals:ˆ v d ( τ ; z ) ∈ arg min v ∈ R p  n (cid:88) i : D i = d ˆ f i ( τ )( X (cid:48) i v ) + z (cid:48) v + γ d n (cid:107) v (cid:107)  , d ∈ { , } . (12)Provided that strong duality holds, we can estimate the rank-score balancing weights (cid:98) w d ( τ ; z ) byeither solving the primal problems (11) or by solving the dual problems (12) and exploiting theexplicit relationship between primal and dual solutions. To be precise, we have the following result: Lemma 1 (Dual characterization of the rank-score balancing program) . (i) Programs (11) and (12) form a primal-dual pair. ii) Let δ ∈ (0 , and d ∈ { , } . Suppose that Condition 3 holds. There exists an absoluteconstant c > such that for all γ d > that satisfy γ d ≥ c ¯ ϕ max ¯ κ − ¯ f (cid:107) z (cid:107) (cid:112) n log( p/δ ) , wehave with probability at least − δ , for all ≤ i ≤ n and τ ∈ T , (cid:98) w d,i ( τ ; z ) =  − ˆ f i ( τ )2 √ n X (cid:48) i ˆ v d ( τ ; z ) , i ∈ { j : D j = d } , i / ∈ { j : D j = d } , where (cid:98) w d ( τ ; z ) and ˆ v d ( τ ; z ) are the solutions to the programs (11) and (12) , respectively. The important takeaway from Lemma 1 is that, with high probability, for γ d > (cid:98) Q d ( τ ; z ) = z (cid:48) ˆ θ d ( τ ) − n (cid:88) i : D i = d ˆ f i ( τ ) (cid:0) τ − { Y i ≤ X (cid:48) i ˆ θ d ( τ ) } (cid:1) X (cid:48) i ˆ v d ( τ ; z ) . (13)Thus, while the original formulation of the rank-score balanced estimator involves a complicatedsum over the rank-score balancing weights (cid:98) w ( τ ; z ) , . . . , (cid:98) w n ( τ ; z ), the dual formulation is a simplelinear function of the dual solution ˆ v d ( τ ; z ) ∈ R p . Therefore, we can expect that (at least for ﬁxed τ and p ) the rank-score balanced estimator can be approximated by a sum of n independent andidentically distributed random variables. The following non-asymptotic Bahadur-type representa-tion is a signiﬁcantly reﬁned version of this statement (holding uniformly in τ ∈ T and for p ≥ n ).It is key to the weak convergence results in Section 4.2. Lemma 2 (Bahadur-type representation) . Let T be a compact subset of (0 , , δ ∈ (0 , , ˆ s d :=sup τ ∈T (cid:107) ˆ θ d ( τ ) (cid:107) , ˆ r := (cid:112) ( s v + s θ + ˆ s d ) log( np/δ ) /n and r := (cid:112) ( s v + s θ ) log( np/δ ) /n . Suppose thatConditions 1–9 and Condition 11 with (cid:37) = r hold, r √ log n (cid:46) , and √ n √ s v h − r + √ s v h (cid:46) . If λ d (cid:16) (cid:112) n log( p/δ ) and γ d (cid:16) (cid:107) z (cid:107) (cid:0) h − s θ log( np/δ ) + √ nh (cid:1) √ n , then (cid:98) Q d ( τ ; z ) − Q d ( τ ; z )= − n (cid:88) i : D i = d f Y d | X ( X (cid:48) i θ d ( τ ) | X i ) (cid:0) τ − { Y i ≤ X (cid:48) i θ d ( τ ) } (cid:1) X (cid:48) i v d ( τ ; z ) + e d ( τ ; z ) , and, with probability at least − δ , sup τ ∈T | e d ( τ ; z ) | ≤ C ( (cid:107) z (cid:107) ∨ (cid:0) (cid:98) R n ∨ R n (cid:1) , where v d ( τ ; z ) = − (cid:0) E [ f Y d | X (cid:0) X (cid:48) θ d ( τ ) | X (cid:1) XX (cid:48) { D = d } ] (cid:1) − z , (cid:98) R n = ˆ r (log n )+ˆ r / , R n = √ nh − r + hr + h , and C > depends on ¯ f , L f , L θ , C Q , ¯ ϕ max , and κ only. Remark 3 (Empirical sparsity ˆ s d ) . A distinctive feature of this Bahadur-type representation isthat the upper bound on the remainder term | e d ( τ ; z ) | depends on the empirical sparsity ˆ s d =sup τ ∈T (cid:107) ˆ θ d ( τ ) (cid:107) . This implies that the upper bound is itself random. The proof of this representationis thus quite delicate and requires uniform bounds over sparsity levels ≤ s ≤ n ∧ p . emark 4 (Magnitude of (cid:98) R n ) . For λ > large enough, ˆ s (cid:46) s θ almost surely and we can substitute r for ˆ r . We thus obtain (cid:98) R n (cid:46) ( s v + s θ ) / log / ( np/δ ) log / ( n ) /n / . Up to the log -factors thisrate matches the optimal rate of the residuals of the Bahadur representation for classical quantileregression estimators in low dimensions (Zhou and Portnoy, 1996). Remark 5 (Magnitude of R n ) . Recall that h > is the bandwidth of the non-parametric densityestimators given in (7) . The particular way in which the bandwidth enters R n is the result of thetwo-fold dependence of the rank-score balanced estimator on the non-parametric density estimates:a direct dependence via ˆ f i ( τ ) and an indirect dependence via (cid:98) v d ( τ ; z ) . The weak convergence results and examples from Section 4.2 are only practically relevant togetherwith an estimator of the asymptotic covariance function that is uniformly consistent in τ , τ ∈ T .Here, we show how to exploit the duality formalism from Section 4.3 to construct such estimators.An estimate for the function ( τ , τ ) (cid:55)→ H d ( τ , τ ; z ) is given by (cid:98) H d ( τ , τ ; z ) := ( τ ∧ τ − τ τ )ˆ v (cid:48) d ( τ ; z )  n (cid:88) i : D i = d ˆ f i ( τ ) ˆ f i ( τ ) X i X (cid:48) i  ˆ v d ( τ ; z ) , where ˆ v d ( τ ; z ) is the solution to the dual program (12) (see Section 4.3). By the following lemmathis estimate is uniformly consistent in τ , τ ∈ T . Lemma 3.

Recall the setup of Theorem 1 and set r = (cid:112) ( s v + s θ ) log( np ) /n . The following holds: sup τ ,τ ∈T (cid:12)(cid:12)(cid:12) (cid:98) H d ( τ , τ ; z ) − H d ( τ , τ ; z ) (cid:12)(cid:12)(cid:12) = O p (cid:16)(cid:0) (cid:107) z (cid:107) ∨ (cid:1) (cid:16) √ n √ s v h − r + √ s v h (cid:17)(cid:17) . As a consequence, a uniformly consistent estimate of the asymptotic variance σ ( τ ; z ) of theHQTE process at a single quantile τ ∈ T (see Example 1) is given by (cid:98) σ ( τ ; z ) := τ (1 − τ )  n (cid:88) i : D i =1 ˆ f i ( τ ) (cid:0) X (cid:48) i ˆ v ( τ ; z ) (cid:1) + 14 n (cid:88) i : D i =0 ˆ f i ( τ ) (cid:0) X (cid:48) i ˆ v ( τ ; z ) (cid:1)  , (14)where ˆ v ( τ ; z ) and ˆ v ( τ ; z ) are the solutions to the dual problems (12). The duality formalism fromSection 4.3 implies that another uniformly consistent estimate for σ ( τ ; z ) is given by (cid:98) σ ( τ ; z ) := τ (1 − τ ) n (cid:88) i =1 (cid:98) w i ( τ ; z ) ˆ f − i ( τ ) , (15)where the (cid:98) w i ( τ ; z )’s are the rank-score balancing weights. Neither of the two estimates requiresinverting a (high-dimensional) matrix, which may be surprising given the form of the target σ ( τ ; z ).16 A practical guide to the rank-score balancing procedure

In the following, we explain how we implement the rank-score balancing procedure with the helpof the dual problem. As the rank-score balancing estimator of the HQTE depends on the fourregularization parameters λ , λ , γ , and γ > h > (cid:96) -penalized quantile regression program To select λ d > (cid:96) -penalized quantile regression problemby Belloni and Chernozhukov (2011). That is, we compute the pilot estimate of θ d ( τ ) asˆ θ d ( τ ) ∈ arg min θ ∈ R p  (cid:88) i : D i = d ρ τ ( Y i − X (cid:48) i θ ) + λ d (cid:112) τ (1 − τ ) p (cid:88) k =1 (cid:98) σ d,k | θ k |  , (16)with (cid:98) σ d,k = n − (cid:80) i ; D i = d X ik and λ d = 1 . · Λ d (0 . | X , . . . , X n ), where Λ d (0 . | X , . . . , X n ) is the90%-quantile of Λ d | X , . . . , X n andΛ d := sup τ ∈T max ≤ k ≤ p (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i : D i = d ( τ − { U i ≤ τ } ) X ik (cid:98) σ d,k (cid:112) τ (1 − τ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , with U , . . . , U n be i.i.d. Uniform(0,1) random variables, independent of X , . . . , X n Recall the primal and dual programs (11) and (12), respectively. Provided that strong duality holds,we can estimate the rank-score balancing weights (cid:98) w ( τ ; z ) by solving either of the two problems.However, from a statistical and computational point of view, it is preferable to solve the dualproblems.First, since the dual programs (12) are unconstrained optimization problems they allow us tochoose the tuning parameter γ d > γ d > (cid:107) z − √ n (cid:80) i : D i = d w i X i (cid:107) ∞ whose risk is comparable to the one of the optimal weights. A smaller γ d > R package CVXR (Fu et al., 2017), and the dual problemusing Alternating Direction Method of Multipliers (ADMM) to the l -regularized quadratic program(Wahlberg et al., 2012) via R package accSDA (Atkins et al., 2017).Third, since the dual programs do not involve the inverses of the estimated densities ˆ f i ( τ ) theyare numerically more stable than the primal problems. Therefore, in the simulation studies andthe real data analysis we only report results obtained via the dual problem. h for the non-parametric density estimator To stabilize the density estimator (7), we replace the (cid:96) -penalized quantile regression estimateswith reﬁtted quantile regression estimates. The reﬁtted estimates are obtained by ﬁtting a quantileregression model to the data using only the covariates in the support set of ˆ θ d ( τ ). As this densityestimator takes a similar form as the one in Belloni et al. (2019c), we follow their advice and setbandwidth h = min { n − / , τ (1 − τ ) / } . We carry out simulation studies to investigate the performance of the rank-score balanced estimator.The goal of the simulation studies is to: (1) illustrate our rank-score balanced estimator providesconsistent estimate of HQTE with nominal-level coverage probabilities, (2) showcase the rank-score balanced estimator is more eﬃcient than the unweighted quantile regression estimator, and(3) provide numerical evidence supporting the theoretical results from Section 4.2.

Our simulation design mimics high-dimensional observation studies where treatments are assignedbased on covariates. We consider the following generative model: Y = X (cid:48) θ + εσ ( X ) , Y = X (cid:48) θ + εσ ( X ) , X ⊥⊥ ε, ε ∼ N (0 , ,D | X ∼ Bernoulli (cid:18) e − X + X e − X + X (cid:19) , Y = DY + (1 − D ) Y , with X = 1, X = | W | + 0 . X = W + 0 . X j = W j for 4 ≤ j ≤ p , and W ∼ N (0 , Σ)where Σ = (Σ jk ) p − j,k =1 and Σ jk = 0 . | j − k | . We set θ = (0 . , , , − , , . . . , (cid:48) ∈ R p and con-18ider the following three scenarios for θ : sparse ( θ ∝ (1 , , , , , , , . . . , (cid:48) ), dense ( θ ∝ (1 , / √ , . . . , / √ p ) (cid:48) ) and pseudo dense ( θ ∝ (1 , / , . . . , /p ) (cid:48) ). We consider three diﬀerent sig-nal strengths || θ || ∈ { , , } . We choose the sample size n and the dimension of the covari-ates p from ( n, p ) ∈ { (600 , , (1000 , } . As we estimate the CQF separately by using theobserved data in the treated and control groups, the eﬀective sample size for our rank-score bal-ancing program n d is approximately half of the sample size. Thus, the eﬀective sample size isalways less than p . For the noise level σ d ( X ) we consider two scenarios: homoscedastic noise, σ ( X ) = σ ( X ) = 1, and heteroscedastic noise, σ d ( X ) = (1 − d ) X + dX for d ∈ { , } . Lastly, weset z = (0 , / √ , / √ , , . . . , (cid:48) . Under this data generating process, the HQTE at z is the linearfunction α ( τ ; z ) = z (cid:48) ( θ ( τ ) − θ ( τ )).We implement the rank-score balanced estimator as discussed in Section 5. In particular, thismeans that even in the case of homoscedastic noise we do not use a specialized density estimatorthat could exploit this extra information. Since in practice homoscedasticity may be diﬃcult todetect, we do not to want rely on the validity of the homoscedasticity assumption. To illustratethe bias-variance trade-oﬀ that underlies the tuning parameter γ d we report results not just forthe “1SE” rule (“Rank-1SE”) but also for a “2SE” rule (“Rank-2SE”). The “2SE” rule choosesthe smallest γ d > θ d ( τ ) only. The (unweighted) oracle estimate of the HQTE is (cid:98) α oracle ( τ ; z ) = z (cid:48) (ˆ θ oracle1 ( τ ) − ˆ θ oracle0 ( τ )).We compute this (unweighted) oracle estimate only in the scenario with sparse θ d ( τ ). The “Reﬁt”method is the following two-step procedure: We ﬁrst obtain estimates ˆ θ d ( τ ) by solving the weighted (cid:96) -penalized quantile regression program (16). Then, we compute a reﬁtted estimate ˆ θ reﬁt d ( τ ) byﬁtting a quantile regression model based only on the covariates in the support set of ˆ θ d ( τ ). Thereﬁtted estimate of the HQTE is (cid:98) α reﬁt ( τ ; z ) = z (cid:48) (ˆ θ reﬁt1 ( τ ) − ˆ θ reﬁt0 ( τ )). The “Lasso” method refersto simply using the estimates ˆ θ d ( τ ) from the (cid:96) -penalized quantile regression program (16) withoutany adjustments. The Lasso estimate of the HQTE is thus (cid:98) α lasso ( τ ; z ) = z (cid:48) (ˆ θ lasso1 ( τ ) − ˆ θ lasso0 ( τ )).Conﬁdence intervals for the rank-score balanced estimator are based on the asymptotic nor-mality results in Section 4.2 and hold under the mild regularity conditions stated in Section 4.1.Conﬁdence intervals for the (unweighted) oracle method are constructed using standard large sam-ple theory (Angrist et al., 2006). Conﬁdence intervals for the Lasso and the Reﬁt method areconstructed assuming that the selected models equal the true model, i.e. the support sets ofˆ θ lasso d ( τ ), ˆ θ reﬁt d ( τ ) equal the support set of θ d ( τ ). This assumption is satisﬁed under strong oracleconditions (Fan and Li, 2001). 19igure 1: Simulation results for homoscedastic data with ( n, p ) = (1000 , θ with || θ || = 4. Panel (A) Bias comparison. Panel (B) Variance comparison. Panel (C) Coverageprobability of the conﬁdence interval while the nominal coverage probability is 95%. Panel (D)Histograms of the standardized estimates of the rank-score balanced estimator displayed in (17),density of N (0 ,

1) in red.

We measure the performance of the estimators in terms of their biases (computed as the diﬀerencesbetween the mean of the Monte Carlo estimates of α ( τ ; z ) and the true HQTE), variances (computedas the variances of the Monte Carlo estimates of α ( τ ; z )) and coverage probabilities of the conﬁdenceintervals with the nominal coverage probability of 95%. We provide ﬁnite sample comparisonsthrough Table 1 and Figure 1 for homoscedastic data, and Table 2 and Figure 2 for heteroscedasticdata. Details about the model parameters are given in the captions of these tables and ﬁgures.Our simulation results are evaluated through 1,000 Monte Carlo samples.The main takeaway from the simulation study is that the rank-score balanced estimator with γ d selected by the 1SE rule outperforms the Reﬁtted and Lasso estimators in terms of bias, varianceand the validity of inference in most scenarios. In the following we highlight only three importantconclusions. First, as expected, the rank-score balanced estimator performs better in sparse thanin dense models. Second, the rank-score balanced estimator often has a smaller variance than theOracle estimator, a phenomenon that is more obvious in the heteroscedastic than the homoscedastic20igure 2: Simulation results for heteroscedastic data with ( n, p ) = (1000 , θ with || θ || = 4. Panel (A) Bias comparison. Panel (B) Variance comparison. Panel (C) Coverageprobability of the conﬁdence interval while the nominal coverage probability is 95%. Panel (D)Histograms of the standardized estimates of the rank-score balanced estimator displayed in (17),density of N (0 ,

1) in red.case. Third, the asymptotic normality results from Section 4.2 continue to hold reasonably well inﬁnite samples. This can be deduced from Figures 1(D) and 2(D), in which we provide histogramsof the standardized estimates of the rank-score balanced estimator, (cid:98) σ − ( τ ; z ) · √ n (cid:0)(cid:98) α ( τ ; z ) − α ( τ ; z ) (cid:1) , (17)where (cid:98) σ ( τ ; z ) is the estimate deﬁned in eq. (14). These histograms neatly ﬁt the overlaid N(0,1)densities. In contrast, the Lasso estimator is clearly biased (Figures 1(A) and 2(B)) and so isthe Reﬁtting estimator in scenarios with smaller signal to noise ratio, i.e. scenarios with small (cid:107) θ ( τ ) (cid:107) (Tables 1 and 2). These biases suggest that the oracle condition is violated and hencethe ﬁnite sample distributions of these estimators may not be approximated by a standard normaldistribution. 21 Unweighted Oracle Reﬁt Lasso Rank-1SE Rank-2SE n = 600 , p = 400, sparse signals, || θ || = 2 √ n Bias 0 . . . − . . − . .

12) 0 . .

12) 0 . . . . . − . . − . . − . . − . . . − . . − . . − . . − . . − . . n Var 0 . . .

93) 8 . .

64) 7 . .

71) 9 . .

61) 9 . . . . .

37) 7 . .

47) 7 . .

73) 8 . .

54) 9 . . . . .

52) 8 . .

48) 7 . .

5) 9 . .

48) 10 . . . . .

01) 0 . .

01) 0 . . . . .

01) 0 . .

01) 0 . . . . .

01) 0 . .

01) 0 . . n = 600 , p = 400, sparse signals, || θ || = 1 √ n Bias 0 . . . − . . − . . − . .

1) 0 . . . − . . − . . − . . − . . − . . . . . − . . − . . − . . − . . n Var 0 . . .

70) 19 . .

31) 4 . .

35) 6 . .

35) 8 . . . . .

31) 7 . .

44) 7 . .

53) 7 . .

46) 9 . . . . .

52) 7 . .

43) 5 . .

47) 8 . .

51) 10 . . . . .

01) 0 . .

02) 0 . .

01) 0 . .

01) 0 . . . . .

01) 0 . .

01) 0 . . . . .

01) 0 . .

01) 0 . . n = 600 , p = 400, pseudo-sparse signals, || θ || = 2 √ n Bias 0 . − . . − . .

15) 0 . .

15) 0 . . . − . . − . . − . . − . . . − . . − . . − . . − . . n Var 0 . . .

81) 17 . .

86) 15 . .

82) 18 . . . . .

61) 9 . .

94) 10 . .

66) 12 . . . . .

51) 9 . .

87) 11 . .

75) 13 . . . . .

01) 0 . .

02) 0 . .

01) 0 . . . . .

01) 0 . .

02) 0 . .

01) 0 . . . . .

01) 0 . .

02) 0 . .

01) 0 . . n = 600 , p = 400, dense signals, || θ || = 2 √ n Bias 0 . − . . − . . − . . − . . . − . . − . . − . . − . . . − . . − . . − . . − . . n Var 0 . . .

10) 1 . .

33) 12 . .

34) 14 . . . . .

18) 25 . .

70) 28 . .

61) 29 . . . . .

10) 25 . .

62) 29 . .

72) 30 . . . . .

01) 0 . .

00) 0 . .

01) 0 . . . . .

02) 0 . .

01) 0 . . . . .

02) 0 . .

01) 0 . . Unweighted Oracle Reﬁt Lasso Rank-1SE Rank-2SE n = 600 , p = 400, sparse signals, || θ || = 2 √ n Bias 0 . . . − . . − . . − . . − . . . . . − . . − . . − . .

15) 0 . . . − . .

19) 0 . . − . . − . . − . . n Var 0 . . .

62) 14 . .

19) 23 . .

6) 21 . .

70) 23 . . . . .

00) 22 . .

62) 18 . .

47) 13 . .

88) 15 . . . . .

15) 25 . .

29) 24 . .

72) 15 . .

88) 16 . . . . .

01) 0 . .

02) 0 . . . . .

01) 0 . .

02) 0 . .

01) 0 . . . . .

01) 0 . .

01) 0 . . n = 600 , p = 400, sparse signals, || θ || = 1 √ n Bias 0 . . . − . . − . . − . . − . . . . . − . . − . . − . .

23) 0 . . . − . . − . . − . . − . . − . . n Var 0 . . .

52) 19 . .

56) 5 . .

49) 9 . .

81) 13(1 . . . .

57) 33 . .

40) 13 . .

36) 12 . .

35) 14 . . . . .

86) 27 . .

17) 25 . .

78) 17 . .

47) 18 . . . . .

02) 0 . .

03) 0 . .

02) 0 . . . . .

02) 0 . .

03) 0 . .

01) 0 . . . . .

01) 0 . .

02) 0 . .

01) 0 . . n = 600 , p = 400, pseudo-sparse signals, || θ || = 2 √ n Bias 0 . − . . − . . − . . − . . . − . . − . . − . . − . . . − . . − . . − . . − . . n Var 0 . . .

05) 25 . .

77) 26 . .

41) 29 . . . . .

70) 17 . .

25) 17 . .

14) 18 . . . . .

78) 34 . .

81) 27 . .

74) 29 . . . . .

02) 0 . .

01) 0 . . . . .

02) 0 . .

01) 0 . . . . .

01) 0 . .

02) 0 . .

01) 0 . . n = 600 , p = 400, dense signals, || θ || = 2 √ n Bias 0 . − . . − . . − . . − . . . − . . − . . − . . − . . . − . . − . . − . . − . . n Var 0 . . .

04) 4 . .

16) 11 . .

12) 15 . . . . .

07) 33 . .

93) 36 . .

03) 38 . . . . .

42) 48 . .

78) 46 . .

78) 49 . . . . .

00) 0 . .

02) 0 . . . . .

02) 0 . .

01) 0 . . . . .

02) 0 . .

01) 0 . . To illustrate the advantages of considering HQTE, we apply the proposed method to the 2018Natality Data published by the National Center for Health Statistics. We consider live, singleton23irths to Asian and Black mothers aged 40 to 45. This results in 832 Asian and 948 Black mothers.On average, we observe that the birthweight of infants born to Black mothers is 98.4 grams lowerthan the birthweight of infants born to Asian mother. Low birthweight is known to be associatedwith a wide range of subsequent health problems. Consequently, there has been considerableinterests in identifying eﬀective public policy initiatives that reduce the incidence of low-birth-weight infants. It is well documented in the literature (Bross and Shapiro, 1982; Gage et al., 2013)that education beyond high school is associated with a modest increase in birthweights. In aneﬀort to focus attention directly for the low-birth-weight infants, our analysis suggests that highermaternal education reduces the birthweight diﬀerence between infants born to Asian and Blackmothers, and its eﬀect is more positive in the lower tail of the birthweight distribution.To this end, we denote the treatment variable D i as an indicator variable with D i =  , . This results in a data set of n = (cid:80) ni =1 (1 − D i ) = 570 births to mothers with no more thanhigh school degree, and n = (cid:80) ni =1 D i = 1 ,

210 birth to mothers with some college degree. Theresponse variable Y i is the individual infant birthweight. As for the covariates X i , to improvemodel ﬁtting, we replace six continuous variables (mother’s age, mother’s height, mother’s weightgain during pregnancy, mother’s pre-pregnancy weight, father’s age, and number of cigarettessmoked during/before pregnancy) with their spline basis functions. In addition to these continuousvariables, we include the following categorical variables: father’s education level, mother’s marriagestatus, usage of induction/augmentation labor, usage of antibiotic/anesthesia during labor, genderof infants, and a few mother’s complications during pregnancy. To avoid handpicking importantinteraction terms to be included in the model, we introduce all possible p = 1 ,

328 interaction termsas covariates via R function model.matrix . The ﬁrst row of X i equals 1, the second row of X i isan indicator variable: X i =  , . Therefore, the coeﬃcient of X i , θ d, ( τ ), captures the diﬀerential eﬀect of race in the treatmentgroup d after adjusting for the confounding eﬀect from other covariates. Since our goal is toinvestigate whether attending college can reduce the birthweight diﬀerence between infants bornto Asian and Black mothers, we estimate α ( τ ; z ) = θ , ( τ ) − θ , ( τ ) = z (cid:48) ( θ ( τ ) − θ ( τ )) , where z = (0 , , , . . . , (cid:48) ∈ R p .Figure 3 (A) shows the estimated quantile regression coeﬃcients ˆ θ , ( τ ) (green curve) and24igure 3: Panel (A) Infant birthweight diﬀerence between Asian and black mothers in two groups:mothers with some college degrees and mothers with no more than high school degrees. Panel (B)Diﬀerential eﬀect of attending college on infants birthweights between Asian and black mothers.Robust simultaneous 95% conﬁdence bands discussed in Example 3 are given by the shaded regions.ˆ θ , ( τ ) (red curve) along with estimated robust simultaneous 95% conﬁdence bands (see Example3 for details). We observe that for mothers with (only) high school education, the birthweightdiﬀerence between Asian and black mothers is more signiﬁcant at the lower tail, decreasing from350 grams in the lower tail to about 200 grams in the upper tail. For mothers with some collegeeducation, the birthweight diﬀerence between infants born to Asian and black mothers is roughlya quadratic function of the quantile τ . The minimal diﬀerence is about 75 grams at around the0 .

75 quantile level and the maximal diﬀerence is about 200 grams in the lower quantiles. Figure 3(B) shows the estimate of α ( τ ; z ). We see that education not only has a substantial eﬀect inreducing birthweight diﬀerence between Asian and black mothers, but also has diﬀerential eﬀectsacross diﬀerent quantiles of infant birthweights–its inﬂuence is more signiﬁcant at the left tail ofthe distribution. In this article, we have introduced a new procedure to study treatment eﬀect heterogeneity basedon quantile regression modeling and rank-score balancing. While our rank-score balanced estimatoris easy to implement and enjoys strong theoretical guarantees, the following points merit futureresearch:First, it is desirable to study the asymptotic eﬃciency of the rank-score balancing procedure.Since the rank-score balanced estimate is a scalar-valued functional of a high-dimensional (nuis-sance) parameter, this requires a semi-parametric point of view. However, the existing semipara-metric eﬃciency bounds for quantile regression apply only to ﬁxed dimensional settings Neweyand Powell (1990); Zhao (2001). The treatment of high-dimensional quantile regression models25equires a more elaborate analysis since the models change with sample size and sparsity level.One question we hope to answer in future work is whether the rank-score balanced estimator isalso asymptotically optimal in high-dimensional settings.Second, it is worthwhile to relax the unconfoundedness assumption, simply because unmeasuredconfounding presents a critical challenge to causal inference from observational studies. A classicalapproach to mitigate the confounding bias is to include instrumental variable methods. In thiscontext, the identiﬁcation condition 1 can be modiﬁed similar to that of Chernozhukov and Hansen(2005). In future work, we therefore plan to investigate how to combine instrumental variables withcovariate balancing. 26 upplementary Materials for “Inference on Heterogeneous QuantileTreatment Eﬀects via Rank-Score Balancing”

Alexander Giessing ∗ Jingshen Wang †‡§

February 4, 2021

A Preliminaries

In these supplementary materials we develop the theoretical backbone of the results in the main text.The key to the weak convergence results in the main text is the Bahadur-type representation of the rank-score balanced estimate of the CQF. Establishing this Bahadur-type representation with a reasonable non-asymptotic bound on the remainder term is very involved. In particular, the proof requires the followingnon-trivial auxiliary results: • uniform consistency of the unweighted (cid:96) -penalized quantile regression vector; • uniform control over the empirical sparsity of the unweighted (cid:96) -penalized quantile regression vector; • uniform consistency of the solution to the dual program of the rank-score balancing program.All these results are new. The existing results and proofs pertaining to uniform consistency and empiricalsparsity of the weighted (cid:96) -penalized quantile regression vector (Belloni and Chernozhukov, 2011; Belloniet al., 2019c) do not apply to the case of unweighted (cid:96) -penalized quantile regression. The reason for thisis that the empirical processes associated with the weighted (cid:96) -penalty are self-normalized whereas thosecorresponding to the unweighted (cid:96) -penalty are unbounded. The self-normalized processes can be analyzedvia conventional tools in empirical process theory, while the unbounded empirical processes require newtools developed in Giessing (2020). Similarly, the dual of the rank-score balancing process is a non-standardproblem that cannot be recast as a standard regression problem. Hence, its analysis requires a new approachand new tools as well.In order to make these auxiliary results as transparent and accessible as possible we prove them underslightly weaker conditions than stated in the main paper and provide explicit constants whenever possible.The results from the main text are therefore simple corollaries of the more reﬁned results provided inSections D and E. ∗ Department of ORFE, Princeton University. E-mail: [email protected]. † Division of Biostatistics, University of California, Berkeley. E-mail: [email protected]. ‡ Jingshen Wang acknowledges the support of the National Science Foundation (DMS 2015325). § Corresponding author. Notation

In addition to the notation of the main text we introduce the following conventions.Unless otherwise stated, we denote by { X i } i ∈ N a sequence of independent S -valued random elementswith common law P , i.e. X i : S N → S , i ∈ N , are the coordinate projections of the inﬁnite productprobability space (Ω , A , P ) = ( S N , S N , P N ). If auxiliary variables independent of the X i ’s are involved,the underlying probability space is assumed to be of the form (Ω , A , P ) = ( S N , S N , P N ) × ( Z, Z , Q ). Wewrite E for the expectation with respect to P , and P X , E X for the partial integration with respect tothe joint law of { X i } i ∈ N only. For events A ∈ A with P { A } > Y on (Ω , A , P )we deﬁne the conditional expectation of Y given A as E [ Y | A ] := ( (cid:82) A Y d P ) / P { A } . For any measure Q and any real-valued Q -integrable function f on ( S, S ) we write Qf = Q ( f ) := (cid:82) f dQ . We denote by L p ( S, S , Q ), p ∈ [1 , ∞ ), the space of all real-valued measurable functions f on ( S, S ) with ﬁnite L p ( Q )-norm,i.e. (cid:107) f (cid:107) Q,p := ( (cid:82) | f | p dQ ) /p < ∞ . For a random variable ξ on (Ω , A , P ) we set (cid:107) ξ (cid:107) p := ( E | ξ | p ) /p . We deﬁnethe empirical measures P n associated with observations { X i , } ni =1 , n ∈ N , as random measures on ( S, S ) givenby P n ( ω ) := n − (cid:80) ni =1 δ X i ( ω ) , ω ∈ S , where δ x is the Dirac measure at x . We denote the empirical processesindexed by a class F ⊂ L ( S, S , P ) by G n ( f ) := √ n ( P n − P )( f ) := n − / (cid:80) ni =1 ( f ( X i ) − P f ), f ∈ F , andthe corresponding symmetrized empirical processes by G ◦ n ( f ) := n − / (cid:80) ni =1 ε i f ( X i ), f ∈ F , where { ε i } ni =1 is a sequence of i.i.d. Rademacher random variables independent of { X i } ni =1 . For probability measures Q on ( S, S ) we set (cid:107) Q (cid:107) F = sup {| Qf | : f ∈ F} . To avoid distracting measurablity questions regarding thequantity (cid:107) Q (cid:107) F , we assume that the function class F is either countable or that the (symmetrized) empiricalprocess index by F is separable. The latter is true for all instances in this paper, since all function classesconsidered are indexed by vectors that live in R p , p ∈ [1 , ∞ ). Given a pseudometric space ( T, d ) and ε > N ( ε, T, d ) denotes the ε -covering number of T with respect to d . If d is induced by a norm n on T , we alsowrite N ( ε, T, n ).For non-negative real-valued sequences { a n } n ≥ and { b n } n ≥ , the relation a n (cid:46) b n means that thereexists an absolute constant c > n, d, p and an integer n ∈ N such that a n ≤ cb n for all n ≥ n . We write a n (cid:16) b n if a n (cid:46) b n and b n (cid:46) a n . We deﬁne a n ∨ b n = max { a n , b n } and a n ∧ b n = min { a n , b n } .For a vector a ∈ R d and p ∈ [1 , ∞ ) we write (cid:107) a (cid:107) p = ( (cid:80) dk =1 | a k | p ) /p . Also, we write (cid:107) a (cid:107) ∞ = max ≤ k ≤ d | a k | .For a scalar random variable ξ and α ∈ (0 ,

2] we deﬁne the ψ α -Orlicz norm by (cid:107) ξ (cid:107) ψ α = inf { t > | ξ | α /t α )] ≤ } . For a sequence of scalar random variables { ξ n } n ≥ we write ξ n = O p ( a n ) if ξ n /a n is stochastically bounded. For any matrix M ∈ R d × d we denote its operator norm by (cid:107) M (cid:107) op (its largestsingular value). C Proofs of the results in the main text

Proof of Theorem 1 . The claim follows from Theorem 7 upon noting hat the conditions Conditions 1-10and Condition 11 with (cid:37) = (cid:112) s v + s θ ) log( np ) /n guarantee that λ d , γ d > Proof of Theorem 2 . By Theorem 1, √ n (cid:0) (cid:98) Q d ( · ; z ) − Q d ( · ; z ) (cid:1) (cid:32) G ( · ; z ) in (cid:96) ∞ ( T ), where G ( · ; z ) is acentered Gaussian process with covariance function ( τ , τ ) (cid:55)→ H d ( τ , τ ; z ). The processes {√ n (cid:0) (cid:98) Q ( · ; z ) − Q ( · ; z ) (cid:1) : τ ∈ T } and {√ n (cid:0) (cid:98) Q ( · ; z ) − Q ( · ; z ) (cid:1) : τ ∈ T } are asymptotically independent. Hence, byCorollary 1.4.5 (Example 1.4.6) in van der Vaart and Wellner (1996), (cid:0) √ n (cid:0) (cid:98) Q ( · ; z ) − Q ( · ; z ) (cid:1) , √ n (cid:0) (cid:98) Q ( · ; z ) − Q ( · ; z ) (cid:1)(cid:1) (cid:32) (cid:0) G , G (cid:1) . Hence, the Continuous Mapping Theorem (e.g. van der Vaart and Wellner, 1996,Theorem 1.11.1) yields the claim of the theorem. roof of Lemma 1 . Apply Lemma 14.

Proof of Lemma 2 . The claim follows from Corollary 3. As in the proof of Theorem 1 the conditionsConditions 1-9 and Condition 11 with (cid:37) = (cid:112) s v + s θ ) log( np/δ ) /n guarantee that λ d , γ d > Proof of Lemma 3 . Apply Lemma 17.

D The (cid:96) -penalized quantile regression problem D.1 Setting

We consider a high-dimensional parametric quantile regression model with continuous response Y ∈ R andpredictors X ∈ R p , where the number of parameters p diverges with (and possibly exceeds) the sample size n . Let F be the joint distribution of ( Y, X ), and let F Y | X ( ·| x ) and f Y | X ( ·| x ) be the conditional distributionand conditional density of Y | X = x . Recall that the τ th conditional quantile function (CQF) of Y given X is Q ( τ ; X ) = inf (cid:8) y : F Y | X ( y | X ) ≥ τ (cid:9) . We assume that, at least over a compact subset

T ⊂ (0 ,

1) of quantile levels, the true CQF is linear functionof only a few predictor variables, i.e. Q ( τ ; X ) = X (cid:48) θ ( τ ). Given a random sample { ( Y i , X i ) } ni =1 we thereforeestimate θ ( τ ) as the solution ˆ θ λ ( τ ) to the (unweighted) (cid:96) -penalized quantile regression problem,min θ ∈ R p (cid:40) n (cid:88) i =1 ρ τ ( Y i − X (cid:48) i θ ) + λ (cid:107) θ (cid:107) (cid:41) . (18) D.2 Assumptions

The following assumptions partially reﬁne the conditions of the main text.Throughout we assume that { ( Y i , X i ) } ni =1 is a random sample of independent and identically distributedrandom variables with joint distribution F . Assumption 1 (Sub-Gaussian predictors) . The random vector X ∈ R p has positive deﬁnite covariancematrix Σ and satisﬁes (cid:13)(cid:13)(cid:13)(cid:0) X − E [ X ] (cid:1) (cid:48) u (cid:13)(cid:13)(cid:13) ψ (cid:46) u (cid:48) Σ u ∀ u ∈ R p . Assumption 2 (Linear conditional quantile function) . Let T be a compact subset of (0 , . For all τ ∈ T ,the τ th conditional quantile of Y given X is a linear function in X , i.e. Q Y ( τ | X ) = X (cid:48) θ ( τ ) , θ ( τ ) ∈ R p . Assumption 3 (Sparsity of τ (cid:55)→ θ ( τ )) . Let T be compact subset of (0 , . T θ ( τ ) := support (cid:0) θ ( τ ) (cid:1) and s θ := sup τ ∈T (cid:107) θ ( τ ) (cid:107) ≤ n ∧ p. ssumption 4 (Lipschitz continuity of τ (cid:55)→ θ ( τ )) . Let T be compact subset of (0 , . There exists aconstant L θ ≥ such that for all s, τ ∈ T , (cid:107) θ ( τ ) − θ ( s ) (cid:107) ≤ L θ | τ − s | . Assumption 5 (Lipschitz continuity of the conditional density) . The conditional density of Y given X , f Y | X , exists and is Lipschitz continuous, i.e. there exists a constant L f ≥ such that for all a, b, x ∈ R p , (cid:12)(cid:12) f Y | X ( x (cid:48) a | x ) − f Y | X ( x (cid:48) b | x ) (cid:12)(cid:12) ≤ L f | x (cid:48) a − x (cid:48) b | . Assumption 6 (Growth condition for consistency) . The parameters δ ∈ (0 , and s θ , p, n ≥ satisfy s θ log( ep/s θ ) = o ( n ) , log(1 /δ ) = O (cid:0) log( ep/s θ ) ∧ log n (cid:1) , and s θ + 2 ≤ p. Next we introduce several deﬁnitions and assumptions pertaining the population and sample covarianceand gram matrix of the predictors.

Deﬁnition 4 ( s -sparse maximum eigenvalues of covariance matrix) . Let s ∈ { , . . . , p } and deﬁne the s -sparse maximum eigenvalue the population and sample the covariance matrices Σ and (cid:98) Σ , respectively, by φ max ( s ) := sup u : (cid:107) u (cid:107) ≤ s u (cid:48) Σ u (cid:107) u (cid:107) and (cid:98) φ max ( s ) := sup u : (cid:107) u (cid:107) ≤ s u (cid:48) (cid:98) Σ u (cid:107) u (cid:107) . Deﬁnition 5 (Cone of (

J, ϑ )-dominated coordinates) . For J ⊆ { , . . . , p } and ϑ ∈ [0 , ∞ ] deﬁne the cone of ( J, ϑ ) -dominated coordinates by C p ( J, ϑ ) := { u ∈ R p : (cid:107) u J c (cid:107) ≤ ϑ (cid:107) u J (cid:107) } . Remark 6.

Observe that ϑ ∈ [0 , ∞ ] controls the (approximate) sparsity level of the vectors in C p (cid:0) J, ϑ (cid:1) .Indeed, if ϑ = 0 , then the vectors in C p (cid:0) J, (cid:1) are | J | -sparse and only entries with index in J are non-zero.In contrast, if ϑ = ∞ , then C p (cid:0) J, ∞ (cid:1) = R p . Deﬁnition 6 (( ω, ϑ )-restricted minimum eigenvalue of weighted design matrix) . Let T be a compact subsetof (0 , and ω, ϑ ≥ . Deﬁne the ( ω, ϑ ) -restricted minimum eigenvalue of weighted design matrix as κ ω ( ϑ ) := inf τ ∈T inf u ∈ C p ( T θ ( τ ) ,ϑ ) ∩ ∂B p (0 , E (cid:104) f ωY | X ( X (cid:48) θ ( τ ) | X )( X (cid:48) u ) (cid:105) . Remark 7.

Note that κ ω ( ∞ ) is simply the minimum eigenvalue of E (cid:104) f ωY | X ( X (cid:48) θ ( τ ) | X ) XX (cid:48) (cid:105) Assumption 7 (( ϑ, (cid:37) )-restricted identiﬁability) . Let T be a compact subset of (0 , and ϑ, (cid:37) ≥ . Thequantile regression vector θ ( τ ) is ( ϑ, (cid:37) ) -restricted identiﬁable for all τ ∈ T , i.e. κ ( ϑ ) > q (cid:37) ( ϑ ) := inf τ ∈T inf u ∈ C p ( T θ ( τ ) ,ϑ ) ∩ ∂B p (0 , E (cid:2) f Y | X ( X (cid:48) θ ( τ ) | X )( X (cid:48) u ) (cid:3) (cid:37) L f E [ | X (cid:48) u | ] (cid:38) . .3 Consistency Theorem 3 (Consistency) . Let T be a compact subset of (0 , , δ ∈ (0 , , c > , and λ > . Set ¯ c := ( c + 1) / ( c − and r θ := (cid:32) ¯ cφ / (2 s θ ) L θ κ (¯ c ) ∨ (cid:33) (cid:114) s θ log( ep/s θ ) + log n + log(1 /δ ) n (cid:95) ¯ cκ (¯ c ) λ √ s θ n . (19) Suppose that Assumptions 1–6 and Assumption 7 with ( ϑ, (cid:37) ) = (¯ c, r θ ) hold and that λ > satisﬁes eq. (22) .With probability at least − δ , sup τ ∈T (cid:107) ˆ θ λ ( τ ) − θ ( τ ) (cid:107) (cid:46) r θ . Corollary 1.

Let T be a compact subset of (0 , , δ ∈ (0 , , and c > . Set ¯ c := ( c + 1) / ( c − and ˜ r θ := (cid:112) s θ log( p/δ ) /n + (log n ) /n . Suppose that Assumptions 1–6 and Assumption 7 with ( ϑ, (cid:37) ) = (¯ c, ˜ r θ ) hold. Then there exists an absolute constant c > c such that for all C λ ≥ c and λ := C λ ϕ / (1) (cid:112) n log( p/δ ) , (20) with probability at least − δ , sup τ ∈T (cid:107) ˆ θ λ ( τ ) − θ ( τ ) (cid:107) (cid:46) (cid:32) ¯ cφ / (2 s θ ) L θ κ (¯ c ) ∨ C λ ¯ cϕ / (1) κ (¯ c ) (cid:33) (cid:114) s θ log( p/δ ) + log nn . D.4 Empirical sparsity

We introduce the following notation:

Deﬁnition 7 ( s -sparse maximum eigenvalues of gram matrix) . Let s ∈ { , . . . , p } and deﬁne the s -sparsemaximum eigenvalue the population and sample the gram matrices by ϕ max ( s ) := sup u : (cid:107) u (cid:107) ≤ s E [( X (cid:48) u ) ] (cid:107) u (cid:107) and (cid:98) ϕ max ( s ) := sup u : (cid:107) u (cid:107) ≤ s n − (cid:80) ni =1 ( X (cid:48) i u ) (cid:107) u (cid:107) . Theorem 4 (Empirical Sparsity) . Let T be a compact subset of (0 , .(i) Let ≤ m ≤ p be arbitrary. If λ ≥ n √ (cid:112) (cid:98) ϕ max ( m ) /m , then ˆ s λ := sup τ ∈T (cid:107) ˆ θ λ ( τ ) (cid:107) ≤ m ∧ n ∧ p withprobability one.(ii) Let δ ∈ (0 , and c > . Set ¯ c := ( c + 1) / ( c − and ¯ r := (cid:112) s θ log( np/δ ) /n . Suppose thatAssumptions 1–6 and Assumption 7 with ( ϑ, (cid:37) ) = (¯ c, ¯ r θ ) hold. There exists an absolute constant c > c ∨ √ such that for all C λ ≥ c and λ := C λ (cid:113) ϕ max (cid:0) n/ log( np/δ ) (cid:1) ∨ (cid:98) ϕ max (cid:0) n/ log( np/δ ) (cid:1)(cid:112) n log( np/δ ) , (21) with probability at least − δ , sup τ ∈T (cid:107) ˆ θ λ ( τ ) − θ ( τ ) (cid:107) (cid:46) C ¯ r θ , nd ˆ s λ := sup τ ∈T (cid:107) ˆ θ λ ( τ ) (cid:107) (cid:46) C C s θ , where C := (cid:32) ¯ cφ / (2 s θ ) L θ κ (¯ c ) ∨ C λ ¯ cϕ / (cid:0) n/ log( np/δ ) (cid:1) κ (¯ c ) ∨ C λ ¯ c (cid:98) ϕ / (cid:0) n/ log( np/δ ) (cid:1) κ (¯ c ) ∨ (cid:33) ,C := (cid:32) ¯ c (2 + ¯ c ) φ max ( s θ ) L f C λ ∨ ¯ c (2 + ¯ c ) ϕ / ( s θ ) C λ (cid:33) . D.5 Auxiliary results

Lemma 4 (Restricted cone property) . Let T be a compact subset of (0 , and c > . Set ¯ c := ( c +1) / ( c − . Suppose that Assumption 3 holds and λc − ≥ sup τ ∈T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 X i (cid:0) τ − { Y i ≤ X (cid:48) i θ ( τ ) } (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ . (22) Then, for all τ ∈ T , (cid:98) θ λ ( τ ) − θ ( τ ) ∈ C p (cid:0) T θ ( τ ) , ¯ c (cid:1) . Remark 8.

This lemma states that (cid:98) θ λ ( τ ) − θ ( τ ) lies in a cone of dominated coordinates. This is instrumentalfor establishing (uniform) consistency of the (cid:96) -penalized quantile regression vector. Lemma 5 (A new look at Knight’s identity) . Let s, τ ∈ T , y ∈ R , and θ, x ∈ R p be arbitrary. Deﬁne φ τ,x,y ( z ) := (cid:90) z { y ≤ x (cid:48) θ ( τ ) + u } du. The following holds true:(i) φ τ,x,y is a contraction and φ τ,x,y (0) = 0 ;(ii) φ τ,x,y (cid:0) x (cid:48) θ − x (cid:48) θ ( τ ) (cid:1) = φ s,x,y (cid:0) x (cid:48) θ − x (cid:48) θ ( s ) (cid:1) − φ s,x,y (cid:0) x (cid:48) θ ( τ ) − x (cid:48) θ ( s ) (cid:1) ;(iii) ρ τ ( y − x (cid:48) θ ) − ρ τ (cid:0) y − x (cid:48) θ ( τ ) (cid:1) = − τ (cid:0) x (cid:48) θ − x (cid:48) θ ( τ ) (cid:1) + φ τ,x,y (cid:0) x (cid:48) θ − x (cid:48) θ ( τ ) (cid:1) . Remark 9.

The ﬁrst property allows us to apply the contraction principle for conditional Rademacheraverages; the second property helps us when using (simple) chaining arguments over quantile levels τ ∈ T .While the ﬁrst and second properties appear to be new, the third property is a simple consequence of Knight’sidentity. Implicitly, Belloni and Chernozhukov (2011) use the same properties in their proof of Lemma 5. Lemma 6 (Locally quadratic minorization) . Let T be a compact subset of (0 , , c ≥ , and r > . Supposethat Assumptions 5 and 7 with ( ϑ, (cid:37) ) = ( c , r ) hold. There exists an absolute constant c > (independentof T , c , r ) such that for all points ( θ, τ ) ∈ R p × T with θ − θ ( τ ) ∈ C p ( T θ ( τ ) , c ) ∩ ∂B p (0 , r /c ) , we have E (cid:2) ρ τ ( Y − X (cid:48) θ ) − ρ τ (cid:0) Y − X (cid:48) θ ( τ ) (cid:1)(cid:3) (cid:38) κ ( c ) r . emma 7 (Size of cones of dominated coordates (Lemma 7.1, Koltchinskii, 2011)) . Let ϑ ∈ [0 , ∞ ] , J ⊆{ , . . . , p } , s = card( J ) , and p ≥ s + 2 . Deﬁne M = (cid:91) I ⊂{ ,...,p } , card( I ) ≤ s N I , where N I is the minimal / -net of B I = (cid:8) { u i } i ∈ I : (cid:80) i ∈ I | u i | ≤ (cid:9) . The following holds true:(i) C p ( J, ϑ ) ∩ B p (0 , ⊂ ϑ )conv( M ) ;(ii) card( M ) ≤ (cid:0) eps (cid:1) s ;(iii) ∀ u ∈ M : (cid:107) u (cid:107) ≤ s . Lemma 8 (Maxima of biconvex function) . Let f : X × Y → R be a biconvex function. Then, sup x ∈ conv( X ) sup y ∈ conv( Y ) f ( x, y ) = sup x ∈X sup y ∈Y f ( x, y ) , Moreover, the identity remains true if f is replaced by | f | . Lemma 9.

Let δ ∈ (0 , be arbitrary, T be a compact subset of (0 , , c ≥ , r > . Suppose thatAssumptions 1–4 hold. Deﬁne G = (cid:8) g : R p +1 → R : g ( X, Y ) = ρ τ (cid:0) Y − X (cid:48) θ ( τ ) (cid:1) − ρ τ ( Y − X (cid:48) θ ) ,θ − θ ( τ ) ∈ C p ( T θ ( τ ) , c ) ∩ B p (0 , r ) , τ ∈ T (cid:9) . With probability at least − δ , (cid:107) G n (cid:107) G (cid:46) c ) φ / (2 s θ ) r (cid:112) s θ log( ep/s θ ) + log(1 + L θ /r ) + log(1 /δ ) . Lemma 10.

Let T be a compact subset of (0 , . Let δ ∈ (0 , be arbitrary and ϑ k ∈ [0 , ∞ ] , J k ( τ ) ⊆{ , . . . , p } for τ ∈ T , s k = sup τ ∈T card( J k ( τ )) , p ≥ s k + 2 , and k ∈ { , } . Let { ( X i , ξ i ) } ni =1 be a sequenceof i.i.d. random vectors. Suppose that Assumption 1 holds and | ξ i | ≤ a.s. for all ≤ i ≤ n . Deﬁne G = (cid:8) g : R p × [ − , → R : g ( X, ξ ) = ξ ( X (cid:48) u )( X (cid:48) u ) , u k ∈ C p ( J k ( τ ) , ϑ k ) ∩ B p (0 , , k ∈ { , } , τ ∈ T (cid:9) . With probability at least − δ , (cid:107) G n (cid:107) G M , M (cid:46) (2 + ϑ )(2 + ϑ ) ϕ / ( s ) ϕ / ( s ) (cid:16)(cid:112) s log( ep/s ) + s log( ep/s ) + log(1 /δ )+ n − / (cid:0) s log( ep/s ) + s log( ep/s ) + log(1 /δ ) (cid:1)(cid:17) . Lemma 11.

Let T be a compact subset of (0 , . Let δ ∈ (0 , be arbitrary and ϑ ∈ [0 , ∞ ] , J ( τ ) ⊆ { , . . . , p } for τ ∈ T , s = sup τ ∈T card( J ( τ )) , and p ≥ s + 2 . Let { ( X i , Y i , ξ i ) } ni =1 be a sequence of i.i.d. random vectors.Suppose that Assumptions 1–3 hold and | ξ i | ≤ a.s. for all ≤ i ≤ n . Deﬁne G = (cid:8) g : R p +1 × [ − , → R : g ( X, Y, ξ ) = ξ (cid:0) τ − (cid:8) Y ≤ X (cid:48) θ ( τ ) } (cid:1) X (cid:48) v,v ∈ C p ( J ( τ ) , ϑ ) ∩ B p (0 , , τ ∈ T } . ith probability at least − δ , (cid:107) G n (cid:107) G (cid:46) (2 + ϑ ) ϕ / ( s ) (cid:112) s log( ep/s ) + log(1 /δ ) (cid:113) π n, ( s log( ep/s ) + log(1 /δ )) , where π n, ( z ) = (cid:112) z/n + z/n for z ≥ . The next two lemmata provide bounds that hold uniformly over collections of empirical processes.

Lemma 12.

Let δ ∈ (0 , be arbitrary and T be a compact subset of (0 , . Let { ( X i , Y i , ξ i ) } ni =1 be asequence of i.i.d. random vectors. Suppose that Assumptions 1–3 hold and | ξ i | ≤ a.s. for all ≤ i ≤ n .Deﬁne G = (cid:8) g : R p +1 × [ − , → R : g ( X, Y, ξ ) = ξ (cid:0) (cid:8) Y ≤ X (cid:48) θ (cid:9) − (cid:8) Y ≤ X (cid:48) θ ( τ ) } (cid:1) X (cid:48) v, v, θ ∈ R p , (cid:107) v (cid:107) ≤ , (cid:107) v (cid:107) ≤ n, (cid:107) θ (cid:107) ≤ n, τ ∈ T } . The following holds true:(i) With probability at least − δ , ∀ g v,θ,τ ∈ G : | G n ( g v,θ,τ ) | (cid:46) ϕ / ( (cid:107) v (cid:107) ) (cid:112) t (cid:107) v (cid:107) , (cid:107) θ (cid:107) ,n,δ (cid:113) π n, ( t (cid:107) v (cid:107) , (cid:107) θ (cid:107) ,n,δ ) . where t k,(cid:96),n,δ = k log( ep/k ) + (cid:96) log( ep/(cid:96) ) + log( n/δ ) and π n, ( z ) = (cid:112) z/n + z/n for z ≥ ;(ii) Let G ( m ) = { g v,θ,τ ∈ G : (cid:107) v (cid:107) ≤ m, (cid:107) θ (cid:107) ≤ m } . With probability at least − δ , ∀ m ≤ n ∧ p : (cid:107) G n (cid:107) G ( m ) (cid:46) ϕ / ( m ) (cid:112) m log( ep/m ) + log( n/δ ) (cid:113) π n, (cid:0) m log( ep/m ) + log( n/δ ) (cid:1) , where π n, ( z ) = (cid:112) z/n + z/n for z ≥ ; Lemma 13.

Let δ ∈ (0 , be arbitrary. Suppose that Assumption 1 holds. With probability at least − δ , ∀ k ≤ n : sup card( I ) ≤ k sup (cid:107) u (cid:107) ≤ , (cid:107) u (cid:107) ≤ k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k (cid:88) i ∈ I ( X (cid:48) i u ) − E [( X (cid:48) i u ) ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:46) ϕ max ( k ) (cid:32)(cid:114) k log( epn/k ) + log(1 /δ ) k + k log( epn/k ) + log(1 /δ ) k (cid:33) . E The rank-score balanced quantile regression problem

E.1 Setting

We deﬁne the rank-score balanced estimator of the CQF at covariate z ∈ R p as (cid:98) Q λ,γ ( τ ; z ) := z (cid:48) ˆ θ λ ( τ ) + 1 √ n n (cid:88) i =1 (cid:98) w γ,i ( τ ; z ) ˆ f − i ( τ )( τ − { Y i ≤ X (cid:48) i ˆ θ λ ( τ ) } ) ∀ τ ∈ T , (23) here the ˆ f i ’s are estimates of f Y | X ( X (cid:48) i θ ( τ ) | X i ) and the vector (cid:98) w γ ( τ ; z ) ∈ R n is the solution to the ranks-score balancing program min w ∈ R n (cid:88) w i ˆ f − i ( τ )s . t . (cid:107) z − n − / X (cid:48) w (cid:107) ∞ ≤ γ/n, (24)where X (cid:48) = [ X , . . . , X n ] ∈ R p × n . E.2 Assumptions

The following assumptions partially reﬁne the conditions in the main text.

Assumption 8 (Relative consistency of the density estimator) . The estimates { ˆ f i ( · ) } ni =1 of the conditionaldensities { f Y | X ( X (cid:48) i θ ( · ) | X i ) } ni =1 are relative consistent in the following sense: There exists r f > such that,with probability at least − η , sup τ ∈T max ≤ i ≤ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ f i ( τ ) f Y | X ( X (cid:48) i θ ( τ ) | X i ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:46) r f . Assumption 9 (Boundedness of the conditional density) . The conditional density of Y given X is bounded,i.e. there exists an absolute constant ≤ ¯ f < ∞ such that sup z ∈ R sup x ∈ R p f Y | X ( z | x ) ≤ ¯ f . Deﬁnition 8 (Population dual solution) . For z ∈ R p and τ ∈ T deﬁne the population dual solution by v ( τ ; z ) := − (cid:104) f Y | X (cid:0) X (cid:48) θ ( τ ) | X (cid:1) XX (cid:48) (cid:105) − z. (25) Assumption 10 (Sparsity of v ( τ ; z )) . Let v ( τ ; z ) ∈ R p be sparse in the sense that T v ( τ ; z ) := support (cid:0) v ( τ ; z ) (cid:1) and s v ( z ) := sup τ ∈T (cid:107) v ( τ ; z ) (cid:107) ≤ n ∧ p. Assumption 11 (Growth condition for dual solution) . The parameters δ ∈ (0 , , r f ≥ , and z, s v ( z ) , p, n ≥ satisfy s v ( z ) log (cid:0) ep/s v ( z ) (cid:1) = o ( n ) , log(1 /δ ) = O (cid:0) log (cid:0) ep/s v ( z ) (cid:1) ∧ log n (cid:1) , s v ( z ) + 2 ≤ p, r f = O (1) . Assumption 12 (Diﬀerentiability of τ (cid:55)→ Q Y ( τ ; X )) . Let T be a compact subset of (0 , . The conditionalquantile function of Y given X , τ (cid:55)→ Q Y ( τ ; X ) , is three times boundedly diﬀerentiable on T , i.e. there existsa constant C Q ≥ such that sup τ ∈T sup x ∈ R d (cid:12)(cid:12)(cid:12) d dτ Q Y ( τ ; x ) (cid:12)(cid:12)(cid:12) ≤ C Q . E.3 The dual problem

Consider the following convex optimization problem:min v ∈ R p n (cid:88) i =1 ˆ f i ( τ )( X (cid:48) i v ) + nz (cid:48) v + γ (cid:107) v (cid:107) , (26) he next lemma establishes that this program is the dual program to the rank-score balancing pro-gram (24). Lemma 14 (Dual of the rank-score balancing program) . (i) Programs (26) and (24) are a primal-dual pair.(ii) Let δ ∈ (0 , . Suppose that Assumption 1 holds. There exists an absolute constant c > such thatfor all γ > that satisfy γc − ≥ ¯ f ϕ / (1) ϕ / (cid:0) s v ( z ) (cid:1) κ ( ∞ ) (cid:112) log( p/δ ) (cid:107) z (cid:107) √ n, (27) with probability at least − δ , (cid:98) w γ,i ( τ ; z ) = − ˆ f i ( τ )2 √ n X (cid:48) i ˆ v γ ( τ ; z ) , ≤ i ≤ n, ∀ τ ∈ T , (28) where (cid:98) w γ ( τ ; z ) and ˆ v γ ( τ ; z ) are the unique solutions to COP (24) and COP (26) , respectively. Theorem 5 (Consistency) . Let T be a compact subset of (0 , , δ, η ∈ (0 , , c > , and γ > . Set ¯ c := ( c + 1) / ( c − and r v := C (cid:18) (cid:107) z (cid:107) κ ( ∞ ) ∨ (cid:19) (cid:115) s v ( z ) log (cid:0) ep/s v ( z ) (cid:1) + s θ log( ep/s θ ) + log( n/δ ) n ∨ r f  (cid:95) ¯ cκ (¯ c ) γ (cid:112) s v ( z ) n , (29) where C := ¯ c ¯ f L f L θ (cid:0) ϕ / (2 s θ ) (cid:1) ϕ max (cid:0) s v ( z ) (cid:1) /κ (¯ c ) . Suppose that Assumptions 1–6 and 8–11 hold andthat γ > satisﬁes eq. (35) . With probability at least − η − δ , sup τ ∈T (cid:107) ˆ v γ ( τ ; z ) − v ( τ ; z ) (cid:107) (cid:46) r v . Remark 10.

Note that the rate of consistency of ˆ v γ ( τ ; z ) to v ( τ ; z ) is at most as fast as the rate ofconsistency of the estimated conditional density, i.e. r f . Corollary 2.

Let T be a compact subset of (0 , , δ, η ∈ (0 , and c > . Set ¯ c := ( c +1) / ( c − . Supposethat Assumptions 1–6 and 8–11 hold. There exists an absolute constant c > c such that for all C γ ≥ c and γ := C γ C ¯ f κ (¯ c ) κ ( ∞ ) (cid:16)(cid:112) log( np/δ ) + √ nr f (cid:17) (cid:107) z (cid:107) √ n, (30) with probability at least − η − δ , sup τ ∈T (cid:107) ˆ v γ ( τ ; z ) − v ( τ ; z ) (cid:107) (cid:46) C (cid:18) (cid:107) z (cid:107) κ ( ∞ ) ∨ C γ ¯ c ¯ f (cid:107) z (cid:107) κ ( ∞ ) ∨ (cid:19) (cid:32)(cid:114) s v ( z ) log( np/δ ) + s θ log( ep/s θ ) n (cid:95) r f (cid:112) s v ( z ) (cid:33) . Remark 11.

Note that for C γ ≥ c large enough any γ > that satisﬁes eq. (30) also satisﬁes the inequalitiesin eq. (27) and eq. (35) . .4 Bahadur-type representation In this section, we consider a speciﬁc estimator for the conditional density. Following Koenker (2005b) we ob-serve that 1 /f Y | X ( X (cid:48) i θ ( τ ) | X i ) = ddτ Q ( τ ; X i ). Thus, we estimate the conditional densities f Y | X ( X (cid:48) i θ ( τ ) | X i ),1 ≤ i ≤ n , by ˆ f i ( τ ) := 2 hX (cid:48) i ˆ θ λ ( τ + h ) − X (cid:48) i ˆ θ λ ( τ − h ) , ∀ τ ∈ T , ≤ i ≤ n, (31)where h > Lemma 15.

Let T be a compact subset of (0 , , δ ∈ (0 , . Let { ˆ f i ( τ ) } ni =1 be as deﬁned in eq. (31) . Set ¯ c := ( c + 1) / ( c − , let r θ > be as in eq. (19) , and let λ > satisfy eq. (22) and deﬁne r f := ¯ f √ nh − r θ + ¯ f C Q h . (32) Suppose that Assumptions 1–6, 9–12, and Assumption 7 with ( ϑ, (cid:37) ) = (¯ c, r θ ) hold. If r f = o (1) , then, withprobability at least − δ , sup τ ∈T max ≤ i ≤ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ f i ( τ ) f Y | X ( X (cid:48) i θ ( τ ) | X i ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:46) r f . Theorem 6 (Bahadur-type representation) . Let T be a compact subset of (0 , , δ ∈ (0 , , c > , and h, λ, γ > . Let r θ , r v , r f > be as in eq. (19) , eq. (29) and eq. (32) , respectively. Set ¯ c := ( c + 1) / ( c − , ˆ s λ := sup τ ∈T (cid:107) ˆ θ λ ( τ ) (cid:107) , and ˆ r B := (cid:114) ( s v ( z ) + s θ + ˆ s λ ) log( np/δ ) n . (33) Suppose that Assumptions 1–6, 9–12, and Assumption 7 with ( ϑ, (cid:37) ) = (¯ c, r θ ) hold. Further, suppose that λ, γ > satisfy eq. (22) and eq. (35) , respectively. If (cid:0) s v ( z ) + s θ (cid:1) log ( np/δ ) log ( n ) = o ( n ) and r f = o (1) ,then, for all τ ∈ T and z ∈ R p , with probability at least − δ , (cid:98) Q λ,γ ( τ ; z ) − Q ( τ ; z ) = − n n (cid:88) i =1 f Y | X ( X (cid:48) i θ ( τ ) | X i ) (cid:0) τ − { Y i ≤ X (cid:48) i θ ( τ ) } (cid:1) X (cid:48) i v ( τ ; z ) + e n ( τ ; z ) , and sup τ ∈T | e n ( τ ; z ) | (cid:46) C (cid:18) r v + (cid:107) z (cid:107) κ ( ∞ ) (cid:19) (cid:0) r θ + ˆ r B (log n ) + √ r θ ˆ r B (cid:1) + C r v (ˆ r B + r θ )+ C (cid:18) r v + (cid:107) z (cid:107) κ ( ∞ ) (cid:19) (cid:16) ˆ r B (log n ) + √ r θ ˆ r B + ˆ r B + r θ (cid:1) h − r θ + C (cid:18) r v + (cid:107) z (cid:107) κ ( ∞ ) (cid:19) (cid:0) h − r f r θ + h (cid:1) , where C , C , C ≥ are deﬁned in eq. (103) . Remark 12.

The upper bound on the remainder term e n ( τ ; z ) is complicated; however, it has a simple xplanation: By the duality result of Lemma 14 the rank-score balanced estimator can be formulated as (cid:98) Q λ,γ ( τ ; z ) = z (cid:48) ˆ θ λ ( τ ) − n n (cid:88) i =1 ˆ f i ( τ )( τ − { Y i ≤ X (cid:48) i ˆ θ λ ( τ ) } ) X (cid:48) i ˆ v γ ( τ ; z ) . We note that the bias-correction term of the rank-score balanced estimator is a function of ˆ v γ ( τ ; z ) , { ˆ f i ( τ ) } ni =1 ,and ˆ θ λ ( τ ) . To establish the Bahadur-type representation we therefore expand the bias-correction term byadding and subtracting the corresponding population versions of these parameters. We then derive individualbounds on the resulting seven terms. With high probability, each of these seven terms can be upper boundedby the supremum of an empirical process indexed by some function class with vanishing L ( P ) -diameter (dueto the consistency results of Theorems 3 and 5). The bound on the remainder term e n ( τ ; z ) then follows froma careful application of empirical process techniques. Hereby, the speciﬁc choice of the estimates { ˆ f i ( τ ) } ni =1 is crucial. In Theorem 6 the upper bound on the remainder term e n ( τ ; z ) is a random variable: It depends on thenumber of non-zero coeﬃcients of the estimated quantile regression vector, i.e. ˆ s λ := sup τ ∈T (cid:107) ˆ θ λ ( τ ) (cid:107) . Weknow from Theorem 4 that for λ > s λ (cid:46) s θ . This leads to several simpliﬁcations which arethe content of the following corollary: Corollary 3.

Let T be a compact subset of (0 , , δ ∈ (0 , , and c > . Set ¯ c := ( c + 1) / ( c − and ¯ r B := (cid:114) ( s v ( z ) + s θ ) log( np/δ ) n . (34) Let λ, γ > be as in eq. (21) and eq. (30) , respectively. Let { ˆ f i ( τ ) } ni =1 be as deﬁned in eq. (31) . Suppose thatAssumptions 1–6, 9–12, and Assumption 7 with ( ϑ, (cid:37) ) = (¯ c, ¯ r B ) hold. If ( s v ( z ) + s θ ) log ( np/δ ) log ( n ) = o ( nh ) and h s v ( z ) = o (1) , then, for all τ ∈ T and z ∈ R p , with probability at least − δ , (cid:98) Q λ,γ ( τ ; z ) − Q ( τ ; z ) = − n n (cid:88) i =1 f Y | X ( X (cid:48) i θ ( τ ) | X i ) (cid:0) τ − { Y i ≤ X (cid:48) i θ ( τ ) } (cid:1) X (cid:48) i v ( τ ; z ) + e n ( τ ; z ) , and sup τ ∈T | e n ( τ ; z ) | (cid:46) C (cid:18)(cid:18) (cid:107) z (cid:107) κ ( ∞ ) ∨ (cid:19) ¯ r / B (log n ) / + n / ¯ r B h − + √ n ¯ r B h (cid:19) , where C > is a constant depending on c , ¯ f , L f , L θ , C Q , κ (¯ c ) , ϕ max ( p ) only. E.5 Weak convergence

Lemma 16 (Asymptotic equicontinuity of the leading term in the Bahadur-type representation) . Let T bea compact subset of (0 , . Consider G = { g : R p +1 → R : g ( X, Y ) = f Y | X ( X (cid:48) θ ( τ ) | X )( τ − { Y ≤ X (cid:48) θ ( τ ) } ) X (cid:48) v ( τ ; z ) , τ ∈ T } , and G ξ = { g τ − h τ (cid:48) : g τ , h τ (cid:48) ∈ G , | τ − τ (cid:48) | ≤ ξ, τ, τ (cid:48) ∈ T } . If ϕ max ( p ) = O (1) and (cid:107) z (cid:107) κ − ( ∞ ) = O (1) , thenthe following holds true:(i) G is totally bounded with respect to the standard deviation metric; ii) G is asymptotic equicontinuous, i.e. lim ξ ↓ lim n →∞ P (cid:8) (cid:107) G n (cid:107) G ξ > ε (cid:9) = 0 for all ε > . The next result is the main theorem of this section.

Theorem 7 (Weak convergence of the rank-score balanced estimator) . Let T be a compact subset of (0 , and c > . Set ¯ c := ( c + 1) / ( c − and ¯ r B := (cid:112) ( s v ( z ) + s θ ) log( np ) /n . Let λ, γ > be as in eq. (21) and eq. (30) , respectively. Let { ˆ f i ( τ ) } ni =1 be as deﬁned in eq. (31) . Suppose that Assumptions 1–6, 9–12,and Assumption 7 with ( ϑ, (cid:37) ) = (¯ c, ¯ r B ) hold. Moreover, suppose that ( s v ( z ) + s θ ) log ( np ) log ( n ) = o ( nh ) , h s v ( z ) = o (1) , ϕ max ( p ) = O (1) , κ (¯ c ) = O (1) and (cid:107) z (cid:107) κ − ( ∞ ) = O (1) . If the following limit H ( τ , τ ; z ) := lim n →∞ τ ∧ τ − τ τ v (cid:48) ( τ ; z ) E [ f Y | X ( X (cid:48) θ ( τ ) | X ) f Y | X ( X (cid:48) θ ( τ ) | X ) XX (cid:48) ] v ( τ ; z ) exists for all τ , τ ∈ T , then √ n (cid:16) (cid:98) Q λ,γ ( · ; z ) − Q ( · ; z ) (cid:17) (cid:32) G ( · ; z ) in (cid:96) ∞ ( T ) , where G ( · ; z ) is a centered Gaussian process with covariance function ( τ , τ ) (cid:55)→ H ( τ , τ ; z ) . Lemma 17.

Recall the setup of Corollary 3. With probability at least − δ , sup τ ,τ ∈T (cid:12)(cid:12)(cid:12) (cid:98) H ( τ , τ ; z ) − H ( τ , τ ; z ) (cid:12)(cid:12)(cid:12) (cid:46) C (cid:18) (cid:107) z (cid:107) κ ( ∞ ) ∨ (cid:19) (cid:16) √ n (cid:112) s v ( z ) h − ¯ r B + (cid:112) s v ( z ) h (cid:17) , where C > is a constant depending on c , ¯ f , L f , L θ , C Q , κ (¯ c ) , ϕ max ( p ) only. E.6 Auxiliary results

Lemma 18 (Restricted cone property) . For c > set ¯ c = c +1 c − . Suppose that Assumption 10 holds and γc − ≥ sup τ ∈T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 ˆ f i ( τ ) X i X (cid:48) i v ( τ ; z ) + nz (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ . (35) Then, for all τ ∈ T , ˆ v γ ( τ ; z ) − v ( τ ; z ) ∈ C p ( T v ( τ ; z ) , ¯ c ) . Lemma 19 (Lipschitz continuity of τ (cid:55)→ v ( τ ; z )) . Let T be a compact subset of (0 , . Suppose thatAssumptions 4, 5, 9, and 10 hold. There exists an absolute constant C v ≥ such that for all s, τ ∈ T , (cid:107) v ( s ; z ) − v ( τ ; z ) (cid:107) ≤ C v ¯ f L f L θ ϕ / (2 s θ ) ϕ max ( p ) κ ( ∞ ) (cid:107) z (cid:107) | s − τ | . Lemma 20 (Maxima of sum of block multi-convex functions) . Let f , . . . , f N : X × · · · × X K → R be blockmulti-convex functions. Then, sup x j ∈ conv( X j ) , ≤ j ≤ K N (cid:88) i =1 f i ( x , . . . , x K ) = sup x j ∈X j , ≤ j ≤ K N (cid:88) i =1 f i ( x , . . . , x K ) . Moreover, the identity remains true if (cid:80) Ni =1 f i is replaced by (cid:12)(cid:12)(cid:12)(cid:80) Ni =1 f i (cid:12)(cid:12)(cid:12) . emark 13. This is a generalization of Lemma 8.

Lemma 21.

Let T be a compact subset of (0 , . Let δ ∈ (0 , be arbitrary and ϑ k ∈ [0 , ∞ ] , J k ( τ ) ⊆{ , . . . , p } for τ ∈ T , s k = sup τ ∈T card( J k ( τ )) , p ≥ s k + 2 , and k ∈ { , } . Suppose that Assumptions 1–5,and 9 hold. Deﬁne G = (cid:8) g : R p → R : g ( X ) = f Y | X ( X (cid:48) θ ( τ ) | X )( X (cid:48) u )( X (cid:48) u ) ,u k ∈ C p ( J k ( τ ) , ϑ k ) ∩ B p (0 , , k ∈ { , } , τ ∈ T (cid:9) . With probability at least − δ , (cid:107) G n (cid:107) G (cid:46) (2 + ϑ )(2 + ϑ ) ¯ f ϕ / ( s ) ϕ / ( s )(1 + ϕ / (2 s θ )) ψ n (cid:0) t s θ ,s ,s ,n,δ (cid:1) , where t s ,s ,s θ ,n,δ = s log( ep/s ) + s log( ep/s ) + s θ log( ep/s θ ) + log( nL f L θ /δ ) and ψ n ( z ) = √ z (cid:0) n − / √ z + n − / z / (cid:1) for z ≥ . Lemma 22.

Let T be a compact subset of (0 , . Let δ ∈ (0 , be arbitrary, ϑ ∈ [0 , ∞ ] , J ( τ ) ⊆ { , . . . , p } for τ ∈ T , s = sup τ ∈T card( J ( τ )) , and p ≥ s + 2 . Suppose that Assumptions 1–5, and 9 hold. Deﬁne G = (cid:8) g : R p +1 → R : g ( X, Y ) = f Y | X ( X (cid:48) θ ( τ ) | X ) (cid:0) τ − (cid:8) Y ≤ X (cid:48) θ ( τ ) } (cid:1) X (cid:48) v,v ∈ C p ( J ( τ ) , ϑ ) ∩ B p (0 , , τ ∈ T } . With probability at least − δ , (cid:107) G n (cid:107) G (cid:46) (2 + ϑ ) ¯ f ϕ / ( s ) (cid:0) ϕ / (2 s θ ) (cid:1) ψ n (cid:0) t s,s θ ,n,δ (cid:1) , where t s,s θ ,n,δ = s log( ep/s ) + s θ log( ep/s θ ) + log( nL f L θ /δ ) and ψ n ( z ) = √ z (cid:0) n − / √ z + n − / z / (cid:1) for z ≥ . Lemma 23.

Let T be a compact subset of (0 , . Let δ ∈ (0 , be arbitrary and ϑ k ∈ [0 , ∞ ] , J k ( τ ) ⊆{ , . . . , p } for τ ∈ T , s k = sup τ ∈T card( J k ( τ )) , p ≥ s k + 2 , and k ∈ { , } . Suppose that Assumptions 1–5,and 9 hold. Deﬁne G = (cid:110) g : R p +1 → R : g ( X, Y ) = f Y | X ( X (cid:48) θ ( τ ) | X ) (cid:0) τ − (cid:8) Y ≤ X (cid:48) θ ( τ ) } (cid:1) ( X (cid:48) u )( X (cid:48) u ) ,u k ∈ C p ( J k ( τ ) , ϑ k ) ∩ B p (0 , , k ∈ { , } , τ ∈ T } . With probability at least − δ , (cid:107) G n (cid:107) G (cid:46) (2 + ϑ )(2 + ϑ ) ¯ f ϕ / ( s ) ϕ / ( s ) (cid:0) ϕ / (2 s θ ) (cid:1) ψ n (cid:0) t s ,s ,s θ ,n,δ (cid:1) , where t s ,s ,s θ ,n,δ = s log( ep/s ) + s log( ep/s ) + s θ log( ep/s θ ) + log( nL f L θ /δ ) and ψ n ( z ) = √ z (cid:0) n − / √ z + n − / z (cid:1) for z ≥ . Lemma 24.

Let T be a compact subset of (0 , . Let r > , δ ∈ (0 , be arbitrary, ϑ ∈ [0 , ∞ ] , J ( τ ) ⊆{ , . . . , p } for τ ∈ T , s = sup τ ∈T card( J ( τ )) , and p ≥ s + 2 . Suppose that Assumptions 1–5 and 9 hold.Deﬁne G = (cid:8) g : R p +1 → R : g ( X, Y ) = f Y | X ( X (cid:48) θ ( τ ) | X ) (cid:0) (cid:8) Y ≤ X (cid:48) θ (cid:9) − (cid:8) Y ≤ X (cid:48) θ ( τ ) } (cid:1) X (cid:48) v, θ ∈ R p , θ (cid:107) ≤ n, (cid:107) θ − θ ( τ ) (cid:107) ≤ r , v ∈ C p ( J ( τ ) , ϑ ) ∩ B p (0 , , τ ∈ T } . The following holds true:(i) With probability at least − δ , ∀ g v,θ,τ ∈ G : | G n ( g v,θ,τ ) | (cid:46) (2 + ϑ ) ¯ f / ϕ / ( s ) (cid:0) ϕ / (2 s θ ) (cid:1)(cid:0) ϕ / ( (cid:107) θ (cid:107) + s θ ) (cid:1) × (cid:16) υ r ,n (cid:0) (cid:107) θ (cid:107) log(1 /r ) (cid:1) + υ r ,n ( t s, (cid:107) θ (cid:107) ,s θ ,n,δ ) (cid:17) . where t s,k,s θ ,n,δ = s log( ep/s ) + k log( ep/k ) + s θ log( ep/s θ ) + log( L f L θ n/δ ) and υ r ,n ( z ) = √ z (cid:0) √ r + n − / (log n ) √ z + n − (log n ) / z (cid:1) for z ≥ ;(ii) Let G ( m ) = { g v,θ,τ ∈ G : (cid:107) θ (cid:107) ≤ m } . With probability at least − δ , ∀ m ≤ n ∧ p : (cid:107) G n (cid:107) G ( m ) (cid:46) (2 + ϑ ) ¯ f / ϕ / ( s ) (cid:0) ϕ / (2 s θ ) (cid:1)(cid:0) ϕ / ( m + s θ ) (cid:1) × (cid:16) υ r ,n (cid:0) m log(1 /r ) (cid:1) + υ r ,n ( t s,m,s θ ,n,δ ) (cid:17) , where t s,m,s θ ,n,δ = s log( ep/s ) + m log( ep/m ) + s θ log( ep/s θ ) + log( L f L θ n/δ ) and υ r ,n ( z ) = √ z (cid:0) √ r + n − / (log n ) √ z + n − (log n ) / z (cid:1) for z ≥ . Lemma 25.

Let T be a compact subset of (0 , . Let r > , δ ∈ (0 , be arbitrary and ϑ k ∈ [0 , ∞ ] , J k ( τ ) ⊆{ , . . . , p } for τ ∈ T , s k = sup τ ∈T card( J k ( τ )) , p ≥ s k + 2 , and k ∈ { , } . Suppose that Assumptions 1–5and 9 hold. Deﬁne G = (cid:110) g : R p +1 → R : g ( X, Y ) = f Y | X ( X (cid:48) θ ( τ ) | X ) (cid:0) (cid:8) Y ≤ X (cid:48) θ (cid:9) − (cid:8) Y ≤ X (cid:48) θ ( τ ) } (cid:1) ( X (cid:48) u )( X (cid:48) u ) , θ ∈ R p , (cid:107) θ (cid:107) ≤ n, (cid:107) θ − θ ( τ ) (cid:107) ≤ r , u k ∈ C p ( J k ( τ ) , ϑ k ) ∩ B p (0 , , k ∈ { , } , τ ∈ T } . The following holds true:(i) With probability at least − δ , ∀ g u ,u ,θ,τ ∈ G : | G n ( g u ,u ,θ,τ ) | (cid:46) (2 + ϑ )(2 + ϑ ) ¯ f / ϕ / ( s ) ϕ / ( s ) (cid:0) ϕ / (2 s θ ) (cid:1)(cid:0) ϕ / ( (cid:107) θ (cid:107) + s θ ) (cid:1) × (cid:16) υ r ,n (cid:0) (cid:107) θ (cid:107) log(1 /r ) (cid:1) + υ r ,n ( t s ,s , (cid:107) θ (cid:107) ,s θ ,n,δ ) (cid:17) . where t s ,s ,k,s θ ,n,δ = s log( ep/s ) + s log( ep/s ) + k log( ep/k ) + s θ log( ep/s θ ) + log( L f L θ n/δ ) and υ r ,n ( z ) = √ z (cid:0) √ r + n − / (log n ) √ z + n − / (log n ) z / (cid:1) for z ≥ ;(ii) Let G ( m ) = { g u ,u ,θ,τ ∈ G : (cid:107) θ (cid:107) ≤ m } . With probability at least − δ , ∀ m ≤ n ∧ p : (cid:107) G n (cid:107) G ( m ) (cid:46) (2 + ϑ )(2 + ϑ ) ¯ f / ϕ / ( s ) ϕ / ( s ) (cid:0) ϕ / (2 s θ ) (cid:1)(cid:0) ϕ / ( m + s θ ) (cid:1) × (cid:16) υ r ,n (cid:0) m log(1 /r ) (cid:1) + υ r ,n ( t s ,s ,m,s θ ,n,δ ) (cid:17) , where t s ,s ,m,s θ ,n,δ = s log( ep/s ) + s log( ep/s ) + m log( ep/m ) + s θ log( ep/s θ ) + log( L f L θ n/δ ) and υ r ,n ( z ) = √ z (cid:0) √ r + n − / (log n ) √ z + n − / (log n ) z / (cid:1) for z ≥ . Results from empirical process theory

F.1 Maximal and deviation inequalities

In this section we collect maximal inequalities that we use throughout the proofs.

Deﬁnition 9 (Exponential Orlicz-norms ( ψ α -norms)) . For α > set ψ α ( x ) = exp( x α ) − . We deﬁne theexponential Orlicz-norm of a real-valued random variable ξ on (Ω , A , P ) as (cid:107) ξ (cid:107) ψ α := inf { λ > E ψ α ( | ξ | /λ ) ≤ } , and the ψ α -norm of a real-valued Q -integrable function f on ( S, S ) as (cid:107) f (cid:107) Q,ψ α := inf (cid:26) λ > (cid:90) ψ α ( | f | /λ ) dQ ≤ (cid:27) . Remark 14.

For α ∈ (0 , the map t (cid:55)→ ψ α ( t ) is not convex; hence, (cid:107)·(cid:107) ψ α and (cid:107)·(cid:107) Q,ψ α are only quasinorms. Theorem 8 (Giessing, 2020) . Let

F ⊂ L ( S, S , P ) and ρ be a pseudo-metric on F × F . For δ > deﬁne F δ = { f − g : f, g ∈ F , ρ ( f, g ) ≤ δ } . Suppose that for α > there exists a constant K > such that for all f, g ∈ F , (cid:107) ( f − P f ) − ( g − P g ) (cid:107) P,ψ α ≤ Kρ ( f, g ) . (36) The following holds true:(i) If α ∈ (0 , , then for all events A ∈ A with P { A } > , E [ (cid:107) G n (cid:107) F δ | A ] ≤ C α K (cid:34)(cid:90) δ (cid:115) log (cid:18) N ( ε, F , ρ ) P { A } (cid:19) dε + n − / (cid:90) δ (cid:18) log (cid:18) N ( ε, F , ρ ) P { A } (cid:19)(cid:19) /α dε (cid:35) , where C α > is a constant depending on α only.(ii) If α ∈ (1 , , then for all events A ∈ A with P { A } > , E [ (cid:107) G n (cid:107) F δ | A ] ≤ C α K (cid:34)(cid:90) δ (cid:115) log (cid:18) N ( ε, F , ρ ) P { A } (cid:19) dε + n − / /β (cid:90) δ (cid:18) log (cid:18) N ( ε, F , ρ ) P { A } (cid:19)(cid:19) /α dε (cid:35) , where C α > is a constant depending on α only and /α + 1 /β = 1 . Remark 15.

The Lipschitz-type condition (36) on the individual increments can be substituted by a Lipschitzcondition on the uncentered individual increments, i.e. cases (i) and (ii) continue to hold true (with a largerconstant) if there exist

K > such that for all f, g ∈ F , (cid:107) f − g (cid:107) P,ψ α ≤ Kρ ( f, g ) . We establish this claim asa side result in the proof of Theorem 8. Remark 16. If F = { f t : t ∈ T } and (cid:107) ( f s − P f s ) − ( f t − P f t ) (cid:107) P,ψ α ≤ Kd ( s, t ) for all f s , f t ∈ F and apseudo-metric d on T × T , then we can replace N ( ε, F , ρ ) by N ( ε, T, d ) . Corollary 4 (Giessing, 2020) . Recall the setup of Theorem 8. The following holds true: i) If α ∈ (0 , , then, for all t ≥ , with probability at least − e − t , (cid:107) G n (cid:107) F δ ≤ C α K (cid:32)(cid:90) δ (cid:112) log N ( ε, F , ρ ) dε + n − / (cid:90) δ (cid:0) log N ( ε, F , ρ ) (cid:1) /α dε (cid:33) + C α Kδ (cid:16) √ t + n − / t /α (cid:17) , where C α > is a constant depending on α only.(ii) If α ∈ (1 , , then, for all t ≥ , with probability at least − e − t , (cid:107) G n (cid:107) F δ ≤ C α K (cid:32)(cid:90) δ (cid:112) log N ( ε, F , ρ ) dε + n − / /β (cid:90) δ (cid:0) log N ( ε, F , ρ ) (cid:1) /α dε (cid:33) + C α Kδ (cid:16) √ t + n − / /β t /α (cid:17) , where C α > is a constant depending on α only and /α + 1 /β = 1 . Lemma 26 (Theorem 5.2, Chernozhukov et al., 2014) . Let

F ⊂ L ( S, S , P ) with envelope F ∈ L ( S, S , P ) and ∈ F . Let σ > be any positive constant such that sup f ∈F P f ≤ σ ≤ (cid:107) F (cid:107) P, . Set η = σ/ (cid:107) F (cid:107) P, .Deﬁne M = max ≤ i ≤ n F ( X i ) . Then, E (cid:107) G n (cid:107) F (cid:46) J ( η, F ) (cid:107) F (cid:107) P, + (cid:107) M (cid:107) J ( η, F ) η √ n , where J ( η, F ) = (cid:90) η sup Q (cid:113) log N (cid:0) ε (cid:107) F (cid:107) Q, , F , L ( Q ) (cid:1) dε, and the supremum is taken over all ﬁnitely discrete probability measures Q . Lemma 27 (Corollary 5.1, Chernozhukov et al., 2014) . Let

F ⊂ L ( S, S , P ) be a VC type class of functionswith envelope F ∈ L ( S, S , P ) and ∈ F . Let σ > be any positive constant such that sup f ∈F P f ≤ σ ≤(cid:107) F (cid:107) P, . Set η = σ/ (cid:107) F (cid:107) P, . Deﬁne M = max ≤ i ≤ n F ( X i ) . E (cid:107) G n (cid:107) F (cid:46) (cid:115) V σ log (cid:18) A (cid:107) F (cid:107) P, σ (cid:19) + V (cid:107) M (cid:107) √ n log (cid:18) A (cid:107) F (cid:107) P, σ (cid:19) . Lemma 28 (Theorem 4, Adamczak, 2008) . Let

F ⊂ L ( S, S , P ) with envelope F ∈ L ( S, S , P ) and ∈ F .Let σ > be any positive constant such that sup f ∈F P f ≤ σ ≤ (cid:107) F (cid:107) P, . Deﬁne M = max ≤ i ≤ n F ( X i ) .For all α ∈ (0 , and t ≥ , with probability at least − e − t , (cid:107) G n (cid:107) F (cid:46) E (cid:107) G n (cid:107) F + σ √ t + (cid:107) M (cid:107) ψ α n − / t /α . F.2 Auxiliary results

In this section we collect technical auxiliary results that we use throughout the proofs.

Corollary 5.

Let

G ⊂ L ( S, S , P ) a ﬁnite collection of functions with card( G ) < ∞ . Let σ > be a positiveconstant such that sup g ∈G P g ≤ σ . Let ρ be a semi-metric on G × G with G = { g : g ∈ G} and δ > be positive constant such that sup g ,g ∈G ρ ( g , g ) ≤ δ . Let H ⊂ L ( S, S , P ) be a collection of VC subgraphfunctions with VC-index V ( H ) and absolute values bounded by one. Deﬁne F = GH = { x (cid:55)→ g ( x ) h ( x ) : g ∈G , h ∈ H} . If there exists a constant K > such that for some α ∈ (0 , and all g , g ∈ G , (cid:107) ( g − P g ) − ( g − P g ) (cid:107) P,ψ α ≤ Kρ ( g , g ) , (37) then, there exist constants c, c (cid:48) > (depending on α ) such that for all t ≥ , with probability at least − ce − c (cid:48) t , (cid:107) G n (cid:107) F (cid:46) (cid:112) V ( H ) + log card( G ) (cid:113) σ + Kδπ n,α (cid:0) log card( G ) (cid:1) + √ t (cid:113) σ + Kδπ n,α ( t ) , where π n,α ( z ) = (cid:112) z/n + z /α /n for z ≥ . Remark 17.

A similar result holds for α ∈ (1 , with explicit constants c = 3 e, c (cid:48) = 1 ; however, we do notneed such a result in this paper. Remark 18.

Note that the upper bound does not depend on the envelope of F . The bound is therefore betterby at least a log n -factor than if we had used a combination of Lemma 28 and Lemma 27. Typically, thequantities Kδπ n,α (cid:0) card( G ) (cid:1) and Kδπ n,α ( t ) will be negligible compared to σ . Remark 19.

We use this result to derive sharp bounds on the gradient of the check loss (i.e. the loss functionof the quantile regression program). To establish the connection between this result and quantile regressionnote that the (sub)gradient of the check loss { τ − { Y ≤ X (cid:48) θ ( τ ) } X (cid:48) v can be written as the product of twofunctions, h ( X, Y ) = τ − { Y ≤ X (cid:48) θ ( τ ) } = τ − { F Y | X ( Y | X ) ≤ τ } and g ( X ) = X (cid:48) v . The function h isbounded in absolute value by one and belongs to a class of VC subgraph functions with VC-index at most 3(the diﬀerence of two classes of VC-subgraph functions with VC-index at most 2) and g is indexed by v ∈ R p .In our speciﬁc applications, we will be able to argue that v ∈ M ⊂ R p with card( M ) < ∞ . Lemma 29 (Orlicz-norm of products of sub-Gaussian random variables) . Let X , . . . , X K ∈ R be sub-Gaussian random variables. Then, for K ≥ , (cid:107) (cid:81) Kk =1 X k (cid:107) ψ /K ≤ (cid:81) Kk =1 (cid:107) X k (cid:107) ψ . Lemma 30.

Let φ, ϕ : R + → R + be increasing functions and ξ be a random variable on L (Ω , A , P ) . If forall events A ∈ A with P { A } > , E [ ξ | A ] ≤ φ (cid:0) ϕ − (1 / P { A } ) (cid:1) , then for all u > , P (cid:8) ξ > φ ( u ) (cid:9) ≤ (cid:0) ϕ ( u ) (cid:1) − . Lemma 31 (Useful generalization of Lemma 1, Panchenko, 2003) . Let

X, Y be random variables suchthat E [ F ( X )] ≤ E [ F ( Y )] for every convex and increasing function F . Further, let ϕ : R + → R + be aconcave and strictly increasing function and α ∈ (0 , ∞ ) . If for some constants c > , c > , and for all t ≥ , P (cid:8) Y ≥ ϕ (cid:0) t /α (cid:1)(cid:9) ≤ c e − c t , then there exist constants c > , c > (depending only on c , c , α )such that, for all t ≥ , P (cid:8) X ≥ ϕ (cid:0) t /α (cid:1)(cid:9) ≤ c e − c t . Remark 20.

It is possible to ﬁnd explicit expressions for the constants c , c in terms of c , c , α . For ourpurposes the constants are irrelevant, only the nature of the inherited tail behavior is important. However,for completeness, we record that for α ∈ [1 , ∞ ) the result holds with c = c e and c = c . The main useof this lemma is to deduce tail bounds for empirical processes based on tail bounds on the correspondingsymmetrized empirical process. Lemma 32.

Let F , H ⊂ L ( S, S , P ) . Further, ϕ i : R → R , i ≤ n , be a contraction with ϕ i (0) = 0 and F : R + → R + be convex and increasing. Let { g i } ni =1 be a sequence of i.i.d. standard Gaussian randomvariables independent of { X i } ni =1 . Then, E (cid:34) F (cid:32)

G Proofs of the results in the supplementary materials

G.1 Proofs of Section D.3

Proof of Theorem 3 . Ansatz.

For τ ∈ T and r > B p (0 , r ) := { v ∈ R p : (cid:107) v (cid:107) ≤ r } and recallthat C p ( J, ϑ ) = { θ ∈ R p : (cid:107) θ J c (cid:107) ≤ ϑ (cid:107) θ J (cid:107) } . Suppose that, with high probability,(a) ˆ θ ( τ ) − θ ( τ ) ∈ C p (cid:0) T θ ( τ ) , ¯ c (cid:1) for all τ ∈ T ; and(b) for some absolute constant c >

0, the regularized and centered objective function is strictly positivewhen evaluated at points ( θ, τ ) ∈ R p × T satisfying θ − θ ( τ ) ∈ C p ( T θ ( τ ) , ¯ c ) ∩ ∂B p (0 , r/ (2 c )) =: K ( r/ (2 c ) , τ ), i.e.inf τ ∈T inf θ − θ ( τ ) ∈ K ( r/ (2 c ) ,τ ) (cid:32) n (cid:88) i =1 ρ τ ( Y i − X (cid:48) i θ ) − ρ τ (cid:0) Y i − X (cid:48) i θ ( τ ) (cid:1) + λ (cid:0) (cid:107) θ (cid:107) − (cid:107) θ ( τ ) (cid:107) (cid:1)(cid:33) > . Since the regularized and centered objective function is convex in θ and negative at (cid:98) θ ( τ ) for all τ ∈ T , itfollows that sup τ ∈T (cid:107) (cid:98) θ ( τ ) − θ ( τ ) (cid:107) (cid:46) r . Thus, to establish the claim of the theorem, we only need to provethat statements (a) and (b) hold with high probability. Veriﬁcation of high probability statements.

By assumption eq. (22) holds true. Thus, by Lemma 4,statement (a) holds for all τ ∈ T with probability one. We are left with establishing statement (b). Supposethat r > ϑ, (cid:37) ) = (¯ c, r ). Then, q r (¯ c ) (cid:38)

1, i.e. there exists an absoluteconstant c ≥ c · q r (¯ c ) ≥

1. Set G := (cid:8) g : R p +1 → R : g ( X, Y ) = ρ τ (cid:0) Y − X (cid:48) θ ( τ ) (cid:1) − ρ τ ( Y − X (cid:48) θ ) , θ ∈ R p , τ ∈ T , θ − θ ( τ ) ∈ K ( r/ (2 c ) , τ ) (cid:9) . Computeinf τ ∈T inf θ − θ ( τ ) ∈ K ( c · r,τ ) (cid:32) n n (cid:88) i =1 ρ τ ( Y i − X (cid:48) i θ ) − ρ τ (cid:0) Y i − X (cid:48) i θ ( τ ) (cid:1) + λn (cid:0) (cid:107) θ (cid:107) − (cid:107) θ ( τ ) (cid:107) (cid:1)(cid:33) (cid:38) − sup g ∈G n n (cid:88) i =1 g ( X i , Y i ) − sup τ ∈T sup θ − θ ( τ ) ∈ K ( c · r,τ ) λn (cid:0) (cid:107) θ (cid:107) − (cid:107) θ ( τ ) (cid:107) (cid:1) (cid:38) − sup g ∈G E (cid:34) n n (cid:88) i =1 g ( X i , Y i ) (cid:35) − n − / (cid:107) G n (cid:107) G − sup τ ∈T sup θ − θ ( τ ) ∈ K ( c · r,τ ) λn (cid:0) (cid:107) θ (cid:107) − (cid:107) θ ( τ ) (cid:107) (cid:1) . (38)We bound the expressions on the far right hand side in above display. By Lemma 6, − sup g ∈G E (cid:34) n n (cid:88) i =1 g ( X i , Y i ) (cid:35) = inf g ∈G E (cid:34) n n (cid:88) i =1 − g ( X i , Y i ) (cid:35) (cid:38) κ (¯ c ) r . (39) y Lemma 9, with probability at least 1 − δ , n − / (cid:107) G n (cid:107) G (cid:46) c ) φ / (2 s θ ) r (cid:114) s θ log( ep/s θ ) + log(1 /δ ) + log(1 + L θ /r ) n . (40)By the reverse triangle inequality and Lemma 4,sup τ ∈T sup θ − θ ( τ ) ∈ K ( c · r,τ ) λn (cid:0) (cid:107) θ (cid:107) − (cid:107) θ ( τ ) (cid:107) (cid:1) . ≤ sup τ ∈T sup θ − θ ( τ ) ∈ K ( c · r,τ ) λn  (cid:88) k ∈ T θ (cid:0) | θ k | − | θ ( τ ) | (cid:1) + (cid:88) k ∈ T cθ | θ k |  ≤ sup τ ∈T sup θ − θ ( τ ) ∈ K ( c · r,τ ) λn  (cid:88) k ∈ T θ | θ k − θ ( τ ) | + (cid:88) k ∈ T cθ | θ k |  ≤ (1 + ¯ c ) sup τ ∈T sup θ − θ ( τ ) ∈ K ( c · r,τ ) λn (cid:88) k ∈ T θ | θ k − θ ( τ ) |≤ (1 + ¯ c ) λn s / θ r. (41)Combine eq. (39), (40), and (41), use Assumption 6, and observe that, there exist absolute constants c , c , c > − δ , the expression in eq. (38) can be lower bounded (upto a multiplicative constant) by κ (¯ c ) r − c c ) φ / (2 s θ ) r (cid:114) s θ log( ep/s θ ) + log(1 + L θ /r ) + log(1 /δ ) n − c (1 + ¯ c ) λn s / θ r> , whenever r ≥ c (cid:32) C (cid:114) s θ log( ep/s θ ) + log n + log(1 /δ ) n (cid:95) ¯ cκ (¯ c ) λ √ s θ n (cid:33) , where C := ¯ cφ / (2 s θ ) L θ κ (¯ c ) (recall that L θ ≥

1) and ¯ c > c, r ) are suchthat the ( ϑ, (cid:37) )-restricted identiﬁability Assumption 7 holds. This concludes the proof.

Proof of Corollary 1 . The idea is to derive an upper bound on the gradient of the quantile loss function.Any λ > J = (cid:8) { } , { } , . . . , { p } (cid:9) , C p ( J, ∩ ∂B p (0 ,

1) reduces to the set of standard unit vectors in R p . Hence, by Lemma 11, with probabilityat least 1 − δ sup τ ∈T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:0) τ − { Y i ≤ X (cid:48) i θ ( τ ) } (cid:1) X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ = sup τ ∈T sup v ∈ C p ( J, ∩ B p (0 , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 (cid:0) τ − { Y i ≤ X (cid:48) i θ ( τ ) } (cid:1) X (cid:48) i v (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:46) σ max (cid:112) n log( p/δ ) (cid:113) π n, (log( p/δ )) , where π n, ( z ) = (cid:112) z/n + z/n for z ≥ σ max = ϕ / (1). Thus, by Assumption 6, there exists a constant > − δ ,sup τ ∈T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:0) τ − { Y i ≤ X (cid:48) i θ ( τ ) } (cid:1) X i (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ C σ max (cid:112) n log( p/δ ) . Therefore, any λ ≥ C c σ max (cid:112) n log( p/δ ) satisﬁes eq. (22) with probability at least 1 − δ . Plug this lowerbound on λ > r > G.2 Proofs of Section D.4

Proof of Theorem 4 . Proof of statement (i).

Recall from Lemma 9 in Belloni and Chernozhukov (2011) that ˆ s ≤ n ∧ p .From the complementary slackness characterizations (C.3) in Belloni and Chernozhukov (2011) we obtainthe following inequality: λ ˆ s = λ sup τ ∈T sign (cid:0) ˆ θ λ ( τ ) (cid:1) (cid:48) sign (cid:0) ˆ θ λ ( τ ) (cid:1) ≤ sup τ ∈T sign (cid:0) ˆ θ λ ( τ ) (cid:1) (cid:48) X (cid:48) ˆ a ( τ ) ≤ n √ ˆ s (cid:32) sup (cid:107) u (cid:107) ≤ , (cid:107) u (cid:107) ≤ ˆ s n n (cid:88) i =1 ( X (cid:48) i u ) (cid:33) / ≤ n (cid:112) (cid:98) ϕ max (ˆ s )ˆ s. This implies that for all λ > λ ≤ n (cid:112) (cid:98) ϕ max (ˆ s ) / ˆ s. Combine above inequality with the lower bound λ ≥ n √ (cid:112) (cid:98) ϕ max ( m ) /m to obtainˆ s ≤ m (cid:98) ϕ max (ˆ s ) (cid:98) ϕ max ( m ) . We now proceed by contradiction as in the proof of Lemma 6 in Belloni and Chernozhukov (2011): Supposethat ˆ s > m . Then there exists (cid:96) > s = (cid:96)m . Therefore, by Lemma 13 in Belloni and Chernozhukov(2011), ˆ s ≤ m (cid:98) ϕ max ( (cid:96)m ) (cid:98) ϕ max ( m ) ≤ m (cid:100) (cid:96) (cid:101) < (cid:96)m = ˆ s. Thus, ˆ s ≤ m . This concludes the proof of statement (i). Proof of statement (ii).

The following proof is modeled after the proof of Lemma 7 in Belloni andChernozhukov (2011) with necessary modiﬁcations to accommodate our setup.First, since s (cid:55)→ ϕ max ( s ) ∨ (cid:98) ϕ max ( s ) is non-decreasing, we have λ = C λ (cid:115) ϕ max (cid:18) n log( np/δ ) (cid:19) ∨ ˆ ϕ max (cid:18) n log( np/δ ) (cid:19)(cid:112) n log( np/δ ) ≥ C λ (cid:112) ϕ max (1) ∨ (cid:98) ϕ max (1) (cid:112) n log( p/δ ) . hus, arguing as in the proof of Corollary 1, we have for all τ ∈ T ˆ θ λ ( τ ) − θ ( τ ) ∈ C p ( T θ ( τ ) , ¯ c ) , and, with probability at least 1 − δ ,sup τ ∈T (cid:107) ˆ θ λ ( τ ) − θ ( τ ) (cid:107) (cid:46) (cid:101) C (cid:114) s θ log( np/δ ) n , where (cid:101) C := ¯ cφ / (2 s θ ) L θ κ (¯ c ) ∨ C λ ϕ / (cid:0) n/ log( np/δ ) (cid:1) κ (¯ c ) ∨ C λ (cid:98) ϕ / (cid:0) n/ log( np/δ ) (cid:1) κ (¯ c ) ∨ . (42)In the following, to simplify the notation, we write (cid:98) T θ ( τ ) = support (cid:0) ˆ θ λ ( τ ) (cid:1) , X = [ X , . . . , X n ], andˆ s ( τ ) = (cid:107) ˆ θ λ ( τ ) (cid:107) . Also, denote by ˆ a ( τ ) = (ˆ a ( τ ) , . . . , ˆ a n ( τ )), the (vector of) dual optimal rank scores whichsolve the dual program (C.2) in Belloni and Chernozhukov (2011). From the complementary slacknesscharacterizations (C.3) and identity (C.4) in Belloni and Chernozhukov (2011) we obtain the followinginequality: λ √ ˆ s ≤ sup τ ∈T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:0) τ − { Y i ≤ X (cid:48) i ˆ θ λ ( τ ) } − ˆ a i ( τ ) (cid:1) X i, (cid:98) T θ ( τ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + sup τ ∈T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:0) { Y i ≤ X (cid:48) i θ ( τ ) } − { Y i ≤ X (cid:48) i ˆ θ λ ( τ ) } (cid:1) X i, (cid:98) T θ ( τ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) + sup τ ∈T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:0) τ − { Y i ≤ X (cid:48) i θ ( τ ) } (cid:1) X i, (cid:98) T θ ( τ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = I + II + III . (43) Bound on I.

Observe that ˆ a i ( τ ) (cid:54) = τ − { Y i ≤ X (cid:48) i ˆ θ λ ( τ ) } only if Y i = X (cid:48) i ˆ θ λ ( τ ). By Lemma 9 in Belloniand Chernozhukov (2011) the penalized quantile regression ﬁt can only interpolate at most ˆ s ( τ ) ≤ ˆ s pointsalmost surely uniformly over τ ∈ T . This implies that (cid:80) ni =1 (cid:12)(cid:12) τ − { Y i ≤ X (cid:48) i ˆ θ λ ( τ ) } − ˆ a i ( τ ) (cid:12)(cid:12) ≤ ˆ s . Therefore,by H¨older’s inequality and Lemma 13, with probability at least 1 − δ ,sup τ ∈T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:0) τ − { Y i ≤ X (cid:48) i ˆ θ λ ( τ ) } − ˆ a i ( τ ) (cid:1) X i, (cid:98) T θ ( τ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ sup card( I ) ≤ ˆ s sup (cid:107) u (cid:107) ≤ , (cid:107) u (cid:107) ≤ ˆ s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) i ∈ I X (cid:48) i u (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup card( I ) ≤ ˆ s sup (cid:107) u (cid:107) ≤ , (cid:107) u (cid:107) ≤ ˆ s ˆ s (cid:32) s (cid:88) i ∈ I ( X (cid:48) i u ) − E [( X (cid:48) i u ) ] (cid:33) / + ˆ sϕ / (ˆ s ) (cid:46) ϕ / (ˆ s )ˆ s (cid:112) log( epn/ ˆ s ) + ϕ / (ˆ s )ˆ s / (cid:0) log(1 /δ ) (cid:1) / + ϕ / (ˆ s )ˆ s / (cid:112) log(1 /δ ) + ˆ sϕ / (ˆ s ) , (44)where we have used that φ max (ˆ s ) ≤ ϕ max (ˆ s ). Bound on II.

Deﬁne G = (cid:8) g : R p +1 → R : g ( X, Y ) = (cid:0) (cid:8) Y ≤ X (cid:48) θ (cid:9) − (cid:8) Y ≤ X (cid:48) θ ( τ ) } (cid:1) X (cid:48) v, , θ ∈ R p , (cid:107) v (cid:107) ≤ , (cid:107) v (cid:107) ≤ n, (cid:107) θ (cid:107) ≤ n, τ ∈ T } , and, for 1 ≤ s ≤ n ∧ p and r = (cid:101) C (cid:113) s θ log( np/δ ) n with (cid:101) C ≥ G ( s ) = (cid:8) g v,θ,τ ∈ G : (cid:107) v (cid:107) ≤ s, (cid:107) θ (cid:107) ≤ s (cid:9) (cid:92) (cid:8) θ ∈ R p : θ − θ ( τ ) ∈ C p ( T θ ( τ ) , ¯ c ) ∩ B p (0 , r ) , τ ∈ T (cid:9) . By the triangle inequality and Corollary 1, with probability at least 1 − δ ,sup τ ∈T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:0) { Y i ≤ X (cid:48) i θ ( τ ) } − { Y i ≤ X (cid:48) i ˆ θ λ ( τ ) } (cid:1) X i, (cid:98) T θ ( τ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ √ n (cid:107) G n (cid:107) G (ˆ s ) + n sup g ∈G (ˆ s ) | P g | . (45)By Lemma 12 (ii), with probability at least 1 − δ , √ n (cid:107) G n (cid:107) F (ˆ s ) (cid:46) ϕ / (ˆ s ) √ n (cid:112) ˆ s log( ep/ ˆ s ) + log( n/δ ) (cid:113) π n, (cid:0) ˆ s log( ep/ ˆ s ) + log( n/δ ) (cid:1) (cid:46) ϕ / (ˆ s ) √ n (cid:112) ˆ s log( ep/ ˆ s ) + log( n/δ ) , (46)where the second inequality holds since ˆ s ≤ n/ log( np/δ ) by choice of λ > g ∈ G (ˆ s ), nP g = n E (cid:34) X (cid:48) v (cid:90) X (cid:48) θX (cid:48) θ ( τ ) (cid:0) f Y | X ( z | X ) − f Y | X ( X (cid:48) θ ( τ ) | X ) (cid:1) dz (cid:35) + v (cid:48) E (cid:2) f Y | X ( X (cid:48) θ ( τ ) | X ) XX (cid:48) (cid:3) (cid:0) θ − θ ( τ ) (cid:1) ≤ n L f E (cid:104) | X (cid:48) v | (cid:0) X (cid:48) ( θ ( τ ) − θ ) (cid:1) (cid:105) + nv (cid:48) E (cid:2) f Y | X ( X (cid:48) θ ( τ ) | X ) XX (cid:48) (cid:3) (cid:0) θ − θ ( τ ) (cid:1) . (47)Since τ (cid:55)→ f Y | X ( X (cid:48) θ ( τ ) | X ) is continuous on the compact set T , it is necessarily bounded. Thus, byLemmas 7 and 8, nv (cid:48) E (cid:2) f Y | X ( X (cid:48) θ ( τ ) | X ) XX (cid:48) (cid:3) (cid:0) θ − θ ( τ ) (cid:1) (cid:46) (2 + ¯ c ) nϕ / (ˆ s ) ϕ / ( s θ ) r. (48)By H¨older’s inequality, Assumption 1, and Lemmas 7 and 8, n L f E (cid:104) | X (cid:48) v | (cid:0) X (cid:48) ( θ ( τ ) − θ ) (cid:1) (cid:105) ≤ n L f (cid:0) E [ | X (cid:48) v | ] (cid:1) / (cid:16) E (cid:2)(cid:0) X (cid:48) ( θ ( τ ) − θ ) (cid:1) (cid:3)(cid:17) / (cid:46) (2 + ¯ c ) nL f (cid:0) ϕ / (ˆ s ) + φ / (ˆ s ) (cid:1) φ max ( s θ ) r (cid:46) (2 + ¯ c ) nL f ϕ / (ˆ s ) φ max ( s θ ) r . (49)Combine eq. (45)–(49) and conclude that with probability at least 1 − δ ,sup τ ∈T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:0) { Y i ≤ X (cid:48) i θ ( τ ) } − { Y i ≤ X (cid:48) i ˆ θ λ ( τ ) } (cid:1) X i, (cid:98) T θ ( τ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:46) ϕ / (ˆ s ) √ n (cid:112) ˆ s log( ep/ ˆ s ) + log( n/δ ) + (2 + ¯ c ) nL f ϕ / (ˆ s ) φ max ( s θ ) r + (2 + ¯ c ) nϕ / (ˆ s ) ϕ / ( s θ ) r. (50) ound on III. Since λ > − δ ,sup τ ∈T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:0) τ − { Y i ≤ X (cid:48) i θ ( τ ) } (cid:1) X i, (cid:98) T θ ( τ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ √ ˆ s sup τ ∈T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:0) τ − { Y i ≤ X (cid:48) i θ ( τ ) } (cid:1) X i, (cid:98) T θ ( τ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ c − λ √ ˆ s. (51) Conclusion.

Combine eq. (43), (44), (50), and (51) to conclude that there exist absolute constants c , c > − δ , λ √ ˆ s ≤ c (cid:16) ϕ / (ˆ s )ˆ s (cid:112) log( epn/ ˆ s ) + ϕ / (ˆ s )ˆ s / (cid:0) log(1 /δ ) (cid:1) / + ϕ / (ˆ s )ˆ s / (cid:112) log(1 /δ ) + ˆ sϕ / (ˆ s ) (cid:17) + c ϕ / (ˆ s ) (cid:112) n ˆ s log( ep/ ˆ s ) + n log( n/δ )+ c (2 + ¯ c ) nL f ϕ / (ˆ s ) φ max ( s θ ) r + c (2 + ¯ c ) nϕ / (ˆ s ) ϕ / ( s θ ) r + c − λ √ ˆ s. Set m := n/ log( np/δ ). Recall that s (cid:55)→ ϕ max ( s ) is non-decreasing. By statement (i) and since c > λ √ ˆ s ≤ c c c − (cid:112) ϕ max ( m ) m √ ˆ s (cid:32)(cid:112) log( ep/ ˆ s ) + (cid:18) log(1 /δ ) m (cid:19) / + (cid:114) log(1 /δ ) m + 1 (cid:33) + c c c − (cid:112) ϕ max ( m ) √ ˆ s (cid:32)(cid:112) n log( ep/ ˆ s ) + (cid:114) n log( n/δ )ˆ s (cid:33) + c c c − φ max ( s θ )(2 + ¯ c ) L f (cid:112) ϕ max ( m ) nr + c c c − ϕ / ( s θ )(2 + ¯ c ) (cid:112) ϕ max ( m ) nr. (52)Divide above inequality by λ = C λ n √ m (cid:112) ϕ max ( m ) ∨ (cid:98) ϕ max ( m ) and conclude that √ ˆ s (cid:46) c c − (cid:32)(cid:114) log( ep/ ˆ s ) n + (cid:18) log(1 /δ ) n m (cid:19) / + (cid:114) log(1 /δ ) nm + n − / (cid:33) (cid:114) mn √ ˆ sC λ + c c − (cid:114) m log( ep/ ˆ s ) n √ ˆ sC λ + c c − (cid:115) m log( n/δ ) C λ n + c c − L f C λ (2 + ¯ c ) φ max ( s θ ) √ mr + c c − C λ (2 + ¯ c ) ϕ / ( s θ ) √ mr. Since m log( p/δ ) = o ( n ), above inequality simpliﬁes to √ ˆ s (cid:46) c c − L f C λ (2 + ¯ c ) φ max ( s θ ) √ mr + c c − κ (¯ c ) C λ √ mr (cid:112) ϕ max ( m ) (cid:46) c c − (cid:101) C C λ L f (2 + ¯ c ) φ max ( s θ ) (cid:114) s θ log( np/δ ) n √ s θ + c c − (cid:101) C C λ (2 + ¯ c ) ϕ / ( s θ ) √ s θ (cid:46) c c − (cid:101) C C λ (cid:16) L f (2 + ¯ c ) φ max ( s θ ) + (2 + ¯ c ) ϕ / ( s θ ) (cid:17) √ s θ . his concludes the proof of the second statement. G.3 Proofs of Section D.5

Proof of Lemma 4 . The proof strategy is standard (e.g. Bickel et al., 2009; Belloni and Chernozhukov,2011, 2013). By optimality of θ ( τ ) and the premise, for all τ ∈ T ,0 ≥ n (cid:88) i =1 ρ τ (cid:0) Y i − X (cid:48) i ˆ θ λ ( τ ) (cid:1) − n (cid:88) i =1 ρ τ (cid:0) Y i − X (cid:48) i θ ( τ ) (cid:1) + λ (cid:107) ˆ θ λ ( τ ) (cid:107) − λ (cid:107) θ ( τ ) (cid:107) ≥ − n (cid:88) i =1 (cid:0) τ − (cid:8) Y i ≤ X (cid:48) i θ ( τ ) (cid:9)(cid:1) X (cid:48) i (cid:0) ˆ θ λ ( τ ) − θ ( τ ) (cid:1) + λ (cid:107) ˆ θ λ ( τ ) (cid:107) − λ (cid:107) θ ( τ ) (cid:107) ≥ λ (cid:16) − c − (cid:107) ˆ θ λ ( τ ) − θ ( τ ) (cid:107) + (cid:107) ˆ θ λ ( τ ) (cid:107) − (cid:107) θ ( τ ) (cid:107) (cid:17) . Thus, (cid:0) c − (cid:1) p (cid:88) k =1 (cid:12)(cid:12) ˆ θ λ,k ( τ ) − θ ,k ( τ ) (cid:12)(cid:12) ≥ p (cid:88) k =1 (cid:12)(cid:12) ˆ θ λ,k ( τ ) − θ ,k ( τ ) (cid:12)(cid:12) + p (cid:88) k =1 (cid:12)(cid:12) ˆ θ λ,k ( τ ) (cid:12)(cid:12) − p (cid:88) k =1 (cid:12)(cid:12) θ ,k ( τ ) (cid:12)(cid:12) . (53)By Assumption 3 and the reverse triangle inequality, p (cid:88) k =1 (cid:12)(cid:12) ˆ θ λ,k ( τ ) − θ ,k ( τ ) (cid:12)(cid:12) + p (cid:88) k =1 (cid:12)(cid:12) ˆ θ λ,k ( τ ) (cid:12)(cid:12) − p (cid:88) k =1 (cid:12)(cid:12) θ ,k ( τ ) (cid:12)(cid:12) ≥ (cid:88) k ∈ T cθ ( τ ) (cid:12)(cid:12) ˆ θ λ,k ( τ ) (cid:12)(cid:12) . (54)Combine eq. (53) and (54) to conclude that c + 1 c − (cid:88) k ∈ T θ ( τ ) (cid:12)(cid:12) ˆ θ λ,k ( τ ) − θ ,k (cid:12)(cid:12) ≥ (cid:88) k ∈ T cθ ( τ ) (cid:12)(cid:12) ˆ θ λ,k ( τ ) (cid:12)(cid:12) . (55)This concludes the proof. Proof of Lemma 5 . Obviously, φ τ,x,y (0) = 0 and | φ τ,x,y ( a ) − φ τ,x,y ( b ) | ≤ | a − b | for all a, b ∈ R . Moreover,by a change of variables, φ τ,x,y (cid:0) x (cid:48) θ − x (cid:48) θ ( τ ) (cid:1) = (cid:90) x (cid:48) θ − x (cid:48) θ ( τ )0 (cid:8) y ≤ x (cid:48) θ ( s ) + u + (cid:0) x (cid:48) θ ( τ ) − x (cid:48) θ ( s ) (cid:1)(cid:9) du = (cid:90) x (cid:48) θ − x (cid:48) θ ( s ) x (cid:48) θ ( τ ) − x (cid:48) θ ( s ) { y ≤ x (cid:48) θ ( s ) + u } du = φ s,x,y (cid:0) x (cid:48) θ − x (cid:48) θ ( s ) (cid:1) − φ s,x,y (cid:0) x (cid:48) θ ( τ ) − x (cid:48) θ ( s ) (cid:1) . Lastly, re-arrange Knight’s identity and obtain ρ τ ( y − x (cid:48) θ ) − ρ τ (cid:0) y − x (cid:48) θ ( τ ) (cid:1) = − τ (cid:0) x (cid:48) θ − x (cid:48) θ ( τ ) (cid:1) + φ τ,x,y (cid:0) x (cid:48) θ − x (cid:48) θ ( τ ) (cid:1) . roof of Lemma 6 . Deﬁne L ( θ | X ) := E [ ρ τ ( Y − X (cid:48) θ ) | X ] = ( τ − (cid:90) X (cid:48) θ −∞ ( z − X (cid:48) θ ) dF Y | X ( z ) + τ (cid:90) ∞ X (cid:48) θ ( z − X (cid:48) θ ) dF Y | X ( z ) . Hence, by Leibniz’ rule for diﬀerentiating parameter integrals, ddθ L ( θ | X ) = ( τ − (cid:90) X (cid:48) θ −∞ − X (cid:48) dF Y | X ( z ) + τ (cid:90) ∞ X (cid:48) θ − X (cid:48) dF Y | X ( z ) = (cid:0) F Y | X ( X (cid:48) θ | X ) − τ (cid:1) X (cid:48) , (56)and, by diﬀerentiating eq. (56), d dθdθ (cid:48) L ( θ | X ) = f Y | X ( X (cid:48) θ | X ) XX (cid:48) . (57)By optimality of θ ( τ ) we have ddθ L ( θ ( τ ) | X ) = 0. Hence, eq. (56), eq. (57), and a second order Taylorapproximation around θ ( τ ) yield E [ ρ τ ( Y − X (cid:48) θ ) − ρ τ ( Y − X (cid:48) ( τ ) θ ) | X ] = 12 (cid:0) θ − θ ( τ ) (cid:1) (cid:48) f Y | X ( X (cid:48) ξ ( τ ) | X ) XX (cid:48) (cid:0) θ − θ ( τ ) (cid:1) , (58)where ξ ( τ ) = λ (cid:0) θ − θ ( τ ) (cid:1) + θ ( τ ) for some λ ∈ [0 , (cid:0) θ − θ ( τ ) (cid:1) (cid:48) f Y | X ( X (cid:48) θ ( τ ) | X ) XX (cid:48) (cid:0) θ − θ ( τ ) (cid:1) − L f (cid:12)(cid:12) X (cid:48) (cid:0) θ − θ ( τ ) (cid:1)(cid:12)(cid:12) . (59)By Assumption 7 there exists an absolute constant c > c · q r (¯ c ) ≥ r . Let ( θ, τ ) ∈ R p × T satisfy θ − θ ( τ ) ∈ C pn ( J, c ) ∩ ∂B p (0 , r / (2 c )). Then, eq. (58) and (59) imply E [ ρ τ ( Y − X (cid:48) θ ) − ρ τ ( Y − X (cid:48) ( τ ) θ )] ≥ inf τ ∈T inf u ∈ C p ( T θ ( τ ) ,c ) ∩ ∂B p (0 , (cid:110) E (cid:2) f Y | X ( X (cid:48) θ ( τ ) | X )( X (cid:48) u ) (cid:3) − L f E (cid:2) | X (cid:48) u | (cid:3) r / (2 c ) (cid:111) × r / (2 c ) / ≥ κ min ( c ) r / (4 c ) (cid:38) κ min ( c ) r , where the last inequality follows again from Assumption 7 with ( ϑ, (cid:37) ) = ( c , r ). This concludes the proof. Proof of Lemma 7 . By Lemma 7.1 (ii) in Koltchinskii (2011) the set

M ⊂ B p (0 ,

1) is such that C ( J, ϑ ) ∩ B p (0 , ⊂ ϑ )conv( M ), (cid:107) u (cid:107) ≤ s for all u ∈ M , and card( M ) ≤ s (cid:80) sk =0 (cid:0) pk (cid:1) . By Proposition 3.6.4in Gin´e and Nickl (2015) we have (cid:80) sk =0 (cid:0) pk (cid:1) ≤ p s s ! ≤ (cid:0) eps (cid:1) s for p ≥ s + 2. Proof of Lemma 8 . For each x ∈ conv( X ) there exist n ∈ N and λ , . . . , λ n ≥ (cid:80) ni =1 λ i = 1, such that x = (cid:80) ni =1 λ i x i for some x i ∈ X . Similarly, for each y ∈ conv( Y ) there exist m ∈ N and µ , . . . , µ m ≥ (cid:80) mj =1 µ j = 1, such that y = (cid:80) mj =1 µ i y i for some y i ∈ Y . Thus, by biconvexity of f and two applications ofJensen’s inequality, for all x ∈ conv( X ) and for all y ∈ conv( Y ), f ( x, y ) ≤ n (cid:88) i =1 m (cid:88) j =1 λ i µ j f ( x i , y j ) ≤ n (cid:88) i =1 m (cid:88) j =1 λ i µ j sup x ∈X sup y ∈Y f ( x, y ) = sup x ∈X sup y ∈Y f ( x, y ) . hus, sup x ∈ conv( X ) sup y ∈ conv( Y ) f ( x, y ) ≤ sup x ∈X sup y ∈Y f ( x, y ). The reverse inequality holds trivially truesince X ⊆ conv( X ) and Y ⊆ conv( Y ). The same arguments hold if f is replaced by | f | . This concludes theproof. Proof of Lemma 9 . We plan to apply the maximal inequality from Theorem 8. However, instead ofapplying Theorem 8 to the original process, we apply it to various auxiliary processes. We then use Lemma 31to assemble a bound for the original empirical process from the bounds of those auxiliary processes. Therationale behind this complicated approach is that a direct application of Theorem 8 leads to an upperbound involving the metric entropy integral over C p ( ∪ τ ∈T T θ ( τ ) , c ) ∩ B p (0 , r ), for which we lack the toolsto obtain tight estimates. In contrast, the bounds on the auxiliary processes involve metric entropy integralsof ﬁnite sets only which are easy to estimate.Denote the underlying probability space by (Ω , A , P ). To simplify notation, write K ( r , τ ) = C p ( T θ ( τ ) , c ) ∩ B p (0 , r ). Let η ∈ (0 ,

1) be arbitrary and T η be an η -net with cardinality card( T η ) ≤ /η . We have thefollowing decomposition: (cid:107) G n (cid:107) G ( a ) ≤ sup τ ∈T sup u ∈ K ( r ,τ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 X (cid:48) i u − E [ X (cid:48) i u ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + sup τ ∈T sup θ − θ ( τ ) ∈ K ( r ,τ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 φ τ,X i ,Y i (cid:0) X (cid:48) i θ − X (cid:48) i θ ( τ ) (cid:1) − E (cid:2) φ τ,X i ,Y i (cid:0) X (cid:48) i θ − X (cid:48) i θ ( τ ) (cid:1)(cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( b ) ≤ sup τ ∈T sup u ∈ K ( r ,τ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 X (cid:48) i u − E [ X (cid:48) i u ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + sup s ∈T η sup τ : | τ − s |≤ η sup θ − θ ( τ ) ∈ K ( r ,τ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 φ s,X i ,Y i (cid:0) X (cid:48) i θ − X (cid:48) i θ ( s ) (cid:1) − E (cid:2) φ s,X i ,Y i (cid:0) X (cid:48) i θ − X (cid:48) i θ ( s ) (cid:1)(cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + sup s ∈T η sup τ : | τ − s |≤ η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 φ s,X i ,Y i (cid:0) X (cid:48) i θ ( τ ) − X (cid:48) i θ ( s ) (cid:1) − E (cid:2) φ s,X i ,Y i (cid:0) X (cid:48) i θ ( τ ) − X (cid:48) i θ ( s ) (cid:1)(cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = I + II + III , (60)where ( a ) follows from Lemma 5 ( iii ) and ( b ) follows from Lemma 5 ( ii ). Bound on I.

By Lemma 7 and Assumption 3 there exists

M ⊂ B p (0 ,

1) with cardinality card( M ) ≤ (cid:16) eps θ (cid:17) s θ , (cid:107) u (cid:107) ≤ s θ for all u ∈ M , and K (1 , τ ) ⊂ c )conv( M ) for all τ ∈ T . Hence, for A ∈ A arbitrary, E (cid:34) sup τ ∈T sup u ∈ K ( r ,τ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 X (cid:48) i u − E [ X (cid:48) i u ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | A (cid:35) ≤ c ) r E (cid:34) sup u ∈ conv( M ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 X (cid:48) i u − E [ X (cid:48) i u ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | A (cid:35) . Note that u (cid:55)→ n (cid:80) ni =1 ( X i − E [ X i ]) (cid:48) u is linear. Hence, by Lemma 8 the term on the right hand side in abovedisplay is equal to 2(2 + c ) r E (cid:34) sup u ∈M (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 X (cid:48) i u − E [ X (cid:48) i u ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | A (cid:35) . (61)Now, note that for all u , u ∈ M , (cid:13)(cid:13) ( X − E [ X ]) (cid:48) u − ( X − E [ X ]) (cid:48) u (cid:13)(cid:13) ψ (cid:46) φ / ( s θ ) (cid:107) u − u (cid:107) . hus, by Theorem 8 and Remark 16, eq. (61) can be upper bounded (up to a multiplicative constant) by2(2 + c ) r φ / ( s θ ) (cid:112) s θ log( ep/s θ ) + log(1 / P { A } ) . Hence, by Lemma 30, for all t ≥

0, with probability at least 1 − e − t , I (cid:46) c ) r φ / ( s θ ) (cid:16)(cid:112) s θ log( ep/s θ ) + √ t (cid:17) . (62) Bound on II.

Let { ε i } ni =1 , { ˜ ε i } ni =1 be two independent sequences of i.i.d. Rademacher random variablesindependent of { ( Y i , X i ) } ni =1 . For A ∈ A arbitrary consider E (cid:34) sup τ : | τ − s |≤ η sup θ − θ ( τ ) ∈ K ( r ,τ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 ˜ ε i (cid:0) ε i X i − E [ X i ] (cid:1) (cid:48) (cid:0) θ − θ ( s ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | A (cid:35) ≤ E (cid:34) sup τ ∈T sup u ∈ K ( r ,τ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 ˜ ε i (cid:0) ε i X i − E [ X i ] (cid:1) (cid:48) u (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | A (cid:35) (63)+ E (cid:34) sup τ : | τ − s |≤ η (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 ˜ ε i (cid:0) ε i X i − E [ X i ] (cid:1) (cid:48) (cid:0) θ ( τ ) − θ ( s ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) | A (cid:35) . (64)Note that the “doubly symmetrized” summands ˜ ε i (cid:0) ε i X i − E [ X i ] (cid:1) (cid:48) u have zero mean, are sub-Gaussian, andhave the same second moments as ( X i − E [ X i ] (cid:1) (cid:48) u . Hence, (cid:107) ˜ ε i (cid:0) ε i X i − E [ X i ] (cid:1) (cid:48) u (cid:107) P,ψ (cid:46) φ / ( (cid:107) u (cid:107) ) (cid:107) u (cid:107) .Proceeding as in Step 1, we can therefore upper bound the term in eq. (63) (up to a multiplicative constant)by 16(2 + c ) r φ / ( s θ ) (cid:112) s θ log( ep/s θ ) + log(1 / P { A } ) . Note that sup τ,s ∈T (cid:107) θ ( τ ) − θ ( s ) (cid:107) ≤ s θ . Hence, by Assumption 4, for all τ, s ∈ T , (cid:13)(cid:13) ε (cid:0) X − E [ X ] (cid:1) (cid:48) θ ( τ ) − ε (cid:0) X − E [ X ] (cid:1) (cid:48) θ ( s ) (cid:13)(cid:13) ψ (cid:46) φ / (2 s θ ) (cid:107) θ ( τ ) − θ ( s ) (cid:107) (cid:46) φ / (2 s θ ) L θ | τ − s | . Therefore, by Theorem 8, Remark 15, and Remark 16, eq. (64) can be upper bounded (up to a multiplicativeconstant) by 2(2 + c ) ηφ / (2 s θ ) L θ (cid:112) s θ log( ep/ s θ ) + log(1 / P { A } ) . Thus, by Lemma 30, for all t ≥

0, with probability at least 1 − e − t ,sup τ : | τ − s |≤ η sup θ − θ ( τ ) ∈ K ( r ,τ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 ˜ ε i (cid:0) ε i X i − E [ X i ] (cid:1) (cid:48) (cid:0) θ − θ ( s ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:46) c ) r φ / ( s θ ) (cid:16)(cid:112) s θ log( ep/s θ ) + √ t (cid:17) + 16(2 + c ) ηφ / (2 s θ ) L θ (cid:16)(cid:112) s θ log( ep/ s θ ) + √ t (cid:17) (65)We now turn the bound on this doubly symmetrized process into a bound on II . For any increasing andconvex function F we have E (cid:34) F (cid:32) sup τ : | τ − s |≤ η sup θ − θ ( τ ) ∈ K ( r ,τ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 φ s,X i ,Y i (cid:0) X (cid:48) i θ − X (cid:48) i θ ( s ) (cid:1) − E (cid:2) φ s,X i ,Y i (cid:0) X (cid:48) i θ − X (cid:48) i θ ( s ) (cid:1)(cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33)(cid:35) a ) ≤ E (cid:34) F (cid:32) τ : | τ − s |≤ η sup θ − θ ( τ ) ∈ K ( r ,τ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 ε i φ s,X i ,Y i (cid:0) X (cid:48) i θ − X (cid:48) i θ ( s ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33)(cid:35) ( b ) ≤ E (cid:34) F (cid:32) τ : | τ − s |≤ η sup θ − θ ( τ ) ∈ K ( r ,τ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 ε i (cid:0) X (cid:48) i θ − X (cid:48) i θ ( s ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33)(cid:35) ( c ) ≤ E (cid:34) F (cid:32) τ : | τ − s |≤ η sup θ − θ ( τ ) ∈ K ( r ,τ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 ˜ ε i (cid:0) ε i X i − E [ X i ] (cid:1) (cid:48) (cid:0) θ − θ ( s ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33)(cid:35) (66)where ( a ) holds by Lemma 2.3.6 in van der Vaart and Wellner (1996), ( b ) holds by Theorem 4.12 in Ledouxand Talagrand (1996) and since by Lemma 5 (i) φ s,X,Y is a contraction, and ( c ) holds again by Lemma 2.3.6in van der Vaart and Wellner (1996).By Lemma 31, eq. (65) and (66), and the union bound over η ∈ T η , for all t ≥

0, with probability atleast 1 − e − t , II (cid:46) c ) r φ / ( s θ ) (cid:16)(cid:112) s θ log( ep/s θ ) + (cid:112) log(1 + 1 /η ) + t (cid:17) + 16(2 + c ) ηφ / (2 s θ ) L θ (cid:16)(cid:112) s θ log( ep/ s θ ) + (cid:112) log(1 + 1 /η ) + t (cid:17) . (67) Bound on III.

We obtain a bound for this term using similar arguments as in Step 2. We skip theredundant details and simply note that, for all t ≥

0, with probability at least 1 − e − t , III (cid:46) c ) ηφ / (2 s θ ) L θ (cid:16)(cid:112) s θ log( ep/ s θ ) + (cid:112) log(1 + 1 /η ) + t (cid:17) . (68) Conclusion.

Since η ∈ (0 ,

1) is arbitrary, we can choose η (cid:16) r /L θ . Combine the bounds ineq. (60), (62), (67) and (68), adjust the constants, and conclude that with probability at least 1 − δ , (cid:107) G n (cid:107) G (cid:46) c ) φ / (2 s θ ) r (cid:112) s θ log( ep/s θ ) + log(1 + L θ /r ) + log(1 /δ ) . Proof of Lemma 10 . To simplify notation, we write K k ( τ ) = C p ( J k ( τ ) , ϑ k ) ∩ B p (0 ,

1) for k ∈ { , } . ByLemma 7 there exist M , M ⊂ B p (0 ,

1) such thatcard( M k ) ≤ (cid:18) eps k (cid:19) s k , ∀ u ∈ M k : (cid:107) u (cid:107) ≤ s k , ∀ τ ∈ T , k ∈ { , } : K k ( τ ) ⊂ ϑ k )conv( M k ) . For these M , M we deﬁne G M , M = (cid:8) g ( X, ξ ) = ξ ( X (cid:48) u )( X (cid:48) u ) : u k ∈ M k , k ∈ { , } (cid:9) . By Assumption 1 and Lemma 29, for all v , u ∈ M and v , u ∈ M , (cid:13)(cid:13)(cid:0) ξ ( X (cid:48) u )( X (cid:48) u ) − E [ ξ ( X (cid:48) u )( X (cid:48) u )] (cid:1) − (cid:0) ξ ( X (cid:48) v )( X (cid:48) v ) − E [ ξ ( X (cid:48) v )( X (cid:48) v )] (cid:1)(cid:13)(cid:13) P,ψ ≤ (cid:107) ξ ( X (cid:48) u )( X (cid:48) u ) − ξ ( X (cid:48) v )( X (cid:48) v ) (cid:107) P,ψ ≤ w ,w (cid:107) ( X (cid:48) w )( X (cid:48) w ) (cid:107) P,ψ ( (cid:107) u − v (cid:107) + (cid:107) u − v (cid:107) ) (cid:46) ϕ / ( s ) ϕ / ( s ) ( (cid:107) u − v (cid:107) + (cid:107) u − v (cid:107) ) , here the supremum in the third line is taken over all w , w such that (cid:107) w k (cid:107) ≤ (cid:107) w k (cid:107) ≤ s k for k ∈ { , } . Thus, by Corollary 4, with probability at least 1 − δ , (cid:107) G n (cid:107) G M , M (cid:46) ϕ / ( s ) ϕ / ( s ) (cid:16)(cid:112) t s ,s ,δ + n − / t s ,s ,δ (cid:17) , where t s ,s ,δ = s log( ep/s ) + s log( ep/s ) + log(1 /δ ). Hence, by Lemma 8 and construction of M we have,with probability at least 1 − δ , (cid:107) G n (cid:107) G (cid:46) (2 + ϑ )(2 + ϑ ) ϕ / ( s ) ϕ / ( s ) (cid:16)(cid:112) t s ,s ,δ + n − / t s ,s ,δ (cid:17) . Proof of Lemma 11 . To simplify notation, we write K ( τ ) = C p ( J ( τ ) , ϑ ) ∩ B p (0 , M ⊂ B p (0 ,

1) such thatcard( M ) ≤ (cid:18) eps (cid:19) s , ∀ u ∈ M : (cid:107) u (cid:107) ≤ s, ∀ τ ∈ T : K ( τ ) ⊂ ϑ )conv( M ) . For this M we deﬁne G M = (cid:8) g ( X, Y, ξ ) = ξ (cid:0) τ − (cid:8) Y ≤ X (cid:48) θ ( τ ) } (cid:1) X (cid:48) v : v ∈ M , τ ∈ T (cid:9) . We observe the following: First, G M = { hj : h ∈ H , j ∈ J M } , where H = (cid:8) h ( X, Y ) = τ − { F Y | X ( Y | X ) ≤ τ } : τ ∈ T (cid:9) and J M = { j ( X, ξ ) = ξX (cid:48) v : v ∈ M} . The set H is the diﬀerence of two VC-subgraph classeswith VC-indices at most 2, respectively (van der Vaart and Wellner, 1996, Lemma 2.6.15 and Example2.6.1). Thus, H is VC-subgraph class with VC-index at most 3 (van der Vaart and Wellner, 1996, Lemma2.6.18). The function class J M is ﬁnite with card( J M ) = card( M ). By Assumption 1 and Lemma 29, for u , u , v , v ∈ M arbitrary, (cid:13)(cid:13) ξ ( u (cid:48) XX (cid:48) u − E [ ξ u (cid:48) XX (cid:48) u ]) − ( ξ v (cid:48) XX (cid:48) v − E [ ξ v (cid:48) XX (cid:48) v ]) (cid:13)(cid:13) P,ψ ≤ (cid:13)(cid:13) ξ u (cid:48) XX (cid:48) u − ξ v (cid:48) XX (cid:48) v (cid:13)(cid:13) P,ψ ≤ w ,w (cid:107) ( X (cid:48) w )( X (cid:48) w ) (cid:107) P,ψ ( (cid:107) u − v (cid:107) + (cid:107) u − v (cid:107) ) (cid:46) ϕ max ( | S | ) ( (cid:107) u − v (cid:107) + (cid:107) u − v (cid:107) ) , where the supremum in the third line is taken over all w , w such that (cid:107) w (cid:107) , (cid:107) w (cid:107) ≤ (cid:107) w (cid:107) , (cid:107) w (cid:107) ≤| S | . Therefore, for j v , j u ∈ J M , (cid:107) ( j v − P j v ) − ( j u − P j u ) (cid:107) P,ψ (cid:46) ϕ max ( | S | ) (cid:107) v − u (cid:107) . Thus, by Corollary 5, with probability at least 1 − δ , (cid:107) G n (cid:107) G M (cid:46) ϕ / ( s ) (cid:112) t s,δ (cid:113) π n, ( t s,δ )where t s,δ = s log( ep/s )+log(1 /δ ) and π n, ( z ) = (cid:112) z/n + z/n for z ≥

0. Hence, by Lemma 8 and construction f M we have, with probability at least 1 − δ , (cid:107) G n (cid:107) G (cid:46) ϑ ) ϕ / ( s ) (cid:112) t s,δ (cid:113) π n, ( t s,δ ) . Proof of Lemma 12 . Proof of Case (i).

Let S ⊂ { , . . . , p } be arbitrary. By Lemma 7 there exist M ⊂ B p (0 ,

1) such thatcard( M ) ≤ (cid:18) ep | S | (cid:19) | S | , ∀ u ∈ M : (cid:107) u (cid:107) ≤ | S | , B p (0 , ⊂ M ) . In the following it is understood that M is a function of S and we will not make this dependence explicit.For T ⊂ { , . . . , p } and the pair ( S, M ) we deﬁne G S,T, M = (cid:8) g ( X, Y, ξ ) = ξ (cid:0) (cid:8) Y ≤ X (cid:48) θ (cid:9) − (cid:8) Y ≤ X (cid:48) θ ( τ ) } (cid:1) X (cid:48) v : θ ∈ R p , supp( θ ) = T, v ∈ M , τ ∈ T (cid:9) . We make the following observations: First, each g ∈ G S,T, M is parameterized (uniquely) by the triple ( v, θ, τ ).Hence, we will write g = g v,θ,τ whenever we need to highlight the dependence on the parameters. Second, G S,T, M = { hj : h ∈ H T , j ∈ J S, M } , where H T = (cid:8) h ( X, Y ) = { Y ≤ X (cid:48) θ } − { F Y | X ( Y | X ) ≤ τ } : θ ∈ R p , supp( θ ) = T, τ ∈ T (cid:9) and J S, M = { j ( X, ξ ) = ξX (cid:48) v : v ∈ M} . In particular, for every g = g v,θ,τ ∈ G S,T, M there exist unique h θ,τ ∈ H T and j v ∈ J S, M such that g v,θ,τ = h θ,τ j v . The set H T is the diﬀerence of twoVC-subgraph classes with VC-indices at most | T | + 3 and 2, respectively (van der Vaart and Wellner, 1996,Lemma 2.6.15 and Example 2.6.1). Thus, H T is VC-subgraph class with VC-index at most | T | + 4 (van derVaart and Wellner, 1996, Lemma 2.6.18). The function class J S, M is ﬁnite with card( J S, M ) = card( M ).By Assumption 1 and Lemma 29, for u , u , v , v ∈ M arbitrary, (cid:13)(cid:13) ξ ( u (cid:48) XX (cid:48) u − E [ ξ u (cid:48) XX (cid:48) u ]) − ( ξ v (cid:48) XX (cid:48) v − E [ ξ v (cid:48) XX (cid:48) v ]) (cid:13)(cid:13) P,ψ ≤ (cid:13)(cid:13) ξ u (cid:48) XX (cid:48) u − ξ v (cid:48) XX (cid:48) v (cid:13)(cid:13) P,ψ ≤ w ,w (cid:107) ( X (cid:48) w )( X (cid:48) w ) (cid:107) P,ψ ( (cid:107) u − v (cid:107) + (cid:107) u − v (cid:107) ) (cid:46) ϕ max ( | S | ) ( (cid:107) u − v (cid:107) + (cid:107) u − v (cid:107) ) , where the supremum in the third line is taken over all w , w such that (cid:107) w (cid:107) , (cid:107) w (cid:107) ≤ (cid:107) w (cid:107) , (cid:107) w (cid:107) ≤| S | . Therefore, for j v , j u ∈ J S, M , (cid:107) ( j v − P j v ) − ( j u − P j u ) (cid:107) P,ψ (cid:46) ϕ max ( | S | ) (cid:107) v − u (cid:107) . Thus, by Corollary 5 there exists an absolute constant C > S, T, n, p, M ) such that forall t > P (cid:110) (cid:107) G n (cid:107) G S,T, M > C ϕ / ( | S | )Π n,p ( | S | , | T | , t ) (cid:111) ≤ e − t , where Π n,p ( | S | , | T | , t ) = (cid:112) | T | + | S | log( ep/ | S | ) (cid:113) π n, (cid:0) | S | log( ep/ | S | ) (cid:1) + √ t (cid:113) π n, ( t ) nd π n, ( z ) = (cid:112) z/n + z/n for z ≥ (cid:0) { S ⊆ { , . . . , p } : | S | ≤ k } (cid:1) ≤ ( ep/k ) k and card (cid:0) { T ⊆ { , . . . , p } : | T | ≤ (cid:96) } (cid:1) = (cid:80) (cid:96)i =1 (cid:0) pi (cid:1) ≤ ( ep/(cid:96) ) (cid:96) . Therefore, setting t k,(cid:96),n,δ = k log( ep/k ) + (cid:96) log( ep/(cid:96) ) + 2 log n + log(3 e/δ ) and applying the unionbound over above tail probabilities gives P (cid:40) sup ≤ k,(cid:96) ≤ n sup card( S ) ≤ k sup card( T ) ≤ (cid:96) (cid:107) G n (cid:107) G S,T, M ϕ / ( k )Π n,p ( k, (cid:96), t k,(cid:96),n,δ ) > C (cid:41) ≤ e n (cid:88) k =1 n (cid:88) (cid:96) =1 (cid:16) epk (cid:17) k (cid:16) ep(cid:96) (cid:17) (cid:96) e − t k,(cid:96),n,δ ≤ δ. (69)Next, for S, T ⊂ { , . . . , p } arbitrary, deﬁne G S,T = (cid:8) g ( X, Y, ξ ) = ξ (cid:0) (cid:8) Y ≤ X (cid:48) θ (cid:9) − (cid:8) Y ≤ X (cid:48) θ ( τ ) } (cid:1) X (cid:48) v : θ, v ∈ R p , supp( θ ) = T, supp( v ) = S, τ ∈ T (cid:9) . By Lemma 8 and construction of M we have (cid:107) G n (cid:107) G S,T ≤ (cid:107) G n (cid:107) G S,T, M , (70)and G = (cid:91) S,T ⊂{ ,...p } , card( S ) , card( T ) ≤ n G S,T . (71)Hence, eq. (69)–(71) imply P (cid:40) ∃ g v,θ,τ ∈ G : G n ( g v,θ,τ ) ϕ / ( (cid:107) v (cid:107) )Π n,p ( (cid:107) v (cid:107) , (cid:107) θ (cid:107) , t (cid:107) v (cid:107) , (cid:107) θ (cid:107) ,n ) > C (cid:41) ≤ δ. Lastly, there exists an absolute constant C > n,p ( (cid:107) v (cid:107) , (cid:107) θ (cid:107) , t (cid:107) v (cid:107) , (cid:107) θ (cid:107) ,n ) ≤ C (cid:112) t (cid:107) v (cid:107) , (cid:107) θ (cid:107) ,n (cid:113) π n, ( t (cid:107) v (cid:107) , (cid:107) θ (cid:107) ,n ) . This completes the proof of case (i).

Proof of Case (ii).

Observe that s (cid:55)→ s log( ep/s ) and s (cid:55)→ ϕ max ( s ) are monotone increasing on [1 , p ].Thus, the bound of case (i) for g v,θ,τ ∈ G with (cid:107) v (cid:107) = (cid:107) θ (cid:107) = m holds also for all g v (cid:48) ,θ (cid:48) ,τ (cid:48) ∈ G with (cid:107) v (cid:48) (cid:107) , (cid:107) θ (cid:48) (cid:107) ≤ m . To conclude, adjust some constants. Proof of Lemma 13 . Fix k ≤ n ∧ p , ﬁx I ⊆ { , . . . , n } with card( I ) = k , and ﬁx the support set of u ∈ R p with (cid:107) u (cid:107) = (cid:96) . By Lemma 10 with ϑ = 0, there exists an absolute constant C > − e − t , sup (cid:107) u (cid:107) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) k (cid:88) i ∈ I ( X (cid:48) i u ) − E [( X (cid:48) i u ) ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ Cϕ max ( (cid:96) ) (cid:32)(cid:114) (cid:96) + tk + (cid:96) + tk (cid:33) . Note that card (cid:0) { I ⊆ { , . . . , n } : card( I ) ≤ k } (cid:1) = (cid:80) ki =1 (cid:0) ni (cid:1) ≤ (cid:0) enk (cid:1) k and card (cid:0) { u ∈ { , } p : (cid:107) u (cid:107) ≤ k } (cid:1) = (cid:80) ki =1 (cid:0) pi (cid:1) ≤ (cid:0) epk (cid:1) k . Therefore, setting t k,(cid:96),n = k log( ep/k ) + k log( en/k ) + log(2 n/δ ) and applying the nion bound gives P  ∃ k ≤ n : sup card( I ) ≤ k sup (cid:96) ≤ k sup (cid:107) u (cid:107) ≤ , (cid:107) u (cid:107) = (cid:96) (cid:12)(cid:12) k (cid:80) i ∈ I ( X (cid:48) i u ) − E [( X (cid:48) i u ) ] (cid:12)(cid:12) ϕ max ( (cid:96) ) (cid:18)(cid:113) (cid:96) + t k,(cid:96),n k + (cid:96) + t k,(cid:96),n k (cid:19) ≥ C  ≤ n (cid:88) k =1 (cid:16) epk (cid:17) k (cid:16) enk (cid:17) k e − t k,(cid:96),n ≤ δ. Upper bound (cid:96) by k , simplify the expression, and conclude. G.4 Proofs of Section E.3

Proof of Lemma 14 . Proof of statement (i).

To simplify the discussion we introduce the following matrices and vectors (cid:98) Ψ( τ ) = 2 diag (cid:16) ˆ f − ( τ ) , . . . , ˆ f − n ( τ ) (cid:17) ∈ R n × n ,A = n − / [ − X , X ] (cid:48) ∈ R p × n ,b = (cid:104) γn (cid:48) p − z (cid:48) , γn (cid:48) p + z (cid:48) (cid:105) (cid:48) ∈ R p + n , and rewrite the convex optimization problem (24) in standard matrix form asmin w ∈ R n w (cid:48) (cid:98) Ψ( τ ) w s . t . Aw (cid:22) b. Recall that the dual of above optimization problem is given bymax λ ∈ R p +2 n − λ (cid:48) A (cid:98) Ψ − ( τ ) A (cid:48) λ − λ (cid:48) b s . t . λ (cid:23) . By assumption the primal problem is feasible. Thus, strong duality holds and the solution to the primal, (cid:98) w ( τ ; z ) ∈ R n , and the solution to the dual, ˆ λ ( τ ) ∈ R p , satisfy (cid:98) w ( τ ; z ) = − (cid:98) Ψ − ( τ ) A (cid:48) ˆ λ ( τ ) . (72)Since ˆ λ ( τ ) ∈ R p is just the vector of optimal Lagrange multipliers associated with the original problemformulation in (24), it also satisﬁes the complementary slackness conditionsˆ λ k ( τ ) ≥ (cid:2) A (cid:98) w ( τ ) (cid:3) k < b k = ⇒ ˆ λ k ( τ ) = 0 , k = 1 , . . . , p. Write ˆ λ ( τ ) = (cid:0) µ, µ (cid:1) , where µ, µ ∈ R p + are the optimal Lagrange multipliers associate with the box constraintsof COP (24). The complementary slackness conditions imply that µ k µ k = 0 , k = 1 , . . . , p. et ˆ v ( τ ; z ) := (cid:0) − µ + µ (cid:1) ∈ R p . Above identity implies that | ˆ v k ( τ ; z ) | = µ k + µ k , k = 1 , . . . , p, and hence −

12 ˆ λ (cid:48) ( τ ) A (cid:98) Ψ − ( τ ) A (cid:48) ˆ λ ( τ ) − ˆ λ (cid:48) ( τ ) b = − n ˆ v (cid:48) ( τ ; z ) X (cid:48) (cid:98) Ψ − ( τ ) X ˆ v ( τ ; z ) − z (cid:48) ˆ v ( τ ; z ) − γn (cid:107) ˆ v ( τ ; z ) (cid:107) . (73)Since every v ∈ R p can be written as v = (cid:0) − µ + µ (cid:1) for some µ, µ ∈ R p + , identity (73) implies that ˆ v ( τ ; z ) isthe solution to the unconstrained COPmin v ∈ R p n v (cid:48) X (cid:48) (cid:98) Ψ − ( τ ) X v + z (cid:48) v + γn (cid:107) v (cid:107) . Lastly, note that 12 n v (cid:48) X (cid:48) (cid:98) Ψ − ( τ ) X v = 14 n n (cid:88) i =1 ˆ f i ( τ ) (cid:0) X (cid:48) i v (cid:1) . This concludes the proof of the ﬁrst statement.

Proof of statement (ii).

Note that the primal problem 24 is convex. To establish the claim it istherefore suﬃcient to verify that Slater’s condition holds with probability at least 1 − δ . To this end, deﬁne w ∗ ( z ) = n − / X E [ XX (cid:48) ] − z. Note that n − / E [ X (cid:48) w ∗ ( z )] = z . Hence, by Assumption 1 the z − n − / e (cid:48) X (cid:48) w ∗ ( z ) , . . . , z p − n − / e (cid:48) p X (cid:48) w ∗ ( z )are centered (non-identical and dependent) sub-exponential random variables. (Here, e k denotes the k thstandard unit vector in R p .) Moreover, z k − n − / e (cid:48) k X (cid:48) w ∗ ( z ) = − n (cid:80) ni =1 ( X ik X (cid:48) i E [ XX (cid:48) ] − z − z k ) for all1 ≤ k ≤ p . By Lemma 29 and Assumption 1, for all 1 ≤ i ≤ n max ≤ k ≤ p (cid:107) X ik X (cid:48) i E [ XX (cid:48) ] − z − z k (cid:107) ψ ≤ ≤ k ≤ p (cid:13)(cid:13) X ik X (cid:48) i E [ XX (cid:48) ] − z (cid:13)(cid:13) ψ ≤ ≤ k ≤ p max ≤ i ≤ n (cid:107) X ik (cid:107) ψ (cid:107) X (cid:48) i E [ XX (cid:48) ] − z (cid:107) ψ (cid:46) ¯ f ϕ / (1) ϕ / ( s v ) κ ( ∞ ) (cid:107) z (cid:107) . Hence, the union bound followed by Bernstein’s inequality implies that for all t ≥

0, there exists an absoluteconstant

C > P (cid:110)(cid:13)(cid:13) z − n − / X (cid:48) w ∗ ( z ) (cid:13)(cid:13) ∞ > t (cid:111) ≤ p max ≤ k ≤ p P (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ( X ik X (cid:48) i E [ XX (cid:48) ] − z − z k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > t (cid:41) ≤ p exp (cid:32) − C min (cid:40) tκ ( ∞ )¯ f ϕ / (1) ϕ / ( s v ) (cid:107) z (cid:107) , t κ ( ∞ )¯ f ϕ max (1) ϕ / ( s v ) (cid:107) z (cid:107) (cid:41) n (cid:33) . Set t ∗ = max { s, s } with s = ¯ f ϕ / (1) ϕ / ( s v ) κ ( ∞ ) (cid:107) z (cid:107) (cid:16) log 2 p +log( C/δ ) n (cid:17) / and conclude that with probabilityat least 1 − δ the constraint set is non-empty whenever γ > t ∗ , i.e. Slater’s condition holds. Lastly, observe hat by eq. (72), (cid:98) w ( τ ; z ) = − n − / (cid:98) Ψ − ( τ ) X ˆ v ( τ ; z ) . This concludes the proof of the second statement.

Proof of Theorem 5 . Ansatz.

For τ ∈ T and r > B p (0 , r ) := { u ∈ R p : (cid:107) u (cid:107) ≤ r } and recallthat C p ( J, ϑ ) = { u ∈ R p : (cid:107) u J c (cid:107) ≤ ϑ (cid:107) u J (cid:107) } . To simplify notation, we write v ( τ ), ˆ v ( τ ), T v ( τ ), and s v for v ( τ ; z ), ˆ v ( τ ; z ), T v ( τ ; z ), and s v ( z ), respectively. Suppose that, with high probability,(a) ˆ v ( τ ) − v ( τ ) ∈ C p (cid:0) T v ( τ ) , ¯ c (cid:1) for all τ ∈ T ; and(b) the centered dual objective function is strictly positive when evaluated at points ( v, τ ) ∈ R p × T satisfying v − v ( τ ) ∈ C p ( T v ( τ ) , ¯ c ) ∩ ∂B p (0 , r ) =: K ( r, τ ), i.e.inf τ ∈T inf v − v ( τ ) ∈ K ( r,τ ) (cid:32) n (cid:88) i =1 (cid:16) ˆ f i ( τ )( X (cid:48) i v ) − ˆ f i ( τ ) (cid:0) X (cid:48) i v ( τ ) (cid:1) (cid:17) + nz (cid:48) (cid:0) v − v ( τ ) (cid:1) + γ (cid:0) (cid:107) v (cid:107) − (cid:107) v ( τ ) (cid:107) (cid:1)(cid:33) > . Since the dual objective function is convex in v and negative at ˆ v ( τ ) for all τ ∈ T , it then follows thatsup τ ∈T (cid:107) ˆ v γ ( τ ) − v ( τ ) (cid:107) (cid:46) r . Thus, to establish the claim of the theorem, we only need to prove thatstatements (a) and (b) hold with high probability. Veriﬁcation of high probability statements.

By assumption, eq. (35) holds true. Thus, byLemma 18, statement (a) holds for all τ ∈ T with probability one. We now establish statement (b).To this end, we deﬁne G = { g : R p → R : g ( X ) = f Y | X ( X (cid:48) θ ( τ ) | X ) v (cid:48) ( τ ) XX (cid:48) v, v ∈ K ( r, τ ) , τ ∈ T } , G = { g : R p → R : g ( X ) = f Y | X ( X (cid:48) θ ( τ ) | X )( X (cid:48) v ) , v ∈ K ( r, τ ) , τ ∈ T } . Also, note that by Assumptions 8 and 9, with probability at least 1 − η , (cid:12)(cid:12)(cid:12) ˆ f i ( τ ) − f i ( τ ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:0) ˆ f i ( τ ) − f i ( τ ) (cid:1)(cid:0) ˆ f i ( τ ) + f i ( τ ) (cid:1)(cid:12)(cid:12)(cid:12) = (cid:0) ˆ f i ( τ ) − f i ( τ ) (cid:1) + 2 (cid:12)(cid:12) ˆ f i ( τ ) − f i ( τ ) (cid:12)(cid:12) f i ( τ ) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ f i ( τ ) f i ( τ ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f i ( τ ) + 2 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ f i ( τ ) f i ( τ ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f i ( τ ) (cid:46) r f f i ( τ ) . (74)Therefore, by a second-order Taylor approximation we have, with probability at least 1 − η , for all v ∈ K ( r, τ )and τ ∈ T ,14 n n (cid:88) i =1 ˆ f i ( τ )( X (cid:48) i v ) − ˆ f i ( τ ) (cid:0) X (cid:48) i v ( τ ) (cid:1) + z (cid:48) (cid:0) v − v ( τ ) (cid:1) = 12 n n (cid:88) i =1 f i ( τ ) v (cid:48) ( τ ) X i X (cid:48) i (cid:0) v − v ( τ ) (cid:1) + 14 n n (cid:88) i =1 f i ( τ ) (cid:0) v − v ( τ ) (cid:1) (cid:48) X i X (cid:48) i (cid:0) v − v ( τ ) (cid:1) + z (cid:48) (cid:0) v − v ( τ ) (cid:1) + 12 n n (cid:88) i =1 (cid:0) ˆ f i ( τ ) − f i ( τ ) (cid:1) v (cid:48) ( τ ) X i X (cid:48) i (cid:0) v − v ( τ ) (cid:1) + 14 n n (cid:88) i =1 (cid:0) ˆ f i ( τ ) − f i ( τ ) (cid:1)(cid:0) v − v ( τ ) (cid:1) (cid:48) X i X (cid:48) i (cid:0) v − v ( τ ) (cid:1) (cid:38) E (cid:34) n n (cid:88) i =1 f i ( τ ) (cid:0) v − v ( τ ) (cid:1) (cid:48) X i X (cid:48) i (cid:0) v − v ( τ ) (cid:1)(cid:35) n − / (cid:107) G n (cid:107) G − n − / (cid:107) G n (cid:107) G − r f sup g ∈G (cid:107) g (cid:107) P n , − r f sup g ∈G (cid:107) g (cid:107) P n , . Thus, with probability at least 1 − η ,inf τ ∈T inf v − v ( τ ) ∈ K ( r,τ ) (cid:32) n n (cid:88) i =1 (cid:16) ˆ f i ( τ )( X (cid:48) i v ) − ˆ f i ( τ ) (cid:0) X (cid:48) i v ( τ ) (cid:1) (cid:17) + z (cid:48) (cid:0) v − v ( τ ) (cid:1) + γn (cid:0) (cid:107) v (cid:107) − (cid:107) v ( τ ) (cid:107) (cid:1)(cid:33) (cid:38) inf τ ∈T inf v − v ( τ ) ∈ K ( r,τ ) E (cid:34) n n (cid:88) i =1 f i ( τ ) (cid:0) v − v ( τ ) (cid:1) (cid:48) X i X (cid:48) i (cid:0) v − v ( τ ) (cid:1)(cid:35) − n − / (cid:107) G n (cid:107) G − n − / (cid:107) G n (cid:107) G − r f sup g ∈G (cid:107) g (cid:107) P n , − r f sup g ∈G (cid:107) g (cid:107) P n , − sup τ ∈T sup v − v ( τ ) ∈ K ( r,τ ) γn (cid:0) (cid:107) v (cid:107) − (cid:107) v ( τ ) (cid:107) (cid:1) = I − II − III − IV − V − VI . (75)We now bound the expressions on the far right hand side in eq. (75). Bound on I.

By deﬁnition of K ( r, τ ),inf τ ∈T inf v − v ( τ ) ∈ K ( r,τ ) E (cid:34) n n (cid:88) i =1 f i ( τ ) (cid:0) v − v ( τ ) (cid:1) (cid:48) X i X (cid:48) i (cid:0) v − v ( τ ) (cid:1)(cid:35) ≥ κ (¯ c ) r . (76) Bound on II.

By Lemma 21 and Assumptions 6 and 11, with probability at least 1 − δ , n − / (cid:107) G n (cid:107) G (cid:46) (2 + ¯ c ) ¯ f (1 + ϕ / (2 s θ )) ϕ max ( s v ) κ ( ∞ ) (cid:107) z (cid:107) × (cid:114) s v ( z ) log( ep/s v ) + s θ log( ep/s θ ) + log( nL f L θ ) + log(1 /δ ) n r (77) Bound on III.

By Lemma 21 and Assumptions 6 and 11, with probability at least 1 − δ , n − / (cid:107) G n (cid:107) G (cid:46) (2 + ¯ c ) ¯ f (1 + ϕ / (2 s θ )) ϕ max ( s v ) × (cid:114) s v log( ep/s v ) + s θ log( ep/s θ ) + log( nL f L θ ) + log(1 /δ ) n r . (78) Bound on IV.

We introduce the following two function classes: G = { g : R p → R : g ( X ) = f Y | X ( X (cid:48) θ ( τ ) | X )( X (cid:48) v ) , v ∈ K (1 , τ ) , τ ∈ T } , G = { g : R p → R : g ( X ) = f Y | X ( X (cid:48) θ ( τ ) | X )( X (cid:48) v ) , v ∈ R p , (cid:107) v (cid:107) ≤ , (cid:107) v (cid:107) ≤ s v , τ ∈ T } . Then, by the triangle inequality, r f sup g ∈G (cid:107) g (cid:107) P n , (cid:46) r f n − / (cid:107) G n (cid:107) |G | + (2 + ¯ c ) ¯ f ϕ max ( s v ) (cid:107) z (cid:107) κ ( ∞ ) r f r (cid:46) r f n − / (cid:107) z (cid:107) rκ ( ∞ ) (cid:16) (cid:107) G n (cid:107) G + (cid:107) G n (cid:107) G (cid:17) + (2 + ¯ c ) ¯ f ϕ max ( s v ) (cid:107) z (cid:107) κ ( ∞ ) r f r. hus, by Lemma 21 and Assumptions 6 and 11, with probability at least 1 − δ , r f sup g ∈G (cid:107) g (cid:107) P n , (cid:46) (2 + ¯ c ) ¯ f ϕ max ( s v ) (cid:107) z (cid:107) κ ( ∞ ) r f r + (2 + ¯ c ) ¯ f (1 + ϕ / (2 s θ )) ϕ max ( s v ) κ ( ∞ ) (cid:107) z (cid:107) × (cid:114) s v ( z ) log( ep/s v ) + s θ log( ep/s θ ) + log( nL f L θ ) + log(1 /δ ) n r f r (79) Bound on V.

By the triangle inequality, r f sup g ∈G (cid:107) g (cid:107) P n , (cid:46) r f n − / (cid:107) G n (cid:107) |G | + (2 + ¯ c ) ¯ f ϕ max ( s v ) r f r (cid:46) r f n − / (cid:107) G n (cid:107) G + (2 + ¯ c ) ¯ f ϕ max ( s v ) r f r . (80) Bound on VI.

By Lemma 18 and arguments as in the proof of Theorem 3,sup τ ∈T inf v − v ( τ ; z ) ∈ K ( r ) γn (cid:0) (cid:107) v (cid:107) − (cid:107) v ( τ ) (cid:107) (cid:1) (cid:46) (2 + ¯ c ) γn s / v r. (81) Conclusion.

Combine Assumptions 6 and 11, eq. (76)–(81), and observe that there exist absoluteconstants c , c , c , c , c , c > − η − δ , the expression in eq. (75) canbe lower bounded (up to a multiplicative constant) by κ (¯ c ) r − c (2 + ¯ c ) ¯ f (1 + ϕ / (2 s θ )) ϕ max ( s v ) κ ( ∞ ) (cid:107) z (cid:107) (1 + r f ) × (cid:114) s v log( ep/s v ) + s θ log( ep/s θ ) + log( nL f L θ ) + log(1 /δ ) n r − c (2 + ¯ c ) ¯ f (1 + ϕ / (2 s θ )) ϕ max (cid:0) s v ( z ) (cid:1) (1 + r f ) × (cid:114) s v log( ep/s v ) + s θ log( ep/s θ ) + log( nL f L θ ) + log(1 /δ ) n r − c (2 + ¯ c ) ¯ f ϕ max ( s v ) κ ( ∞ ) (cid:107) z (cid:107) r f r − c (2 + ¯ c ) ¯ f ϕ max ( s v ) r f r − c (2 + ¯ c ) γn s / v r> , whenever r ≥ c C (cid:18) (cid:107) z (cid:107) κ ( ∞ ) ∨ (cid:19) (cid:32)(cid:114) s v log( ep/s v ) + s θ log( ep/s θ ) + log( nL f L θ ) + log(1 /δ ) n ∨ r f (cid:33) (cid:95) c ¯ cκ (¯ c ) γ √ s v n , where C = (1 + ϕ / (2 s θ ))¯ c ¯ f ϕ max ( s v ) /κ (¯ c ). To conclude, adjust constants. Proof of Corollary 2 . The idea is to derive an upper bound on the gradient of the objective function.Any γ > v ( τ ), ˆ v ( τ ), T v ( τ ), and s v for v ( τ ; z ), ˆ v ( τ ; z ), T v ( τ ; z ), and s v ( z ), respectively. Let ζ ∈ (0 ,

1) be arbitrary and T ζ be an ζ -net with cardinality card( T ζ ) ≤ /ζ . y the deﬁnition of v ( τ ) and repeated applications of the triangle inequality,sup τ ∈T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 ˆ f i ( τ ) X i X (cid:48) i v ( τ ) + nz (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ sup τ ∈T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:0) f i ( τ ) X i X (cid:48) i − E [ f i ( τ ) X i X (cid:48) i ] (cid:1) v ( τ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ + sup τ ∈T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:0) ˆ f i ( τ ) − f i ( τ ) (cid:1) X i X (cid:48) i v ( τ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ ≤ sup s ∈ T ζ sup s (cid:48) ∈T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:0) f i ( s (cid:48) ) X i X (cid:48) i − E [ f i ( s (cid:48) ) X i X (cid:48) i ] (cid:1) v ( s ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ + sup s ∈ T ζ sup τ : | τ − s |≤ ζ sup s (cid:48) ∈T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:0) f i ( s (cid:48) ) X i X (cid:48) i − E [ f i ( s (cid:48) ) X i X (cid:48) i ] (cid:1)(cid:0) v ( τ ) − v ( s ) (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ + sup s ∈ T ζ sup s (cid:48) ∈T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:0) ˆ f i ( s (cid:48) ) − f i ( s (cid:48) ) (cid:1) X i X (cid:48) i v ( s ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ + sup s ∈ T ζ sup τ : | τ − s |≤ ζ sup s (cid:48) ∈T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 (cid:0) ˆ f i ( s (cid:48) ) − f i ( s (cid:48) ) (cid:1) X i X (cid:48) i (cid:0) v ( τ ) − v ( s ) (cid:1)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ = I + II + III + IV . (82)We now bound the four terms on the far right hand side in above display. Bound on I.

By Lemma 21, Assumptions 6 and 11, and the union bound over s ∈ T ζ , with probabilityat least 1 − δ , I (cid:46) ¯ c ¯ f ϕ / ( s v ) ϕ / (1) (cid:0) ϕ / (2 s θ ) (cid:1) (cid:107) z (cid:107) κ ( ∞ ) √ n (cid:113) log p + log( nL f L θ ) + log(1 + 1 /ζ ) + log(1 /δ ) . (83) Bound on II.

By Lemmas 19 and 21, Assumptions 6 and 11, and the union bound over s ∈ T ζ , withprobability at least 1 − δ , II (cid:46) ¯ c ¯ f L f L θ ϕ / (2 s v ) ϕ / (1) (cid:0) ϕ / (2 s θ ) (cid:1) (cid:107) z (cid:107) κ ( ∞ ) ϕ / (2 s θ ) ϕ max ( p ) × √ n (cid:113) s v log( ep/s v ) + log p + log( nL f L θ ) + log(1 + 1 /ζ ) + log(1 /δ ) ζ. (84) Bound on III.

We introduce the following two function classes: G = { g : R p → R : g ( X ) = f Y | X ( X (cid:48) θ ( s (cid:48) ) | X ) X k , ≤ k ≤ p, s (cid:48) ∈ T } , G = { g : R p → R : g ( X ) = f Y | X ( X (cid:48) θ ( s (cid:48) ) | X )( X (cid:48) v ) , v ∈ (cid:8) v ( s ) κ ( ∞ ) / (cid:107) z (cid:107) : s ∈ T ζ (cid:9) , s (cid:48) ∈ T } . Recall that by eq. (74), with probability at least 1 − η , we have | ˆ f i ( τ ) − f i ( τ ) | (cid:46) r f f i ( τ ) . Thus, withprobability at least 1 − η , III (cid:46) r f sup s ∈ T ζ sup s (cid:48) ∈T max ≤ k ≤ p (cid:32) n (cid:88) i =1 f i ( s (cid:48) ) (cid:1) | X ik || X (cid:48) i v ( s ) | (cid:33) (cid:46) r f √ n (cid:107) z (cid:107) κ ( ∞ ) (cid:0) (cid:107) G n (cid:107) G + (cid:107) G n (cid:107) G (cid:1) + r f n ¯ f ϕ max ( s v ) (cid:107) z (cid:107) κ ( ∞ ) . Therefore, by Lemma 21, Assumptions 6 and 11, and the union bound over s ∈ T ζ , we have, with probability t least 1 − η − δ , III (cid:46) ¯ c ¯ f ϕ max ( s v ) (cid:0) ϕ / (2 s θ ) (cid:1) (cid:107) z (cid:107) κ ( ∞ ) √ n (cid:113) log p + log( nL f L θ ) + log(1 + 1 /ζ ) + log(1 /δ ) r f + ¯ c ¯ f ϕ max ( s v ) (cid:107) z (cid:107) κ ( ∞ ) nr f . (85) Bound on IV.

Analogous to the bounds of II and III , we have, with probability at least 1 − η − δ , IV (cid:46) ¯ c ¯ f L f L θ ϕ max (2 s v ) (cid:0) ϕ / (2 s θ ) (cid:1) (cid:107) z (cid:107) κ ( ∞ ) ϕ / (2 s θ ) ϕ max ( p ) × √ n (cid:113) s v log( ep/s v ) + log p + log( nL f L θ ) + log(1 + 1 /ζ ) + log(1 /δ ) r f ζ + ¯ c ¯ f L f L θ ϕ max (2 s v ) (cid:107) z (cid:107) κ ( ∞ ) ϕ / (2 s θ ) ϕ max ( p ) nr f ζ. (86) Conclusion.

Since ζ ∈ (0 ,

1) is arbitrary, we can choose ζ (cid:16) / (cid:0) L f L θ ϕ / (2 s θ ) ϕ max ( p ) √ p (cid:1) . Combinethe bounds in eq. (82)–(86), adjust some constants, and conclude that with probability at least 1 − η − δ ,sup τ ∈T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) n (cid:88) i =1 ˆ f i ( τ ) X i X (cid:48) i v ( τ ) + nz (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ (cid:46) ¯ c ¯ f ϕ max (2 s v ) (cid:0) ϕ / (2 s θ ) (cid:1) (cid:107) z (cid:107) κ ( ∞ ) (cid:18)(cid:113) log( np/δ ) + log (cid:0) L f L θ ϕ max (2 s θ ) ϕ max ( p ) (cid:1) + √ nr f (cid:19) √ n. We further upper bound (and simplify) the term on the right hand side in above display. Recall fromTheorem 5 C = ¯ c ¯ f L f L θ (cid:0) ϕ / (2 s θ ) (cid:1) ϕ max ( s v ) /κ (¯ c ). Then,¯ c ¯ f ϕ max (2 s v ) (cid:0) ϕ / (2 s θ ) (cid:1) (cid:107) z (cid:107) κ ( ∞ ) (cid:18)(cid:113) log( np/δ ) + log (cid:0) L f L θ ϕ max (2 s θ ) ϕ max ( p ) (cid:1) + √ nr f (cid:19) √ n ( a ) (cid:46) ¯ c ¯ f L f L θ ϕ max (2 s v ) (cid:0) ϕ max (2 s θ ) (cid:1) (cid:107) z (cid:107) κ ( ∞ ) (cid:16)(cid:112) log( np/δ ) + √ nr f (cid:17) √ n (cid:46) C ¯ f ϕ max (2 s v ) ϕ max ( s v ) κ (¯ c ) (cid:107) z (cid:107) κ ( ∞ ) ( b ) (cid:46) C ¯ f κ (¯ c ) (cid:107) z (cid:107) κ ( ∞ ) , where (a) and (b) follow from Lemma 13 in Belloni and Chernozhukov (2011) (more speciﬁcally, (a) holdssince log (cid:0) ϕ max (2 s θ ) ϕ max ( p ) (cid:1) (cid:46) (cid:0) ϕ max (2 s θ ) (cid:1) + log p by Lemma 13 in Belloni and Chernozhukov (2011)and (b) holds since ϕ max (2 s v ) /ϕ max ( s v ) ≤ ϕ max (1) /ϕ max ( s v ) ≤ C > γ ≥ Cc C ¯ f κ (¯ c ) (cid:107) z (cid:107) κ ( ∞ ) (cid:16)(cid:112) log( np/δ ) + √ nr f (cid:17) √ n, eq. (35) holds with probability at least 1 − η − δ . Plug this lower bound on γ > r > .5 Proofs of Section E.4 Proof of Lemma 15 . To simplify notation we write f i ( τ ) for f Y | X ( X (cid:48) i θ ( τ ) | X i ). We have (cid:12)(cid:12)(cid:12) ˆ f i ( τ ) − f i ( τ ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) / ˆ f i ( τ ) − /f i ( τ ) (cid:12)(cid:12)(cid:12) ˆ f i ( τ ) f i ( τ ) ≤ (cid:12)(cid:12)(cid:12) / ˆ f i ( τ ) − /f i ( τ ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12) ˆ f i ( τ ) − f i ( τ ) (cid:12)(cid:12)(cid:12) f i ( τ ) + (cid:12)(cid:12)(cid:12) / ˆ f i ( τ ) − /f i ( τ ) (cid:12)(cid:12)(cid:12) f i ( τ ) . If sup τ ∈T max ≤ i ≤ n | / ˆ f i ( τ ) − /f i ( τ ) | f i ( τ ) < /

2, this can be rearranged to yieldsup τ ∈T max ≤ i ≤ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ˆ f i ( τ ) f i ( τ ) − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ f sup τ ∈T max ≤ i ≤ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f i ( τ ) − f i ( τ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (87)Thus, the claim of the lemma follows if sup τ ∈T max ≤ i ≤ n | / ˆ f i ( τ ) − /f i ( τ ) | → τ ∈ T and h > Q Y ( τ + h ; X ) = Q Y ( τ ; X ) + Q (cid:48) Y ( τ ; X ) h + Q (cid:48)(cid:48) Y ( τ ; X ) h + Q (cid:48)(cid:48)(cid:48) Y ( ζ + ; X ) h ,Q Y ( τ − h ; X ) = Q Y ( τ ; X ) − Q (cid:48) Y ( τ ; X ) h + Q (cid:48)(cid:48) Y ( τ ; X ) h − Q (cid:48)(cid:48)(cid:48) Y ( ζ − ; X ) h where ζ + ∈ ( τ, τ + h ) and ζ − ∈ ( τ − h, τ ). Combine both expansions and conclude that Q (cid:48) Y ( τ ; X ) = Q Y ( τ + h ; X ) − Q Y ( τ − h ; X )2 h + ( Q (cid:48)(cid:48)(cid:48) Y ( ζ + ; X ) − Q (cid:48)(cid:48)(cid:48) Y ( ζ − ; X )) h . (88)Recall the identity Q (cid:48) Y ( τ ; X ) = 1 /f Y | X ( X (cid:48) θ ( τ ) | X ) and invoke Assumption 12 to arrive atsup τ ∈T max ≤ i ≤ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f i ( τ ) − f i ( τ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup τ ∈T max ≤ i ≤ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X (cid:48) i ˆ θ λ ( τ + h ) − X (cid:48) i ˆ θ λ ( τ − h ) − Q ( τ + h ; X i ) + Q ( τ − h ; X i )2 h (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + C Q h . (89)By Assumption 1, Lemma 7 and 8, (cid:13)(cid:13) max ≤ i ≤ n sup τ ∈T sup u ∈ C p ( T θ ( τ ) , ¯ c ) ∩ B p (0 , X (cid:48) i u (cid:13)(cid:13) ψ (cid:46) (cid:112) log n + s θ log( ep/s θ ) ϕ / ( s θ ) , and by Theorem 3, with probability at least 1 − δ ,sup τ ∈T (cid:107) ˆ θ λ ( τ ) − θ ( τ ) (cid:107) (cid:46) (cid:32) ¯ cφ / (2 s θ ) L θ κ (¯ c ) ∨ (cid:33) (cid:114) s θ log( ep/s θ ) + log n + log(1 /δ ) n (cid:95) ¯ cκ (¯ c ) λ √ s θ n =: r θ . Hence, with probability at least 1 − δ ,sup τ ∈T max ≤ i ≤ n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) f i ( τ ) − f i ( τ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:46) √ nh − r θ + C Q h . By assumption the right hand side in above display vanishes as n, p → ∞ . Combine this with eq. (87) toconclude the proof.

Proof of Theorem 6 . To simplify notation, we write f i ( τ ), v ( τ ), and T v ( τ ) instead of f Y | X ( X (cid:48) i θ ( τ ) | X i ), ( τ ; z ), and T v ( τ ; z ), respectively. We have the following expansion of the error term:sup τ ∈T | e n ( τ ; z ) | = sup τ ∈T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) z (cid:48) ˆ θ λ ( τ ) − z (cid:48) θ ( τ ) − n n (cid:88) i =1 ˆ f i ( τ ) (cid:0) τ − { Y i ≤ X (cid:48) i ˆ θ λ ( τ ) } (cid:1) X (cid:48) i ˆ v γ ( τ )+ 12 n n (cid:88) i =1 f i ( τ ) (cid:0) τ − { Y i ≤ X (cid:48) i θ ( τ ) } (cid:1) X (cid:48) i v ( τ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup τ ∈T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 f i ( τ ) (cid:0) { Y i ≤ X (cid:48) i ˆ θ λ ( τ ) } − { Y i ≤ X (cid:48) i θ ( τ ) } (cid:1) X (cid:48) i v ( τ ) + z (cid:48) (cid:0) ˆ θ λ ( τ ) − θ ( τ ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + sup τ ∈T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ˆ f i ( τ ) (cid:0) τ − { Y i ≤ X (cid:48) i ˆ θ λ ( τ ) } (cid:1) X (cid:48) i ˆ v γ ( τ ) − n n (cid:88) i =1 f i ( τ ) (cid:0) τ − { Y i ≤ X (cid:48) i ˆ θ λ ( τ ) } (cid:1) X (cid:48) i v ( τ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ sup τ ∈T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 f i ( τ ) (cid:0) { Y i ≤ X (cid:48) i ˆ θ λ ( τ ) } − { Y i ≤ X (cid:48) i θ ( τ ) } (cid:1) X (cid:48) i v ( τ ) + z (cid:48) (cid:0) ˆ θ λ ( τ ) − θ ( τ ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + sup τ ∈T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:0) ˆ f i ( τ ) − f i ( τ ) (cid:1)(cid:0) { Y i ≤ X (cid:48) i θ ( τ ) } − { Y i ≤ X (cid:48) i ˆ θ λ ( τ ) } (cid:1) X (cid:48) i (cid:0) ˆ v γ ( τ ) − v ( τ ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + sup τ ∈T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 f i ( τ ) (cid:0) { Y i ≤ X (cid:48) i θ ( τ ) } − { Y i ≤ X (cid:48) i ˆ θ λ ( τ ) } (cid:1) X (cid:48) i (cid:0) ˆ v γ ( τ ) − v ( τ ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + sup τ ∈T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:0) ˆ f i ( τ ) − f i ( τ ) (cid:1)(cid:0) { Y i ≤ X (cid:48) i θ ( τ ) } − { Y i ≤ X (cid:48) i ˆ θ λ ( τ ) } (cid:1) X (cid:48) i v ( τ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + sup τ ∈T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:0) ˆ f i ( τ ) − f i ( τ ) (cid:1)(cid:0) τ − { Y i ≤ X (cid:48) i θ ( τ ) } (cid:1) X (cid:48) i (cid:0) ˆ v γ ( τ ) − v ( τ ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + sup τ ∈T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 f i ( τ ) (cid:0) τ − { Y i ≤ X (cid:48) i θ ( τ ) } (cid:1) X (cid:48) i (cid:0) ˆ v γ ( τ ) − v ( τ ) (cid:1)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + sup τ ∈T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:0) ˆ f i ( τ ) − f i ( τ ) (cid:1)(cid:0) τ − { Y i ≤ X (cid:48) i θ ( τ ) } (cid:1) X (cid:48) i v ( τ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = I + II + III + IV + V + VI + VII . (90)We now bound the seven terms on the far right hand side in above display. Bound on I.

Let δ ∈ (0 , λ > c >

1, ¯ c := ( c + 1) / ( c − s λ := sup τ ∈T (cid:107) ˆ θ λ ( τ ) (cid:107) , r θ := C (cid:32) ¯ cφ / (2 s θ ) L θ κ (¯ c ) ∨ (cid:33) (cid:114) s θ log( ep/s θ ) + log n + log(1 /δ ) n (cid:95) ¯ cκ (¯ c ) λ √ s θ n , where C > τ ∈T (cid:107) ˆ θ λ ( τ ) − θ ( τ ) (cid:107) ≤ r θ withprobability at least 1 − δ . Deﬁne G = { g : R p +1 → R : g ( X, Y ) = f Y | X ( X (cid:48) θ ( τ ) | X ) (cid:0) { Y ≤ X (cid:48) θ ( τ ) } − { Y ≤ X (cid:48) θ } (cid:1) X (cid:48) v, θ ∈ R p , (cid:107) θ (cid:107) ≤ n, (cid:107) θ − θ ( τ ) (cid:107) ≤ r θ , v ∈ C p ( T v ( τ ) , ∩ B p (0 , , τ ∈ T } . Every g ∈ G is uniquely determined by a triplet ( v, θ, τ ); hence, we may write g = g v,θ,τ . Deﬁne G (ˆ s λ ) = g v,θ,τ ∈ G : (cid:107) θ (cid:107) ≤ ˆ s λ } . Thus, by Theorem 3, with probability at least 1 − δ , I (cid:46) n − / (cid:107) z (cid:107) κ ( ∞ ) (cid:107) G n (cid:107) G (ˆ s λ ) + sup τ ∈T sup θ (cid:12)(cid:12)(cid:12)(cid:12) E (cid:2) X (cid:48) v ( τ ) f Y | X ( X (cid:48) θ ( τ ) | X ) (cid:0) { Y ≤ X (cid:48) θ } − { Y ≤ X (cid:48) θ ( τ ) } (cid:1)(cid:3) + z (cid:48) θ − z (cid:48) θ ( τ ) (cid:12)(cid:12)(cid:12)(cid:12) , (91)where the supremum in θ is taken over (cid:107) θ (cid:107) ≤ ˆ s λ and (cid:107) θ − θ ( τ ) (cid:107) ≤ r θ .By Lemma 24 (ii), with probability at least 1 − δ , (cid:107) G n (cid:107) G (ˆ s λ ) (cid:46) ¯ f / ϕ / ( s v ) (cid:0) ϕ / (2 s θ ) (cid:1)(cid:0) ϕ / (ˆ s λ + s θ ) (cid:1) × (cid:0) υ r θ ,n (cid:0) ˆ s λ log(1 /r θ ) (cid:1) + υ r θ ,n ( t s v , ˆ s λ ,s θ ,n,δ ) (cid:1) , (92)where t s v , ˆ s λ ,s θ ,n,δ = s v log( ep/s v ) + ˆ s λ log( ep/ ˆ s λ ) + s θ log( ep/s θ ) + log( L f L θ n/δ ) and υ r θ ,n ( z ) = √ z (cid:0) √ r θ + n − / (log n ) √ z + n − (log n ) / z (cid:1) for z ≥ F Y | X in X (cid:48) θ ( τ ) (with Peano’s remainder term) and by thedeﬁnition of v ( τ ), the second term in eq. (91) can be upper bounded bysup τ ∈T sup θ (cid:12)(cid:12)(cid:12)(cid:12) E (cid:2) X (cid:48) v ( τ ) f Y | X ( X (cid:48) θ ( τ ) | X ) (cid:0) F Y | X ( X (cid:48) θ | X ) − F Y | X ( X (cid:48) θ ( τ ) | X ) (cid:1)(cid:3) + z (cid:48) θ − z (cid:48) θ ( τ ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ sup τ ∈T sup θ (cid:12)(cid:12)(cid:12)(cid:12) E (cid:2) X (cid:48) v ( τ ) f Y | X ( X (cid:48) θ ( τ ) | X ) X (cid:48) (cid:0) θ − θ ( τ ) (cid:1) + z (cid:48) θ − z (cid:48) θ ( τ ) (cid:12)(cid:12)(cid:12)(cid:12) + sup τ ∈T sup θ (cid:12)(cid:12)(cid:12)(cid:12) E (cid:2) X (cid:48) v ( τ ) f Y | X ( X (cid:48) θ ( τ ) | X ) L f (cid:0) X (cid:48) ( θ − θ ( τ )) (cid:1) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12) = L f τ ∈T sup θ (cid:12)(cid:12)(cid:12) E (cid:2) X (cid:48) v ( τ ) f Y | X ( X (cid:48) θ ( τ ) | X ) (cid:0) X (cid:48) ( θ − θ ( τ )) (cid:1) (cid:3)(cid:12)(cid:12)(cid:12) (cid:46) ¯ f L f ϕ / ( s v ) ϕ / (ˆ s λ + s θ ) (cid:107) z (cid:107) κ ( ∞ ) r θ . (93)Combine eq. (91)–(93), use that ( s v + s θ ) log ( np/δ ) log ( n ) = o ( n ) and r f = o (1), and conclude that, withprobability at least 1 − δ , I (cid:46) L f ¯ f (cid:107) z (cid:107) κ ( ∞ ) ϕ / ( s v ) (cid:0) ϕ / (2 s θ ) (cid:1)(cid:0) ϕ / (ˆ s λ + s θ ) (cid:1) × (cid:0) r θ + ˆ r B (log n ) + √ r θ ˆ r B (cid:1) . (94) Bound on II.

Note thatˆ f i ( τ ) f i ( τ ) − (cid:32) f i ( τ ) − f i ( τ ) (cid:33) (cid:32) ˆ f i ( τ ) f i ( τ ) − (cid:33) f i ( τ ) + (cid:32) f i ( τ ) − f i ( τ ) (cid:33) f i ( τ ) . Recall the identity Q (cid:48) Y ( τ ; X ) = 1 /f Y | X ( X (cid:48) θ ( τ ) | X ) and eq. (88) from the proof of Lemma 15, namely, for h > Q (cid:48) Y ( τ ; X ) = Q Y ( τ + h ; X ) − Q Y ( τ − h ; X )2 h + ( Q (cid:48)(cid:48)(cid:48) Y ( ζ + ; X ) − Q (cid:48)(cid:48)(cid:48) Y ( ζ − ; X )) h , hese three identities yield the following expansion:ˆ f i ( τ ) − f i ( τ )= (cid:32) ˆ f i ( τ ) f i ( τ ) − (cid:33) f i ( τ ) (cid:0) Q (cid:48)(cid:48)(cid:48) Y ( ζ + ; X ) − Q (cid:48)(cid:48)(cid:48) Y ( ζ − ; X ) (cid:1) h + f i ( τ ) (cid:0) Q (cid:48)(cid:48)(cid:48) Y ( ζ + ; X ) − Q (cid:48)(cid:48)(cid:48) Y ( ζ − ; X ) (cid:1) h + (cid:32) ˆ f i ( τ ) f i ( τ ) − (cid:33) f i ( τ ) (cid:32) X (cid:48) i (cid:0) θ ( τ + h ) − ˆ θ λ ( τ + h ) (cid:1) h (cid:33) + f i ( τ ) (cid:32) X (cid:48) i (cid:0) θ ( τ + h ) − ˆ θ λ ( τ + h ) (cid:1) h (cid:33) + (cid:32) ˆ f i ( τ ) f i ( τ ) − (cid:33) f i ( τ ) (cid:32) X (cid:48) i (cid:0) θ ( τ − h ) − ˆ θ λ ( τ − h ) (cid:1) h (cid:33) + f i ( τ ) (cid:32) X (cid:48) i (cid:0) θ ( τ − h ) − ˆ θ λ ( τ − h ) (cid:1) h (cid:33) . (95)Deﬁne r f := C ¯ f ϕ / ( s θ ) s θ log( np/δ ) h √ n + C Q h where C ≥ δ, η ∈ (0 , γ > c >

1, ¯ c := ( c + 1) / ( c − r v := C (cid:18) (cid:107) z (cid:107) κ ( ∞ ) ∨ (cid:19) (cid:32)(cid:114) s v log( ep/s v ) + s θ log( ep/s θ ) + log( n/δ ) n ∨ r f (cid:33) (cid:95) ¯ cκ (¯ c ) γ √ s v n , where C := ¯ c ¯ f L f L θ (cid:0) ϕ / (2 s θ ) (cid:1) ϕ max ( s v ) /κ (¯ c ) ∨ G = { g : R p +1 → R : g ( X, Y ) = f Y | X ( X (cid:48) θ ( τ ) | X ) (cid:0) { Y ≤ X (cid:48) θ ( τ ) } − { Y ≤ X (cid:48) θ } (cid:1) ( X (cid:48) v )( X (cid:48) u ) , θ ∈ R p , (cid:107) θ (cid:107) ≤ n, (cid:107) θ − θ ( τ ) (cid:107) ≤ r θ , v ∈ C p ( T v ( τ ) , ¯ c ) ∩ B p (0 , , u ∈ C p ( T θ ( τ ) , ¯ c ) ∩ B p (0 , , τ ∈ T } . Every g ∈ G is uniquely determined by a quadruple ( v, u, θ, τ ); hence, we may write g = g v,u,θ,τ . Deﬁne G (ˆ s λ ) = { g v,u,θ,τ ∈ G : (cid:107) θ (cid:107) ≤ ˆ s λ } .By expansion eq. (95), Theorem 3, and 5, and Lemma 15 we have, with probability at least 1 − δ , II (cid:46) h r v (1 + r f ) ¯ f C Q sup τ ∈T sup v ∈ C p ( T v ( τ ) , ¯ c ) ∩ B p (0 , n n (cid:88) i =1 | X (cid:48) i v | + h − r v r θ r f ¯ f sup τ ∈T sup v ∈ C p ( T v ( τ ) , ¯ c ) ∩ B p (0 , sup u ∈ C p ( T θ ( τ ) , ¯ c ) ∩ B p (0 , n n (cid:88) i =1 | X (cid:48) i u || X (cid:48) i v | + h − r v r θ (cid:107) P n (cid:107) G (ˆ s λ ) = II + II + II . (96) Bound on II . By Lemma 10, with probability at least 1 − δ , II (cid:46) h r v (1 + r f ) ¯ f C Q sup v (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ( X (cid:48) i v ) − E [( X (cid:48) i v ) ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33) / + h r v (1 + r f ) ¯ f (2 + ¯ c ) C Q ϕ / ( s v ) (cid:46) h r v (1 + r f ) ¯ f (2 + ¯ c ) C Q ϕ / ( s v ) (cid:16) n − / t / s v ,δ + n − / t / s v ,δ (cid:17) , where the supremum is taken over v ∈ C p ( T v ( τ ) ∩ B p (0 , , ¯ c ) and τ ∈ T , and t s v ,δ = s v log( ep/s v )+log(1 /δ ). ound on II . By an easy modiﬁcation of Lemma 10, with probability at least 1 − δ , II (cid:46) h − r v r θ r f ¯ f sup v sup u (cid:32)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 ( X (cid:48) i v ) ( X (cid:48) i u ) − E [( X (cid:48) i v ) ( X (cid:48) i u ) ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33) / + h − r v r θ r f ¯ f (2 + ¯ c ) ϕ / ( s v ) ϕ / ( s θ ) (cid:46) h − r v r θ r f ¯ f (2 + ¯ c ) ϕ / ( s v ) ϕ / ( s θ ) (cid:16) n − / t / s v ,s θ ,δ + n − / t s v ,s θ ,δ (cid:17) , where the suprema are taken over v ∈ C p ( T v ( τ ) ∩ B p (0 , , ¯ c ) and τ ∈ T and over u ∈ C p ( T θ ( τ ) ∩ B p (0 , , ¯ c )and τ ∈ T , and t s v ,s θ ,δ = s v log( ep/s v ) + s θ log( ep/s θ ) + log(1 /δ ). Bound on II . By Lemma 25, with probability at least 1 − δ , II (cid:46) r v r θ h √ n (cid:107) G n (cid:107) G (ˆ s λ ) + h − r v r θ sup g ∈G (ˆ s λ ) | P g | (cid:46) (2 + ¯ c ) ¯ f / ϕ / ( s θ ) ϕ / ( s v ) (cid:0) ϕ / (2 s θ ) (cid:1)(cid:0) ϕ / (ˆ s λ + s θ ) (cid:1) × r v r θ h √ n (cid:16) υ r θ ,n (cid:0) ˆ s λ log(1 /r θ ) (cid:1) + υ r θ ,n ( t s v , ˆ s λ ,s θ ,n,δ ) (cid:17) + h − r v r θ (2 + ¯ c ) ¯ f L f ϕ / ( s θ ) ϕ / ( s v ) ϕ / (ˆ s λ ) , where t s v , ˆ s λ ,s θ ,n,δ = s v log( ep/s v ) + ˆ s λ log( ep/ ˆ s λ ) + s θ log( ep/s θ ) + log( L f L θ n/δ ) and υ r θ ,n ( z ) = √ z (cid:0) √ r θ + n − / (log n ) √ z + n − / (log n ) z / (cid:1) for z ≥ I , II , and III , use that ( s v + s θ ) log ( np/δ ) log ( n ) = o ( n ) and r f = o (1),and conclude that, with probability at east 1 − δ , II (cid:46) (2 + ¯ c ) ¯ f C Q ϕ / ( s v ) × h r v + (2 + ¯ c ) L f L θ ¯ f / ϕ / ( s θ ) ϕ / ( s v ) (cid:0) ϕ / (2 s θ ) (cid:1)(cid:0) ϕ / (ˆ s λ + s θ ) (cid:1) × h − r v r θ (cid:16) ˆ r B (log n ) + r / θ ˆ r B + r θ + r f (cid:17) . (97) Bound on III.

Recall the deﬁnition of function class G . Since ( s v + s θ ) log ( np/δ ) log ( n ) = o ( n ) and r f = o (1) we have by Lemma 24 (ii), Theorems 3 and 5 that, with probability at least 1 − δ , III (cid:46) r v √ n (cid:107) G n (cid:107) G (ˆ s λ ) + sup τ ∈T sup θ sup v (cid:12)(cid:12)(cid:12)(cid:12) E (cid:2) f Y | X ( X (cid:48) θ ( τ ) | X ) (cid:0) { Y ≤ X (cid:48) θ ( τ ) } − { Y ≤ X (cid:48) θ } (cid:1) X (cid:48) v (cid:3)(cid:12)(cid:12)(cid:12)(cid:12) r v (cid:46) r v √ n (cid:107) G n (cid:107) G (ˆ s λ ) + L f ¯ f ϕ / ( s v ) ϕ / ( s θ + ˆ s λ ) r v r θ (cid:46) L f ¯ f ϕ / ( s v ) (cid:0) ϕ / (2 s θ ) (cid:1)(cid:0) ϕ / (ˆ s λ + s θ ) (cid:1) × r v (cid:0) ˆ r B (log n ) + √ r θ ˆ r B + r θ (cid:1) . (98) Bound on IV.

We can bound this term in the same way in which we bounded term II . We concludethat, with probability at east 1 − δ , IV (cid:46) ¯ f C Q ϕ / ( s v ) (cid:107) z (cid:107) κ ( ∞ ) × h + (2 + ¯ c ) L f L θ ¯ f / ϕ / ( s θ ) ϕ / ( s v ) (cid:0) ϕ / (2 s θ ) (cid:1)(cid:0) ϕ / (ˆ s λ + s θ ) (cid:1) (cid:107) z (cid:107) κ ( ∞ ) × h − r θ (cid:16) ˆ r B (log n ) + r / θ ˆ r B + r θ + r f (cid:17) . (99) Bound on V.

The strategy is similar to how we bounded term II , except that we will use Lemma 23 nstead of Lemma 25. Deﬁne G = { g : R p +1 → R : g ( X, Y ) = f Y | X ( X (cid:48) θ ( τ ) | X ) (cid:0) τ − { Y ≤ X (cid:48) θ ( τ ) } (cid:1) ( X (cid:48) v )( X (cid:48) u ) ,v ∈ C p ( T v ( τ ) , ¯ c ) ∩ B p (0 , , u ∈ C p ( T θ ( τ ) , ¯ c ) ∩ B p (0 , , τ ∈ T } . By expansion eq. (95), Theorem 3, and 5, and Lemma 15 we have, with probability at least 1 − δ , V (cid:46) h r v (1 + r f ) ¯ f C Q sup τ ∈T sup v ∈ C p ( T v ( τ ) ∩ B p (0 , , ¯ c ) n n (cid:88) i =1 | X (cid:48) i v | + h − r v r θ r f ¯ f sup τ ∈T sup v ∈ C p ( T v ( τ ) , ¯ c ) ∩ B p (0 , sup u ∈ C p ( T θ ( τ ) , ¯ c ) ∩ B p (0 , n n (cid:88) i =1 | X (cid:48) i u || X (cid:48) i v | + h − r v r θ (cid:107) P n (cid:107) G = V + V + V . Clearly, terms V and V can be bounded as II and II . We only need to bound term V . By Lemma 23,with probability at least 1 − δ , V (cid:46) h − r v r θ (cid:107) P n (cid:107) G = r v r θ h √ n (cid:107) G n (cid:107) G (cid:46) r v r θ h √ n (2 + ¯ c ) ¯ f ϕ / ( s v ) ϕ / ( s θ ) (cid:0) ϕ / (2 s θ ) (cid:1) ψ n (cid:0) t s v ,s θ ,n,δ (cid:1) , where t s v ,s θ ,n,δ = s v log( ep/s v ) + s θ log( ep/s θ ) + log( nL f L θ /δ ) and ψ n ( z ) = √ z (cid:0) n − / √ z + n − / z (cid:1) for z ≥ s v + s θ ) log ( np/δ ) log ( n ) = o ( n ) and r f = o (1) we conclude that, with probability at least 1 − δ , V (cid:46) (2 + ¯ c ) ¯ f C Q ϕ / ( s v ) × h r v + (2 + ¯ c ) ¯ f L f L θ ϕ / ( s v ) ϕ / ( s θ ) (cid:0) ϕ / (2 s θ ) (cid:1) × h − r v r θ ˆ r B . (100)(Clearly, this bound is not tight, but it is reasonable and matches the other bounds well). Bound on VI.

Let c >

1, set ¯ c := ( c + 1) / ( c − G = (cid:8) g : R p +1 → R : g ( X, Y ) = f Y | X ( X (cid:48) θ ( τ ) | X ) (cid:0) τ − (cid:8) Y ≤ X (cid:48) θ ( τ ) } (cid:1) X (cid:48) v,v ∈ C p ( T v ( τ ) , ¯ c ) ∩ B p (0 , , τ ∈ T } . By Lemma 22 and Theorem 5, with probability at least 1 − δ , VI (cid:46) r v √ n (cid:107) G n (cid:107) G (cid:46) (2 + ¯ c ) ¯ f ϕ / ( s v ) (cid:0) ϕ / (2 s θ ) (cid:1) × r v ˆ r B . (101) Bound on VII . We can bound this term in the same way in which we bounded term V . We concludethat, with probability at east 1 − δ , VII (cid:46) ¯ f C Q ϕ / ( s v ) (cid:107) z (cid:107) κ ( ∞ ) × h + (2 + ¯ c ) ¯ f L f L θ ϕ / ( s v ) ϕ / ( s θ ) (cid:0) ϕ / (2 s θ ) (cid:1) (cid:107) z (cid:107) κ ( ∞ ) × h − r θ ˆ r B . (102) onclusion. Combining eq. (90) and the bounds on I − VII we obtain, with probability at least 1 − δ ,sup τ ∈T | e n ( τ ; z ) | (cid:46) C (cid:18) r v + (cid:107) z (cid:107) κ ( ∞ ) (cid:19) (cid:0) r θ + ˆ r B (log n ) + √ r θ ˆ r B (cid:1) + C r v (ˆ r B + r θ )+ C (cid:18) r v + (cid:107) z (cid:107) κ ( ∞ ) (cid:19) (cid:16) ˆ r B (log n ) + √ r θ ˆ r B + ˆ r B + r θ (cid:17) h − r θ + C (cid:18) r v + (cid:107) z (cid:107) κ ( ∞ ) (cid:19) (cid:0) r f h − r θ + h (cid:1) where C := L f ¯ f ϕ / ( s v ) (cid:0) ϕ / (2 s θ ) (cid:1)(cid:0) ϕ / (ˆ s λ + s θ ) (cid:1) C := (2 + ¯ c ) L f L θ ¯ f / ϕ / ( s θ ) ϕ / ( s v ) (cid:0) ϕ / (2 s θ ) (cid:1)(cid:0) ϕ / (ˆ s λ + s θ ) (cid:1) C := (2 + ¯ c ) ¯ f C Q ϕ / ( s v ) + C ¯ f (103) Proof of Corollary 3 . The claim follows by combining Theorem 6 with Theorem 4 (ii) and Corollary 2.The rate on the remainder term can be simpliﬁed by exploiting that, by assumption, ( s v + s θ ) log ( np/δ ) log ( n ) = o ( nh ) and h s v = o (1). G.6 Proofs of Section E.5

Proof of Lemma 16 . To simplify notation, we write f i ( τ ), v ( τ ), and T v ( τ ) instead of f Y | X ( X (cid:48) i θ ( τ ) | X i ), v ( τ ; z ), and T v ( τ ; z ), respectively. Proof of (i).

By Lemma 19 for all g τ , g τ (cid:48) ∈ G(cid:107) ( g τ − P g τ ) − ( g τ (cid:48) − P g τ (cid:48) ) (cid:107) P,ψ = (cid:107) g τ − g τ (cid:48) (cid:107) P,ψ ≤ (cid:13)(cid:13)(cid:0) f Y | X ( X (cid:48) θ ( τ ) | X ) − f Y | X ( X (cid:48) θ ( τ (cid:48) ) | X ) (cid:1)(cid:0) τ − { Y ≤ X (cid:48) θ ( τ ) } (cid:1) X (cid:48) v ( τ ) (cid:13)(cid:13) ψ + (cid:13)(cid:13) f Y | X ( X (cid:48) θ ( τ ) | X ) (cid:0) τ − { Y ≤ X (cid:48) θ ( τ ) } (cid:1) X (cid:48) (cid:0) v ( τ ) − v ( τ (cid:48) ) (cid:1)(cid:13)(cid:13) ψ + (cid:13)(cid:13) f Y | X ( X (cid:48) θ ( τ (cid:48) ) | X ) (cid:0) τ − τ (cid:48) − { Y ≤ X (cid:48) θ ( τ ) } + { Y ≤ X (cid:48) θ ( τ (cid:48) ) } (cid:1) X (cid:48) v ( τ ) (cid:13)(cid:13) ψ ≤ L f L θ ϕ / (2 s θ ) ϕ / ( s v ) (cid:107) z (cid:107) κ ( ∞ ) | τ − τ (cid:48) | + ¯ f L f L θ ϕ / ( s v ) ϕ / (2 s θ ) ϕ max ( p ) (cid:107) z (cid:107) κ ( ∞ ) | τ − τ (cid:48) | + ¯ f ϕ / ( s v ) (cid:107) z (cid:107) κ ( ∞ ) | τ − τ (cid:48) | + ¯ f L f L θ ϕ / ( s v ) ϕ / (2 s θ ) (cid:107) z (cid:107) κ ( ∞ ) | τ − τ (cid:48) |≡ Kρ ( τ, τ (cid:48) ) , where ρ ( τ, τ (cid:48) ) = | τ − τ (cid:48) | . Total boundedness in the standard deviation metric follows, since the ψ normdominates the L ( P )-norm and T is totally bounded with respect to ρ . Case (ii).

By Theorem 8 and emark 15, E [ (cid:107) G (cid:107) G ξ ] (cid:46) K (cid:90) ξ (cid:112) log N ( ε, T , ρ ) dε + Kn − / (cid:90) ξ log N ( ε, T , ρ ) dε (cid:46) K (cid:90) ξ (cid:112) log(1 + 1 /ε ) dε + Kn − / (cid:90) ξ log(1 + 1 /ε ) dε (cid:46) K (cid:32)(cid:90) ξ dε (cid:33) / (cid:32)(cid:90) ξ log(1 + 1 /ε ) dε (cid:33) / + Kn − / (cid:90) ξ log(1 + 1 /ε ) dε (cid:46) Kξ / (cid:0) (1 + ξ ) log(1 + ξ ) − ξ log ξ (cid:1) / + Kn − / (cid:0) (1 + ξ ) log(1 + ξ ) − ξ log ξ (cid:1) → . Now, asymptotic equicontinuity follows from Markov’s inequality and by the assumption that ϕ max ( p ) = O (1)and (cid:107) z (cid:107) κ − ( ∞ ) = O (1). Proof of Theorem 7 . By Theorem 1.5.7 and Addendum 1.5.8 in van der Vaart and Wellner (1996) itsuﬃces to establish asymptotic equicontinuity of the leading term of the Bahadur-type representation fromTheorem 6, total boundedness of G as deﬁned in Lemma 16, and ﬁnite dimensional convergence to a Gaussianrandom vector. Asymptotic equicontinuity and total boundedness are established in Lemma 16; hence weare left to show the ﬁnite dimensional convergence.Finite dimensional convergence to a Gaussian random vector follows the strong moment conditions(Assumption 1), the assumption that the limit covariance function exists, and the Lindeberg Feller CLT. Fordetails see proof of Theorem 2.1 on p. 3292 in Chao et al. (2017). We are left to show that the asymptoticmean equals zero and that the (ﬁnite dimensional) covariance is as stated in the theorem. First, for all τ ∈ T , E (cid:34) − √ n n (cid:88) i =1 f Y | X ( X (cid:48) i θ ( τ ) | X i )( τ − { Y i ≤ X (cid:48) i θ ( τ ) } ) X (cid:48) i v ( τ ; z ) (cid:35) = E (cid:34) √ n n (cid:88) i =1 f Y | X ( X (cid:48) i θ ( τ ) | X i ) (cid:0) τ − F Y | X ( X (cid:48) i θ ( τ ) | X i ) (cid:1) X (cid:48) i v ( τ ; z ) (cid:35) = 0 . (104)Second, for arbitrary τ , . . . , τ K ∈ T , the asymptotic covariance matrix islim n →∞ (cid:32) E (cid:34) √ n n (cid:88) i =1 f Y | X ( X (cid:48) i θ ( τ j ) | X i )( τ j − { Y i ≤ X (cid:48) i θ ( τ j ) } ) (cid:0) X (cid:48) i v ( τ j ; z ) (cid:1) × √ n n (cid:88) i =1 f Y | X ( X (cid:48) i θ ( τ k ) | X i ) ( τ k − { Y i ≤ X (cid:48) i θ ( τ k ) } ) (cid:0) X (cid:48) i v ( τ k ; z ) (cid:1)(cid:35)(cid:33) Kj,k =1 = lim n →∞ (cid:16) ( τ j ∧ τ k − τ j τ k ) / v (cid:48) ( τ j ; z ) E [ f Y | X ( X (cid:48) θ ( τ j ) | X ) f Y | X ( X (cid:48) θ ( τ k ) | X ) XX (cid:48) ] v (cid:48) ( τ k ; z ) (cid:17) Kj,k =1 where we have used the deﬁnition of v ( τ ; z ), Assumption 2, and eq. (104). Proof of Lemma 17 . We prove only the special case τ = τ = τ . The general case follows by the obviousgeneralization of the arguments; however, making this explicit would be extremely tedious.For notational convenience, we write v ( τ ) for v ( τ ; z ), ˆ v ( τ ) for ˆ v γ ( τ ; z ), f ( τ ) for f Y | X ( X (cid:48) θ ( τ ) | X ), and i ( τ ) for f Y | X ( X (cid:48) i θ ( τ ) | X i ). We also introduce D ( τ ) := 14 E (cid:2) f ( τ ) XX (cid:48) (cid:3) and (cid:98) D ( τ ) := 14 n n (cid:88) i =1 ˆ f i ( τ ) X i X (cid:48) i . We have the following decomposition: (cid:12)(cid:12)(cid:12) ˆ v (cid:48) ( τ ) (cid:98) D ( τ )ˆ v ( τ ) − v (cid:48) ( τ ) D ( τ ) v ( τ ) (cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:0) ˆ v ( τ ) − v ( τ ) (cid:1) (cid:48) (cid:16) (cid:98) D ( τ ) − D ( τ ) (cid:17) (cid:0) ˆ v ( τ ) − v ( τ ) (cid:1)(cid:12)(cid:12)(cid:12) (105)+ (cid:12)(cid:12)(cid:12) v (cid:48) ( τ ) (cid:16) (cid:98) D ( τ ) − D ( τ ) (cid:17) (cid:0) ˆ v ( τ ) − v ( τ ) (cid:1)(cid:12)(cid:12)(cid:12) (106)+ (cid:12)(cid:12)(cid:12)(cid:0) ˆ v ( τ ) − v ( τ ) (cid:1) (cid:48) (cid:16) (cid:98) D ( τ ) − D ( τ ) (cid:17) v ( τ ) (cid:12)(cid:12)(cid:12) (107)+ (cid:12)(cid:12)(cid:12) v (cid:48) ( τ ) (cid:16) (cid:98) D ( τ ) − D ( τ ) (cid:17) v ( τ ) (cid:12)(cid:12)(cid:12) (108)+ (cid:12)(cid:12)(cid:12)(cid:0) ˆ v ( τ ) − v ( τ ) (cid:1) (cid:48) D ( τ ) (cid:0) ˆ v ( τ ) − v ( τ ) (cid:1)(cid:12)(cid:12)(cid:12) (109)+ (cid:12)(cid:12)(cid:12)(cid:0) ˆ v ( τ ) − v ( τ ) (cid:1) (cid:48) D ( τ ) v ( τ ) (cid:12)(cid:12)(cid:12) (110)+ (cid:12)(cid:12) v (cid:48) ( τ ) D ( τ ) (cid:0) ˆ v ( τ ) − v ( τ ) (cid:1)(cid:12)(cid:12) . (111)The ﬁrst four terms, eq. (105)–(108), can be upper bounded uniformly in τ ∈ T bysup τ ∈T sup (cid:107) u (cid:107) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 (cid:0) ˆ f i ( τ ) − f i ( τ ) (cid:1) ( X (cid:48) i u ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) z (cid:107) κ ( ∞ ) + sup τ ∈T sup (cid:107) u (cid:107) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n (cid:88) i =1 f i ( τ )( X (cid:48) i u ) − E [ f ( τ )( X (cid:48) u ) ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:107) z (cid:107) κ ( ∞ )= I + II . Set r f = ¯ f √ nh − r θ + ¯ f C Q h . Arguing as in the proof of Theorem 5, by eq. (79), with probability at least1 − δ , I (cid:46) r f (cid:107) z (cid:107) κ ( ∞ ) L f L θ (cid:16) (2 + ¯ c ) ¯ f ϕ max ( s v ) + ¯ r B (2 + ¯ c ) ¯ f (1 + ϕ / (2 s θ )) ϕ max ( s v ) (cid:17) , and, by eq. (77), with probability at least 1 − δ , II (cid:46) r f ¯ r B (cid:107) z (cid:107) κ ( ∞ ) L f L θ (2 + ¯ c ) ¯ f (1 + ϕ / (2 s θ )) ϕ max ( s v ) . The last three terms, eq. (109)–(111), can be jointly upper bounded via Corollary 2 and Lemma 15, withprobability at least 1 − δ , by(2 + ¯ c ) ¯ f ϕ max ( p ) (cid:18) (cid:107) z (cid:107) κ ( ∞ ) ∨ (cid:19) (cid:16) ¯ r B + ¯ f √ nh − ¯ r B √ s v + ¯ f C Q h √ s v (cid:17) . Combining above bounds to conclude.

G.7 Proofs of Section E.6

Proof of Lemma 18 . Trivial and hence omitted. Follows from the same arguments as Lemma 4. roof of Lemma 19 . Let s, τ ∈ T be arbitrary. For notational convenience, we introduce D ( τ ) := E (cid:104) f Y | X ( X (cid:48) θ ( τ ) | X ) XX (cid:48) (cid:105) and write s v for s v ( z ). (cid:107) v ( s ; z ) − v ( τ ; z ) (cid:107) ≤ (cid:107) D − ( s ) (cid:107) op sup (cid:107) u (cid:107) ≤ , (cid:107) u (cid:107) ≤ s v (cid:13)(cid:13)(cid:0) D ( τ ) − D ( s ) (cid:1) u (cid:107) (cid:107) D − ( τ ) z (cid:107) ≤ (cid:107) z (cid:107) κ ( ∞ ) sup (cid:107) u (cid:107) ≤ , (cid:107) u (cid:107) ≤ s v sup (cid:107) v (cid:107) ≤ (cid:12)(cid:12)(cid:12) E (cid:104) v (cid:48) (cid:0) f Y | X ( X (cid:48) θ ( s ) | X ) − f Y | X ( X (cid:48) θ ( τ ) | X ) (cid:1) XX (cid:48) u (cid:105)(cid:12)(cid:12)(cid:12) ≤ f L f L θ (cid:107) z (cid:107) κ ( ∞ ) sup (cid:107) u (cid:107) ≤ , (cid:107) u (cid:107) ≤ s v sup (cid:107) v (cid:107) ≤ sup (cid:107) w (cid:107) ≤ , (cid:107) w (cid:107) ≤ s θ E | X (cid:48) w || u (cid:48) XX (cid:48) v | s − τ | (cid:46) f L f L θ (cid:107) z (cid:107) κ ( ∞ ) ϕ / ( p ) ϕ / ( s v ) ϕ / (2 s θ ) | s − τ | . To conclude, simplify the bound.

Proof of Lemma 20 . We generalize the proof of Lemma 8 from K = 2 to arbitrary K ∈ N .For each x j ∈ conv( X j ), 1 ≤ j ≤ K , there exist n j ∈ N and λ j , . . . , λ nj ≥ (cid:80) n j i j =1 λ i j j = 1, such that x j = (cid:80) n j i j =1 λ i j j x i j j for some x i j j ∈ X j . Thus, by the multi-convexity of the f i ’s and N × K applications ofJensen’s inequality, for all x j ∈ conv( X j ), 1 ≤ j ≤ K , N (cid:88) (cid:96) =1 f (cid:96) ( x , . . . , x K ) ≤ K (cid:88) j =1 n j (cid:88) i j =1 λ i · · · λ i K K (cid:32) N (cid:88) (cid:96) =1 f (cid:96) ( x i , . . . , x i K K ) (cid:33) ≤ K (cid:88) j =1 n j (cid:88) i j =1 λ i · · · λ i K K (cid:32) sup x j ∈X j , ≤ j ≤ K N (cid:88) (cid:96) =1 f (cid:96) ( x , . . . , x K ) (cid:33) ≤ sup x j ∈X j , ≤ j ≤ K N (cid:88) (cid:96) =1 f (cid:96) ( x , . . . , x K ) . Thus, sup x j ∈ conv( X j ) , ≤ j ≤ K (cid:80) Ni =1 f i ( x , . . . , x K ) ≤ sup x j ∈X j , ≤ j ≤ K (cid:80) Ni =1 f i ( x , . . . , x K ). The reverse in-equality holds trivially true since X j ⊆ conv( X j ), 1 ≤ j ≤ K . The same arguments hold if (cid:80) Ni =1 f i isreplaced by (cid:12)(cid:12)(cid:12)(cid:80) Ni =1 f i (cid:12)(cid:12)(cid:12) . This concludes the proof. Proof of Lemma 21 . To simplify notation, we write K k ( τ ) = C p ( J k ( τ ) , ϑ k ) ∩ B p (0 ,

1) for k ∈ { , } and f i ( τ ) = f Y | X ( X (cid:48) i θ ( τ ) | X i ) for 1 ≤ i ≤ n . Let η ∈ (0 ,

We begin with deriving an upper bound on the supremum of an empirical processesindexed by an auxiliary function class. By Lemma 7 there exist M , M , W ⊂ B p (0 ,

1) such thatcard( M k ) ≤ (cid:18) eps k (cid:19) s k , ∀ u ∈ M k : (cid:107) u (cid:107) ≤ s k , ∀ τ ∈ T , k ∈ { , } : K k ( τ ) ⊂ ϑ k )conv( M k );card( W ) ≤ (cid:18) ep s θ (cid:19) s θ , ∀ w ∈ W : (cid:107) w (cid:107) ≤ s θ , ∀ τ, τ (cid:48) ∈ T : C p ( T θ ( τ ) ∪ T θ ( τ (cid:48) ) , ∩ B p (0 , ⊂ W ) . Let { g i } ni =1 be a sequence of i.i.d. standard normal random variables independent of { X i } ni =1 . Deﬁne F = (cid:8) f ( X, g ) = g ( X (cid:48) u )( X (cid:48) u ) X (cid:48) w : u k ∈ M k , w ∈ W , k ∈ { , } (cid:9) Next, we verify that F satisﬁes the Lipschitz-type condition eq. (36). For α ∈ (0 ,

1) the ψ α -norm is a quasi-norm since x (cid:55)→ exp( x α ) − x . However, by Assumption 1 and Lemma 29there exists an absolute constant C > u , v ∈ M , u , v ∈ M , and w , w ∈ W , (cid:107) gX (cid:48) u X (cid:48) v X (cid:48) w − εX (cid:48) u X (cid:48) v X (cid:48) w (cid:107) P,ψ / ≤ C sup u ∈M sup v ∈M sup w ∈W (cid:107) X (cid:48) u (cid:107) P,ψ (cid:107) X (cid:48) u (cid:107) P,ψ (cid:107) X (cid:48) w (cid:107) P,ψ (cid:0) (cid:107) u − u (cid:107) + (cid:107) v − v (cid:107) + (cid:107) w − w (cid:107) (cid:1) (cid:46) Cϕ / (2 s θ ) ϕ / ( s ) ϕ / ( s ) (cid:0) (cid:107) u − u (cid:107) + (cid:107) v − v (cid:107) + (cid:107) w − w (cid:107) (cid:1) . Whence, by Corollary 4, with probability at least 1 − δ , (cid:107) G n (cid:107) F (cid:46) ϕ / (2 s θ ) ϕ / ( s ) ϕ / ( s ) (cid:112) s θ log( ep/s θ ) + s log( ep/s ) + s log( ep/s )+ ϕ / (2 s θ ) ϕ / ( s ) ϕ / ( s ) n − / (cid:16) s θ log( ep/s θ ) + s log( ep/s ) + s log( ep/s ) (cid:17) + ϕ / (2 s θ ) ϕ / ( s ) ϕ / ( s ) (cid:16)(cid:112) log(1 /δ ) + n − / (cid:0) log(1 /δ ) (cid:1) (cid:17) . (114)We now turn the bound on (cid:107) G n (cid:107) F into a bound on II . For any increasing and convex function F wehave E  F  sup τ : | τ − s |≤ η sup s (cid:48) ∈T sup u k ∈ K k ( s (cid:48) ) k ∈{ , } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 (cid:0) f i ( τ ) − f i ( s ) (cid:1) ( X (cid:48) i u )( X (cid:48) i u ) − E (cid:2)(cid:0) f i ( τ ) − f i ( s ) (cid:1) ( X (cid:48) i u )( X (cid:48) i u ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( a ) ≤ E  F  √ π sup τ : | τ − s |≤ η sup s (cid:48) ∈T sup u k ∈ K k ( s (cid:48) ) k ∈{ , } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 g i (cid:0) f i ( τ ) − f i ( s ) (cid:1) ( X (cid:48) i u )( X (cid:48) i u ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( b ) ≤ E  F  √ π ¯ f L f sup τ : | τ − s |≤ η sup s (cid:48) ∈T sup u k ∈ K k ( s (cid:48) ) k ∈{ , } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 g i (cid:0) X (cid:48) i θ ( τ ) − X (cid:48) i θ ( s ) (cid:1) ( X (cid:48) i u )( X (cid:48) i u ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( c ) ≤ E (cid:104) F (cid:16) √ π (2 + ϑ )(2 + ϑ ) ¯ f L f L θ η (cid:107) G n (cid:107) F (cid:17)(cid:105) (115) here ( a ) holds by Lemma 6.3 followed by Lemma 4.5 in Ledoux and Talagrand (1996), ( b ) holds byLemma 32 (note that the map z (cid:55)→ f Y | X ( z − X (cid:48) θ ( s ) | X ) − f Y | X ( X (cid:48) θ ( s ) | X ) is Lipschitz-continuous andvanishes at 0), and ( c ) holds by Assumption 4, Lemma 20, and construction of M , M , and W .By Lemma 31 (with α = 1 / ϕ ( t ) = t / + n − / t ), eq. (114) and (115), and the union bound over η ∈ T η , with probability at least 1 − δ , II (cid:46) (2 + ϑ )(2 + ϑ ) ¯ f L f L θ ϕ / (2 s θ ) ϕ / ( s ) ϕ / ( s ) × (cid:16)(cid:112) s θ log( ep/s θ ) + s log( ep/s ) + s log( ep/s )+ n − / (cid:16) s θ log( ep/s θ ) + s log( ep/s ) + s log( ep/s ) (cid:17) + (cid:112) log(1 /δ ) + n − / (cid:0) log(1 /δ ) (cid:1) (cid:17) × η. (116) Conclusion.

Since η ∈ (0 ,

1) is arbitrary, we can choose η (cid:16) / ( L f L θ n ). Combine the bounds ineq. (112), (113), and (116), adjust the constants, and conclude that with probability at least 1 − δ , (cid:107) G n (cid:107) G (cid:46) (2 + ϑ )(2 + ϑ ) ¯ f ϕ / ( s ) ϕ / ( s )(1 + ϕ / (2 s θ )) ψ n (cid:0) t s θ ,s ,s ,n,δ (cid:1) , where t s ,s ,s θ ,n,δ = s log( ep/s ) + s log( ep/s ) + s θ log( ep/s θ ) + log( nL f L θ /δ ) and ψ n ( z ) = √ z (cid:0) n − / √ z + n − / z / (cid:1) for z ≥ Proof of Lemma 22 . The proof strategy is identical to the one of Lemma 21. To simplify notation, wewrite K ( τ ) = C p ( J ( τ ) , ϑ ) ∩ B p (0 ,

1) and f i ( τ ) = f Y | X ( X (cid:48) i θ ( τ ) | X i ) for 1 ≤ i ≤ n . Let η ∈ (0 ,

1) be arbitraryand T η be an η -net with cardinality card( T η ) ≤ /η . We have the following decomposition: (cid:107) G n (cid:107) G ≤ sup τ ∈T η sup τ (cid:48) ∈T sup v ∈ K ( τ (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 f i ( τ ) (cid:0) τ (cid:48) − { Y i ≤ X (cid:48) i θ ( τ (cid:48) ) } X (cid:48) i v (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + sup τ ∈T η sup τ (cid:48) : | τ (cid:48) − τ |≤ η sup τ (cid:48)(cid:48) ∈T sup v ∈ K ( τ (cid:48)(cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 (cid:0) f i ( τ ) − f i ( τ (cid:48) ) (cid:1)(cid:0) τ (cid:48)(cid:48) − { Y i ≤ X (cid:48) i θ ( τ (cid:48)(cid:48) ) } X (cid:48) i v (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = I + II . (117) Bound on I.

By Lemma 11 and the union bound over τ ∈ T η , with probability at least 1 − δ , I (cid:46) ϕ / ( s ) ¯ f (cid:112) t s,η,δ (cid:113) π n, ( t s,η,δ ) , (118)where t s,η,δ = s log( ep/s ) + log(1 + 1 /η ) + log(1 /δ ) and π n, ( z ) = (cid:112) z/n + z/n for z ≥ Bound on II.

We begin with deriving an upper bound on the supremum of an empirical processesindexed by an auxiliary function class. By Lemma 7 there exist M , W ⊂ B p (0 ,

1) such thatcard( M ) ≤ (cid:18) eps (cid:19) s , ∀ u ∈ M : (cid:107) u (cid:107) ≤ s, ∀ τ ∈ T : K ( τ ) ⊂ ϑ )conv( M );card( W ) ≤ (cid:18) ep s θ (cid:19) s θ , ∀ w ∈ W : (cid:107) w (cid:107) ≤ s θ , ∀ τ, τ (cid:48) ∈ T : C p ( T θ ( τ ) ∪ T θ ( τ (cid:48) ) , ∩ B p (0 , ⊂ W ) . et { g i } ni =1 be a sequence of i.i.d. standard normal random variables independent of { ( X i , Y i ) } ni =1 . Deﬁne F = (cid:8) f ( X, Y, g ) = g ( X (cid:48) v )( X (cid:48) w ) (cid:0) τ − { Y ≤ X (cid:48) θ ( τ ) } (cid:1) : v ∈ M , w ∈ W , τ ∈ T (cid:9) . Note the following: First, F = { hj : h ∈ H , j ∈ J } , where H = (cid:8) h ( X, Y ) = (cid:0) τ − { F Y | X ( Y | X ) ≤ τ } (cid:1) : τ ∈ T (cid:9) and J = { j ( X, g ) = g ( X (cid:48) v )( X (cid:48) w ) : v ∈ M , w ∈ W} . The set H is the diﬀerence of two VC-subgraph classes with VC-indices at most 2, respectively (van der Vaart and Wellner, 1996, Lemma 2.6.15and Example 2.6.1). Thus, H is VC-subgraph class with VC-index at most 3 (van der Vaart and Wellner,1996, Lemma 2.6.18). The function class J is ﬁnite with card( J M ) = card( M ) × card( W ). By Assumption 1and Lemma 29, for v , v ∈ M , w , w ∈ W arbitrary, (cid:13)(cid:13) g ( X (cid:48) v ) ( X (cid:48) w ) − E [ g ( X (cid:48) v ) ( X (cid:48) w ) ] − (cid:0) g ( X (cid:48) v ) ( X (cid:48) w ) − E [ g ( X (cid:48) v ) ( X (cid:48) w ) ] (cid:1)(cid:13)(cid:13) P,ψ / ≤ (cid:13)(cid:13) g ( X (cid:48) v ) ( X (cid:48) w ) − g ( X (cid:48) v ) ( X (cid:48) w ) (cid:13)(cid:13) P,ψ / ≤ u ,u (cid:13)(cid:13) g (cid:107) ψ (cid:107) ( X (cid:48) u ) (cid:107) ψ (cid:107) ( X (cid:48) u ) (cid:13)(cid:13) P,ψ ( (cid:107) v − v (cid:107) + (cid:107) w − w (cid:107) ) (cid:46) ϕ max ( s ) ϕ max (2 s θ ) ( (cid:107) v − v (cid:107) + (cid:107) w − w (cid:107) ) , where the supremum in the third line is taken over all u , u such that (cid:107) u (cid:107) , (cid:107) u (cid:107) ≤ (cid:107) u (cid:107) ≤ s and (cid:107) u (cid:107) ≤ s θ . Therefore, for j v , j u ∈ J , (cid:107) ( j v − P j v ) − ( j u − P j u ) (cid:107) P,ψ (cid:46) ϕ max ( s ) ϕ max (2 s θ ) (cid:107) v − u (cid:107) . Thus, by Corollary 5, with probability at least 1 − δ , (cid:107) G n (cid:107) F (cid:46) ϕ / ( s ) ϕ / (2 s θ ) (cid:112) t s,s θ ,δ (cid:113) π n, / ( t s,s θ ,δ ) , (119)where t s,s θ ,δ = s log( ep/s ) + s θ log( ep/s θ ) + log(1 /δ ) and π n, / ( z ) = (cid:112) z/n + z /n for z ≥ II . For any increasing and convex function F we have, for τ ∈ T η arbitrary, E (cid:34) F (cid:32) sup τ (cid:48) : | τ (cid:48) − τ |≤ η sup τ (cid:48)(cid:48) ∈T sup v ∈ K ( τ (cid:48)(cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 (cid:0) f i ( τ ) − f i ( τ (cid:48) ) (cid:1)(cid:0) τ (cid:48)(cid:48) − { Y i ≤ X (cid:48) i θ ( τ (cid:48)(cid:48) ) } X (cid:48) i v (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33)(cid:35) ( a ) ≤ E (cid:34) F (cid:32) √ π sup τ (cid:48) : | τ (cid:48) − τ |≤ η sup τ (cid:48)(cid:48) ∈T sup v ∈ K ( τ (cid:48)(cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 g i (cid:0) f i ( τ ) − f i ( τ (cid:48) ) (cid:1)(cid:0) τ (cid:48)(cid:48) − { Y i ≤ X (cid:48) i θ ( τ (cid:48)(cid:48) ) } X (cid:48) i v (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33)(cid:35) ( b ) ≤ E (cid:34) F (cid:32) √ πL f sup τ (cid:48) : | τ (cid:48) − τ |≤ η sup τ (cid:48)(cid:48) ∈T sup v ∈ K ( τ (cid:48)(cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 g i (cid:0) X (cid:48) i θ ( τ ) − X (cid:48) i θ ( τ (cid:48) ) (cid:1)(cid:0) τ (cid:48)(cid:48) − { Y i ≤ X (cid:48) i θ ( τ (cid:48)(cid:48) ) } X (cid:48) i v (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33)(cid:35) ( c ) ≤ E (cid:104) F (cid:16) ϑ ) √ πL f L θ η (cid:107) G n (cid:107) F (cid:17)(cid:105) , (120)where ( a ) holds by Lemma 6.3 followed by Lemma 4.5 in Ledoux and Talagrand (1996), ( b ) holds byLemma 32 (note that the map z (cid:55)→ f Y | X ( z − X (cid:48) θ ( τ ) | X ) − f Y | X ( X (cid:48) θ ( τ ) | X ) is Lipschitz-continuous andvanishes at 0), and ( c ) holds by Assumption 4, Lemma 8, and construction of M and W .Set t s,s θ ,η,δ = s log(5 ep/s ) + 2 s θ log(5 ep/s θ ) + log(1 + 1 /η ) + log(1 /δ ). By Lemma 31, eq. (119) and (120), nd the union bound over τ ∈ T η we have, with probability at least 1 − δ , II (cid:46) L f L θ (2 + ϑ ) ϕ / ( s ) ϕ / (2 s θ ) (cid:112) t s,s θ ,η,δ (cid:113) π n, / ( t s,s θ ,η,δ ) η. (121) Conclusion.

Since η ∈ (0 ,

1) is arbitrary, we can choose η (cid:16) / ( L f L θ n ). Combine the bounds ineq. (117), (119), and (121), adjust the constants, and conclude that with probability at least 1 − δ , (cid:107) G n (cid:107) G (cid:46) (2 + ϑ ) ¯ f ϕ / ( s ) (cid:0) ϕ / (2 s θ ) (cid:1) ψ n (cid:0) t s,s θ ,n,δ (cid:1) , where t s,s θ ,n,δ = s log( ep/s ) + s θ log( ep/s θ ) + log( nL f L θ /δ ) and ψ n ( z ) = √ z (cid:0) n − / √ z + n − / z / (cid:1) for z ≥ Proof of Lemma 23 . The proof is an easy modiﬁcation of the proof of Lemma 22. We only sketch thedetails:To simplify notation, we write K k ( τ ) = C p ( J k ( τ ) , ϑ k ) ∩ B p (0 ,

1) for k ∈ { , } and f i ( τ ) = f Y | X ( X (cid:48) i θ ( τ ) | X i )for 1 ≤ i ≤ n . Let η ∈ (0 ,

1) be arbitrary and T η be an η -net with cardinality card( T η ) ≤ /η . We havethe following decomposition: (cid:107) G n (cid:107) G ≤ sup s ∈T η sup s (cid:48) ∈T sup u k ∈ K k ( s (cid:48) ) k ∈{ , } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 f i ( s ) (cid:0) s (cid:48) − { Y i ≤ X (cid:48) i θ ( s (cid:48) ) } ( X (cid:48) i u )( X (cid:48) i u ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + sup s ∈ T η sup τ : | τ − s |≤ η sup s (cid:48) ∈T sup u k ∈ K k ( s (cid:48) ) k ∈{ , } (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 (cid:0) f i ( τ ) − f i ( s ) (cid:1)(cid:0) s (cid:48) − { Y i ≤ X (cid:48) i θ ( s (cid:48) ) } ( X (cid:48) i u )( X (cid:48) i u ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = I + II . (122)A trivial modiﬁcation of Lemma 11 and the union bound over τ ∈ T η yield, with probability at least1 − δ , I (cid:46) (2 + ϑ )(2 + ϑ ) ¯ f ϕ / ( s ) ϕ / ( s ) (cid:112) t s ,s ,η,δ (cid:113) π n, / ( t s ,s ,η,δ ) , where t s ,s ,η,δ = s log( ep/s ) + s log( ep/s ) + log(1 + 1 /η ) + log(1 /δ ) and π n, / ( z ) = (cid:112) z/n + z /n for z ≥ − δ , II (cid:46) (2 + ϑ )(2 + ϑ ) ϕ / ( s ) ϕ / ( s ) ϕ / (2 s θ ) (cid:112) t s ,s ,s θ ,δ (cid:113) π n, / ( t s ,s ,s θ ,δ ) , where t s ,s ,s θ ,δ = s log( ep/s ) + s log( ep/s ) + s θ log( ep/s θ ) + log(1 /δ ) and π n, / ( z ) = (cid:112) z/n + z /n for z ≥ z (cid:55)→ f Y | X ( z − X (cid:48) θ ( τ ) | X ) − f Y | X ( X (cid:48) θ ( τ ) | X ) has Lipschitz constant 2 L f ¯ f and vanishes at 0.Hence, adapting the arguments in eq. (120) and eq. (121), we conclude that, with probability at least 1 − δ , II (cid:46) L f L θ ¯ f (2 + ϑ )(2 + ϑ ) ϕ / ( s ) ϕ / ( s ) ϕ / (2 s θ ) (cid:112) t s ,s ,s θ ,η,δ (cid:113) π n, / ( t s ,s ,s θ ,η,δ ) η, where t s ,s ,s θ ,η,δ = s log( ep/s ) + s log( ep/s ) + s θ log( ep/s θ ) + log(1 /η ) + log(1 /δ ). ince η ∈ (0 ,

1) is arbitrary, we can choose η (cid:16) / ( L f L θ n ). Combine the bounds in on I and II toconclude that with probability at least 1 − δ , (cid:107) G n (cid:107) G (cid:46) (2 + ϑ )(2 + ϑ ) ¯ f ϕ / ( s ) ϕ / ( s ) ϕ / ( s θ ) (cid:0) ϕ / (2 s θ ) (cid:1) ψ n (cid:0) t s ,s ,s θ ,n,δ (cid:1) , where t s ,s ,s θ ,n,δ = s log( ep/s ) + s log( ep/s ) + s θ log( ep/s θ ) + log( nL f L θ /δ ) and ψ n ( z ) = √ z (cid:0) n − / √ z + n − / z (cid:1) for z ≥ Proof of Lemma 24 . The proof strategy is similar to the one of Lemma 21. However, instead of simplyapplying Corollary 4, we us a combination of Lemmas 27 and 28. This allows us to leverage the fact that (cid:107) θ − θ ( τ ) (cid:107) ≤ r for all τ ∈ T at the cost of additional (log n )-factors. Proof of Case (i).

To simplify notation, we write K ( τ ) = C p ( J ( τ ) , ϑ ) ∩ B p (0 ,

1) and f i ( τ = f Y | X ( X (cid:48) i θ ( τ ) | X i ) fo 1 ≤ i ≤ n . By Lemma 7 there exist M ⊂ B p (0 ,

1) such thatcard( M ) ≤ (cid:18) eps (cid:19) s , ∀ v ∈ M : (cid:107) v (cid:107) ≤ s, ∀ τ ∈ T : K ( τ ) ⊂ ϑ )conv( M ) . For S ⊆ { , . . . , p } , v ∈ M , and τ ∈ T we deﬁne the following function classes: H S,v = (cid:8) h ( X, Y ) = (cid:0) (cid:8) Y ≤ X (cid:48) θ (cid:9) − (cid:8) Y ≤ X (cid:48) θ ( τ ) } (cid:1) X (cid:48) v,θ ∈ R p , supp( θ ) = S, (cid:107) θ − θ ( τ ) (cid:107) ≤ r , τ ∈ T (cid:9) , G S,v,τ = (cid:8) g ( X, Y ) = f Y | X ( X (cid:48) θ ( τ ) | X ) h ( X, Y ) : h ∈ H S,v }G S,v = (cid:91) τ ∈T G S,v,τ . Each g ∈ G S,v,τ and h ∈ H S,v is uniquely determined by the triplet ( v, θ, τ ). We therefore also write g v,θ,τ and h v,θ,τ whenever we need to identify a function via its parameters. Let η ∈ (0 ,

1) be arbitrary and T η bean η -net with cardinality card( T η ) ≤ /η . We have the following decomposition: (cid:107) G n (cid:107) G S,v ≤ sup τ ∈T η (cid:107) G n (cid:107) G S,v,τ + sup τ ∈T η sup τ (cid:48) : | τ (cid:48) − τ |≤ η sup h ∈H S,v (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 (cid:0) f i ( τ (cid:48) ) − f i ( τ ) (cid:1) h ( X i , Y i ) − E (cid:2)(cid:0) f i ( τ (cid:48) ) − f i ( τ ) (cid:1) h ( X i , Y i ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = I + II . (123)In the following we ﬁrst derive an upper bound on (cid:107) G n (cid:107) G S,v and then assemble a bound on (cid:107) G n (cid:107) G . Bound on I.

It is standard to verify that G S,v,τ is a VC-subgraph class with VC-index at most aconstant multiple of | S | + 1 (see proof of Lemma 12). Also, G v ( X ) = ¯ f | X (cid:48) v | is an envelope of G S,v,τ , (cid:107) max ≤ i ≤ n G v ( X i ) (cid:107) ψ (cid:46) (log n ) ¯ f ϕ / ( s ), and for all g ∈ G S,v,τ , P g ≤ f E (cid:2) ( X (cid:48) v ) (cid:12)(cid:12) F Y | X ( X (cid:48) θ ( τ ) | X ) − F Y | X ( X (cid:48) θ | X ) (cid:12)(cid:12)(cid:3) ≤ f E (cid:2) ( X (cid:48) v ) | X (cid:48) ( θ ( τ ) − θ ) | (cid:3) (cid:46) r ¯ f ϕ max ( s ) ϕ max ( | S | + s θ ) (cid:46) r ϕ max ( s ) (cid:0) f ϕ max ( | S | + s θ ) (cid:1) . hus, by Lemma 28, there exists an absolute constant C > S, v, τ ) such that P (cid:110) (cid:107) G n (cid:107) G S,v,τ ≥ C E (cid:107) G n (cid:107) G S,v,τ + C ϕ / ( s ) (cid:0) f / ϕ / ( | S | + s θ ) (cid:1) υ r ,n, ( t ) (cid:111) ≤ e − t , (124)and by Lemma 27, there exists an absolute constant C > S, v, τ ) such that E (cid:107) G n (cid:107) G S,v,τ ≤ C ϕ / ( s ) (cid:0) f / ϕ / ( | S | + s θ ) (cid:1) υ r ,n, (cid:0) | S | log(1 /r ) (cid:1) , where υ r ,n,γ ( z ) = √ r z + (cid:113) log γ nn z γ for z ≥ γ >

0. Set t η,δ = log(1 + 1 /η ) + log(1 /δ ). Now, eq. (124)and the union bound over τ ∈ T η yield, with probability at least 1 − δ , I (cid:46) ϕ / ( s ) (cid:0) f / ϕ / ( | S | + s θ ) (cid:1)(cid:0) υ r ,n, (cid:0) | S | log(1 /r ) (cid:1) + υ r ,n, ( t η,δ ) (cid:1) . (125) Bound on II.

Let { g i } ni =1 be a sequence of i.i.d. standard normal random variables independent of { ( X i , Y i ) } ni =1 . By Lemma 7 there exist W ⊂ B p (0 ,

1) such thatcard( W ) ≤ (cid:18) eps θ (cid:19) s θ , ∀ w ∈ W : (cid:107) w (cid:107) ≤ s θ , B p (0 , ⊂ W ) . For S ⊆ { , . . . , p } , v ∈ M , and w ∈ W consider J S,v,w = (cid:8) j ( X, Y, g ) = gh ( X, Y ) X (cid:48) w : h ∈ H S,v (cid:9) . Again, we easily verify that J S,v,w is a VC-subgraph class with VC-index at most a constant multi-ple of | S | + 2 (see proof of Lemma 12). Also, J v,w ( X, g ) = | g || ( X (cid:48) v )( X (cid:48) w ) | is an envelope of J S,v,w , (cid:107) max ≤ i ≤ n J v,w ( X i , g i ) (cid:107) ψ / (cid:46) (log n ) / ϕ / ( s ) ϕ / (2 s θ ), and for all j ∈ J S,v,w , P j ≤ E (cid:2) g ( X (cid:48) v ) ( X (cid:48) w ) (cid:12)(cid:12) F Y | X ( X (cid:48) θ ( τ ) | X ) − F Y | X ( X (cid:48) θ | X ) (cid:12)(cid:12)(cid:3) ≤ f E (cid:2) g ( X (cid:48) v ) ( X (cid:48) w ) | X (cid:48) ( θ ( τ ) − θ ) | (cid:3) (cid:46) r ¯ f ϕ max ( s ) ϕ max (2 s θ ) ϕ max ( | S | + s θ ) (cid:46) r ϕ max ( s ) ϕ max (2 s θ ) (cid:0) f ϕ max ( | S | + s θ ) (cid:1) . Thus, by Lemma 28, there exists an absolute constant C > S, v, w ) such that P (cid:110) (cid:107) G n (cid:107) J S,v,w ≥ C E (cid:107) G n (cid:107) J S,v,w + C ϕ / ( s ) ϕ / (2 s θ ) (cid:0) f / ϕ / ( | S | + s θ ) (cid:1) υ r ,n, ( t ) (cid:111) ≤ e − t , (126)and by Lemma 27, there exists an absolute constant C > S, v, w ) such that E (cid:107) G n (cid:107) J S,v,w ≤ C ϕ / ( s ) ϕ / (2 s θ ) (cid:0) f / ϕ / ( | S | + s θ ) (cid:1) υ r ,n, (cid:0) | S | log(1 /r ) (cid:1) , where υ r ,n,γ ( z ) = √ r z + (cid:113) log γ nn z γ for z ≥ γ >

0. Set t s θ ,δ = 2 s θ log(5 ep/s θ ) + log(1 /δ ). (Note thatthe upper bound on E (cid:107) G n (cid:107) G S,v,τ is not tight, but it is a convenient choice since it matches with the otherterms in eq. (126).) Now, by eq. (126) and the union bound over w ∈ W we have, with probability at least1 − δ ,sup w ∈W (cid:107) G n (cid:107) J S,v,w (cid:46) ϕ / ( s ) ϕ / (2 s θ ) (cid:0) f / ϕ / ( | S | + s θ ) (cid:1)(cid:0) υ r ,n, (cid:0) | S | log(1 /r ) (cid:1) + υ r ,n, ( t s θ ,δ ) (cid:1) . (127) e now turn the bound on this process into a bound on II . For any increasing and convex function F wehave, for τ ∈ T η arbitrary, E (cid:34) F (cid:32) sup τ (cid:48) : | τ (cid:48) − τ |≤ η sup h ∈H S,v (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 (cid:0) f i ( τ (cid:48) ) − f i ( τ ) (cid:1) h ( X i , Y i ) − E (cid:2)(cid:0) f i ( τ (cid:48) ) − f i ( τ ) (cid:1) h ( X i , Y i ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33)(cid:35) ( a ) ≤ E (cid:34) F (cid:32) √ π sup τ (cid:48) : | τ (cid:48) − τ |≤ η sup h ∈H S,v (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 g i (cid:0) f i ( τ (cid:48) ) − f i ( τ ) (cid:1) h ( X i , Y i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33)(cid:35) ( b ) ≤ E (cid:34) F (cid:32) √ πL f sup τ (cid:48) : | τ (cid:48) − τ |≤ η sup h ∈H S,v (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 g i (cid:0) X (cid:48) i θ ( τ (cid:48) ) − X (cid:48) i θ ( τ ) (cid:1) h ( X i , Y i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33)(cid:35) ( c ) ≤ E (cid:20) F (cid:18) √ πL f L θ η sup w ∈W (cid:107) G n (cid:107) J S,v,w (cid:19)(cid:21) , (128)where ( a ) holds by Lemma 6.3 followed by Lemma 4.5 in Ledoux and Talagrand (1996), ( b ) holds byLemma 32 (note that the map z (cid:55)→ f Y | X ( z − X (cid:48) θ ( τ ) | X ) − f Y | X ( X (cid:48) θ ( τ ) | X ) is Lipschitz-continuous andvanishes at 0), and ( c ) holds by Assumption 4, Lemma 8, and construction of W .Set t s θ ,η,δ = 2 s θ log(5 ep/s θ ) + log(1 + 1 /η ) + log(1 /δ ). By Lemma 31, eq. (127) and (128), and the unionbound over τ ∈ T η we have, with probability at least 1 − δ , II (cid:46) L f L θ ϕ / ( s ) ϕ / (2 s θ ) (cid:0) f / ϕ / ( | S | + s θ ) (cid:1)(cid:0) υ r ,n, (cid:0) | S | log(1 /r ) (cid:1) + υ r ,n, ( t s θ ,η,δ (cid:1) η. (129) Conclusion.

Since η ∈ (0 ,

1) is arbitrary, we can choose η (cid:16) / ( L f L θ n / ). Combining eq. (123), (125),and (129) yields, with probability at least 1 − δ , (cid:107) G n (cid:107) G S,v (cid:46) ϕ / ( s ) (cid:0) ϕ / (2 s θ ) (cid:1)(cid:0) f / ϕ / ( | S | + s θ ) (cid:1)(cid:16) υ r ,n (cid:0) | S | log(1 /r ) (cid:1) + υ r ,n ( t s θ ,n,δ ) (cid:17) , (130)where t s θ ,n,δ = 2 s θ log(5 ep/s θ )+log( L f L θ n )+log(1 /δ ) and υ r ,n ( z ) = √ z (cid:0) √ r + n − / (log n ) √ z + n − (log n ) / z (cid:1) for z ≥ (cid:0) { S ⊆ { , . . . , p } : | S | ≤ k } (cid:1) ≤ (cid:80) ki =1 (cid:0) pi (cid:1) ≤ ( ep/k ) k . Set t s,k,s θ ,n,δ = s log(5 ep/s )+ k log( ep/k ) + 2 s θ log(5 ep/s θ ) + log( L f L θ n ) + log(1 /δ ). By eq. (130) and the union bound over v ∈ M and S ⊂ { , . . . p } with card( S ) ≤ k there exists an absolute constant C > P (cid:40) sup v ∈M sup ≤ k ≤ n sup card( S ) ≤ k (cid:107) G n (cid:107) G S,v Θ( s, k, s θ ) (cid:0) υ r ,n (cid:0) k log(1 /r ) (cid:1) + υ r ,n ( t s,k,s θ ,n,δ ) (cid:1) > C (cid:41) ≤ n (cid:88) k =1 (cid:16) epk (cid:17) k e − k log( ep/k ) − log n − log(1 /δ ) ≤ δ, (131)where Θ( s, k, s θ ) = ¯ f / ϕ / ( s ) (cid:0) ϕ / (2 s θ ) (cid:1)(cid:0) ϕ / ( k + s θ ) (cid:1) . For S ⊂ { , . . . , p } deﬁne G S = { g ∈ G S,v : v ∈ C p ( J, ϑ ) ∩ B p (0 , } . By Lemma 8 and construction of M , for all S ⊂ { , . . . , p } , (cid:107) G n (cid:107) G S ≤ ϑ ) sup v ∈M (cid:107) G n (cid:107) G S,v , nd G = (cid:91) S ⊂{ ,...p } , card( S ) ≤ n G S . Hence, by eq. (131), with probability at least 1 − δ , ∀ g v,θ,τ ∈ G : | G n ( g v,θ,τ ) | (cid:46) ϑ )Θ (cid:0) s, (cid:107) θ (cid:107) , s θ (cid:1)(cid:0) υ r ,n (cid:0) (cid:107) θ (cid:107) log(1 /r ) (cid:1) + υ r ,n ( t s, (cid:107) θ (cid:107) ,s θ ,n,δ ) (cid:1) . Proof of Case (ii).

Observe that s (cid:55)→ s log( ep/s ) and s (cid:55)→ ϕ max ( s ) are monotone increasing on [1 , p ].Thus, the bound of case (i) for g v,θ,τ ∈ G with (cid:107) θ (cid:107) = m holds also for all g v,θ (cid:48) ,τ ∈ G with (cid:107) θ (cid:48) (cid:107) ≤ m . Toconclude, adjust some constants. Proof of Lemma 25 . The proof is an easy modiﬁcation of the proof of Lemma 24. We only sketch thedetails:

Proof of Case (i).

To simplify notation, we write K k ( τ ) = C p ( J k ( τ ) , ϑ k ) ∩ B p (0 ,

1) for k ∈ { , } and f i ( τ ) = f Y | X ( X (cid:48) i θ ( τ ) | X i ) for 1 ≤ i ≤ n . By Lemma 7 there exist M , M ⊂ B p (0 ,

1) such thatcard( M k ) ≤ (cid:18) eps k (cid:19) s k , ∀ u ∈ M k : (cid:107) u (cid:107) ≤ s k , ∀ τ ∈ T , k ∈ { , } : K k ( τ ) ⊂ ϑ k )conv( M k ) . For S ⊆ { , . . . , p } , u ∈ M , u ∈ M , and τ ∈ T we deﬁne the following function classes: H S,u ,u = (cid:8) h ( X, Y ) = (cid:0) (cid:8) Y ≤ X (cid:48) θ (cid:9) − (cid:8) Y ≤ X (cid:48) θ ( τ ) } (cid:1) ( X (cid:48) u )( X (cid:48) u ) ,θ ∈ R p , supp( θ ) = S, (cid:107) θ − θ ( τ ) (cid:107) ≤ r , τ ∈ T (cid:9) , G S,u ,u ,τ = (cid:8) g ( X, Y ) = f Y | X ( X (cid:48) θ ( τ ) | X ) h ( X, Y ) : h ∈ H S,u ,u }G S,u ,u = (cid:91) τ ∈T G S,u ,u ,τ . Let η ∈ (0 ,

1) be arbitrary and T η be an η -net with cardinality card( T η ) ≤ /η . We have the followingdecomposition: (cid:107) G n (cid:107) G S,u ,u ≤ sup τ ∈T η (cid:107) G n (cid:107) G S,u ,u ,τ + sup τ ∈T η sup τ (cid:48) : | τ (cid:48) − τ |≤ η sup h ∈H S,u ,u (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ n n (cid:88) i =1 (cid:0) f i ( τ (cid:48) ) − f i ( τ ) (cid:1) h ( X i , Y i ) − E (cid:2)(cid:0) f i ( τ (cid:48) ) − f i ( τ ) (cid:1) h ( X i , Y i ) (cid:3)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = I + II . Bound on I.

It is standard to verify that G S,u ,u τ is a VC-subgraph class with VC-index at most aconstant multiple of | S | + 1 (see proof of Lemma 12). Also, G u ,u ( X ) = ¯ f | X (cid:48) u || X (cid:48) u | is an envelope of G S,u ,u τ , (cid:107) max ≤ i ≤ n G u ,u ( X i ) (cid:107) ψ (cid:46) (log n ) ¯ f ϕ / ( s ) ϕ / ( s ), and for all g ∈ G S,u ,u ,τ , P g (cid:46) r ϕ max ( s ) ϕ max ( s ) (cid:0) f ϕ max ( | S | + s θ ) (cid:1) . Thus, by Lemma 28, there exists an absolute constant C > S, u , u , τ ) such that P (cid:110) (cid:107) G n (cid:107) G S,u ,u ,τ ≥ C E (cid:107) G n (cid:107) G S,u ,u ,τ + C ϕ / ( s ) ϕ / ( s ) (cid:0) f / ϕ / ( | S | + s θ ) (cid:1) υ r ,n, ( t ) (cid:111) ≤ e − t , nd by Lemma 27, there exists an absolute constant C > S, u , u , τ ) such that E (cid:107) G n (cid:107) G S,u ,u ,τ ≤ C ϕ / ( s ) ϕ / ( s ) (cid:0) f / ϕ / ( | S | + s θ ) (cid:1) υ r ,n, (cid:0) | S | log(1 /r ) (cid:1) , where υ r ,n,γ ( z ) = √ r z + (cid:113) log γ nn z γ for z ≥ γ >

0. (Compared with Lemma 24 the dependence ofthe upper bound on | S | has not changed. This is because using Lemma 28 with the ψ -norm is sub-optimalin the proof of Lemma 24, whereas, in the present case, using Lemma 28 with the ψ -norm is optimal.)Set t η,δ = log(1 + 1 /η ) + log(1 /δ ). Above two inequalities and the union bound over τ ∈ T η yield, withprobability at least 1 − δ , I (cid:46) ϕ / ( s ) ϕ / ( s ) (cid:0) f / ϕ / ( | S | + s θ ) (cid:1)(cid:0) υ r ,n, (cid:0) | S | log(1 /r ) (cid:1) + υ r ,n, ( t η,δ ) (cid:1) . Bound on II.

Let { g i } ni =1 be a sequence of i.i.d. standard normal random variables independent of { ( X i , Y i ) } ni =1 . By Lemma 7 there exist W ⊂ B p (0 ,

1) such thatcard( W ) ≤ (cid:18) eps θ (cid:19) s θ , ∀ w ∈ W : (cid:107) w (cid:107) ≤ s θ , B p (0 , ⊂ W ) . For S ⊆ { , . . . , p } , u ∈ M , u ∈ M , and w ∈ W consider J S,u ,u ,w = (cid:8) j ( X, Y, g ) = gh ( X, Y ) X (cid:48) w : h ∈ H S,u ,u (cid:9) . Obviously, J S,u ,u ,w is a VC-subgraph class with VC-index at most a constant multiple of | S | +2 (see proof ofLemma 12). Also, J u ,u ,w ( X, g ) = | g || ( X (cid:48) u )( X (cid:48) u )( X (cid:48) w ) | is an envelope of J S,u ,u ,w , (cid:107) max ≤ i ≤ n J u ,u ,w ( X i , g i ) (cid:107) ψ / (cid:46) (log n ) ϕ / ( s ) ϕ / ( s ) ϕ / (2 s θ ), and for all j ∈ J S,u ,u ,w , P j (cid:46) r ϕ max ( s ) ϕ max ( s ) ϕ max (2 s θ ) (cid:0) f ϕ max ( | S | + s θ ) (cid:1) . Thus, by Lemma 28, there exists an absolute constant C > S, u , u , w ) such that P (cid:8) (cid:107) G n (cid:107) J S,u ,u ,w ≥ C E (cid:107) G n (cid:107) J S,u ,u ,w + C ϕ / ( s ) ϕ / ( s ) ϕ / (2 s θ ) (cid:0) f / ϕ / ( | S | + s θ ) (cid:1) υ r ,n, ( t ) (cid:111) ≤ e − t , and by Lemma 27, there exists an absolute constant C > S, u , u , w ) such that E (cid:107) G n (cid:107) J S,u ,u ,w ≤ C ϕ / ( s ) ϕ / ( s ) ϕ / (2 s θ ) (cid:0) f / ϕ / ( | S | + s θ ) (cid:1) υ r ,n, (cid:0) | S | log(1 /r ) (cid:1) , where υ r ,n,γ ( z ) = √ r z + (cid:113) log γ nn z γ for z ≥ γ >

0. Set t s θ ,δ = 2 s θ log(5 ep/s θ ) + log(1 /δ ). Above twoinequalities and the union bound over w ∈ W imply that, with probability at least 1 − δ ,sup w ∈W (cid:107) G n (cid:107) J S,u ,u ,w (cid:46) ϕ / ( s ) ϕ / ( s ) ϕ / (2 s θ ) (cid:0) f / ϕ / ( | S | + s θ ) (cid:1)(cid:0) υ r ,n, (cid:0) | S | log(1 /r ) (cid:1) + υ r ,n, ( t s θ ,δ ) (cid:1) . Note that z (cid:55)→ f Y | X ( z − X (cid:48) θ ( τ ) | X ) − f Y | X ( X (cid:48) θ ( τ ) | X ) has Lipschitz constant 2 L f ¯ f and vanishes at0. Set t s θ ,η,δ = 2 s θ log(5 ep/s θ ) + log(1 + 1 /η ) + log(1 /δ ). Then, adapting the arguments in eq. (128) and q. (129), we conclude that, with probability at least 1 − δ , II (cid:46) L f L θ ϕ / ( s ) ϕ / ( s ) ϕ / (2 s θ ) (cid:0) f / ϕ / ( | S | + s θ ) (cid:1)(cid:0) υ r ,n, (cid:0) | S | log(1 /r ) (cid:1) + υ r ,n, ( t s θ ,η,δ (cid:1) η. Conclusion.

Since η ∈ (0 ,

1) is arbitrary, we can choose η (cid:16) / ( L f L θ n ). Combining the bounds on I and II yields, with probability at least 1 − δ , (cid:107) G n (cid:107) G S,u ,u (cid:46) ϕ / ( s ) ϕ / ( s ) (cid:0) ϕ / (2 s θ ) (cid:1)(cid:0) f / ϕ / ( | S | + s θ ) (cid:1)(cid:16) υ r ,n (cid:0) | S | log(1 /r ) (cid:1) + υ r ,n ( t s θ ,n,δ ) (cid:17) , where t s θ ,n,δ = 2 s θ log(5 ep/s θ )+log( L f L θ n )+log(1 /δ ) and υ r ,n ( z ) = √ z (cid:0) √ r + n − / (log n ) √ z + n − / (log n ) z / (cid:1) for z ≥ t s ,s ,k,s θ ,n,δ = s log(5 ep/s )+ s log(5 ep/s )+ k log( ep/k )+2 s θ log(5 ep/s θ )+log( L f L θ n )+log(1 /δ ).Proceed as in the proof of Lemma 24 and conclude that, with probability at least 1 − δ , ∀ g u ,u ,θ,τ ∈ G : | G n ( g u ,u ,θ,τ ) | (cid:46) (2 + ϑ )(2 + ϑ )Θ (cid:0) s , s , (cid:107) θ (cid:107) , s θ (cid:1) × (cid:0) υ r ,n (cid:0) (cid:107) θ (cid:107) log(1 /r ) (cid:1) + υ r ,n ( t s ,s , (cid:107) θ (cid:107) ,s θ ,n,δ ) (cid:1) , where Θ (cid:0) s , s , (cid:107) θ (cid:107) , s θ (cid:1) = ¯ f / ϕ / ( s ) ϕ / ( s ) (cid:0) ϕ / (2 s θ ) (cid:1)(cid:0) ϕ / ( k + s θ ) (cid:1) . Proof of Case (ii).

Observe that s (cid:55)→ s log( ep/s ) and s (cid:55)→ ϕ max ( s ) are monotone increasing on [1 , p ].Thus, the bound of case (i) for g u ,u ,θ,τ ∈ G with (cid:107) θ (cid:107) = m holds also for all g u ,u ,θ (cid:48) ,τ ∈ G with (cid:107) θ (cid:48) (cid:107) ≤ m .To conclude, adjust some constants. G.8 Proofs of Section F.2

Proof of Corollary 5 . We ﬁrst derive a bound on the symmetrized process (cid:107) G ◦ n (cid:107) F . Then, using Lemma 31,we deduce a bound for the original process (cid:107) G n (cid:107) F . In the following, we slightly abuse notation and write (cid:13)(cid:13) (cid:107) g (cid:107) P, (cid:13)(cid:13) G for sup g ∈G (cid:112) P g . For t, s > P (cid:8) (cid:107) G ◦ n (cid:107) F > t + √ st (cid:9) ≤ P (cid:40) (cid:107) G ◦ n (cid:107) F (cid:13)(cid:13) (cid:107) g (cid:107) P, (cid:13)(cid:13) G (cid:13)(cid:13) (cid:107) g (cid:107) P n , (cid:13)(cid:13) G (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:13)(cid:13) (cid:107) g (cid:107) P n , (cid:13)(cid:13) G (cid:13)(cid:13) (cid:107) g (cid:107) P, (cid:13)(cid:13) G − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:107) G ◦ n (cid:107) F (cid:13)(cid:13) (cid:107) g (cid:107) P, (cid:13)(cid:13) G (cid:13)(cid:13) (cid:107) g (cid:107) P n , (cid:13)(cid:13) G > t + √ st (cid:41) ≤ P (cid:40)(cid:13)(cid:13) (cid:107) g (cid:107) P, (cid:13)(cid:13) G (cid:107) G ◦ n (cid:107) F (cid:13)(cid:13) (cid:107) g (cid:107) P n , (cid:13)(cid:13) G > t (cid:41) + P (cid:40)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:13)(cid:13) (cid:107) g (cid:107) P n , (cid:13)(cid:13) G (cid:13)(cid:13) (cid:107) g (cid:107) P, (cid:13)(cid:13) G − (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > √ s (cid:41) ≤ P (cid:40) (cid:107) G ◦ n (cid:107) F (cid:13)(cid:13) (cid:107) g (cid:107) P n , (cid:13)(cid:13) G > t (cid:13)(cid:13) (cid:107) g (cid:107) P, (cid:13)(cid:13) G (cid:41) + P (cid:110)(cid:13)(cid:13) (cid:107) g (cid:107) P n , − (cid:107) g (cid:107) P, (cid:13)(cid:13) G > s (cid:13)(cid:13) (cid:107) g (cid:107) P, (cid:13)(cid:13) G (cid:111) = I + II , (132)where the second inequality holds since for all a, b ∈ R and s, t > | ab | > st = ⇒ | a | > s or | b | > t, | a + b | > s + t = ⇒ | a | > s or | b | > t, and the third inequality holds since |√ a − √ b | ≤ (cid:112) | a − b | for all a, b ≥ ound on I. Deﬁne the following (data-dependent) classes: F g = { hg/ (cid:107) g (cid:107) P n , : h ∈ H} , g ∈ G . For all f ∈ F g and any e , e ∈ {− , } we have | f e − f e | ≤ | g | / (cid:107) g (cid:107) P n , . Thus, conditionallyon { X i } ni =1 , the map ( ε , . . . , ε n ) (cid:55)→ (cid:107) G ◦ n (cid:107) F g is a function of bounded diﬀerences with constant c =4 n − (cid:80) ni =1 g ( X i ) / (cid:107) g (cid:107) P n , = 4. Hence, for all t ≥ P ε (cid:110) (cid:107) G ◦ n (cid:107) F g ≥ E ε (cid:107) G ◦ n (cid:107) F g + √ t (cid:111) ≤ e − t/c = e − t/ . (133)By construction of F g , the envelope F g = sup f ∈F g | f | has L ( P n )-semi-norm bounded by one and is VCsubgraph with VC-index V ( H ). Thus, by Dudley’s maximal inequality applied conditionally on { X i } ni =1 andTheorem 2.6.7 in van der Vaart and Wellner (1996), E ε (cid:107) G ◦ n (cid:107) F g (cid:46) (cid:107) F g (cid:107) P n , (cid:90) (cid:113) log N (cid:0) ε (cid:107) F g (cid:107) P n , , F g , L ( P n ) (cid:1) dε (cid:46) (cid:112) V ( H ) . (134)Combine eq. (133) and (134) with the union bound over g ∈ G to conclude that there exists an absoluteconstant C > { X i } ni =1 , n, F g , G ) such that for all t ≥ P ε (cid:26) sup g ∈G (cid:107) G ◦ n (cid:107) F g ≥ C (cid:112) V ( H ) + (cid:112) t + log card( G ) (cid:27) ≤ e − t/ . Clearly, sup g ∈G (cid:107) G ◦ n (cid:107) F g ≥ (cid:107) G ◦ n (cid:107) F / (cid:13)(cid:13) (cid:107) g (cid:107) P n , (cid:13)(cid:13) G . Thus, in above display, take expectation with respect to { X i } ni =1 , and conclude that, for all t ≥ P (cid:40) (cid:107) G ◦ n (cid:107) F (cid:13)(cid:13) (cid:107) g (cid:107) P n , (cid:13)(cid:13) G ≥ C (cid:112) V ( H ) + (cid:112) log card( G ) + √ t (cid:41) ≤ e − t/ . (135) Bound on II.

Recall that G = { g : g ∈ G} . Hence, (cid:107) G n (cid:107) G = √ n (cid:13)(cid:13) (cid:107) g (cid:107) P n , − (cid:107) g (cid:107) P, (cid:13)(cid:13) G . Therefore, byeq. (37) and Corollary 4, there exists an absolute constant C > s ≥ P (cid:110)(cid:13)(cid:13) (cid:107) g (cid:107) P n , − (cid:107) g (cid:107) P, (cid:13)(cid:13) G > C Kδπ n,α (cid:0) log card( G ) (cid:1) + C Kδπ n,α ( s ) (cid:111) ≤ e − s , (136)where δ = sup g ,g ∈G ρ ( g , g ) and π n,α ( z ) = (cid:112) z/n + z /α /n for z ≥ Conclusion.

Combine eq. (132), (135), and (136) and conclude that there exists an absolute constant C > s, t > P (cid:26) (cid:107) G ◦ n (cid:107) F > C (cid:16)(cid:112) V ( H ) + (cid:112) card( G ) (cid:17) (cid:18) σ + (cid:113) Kδπ n,α (cid:0) log card( G ) (cid:1)(cid:19) + C √ t (cid:16) σ + (cid:113) Kδπ n,α ( s ) (cid:17)(cid:111) ≤ e − t + e − s . (137)Since for any convex and increasing function F : R + → R + , E [ F ( (cid:107) G n (cid:107) F )] ≤ E [ F (2 (cid:107) G ◦ n (cid:107) F )] , Lemma 31 and eq. (137) imply that, there exist cosntants c, c (cid:48) ≥ α ) such that, with robability at least 1 − ce − c (cid:48) t , (cid:107) G n (cid:107) F (cid:46) (cid:16)(cid:112) V ( H ) + (cid:112) log card( G ) (cid:17) (cid:18) σ + (cid:113) Kδπ n,α (cid:0) log card( G ) (cid:1)(cid:19) + √ t (cid:16) σ + (cid:113) Kδπ n,α ( t ) (cid:17) . To conclude, adjust some absolute constants.

Proof of Lemma 29 . Recall Young’s inequality: (cid:81) Ki =1 x α i i ≤ (cid:80) Ki =1 α i x i for all x i , α i ≥ i = 1 , . . . , K with (cid:80) Ki =1 α i = 1. Without loss of generality, we can assume that (cid:107) X i (cid:107) ψ = 1 for all i = 1 , . . . , K . Thus,the claim of the lemma follows if we can show the following: If E (cid:2) exp( X i ) (cid:3) ≤ i = 1 , . . . , K , thenE[exp( (cid:81) Ki =1 | X i | /K )] ≤

2. This assertion follows from straightforward calculations:E (cid:34) ψ /K (cid:32) K (cid:89) i =1 X i (cid:33)(cid:35) = E (cid:34) exp (cid:32) K (cid:89) i =1 | X i | /K (cid:33)(cid:35) ≤ E (cid:34) exp (cid:32) K K (cid:88) i =1 | X i | (cid:33)(cid:35) = E (cid:34) K (cid:89) i =1 exp (cid:18) K | X i | (cid:19)(cid:35) ≤ K (cid:32) K (cid:88) i =1 E (cid:2) exp (cid:0) | X i | (cid:1)(cid:3)(cid:33) ≤ , where in the ﬁrst and second inequalities we have used Young’s inequality. Proof of Lemma 30 . Let t ≥ A := { ω ∈ Ω : ξ ( ω ) > φ ( t ) } . By the premise, φ ( t ) ≤ (cid:82) A ξd P ≤ φ (cid:0) ϕ − (1 / P { A } ) (cid:1) . Thus, φ ( t ) ≤ φ (cid:0) ϕ − (1 / P { A } ) (cid:1) . Solving for P { A } yields the claim. Proof of Lemma 31 . We split the proof into two cases with similar yet distinct proofs.

Case α ∈ (0 , . Denote by ϕ − the inverse of ϕ and note that ϕ − is convex and increasing. Therefore, F ( z ) = max { ϕ − ( z ) − t, } is also convex and increasing. Hence, by assumption, for all t ≥ (cid:90) ∞ t P (cid:8) ϕ − ( X ) ≥ s (cid:9) ds = E [ F ( X )] ≤ E [ F ( Y )] = (cid:90) ∞ t P (cid:8) ϕ − ( Y ) ≥ s (cid:9) ds ≤ (cid:90) ∞ t c e − c s α ds = c αc /α (cid:90) ∞ c t α u /α − e − u du = c αc /α Γ (1 /α, c t α ) , (138)where Γ( a, z ) = (cid:82) ∞ z u a − e − u du is the incomplete Gamma function.By Theorem 1.1 and Proposition 2.10 in Pinelis (2020), for all z ≥ a, z ) ≤  z a − e − x + ( a − G a − ( z ) 1 < a < G a ( z ) a ≥ , (139)where G a ( z ) = ( z + b a ) a − z a ab a e − z and b a = Γ( a +1) / ( a − . Thus, by eq. (139) there exist constants c, c (cid:48) , c (cid:48)(cid:48) , c (cid:48)(cid:48)(cid:48) > a > z ≥ a, z ) ≤ ( cz a + c (cid:48) z a − + c (cid:48)(cid:48) ) e − z ≤ c (cid:48)(cid:48)(cid:48) e − z/a . (140)Combine eq. (140) and (138) to conclude that there exists a constant c > α, c only) suchthat for all t ≥ u ≤ t , P (cid:8) ϕ − ( X ) ≥ t (cid:9) ≤ u (cid:90) tt − u P (cid:8) ϕ − ( X ) ≥ s (cid:9) ds ≤ c e αc u α c /α u e − αc t α . ptimizing over u yields u ∗ = (1 /α c ) /α and, hence, for all t ≥ (1 /α c ) /α , P (cid:8) ϕ − ( X ) ≥ t (cid:9) ≤ c e /α e − αc t α . Since c e /α e − αc t α ≥ t ≤ (1 /c ) /α , this bound holds also for all 0 ≤ t < (1 /c ) /α . Case α ∈ [1 , ∞ ) . The proof strategy is the same as for case (i), but the calculations are simpler.Denote by ϕ − the inverse of ϕ and note that z (cid:55)→ (cid:0) ϕ − ( z ) (cid:1) α is convex and increasing. Therefore, F ( z ) =max (cid:8)(cid:0) ϕ − ( z ) (cid:1) α − t, (cid:9) is also convex and increasing. We have, for all t ≥ (cid:90) ∞ t P (cid:110)(cid:0) ϕ − ( X ) (cid:1) α ≥ s (cid:111) ds = E [ F ( X )] ≤ E [ F ( Y )] = (cid:90) ∞ t P (cid:110)(cid:0) ϕ − ( Y ) (cid:1) α ≥ s (cid:111) ds ≤ c c e − c t . Thus, we have, for all t ≥ u ≤ t , P (cid:110)(cid:0) ϕ − ( X ) (cid:1) α ≥ t (cid:111) ≤ u (cid:90) tt − u P (cid:110)(cid:0) ϕ − ( X ) (cid:1) α ≥ s (cid:111) ds ≤ c e c u c u e − c t . Optimizing over u yields, for all t ≥ /c , P (cid:110)(cid:0) ϕ − ( X ) (cid:1) α ≥ t (cid:111) ≤ c e − c t . Again, since c e − c t ≥ t ≤ /c , this bound holds also for all 0 ≤ t < /c . Proof of Lemma 32 . The proof is a straightforward modiﬁcation of the classical proof of Corollary 3.17in Ledoux and Talagrand (1996). Let ˜ f ∈ F be arbitrary. We have the following: E (cid:34) F (cid:32)

12 sup f ∈F sup h ∈H (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 g i ϕ i (cid:0) f ( X i ) (cid:1) h ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33)(cid:35) ( a ) ≤ E (cid:34) F (cid:32) sup f ∈F sup h ∈H (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 g i (cid:16) ϕ i (cid:0) f ( X i ) (cid:1) − ϕ i (cid:0) ˜ f ( X i ) (cid:1)(cid:17) h ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33)(cid:35) + 12 E (cid:34) F (cid:32) sup h ∈H (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 g i ϕ i (cid:0) ˜ f ( X i ) (cid:1) h ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33)(cid:35) ( b ) ≤ E (cid:34) F (cid:32) sup f,f (cid:48) ∈F sup h ∈H (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 g i (cid:16) ϕ i (cid:0) f ( X i ) (cid:1) − ϕ i (cid:0) f (cid:48) ( X i ) (cid:1)(cid:17) h ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33)(cid:35) , where (a) holds by convexity of F and (b) holds since ϕ i (0) = 0 and symmetry of g i , 1 ≤ i ≤ n . Theexpression in above display can be further upper bounded by E (cid:34) F (cid:32) sup f,f (cid:48) ∈F sup h ∈H (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 g i (cid:16) f ( X i ) − f (cid:48) ( X i ) (cid:17) h ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33)(cid:35) , (141)since for all f, f (cid:48) ∈ F and h ∈ H , n (cid:88) i =1 (cid:16) ϕ i (cid:0) f ( X i ) (cid:1) − ϕ i (cid:0) f (cid:48) ( X i ) (cid:1)(cid:17)(cid:17) h ( X i ) ≤ n (cid:88) i =1 (cid:16) f ( X i ) − f (cid:48) ( X i ) (cid:17) h ( X i ) , and hence the the Gaussian contraction theorem (e.g. Ledoux and Talagrand, 1996, Theorem 3.15) applies. o conclude the proof, upper bound eq. (141) by E (cid:34) F (cid:32) f ∈F sup h ∈H (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 g i f ( X i ) h ( X i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:33)(cid:35) . eferences Abadie, A., Angrist, J., and Imbens, G. (2002). Instrumental variables estimates of the eﬀect ofsubsidized training on the quantiles of trainee earnings.

Econometrica , 70(1):91–117.Adamczak, R. (2008). A tail inequality for suprema of unbounded empirical processes with appli-cations to markov chains.

Electronic Journal of Probability , 13:1000–1034.Angrist, J., Chernozhukov, V., and Fern´andez-Val, I. (2006). Quantile regression under misspeciﬁ-cation, with an application to the U.S. wage structure.

Econometrica , 74(2):539–563.Angrist, J. D. (2004). Treatment eﬀect heterogeneity in theory and practice.

The economic journal ,114(494):C52–C83.Athey, S., Imbens, G. W., and Wager, S. (2018). Approximate residual balancing: debiased infer-ence of average treatment eﬀects in high dimensions.

Journal of the Royal Statistical Society:Series B (Statistical Methodology) , 80(4):597–623.Atkins, Summer, Einarsson, Gudmundur, Ames, Brendan, Clemmensen, and Line (2017). Proximalmethods for sparse optimal scoring and discriminant analysis. arXiv preprint arXiv:1705.07194 .Belloni, A. and Chernozhukov, V. (2011). l -penalized quantile regression in high-dimensionalsparse models. The Annals of Statistics , 39(1):82–130.Belloni, A. and Chernozhukov, V. (2013). Least squares after model selection in high-dimensionalsparse models.

Bernoulli , 19(2):521–547.Belloni, A., Chernozhukov, V., Chetverikov, D., and Fern´andez-Val, I. (2019a). Conditional quantileprocesses based on series or many regressors.

Journal of Econometrics , 213(1):4 – 29. Annals:In Honor of Roger Koenker.Belloni, A., Chernozhukov, V., and Hansen, C. (2014). Inference on treatment eﬀects after selectionamong high-dimensional controls.

The Review of Economic Studies , 81(2):608–650.Belloni, A., Chernozhukov, V., and Kato, K. (2019b). Valid post-selection inference in high-dimensional approximately sparse quantile regression models.

Journal of the American StatisticalAssociation , 114(526):749–758.Belloni, A., Chernozhukov, V., and Kato, K. (2019c). Valid post-selection inference in high-dimensional approximately sparse quantile regression models.

Journal of the American StatisticalAssociation , 114(526):749–758.Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and dantzigselector.

The Annals of Statistics , 37(4):1705–1732.1owers, K., Liu, G., Wang, P., Ye, T., Tian, Z., Liu, E., Yu, Z., Yang, X., Klebanoﬀ, M., andYeung, E. (2011). Birth weight, postnatal weight change, and risk for high blood pressure amongchinese children.

Pediatrics , 127(5):e1272–e1279.Bradic, J. and Kolar, M. (2017). uniform inference for high-dimensional quantile regression: linearfunctionals and regression rank scores. arXiv preprint arXiv:1702.06209 .Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. (1984).

Classiﬁcation and regressiontrees . CRC press.Bross, D. S. and Shapiro, S. (1982). Direct and indirect associations of ﬁve factors with infantmortality.

American Journal of Epidemiology , 115(1):78–91.Candes, E. and Tao, T. (2007). The dantzig selector: Statistical estimation when p is much largerthan n.

Annals of Statistics , 35(6):2313–2351.Cattaneo, M. D. (2010). Eﬃcient semiparametric estimation of multi-valued treatment eﬀects underignorability.

Journal of Econometrics , 155(2):138–154.Chao, S.-K., Volgushev, S., and Cheng, G. (2017). Quantile processes for semi and nonparametricregression.

Electronic Journal of Statistics , 11(2):3272–3331.Chernozhukov, V., Chetverikov, D., and Kato, K. (2014). Gaussian approximation of suprema ofempirical processes.

The Annals of Statistics , 42(4):1564–1597.Chernozhukov, V. and Fern´andez-Val, I. (2005). Subsampling inference on quantile regressionprocesses.

Sankhya: The Indian Journal of Statistics (2003-2007) , 67(2):253–276.Chernozhukov, V. and Hansen, C. (2005). An iv model of quantile treatment eﬀects.

Econometrica ,73(1):245–261.Coppock, A., Leeper, T. J., and Mullinix, K. J. (2018). Generalizability of heterogeneous treatmenteﬀect estimates across samples.

Proceedings of the National Academy of Sciences , 115(49):12441–12446.Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracleproperties.

Journal of the American statistical Association , 96(456):1348–1360.Firpo, S. (2007). Eﬃcient semiparametric estimation of quantile treatment eﬀects.

Econometrica ,75(1):259–276.Fr¨olich, M. and Melly, B. (2013). Unconditional quantile treatment eﬀects under endogeneity.

Journal of Business & Economic Statistics , 31(3):346–357.Fu, A., Narasimhan, B., and Boyd, S. (2017). Cvxr: An r package for disciplined convex optimiza-tion. arXiv preprint arXiv:1711.07582 . 2age, T. B., Fang, F., O’Neill, E., and DiRienzo, G. (2013). Maternal education, birth weight, andinfant mortality in the united states.

Demography , 50(2):615–635.Giessing, A. (2020). Inequalities for suprema of unbounded empirical processes.

Technical Report .Giessing, A. and Wang, J. (2020). Inference on heterogeneous quantile treatment eﬀects via rank-score balancing.

Supplementary Materials .Gin´e, E. and Nickl, R. (2015).

Mathematical Foundations of Inﬁnite-Dimensional Statistical Models .Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.Hainmueller, J. (2012). Entropy balancing for causal eﬀects: A multivariate reweighting methodto produce balanced samples in observational studies.

Political analysis , pages 25–46.He, X. and Shi, P. (1994). Convergence rate of b-spline estimators of nonparametric conditionalquantile functions.

Journal of Nonparametric Statistics , 3(3-4):299–308.He, X., Wang, L., and Hong, H. G. (2013). Quantile-adaptive model-free variable screening forhigh-dimensional heterogeneous data.

Annals of Statistics , 41(1):342–369.Imai, K. and Ratkovic, M. (2013). Estimating treatment eﬀect heterogeneity in randomized programevaluation.

The Annals of Applied Statistics , 7(1):443–470.Imai, K. and Ratkovic, M. (2014). Covariate balancing propensity score.

Journal of the RoyalStatistical Society: Series B: Statistical Methodology .Kern, H. L., Stuart, E. A., Hill, J., and Green, D. P. (2016). Assessing methods for generaliz-ing experimental impact estimates to target populations.

Journal of research on educationaleﬀectiveness , 9(1):103–127.Koenker (2005a).

Quantile Regression (Econometric Society monographs; no. 38) . Cambridgeuniversity press.Koenker, R. (2005b).

Quantile Regression . Econometric Society Monographs. Cambridge UniversityPress, Cambridge.Koenker, R. and Zhao, Q. (1994). L-estimatton for linear heteroscedastic models.

Journaltitle ofNonparametric Statistics , 3(3-4):223–235.Koltchinskii, V. (2011).

Oracle Inequalities in Empirical Risk Minimization and Sparse RecoveryProblems: ´Ecole D’ ´Et´e de Probabilit´es de Saint-Flour XXXVIII-2008 . Lecture Notes in Mathe-matics. Springer Verlag, New York.K¨unzel, S. R., Sekhon, J. S., Bickel, P. J., and Yu, B. (2019). Metalearners for estimating heteroge-neous treatment eﬀects using machine learning.

Proceedings of the national academy of sciences ,116(10):4156–4165. 3edoux, M. and Talagrand, M. (1996).

Probability in Banach Spaces: Isoperimetry and Processes .Springer-Verlag, Berlin.Lipkovich, I., Dmitrienko, A., and B D’Agostino Sr, R. (2017). Tutorial in biostatistics: data-drivensubgroup identiﬁcation and analysis in clinical trials.

Statistics in medicine , 36(1):136–196.Ma, S. and Huang, J. (2017). A concave pairwise fusion approach to subgroup analysis.

Journalof the American Statistical Association , 112(517):410–423.Maurer, A. (2016). A Vector-Contraction Inequality for Rademacher Complexities. In Ortner, R.,Simon, H. U., and Zilles, S., editors,

Algorithmic Learning Theory , pages 3–17, Cham. Springer.Mhanna, M., Iqbal, A., and Kaelber, D. (2015). Weight gain and hypertension at three years ofage and older in extremely low birth weight infants.

Journal of neonatal-perinatal medicine ,8(4):363–369.Newey, W. K. and Powell, J. L. (1990). Eﬃcient estimation of linear and type i censored regressionmodels under conditional quantile restrictions.

Econometric Theory , pages 295–317.Nie, X. and Wager, S. (2019). Quasi-oracle estimation of heterogeneous treatment eﬀects. arXivpreprint arXiv:1712.04912 .Panchenko, D. (2003). Symmetrization approach to concentration inequalities for empirical pro-cesses.

The Annals of Probability , 31(4):2068–2081.Pinelis, I. (2020). Exact lower and upper bounds on the incomplete gamma function.Rosenbaum, P. R. (1989). Optimal matching for observational studies.

Journal of the AmericanStatistical Association , 84(408):1024–1032.Rosenbaum, P. R. (2020). Modern algorithms for matching in observational studies.

Annual Reviewof Statistics and Its Application , 7:143–176.Rosenbaum, P. R. and Rubin, D. B. (1983). The central role of the propensity score in observationalstudies for causal eﬀects.

Biometrika , 70(1):41–55.Rubin, D. B. (1974). Estimating causal eﬀects of treatments in randomized and nonrandomizedstudies.

Journal of educational Psychology , 66(5):688.Rubin, D. B. (1979). Using multivariate matched sampling and regression adjustment to controlbias in observational studies.

Journal of the American Statistical Association , 74(366a):318–328.Rubin, D. B. (2009). Should observational studies be designed to allow lack of balance in covariatedistributions across treatment groups?

Statistics in Medicine , 28(9):1420–1423.Tan, Z. (2010). Bounded, eﬃcient and doubly robust estimation with inverse weighting.

Biometrika ,97(3):661–682. 4an der Vaart, A. W. and Wellner, J. (1996).

Weak Convergence and Empirical Processes: WithApplications to Statistics . Springer Series in Statistics. Springer-Verlag, New York.Wahlberg, B., Boyd, S., Annergren, M., and Wang, Y. (2012). An admm algorithm for a class oftotal variation regularized estimation problems.

IFAC Proceedings Volumes , 45(16):83–88.Wang, J., He, X., and Xu, G. (2020). Debiased inference on treatment eﬀect in a high-dimensionalmodel.

Journal of the American Statistical Association , 115(529):442–454.Wang, L., Zhou, Y., Song, R., and Sherwood, B. (2018). Quantile-optimal treatment regimes.

Journal of the American Statistical Association , 113(523):1243–1254.Wang, Y. and Zubizarreta, J. (2017). Minimal dispersion approximately balancing weights: Asymp-totic properties and practical considerations.

Biometrika , 103(1):1–29.Zhao, Q. (2001). Asymptotically eﬃcient median regression in the presence of heteroskedasticityof unknown form.

Econometric Theory , 17(4):765–784.Zhao, W., Zhang, F., and Lian, H. (2019). Debiasing and distributed estimation for high-dimensional quantile regression.

IEEE Transactions on Neural Networks and Learning Systems .Zhou, K. Q. and Portnoy, S. L. (1996). Direct use of regression quantiles to construct conﬁdencesets in linear models.

Annals of Statistics , 24(1):287–306.Zubizarreta, J. R. (2015). Stable weights that balance covariates for estimation with incompleteoutcome data.