[PDF] Convex Regression in Multidimensions: Suboptimality of Least Squares Estimators

Abstract

The least squares estimator (LSE) is shown to be suboptimal in squared error loss in the usual nonparametric regression model with Gaussian errors for d≥5 for each of the following families of functions: (i) convex functions supported on a polytope (in fixed design), (ii) bounded convex functions supported on a polytope (in random design), and (iii) convex Lipschitz functions supported on any convex domain (in random design). For each of these families, the risk of the LSE is proved to be of the order n −2/d (up to logarithmic factors) while the minimax risk is n −4/(d+4) , for d≥5 . In addition, the first rate of convergence results (worst case and adaptive) for the full convex LSE are established for polytopal domains for all d≥1 . Some new metric entropy results for convex functions are also proved which are of independent interest.

Full PDF

aa r X i v : . [ m a t h . S T ] J un Convex Regression in Multidimensions:Suboptimality of Least Squares Estimators

Gil Kur § , Fuchang Gao ∗ , Adityanand Guntuboyina † , and Bodhisattva Sen ‡

32 Vassar StCambridge, MA 02139e-mail: [email protected]

875 Perimeter Drive, MS 1103Moscow, ID 83844e-mail: [email protected]

423 Evans HallBerkeley, CA 94720e-mail: [email protected] [email protected]

Abstract:

The least squares estimator (LSE) is shown to be suboptimal in squarederror loss in the usual nonparametric regression model with Gaussian errors for d ≥ n − /d (up to logarithmic factors) while the minimaxrisk is n − / ( d +4) , for d ≥

5. In addition, the ﬁrst rate of convergence results (worstcase and adaptive) for the full convex LSE are established for polytopal domainsfor all d ≥

1. Some new metric entropy results for convex functions are also provedwhich are of independent interest.

MSC 2010 subject classiﬁcations:

Primary 62G08.

Keywords and phrases:

Adaptive risk bounds, bounded convex regression, Dud-ley’s entropy bound, Lipschitz convex regression, lower bounds on the risk of leastsquares estimators, metric entropy, nonparametric maximum likelihood estima-tion, Sudakov minoration. ∗ Supported by NSF Grant OCA-1940270 † Supported by NSF CAREER Grant DMS-1654589 ‡ Supported by NSF Grant DMS-1712822 § Supported by the Center for Minds, Brains and Machines, funded by NSF award CCF-12312161 ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions

1. Introduction

We consider the following nonparametric regression problem: Y i = f ( X i ) + ξ i , for i = 1 , . . . , n, (1)where f : Ω → R is an unknown convex function on a full-dimensional compact convexdomain Ω ⊆ R d ( d ≥ X , . . . , X n are design points that may be ﬁxed in Ω or i.i.d.from the uniform distribution P on Ω, and ξ , . . . , ξ n are i.i.d. unobserved errors havingthe N (0 , σ ) distribution with σ > { ( Y i , X i ) } ni =1 , thegoal is to recover f . This is the classical problem of convex regression which has along history in statistics and related ﬁelds. Standard references include Hildreth [32],Hanson and Pledger [31], Groeneboom et al. [23], Groeneboom and Jongbloed [22],Seijo and Sen [42], Kuosmanen [34], Lim and Glynn [38] and Bal´azs [3]. Applicationsof convex regression can be found in Varian [45], Varian [46], Allon et al. [2], Matzkin[40], A¨ıt-Sahalia and Duarte [1], Keshavarz et al. [33] and Toriello et al. [43].A natural way to estimate f in (1) is to use the method of least squares, i.e., minimizethe sum of squared errors subject to the convexity constraint. Formally, the convex leastsquares estimator (LSE) is deﬁned asˆ f n ∈ argmin f ∈C (Ω) n X i =1 ( Y i − f ( X i )) (2)where the minimization is over C (Ω) which is deﬁned as the class of all real-valuedconvex functions deﬁned on Ω. ˆ f n coincides with the maximum likelihood estimatoras we have assumed that the errors ξ , . . . , ξ n are normally distributed. ˆ f n is uniquelydeﬁned at the design points X , . . . , X n and can be extended to other points in Ω in apiecewise aﬃne fashion. The convex LSE does not involve any tuning parameters. Seijoand Sen [42] performed a detailed study of the characterization and computation of ˆ f n (see also Kuosmanen [34] and Lim and Glynn [38]) and Mazumder et al. [41] (see alsoChen and Mazumder [13]) demonstrated that it can be eﬃciently computed for fairlylarge values of the dimension d and the sample size n .The theoretical properties of ˆ f n are fairly well-understood in the univariate case d = 1. Hanson and Pledger [31] proved (uniform) consistency on compact subintervalscontained in the interior of Ω and D¨umbgen et al. [17] strengthened these results byproving rates of convergence. Groeneboom et al. [23] proved pointwise rates of conver-gence and asymptotic distributions under smoothness assumptions and these resultswere extended by Chen and Wellner [14] and Ghosal and Sen [21]. Guntuboyina andSen [27], Chatterjee et al. [12], Bellec [7] and Chatterjee [11] proved risk bounds for theconvex LSE under the equally-spaced ﬁxed design setting. These results imply that theunivariate convex LSE achieves the minimax rate n − / for estimating general convexfunctions while also achieving faster parametric rates (up to logarithmic multiplicativefactors) for estimating piecewise aﬃne convex functions. ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions In the multivariate case d ≥

2, consistency of the convex LSE was proved by Seijoand Sen [42] (see also Lim and Glynn [38]). However, rates of convergence have notbeen proved previously and one of the main goals of the present paper is to ﬁll this gap.Rates of convergence are available, however, for certain alternative estimators such asthe

Lipschitz convex

LSE and the bounded convex

LSE. The Lipschitz convex LSE isdeﬁned as the LSE over C L (Ω):ˆ f n ( C L (Ω)) ∈ argmin f ∈C L (Ω) n X i =1 ( Y i − f ( X i )) (3)where C L (Ω) is the class of all convex functions on Ω that are L -Lipschitz. The boundedconvex LSE is deﬁned as the LSE over C B (Ω):ˆ f n ( C B (Ω)) ∈ argmin f ∈C B (Ω) n X i =1 ( Y i − f ( X i )) (4)where C B (Ω) is the class of all convex functions on Ω that are uniformly bounded by B . Rates of convergence for the Lipschitz convex LSE can be found in Bal´azs et al.[4], Lim [37] and Mazumder et al. [41] while rates for the bounded convex LSE are inHan and Wellner [30]. It should be noted that these alternative estimators ˆ f n ( C L (Ω))and ˆ f n ( C B (Ω)) crucially depend on tuning parameters (speciﬁcally L and B ) while theconvex LSE is tuning parameter free.In Section 2, we provide the ﬁrst rate of convergence results for the convex LSEfor d ≥

2. Let us describe these results at a high-level here (we ignore logarithmicmultiplicative factors in this Introduction; see the actual theorems for the full bounds).We assume that Ω is a polytope and that the design points X , . . . , X n form a ﬁxedregular rectangular grid intersected with Ω. As is common in ﬁxed design analysis, wework with the loss function ℓ P n ( f, g ) := Z ( f − g ) d P n = 1 n n X i =1 ( f ( X i ) − g ( X i )) (5)where P n is the empirical distribution of the design points X , . . . , X n (note that P n is non-random here as we are working in ﬁxed-design). The risk of ˆ f n is deﬁned as E f ℓ P n ( ˆ f n , f ) and we prove bounds for E f ℓ P n ( ˆ f n , f ) that hold for ﬁnite samples. Ourﬁrst main result is Theorem 2.1 which proves that the risk of the convex LSE is boundedfrom above by r n,d := (cid:26) n − / ( d +4) : d ≤ n − /d : d ≥ adaptive risk bounds for the convex LSE which imply that the convexLSE converges at rates faster than r n,d when f is a piecewise aﬃne convex function.Speciﬁcally we prove in Theorem 2.3 that when f is a piecewise aﬃne convex functionwith k aﬃne pieces, the risk of ˆ f n is bounded from above by a k,n,d := ( kn : d ≤ (cid:0) kn (cid:1) /d : d ≥ ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions up to a logarithmic term which depends on the number of facets of each aﬃne piece. Itis interesting that the rate a k,n,d switches from a parametric rate for d ≤ d ≥ r n,d and a k,n,d are not merely loose upper bounds but accurately describe thebehavior of ˆ f n . These lower bound results are only interesting for d ≥ d ≤ r n,d is already the minimax rate (see (14)) and a k,n,d is the parametric rate. InTheorem 2.2, we prove the existence of a bounded, Lipschitz convex function f on Ωwhere the risk of the convex LSE is bounded from below by n − /d (note that logarithmicfactors are being ignored in this Introduction). This function where the LSE is shown toachieve the n − /d rate will be a piecewise aﬃne convex function whose number of aﬃnepieces is of the order √ n (see Lemma 2.4). This proves that r n,d correctly describes theworst case risk behavior of the convex LSE. Moreover, in Theorem 2.5, we show, forevery 1 ≤ k ≤ √ n , the existence of a convex function that is piecewise aﬃne with ∼ k aﬃne pieces where the risk of the convex LSE is bounded from below by ( k/n ) /d . Thisshows that a k,n,d correctly describes the adaptive behavior of the convex LSE for d ≥ ≤ k ≤ √ n . Note that a k,n,d cannot be expected to be a tight bound for k ≫ √ n as then ( k/n ) /d will dominate the worst case risk bound of n − /d .Our results imply the minimax suboptimality of the convex LSE for d ≥

5. Indeed,the minimax risk for the class of all bounded, Lipschitz convex functions is of theorder n − / ( d +4) . From a comparison of r n,d with the minimax risk n − / ( d +4) , we canimmediately conclude that the convex LSE is minimax suboptimal for d ≥ X , . . . , X n are independently distributed according to the uniformdistribution P on Ω and consider the loss function ℓ P ( f, g ) := Z Ω ( f − g ) d P . (8)For the bounded convex LSE, it was proved in Han and Wellner [30] that its risk isbounded from above by r n,d (deﬁned in (6)) when Ω is a polytope. In Theorem 3.1, weprove that there exists a bounded, Lipschitz convex function on Ω where the risk of thebounded convex LSE is bounded from below by n − /d . This implies that the boundedconvex LSE is minimax suboptimal in random design when the domain Ω is a polytope.This contrasts intriguingly with the recent result of Kur et al. [35] who proved that thebounded convex LSE is minimax optimal when Ω is a smooth convex body (such as theunit ball). Some insight into this is given in Section 6.For the Lipschitz convex LSE, it was proved, in Bal´azs et al. [4], Lim [37], Mazumderet al. [41], that its risk is bounded again by r n,d for all convex bodies Ω (regardlessof whether Ω is polytopal or smooth). In Theorem 3.2, we prove the existence of abounded, Lipschitz convex function on Ω where the risk of the Lipschitz convex LSE isbounded from below by n − /d . This implies that the Lipschitz convex LSE is minimax ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions suboptimal when d ≥ d ≥

5. In a recent paper Kur et al.[36] involving two of the authors of the present paper, it was shown that the LSEover the class of support functions of all convex bodies was suboptimal for d ≥ d is small.Our risk results required proving novel metric entropy bounds for convex functions.These results are given in Section 4. For our ﬁxed design risk bounds, we prove, inTheorem 4.1, a metric entropy upper bound for the class { f ∈ C (Ω) : ℓ P n ( f, f ) ≤ t } for aﬃne functions f under the ℓ P n pseudometric. This result is diﬀerent from existingmetric entropy results in [9, 15, 20, 25, 26] for convex functions in that it deals with thediscrete ℓ P n pseudometric while all existing results deal with continuous L p metrics. Alsothe constraint ℓ P n ( f, f ) ≤ t on the convex functions in the above class is comparativelyweak. For our random design results, we prove, in Theorem 4.4, bracketing L ( P ) metricentropy bounds for (cid:26) f ∈ C (Ω) : ℓ P ( f, f ) ≤ t, sup x ∈ Ω | f ( x ) | ≤ B (cid:27) for polytopal Ω, piecewise aﬃne f ∈ C (Ω) and t >

0. This result also improves existingresults in certain aspects; seee Section 4 for details.Let us now quickly summarize the contents of the rest of the paper. The results forthe convex LSE in ﬁxed design are given in Section 2. Results for the bounded convexLSE and Lipschitz convex LSE in random design are in Section 3. Metric entropy resultsare in Section 4. The main technical ideas behind the proofs are brieﬂy described inSection 5. Section 6 contains a discusssion of issues related to our main results. Section7 contains the proofs of the main results from Section 2. Section 8 contains the proofsof the results from Section 3. Section 9 contains the proofs of the metric entropy resultsfrom Section 4. Some additional proofs are relegated to Section 10.

2. Risk bounds for the convex LSE

In this section, we prove rates of convergence for the convex LSE ˆ f n (deﬁned as in (2)).These are the ﬁrst rate of convergence results for the convex LSE for d ≥

2. Throughout ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions the paper, c d , C d , κ d etc. denote constants that depend on the dimension d alone andtheir exact value might change from appearance to appearance. We sometimes refer tothese constants as dimensional constants (“dimensional” here refers to the dependenceon d ).Let us ﬁrst describe our assumptions on Ω that we use throughout this section. Weassume that the domain Ω is a polytope. Ω can be written in the form:Ω = (cid:8) x ∈ R d : a i ≤ v Ti x ≤ b i for i = 1 , . . . , F (cid:9) (9)for some positive integer F , unit vectors v , . . . , v F and real numbers a , . . . , a F , b , . . . , b F .We also assume that Ω is contained in the unit ball (this can be achieved by scaling).Our bounds will be nonasymptotic and hold even when Ω changes with n . We assumehowever that F is bounded from above by a constant depending on d alone. We alsoassume that the volume of Ω is bounded from below by a constant depending on d alone.We do not address the problem of ﬁnding rates of convergence of the convex LSE inthe non-polytopal case where Ω is a smooth convex body. Rates of convergence in thiscase will be quite diﬀerent from the rates derived in this section. See Section 6 for moredetails.In this section, we work with the ﬁxed design setting where X , . . . , X n form a ﬁxedregular rectangular grid in Ω and Y , . . . , Y n are generated according to (1). Speciﬁcally,for δ >

0, let S := { ( k δ, . . . , k d δ ) : k i ∈ Z , ≤ i ≤ d } (10)denote the regular d -dimensional δ -grid in R d . We assume that X , . . . , X n are an enu-meration of the points in S ∩

Ω with n denoting the cardinality of S ∩

Ω. By the usualvolumetric argument and the assumption that the volume of Ω is bounded from aboveand below by constants depending on d alone, there exists a small enough constant κ d such that whenever δ ≤ κ d , we have2 ≤ c d δ − d ≤ n ≤ C d δ − d (11)for dimensional constants c d and C d . We shall assume throughout this section that δ ≤ κ d so that the above inequality holds.We study the performance of the LSE ˆ f n under the loss function (5). The next coupleof results prove upper bounds for the risk E f ℓ P n ( ˆ f n , f ) of the convex LSE. Let C (Ω)denote the class of all real-valued convex functions on Ω and let A (Ω) denote the classof all aﬃne functions on Ω. For each f ∈ C (Ω), let L ( f ) := inf g ∈A (Ω) ℓ P n ( f, g )where, it may be recalled, that ℓ P n ( f, g ) is our loss function deﬁned via (5). The followingresult proves an upper bound involving L ( f ) for E f ℓ P n ( ˆ f n , f ) for arbitrary f ∈ C (Ω).Note that the number F appearing in the bound (12) below is the number of parallelhalfspaces or slabs deﬁning Ω (see (9)) and this number is assumed to be bounded bya constant depending on d alone. ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Theorem 2.1.

Fix f ∈ C (Ω) with L := L ( f ) . There exists positive constants C d and γ d depending only on d such that E f ℓ P n ( ˆ f n , f ) ≤  C d max (cid:26) L d/ (4+ d ) (cid:16) σ n (log n ) F (cid:17) / ( d +4) , σ n (log n ) F (cid:27) for d ≤ C max n σ L √ n (log n ) F/ , σ n (log n ) F o for d = 4 C d max (cid:26) σ L (cid:16) (log n ) F n (cid:17) /d , σ (cid:16) (log n ) F n (cid:17) /d (cid:27) for d ≥ . (12)When L = L ( f ) and σ are ﬁxed positive constants (not changing with n ) and n issuﬃciently large, the leading terms in the right hand side of (12) are the ﬁrst termsinside the maxima. More precisely, from (12), it easily follows that for every L > σ >

0, there exist constants C d (depending on d alone) and N d,σ/ L (depending only on d and σ/ L ) such thatsup f ∈C (Ω): L ( f ) ≤ L E f ℓ P n ( ˆ f n , f ) ≤  C d L d/ (4+ d ) (cid:16) σ n (log n ) F (cid:17) / ( d +4) for d = 1 , , C σ L √ n (log n ) F/ for d = 4 C d σ L (cid:16) (log n ) F n (cid:17) /d for d ≥ n ≥ N d,σ/ L . The risk upper bound obtained above can be compared with the fol-lowing minimax risk characterization (which can be proved by routine arguments; seee.g., [24, Proof of Theorem 3.2]). Let C LL (Ω) denote the class of all convex functions onΩ that are L -Lipschitz and uniformly bounded by L . Then there exist constants c d , C d and N d,σ/L such that c d L d/ (4+ d ) (cid:18) σ n (cid:19) / ( d +4) ≤ inf ˜ f n sup f ∈C LL (Ω) E f ℓ P n ( ˜ f n , f ) ≤ C d L d/ (4+ d ) (cid:18) σ n (cid:19) / ( d +4) (14)for n ≥ N d,σ/L . Letting C L (Ω) be the class of all convex functions on Ω that are uniformlybounded by L , it is easy to see that C LL (Ω) ⊆ C L (Ω) ⊆ { f ∈ C (Ω) : L ( f ) ≤ L } which implies that the minimax lower bound in (14) also holds for the larger classes C L (Ω) and { f ∈ C (Ω) : L ( f ) ≤ L } .A comparison of (13) and (14) implies that the convex LSE ˆ f n is nearly minimaxoptimal (up to logarithmic multiplicative factors) over the class { f ∈ C (Ω) : L ( f ) ≤ L } (or over the smaller classes C L (Ω) or C LL (Ω)) for d ≤

4. However, the rate given by (13)is strictly suboptimal compared to the minimax rate for d ≥ d ≥

5, there exists a bounded Lipschitz convexfunction f for which E f ℓ P n ( ˆ f n , f ) is bounded from below by n − /d up to a logarithmicmultiplicative factor. Comparing this to (14) (and noting that n − /d ≫ n − / ( d +4) for ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions d ≥ d ≥

5, the convexLSE is minimax suboptimal over C LL (Ω) (or over the larger classes C L (Ω) or { f ∈ C (Ω) : L ( f ) ≤ L } ). Theorem 2.2.

Fix d ≥ , L > and σ > . There exist constants c d and C d,σ/L suchthat sup f ∈C LL (Ω) E f ℓ P n ( ˆ f n , f ) ≥ c d σLn − /d (log n ) − d +1) /d for n ≥ C d,σ/L . (15)We shall now present risk bounds when f is a piecewise aﬃne convex function. Tomotivate these results, let us ﬁrst examine inequality (12) when f is an aﬃne function.Here L = L ( f ) = 0 so we havesup f ∈A (Ω) E f ℓ P n ( ˆ f n , f ) ≤  C d σ n (log n ) F for d = 1 , , C σ n (log n ) F for d = 4 C d σ (cid:16) (log n ) F n (cid:17) /d for d ≥ . (16)The bounds given in (16) are of a smaller order than those given by (13) which meansthat the LSE ˆ f n adapts to aﬃne functions by converging to them at faster rates com-pared to other convex functions in { f ∈ C (Ω) : L ( f ) ≤ L } . In the next result, we provethat a similar adaptation holds for a larger class of piecewise aﬃne convex functions.For k ≥ h ≥

1, let C k,h (Ω) denote all functions f ∈ C (Ω) for which there exist k convex subsets Ω , . . . , Ω k satisfying the following properties:1. f is aﬃne on each Ω i ,2. each Ω i can be written as an intersection of at most h slabs (i.e., as in (9) with F = h ), and3. Ω ∩ S , . . . , Ω k ∩ S are disjoint with ∪ ki =1 (Ω i ∩ S ) = Ω ∩ S . Theorem 2.3.

For every k ≥ and h ≥ , we have sup f ∈ C k,p (Ω) E f ℓ P n ( ˆ f n , f ) ≤  C d σ (cid:0) kn (cid:1) (log n ) h for d = 1 , , C d σ (cid:0) kn (cid:1) (log n ) h +2 for d = 4 C d σ (cid:16) k (log n ) h n (cid:17) /d for d ≥ for a constant C d depending on d alone. Remark 2.1.

Note that Theorem 2.3 generalizes the bound (16) because A (Ω) = C ,F (Ω) . The logarithmic factors in (17) have powers involving h which means that they cannotbe ignored when h grows with n . Thus Theorem 2.3 only gives something useful when h is either constant or grows very slowly with n .If we ignore the logarithmic factors in (17), we can see that the risk bound in (17)switches from a parametric rate for d ≤ d ≥ d ≥ ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions larger than that given by (13) when k > √ nσ − d/ so Theorem 2.3 is only interesting(for d ≥

5) for k in the range 1 ≤ k ≤ √ nσ − d/ . In the next result, we show that forevery k satisfying 1 ≤ k ≤ √ nσ − d/ , there exists a piecewise aﬃne convex functionon Ω with no larger than C d k aﬃne pieces where the risk of the LSE is bounded frombelow by ( k/n ) /d (up to logarithmic multiplicative factors). The function where therate ( k/n ) /d is achieved can be taken to be a piecewise aﬃne approximation to asmooth convex function such as the quadratic function. Its existence is guaranteed bythe following lemma. Lemma 2.4.

Let f ( x ) := k x k . There exists a positive constant C d (depending on thedimension alone) such that the following is true. For every k ≥ , there exist m ≤ C d kd -simplices ∆ , . . . , ∆ m and a convex function ˜ f k such that1. Ω = ∪ mi =1 ∆ i ,2. ∆ i ∩ ∆ j is contained in a facet of ∆ i and a facet of ∆ j for each i = j ,3. ˜ f k is aﬃne on each ∆ i , i = 1 , . . . , m ,4. sup x ∈ Ω | f ( x ) − ˜ f k ( x ) | ≤ C d k − /d ,5. ˜ f k ∈ C C d C d (Ω) . The next result shows that the LSE achieves the rate ( k/n ) /d for the functions ˜ f k given by the above lemma. Theorem 2.5.

Fix d ≥ . There exist positive constants c d and N d such that for n ≥ N d and ≤ k ≤ min (cid:0) √ nσ − d/ , c d n (cid:1) , (18) we have E ˜ f k ℓ P n ( ˆ f n , ˜ f k ) ≥ c d σ (cid:18) kn (cid:19) /d (log n ) − d +1) /d (19) where ˜ f k is the function from Lemma 2.4. The above result immediately implies that the adaptive risk bounds in (17) cannotbe improved for all k satisfying (18) (note that min (cid:0) √ nσ − d/ , c d n (cid:1) will equal √ nσ − d/ unless σ is of smaller order than n − /d ). This implies, in particular, that the LSE cannotadapt at near parametric rates for aﬃne functions for d ≥ k = √ nσ − d/ is of the same order as that given byTheorem 2.2. In other words, the adaptive lower bound in Theorem 2.5 implies minimaxsuboptimality of the convex LSE.

3. Suboptimality of constrained convex LSEs in two settings

The highlight of the results of Section 2 is the minimax suboptimality of the convexLSE in the ﬁxed gridded design setting when d ≥

5. In this section, we show that thissuboptimality also extends to the bounded convex LSE (when Ω is a polytope) andthe Lipschitz convex LSE (for general Ω). We consider here the random design settingwhere the observed data are ( X , Y ) , . . . , ( X n , Y n ) with X , . . . , X n being independent ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions having the uniform distribution P on Ω and Y , . . . , Y n being generated according to (1)for independent N (0 , σ ) errors ξ , . . . , ξ n . We also assume that ξ , . . . , ξ n , X , . . . , X n are independent random variables and work with the ℓ P loss function (8).Let us ﬁrst state our result for the bounded convex LSE (deﬁned as in (4)). This resultassumes that the domain Ω is a polytope. The risk E f ℓ P ( ˆ f n ( C B (Ω)) , f ) of the boundedconvex LSE was studied in Han and Wellner [30] who proved matching upper and lowerbounds (up to logarithmic factors in n ) for d ≤ d ≥ d ≥

5, it was proved in Han and Wellner [30, Theorem3.6] that sup f ∈C B (Ω) E f ℓ P ( ˆ f n ( C B (Ω)) , f ) is bounded from above by n − /d (ignoringmultiplicative factors that are logarithmic in n and that depend on B and σ ). On theother hand, the minimax rate over C B (Ω) under the ℓ P loss function equals n − / ( d +4) for all d ≥ n − /d and the minimax risk of n − / ( d +4) for d ≥

5. The next result shows that, for d ≥

5, there exist functions f in C B (Ω) wherethe risk of ˆ f n ( C B (Ω)) is bounded from below by n − /d (up to logarithmic multiplicativefactors) thereby proving that the bounded convex LSE is minimax suboptimal. Thefact that Ω is polytopal is crucial here for the LSE becomes optimal when Ω is the unitball as recently shown in Kur et al. [35]. We provide an explanation of this in Section6. Theorem 3.1.

Let Ω be a polytope whose number of facets is bounded by a constantdepending on d alone. Assume also that Ω is contained in the unit ball and contains aball of constant (depending on d alone) radius. Fix d ≥ . There exist constants c d and N d,σ/B such that for every B > and σ > , we have sup f ∈C B (Ω) E f ℓ P ( ˆ f n ( C B (Ω)) , f ) ≥ c d σBn − /d (log n ) − d +1) /d whenever n ≥ N d,σ/B . (20)The next result is for the Lipschitz convex LSE (deﬁned as in (3)). The followingresult shows that the same lower bound n − /d (up to logarithmic factors) holds for theLipschitz convex LSE for essentially every convex domain Ω (regardless of whether Ωis polytopal or smooth). Theorem 3.2.

Suppose Ω is a convex body that is contained in the unit ball and containsa ball centered at zero of constant (depending on d alone) radius. Fix d ≥ . There existpositive constants c d and N d,σ/L such that for every L > and σ > , we have sup f ∈C LL (Ω) E f ℓ P ( ˆ f n ( C L (Ω)) , f ) ≥ c d σLn − /d (log n ) − d +1) /d whenever n ≥ N d,σ/L . (21)

4. Metric entropy results

Our risk results from the previous two sections are based on new metric entropy resultsfor convex functions. Speciﬁcally, the risk bounds for the convex LSE in Section 2 ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions are proved via a metric entropy bound for convex functions satisfying a discrete L constraint and the risk lower bounds in Section 3 are proved via a bracketing entropybound for bounded convex functions with an additional L constraint. The goal of thissection is to describe these entropy results. We would like to start however with a briefdescription of existing entropy results for convex functions.Bronˇste˘ın [9] proved that the metric entropy of bounded Lipschitz convex functionsdeﬁned on a ﬁxed convex body Ω in R d is of the order ǫ − d/ under the supremum( L ∞ ) metric. The Lipschitz constraint can be removed if one is only interested in L p metrics for p < ∞ . Indeed, it was shown (by Gao [18] and Dryanov [16] for d = 1 andby Guntuboyina and Sen [26] for d ≥

2) that the metric entropy of bounded convexfunctions on Ω = [0 , d is of the order ǫ − d/ under the L p metric for every 1 ≤ p < ∞ .The boundedness constraint can further be relaxed to a L q -norm constraint (1 ≤ q ≤ ∞ )in which case the aforementioned result will hold in the L p metric for 1 ≤ p < q (seeGuntuboyina [25]). The case of more general convex bodies Ω was considered by Gaoand Wellner [20] who proved that the same bounds hold when Ω is an arbitrary polytope.Gao and Wellner [20] also studied the case where Ω is not a polytope. For example, whenΩ is the unit ball, they showed that the metric entropy of bounded convex functions onΩ is of the order ǫ − ( d − (which is larger than ǫ − d/ ) in the L metric when d ≥ ǫ -covering number of a set S under a pseudometric d will be denoted by N ( ǫ, S, d ).Also, the ǫ -bracketing number of a set S of functions under a pseudometric d will bedenoted by N [ ] ( ǫ, S, d ).Our ﬁrst main metric entropy result is the following. We use notation that is similarto that in Section 2. Recall that S is the regular d -dimensional δ -grid deﬁned in (10).The resolution of the grid δ will appear in the bounds below (note that, by (11), δ willbe of order n − /d ). Let Ω be a convex body such that Ω ∩ S 6 = ∅ . For 1 ≤ p < ∞ and afunction f on Ω, we deﬁne the quasi-norm ℓ S ( f, Ω , p ) = ∩ S ) X s ∈ Ω ∩ S | f ( s ) | p ! /p , where ∩ S ) denotes the cardinality of Ω ∩ S . Furthermore, for any ﬁxed function f on Ω, and any t >

0, denote B p S ( f ; t ; Ω) = { f : Ω → R | f is convex on Ω , ℓ S ( f − f , Ω , p ) ≤ t } . We are interested in the metric entropy of B p S ( f ; t ; Ω) under the ℓ S ( · , Ω , p ) quasi-metric.The following result deals with the case when f ∈ A (Ω) (recall that A (Ω) denotes theclass of all aﬃne functions on Ω). We state and prove this for every 1 ≤ p < ∞ (as theresult could be of independent interest) even though we only require the case p = 2 forproving the risk bounds of Section 2. Theorem 4.1.

Let Ω be a d -dimensional convex polytope that equals the intersectionof at most F pairs of halfspaces with distance no larger than 1 (i.e., as in (9) with ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions max ≤ i ≤ F ( b i − a i ) ≤ ). There exists a constant c d,p depending only on d and p suchthat for every f ∈ A (Ω) , ε > and t > , we have log N ( ε, B p S ( f ; t ; Ω) , ℓ S ( · , Ω , p )) ≤ [ c d,p log(1 /δ )] F ( t/ε ) d/ . (22)Theorem 4.1 diﬀers from existing entropy results for convex functions in the followingways. First, it deals with the discrete ℓ S ( · , Ω , p ) metric while all results previously havestudied the continuous L p metrics. Second, the constraint on functions in B p S ( f ; t ; Ω)is 1 ∩ S ) X s ∈ Ω ∩ S | f ( s ) − f ( s ) | p ≤ t p which is much weaker than imposing uniform boundedness on the class. When δ ↓

0, onemight view the above constraint as a constraint on the continuous L p norm of f , but itmust be noted that the L p metric entropy under such an L p constraint equals ∞ (moregenerally, L p metric entropy under an L q norm constraint is ﬁnite if and only if p < q ;see [20, 25]). The bound (22) also approaches inﬁnity as δ ↓ /δ and this only leads to additional logarithmic terms in our risk bounds.Theorem 4.1 implies, via the triangle inequality, bounds for log N ( ε, B p S ( f ; t ; Ω) , ℓ S ( · , Ω , p ))for arbitrary not necessarily aﬃne f . Indeed, the triangle inequality gives B p S ( f ; t ; Ω) ⊆ B p S ( f ; t + ℓ S ( f − f , Ω , p ); Ω)for every f, f . Applying this for aﬃne functions f , we obtain from Theorem 4.1 thatlog N ( ε, B p S ( f ; t ; Ω) , ℓ S ( · , Ω , p )) ≤ [ c d,p log(1 /δ )] F (cid:18) t + inf f ∈A (Ω) ℓ S ( f − f , Ω , p ) ε (cid:19) d/ . (23)While the above inequality is useful (we use it in the proof of Theorem 2.1), it is loosein the case when f is piecewise aﬃne and t is small. For piecewise aﬃne f , we useinstead the following two corollaries of Theorem 4.1. Corollary 4.2 will be used to proveTheorem 2.3 while Corollary 4.3 will be used in the proof of Theorem 2.5. Corollary 4.2.

Suppose f is a piecewise aﬃne convex function on Ω . Suppose Ω , . . . , Ω k are convex subsets of Ω such that1. f is aﬃne on each Ω i .2. Each Ω i equals an intersection of at most s pairs of parallel halfspaces.3. Ω ∩ S , . . . , Ω k ∩ S are disjoint with ∪ ki =1 (Ω i ∩ S ) = Ω ∩ S .Then log N ( ǫ, B p S ( f ; t ; Ω) , ℓ S ( · , Ω , p )) ≤ k (cid:18) tǫ (cid:19) d/ (cid:18) c d,p log 1 δ (cid:19) s for a constant c d,p that depends on d and p alone. The next result can be seen as a consequence of the above Corollary when eachpolytope Ω i is a d -simplex. ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Corollary 4.3.

Suppose f is a piecewise aﬃne convex function on Ω . Suppose that Ω can be written as the union of k d -simplices ∆ , . . . , ∆ k such that f is aﬃne on each ∆ i and such that ∆ i ∩ ∆ j is contained in a facet of ∆ i and a facet of ∆ j for each i = j .Then log N ( ǫ, B p S ( f ; t ; Ω) , ℓ S ( · , Ω , p )) ≤ C d,p k (cid:18) tǫ (cid:19) d/ (cid:18) log 1 δ (cid:19) d +1 for a constant C d,p that depends only on d and p . We next state our main bracketing entropy result. This is crucial for our risk lowerbounds in Section 3. Recall that C Γ (Ω) denotes the class of all convex functions on Ωthat are uniformly bounded by Γ. We state the next result for every 1 ≤ p < ∞ forcompleteness although we only use it for p = 2. Theorem 4.4.

Let Ω be a convex body in R d with volume bounded by 1. Let f be aconvex function on Ω that is bounded by Γ . For a ﬁxed ≤ p < ∞ and t > , let B Γ p ( f ; t ; Ω) = (cid:26) f ∈ C Γ (Ω) : Z Ω | f ( x ) − f ( x ) | p dx ≤ t p (cid:27) . (24) Suppose ∆ , . . . , ∆ k ⊆ Ω are d -simplices with disjoint interiors such that f is aﬃne oneach ∆ i . Then for every < ǫ < Γ and t > , we have log N [ ] ( ε, B Γ p ( f ; t ; Ω) , k · k p, ∪ ki =1 ∆ i ) ≤ C d,p k (cid:18) log Γ ε (cid:19) d +1 (cid:18) tε (cid:19) d/ (25) for a constant C d,p that depends on p and d alone. The left hand side above denotesbracketing entropy with respect to the L p metric on the set ∆ ∪ · · · ∪ ∆ k . To see how Theorem 4.4 compares to existing bracketing entropy results, considerthe special case when Ω is a d -simplex and when f ≡

0. In that case, the conclusionof Theorem 4.4 (for k = 1 and ∆ = Ω) becomes:log N [ ] ( ε, { f ∈ C Γ (Ω) : Z Ω | f ( x ) | p dx ≤ t p } , k · k p ) ≤ C d,p (cid:18) log Γ ε (cid:19) d +1 (cid:18) tε (cid:19) d/ . (26)The class of convex functions above has both an L ∞ constraint (uniform boundedness)as well as an L p constraint and (26) does not hold if either of the two constraints aredropped. Indeed, the entropy becomes inﬁnite if the L ∞ constraint is dropped. On theother hand, if the L p constraint is dropped, then the bracketing entropy is of the order(Γ /ǫ ) d/ as proved by Gao and Wellner [20] (see also Doss [15]). In contrast to (Γ /ǫ ) d/ ,(26) only has a logarithmic dependence on Γ and is much smaller when t is small. Hanand Wellner [30, Lemma 3.3] proved a weaker bound for the left hand side of (25)having additional multiplicative factors involving k (these factors cannot be neglectedsince we care about the regime k ∼ √ n ). ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions

5. Proof ideas

We brieﬂy describe here the key ideas underlying the proofs of the main results of thepaper. The risk upper bounds for the convex LSE (Theorem 2.1 and Theorem 2.3) inSection 2 are based on standard techniques [10] for analyzing LSEs combined with ourmetric entropy results of Section 4. The main novelty here is in the metric entropyresults. The worst case risk lower bound for the convex LSE in Theorem 2.2 followsfrom the adaptive lower bound in Theorem 2.5 by taking k ∼ √ n . The main ideasbehind the proof of Theorem 2.5 are as follows. Chatterjee [10] proved that the risk ofthe LSE at ˜ f k (this is the function given by Lemma 2.4) behaves as t f k where t ˜ f k is themaximizer of the function t H ˜ f k ( t ) := E sup f ∈C (Ω): ℓ P n ( ˜ f k ,f ) ≤ t n n X i =1 ξ i (cid:16) f ( X i ) − ˜ f k ( X i ) (cid:17) − t t ∈ [0 , ∞ ). The task then boils down to proving that t ˜ f k is bounded from belowby ( k/n ) /d up to logarithmic factors. This requires proving upper bounds and lowerbounds for the function H ˜ f k ( t ). We prove upper bounds using Dudley’s entropy boundand our metric entropy result of Corollary 4.3. The lower bounds are proved by Sudakovminoration as well as a metric entropy lower bound for local balls around f ( x ) := k x k (see Lemma 7.3).The proof of Theorem 3.1 uses the same basic strategy as that of Theorem 2.5 butis technically more involved because of the random design setting. We use conditionalversions of many arguments used in the proof of Theorem 2.5 including the conditionalversion of the result of Chatterjee [10]. The bracketing entropy upper bound fromTheorem 4.4 as well as the metric entropy lower bound from Lemma 8.2 are crucial forthis proof.The proof of Theorem 3.2 involves taking a large polytopal region S inside the generaldomain Ω, using ideas from the proof of Theorem 3.1 on the subset S , and using theLipschitz constraint to deal with the relatively small set Ω \ S . The Lipschitz constraintis crucial here as it allows the use of covering numbers in the supremum ( L ∞ ) metricdue to Bronˇste˘ın [9].Our ideas behind the proofs for the lower bounds on the performance of the LSEshave been used in a simpler setting in Kur et al. [36]. Speciﬁcally, the ﬁxed design lowerbound in Kur et al. [36] only works in the regime k ∼ √ n and so it does not yield theadaptive lower bounds in Theorem 2.5. The random design lower bound in Kur et al.[36] uses an assumption on the Koltchinskii-Pollard entropy (or ∞ -covering) which isnot available in the present setting.The main proof ideas for the metric entropy results are as follows. Let us start withTheorem 4.4 because its proof is technically simpler. The key is to consider a polytopaldomain of the form Ω := (cid:8) x ∈ R d : a i ≤ v Ti x ≤ b i , i = 1 , . . . , d + 1 (cid:9) ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions for some unit vectors v , . . . , v d +1 and prove the bound (26). Results of Gao and Wellner[20] can be used to show this if we consider the L p norm on the smaller setΩ := (cid:8) x ∈ R d : a i + η ( b i − a i ) ≤ v Ti x ≤ b i − η ( b i − a i ) , i = 1 , . . . , d + 1 (cid:9) for a ﬁxed η > toall of Ω. We do this via induction by sequentially extending to each domain T r (Ω) := (cid:8) x ∈ R d : a i ≤ v Ti x ≤ b i , ≤ i ≤ r and a i + η ( b i − a i ) ≤ v Ti x ≤ b i − η ( b i − a i ) , r < i ≤ d + 1 (cid:9) for r = 0 , . . . , d + 1 (note that T (Ω) = Ω and T d +1 (Ω) = Ω). Details can be found inthe statement and proof of Lemma 9.1.The ideas behind the proof of Theorem 4.1 are similar but more technically involvedbecause of the discrete metric and the lack of any uniform boundedness. Even the ﬁrststep of proving the result in the strict interior (such as Ω above) of the full domain Ωis challenging as there are no prior results (such as those in Gao and Wellner [20]) inthis discrete unbounded setting. This result is proved in Proposition 9.5. The inductionstep (carried out in Proposition 9.11 and Lemma 9.12) is also more delicate.

6. Discusssion

This section has some high-level remarks on the results of the paper. Our minimaxsuboptimality results (for d ≥

5) are based on constructions involving piecewise aﬃnefunctions. Speciﬁcally, we prove that the suboptimal rate n − /d is realized at the piece-wise aﬃne function ˜ f k for k ∼ √ n . One might wonder if the convex LSE achieves thesame rate n − /d or a faster rate (such as the minimax rate n − / ( d +4) ) at smooth convexfunctions such as f ( x ) = k x k . This appears to be challenging to resolve. The maindiﬃculty stems from the fact that for such f , the function t H f ( t ) (deﬁned as in(27) with ˜ f k replaced by f ) will take values of the same order ( n − /d up to logarith-mic factors) for n − /d . t . n − /d . Indeed, the upper bound of order n − /d (up to alogarithmic factor) can be proved by the arguments involved in the proof of Theorem2.1 and the lower bound of n − /d follows from Lemma 7.3 or Lemma 8.2 via Sudakovminoration (Lemma 7.4). The (square of the) maximizer of H f ( · ) determines the rateof convergence of the LSE (by Chatterjee [10, Theorem 1.1]) and the fact that H f ( · )takes values of the same order in the large interval n − /d . t . n − /d makes it diﬃcultto accurately pin down the location of its maximizer.As already mentioned previously, our minimax suboptimality result for the boundedconvex LSE when the domain is a polytope (for d ≥

5) contrasts with the recent resultof Kur et al. [35] who proved that the bounded convex LSE is minimax optimal whenΩ = B d ( B d is the unit ball in R d ) for all d . Why is the LSE suboptimal over C B ([0 , d )but optimal over C B ( B d )? The class C B ( B d ) is much larger than C B ([0 , d ) in the senseof metric entropy under the L norm with respect to the Lebesgue measure. Indeedthe ǫ -entropy of C B ([0 , d ) is of the order ǫ − d/ while that of C B ( B d ) is of the order ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions ǫ − ( d − (see Gao and Wellner [20]). This increased metric entropy of C B ( B d ) is drivenby the curvature (of the boundary) of B d . Speciﬁcally, one can obtain disjoint sphericalcaps S , . . . , S N (with N ∼ ǫ − ( d − ) of height ǫ such that the indicators of ∪ i ∈ H S i forsuﬃciently separated subsets H ⊆ { , . . . , N } form an ǫ -packing subset of C B ( B d ) in the L metric with respect to Lebesgue measure (strictly speaking, these indicator functionsare not convex but they can be approximated by piecewise aﬃne convex functions).In other words, the complexity of C B ( B d ) is driven by the complexity of these well-separated subsets (unions of spherical caps) of B d . This aspect of C B ( B d ) is cruciallyused in Kur et al. [35] to prove the optimality of the LSE for C B ( B d ). In contrast,the complexity of C B ([0 , d ) (or more generally C B (Ω) when Ω is a polytope) is notdriven by indicator-like functions of subsets of the domain. Here ǫ -packing sets can beconstructed by local perturbations of a smooth convex function such as f ( x ) := k x k .This seems to be the main diﬀerence between C B ( B d ) and C B ([0 , d ) which is causingthe LSE to switch from minimax optimality to suboptimality.An interesting observation is that, in both the polytopal and the smooth cases, theworst case risk of the LSE over C B (Ω) equals, up to logarithmic factors, the globalGaussian complexity: E sup f ∈C B (Ω) n n X i =1 ξ i f ( X i ) (28)where ξ , . . . , ξ n , X , . . . , X n are independent with ξ , . . . , ξ n distributed as normal withmean zero and variance σ and X , . . . , X n distributed according to the uniform distri-bution P on Ω. When Ω is a polytope, (28) is of the order n − /d . To see this, one canupper bound (28) by using standard empirical process bounds via L ( P ) bracketing en-tropy bounds (see e.g., van de Geer [44, Theorem 5.11] restated as inequality (44)) andlower bound (28) by Sudakov minoration along with the metric entropy lower bound inLemma 8.2. When Ω = B d , this strategy of upper bounding (28) via L ( P ) bracketingentropy gives a suboptimal upper bound as explained by Kur et al. [35]. The reason isthat the L ( P n ) bracketing ǫ -entropy (here P n is the empirical measure of X , . . . , X n )is diﬀerent from the L ( P ) bracketing ǫ -entropy for ǫ smaller than n − / ( d − . Thus, toprove the sharp bound for (28) in the ball case, Kur et al. [35] resort to a diﬀerenttechnique via level sets and chaining using L bracketing numbers.Isotonic regression is another shape constrained regression problem where the LSEis known to be minimax optimal for all dimensions (see Han et al. [29]). The class ofcoordinatewise monotone functions on [0 , d is similar to C B ( B d ) in that its metricentropy is driven by well-separated subsets of [0 , d (see Gao and Wellner [19, Proofof Proposition 2.1]). Other examples of such classes where the LSE is optimal for alldimensions can be found in Han [28].We proved our ﬁxed-design risk bounds for the full convex LSE (in Section 2) in thecase where the domain Ω is a polytope. A natural question is to extend these to thecase where Ω is a smooth convex body such as the unit ball. Based on the results ofKur et al. [35], it is reasonable to conjecture that convex LSE will be minimax optimalin ﬁxed design when the domain is the unit ball. However it appears nontrivial to prove ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions this as the level set reduction employed in Kur et al. [35] cannot be used in the absenceof uniform boundedness. We hope to address this in future work.

7. Proofs of results from Section 2

This section has proofs for Theorems 2.1, 2.2, 2.3 and 2.5 (Lemma 2.4 is proved inSection 10). The metric entropy results: inequality (23), Corollary 4.2 and Corollary4.3 stated in Section 4 are crucial for these proofs. Let us also recall here some generalresults that will be used in these proofs starting with the folowing result of Chatterjee[10]. We use the following notation. For a function f on Ω, a class of functions F on Ωand t >

0, let B F P n ( f, t ) := { g ∈ F : ℓ P n ( f, g ) ≤ t } . (29)where ℓ P n is given in (5). Theorem 7.1 (Chatterjee) . Consider data generated according to the model: Y i = f ( X i ) + ξ i for i = 1 , . . . , n where X , . . . , X n are ﬁxed deterministic design points in a convex body X ⊆ R d , f belongs to a convex class of functions F and ξ , . . . , ξ n are independently distributedaccording to the normal distribution with mean 0 and variance σ . Consider the LSE ˆ f n ( F ) ∈ argmin g ∈F n X i =1 ( Y i − g ( X i )) and deﬁne t f := argmax t ≥ H f ( t ) where H f ( t ) := E sup g ∈ B F P n ( f,t ) n n X i =1 ξ i ( g ( X i ) − f ( X i )) − t where B F P n ( f, t ) is deﬁned in (29) . Then H f ( · ) is a concave function on [0 , ∞ ) , t f isunique and the following pair of inequalities hold for positive constants c and C : P n . t f ≤ ℓ P n ( ˆ f n ( F ) , f ) ≤ t f o ≥ − (cid:18) − cnt f σ (cid:19) (30) and . t f − Cσ n ≤ E ℓ P n ( ˆ f n ( F ) , f ) ≤ t f + Cσ n . Upper bounds for t f can be obtained via: t f ≤ inf { r > H f ( r ) ≤ } (31) and lower bounds for t f can be obtained via: t f ≥ r if ≤ r < r are such that H f ( r ) ≤ H f ( r ) . (32) ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Let us also recall the Dudley metric entropy bound for the supremum of a Gaussianprocess.

Theorem 7.2 (Dudley) . Let ξ , . . . , ξ n be independently distributed according to thenormal distribution with mean 0 and variance σ . Then for every deterministic X , . . . , X n ∈X , every class of functions F , f ∈ F and t ≥ , we have E sup g ∈ B F P n ( f,t ) n n X i =1 ξ i ( g ( X i ) − f ( X i )) ≤ σ inf <θ ≤ t/ √ n Z t/ θ q log N ( ǫ, B F P n ( f, t ) , ℓ P n ) dǫ + 2 θ ! . We start by using the metric entropy bound (23). Indeed, using (23) for p = 2 (andnoting that log(1 /δ ) ≤ c d log n because of (11) and the fact that F is a constantdepending on d alone), we getlog N ( ǫ, B C (Ω) P n ( f , t ) , ℓ P n ) ≤ C d (log n ) F (cid:18) t + L ǫ (cid:19) d/ . (33)for every t > ǫ > L = L ( f ) = inf f ∈A (Ω) ℓ P n ( f , f ). We now control G ( t ) := E sup g ∈C (Ω): ℓ P n ( f ,g ) ≤ t n n X i =1 ξ i ( g ( X i ) − f ( X i )) (34)so we can bound E f ℓ P n ( ˆ f n , f ) by Theorem 7.1. Theorem 7.2 along with (33) gives G ( t ) σ ≤ C d (log n ) F/ √ n Z t/ θ (cid:18) t + L ǫ (cid:19) d/ dǫ + 2 θ ≤ ( C d d/ )(log n ) F/ √ n (Z t/ θ (cid:18) tǫ (cid:19) d/ dǫ + Z t/ θ (cid:18) L ǫ (cid:19) d/ dǫ ) + 2 θ (35)for every 0 < θ ≤ t/

2. Below, we replace C d d/ by just C d .Before proceeding further, it is convenient to split into the three cases d ≤ , d = 4and d ≥

5. When d ≤

3, we take θ = 0 to get G ( t ) ≤ C d σ √ n (log n ) F/ (cid:0) t + L d/ t − d/ (cid:1) . Because C d σ √ n (log n ) F/ t ≤ t t ≥ C d σ √ n (log n ) F/ and C d σ √ n (log n ) F/ L d/ t − d/ ≤ t t ≥ (4 C d ) d +4 (cid:18) σ (log n ) F/ √ n (cid:19) d +4 L dd +4 , ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions we deduce that H ( t ) := G ( t ) − t ≤ t ≥ C d max (cid:18) σ (log n ) F/ √ n (cid:19) d +4 L dd +4 , σ √ n (log n ) F/ ! . It follows from (31) that t f ≤ C d max (cid:18) σ (log n ) F/ √ n (cid:19) d +4 L dd +4 , σ √ n (log n ) F/ ! so Theorem 2.1 for d ≤ d = 4, (35) leads to G ( t ) ≤ C d σ √ n (log n ) F/ ( t + L ) log t θ + 2 σθ. Choosing θ = t/ (2 √ n ), we obtain G ( t ) ≤ C d σ √ n ( t + L ) (log n ) F/ from which we can deduce as before that H ( t ) = G ( t ) − t ≤ t ≥ C d max √ σ L (log n ) ( F/ / n / σ (log n ) F/ √ n ! which proves Theorem 2.1 for d = 4.Finally, for d ≥

5, (35) leads to the bound G ( t ) ≤ C d σ (log n ) F/ √ n (Z ∞ θ (cid:18) tǫ (cid:19) d/ dǫ + Z ∞ θ (cid:18) L ǫ (cid:19) d/ dǫ ) + 2 σθ ≤ C d σ (log n ) F/ √ n ( t + L ) d/ θ − ( d/ + 2 σθ for every θ >

0. The choice θ = (cid:18) C d (log n ) F/ √ n (cid:19) d/ ( t + L )gives G ( t ) ≤ σ (cid:18) C d (log n ) F/ √ n (cid:19) /d ( t + L )from which it follows that H ( t ) = G ( t ) − t ≤ t ≥ max √ σ L (cid:18) (log n ) F/ √ n (cid:19) /d , σ (cid:18) (log n ) F/ √ n (cid:19) /d ! which concludes the proof of Theorem 2.1. ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions This basically follows from Theorem 2.5. Let c d and N d be as given by Theorem 2.5.Letting k = √ nσ − d/ and assuming that n ≥ max( N d , ( c − d ) σ − d/ ), we obtain fromTheorem 2.5 that sup f ∈C CdCd (Ω) E f ℓ P n ( ˆ f n , f ) ≥ c d σn − /d (log n ) − d +1) /d . where C d is such that ˜ f k ∈ C C d C d (Ω) (existence of such a C d is guaranteed by Lemma2.4). The required lower bound (15) on the class C LL (Ω) for an arbitrary L >

Theorem 2.3 follows from a straightforward application of the metric entropy bound inCorollary 4.2 and the general results Theorem 7.1 and Theorem 7.2. Indeed, combiningCorollary 4.2 and Theorem 7.2, we get G ( t ) ≤ C d σ r kn (log n ) h/ Z t/ θ (cid:18) tǫ (cid:19) d/ dǫ + 2 σθ (36)for every 0 < θ ≤ t/ G ( t ) is as in (34). We now split into the three cases d ≤ d = 4 and d ≥

5. When d ≤

3, we take θ = 0 to obtain G ( t ) ≤ C d tσ r kn (log n ) h/ so that G ( t ) ≤ t / t ≥ C d σ q kn (log n ) h/ . This proves Theorem 2.3 for d ≤ d = 4, we get G ( t ) ≤ C d σ r kn (log n ) h/ t log t θ + 2 σθ. Choosing θ := t/ (2 √ n ), we get G ( t ) ≤ C d σt r kn (log n ) h/ . This gives G ( t ) ≤ t / t ≥ C d σ q kn (log n ) h/ which proves Theorem 2.3 for d = 4.For d ≥

5, we get G ( t ) ≤ C d σ r kn (log n ) h/ Z ∞ θ (cid:18) tǫ (cid:19) d/ dǫ + 2 σθ ≤ C d σ r kn (log n ) h/ t d/ θ − ( d/ + 2 σθ. ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Take θ = t (cid:16) C d (log n ) h/ q kn (cid:17) /d to get G ( t ) ≤ C d σt (log n ) h/ r kn ! /d . (37)This clearly implies that G ( t ) ≤ t / t ≥ C d σ (log n ) h/ r kn ! /d which completes the proof of Theorem 2.3. In addition to Theorem 7.1, Theorem 7.2, Lemma 2.4 and Corollary 4.3, this proof willneed the following two results. The proof of the ﬁrst result below is given in Section 10while the second result is standard. Recall the notation (29).

Lemma 7.3.

Let Ω be a convex body contained in the unit ball whose volume is boundedfrom below by a constant depending on d alone. Let f ( x ) := k x k . There exist twopositive constants c and c depending on d alone such that log N ( c n − /d , B C (Ω) P n ( f , t ) , ℓ P n ) ≥ n for t ≥ c n − /d . Lemma 7.4 (Sudakov minoration) . Let ξ , . . . , ξ n be independently distributed accord-ing to the normal distribution with mean 0 and variance σ . Then for every deterministic X , . . . , X n ∈ X , every class of functions F and t ≥ , we have E sup g ∈ B F P n ( f,t ) n n X i =1 ξ i ( g ( X i ) − f ( X i )) ≥ βσ √ n sup ǫ> (cid:26) ǫ q log N ( ǫ, B F P n ( f, t ) , ℓ P n ) (cid:27) . Proof of Theorem 2.5.

By Lemma 2.4, ˜ f k satisﬁes ℓ P n ( f , ˜ f k ) ≤ sup x ∈ Ω (cid:12)(cid:12)(cid:12) f ( x ) − ˜ f k ( x ) (cid:12)(cid:12)(cid:12) ≤ C d k − /d . (38)where f ( x ) := k x k . Theorem 7.1 says that E ˜ f k ℓ P n ( ˆ f n , ˜ f k ) can be bounded from belowby lower bounding t ˜ f k where t ˜ f k is the maximizer of ˜ G ( t ) − t / t ≥ G ( t ) := E sup g ∈C (Ω): ℓ P n ( ˜ f k ,g ) ≤ t n n X i =1 ξ i (cid:16) g ( X i ) − ˜ f k ( X i ) (cid:17) . Note that we are working in the ﬁxed design setting so X , . . . , X n are non-random andthe expectation above is being taken with respect to the randomness in ξ , . . . , ξ n . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions We shall lower bound t ˜ f k by proving suitable upper and lower bounds for the function˜ G ( · ). Note ﬁrst that by the properties of ˜ f k given in Lemma 2.4, we can apply the metricentropy bound in Corollary 4.3 along with Theorem 7.2 to get (similar to the calculationunderlying (37)) a constant Υ d such that˜ G ( t ) ≤ Υ d σt (log n ) d +1) /d (cid:18) kn (cid:19) /d for every t > G ( t ): there exist positive constants γ d andΓ d depending on d alone such that˜ G ( t ) ≥ γ d σt (cid:18) kn (cid:19) /d for all t ≤ γ d k − /d (40)provided k ≤ Γ d n . To prove this, suppose ﬁrst that t = 2 C d k − /d where C d is theconstant from (38). For this choice of t , it follows from the triangle inequality (and(38)) that B C (Ω) P n ( ˜ f k , t ) ⊇ B C (Ω) P n ( f , C d k − /d )where, it may be recalled, B C (Ω) P n ( f, s ) := { g ∈ C (Ω) : ℓ P n ( f, g ) ≤ s } . This immediatelyimplies ˜ G ( t ) ≥ G ( C d k − /d ) . where G ( t ) is deﬁned as in (34). Lemma 7.4 now gives˜ G ( t ) ≥ G ( C d k − /d ) ≥ βσ √ n sup ǫ> (cid:26) ǫ q log N ( ǫ, B P n ( f , C d k − /d ) , ℓ P n ) (cid:27) . Using the lower bound on the metric entropy given by Lemma 7.3 for ǫ = c n − /d gives˜ G ( t ) ≥ βσ √ n ( c n − /d ) r n βc √ σn − /d provided k ≤ (cid:18) C d c (cid:19) d/ n. (41)The condition above is necessary for the inequality c n − /d ≤ C d k − /d which is requiredfor the application of Lemma 7.3. This gives˜ G ( t ) ≥ βc √ σn − /d for t = 2 C d k − /d . Now for t ≤ C d k − /d , we use the fact that x ˜ G ( x ) is concave on [0 , ∞ ) (and that˜ G (0) = 0) to deduce that˜ G ( t ) t ≥ ˜ G (2 C d k − /d )2 C d k − /d ≥ σ (cid:18) kn (cid:19) /d βc √ C d for all t ≤ C d k − /d . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions This proves (40) for γ d = min (cid:18) βc √ C d , C d (cid:19) and Γ d = (cid:18) c C d c (cid:19) d/ . We shall now bound the quantity t f k (deﬁned in Theorem 7.1) from below using (39)and (40). By the lower bound in (40), we getsup t> (cid:18) ˜ G ( t ) − t (cid:19) ≥ sup t ≤ γ d k − /d γ d σt (cid:18) kn (cid:19) /d − t ! . Taking t = γ d σ ( k/n ) /d and noting that t = γ d σ (cid:18) kn (cid:19) /d ≤ γ d k − /d if and only if k ≤ √ nσ − d/ , we get that sup t> (cid:18) ˜ G ( t ) − t (cid:19) ≥ γ d σ (cid:18) kn (cid:19) /d . The above inequality, combined with (39) and the fact that t ˜ f k maximizes ˜ G ( t ) − t / t >

0, yield γ d σ (cid:18) kn (cid:19) /d ≤ sup t> (cid:18) ˜ G ( t ) − t (cid:19) = ˜ G ( t ˜ f k ) − t f k ≤ ˜ G ( t ˜ f k ) ≤ Υ d σt ˜ f k (log n ) d +1) /d (cid:18) kn (cid:19) /d . This implies t ˜ f k ≥ γ d d σ (cid:18) kn (cid:19) /d (log n ) − d +1) /d . Theorem 7.1 then gives E ˜ f k ℓ P n (cid:16) ˆ f n , ˜ f k (cid:17) ≥ γ d d σ (cid:18) kn (cid:19) /d (log n ) − d +1) /d − Cσ n . It is now clear that the ﬁrst term on the right hand side above dominates the secondterm when n is larger than a constant depending on d alone. This completes the proofof Theorem 2.5.

8. Proofs of results from Section 3

We provide here the proofs for Theorem 3.1 and Theorem 3.2. These proofs are similarin spirit to that of Theorem 2.5 with some diﬀerences that are necessary to deal withthe random design setting. Let us ﬁrst state some general results that will be used inthese proofs. ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions The proof of Theorem 2.5 needed the ingredients: Theorem 7.1, Theorem 7.2, Lemma7.3, Lemma 2.4 and Lemma 7.4. Modiﬁed forms of these ingredients to cover the randomdesign setting (as described below) are used for the proof of Theorem 3.1 and Theorem3.2.As in the proof of Theorem 2.5, a key role will be played by Theorem 7.1 of Chatterjee[10]. Theorem 7.1 holds for the ﬁxed design setting with no restriction on the designpoints which means that it also applies to the random design setting provided wecondition on the design points X , . . . , X n . In particular, for our random design settingwith ℓ P n deﬁned as in (5), inequality (30) becomes: P (cid:26) . t f ≤ ℓ P n ( ˆ f n , f ) ≤ t f (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n (cid:27) ≥ − (cid:18) − cnt f σ (cid:19) (42)where t f = t f ( X , . . . , X n ) := argmax t ≥ H f ( t ) (43)with H f ( t ) := E " sup g ∈ B F P n ( f,t ) n n X i =1 ξ i ( g ( X i ) − f ( X i )) (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n − t . Here t f = t f ( X , . . . , X n ) is random as it depends on the random design points X , . . . , X n .Instead of Dudley’s theorem (Theorem 7.2), we shall use the following theorem onthe suprema of empirical processes. The ﬁrst conclusion of the theorem below is takenfrom van de Geer [44, Theorem 5.11] while the second conclusion essentially followsfrom van de Geer [44, Proof of Lemma 5.16]. Theorem 8.1.

Suppose X , . . . , X n are independently distributed according to a distri-bution P on Ω . Suppose F is a class of real-valued functions on Ω that are uniformlybounded by Γ > . Then the following two statements are true:1. There exists a positive constant C such that E sup f ∈F | P n f − P f | ≤ C inf (cid:26) a ≥ Γ √ n : a ≥ C √ n Z Γ a q log N [ ] ( u, F , L ( P )) du (cid:27) . (44)

2. There exists a positive constant C such that P (cid:26) sup f,g ∈F ( ℓ P ( f, g ) − ℓ P n ( f, g )) > Ca (cid:27) ≤ C exp (cid:18) − na C Γ (cid:19) (45) and P (cid:26) sup f,g ∈F ( ℓ P n ( f, g ) − ℓ P ( f, g )) > Ca (cid:27) ≤ C exp (cid:18) − na C Γ (cid:19) (46) provided na ≥ C log N [ ] ( a, F , L ( P )) . (47) ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Instead of Lemma 7.3, we shall use the following result which proves the same lowerbound as in Lemma 7.3 in the random design setting with high probability. Recall that C LL (Ω) denotes the class of all convex functions on Ω that are L -Lipschitz and uniformlybounded by L . Lemma 8.2.

Let Ω be a convex body that contains a ball of constant (depending on d alone) radius. Let f ( x ) := k x k . Then there exist positive constants c , c , c , c and C depending on d alone such that P n log N ( ǫ, B C LL (Ω) P n ( f , t ) , ℓ P n ) ≥ c ǫ − d/ o ≥ − exp( − c n ) (48) for each ﬁxed ǫ, t, L satisfying L ≥ C and c n − /d ≤ ǫ ≤ min( c , t/ . Lemma 2.4 will also be crucially used in the proof of Theorem 3.1. The followinganalogue of Lemma 2.4 for the case when Ω is not necessarily a polytope will be usedin the proof of Theorem 3.2.

Lemma 8.3.

Suppose Ω is a convex body that is contained in the unit ball and contains aball of constant (depending on d alone) radius centered at zero. Let f ( x ) := k x k . Thereexists a positive constant C d (depending on dimension alone) such that the following istrue. For every k ≥ , there exist m ≤ C d k d -simplices ∆ , . . . , ∆ m ⊆ Ω having disjointinteriors and a convex function ˜ f k such that1. (1 − C d k − /d )Ω ⊆ ∪ mi =1 ∆ i ⊆ Ω ,2. ˜ f k is aﬃne on each ∆ i , i = 1 , . . . , m ,3. sup x ∈ Ω | f ( x ) − ˜ f k ( x ) | ≤ C d k − /d ,4. ˜ f k ∈ C C d C d (Ω) . Lemma 7.4 will be used in the proofs of Theorem 3.1 and Theorem 3.2 in the followingconditional form: E " sup B F P n ( f,t ) n n X i =1 ξ i ( g ( X i ) − f ( X i )) (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n ≥ βσ √ n sup ǫ> (cid:26) ǫ q log N ( ǫ, B F P n ( f, t ) , ℓ P n ) (cid:27) . (49)We are now ready for the proofs of Theorem 3.1 and Theorem 3.2. Proof of Theorem 3.1.

It is enough to prove (20) when B is a ﬁxed dimensional con-stant. From here, the inequality for arbitrary B > f ( x ) := k x k and ˜ f k be as given by Lemma 2.4. Below we shall assume that B is a large enough dimensional constant so that ˜ f k ∈ C B (Ω). The main task in this proofwill be to bound the quantity t ˜ f k (deﬁned via (43)) from below where t ˜ f k maximizes H ˜ f k ( t ) := G ˜ f k ( t ) − t ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions over all t ≥ G ˜ f k ( t ) := E  sup g ∈ B C B (Ω) P n ( ˜ f k ,t ) n n X i =1 ξ i (cid:16) g ( X i ) − ˜ f k ( X i ) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n  . (50)We shall prove a lower bound for t ˜ f k that holds with high probability over the ran-domness in X , . . . , X n . Speciﬁcally, we shall prove the existence of three dimensionalconstants γ d , c d and C d and a constant N d,σ which depends on d and σ such that P (cid:8) t ˜ f k ≥ c d n − /d √ σ (log n ) − d +1) /d (cid:9) ≥ − C d exp (cid:18) − n ( d − /d C d (cid:19) (51)for k = γ d √ nσ − d/ and n ≥ N d,σ .Before proceeding with the proof of (51), let us ﬁrst show how (51) completes theproof of Theorem 3.1. Note ﬁrst thatsup f ∈C B (Ω) E f ℓ P ( ˆ f n ( C B (Ω)) , f ) ≥ E ˜ f k ℓ P ( ˆ f n ( C B (Ω)) , ˜ f k )so it is enough to prove that the right hand side of (20) is a lower bound for E ˜ f k ℓ P ( ˆ f n ( C B (Ω)) , ˜ f k ).We shall assume therefore that the data have been generated from the true function˜ f k . Let ρ n be the lower bound on t ˜ f k given by (51) i.e., ρ n := c d n − /d √ σ (log n ) − d +1) /d . (52)Inequality (42) clearly implies P (cid:26) ℓ P n ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ t f k (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n (cid:27) ≥ − − cnt f k σ ! . As a result, P (cid:26) ℓ P n ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ ρ n (cid:27) ≥ P (cid:26) ℓ P n ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ t f k , t ˜ f k ≥ ρ n (cid:27) = E (cid:20) I (cid:8) t ˜ f k ≥ ρ n (cid:9) P (cid:26) ℓ P n ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ t f k (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n (cid:27)(cid:21) ≥ E " I (cid:8) t ˜ f k ≥ ρ n (cid:9) − − cnt f k σ !! ≥ (cid:18) − (cid:18) − cnρ n σ (cid:19)(cid:19) P (cid:8) t ˜ f k ≥ ρ n (cid:9) . We can now use (51) to obtain P (cid:26) ℓ P n ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ ρ n (cid:27) ≥ (cid:18) − (cid:18) − cnρ n σ (cid:19)(cid:19) (cid:18) − C d exp (cid:18) − n ( d − /d C d (cid:19)(cid:19) ≥ − (cid:18) − cnρ n σ (cid:19) − C d exp (cid:18) − n ( d − /d C d (cid:19) . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Clearly if N d,σ is chosen appropriately, then, for n ≥ N d,σ , nρ n σ = c d σ n ( d − /d (log n ) − d +1) /d will be larger than any constant multiple of n ( d − /d which gives P (cid:26) ℓ P n ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ ρ n (cid:27) ≥ − C d exp (cid:18) − n ( d − /d C d (cid:19) . (53)We shall now argue that a similar inequality also holds for ℓ P ( ˆ f n ( C B (Ω)) , ˜ f k ). For every a >

0, we have P (cid:26) ℓ P ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ (cid:18) ρ n √ − a (cid:19)(cid:27) ≥ P ( ℓ P n ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ ρ n , sup f,g ∈C B (Ω) ( ℓ P n ( f, g ) − ℓ P ( f, g )) ≤ a ) ≥ P (cid:26) ℓ P n ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ ρ n (cid:27) + P ( sup f,g ∈C B (Ω) ( ℓ P n ( f, g ) − ℓ P ( f, g )) ≤ a ) − ≥ P ( sup f,g ∈C B (Ω) ( ℓ P n ( f, g ) − ℓ P ( f, g )) ≤ a ) − C d exp (cid:18) − n ( d − /d C d (cid:19) . To bound the probability above, we use (46). Gao and Wellner [20, Theorem 1.5] giveslog N [ ] ( ǫ, C B (Ω) , ℓ P ) ≤ C d (cid:18) Bǫ (cid:19) d/ (55)The requirement (47) is therefore satisﬁed when a is n − / ( d +4) L d/ ( d +4) multiplied by alarge enough dimensional constant. Inequality (46) then gives P ( sup f,g ∈C B (Ω) ( ℓ P n ( f, g ) − ℓ P ( f, g )) ≤ C d n − / ( d +4) B d/ ( d +4) ) ≥ − C exp (cid:18) − n d/ ( d +4) C d B / ( d +4) (cid:19) (56)which then implies that P (cid:26) ℓ P ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ (cid:0) c d n − /d √ σ (log n ) − d +1) /d − C d n − / ( d +4) B d/ ( d +4) (cid:1)(cid:27) ≥ − C exp (cid:18) − n d/ ( d +4) C d B / ( d +4) (cid:19) − C d exp (cid:18) − n ( d − /d C d (cid:19) . Because n − / ( d +4) is of a smaller order than n − /d and n d/ ( d +4) is of a larger order than n ( d − /d (and B is a dimensional constant), we obtain P n ℓ P ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ c d n − /d √ σ (log n ) − d +1) /d o ≥ − C d exp (cid:18) − n ( d − /d C d (cid:19) . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions provided n ≥ N d,σ where N d,σ is a constant depending on d and σ alone. Finally, notethat N d,σ can be chosen so that P n ℓ P ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ c d n − /d √ σ (log n ) − d +1) /d o ≥ E ˜ f k ℓ P ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ c d σn − /d (log n ) − d +1) /d . (57)This completes the proof of Theorem 3.1 assuming that (51) is true.Let us now start the proof of (51). For this purpose, we shall require both upper andlower bounds for G ˜ f k ( t ) (deﬁned in (50)) for appropriate values of t . We start to proveupper bounds. Note ﬁrst that B C B (Ω) P n ( ˜ f k , t ) ⊆ B C B (Ω) P ( ˜ f k , t + C d n − / ( d +4) B d/ ( d +4) ) (58)with probability at least 1 − C exp (cid:18) − n d/ ( d +4) C d B / ( d +4) (cid:19) . (59)Here we are using the notation B F P ( f, t ) := { g ∈ F : ℓ P ( f, g ) ≤ t } . (60)where ℓ P is given in (8). (58) is a consequence of P ( sup f,g ∈C B (Ω) ( ℓ P ( f, g ) − ℓ P n ( f, g )) ≤ C d n − / ( d +4) B d/ ( d +4) ) ≥ − C exp (cid:18) − n d/ ( d +4) C d B / ( d +4) (cid:19) (61)whose proof follows from the same argument as the proof of (56). Thus for t ≥ C d n − / ( d +4) B d/ ( d +4) , (62)we get B C B (Ω) P n ( ˜ f k , t ) ⊆ B C B (Ω) P ( ˜ f k , t )with probability at least (59). We deduce consequently that, for a ﬁxed t satisfying (62),the event: G ˜ f k ( t ) ≤ G ˜ f k (3 t ) := E  sup g ∈ B C B (Ω) P ( ˜ f k , t ) n n X i =1 ξ i (cid:16) g ( X i ) − ˜ f k ( X i ) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n  ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions holds with probability at least (59). It is easy to see that the function T ( x , . . . , x n ) := E  sup g ∈ B C B (Ω) P ( ˜ f k , t ) n n X i =1 ξ i (cid:16) g ( X i ) − ˜ f k ( X i ) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) X = x , . . . , X n = x n  satisﬁes the bounded diﬀerences condition: | T ( x , . . . , x n ) − T ( x ′ , . . . , x ′ n ) | ≤ Bσn n X i =1 I { x i = x ′ i } and the bounded diﬀerences concentration inequality consequently gives P (cid:8) G ˜ f k (3 t ) ≤ E G ˜ f k (3 t ) + x (cid:9) ≥ − exp (cid:18) − nx B σ (cid:19) (63)for every x >

0. We next control E G ˜ f k (3 t ) = E  sup g ∈ B C B (Ω) P ( ˜ f k , t ) n n X i =1 ξ i (cid:16) g ( X i ) − ˜ f k ( X i ) (cid:17) where the expectation on the left hand side is with respect to X , . . . , X n while theexpectation on the right hand side is with respect to ξ , . . . , ξ n , X , . . . , X n . Clearly E G ˜ f k (3 t ) = E sup h ∈H ( Q n h − Q h ) (64)where H consists of all functions of the form ( ξ, x ) ξ (cid:16) g ( x ) − ˜ f k ( x ) (cid:17) as g varies over B C B (Ω) P ( ˜ f k , t ), Q n is the empirical measure corresponding to ( ξ i , X i ) , i = 1 , . . . , n , and Q is the distribution of ( ξ, X ) where ξ and X are independent with ξ ∼ N (0 , σ ) and X ∼ P .We now use the bound (44) requires us to control N [ ] ( ǫ, H , L ( Q )). This is done byTheorem 4.4 which states thatlog N [ ] ( ǫ, B C B (Ω) P ( ˜ f k , t ) , L ( P )) ≤ C d k (cid:18) log C d Bǫ (cid:19) d +1 (cid:18) tǫ (cid:19) d/ . (65)Theorem 4.4 is stated under the unnormalized integral constraint R Ω ( f − ˜ f k ) ≤ t andfor bracketing numbers under the unnormalized Lebesgue measure but this implies (65)as the volume of Ω is assumed to be bounded on both sides by dimensional constants.We now claim that N [ ] ( ǫ, H , L ( Q )) ≤ N [ ] ( ǫσ − , B C B (Ω) P ( ˜ f k , t ) , L ( P )) . (66) ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Inequality (66) is true because of the following. Let { [ g L , g U ] , g ∈ G } be a set of coveringbrackets for the set B C B (Ω) P ( ˜ f k , t ). For each bracket [ g L , g U ], we associate a correspond-ing bracket [ h L , h U ] for H as follows: h L ( ξ, x ) := ξ (cid:16) g L ( x ) − ˜ f k ( x ) (cid:17) I { ξ ≥ } + ξ (cid:16) g U ( x ) − ˜ f k ( x ) (cid:17) I { ξ < } and h U ( ξ, x ) := ξ (cid:16) g U ( x ) − ˜ f k ( x ) (cid:17) I { ξ ≥ } + ξ (cid:16) g L ( x ) − ˜ f k ( x ) (cid:17) I { ξ < } . It is now easy to check that whenever g L ≤ g ≤ g U , we have h L ≤ h g ≤ h U where h g ( ξ, x ) = ξ (cid:16) g ( x ) − ˜ f k ( x ) (cid:17) . Further, h U − h L = | ξ | ( g U − g L ) and thus Q ( h U − h L ) = σ P ( g U − g L ) which proves (66). Inequality (65) then gives that for every a ≥ B/ √ n ,we have Z Ba q log N [ ] ( u, H , L ( Q )) du ≤ C d √ k Z Ba (cid:18) log C d Bσu (cid:19) ( d +1) / (cid:18) tσu (cid:19) d/ du ≤ C d √ k ( tσ ) d/ (cid:18) log C d Bσa (cid:19) ( d +1) / Z ∞ a u − d/ du ≤ C d √ k ( tσ ) d/ (cid:18) log C d Bσa (cid:19) ( d +1) / a − ( d/ ≤ C d √ k ( tσ ) d/ (cid:0) log( C d σ √ n ) (cid:1) ( d +1) / a − ( d/ where, in the last inequality, we used a ≥ B/ √ n . The inequality a ≥ Cn − / R Ba p log N [ ] ( u, H , L ( Q )) du will therefore be satisﬁed for a ≥ C d tσ (cid:18) kn (cid:19) /d (cid:0) log( C d σ √ n ) (cid:1) d +1) /d for an appropriate constant C d . The bound (44) then gives E G ˜ f k (3 t ) ≤ C d tσ (cid:18) kn (cid:19) /d (cid:0) log( C d σ √ n ) (cid:1) d +1) /d + B √ n . Assuming now that t ≥ Bk − /d σ n (4 − d ) / (2 d ) , (67)we deduce E G ˜ f k (3 t ) ≤ (1 + C d ) tσ (cid:18) kn (cid:19) /d (cid:0) log( C d σ √ n ) (cid:1) d +1) /d (68)Putting the above steps together, we obtain that for every x >

0, the inequality G ˜ f k ( t ) ≤ G ˜ f k (3 t ) ≤ E G ˜ f k ( t ) + x ≤ C d tσ (cid:18) kn (cid:19) /d (cid:0) log( C d σ √ n ) (cid:1) d +1) /d + x ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions holds with probability at least1 − C exp (cid:18) − n d/ ( d +4) C d B / ( d +4) (cid:19) − exp (cid:18) − nx B σ (cid:19) for every ﬁxed t satisfying (62) and (67). We take x = x ( t ) := C d tσ (cid:18) kn (cid:19) /d (cid:0) log( C d σ √ n ) (cid:1) d +1) /d to deduce that H ˜ f k ( t ) ≤ G ˜ f k ( t ) ≤ C d tσ (cid:18) kn (cid:19) /d (cid:0) log( C d σ √ n ) (cid:1) d +1) /d (69)with probability at least1 − C exp (cid:18) − n d/ ( d +4) C d B / ( d +4) (cid:19) − exp (cid:18) − nx ( t )2 B σ (cid:19) (70)provided t satisﬁes (62) and (67).We shall next prove a lower bound for H ˜ f k ( t ). The key ingredients here are Lemma8.2 and the conditional form (49) of Sudakov’s minoration. Let us ﬁrst prove lowerbounds for G ˜ f k ( t ). Note that B C B (Ω) P n ( ˜ f k , t ) ⊇ B C B (Ω) P n ( f , t/

2) whenever ℓ P n ( f , ˜ f k ) ≤ t/ . Because ℓ P n ( f , ˜ f k ) ≤ sup x ∈ Ω | f ( x ) − ˜ f k ( x ) | ≤ L d k − /d , the condition ℓ P n ( f , ˜ f k ) ≤ t/ t ≥ L d k − /d . (71)Thus for t satisfying the above, G ˜ f k ( t ) ≥ E  sup g ∈ B C B (Ω) P n ( f ,t/ n n X i =1 ξ i (cid:16) g ( X i ) − ˜ f k ( X i ) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n  = E  sup g ∈ B C B (Ω) P n ( f ,t/ n n X i =1 ξ i ( g ( X i ) − f ( X i )) (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n  . Inequality (49) then gives G ˜ f k ( t ) ≥ βσ √ n sup ǫ> (cid:26) ǫ q log N ( ǫ, B C B (Ω) P n ( f , t/ , P n ) (cid:27) We now use Lemma 8.2 with ǫ = c n − /d to claim that for B ≥ C and t ≥ c n − /d , (72) ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions we have G ˜ f k ( t ) ≥ β √ c σc − ( d/ n − /d with probability at least 1 − exp( − c n ). This gives the following lower bound on H ˜ f k ( t ): H ˜ f k ( t ) ≥ β √ c σc − ( d/ n − /d − t . Taking t = t where t = β √ c σc − ( d/ n − /d (73)gives us that H ˜ f k ( t ) ≥ t β √ c σc − ( d/ n − /d (74)with probability at least 1 − exp( − c n ) provided t = t satisﬁes (71) and (72). Thecondition (71) is equivalent to k ≥ L d β √ c c − ( d/ ! d/ √ nσ − d/ (75)and (72) is equivalent to n ≥ c β √ c c − ( d/ ! d/ σ − d/ . (76)We shall now combine (69) and (74). Suppose t > C d t σ (cid:18) kn (cid:19) /d (cid:0) log( C d σ √ n ) (cid:1) d +1) /d = β √ c σc − ( d/ n − /d where C d is as in (69) and the other constants ( β , c and c ) are from (74). The aboveequality is the same as t = β √ c c − ( d/ C d k − /d (cid:0) log( C d σ √ n ) (cid:1) − d +1) /d . In that case, (69) and (74) together imply that H ˜ f k ( t ) ≤ H ˜ f k ( t )with probability at least1 − C exp (cid:18) − n d/ ( d +4) C d B / ( d +4) (cid:19) − exp (cid:18) − nx ( t )2 B σ (cid:19) − exp( − c n ) . (77) ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions If we now assume that k ≥ (2 C d ) − d/ (cid:16) β √ c c − ( d/ (cid:17) d/ √ nσ − d/ , (78)then t < t . Inequality (32) then gives that t ˜ f k ≥ t = β √ c c − ( d/ C d k − /d (cid:0) log( C d σ √ n ) (cid:1) − d +1) /d . with probability at least (77).We shall now take k = γ d √ nσ − d/ where γ d is the larger of the two dimensionalconstants on the right hand sides of (75) and (78) and this will obviously ensure thatboth (75) and (78) are satisﬁed. The quantity t then equals t = c d n − /d √ σ (cid:0) log( C d nσ − ( d/ ) (cid:1) − d +1) /d (79)for a speciﬁc c d and x ( t ) = C d c d γ /dd σn − /d . The probability in (77) then equals1 − C exp (cid:18) − n d/ ( d +4) C d B / ( d +4) (cid:19) − exp (cid:18) − C d c d ( γ d ) /d n ( d − /d B (cid:19) − exp( − c n ) . Now if we assume that B ≥

1, then n d/ ( d +4) C d B / ( d +4) ≥ n d/ ( d +4) C d B ≥ n ( d − /d B and also n ≥ n ( d − /d /B . We thus deduce that the probability in (77) is bounded frombelow by 1 − C d exp (cid:18) − n ( d − /d C d B (cid:19) . which can be further simpliﬁed to1 − C d exp (cid:18) − n ( d − /d C d (cid:19) . as B is a constant that only depends on d . If we now take n to be larger than a constant N d,σ depending on d and σ alone, then the conditions (62) and (71) will be satisﬁedfor t = t and (76) will also be satisﬁed. Finally the logarithmic term in (79) can befurther simpliﬁed by the bound log( C d nσ − ( d/ ) ≤ c d log n . This completes the proofof (51) and consequently Theorem 3.1. ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Proof of Theorem 3.2.

It is enough to prove (21) when L is a ﬁxed dimensional constant.From here, the inequality for arbitrary L > f ( x ) := k x k and ˜ f k be as given by Lemma 8.3. Let L be a dimensional constantlarge enough so that ˜ f k ∈ C LL (Ω). As in the proof of Theorem 3.1, the key is to prove(51) where t ˜ f k is deﬁned as the maximizer of H ˜ f k ( t ) := G ˜ f k ( t ) − t t ≥ G ˜ f k ( t ) := E  sup g ∈ B C L (Ω) P n ( ˜ f k ,t ) n n X i =1 ξ i (cid:16) g ( X i ) − ˜ f k ( X i ) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n  . (80)Before proceeding with the proof of (51), let us ﬁrst show how (51) completes the proofof Theorem 3.2. Because ˜ f k ∈ C LL (Ω), it is enough to prove that the right hand side of(21) is a lower bound for E ˜ f k ℓ P ( ˆ f n ( C L (Ω)) , ˜ f k ). We shall assume therefore that the datahave been generated from the true function ˜ f k . Note ﬁrst that, as shown in the proofof inequality (53) in Theorem 3.1, inequality (51) leads to P (cid:26) ℓ P n ( ˆ f n ( C L (Ω) , ˜ f k ) ≥ ρ n (cid:27) ≥ − C d exp (cid:18) − n ( d − /d C d (cid:19) where ρ n is given by (52) and n ≥ N d,σ for a large enough constant N d,σ depending onlyon d and σ . A similar inequality will be shown below for ℓ P ( ˆ f n ( C L (Ω) , ˜ f k ). For every a >

0, we write P (cid:26) ℓ P ( ˆ f n ( C L (Ω)) , ˜ f k ) ≥ (cid:18) ρ n √ − a (cid:19) , ˆ f n ( C L (Ω)) ∈ C LL (Ω) (cid:27) = P (cid:26) ℓ P ( ˆ f n ( C LL (Ω)) , ˜ f k ) ≥ (cid:18) ρ n √ − a (cid:19) , ˆ f n ( C L (Ω)) ∈ C LL (Ω) (cid:27) ≥ P ( ℓ P n ( ˆ f n ( C LL (Ω)) , ˜ f k ) ≥ ρ n , sup f,g ∈C LL (Ω) ( ℓ P n ( f, g ) − ℓ P ( f, g )) ≤ a, ˆ f n ( C L (Ω)) ∈ C LL (Ω) ) ≥ P (cid:26) ℓ P n ( ˆ f n ( C LL (Ω)) , ˜ f k ) ≥ ρ n , ˆ f n ( C L (Ω)) ∈ C LL (Ω) (cid:27) − P ( sup f,g ∈C LL (Ω) ( ℓ P n ( f, g ) − ℓ P ( f, g )) ≤ a ) . (81)We now bound the probability: P (cid:26) ℓ P ( ˆ f n ( C L (Ω)) , ˜ f k ) ≥ (cid:18) ρ n √ − a (cid:19) , ˆ f n ( C L (Ω)) / ∈ C LL (Ω) (cid:27) . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions For this, we ﬁrst make the following claim: f ∈ C L (Ω) , f / ∈ C LL (Ω) = ⇒ min (cid:16) ℓ P n ( f, ˜ f k ) , ℓ P ( f, ˜ f k ) (cid:17) > L. (82)To see (82), note that assumptions f ∈ C L (Ω) and f / ∈ C LL (Ω) together imply that f ( x ) > L for some x ∈ Ω. By the Lipschitz property of f , the fact that Ω has diameter ≤ f k is bounded by L , we have f ( y ) − ˜ f k ( y ) ≥ f ( x ) − L k y − x k − L > L for all y ∈ Ωwhich clearly implies that both ℓ P n ( f, ˜ f k ) and ℓ P ( f, ˜ f k ) are larger than L . This proves(82).Assume now that N d,σ is large enough so that ρ n is at most L for n ≥ N d,σ . The fact(82) clearly implies that P (cid:26) ℓ P ( ˆ f n ( C L (Ω)) , ˜ f k ) ≥ (cid:18) ρ n √ − a (cid:19) , ˆ f n ( C L (Ω)) / ∈ C LL (Ω) (cid:27) = P n ˆ f n ( C L (Ω)) / ∈ C LL (Ω) o = P (cid:26) ℓ P n ( ˆ f n ( C L (Ω)) , ˜ f k ) ≥ ρ n , ˆ f n ( C L (Ω)) / ∈ C LL (Ω) (cid:27) Combining the above with (81), we get P (cid:26) ℓ P ( ˆ f n ( C L (Ω)) , ˜ f k ) ≥ (cid:18) ρ n √ − a (cid:19)(cid:27) ≥ P (cid:26) ℓ P n ( ˆ f n ( C L (Ω) , ˜ f k ) ≥ ρ n (cid:27) + P ( sup f,g ∈C LL (Ω) ( ℓ P n ( f, g ) − ℓ P ( f, g )) ≤ a ) − . This inequality is analogous to inequality (54) in the proof of Theorem 3.1. From here,one can deduce E ˜ f k ℓ P ( ˆ f n ( C L (Ω)) , ˜ f k ) ≥ c d σn − /d (log n ) − d +1) /d . (83)in the same way that (57) was derived from (54). The only diﬀerence is that, insteadof (55), we now use the following result due to Bronˇste˘ın [9]:log N [ ] ( ǫ, C LL (Ω) , ℓ P ) ≤ log N ( ǫ, C LL (Ω) , ℓ ∞ ) ≤ C d (cid:18) Lǫ (cid:19) d/ (84)where, of course, ℓ ∞ refers to the metric ( f, g ) sup x ∈ Ω | f ( x ) − g ( x ) | (recall that ℓ ∞ covering numbers dominate bracketing numbers with respect to L ( P ) for everyprobability measure P ). (83) clearly completes the proof of Theorem 3.2.Let us now provide the proof of (51). This was proved in Theorem 3.1 on the basisof the inequalities (69) and (74). Below we shall establish (69) and (74) in the presentsetting with slight modiﬁcation. From these, (51) will follow via the same argument ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions used in Theorem 3.1. Note that the diﬀerence between the current proof and the proofof Theorem 3.1 is that G ˜ f k ( t ) is now deﬁned as in (80) involving a supremum of g ∈ B C L (Ω) P n ( ˜ f k , t ) while, in the proof of Theorem 3.1, G ˜ f k ( t ) was deﬁned as in (50) involvinga supremum of g ∈ B C L (Ω) P n ( ˜ f k , t ).Let us start with the proof of (69). For this, note ﬁrst that (82) immediately implies B C L (Ω) P n ( ˜ f k , t ) = B C LL (Ω) P n ( ˜ f k , t ) for all t ≤ L. Let us assume therefore that t ≤ L so we can work with the class of bounded Lipschitzconvex functions C LL (Ω). We write G ˜ f k ( t ) ≤ G I ˜ f k ( t ) + G II ˜ f k ( t )where G I ˜ f k ( t ) := E  sup g ∈ B C LL (Ω) P n ( ˜ f k ,t ) n n X i =1 I { X i ∈ ∪ mi =1 ∆ i } ξ i (cid:16) g ( X i ) − ˜ f k ( X i ) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n  and G II ˜ f k ( t ) := E  sup g ∈ B C LL (Ω) P n ( ˜ f k ,t ) n X i : X i / ∈∪ mi =1 ∆ i ξ i (cid:16) g ( X i ) − ˜ f k ( X i ) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n  . Here ∆ , . . . , ∆ m are the d -simplices given by Lemma 8.3. We shall provide upperbounds for both G I ˜ f k ( t ) and G II ˜ f k ( t ). The bound for G I ˜ f k ( t ) is very similar to the bound(69) obtained for G ˜ f k ( t ) in the proof of Theorem 3.1 with the following two diﬀerences.Instead of the metric entropy bound (55), we use the result (84) due to Bronˇste˘ın [9]. In-equality (84) allows us to replace B C LL (Ω) P n ( ˜ f k , t ) by B C LL (Ω) P ( ˜ f k , t ) with high probability.Also, instead of (65), we shall use (which also follows from Theorem 4.4) N [ ] ( ǫ, n x g ( x ) I { x ∈ ∪ mi =1 ∆ i } : g ∈ B C LL (Ω) P ( ˜ f k , t ) o , L ( P )) ≤ C d k (cid:18) log C d Lǫ (cid:19) d +1 (cid:18) tǫ (cid:19) d/ . With these changes, following the proof of inequality (69) in Theorem 3.1 with B replaced by 4 L allows us to deduce that G I ˜ f k ( t ) ≤ C d tσ (cid:18) kn (cid:19) /d (cid:0) log( C d σ √ n ) (cid:1) d +1) /d (85)with probability at least1 − C exp (cid:18) − n d/ ( d +4) C d L / ( d +4) (cid:19) − exp − nt C d L (cid:18) kn (cid:19) /d (cid:0) log( C d σ √ n ) (cid:1) d +1) /d ! ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions for every 0 < t ≤ L satisfying t ≥ C d n − / ( d +4) L d/ ( d +4) and t ≥ Lk − /d σ n (4 − d ) / (2 d ) . (86)We shall now bound G II ˜ f k ( t ). For this, let ˜ n := P ni =1 I { X i / ∈ ∪ mi =1 ∆ i } and use Dudley’sbound (Theorem 7.2) and (84) to write G II ˜ f k ( t ) = ˜ nn E  sup g ∈ B C LL (Ω) P n ( ˜ f k ,t ) n X i : X i / ∈∪ mi =1 ∆ i ξ i (cid:16) g ( X i ) − ˜ f k ( X i ) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n  ≤ σ ˜ nn inf δ> (cid:18) √ ˜ n Z ∞ δ q log N ( ǫ, C LL (Ω) , ℓ ∞ ) dǫ + 2 δ (cid:19) ≤ C d σ ˜ nn inf δ> √ ˜ n Z ∞ δ (cid:18) Lǫ (cid:19) d/ dǫ + δ ! . The choice δ = L (˜ n ) − /d then gives G II ˜ f k ( t ) ≤ C d L σn (˜ n ) − (2 /d ) . (87)˜ n is binomially distributed with parameters n and ˜ p := Vol(Ω \ ( ∪ mi =1 ∆ i )) / Vol(Ω).Because (1 − C d k − /d )Ω ⊆ ∪ mi =1 ∆ i ⊆ Ω (see Lemma 8.3), we have˜ p ≤ − (1 − C d k − /d ) d ≤ dC d k − /d because (1 − u ) d ≥ − du . Hoeﬀding’s inequality: P { Bin( n, p ) ≤ np + u } ≥ − exp (cid:18) − u n (cid:19) for every u ≥ C d is such that ˜ p ≤ C d k − /d ) P (cid:8) ˜ n ≤ C d nk − /d (cid:9) ≥ P (cid:8) ˜ n − n ˜ p ≤ C d nk − /d (cid:9) ≥ − exp (cid:18) − C d nk − /d (cid:19) . Combining the above with (87), we get that G II ˜ f k ( t ) ≤ C d Lσn − /d k − /d k /d with probability at least 1 − exp( − C d nk − /d ). Combining this bound with the bound(85) obtained for G I ˜ f k ( t ), we get (below H ˜ f k ( t ) := G ˜ f k ( t ) − t / H ˜ f k ( t ) ≤ G ˜ f k ( t ) ≤ C d tσ (cid:18) kn (cid:19) /d (cid:0) log( C d σ √ n ) (cid:1) d +1) /d + C d Lσn − /d k − /d k /d (88) ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions with probability at least1 − C exp − n dd +4 C d L d +4 ! − exp − nt C d L (cid:18) kn (cid:19) d (cid:0) log( C d σ √ n ) (cid:1) d +1) d ! − exp( − C d nk − d )(89)for every 0 ≤ t ≤ L satisfying (86). Under the condition t (cid:0) log( C d σ √ n ) (cid:1) d +1) /d ≥ Lk /d k − /d , the second term on the right hand side of (88) is dominated by the ﬁrst term leadingto inequality (69).The next step is to prove a lower bound for H ˜ f k ( t ). Here the same argument as inthe proof of Theorem 3.1 applies and we can deduce that inequality (74) holds withprobability at least 1 − exp( − c n ) provided the conditions (75) and (76) are satisﬁed(note that t in (74) is given by (73)).We have thus proved inequalities (69) and (74). From these, we can follow the sameargument as in that proof to deduce (51). The following additional constraint (whichwas not present in the proof of Theorem 3.1) needs to be checked here: Lk /d k − /d (cid:0) log( ekσ √ n ) (cid:1) − d +1) /d ≤ t ≤ L where t is deﬁned in (79) and k = γ d √ nσ − d/ (for a large enough γ d ). This holds aslong as n is larger than a constant N d,σ depending on d and σ alone (note that L is aconstant depending on d alone). Note also that the probability (89) has an additionalterm compared to (70) but this additional term exp( − C d nk − /d ) is easily seen to bebounded by exp( − n ( d − /d /C d ) for k = γ d √ nσ − d/ provided n ≥ N d,σ for a large enoughconstant N d,σ . The proof of (51) is thus complete.

9. Proofs of Metric Entropy Results

First, we state the key Lemma, and prove it later (in Subsection 9.2).

Lemma 9.1. If Ω is a d -dimensional convex body deﬁned by Ω = { x ∈ R d : a i ≤ v Ti x ≤ b i , ≤ i ≤ d + 1 } , where v i are ﬁxed unit vectors. Then for any < ε < , there exists aset G consisting of no more than exp( C | Ω | d/ p [log(Γ /ε )] d +1 ( t/ε ) d/ ) brackets such thatfor every f ∈ B Γ p (Ω , t ) := { f is convex in Ω , k f k p ≤ t, k f k ∞ ≤ Γ } , there exists a bracket [ g, h ] ∈ G such that g ( x ) ≤ f ( x ) ≤ h ( x ) for all x ∈ Ω , and Z Ω | h ( x ) − g ( x ) | p dx < ε p . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Now we are ready to prove Theorem 4.4.

Proof of Theorem 4.4.

Assume Ω = ∪ mi =1 ∆ i , where ∆ i , 1 ≤ i ≤ m are d -simplices. Foreach f ∈ B Γ p ( f , Ω , t ), we deﬁne t i ( f ) as the smallest positive integer t i such that Z ∆ i | f ( x ) − f ( x ) | p dx ≤ t pi t p | ∆ i | . Because | f − f | ≤ t i ≤ /t . Thus, there are no more than (2Γ /t ) m choicesof the sequence t , t , . . . , t m . For every such sequence T = { t , t , . . . , t m } , we deﬁne F T = (cid:26) f ∈ B Γ p ( f , Ω , t ) : ( t i − p t p | ∆ i | ≤ Z ∆ i | f ( x ) − f ( x ) | p dx ≤ t pi t p | ∆ i | , ≤ i ≤ m (cid:27) . Thus, for every f ∈ F T , we have m X i =1 ( t i − p t p | ∆ i | ≤ m X i =1 Z ∆ i | f ( x ) − f ( x ) | p dx ≤ t p , i.e. P mi =1 ( t i − p | ∆ i | ≤

1. Hence, m X i =1 t pi | ∆ i | ≤ p − m X i =1 [( t i − p + 1] | ∆ i | ≤ p . Furthermore, for each f ∈ F T and 1 ≤ i ≤ m , the restriction of f − f to ∆ i belongsto B p (∆ i , t i t | ∆ i | /p ) (since f is linear on each ∆ i ). Since each simplex can be writtenas an intersection of d + 1 slabs, by Lemma 9.1 , there exists a set G i consisting of nomore than exp( C ( d, p )[log(2Γ /ε i )] d +1 t d/ i | ∆ i | d/ p t d/ ε − d/ i ) brackets, such that for each f ∈ F T , there exists a bracket [ g i , h i ] ∈ G i such that g i ( x )+ f ( x ) ≤ f ( x ) ≤ h i ( x )+ f ( x )for all x ∈ ∆ i , and R ∆ i | h i ( x ) − g i ( x ) | p dx ≤ ε pi . If we deﬁne g ( x ) = g i ( x ) and h ( x ) = h i ( x )for x ∈ ∆ i , 1 ≤ i ≤ m . Then, we clearly have g ( x ) ≤ f ( x ) ≤ h ( x ) for all x ∈ Ω, and Z [0 , d | h ( x ) − g ( x ) | p dx = m X i =1 Z ∆ i | h i ( x ) − g i ( x ) | p dx ≤ m X i =1 ε pi . We choose ε i = max(2 − − /p t i | ∆ i | /p , (4 m ) − /p ) · ε, where we used the fact that if | f − f | ≤

2Γ then R Ω ∞ | f ( x ) − f ( x ) | p ≤ ε p / m X i =1 ε pi ≤ ε p p +2 m X i =1 t pi | ∆ i | + 14 ε p ! ≤ ε p . Thus, [ g, h ] is an ε -bracket. ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Note that for each ﬁxed T , the total number of brackets [ g, h ] is at most N := m Y i =1 exp( C ( d, p )[log(Γ /ε i )] d +1 t d/ i | ∆ i | d/ p t d/ ε − d/ i ) ≤ m Y i =1 exp( C ( d, p )[ 1 p log(4 m ) + log Γ + log(1 /ε )] d +1 (2 /p t/ε ) d/ ) ≤ exp (cid:0) C ′ ( d, p ) m [log m + log Γ + log(1 /ε )] d +1 ( t/ε ) d/ (cid:1) These with all the possible choices of T , the number of realizations of the brackets [ g, h ]is at most(2Γ /t ) m · N ≤ exp (cid:0) C ′′ ( d, p ) m [log m + log Γ + log(1 /ε )] d +1 ( t/ε ) d/ (cid:1) , and the claim follows. Our starting point is the following results proved in Lemma 5 and Theorem 1(ii) of [20]respectively:

Proposition 9.2. If Ω is a convex body in [0 , d with volume | Ω | ≥ /d ! , then forany < δ < , there exists a constant Λ depending only on d , p and δ , such that C p (Ω) ⊂ C ∞ (Ω δ , Λ) , where Ω η = { x ∈ Ω : dist( x, ∂ Ω) ≥ η } and C p (Ω) := { f is convex in Ω , k f k p ≤ } . Proposition 9.3. If Ω is a convex body that can be triangulated into m simplices ofdimension d , then there exists a constant C depending only on d and p such that for all < ε < , we have log N [ ] ( ε, C ∞ (Ω) , k · k p ) ≤ Cm | Ω | d/ p ε − d/ , where k f k p = (cid:0)R Ω | f ( x ) | p dx (cid:1) /p . Using the last two propositions:

Corollary 9.4.

Let Ω ⊂ [0 , d is a d -dimensional convex body which is deﬁned by Ω = { x ∈ R d : a i ≤ v Ti x ≤ b i , ≤ i ≤ m } , where v i are ﬁxed unit vectors, m ≥ d + 1 . Then for any < η < / , t > and any < ε < t , the following holds: log N [ ] ( ε, C p (Ω , t ) , k · k L p (Ω ) ) ≤ C m ( t/ε ) − d/ , where C p (Ω , t ) := { f is convex in Ω , k f k p ≤ t } , C is a constant depending only on d , p and η , and Ω := { x ∈ R d : a i + η ( b i − a i ) ≤ v Ti x ≤ b i − η ( b i − a i ) , ≤ i ≤ m } . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Proof.

Observe that on Ω k f k ∞ ≤ C ( η, d ) t | Ω | /p . To see this, assume in contradictionthat it is not the case. Using similar arguments to the proof of Proposition 9.2, there isa set of volume of c ( η, d ) | Ω | (a ”cap/corner”) such that f ≥ t | Ω | /p which will contradictthe deﬁnition of C p (Ω , t ). Then, using the Proposition 9.3 scaled by t | Ω | /p , gives thecorollary.We will prove that if we replace C p (Ω , t ) by B Γ p (Ω , t ) = { f ∈ C p (Ω , t ) : k f k ∞ ≤ Γ } , then we can replace Ω by Ω at a cost of logarithmic factors in the rate of bracketingentropy.For any domain D of an intersection of d + 1 ”slabs” ( D := { x ∈ R d : a i ≤ v Ti x ≤ b i , ≤ i ≤ d + 1 } , where v i are ﬁxed unit vectors) and for every 0 ≤ r ≤ d + 1 we deﬁnethe operator T r . T r ( D ) = { x ∈ D : a i ≤ v Ti x ≤ b i for i ≤ r, a j + η ( b j − a j ) ≤ v Tj x ≤ b j − η ( b j − a j ) for r < j ≤ d +1 } . Now, we are ready to prove the lemma.

Proof of Lemma 9.1.

Because the desired inequality is invariant under aﬃne transfor-mation, we can assume Ω is contained in [0 , d and | Ω | ≥ d ! . Fix 0 < η < /

5. Weprove the following: There exist two constants C ( d, p ) and C ( d, p ) such that for all r = 1 , . . . , d + 1 the following holds:log N [ ] ( ε, B p (Ω , t ) , k · k L p ( T r (Ω)) ) ≤ C [ C log(Γ /ε )] r | Ω | d/ p t d/ ε − d/ . We prove the statement by induction on r . Clearly, T (Ω) = Ω := { x ∈ R d : a j + η ( b j − a j ) ≤ v Tj x ≤ b j − η ( b j − a j ) for 1 ≤ j ≤ d + 1 } . By Corollary 9.4 and the assumptions on Ω the statement is true when r = 0. Supposethe statement is true for r = k −

1. We deﬁne K = T k − (Ω) = { x ∈ T k (Ω) : a k + η ( b k − a k ) ≤ v Tk x ≤ b k − η ( b k − a k ) } . For s = 1 , , . . . , m deﬁne K s +1 = { x ∈ T k (Ω) : a k + 2 − s − η ( b k − a k ) ≤ v Tk x < a k + 2 − s η ( b k − a k ) } ,K s +2 = { x ∈ T k (Ω) : b k − − s η ( b k − a k ) < v Tk x ≤ b k − − s − η ( b k − a k ) } , Furthermore, we deﬁne K L = { x ∈ T k (Ω) : a k ≤ v Tk x < a k + 2 − m − η ( b k − a k ) } ,K R = { x ∈ T k (Ω) : b k − − m − η ( b k − a k ) < v Tk x ≤ b k } . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Then, K , K L , K R , K s +1 and K s +2 , 1 ≤ s ≤ m form a partition of T k (Ω). Now, weaim to use the induction step. For this purpose we denote the inﬂated sets b K s +1 = { x ∈ Ω : a k + 2 − s − η ( b k − a k ) ≤ v Tk x < a k + 3 · − s − η ( b k − a k ) } , b K s +2 = { x ∈ Ω : b k − · − s − η ( b k − a k ) < v Tk x ≤ b k − − s − η ( b k − a k ) } . Now, we will apply the operator T k − on these sets: T k − ( b K s +1 ) = { x ∈ b K s +1 : a i ≤ v Ti x ≤ b i for i ≤ k − ,a j + η ( b j − a j ) ≤ v Tj x ≤ b j − η ( b j − a j ) for k + 1 ≤ j ≤ d + 1; a k + 2 − s − η ( b k − a k ) + η (5 · − s − ( b k − a k )) ≤ v Tk x< a k + 3 · − s − η ( b k − a k ) − η (5 · − s − ( b k − a k )) }⊃ K s +1 , provided that 5 η <

1. Similarly, T k − ( b K s +2 ) ⊃ K s +2 .We choose m so that Γ p | K L | ≤ − p ε p , and Γ p | K R | ≤ − p ε p in a way that ( R K L f ( x ) p dx ) /p and ( R K R f ( x ) p dx ) /p are negligible. This can be done by choosing m = C ( d, Γ) log(Γ /ε ).To see this, observe that K R , K L are slabs with width 5 η · − ( m − intersected with theunit cube. Thus, their volume can be bounded by C ( d ) · − ( m − η , which implies that (cid:18)Z K L f ( x ) p dx (cid:19) /p ≤ C ( d )2 − ( m − η Γ ≤ ǫ/ . For every f ∈ B Γ p (Ω , t ), we deﬁne t i as the smallest integer that satisﬁes the following R b K i | f ( x ) | p dx ≤ | b K i | t pi t p . Because any point in Ω is contained in b K i for at most threediﬀerent i , we have m +2 X i =0 ( t i − p t p | b K i | ≤ t p . (90)This implies that m +2 X i =0 t pi | b K i | ≤ p − m +2 X i =0 [( t i − p + 1] | b K i | ≤ · p , where we used the fact that P m +2 i =0 | b K i | ≤ | Ω | ≤

3. Since k f k ∞ ≤ Γ, we have t i ≤ Γ /t . Thus, the total number of choices of the sequence t , t , t , . . . , t m +2 is at most(Γ /t ) m +3 . For each ordered sequence T = { t , t , t , . . . , t m +2 } satisfying (90), we deﬁne F T = (cid:26) f ∈ B p (Ω , t ) : ∀ ≤ i ≤ m + 2 , ( t i − p t p | b K i | < Z b K i | f ( x ) | p dx ≤ t pi t p | b K i | (cid:27) . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Then, f ∈ B p ( b K i , t i t | b K i | /p ). Clearly, for all 0 ≤ i ≤ m + 2, K i satisﬁes the inductionassumption. Therefore, for any 0 ≤ i ≤ m + 2, there exists two sets G i and G i ,each consisting of exp( C [ C log(Γ /ε )] k − | K i | d/ p t d/ i ε − d/ i ) elements, such that for every f ∈ B p ( b K i , t i t | b K i | /p ), there exists g i ∈ G i and g i ∈ G i such that g i ( x ) ≤ f ( x ) ≤ g i ( x )for all x ∈ K i , and Z K i | g i ( x ) − g i ( x ) | p ≤ ε pi . We deﬁne b f ( x ) = g i ( x ) if x ∈ K i , and b f ( x ) = − Γ for x ∈ K L ∪ K R . Similarly,we deﬁne b g ( x ) = g i ( x ) if x ∈ K i , and b g ( x ) = Γ for x ∈ K L ∪ K R . Then, we have b f ( x ) ≤ f ( x ) ≤ b g ( x ) for all x ∈ T k (Ω) and Z T k (Ω) | b g ( x ) − b f ( x ) | p dx ≤ m +2 X i =0 Z K i | g i ( x ) − g i ( x ) | p dx + Z K L ∪ K R | b g ( x ) − b f ( x ) | p dx ≤ m +2 X i =0 ε pi + 12 ε p . If we choose ε i = 12 · /p t i | b K i | /p ε, then, m +2 X i =0 ε pi = 16 · p m +2 X i =0 t pi | b K i | ε p ≤ ε p . This implies that Z T k (Ω) | b g ( x ) − b f ( x ) | p dx ≤ ε p . Now, let us count the number of possible realizations of the brackets [ b f , b g ]. For eachﬁxed T , the number of choices of the brackets [ b f , b g ] is at most N := m +2 Y i =0 exp( C [ C log(Γ /ε )] k − | c K i | d/ p t d/ i t d/ ε − d/ i )= exp( C [ C log(Γ /ε )] k − (2 m + 3)( t/ε ) d/ ) ≤ exp (cid:0) C [log(Γ /ε )] k ( t/ε ) d/ (cid:1) . The total number of ε -brackets under the L p ( T k (Ω)) distance needed to cover B Γ p (Ω , t )is then bounded by (Γ /t ) m +3 · N ≤ exp (cid:0) C [log(Γ /ε )] k ( t/ε ) d/ (cid:1) provided the constant C is large enough, and Lemma 9.1 follows. ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Let Ω be a convex body, and let c be the center of its John ellipsoid (i.e., the uniqueellipsoid of maximum volume contained in Ω). For any λ >

0, deﬁne Ω λ = c + λ (Ω − c ).It is clear that | Ω λ | = λ d | Ω | , where | Ω | denotes the volume of Ω. The goal of this sub-section is to prove the following proposition, which is the equivalent”discrete” version of Corollary 9.4.

Proposition 9.5.

Let S be the regular d -dimensional δ -grid, and let Ω be a convexbody in R d . For any t > and any < ε < , there exists a set N consisting of exp( γ d · ( t/ε ) d/ ) functions such that for every f ∈ B qS (0; t ; Ω) , there exists g ∈ N satisfying | f ( x ) − g ( x ) | < ε for all x ∈ Ω . , where γ d is a constant depending only on d . To prove Proposition 9.5, we need some preparations.

Lemma 9.6.

Let S be a regular δ -grid on R d , and let Ω be a convex body containing aball of radius r ≥ d / δ . We have | Ω | δ − d ≤ ∩ S ) ≤ | Ω | δ − d . Proof.

Let s , . . . , s n be the grid points contained in Ω. We have n [ i =1 ( s i + [ − δ/ , δ/ d ) ⊂ Ω + [ − δ/ , δ/ d ⊂ Ω + √ dδ B d . Note that when Ω contains a ball of radius r , | Ω + √ dδ B d | ≤ √ dδ r ! d | Ω | ≤ (cid:18) d (cid:19) /d | Ω | ≤ | Ω | . Volume comparison gives us n ≤ | Ω | δ − d . On the other hand, let U be the union of the cubes s i + [ − δ/ , δ/ d . The volume of U is nδ d . Since the union of s i + [ − δ, δ ] d covers Ω, we have U + [ − δ/ , δ/ d ⊃ Ω. Inparticular, U contains the set { x ∈ Ω : dist( x, ∂ Ω) ≥ √ dδ/ } . Since Ω contains a ball of radius r . If we let c be the center of this ball, and deﬁne b Ω = c + − √ dδ r ! (Ω − c ) , ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions then the distance between any x ∈ b Ω and ∂ Ω is at least √ dδ/

2. Hence U ⊃ b Ω. Conse-quently n = | U | δ − d ≥ | b Ω | δ − d = − √ dδ r ! d | Ω | δ − d ≥ | Ω | δ − d . This ﬁnishes the proof of the lemma.

Lemma 9.7.

Let S be a regular d -dimensional δ -grid. Let Ω ⊂ [0 , d be a convexbody that contains a ball of radius at least d / δ . Then for every f ∈ B qS (0 , t, Ω) , f ≥ − dt .Proof. Let x be the minimizer of f on Ω. If f ( x ) ≥

0, then there is nothing to prove;otherwise, the set K := { x ∈ Ω | f ( x ) ≤ } is a closed convex set containing x . Denote K t = x + t ( K − x ), and let b K = K σ \ K − σ , where σ = (10 d ) − . We show that for all x ∈ Ω \ b K , | f ( x ) | ≥ σ | f ( x ) | . Indeed, if we deﬁne a function g on Ω so that g ( x ) = f ( x ), g ( γ ) = f ( γ ) for all γ ∈ ∂K , and g is linear on L γ := { x = x + t ( γ − x ) ∈ Ω | t ≥ } .Then, by the convexity of f on each L γ , we have | f ( x ) | ≥ | g ( x ) | on Ω. Thus, for all x ∈ Ω \ b K , | f ( x ) | ≥ | g ( x ) | = | g ( γ ) | + k x − γ kk x − γ k | f ( x ) | ≥ σ | f ( x ) | . Next, we show that most of the grid points in Ω are outside b K . Indeed, If s is a gridpoint in b K , then s + [ − δ/ , δ/ d ⊂ K σ ∩ Ω + [ − δ/ , δ/ d and at least one half ofthe cube s + [ − δ/ , δ/ d lies outside K − σ . Thus, the number of grid points in b K isbounded by 2 | ( K σ ∩ Ω + [ − δ/ , δ/ d ) \ K − σ | δ − d . Since | ( A + B ) \ A | can be expressed as a sum of products of mixed volumes of A and B , and smaller sets have smaller mixed volumes, we have | ( A + B ) \ B | ≤ | ( C + D ) \ D | for all convex sets C ⊃ A and D ⊃ B . Applying this inequality for A = C =[ − δ/ , δ/ d , B = ( K σ ∩ Ω and D = [ − , d , we have | ( K σ ∩ Ω+[ − δ/ , δ/ d ) \ K σ ∩ Ω | ≤ | ([ − , d +[ − δ/ , δ/ d ) \ [ − , d | = (2+ δ ) d − d , while | K σ ∩ Ω \ K − σ | ≤ h − (cid:0) − σ σ (cid:1) d i | K σ ∩ Ω | , we have | ( K σ ∩ Ω+[ − δ/ , δ/ d ) \ K − σ | ≤ " − (cid:18) − σ σ (cid:19) d | K σ ∩ Ω | +(2+ δ ) d − d ≤ dσ | Ω | . Thus, the number of grid points in b K is bounded by6 dσ | Ω | δ − d ≤ dσ · · S ∩ Ω) ≤ dσ · S ∩ Ω) . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Hence, S ∩ Ω) · t q ≥ X s ∈ S ∩ (Ω \ b K ) | f ( s ) | q ≥ (1 − dσ ) · S ∩ Ω) · ( σ | f ( x ) | ) q , which implies that f ( x ) ≥ − /q σ − t ≥ − dt by using σ = (10 d ) − . Lemma 9.8.

Suppose that a convex body Ω contains n grid points of a regular δ -grid in R d , and a ball of radius at least d / δ . Then, at any point P on the boundary of Ω . ,any hyperplane passing through P cuts Ω into two parts. The part that does not containthe center of John ellipsoid of Ω as its interior point contains at least (20 d ) − d − · n gridpoints.Proof. For any point P on the boundary of Ω . . Any hyperplane passing through P cuts Ω into two parts. Suppose L is a part that does not contain the center of Johnellipsoid of Ω as its interior. We prove that | L | ≥ d ) d | Ω | . Because the ratio | L | / | Ω | isinvariant under aﬃne transform, we estimate | T L | / | T Ω | , where T is an aﬃne transformso that the John ellipsoid of T Ω is the unit ball B d . Then, it is known that T Ω iscontained in a ball of radius d . Because the distance from ( T Ω) . to the boundary of T Ω is at least , T L contain half of the ball with center at

T P and radius . Thus,

T L has volume at least − d | B d | . Since T Ω is contained in the ball of radius d , we have | T Ω | ≤ d d | B d | . This implies that | T L | ≥ d (20 d ) − d | T Ω | . Hence | L | ≥ d (20 d ) − d | Ω | .Because the John ellipsoid of Ω contains a ball of radius at least 400 d / δ , the distancefrom Ω . to the boundary of Ω is at least 20 d / δ . Thus, L contains a ball of radiusat least 10 d / δ . By Lemma 9.6, the number of grid points in it is at least | L | δ − d ≥ d (20 d ) − d | Ω | δ − d . The statement of Lemma 9.8 then follows by using Lemma 9.6 onemore time. Lemma 9.9.

Let S be the regular d -dimensional δ -grid. Let Ω be a convex body in R d containing a ball of radius d / δ . For every f ∈ B qS (0 , t, Ω) , f ( x ) ≤ (20 d ) d +1 q t for all x ∈ Ω . .Proof. Let z be the maximizer of f on Ω . . By the convexity of f , z must be on theboundary of Ω . . If f ( z ) ≤

0, there is nothing to prove. So we assume f ( z ) >

0. Theconvexity of f implies that z lies on the boundary of the convex set K : { x ∈ Ω : f ( x ) ≤ f ( z ) } ⊃ Ω η . There exists a hyperplane z so that the convex set { x | f ( x ) ≤ f ( z ) } liesentirely on one side of the hyperplane. Let L be the portion of Ω that lies on the otherside of the hyperplane that support K at z . This hyperplane cuts Ω into two parts. Let L be the part that does not contain K . Then, f ( x ) ≥ f ( z ) for all x ∈ L . By Lemma 9.8,we have L ∩ S ) ≥ (20 d ) − d − · ∩ S ) . Since f ( x ) ≥ f ( z ) > x ∈ L , we have L ∩ S ) · f ( z ) q ≤ X s ∈ L ∩ S | f ( s ) | q ≤ X s ∈ Ω ∩ S | f ( s ) | q ≤ ∩ S ) · t q , ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions This implies that f ( z ) ≤ (20 d ) d +1 q t . Lemma 9.10 (Bronshtein) . There exists a constant β depending only on d such thatfor any ε > , any M > , and any convex set Ω ⊂ B d , there exists a set G consistingof exp( β ( M/ε ) d/ ) functions, such that for any convex function f on Ω that is boundedby M and has a Lipschitz constant bounded by M , there exists g ∈ G such that | f ( x ) − g ( x ) | < ε for all x ∈ Ω . Proof of Proposition 9.5

If Ω contains N ≤ d grid points. Denote these gridpoints by s , s , . . . s N . Then { ( f ( s ) , f ( s ) , . . . , f ( s N )) : f ∈ B qS (0 , t, Ω) } is a subset of { ( x , x , . . . x N ) : | x | q + | x | q + · · · + | x N | q ≤ N t q } , i.e., the ℓ Nq -ball of radius N /q t . By volume comparison, it can be covered by no morethan (1 + tε ) N ℓ Nq -balls of radius N /q ε . Thus, by choosing γ d ≥ d , the statementof Proposition 9.5 is true when Ω contains no more than 400 d grid points.For the remaining case, we prove by induction.If d = 1, and Ω contains more than 400 grid points, then by By Lemma 9.7 andLemma 9.9, we have − t ≤ f ( x ) ≤ t for all x ∈ Ω . . Let T be a linear transformthat maps the interval Ω . to the interval [ − , f ◦ T − are convex functionson [ − ,

1] satisfying − t ≤ f ( T − x ) ≤ t for all x ∈ [ − , f ◦ T − ,we have | ( f ◦ T − ( x )) ′ | ≤ max (cid:26) | f ◦ T − (1) − f ◦ T − (0 . || − . | , | f ◦ T − ( − − f ◦ T − ( − . || ( − − ( − . | (cid:27) ≤ t. By Lemma 9.10, there exists a set G consisting of no more than exp( β √ tε − / )functions such that for each f ∈ B qS (0; t ; Ω), there exists g ∈ G satisfying | f ◦ T − ( x ) − g ( x ) | < ε for all x ∈ [ − . , . T Ω . ⊂ ( T Ω . ) . = [ − . , . | f ( z ) − g ◦ T ( z ) | < ε for all z ∈ Ω . . Thus, the statement of the lemma is truewith N = { g ◦ T : g ∈ G} if we choose γ = max(400 , √ β ).Suppose the statement is true for d < k . Consider the case d = k . If the minimumnumber of parallel hyperplanes needed to cover all the grid points in Ω is more than400 d . Then the lattice width w (Ω , S ) is at least 400 d . Let µ (Ω , S ) be the coveringradius, i.e., the smallest number µ such that µ Ω + S ⊃ R d . By Khinchine’s ﬂatnesstheorem,(c.f. [5],[6]) we have w (Ω , S ) · µ (Ω , S ) ≤ Cd / , which implies that the coveringradius of Ω is at most C (800 d / ) − . Thus, Ω contains a cube of side length C − d / δ ,and hence a ball of radius C − d / δ . Thus, all the previous lemmas are applicable toΩ. By Lemma 9.7 and Lemma 9.9, we have − dt ≤ f ( x ) ≤ (20 d ) d +1 t for all x ∈ Ω . .If T is an aﬃne transformation so that the John ellipsoid of T Ω . is the unit ball B d (1). Because Ω . ⊂ (Ω . ) . , by the proof of Lemma 9.8, the distance between the ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions boundary of T (Ω . ) and the boundary of T (Ω . ) is at least . If we deﬁne convexfunction e f on T (Ω) by e f ( y ) = f ( T − ( y )). Then, − dt ≤ e f ( y ) ≤ (20 d ) d +1 t for all y ∈ T (Ω . ). For any u, v ∈ T (Ω . ), without loss of generality we assume e f ( u ) ≤ e f ( v ).Consider the half-line starting from u and passing through v . Suppose the half-lineintersects the boundary of T (Ω . ) at p and the boundary of T (Ω . ) at q . By theconvexity of e f on this half-line, we have0 ≤ e f ( v ) − e f ( u ) k v − u k ≤ | e f ( q ) − e f ( p ) |k q − p k ≤ d ) d +1 t + 20 dt ] := M. This implies that e f is a convex function on T (Ω . ) that has a Lipschitz constant M .Of course f is also bounded by M on T (Ω . ). Thus, by Lemma 9.10, there existsa set of function G consisting of at most exp( β · ( M/ε ) d/ ) functions such that forevery f ∈ B qS (0 , t, K ), there exists a function g ∈ G , such that | e f ( y ) − g ( y ) | < ε forall y ∈ T (Ω . ). This implies | f ( x ) − g ( T x ) | < ε for all x ∈ Ω . . Thus, by setting N = { g ◦ T | g ∈ G} the lemma follows with γ d ≥ βM d/ .If the minimum number of parallel hyperplanes needed to cover all the grid points inΩ is less than 400 d . Then, by applying the lemma for d = k − γ d ≥ d γ d − . Now, we try to reach closer to the boundary of Ω. More precisely, we will extendProposition 9.5 from Ω . to the set Ω deﬁned below.Let Ω be a convex polytope with the center of John ellipsoid at c . Then, we candescribe Ω as { x ∈ R d : − a i ≤ v Ti ( x − c ) ≤ b i , ≤ i ≤ F } , where a i > b i > v i are unit vectors in R d . Let m i and n i be the smallest integer such that 2 − m i a i ≤ δ and2 − n i b i ≤ δ . LetΩ = { x ∈ R d : − (1 − − m i ) a i ≤ v Ti ( x − c ) ≤ (1 − − n i ) b i , ≤ i ≤ F } . Then the Hausdorﬀ distance between Ω and Ω is no larger than δ . Thus, Ω is indeedclose to Ω.The following proposition suggests that to achieve our goal, we only need to properlydecompose Ω . Proposition 9.11. If D i , ≤ i ≤ m is a sequence of convex subsets of Ω such thatno point in Ω is contained in more than M subsets in the sequence. Then, for Ω ⊂∪ mi =1 ( D i ) . , we have log N ( ε, B pS (0; t ; Ω) , ℓ pS ( · , Ω )) ≤ cmM d p ( t/ε ) d/ . Proof.

Let G i be the set of grid points in D i , and S i be the grid points in ( D i ) . \∪ j

1) + m X i =1 | G i | ≤ M n + M n = 2

M n, ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions we can bound the total number of realizations of g byexp (cid:16) γ (2 M ) d p m ( t/ε ) d/ (cid:17) . Consequently, we havelog N ( ε, B pS (0; t ; Ω) , ℓ pS ( · , Ω )) ≤ log (cid:18) M n + mm (cid:19) + γ (2 M ) d p m ( t/ε ) d/ ≤ cmM d p ( t/ε ) d/ . Now, let us decompose Ω . Lemma 9.12.

There exists convex sets b D i , ≤ i ≤ N := Q Fi =1 ( m i + n i ) contained in Ω , such that no point in Ω is contained in more than F of these sets, and Ω ⊂ ∪ Ni =1 ( b D i ) . . Proof.

Let K = { ( k , k , . . . , k F ) : − m i ≤ k i ≤ n i − , ≤ i ≤ F } . There are Q Fi =1 ( m i + n i ) elements in K . For each K = ( k , k , . . . , k F ) ∈ K , deﬁne D K = { x ∈ R d : α i ( k i ) ≤ v Ti ( x − c ) ≤ α i ( k i + 1) } , where α i ( t ) = (cid:26) − (1 − t ) a i t ≤ − − t ) b i t > .D K is a convex set. The union of all D K , K ∈ K is the set { x ∈ R d : − (1 − − m i ) a i ≤ v Ti ( x − c ) ≤ (1 − − n i ) b i } . Similarly, we deﬁne b D K = { x ∈ R d : β i ( k i ) ≤ v Ti ( x − c ) ≤ γ i ( k i ) } , where β i ( k i ) = α i ( k i ) −

14 [ α i ( k i + 1) − α i ( k i )] , γa i ( k i ) = α i ( k i + 1) + 14 [ α i ( k i + 1) − α i ( k i )] . Let c K be the center of John ellipsoid of b D K , We have b D K = { x ∈ R d : β i ( k i ) − v Ti ( c K − c ) ≤ v Ti ( x − c K ) ≤ γ i ( k i ) − v Ti ( c K − c ) } . Thus,( b D K ) . = { x ∈ R d : 0 . β i ( k i ) − v Ti ( c K − c )] ≤ v Ti ( x − c K ) ≤ . γ i ( k i ) − v Ti ( c K − c )] } = { x ∈ R d : 0 . β i ( k i ) + 0 . v Ti ( c K − c ) ≤ v Ti ( x − c ) ≤ . γ i ( k i ) + 0 . v Ti ( c K − c ) }⊃{ x ∈ R d : 0 . β i ( k i ) + 0 . α i ( k i + 1) ≤ v Ti ( x − c K ) ≤ . γ i ( k i ) + 0 . α i ( k i ) } = { x ∈ R d : α i ( k i ) ≤ v Ti ( x − c K ) ≤ α i ( k i + 1) } = D K , ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions where in the second to the last equality we used the fact that 0 . β i ( k i ) + 0 . α i ( k i + 1) = α i ( k i ) and 0 . γ i ( k i ) + 0 . α i ( k i ) = α i ( k i + 1).It is not diﬃcult to check that if integer k i = 0 , −

1, the interval ( β i ( k i ) , γ i ( k i )) and( β i ( j i ) , γ i ( j i )) intersect only when | k i − t i | ≤

1, or one of the two cases: t i = 0 or t i = − F diﬀerent sets d D K . The lemma follows by renaming these sets as b D i , 1 ≤ i ≤ N . Proof of Theorem 4.1

By using the lemma above and Proposition 9.11, we havelog N ( ε, B pS (0; t ; Ω) , ℓ pS ( · , Ω )) ≤ c dF p [log(1 /δ )] F ( t/ε ) d/ . Because the distance between the boundary of Ω \ Ω can be decomposed into at most 2 F piece of width δ . By Khinchine’s ﬂatness theorem, the grid points in Ω \ Ω contained in cF hyperplanes for some constant c . The intersection of Ω and each of these hyperplanesis a ( d −

1) dimensional convex polytope. This enables us to obtain covering numberestimates on Ω \ Ω using induction on dimension. This concludes the proof of Theorem4.1. Proof of Corollary 4.2.

Let n denote the cardinality of Ω ∩ S and let n i denote thecardinality of Ω i ∩ S for each i = 1 , . . . , k . We can assume that each n i > i . For f ∈ B p S ( f ; t ; Ω) and 1 ≤ i ≤ k , let σ i ( f ) be the smallestpositive integer for which X s ∈ Ω i ∩S | f ( s ) − f ( s ) | p ≤ n i σ i ( f ) t p . It is clear then that 1 ≤ σ i ( f ) ≤ n for each i . Also because σ i ( f ) is the smallest integersatisfying the above, we have X s ∈ Ω i ∩S | f ( s ) − f ( s ) | p ≥ n i ( σ i ( f ) − t p which implies that k X i =1 n i ( σ i ( f ) − t p ≤ k X i =1 X s ∈ Ω i ∩S | f ( s ) − f ( s ) | p = X s ∈ Ω ∩S | f ( s ) − f ( s ) | p ≤ t , leading to k X i =1 n i σ i ( f ) ≤ k X i =1 n i = n. Let Σ := { ( σ ( f ) , . . . , σ k ( f )) : f ∈ B p S ( f ; t ; Ω) } . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions and note that the cardinality of Σ is at most n k as 1 ≤ σ i ( f ) ≤ n for each i . For every( σ , . . . , σ k ) ∈ Σ, let F σ ,...,σ k = { f ∈ B p S ( f ; t ; Ω) : σ i ( f ) = σ i , ≤ i ≤ k } . Observe now that if ℓ p S ( f − f , Ω i ) ≤ ǫσ /pi for i = 1 , . . . , k, then ℓ p S ( f − f , Ω) = n k X i =1 X s ∈ Ω i ∩S | f ( s ) − f ( s ) | p ! /p ≤ ǫ p P ki =1 n i σ i n ! /p ≤ ǫ. This giveslog N ( ǫ, F σ ,...,σ k , ℓ pS ( · , Ω)) ≤ k X i =1 log N ( ǫ √ σ i , B p S ( f ; t √ σ i ; Ω i ) , ℓ p S ( · , Ω i )) . Because f is aﬃne on each Ω i , we can apply Theorem 4.1 to obtainlog N ( ǫ, F σ ,...,σ k , ℓ p S ( · , Ω)) ≤ k (cid:18) tǫ (cid:19) d/ (cid:18) c d log 1 δ (cid:19) s . Because B p S ( f ; t ; Ω) = [ ( σ ,...,σ k ) ∈ Σ F σ ,...,σ k and the cardinality of Σ is at most n k , we deduce thatlog N ( ǫ, B p S ( f ; t ; Ω) , ℓ p S ( · , Ω)) ≤ k (cid:18) tǫ (cid:19) d/ (cid:18) c d log 1 δ (cid:19) s + k log n Because log n ≤ C d log(1 /δ ) (inequality (11)), the second term on the right hand sideabove is dominated by the ﬁrst term. This completes the proof of Corollary 4.2. Proof of Corollary 4.3.

We use Corollary 4.2 with the following choice of Ω , . . . , Ω k .Take Ω = ∆ and deﬁne Ω i for 2 ≤ i ≤ k recursively as follows. Consider ∆ i andits facets that have a non-empty intersection with ∆ , . . . , ∆ i − . If any of these facetscontain points in S , then we move the facets slightly inward so that they do not containany grid points. This will ensure that Ω , . . . , Ω k are d -simplices satisfying the conditionsof Corollary 4.2. The conclusion of Corollary 4.3 then directly follows from Corollary4.2 (note that every d -simplex can be written as an intersection of d + 1 halfspaces so inparticular it can be written as an intersection of at most d + 1 pairs of halfspaces). ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions

10. Additional proofs

This section contains proofs of Lemma 2.4, Lemma 7.3, Lemma 8.2 and Lemma 8.3.

Proof of Lemma 2.4.

For a ﬁxed η >

0, let C η be the collection of all cubes of the form[ k η, ( k + 1) η ] × · · · × [ k d η, ( k d + 1) η ]for ( k , . . . , k d ) ∈ Z d which intersect Ω. Because Ω is contained in the unit ball, thereexists a dimensional constant c d such that the cardinality of C η is at most C d η − d for η ≤ c d .For each B ∈ C η , the set B ∩ Ω is a polytope whose number of facets is bounded fromabove by a constant depending on d alone. This polytope can therefore be triangulatedinto at most C d number of d -simplices. Let ∆ , . . . , ∆ m be the collection obtained bythe taking the all of the aforementioned simplices as B varies over C η . These simplicesclearly satisfy the ﬁrst two requirements of Lemma 2.4. Moreover m ≤ C d η − d and the diameter of each simplex ∆ i is at most C d η . Now deﬁne ˜ f η to be a piecewiseaﬃne convex function that agrees with f ( x ) = k x k for each vertex of each simplex ∆ i and is deﬁned by linearity everywhere else on Ω. This function is clearly aﬃne on each∆ i , belongs to C C d C d (Ω) for a suﬃciently large C d and it satisﬁessup x ∈ ∆ i | f ( x ) − ˜ f η ( x ) | ≤ C d (diameter(∆ i )) ≤ C d η . Now given k ≥

1, let η = c d k − /d for a suﬃciently small dimensional constant c d andlet ˜ f k to be the function ˜ f η for this η . The number of simplices is now m ≤ C d k andsup x ∈ ∆ i | f ( x ) − ˜ f η ( x ) | ≤ C d k − /d which completes the proof of Lemma 2.4. Proof of Lemma 7.3.

Let g ( x , x , . . . , x d ) = (cid:26) P di =1 cos ( πx i ) ( x , x , . . . , x d ) ∈ [ − / , / d x , x , . . . , x d ) / ∈ [ − / , / d . Note that g is smooth, ∂ g∂x i ∂x j = 0 for i = j and (cid:12)(cid:12)(cid:12)(cid:12) ∂ g∂x i ( x , . . . , x d ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ √ π which means that the Hessian of g is dominated by (4 √ π /

3) times the identity matrix.It is also easy to check that the Hessian of g equals zero on the boundary of [ − . , . d . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Now for every grid point s := ( k δ, . . . , k d δ ) in S ∩

Ω, consider the function g s ( x , . . . , x d ) := δ g (cid:18) x − k δδ , . . . , x d − k d δδ (cid:19) . Clearly g s is supported on the cube[( k − / δ, ( k + 1 / δ ] × [( k − / δ, ( k + 1 / δ ] × · · · × [( k d − / δ, ( k d + 1 / δ ] (91)and observe that these cubes for diﬀerent grid points have disjoint interiors.We now consider binary vectors in { , } n . We shall index each ξ ∈ { , } n by ξ s , s ∈S ∩ Ω. For every ξ = ( ξ s , s ∈ S ∩ Ω) ∈ { , } n , consider the function G ξ ( x ) = f ( x ) + 34 √ π X s ∈S∩ Ω ξ s g s ( x ) . (92)It can be veriﬁed that G ξ is convex because f has constant Hessian equal to 2 times theidentity, the Hessian of g s is bounded by (4 √ π /

3) and the supports of g s , s ∈ S ∩ Ωhave disjoint interiors. Note further that for ξ, ξ ′ ∈ { , } n and s ∈ S ∩ Ω, G ξ ( s ) − G ξ ′ ( s ) = 3 dδ √ π ( ξ s − ξ ′ s ) . This implies that ℓ P n ( G ξ , G ξ ′ ) = 3 dδ √ π r Υ( ξ, ξ ′ ) n where Υ( ξ, ξ ′ ) := P s ∈S∩ Ω I { ξ s = ξ ′ s } is the Hamming distance between ξ and ξ ′ . TheVarshamov-Gilbert lemma (see e.g., Massart [39, Lemma 4.7]) asserts the existence ofa subset W of { , } n with cardinality | W | ≥ exp( n/

8) such that Υ( ξ, ξ ′ ) ≥ n/ ξ, ξ ′ ∈ W with ξ = ξ ′ . We then have ℓ P n ( G ξ , G ξ ′ ) ≥ dδ √ π for all ξ, ξ ′ ∈ W with ξ = ξ ′ . Inequality (11) then gives ℓ P n ( G ξ , G ξ ′ ) ≥ c n − /d for all ξ, ξ ′ ∈ W with ξ = ξ ′ . for a constant c depending on d alone. On the other hand, one can also check that ℓ P n ( G ξ , f ) ≤ dδ √ π ≤ c n − /d for another constant c depending on d alone. This completes the proof of Lemma7.3. ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Proof of Lemma 8.2.

We ﬁrst claim the existence of constants c , c , C all dependingon d alone such that for every ǫ ≤ c and L ≥ C , there exist an integer N with c ǫ − d/ ≤ log N ≤ c ǫ − d/ and functions f , . . . , f N ∈ C LL (Ω) such thatmin ≤ i = j ≤ N ℓ P ( f i , f j ) ≥ √ ǫ and max ≤ i ≤ N sup x ∈ Ω | f i ( x ) − f ( x ) | ≤ ǫ. This basically follows from a similar construction as in the proof of Lemma 7.3 (with δ = ǫ ). To facilitate calculations with ℓ P , it will be convenient to restrict the sumin (92) to all points s = ( k δ, . . . , k d δ ) such that the cube (91) is fully contained inΩ. The number of such grid points is also at least c d δ − d provided δ is smaller than adimensional constant. Note also that each function (92) is bounded by L and L -Lipschitzfor a dimensional constant L .Lemma 8.2 follows from the above claim and the Hoeﬀding inequality (note thatsup x ∈ Ω ( f j ( x ) − f k ( x )) ≤ ǫ ). Indeed, for every t >

0, Hoeﬀding inequality followedby a union bound allows us to deduce that P (cid:8) ℓ P n ( f j , f k ) − ℓ P ( f j , f k ) ≥ − tn − / for all j, k (cid:9) ≥ − N exp (cid:18) − t Γ ǫ (cid:19) . for a universal constant Γ. Taking t = ǫ √ n , we get P { ℓ P n ( f j , f k ) ≥ ǫ for all j, k } ≥ − N exp (cid:18) − n Γ (cid:19) ≥ − exp (cid:16) c ǫ − d/ − n Γ (cid:17) . Assuming now that ǫ ≥ n − /d (8 c Γ) /d , we get P { ℓ P n ( f j , f k ) ≥ ǫ for all j, k } ≥ − exp (cid:16) − n (cid:17) . Note ﬁnally that each f j belongs to B C LL (Ω) P n ( f , t ) for t ≥ ǫ . This completes the proofof Lemma 8.2. Proof of Lemma 8.3.

This proof is similar to that of Lemma 2.4. For a ﬁxed η >

0, let D η be the collection of all cubes of the form[ k η, ( k + 1) η ] × · · · × [ k d η, ( k d + 1) η ]for ( k , . . . , k d ) ∈ Z d which are contained in the interior of Ω. Because Ω is contained inthe unit ball and contains a ball around zero of constant (depending on d alone) radiusand the diameter of B is at most η √ d , it follows that(1 − C d η ) Ω ⊆ ∪ B ∈ D η B ⊆ Ω . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions It is also easy to see that the cardinality of D η at most C d η − d for η ≤ c d . We nowtriangulate each cube in D η into a constant number of d -simplices. Let ∆ , . . . , ∆ m bethe collection obtained by taking all of the aforementioned simplices as B varies over D η . These simplices clearly have disjoint interiors and the diameter of each simplex ∆ i is at most C d η . Moreover m ≤ C d η − d . Now deﬁne ˜ f η to be a piecewise aﬃne convex function that agrees with f ( x ) = k x k for each vertex of each simplex ∆ i and is deﬁned by linearity everywhere else on Ω.This function is clearly aﬃne on each ∆ i , belongs to C C d C d (Ω) for a suﬃciently large C d and it satisﬁes sup x ∈ ∆ i | f ( x ) − ˜ f η ( x ) | ≤ C d (diameter(∆ i )) ≤ C d η . Now for each ﬁxed k ≥

1, we take η = c d k − /d for a suﬃciently small enough c d and let˜ f k to be the function ˜ f η for this η . This completes the proof of Lemma 8.3. References [1] A¨ıt-Sahalia, Y. and J. Duarte (2003). Nonparametric option pricing under shaperestrictions.

J. Econometrics 116 (1-2), 9–47. Frontiers of ﬁnancial econometrics andﬁnancial engineering.[2] Allon, G., M. Beenstock, S. Hackman, U. Passy, and A. Shapiro (2007). Nonpara-metric estimation of concave production technologies by entropic methods.

J. Appl.Econometrics 22 (4), 795–816.[3] Bal´azs, G. (2016).

Convex regression: theory, practice, and applications . Ph. D.thesis, University of Alberta.[4] Bal´azs, G., A. Gy¨orgy, and C. Szepesv´ari (2015). Near-optimal max-aﬃne estimatorsfor convex regression. In

AISTATS .[5] Banaszczyk, W., A. E. Litvak, A. Pajor, and S. J. Szarek (1999). The ﬂatness the-orem for nonsymmetric convex bodies via the local theory of banach spaces.

Mathe-matics of operations research 24 (3), 728–750.[6] B´ar´any, I. and D. G. Larman (1998). The convex hull of the integer points in alarge ball.

Mathematische Annalen 312 (1), 167–181.[7] Bellec, P. C. (2018). Sharp oracle inequalities for least squares estimators in shaperestricted regression.

Ann. Statist. 46 (2), 745–780.[8] Birg´e, L. and P. Massart (1993). Rates of convergence for minimum contrast esti-mators.

Probab. Theory Related Fields 97 (1-2), 113–150.[9] Bronˇste˘ın, E. M. (1976). ε -entropy of convex sets and functions. Sibirsk. Mat.ˇZ. 17 (3), 508–514, 715.[10] Chatterjee, S. (2014). A new perspective on least squares under convex constraint.

Ann. Statist. 42 (6), 2340–2381.[11] Chatterjee, S. (2016). An improved global risk bound in concave regression.

Elec-tron. J. Stat. 10 (1), 1608–1629. ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions [12] Chatterjee, S., A. Guntuboyina, and B. Sen (2015). On risk bounds in isotonicand other shape restricted regression problems. Ann. Statist. 43 (4), 1774–1800.[13] Chen, W. and R. Mazumder (2020). Multivariate convex regression at scale. arXivpreprint arXiv:2005.11588 .[14] Chen, Y. and J. A. Wellner (2016). On convex least squares estimation when thetruth is linear.

Electron. J. Stat. 10 (1), 171–209.[15] Doss, C. R. (2020). Bracketing numbers of convex and m -monotone functions onpolytopes. J. Approx. Theory 256 , 105425.[16] Dryanov, D. (2009). Kolmogorov entropy for classes of convex functions.

Constr.Approx. 30 (1), 137–153.[17] D¨umbgen, L., S. Freitag, and G. Jongbloed (2004). Consistency of concave re-gression with an application to current-status data.

Math. Methods Statist. 13 (1),69–81.[18] Gao, F. (2008). Entropy estimate for k -monotone functions via small ball proba-bility of integrated Brownian motion. Electron. Commun. Probab. 13 , 121–130.[19] Gao, F. and J. A. Wellner (2007). Entropy estimate for high-dimensional monotonicfunctions.

J. Multivariate Anal. 98 (9), 1751–1764.[20] Gao, F. and J. A. Wellner (2017). Entropy of convex functions on R d . Constr.Approx. 46 (3), 565–592.[21] Ghosal, P. and B. Sen (2017). On univariate convex regression.

Sankhya A 79 (2),215–253.[22] Groeneboom, P. and G. Jongbloed (2014).

Nonparametric estimation under shapeconstraints , Volume 38. Cambridge University Press.[23] Groeneboom, P., G. Jongbloed, and J. A. Wellner (2001). Estimation of a convexfunction: characterizations and asymptotic theory.

Ann. Statist. 29 (6), 1653–1698.[24] Guntuboyina, A. (2012). Optimal rates of convergence for convex set estimationfrom support functions.

Ann. Statist. 40 (1), 385–411.[25] Guntuboyina, A. (2016). Covering numbers of L p -balls of convex functions andsets. Constructive Approximation 43 (1), 135–151.[26] Guntuboyina, A. and B. Sen (2013). Covering numbers for convex functions.

IEEETrans. Inf. Th. 59 (4), 1957–1965.[27] Guntuboyina, A. and B. Sen (2015). Global risk bounds and adaptation in uni-variate convex regression.

Probab. Theory Related Fields 163 (1-2), 379–411.[28] Han, Q. (2019). Global empirical risk minimizers with ”shape constraints” are rateoptimal in general dimensions. arXiv preprint arXiv:1905.12823 .[29] Han, Q., T. Wang, S. Chatterjee, and R. J. Samworth (2019). Isotonic regressionin general dimensions.

Ann. Statist. 47 (5), 2440–2471.[30] Han, Q. and J. A. Wellner (2016). Multivariate convex regression: global riskbounds and adaptation. arXiv preprint arXiv:1601.06844 .[31] Hanson, D. L. and G. Pledger (1976). Consistency in concave regression.

Ann.Statist. 4 (6), 1038–1050.[32] Hildreth, C. (1954). Point estimates of ordinates of concave functions.

J. Amer.Statist. Assoc. 49 , 598–619. ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions [33] Keshavarz, A., Y. Wang, and S. Boyd (2011). Imputing a convex objective function.In Intelligent Control (ISIC), 2011 IEEE International Symposium on , pp. 613–619.IEEE.[34] Kuosmanen, T. (2008). Representation theorem for convex nonparametric leastsquares.

The Econometrics Journal 11 (2), 308–325.[35] Kur, G., Y. Dagan, and A. Rakhlin (2019). Optimality of maximum likelihoodfor log-concave density estimation and bounded convex regression. arXiv preprintarXiv:1903.05315 .[36] Kur, G., A. Guntuboyina, and A. Rakhlin (2020). On suboptimality of least squareswith application to estimation of convex bodies. Accepted for presentation and pub-lication in the 33rd Annual Conference on Learning Theory (COLT) July 9-12, 2020.[37] Lim, E. (2014). On convergence rates of convex regression in multiple dimensions.

INFORMS J. Comput. 26 (3), 616–628.[38] Lim, E. and P. W. Glynn (2012). Consistency of multidimensional convex regres-sion.

Oper. Res. 60 (1), 196–208.[39] Massart, P. (2007).

Concentration inequalities and model selection. Lecture notesin Mathematics , Volume 1896. Berlin: Springer.[40] Matzkin, R. L. (1991). Semiparametric estimation of monotone and concave utilityfunctions for polychotomous choice models.

Econometrica 59 (5), 1315–1327.[41] Mazumder, R., A. Choudhury, G. Iyengar, and B. Sen (2019). A computationalframework for multivariate convex regression and its variants.

J. Amer. Statist.Assoc. 114 (525), 318–331.[42] Seijo, E. and B. Sen (2011). Nonparametric least squares estimation of a multi-variate convex regression function.

Ann. Statist. 39 (3), 1633–1657.[43] Toriello, A., G. Nemhauser, and M. Savelsbergh (2010). Decomposing inventoryrouting problems with approximate value functions.

Naval Res. Logist. 57 (8), 718–727.[44] van de Geer, S. A. (2000).

Applications of empirical process theory , Volume 6 of

Cambridge Series in Statistical and Probabilistic Mathematics . Cambridge UniversityPress, Cambridge.[45] Varian, H. R. (1982). The nonparametric approach to demand analysis.

Econo-metrica 50 (4), 945–973.[46] Varian, H. R. (1984). The nonparametric approach to production analysis.

Related Researches

Berry-Esseen bounds of second moment estimators for Gaussian processes observed at high frequency

by Soukaina Douissi

Prepivoted permutation tests

by Colin B. Fogarty

Sharp Sensitivity Analysis for Inverse Propensity Weighting via Quantile Balancing

by Jacob Dorn

Online nonparametric regression with Sobolev kernels

by Oleksandr Zadorozhnyi

Discrepancy Bounds for a Class of Negatively Dependent Random Points Including Latin Hypercube Samples

by Michael Gnewuch

Edgeworth approximations for distributions of symmetric statistics

by Friedrich Götze

On the estimating equations and objective functions for parameters of exponential power distribution: Application for disorder

by Mehmet Niyazi ?ankaya

A new robust approach for multinomial logistic regression with complex design model

by Elena Castilla

Discrete Max-Linear Bayesian Networks

by Benjamin Hollering

Online Statistical Inference for Gradient-free Stochastic Optimization

by Xi Chen

The complex behaviour of Galton rank order statistic

by E. del Barrio

Sharper Sub-Weibull Concentrations: Non-asymptotic Bai-Yin Theorem

by Huiming Zhang

Inference and model selection in general causal time series with exogenous covariates

by Mamadou Lamine Diop

Instance-Dependent Bounds for Zeroth-order Lipschitz Optimization with Error Certificates

by François Bachoc

Nonparametric calibration for stochastic reaction-diffusion equations based on discrete observations

by Florian Hildebrandt

On shrinkage estimation of a spherically symmetric distribution for balanced loss functions

by Lahoucine Hobbad

Semiparametric empirical likelihood inference with estimating equations under density ratio models

by Meng Yuan

Adaptive Robust Large Volatility Matrix Estimation Based on High-Frequency Financial Data

by Minseok Shin

Graph Community Detection from Coarse Measurements: Recovery Conditions for the Coarsened Weighted Stochastic Block Model

by Nafiseh Ghoroghchian

Efficient computational algorithms for approximate optimal designs

by Jiangtao Duan

Distribution-Free Robust Linear Regression

by Jaouad Mourtada

On the consistency of the Kozachenko-Leonenko entropy estimate

by Luc Devroye

On the Minimal Error of Empirical Risk Minimization

by Gil Kur

On admissible estimation of a mean vector when the scale is unknown

by Yuzo Maruyama

It was "all" for "nothing": sharp phase transitions for noiseless discrete channels

by Jonathan Niles-Weed

«

1

2

3

4

»

Submitted on 3 Jun 2020 Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar