Convex Regression in Multidimensions: Suboptimality of Least Squares Estimators
Gil Kur, Fuchang Gao, Adityanand Guntuboyina, Bodhisattva Sen
aa r X i v : . [ m a t h . S T ] J un Convex Regression in Multidimensions:Suboptimality of Least Squares Estimators
Gil Kur § , Fuchang Gao ∗ , Adityanand Guntuboyina † , and Bodhisattva Sen ‡
32 Vassar StCambridge, MA 02139e-mail: [email protected]
875 Perimeter Drive, MS 1103Moscow, ID 83844e-mail: [email protected]
423 Evans HallBerkeley, CA 94720e-mail: [email protected] [email protected]
Abstract:
The least squares estimator (LSE) is shown to be suboptimal in squarederror loss in the usual nonparametric regression model with Gaussian errors for d ≥ n − /d (up to logarithmic factors) while the minimaxrisk is n − / ( d +4) , for d ≥
5. In addition, the first rate of convergence results (worstcase and adaptive) for the full convex LSE are established for polytopal domainsfor all d ≥
1. Some new metric entropy results for convex functions are also provedwhich are of independent interest.
MSC 2010 subject classifications:
Primary 62G08.
Keywords and phrases:
Adaptive risk bounds, bounded convex regression, Dud-ley’s entropy bound, Lipschitz convex regression, lower bounds on the risk of leastsquares estimators, metric entropy, nonparametric maximum likelihood estima-tion, Sudakov minoration. ∗ Supported by NSF Grant OCA-1940270 † Supported by NSF CAREER Grant DMS-1654589 ‡ Supported by NSF Grant DMS-1712822 § Supported by the Center for Minds, Brains and Machines, funded by NSF award CCF-12312161 ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions
1. Introduction
We consider the following nonparametric regression problem: Y i = f ( X i ) + ξ i , for i = 1 , . . . , n, (1)where f : Ω → R is an unknown convex function on a full-dimensional compact convexdomain Ω ⊆ R d ( d ≥ X , . . . , X n are design points that may be fixed in Ω or i.i.d.from the uniform distribution P on Ω, and ξ , . . . , ξ n are i.i.d. unobserved errors havingthe N (0 , σ ) distribution with σ > { ( Y i , X i ) } ni =1 , thegoal is to recover f . This is the classical problem of convex regression which has along history in statistics and related fields. Standard references include Hildreth [32],Hanson and Pledger [31], Groeneboom et al. [23], Groeneboom and Jongbloed [22],Seijo and Sen [42], Kuosmanen [34], Lim and Glynn [38] and Bal´azs [3]. Applicationsof convex regression can be found in Varian [45], Varian [46], Allon et al. [2], Matzkin[40], A¨ıt-Sahalia and Duarte [1], Keshavarz et al. [33] and Toriello et al. [43].A natural way to estimate f in (1) is to use the method of least squares, i.e., minimizethe sum of squared errors subject to the convexity constraint. Formally, the convex leastsquares estimator (LSE) is defined asˆ f n ∈ argmin f ∈C (Ω) n X i =1 ( Y i − f ( X i )) (2)where the minimization is over C (Ω) which is defined as the class of all real-valuedconvex functions defined on Ω. ˆ f n coincides with the maximum likelihood estimatoras we have assumed that the errors ξ , . . . , ξ n are normally distributed. ˆ f n is uniquelydefined at the design points X , . . . , X n and can be extended to other points in Ω in apiecewise affine fashion. The convex LSE does not involve any tuning parameters. Seijoand Sen [42] performed a detailed study of the characterization and computation of ˆ f n (see also Kuosmanen [34] and Lim and Glynn [38]) and Mazumder et al. [41] (see alsoChen and Mazumder [13]) demonstrated that it can be efficiently computed for fairlylarge values of the dimension d and the sample size n .The theoretical properties of ˆ f n are fairly well-understood in the univariate case d = 1. Hanson and Pledger [31] proved (uniform) consistency on compact subintervalscontained in the interior of Ω and D¨umbgen et al. [17] strengthened these results byproving rates of convergence. Groeneboom et al. [23] proved pointwise rates of conver-gence and asymptotic distributions under smoothness assumptions and these resultswere extended by Chen and Wellner [14] and Ghosal and Sen [21]. Guntuboyina andSen [27], Chatterjee et al. [12], Bellec [7] and Chatterjee [11] proved risk bounds for theconvex LSE under the equally-spaced fixed design setting. These results imply that theunivariate convex LSE achieves the minimax rate n − / for estimating general convexfunctions while also achieving faster parametric rates (up to logarithmic multiplicativefactors) for estimating piecewise affine convex functions. ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions In the multivariate case d ≥
2, consistency of the convex LSE was proved by Seijoand Sen [42] (see also Lim and Glynn [38]). However, rates of convergence have notbeen proved previously and one of the main goals of the present paper is to fill this gap.Rates of convergence are available, however, for certain alternative estimators such asthe
Lipschitz convex
LSE and the bounded convex
LSE. The Lipschitz convex LSE isdefined as the LSE over C L (Ω):ˆ f n ( C L (Ω)) ∈ argmin f ∈C L (Ω) n X i =1 ( Y i − f ( X i )) (3)where C L (Ω) is the class of all convex functions on Ω that are L -Lipschitz. The boundedconvex LSE is defined as the LSE over C B (Ω):ˆ f n ( C B (Ω)) ∈ argmin f ∈C B (Ω) n X i =1 ( Y i − f ( X i )) (4)where C B (Ω) is the class of all convex functions on Ω that are uniformly bounded by B . Rates of convergence for the Lipschitz convex LSE can be found in Bal´azs et al.[4], Lim [37] and Mazumder et al. [41] while rates for the bounded convex LSE are inHan and Wellner [30]. It should be noted that these alternative estimators ˆ f n ( C L (Ω))and ˆ f n ( C B (Ω)) crucially depend on tuning parameters (specifically L and B ) while theconvex LSE is tuning parameter free.In Section 2, we provide the first rate of convergence results for the convex LSEfor d ≥
2. Let us describe these results at a high-level here (we ignore logarithmicmultiplicative factors in this Introduction; see the actual theorems for the full bounds).We assume that Ω is a polytope and that the design points X , . . . , X n form a fixedregular rectangular grid intersected with Ω. As is common in fixed design analysis, wework with the loss function ℓ P n ( f, g ) := Z ( f − g ) d P n = 1 n n X i =1 ( f ( X i ) − g ( X i )) (5)where P n is the empirical distribution of the design points X , . . . , X n (note that P n is non-random here as we are working in fixed-design). The risk of ˆ f n is defined as E f ℓ P n ( ˆ f n , f ) and we prove bounds for E f ℓ P n ( ˆ f n , f ) that hold for finite samples. Ourfirst main result is Theorem 2.1 which proves that the risk of the convex LSE is boundedfrom above by r n,d := (cid:26) n − / ( d +4) : d ≤ n − /d : d ≥ adaptive risk bounds for the convex LSE which imply that the convexLSE converges at rates faster than r n,d when f is a piecewise affine convex function.Specifically we prove in Theorem 2.3 that when f is a piecewise affine convex functionwith k affine pieces, the risk of ˆ f n is bounded from above by a k,n,d := ( kn : d ≤ (cid:0) kn (cid:1) /d : d ≥ ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions up to a logarithmic term which depends on the number of facets of each affine piece. Itis interesting that the rate a k,n,d switches from a parametric rate for d ≤ d ≥ r n,d and a k,n,d are not merely loose upper bounds but accurately describe thebehavior of ˆ f n . These lower bound results are only interesting for d ≥ d ≤ r n,d is already the minimax rate (see (14)) and a k,n,d is the parametric rate. InTheorem 2.2, we prove the existence of a bounded, Lipschitz convex function f on Ωwhere the risk of the convex LSE is bounded from below by n − /d (note that logarithmicfactors are being ignored in this Introduction). This function where the LSE is shown toachieve the n − /d rate will be a piecewise affine convex function whose number of affinepieces is of the order √ n (see Lemma 2.4). This proves that r n,d correctly describes theworst case risk behavior of the convex LSE. Moreover, in Theorem 2.5, we show, forevery 1 ≤ k ≤ √ n , the existence of a convex function that is piecewise affine with ∼ k affine pieces where the risk of the convex LSE is bounded from below by ( k/n ) /d . Thisshows that a k,n,d correctly describes the adaptive behavior of the convex LSE for d ≥ ≤ k ≤ √ n . Note that a k,n,d cannot be expected to be a tight bound for k ≫ √ n as then ( k/n ) /d will dominate the worst case risk bound of n − /d .Our results imply the minimax suboptimality of the convex LSE for d ≥
5. Indeed,the minimax risk for the class of all bounded, Lipschitz convex functions is of theorder n − / ( d +4) . From a comparison of r n,d with the minimax risk n − / ( d +4) , we canimmediately conclude that the convex LSE is minimax suboptimal for d ≥ X , . . . , X n are independently distributed according to the uniformdistribution P on Ω and consider the loss function ℓ P ( f, g ) := Z Ω ( f − g ) d P . (8)For the bounded convex LSE, it was proved in Han and Wellner [30] that its risk isbounded from above by r n,d (defined in (6)) when Ω is a polytope. In Theorem 3.1, weprove that there exists a bounded, Lipschitz convex function on Ω where the risk of thebounded convex LSE is bounded from below by n − /d . This implies that the boundedconvex LSE is minimax suboptimal in random design when the domain Ω is a polytope.This contrasts intriguingly with the recent result of Kur et al. [35] who proved that thebounded convex LSE is minimax optimal when Ω is a smooth convex body (such as theunit ball). Some insight into this is given in Section 6.For the Lipschitz convex LSE, it was proved, in Bal´azs et al. [4], Lim [37], Mazumderet al. [41], that its risk is bounded again by r n,d for all convex bodies Ω (regardlessof whether Ω is polytopal or smooth). In Theorem 3.2, we prove the existence of abounded, Lipschitz convex function on Ω where the risk of the Lipschitz convex LSE isbounded from below by n − /d . This implies that the Lipschitz convex LSE is minimax ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions suboptimal when d ≥ d ≥
5. In a recent paper Kur et al.[36] involving two of the authors of the present paper, it was shown that the LSEover the class of support functions of all convex bodies was suboptimal for d ≥ d is small.Our risk results required proving novel metric entropy bounds for convex functions.These results are given in Section 4. For our fixed design risk bounds, we prove, inTheorem 4.1, a metric entropy upper bound for the class { f ∈ C (Ω) : ℓ P n ( f, f ) ≤ t } for affine functions f under the ℓ P n pseudometric. This result is different from existingmetric entropy results in [9, 15, 20, 25, 26] for convex functions in that it deals with thediscrete ℓ P n pseudometric while all existing results deal with continuous L p metrics. Alsothe constraint ℓ P n ( f, f ) ≤ t on the convex functions in the above class is comparativelyweak. For our random design results, we prove, in Theorem 4.4, bracketing L ( P ) metricentropy bounds for (cid:26) f ∈ C (Ω) : ℓ P ( f, f ) ≤ t, sup x ∈ Ω | f ( x ) | ≤ B (cid:27) for polytopal Ω, piecewise affine f ∈ C (Ω) and t >
0. This result also improves existingresults in certain aspects; seee Section 4 for details.Let us now quickly summarize the contents of the rest of the paper. The results forthe convex LSE in fixed design are given in Section 2. Results for the bounded convexLSE and Lipschitz convex LSE in random design are in Section 3. Metric entropy resultsare in Section 4. The main technical ideas behind the proofs are briefly described inSection 5. Section 6 contains a discusssion of issues related to our main results. Section7 contains the proofs of the main results from Section 2. Section 8 contains the proofsof the results from Section 3. Section 9 contains the proofs of the metric entropy resultsfrom Section 4. Some additional proofs are relegated to Section 10.
2. Risk bounds for the convex LSE
In this section, we prove rates of convergence for the convex LSE ˆ f n (defined as in (2)).These are the first rate of convergence results for the convex LSE for d ≥
2. Throughout ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions the paper, c d , C d , κ d etc. denote constants that depend on the dimension d alone andtheir exact value might change from appearance to appearance. We sometimes refer tothese constants as dimensional constants (“dimensional” here refers to the dependenceon d ).Let us first describe our assumptions on Ω that we use throughout this section. Weassume that the domain Ω is a polytope. Ω can be written in the form:Ω = (cid:8) x ∈ R d : a i ≤ v Ti x ≤ b i for i = 1 , . . . , F (cid:9) (9)for some positive integer F , unit vectors v , . . . , v F and real numbers a , . . . , a F , b , . . . , b F .We also assume that Ω is contained in the unit ball (this can be achieved by scaling).Our bounds will be nonasymptotic and hold even when Ω changes with n . We assumehowever that F is bounded from above by a constant depending on d alone. We alsoassume that the volume of Ω is bounded from below by a constant depending on d alone.We do not address the problem of finding rates of convergence of the convex LSE inthe non-polytopal case where Ω is a smooth convex body. Rates of convergence in thiscase will be quite different from the rates derived in this section. See Section 6 for moredetails.In this section, we work with the fixed design setting where X , . . . , X n form a fixedregular rectangular grid in Ω and Y , . . . , Y n are generated according to (1). Specifically,for δ >
0, let S := { ( k δ, . . . , k d δ ) : k i ∈ Z , ≤ i ≤ d } (10)denote the regular d -dimensional δ -grid in R d . We assume that X , . . . , X n are an enu-meration of the points in S ∩
Ω with n denoting the cardinality of S ∩
Ω. By the usualvolumetric argument and the assumption that the volume of Ω is bounded from aboveand below by constants depending on d alone, there exists a small enough constant κ d such that whenever δ ≤ κ d , we have2 ≤ c d δ − d ≤ n ≤ C d δ − d (11)for dimensional constants c d and C d . We shall assume throughout this section that δ ≤ κ d so that the above inequality holds.We study the performance of the LSE ˆ f n under the loss function (5). The next coupleof results prove upper bounds for the risk E f ℓ P n ( ˆ f n , f ) of the convex LSE. Let C (Ω)denote the class of all real-valued convex functions on Ω and let A (Ω) denote the classof all affine functions on Ω. For each f ∈ C (Ω), let L ( f ) := inf g ∈A (Ω) ℓ P n ( f, g )where, it may be recalled, that ℓ P n ( f, g ) is our loss function defined via (5). The followingresult proves an upper bound involving L ( f ) for E f ℓ P n ( ˆ f n , f ) for arbitrary f ∈ C (Ω).Note that the number F appearing in the bound (12) below is the number of parallelhalfspaces or slabs defining Ω (see (9)) and this number is assumed to be bounded bya constant depending on d alone. ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Theorem 2.1.
Fix f ∈ C (Ω) with L := L ( f ) . There exists positive constants C d and γ d depending only on d such that E f ℓ P n ( ˆ f n , f ) ≤ C d max (cid:26) L d/ (4+ d ) (cid:16) σ n (log n ) F (cid:17) / ( d +4) , σ n (log n ) F (cid:27) for d ≤ C max n σ L √ n (log n ) F/ , σ n (log n ) F o for d = 4 C d max (cid:26) σ L (cid:16) (log n ) F n (cid:17) /d , σ (cid:16) (log n ) F n (cid:17) /d (cid:27) for d ≥ . (12)When L = L ( f ) and σ are fixed positive constants (not changing with n ) and n issufficiently large, the leading terms in the right hand side of (12) are the first termsinside the maxima. More precisely, from (12), it easily follows that for every L > σ >
0, there exist constants C d (depending on d alone) and N d,σ/ L (depending only on d and σ/ L ) such thatsup f ∈C (Ω): L ( f ) ≤ L E f ℓ P n ( ˆ f n , f ) ≤ C d L d/ (4+ d ) (cid:16) σ n (log n ) F (cid:17) / ( d +4) for d = 1 , , C σ L √ n (log n ) F/ for d = 4 C d σ L (cid:16) (log n ) F n (cid:17) /d for d ≥ n ≥ N d,σ/ L . The risk upper bound obtained above can be compared with the fol-lowing minimax risk characterization (which can be proved by routine arguments; seee.g., [24, Proof of Theorem 3.2]). Let C LL (Ω) denote the class of all convex functions onΩ that are L -Lipschitz and uniformly bounded by L . Then there exist constants c d , C d and N d,σ/L such that c d L d/ (4+ d ) (cid:18) σ n (cid:19) / ( d +4) ≤ inf ˜ f n sup f ∈C LL (Ω) E f ℓ P n ( ˜ f n , f ) ≤ C d L d/ (4+ d ) (cid:18) σ n (cid:19) / ( d +4) (14)for n ≥ N d,σ/L . Letting C L (Ω) be the class of all convex functions on Ω that are uniformlybounded by L , it is easy to see that C LL (Ω) ⊆ C L (Ω) ⊆ { f ∈ C (Ω) : L ( f ) ≤ L } which implies that the minimax lower bound in (14) also holds for the larger classes C L (Ω) and { f ∈ C (Ω) : L ( f ) ≤ L } .A comparison of (13) and (14) implies that the convex LSE ˆ f n is nearly minimaxoptimal (up to logarithmic multiplicative factors) over the class { f ∈ C (Ω) : L ( f ) ≤ L } (or over the smaller classes C L (Ω) or C LL (Ω)) for d ≤
4. However, the rate given by (13)is strictly suboptimal compared to the minimax rate for d ≥ d ≥
5, there exists a bounded Lipschitz convexfunction f for which E f ℓ P n ( ˆ f n , f ) is bounded from below by n − /d up to a logarithmicmultiplicative factor. Comparing this to (14) (and noting that n − /d ≫ n − / ( d +4) for ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions d ≥ d ≥
5, the convexLSE is minimax suboptimal over C LL (Ω) (or over the larger classes C L (Ω) or { f ∈ C (Ω) : L ( f ) ≤ L } ). Theorem 2.2.
Fix d ≥ , L > and σ > . There exist constants c d and C d,σ/L suchthat sup f ∈C LL (Ω) E f ℓ P n ( ˆ f n , f ) ≥ c d σLn − /d (log n ) − d +1) /d for n ≥ C d,σ/L . (15)We shall now present risk bounds when f is a piecewise affine convex function. Tomotivate these results, let us first examine inequality (12) when f is an affine function.Here L = L ( f ) = 0 so we havesup f ∈A (Ω) E f ℓ P n ( ˆ f n , f ) ≤ C d σ n (log n ) F for d = 1 , , C σ n (log n ) F for d = 4 C d σ (cid:16) (log n ) F n (cid:17) /d for d ≥ . (16)The bounds given in (16) are of a smaller order than those given by (13) which meansthat the LSE ˆ f n adapts to affine functions by converging to them at faster rates com-pared to other convex functions in { f ∈ C (Ω) : L ( f ) ≤ L } . In the next result, we provethat a similar adaptation holds for a larger class of piecewise affine convex functions.For k ≥ h ≥
1, let C k,h (Ω) denote all functions f ∈ C (Ω) for which there exist k convex subsets Ω , . . . , Ω k satisfying the following properties:1. f is affine on each Ω i ,2. each Ω i can be written as an intersection of at most h slabs (i.e., as in (9) with F = h ), and3. Ω ∩ S , . . . , Ω k ∩ S are disjoint with ∪ ki =1 (Ω i ∩ S ) = Ω ∩ S . Theorem 2.3.
For every k ≥ and h ≥ , we have sup f ∈ C k,p (Ω) E f ℓ P n ( ˆ f n , f ) ≤ C d σ (cid:0) kn (cid:1) (log n ) h for d = 1 , , C d σ (cid:0) kn (cid:1) (log n ) h +2 for d = 4 C d σ (cid:16) k (log n ) h n (cid:17) /d for d ≥ for a constant C d depending on d alone. Remark 2.1.
Note that Theorem 2.3 generalizes the bound (16) because A (Ω) = C ,F (Ω) . The logarithmic factors in (17) have powers involving h which means that they cannotbe ignored when h grows with n . Thus Theorem 2.3 only gives something useful when h is either constant or grows very slowly with n .If we ignore the logarithmic factors in (17), we can see that the risk bound in (17)switches from a parametric rate for d ≤ d ≥ d ≥ ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions larger than that given by (13) when k > √ nσ − d/ so Theorem 2.3 is only interesting(for d ≥
5) for k in the range 1 ≤ k ≤ √ nσ − d/ . In the next result, we show that forevery k satisfying 1 ≤ k ≤ √ nσ − d/ , there exists a piecewise affine convex functionon Ω with no larger than C d k affine pieces where the risk of the LSE is bounded frombelow by ( k/n ) /d (up to logarithmic multiplicative factors). The function where therate ( k/n ) /d is achieved can be taken to be a piecewise affine approximation to asmooth convex function such as the quadratic function. Its existence is guaranteed bythe following lemma. Lemma 2.4.
Let f ( x ) := k x k . There exists a positive constant C d (depending on thedimension alone) such that the following is true. For every k ≥ , there exist m ≤ C d kd -simplices ∆ , . . . , ∆ m and a convex function ˜ f k such that1. Ω = ∪ mi =1 ∆ i ,2. ∆ i ∩ ∆ j is contained in a facet of ∆ i and a facet of ∆ j for each i = j ,3. ˜ f k is affine on each ∆ i , i = 1 , . . . , m ,4. sup x ∈ Ω | f ( x ) − ˜ f k ( x ) | ≤ C d k − /d ,5. ˜ f k ∈ C C d C d (Ω) . The next result shows that the LSE achieves the rate ( k/n ) /d for the functions ˜ f k given by the above lemma. Theorem 2.5.
Fix d ≥ . There exist positive constants c d and N d such that for n ≥ N d and ≤ k ≤ min (cid:0) √ nσ − d/ , c d n (cid:1) , (18) we have E ˜ f k ℓ P n ( ˆ f n , ˜ f k ) ≥ c d σ (cid:18) kn (cid:19) /d (log n ) − d +1) /d (19) where ˜ f k is the function from Lemma 2.4. The above result immediately implies that the adaptive risk bounds in (17) cannotbe improved for all k satisfying (18) (note that min (cid:0) √ nσ − d/ , c d n (cid:1) will equal √ nσ − d/ unless σ is of smaller order than n − /d ). This implies, in particular, that the LSE cannotadapt at near parametric rates for affine functions for d ≥ k = √ nσ − d/ is of the same order as that given byTheorem 2.2. In other words, the adaptive lower bound in Theorem 2.5 implies minimaxsuboptimality of the convex LSE.
3. Suboptimality of constrained convex LSEs in two settings
The highlight of the results of Section 2 is the minimax suboptimality of the convexLSE in the fixed gridded design setting when d ≥
5. In this section, we show that thissuboptimality also extends to the bounded convex LSE (when Ω is a polytope) andthe Lipschitz convex LSE (for general Ω). We consider here the random design settingwhere the observed data are ( X , Y ) , . . . , ( X n , Y n ) with X , . . . , X n being independent ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions having the uniform distribution P on Ω and Y , . . . , Y n being generated according to (1)for independent N (0 , σ ) errors ξ , . . . , ξ n . We also assume that ξ , . . . , ξ n , X , . . . , X n are independent random variables and work with the ℓ P loss function (8).Let us first state our result for the bounded convex LSE (defined as in (4)). This resultassumes that the domain Ω is a polytope. The risk E f ℓ P ( ˆ f n ( C B (Ω)) , f ) of the boundedconvex LSE was studied in Han and Wellner [30] who proved matching upper and lowerbounds (up to logarithmic factors in n ) for d ≤ d ≥ d ≥
5, it was proved in Han and Wellner [30, Theorem3.6] that sup f ∈C B (Ω) E f ℓ P ( ˆ f n ( C B (Ω)) , f ) is bounded from above by n − /d (ignoringmultiplicative factors that are logarithmic in n and that depend on B and σ ). On theother hand, the minimax rate over C B (Ω) under the ℓ P loss function equals n − / ( d +4) for all d ≥ n − /d and the minimax risk of n − / ( d +4) for d ≥
5. The next result shows that, for d ≥
5, there exist functions f in C B (Ω) wherethe risk of ˆ f n ( C B (Ω)) is bounded from below by n − /d (up to logarithmic multiplicativefactors) thereby proving that the bounded convex LSE is minimax suboptimal. Thefact that Ω is polytopal is crucial here for the LSE becomes optimal when Ω is the unitball as recently shown in Kur et al. [35]. We provide an explanation of this in Section6. Theorem 3.1.
Let Ω be a polytope whose number of facets is bounded by a constantdepending on d alone. Assume also that Ω is contained in the unit ball and contains aball of constant (depending on d alone) radius. Fix d ≥ . There exist constants c d and N d,σ/B such that for every B > and σ > , we have sup f ∈C B (Ω) E f ℓ P ( ˆ f n ( C B (Ω)) , f ) ≥ c d σBn − /d (log n ) − d +1) /d whenever n ≥ N d,σ/B . (20)The next result is for the Lipschitz convex LSE (defined as in (3)). The followingresult shows that the same lower bound n − /d (up to logarithmic factors) holds for theLipschitz convex LSE for essentially every convex domain Ω (regardless of whether Ωis polytopal or smooth). Theorem 3.2.
Suppose Ω is a convex body that is contained in the unit ball and containsa ball centered at zero of constant (depending on d alone) radius. Fix d ≥ . There existpositive constants c d and N d,σ/L such that for every L > and σ > , we have sup f ∈C LL (Ω) E f ℓ P ( ˆ f n ( C L (Ω)) , f ) ≥ c d σLn − /d (log n ) − d +1) /d whenever n ≥ N d,σ/L . (21)
4. Metric entropy results
Our risk results from the previous two sections are based on new metric entropy resultsfor convex functions. Specifically, the risk bounds for the convex LSE in Section 2 ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions are proved via a metric entropy bound for convex functions satisfying a discrete L constraint and the risk lower bounds in Section 3 are proved via a bracketing entropybound for bounded convex functions with an additional L constraint. The goal of thissection is to describe these entropy results. We would like to start however with a briefdescription of existing entropy results for convex functions.Bronˇste˘ın [9] proved that the metric entropy of bounded Lipschitz convex functionsdefined on a fixed convex body Ω in R d is of the order ǫ − d/ under the supremum( L ∞ ) metric. The Lipschitz constraint can be removed if one is only interested in L p metrics for p < ∞ . Indeed, it was shown (by Gao [18] and Dryanov [16] for d = 1 andby Guntuboyina and Sen [26] for d ≥
2) that the metric entropy of bounded convexfunctions on Ω = [0 , d is of the order ǫ − d/ under the L p metric for every 1 ≤ p < ∞ .The boundedness constraint can further be relaxed to a L q -norm constraint (1 ≤ q ≤ ∞ )in which case the aforementioned result will hold in the L p metric for 1 ≤ p < q (seeGuntuboyina [25]). The case of more general convex bodies Ω was considered by Gaoand Wellner [20] who proved that the same bounds hold when Ω is an arbitrary polytope.Gao and Wellner [20] also studied the case where Ω is not a polytope. For example, whenΩ is the unit ball, they showed that the metric entropy of bounded convex functions onΩ is of the order ǫ − ( d − (which is larger than ǫ − d/ ) in the L metric when d ≥ ǫ -covering number of a set S under a pseudometric d will be denoted by N ( ǫ, S, d ).Also, the ǫ -bracketing number of a set S of functions under a pseudometric d will bedenoted by N [ ] ( ǫ, S, d ).Our first main metric entropy result is the following. We use notation that is similarto that in Section 2. Recall that S is the regular d -dimensional δ -grid defined in (10).The resolution of the grid δ will appear in the bounds below (note that, by (11), δ willbe of order n − /d ). Let Ω be a convex body such that Ω ∩ S 6 = ∅ . For 1 ≤ p < ∞ and afunction f on Ω, we define the quasi-norm ℓ S ( f, Ω , p ) = ∩ S ) X s ∈ Ω ∩ S | f ( s ) | p ! /p , where ∩ S ) denotes the cardinality of Ω ∩ S . Furthermore, for any fixed function f on Ω, and any t >
0, denote B p S ( f ; t ; Ω) = { f : Ω → R | f is convex on Ω , ℓ S ( f − f , Ω , p ) ≤ t } . We are interested in the metric entropy of B p S ( f ; t ; Ω) under the ℓ S ( · , Ω , p ) quasi-metric.The following result deals with the case when f ∈ A (Ω) (recall that A (Ω) denotes theclass of all affine functions on Ω). We state and prove this for every 1 ≤ p < ∞ (as theresult could be of independent interest) even though we only require the case p = 2 forproving the risk bounds of Section 2. Theorem 4.1.
Let Ω be a d -dimensional convex polytope that equals the intersectionof at most F pairs of halfspaces with distance no larger than 1 (i.e., as in (9) with ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions max ≤ i ≤ F ( b i − a i ) ≤ ). There exists a constant c d,p depending only on d and p suchthat for every f ∈ A (Ω) , ε > and t > , we have log N ( ε, B p S ( f ; t ; Ω) , ℓ S ( · , Ω , p )) ≤ [ c d,p log(1 /δ )] F ( t/ε ) d/ . (22)Theorem 4.1 differs from existing entropy results for convex functions in the followingways. First, it deals with the discrete ℓ S ( · , Ω , p ) metric while all results previously havestudied the continuous L p metrics. Second, the constraint on functions in B p S ( f ; t ; Ω)is 1 ∩ S ) X s ∈ Ω ∩ S | f ( s ) − f ( s ) | p ≤ t p which is much weaker than imposing uniform boundedness on the class. When δ ↓
0, onemight view the above constraint as a constraint on the continuous L p norm of f , but itmust be noted that the L p metric entropy under such an L p constraint equals ∞ (moregenerally, L p metric entropy under an L q norm constraint is finite if and only if p < q ;see [20, 25]). The bound (22) also approaches infinity as δ ↓ /δ and this only leads to additional logarithmic terms in our risk bounds.Theorem 4.1 implies, via the triangle inequality, bounds for log N ( ε, B p S ( f ; t ; Ω) , ℓ S ( · , Ω , p ))for arbitrary not necessarily affine f . Indeed, the triangle inequality gives B p S ( f ; t ; Ω) ⊆ B p S ( f ; t + ℓ S ( f − f , Ω , p ); Ω)for every f, f . Applying this for affine functions f , we obtain from Theorem 4.1 thatlog N ( ε, B p S ( f ; t ; Ω) , ℓ S ( · , Ω , p )) ≤ [ c d,p log(1 /δ )] F (cid:18) t + inf f ∈A (Ω) ℓ S ( f − f , Ω , p ) ε (cid:19) d/ . (23)While the above inequality is useful (we use it in the proof of Theorem 2.1), it is loosein the case when f is piecewise affine and t is small. For piecewise affine f , we useinstead the following two corollaries of Theorem 4.1. Corollary 4.2 will be used to proveTheorem 2.3 while Corollary 4.3 will be used in the proof of Theorem 2.5. Corollary 4.2.
Suppose f is a piecewise affine convex function on Ω . Suppose Ω , . . . , Ω k are convex subsets of Ω such that1. f is affine on each Ω i .2. Each Ω i equals an intersection of at most s pairs of parallel halfspaces.3. Ω ∩ S , . . . , Ω k ∩ S are disjoint with ∪ ki =1 (Ω i ∩ S ) = Ω ∩ S .Then log N ( ǫ, B p S ( f ; t ; Ω) , ℓ S ( · , Ω , p )) ≤ k (cid:18) tǫ (cid:19) d/ (cid:18) c d,p log 1 δ (cid:19) s for a constant c d,p that depends on d and p alone. The next result can be seen as a consequence of the above Corollary when eachpolytope Ω i is a d -simplex. ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Corollary 4.3.
Suppose f is a piecewise affine convex function on Ω . Suppose that Ω can be written as the union of k d -simplices ∆ , . . . , ∆ k such that f is affine on each ∆ i and such that ∆ i ∩ ∆ j is contained in a facet of ∆ i and a facet of ∆ j for each i = j .Then log N ( ǫ, B p S ( f ; t ; Ω) , ℓ S ( · , Ω , p )) ≤ C d,p k (cid:18) tǫ (cid:19) d/ (cid:18) log 1 δ (cid:19) d +1 for a constant C d,p that depends only on d and p . We next state our main bracketing entropy result. This is crucial for our risk lowerbounds in Section 3. Recall that C Γ (Ω) denotes the class of all convex functions on Ωthat are uniformly bounded by Γ. We state the next result for every 1 ≤ p < ∞ forcompleteness although we only use it for p = 2. Theorem 4.4.
Let Ω be a convex body in R d with volume bounded by 1. Let f be aconvex function on Ω that is bounded by Γ . For a fixed ≤ p < ∞ and t > , let B Γ p ( f ; t ; Ω) = (cid:26) f ∈ C Γ (Ω) : Z Ω | f ( x ) − f ( x ) | p dx ≤ t p (cid:27) . (24) Suppose ∆ , . . . , ∆ k ⊆ Ω are d -simplices with disjoint interiors such that f is affine oneach ∆ i . Then for every < ǫ < Γ and t > , we have log N [ ] ( ε, B Γ p ( f ; t ; Ω) , k · k p, ∪ ki =1 ∆ i ) ≤ C d,p k (cid:18) log Γ ε (cid:19) d +1 (cid:18) tε (cid:19) d/ (25) for a constant C d,p that depends on p and d alone. The left hand side above denotesbracketing entropy with respect to the L p metric on the set ∆ ∪ · · · ∪ ∆ k . To see how Theorem 4.4 compares to existing bracketing entropy results, considerthe special case when Ω is a d -simplex and when f ≡
0. In that case, the conclusionof Theorem 4.4 (for k = 1 and ∆ = Ω) becomes:log N [ ] ( ε, { f ∈ C Γ (Ω) : Z Ω | f ( x ) | p dx ≤ t p } , k · k p ) ≤ C d,p (cid:18) log Γ ε (cid:19) d +1 (cid:18) tε (cid:19) d/ . (26)The class of convex functions above has both an L ∞ constraint (uniform boundedness)as well as an L p constraint and (26) does not hold if either of the two constraints aredropped. Indeed, the entropy becomes infinite if the L ∞ constraint is dropped. On theother hand, if the L p constraint is dropped, then the bracketing entropy is of the order(Γ /ǫ ) d/ as proved by Gao and Wellner [20] (see also Doss [15]). In contrast to (Γ /ǫ ) d/ ,(26) only has a logarithmic dependence on Γ and is much smaller when t is small. Hanand Wellner [30, Lemma 3.3] proved a weaker bound for the left hand side of (25)having additional multiplicative factors involving k (these factors cannot be neglectedsince we care about the regime k ∼ √ n ). ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions
5. Proof ideas
We briefly describe here the key ideas underlying the proofs of the main results of thepaper. The risk upper bounds for the convex LSE (Theorem 2.1 and Theorem 2.3) inSection 2 are based on standard techniques [10] for analyzing LSEs combined with ourmetric entropy results of Section 4. The main novelty here is in the metric entropyresults. The worst case risk lower bound for the convex LSE in Theorem 2.2 followsfrom the adaptive lower bound in Theorem 2.5 by taking k ∼ √ n . The main ideasbehind the proof of Theorem 2.5 are as follows. Chatterjee [10] proved that the risk ofthe LSE at ˜ f k (this is the function given by Lemma 2.4) behaves as t f k where t ˜ f k is themaximizer of the function t H ˜ f k ( t ) := E sup f ∈C (Ω): ℓ P n ( ˜ f k ,f ) ≤ t n n X i =1 ξ i (cid:16) f ( X i ) − ˜ f k ( X i ) (cid:17) − t t ∈ [0 , ∞ ). The task then boils down to proving that t ˜ f k is bounded from belowby ( k/n ) /d up to logarithmic factors. This requires proving upper bounds and lowerbounds for the function H ˜ f k ( t ). We prove upper bounds using Dudley’s entropy boundand our metric entropy result of Corollary 4.3. The lower bounds are proved by Sudakovminoration as well as a metric entropy lower bound for local balls around f ( x ) := k x k (see Lemma 7.3).The proof of Theorem 3.1 uses the same basic strategy as that of Theorem 2.5 butis technically more involved because of the random design setting. We use conditionalversions of many arguments used in the proof of Theorem 2.5 including the conditionalversion of the result of Chatterjee [10]. The bracketing entropy upper bound fromTheorem 4.4 as well as the metric entropy lower bound from Lemma 8.2 are crucial forthis proof.The proof of Theorem 3.2 involves taking a large polytopal region S inside the generaldomain Ω, using ideas from the proof of Theorem 3.1 on the subset S , and using theLipschitz constraint to deal with the relatively small set Ω \ S . The Lipschitz constraintis crucial here as it allows the use of covering numbers in the supremum ( L ∞ ) metricdue to Bronˇste˘ın [9].Our ideas behind the proofs for the lower bounds on the performance of the LSEshave been used in a simpler setting in Kur et al. [36]. Specifically, the fixed design lowerbound in Kur et al. [36] only works in the regime k ∼ √ n and so it does not yield theadaptive lower bounds in Theorem 2.5. The random design lower bound in Kur et al.[36] uses an assumption on the Koltchinskii-Pollard entropy (or ∞ -covering) which isnot available in the present setting.The main proof ideas for the metric entropy results are as follows. Let us start withTheorem 4.4 because its proof is technically simpler. The key is to consider a polytopaldomain of the form Ω := (cid:8) x ∈ R d : a i ≤ v Ti x ≤ b i , i = 1 , . . . , d + 1 (cid:9) ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions for some unit vectors v , . . . , v d +1 and prove the bound (26). Results of Gao and Wellner[20] can be used to show this if we consider the L p norm on the smaller setΩ := (cid:8) x ∈ R d : a i + η ( b i − a i ) ≤ v Ti x ≤ b i − η ( b i − a i ) , i = 1 , . . . , d + 1 (cid:9) for a fixed η > toall of Ω. We do this via induction by sequentially extending to each domain T r (Ω) := (cid:8) x ∈ R d : a i ≤ v Ti x ≤ b i , ≤ i ≤ r and a i + η ( b i − a i ) ≤ v Ti x ≤ b i − η ( b i − a i ) , r < i ≤ d + 1 (cid:9) for r = 0 , . . . , d + 1 (note that T (Ω) = Ω and T d +1 (Ω) = Ω). Details can be found inthe statement and proof of Lemma 9.1.The ideas behind the proof of Theorem 4.1 are similar but more technically involvedbecause of the discrete metric and the lack of any uniform boundedness. Even the firststep of proving the result in the strict interior (such as Ω above) of the full domain Ωis challenging as there are no prior results (such as those in Gao and Wellner [20]) inthis discrete unbounded setting. This result is proved in Proposition 9.5. The inductionstep (carried out in Proposition 9.11 and Lemma 9.12) is also more delicate.
6. Discusssion
This section has some high-level remarks on the results of the paper. Our minimaxsuboptimality results (for d ≥
5) are based on constructions involving piecewise affinefunctions. Specifically, we prove that the suboptimal rate n − /d is realized at the piece-wise affine function ˜ f k for k ∼ √ n . One might wonder if the convex LSE achieves thesame rate n − /d or a faster rate (such as the minimax rate n − / ( d +4) ) at smooth convexfunctions such as f ( x ) = k x k . This appears to be challenging to resolve. The maindifficulty stems from the fact that for such f , the function t H f ( t ) (defined as in(27) with ˜ f k replaced by f ) will take values of the same order ( n − /d up to logarith-mic factors) for n − /d . t . n − /d . Indeed, the upper bound of order n − /d (up to alogarithmic factor) can be proved by the arguments involved in the proof of Theorem2.1 and the lower bound of n − /d follows from Lemma 7.3 or Lemma 8.2 via Sudakovminoration (Lemma 7.4). The (square of the) maximizer of H f ( · ) determines the rateof convergence of the LSE (by Chatterjee [10, Theorem 1.1]) and the fact that H f ( · )takes values of the same order in the large interval n − /d . t . n − /d makes it difficultto accurately pin down the location of its maximizer.As already mentioned previously, our minimax suboptimality result for the boundedconvex LSE when the domain is a polytope (for d ≥
5) contrasts with the recent resultof Kur et al. [35] who proved that the bounded convex LSE is minimax optimal whenΩ = B d ( B d is the unit ball in R d ) for all d . Why is the LSE suboptimal over C B ([0 , d )but optimal over C B ( B d )? The class C B ( B d ) is much larger than C B ([0 , d ) in the senseof metric entropy under the L norm with respect to the Lebesgue measure. Indeedthe ǫ -entropy of C B ([0 , d ) is of the order ǫ − d/ while that of C B ( B d ) is of the order ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions ǫ − ( d − (see Gao and Wellner [20]). This increased metric entropy of C B ( B d ) is drivenby the curvature (of the boundary) of B d . Specifically, one can obtain disjoint sphericalcaps S , . . . , S N (with N ∼ ǫ − ( d − ) of height ǫ such that the indicators of ∪ i ∈ H S i forsufficiently separated subsets H ⊆ { , . . . , N } form an ǫ -packing subset of C B ( B d ) in the L metric with respect to Lebesgue measure (strictly speaking, these indicator functionsare not convex but they can be approximated by piecewise affine convex functions).In other words, the complexity of C B ( B d ) is driven by the complexity of these well-separated subsets (unions of spherical caps) of B d . This aspect of C B ( B d ) is cruciallyused in Kur et al. [35] to prove the optimality of the LSE for C B ( B d ). In contrast,the complexity of C B ([0 , d ) (or more generally C B (Ω) when Ω is a polytope) is notdriven by indicator-like functions of subsets of the domain. Here ǫ -packing sets can beconstructed by local perturbations of a smooth convex function such as f ( x ) := k x k .This seems to be the main difference between C B ( B d ) and C B ([0 , d ) which is causingthe LSE to switch from minimax optimality to suboptimality.An interesting observation is that, in both the polytopal and the smooth cases, theworst case risk of the LSE over C B (Ω) equals, up to logarithmic factors, the globalGaussian complexity: E sup f ∈C B (Ω) n n X i =1 ξ i f ( X i ) (28)where ξ , . . . , ξ n , X , . . . , X n are independent with ξ , . . . , ξ n distributed as normal withmean zero and variance σ and X , . . . , X n distributed according to the uniform distri-bution P on Ω. When Ω is a polytope, (28) is of the order n − /d . To see this, one canupper bound (28) by using standard empirical process bounds via L ( P ) bracketing en-tropy bounds (see e.g., van de Geer [44, Theorem 5.11] restated as inequality (44)) andlower bound (28) by Sudakov minoration along with the metric entropy lower bound inLemma 8.2. When Ω = B d , this strategy of upper bounding (28) via L ( P ) bracketingentropy gives a suboptimal upper bound as explained by Kur et al. [35]. The reason isthat the L ( P n ) bracketing ǫ -entropy (here P n is the empirical measure of X , . . . , X n )is different from the L ( P ) bracketing ǫ -entropy for ǫ smaller than n − / ( d − . Thus, toprove the sharp bound for (28) in the ball case, Kur et al. [35] resort to a differenttechnique via level sets and chaining using L bracketing numbers.Isotonic regression is another shape constrained regression problem where the LSEis known to be minimax optimal for all dimensions (see Han et al. [29]). The class ofcoordinatewise monotone functions on [0 , d is similar to C B ( B d ) in that its metricentropy is driven by well-separated subsets of [0 , d (see Gao and Wellner [19, Proofof Proposition 2.1]). Other examples of such classes where the LSE is optimal for alldimensions can be found in Han [28].We proved our fixed-design risk bounds for the full convex LSE (in Section 2) in thecase where the domain Ω is a polytope. A natural question is to extend these to thecase where Ω is a smooth convex body such as the unit ball. Based on the results ofKur et al. [35], it is reasonable to conjecture that convex LSE will be minimax optimalin fixed design when the domain is the unit ball. However it appears nontrivial to prove ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions this as the level set reduction employed in Kur et al. [35] cannot be used in the absenceof uniform boundedness. We hope to address this in future work.
7. Proofs of results from Section 2
This section has proofs for Theorems 2.1, 2.2, 2.3 and 2.5 (Lemma 2.4 is proved inSection 10). The metric entropy results: inequality (23), Corollary 4.2 and Corollary4.3 stated in Section 4 are crucial for these proofs. Let us also recall here some generalresults that will be used in these proofs starting with the folowing result of Chatterjee[10]. We use the following notation. For a function f on Ω, a class of functions F on Ωand t >
0, let B F P n ( f, t ) := { g ∈ F : ℓ P n ( f, g ) ≤ t } . (29)where ℓ P n is given in (5). Theorem 7.1 (Chatterjee) . Consider data generated according to the model: Y i = f ( X i ) + ξ i for i = 1 , . . . , n where X , . . . , X n are fixed deterministic design points in a convex body X ⊆ R d , f belongs to a convex class of functions F and ξ , . . . , ξ n are independently distributedaccording to the normal distribution with mean 0 and variance σ . Consider the LSE ˆ f n ( F ) ∈ argmin g ∈F n X i =1 ( Y i − g ( X i )) and define t f := argmax t ≥ H f ( t ) where H f ( t ) := E sup g ∈ B F P n ( f,t ) n n X i =1 ξ i ( g ( X i ) − f ( X i )) − t where B F P n ( f, t ) is defined in (29) . Then H f ( · ) is a concave function on [0 , ∞ ) , t f isunique and the following pair of inequalities hold for positive constants c and C : P n . t f ≤ ℓ P n ( ˆ f n ( F ) , f ) ≤ t f o ≥ − (cid:18) − cnt f σ (cid:19) (30) and . t f − Cσ n ≤ E ℓ P n ( ˆ f n ( F ) , f ) ≤ t f + Cσ n . Upper bounds for t f can be obtained via: t f ≤ inf { r > H f ( r ) ≤ } (31) and lower bounds for t f can be obtained via: t f ≥ r if ≤ r < r are such that H f ( r ) ≤ H f ( r ) . (32) ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Let us also recall the Dudley metric entropy bound for the supremum of a Gaussianprocess.
Theorem 7.2 (Dudley) . Let ξ , . . . , ξ n be independently distributed according to thenormal distribution with mean 0 and variance σ . Then for every deterministic X , . . . , X n ∈X , every class of functions F , f ∈ F and t ≥ , we have E sup g ∈ B F P n ( f,t ) n n X i =1 ξ i ( g ( X i ) − f ( X i )) ≤ σ inf <θ ≤ t/ √ n Z t/ θ q log N ( ǫ, B F P n ( f, t ) , ℓ P n ) dǫ + 2 θ ! . We start by using the metric entropy bound (23). Indeed, using (23) for p = 2 (andnoting that log(1 /δ ) ≤ c d log n because of (11) and the fact that F is a constantdepending on d alone), we getlog N ( ǫ, B C (Ω) P n ( f , t ) , ℓ P n ) ≤ C d (log n ) F (cid:18) t + L ǫ (cid:19) d/ . (33)for every t > ǫ > L = L ( f ) = inf f ∈A (Ω) ℓ P n ( f , f ). We now control G ( t ) := E sup g ∈C (Ω): ℓ P n ( f ,g ) ≤ t n n X i =1 ξ i ( g ( X i ) − f ( X i )) (34)so we can bound E f ℓ P n ( ˆ f n , f ) by Theorem 7.1. Theorem 7.2 along with (33) gives G ( t ) σ ≤ C d (log n ) F/ √ n Z t/ θ (cid:18) t + L ǫ (cid:19) d/ dǫ + 2 θ ≤ ( C d d/ )(log n ) F/ √ n (Z t/ θ (cid:18) tǫ (cid:19) d/ dǫ + Z t/ θ (cid:18) L ǫ (cid:19) d/ dǫ ) + 2 θ (35)for every 0 < θ ≤ t/
2. Below, we replace C d d/ by just C d .Before proceeding further, it is convenient to split into the three cases d ≤ , d = 4and d ≥
5. When d ≤
3, we take θ = 0 to get G ( t ) ≤ C d σ √ n (log n ) F/ (cid:0) t + L d/ t − d/ (cid:1) . Because C d σ √ n (log n ) F/ t ≤ t t ≥ C d σ √ n (log n ) F/ and C d σ √ n (log n ) F/ L d/ t − d/ ≤ t t ≥ (4 C d ) d +4 (cid:18) σ (log n ) F/ √ n (cid:19) d +4 L dd +4 , ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions we deduce that H ( t ) := G ( t ) − t ≤ t ≥ C d max (cid:18) σ (log n ) F/ √ n (cid:19) d +4 L dd +4 , σ √ n (log n ) F/ ! . It follows from (31) that t f ≤ C d max (cid:18) σ (log n ) F/ √ n (cid:19) d +4 L dd +4 , σ √ n (log n ) F/ ! so Theorem 2.1 for d ≤ d = 4, (35) leads to G ( t ) ≤ C d σ √ n (log n ) F/ ( t + L ) log t θ + 2 σθ. Choosing θ = t/ (2 √ n ), we obtain G ( t ) ≤ C d σ √ n ( t + L ) (log n ) F/ from which we can deduce as before that H ( t ) = G ( t ) − t ≤ t ≥ C d max √ σ L (log n ) ( F/ / n / σ (log n ) F/ √ n ! which proves Theorem 2.1 for d = 4.Finally, for d ≥
5, (35) leads to the bound G ( t ) ≤ C d σ (log n ) F/ √ n (Z ∞ θ (cid:18) tǫ (cid:19) d/ dǫ + Z ∞ θ (cid:18) L ǫ (cid:19) d/ dǫ ) + 2 σθ ≤ C d σ (log n ) F/ √ n ( t + L ) d/ θ − ( d/ + 2 σθ for every θ >
0. The choice θ = (cid:18) C d (log n ) F/ √ n (cid:19) d/ ( t + L )gives G ( t ) ≤ σ (cid:18) C d (log n ) F/ √ n (cid:19) /d ( t + L )from which it follows that H ( t ) = G ( t ) − t ≤ t ≥ max √ σ L (cid:18) (log n ) F/ √ n (cid:19) /d , σ (cid:18) (log n ) F/ √ n (cid:19) /d ! which concludes the proof of Theorem 2.1. ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions This basically follows from Theorem 2.5. Let c d and N d be as given by Theorem 2.5.Letting k = √ nσ − d/ and assuming that n ≥ max( N d , ( c − d ) σ − d/ ), we obtain fromTheorem 2.5 that sup f ∈C CdCd (Ω) E f ℓ P n ( ˆ f n , f ) ≥ c d σn − /d (log n ) − d +1) /d . where C d is such that ˜ f k ∈ C C d C d (Ω) (existence of such a C d is guaranteed by Lemma2.4). The required lower bound (15) on the class C LL (Ω) for an arbitrary L >
Theorem 2.3 follows from a straightforward application of the metric entropy bound inCorollary 4.2 and the general results Theorem 7.1 and Theorem 7.2. Indeed, combiningCorollary 4.2 and Theorem 7.2, we get G ( t ) ≤ C d σ r kn (log n ) h/ Z t/ θ (cid:18) tǫ (cid:19) d/ dǫ + 2 σθ (36)for every 0 < θ ≤ t/ G ( t ) is as in (34). We now split into the three cases d ≤ d = 4 and d ≥
5. When d ≤
3, we take θ = 0 to obtain G ( t ) ≤ C d tσ r kn (log n ) h/ so that G ( t ) ≤ t / t ≥ C d σ q kn (log n ) h/ . This proves Theorem 2.3 for d ≤ d = 4, we get G ( t ) ≤ C d σ r kn (log n ) h/ t log t θ + 2 σθ. Choosing θ := t/ (2 √ n ), we get G ( t ) ≤ C d σt r kn (log n ) h/ . This gives G ( t ) ≤ t / t ≥ C d σ q kn (log n ) h/ which proves Theorem 2.3 for d = 4.For d ≥
5, we get G ( t ) ≤ C d σ r kn (log n ) h/ Z ∞ θ (cid:18) tǫ (cid:19) d/ dǫ + 2 σθ ≤ C d σ r kn (log n ) h/ t d/ θ − ( d/ + 2 σθ. ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Take θ = t (cid:16) C d (log n ) h/ q kn (cid:17) /d to get G ( t ) ≤ C d σt (log n ) h/ r kn ! /d . (37)This clearly implies that G ( t ) ≤ t / t ≥ C d σ (log n ) h/ r kn ! /d which completes the proof of Theorem 2.3. In addition to Theorem 7.1, Theorem 7.2, Lemma 2.4 and Corollary 4.3, this proof willneed the following two results. The proof of the first result below is given in Section 10while the second result is standard. Recall the notation (29).
Lemma 7.3.
Let Ω be a convex body contained in the unit ball whose volume is boundedfrom below by a constant depending on d alone. Let f ( x ) := k x k . There exist twopositive constants c and c depending on d alone such that log N ( c n − /d , B C (Ω) P n ( f , t ) , ℓ P n ) ≥ n for t ≥ c n − /d . Lemma 7.4 (Sudakov minoration) . Let ξ , . . . , ξ n be independently distributed accord-ing to the normal distribution with mean 0 and variance σ . Then for every deterministic X , . . . , X n ∈ X , every class of functions F and t ≥ , we have E sup g ∈ B F P n ( f,t ) n n X i =1 ξ i ( g ( X i ) − f ( X i )) ≥ βσ √ n sup ǫ> (cid:26) ǫ q log N ( ǫ, B F P n ( f, t ) , ℓ P n ) (cid:27) . Proof of Theorem 2.5.
By Lemma 2.4, ˜ f k satisfies ℓ P n ( f , ˜ f k ) ≤ sup x ∈ Ω (cid:12)(cid:12)(cid:12) f ( x ) − ˜ f k ( x ) (cid:12)(cid:12)(cid:12) ≤ C d k − /d . (38)where f ( x ) := k x k . Theorem 7.1 says that E ˜ f k ℓ P n ( ˆ f n , ˜ f k ) can be bounded from belowby lower bounding t ˜ f k where t ˜ f k is the maximizer of ˜ G ( t ) − t / t ≥ G ( t ) := E sup g ∈C (Ω): ℓ P n ( ˜ f k ,g ) ≤ t n n X i =1 ξ i (cid:16) g ( X i ) − ˜ f k ( X i ) (cid:17) . Note that we are working in the fixed design setting so X , . . . , X n are non-random andthe expectation above is being taken with respect to the randomness in ξ , . . . , ξ n . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions We shall lower bound t ˜ f k by proving suitable upper and lower bounds for the function˜ G ( · ). Note first that by the properties of ˜ f k given in Lemma 2.4, we can apply the metricentropy bound in Corollary 4.3 along with Theorem 7.2 to get (similar to the calculationunderlying (37)) a constant Υ d such that˜ G ( t ) ≤ Υ d σt (log n ) d +1) /d (cid:18) kn (cid:19) /d for every t > G ( t ): there exist positive constants γ d andΓ d depending on d alone such that˜ G ( t ) ≥ γ d σt (cid:18) kn (cid:19) /d for all t ≤ γ d k − /d (40)provided k ≤ Γ d n . To prove this, suppose first that t = 2 C d k − /d where C d is theconstant from (38). For this choice of t , it follows from the triangle inequality (and(38)) that B C (Ω) P n ( ˜ f k , t ) ⊇ B C (Ω) P n ( f , C d k − /d )where, it may be recalled, B C (Ω) P n ( f, s ) := { g ∈ C (Ω) : ℓ P n ( f, g ) ≤ s } . This immediatelyimplies ˜ G ( t ) ≥ G ( C d k − /d ) . where G ( t ) is defined as in (34). Lemma 7.4 now gives˜ G ( t ) ≥ G ( C d k − /d ) ≥ βσ √ n sup ǫ> (cid:26) ǫ q log N ( ǫ, B P n ( f , C d k − /d ) , ℓ P n ) (cid:27) . Using the lower bound on the metric entropy given by Lemma 7.3 for ǫ = c n − /d gives˜ G ( t ) ≥ βσ √ n ( c n − /d ) r n βc √ σn − /d provided k ≤ (cid:18) C d c (cid:19) d/ n. (41)The condition above is necessary for the inequality c n − /d ≤ C d k − /d which is requiredfor the application of Lemma 7.3. This gives˜ G ( t ) ≥ βc √ σn − /d for t = 2 C d k − /d . Now for t ≤ C d k − /d , we use the fact that x ˜ G ( x ) is concave on [0 , ∞ ) (and that˜ G (0) = 0) to deduce that˜ G ( t ) t ≥ ˜ G (2 C d k − /d )2 C d k − /d ≥ σ (cid:18) kn (cid:19) /d βc √ C d for all t ≤ C d k − /d . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions This proves (40) for γ d = min (cid:18) βc √ C d , C d (cid:19) and Γ d = (cid:18) c C d c (cid:19) d/ . We shall now bound the quantity t f k (defined in Theorem 7.1) from below using (39)and (40). By the lower bound in (40), we getsup t> (cid:18) ˜ G ( t ) − t (cid:19) ≥ sup t ≤ γ d k − /d γ d σt (cid:18) kn (cid:19) /d − t ! . Taking t = γ d σ ( k/n ) /d and noting that t = γ d σ (cid:18) kn (cid:19) /d ≤ γ d k − /d if and only if k ≤ √ nσ − d/ , we get that sup t> (cid:18) ˜ G ( t ) − t (cid:19) ≥ γ d σ (cid:18) kn (cid:19) /d . The above inequality, combined with (39) and the fact that t ˜ f k maximizes ˜ G ( t ) − t / t >
0, yield γ d σ (cid:18) kn (cid:19) /d ≤ sup t> (cid:18) ˜ G ( t ) − t (cid:19) = ˜ G ( t ˜ f k ) − t f k ≤ ˜ G ( t ˜ f k ) ≤ Υ d σt ˜ f k (log n ) d +1) /d (cid:18) kn (cid:19) /d . This implies t ˜ f k ≥ γ d d σ (cid:18) kn (cid:19) /d (log n ) − d +1) /d . Theorem 7.1 then gives E ˜ f k ℓ P n (cid:16) ˆ f n , ˜ f k (cid:17) ≥ γ d d σ (cid:18) kn (cid:19) /d (log n ) − d +1) /d − Cσ n . It is now clear that the first term on the right hand side above dominates the secondterm when n is larger than a constant depending on d alone. This completes the proofof Theorem 2.5.
8. Proofs of results from Section 3
We provide here the proofs for Theorem 3.1 and Theorem 3.2. These proofs are similarin spirit to that of Theorem 2.5 with some differences that are necessary to deal withthe random design setting. Let us first state some general results that will be used inthese proofs. ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions The proof of Theorem 2.5 needed the ingredients: Theorem 7.1, Theorem 7.2, Lemma7.3, Lemma 2.4 and Lemma 7.4. Modified forms of these ingredients to cover the randomdesign setting (as described below) are used for the proof of Theorem 3.1 and Theorem3.2.As in the proof of Theorem 2.5, a key role will be played by Theorem 7.1 of Chatterjee[10]. Theorem 7.1 holds for the fixed design setting with no restriction on the designpoints which means that it also applies to the random design setting provided wecondition on the design points X , . . . , X n . In particular, for our random design settingwith ℓ P n defined as in (5), inequality (30) becomes: P (cid:26) . t f ≤ ℓ P n ( ˆ f n , f ) ≤ t f (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n (cid:27) ≥ − (cid:18) − cnt f σ (cid:19) (42)where t f = t f ( X , . . . , X n ) := argmax t ≥ H f ( t ) (43)with H f ( t ) := E " sup g ∈ B F P n ( f,t ) n n X i =1 ξ i ( g ( X i ) − f ( X i )) (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n − t . Here t f = t f ( X , . . . , X n ) is random as it depends on the random design points X , . . . , X n .Instead of Dudley’s theorem (Theorem 7.2), we shall use the following theorem onthe suprema of empirical processes. The first conclusion of the theorem below is takenfrom van de Geer [44, Theorem 5.11] while the second conclusion essentially followsfrom van de Geer [44, Proof of Lemma 5.16]. Theorem 8.1.
Suppose X , . . . , X n are independently distributed according to a distri-bution P on Ω . Suppose F is a class of real-valued functions on Ω that are uniformlybounded by Γ > . Then the following two statements are true:1. There exists a positive constant C such that E sup f ∈F | P n f − P f | ≤ C inf (cid:26) a ≥ Γ √ n : a ≥ C √ n Z Γ a q log N [ ] ( u, F , L ( P )) du (cid:27) . (44)
2. There exists a positive constant C such that P (cid:26) sup f,g ∈F ( ℓ P ( f, g ) − ℓ P n ( f, g )) > Ca (cid:27) ≤ C exp (cid:18) − na C Γ (cid:19) (45) and P (cid:26) sup f,g ∈F ( ℓ P n ( f, g ) − ℓ P ( f, g )) > Ca (cid:27) ≤ C exp (cid:18) − na C Γ (cid:19) (46) provided na ≥ C log N [ ] ( a, F , L ( P )) . (47) ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Instead of Lemma 7.3, we shall use the following result which proves the same lowerbound as in Lemma 7.3 in the random design setting with high probability. Recall that C LL (Ω) denotes the class of all convex functions on Ω that are L -Lipschitz and uniformlybounded by L . Lemma 8.2.
Let Ω be a convex body that contains a ball of constant (depending on d alone) radius. Let f ( x ) := k x k . Then there exist positive constants c , c , c , c and C depending on d alone such that P n log N ( ǫ, B C LL (Ω) P n ( f , t ) , ℓ P n ) ≥ c ǫ − d/ o ≥ − exp( − c n ) (48) for each fixed ǫ, t, L satisfying L ≥ C and c n − /d ≤ ǫ ≤ min( c , t/ . Lemma 2.4 will also be crucially used in the proof of Theorem 3.1. The followinganalogue of Lemma 2.4 for the case when Ω is not necessarily a polytope will be usedin the proof of Theorem 3.2.
Lemma 8.3.
Suppose Ω is a convex body that is contained in the unit ball and contains aball of constant (depending on d alone) radius centered at zero. Let f ( x ) := k x k . Thereexists a positive constant C d (depending on dimension alone) such that the following istrue. For every k ≥ , there exist m ≤ C d k d -simplices ∆ , . . . , ∆ m ⊆ Ω having disjointinteriors and a convex function ˜ f k such that1. (1 − C d k − /d )Ω ⊆ ∪ mi =1 ∆ i ⊆ Ω ,2. ˜ f k is affine on each ∆ i , i = 1 , . . . , m ,3. sup x ∈ Ω | f ( x ) − ˜ f k ( x ) | ≤ C d k − /d ,4. ˜ f k ∈ C C d C d (Ω) . Lemma 7.4 will be used in the proofs of Theorem 3.1 and Theorem 3.2 in the followingconditional form: E " sup B F P n ( f,t ) n n X i =1 ξ i ( g ( X i ) − f ( X i )) (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n ≥ βσ √ n sup ǫ> (cid:26) ǫ q log N ( ǫ, B F P n ( f, t ) , ℓ P n ) (cid:27) . (49)We are now ready for the proofs of Theorem 3.1 and Theorem 3.2. Proof of Theorem 3.1.
It is enough to prove (20) when B is a fixed dimensional con-stant. From here, the inequality for arbitrary B > f ( x ) := k x k and ˜ f k be as given by Lemma 2.4. Below we shall assume that B is a large enough dimensional constant so that ˜ f k ∈ C B (Ω). The main task in this proofwill be to bound the quantity t ˜ f k (defined via (43)) from below where t ˜ f k maximizes H ˜ f k ( t ) := G ˜ f k ( t ) − t ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions over all t ≥ G ˜ f k ( t ) := E sup g ∈ B C B (Ω) P n ( ˜ f k ,t ) n n X i =1 ξ i (cid:16) g ( X i ) − ˜ f k ( X i ) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n . (50)We shall prove a lower bound for t ˜ f k that holds with high probability over the ran-domness in X , . . . , X n . Specifically, we shall prove the existence of three dimensionalconstants γ d , c d and C d and a constant N d,σ which depends on d and σ such that P (cid:8) t ˜ f k ≥ c d n − /d √ σ (log n ) − d +1) /d (cid:9) ≥ − C d exp (cid:18) − n ( d − /d C d (cid:19) (51)for k = γ d √ nσ − d/ and n ≥ N d,σ .Before proceeding with the proof of (51), let us first show how (51) completes theproof of Theorem 3.1. Note first thatsup f ∈C B (Ω) E f ℓ P ( ˆ f n ( C B (Ω)) , f ) ≥ E ˜ f k ℓ P ( ˆ f n ( C B (Ω)) , ˜ f k )so it is enough to prove that the right hand side of (20) is a lower bound for E ˜ f k ℓ P ( ˆ f n ( C B (Ω)) , ˜ f k ).We shall assume therefore that the data have been generated from the true function˜ f k . Let ρ n be the lower bound on t ˜ f k given by (51) i.e., ρ n := c d n − /d √ σ (log n ) − d +1) /d . (52)Inequality (42) clearly implies P (cid:26) ℓ P n ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ t f k (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n (cid:27) ≥ − − cnt f k σ ! . As a result, P (cid:26) ℓ P n ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ ρ n (cid:27) ≥ P (cid:26) ℓ P n ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ t f k , t ˜ f k ≥ ρ n (cid:27) = E (cid:20) I (cid:8) t ˜ f k ≥ ρ n (cid:9) P (cid:26) ℓ P n ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ t f k (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n (cid:27)(cid:21) ≥ E " I (cid:8) t ˜ f k ≥ ρ n (cid:9) − − cnt f k σ !! ≥ (cid:18) − (cid:18) − cnρ n σ (cid:19)(cid:19) P (cid:8) t ˜ f k ≥ ρ n (cid:9) . We can now use (51) to obtain P (cid:26) ℓ P n ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ ρ n (cid:27) ≥ (cid:18) − (cid:18) − cnρ n σ (cid:19)(cid:19) (cid:18) − C d exp (cid:18) − n ( d − /d C d (cid:19)(cid:19) ≥ − (cid:18) − cnρ n σ (cid:19) − C d exp (cid:18) − n ( d − /d C d (cid:19) . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Clearly if N d,σ is chosen appropriately, then, for n ≥ N d,σ , nρ n σ = c d σ n ( d − /d (log n ) − d +1) /d will be larger than any constant multiple of n ( d − /d which gives P (cid:26) ℓ P n ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ ρ n (cid:27) ≥ − C d exp (cid:18) − n ( d − /d C d (cid:19) . (53)We shall now argue that a similar inequality also holds for ℓ P ( ˆ f n ( C B (Ω)) , ˜ f k ). For every a >
0, we have P (cid:26) ℓ P ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ (cid:18) ρ n √ − a (cid:19)(cid:27) ≥ P ( ℓ P n ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ ρ n , sup f,g ∈C B (Ω) ( ℓ P n ( f, g ) − ℓ P ( f, g )) ≤ a ) ≥ P (cid:26) ℓ P n ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ ρ n (cid:27) + P ( sup f,g ∈C B (Ω) ( ℓ P n ( f, g ) − ℓ P ( f, g )) ≤ a ) − ≥ P ( sup f,g ∈C B (Ω) ( ℓ P n ( f, g ) − ℓ P ( f, g )) ≤ a ) − C d exp (cid:18) − n ( d − /d C d (cid:19) . To bound the probability above, we use (46). Gao and Wellner [20, Theorem 1.5] giveslog N [ ] ( ǫ, C B (Ω) , ℓ P ) ≤ C d (cid:18) Bǫ (cid:19) d/ (55)The requirement (47) is therefore satisfied when a is n − / ( d +4) L d/ ( d +4) multiplied by alarge enough dimensional constant. Inequality (46) then gives P ( sup f,g ∈C B (Ω) ( ℓ P n ( f, g ) − ℓ P ( f, g )) ≤ C d n − / ( d +4) B d/ ( d +4) ) ≥ − C exp (cid:18) − n d/ ( d +4) C d B / ( d +4) (cid:19) (56)which then implies that P (cid:26) ℓ P ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ (cid:0) c d n − /d √ σ (log n ) − d +1) /d − C d n − / ( d +4) B d/ ( d +4) (cid:1)(cid:27) ≥ − C exp (cid:18) − n d/ ( d +4) C d B / ( d +4) (cid:19) − C d exp (cid:18) − n ( d − /d C d (cid:19) . Because n − / ( d +4) is of a smaller order than n − /d and n d/ ( d +4) is of a larger order than n ( d − /d (and B is a dimensional constant), we obtain P n ℓ P ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ c d n − /d √ σ (log n ) − d +1) /d o ≥ − C d exp (cid:18) − n ( d − /d C d (cid:19) . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions provided n ≥ N d,σ where N d,σ is a constant depending on d and σ alone. Finally, notethat N d,σ can be chosen so that P n ℓ P ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ c d n − /d √ σ (log n ) − d +1) /d o ≥ E ˜ f k ℓ P ( ˆ f n ( C B (Ω)) , ˜ f k ) ≥ c d σn − /d (log n ) − d +1) /d . (57)This completes the proof of Theorem 3.1 assuming that (51) is true.Let us now start the proof of (51). For this purpose, we shall require both upper andlower bounds for G ˜ f k ( t ) (defined in (50)) for appropriate values of t . We start to proveupper bounds. Note first that B C B (Ω) P n ( ˜ f k , t ) ⊆ B C B (Ω) P ( ˜ f k , t + C d n − / ( d +4) B d/ ( d +4) ) (58)with probability at least 1 − C exp (cid:18) − n d/ ( d +4) C d B / ( d +4) (cid:19) . (59)Here we are using the notation B F P ( f, t ) := { g ∈ F : ℓ P ( f, g ) ≤ t } . (60)where ℓ P is given in (8). (58) is a consequence of P ( sup f,g ∈C B (Ω) ( ℓ P ( f, g ) − ℓ P n ( f, g )) ≤ C d n − / ( d +4) B d/ ( d +4) ) ≥ − C exp (cid:18) − n d/ ( d +4) C d B / ( d +4) (cid:19) (61)whose proof follows from the same argument as the proof of (56). Thus for t ≥ C d n − / ( d +4) B d/ ( d +4) , (62)we get B C B (Ω) P n ( ˜ f k , t ) ⊆ B C B (Ω) P ( ˜ f k , t )with probability at least (59). We deduce consequently that, for a fixed t satisfying (62),the event: G ˜ f k ( t ) ≤ G ˜ f k (3 t ) := E sup g ∈ B C B (Ω) P ( ˜ f k , t ) n n X i =1 ξ i (cid:16) g ( X i ) − ˜ f k ( X i ) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions holds with probability at least (59). It is easy to see that the function T ( x , . . . , x n ) := E sup g ∈ B C B (Ω) P ( ˜ f k , t ) n n X i =1 ξ i (cid:16) g ( X i ) − ˜ f k ( X i ) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) X = x , . . . , X n = x n satisfies the bounded differences condition: | T ( x , . . . , x n ) − T ( x ′ , . . . , x ′ n ) | ≤ Bσn n X i =1 I { x i = x ′ i } and the bounded differences concentration inequality consequently gives P (cid:8) G ˜ f k (3 t ) ≤ E G ˜ f k (3 t ) + x (cid:9) ≥ − exp (cid:18) − nx B σ (cid:19) (63)for every x >
0. We next control E G ˜ f k (3 t ) = E sup g ∈ B C B (Ω) P ( ˜ f k , t ) n n X i =1 ξ i (cid:16) g ( X i ) − ˜ f k ( X i ) (cid:17) where the expectation on the left hand side is with respect to X , . . . , X n while theexpectation on the right hand side is with respect to ξ , . . . , ξ n , X , . . . , X n . Clearly E G ˜ f k (3 t ) = E sup h ∈H ( Q n h − Q h ) (64)where H consists of all functions of the form ( ξ, x ) ξ (cid:16) g ( x ) − ˜ f k ( x ) (cid:17) as g varies over B C B (Ω) P ( ˜ f k , t ), Q n is the empirical measure corresponding to ( ξ i , X i ) , i = 1 , . . . , n , and Q is the distribution of ( ξ, X ) where ξ and X are independent with ξ ∼ N (0 , σ ) and X ∼ P .We now use the bound (44) requires us to control N [ ] ( ǫ, H , L ( Q )). This is done byTheorem 4.4 which states thatlog N [ ] ( ǫ, B C B (Ω) P ( ˜ f k , t ) , L ( P )) ≤ C d k (cid:18) log C d Bǫ (cid:19) d +1 (cid:18) tǫ (cid:19) d/ . (65)Theorem 4.4 is stated under the unnormalized integral constraint R Ω ( f − ˜ f k ) ≤ t andfor bracketing numbers under the unnormalized Lebesgue measure but this implies (65)as the volume of Ω is assumed to be bounded on both sides by dimensional constants.We now claim that N [ ] ( ǫ, H , L ( Q )) ≤ N [ ] ( ǫσ − , B C B (Ω) P ( ˜ f k , t ) , L ( P )) . (66) ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Inequality (66) is true because of the following. Let { [ g L , g U ] , g ∈ G } be a set of coveringbrackets for the set B C B (Ω) P ( ˜ f k , t ). For each bracket [ g L , g U ], we associate a correspond-ing bracket [ h L , h U ] for H as follows: h L ( ξ, x ) := ξ (cid:16) g L ( x ) − ˜ f k ( x ) (cid:17) I { ξ ≥ } + ξ (cid:16) g U ( x ) − ˜ f k ( x ) (cid:17) I { ξ < } and h U ( ξ, x ) := ξ (cid:16) g U ( x ) − ˜ f k ( x ) (cid:17) I { ξ ≥ } + ξ (cid:16) g L ( x ) − ˜ f k ( x ) (cid:17) I { ξ < } . It is now easy to check that whenever g L ≤ g ≤ g U , we have h L ≤ h g ≤ h U where h g ( ξ, x ) = ξ (cid:16) g ( x ) − ˜ f k ( x ) (cid:17) . Further, h U − h L = | ξ | ( g U − g L ) and thus Q ( h U − h L ) = σ P ( g U − g L ) which proves (66). Inequality (65) then gives that for every a ≥ B/ √ n ,we have Z Ba q log N [ ] ( u, H , L ( Q )) du ≤ C d √ k Z Ba (cid:18) log C d Bσu (cid:19) ( d +1) / (cid:18) tσu (cid:19) d/ du ≤ C d √ k ( tσ ) d/ (cid:18) log C d Bσa (cid:19) ( d +1) / Z ∞ a u − d/ du ≤ C d √ k ( tσ ) d/ (cid:18) log C d Bσa (cid:19) ( d +1) / a − ( d/ ≤ C d √ k ( tσ ) d/ (cid:0) log( C d σ √ n ) (cid:1) ( d +1) / a − ( d/ where, in the last inequality, we used a ≥ B/ √ n . The inequality a ≥ Cn − / R Ba p log N [ ] ( u, H , L ( Q )) du will therefore be satisfied for a ≥ C d tσ (cid:18) kn (cid:19) /d (cid:0) log( C d σ √ n ) (cid:1) d +1) /d for an appropriate constant C d . The bound (44) then gives E G ˜ f k (3 t ) ≤ C d tσ (cid:18) kn (cid:19) /d (cid:0) log( C d σ √ n ) (cid:1) d +1) /d + B √ n . Assuming now that t ≥ Bk − /d σ n (4 − d ) / (2 d ) , (67)we deduce E G ˜ f k (3 t ) ≤ (1 + C d ) tσ (cid:18) kn (cid:19) /d (cid:0) log( C d σ √ n ) (cid:1) d +1) /d (68)Putting the above steps together, we obtain that for every x >
0, the inequality G ˜ f k ( t ) ≤ G ˜ f k (3 t ) ≤ E G ˜ f k ( t ) + x ≤ C d tσ (cid:18) kn (cid:19) /d (cid:0) log( C d σ √ n ) (cid:1) d +1) /d + x ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions holds with probability at least1 − C exp (cid:18) − n d/ ( d +4) C d B / ( d +4) (cid:19) − exp (cid:18) − nx B σ (cid:19) for every fixed t satisfying (62) and (67). We take x = x ( t ) := C d tσ (cid:18) kn (cid:19) /d (cid:0) log( C d σ √ n ) (cid:1) d +1) /d to deduce that H ˜ f k ( t ) ≤ G ˜ f k ( t ) ≤ C d tσ (cid:18) kn (cid:19) /d (cid:0) log( C d σ √ n ) (cid:1) d +1) /d (69)with probability at least1 − C exp (cid:18) − n d/ ( d +4) C d B / ( d +4) (cid:19) − exp (cid:18) − nx ( t )2 B σ (cid:19) (70)provided t satisfies (62) and (67).We shall next prove a lower bound for H ˜ f k ( t ). The key ingredients here are Lemma8.2 and the conditional form (49) of Sudakov’s minoration. Let us first prove lowerbounds for G ˜ f k ( t ). Note that B C B (Ω) P n ( ˜ f k , t ) ⊇ B C B (Ω) P n ( f , t/
2) whenever ℓ P n ( f , ˜ f k ) ≤ t/ . Because ℓ P n ( f , ˜ f k ) ≤ sup x ∈ Ω | f ( x ) − ˜ f k ( x ) | ≤ L d k − /d , the condition ℓ P n ( f , ˜ f k ) ≤ t/ t ≥ L d k − /d . (71)Thus for t satisfying the above, G ˜ f k ( t ) ≥ E sup g ∈ B C B (Ω) P n ( f ,t/ n n X i =1 ξ i (cid:16) g ( X i ) − ˜ f k ( X i ) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n = E sup g ∈ B C B (Ω) P n ( f ,t/ n n X i =1 ξ i ( g ( X i ) − f ( X i )) (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n . Inequality (49) then gives G ˜ f k ( t ) ≥ βσ √ n sup ǫ> (cid:26) ǫ q log N ( ǫ, B C B (Ω) P n ( f , t/ , P n ) (cid:27) We now use Lemma 8.2 with ǫ = c n − /d to claim that for B ≥ C and t ≥ c n − /d , (72) ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions we have G ˜ f k ( t ) ≥ β √ c σc − ( d/ n − /d with probability at least 1 − exp( − c n ). This gives the following lower bound on H ˜ f k ( t ): H ˜ f k ( t ) ≥ β √ c σc − ( d/ n − /d − t . Taking t = t where t = β √ c σc − ( d/ n − /d (73)gives us that H ˜ f k ( t ) ≥ t β √ c σc − ( d/ n − /d (74)with probability at least 1 − exp( − c n ) provided t = t satisfies (71) and (72). Thecondition (71) is equivalent to k ≥ L d β √ c c − ( d/ ! d/ √ nσ − d/ (75)and (72) is equivalent to n ≥ c β √ c c − ( d/ ! d/ σ − d/ . (76)We shall now combine (69) and (74). Suppose t > C d t σ (cid:18) kn (cid:19) /d (cid:0) log( C d σ √ n ) (cid:1) d +1) /d = β √ c σc − ( d/ n − /d where C d is as in (69) and the other constants ( β , c and c ) are from (74). The aboveequality is the same as t = β √ c c − ( d/ C d k − /d (cid:0) log( C d σ √ n ) (cid:1) − d +1) /d . In that case, (69) and (74) together imply that H ˜ f k ( t ) ≤ H ˜ f k ( t )with probability at least1 − C exp (cid:18) − n d/ ( d +4) C d B / ( d +4) (cid:19) − exp (cid:18) − nx ( t )2 B σ (cid:19) − exp( − c n ) . (77) ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions If we now assume that k ≥ (2 C d ) − d/ (cid:16) β √ c c − ( d/ (cid:17) d/ √ nσ − d/ , (78)then t < t . Inequality (32) then gives that t ˜ f k ≥ t = β √ c c − ( d/ C d k − /d (cid:0) log( C d σ √ n ) (cid:1) − d +1) /d . with probability at least (77).We shall now take k = γ d √ nσ − d/ where γ d is the larger of the two dimensionalconstants on the right hand sides of (75) and (78) and this will obviously ensure thatboth (75) and (78) are satisfied. The quantity t then equals t = c d n − /d √ σ (cid:0) log( C d nσ − ( d/ ) (cid:1) − d +1) /d (79)for a specific c d and x ( t ) = C d c d γ /dd σn − /d . The probability in (77) then equals1 − C exp (cid:18) − n d/ ( d +4) C d B / ( d +4) (cid:19) − exp (cid:18) − C d c d ( γ d ) /d n ( d − /d B (cid:19) − exp( − c n ) . Now if we assume that B ≥
1, then n d/ ( d +4) C d B / ( d +4) ≥ n d/ ( d +4) C d B ≥ n ( d − /d B and also n ≥ n ( d − /d /B . We thus deduce that the probability in (77) is bounded frombelow by 1 − C d exp (cid:18) − n ( d − /d C d B (cid:19) . which can be further simplified to1 − C d exp (cid:18) − n ( d − /d C d (cid:19) . as B is a constant that only depends on d . If we now take n to be larger than a constant N d,σ depending on d and σ alone, then the conditions (62) and (71) will be satisfiedfor t = t and (76) will also be satisfied. Finally the logarithmic term in (79) can befurther simplified by the bound log( C d nσ − ( d/ ) ≤ c d log n . This completes the proofof (51) and consequently Theorem 3.1. ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Proof of Theorem 3.2.
It is enough to prove (21) when L is a fixed dimensional constant.From here, the inequality for arbitrary L > f ( x ) := k x k and ˜ f k be as given by Lemma 8.3. Let L be a dimensional constantlarge enough so that ˜ f k ∈ C LL (Ω). As in the proof of Theorem 3.1, the key is to prove(51) where t ˜ f k is defined as the maximizer of H ˜ f k ( t ) := G ˜ f k ( t ) − t t ≥ G ˜ f k ( t ) := E sup g ∈ B C L (Ω) P n ( ˜ f k ,t ) n n X i =1 ξ i (cid:16) g ( X i ) − ˜ f k ( X i ) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n . (80)Before proceeding with the proof of (51), let us first show how (51) completes the proofof Theorem 3.2. Because ˜ f k ∈ C LL (Ω), it is enough to prove that the right hand side of(21) is a lower bound for E ˜ f k ℓ P ( ˆ f n ( C L (Ω)) , ˜ f k ). We shall assume therefore that the datahave been generated from the true function ˜ f k . Note first that, as shown in the proofof inequality (53) in Theorem 3.1, inequality (51) leads to P (cid:26) ℓ P n ( ˆ f n ( C L (Ω) , ˜ f k ) ≥ ρ n (cid:27) ≥ − C d exp (cid:18) − n ( d − /d C d (cid:19) where ρ n is given by (52) and n ≥ N d,σ for a large enough constant N d,σ depending onlyon d and σ . A similar inequality will be shown below for ℓ P ( ˆ f n ( C L (Ω) , ˜ f k ). For every a >
0, we write P (cid:26) ℓ P ( ˆ f n ( C L (Ω)) , ˜ f k ) ≥ (cid:18) ρ n √ − a (cid:19) , ˆ f n ( C L (Ω)) ∈ C LL (Ω) (cid:27) = P (cid:26) ℓ P ( ˆ f n ( C LL (Ω)) , ˜ f k ) ≥ (cid:18) ρ n √ − a (cid:19) , ˆ f n ( C L (Ω)) ∈ C LL (Ω) (cid:27) ≥ P ( ℓ P n ( ˆ f n ( C LL (Ω)) , ˜ f k ) ≥ ρ n , sup f,g ∈C LL (Ω) ( ℓ P n ( f, g ) − ℓ P ( f, g )) ≤ a, ˆ f n ( C L (Ω)) ∈ C LL (Ω) ) ≥ P (cid:26) ℓ P n ( ˆ f n ( C LL (Ω)) , ˜ f k ) ≥ ρ n , ˆ f n ( C L (Ω)) ∈ C LL (Ω) (cid:27) − P ( sup f,g ∈C LL (Ω) ( ℓ P n ( f, g ) − ℓ P ( f, g )) ≤ a ) . (81)We now bound the probability: P (cid:26) ℓ P ( ˆ f n ( C L (Ω)) , ˜ f k ) ≥ (cid:18) ρ n √ − a (cid:19) , ˆ f n ( C L (Ω)) / ∈ C LL (Ω) (cid:27) . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions For this, we first make the following claim: f ∈ C L (Ω) , f / ∈ C LL (Ω) = ⇒ min (cid:16) ℓ P n ( f, ˜ f k ) , ℓ P ( f, ˜ f k ) (cid:17) > L. (82)To see (82), note that assumptions f ∈ C L (Ω) and f / ∈ C LL (Ω) together imply that f ( x ) > L for some x ∈ Ω. By the Lipschitz property of f , the fact that Ω has diameter ≤ f k is bounded by L , we have f ( y ) − ˜ f k ( y ) ≥ f ( x ) − L k y − x k − L > L for all y ∈ Ωwhich clearly implies that both ℓ P n ( f, ˜ f k ) and ℓ P ( f, ˜ f k ) are larger than L . This proves(82).Assume now that N d,σ is large enough so that ρ n is at most L for n ≥ N d,σ . The fact(82) clearly implies that P (cid:26) ℓ P ( ˆ f n ( C L (Ω)) , ˜ f k ) ≥ (cid:18) ρ n √ − a (cid:19) , ˆ f n ( C L (Ω)) / ∈ C LL (Ω) (cid:27) = P n ˆ f n ( C L (Ω)) / ∈ C LL (Ω) o = P (cid:26) ℓ P n ( ˆ f n ( C L (Ω)) , ˜ f k ) ≥ ρ n , ˆ f n ( C L (Ω)) / ∈ C LL (Ω) (cid:27) Combining the above with (81), we get P (cid:26) ℓ P ( ˆ f n ( C L (Ω)) , ˜ f k ) ≥ (cid:18) ρ n √ − a (cid:19)(cid:27) ≥ P (cid:26) ℓ P n ( ˆ f n ( C L (Ω) , ˜ f k ) ≥ ρ n (cid:27) + P ( sup f,g ∈C LL (Ω) ( ℓ P n ( f, g ) − ℓ P ( f, g )) ≤ a ) − . This inequality is analogous to inequality (54) in the proof of Theorem 3.1. From here,one can deduce E ˜ f k ℓ P ( ˆ f n ( C L (Ω)) , ˜ f k ) ≥ c d σn − /d (log n ) − d +1) /d . (83)in the same way that (57) was derived from (54). The only difference is that, insteadof (55), we now use the following result due to Bronˇste˘ın [9]:log N [ ] ( ǫ, C LL (Ω) , ℓ P ) ≤ log N ( ǫ, C LL (Ω) , ℓ ∞ ) ≤ C d (cid:18) Lǫ (cid:19) d/ (84)where, of course, ℓ ∞ refers to the metric ( f, g ) sup x ∈ Ω | f ( x ) − g ( x ) | (recall that ℓ ∞ covering numbers dominate bracketing numbers with respect to L ( P ) for everyprobability measure P ). (83) clearly completes the proof of Theorem 3.2.Let us now provide the proof of (51). This was proved in Theorem 3.1 on the basisof the inequalities (69) and (74). Below we shall establish (69) and (74) in the presentsetting with slight modification. From these, (51) will follow via the same argument ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions used in Theorem 3.1. Note that the difference between the current proof and the proofof Theorem 3.1 is that G ˜ f k ( t ) is now defined as in (80) involving a supremum of g ∈ B C L (Ω) P n ( ˜ f k , t ) while, in the proof of Theorem 3.1, G ˜ f k ( t ) was defined as in (50) involvinga supremum of g ∈ B C L (Ω) P n ( ˜ f k , t ).Let us start with the proof of (69). For this, note first that (82) immediately implies B C L (Ω) P n ( ˜ f k , t ) = B C LL (Ω) P n ( ˜ f k , t ) for all t ≤ L. Let us assume therefore that t ≤ L so we can work with the class of bounded Lipschitzconvex functions C LL (Ω). We write G ˜ f k ( t ) ≤ G I ˜ f k ( t ) + G II ˜ f k ( t )where G I ˜ f k ( t ) := E sup g ∈ B C LL (Ω) P n ( ˜ f k ,t ) n n X i =1 I { X i ∈ ∪ mi =1 ∆ i } ξ i (cid:16) g ( X i ) − ˜ f k ( X i ) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n and G II ˜ f k ( t ) := E sup g ∈ B C LL (Ω) P n ( ˜ f k ,t ) n X i : X i / ∈∪ mi =1 ∆ i ξ i (cid:16) g ( X i ) − ˜ f k ( X i ) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n . Here ∆ , . . . , ∆ m are the d -simplices given by Lemma 8.3. We shall provide upperbounds for both G I ˜ f k ( t ) and G II ˜ f k ( t ). The bound for G I ˜ f k ( t ) is very similar to the bound(69) obtained for G ˜ f k ( t ) in the proof of Theorem 3.1 with the following two differences.Instead of the metric entropy bound (55), we use the result (84) due to Bronˇste˘ın [9]. In-equality (84) allows us to replace B C LL (Ω) P n ( ˜ f k , t ) by B C LL (Ω) P ( ˜ f k , t ) with high probability.Also, instead of (65), we shall use (which also follows from Theorem 4.4) N [ ] ( ǫ, n x g ( x ) I { x ∈ ∪ mi =1 ∆ i } : g ∈ B C LL (Ω) P ( ˜ f k , t ) o , L ( P )) ≤ C d k (cid:18) log C d Lǫ (cid:19) d +1 (cid:18) tǫ (cid:19) d/ . With these changes, following the proof of inequality (69) in Theorem 3.1 with B replaced by 4 L allows us to deduce that G I ˜ f k ( t ) ≤ C d tσ (cid:18) kn (cid:19) /d (cid:0) log( C d σ √ n ) (cid:1) d +1) /d (85)with probability at least1 − C exp (cid:18) − n d/ ( d +4) C d L / ( d +4) (cid:19) − exp − nt C d L (cid:18) kn (cid:19) /d (cid:0) log( C d σ √ n ) (cid:1) d +1) /d ! ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions for every 0 < t ≤ L satisfying t ≥ C d n − / ( d +4) L d/ ( d +4) and t ≥ Lk − /d σ n (4 − d ) / (2 d ) . (86)We shall now bound G II ˜ f k ( t ). For this, let ˜ n := P ni =1 I { X i / ∈ ∪ mi =1 ∆ i } and use Dudley’sbound (Theorem 7.2) and (84) to write G II ˜ f k ( t ) = ˜ nn E sup g ∈ B C LL (Ω) P n ( ˜ f k ,t ) n X i : X i / ∈∪ mi =1 ∆ i ξ i (cid:16) g ( X i ) − ˜ f k ( X i ) (cid:17) (cid:12)(cid:12)(cid:12)(cid:12) X , . . . , X n ≤ σ ˜ nn inf δ> (cid:18) √ ˜ n Z ∞ δ q log N ( ǫ, C LL (Ω) , ℓ ∞ ) dǫ + 2 δ (cid:19) ≤ C d σ ˜ nn inf δ> √ ˜ n Z ∞ δ (cid:18) Lǫ (cid:19) d/ dǫ + δ ! . The choice δ = L (˜ n ) − /d then gives G II ˜ f k ( t ) ≤ C d L σn (˜ n ) − (2 /d ) . (87)˜ n is binomially distributed with parameters n and ˜ p := Vol(Ω \ ( ∪ mi =1 ∆ i )) / Vol(Ω).Because (1 − C d k − /d )Ω ⊆ ∪ mi =1 ∆ i ⊆ Ω (see Lemma 8.3), we have˜ p ≤ − (1 − C d k − /d ) d ≤ dC d k − /d because (1 − u ) d ≥ − du . Hoeffding’s inequality: P { Bin( n, p ) ≤ np + u } ≥ − exp (cid:18) − u n (cid:19) for every u ≥ C d is such that ˜ p ≤ C d k − /d ) P (cid:8) ˜ n ≤ C d nk − /d (cid:9) ≥ P (cid:8) ˜ n − n ˜ p ≤ C d nk − /d (cid:9) ≥ − exp (cid:18) − C d nk − /d (cid:19) . Combining the above with (87), we get that G II ˜ f k ( t ) ≤ C d Lσn − /d k − /d k /d with probability at least 1 − exp( − C d nk − /d ). Combining this bound with the bound(85) obtained for G I ˜ f k ( t ), we get (below H ˜ f k ( t ) := G ˜ f k ( t ) − t / H ˜ f k ( t ) ≤ G ˜ f k ( t ) ≤ C d tσ (cid:18) kn (cid:19) /d (cid:0) log( C d σ √ n ) (cid:1) d +1) /d + C d Lσn − /d k − /d k /d (88) ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions with probability at least1 − C exp − n dd +4 C d L d +4 ! − exp − nt C d L (cid:18) kn (cid:19) d (cid:0) log( C d σ √ n ) (cid:1) d +1) d ! − exp( − C d nk − d )(89)for every 0 ≤ t ≤ L satisfying (86). Under the condition t (cid:0) log( C d σ √ n ) (cid:1) d +1) /d ≥ Lk /d k − /d , the second term on the right hand side of (88) is dominated by the first term leadingto inequality (69).The next step is to prove a lower bound for H ˜ f k ( t ). Here the same argument as inthe proof of Theorem 3.1 applies and we can deduce that inequality (74) holds withprobability at least 1 − exp( − c n ) provided the conditions (75) and (76) are satisfied(note that t in (74) is given by (73)).We have thus proved inequalities (69) and (74). From these, we can follow the sameargument as in that proof to deduce (51). The following additional constraint (whichwas not present in the proof of Theorem 3.1) needs to be checked here: Lk /d k − /d (cid:0) log( ekσ √ n ) (cid:1) − d +1) /d ≤ t ≤ L where t is defined in (79) and k = γ d √ nσ − d/ (for a large enough γ d ). This holds aslong as n is larger than a constant N d,σ depending on d and σ alone (note that L is aconstant depending on d alone). Note also that the probability (89) has an additionalterm compared to (70) but this additional term exp( − C d nk − /d ) is easily seen to bebounded by exp( − n ( d − /d /C d ) for k = γ d √ nσ − d/ provided n ≥ N d,σ for a large enoughconstant N d,σ . The proof of (51) is thus complete.
9. Proofs of Metric Entropy Results
First, we state the key Lemma, and prove it later (in Subsection 9.2).
Lemma 9.1. If Ω is a d -dimensional convex body defined by Ω = { x ∈ R d : a i ≤ v Ti x ≤ b i , ≤ i ≤ d + 1 } , where v i are fixed unit vectors. Then for any < ε < , there exists aset G consisting of no more than exp( C | Ω | d/ p [log(Γ /ε )] d +1 ( t/ε ) d/ ) brackets such thatfor every f ∈ B Γ p (Ω , t ) := { f is convex in Ω , k f k p ≤ t, k f k ∞ ≤ Γ } , there exists a bracket [ g, h ] ∈ G such that g ( x ) ≤ f ( x ) ≤ h ( x ) for all x ∈ Ω , and Z Ω | h ( x ) − g ( x ) | p dx < ε p . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Now we are ready to prove Theorem 4.4.
Proof of Theorem 4.4.
Assume Ω = ∪ mi =1 ∆ i , where ∆ i , 1 ≤ i ≤ m are d -simplices. Foreach f ∈ B Γ p ( f , Ω , t ), we define t i ( f ) as the smallest positive integer t i such that Z ∆ i | f ( x ) − f ( x ) | p dx ≤ t pi t p | ∆ i | . Because | f − f | ≤ t i ≤ /t . Thus, there are no more than (2Γ /t ) m choicesof the sequence t , t , . . . , t m . For every such sequence T = { t , t , . . . , t m } , we define F T = (cid:26) f ∈ B Γ p ( f , Ω , t ) : ( t i − p t p | ∆ i | ≤ Z ∆ i | f ( x ) − f ( x ) | p dx ≤ t pi t p | ∆ i | , ≤ i ≤ m (cid:27) . Thus, for every f ∈ F T , we have m X i =1 ( t i − p t p | ∆ i | ≤ m X i =1 Z ∆ i | f ( x ) − f ( x ) | p dx ≤ t p , i.e. P mi =1 ( t i − p | ∆ i | ≤
1. Hence, m X i =1 t pi | ∆ i | ≤ p − m X i =1 [( t i − p + 1] | ∆ i | ≤ p . Furthermore, for each f ∈ F T and 1 ≤ i ≤ m , the restriction of f − f to ∆ i belongsto B p (∆ i , t i t | ∆ i | /p ) (since f is linear on each ∆ i ). Since each simplex can be writtenas an intersection of d + 1 slabs, by Lemma 9.1 , there exists a set G i consisting of nomore than exp( C ( d, p )[log(2Γ /ε i )] d +1 t d/ i | ∆ i | d/ p t d/ ε − d/ i ) brackets, such that for each f ∈ F T , there exists a bracket [ g i , h i ] ∈ G i such that g i ( x )+ f ( x ) ≤ f ( x ) ≤ h i ( x )+ f ( x )for all x ∈ ∆ i , and R ∆ i | h i ( x ) − g i ( x ) | p dx ≤ ε pi . If we define g ( x ) = g i ( x ) and h ( x ) = h i ( x )for x ∈ ∆ i , 1 ≤ i ≤ m . Then, we clearly have g ( x ) ≤ f ( x ) ≤ h ( x ) for all x ∈ Ω, and Z [0 , d | h ( x ) − g ( x ) | p dx = m X i =1 Z ∆ i | h i ( x ) − g i ( x ) | p dx ≤ m X i =1 ε pi . We choose ε i = max(2 − − /p t i | ∆ i | /p , (4 m ) − /p ) · ε, where we used the fact that if | f − f | ≤
2Γ then R Ω ∞ | f ( x ) − f ( x ) | p ≤ ε p / m X i =1 ε pi ≤ ε p p +2 m X i =1 t pi | ∆ i | + 14 ε p ! ≤ ε p . Thus, [ g, h ] is an ε -bracket. ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Note that for each fixed T , the total number of brackets [ g, h ] is at most N := m Y i =1 exp( C ( d, p )[log(Γ /ε i )] d +1 t d/ i | ∆ i | d/ p t d/ ε − d/ i ) ≤ m Y i =1 exp( C ( d, p )[ 1 p log(4 m ) + log Γ + log(1 /ε )] d +1 (2 /p t/ε ) d/ ) ≤ exp (cid:0) C ′ ( d, p ) m [log m + log Γ + log(1 /ε )] d +1 ( t/ε ) d/ (cid:1) These with all the possible choices of T , the number of realizations of the brackets [ g, h ]is at most(2Γ /t ) m · N ≤ exp (cid:0) C ′′ ( d, p ) m [log m + log Γ + log(1 /ε )] d +1 ( t/ε ) d/ (cid:1) , and the claim follows. Our starting point is the following results proved in Lemma 5 and Theorem 1(ii) of [20]respectively:
Proposition 9.2. If Ω is a convex body in [0 , d with volume | Ω | ≥ /d ! , then forany < δ < , there exists a constant Λ depending only on d , p and δ , such that C p (Ω) ⊂ C ∞ (Ω δ , Λ) , where Ω η = { x ∈ Ω : dist( x, ∂ Ω) ≥ η } and C p (Ω) := { f is convex in Ω , k f k p ≤ } . Proposition 9.3. If Ω is a convex body that can be triangulated into m simplices ofdimension d , then there exists a constant C depending only on d and p such that for all < ε < , we have log N [ ] ( ε, C ∞ (Ω) , k · k p ) ≤ Cm | Ω | d/ p ε − d/ , where k f k p = (cid:0)R Ω | f ( x ) | p dx (cid:1) /p . Using the last two propositions:
Corollary 9.4.
Let Ω ⊂ [0 , d is a d -dimensional convex body which is defined by Ω = { x ∈ R d : a i ≤ v Ti x ≤ b i , ≤ i ≤ m } , where v i are fixed unit vectors, m ≥ d + 1 . Then for any < η < / , t > and any < ε < t , the following holds: log N [ ] ( ε, C p (Ω , t ) , k · k L p (Ω ) ) ≤ C m ( t/ε ) − d/ , where C p (Ω , t ) := { f is convex in Ω , k f k p ≤ t } , C is a constant depending only on d , p and η , and Ω := { x ∈ R d : a i + η ( b i − a i ) ≤ v Ti x ≤ b i − η ( b i − a i ) , ≤ i ≤ m } . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Proof.
Observe that on Ω k f k ∞ ≤ C ( η, d ) t | Ω | /p . To see this, assume in contradictionthat it is not the case. Using similar arguments to the proof of Proposition 9.2, there isa set of volume of c ( η, d ) | Ω | (a ”cap/corner”) such that f ≥ t | Ω | /p which will contradictthe definition of C p (Ω , t ). Then, using the Proposition 9.3 scaled by t | Ω | /p , gives thecorollary.We will prove that if we replace C p (Ω , t ) by B Γ p (Ω , t ) = { f ∈ C p (Ω , t ) : k f k ∞ ≤ Γ } , then we can replace Ω by Ω at a cost of logarithmic factors in the rate of bracketingentropy.For any domain D of an intersection of d + 1 ”slabs” ( D := { x ∈ R d : a i ≤ v Ti x ≤ b i , ≤ i ≤ d + 1 } , where v i are fixed unit vectors) and for every 0 ≤ r ≤ d + 1 we definethe operator T r . T r ( D ) = { x ∈ D : a i ≤ v Ti x ≤ b i for i ≤ r, a j + η ( b j − a j ) ≤ v Tj x ≤ b j − η ( b j − a j ) for r < j ≤ d +1 } . Now, we are ready to prove the lemma.
Proof of Lemma 9.1.
Because the desired inequality is invariant under affine transfor-mation, we can assume Ω is contained in [0 , d and | Ω | ≥ d ! . Fix 0 < η < /
5. Weprove the following: There exist two constants C ( d, p ) and C ( d, p ) such that for all r = 1 , . . . , d + 1 the following holds:log N [ ] ( ε, B p (Ω , t ) , k · k L p ( T r (Ω)) ) ≤ C [ C log(Γ /ε )] r | Ω | d/ p t d/ ε − d/ . We prove the statement by induction on r . Clearly, T (Ω) = Ω := { x ∈ R d : a j + η ( b j − a j ) ≤ v Tj x ≤ b j − η ( b j − a j ) for 1 ≤ j ≤ d + 1 } . By Corollary 9.4 and the assumptions on Ω the statement is true when r = 0. Supposethe statement is true for r = k −
1. We define K = T k − (Ω) = { x ∈ T k (Ω) : a k + η ( b k − a k ) ≤ v Tk x ≤ b k − η ( b k − a k ) } . For s = 1 , , . . . , m define K s +1 = { x ∈ T k (Ω) : a k + 2 − s − η ( b k − a k ) ≤ v Tk x < a k + 2 − s η ( b k − a k ) } ,K s +2 = { x ∈ T k (Ω) : b k − − s η ( b k − a k ) < v Tk x ≤ b k − − s − η ( b k − a k ) } , Furthermore, we define K L = { x ∈ T k (Ω) : a k ≤ v Tk x < a k + 2 − m − η ( b k − a k ) } ,K R = { x ∈ T k (Ω) : b k − − m − η ( b k − a k ) < v Tk x ≤ b k } . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Then, K , K L , K R , K s +1 and K s +2 , 1 ≤ s ≤ m form a partition of T k (Ω). Now, weaim to use the induction step. For this purpose we denote the inflated sets b K s +1 = { x ∈ Ω : a k + 2 − s − η ( b k − a k ) ≤ v Tk x < a k + 3 · − s − η ( b k − a k ) } , b K s +2 = { x ∈ Ω : b k − · − s − η ( b k − a k ) < v Tk x ≤ b k − − s − η ( b k − a k ) } . Now, we will apply the operator T k − on these sets: T k − ( b K s +1 ) = { x ∈ b K s +1 : a i ≤ v Ti x ≤ b i for i ≤ k − ,a j + η ( b j − a j ) ≤ v Tj x ≤ b j − η ( b j − a j ) for k + 1 ≤ j ≤ d + 1; a k + 2 − s − η ( b k − a k ) + η (5 · − s − ( b k − a k )) ≤ v Tk x< a k + 3 · − s − η ( b k − a k ) − η (5 · − s − ( b k − a k )) }⊃ K s +1 , provided that 5 η <
1. Similarly, T k − ( b K s +2 ) ⊃ K s +2 .We choose m so that Γ p | K L | ≤ − p ε p , and Γ p | K R | ≤ − p ε p in a way that ( R K L f ( x ) p dx ) /p and ( R K R f ( x ) p dx ) /p are negligible. This can be done by choosing m = C ( d, Γ) log(Γ /ε ).To see this, observe that K R , K L are slabs with width 5 η · − ( m − intersected with theunit cube. Thus, their volume can be bounded by C ( d ) · − ( m − η , which implies that (cid:18)Z K L f ( x ) p dx (cid:19) /p ≤ C ( d )2 − ( m − η Γ ≤ ǫ/ . For every f ∈ B Γ p (Ω , t ), we define t i as the smallest integer that satisfies the following R b K i | f ( x ) | p dx ≤ | b K i | t pi t p . Because any point in Ω is contained in b K i for at most threedifferent i , we have m +2 X i =0 ( t i − p t p | b K i | ≤ t p . (90)This implies that m +2 X i =0 t pi | b K i | ≤ p − m +2 X i =0 [( t i − p + 1] | b K i | ≤ · p , where we used the fact that P m +2 i =0 | b K i | ≤ | Ω | ≤
3. Since k f k ∞ ≤ Γ, we have t i ≤ Γ /t . Thus, the total number of choices of the sequence t , t , t , . . . , t m +2 is at most(Γ /t ) m +3 . For each ordered sequence T = { t , t , t , . . . , t m +2 } satisfying (90), we define F T = (cid:26) f ∈ B p (Ω , t ) : ∀ ≤ i ≤ m + 2 , ( t i − p t p | b K i | < Z b K i | f ( x ) | p dx ≤ t pi t p | b K i | (cid:27) . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Then, f ∈ B p ( b K i , t i t | b K i | /p ). Clearly, for all 0 ≤ i ≤ m + 2, K i satisfies the inductionassumption. Therefore, for any 0 ≤ i ≤ m + 2, there exists two sets G i and G i ,each consisting of exp( C [ C log(Γ /ε )] k − | K i | d/ p t d/ i ε − d/ i ) elements, such that for every f ∈ B p ( b K i , t i t | b K i | /p ), there exists g i ∈ G i and g i ∈ G i such that g i ( x ) ≤ f ( x ) ≤ g i ( x )for all x ∈ K i , and Z K i | g i ( x ) − g i ( x ) | p ≤ ε pi . We define b f ( x ) = g i ( x ) if x ∈ K i , and b f ( x ) = − Γ for x ∈ K L ∪ K R . Similarly,we define b g ( x ) = g i ( x ) if x ∈ K i , and b g ( x ) = Γ for x ∈ K L ∪ K R . Then, we have b f ( x ) ≤ f ( x ) ≤ b g ( x ) for all x ∈ T k (Ω) and Z T k (Ω) | b g ( x ) − b f ( x ) | p dx ≤ m +2 X i =0 Z K i | g i ( x ) − g i ( x ) | p dx + Z K L ∪ K R | b g ( x ) − b f ( x ) | p dx ≤ m +2 X i =0 ε pi + 12 ε p . If we choose ε i = 12 · /p t i | b K i | /p ε, then, m +2 X i =0 ε pi = 16 · p m +2 X i =0 t pi | b K i | ε p ≤ ε p . This implies that Z T k (Ω) | b g ( x ) − b f ( x ) | p dx ≤ ε p . Now, let us count the number of possible realizations of the brackets [ b f , b g ]. For eachfixed T , the number of choices of the brackets [ b f , b g ] is at most N := m +2 Y i =0 exp( C [ C log(Γ /ε )] k − | c K i | d/ p t d/ i t d/ ε − d/ i )= exp( C [ C log(Γ /ε )] k − (2 m + 3)( t/ε ) d/ ) ≤ exp (cid:0) C [log(Γ /ε )] k ( t/ε ) d/ (cid:1) . The total number of ε -brackets under the L p ( T k (Ω)) distance needed to cover B Γ p (Ω , t )is then bounded by (Γ /t ) m +3 · N ≤ exp (cid:0) C [log(Γ /ε )] k ( t/ε ) d/ (cid:1) provided the constant C is large enough, and Lemma 9.1 follows. ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Let Ω be a convex body, and let c be the center of its John ellipsoid (i.e., the uniqueellipsoid of maximum volume contained in Ω). For any λ >
0, define Ω λ = c + λ (Ω − c ).It is clear that | Ω λ | = λ d | Ω | , where | Ω | denotes the volume of Ω. The goal of this sub-section is to prove the following proposition, which is the equivalent”discrete” version of Corollary 9.4.
Proposition 9.5.
Let S be the regular d -dimensional δ -grid, and let Ω be a convexbody in R d . For any t > and any < ε < , there exists a set N consisting of exp( γ d · ( t/ε ) d/ ) functions such that for every f ∈ B qS (0; t ; Ω) , there exists g ∈ N satisfying | f ( x ) − g ( x ) | < ε for all x ∈ Ω . , where γ d is a constant depending only on d . To prove Proposition 9.5, we need some preparations.
Lemma 9.6.
Let S be a regular δ -grid on R d , and let Ω be a convex body containing aball of radius r ≥ d / δ . We have | Ω | δ − d ≤ ∩ S ) ≤ | Ω | δ − d . Proof.
Let s , . . . , s n be the grid points contained in Ω. We have n [ i =1 ( s i + [ − δ/ , δ/ d ) ⊂ Ω + [ − δ/ , δ/ d ⊂ Ω + √ dδ B d . Note that when Ω contains a ball of radius r , | Ω + √ dδ B d | ≤ √ dδ r ! d | Ω | ≤ (cid:18) d (cid:19) /d | Ω | ≤ | Ω | . Volume comparison gives us n ≤ | Ω | δ − d . On the other hand, let U be the union of the cubes s i + [ − δ/ , δ/ d . The volume of U is nδ d . Since the union of s i + [ − δ, δ ] d covers Ω, we have U + [ − δ/ , δ/ d ⊃ Ω. Inparticular, U contains the set { x ∈ Ω : dist( x, ∂ Ω) ≥ √ dδ/ } . Since Ω contains a ball of radius r . If we let c be the center of this ball, and define b Ω = c + − √ dδ r ! (Ω − c ) , ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions then the distance between any x ∈ b Ω and ∂ Ω is at least √ dδ/
2. Hence U ⊃ b Ω. Conse-quently n = | U | δ − d ≥ | b Ω | δ − d = − √ dδ r ! d | Ω | δ − d ≥ | Ω | δ − d . This finishes the proof of the lemma.
Lemma 9.7.
Let S be a regular d -dimensional δ -grid. Let Ω ⊂ [0 , d be a convexbody that contains a ball of radius at least d / δ . Then for every f ∈ B qS (0 , t, Ω) , f ≥ − dt .Proof. Let x be the minimizer of f on Ω. If f ( x ) ≥
0, then there is nothing to prove;otherwise, the set K := { x ∈ Ω | f ( x ) ≤ } is a closed convex set containing x . Denote K t = x + t ( K − x ), and let b K = K σ \ K − σ , where σ = (10 d ) − . We show that for all x ∈ Ω \ b K , | f ( x ) | ≥ σ | f ( x ) | . Indeed, if we define a function g on Ω so that g ( x ) = f ( x ), g ( γ ) = f ( γ ) for all γ ∈ ∂K , and g is linear on L γ := { x = x + t ( γ − x ) ∈ Ω | t ≥ } .Then, by the convexity of f on each L γ , we have | f ( x ) | ≥ | g ( x ) | on Ω. Thus, for all x ∈ Ω \ b K , | f ( x ) | ≥ | g ( x ) | = | g ( γ ) | + k x − γ kk x − γ k | f ( x ) | ≥ σ | f ( x ) | . Next, we show that most of the grid points in Ω are outside b K . Indeed, If s is a gridpoint in b K , then s + [ − δ/ , δ/ d ⊂ K σ ∩ Ω + [ − δ/ , δ/ d and at least one half ofthe cube s + [ − δ/ , δ/ d lies outside K − σ . Thus, the number of grid points in b K isbounded by 2 | ( K σ ∩ Ω + [ − δ/ , δ/ d ) \ K − σ | δ − d . Since | ( A + B ) \ A | can be expressed as a sum of products of mixed volumes of A and B , and smaller sets have smaller mixed volumes, we have | ( A + B ) \ B | ≤ | ( C + D ) \ D | for all convex sets C ⊃ A and D ⊃ B . Applying this inequality for A = C =[ − δ/ , δ/ d , B = ( K σ ∩ Ω and D = [ − , d , we have | ( K σ ∩ Ω+[ − δ/ , δ/ d ) \ K σ ∩ Ω | ≤ | ([ − , d +[ − δ/ , δ/ d ) \ [ − , d | = (2+ δ ) d − d , while | K σ ∩ Ω \ K − σ | ≤ h − (cid:0) − σ σ (cid:1) d i | K σ ∩ Ω | , we have | ( K σ ∩ Ω+[ − δ/ , δ/ d ) \ K − σ | ≤ " − (cid:18) − σ σ (cid:19) d | K σ ∩ Ω | +(2+ δ ) d − d ≤ dσ | Ω | . Thus, the number of grid points in b K is bounded by6 dσ | Ω | δ − d ≤ dσ · · S ∩ Ω) ≤ dσ · S ∩ Ω) . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Hence, S ∩ Ω) · t q ≥ X s ∈ S ∩ (Ω \ b K ) | f ( s ) | q ≥ (1 − dσ ) · S ∩ Ω) · ( σ | f ( x ) | ) q , which implies that f ( x ) ≥ − /q σ − t ≥ − dt by using σ = (10 d ) − . Lemma 9.8.
Suppose that a convex body Ω contains n grid points of a regular δ -grid in R d , and a ball of radius at least d / δ . Then, at any point P on the boundary of Ω . ,any hyperplane passing through P cuts Ω into two parts. The part that does not containthe center of John ellipsoid of Ω as its interior point contains at least (20 d ) − d − · n gridpoints.Proof. For any point P on the boundary of Ω . . Any hyperplane passing through P cuts Ω into two parts. Suppose L is a part that does not contain the center of Johnellipsoid of Ω as its interior. We prove that | L | ≥ d ) d | Ω | . Because the ratio | L | / | Ω | isinvariant under affine transform, we estimate | T L | / | T Ω | , where T is an affine transformso that the John ellipsoid of T Ω is the unit ball B d . Then, it is known that T Ω iscontained in a ball of radius d . Because the distance from ( T Ω) . to the boundary of T Ω is at least , T L contain half of the ball with center at
T P and radius . Thus,
T L has volume at least − d | B d | . Since T Ω is contained in the ball of radius d , we have | T Ω | ≤ d d | B d | . This implies that | T L | ≥ d (20 d ) − d | T Ω | . Hence | L | ≥ d (20 d ) − d | Ω | .Because the John ellipsoid of Ω contains a ball of radius at least 400 d / δ , the distancefrom Ω . to the boundary of Ω is at least 20 d / δ . Thus, L contains a ball of radiusat least 10 d / δ . By Lemma 9.6, the number of grid points in it is at least | L | δ − d ≥ d (20 d ) − d | Ω | δ − d . The statement of Lemma 9.8 then follows by using Lemma 9.6 onemore time. Lemma 9.9.
Let S be the regular d -dimensional δ -grid. Let Ω be a convex body in R d containing a ball of radius d / δ . For every f ∈ B qS (0 , t, Ω) , f ( x ) ≤ (20 d ) d +1 q t for all x ∈ Ω . .Proof. Let z be the maximizer of f on Ω . . By the convexity of f , z must be on theboundary of Ω . . If f ( z ) ≤
0, there is nothing to prove. So we assume f ( z ) >
0. Theconvexity of f implies that z lies on the boundary of the convex set K : { x ∈ Ω : f ( x ) ≤ f ( z ) } ⊃ Ω η . There exists a hyperplane z so that the convex set { x | f ( x ) ≤ f ( z ) } liesentirely on one side of the hyperplane. Let L be the portion of Ω that lies on the otherside of the hyperplane that support K at z . This hyperplane cuts Ω into two parts. Let L be the part that does not contain K . Then, f ( x ) ≥ f ( z ) for all x ∈ L . By Lemma 9.8,we have L ∩ S ) ≥ (20 d ) − d − · ∩ S ) . Since f ( x ) ≥ f ( z ) > x ∈ L , we have L ∩ S ) · f ( z ) q ≤ X s ∈ L ∩ S | f ( s ) | q ≤ X s ∈ Ω ∩ S | f ( s ) | q ≤ ∩ S ) · t q , ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions This implies that f ( z ) ≤ (20 d ) d +1 q t . Lemma 9.10 (Bronshtein) . There exists a constant β depending only on d such thatfor any ε > , any M > , and any convex set Ω ⊂ B d , there exists a set G consistingof exp( β ( M/ε ) d/ ) functions, such that for any convex function f on Ω that is boundedby M and has a Lipschitz constant bounded by M , there exists g ∈ G such that | f ( x ) − g ( x ) | < ε for all x ∈ Ω . Proof of Proposition 9.5
If Ω contains N ≤ d grid points. Denote these gridpoints by s , s , . . . s N . Then { ( f ( s ) , f ( s ) , . . . , f ( s N )) : f ∈ B qS (0 , t, Ω) } is a subset of { ( x , x , . . . x N ) : | x | q + | x | q + · · · + | x N | q ≤ N t q } , i.e., the ℓ Nq -ball of radius N /q t . By volume comparison, it can be covered by no morethan (1 + tε ) N ℓ Nq -balls of radius N /q ε . Thus, by choosing γ d ≥ d , the statementof Proposition 9.5 is true when Ω contains no more than 400 d grid points.For the remaining case, we prove by induction.If d = 1, and Ω contains more than 400 grid points, then by By Lemma 9.7 andLemma 9.9, we have − t ≤ f ( x ) ≤ t for all x ∈ Ω . . Let T be a linear transformthat maps the interval Ω . to the interval [ − , f ◦ T − are convex functionson [ − ,
1] satisfying − t ≤ f ( T − x ) ≤ t for all x ∈ [ − , f ◦ T − ,we have | ( f ◦ T − ( x )) ′ | ≤ max (cid:26) | f ◦ T − (1) − f ◦ T − (0 . || − . | , | f ◦ T − ( − − f ◦ T − ( − . || ( − − ( − . | (cid:27) ≤ t. By Lemma 9.10, there exists a set G consisting of no more than exp( β √ tε − / )functions such that for each f ∈ B qS (0; t ; Ω), there exists g ∈ G satisfying | f ◦ T − ( x ) − g ( x ) | < ε for all x ∈ [ − . , . T Ω . ⊂ ( T Ω . ) . = [ − . , . | f ( z ) − g ◦ T ( z ) | < ε for all z ∈ Ω . . Thus, the statement of the lemma is truewith N = { g ◦ T : g ∈ G} if we choose γ = max(400 , √ β ).Suppose the statement is true for d < k . Consider the case d = k . If the minimumnumber of parallel hyperplanes needed to cover all the grid points in Ω is more than400 d . Then the lattice width w (Ω , S ) is at least 400 d . Let µ (Ω , S ) be the coveringradius, i.e., the smallest number µ such that µ Ω + S ⊃ R d . By Khinchine’s flatnesstheorem,(c.f. [5],[6]) we have w (Ω , S ) · µ (Ω , S ) ≤ Cd / , which implies that the coveringradius of Ω is at most C (800 d / ) − . Thus, Ω contains a cube of side length C − d / δ ,and hence a ball of radius C − d / δ . Thus, all the previous lemmas are applicable toΩ. By Lemma 9.7 and Lemma 9.9, we have − dt ≤ f ( x ) ≤ (20 d ) d +1 t for all x ∈ Ω . .If T is an affine transformation so that the John ellipsoid of T Ω . is the unit ball B d (1). Because Ω . ⊂ (Ω . ) . , by the proof of Lemma 9.8, the distance between the ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions boundary of T (Ω . ) and the boundary of T (Ω . ) is at least . If we define convexfunction e f on T (Ω) by e f ( y ) = f ( T − ( y )). Then, − dt ≤ e f ( y ) ≤ (20 d ) d +1 t for all y ∈ T (Ω . ). For any u, v ∈ T (Ω . ), without loss of generality we assume e f ( u ) ≤ e f ( v ).Consider the half-line starting from u and passing through v . Suppose the half-lineintersects the boundary of T (Ω . ) at p and the boundary of T (Ω . ) at q . By theconvexity of e f on this half-line, we have0 ≤ e f ( v ) − e f ( u ) k v − u k ≤ | e f ( q ) − e f ( p ) |k q − p k ≤ d ) d +1 t + 20 dt ] := M. This implies that e f is a convex function on T (Ω . ) that has a Lipschitz constant M .Of course f is also bounded by M on T (Ω . ). Thus, by Lemma 9.10, there existsa set of function G consisting of at most exp( β · ( M/ε ) d/ ) functions such that forevery f ∈ B qS (0 , t, K ), there exists a function g ∈ G , such that | e f ( y ) − g ( y ) | < ε forall y ∈ T (Ω . ). This implies | f ( x ) − g ( T x ) | < ε for all x ∈ Ω . . Thus, by setting N = { g ◦ T | g ∈ G} the lemma follows with γ d ≥ βM d/ .If the minimum number of parallel hyperplanes needed to cover all the grid points inΩ is less than 400 d . Then, by applying the lemma for d = k − γ d ≥ d γ d − . Now, we try to reach closer to the boundary of Ω. More precisely, we will extendProposition 9.5 from Ω . to the set Ω defined below.Let Ω be a convex polytope with the center of John ellipsoid at c . Then, we candescribe Ω as { x ∈ R d : − a i ≤ v Ti ( x − c ) ≤ b i , ≤ i ≤ F } , where a i > b i > v i are unit vectors in R d . Let m i and n i be the smallest integer such that 2 − m i a i ≤ δ and2 − n i b i ≤ δ . LetΩ = { x ∈ R d : − (1 − − m i ) a i ≤ v Ti ( x − c ) ≤ (1 − − n i ) b i , ≤ i ≤ F } . Then the Hausdorff distance between Ω and Ω is no larger than δ . Thus, Ω is indeedclose to Ω.The following proposition suggests that to achieve our goal, we only need to properlydecompose Ω . Proposition 9.11. If D i , ≤ i ≤ m is a sequence of convex subsets of Ω such thatno point in Ω is contained in more than M subsets in the sequence. Then, for Ω ⊂∪ mi =1 ( D i ) . , we have log N ( ε, B pS (0; t ; Ω) , ℓ pS ( · , Ω )) ≤ cmM d p ( t/ε ) d/ . Proof.
Let G i be the set of grid points in D i , and S i be the grid points in ( D i ) . \∪ j
1) + m X i =1 | G i | ≤ M n + M n = 2
M n, ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions we can bound the total number of realizations of g byexp (cid:16) γ (2 M ) d p m ( t/ε ) d/ (cid:17) . Consequently, we havelog N ( ε, B pS (0; t ; Ω) , ℓ pS ( · , Ω )) ≤ log (cid:18) M n + mm (cid:19) + γ (2 M ) d p m ( t/ε ) d/ ≤ cmM d p ( t/ε ) d/ . Now, let us decompose Ω . Lemma 9.12.
There exists convex sets b D i , ≤ i ≤ N := Q Fi =1 ( m i + n i ) contained in Ω , such that no point in Ω is contained in more than F of these sets, and Ω ⊂ ∪ Ni =1 ( b D i ) . . Proof.
Let K = { ( k , k , . . . , k F ) : − m i ≤ k i ≤ n i − , ≤ i ≤ F } . There are Q Fi =1 ( m i + n i ) elements in K . For each K = ( k , k , . . . , k F ) ∈ K , define D K = { x ∈ R d : α i ( k i ) ≤ v Ti ( x − c ) ≤ α i ( k i + 1) } , where α i ( t ) = (cid:26) − (1 − t ) a i t ≤ − − t ) b i t > .D K is a convex set. The union of all D K , K ∈ K is the set { x ∈ R d : − (1 − − m i ) a i ≤ v Ti ( x − c ) ≤ (1 − − n i ) b i } . Similarly, we define b D K = { x ∈ R d : β i ( k i ) ≤ v Ti ( x − c ) ≤ γ i ( k i ) } , where β i ( k i ) = α i ( k i ) −
14 [ α i ( k i + 1) − α i ( k i )] , γa i ( k i ) = α i ( k i + 1) + 14 [ α i ( k i + 1) − α i ( k i )] . Let c K be the center of John ellipsoid of b D K , We have b D K = { x ∈ R d : β i ( k i ) − v Ti ( c K − c ) ≤ v Ti ( x − c K ) ≤ γ i ( k i ) − v Ti ( c K − c ) } . Thus,( b D K ) . = { x ∈ R d : 0 . β i ( k i ) − v Ti ( c K − c )] ≤ v Ti ( x − c K ) ≤ . γ i ( k i ) − v Ti ( c K − c )] } = { x ∈ R d : 0 . β i ( k i ) + 0 . v Ti ( c K − c ) ≤ v Ti ( x − c ) ≤ . γ i ( k i ) + 0 . v Ti ( c K − c ) }⊃{ x ∈ R d : 0 . β i ( k i ) + 0 . α i ( k i + 1) ≤ v Ti ( x − c K ) ≤ . γ i ( k i ) + 0 . α i ( k i ) } = { x ∈ R d : α i ( k i ) ≤ v Ti ( x − c K ) ≤ α i ( k i + 1) } = D K , ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions where in the second to the last equality we used the fact that 0 . β i ( k i ) + 0 . α i ( k i + 1) = α i ( k i ) and 0 . γ i ( k i ) + 0 . α i ( k i ) = α i ( k i + 1).It is not difficult to check that if integer k i = 0 , −
1, the interval ( β i ( k i ) , γ i ( k i )) and( β i ( j i ) , γ i ( j i )) intersect only when | k i − t i | ≤
1, or one of the two cases: t i = 0 or t i = − F different sets d D K . The lemma follows by renaming these sets as b D i , 1 ≤ i ≤ N . Proof of Theorem 4.1
By using the lemma above and Proposition 9.11, we havelog N ( ε, B pS (0; t ; Ω) , ℓ pS ( · , Ω )) ≤ c dF p [log(1 /δ )] F ( t/ε ) d/ . Because the distance between the boundary of Ω \ Ω can be decomposed into at most 2 F piece of width δ . By Khinchine’s flatness theorem, the grid points in Ω \ Ω contained in cF hyperplanes for some constant c . The intersection of Ω and each of these hyperplanesis a ( d −
1) dimensional convex polytope. This enables us to obtain covering numberestimates on Ω \ Ω using induction on dimension. This concludes the proof of Theorem4.1. Proof of Corollary 4.2.
Let n denote the cardinality of Ω ∩ S and let n i denote thecardinality of Ω i ∩ S for each i = 1 , . . . , k . We can assume that each n i > i . For f ∈ B p S ( f ; t ; Ω) and 1 ≤ i ≤ k , let σ i ( f ) be the smallestpositive integer for which X s ∈ Ω i ∩S | f ( s ) − f ( s ) | p ≤ n i σ i ( f ) t p . It is clear then that 1 ≤ σ i ( f ) ≤ n for each i . Also because σ i ( f ) is the smallest integersatisfying the above, we have X s ∈ Ω i ∩S | f ( s ) − f ( s ) | p ≥ n i ( σ i ( f ) − t p which implies that k X i =1 n i ( σ i ( f ) − t p ≤ k X i =1 X s ∈ Ω i ∩S | f ( s ) − f ( s ) | p = X s ∈ Ω ∩S | f ( s ) − f ( s ) | p ≤ t , leading to k X i =1 n i σ i ( f ) ≤ k X i =1 n i = n. Let Σ := { ( σ ( f ) , . . . , σ k ( f )) : f ∈ B p S ( f ; t ; Ω) } . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions and note that the cardinality of Σ is at most n k as 1 ≤ σ i ( f ) ≤ n for each i . For every( σ , . . . , σ k ) ∈ Σ, let F σ ,...,σ k = { f ∈ B p S ( f ; t ; Ω) : σ i ( f ) = σ i , ≤ i ≤ k } . Observe now that if ℓ p S ( f − f , Ω i ) ≤ ǫσ /pi for i = 1 , . . . , k, then ℓ p S ( f − f , Ω) = n k X i =1 X s ∈ Ω i ∩S | f ( s ) − f ( s ) | p ! /p ≤ ǫ p P ki =1 n i σ i n ! /p ≤ ǫ. This giveslog N ( ǫ, F σ ,...,σ k , ℓ pS ( · , Ω)) ≤ k X i =1 log N ( ǫ √ σ i , B p S ( f ; t √ σ i ; Ω i ) , ℓ p S ( · , Ω i )) . Because f is affine on each Ω i , we can apply Theorem 4.1 to obtainlog N ( ǫ, F σ ,...,σ k , ℓ p S ( · , Ω)) ≤ k (cid:18) tǫ (cid:19) d/ (cid:18) c d log 1 δ (cid:19) s . Because B p S ( f ; t ; Ω) = [ ( σ ,...,σ k ) ∈ Σ F σ ,...,σ k and the cardinality of Σ is at most n k , we deduce thatlog N ( ǫ, B p S ( f ; t ; Ω) , ℓ p S ( · , Ω)) ≤ k (cid:18) tǫ (cid:19) d/ (cid:18) c d log 1 δ (cid:19) s + k log n Because log n ≤ C d log(1 /δ ) (inequality (11)), the second term on the right hand sideabove is dominated by the first term. This completes the proof of Corollary 4.2. Proof of Corollary 4.3.
We use Corollary 4.2 with the following choice of Ω , . . . , Ω k .Take Ω = ∆ and define Ω i for 2 ≤ i ≤ k recursively as follows. Consider ∆ i andits facets that have a non-empty intersection with ∆ , . . . , ∆ i − . If any of these facetscontain points in S , then we move the facets slightly inward so that they do not containany grid points. This will ensure that Ω , . . . , Ω k are d -simplices satisfying the conditionsof Corollary 4.2. The conclusion of Corollary 4.3 then directly follows from Corollary4.2 (note that every d -simplex can be written as an intersection of d + 1 halfspaces so inparticular it can be written as an intersection of at most d + 1 pairs of halfspaces). ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions
10. Additional proofs
This section contains proofs of Lemma 2.4, Lemma 7.3, Lemma 8.2 and Lemma 8.3.
Proof of Lemma 2.4.
For a fixed η >
0, let C η be the collection of all cubes of the form[ k η, ( k + 1) η ] × · · · × [ k d η, ( k d + 1) η ]for ( k , . . . , k d ) ∈ Z d which intersect Ω. Because Ω is contained in the unit ball, thereexists a dimensional constant c d such that the cardinality of C η is at most C d η − d for η ≤ c d .For each B ∈ C η , the set B ∩ Ω is a polytope whose number of facets is bounded fromabove by a constant depending on d alone. This polytope can therefore be triangulatedinto at most C d number of d -simplices. Let ∆ , . . . , ∆ m be the collection obtained bythe taking the all of the aforementioned simplices as B varies over C η . These simplicesclearly satisfy the first two requirements of Lemma 2.4. Moreover m ≤ C d η − d and the diameter of each simplex ∆ i is at most C d η . Now define ˜ f η to be a piecewiseaffine convex function that agrees with f ( x ) = k x k for each vertex of each simplex ∆ i and is defined by linearity everywhere else on Ω. This function is clearly affine on each∆ i , belongs to C C d C d (Ω) for a sufficiently large C d and it satisfiessup x ∈ ∆ i | f ( x ) − ˜ f η ( x ) | ≤ C d (diameter(∆ i )) ≤ C d η . Now given k ≥
1, let η = c d k − /d for a sufficiently small dimensional constant c d andlet ˜ f k to be the function ˜ f η for this η . The number of simplices is now m ≤ C d k andsup x ∈ ∆ i | f ( x ) − ˜ f η ( x ) | ≤ C d k − /d which completes the proof of Lemma 2.4. Proof of Lemma 7.3.
Let g ( x , x , . . . , x d ) = (cid:26) P di =1 cos ( πx i ) ( x , x , . . . , x d ) ∈ [ − / , / d x , x , . . . , x d ) / ∈ [ − / , / d . Note that g is smooth, ∂ g∂x i ∂x j = 0 for i = j and (cid:12)(cid:12)(cid:12)(cid:12) ∂ g∂x i ( x , . . . , x d ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ √ π which means that the Hessian of g is dominated by (4 √ π /
3) times the identity matrix.It is also easy to check that the Hessian of g equals zero on the boundary of [ − . , . d . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Now for every grid point s := ( k δ, . . . , k d δ ) in S ∩
Ω, consider the function g s ( x , . . . , x d ) := δ g (cid:18) x − k δδ , . . . , x d − k d δδ (cid:19) . Clearly g s is supported on the cube[( k − / δ, ( k + 1 / δ ] × [( k − / δ, ( k + 1 / δ ] × · · · × [( k d − / δ, ( k d + 1 / δ ] (91)and observe that these cubes for different grid points have disjoint interiors.We now consider binary vectors in { , } n . We shall index each ξ ∈ { , } n by ξ s , s ∈S ∩ Ω. For every ξ = ( ξ s , s ∈ S ∩ Ω) ∈ { , } n , consider the function G ξ ( x ) = f ( x ) + 34 √ π X s ∈S∩ Ω ξ s g s ( x ) . (92)It can be verified that G ξ is convex because f has constant Hessian equal to 2 times theidentity, the Hessian of g s is bounded by (4 √ π /
3) and the supports of g s , s ∈ S ∩ Ωhave disjoint interiors. Note further that for ξ, ξ ′ ∈ { , } n and s ∈ S ∩ Ω, G ξ ( s ) − G ξ ′ ( s ) = 3 dδ √ π ( ξ s − ξ ′ s ) . This implies that ℓ P n ( G ξ , G ξ ′ ) = 3 dδ √ π r Υ( ξ, ξ ′ ) n where Υ( ξ, ξ ′ ) := P s ∈S∩ Ω I { ξ s = ξ ′ s } is the Hamming distance between ξ and ξ ′ . TheVarshamov-Gilbert lemma (see e.g., Massart [39, Lemma 4.7]) asserts the existence ofa subset W of { , } n with cardinality | W | ≥ exp( n/
8) such that Υ( ξ, ξ ′ ) ≥ n/ ξ, ξ ′ ∈ W with ξ = ξ ′ . We then have ℓ P n ( G ξ , G ξ ′ ) ≥ dδ √ π for all ξ, ξ ′ ∈ W with ξ = ξ ′ . Inequality (11) then gives ℓ P n ( G ξ , G ξ ′ ) ≥ c n − /d for all ξ, ξ ′ ∈ W with ξ = ξ ′ . for a constant c depending on d alone. On the other hand, one can also check that ℓ P n ( G ξ , f ) ≤ dδ √ π ≤ c n − /d for another constant c depending on d alone. This completes the proof of Lemma7.3. ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions Proof of Lemma 8.2.
We first claim the existence of constants c , c , C all dependingon d alone such that for every ǫ ≤ c and L ≥ C , there exist an integer N with c ǫ − d/ ≤ log N ≤ c ǫ − d/ and functions f , . . . , f N ∈ C LL (Ω) such thatmin ≤ i = j ≤ N ℓ P ( f i , f j ) ≥ √ ǫ and max ≤ i ≤ N sup x ∈ Ω | f i ( x ) − f ( x ) | ≤ ǫ. This basically follows from a similar construction as in the proof of Lemma 7.3 (with δ = ǫ ). To facilitate calculations with ℓ P , it will be convenient to restrict the sumin (92) to all points s = ( k δ, . . . , k d δ ) such that the cube (91) is fully contained inΩ. The number of such grid points is also at least c d δ − d provided δ is smaller than adimensional constant. Note also that each function (92) is bounded by L and L -Lipschitzfor a dimensional constant L .Lemma 8.2 follows from the above claim and the Hoeffding inequality (note thatsup x ∈ Ω ( f j ( x ) − f k ( x )) ≤ ǫ ). Indeed, for every t >
0, Hoeffding inequality followedby a union bound allows us to deduce that P (cid:8) ℓ P n ( f j , f k ) − ℓ P ( f j , f k ) ≥ − tn − / for all j, k (cid:9) ≥ − N exp (cid:18) − t Γ ǫ (cid:19) . for a universal constant Γ. Taking t = ǫ √ n , we get P { ℓ P n ( f j , f k ) ≥ ǫ for all j, k } ≥ − N exp (cid:18) − n Γ (cid:19) ≥ − exp (cid:16) c ǫ − d/ − n Γ (cid:17) . Assuming now that ǫ ≥ n − /d (8 c Γ) /d , we get P { ℓ P n ( f j , f k ) ≥ ǫ for all j, k } ≥ − exp (cid:16) − n (cid:17) . Note finally that each f j belongs to B C LL (Ω) P n ( f , t ) for t ≥ ǫ . This completes the proofof Lemma 8.2. Proof of Lemma 8.3.
This proof is similar to that of Lemma 2.4. For a fixed η >
0, let D η be the collection of all cubes of the form[ k η, ( k + 1) η ] × · · · × [ k d η, ( k d + 1) η ]for ( k , . . . , k d ) ∈ Z d which are contained in the interior of Ω. Because Ω is contained inthe unit ball and contains a ball around zero of constant (depending on d alone) radiusand the diameter of B is at most η √ d , it follows that(1 − C d η ) Ω ⊆ ∪ B ∈ D η B ⊆ Ω . ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions It is also easy to see that the cardinality of D η at most C d η − d for η ≤ c d . We nowtriangulate each cube in D η into a constant number of d -simplices. Let ∆ , . . . , ∆ m bethe collection obtained by taking all of the aforementioned simplices as B varies over D η . These simplices clearly have disjoint interiors and the diameter of each simplex ∆ i is at most C d η . Moreover m ≤ C d η − d . Now define ˜ f η to be a piecewise affine convex function that agrees with f ( x ) = k x k for each vertex of each simplex ∆ i and is defined by linearity everywhere else on Ω.This function is clearly affine on each ∆ i , belongs to C C d C d (Ω) for a sufficiently large C d and it satisfies sup x ∈ ∆ i | f ( x ) − ˜ f η ( x ) | ≤ C d (diameter(∆ i )) ≤ C d η . Now for each fixed k ≥
1, we take η = c d k − /d for a sufficiently small enough c d and let˜ f k to be the function ˜ f η for this η . This completes the proof of Lemma 8.3. References [1] A¨ıt-Sahalia, Y. and J. Duarte (2003). Nonparametric option pricing under shaperestrictions.
J. Econometrics 116 (1-2), 9–47. Frontiers of financial econometrics andfinancial engineering.[2] Allon, G., M. Beenstock, S. Hackman, U. Passy, and A. Shapiro (2007). Nonpara-metric estimation of concave production technologies by entropic methods.
J. Appl.Econometrics 22 (4), 795–816.[3] Bal´azs, G. (2016).
Convex regression: theory, practice, and applications . Ph. D.thesis, University of Alberta.[4] Bal´azs, G., A. Gy¨orgy, and C. Szepesv´ari (2015). Near-optimal max-affine estimatorsfor convex regression. In
AISTATS .[5] Banaszczyk, W., A. E. Litvak, A. Pajor, and S. J. Szarek (1999). The flatness the-orem for nonsymmetric convex bodies via the local theory of banach spaces.
Mathe-matics of operations research 24 (3), 728–750.[6] B´ar´any, I. and D. G. Larman (1998). The convex hull of the integer points in alarge ball.
Mathematische Annalen 312 (1), 167–181.[7] Bellec, P. C. (2018). Sharp oracle inequalities for least squares estimators in shaperestricted regression.
Ann. Statist. 46 (2), 745–780.[8] Birg´e, L. and P. Massart (1993). Rates of convergence for minimum contrast esti-mators.
Probab. Theory Related Fields 97 (1-2), 113–150.[9] Bronˇste˘ın, E. M. (1976). ε -entropy of convex sets and functions. Sibirsk. Mat.ˇZ. 17 (3), 508–514, 715.[10] Chatterjee, S. (2014). A new perspective on least squares under convex constraint.
Ann. Statist. 42 (6), 2340–2381.[11] Chatterjee, S. (2016). An improved global risk bound in concave regression.
Elec-tron. J. Stat. 10 (1), 1608–1629. ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions [12] Chatterjee, S., A. Guntuboyina, and B. Sen (2015). On risk bounds in isotonicand other shape restricted regression problems. Ann. Statist. 43 (4), 1774–1800.[13] Chen, W. and R. Mazumder (2020). Multivariate convex regression at scale. arXivpreprint arXiv:2005.11588 .[14] Chen, Y. and J. A. Wellner (2016). On convex least squares estimation when thetruth is linear.
Electron. J. Stat. 10 (1), 171–209.[15] Doss, C. R. (2020). Bracketing numbers of convex and m -monotone functions onpolytopes. J. Approx. Theory 256 , 105425.[16] Dryanov, D. (2009). Kolmogorov entropy for classes of convex functions.
Constr.Approx. 30 (1), 137–153.[17] D¨umbgen, L., S. Freitag, and G. Jongbloed (2004). Consistency of concave re-gression with an application to current-status data.
Math. Methods Statist. 13 (1),69–81.[18] Gao, F. (2008). Entropy estimate for k -monotone functions via small ball proba-bility of integrated Brownian motion. Electron. Commun. Probab. 13 , 121–130.[19] Gao, F. and J. A. Wellner (2007). Entropy estimate for high-dimensional monotonicfunctions.
J. Multivariate Anal. 98 (9), 1751–1764.[20] Gao, F. and J. A. Wellner (2017). Entropy of convex functions on R d . Constr.Approx. 46 (3), 565–592.[21] Ghosal, P. and B. Sen (2017). On univariate convex regression.
Sankhya A 79 (2),215–253.[22] Groeneboom, P. and G. Jongbloed (2014).
Nonparametric estimation under shapeconstraints , Volume 38. Cambridge University Press.[23] Groeneboom, P., G. Jongbloed, and J. A. Wellner (2001). Estimation of a convexfunction: characterizations and asymptotic theory.
Ann. Statist. 29 (6), 1653–1698.[24] Guntuboyina, A. (2012). Optimal rates of convergence for convex set estimationfrom support functions.
Ann. Statist. 40 (1), 385–411.[25] Guntuboyina, A. (2016). Covering numbers of L p -balls of convex functions andsets. Constructive Approximation 43 (1), 135–151.[26] Guntuboyina, A. and B. Sen (2013). Covering numbers for convex functions.
IEEETrans. Inf. Th. 59 (4), 1957–1965.[27] Guntuboyina, A. and B. Sen (2015). Global risk bounds and adaptation in uni-variate convex regression.
Probab. Theory Related Fields 163 (1-2), 379–411.[28] Han, Q. (2019). Global empirical risk minimizers with ”shape constraints” are rateoptimal in general dimensions. arXiv preprint arXiv:1905.12823 .[29] Han, Q., T. Wang, S. Chatterjee, and R. J. Samworth (2019). Isotonic regressionin general dimensions.
Ann. Statist. 47 (5), 2440–2471.[30] Han, Q. and J. A. Wellner (2016). Multivariate convex regression: global riskbounds and adaptation. arXiv preprint arXiv:1601.06844 .[31] Hanson, D. L. and G. Pledger (1976). Consistency in concave regression.
Ann.Statist. 4 (6), 1038–1050.[32] Hildreth, C. (1954). Point estimates of ordinates of concave functions.
J. Amer.Statist. Assoc. 49 , 598–619. ur, G., Gao, F., Guntuboyina, A. and Sen, B./Convex Regression in multidimensions [33] Keshavarz, A., Y. Wang, and S. Boyd (2011). Imputing a convex objective function.In Intelligent Control (ISIC), 2011 IEEE International Symposium on , pp. 613–619.IEEE.[34] Kuosmanen, T. (2008). Representation theorem for convex nonparametric leastsquares.
The Econometrics Journal 11 (2), 308–325.[35] Kur, G., Y. Dagan, and A. Rakhlin (2019). Optimality of maximum likelihoodfor log-concave density estimation and bounded convex regression. arXiv preprintarXiv:1903.05315 .[36] Kur, G., A. Guntuboyina, and A. Rakhlin (2020). On suboptimality of least squareswith application to estimation of convex bodies. Accepted for presentation and pub-lication in the 33rd Annual Conference on Learning Theory (COLT) July 9-12, 2020.[37] Lim, E. (2014). On convergence rates of convex regression in multiple dimensions.
INFORMS J. Comput. 26 (3), 616–628.[38] Lim, E. and P. W. Glynn (2012). Consistency of multidimensional convex regres-sion.
Oper. Res. 60 (1), 196–208.[39] Massart, P. (2007).
Concentration inequalities and model selection. Lecture notesin Mathematics , Volume 1896. Berlin: Springer.[40] Matzkin, R. L. (1991). Semiparametric estimation of monotone and concave utilityfunctions for polychotomous choice models.
Econometrica 59 (5), 1315–1327.[41] Mazumder, R., A. Choudhury, G. Iyengar, and B. Sen (2019). A computationalframework for multivariate convex regression and its variants.
J. Amer. Statist.Assoc. 114 (525), 318–331.[42] Seijo, E. and B. Sen (2011). Nonparametric least squares estimation of a multi-variate convex regression function.
Ann. Statist. 39 (3), 1633–1657.[43] Toriello, A., G. Nemhauser, and M. Savelsbergh (2010). Decomposing inventoryrouting problems with approximate value functions.
Naval Res. Logist. 57 (8), 718–727.[44] van de Geer, S. A. (2000).
Applications of empirical process theory , Volume 6 of
Cambridge Series in Statistical and Probabilistic Mathematics . Cambridge UniversityPress, Cambridge.[45] Varian, H. R. (1982). The nonparametric approach to demand analysis.
Econo-metrica 50 (4), 945–973.[46] Varian, H. R. (1984). The nonparametric approach to production analysis.