The Optimality of Polynomial Regression for Agnostic Learning under Gaussian Marginals
Ilias Diakonikolas, Daniel M. Kane, Thanasis Pittas, Nikos Zarifis
aa r X i v : . [ c s . L G ] F e b The Optimality of Polynomial Regressionfor Agnostic Learning under Gaussian Marginals
Ilias Diakonikolas ∗ University of Wisconsin Madison [email protected]
Daniel M. Kane † University of California, San Diego [email protected]
Thanasis PittasUniversity of Wisconsin Madison [email protected]
Nikos Zarifis ‡ University of Wisconsin Madison [email protected]
February 9, 2021
Abstract
We study the problem of agnostic learning under the Gaussian distribution. We develop amethod for finding hard families of examples for a wide class of problems by using LP duality.For Boolean-valued concept classes, we show that the L -regression algorithm is essentially bestpossible, and therefore that the computational difficulty of agnostically learning a concept classis closely related to the polynomial degree required to approximate any function from the classin L -norm. Using this characterization along with additional analytic tools, we obtain optimalSQ lower bounds for agnostically learning linear threshold functions and the first non-trivial SQlower bounds for polynomial threshold functions and intersections of halfspaces. We also developan analogous theory for agnostically learning real-valued functions, and as an application provenear-optimal SQ lower bounds for agnostically learning ReLUs and sigmoids. ∗ Supported by NSF Award CCF-1652862 (CAREER), a Sloan Research Fellowship, and a DARPA Learning withLess Labels (LwLL) grant. † Supported by NSF Award CCF-1553288 (CAREER) and a Sloan Research Fellowship. ‡ Supported in part by a DARPA Learning with Less Labels (LwLL) grant.
Introduction
In Valiant’s Probably Approximately Correct (PAC) learning model [Val84], a learner is givenaccess to random examples that are consistently labeled according to an unknown function in thetarget concept class. Here we focus on the agnostic framework [Hau92, KSS94], which modelslearning in the presence of worst-case noise. Roughly speaking, in the agnostic PAC model, weare given i.i.d. samples from a joint distribution D on labeled examples ( x , y ), where x ∈ R n isthe example and y ∈ R is the corresponding label, and the goal is to compute a hypothesis that iscompetitive with the “best-fitting” function in the target class C . The notion of agnostic learningis meaningful both for learning Boolean-valued functions (under the 0-1 loss) and for learningreal-valued functions (typically, under the L -loss). For concreteness, we restrict the proceedingdiscussion to the Boolean-valued setting.In the distribution-independent setting, agnostic learning is known to be computationally hard,even for simple concept classes and weak learning [GR06, FGKP06, Dan16]. On the other hand,under distributional assumptions, efficient learning algorithms with worst-case noise are possible.A line of work [KKMS08, KLS09, ABL17, Dan15, DKS18, DKTZ20] has given efficient learningalgorithms in the agnostic model for natural concept classes and distributions with various time-accuracy tradeoffs. In this paper, we will focus on agnostic learning under the Gaussian distributionon examples. For Boolean-valued concept classes, we have the following definition. Definition 1.1 (Agnostic Learning Boolean-valued Functions with Gaussian Marginals) . Let C be a class of Boolean-valued concepts on R n . Given i.i.d. samples ( x , y ) from a distribution D on R n × {± } , where the marginal D x on R n is the standard Gaussian N n and no assumptionsare made on the labels y , the goal is to output a hypothesis h : R n → {± } such that with highprobability we have Pr ( x ,y ) ∼ D [ h ( x ) = y ] ≤ OPT + ǫ , where OPT = inf f ∈C Pr ( x ,y ) ∼ D [ f ( x ) = y ].The only known algorithmic technique for agnostic learning in the setting of Definition 1.1is the L -polynomial regression algorithm [KKMS08]. This algorithm uses linear programmingto compute a low-degree polynomial that minimizes the L -distance to the target function. Itsperformance hinges on how well the underlying concept class C can be approximated, in L -norm,by low-degree polynomials. In more detail, if d is the (minimum) degree such that any f ∈ C canbe ǫ -approximated (in L -norm) by a degree- d polynomial, the algorithm has sample complexityand running time n O ( d ) / poly( ǫ ) and outputs a hypothesis with misclassification error OPT + ǫ .For several natural concept classes and distributions on examples, the aforementioned degree d is independent of the dimension n , and only depends on the error ǫ (and potentially othersize parameters). For these settings, the L -regression algorithm can be viewed as a polynomial-time approximation scheme (PTAS) for agnostic learning. Examples of such concept classes in-clude Linear Threshold Functions (LTFs) [KKMS08, DGJ +
10, DKN10], Bounded Degree Poly-nomial Threshold Functions (PTFs) [DHK +
10, Kan10, DRST14, HKM14], Intersections of Halfs-paces [KKMS08, KOS08, Kan14], and other geometric concepts [KOS08]. Specifically, for the classof LTFs under the Gaussian distribution, the L -regression algorithm is known to have sample andcomputational complexity of n O (1 /ǫ ) .For each of the above concept classes, L -polynomial regression is the fastest (and, essentially,the only) known agnostic learner. It is natural to ask whether there exists an agnostic learner withsignificantly improved sample/computational complexity. Can we beat L -polynomial regression for agnostic learning under Gaussian marginals?
1s our first main contribution, we answer the above question in the negative for all concept classessatisfying some mild properties (including all the geometric concept classes mentioned above). Ourlower bound applies for the class of Statistical Query (SQ) algorithms. Statistical Query (SQ)algorithms are a class of algorithms that are allowed to query expectations of bounded functionsof the underlying distribution rather than directly access samples. Formally, an SQ algorithm hasaccess to the following oracle.
Definition 1.2 ( STAT
Oracle) . Let D be a distribution on labeled examples supported on X × [ − , X . A statistical query is a function q : X × [ − , → [ − , STAT ( τ ) to be the oracle that given any such query q ( · , · ) outputs a value v such that | v − E ( x ,y ) ∼ D [ q ( x , y )] | ≤ τ , where τ > +
13, FPV15, FGV17, Fel17] forsome recent references. The reader is referred to [Fel16] for a survey. The class of SQ algorithmsis fairly broad: a wide range of known algorithmic techniques in machine learning are known to beimplementable using SQs (see, e.g., [CKL +
06, FGR +
13, FGV17]).Returning to our agnostic learning setting, roughly speaking, we show that a lower bound of d on the degree of any L approximating polynomial can be translated to an SQ lower bound of n Ω( d ) for the agnostic learning problem. This lower bound is tight, since the L -regression algorithm canbe implemented in the SQ model with complexity n O ( d ) .We note that a similar characterization had been previously shown, under somewhat differentassumptions, for agnostic learning under the uniform distribution on the hypercube [DFT + R n under the Gaussian distribution — i.e., algorithms with error O (OPT) + ǫ — that runin poly( n/ǫ ) time. No polynomial time algorithm with such an error guarantee is known for anydiscrete distribution. At a high-level, known algorithms for these problems make essential use ofthe anti-concentration of the Gaussian distribution, which fails in the discrete setting. Similar algo-rithmic gaps exist for robustly learning low-degree PTFs and intersections of halfspaces [DKS18].Our generic lower bound result for the Boolean case (Theorem 1.4) reduces the problem ofproving explicit SQ lower bounds for agnostic learning to the structural question of proving lowerbounds on the L polynomial approximation degree (under the Gaussian measure). As our secondcontribution, we provide a toolkit to prove explicit degree lower bounds. As a corollary, we proveoptimal or near-optimal SQ lower bounds for various natural classes, including LTFs, PTFs, andintersections of halfspaces.Moving away from the Boolean-valued setting, an interesting direction is to understand thecomplexity of agnostic learning for real-valued function classes. In recent years, this broad questionhas been intensely investigated in learning theory, in part due to its connections to deep learning.Here we focus on agnostic learning under the L -loss. Definition 1.3 (Agnostic Learning Real-valued Functions with Gaussian Marginals) . Let C be aclass of real-valued concepts on R n . Given i.i.d. samples ( x , y ) from a distribution D on R n × R ,where the marginal D x on R n is the standard Gaussian N n and no assumptions are made on thelabels y , the goal is to output a hypothesis h : R n → R such that with high probability we have E ( x ,y ) ∼ D [( h ( x ) − y ) ] / ≤ OPT + ǫ , where OPT = inf f ∈C E ( x ,y ) ∼ D [( f ( x ) − y ) ] / .2 prototypical concept class of significant recent interest are Rectified Linear Units (ReLUs). AReLU is any real-valued function f : R n → R + of the form f ( x ) = ReLU ( h w , x i + θ ), w ∈ R n and θ ∈ R , where ReLU : R → R + is defined as ReLU( u ) = max { , u } . ReLUs are the most commonlyused activation functions in modern deep neural networks. The corresponding agnostic learningproblem is a fundamental primitive in the theory of neural networks that has been extensivelystudied in recent years [GKKT17, MR18, GKK19, DGK +
20, GGK20, DKZ20].Our techniques extend to real-valued concepts leading to improved and nearly tight SQ lowerbounds for natural concept classes. We describe our contributions in the following subsection.
Contributions for Boolean-valued Concepts
Our main general result for Boolean-valuedconcepts is the following:
Theorem 1.4 (Generic SQ Lower Bound, Boolean Case) . Let n, m ∈ Z + with m ≤ n a for anyconstant < a < / and ǫ ≥ n − c for some sufficiently small constant c > . Fix a function f : R m → {± } . Let d be the smallest integer such that there exists a degree at most d polynomial p : R m → R satisfying E x ∼N m [ | p ( x ) − f ( x ) | ] < ǫ . Let C be a class of Boolean-valued functions on R n which includes all functions of the form F ( x ) = f ( Px ) , for any P ∈ R m × n such that PP ⊺ = I m .Any SQ algorithm that agnostically learns C under N n to error OPT + ǫ either requires queries withtolerance at most n − Ω( d ) or makes at least n Ω(1) queries.
The L -polynomial regression algorithm and Theorem 1.4 characterize the complexity of agnosticlearning under the Gaussian distribution – within the class of SQ algorithms – for a range ofconcept classes. If d is the (minimum) degree for which any function in C can be ǫ -approximatedby a degree- d polynomial in L -norm, the complexity of agnostically learning C is, roughly, n Θ( d ) . Applications of Theorem 1.4.
Note that the above result does not tell us what the optimaldegree d is for any given concept class C . Using analytic techniques, we establish explicit lowerbounds on the L polynomial approximation degree for three fundamental concept classes: LinearThreshold Functions (LTFs), Polynomial Threshold Functions, and Intersections of Halfspaces. Asa corollary, we obtain explicit SQ lower bounds for these classes. Our applications are summarizedin Table 1. Concept Class
Lower Bound Upper BoundLTFs Ω (cid:0) /ǫ (cid:1) [Gan02] O (cid:0) /ǫ (cid:1) [Gan02, DKN10]Degree- k PTFs Ω (cid:0) k /ǫ (cid:1) (Thm 3.1) O (cid:0) k /ǫ (cid:1) [Kan10]Intersections of k Halfspaces ˜Ω (cid:0) √ log k/ǫ (cid:1) (Thm 3.5) O (cid:0) log k/ǫ (cid:1) [KOS08]Table 1: Bounds on the degree d of ǫ -approximating polynomials in L -error under the Gaussianmeasure. For each concept class, we obtain an SQ lower bound of n Ω( d ) .For the class of LTFs, using a known degree lower bound for the sign function [Gan02], we im-mediately obtain an SQ lower bound of n Ω(1 /ǫ ) . This bound is optimal (within polynomial factors),improving on the previous SQ lower bound of n Ω(1 /ǫ ) [GGK20, DKZ20]. Our approach is simplerand more general compared to these prior works, immediately extending to other families. For thebroader class of degree- k PTFs, we establish a degree lower bound of Ω( k /ǫ ) (Proposition 3.2),which yields an SQ lower bound of n Ω( k /ǫ ) for the agnostic learning problem.3ur third explicit degree lower bound is for intersections of k halfspaces. For this concept class,we prove a degree lower bound of d = ˜Ω( √ log k/ǫ ) which implies a corresponding SQ lower boundof n ˜Ω( √ log k/ǫ ) . In the process, we establish a new structural result translating lower bounds on theGaussian Noise Sensitivity (GNS) of any Boolean function to the L -polynomial approximationdegree of the same function.Recall that the Gaussian Noise Sensitivity (GNS) of a function f : R n → {± } is defined asGNS ρ ( f ) def = Pr ( x , y ) ∼N ρn [ f ( x ) = f ( y )], where N ρn is the distribution of a (1 − ρ )-correlated Gaussianpair (i.e., x and y are standard Gaussians with correlation (1 − ρ )). We show the following: Theorem 1.5 (Structural Result) . Let f : R n → {± } and p : R n → R be a degree at most d polynomial. Then, we have that E x ∼N n [ | f ( x ) − p ( x ) | ] ≥ Ω(1 / log( d ))GNS (log( d ) /d ) ( f ) . Furthermore,for any ǫ > , we have that E x ∼N n [ | f ( x ) − p ( x ) | ] ≥ GNS ǫ ( f ) / − O ( d √ ǫ ) . Contributions for Real-valued Concepts
For agnostically learning real-valued concepts, weprovide two generic lower bound results, analogous to Theorem 1.4, for Correlational SQ (CSQ)algorithms and general SQ algorithms respectively. A conceptual message of our results is that L regression is essentially optimal against CSQ algorithms, but not necessarily optimal againstgeneral SQ algorithms.Recall that Correlational SQ (CSQ) algorithms are a subclass of SQ algorithms, where thealgorithm is allowed to choose any bounded query function on the examples and obtain estimatesof its correlation with the labels. (See Appendix A.1 for a detailed description.) This class ofalgorithms is fairly broad, capturing many learning algorithms used in practice (including gradient-descent). For CSQ algorithms, we prove. Theorem 1.6 (Generic CSQ Lower Bound, Real-valued Case) . Let n, m ∈ Z + with m ≤ n a for anyconstant < a < / and ǫ ≥ n − c for some sufficiently small constant c > . Let f : R m → R with E x ∼N m [ f ( x )] = 1 and d be the smallest integer such that there exists a degree at most d polynomial p : R m → R satisfying k f − p k < ǫ . Let C be a class of real-valued functions on R n which includesall functions of the form F ( x ) = f ( Px ) , for any matrix P ∈ R m × n satisfying PP ⊺ = I m . Then,any CSQ algorithm that agnostically learns C over N n to L -error OPT + ǫ either requires querieswith tolerance at most n − Ω( d ) or makes at least n Ω(1) queries.
Our lower bound for the general SQ model is presented below. The difference between the twois that the latter uses the L -norm to measure the approximation of f by polynomials. Theorem 1.7 (Generic SQ Lower Bound, Real-valued Case) . Let n, m ∈ Z + with m ≤ n a for anyconstant < a < / and ǫ ≥ n − c for some sufficiently small constant c > . Let f : R m → R with E x ∼N m [ f ( x )] = 1 and d be the smallest integer such that there exists a degree at most d polynomial p : R m → R satisfying k f − p k < ǫ . Let C be a class of real-valued functions on R n which includesall functions of the form F ( x ) = f ( Px ) , for any matrix P ∈ R m × n satisfying PP ⊺ = I m . Then,any SQ algorithm that agnostically learns C over N n to L -error OPT + ǫ either requires querieswith tolerance at most n − Ω( d ) or makes at least n Ω(1) queries.
Applications of Theorems 1.6 and 1.7.
As in the Boolean-valued setting, obtaining explicit(C)SQ lower bounds for agnostically learning real-valued concepts requires analytic tools to es-tablish lower bounds on the degree of polynomial approximations. In this paper, we give suchlower bounds for two fundamental concept classes: ReLUs and sigmoids. Establishing degree lowerbounds for other non-linear activations is left as a question for future work.4 = 1 p = 2 Concept Class
Lower Bound Upper Bound Lower Bound Upper BoundReLUs Ω (1 /ǫ ) (Cor. 5.2) O (1 /ǫ ) Ω (cid:0) /ǫ / (cid:1) (Cor. 5.2) O (cid:0) /ǫ / (cid:1) Sigmoids Ω(log(1 /ǫ )) (Thm. 5.7) O (log (1 /ǫ )) Ω (cid:0) log (1 /ǫ ) (cid:1) (Cor. 5.4) O (cid:0) log (1 /ǫ ) (cid:1) Table 2: Bounds on the degree d of ǫ -approximating polynomials in L and L -error under theGaussian measure. For each concept class, we obtain a CSQ (resp. SQ) lower bound of n Ω( d ) ,where d is the L degree (resp. L degree).Our degree lower bounds applications for both L and L polynomial approximations are sum-marized in Table 2. Combining these degree lower bounds Theorems 1.6 and 1.7 implies explicitSQ lower bounds for ReLUs and sigmoids.Concretely, for agnostically learning ReLUs, we establish a CSQ lower bound of n Ω(1 /ǫ / ) (matching the n O (1 /ǫ / ) upper bound obtained via L -regression); and an SQ lower bound of n Ω(1 /ǫ ) , improving on the previous best bound of n Ω((1 /ǫ ) / ) [GGK20, DKZ20]. SQ Lower Bounds for Boolean-valued Functions
The starting point for our lower bounds isthe work of [DKS17], which shows that if D is a univariate distribution whose low-degree momentsmatch those of a standard Gaussian (and which satisfies some other mild niceness conditions), thenit is SQ-hard to distinguish between a standard multivariate Gaussian and a distribution that isa copy of D in a random direction and a standard Gaussian in the orthogonal directions. (Thisis shown in [DKS17] for D a 1-dimensional distribution, but it is not hard to generalize to higherdimensional distributions.)Note that the above setting is unsupervised. To go from distributions to functions, we will tryto produce a Boolean function f of a few variables such that the distributions of X conditionedon f ( X ) = 1 and on f ( X ) = − f embedded in a hidden low-dimensional subspace isSQ hard to distinguish from a random function. Our goal then is to find such a function f that is(1 / − ǫ )-close to a function in our family. Given this construction, learning the function to errorOPT + ǫ/ f from a random function.The aforementioned approach was recently used by [DKZ20]. However, while that work con-structs the function f somewhat directly, here we take a more general approach. In more detail,it is not hard to phrase the conditions that (1) f is bounded in [ − , L -norm. Thedegree of such a polynomial conveniently matches the parameter that determines the runtime ofthe L - polynomial regression algorithm. We can thus show that, for reasonable function families,the L -regression algorithm is in fact optimal, among SQ algorithms, up to polynomial factors.The above characterization allows us to determine the complexity of agnostically learning LTFs,by leverage tight degree lower bounds for the sign function. For the cases of degree- k PTFs andintersections of k halfspaces, we do not know what the correct answer is, but we are able to prove5on-trivial, and qualitatively close to optimal, lower bounds.We note that the L approximation theory for these functions is more challenging than the L approximation theory (which is entirely determined by the Fourier decay). To that end, we developnew techniques relating L approximability to the Gaussian Noise sensitivity (Theorem 1.5), whichallows us to prove the first non-trivial lower bounds. The proof of Theorem 1.5 works via asymmetrization technique. In particular, let θ = arccos(1 − ǫ ) and let X and Y be standardGaussians. Let F X,Y ( φ ) := f (sin( φ ) X + cos( φ ) Y ). Then we can write GNS ǫ ( f ) = Pr [ F X,Y ( φ ) = F X,Y ( φ + θ )]. On the other hand, k f − p k = E [ | F X,Y ( φ ) − p (sin( φ ) X + cos( φ ) Y ) | ]. Thus, it sufficesto show that if F is any Boolean function on the circle that the L approximation error of F bylow degree polynomials can be bounded below by Pr [ F ( φ ) = F ( φ + θ )]. To show this, we usebasic Fourier analysis to show that any low-degree polynomial with small L norm cannot have anylarge higher derivatives. This implies that if F transitions from being 0 to being 1 over some smallinterval, that any low-degree polynomial will not be able to match it very well in this interval. (C)SQ Lower Bounds for Real-valued Functions We now move to real-valued functions andsketch our CSQ and SQ lower bounds. For CSQ lower bounds, we obtain a similar characterization.The difference is that, in the real-valued setting, we need to find a real-valued function f whose low-degree moments vanish, and which is close to the function we are trying to learn in L norm . Thiscan be phrased as a similar LP and, applying duality, we find that the complexity is determined bythe degree needed to approximate the function we are trying to learn in L norm. For this particularsetting, the LP can actually be solved explicitly and the best possible approximation function isobtained by taking the high-degree Hermite component of f . This lower bound matches (up topolynomial factors in the final error) the upper bound coming from the L polynomial regressionalgorithm. This means that we can qualitatively characterize the complexity of agnostic learningusing CSQ algorithms. In particular, we use this characterization to obtain new CSQ lower boundson agnostically learning ReLUs and sigmoids.Our SQ lower bounds against learning real-valued functions are somewhat more challenging,since the approximating function f must have more than just vanishing moments. It must haveall its level-sets match low-degree moments with a standard Gaussian (which is equivalent onlyfor Boolean-valued functions). Because of this additional requirement, we restrict our “imitatingfunctions” to Boolean-valued functions. We can still find an LP defining f , however the dual givesus the relevant parameter of the degree needed to approximate the function we are trying to learnin L -norm (rather than L -norm) for which a matching upper bound is not known. So, in thiscase, while we can still obtain significantly improved SQ lower bounds for agnostically learning anumber of concept classes, we do not obtain optimal results. At the level of results, the most relevant prior works are the two independent works [DKZ20,GGK20], which established the previously best SQ lower bounds for LTFs, ReLUs, and sigmoidsunder the Gaussian distribution. We have already provided a technical comparison to [DKZ20]in the previous subsection. The work [GGK20] relies on a boosting procedure that translatesrecent SQ lower bounds for (non-agnostic) learning one-hidden-layer neural networks [DKKZ20] toagnostically learning simple concept classes.A useful point of technical comparison is the work [DFT + L -approximation degree — and the duality-based proof6echniques are similar. In particular, [DFT +
15] sets up a finite
LP to find a function f that hasvanishing Fourier coefficients but is close in L -norm to the target function. Due to the discrete na-ture of the setting they consider, [DFT +
15] avoids the functional analysis based arguments requiredto establish duality in our setting.A more significant difference with our framework is that the hard family of [DFT +
15] embedsa copy of f as a junta on a random subset of coordinates, while ours embeds it in a random low-dimensional subspace. This is a critical distinction and is necessary in the Gaussian setting toobtain our tight characterization and the associated applications to LTFs/PTFs and intersectionsof halfspaces. Finally, we remark that the appendix of [DFT +
15] sketches a generalization of theirresults to arbitrary product distributions (including the Gaussian distribution). We emphasize,however, that the lower bound obtained from their construction does not match the guaranteeof the L -regression algorithm [KKMS08] for the following reason: The exponent for their lowerbounds for the continuous setting have to do with the degree necessary to ǫ -approximate the hardfunction as a linear combination of d -juntas . On the other hand, the upper bound of [KKMS08]is related to the approximation by degree- d polynomials. Note that degree- d polynomials arealways linear combinations of d -juntas, and thus the approximation degree by linear combinationsof juntas is lower than the approximation degree by polynomials. In summary, while the lowerbound of [DFT +
15] is tight for discrete product distributions, this is not true in general.
Notation
For n ∈ Z + , we denote [ n ] def = { , . . . , n } . We typically use small letters to denoterandom variables when the underlying distribution is clear from the context. We use E [ x ] for theexpectation of the random variable x and Pr [ E ] for the probability of event E . We will use U ( S ) forthe uniform distribution on the set S . Let N denote the standard univariate Gaussian distributionand N n denote the standard n -dimensional Gaussian distribution. We use φ n to denote the pdf of N n . Sometimes we may use the same symbol for a distribution and its pdf, i.e., denote by D ( x )the density that the distribution D gives to the point x .Small boldface letters are used for vectors and capital boldface letters are used for matrices.Let k x k denote the L -norm of the vector x ∈ R n . We use h u , v i for the inner product of vectors u , v ∈ R n . For a matrix P ∈ R m × n , let k P k denote its spectral norm and k P k F denote itsFrobenius norm. We use I n to denote the n × n identity matrix. We denote by P nd the class ofall polynomials from R n to R with degree at most d . We sometimes use the notation ˜ O ( · ) (resp.˜Ω( · )), this is the same with O ( · ) (resp. Ω( · )), ignoring logarithmic factors, i.e., O ( d log k d ) = ˜ O ( d ). Statistical Query Dimension
To bound the complexity of SQ learning a concept class C , weuse the SQ framework for problems over distributions [FGR + Definition 1.8 (Decision Problem over Distributions) . Let D be a fixed distribution and D be adistribution family. We denote by B ( D , D ) the decision (or hypothesis testing) problem in whichthe input distribution D ′ is promised to satisfy either (a) D ′ = D or (b) D ′ ∈ D , and the goal is todistinguish between the two cases. Definition 1.9 (Pairwise Correlation) . The pairwise correlation of two distributions with proba-bility density functions D , D : R n → R + with respect to a distribution with density D : R n → R + , where the support of D contains the supports of D and D , is defined as χ D ( D , D ) def = R R n D ( x ) D ( x ) /D ( x ) d x − Definition 1.10.
We say that a set of s distributions D = { D , . . . , D s } over R n is ( γ, β )-correlatedrelative to a distribution D if | χ D ( D i , D j ) | ≤ γ for all i = j , and | χ D ( D i , D j ) | ≤ β for i = j .7 efinition 1.11 (Statistical Query Dimension) . For β, γ >
0, a decision problem B ( D , D ), where D is a fixed distribution and D is a family of distributions, let s be the maximum integer such thatthere exists a finite set of distributions D D ⊆ D such that D D is ( γ, β )-correlated relative to D and |D D | ≥ s. The
Statistical Query dimension with pairwise correlations ( γ, β ) of B is defined to be s ,and denoted by SD( B , γ, β ). Lemma 1.12.
Let B ( D , D ) be a decision problem, where D is the reference distribution and D isa class of distributions. For γ, β > , let s = SD( B , γ, β ) . For any γ ′ > , any SQ algorithm for B requires queries of tolerance at most √ γ + γ ′ or makes at least sγ ′ / ( β − γ ) queries. The idea of our construction is to find a function g : R m → [ − ,
1] whose low-degree momentsvanish and is non-trivially close to f . Our hard distribution will then embed g in a random m -dimensional subspace. Given this construction, we can apply Lemma 1.12 to prove Theorem 1.4.The following result establishes the existence of such a function g . Proposition 2.1.
Let f : R m → {± } be such that for any polynomial p : R m → R of degree atmost d − , it holds E x ∼N m [ | p ( x ) − f ( x ) | ] ≥ ǫ . There exists a function g : R m → [ − , such that:1. For any degree at most d − polynomial P : R m → R , we have that E x ∼N m [ P ( x ) g ( x )] = 0 ,i.e., g has zero low-degree moments, and,2. E x ∼N m [ | g ( x ) − f ( x ) | ] ≤ − ǫ , i.e., g is non-trivially close to f .Proof. Note that such a function g would be a solution to the following infinite linear program(LP): ( ∗ ) E x ∼N m [ | g ( x ) − f ( x ) | ] ≤ − ǫ E x ∼N m [ P ( x ) g ( x )] = 0 ∀ P ∈ P md − | g ( x ) | ≤ ∀ x ∈ R m We claim that the LP ( ∗ ) is equivalent to the following LP:( ∗∗ ) − E x ∼N m [ g ( x ) f ( x )] + 2 ǫ ≤ E x ∼N m [ P ( x ) g ( x )] = 0 ∀ P ∈ P md − E x ∼N m [ g ( x ) h ( x )] − k h k ≤ ∀ h ∈ L ( R m )We now show the equivalence between the two formulations. We claim that the third constraint of( ∗ ) is equivalent with the third constraint of ( ∗∗ ). This follows by introducing the “dual variable” h : R m → R . The forward direction follows from H¨older’s inequality and the inverse follows fromthe definition of dual norms as suprema. Finally, for the first constraints, note that since f isBoolean-valued and k g k ∞ ≤
1, we have that E x ∼N m [ | g ( x ) − f ( x ) | ] = 1 − E x ∼N m [ g ( x ) f ( x )].At this point, we would like to use “LP duality” to argue that ( ∗∗ ) is feasible if and only if its“dual LP” is infeasible. While such a statement turns out to be true, it requires some care to provesince we are dealing with infinite LPs (both in number of variables and constraints). The proofrequires a version of the geometric Hahn-Banach theorem from functional analysis.8 emma 2.2 (Informal) . The LP defined by ( ∗∗ ) is feasible if only if there is no conical combinationof the inequalities of ( ∗∗ ) that yields the contradictory inequality E x ∼N m [ g ( x ) ·
0] + 1 ≤ . A proof of this lemma can be found on Appendix C. Using Lemma 2.2, the LP defined by ( ∗∗ ) isfeasible if and only if the following “dual” LP is infeasible:( ∗∗ ′ ) k h k − λ ǫ < h ( x ) + P ( x ) − λ f ( x ) = 0 ∀ x ∈ R m λ ≥ , h ∈ L ( R m ) , P ∈ P md − Suppose that such a solution ( λ, h, P ) exists. We can assume that λ >
0, since otherwise the firstinequality is violated. Moreover, by scaling the solution, we can further assume λ = 1. Then, thesecond constraint becomes h = f − P and the first becomes k f − P k < ǫ . However, this cannothappen by the definition of the degree d (since, by assumption, there is no polynomial of degree lessthan d such that k f − P k < ǫ ). Therefore, the LP ( ∗∗ ) is feasible, which completes our proof.Our construction will use rotated versions of the function g from Proposition 2.1 to create afamily of distributions that is hard to distinguish from a fixed reference distribution. To boundthe SQ dimension of this hypothesis testing problem, we will need a generalization of Lemma 16in [DKKZ20], which bounds the correlation of two rotated versions of g . To formally state ourlemma, we will need one additional piece of terminology. If g ( x ) = P J ∈ N m ˆ g ( J ) H J ( x ) is theHermite expansion of g , the degree- t Hermite part of g is the sum of the terms corresponding tothe Hermite polynomials of degree exactly t . (For background in multilinear algebra and Hermiteanalysis, see Appendices A.2 and A.3.) Our main correlation lemma is the following. Lemma 2.3 (Correlation Lemma) . Let g : R m → R and U , V ∈ R m × n be linear maps such that UU ⊺ = VV ⊺ = I m . Then, we have that E x ∼N n [ g ( Ux ) g ( Vx )] ≤ ∞ X t =0 k UV ⊺ k t E x ∼N m [( g [ t ] ( x )) ] , where g [ t ] denotes the degree- t Hermite part of g .Proof. To simplify notation, write g ( x ) = g ( Ux ) and g ( x ) = g ( Vx ). Moreover, we will write g ( x ) ∼ P ∞ k =0 g [ k ]1 ( x ) and g ( x ) ∼ P ∞ k =0 g [ k ]2 ( x ). Using Fact A.5, we obtain E x ∼N n [ g ( x ) g ( x )] = ∞ X k =0 E x ∼N n [ g [ k ]1 ( x ) g [ k ]2 ( x )] = ∞ X k =0 k ! h∇ k g [ k ]1 ( x ) , ∇ k g [ k ]2 ( x ) i = ∞ X k =0 k ! h∇ k g [ k ] ( Ux ) , ∇ k g [ k ] ( Vx ) i . (1)Denote by U ⊆ R n the image of the linear map U ⊺ . Now observe that, using the chain rule, forany function h ( Ux ) : R n → R it holds ∇ h ( Ux ) = ∂ i h ( Ux ) U ij ∈ U , where we used Einstein’ssummation notation for repeated indices. Applying the above rule k times, we have that ∇ k h ( Ux ) = ∂ i k . . . ∂ i h ( Ux ) U i j . . . U i k j k ∈ U ⊗ k . We denote R = ∇ k g [ k ] ( x ) and observe that this tensor does not depend on x . Moreover, denote M = UV ⊺ , S = ∇ k g [ k ] ( Ux ) = ( U ⊺ ) ⊗ k R ∈ U ⊗ k , and T = ∇ k g [ k ] ( Vx ) = ( V ⊺ ) ⊗ k R ∈ V ⊗ k . We havethat h S , T i = h ( U ⊺ ) ⊗ k R , ( V ⊺ ) ⊗ k R i = h R , M ⊗ k R i ≤ (cid:13)(cid:13)(cid:13) M ⊗ k (cid:13)(cid:13)(cid:13) k R k = k ! k M k k E x ∼N n [( g [ k ] ( x )) ] , Corollary 2.4.
Let d ≥ and D be a distribution over R m such that the first ( d − moments of D match the corresponding moments of N m . Let G ( x ) = D ( x ) /φ m ( x ) be the ratio of the correspondingprobability density functions. For matrices U , V ∈ R m × n such that UU ⊺ = VV ⊺ = I m , define D U and D V to have probability density functions G ( Ux ) φ n ( x ) and G ( Vx ) φ n ( x ) , respectively. Then,we have that | χ N n ( D U , D V ) | ≤ k UV ⊺ k d χ ( D, N m ) .Proof. We compute χ N n ( D U , D V ) = E x ∼N n (cid:20) ( D U ( x ) − φ n ( x ))( D V ( x ) − φ n ( x )) φ n ( x ) (cid:21) = E x ∼N n [( G ( Ux ) − G ( Vx ) − . We then apply Lemma 2.3 to the function g ( x ) = G ( x ) −
1. Note that the assumption that D matches the first d − N m is equivalent to saying that g [ t ] = 0 for t < d . Thus,Lemma 2.3 implies that | χ N n ( D U , D V ) | ≤ k UV ⊺ k d ∞ X t =0 E x ∼N m [( g [ t ] ( x )) ] = k UV ⊺ k d E x ∼N m [ g ( x )] ≤ k UV ⊺ k d χ ( D, N m ) , where the equality is Parseval’s identity and in the last inequality we used the definition of G .Note that D U and D V are copies of D in the subspaces defined by U and V respectively, andindependent Gaussians in the orthogonal component.In order to create our hard family of distributions, we will need the following lemma which statesthat there exist exponentially many linear operators from R n to R m that are nearly orthogonal. Lemma 2.5.
Let < a, c < / and m, n ∈ Z + such that m ≤ n a . There exists a set S of Ω( n c ) matrices in R m × n such that every U ∈ S satisfies UU ⊺ = I m and every pair U , V ∈ S with U = V satisfies k UV ⊺ k F ≤ O ( n c − a ) .Proof. Our proof relies on the following fact that there exist exponentially many nearly orthogonalunit vectors.
Fact 2.6 (see, e.g., Lemma 3.7 of [DKS17]) . For any < c < / there exists a set S ′ of Ω( n c ) unit vectors in R n such that any pair u , v ∈ S ′ , with u = v , satisfies |h u , v i| < O ( n c − / ) . Let S ′ be the set of unit vectors that Fact 2.6 constructs. We group them into sets of size m anduse the vectors of each group as rows for each matrix that we make. Thus, we create at least | S ′ | /n a = 2 Ω( n c ) many matrices. Next, we ortho-normalize each matrix V ∈ S ′ using the Gram-Schmidt process, in order to get VV ⊺ = I m . In every row of V , the Gram-Schmidt algorithm addsat most m orthogonal vectors, each having norm O ( n c − / ). Thus, the total correction term foreach row has norm at most √ mO ( n c − / ). Putting everything together, we have that for all U , V obtained that way, k UV ⊺ k F ≤ (cid:16) m m O ( n c − / ) (cid:17) / = O (cid:0) n c − a (cid:1) .
10e now formally define the family of distributions that we use to prove our hardness result.
Definition 2.7.
Given a function g : R m → [ − , D g to be the class of distributionsover R n × {± } of the form ( x , y ) such that x ∼ N n and E [ y | x = z ] = g ( Uz ), where U ∈ R m × n with UU ⊺ = I m .In the following, we show that if g has low-degree moments equal to zero, then distinguishing D g from the distribution ( x , y ) with x ∼ N n , y ∼ U ( {± } ) is hard in the SQ model. Proposition 2.8.
Let g : R m → [ − , be such that E x ∼N m [ g ( x ) p ( x )] = 0 , for every polynomial p : R m → R of degree less than d , and D g be the class of distributions from Definition 2.7. Then, if m ≤ n a , for some constant a < / , any SQ algorithm that solves the decision problem B ( D g , N n ×U ( {± } )) must either use queries of tolerance n − Ω( d ) or make at least n Ω(1) queries.Proof.
Consider the set of matrices S of Lemma 2.5, for an appropriately small value of c > U ∈ S is associated with a unique element of D g . For every pair of distinct U , V ∈ S ,we have that k UV ⊺ k ≤ k UV ⊺ k F ≤ O ( n c − a ) ≤ n − Ω(1) , where for the last inequality we chose c to be a sufficiently small constant, e.g., c = (1 − a ) / D g associated to a matrix U has probability density (1+ g ( Ux )) φ n ( x )when conditioned on y = 1, and density (1 − g ( Ux )) φ n ( x ) when conditioned on y = −
1. Let D U be the distribution associated to U and D V the distribution associated to V . Denote by A U thedistribution D U conditioned on the event y = 1 and B U the same distribution conditioned on y = −
1. Similarly, let A V and B V denote the conditional distributions associated with V . Usingthe definition of pairwise correlation and the fact that y gets each label with equal probability, itfollows directly that χ N n ×U ( {± } ) ( D U , D V ) = 12 ( χ N n ( A U , A V ) + χ N n ( B U , B V )) . By Corollary 2.4 applied to A U , A V and B U , B V , we obtain χ N n ( A U , A V ) + χ N n ( B U , B V ) ≤ k UV ⊺ k d (cid:0) χ ( A, N m ) + χ ( B, N m ) (cid:1) , where A is the distribution of the random variable Ux for x ∼ A U (and similarly for B ). For the χ -divergence terms, we have that χ ( A, N m ) = Z R m A ( z ) φ m ( z ) d z − Z R m φ m ( z ) Pr [ y = 1 | x = z ] φ m ( z ) Pr [ y = 1] d z − ≤ Z R m φ m ( z ) Pr [ y = 1 | x = z ]d z − Pr [ y = 1] − , where we used the definition of A , Bayes’ rule and the fact that Pr [ y = 1] = 1 /
2. Combining theabove, we get that | χ N n ×U ( {± } ) ( D U , D V ) | ≤ n − Ω( d ) . This inequality implies that SD( B , γ, β ) =2 Ω( n c ) , for γ = n − Ω( d ) and β = O (1). Using Lemma 1.12, with γ ′ = γ , completes the proof. Proof of Theorem 1.4.
Let A be an agnostic learner for C . We use A to solve the decision problem B ( D g , N n ×U ( {± } )), where g : R m → [ − ,
1] is the function from Proposition 2.1 and D g the familyof Definition 2.7. Let D ′ be the target distribution, i.e., D ′ = N n × U ( {± } ) if the null hypothesisis true or D ′ ∈ D g otherwise. We feed A examples drawn from D ′ and it outputs a hypothesis h : R n → {± } such that Pr ( x ,y ) ∼ D ′ [ h ( x ) = y ] ≤ OPT+ ǫ . If D ′ ∈ D g , then for a matrix U ∈ R m × n UU ⊺ = I m , we have that OPT ≤ Pr ( x ,y ) ∼ D ′ [ f ( Ux ) = y ] = k f − g k ≤ (1 − ǫ ), where in theequality we used the fact that the expectation of y conditioned on x is g ( x ) and the last inequalityis due to Proposition 2.1. Combining the above, we get that Pr ( x ,y ) ∼ D ′ [ h ( x ) = y ] ≤ (1 − ǫ ) /
2, orequivalently that E ( x ,y ) ∼ D ′ [ h ( x ) y ] ≥ ǫ . On the other hand, if the labels were drawn uniformly atrandom, this correlation would be exactly 0. Therefore, we can distinguish between the two casesby performing a final query of tolerance ǫ/ h with y . k PTFs
Linear threshold functions (LTFs) are Boolean functions of the form F ( x ) = sign( h w , x i + θ ), where w ∈ R n and θ ∈ R . A degree- k PTF is any Boolean function of the form F ( x ) = sign( q ( x )), where q : R n → R is a real degree- k polynomial. In this section, we show: Theorem 3.1 (Degree Lower Bound for PTFs) . There exists a degree- k PTF f : R → {± } suchthat any degree- d polynomial p : R → R with k f − p k < ǫ must have d = Ω( k /ǫ ) . Theorems 1.4 and 3.1 imply that any SQ algorithm that agnostically learns the class of degree- k PTFs on R n under the Gaussian distribution must have complexity at least n Ω( k /ǫ ) .We now elaborate on these contributions. Lower Bound for LTFs
The L -regression algorithm [KKMS08] is known to be an agnosticlearner for LTFs under Gaussian marginals with complexity n O (1 /ǫ ) . This upper bound uses theknown fact that the L polynomial ǫ -approximate degree of LTFs under the Gaussian distribu-tion is d = O (1 /ǫ ) (see, e.g., [DKN10]). This upper bound is tight. Specifically, known resultsin approximation theory (see Appendix B.1) imply that, any polynomial that ǫ -approximates thefunction sign( t ) in L -norm, under the standard Gaussian distribution, requires degree Ω(1 /ǫ ).Given this structural result, an application of Theorem 1.4, for m = 1 and f ( t ) = sign( t ) givesthe tight SQ lower bound of n Ω(1 /ǫ ) . This bound improves on the best previous bound of n Ω(1 /ǫ ) [GGK20, DKZ20]. Importantly, our approach is much simpler and generalizes to any conceptclass satisfying the mild assumptions of Theorem 1.4. Lower Bound for Degree- k PTFs
The L -regression algorithm is known to be an agnosticlearner for degree- k PTFs under Gaussian marginals with complexity n O ( k /ǫ ) . This upper bounduses the known upper bound of O ( k √ ǫ ) on the Gaussian noise sensitivity of degree- k PTFs [Kan10],which implies an upper bound of O ( k /ǫ ) on the L polynomial ǫ -approximate degree, and thereforean upper bound of O ( k /ǫ ) on the L polynomial ǫ -approximate degree. This degree upper boundis not known to be optimal (in fact, it is provably sub-optimal for k = 1) and it is a plausibleconjecture that the right answer is Θ( k /ǫ ). Here we prove a lower bound of Ω( k /ǫ ), whichapplies even for the univariate case. Proposition 3.2.
There exists a ( k + 1) -piecewise-constant function f : R → { , } such that anydegree- d polynomial p : R → R that satisfies k f − p k < ǫ must have d = Ω( k /ǫ ) . An application of Theorem 1.4, for m = 1 and f ( t ) being the piecewise constant function ofProposition 3.2, implies an SQ lower bound of n Ω( k /ǫ ) .Before we provide the formal proof, we sketch the proof of Proposition 3.2. The hard function f consists of k/ f = 0 in the first half, and f = 1 in the second half. We construct a distribution D thatputs almost all of its mass in the first half of each interval, matches the first d moments with thestandard Gaussian, and D ( x ) ≤ φ ( x ) for all x ∈ R . Then, by construction E x ∼N [ f ( x )] is muchlarger than the same expectation under D . We show that, in fact, this difference bounds frombelow the error of any degree- d polynomial approximation to the function f .The main technical lemma we establish in this context is the following: Lemma 3.3.
There exists a univariate distribution D that (i) matches its first d moments with N ,(ii) the pdf of D is at most times the pdf of N pointwise in R , and (iii) for some α = Θ(1 / √ d ) it holds that Pr [( X mod a ) ∈ ( a/ , a )] = 2 − Ω( d ) . We defer the proof of Lemma 3.3 to Section 3.2 and show how it implies Proposition 3.2 below.
Proof of Proposition 3.2.
We can assume that k is even. Let f be 1 on the k/ ia + a/ , ( i +1) a ), for i = 0 , . . . , k/ −
1, and zero elsewhere. Denote by D the distribution of Lemma 3.3.From property (iii), we have that E x ∼ D [ f ( x )] = 2 − Ω( d ) k . On the other hand, assuming that k = O ( √ d ), we have that E x ∼N [ f ( x )] = Ω( k/ √ d ). This is because the regions where f is 1are contained in the interval [0 , Θ( k/ √ d )] ⊆ [0 , O (1)], where the pdf of the standard Gaussian isbounded below by some constant.Let D ( x ) and φ ( x ) denote the density on point x of the distribution D and N respectively. Forevery polynomial p : R → R of degree at most d , it holds E x ∼N [ f ( x )] − E x ∼ D [ f ( x )] = E x ∼N (cid:20) f ( x ) (cid:18) − D ( x ) φ ( x ) (cid:19)(cid:21) = E x ∼N (cid:20) ( f ( x ) − p ( x )) (cid:18) − D ( x ) φ ( x ) (cid:19)(cid:21) ≤ E x ∼N [ | f ( x ) − p ( x ) | ] , where the second equality follows from the fact that D matches its first d moments with N , andin the last inequality we used that 0 ≤ D ( x ) ≤ φ ( x ) for all x ∈ R . Thus, if f could be L -approximated to error ǫ by a degree- d polynomial, then E x ∼N [ f ( x )] − E x ∼ D [ f ( x )] would be atmost ǫ . But we already showed that this is Ω( k/ √ d ), which implies that d = Ω( k /ǫ ). First, we need the following lemma.
Lemma 3.4.
There is a d -wise independent family of t = O ( d ) standard Gaussians X , X , . . . , X t such that (cid:0)P ti =1 X i (cid:1) mod 1 ∈ [0 , / with probability − − Ω( d ) . Furthermore, such a distributioncan be obtained by rejection sampling a set of independent standard Gaussians, where a sample isrejected with probability / .Proof. The standard Gaussian distribution can be decomposed into a uniform component and aremaining term. That is, N = c U ([0 , − c ) E , where U ([0 , , E is another distribution, and c > t ∈ N such that t > d/c . We generatethis d -wise independent family X , . . . , X t as follows.First, we sample Y , . . . , Y t independent standard Gaussians, writing each Y i either as a samplefrom U ([0 , E . Then, two complementary cases are considered. Case 1 . The number of Y i ’s that came from U ([0 , d . In this case, the sampleis rejected with probability 1 / Case 2 . Otherwise, the sample is rejected if and only if ( P ti =1 Y i ) mod 1 ∈ (1 / , X , . . . , X t be the output of this rejection sampling procedure. The probability that the sampleis generated by the first case of the algorithm is exponentially small. To see this, define Z i ∈ { , } to be one if and only Y i is drawn from U ([0 , C denotes the event of being in Case 1, thenby standard Chernoff bounds we have that Pr [ C ] = Pr " t X i =1 Z i ≤ d = Pr " t X i =1 Z i ≤ E " t X i =1 Z i − (cid:18) − dtc (cid:19)(cid:19) ≤ exp (cid:18) − (1 − d/ ( tc )) tc (cid:19) = 2 − Ω( d ) , where we used that t > d/c . Therefore, the probability that ( P ti =1 X i ) mod 1 ∈ [0 , /
2] is 1 − − Ω( d ) .Moreover, the probability of accepting the sample is exactly 1 / Y i ’s. Tosee this, let C , C = C be the events of Case 1 and Case 2 being true respectively, and A be theevent of accepting the sample. For Case 1, we have Pr [ A | C ] = 1 /
2. In Case 2, we know that atleast one element is drawn from U ([0 , P ti =1 X i ) mod 1 is going to beuniform in [0 , Pr [ A | C ] = 1 /
2. Therefore, Pr [ C | A ] = Pr [ A | C ] Pr [ C ] / Pr [ A ] = Pr [ C ]and Pr [ C | A ] = Pr [ A | C ] Pr [ C ] / Pr [ A ] = Pr [ C ], i.e., accepting is independent of C , C , andthus independent of the sample itself. This means that the output X , . . . , X t remains Gaussian.For the d -wise independence of the variables X , . . . , X t , let I be an arbitrary set of at most d indices from { , . . . , t } . We claim that { X i } i ∈I are independent. Case 1 is trivial, since we acceptindependently of the values of the Y i ’s. For Case 2, note that in that case there are more than d Y i ’s drawn from U ([0 , j
6∈ I such that Y j is uniform andforces the ( P ti =1 X i ) to be uniform in [0 , P ti =1 X i ) ∈ [0 , /
2] is independent of { Y i } i ∈I , and therefore { X i } i ∈I is a set of independent random variables. Proof of Lemma 3.3.
Consider the random variable X = P ti =1 X i / √ t for the X i ’s of Lemma 3.4.For (i), note that the d -th moment involves the expectation of at most d of the X i ’s, which areindependent. Note that (ii) holds because the distribution of X puts almost all of its mass on halfof the real line, and (iii) follows from our scaling of 1 / √ t . An intersection of k halfspaces on R n is any function f : R n → {± } such that there exist k LTFs h i : R n → {± } , i ∈ [ k ], such that f ( x ) = 1 if and only if h i ( x ) = 1 for all i ∈ [ k ].The L -regression algorithm [KKMS08] is known to be an agnostic learner for intersection of k halfspaces on R n under Gaussian marginals with complexity n O ((log k ) /ǫ ) . This upper bounduses the known tight upper bound of O ( √ ǫ log k ) on the Gaussian noise sensitivity of this conceptclass [KOS08], which implies an upper bound of O (log k/ǫ ) on the L polynomial ǫ -degree. Thisdegree upper bound is not known to be optimal (in fact, it is provably suboptimal for k = 1) andit is a plausible conjecture that the right answer is Θ( √ log k/ǫ ). Here we prove a lower bound of˜Ω( √ log k/ǫ ), which applies even for k -dimensional functions. Theorem 3.5 (Degree Lower Bound for Intersections of Halfspaces) . There exists an intersectionof k halfspaces f on R k such that the following holds: Any degree- d polynomial p : R k → R thatsatisfies k f − p k < ǫ must have d = ˜Ω( √ log k/ǫ ) . m = k and f being the function fromTheorem 3.5, implies that any SQ algorithm that agnostically learns intersections of k halfspaceson R n under the Gaussian distribution must have complexity at least n ˜Ω( √ log k/ǫ ) .To prove Theorem 3.5, we make essential use of our structural result, Theorem 1.5, combinedwith the following tight lower bound on the Gaussian noise sensitivity of a well-chosen family ofintersection of halfspaces (see Appendix B.2 for the proof). Lemma 3.6.
There exists an intersection of k halfspaces on R k , f : R k → {± } , such that GNS ǫ ( f ) = Ω( √ ǫ log k ) . We require the following proposition.
Proposition 3.7.
Let p ( θ ) be a degree- d polynomial on the circle, i.e., a degree at most d polynomialin sin θ and cos θ , and let B ( θ ) be a Boolean-valued function that is periodic modulo π . Then, for t being a sufficiently small multiple of log d/d , it holds π Z π | p ( θ ) − B ( θ ) | d θ = ˜Ω(1 / log d ) Pr φ ∼U ([0 , π ]) [ B ( φ − t ) = B ( φ + t )] . Proof.
We can assume that π R π | p ( θ ) | d θ is at most 2, since otherwise the π R π | p ( θ ) − B ( θ ) | d θ is at least 1. Let k be an odd integer proportional to log d . We start with the following technicalclaim. Claim 3.8.
For any θ ∈ [0 , π ] , it holds | p ( k ) ( θ ) | = O ( d ) k .Proof. Using cos θ = (cid:0) e iθ + e − iθ (cid:1) / θ = (cid:0) e iθ − e − iθ (cid:1) /
2, we write p ( θ ) = P ∞ n = −∞ a n e niθ , forsome coefficients a n , where a n = π R π p ( φ ) e − niθ d φ . Since p has degree at most d , it holds that a n = 0, for all n > d and n < − d . Therefore, we have that p ( θ ) = P dn = − d π R π p ( φ ) e ni ( θ − φ ) d φ .Taking the k -th derivative (using Leibniz’s rule) gives p ( k ) ( θ ) = d X n = − d π Z π p ( φ )( ni ) k e ni ( θ − φ ) d φ . This implies that | p ( k ) ( θ ) | ≤ d X n = − d π Z π | p ( φ ) | n k d φ ≤ d X n = − d n k = O ( d k +1 ) . Moreover, k is proportional to log d , thus | p ( k ) ( θ ) | = O ( d ) k , for all θ ∈ [0 , π ].We next pick t to be a small multiple of log d/d and φ ∈ [0 , π ]. Let z m = t cos( πm/k ) + φ ,for m = 0 , , . . . , k , and let q ( z ) = P kj =0 c j z j be the unique degree- k polynomial such that q ( z m ) = p ( z m ), for m = 0 , , . . . , k . Observe that q − p has k + 1 zeroes. Therefore, iterating Rolle’stheorem, we obtain that there is a point φ − t ≤ z ≤ φ + t such that p ( k ) ( z ) = q ( k ) ( z ), and thus | q ( k ) ( z ) | = O ( d ) k , or equivalently c k = 2 k O ( d/k ) k .Let R ( θ ) = q ( t cos θ + φ ). For some constants b n (which depend on t and φ ), we have that R ( θ ) = P kn = − k b n e niθ . Since R ( θ ) is an even function, its Fourier coefficients are real numbers.The following claim provides an upper bound on the coefficient b k .15 laim 3.9. It holds that | b k | ≤ / (4 k ) .Proof. Note that b k = (1 / π ) R π R ( θ ) e − kiθ d θ . Using the orthogonality of the trigonometric poly-nomials, only terms containing cos( kθ ) are non-zero. Moreover, cos k θ = P kj =0 u j cos( jθ ) with u k = 2 − k +1 , which can be verified using the identity cos θ = ( e iθ + e − iθ ) /
2. Therefore, we have that b k = 12 π Z π R ( θ ) e − kiθ d θ = 12 π Z π c k u k t k cos( kθ ) e − kiθ d θ = c k u k t k π π = (cid:18) t (cid:19) k c k , where we used that R ( θ ) = P kj =0 c j ( t cos θ + φ ) j . Since c k = 2 k O ( d/k ) k , we have that b k = O ( td/k ) k ;this is at most 1 / (4 k ), if t is a small enough multiple of log d/d .On the other hand, by doing a filtering using the (2 k )-th roots of unity, we get that P k − m =0 R (2 πm/ (2 k )) =2 kb k , and this is equivalent to P km = − k +1 q ( t cos( πm/k ) + φ )( − m = 2 kb k . Therefore, b k = 12 k k X m = − k +1 q ( t cos( πm/k ) + φ )( − m = 12 k k X m = − k +1 p ( z | m | )( − m = 12 k (cid:16) k X m = − k +1 ( p ( z | m | ) − B ( z | m | ))( − m + k X m = − k +1 B ( z | m | )( − m + ( B ( φ + t ) − B ( φ − t )) (cid:17) . Since k − B is Boolean, 2 P k − m =1 B ( z m )( − m is a multiple of 4. If B ( φ + t ) = B ( φ − t ),the reverse triangle inequality gives (cid:12)(cid:12)(cid:12) B ( φ + t ) − B ( φ − t ) + 2 P k − m =1 B ( z m )( − m (cid:12)(cid:12)(cid:12) ≥
2. Therefore,in this case, we have that k > | b k | ≥ k (cid:16) − P km = − k +1 | p ( z | m | ) − B ( z | m | ) | (cid:17) , or in other words, k X m = − k +1 | p ( z | m | ) − B ( z | m | ) | ≥ { B ( φ + t ) = B ( φ − t ) } . Integrating this over φ from 0 to 2 π gives Z π | p ( θ ) − B ( θ ) | dθ ≥ πk Pr φ ∼U ([0 , π ]) [ B ( φ − t ) = B ( φ + t )] . The result follows from our assumption that k is proportional to log d .Using Proposition 3.7, we can prove the main theorem of this section. Proof of Theorem 1.5.
The latter statement follows from the fact that E x ∼N n [ | f ( x ) − p ( x ) | ] ≥ E x ∼N n [ | f ( x ) − sign( p ( x )) | / ǫ ( f ) − GNS ǫ (sign( p )) = Pr ( x , y ) ∼N ǫd [ f ( x ) = f ( y )] − Pr ( x , y ) ∼N ǫd [sign( p ( x )) = sign( p ( y ))] ≤ Pr x ∼N n [ f ( x ) = sign( p ( x ))] + Pr x ∼N n [ f ( y ) = sign( p ( y ))] = 2 E x ∼N n [ | f ( x ) − sign( p ( x )) | . Combining these, we find that E x ∼N n [ | f ( x ) − p ( x ) | ] ≥ GNS ǫ ( f ) / − GNS ǫ (sign( p )) /
4. The resultthen follows from noting that sign( p ) is a degree- d PTF, and therefore by [Kan10] it holds thatGNS ǫ (sign( p )) = O ( d √ ǫ ). 16or the first statement, let y and z be independent Gaussians and let x ( φ ) = cos φ y + sin φ z .Let a be a sufficiently small multiple of log d/d . For any φ ∈ [0 , π ], x ( φ − a ) and x ( φ + a ) are(1 − δ )-correlated Gaussian random variables, where δ = Θ(log d/d ) . We have that E x ∼N n [ | f ( x ) − p ( x ) | ] = E φ ∈U ([0 , π ]) (cid:20) E y , z ∼N n [ | f ( x ( φ )) − p ( x ( φ )) | ] (cid:21) = E y , z ∼N n (cid:20) E φ ∈U ([0 , π ]) [ | f ( x ( φ )) − p ( x ( φ )) | ] (cid:21) ≥ Ω(1 / log( d )) E y , z ∼N n [ Pr φ ∈U ([0 , π ]) [ f ( x ( φ − a )) = f ( x ( φ + a ))]] , where in the inequality we used Proposition 3.7. Moreover, using Fubini’s theorem, we have E x ∼N n [ | f ( x ) − p ( x ) | ] ≥ Ω(1 / log( d )) E φ ∈U ([0 , π ]) (cid:20) Pr y , z ∼N n [ f ( x ( φ − a )) = f ( x ( φ + a ))] (cid:21) = Ω(1 / log( d )) E φ ∈U ([0 , π ]) [GNS δ ( f )] = Ω(1 / log( d ))GNS δ ( f ) = Ω(1 / log( d ))GNS (log( d ) /d ) ( f ) . To prove our CSQ lower bound, we need to find a hard function g : R m → R that is uncorrelatedwith low-degree polynomials and, at the same time, is close to f in the L -sense. Instead of usingduality to establish the existence of such a function g , we let g be the orthogonal component of thetruncated Hermite expansion of f . Proof of Theorem 1.6.
Let an algorithm A that agnostically learns C up to L -error ǫ . Let g ( x ) = f ( x ) − P d − i =0 f [ i ] ( x ), i.e., g is the same as the function f without the low-degree moments up to d −
1. Note that k g k ≥ ǫ . Let C = 2 / ( ǫ k g k ) and let S be the set of nearly orthogonal matricesof Lemma 2.5. Consider the class C g that consists of all functions from R n to R of the form G V ( x ) = Cg ( Vx ), for any matrix V ∈ S . Every G V ∈ C g is orthogonal to all polynomials ofdegree less than d , and also k G V k = 2 /ǫ . We feed A with samples ( x , G V ( x )), where x ∼ N n , V ∈ S . Let ǫ ′ > A . Then, A returns a hypothesis h satisfying r E x ∼N n [( h ( x ) − G V ( x )) ] ≤ OPT + ǫ ′ . (2)For our choice of C , the optimal error becomesOPT ≤ r E x ∼N n [( f ( x ) − G V ( x )) ] = r C k g k − C E x ∼N m [ f ( x ) g ( x )] ≤ r ǫ − k g k ǫ ≤ r ǫ − ≤ ǫ r − ǫ ≤ ǫ − ǫ , where in the second inequality we used that 2 E x ∼N m [ f ( x ) g ( x )] ≥ k g k . By choosing ǫ ′ = ǫ/ k h − G V k ≤ /ǫ − ǫ/ C g . For any two different U , V ∈ S , we have that E x ∼N n [ G U ( x ) G V ( x )] ≤ C ∞ X t =0 k UV ⊺ k t E x ∼N m [( g [ t ] ( x )) ] ≤ C k UV ⊺ k d ∞ X t = d E x ∼N m [( g [ t ] ( x )) ] ≤ ǫ − k UV ⊺ k dF ≤ ǫ − n − Ω( d ) ≤ n − Ω( d ) , g isuncorrelated with all polynomials of degree less than d , the third inequality follows from Parseval’sidentity and the fact that k g k C = 2 /ǫ , the next one follows from Lemma 2.5, and the last onefrom our assumption ǫ > n − c for an appropriate constant c . As a note, we extend our class C g to include the identically zero function, which does not increase the pairwise correlations. UsingLemma A.3 with γ ′ = γ , we have that CSDA N n ( C g , γ ) = 2 n Ω(1) for γ = n − Ω( d ) . An application ofLemma A.4 with η = 2 /ǫ concludes the proof. To prove lower bounds for the general SQ model, we require our hard function g to be pointwisebounded. This allows us to define a learning problem with Boolean labels, for which we have SQlower bounds ready to be used. Because of our L ∞ constraint on g , the resulting lower bound isexpressed in terms of the degrees of polynomials that approximate f in L rather than L sense.Our duality argument will now use the pair of dual norms L , L ∞ . Proposition 4.1.
Let f ∈ L ( R m ) be such that for any degree at most d − polynomial p : R m → R ,it holds k f − p k ≥ ǫ . Then, there exists a function g : R m → [ − , such that:1. E x ∼N m [ f ( x ) g ( x )] ≥ ǫ , and,2. E x ∼N m [ P ( x ) g ( x )] = 0 , for any polynomial P : R m → R with degree less than d .Proof. The function g is a solution to the infinite system:( ∗ ) E x ∼N m [ f ( x ) g ( x )] ≥ ǫ E x ∼N m [ P ( x ) g ( x )] = 0 ∀ P ∈ P md − k g k ∞ ≤ ∗∗ ) − E x ∼N m [ f ( x ) g ( x )] + ǫ ≤ E x ∼N m [ P ( x ) g ( x )] = 0 ∀ P ∈ P md − E x ∼N m [ g ( x ) h ( x )] ≤ k h k ∀ h ∈ L ( R m )From Corollary C.2, the above LP is feasible unless the following is infeasible:( ∗∗ ′ ) k h k − λǫ < h ( x ) + P ( x ) − λf ( x ) = 0 , ∀ x ∈ R m λ ≥ , h ∈ L ( R m ) , P ∈ P md − Let ( h, P, λ ) be a solution to ( ∗∗ ′ ). Note that we can assume that λ = 1 since all constraints arehomogeneous. Then, the constraints become h = f − P and k f − P k < ǫ , which is a contradiction. Therefore, the original system ( ∗ ) is feasible.We conclude with the proof of the main theorem for this section.18 roof of Theorem 1.7. Suppose that we have such an agnostic learner A . Let g : R m → [ − ,
1] bethe function of Proposition 4.1, for a parameter ǫ ′ > D g be the family ofdistributions over R n × {± } from Definition 2.7. We use A to solve the problem of distinguishingbetween a distribution from D g and the distribution where the labels are drawn uniformly atrandom. That is, we convert A into an algorithm for B ( D g , N n × U ( {± } )), and the hardness resultwill follow from the hardness of that decision problem, as established by Proposition 2.8.Let D ′ be a distribution that is either D ′ = N n × U ( {± } )) or D ′ ∈ D g . We feed A a set ofi.i.d. samples of the form ( x , Cy ), where ( x , y ) ∼ D ′ and C = 1 / E x ∼N m [ f ( x ) g ( x )]. Let ǫ ′ > A and h be the returned hypothesis. We have that r E ( x ,y ) ∼ D ′ [( h ( x ) − Cy ) ] ≤ OPT + ǫ ′ . (3)If D ′ ∈ D g , for the optimal error we have thatOPT ≤ r C − C E x ∼N m [ f ( x ) g ( x )] ≤ p C − C p − /C ≤ C − / (2 C ) . If we choose ǫ ′ = 1 / (4 C ), Equation (3) becomes q E ( x ,y ) ∼ D ′ [( h ( x ) − Cy ) ] ≤ C − / (4 C ). On theother hand, we can write q E ( x ,y ) ∼ D ′ [( h ( x ) − Cy ) ] ≥ p C − C E x ∼N m [ h ( x ) y ]. Combining thesetwo, we obtain 2 C E x ∼N m [ h ( x ) y ] ≥ C − ( C − / (4 C )) ≥ / , which gives that E x ∼N m [ h ( x ) y ] ≥ / (6 C ) = E x ∼N m [ f ( x ) g ( x )] / ≥ ǫ/ D ′ = N n × U ( {± } ), then E ( x ,y ) ∼ D ′ [ h ( x ) y ] = 0. Therefore, by performing a queryof tolerance Ω( ǫ ) for the correlation of h with the labels, we can distinguish between the two casesof our hypothesis testing problem. By Proposition 2.8, this requires either 2 n Ω(1) queries or queriesof tolerance n − Ω( d ) . The class of Rectified Linear Unit (ReLU) functions consists of all functions of the form ReLU( h w , x i ),where w ∈ R n is any vector with k w k = 1 and ReLU : R → R is defined as ReLU( t ) = max { , t } .Upper and lower bounds for agnostically learning ReLUs were given in [GGK20, DKZ20].[DKZ20] established an SQ lower bound of n Ω(1 /ǫ c ) , for some constant c >
0. This constant c wasnot explicitly calculated in [DKZ20], but can be shown to be approximately 1 /
40. [GGK20] gave anSQ lower bound of n Ω(1 /ǫ / ) for this problem. We note that [GGK20] considered a correlationaltype of guarantee, i.e., finding a hypothesis whose correlation with the labels is within ǫ of theoptimal, as opposed to L -error. For this correlational guarantee, the upper bound of [GGK20]is an L -regression algorithm with complexity n O ( ǫ − / ) , and the lower bound states that anySQ algorithm needs to perform queries with tolerance τ < n − Ω( ǫ − / ) or at least 2 n Ω(1) ǫ queries.Furthermore, [GGK20] showed that any agnostic learner with the square loss guarantee can berun with increased accuracy to satisfy the correlational guarantee. This reduction costs a “thirdroot” in the exponent, yielding an n Ω( ǫ − / ) SQ lower bound for the square loss guarantee. As19 note, [GGK20] assumes bounded labels. In this setting, agnostically learning within L -errorOPT + ǫ is equivalent to agnostically learning in squared L -error OPT + ǫ ′ , for ǫ ′ = Θ( ǫ ).Given the context of prior work, we can now present our results. To apply our generic lowerbound theorems, we bound from below the degree of any polynomial that ǫ -approximates theunivariate ReLU function. This can be done by appealing to a known powerful theorem from theapproximation theory literature by Ganburg [Gan02, GR08]. This result can be used to derive tightpolynomial degree lower bounds for the ReLU function and the sign function (see Appendix B.1).Let A σ ( f ) p = inf g ∈ B σ k f − g k p , where B σ , σ > σ , i.e., the class consisting of every entire function g such that for every ǫ > C for which | g ( z ) | ≤ Ce σ (1+ ǫ ) | z | . Fact 5.1.
For any function f : R → R of polynomial growth lim n →∞ (cid:18) b n σ (cid:19) /p inf p ∈P n (cid:13)(cid:13)(cid:13)(cid:13) f (cid:18) b n σ x (cid:19) − p ( x ) (cid:13)(cid:13)(cid:13)(cid:13) p = A σ ( f ) p , where b n = 2 √ n , p ∈ [1 , and A σ ( f ) p is the error of the best approximation of f by entire functionsof exponential type σ in L p ( R ) . As an immediate corollary, we obtain:
Corollary 5.2.
Let f : R → R be the ReLU function ReLU( t ) = max { , t } and p ∈ [1 , . The min-imum integer d for which there exists a degree- d polynomial P : R → R such that k ReLU − P k p ≤ ǫ is d = Θ (cid:16) ǫ − /p (cid:17) . Therefore, Theorems 1.6 and 1.7 imply a complexity of at least n Ω( ǫ − / ) for any agnostic CSQlearner; and n Ω( ǫ − ) for any agnostic SQ learner respectively. We now let f be the standard sigmoid function, defined as f ( t ) = 1 / (1+ e − t ), t ∈ R . We first focus onbounding the degree of polynomials that approximate f in L -norm. This can be done via Hermiteanalysis, in particular, based on the fact that the polynomial of degree d being closest to f in L -norm is the truncated Hermite expansion p d ( t ) = P di =0 ˆ f ( i ) H i ( t ). The error of this approximationis k p d − f k = P ∞ i = d +1 ˆ f ( i ). For the asymptotic behavior of the Hermite coefficients, we use thefollowing fact (see [GGJ +
20] and the references therein).
Fact 5.3 (Lemma A.9 from [GGJ + . Let f : R → R be the standard sigmoid function f ( t ) =1 / (1 + e − t ) and ˆ f ( i ) be its Hermite coefficients for i ∈ Z + . Then, ˆ f (0) = 0 . , ˆ f (2 i ) = 0 and ˆ f (2 i −
1) = e − Θ( √ i ) , for i ≥ . From this fact, we get the bound on the L -error of the best polynomials of degree d . Corollary 5.4 ( L -Degree Lower Bound for Sigmoid) . Let f : R → R be the standard sigmoidfunction f ( t ) = 1 / (1+ e − t ) and d be the smallest integer for which there exists a degree- d polynomial p : R → R such that k f − p k < ǫ . Then d = Θ(log (1 /ǫ )) . roof. Fix a degree k . From Fact 5.3, the best k -degree polynomial p k achieves error k f − p k k = ∞ X i = k +1 ˆ f ( i ) = X i>k,i odd e − Θ( √ i ) = √ ke − Θ( √ k ) . This becomes ǫ when k becomes Θ(log (1 /ǫ )).By Theorem 1.6, we get that any CSQ agnostic learner for sigmoids has complexity n Ω(log (1 /ǫ )) . The approach to derive lower bounds for the degrees of L -approximating polynomials will be torelate the L -norm to the L -norm and use the lower bounds for the latter. In particular, we willuse the following fact about polynomials under the Gaussian measure. Theorem 5.5 (Hypercontractivity [Bog98, Nel73]) . If p is a d -degree polynomial and t > , then k p k t ≤ ( t − d/ k p k . Claim 5.6.
Let r ∈ L ( R ) . Then, k r k ≤ k r k / k r k / .Proof. The proof follows from two applications of the Cauchy-Schwartz inequality. E t ∼N [ r ( t )] ≤ E t ∼N [ | r ( t ) | ] / E t ∼N (cid:2) | r ( t ) | (cid:3) / ≤ E t ∼N [ | r ( t ) | ] / E t ∼N (cid:2) | r ( t ) | (cid:3) / E t ∼N (cid:2) | r ( t ) | (cid:3) / . Rearranging the above, yields the claimed inequality.We can now show our L polynomial degree lower bound. Theorem 5.7 ( L -Degree Lower Bound for Sigmoid) . Let f : R → R be the standard sigmoidfunction f ( t ) = 1 / (1 + e − t ) and < ǫ < . Any degree- d polynomial p : R → R that satisfies k f − p k < ǫ must have d = Ω(log(1 /ǫ )) .Proof. Let p : R → R be a degree- d polynomial such that k f − p k < ǫ . Using Theorem 5.5 with t = 4 and then Claim 5.6 with r ( t ) = p ( t ), we get that k p k ≤ d/ k p k ≤ d/ k p k / k p k / . After dividing both sides by k p k / , we have that k p k ≤ d/ k p k . Furthermore, using thetriangle inequality, k p k ≤ ǫ + k f k = O (1). Therefore, k p k ≤ O ( d ) . Furthermore, Claim 5.6 for r ( t ) = f ( t ) − p ( t ) gives k f − p k ≤ k f − p k / k f − p k / ≤ ǫ / O ( d ) . On the other hand, for the L -error we have that k f − p k ≥ √ de − Θ( √ d ) (Corollary 5.4). Combiningthe two bounds, it follows that d = Ω(log(1 /ǫ )).We note that [GGK20] showed an n Ω(log (1 /ǫ )) SQ lower bound for the correlational guarantee.21 eferences [ABL17] P. Awasthi, M. F. Balcan, and P. M. Long. The power of localization for efficientlylearning linear separators with noise.
J. ACM , 63(6):50:1–50:27, 2017.[Bog98] V. Bogachev.
Gaussian measures . Mathematical surveys and monographs, vol. 62, 1998.[CKL +
06] C.-T. Chu, S. K. Kim, Y. A. Lin, Y. Yu, G. Bradski, A. Y. Ng, and K. Olukotun.Map-reduce for machine learning on multicore. In
Proceedings of the 19th Interna-tional Conference on Neural Information Processing Systems , NIPS’06, pages 281–288,Cambridge, MA, USA, 2006. MIT Press.[Dan15] A. Daniely. A PTAS for agnostically learning halfspaces. In
Proceedings of The 28thConference on Learning Theory, COLT 2015 , pages 484–502, 2015.[Dan16] A. Daniely. Complexity theoretic limitations on learning halfspaces. In
Proceedingsof the 48th Annual Symposium on Theory of Computing, STOC 2016 , pages 105–117,2016.[DFT +
15] D. Dachman-Soled, V. Feldman, L.-Y. Tan, A. Wan, and K. Wimmer. Approximateresilience, monotonicity, and the complexity of agnostic learning. In Piotr Indyk, ed-itor,
Proceedings of the Twenty-Sixth Annual ACM-SIAM Symposium on Discrete Al-gorithms, SODA 2015 , pages 498–511. SIAM, 2015.[DGJ +
10] I. Diakonikolas, P. Gopalan, R. Jaiswal, R. Servedio, and E. Viola. Bounded indepen-dence fools halfspaces.
SIAM J. on Comput. , 39(8):3441–3462, 2010.[DGK +
20] I. Diakonikolas, S. Goel, S. Karmalkar, A. R. Klivans, and M. Soltanolkotabi. Ap-proximation schemes for ’ssion. In Jacob D. Abernethy and Shivani Agarwal, editors,
Conference on Learning Theory, COLT 2020 , volume 125 of
Proceedings of MachineLearning Research , pages 1452–1485. PMLR, 2020.[DHK +
10] I. Diakonikolas, P. Harsha, A. Klivans, R. Meka, P. Raghavendra, R. A. Servedio, andL. Y. Tan. Bounding the average sensitivity and noise sensitivity of polynomial thresholdfunctions. In
STOC , pages 533–542, 2010.[DJS +
15] I. Diakonikolas, R. Jaiswal, R. A. Servedio, L. Y. Tan, and A. Wan. Noise stablehalfspaces are close to very small juntas.
Chicago Journal OF Theoretical ComputerScience , 4:1–13, 2015.[DKKZ20] I. Diakonikolas, D. M. Kane, V. Kontonis, and N. Zarifis. Algorithms and sq lowerbounds for pac learning one-hidden-layer Relu networks. In
Conference on LearningTheory , pages 1514–1539. PMLR, 2020.[DKN10] I. Diakonikolas, D. M. Kane, and J. Nelson. Bounded independence fools degree-2threshold functions. In
FOCS , pages 11–20, 2010.[DKS17] I. Diakonikolas, D. M. Kane, and A. Stewart. Statistical query lower bounds for robustestimation of high-dimensional gaussians and gaussian mixtures. In , pages 73–84, 2017. Fullversion at http://arxiv.org/abs/1611.03473.22DKS18] I. Diakonikolas, D. M. Kane, and A. Stewart. Learning geometric concepts with nastynoise. In
Proceedings of the 50th Annual ACM SIGACT Symposium on Theory ofComputing, STOC 2018 , pages 1061–1073, 2018.[DKTZ20] I. Diakonikolas, V. Kontonis, C. Tzamos, and N. Zarifis. Non-convex SGD learns half-spaces with adversarial label noise.
CoRR , abs/2006.06742, 2020.[DKZ20] I. Diakonikolas, D. Kane, and N. Zarifis. Near-optimal SQ lower bounds for agnos-tically learning halfspaces and relus under gaussian marginals. In Hugo Larochelle,Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, ed-itors,
Advances in Neural Information Processing Systems 33: Annual Conference onNeural Information Processing Systems 2020, NeurIPS 2020 , 2020.[DRST14] I. Diakonikolas, P. Raghavendra, R. A. Servedio, and L. Y. Tan. Average sensitivity andnoise sensitivity of polynomial threshold functions.
SIAM J. Comput. , 43(1):231–253,2014.[Fan68] K. Fan. On infinite systems of linear inequalities.
Journal of Mathematical Analysisand Applications , 21(3):475 – 478, 1968.[Fel16] V. Feldman. Statistical query learning. In
Encyclopedia of Algorithms , pages 2090–2095.2016.[Fel17] V. Feldman. A general characterization of the statistical query complexity. In SatyenKale and Ohad Shamir, editors,
Proceedings of the 30th Conference on Learning Theory,COLT 2017 , volume 65 of
Proceedings of Machine Learning Research , pages 785–830.PMLR, 2017.[FGKP06] V. Feldman, P. Gopalan, S. Khot, and A. Ponnuswami. New results for learning noisyparities and halfspaces. In
Proc. FOCS , pages 563–576, 2006.[FGR +
13] V. Feldman, E. Grigorescu, L. Reyzin, S. Vempala, and Y. Xiao. Statistical algorithmsand a lower bound for detecting planted cliques. In
Proceedings of STOC’13 , pages655–664, 2013. Full version in Journal of the ACM, 2017.[FGV17] V. Feldman, C. Guzman, and S. S. Vempala. Statistical query algorithms for meanvector estimation and stochastic convex optimization. In Philip N. Klein, editor,
Pro-ceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms,SODA 2017 , pages 1265–1277. SIAM, 2017.[FPV15] V. Feldman, W. Perkins, and S. Vempala. On the complexity of random satisfiabilityproblems with planted solutions. In
Proceedings of the Forty-Seventh Annual ACM onSymposium on Theory of Computing, STOC, 2015 , pages 77–86, 2015.[Gan02] M. I. Ganzburg. Limit theorems for polynomial approximation with hermite and freudweights.
Approximation Theory X: Abstract and Classical Analysis (CK Chui, et al,eds.) , pages 211–221, 2002.[GGJ +
20] S. Goel, A. Gollakota, Z. Jin, S. Karmalkar, and A. Klivans. Superpolynomial lowerbounds for learning one-layer neural networks using gradient descent. In
InternationalConference on Machine Learning , pages 3587–3596. PMLR, 2020.23GGK20] S. Goel, A. Gollakota, and A. R. Klivans. Statistical-query lower bounds via functionalgradients. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Bal-can, and Hsuan-Tien Lin, editors,
Advances in Neural Information Processing Systems33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020 ,2020.[GKK19] S. Goel, S. Karmalkar, and A. R. Klivans. Time/accuracy tradeoffs for learning a reluwith respect to gaussian marginals. In
Advances in Neural Information Processing Sys-tems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS2019 , pages 8582–8591, 2019.[GKKT17] S. Goel, V. Kanade, A. R. Klivans, and J. Thaler. Reliably learning the relu in poly-nomial time. In
Proceedings of the 30th Conference on Learning Theory, COLT 2017 ,pages 1004–1042, 2017.[GR06] V. Guruswami and P. Raghavendra. Hardness of learning halfspaces with noise. In
Proc.47th IEEE Symposium on Foundations of Computer Science (FOCS) , pages 543–552.IEEE Computer Society, 2006.[GR08] M. I. Ganzburg and J. Rognes.
Limit theorems of polynomial approximation with expo-nential weights . American Mathematical Soc., 2008.[Hau92] D. Haussler. Decision theoretic generalizations of the PAC model for neural net andother learning applications.
Information and Computation , 100:78–150, 1992.[HKM14] P. Harsha, A. R. Klivans, and R. Meka. Bounding the sensitivity of polynomial thresholdfunctions.
Theory of Computing , 10:1–26, 2014.[Kan10] D.M. Kane. The Gaussian surface area and noise sensitivity of degree-d polynomialthreshold functions. In
CCC , pages 205–210, 2010.[Kan14] D. M. Kane. The average sensitivity of an intersection of halfspaces. In
Symposium onTheory of Computing, STOC 2014 , pages 437–440, 2014.[Kea98] M. J. Kearns. Efficient noise-tolerant learning from statistical queries.
Journal of theACM , 45(6):983–1006, 1998.[KKMS08] A. Kalai, A. Klivans, Y. Mansour, and R. Servedio. Agnostically learning halfspaces.
SIAM Journal on Computing , 37(6):1777–1805, 2008.[KLS09] A. Klivans, P. Long, and R. Servedio. Learning halfspaces with malicious noise. Toappear in
Proc. 17th Internat. Colloq. on Algorithms, Languages and Programming(ICALP) , 2009.[KOS08] A. Klivans, R. O’Donnell, and R. Servedio. Learning geometric concepts via Gaussiansurface area. In
Proc. 49th IEEE Symposium on Foundations of Computer Science(FOCS) , pages 541–550, 2008.[KSS94] M. Kearns, R. Schapire, and L. Sellie. Toward Efficient Agnostic Learning.
MachineLearning , 17(2/3):115–141, 1994.[MR18] P. Manurangsi and D. Reichman. The computational complexity of training relu (s). arXiv preprint arXiv:1810.04207 , 2018.24Nel73] E. Nelson. The free markoff field.
Journal of Functional Analysis , 12(2):211–227, 1973.[O’D14] R. O’Donnell.
Analysis of Boolean Functions . Cambridge University Press, 2014.[Sze67] G. Szeg¨o.
Orthogonal Polynomials . Number τ . 23 in American Mathematical Societycolloquium publications. American Mathematical Society, 1967.[Sz¨o09] B. Sz¨or´enyi. Characterizing statistical query learning: Simplified notions and proofs.In Ricard Gavald`a, G´abor Lugosi, Thomas Zeugmann, and Sandra Zilles, editors, Al-gorithmic Learning Theory, 20th International Conference, ALT 2009 , volume 5809 of
Lecture Notes in Computer Science , pages 186–200. Springer, 2009.[Val84] L. G. Valiant. A theory of the learnable. In
Proc. 16th Annual ACM Symposium onTheory of Computing (STOC) , pages 436–445. ACM Press, 1984.25
Omitted Background
A.1 Correlational Statistical Query (CSQ) Model
For some of our lower bounds in the real-valued setting, we consider correlational or inner productqueries. The CSQ model is a restriction of the SQ model, where the algorithm is allowed to chooseany bounded query function, and obtain estimates for its correlation with the labels.Specifically, for f, h : X → R and a distribution D x over the domain X , we denote by h f, h i D x the quantity E x ∼ D x [ f ( x ) h ( x )] and refer to it as the correlation of f and h under D x . While it iscommonly assumed that the query function h is pointwise bounded, it is in fact sufficient to assumethat it has bounded L -norm. If D is the joint distribution on points and labels, a correlational querytakes h and a parameter t >
0, and outputs a value v ∈ [ E ( x ,y ) ∼ D [ h ( x ) y ] − τ, E ( x ,y ) ∼ D [ h ( x ) y ] + τ ].Similarly to the general SQ model, we consider the following notions of statistical dimension. Definition A.1 (Correlational Statistical Query Dimension) . For β, γ >
0, a probability distri-bution D x over domain X and a family C of functions f : X → R , let s be the maximum integerfor which there exists a finite set of functions { f , . . . , f s } ⊆ C such that | E x ∼ D x [ f i ( x )] | ≤ β forall i ∈ [ s ], and | E x ∼ D x [ f i ( x ) f j ( x )] | ≤ γ for all i, j ∈ [ s ] with i = j . We define the Correla-tional Statistical Query Dimension with pairwise correlations ( γ, β ) of C to be s and denote it byCSD D x ( C , γ, β ). Definition A.2 (Average Correlational Statistical Query Dimension) . Let ρ >
0, let D x be aprobability distribution over some domain X , and let C be a family of functions f : X → R . Wedefine the average pairwise correlation of functions in C to be ρ ( C ) = |C| P g,r ∈C | E x ∼ D x [ g ( x ) r ( x )] | .The Average Correlational Statistical Query Dimension of C relative to D x with parameter γ ,denoted by CSDA D x ( C , γ ), is defined to be the largest integer s such that every subset C ′ ⊆ C ofsize |C ′ | ≥ |C| /s , satisfies ρ ( C ′ ) ≥ ρ .In most of the cases, it suffices to bound the correlational statistical query dimension, since bysimple calculations this implies a bound on the average statistical query dimension. Lemma A.3.
Let C be a class of functions and D x be a distribution and suppose that CSD D x ( C , γ, β ) = d , for some γ, β > . Then, for all γ ′ > , we have that CSD D x ( C , γ + γ ′ ) ≥ dγ ′ / ( β − γ ) . The following result [Sz¨o09, GGJ +
20] relates the Average Correlational SQ dimension of aconcept class with the complexity of any CSQ algorithm for the class.
Lemma A.4 (Theorem B.1 in [GGJ + . Let D x be a distribution over a domain X and let C bea real-valued concept class over X such that ∈ C , and k f k ≥ η for all f ∈ C , f . Suppose thatfor some γ > we have s = CSDA D x ( C , γ ) . Any CSQ algorithm that outputs a hypothesis h suchthat k h − f k < η needs at least s/ queries or queries of tolerance √ γ . A.2 Preliminaries: Multilinear Algebra
Here we introduce some multilinear algebra notation. An order k tensor A is an element of the k -fold tensor product of subspaces A ∈ V ⊗ . . . ⊗ V k . We will be exclusively working with subspacesof R d so a tensor A can be represented by a sequence of coordinates, that is A i ,...,i k . The tensorproduct of a order k tensor A and an order m tensor B is an order k + m tensor defined as( A ⊗ B ) i ,...,i k ,j ,...,j m = A i ,...,i k B j ,...,j m . We are also going to use capital letters for multi-indices,that is tuples of indices I = ( i , . . . , i k ). We denote by E i the multi-index that has 1 on its i -thco-ordinate and 0 elsewhere. For example the previous tensor product can be denoted as A I B J .26o simplify notation we are also going to use Einstein’s summation where we assume that we sumover repeated indices in a product of tensors. For example if A ∈ R d ⊗ R d , v ∈ R d , u ∈ R d we have P di,j =1 v i u j A ij = v i u j A ij . We define the dot product of two tensors (of the same order) to be h A , B i = A i ,...,i k B i ,...,i k = A I B I . We also denote the ℓ -norm of a tensor by k A k = p h A , A i .We denote by A ( X ) a function that maps the tensor X to a tensor A ( X ). Let V be a vector spaceand let A ( x ) : R d → V ⊗ k be a tensor valued function. We denote by ∂ i A ( x ) the tensor of partialderivatives of A ( x ), ∂ i A ( x ) = ∂ i A J ( x ) is a tensor of order k + 1 in V ⊗ k ⊗ R d . We also denote thistensor ∇ A ( x ) = ∂ i A J ( x ) . Similarly we define higher-order derivatives, and we denote ∇ m A ( x ) = ∂ i . . . ∂ i m A J ( x ) ∈ V ⊗ k ⊗ ( R d ) ⊗ m . A.3 Basics of Hermite Polynomials
We are also going to use the Hermite polynomials that form an orthonormal system with re-spect to the Gaussian measure. While, usually one considers the probabilists’s or physicists’ Her-mite polynomials, in this work we define the normalized
Hermite polynomial of degree i to be H ( x ) = 1 , H ( x ) = x, H ( x ) = x − √ , . . . , H i ( x ) = He i ( x ) √ i ! , . . . where by He i ( x ) we denote the prob-abilists’ Hermite polynomial of degree i . These normalized Hermite polynomials form a completeorthonormal basis for the single dimensional version of the inner product space L . To get anorthonormal basis for L , we use a multi-index J ∈ N d to define the d -variate normalized Her-mite polynomial as H J ( x ) = Q di =1 H v i ( x i ). The total degree of H J is | J | = P v i ∈ J v i . Given afunction f ∈ L ( R ) we compute its Hermite coefficients as ˆ f ( J ) = E x ∼N n [ f ( x ) H J ( x )] and expressit uniquely as P J ∈ N n ˆ f ( J ) H J ( x ) . For more details on the Gaussian space and Hermite Analysis(especially from the theoretical computer science perspective), we refer the reader to [O’D14]. Mostof the facts about Hermite polynomials that we use in this work are well known properties and canbe found, for example, in [Sze67].We denote by f [ k ] ( x ) the degree k part of the Hermite expansion of f , f [ k ] ( x ) = P | J | = k ˆ f ( J ) · H J ( x ). We say that a polynomial q is harmonic of degree k if it is a linear combination of degree k Hermite polynomials, that is q can be written as q ( x ) = q [ k ] ( x ) = X J : | J | = k c J H J ( x ) . For a single dimensional Hermite polynomial it holds H ′ m ( x ) = √ mH ′ m − ( x ). Using this weobtain that for a multivariate Hermite polynomial H M ( x ), where M = ( m , . . . , m n ) it holds ∇ H M ( x ) = √ m i H M − E i ( x ) ∈ R n , (4)where E i = e i is the multi-index that has 1 position i and 0 elsewhere. From this fact and theorthogonality of Hermite polynomials we obtain E x ∼N n [ h∇ H M ( x ) , ∇ H L ( x ) i ] = | M | δ M,L . (5)The following fact gives us a formula for the inner product of Fact A.5.
Let p, q : R n → R be harmonic polynomials of degree k . Then E x ∼N n h h∇ ℓ p ( x ) , ∇ ℓ q ( x ) i i = k ( k − . . . ( k − ℓ + 1) E x ∼N n [ p ( x ) q ( x )] . In particular, h∇ k p ( x ) , ∇ k q ( x ) i = k ! E x ∼N n [ p ( x ) q ( x )] . roof. Write p ( x ) = P M : | M | = k b M H M ( x ) and q ( x ) = P M : | M | = k c M H M ( x ). Since the Hermitepolynomials are orthonormal we obtain E x ∼N n [ p ( x ) q ( x )] = P M : | M | = k c M b M . Now, using Equa-tion (4) iteratively we obtain E x ∼N h h∇ ℓ H M ( x ) , ∇ ℓ H L ( x ) i i = k ( k − . . . ( k − ℓ + 1) δ M,L . Using this equality we obtain E x ∼N h h∇ ℓ p ( x ) , ∇ ℓ q ( x ) i i = E x ∼N " h X M b M ∇ ℓ H M ( x ) , X L c L ∇ ℓ H L ( x ) i = X M,L b M c L E x ∼N h h∇ ℓ H M ( x ) , ∇ ℓ H L ( x ) i i = X M,L b M c L k ( k − . . . ( k − ℓ + 1) δ M,L = k ( k − . . . ( k − ℓ + 1) E x ∼N [ p ( x ) q ( x )] . Observe that for every harmonic polynomial p ( x ) of degree k we have that ∇ k p ( x ) is a symmetrictensor of order k . Since the degree of the polynomial is k and we differentiate k times this tensorno longer depends on x . Using Fact A.5, we observe that this operation (modulo a division by √ k !)preserves the L -norm of the harmonic polynomial p , that is E x ∼N n [ p ( x )] = (cid:13)(cid:13) ∇ k p ( x ) (cid:13)(cid:13) /k !. B Omitted Proofs from Section 3
B.1 Low-Degree Polynomial Approximation to the Sign Function
By selecting f ( t ) = sign( t ) and p = 1 in Fact 5.1, we get that any polynomial that achieves errorat most ǫ with respect to the L -norm must have degree at least Ω(1 /ǫ ). Corollary B.1.
Let f : R → {± } with f ( t ) = sign( t ) . Any polynomial p : R → R satisfying k f − p k ≤ ǫ must have degree d = Ω(1 /ǫ ) . B.2 Proof of Lemma 3.6
We restate the lemma below.
Lemma B.2.
There exists an intersection of k halfspaces on R k , f : R k → {± } such that GNS ǫ ( f ) = Ω( √ ǫ log k ) .Proof. We will exhibit a family of k halfspaces whose intersection has the claimed Gaussian noisesensitivity. In particular, these halfspaces will be orthogonal. For i ∈ [ k ], let f i : R n → {± } with f i ( x ) = sign( −h e i , x i + θ ), where e i is the vector having 1 in the i -th coordinate and 0 elsewhere,and θ > f i is 1 if and only if the i -th coordinate is less than θ .Fix an index i ∈ [ k ]. The Gaussian noise sensitivity of a single halfspace is GNS ǫ ( f i ) =Ω( e − θ − ǫ/ √ ǫ ) (see, e.g., [DJS +
15, Lemma 3.4] for a proof). Let x , y be two (1 − ǫ )-correlated n -dimensional standard Gaussian random variables. Then, the inner products h e i , x i and h e i , y i are (1 − ǫ )-correlated univariate Gaussians. Since the Gaussian noise sensitivity of f i is proportionalto the probability that h e i , x i < θ < h e i , y i , we have that Pr ( x , y ) ∼N − ǫn [ h e i , x i < θ < h e i , y i ] = Ω( e − θ − ǫ/ √ ǫ ) . θ be the threshold for which Pr x ∼N n [ h e i , x i > θ ] = 1 /k . The standard bound for the Gaus-sian tail is Pr x ∼N n [ h e i , x i > θ ] = Θ( e − θ / /θ ). Therefore, for the θ that we selected it holds Pr ( x , y ) ∼N − ǫn [ h e i , x i < θ < h e i , y i ] = Ω( θ √ ǫ/k ) = Ω( √ ǫ log k/k ).Let f : R n → {± } be 1 if and only if f i is 1 for all i ∈ [ k ]. Then, we have thatGNS ǫ ( f ) = 2 Pr ( x , y ) ∼N − ǫn [ f ( x ) = 1 , f ( y ) = −
1] = Pr x ∼N n [ f ( x ) = 1] − Pr ( x , y ) ∼N − ǫn [ f ( x ) = f ( y ) = 1]= (cid:18) − k (cid:19) k − (cid:18) − k − Ω (cid:18) √ ǫ log kk (cid:19)(cid:19) k , where the k -th powers are due to the fact that h e i , x i and h e j , x i are independent for i = j . Wecan use the Taylor expansion to show that the above difference is Ω( √ ǫ log k ). Let the function h ( t ) = (1 − /k + t ) k . By Taylor’s theorem, h (0) − h ( t ) = − h ′ (0) t − h ′′ ( ξ ) t /
2, for some ξ between t and 0. By calculating the derivatives, setting t = − Ω( √ ǫ log k/k ) and noting that the second term ofthe approximation is less than the first one, we get that h (0) − h ( t ) = Ω (cid:16) √ ǫ log kk (cid:17) k (cid:0) − k (cid:1) k − . C Duality in Infinite-Dimensional LP
We start with some basic definitions. L p space Let ( X, A , µ ) be a measure space and 1 ≤ p < ∞ . We will typically take X = R n , n ∈ Z + , and µ be the Gaussian measure, unless otherwise specified. For a function f : X → R , the L p -norm of f under N n is defined as k f k p def = (cid:0)R X | f | p d µ (cid:1) /p . For the special case where p = ∞ ,the L ∞ -norm of f is defined as the essential supremum of f on X , i.e., k f k ∞ def = inf { a ∈ R : µ { x ∈ X : f ( x ) > a } = 0 } . The vector space L p ( X, µ ) consists of all functions f : X → R with k f k p < ∞ .We will typically use the shortened notation L p ( R n ) for L p ( R n , N n ). Dual Norms
Consider a vector space V with inner product h· , ·i and a norm k·k on V . The dualnorm k f k ∗ , f ∈ V , is defined as k f k ∗ = sup {h f, h i : k h k ≤ } . H¨older’s inequality states that forany f, h ∈ V it holds h f, h i ≤ k f k k h k ∗ . Basics on Duality of Infinite-Dimensional LPs
For succinctness, we will use the followingnotation. We use (˜ h, t ) for the inequality E x ∼N m [ g ( x )˜ h ( x )] + t ≤
0, where ˜ h ∈ X and t ∈ R . Here X is an appropriate space of functions that in our context will be L p ( R m ). Let S be the set of allsuch tuples that describe the target LP. For the set S , the closed convex cone over X × R is thesmallest closed set S + satisfying the following: if A ∈ S + and B ∈ S + then A + B ∈ S + ; and if A ∈ S + then λA ∈ S + , for all λ ≥ g : R m → R , such that for anyfunction h ∈ L p ( R m ) and at most ( d − P : R m → R , it holds( ∗ ) − E x ∼N m [ g ( x ) f ( x )] + c ≤ 0, where ˜ h ∈ X and t ∈ R . Moreover, let S be the set that contains all such tuples thatdescribe the target system. For the set S , the closed convex cone over X × R is the small-est closed set S + satisfying, if A ∈ S + and B ∈ S + then A + B ∈ S + and, if A ∈ S + then λA ∈ S + for all λ ≥ 0. Note that the S + contains the same feasible solutions as S . The set S = { ( h, −k h k p ) : h ∈ L p } ∪ { ( P, 0) : P ∈ P md − } ∪ { ( − f, c ) } .In the finite-dimensional case, we can always prove the feasibility of an LP by applying thestandard Farkas’ lemma (aka theorem of the alternative). However, when the system is infinite-dimensional, Farkas’ lemma does not hold in general. We are going to use the following result from[Fan68]. Lemma C.1 (Theorem 1 of [Fan68]) . If X is a locally convex, real separated vector space then, alinear system described by S for which S + is closed is feasible (i.e., there exists a g ∈ X ∗ ) if andonly if (0 , 6∈ S + and S + is closed. One direction is trivial, but the other one needs an application of Hahn-Banach theorem whichis where the assumption on X to be a separated space is used. Corollary C.2. If X = L p for ≤ p < ∞ then, the LP described by S is feasible if only if (0 , 6∈ S + .Proof. For X = L p and 1 ≤ p < ∞ , X is a locally convex, real separated vector space. Finally, weneed to prove that the set S + is closed.We begin by finding an explicit representation of S + . It is not hard to see that S + = { ( P + h − yf, k h k p + yc + t ) : P ∈ P md − , h ∈ L p ( R m ) , y, t ∈ R , y, t ≥ } . We will show that this is closed, by showing that it is closed under limits. In particular, supposethat there is some sequence ( P i , h i , y i , t i ) so that ( P i + h i − y i f, k h i k p + y i c + t i ) converges to somelimit (˜ h, ˜ t ). We claim then that ( h, t ) is in S + .To show this, we first note that for i sufficiently large k h i k p ≤ k h i k p + y i c + t i ≤ ˜ t + 1, and y i c ≤k h i k p + y i c + t i ≤ ˜ t +1. Thus, for i sufficiently large k h i k p ≤ ˜ t +1 and y i ≤ (˜ t +1) /c . Furthermore, for i sufficiently large k P i + h i − y i f k p ≤ k ˜ h k p + 1. However, k P i k p ≤ k P i + h i − y i f k p + k h i k p + y i k f k p ,which is bounded for i sufficiently large. Therefore, since an L p ball in P md − is compact, byrestricting to a subsequence, we can assume that the P i have some limit, say, P . Furthermore,since [0 , (˜ t + 1) /c ] is compact, restricting to a further subsequence, we can assume that the y i havesome limit y . Since ( P i + h i − y i f ) have limit ˜ h , the h i must approach a limit h = ˜ h − P + yf . Finally,we note that k h i k p + y i c + t i has limit ˜ t , and thus, the t i must approach a limit t = ˜ t − yc − k h k p .In particular, we must have t ≥ h, ˜ t ) = ( P + h − yf, k h k p + yc + t ) ∈ S + ..