Positive spectrahedra: Invariance principles and Pseudorandom generators
aa r X i v : . [ c s . CC ] J a n Positive spectrahedrons: Geometric properties,Invariance principles and Pseudorandom generators
Srinivasan ArunachalamIBM Quantum.
IBM T.J. Watson Research CenterYorktown Heights, USA
Penghui YaoState Key Laboratory forNovel Software Technology,
Nanjing University [email protected]
January 21, 2021
Abstract
In a recent work, O’Donnell, Servedio and Tan (STOC 2019) gave explicit pseudorandom gen-erators (
PRG s) for arbitrary m -facet polytopes in n variables with seed length poly-logarithmicin m, n , concluding a sequence of works in the last decade, that was started by Diakonikolas,Gopalan, Jaiswal, Servedio, Viola (SICOMP 2010) and Meka, Zuckerman (SICOMP 2013) forfooling linear and polynomial threshold functions, respectively. In this work, we consider anatural extension of PRG s for intersections of positive spectrahedrons. A positive spectrahe-dron is a Boolean function f ( x ) = [ x A + · · · + x n A n (cid:22) B ] where the A i s are k × k posi-tive semidefinite matrices. We construct explicit PRG s that δ -fool “regular” width- M positivespectrahedrons (i.e., when none of the A i s are dominant) over the Boolean space with seedlength poly(log k, log n, M, /δ ).Our main technical contributions are the following: We first prove an invariance principlefor positive spectrahedrons via the well-known Lindeberg method. As far as we are aware sucha generalization of the Lindeberg method was unknown. Second, we prove various geometricproperties of positive spectrahedrons such as their noise sensitivity, Gaussian surface area and aLittlewood-Offord theorem for positive spectrahedrons. Using these results, we give applicationsfor constructing PRG s for positive spectrahedrons, learning theory, discrepancy sets for positivespectrahedrons (over the Boolean cube) and
PRG s for intersections of structured polynomialthreshold functions. ontents ,
7) in Theorem 27 for Bentkus function . . . . . . . . . . . . . . . 29
A Proof of Lemma 34: Case 2 50
Introduction
Constructing explicit pseudorandom generators (
PRG ) for a class of interesting Boolean func-tions has received tremendous attention in the last few decades. One particular class of func-tions that has seen a flurry of works is the class of halfspaces. A halfspace is a Boolean function f : {− , } n → { , } that can be expressed as f ( x ) = sign( a x + · · · + a n x n − b ) for some realvalues a , . . . , a n , b ∈ R . Halfspaces arise naturally in a many areas of theoretical computer scienceincluding machine learning, communication complexity, circuit complexity and pseudorandomness.A successful line of work [Ser06, DHK +
10, MZ13, KM15, GKM18] resulted in
PRG s that ε -foolhalfspaces with seed length poly-logarithmic in ( n/ε ) over the Boolean space.Given the success in designing PRG s for single halfspaces (or linear threshold function), twoalternate lines of work received a lot of attention, polynomial threshold functions and intersec-tions of halfspaces. A degree- d polynomial threshold function ( PTF ) is simply a function f ( x ) =sign( p ( x )) where p is a degree- d polynomial. In this direction, there have been a sequence ofworks [DGJ +
10, DHK +
10, Kan10, Kan11a, Kan11b, Kan11c, Kan14b, OST20] that produced
PRG s with seed length exponential in d over the Boolean space and quasi-polynomial in d overthe Gaussian space. Alternatively, another line of work considered intersections of halfspaces (i.e.,a polytope). In this direction, a sequence of works [GOWZ10, HKM13, ST17, CDS19, OST19]produced a PRG for m -facet polytopes in n variables with seed length poly-logarithmic in m, n . In this work, we initiate the construction of
PRG s for spectrahedrons: a natural generalizationof halfspaces, polytopes and
PTF s in one framework. A spectrahedron S ⊆ R n is a feasible regionof a semidefinite program . Namely, S = ( x ∈ R n : X i x i A i (cid:22) B ) for some symmetric matrices A , . . . , A n , B , where (cid:22) is the standard L¨owner ordering. We say S is a positive spectrahedron if either all A i s are positive semidefinite ( PSD ) or all A i s are neg-ative semidefinite. Spectrahedrons are important basic objects in polynomial optimization andalgebraic geometry [BPT12, Sch18]. Mathematically, spectrahedrons have rich and complicatedstructures and include well-known geometric objects like polytopes, cylinders, polyhedrons, ellip-topes. Computationally, semidefinite programming has found many applications in theoreticalcomputer science in the field of optimization [AK07], approximation theory [GW95, GM12], algo-rithms [AHK05, JLL + + +
15, LRS15]. Theclass of semidefinite programs that consists of only
PSD matrices is an important class of SDPs,termed as positive semidefinite programs , which has been used to characterize various quantuminteractive proof systems [JUW09, JJUW11, GW13]. Their computational complexity has alsoreceived a lot of attention in the past decade [JY11, PT12, AZLO16, JLL + PRG s for regular positive spectrahedrons, which we define inSection 1.2. Before stating our main results, we briefly discuss the techniques developed by priorworks to construct
PRG s for polytopes before discussing the challenges we need to handle here. We remark that there is still room for improvement in the seed length of the
PRG in [OST19]. In this ordering, we say A (cid:22) B if B − A is positive semidefinite, i.e., all the eigenvalues of B − A are non-negative. .1 Prior work and conceptual challenges One of the earliest works that considered fooling threshold functions was by Meka-Zuckerman [MZ13]and [DGJ + PRG s for functions f via invariance principles . Roughly speaking, an invariance principlefor a function f : {− , } n → { , } states that, the expected value of f ( U n ) (where the inputis uniformly random in {− , } n ) is close to the expected value of f ( G n ) (where the input is astandard G n = N (0 , n Gaussian). Invariance theorems are generalizations of the classic Berry-Esseen central limit theorem, generally proven using the well-known Lindeberg method [Lin22].The versatile framework of [MZ13] allows one to use invariance principles along with a few moreingredients to construct
PRG s, so the technical challenge is in establishing invariance principles.Using this framework, Harsha, Klivans and Meka [HKM13] proved an invariance principle forregular polytopes (i.e., when the coefficients in (all) the halfspaces are “regular”). The main noveltyin their work was the poly-logarithmic (in the input parameters) error dependence. In order toprove this, they first proved a general invariance principle for smooth functions (over polytopes).Subsequently they instantiate their invariance principle for the so-called
Bentkus mollifier [Ben90], crucially relying on the fact that the mollifier has derivatives that scale poly-logarithmic in the inputsize. Finally in order to go from invariance principles (for the mollifier) to fooling regular polytopes,they need to prove an anti-concentration of polytopes in the Gaussian space. For this, they use (asa black-box) a well-known result of Nazarov [Naz03, KOS08], which bounds the Gaussian surfacearea ( GSA ) of polytopes. Putting together the invariance principle for smooth functions, Bentkusmollifier and Nazarov’s bound on
GSA , [HKM13] obtained their main results for regular polytopes.We discuss this proof idea in more detail in Section 1.3.Subsequently, Servedio and Tan [ST17] improved the results of [HKM13] by considering “low-weight” polytopes, which removes the regularity condition (albeit, with the seed length of the
PRG in [ST17] depending on the weight). Finally, O’Donnell, Servedio and Tan [OST19] showed howto fool arbitrary polytopes. In [OST19] they still proved a “Boolean-invariance principle” for theBentkus mollifier, however they bypass the entire Gaussian space (in fact it is a necessity to avoidthis Gaussian space since standard invariance principles do not hold for non-regular polytopes).Although they bypass the Gaussian intermediate (which is standard in invariance principles), theirproof techniques still use the Lindeberg method. Additionally, a crucial tool introduced by themwas the Boolean anti-concentration of polytopes, since they can no longer use the
GSA bound ofNazarov which used by [HKM13, ST17, CDS19] for
Gaussian anti-concentration.
There are two straightforward approaches to constructing
PRG s for positive spectrahedrons. Thefirst is to write a spectrahedron as a linear program. Naturally one can approximate a positive-semidefinite constraint X (cid:23) k × k symmetric matrix with exponentially many constraints z T X z ≥ z ∈ R k . However the results of [HKM13, OST19] would be moot here since the seed-lengths of their PRG s are poly-logarithmic in the number of constraints, which are polynomial inthe dimension k , while our goal it to have seed lengths poly-logarithmic in k . The second approachis to use Sylvester’s criterion to write out k polynomials of degree at most k (corresponding to the The Bentkus mollifier is a function which provides a “smooth” continuous approximation to the the discretemultivariate indicator function (also referred to as orthant functions ). determinantal representation of the k minors) and one could potentially use PRG s for polynomialthreshold functions (
PTF ). However, finding optimal
PRG s for
PTF s has remained open for yearsand the best-known
PRG s we have for degree- k PTF s over the Boolean space depends exponentially in k [MZ13]. This naturally motivates us to use the “eigenstructure” of X (cid:23) Invariance principles:
Since a spectrahedron naturally deals with eigenvalues of matrices,it is unclear if we could use known invariance principles for spectrahedrons. In fact, we arenot even aware of a generalization of the Lindeberg-type argument to show an invarianceprinciple for spectral functions (i.e., functions that act on the eigenspectrum of matrices).2.
Geometric properties:
Prior works of [KOS08, HKM13, ST17, CDS19] crucially used thework of Nazarov [Naz03] which bounds the Gaussian surface area of polytopes in order to provetheir anti-concentration. However, spectrahedrons are very poorly understood, and even morebasic questions about their average sensitivity, noise sensitivity, surface area are unknown.3.
Anti-concentration:
An important technique for constructing
PRG s using invariance prin-ciples requires one to prove anti-concentration , i.e., when moving from the smooth mollifierto the orthant functions a crucial ingredient is anti-concentration. It is far from clear ifspectrahedrons enjoy such nice properties in either Boolean spaces or Gaussian spaces.As far as we are aware, none of these questions have been considered for any class of spectrahedronsexcept polytopes. Our main contribution is to make significant progress in all these questions forthe class of positive spectrahedrons.
In order to state our main result we first define
PRG s and ( τ, M )-regular spectrahedrons. A pseu-dorandom generator is a function G : {− , } r → {− , } n and is said to ε -fool a class of functions F ⊆ { f : {− , } n → { , }} with seed length r if it satisfies the following: for every f ∈ F , we have (cid:12)(cid:12)(cid:12)(cid:12) Pr x ∼U n [ f ( x ) = 1] − Pr y ∼U r [ f ( G ( y )) = 1] (cid:12)(cid:12)(cid:12)(cid:12) ≤ ε, where U n (resp. U r ) corresponds to uniform distribution over {− , } n (resp. {− , } r ). We nextdefine the class of regular positive spectrahedrons. Given τ, M >
0, we say a sequence of k × k positive semidefinite matrices (cid:0) A , . . . , A n (cid:1) is ( τ, M ) -regular if I (cid:22) n X i =1 (cid:0) A i (cid:1) (cid:22) M · I and A i (cid:22) τ · I for every i ∈ [ n ] . (1)This regularity assumption is a very natural assumption, it says that the width of a semidefiniteprogram defined by these matrices is bounded. We remark that our regularity condition naturallyextends (and is in fact less restrictive) the regularity condition that was used in prior works onfooling halfspaces and polytopes [GOWZ10, DGJ +
10, MZ13, HKM13]. In Section 1.5.2 we discussmore about why this notion of regularity is necessary and sufficient for our proof techniques.3 spectrahedron S ⊆ R n is a feasible region of the convex set S = (cid:8) x ∈ R n : P i x i A i (cid:22) B (cid:9) . We say S is a positive spectreheron if either all A i s are positive semidefinite ( PSD ) or all A i sare negative semidefinite. We say S is a ( τ, M ) -regular positive spectrahedron if ( A , . . . , A n ) are( τ, M ) regular. It is also natural to consider an intersection of positive spectrahedrons S , . . . , S t .However, without loss of generality one can assume that t = 2 since one can “pack” all the S i s with PSD matrices into a larger block diagonal matrix with dimension t · k and similarly all the negativesemidefinite matrices, so we can always assume we are working with an intersection of two positivespectrahedrons. For simplicity, in the introduction we assume that we are working with a singleregular positive spectrahedron here and state our main theorem.
Result 1 (PRG for positive spectrahedrons) . There exists a
PRG G : { , } r → {− , } n withseed length r = O (log n · log k · M · /δ ) that δ -fools ( τ, M ) -regular positive spectrahedrons for τ ≤ poly( δ/ ( M · log k )) . Typically, handling the “regular case” is the first step towards obtaining optimal results inpseudorandom generators for geometric objects and we have accomplished that here for the firsttime. To prove this theorem, we follow the well-known three-step approach and prove the following:1. An invariance principle for the Bentkus mollifier of arbitrary regular spectrahedrons.2. Boolean and Gaussian anti-concentration for positive regular spectrahedrons.3. An invariance principle for positive regular spectrahedronsBefore proving these statements, we first overview the [HKM13, OST19] approach to proving in-variance principles (since our high-level ideas are inspired by their works).
First recall that a polytope is the feasible region of the set { x ∈ R n : W x ≤ b } for a fixed W ∈ R n × n , b ∈ R n . We say a polytope is τ -regular if each row W i satisfies k W i k = 1 and k W i k ≤ τ . At a high-level the [HKM13] invariance principle states the following: (cid:12)(cid:12)(cid:12)(cid:12) Pr x ∼U n [ W x ≤ b ] − Pr g ∼G n [ W g ≤ b ] (cid:12)(cid:12)(cid:12)(cid:12) ≤ poly(log n, τ ) . (2)To show this, they first express the orthant function above (which we denote O : R n → { , } ),as [ W x ≤ b ] = [ W x ≤ b ] · · · [ W n x ≤ b n ]. Given this structure, they now use the well-knownLindeberg method [Lin22] (see [O’D14, Tao10] for a detailed exposition) to move from the uniformdistribution over a Boolean space to the Gaussian space. To establish Eq. (2), they follow a three-step approach: (1) First, they prove a version of Eq. (2) for smooth functions e O : R n → R (i.e.,functions who have bounded multivariate derivatives). In particular, they use the Lindeberg methodto show that the expected value of e O ( W x ) for x ∼ U n , is “close” to the expected value of e O ( W g ) for g ∼ G n . To understand this closeness, they write out e O ( W z ) using the standard multivariate Taylor For simplicity in exposition, we assume here that k B k ≤ M (our main theorems depend on the norm of B ). Crucially we remark that the seed length of our
PRG has dependence only logarithmic in k , so even with anintersection of t positive spectrahedrons, the dependence would be logarithmic in t as well. For simplicity, we assume that the number of constraints and variables are equal. Their analysis is more general. e O ( W x ) and e O ( W g ) by the higher-order derivatives ofthe smooth function e O . (2) Second, they observe that a result of Bentkus [Ben90] provides exactlyan approximator e O : R n → R (which we refer to as the Bentkus mollifier ) which serves as a smooth approximation to the { , } -valued orthant function O ( x ) = [ W x ≤ b ]. Additionally thismollifier crucially satisfies the property that k e O ( ℓ ) k ≤ O (cid:0) log ℓ n (cid:1) . (3) So far they established thatthe Bentkus mollifier (which served as a proxy for [ W x ≤ b ]) satisfies an approximate version ofEq. (2). In order to go from being close with respect to this Bentkus mollifier to multidimensionalCDF closeness, they prove Gaussian anti-concentration of polytopes. For this, they use a resultof Nazarov [Naz03] (as a black-box) which shows that the Gaussian surface area of a polytope is O ( √ log n ). These three steps allow them to prove Eq. (2). We begin by defining spectral functions. Let f : R k → R , we say ψ : Sym k → R is a spectralfunction if ψ ( M ) = f ( λ ( M )) for all M ∈ Sym k where λ ( M ) = ( λ , . . . , λ k ) are the k eigenvaluesof M . In other words, a spectral function ψ ( · ) depends on a function ψ applied to the eigenvaluesof its argument. We say f satisfies an invariance principle if E x ∼U n " ψ X i x i A i − B ! ≈ ε E g ∼G n " ψ X i g i A i − B ! , for symmetric matrices A , . . . , A n , B . A conceptual challenge in proving an invariance principleeven for smooth spectral functions is that standard Lindeberg-style proofs of invariance theoremsuse multivariate Taylor series of the mollifier function cannot be used here, since our functions acton the eigenvalues of matrices. In the past, there have been various invariance principles [MOO05,Mos08, HKM13, Yao19] but none of them apply here; as far as we are aware invariance principleswith non-diagonal A i , B have not been studied. In this work, we overcome this challenge andadapt the Lindeberg-style proofs of probabilistic invariance principles to prove its analogue forspectral functions.To this end, recall that we are concerned with spectrahedrons whose feasible regions are givenby { x ∈ R n : P i x i A i (cid:22) B } , which can alternatively be written as { x : λ max (cid:0)P i x i A i − B (cid:1) ≤ } .So we let our spectral function f : R k → R to be f ( λ ) = [max i λ i ≤
0] (recall that althoughour spectrahedron acts on n bits on which we want to prove an invariance principle, our spectralfunction acts only on the k eigenvalues). For this function, we can still use the Bentkus mollifier e O : R k → R as a smooth approximation to f . So our first main contribution is to prove aninvariance principle for the Bentkus mollifier applied to the spectrum of matrices. We remark thatin contrast to [HKM13], we do not prove a general invariance principle for spectral functions, insteadour spectral function is tailored for the Bentkus mollifier (which is also the case for [OST19]).
Fr´echet derivatives.
Since our Bentkus mollifier is acting on the eigenspectrum of matrices,instead of multivariate Taylor expansion, we adopt
Fr´echet derivatives , a notion of derivatives that isstudied in Banach spaces. Unfortunately, Fr´echet series (in contrast to standard multivariate series)are still not well understood. In fact even basic properties such as continuity, Lipschitz continuity,differentiability, continuous differentiability, were only proven in the last three decades [BSS98,Lew96, BS99, CQT03], which have been well-known for centuries in standard calculus. In particular, Here k f ( ℓ ) k is the 1-norm of the coefficients in the ℓ -th derivative. In [HKM13], they care about k f (4) k =max x P p,q,r,s | ∂ p ∂ q ∂ r ∂ s f ( x ) | . In fact our analysis can allow arbitrary orthant functions which can be approximated by a Bentkus mollifier. spectral functions only appeared in the last decade.Fortunately for us, Sendov [Sen07] provided a tensorial representation of high-order Fr´echetseries for spectral functions which we employ to analyze the Fr´echet derivatives of the Bentkusmollifier. The challenge is in bounding the 3-tensors that appears in Sendov’s theorem, whichproduce 6 terms corresponding to different permutations of the tensors after simplification. Threeof these 6 terms can simply be upper bounded by k e O (3) k which we know to be small for the Bentkusmollifier. We remark that these are exactly, and the only, terms that appear in the standardinvariance principle proofs for linear forms. Intuitively this is not surprising since the first threeterms simply correspond to the case when the A i , B are diagonal which reduces a spectrahedron toa polytope. However, bounding the remaining terms is highly non-trivial and one of our technicalcontributions is in showing these remaining terms are bounded for the Bentkus mollifier. Bounding derivatives and obtaining invariance principle.
Bounding these last three termsof the 3-tensors significantly deviates from the analysis of [HKM13] since we need to deal with off-diagonal entries of matrices which is unique to the matrix-spectrahedron case and is not facedin [HKM13, ST17, OST19]. To bound this, we use several properties of Fr´echet derivatives suchas, mean value theorems for Fr´echet derivatives, divided differences representations of Fr´echetderivatives [BLZ05], and Dyson’s theorem [Bha13] which provides a useful integral expression forFr´echet derivatives (using the structure of the mollifier). More importantly, since we work with theBentkus mollifier [Ben90], we completely open up the Bentkus black-box and show various analyticproperties of this mollifier e O in order to prove that our Fr´echet derivatives are bounded.In order to go from bounded third-order Fr´echet derivatives to a final invariance principle,we still need to borrow some results from random matrix theory to upper bound the moments of P i x i A i . Although, the concentration of P i x i A i for uniformly random x ∼ U n is well-studiedby standard matrix Chernoff bounds [Tro15], we need better concentration of this random matrixvariable at higher Schatten norms. For the diagonal polytope case [HKM13] used the standardhypercontractivity and [OST19] used Rosenthal’s inequality. Fortunately for us, a matrix-versionof Rosenthal’s inequality [MJC +
14] was proven a few years back and we use it to conclude our proof(in fact we also crucially rely on this inequality to construct our
PRG ). Putting everything togetherwe obtain our main invariance principle for the Bentkus mollifier applied as a spectral function (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E x ∼U n " e O n X i =1 x i A i − B ! − E g ∼G n " e O n X i =1 g i A i − B ! ≤ poly(log k, M, τ ) . (3)We remark that the invariance principle above does not assume the positivity of the matrices. Webelieve this is a necessity for future work on fooling arbitrary spectrahedrons. Even with an invariance principle in hand, we are faced with the same challenges as [HKM13, ST17,OST19] to show an anti-concentration statement. Recall that our goal is to show that for a ( τ, M )-regular positive spectrahedron S , the expected value of the indicator function [ x ∈ S ] for x ∼ U n isclose to the expected value of [ g ∈ S ] for g ∼ G n . This is “almost” what we showed in the previoussection except that the Bentkus mollifier e O in Eq. (3) is replaced by the orthant indicator function f ( x ) = [max i x i ≤ .5.1 Geometric properties A well-known theorem of Ball [Bal93] shows an upper bound of O ( n / ) on the Gaussian surface area( GSA ) of arbitrary n -dimensional convex object. Crucially in the works of [HKM13, ST17] they usedan improvement of Ball’s theorem by Nazarov [Naz03] who showed that the GSA of k -facet polytopesis O ( √ log k ). This logarithmic-upper bound on GSA allows [KOS08, HKM13, ST17, CDS19] toobtain invariance principles, learning algorithms and pseudorandom generators that depend poly-logarithmic in k . In contrast, for our setting it is unclear what is the GSA of spectrahedrons. Clearly,Ball’s theorem gives an upper bound of O ( n / ) on GSA for us. Apart from that, spectrahedrons arepoorly understood. Below we prove an upper bound of O (1) on the GSA of positive spectrahedrons. Result 2 (Geometric properties of positive spectrahedrons) . Let S be a positive spectrahedron andconsider F : {− , } n → { , } defined as F ( x ) = [ x ∈ S ] . The average sensitivity of F is O ( √ n ) ,the ε -Boolean noise sensitivity of F is O ( √ ε ) , and the Gaussian surface area is O (1) . We remark that the noise-sensitivity statement we have above can be viewed as a “positive-matrix-analogue” version of the well-known Peres’s theorem [Per04]. In order to prove this state-ment, we first observe that the average sensitivity of F being O ( √ n ) immediately follows by theobservation that positive spectrahedrons correspond to unate functions and Kane [Kan14a] showed AS ( f ) ≤ O ( √ n ) if f is unate (and a similar statement is known to be false for noise sensitivity).One issue we need to handle when translating between noise sensitivity and average sensitivityis the following: in the standard technique of [Per04, DGJ +
10, Kan14a], one upper bounds the ε -noise sensitivity of a function f by “bucketing” the input variables into m = O (1 /ε ) buckets B , . . . , B m and reduces the function f : {− , } n → {− , } to a function g : {− , } m → {− , } defined as g ( b ) = P mℓ =1 b i P i ∈ B ℓ z i A i (for uniformly random z ). One then upper bounds NS ε ( f )using AS ( g ) (up to a factor ε ). Clearly when using this technique to bound ε -noise sensitivity ofhalfspaces, both f, g are intersections of halfspaces and one can upper bound the average sensitivityof g using Kane’s result [Kan14a] to be O ( √ m ). However in our setting if f is an indicator of a positive spectrahedron, then g no longer needs to be an indicator of a positive spectrahedron since P i ∈ B ℓ z i A i need not even be either a positive semidefinite matrix or a negative semidefinite matrix.We overcome this by modifying the bucketing procedure of [DGJ +
10] to ensure g is an indicator ofa unate function. However, in the process case we end up upper bounding NS ε ( f ) by the “average2-sensitivity” of g . We extend the results of Kane [Kan14a] by showing that even the “average 2-sensitivity” of g is small for our setting. Finally, to move from an upper bound on ε -noise sensitivityto Gaussian surface area, we use standard folklore results [DHK +
10, Kan11a, Bal13].
Gaussian anti-concentration of polytopes directly follows from the fact that the Gaussian surfacearea of polytopes is bounded since its surface has only finite normed vectors. This is crucially usedin [HKM13, ST17, CDS19]. However, it is not clear how to obtain Gaussian anti-concentrationof positive spectrahedrons even with bounded Gaussian surface area (as proven in Result 2) dueto its complicated geometric structures. Here, to move from mollifier-closeness to CDF closeness,we prove a
Boolean anti-concentration for positive spectrahedrons, which is in fact stronger thanGaussian anti-concentration, inspired by the Boolean anti-concentration for polytopes in [OST19].
Regularity condition.
Before explaining the Boolean anti-concentration, we need to revisit the regularity condition, which is also used for polytopes. In [HKM13, ST17], it is assumed that every7alfspace (or row in the matrix W ) satisfies k W i k = 1 and k W i k ≤ τ . One important question is:what is a regularity assumption for spectrahedrons and for which assumptions can we show anti-concentration? A natural possibility is to see if Nazarov’s result [Naz03] holds for spectrahedrons(i.e., show anti-concentration in the weaker Gaussian setting). To the best of our knowledge, thishas firstly not been studied in literature. Moreover, it is not hard to see that, in order for theproof of Nazarov to work for spectrahedrons, one can make a very strong assumption that every A i satisfies λ min ( A i ) ≥
1. However, this seems to significantly restrict the class of spectrahedrons.In order to resolve this, we propose ( τ, M )-regularity as defined in Eq. (1) and prove a strongerstatement, i.e., Boolean anti-concentration for ( τ, M )-regular positive spectrahedrons. We use thisstatement to go from closeness between the mollifier e O (cid:0)P i x i A i − B (cid:1) and e O (cid:0)P i g i A i − B (cid:1) (whichwe already established in Eq. (3)) to closeness between (cid:2)P i x i A i (cid:22) B (cid:3) and (cid:2)P i g i A i (cid:22) B (cid:3) . In thisdirection, we prove a Littlewood-Offord type theorem for positive spectrahedrons. Result 3 (Littlewood-Offord for positive spectrahedrons) . If ( A , . . . , A n ) are ( τ, M ) -regular. Thenevery Λ , we have Pr x ∼U n " λ max X i x i A i − B ! ∈ [ − Λ , Λ] ≤ O (Λ) . The classic Littlewood-Offord theorem [LO39, Erd45] anti-concentration inequality for a half-space w ∈ R n (satisfying | w i | ≥
1) and α ∈ R proves a bound on the probability that P i w i x i ∈ [ α, α + 2] (where x ∼ U n ). In [OST19] they generalized this for intersections of halfspaces and inthe result above we show a matrix-version of Littlewood-Offord theorem. Intuitively, our statementshows the largest eigenvalue of a positive spectrahedron cannot all be very-concentrated in a smallregion (i.e., small eigenvalue regions have small measure over the Boolean cube).The proof of our result is similar to the proofs in [Kan14a, OST19] which show anti-concentrationfor intersections of unate functions (which is the case for positive spectrahedrons). There are acouple of subtleties for us: in [OST19], they perform random “bucketing” of the coordinates in apolytope and show that with high probability, each bucket has “significant” weight, which followsimmediately from the Paley-Zygmund inequality. However, for us, firstly random bucketing doesnot produce a positive spectrahedron (the same issue which we faced in Theorem 2), so insteadwe need to bucket in a non-standard way to go from a positive spectrahedron to a bucket whichcorresponds to a unate function. Next, to show that each bucket has significant weight (which inour case corresponds to large smallest eigenvalue), we invoke the matrix Chernoff bound for nega-tively correlated variables. We remark that higher-dimensional extensions of the Littlewood-Offordtheorem [FF88, TV12] do not talk of eigenspectrum of matrices and differs from our result.Using the standard bits-to-Gaussians trick, this also gives us Gaussian anti-concentration (i.e.,the positive spectrahedrons analogue of Nazarov’s result [Naz03] which is unknown as far as we areaware). Putting this together with our invariance principle statement we obtain our main result. Result 4 (Fooling positive spectrahedrons) . For every ( τ, M ) -regular positive spectrahedron S , (cid:12)(cid:12) E x ∼U n [ x ∈ S ] − E g ∼G n [ g ∈ S ] (cid:12)(cid:12) ≤ poly( M, log k, τ ) . (4)Apart from the applications of constructing pseudorandom generators (which we discuss in thenext section) we believe that our invariance principle for the Bentkus mollifier of arbitrary spectra-hedrons, opening up the Bentkus mollifier (i.e., understanding the Bentkus functions which werealmost used as a black-box in [HKM13, ST17, OST19]), the Littlewood-Offord theorem for positivespectrahedrons, Gaussian surface area of positive spectrahedrons, could be of independent interest.8 .6 Applications We now briefly discuss how to use the invariance principle to obtain our pseudorandom generator.Our construction is based on the Meka-Zuckerman [MZ13]
PRG construction for fooling halfspaces.We note in the passing that this same
PRG (with different parameters) was also used by [HKM13,ST17] and slight modification of it by [OST19]. We omit the details of the
PRG construction herereferring the interested reader to Section 6.3 for an explicit construction.One subtlety in order to go from invariance principle to fooling the MZ-generator is the follow-ing: recall that our invariance principles showed that expected value under the uniform distributionwas close to the expected value under the Gaussian distribution. However, in order to fool the MZ-generator one needs to show that the invariance principle proofs holds also for k -wise independentdistributions. In this direction, we use a neat trick from [OST19] that shows that in order toshow invariance principles for k -wise independent distributions, it suffices to show just Booleananti-concentration, and second we crucially use the fact that the matrix Rosenthal inequality canbe derandomized by analyzing its the original proof. Put together, this shows that our invarianceprinciple proof holds for k -wise independent distributions and gives us our main PRG result.
Result 5 (PRG for positive spectrahedrons) . Let S be a ( τ, M ) -regular positive spectrahedron.There exists a PRG G : { , } r → {− , } n with r = (log n ) · poly(log k, M, /δ ) that δ -fools S withrespect to the uniform distribution for every τ ≤ poly( δ/ (log k · M )) . Learning geometric objects is a fundamental problem in computational learning theory. An applica-tion of upper bounding noise sensitivity or Gaussian surface area of spectrahedrons (in Theorem 2)is in agnostic learning. The agnostic learning framework introduced by [KSS94, Hau92] is the fol-lowing: let
C ⊆ { c : {− , } n → { , }} be a concept class and D : {− , } n × { , } → [0 ,
1] be adistribution. Define opt ( C ) = min c ∈C Pr ( x,b ) ∼D [ c ( x ) = b ] , i.e., what is the best approximation to D from within the concept class. The goal of an agnostic learner is the following: given many samples( x, b ) ∼ D , the goal of a learner is to produce a hypothesis h : {− , } n → { , } which satisfiesPr ( x,b ) ∼D [ h ( x ) = b ] ≤ opt ( C ) + ε. Note that if opt ( C ) = 0, this is the standard PAC learning framework and agnostic learning modelslearnability under adversarial noise. A natural restriction of this model is when the marginal of D on the first n bits is the uniform distribution on { , } n . It is a folklore result [KOS04] that a func-tion f having low noise sensitivity can be approximated by low-degree polynomials (see [HKM13,Lemma 2.7] for an explicit statement). Furthermore, the well-known L1-polynomial regression al-gorithm [KKMS08] shows how to learn low-degree polynomials in the agnostic framework. Puttingthese two connections together gives us the following theorem. Result 6 (Learning positive spectrahedrons) . The concept class of positive spectrahedrons (in n variables with k × k symmetric matrices) can be agnostically learned under the uniform distributionin time n O (log k ) for every constant error parameter. The previous best known result [KOS08] for learning positive spectrahedrons even in the PACmodel was 2 O ( n / ) (as far as we are aware); our result provides a substantially better complexity.9 .6.3 Discrepancy sets for spectrahedrons Understanding discrepancy sets for convex objects is a fundamentally important problem in thefields of convex geometry, optimization, and a range of other areas. Prior works of [HKM13, ST17,OST19] constructed such discrepancy sets for polytopes, but a natural question is to extend theirconstruction to spectrahedrons. In our context, one application of our main result can be viewedas the following: consider the set of all possible positive spectrahedrons (over the Boolean cube) S = { x ∈ {− , } n : P i x i A i (cid:22) B } , then can we construct a small subset of the Boolean cube {− , } n such that this set δ -approximates the {− , } n -volume of every positive spectrahedron?One way to construct such a set is to construct a PRG for the class of functions. So an immediatecorollary of our
PRG for positive spectrahedrons is the following theorem. Result 7 (Discrepancy set for positive spectrahedrons) . There is a deterministic algorithm which,given a ( τ, M ) -regular positive spectrahedron S , runs in time exp(log n, log k, M, /δ ) and outputs a δ -approximation of the number of points in {− , } n contained in S as long as τ ≤ poly( δ/ ( M log k )) . Constructing
PRG s for
PTF s has received a lot of attention. However, the best known seed lengthfor fooling a degree- k PTF on n bits scales as O (log n · k ) (over the Boolean space). A simpleobservation we make is that fooling spectrahedrons (on n bits with k × k matrices) can be in factbe viewed as the more challenging task of fooling an intersection of k many degree- k PTF s.Recall that a spectrahedron is given by S = { x ∈ R n : B − P i x i A i (cid:23) } . Without lossof generality, we may assume that the measure of x satisfying det (cid:0)P i B − x i A i (cid:1) = 0 is zero.Sylvester’s criterion implies that a matrix M (which in our case is B − P i x i A i ) is positive definite if and only if the determinant of the k principle minors of M are positive. Hence, an alternatecharacterization of S is the set of x ∈ R n for which S = k ^ r =1 det B − X i x i A i ! r × r > = k ^ r =1 sign[ p r ( x )]modulo a zero-measure set, where M r × r means the top left r × r principle minor of M . Clearly eachdeterminantal expression produces a polynomial p r of degree at most r . So, our main result aboutfooling S , shows that there is a structured class of intersections of degree- k PTF s (i.e., the class ofpolynomials which can be written as in terms of the above) which can be fooled by a
PRG withseed length O (log n · log k · M/δ ), which is exponentially better than using existing
PRG s for
PTF s.We remark that apriori, it is not even clear why should an arbitrary polynomial even correspondto a spectrahedron as above? However, a well-known result of [HMV06, GM12] states that anarbitrary degree- d polynomial p ∈ R [ x , . . . , x n ] with real coefficients has a symmetric determinantalrepresentation , i.e., there exists symmetric A , A , . . . , A n such that p ( x , . . . , x n ) = det A + X i x i A i ! . where A i ∈ Sym (cid:0) n + dd (cid:1) . So, if we could fool arbitrary spectrahedrons that might be a promisingavenue to fool PTF s and intersections of
PTF s. We remark that counting integer solutions to positive spectrahedrons is not as naturally motivated as that forpolytopes, but nevertheless understanding discrepancy sets for geometric objects is a fundamental question. See [Qua12] for a simple linear algebraic proof of this statement. .7 Future work Our work opens this new line of research into understanding
PRG s for spectrahedrons with severalnovel techniques. This raises several questions for future work. . Can we remove regularity for positive spectrahedrons? One of the crucial techniques thatServedio and Tan [ST17] introduced (inspired by a prior work of Servedio [Ser06]) was decomposinga polytope into head and tail variables (i.e., tail coordinates in a halfspace which satisfy regularityand head coordinates are the dominant variables). They express the head variables as CNF, usethe result of Bazzi [Baz09] to fool the head variables and invariance principles for tail variables.However, in our setting breaking up a single spectrahedron into head and tail variables is unclearand even if possible, what is the analogue of the CNF for our setting? . Can we fool arbitrary spectrahedrons? Besides the difficulty in removing the regularitycondition, another fundamental barrier we face here is, anti-concentration. What is the Gaussiansurface area of a spectrahedron, even this is unknown (as far as we are aware). Our techniques suchas bucketing, using Kane’s result [Kan14a], and Boolean anti-concentration [OST19] crucially usethe assumption of positivity. Going beyond this, might require new understanding on the geometricstructures (like average sensitivity, noise sensitivity) about arbitrary spectrahedrons. . A general invariance principle for spectral functions? Here, we showed our invariance prin-ciple specifically for the Bentkus mollifier. However, like the result of [HKM13] can we prove ageneral invariance principle for arbitrary smooth spectral functions? Given the applications of in-variance principles, they are now considered to be powerful techniques in computational complexitytheory. Having an invariance principle for spectral functions could find more applications such asdeciding noisy entangled quantum games [Yao19]. . Can we fool spectrahedral caps? Let S n − = { x ∈ R n : k x k = 1 } denote the n -dimensionalsphere, then a spectrahedral cap is the set of S n − that is “cut” by a spectrahedron, i.e., for aspectrahedron S , we define the spectrahedral cap C S as C S = S n − ∩ S . In the polytope-setting,fooling spherical caps has received a lot of attention classically [HKM13, KM15] (with almostoptimal seed length PRG s). Can we similarly fool spectrahedral caps? . Fooling polynomial threshold functions? Can we make progress in finding better
PRG s for
PTF s using techniques we developed here for fooling arbitrary spectrahedrons?
Acknowledgements.
We thank Jop Bri¨et and Minglong Qin for helpful comments. This collab-oration earlier faced some bureaucratic issues. We are deeply grateful for the support from JelaniNelson, Kewen Wu, Yitong Yin and others in the TCS community. P.Y. was supported by theNational Key R&D Program of China 2018YFB1003202, National Natural Science Foundation ofChina (Grant No. 61972191), the Program for Innovative Talents and Entrepreneur in Jiangsuthe Fundamental Research Funds for the Central Universities 0202/14380068 and Anhui Initiativein Quantum Information Technologies Grant No. AHY150100. Part of the work was done whenP.Y. and S.A. were participating in the program ”Quantum Wave in Computing” held at SimonsInstitute for the Theory for Computing.
Organization.
In Section 2 we introduce all the mathematical aspects which we use in thispaper as well as state various lemmas in random matrix theory and multidimensional calculus.In Section 3, we introduce the Bentkus mollifier and discuss various properties. In Section 4 westate our main theorem regarding spectral derivatives of smooth functions and go on to boundthe spectral derivatives for the Bentkus function (proving a technical lemma in Appendix A). In11ection 5 we prove an upper bound on the Gaussian surface area of positive spectrahedrons as wellas our Littlewood-Offord theorem for this class. In Section 6 we prove our invariance principletheorem and go on to construct a pseudorandom generator for the class of positive spectrahedrons.
For an integer n ≥
1, let [ n ] represent the sets { , . . . , n } . Given a finite set X and a natural number k , let X k be the set X × · · · × X , the Cartesian product of X , k times. Given a = ( a , . . . , a k )and a set S ⊆ [ k ], we write a S and a − S to represent the projections of a to the coordinatesspecified in S and the on coordinates outside S , respectively. For any i ∈ [ k ], a − i represents a , . . . , a i − , a i +1 , . . . , a n and a i , a ≥ i are defined similarly. Let µ be a probability distribution on X , and µ ( x ) represent the probability of x ∈ X according to µ .Let X be a random variable distributed according to µ . We use the same symbol to represent arandom variable and its distribution whenever it is clear from the context. The expectation of afunction f on X is defined as E [ f ( X )] = E x ∼ X [ f ( x )] = P x ∈X Pr[ X = x ] · f ( x ) = P x µ ( x ) · f ( x ),where x ∼ X represents that x is drawn according to X . For any event E x on x , [ E ( x )] representsthe indicator function of E . In this paper, the lower-cased letters in bold x , y , z · · · are reservedfor random variables. Distributions.
Throughout, we denote G (where G = N (0 , R with mean 0 and variance 1. We denote U n to be theuniform distribution on {− , } n . We say a sequence of random variables X = ( x , . . . , x n ) is t -wise uniform if any subset of X of size t is uniformly distributed (observe that the uniformdistribution is clearly t -wise independent for every t ≥ H on functions [ n ] → [ m ]is said to be an r -wise uniform hash family if for h ∼ H , ( h (1) , . . . , h ( n )) is r -wise uniform. For any f : R → R in C d , which is the set of all real functions that are d -time differentiable, weuse f ( d ) to denote the d -th derivative of f . Given a function F : R k → R and a k -dimensionalmulti-index α = ( α , . . . , α m ) ∈ N k , ∂ α F denotes the mixed partial derivative taken α i times in the i -th coordinate. Fact 1.
Let k ∈ N and f : R k → R be a C d function. Then for all x, y ∈ R k , f ( x + y ) = X α ∈ N k : | α |≤ d − ∂ α f ( x ) α ! m Y i =1 y α i i + err ( x, y ) , where α ! = α ! · · · α m ! , | α | = P i α i and | err ( x, y ) | ≤ sup v ∈ R k X α ∈ N k : | α | = d | ∂ α f ( v ) | max i | y i | d . For a t -time differentiable function f : R k → R and s ≤ t , define k f ( s ) k = max n X p ,p ,...,p s ∈ [ k ] | ∂ p · · · ∂ p s f ( x ) | : x ∈ R k o efinition 2. Let f : R → R . For any distinct inputs x , . . . , x n ∈ R , the divided difference isdefined recursively as follows. f [0] = f,f [ i ] ( x , . . . , x i +1 ) = f [ i ] ( x , . . . , x i − , x i ) − f [ i ] ( x , . . . , x i − , x i +1 ) x i − x i +1 . For other values of x , . . . , x i +1 , f [ i ] is defined by continuous extension. Fact 3 (Mean value theorem for divided difference [Boo05]) . For any f ∈ C n and any x , . . . , x n +1 ,there exists ξ ∈ (min { x , . . . , x n +1 } , max { x , . . . , x n +1 } ) such that f [ n ] ( x , . . . , x n +1 ) = f ( n ) ( ξ ) n ! . Let f : { , } n → { , } , g : R n → { , } and S be a Borel set in R n . We define the followingcombinatorial properties of Boolean-valued functions f, g .1. Average sensitivity: AS ( f ) = P ni =1 Pr x [ f ( x ) = f ( x ⊕ e i )], where the probability is takenuniformly in { , } n .2. ε -Noise sensitivity: NS ε ( f ) = Pr x , y [ f ( x ) = f ( y )] where the probability is taken according tothe distribution: x is uniformly random in { , } n and y is obtained from x by independentlyflipping each x i with probability ε .3. Gaussian noise sensitivity: GNS ε ( g ) = Pr x , z [ g ( x ) = g ( y )] where x , z are independent andrandom Gaussian vectors in G n , and y = (1 − ε ) x + √ ε − ε z .4. Gaussian surface area: GSA ( S ) = lim inf δ → G n ( S δ \ S ) δ where S δ = { x : dist ( x, S ) ≤ δ } denotesthe δ -neighborhood of S under Euclidean distance.We refer interested readers to [O’D14] for more on these parameters and their applications toanalysis of Boolean functions. For any integer k >
0, we use
Mat k and Sym k to represent the set of k × k real matrices andsymmetric matrices, respectively. For any matrix X , k X k p represents the Schattern p -norm of X and k X k represents the spectral norm of X . I k represents a k × k identity matrix. The subscript k may be omitted whenever the dimension is clear from the context. We need the following resultsin matrix analysis. Fact 4. [Bha00] For any k × k real symmetric matrix A , let B be its upper triangle part of A .Namely B i,j = A i,j if i ≤ j and is otherwise. Then k B k ≤ ln kπ k A k . Fact 5. [Tro12, Theorem 1.1] Let n, k ≥ be integers and X , . . . , X n be independent random k × k real symmetric matrices satisfy (cid:22) X i (cid:22) R for i ∈ [ n ] . Set µ = λ min n n X i =1 E [ X i ] ! . hen Pr " λ min n X i =1 X i ! ≤ (1 − δ ) µ ≤ k · e − δ (1 − δ ) − δ ! µ/R for every δ ∈ [0 , . Fact 6.
For every integer m ≥ and A , . . . , A n ∈ Sym k it holds that E (cid:2) k P i g i A i k m (cid:3) ≤ (1 + 2 m ⌈ log k ⌉ ) m/ · k P i ( A i ) k m/ and E (cid:2) k P i x i A i k m (cid:3) ≤ (1 + 2 m ⌈ log k ⌉ ) m/ · k P i ( A i ) k m/ , where the expectations are taken over x ∼ U n and g ∼ G n . Additionally, the second inequality stillholds if x is m ⌈ log k ⌉ -wise uniform.Proof. It suffices to prove the second inequality as the first one follows by the standard bits-to-Gaussians tricks [O’D14, Chapter 11]. Let B = P i x i A i where x ∼ U n . The proof closely followsthe argument in [Tro16], where Tropp proved the case that m = 1. For any integer p ≥
1, it isproved in [Tro16, Eqs. (4.9,4.11)] that E (cid:2) Tr B p (cid:3) ≤ k · (cid:18) p + 1 e (cid:19) p · k P i ( A i ) k p . Thus E [ k B k m ] ≤ E (cid:2) Tr B pm (cid:3) / p ≤ k / p · (cid:18) pm + 1 e (cid:19) m/ · k P i ( A i ) k m/ . Setting p = ⌈ log k ⌉ , we conclude the result. Fact 7 (Matrix Rosenthal inequality [MJC +
14, Corollary 7.4]) . Let X , . . . , X n be centered, inde-pendent random real symmetric matrices. Then (cid:16) E h k P i X i k p p i(cid:17) p ≤ p p − k (cid:0)P i E (cid:2) X i (cid:3)(cid:1) k p + (4 p − X i E h k X i k p p i! p . This inequality still holds if X , . . . , X n are p -wise independent. Let f : R k → R and λ : Sym k → R k where λ ( X ) = ( λ ( X ) , . . . , λ k ( X )) are the eigenvalues of M sorted in a non-increasing order. We refer to λ max = λ interchangeably. Let F = f ◦ λ : Sym k → R .If f : R → R is an analytic function in R , namely its Taylor series converges in R , we define f ( X ) for general matrices using its Taylor expansion. It is not hard to see that the Taylor seriesstill converges with matrix inputs. If X is symmetric with a spectral decomposition X = U DU T ,where D = diag ( λ ( X ) , . . . , λ k ( X )), then f ( X ) = U diag ( f ( λ ( X )) , . . . , λ k ( X )) U T .The Fr´echet derivatives are a notion of derivatives defined in Banach space. In this paper, weonly concern about the Fr´echet derivatives on matrix spaces. Readers may refer to [Col12] for amore thorough treatment. The Fr´echet derivatives are the maps that are defined as follows.14 efinition 8. Given integers m, n ≥ , a map F : Mat m → Mat n and P, Q ∈ Mat m , the Fr´echetderivative of F at P with respect to Q is defined to be DF ( P ) [ Q ] = ddt F ( P + tQ ) | t =0 . The k -th order Fr´echet derivative of F at P with respect to ( Q , . . . , Q k ) is defined to be D k F ( P ) [ Q , . . . , Q k ] = ddt D k − F ( P + tQ k ) [ Q , . . . , Q k − ] | t =0 . Fr´echet derivatives share many common properties with the derivatives in Euclidean spaces,such as linearity, composition rules, Taylor expansions, etc. We refer the interested reader to [Col12]for more. Some basic properties of Fr´echet derivatives are summarized in the following fact.
Fact 9. [Bha13, Chapter X.4] Given
F, G : Mat n → Mat m and P, Q , . . . , Q k ∈ Mat n , it holds that1. D ( F + G ) ( P ) [ Q ] = DF ( P ) [ Q ] + DG ( P ) [ Q ] .2. D ( F · G ) ( P ) [ Q ] = DF ( P ) [ Q ] · G ( P ) + F ( P ) · DG ( P ) [ Q ] .3. If m = n , D ( F ◦ G ) ( P ) [ Q ] = ( D ( G ◦ F ) ( P ) ◦ DF ( P )) [ Q ] .4. D k F ( P ) [ Q , . . . , Q k ] = D k F ( P ) (cid:2) Q σ (1) , . . . , Q σ ( k ) (cid:3) for every k > and permutation σ ∈ S k . The following fact states that Fr´echet derivatives can be expressed as divided differences.
Fact 10. [BLZ05] Let f : R → R be an analytical function and X = diag ( x , . . . , x k ) be a diagonalmatrix whose spectrum is in R . For any matrix A, B , the following holds Df ( X ) [ A ] = (cid:16) f [1] ( x i , x i ) A i ,i (cid:17) ≤ i ,i ≤ k . (5) D f ( X ) [ A, B ] = k X j =1 f [2] ( x i , x j , x i ) A i ,j B j,i ≤ i ,i ≤ k . (6) Fact 11 (Dyson’s expansion [Bha13, Chapter X.4]) . Let f ( x ) = e x . For any X ∈ Sym k and A, B ∈ Mat k , it holds Df ( X ) [ A ] = Z du e (1 − u ) X Ae uX . Lemma 12.
Let f ( x ) = e − x / . It holds that D f ( X ) [ A, B ]= 14 Z du Z dv (1 − u ) e − (1 − u )(1 − v ) X / ( XB + BX ) e − (1 − u ) vX / ( XA + AX ) e − uX / + 14 Z du Z dv ue − (1 − u ) X / ( XA + AX ) e − u (1 − v ) X / ( XB + BX ) e − uvX / − Z du e − (1 − u ) X / ( AB + BA ) e − uX / . In [BLZ05, Lemma 3.8] this fact is proven when A = B is a symmetric matrix and it is not hard to generalizetheir proof to obtain Eqs. (5), (6). n particular, if A = B = H is a symmetric matrix ,then D f ( X ) [ H, H ]= 14 Z du Z dv (1 − u ) e − (1 − u )(1 − v ) X / ( XH + HX ) e − (1 − u ) vX / ( XH + HX ) e − uX / + 14 Z du Z dv ( u ) e − (1 − u ) X / ( XH + HX ) e − u (1 − v ) X / ( XH + HX ) e − uvX / − Z du e − (1 − u ) X / H e − uX / . Note that f ( x ) = e − x / is analytical in R . Thus it is valid to define f on arbitrary matrices. Proof.
For any t ∈ (0 , g ( x ) = e − tx . By the definition of Fr´echet derivative Dg ( X ) [ A ] = lim ε → ε (cid:18) e − t ( X + εA ) − e − tA (cid:19) = lim ε → ε (cid:16) e − t ( X + ε ( XA + AX )+ ε A ) − e − tX (cid:17) = lim ε → ε (cid:16) e − t ( X + ε ( XA + AX ) ) + O (cid:0) ε (cid:1) − e − tX (cid:17) = − t Z du e − (1 − u ) tX ( XA + AX ) e − utX , where the second equality is from the fact that k e X + εY − e X k = O ( ε ) and the last equality is fromFact 11. Setting t = , we have Df ( X ) [ A ] = − Z du e − (1 − u ) X / ( XA + AX ) e − uX / . Taking one more derivative on X with respect to B , we conclude the result. Definition 13.
Given τ, M > , we say a sequence of k × k positive semidefinite matrices ( A , . . . , A n ) is ( τ, M ) -regular if I (cid:22) n X i =1 (cid:0) A i (cid:1) (cid:22) M · I and A i (cid:22) τ · I for every i ∈ [ m ] (7)A spectrahedron S ⊆ R k is a feasible region of a semidefinite program. Namely, the set S = (cid:8) x ∈ R n : P i x i A i (cid:22) B (cid:9) for some symmetric matrices A , . . . , A n , B . We say S is a posi-tive spectrahedron if either all A i s are positive semidefinite or all A i s are negative semidefinite( NSD ). Moreover, it is ( τ, M )-regular if either ( A , . . . , A n ) or ( − A , . . . , − A n ) is ( τ, M )-regular.We say S is an intersection of positive spetrahedrons if S = S ∩ S where S and S are positivespectrahedrons whose matrices are all positive semidefinite and negative semidefinite, respectively.Note that it suffices to consider the intersections of two spetrahedrons as one can pack all PSD matrices into one large block-diagonal matrix (looking ahead this will only affect the parametersin our main results by a logarithmic factor). Packing the corresponding B i s, one get a positivespectrahedron. Same for all negative semidefinite matrices.16 .6 Pseudorandomness Definition 14.
A function g : {− , } r → {− , } n with seed length r , is said to δ -fool a function f : {− , } n → R if (cid:12)(cid:12)(cid:12)(cid:12) E s ∼U r [ f ( g ( s ))] − E u ∼U n f ( u ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ δ. The function g is said to be a efficient pseudorandom generator ( PRG ) that δ -fools a class F of n -variable functions if g is computable by a deterministic uniform poly ( n ) -time algorithm and g fools all function f ∈ F . For ℓ ≥
1, let T ℓ be an ℓ -tensor, i.e., T ℓ : ( R k ) × ℓ → R . Note that a ℓ -tensor is defined uniquely by thecoefficients { T i ,...,i ℓ : i , . . . , i ℓ ∈ [ k ] } . Below we abuse notation by letting T ( i , . . . , i ℓ ) = T i ,...,i ℓ .Often we will use the natural bijection between 2 ℓ -tensors acting on R k and ℓ -tensors acting on Mat k , i.e., for a 2 ℓ -tensor T : ( R k ) × ℓ → R defined as T ( x , . . . , x ℓ ) = X i ,...,i ℓ ∈ [ k ] T ( i , . . . , i ℓ , i ℓ +1 , . . . , i ℓ ) x i · · · x ℓi ℓ , we can also view T as T ′ : ( Mat k ) × ℓ → R defined by rearranging the terms above to obtain: T ′ ( X , . . . , X ℓ ) = X i ,j ∈ [ n ] X i ,j ∈ [ k ] · · · X i ℓ ,j ℓ ∈ [ k ] T ( i , . . . , i ℓ , j , . . . , j ℓ ) X i ,j · · · X ℓi ℓ ,j ℓ Finally, we define “permutation folding” operator which takes a (2 ℓ )-tensor on R k as definedabove and produces a permutation to produce an ℓ -tensor on Mat k . Definition 15 (Definition of diag σ T ) . Let T : ( R k ) × t → R be a k -tensor and σ ∈ S t . Then wedefine diag σ T : ( Mat k ) × t → R as the following map (diag σ T ) (( i , j ) . . . , ( i k , j k )) = T ( i , . . . , i k ) iff ~i = σ~j, (8) and otherwise. In this paper, we are interested in smooth approximators of the function ψ : R k → R defined as ψ ( x ) = (cid:20) max i x i ≤ (cid:21) . (9)To this end, we introduce the Bentkus mollifier defined by Bentkus in [Ben90] and establish severalnew properties. Readers may refer to [Ben90, FK20] for a more thorough treatment. Definition 16. [Ben90] Let g ( x ) = R x −∞ √ π e − t / dt . For every integer k ≥ , define G : R k → R as G ( x , . . . , x k ) = k Y i =1 g ( x i ) . .1 Properties of the mollifier and its derivatives It is easy to calculate that g ′ ( x ) = 1 √ π e − x / (10) g ′′ ( x ) = − x √ π exp (cid:0) − x / (cid:1) (11) g ′′′ ( x ) = 1 √ π (cid:0) x − (cid:1) exp (cid:0) − x / (cid:1) . (12)In order to simplify many calculations, we introduce the function¯ g ( x ) = g ′ ( x ) g ( x ) . (13) Fact 17. [FK20] It holds that g ′ ( u ) = − ( u + g ( u )) · g ( u ); (14) g ′′ ( u ) = (cid:0) u − (cid:1) g ( u ) + 3 ug ( u ) + 2 g ( u ) . (15) Also g is positive and monotone decreasing in R . g ′ is negative and monotone increasing in R . Fact 18. [Fel68, Section 7.1] For any x ≥ , it holds that e − x / √ π (cid:18) x − x (cid:19) ≤ − g ( x ) ≤ e − x / x √ π . The following lemma immediately follows from Fact 17 and Fact 18.
Lemma 19.
For any ∆ ≥ and x ∈ R with | x | ≤ ∆ , it holds that | g ( x ) | ≤ , (cid:12)(cid:12) g ′ ( x ) (cid:12)(cid:12) ≤ | g ( x ) | , (cid:12)(cid:12) g ′′ ( x ) (cid:12)(cid:12) ≤ | g ( x ) | . Fact 20. [Ben90] It holds that for any integer t, k ≥ x ∈ R k k G ( t ) ( x ) k ≤ C t log t/ ( k + 1) (16) for some constant C t only depending on t . Lemma 21.
For any x ∈ R k , if there exist more than k indices satisfying x i ≤ , then k G (1) ( x ) k ≤ O (cid:0) k (cid:1) . roof. Note that g ( z ) ≤ if z ≤
0. Let T = { i : x i ≤ } . Then k G (1) ( x ) k = k X i =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) g ′ ( x i ) Y j = i g ( x j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = X i ∈ T (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) g ′ ( x i ) Y j = i g ( x j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i/ ∈ T g ′ ( x i ) Y j = i g ( x j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ | T | | T |− + 12 | T | (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i/ ∈ T g ′ ( x i ) Y j = i : j / ∈ T g ( x j ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ | T | | T |− + 2 √ k | T | , where the equality used that the terms are all positive and the second inequality is from Fact 20.The upper bound is O (cid:0) k (cid:1) if | T | ≥ k . Claim 22.
For any x > y , it holds that (cid:12)(cid:12)(cid:12)(cid:12) g ( x ) g ′ ( y ) − g ′ ( x ) g ( y ) x − y (cid:12)(cid:12)(cid:12)(cid:12) ≤ (1 + | x | ) exp (cid:18) − y (cid:19) = (1 + | x | ) g ′ ( y ) · √ π. (17) Proof. (cid:12)(cid:12)(cid:12)(cid:12) g ( x ) g ′ ( y ) − g ′ ( x ) g ( y ) x − y (cid:12)(cid:12)(cid:12)(cid:12) = 12 π (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)Z −∞ exp (cid:16) − (cid:16) y + ( t + x ) (cid:17)(cid:17) − exp (cid:16) − (cid:16) x + ( t + y ) (cid:17)(cid:17) x − y dt (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ π exp (cid:18) − x + y (cid:19) Z −∞ (cid:12)(cid:12)(cid:12)(cid:12) exp (cid:18) − t (cid:19) exp ( − ty ) − exp ( − tx ) x − y (cid:12)(cid:12)(cid:12)(cid:12) dt = 12 π exp (cid:18) − x + y (cid:19) Z −∞ (cid:12)(cid:12)(cid:12)(cid:12) exp (cid:18) − t − tx (cid:19) − exp ( − t ( y − x )) y − x (cid:12)(cid:12)(cid:12)(cid:12) dt ≤ π exp (cid:18) − x + y (cid:19) Z −∞ (cid:12)(cid:12)(cid:12)(cid:12) exp (cid:18) − t − tx (cid:19) t (cid:12)(cid:12)(cid:12)(cid:12) dt = 12 π exp (cid:18) − y (cid:19) Z −∞ (cid:12)(cid:12)(cid:12)(cid:12) exp (cid:18) −
12 ( t + x ) (cid:19) t (cid:12)(cid:12)(cid:12)(cid:12) dt = 12 π exp (cid:18) − y (cid:19) (cid:18) exp (cid:18) − x (cid:19) + √ πx − x Z ∞ x e − t / dt (cid:19) ≤ (1 + | x | ) exp (cid:18) − y (cid:19) , where the second inequality used | − e − z | ≤ | z | .For every θ > , α ∈ R , we define the Bentkus mollifier as follows. G θ ( x ) = Pr g ∼G k [ x + α + θ g ≤
0] (18)19t is not hard to verify that G θ ( x ) = n Y i =1 Z − xiθ −∞ √ π e − x i / = G (cid:16) − x θ , · · · , − x k θ (cid:17) . The following fact states that G θ ( · + α ) /G θ ( · − α ) is a good approximator of ψ defined inEq. (9) except a small inner/outer region near the “boundary” which is made precise below. Fact 23 (Lemma 6.7 and Fact 6.8 in [OST19]) . For any δ, θ ∈ (0 , , x ∈ R k there exists Λ =Θ (cid:16) θ · p log( k/δ ) (cid:17) and α = Θ (cid:16) θ · p log( k/δ ) (cid:17) such that the following holds.1. | G θ ( x + α ) − ψ ( x ) | ≤ δ if max i x i ≤ − Λ .2. | G θ ( x − α ) − ψ ( x ) | ≤ δ if max i x i ≥ Λ .3. G θ ( x + α ) − δ ≤ ψ ( x ) ≤ G θ ( x − α ) + δ for all x ∈ R k .where x + α = ( x + α, . . . , x k + α )Let A i = diag (cid:0) A i , A i (cid:1) and D = diag ( D , D ) be block diagonal matrices. To keep thenotations succinct, we set A ( x ) = P i x i A i − D . Fact 24. [OST19, Lemma 6.9] Let k, δ, θ, Λ , α be the parameters satisfying Fact 23. Let Ψ , Ψ θ : Sym k → R be the functions defined as Ψ ( M ) = ψ ( λ ( M )) , Ψ θ ( M ) = G θ ( λ ( M )) , where ψ is definedin Eq. (9) and G θ is defined in Eq. (18) , x and x ′ be two random variables in R k satisfying that (cid:12)(cid:12) E [Ψ θ ( A ( x ) + β I )] − E (cid:2) Ψ θ (cid:0) A (cid:0) x ′ (cid:1) + β I (cid:1)(cid:3)(cid:12)(cid:12) ≤ η, for both β = α and β = − α . Then, it holds that (cid:12)(cid:12) E [Ψ ( A ( x ))] − E (cid:2) Ψ (cid:0) A (cid:0) x ′ (cid:1)(cid:1)(cid:3)(cid:12)(cid:12) ≤ η + 3 δ + Pr [ λ max ( A ( x )) ∈ ( − Λ , Λ]] . Before we describe the main theorem of this section, we need the following notation introduced bySendov in [Sen07] to calculate the high-order Fr´echet derivatives of spectral functions.
Definition 25. [Sen07] Let t ≥ and x ∈ R t . Let T : ( R k ) × t → R be a t -tensor. For every, ℓ ∈ [ t ] , define a ( t + 1) -tensor T ℓ out : ( R k ) × ( t +1) → R as follows ( T ℓ out )( i , . . . , i t +1 ) = ( i ℓ = i t +1 T ( i ,...,i ℓ − ,i t +1 ,i ℓ +1 ,...,i t ) − T ( i ,...,i ℓ − ,i ℓ ,i ℓ +1 ,...,i t ) x it +1 − x iℓ i ℓ = i t +1 . Finally, for every ℓ ∈ [ t ] , define T σ ( x ) = ∇ f ( x ) ℓ = 1 , σ = (1)( T ( x )) ℓ out ℓ ≤ t − ∇ T σ ( x ) ℓ = t, here σ ( ℓ ) is defined as follows: let σ be a permutation of [ k ] given in the cycle decomposition,then σ ( ℓ ) is a permutation of [ k + 1] elements whose cycle representation is the same as σ exceptthat the element k + 1 is inserted after the ℓ th element and before the ( ℓ + 1) th element in the cyclerepresentation of σ . We are now ready to state the main theorem for computing spectral derivatives.
Theorem 26. [Sen07] Let X ∈ Sym k be such that the eigenvalues of X are all distinct. Let F : Sym k → R be a spectral function (i.e., F = f ◦ λ for f : R k → R ). Then F is t -timesdifferentiable at X if and only if f is t -times differentiable at λ ( X ) .Moreover, for every σ ∈ S t , x ∈ R k , let T σ ( x ) : ( R k ) × t → R be a t -tensor as defined inDefinition 25 (which depends on the function f ). Then, for every U , . . . , U t ∈ Sym k , we have D t F ( X ) [ U , . . . , U t ] = X σ ∈ S t diag σ T σ ( λ ( X )) ! ( V T U V, . . . , V T U t V ) , where V satisfies X = V (diag( λ ( X )) V T and diag σ T : ( Mat k ) t → R is a t -tensor on the set Sym k (as defined in Definition 15). In this section, we first understand the relevant quantities to compute the spectral derivatives ofsmooth functions.
Theorem 27.
Let k, n ≥ . Let f : R k → R be a -times differentiable symmetric function and λ : Sym k → R k be the map λ ( M ) = ( λ ( M ) , . . . , λ k ( M )) for every M ∈ Sym k . Let F : Sym k → R be defined as F ( M ) = ( f ◦ λ )( M ) for all M ∈ Sym k . Then, for every P ∈ Sym k with distinct eigenvalues and H ∈ Sym k , let P = V (diag ( λ ( P ))) V T be a spectral decomposition of P and H = V QV T . Then D F ( P ) [ Q, Q, Q ] is the summation of the following terms.1. P i ∇ i ,i ,i f ( x ) H i ,i P i = i ∇ i ,i ,i f ( x ) H i ,i H i ,i P i = i = i ( ∇ i ,i ,i f ( x )) · H i ,i H i ,i H i ,i P i = i (cid:18) ∇ i ,i −∇ i ,i x i − x i − ∇ i −∇ i ( x i − x i ) (cid:19) f ( x ) H i ,i H i ,i P i = i = i ∇ i ,i −∇ i ,i x i − x i f ( x ) H i ,i H i ,i P i = i = i (cid:16) ∇ i −∇ i ( x i − x i )( x i − x i ) − ∇ i −∇ i ( x i − x i )( x i − x i ) (cid:17) f ( x ) H i ,i H i ,i H i ,i P i = i = i (cid:16) ∇ i −∇ i ( x i − x i )( x i − x i ) − ∇ i −∇ i ( x i − x i )( x i − x i ) (cid:17) f ( x ) H i ,i H i ,i H i ,i , For more intuition, consider a simple example: let σ = (12)(3) be a permutation on [3], then σ ( · ) is a permutationon [4] defined as follows: σ (1) is (142)(3), similarly σ (2) = (124)(3), σ (3) = (12)(34), σ (4) = (12)(3)(4). Think of x ∈ R k as the eigenvalues of X ∈ Sym k , i.e., x = λ ( X ). here x = ( λ ( P ) , . . . , λ k ( P )) .Proof. To prove this theorem, we first apply Theorem 26 for t = 3 to obtain D F ( P ) [ Q, Q, Q ] = X σ ∈ S diag σ T σ ( λ ( P )) ( H, H, H ) . (19)We next carefully express each quantity in the summation using the definition of these tensors andupper bound each term. To this end, we break down all the six elements of S and analyze themseparately as follows. Case 1: σ = (1)(2)(3) . Then T σ ( x ) = ∇ f ( x ). Case 2: σ = (12)(3) . First note that we have for σ = (12) and (cid:0) T (12) ( x ) (cid:1) i ,i = ( i = i x i − x i · ( ∇ i − ∇ i ) f ( x ) i = i Now, in order to compute T (12)(3) , we need to compute ∇ T (12) ( x ) which can be written as follows (cid:0) T (12)(3) ( x ) (cid:1) i ,i ,i = i = i x i − x i · (cid:16) ∇ i ,i − ∇ i ,i (cid:17) f ( x ) − x i − x i ) · ( ∇ i − ∇ i ) f ( x ) i = i = i x i − x i · (cid:16) ∇ i ,i − ∇ i ,i (cid:17) f ( x ) + x i − x i ) · ( ∇ i − ∇ i ) f ( x ) i = i = i x i − x i · (cid:16) ∇ i ,i − ∇ i ,i (cid:17) f ( x ) i = i = i Case 3: σ = (13)(2) . First note that for σ = (1)(2), we have T (1)(2) = ∇ f and σ (1) = (13)(2).So, we need to compute (cid:0) ∇ f (cid:1) f ( x ) and we get (cid:0) T (13)(2) ( x ) (cid:1) i ,i ,i = ( i = i x i − x i · (cid:16) ∇ i ,i − ∇ i ,i (cid:17) f ( x ) i = i Case 4: σ = (1)(23) . First note that for σ = (1)(2), we have T (1)(2) = ∇ f and σ (2) = (1)(23).So, we need to compute (cid:0) ∇ f (cid:1) f ( x ) and we get (cid:0) T (1)(23) ( x ) (cid:1) i ,i ,i = ( i = i x i − x i · (cid:16) ∇ i ,i − ∇ i ,i (cid:17) f ( x ) i = i Case 5: σ = (123) . Let σ = (12), then σ (2) = (123). So we need to compute (cid:0) T (12) (cid:1) f ( x ) andwe obtain (cid:0) T (123) ( x ) (cid:1) i ,i ,i = x i − x i ) · ( ∇ i − ∇ i ) f ( x ) i = i = i x i − x i ) · ( ∇ i − ∇ i ) f ( x ) i = i = i x i − x i )( x i − x i ) · ( ∇ i − ∇ i ) f ( x ) − x i − x i )( x i − x i ) · ( ∇ i − ∇ i ) f ( x ) i = i = i ase 6: σ = (132) . Let σ = (12), then στ (1) = (132). So we need to compute (cid:0) T (12) (cid:1) f ( x ) andwe obtain. (cid:0) T (132) ( x ) (cid:1) i ,i ,i = − x i − x i ) · ( ∇ i − ∇ i ) f ( x ) i = i = i x i − x i ) · ( ∇ i − ∇ i ) f ( x ) i = i = i x i − x i )( x i − x i ) · ( ∇ i − ∇ i ) f ( x ) − x i − x i )( x i − x i ) · ( ∇ i − ∇ i ) f ( x ) i = i = i X σ ∈ S T σ ( x )( H, H, H ) = X σ X i ,i ,i ( T σ ( x )) i ,i ,i H i ,i σ (1) H i ,i σ (2) H i ,i σ (3) Let’s write this out as follows: by T i , we mean T case ( i ) above X i ,i ,i ( T ) i ,i ,i H i ,i H i ,i H i ,i + ( T ) i ,i ,i H i ,i H i ,i H i ,i + ( T ) i ,i ,i H i ,i H i ,i H i ,i + ( T ) i ,i ,i H i ,i H i ,i H i ,i + ( T ) i ,i ,i H i ,i H i ,i H i ,i + ( T ) i ,i ,i H i ,i H i ,i H i ,i and in particular, assuming H is symmetric the above simplifies to X i ,i ,i ( T ) i ,i ,i H i ,i H i ,i H i ,i + ( T ) i ,i ,i H i ,i H i ,i + ( T ) i ,i ,i H i ,i H i ,i + ( T ) i ,i ,i H i ,i H i ,i + ( T ) i ,i ,i H i ,i H i ,i H i ,i + ( T ) i ,i ,i H i ,i H i ,i H i ,i (20)Now, we will break up this sum into 5 cases as follows which we need to upper bound Case (i): i = i = i . Then Eq. (20) reduces to the following X i ,i H i ,i H i ,i ( T + T ) + H i ,i H i ,i ( T + T + T + T ) (21)Note that when we say T q above, we mean ( T q ) i ,i ,i = ( T q ) i ,i ,i (since i = i ). Let us now plugin the values of the corresponding T q s into the formula and rewrite the above as follows X i = i H i ,i H i ,i (cid:0) ∇ i ,i ,i f ( x ) + 0 (cid:1) ++ H i ,i H i ,i ∇ i ,i − ∇ i ,i x i − x i + ∇ i − ∇ i ( x i − x i ) + ∇ i ,i − ∇ i ,i x i − x i + ∇ i − ∇ i ( x i − x i ) ! f ( x )= X i = i H i ,i H i ,i (cid:0) ∇ i ,i ,i f ( x ) (cid:1) + 2 H i ,i H i ,i ∇ i ,i − ∇ i ,i x i − x i + ∇ i − ∇ i ( x i − x i ) ! f ( x ) (22) Case (ii): i = i = i . Then Eq. (20) reduces to X i ,i H i ,i H i ,i ( T + T ) + H i ,i H i ,i ( T + T + T + T ) (23)23he above simplies to the following X i = i H i ,i H i ,i (cid:0) ∇ i ,i ,i f ( x ) + 0 (cid:1) ++ H i ,i H i ,i ∇ i ,i − ∇ i ,i x i − x i + ∇ i ,i − ∇ i ,i x i − x i + ∇ i − ∇ i ( x i − x i ) + ∇ i − ∇ i ( x i − x i ) ! f ( x )= X i = i H i ,i H i ,i (cid:0) ∇ i ,i ,i f ( x ) (cid:1) + 2 H i ,i H i ,i ∇ i ,i − ∇ i ,i x i − x i + ∇ i − ∇ i ( x i − x i ) ! f ( x ) (24) Case (iii): i = i = i . Then Eq. (20) reduces to X i ,i H i ,i H i ,i ( T + T ) + H i ,i H i ,i ( T + T + T + T ) (25)The above simplifies to the following X i = i H i ,i H i ,i (cid:0) ∇ i ,i ,i f ( x ) + 0 (cid:1) ++ H i ,i H i ,i ∇ i ,i − ∇ i ,i x i − x i − ∇ i − ∇ i ( x i − x i ) + ∇ i ,i − ∇ i ,i x i − x i − ∇ i − ∇ i ( x i − x i ) ! f ( x )= X i = i H i ,i H i ,i (cid:0) ∇ i ,i ,i f ( x ) (cid:1) + 2 H i ,i H i ,i ∇ i ,i − ∇ i ,i x i − x i − ∇ i − ∇ i ( x i − x i ) ! f ( x ) (26) Case (i)+ Case (ii)+ Case (iii).
We first upper bound these three cases to get the desiredupper bound in the theorem statement. First summing the three cases, we have X i = i H i ,i H i ,i (cid:0) ∇ i ,i ,i + ∇ i ,i ,i + ∇ i ,i ,i (cid:1) f ( x )+ 6 X i = i H i ,i H i ,i ∇ i ,i − ∇ i ,i x i − x i − ∇ i − ∇ i ( x i − x i ) ! f ( x ) | {z } ( ⋆ ) (27)We now bound ( ⋆ ) using the following claim. Case (iv): i = i = i . Then Eq. (20) reduces to X i H i ,i ( T + T + T + T + T + T ) = X i H i ,i ∇ i ,i ,i f (28)24 ase (v): i = i = i . Then Eq. (20) stays the same and we get X i ,i ,i ( ∇ i ,i ,i f ) · H i ,i H i ,i H i ,i + ∇ i ,i − ∇ i ,i x i − x i f ( x ) H i ,i H i ,i + ∇ i ,i − ∇ i ,i x i − x i f ( x ) H i ,i H i ,i + ∇ i ,i − ∇ i ,i x i − x i f ( x ) H i ,i H i ,i + (cid:18) ∇ i − ∇ i ( x i − x i )( x i − x i ) − ∇ i − ∇ i ( x i − x i )( x i − x i ) (cid:19) f ( x ) H i ,i H i ,i H i ,i + (cid:18) ∇ i − ∇ i ( x i − x i )( x i − x i ) − ∇ i − ∇ i ( x i − x i )( x i − x i ) (cid:19) f ( x ) H i ,i H i ,i H i ,i (29)This concludes the proof of the theorem statement. We now state the main theorem we prove using the theorem above. Let G : R k → R be the Bentkusfunction given in Definition 16. Theorem 28.
Let k ≥ be an integer and ψ : Sym k → R be a function defined as ψ ( M ) =( G ◦ λ ) ( M ) where G is given in Definition 16. Given ∆ ≥ X ∈ Sym k with eigenvalues λ ( X ) =( x , . . . , x k ) satisfying that k X k ≤ ∆ , it holds that (cid:12)(cid:12) D ψ ( X ) [ H, H, H ] (cid:12)(cid:12) ≤ O (cid:0) ∆ · log k · k H k (cid:1) . The following corollary simply follows from the definition of G θ in Eq. 18 and the chain ruleof Fr´echet derivatives in Fact 9. Corollary 29.
Let k ≥ be an integer and θ > , α ∈ R and Ψ θ : Sym k → R be a function definedas Ψ θ ( M ) = ( G θ ◦ λ ) ( M + α I ) , where G θ is given in Eq. (18) . Given ∆ ≥ , X ∈ Sym k witheigenvalues λ ( X ) = ( x , . . . , x k ) satisfying that k X k ≤ ∆ , it holds that (cid:12)(cid:12) D Ψ θ ( X + α I ) [ H, H, H ] (cid:12)(cid:12) ≤ O (cid:18) ∆ + α θ · log k · k H k (cid:19) . In order to prove the theorem above, We upper bound all the terms listed in Theorem 27individually in the following sections (in increasing order of difficulty). Given the calculations arefairly technical we break down the analysis in the following sections for modularity and readerconvenience. In Section 4.4.1 we bound the first three terms in Theorem 27 (this is the easy casesince the analysis is very similar to what happens in [HKM13] and requires new properties of theBentkus function), in Section 4.4.2 and 4.4.3 we bound the fourth and fifth term (this alreadydeviates from the analysis of [HKM13]) and finally in Section 4.5 we bound the sixth and seventhterm (this calculation is fairly involved and deviates significantly from prior works, since we needto deal with properties of Fr´echet derivatives, new properties of Bentkus function and the non-diagonal entries of the matrices H which is unique to the matrix-spectrahedron case and is notfaced in [HKM13, ST17, OST19]).As spectral functions and spectral norms are unitary invariant, we assume that X = diag ( x , . . . , x k )is diagonal without loss of generality. To adopt Theorem 27, we assume that all the x , . . . , x n aredistinct. We further conclude the result by continuity.25 .4 Bounding terms ( )-( ) in Theorem 27 for Bentkus function Let G : R k → R be the Bentkus function given in Definition 16. Recall that G ( x ) = Q i g ( x i ),where g ( x ) = √ π R − x −∞ e − t / dt . Recall the notation g ′ ( x ) = √ π e − x / and g ( x ) = g ′ ( x ) /g ( x ). (1 , , in Theorem 27Lemma 30 (Bounding terms (1 , , . The following three terms (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∇ i ,i ,i G ( x ) H i ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i ∇ i ,i ,i G ( x ) H i ,i H i ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i ∇ i ,i ,i G ( x ) · H i ,i H i ,i H i ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) can be upper bound by O (log . k · k H k ) .Proof. The first upper bound is straightforward. Observe that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)X i ∇ i ,i ,i G ( x ) H i ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ max i | H i,i | · X i (cid:12)(cid:12) ∇ i ,i ,i G ( x ) (cid:12)(cid:12) ≤ max i | H i,i | ·k G (3) ( x ) k ≤ k H k · log . k, where the second inequality follows by definition of k G (3) k and the last inequality used max i,j | H i,j | ≤k H k (the latter being the spectral norm of H ) and Fact 20 to conclude k G (3) k ≤ O (cid:0) log . k (cid:1) .Similarly, the remaining two terms can also be bounded exactly as above (by observing that P i = i ∇ i ,i ,i G and P i = i = i ( ∇ i ,i ,i G ) appear in the expression of k G (3) k . ) in Theorem 27 In order to bound the remaining terms in Theorem 27, we need the following claim.
Claim 31.
It holds that1. P i = i g ( x i ) (cid:12)(cid:12)(cid:12) G ( x ) H i ,i H i ,i (cid:12)(cid:12)(cid:12) ≤ O (cid:16) √ log k · k H k (cid:17) . P i = i g ( x i ) (cid:12)(cid:12)(cid:12) G ( x ) H i ,i H i ,i (cid:12)(cid:12)(cid:12) ≤ O (cid:16) √ log k · k H k (cid:17) . P i = i = i (cid:12)(cid:12)(cid:12) g ( x i ) g ( x i ) G ( x ) H i ,i H i ,i (cid:12)(cid:12)(cid:12) ≤ O (cid:16) log k · k H k (cid:17) .Proof. For Item 1, we have X i = i g ( x i ) (cid:12)(cid:12) G ( x ) H i ,i H i ,i (cid:12)(cid:12) ≤ X i g ( x i ) G ( x ) · max i X i (cid:12)(cid:12) H i ,i H i ,i (cid:12)(cid:12) ≤ k G (1) ( x ) k k H k where the last inequality is becausemax i X i (cid:12)(cid:12) H i ,i H i ,i (cid:12)(cid:12) ≤ k H k max i (cid:0) H (cid:1) i ,i ≤ k H k , (30)using the fact that max ij | H ij | ≤ k H k . Using Fact 20 shows the first inequality. Item 2 follows bythe same reason. 26or Item 3, we have X i = i = i (cid:12)(cid:12) g ( x i ) g ( x i ) G ( x ) H i ,i H i ,i (cid:12)(cid:12) = X i = i | g ( x i ) g ( x i ) G ( x ) | max i ,i X i (cid:12)(cid:12) H i ,i H i ,i (cid:12)(cid:12) ≤ O (cid:16) log k · k H k (cid:17) where the inequality is from Fact 20 and the fact that X i (cid:12)(cid:12) H i ,i H i ,i (cid:12)(cid:12) ≤ k H k · (cid:0) H (cid:1) i ,i ≤ k H k . (31) Lemma 32 (Bounding terms (4) in Theorem 27) . We have X i = i H i ,i H i ,i ∇ i ,i G − ∇ i ,i Gx i − x i − ∇ i G − ∇ i G ( x i − x i ) ! ≤ O (cid:16) ∆ · p log k k H k (cid:17) . Proof.
First observe that ∇ i G ( x ) = g ′ ( x i ) Y j = i G ( x j ) = g ( x i ) · G ( x ) , and similarly we have ∇ i ,i G ( x ) = g ( x i ) ∇ i G ( x )+ G ( x ) ∇ i g ( x i ) = (cid:0) g ( x i ) − ( x i + g ( x i )) g ( x i ) (cid:1) G ( x ) = − x i g ( x i ) G ( x ) , where we used Fact 17. Now, let us start upper bounding the lemma statement as follows (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i H i ,i H i ,i ∇ i ,i G − ∇ i ,i Gx i − x i − ∇ i G − ∇ i G ( x i − x i ) !(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ X i = i (cid:12)(cid:12)(cid:12) − g ( x i ) g ( x i ) + x i g ( x i ) x i − x i − g ( x i ) − g ( x i )( x i − x i ) (cid:12)(cid:12)(cid:12) · | G ( x ) · H i ,i H i ,i | = X i = i (cid:12)(cid:12)(cid:12) − g ( x i ) g ( x i ) + x i g ( x i ) x i − x i − g ′ ( ξ i ,i ) x i − x i (cid:12)(cid:12)(cid:12) · | G ( x ) · H i ,i H i ,i | = X i = i (cid:12)(cid:12)(cid:12) g ( x i ) g ( x i ) + x i g ( x i ) x i − x i − ( ξ i ,i + g ( ξ i ,i )) g ( ξ i ,i ) x i − x i (cid:12)(cid:12)(cid:12) · | G ( x ) · H i ,i H i ,i |≤ X i = i (cid:12)(cid:12)(cid:12) x i g ( x i ) − ξ i ,i g ( ξ i ,i ) x i − x i (cid:12)(cid:12)(cid:12) · | G ( x ) · H i ,i H i ,i | | {z } :=(1) + (cid:12)(cid:12)(cid:12) g ( x i ) g ( x i ) − g ( ξ i ,i ) x i − x i (cid:12)(cid:12)(cid:12) · | G ( x ) · H i ,i H i ,i | | {z } :=(2) , where the first equality used the mean-value theorem to obtain a ξ i ,i ∈ [ x i , x i ], second equalityused Eq. (14). We now bound both these terms separately as follows.27 erm 1 upper bound. Note that ξ i ,i is between x i and x i . The first term is upper bounded by X i = i (cid:12)(cid:12)(cid:12) x i g ( x i ) − ξ i ,i g ( ξ i ,i ) x i − ξ i ,i (cid:12)(cid:12)(cid:12) · | G ( x ) · H i ,i H i ,i | = X i = i (cid:12)(cid:12)(cid:12)(cid:0) − η i ,i (cid:1) g ( η i ,i ) − η i ,i g ( η i ,i ) (cid:12)(cid:12)(cid:12) · (cid:12)(cid:12) G ( x ) · H i ,i H i ,i (cid:12)(cid:12) ≤ X i = i g ( η i ,i ) (cid:12)(cid:12) G ( x ) · H i ,i H i ,i (cid:12)(cid:12) for some η i ,i between x i and ξ i ,i , where we apply a mean value theorem for the function xg ( x )for the equality and Lemma 19 for the inequality. Note that g ( · ) is nonnegative and monotonedecreasing by Fact 17. Thus the first term is upper bounded by2∆ X i = i max { g ( x i ) , g ( x i ) } (cid:12)(cid:12) G ( x ) · H i ,i H i ,i (cid:12)(cid:12) which, in turn, is upper bounded by O (cid:16) ∆ · √ log k · k H k (cid:17) from Fact 20 and Eqs (30), (31). Term 2 upper bound.
By triangle inequality we upper bound the second term by X i = i (cid:12)(cid:12)(cid:12) g ( x i ) g ( x i ) − g ( ξ i ,i ) x i − x i (cid:12)(cid:12)(cid:12) · | G ( x ) · H i ,i H i ,i |≤ X i = i (cid:12)(cid:12)(cid:12) g ( x i ) g ( x i ) − g ( x i ) x i − x i (cid:12)(cid:12)(cid:12) · | G ( x ) · H i ,i H i ,i | + X i = i (cid:12)(cid:12)(cid:12) g ( x i ) − g ( ξ i ,i ) x i − x i (cid:12)(cid:12)(cid:12) · | G ( x ) · H i ,i H i ,i | . (32)We first upper bound the first quantity in Eq. (32) first as follows. X i = i (cid:12)(cid:12)(cid:12) g ( x i ) g ( x i ) − g ( x i ) x i − x i (cid:12)(cid:12)(cid:12) · | G ( x ) · H i ,i H i ,i | = X i = i (cid:12)(cid:12)(cid:12) g ( x i ) − g ( x i ) x i − x i (cid:12)(cid:12)(cid:12) · | G ( x ) | · | g ( x i ) | · | H i ,i H i ,i | = X i = i | g ′ ( ζ i ,i ) | · | G ( x ) | · | g ( x i ) | · | H i ,i H i ,i |≤ · X i = i | G ( x ) | · | g ( x i ) | · | H i ,i H i ,i |≤ k G (1) k k H k . ≤ O (cid:16) ∆ · p log k · k H k (cid:17) , (33)where ζ i ,i between x i and x i , first inequality uses Fact 19, the second inequality uses Eqs. (30), (31)and the last inequality is from Fact 20.We now bound the second term in Eq. (32) as follows28 i = i (cid:12)(cid:12)(cid:12) g ( x i ) − g ( ξ i ,i ) x i − x i (cid:12)(cid:12)(cid:12) · | G ( x ) · H i ,i H i ,i |≤ X i = i (cid:12)(cid:12)(cid:12) g ( x i ) − g ( ξ i ,i ) ξ i ,i − x i (cid:12)(cid:12)(cid:12) · | G ( x ) · H i ,i H i ,i | (for ξ is between x i and x i )= 2 X i = i (cid:12)(cid:12) g ( η i ,i ) g ′ ( η i ,i ) (cid:12)(cid:12) · | G ( x ) | · (cid:12)(cid:12) H i ,i H i ,i (cid:12)(cid:12) (for some η i ,i between x i and ξ i ,i ) ≤ X i = i | g ( η i ,i ) | · | G ( x ) | · (cid:12)(cid:12) H i ,i H i ,i (cid:12)(cid:12) (Fact 19) ≤ X i = i max { g ( x i ) , g ( x i ) } · | G ( x ) | · (cid:12)(cid:12) H i ,i H i ,i (cid:12)(cid:12) . (Fact 17 and Lemma 19)Further applying Fact 20 and putting together Eqs. (31)(30), we conclude that it can be upperbounded by O (cid:16) ∆ √ log k k H k (cid:17) . ) in Theorem 27Lemma 33 (Bounding terms (5) in Theorem 27) . We have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i ∇ i ,i G ( x ) − ∇ i ,i G ( x ) x i − x i H i ,i H i ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ O (cid:16) ∆ · log k · k H k (cid:17) . Proof. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i ∇ i ,i G ( x ) − ∇ i ,i G ( x ) x i − x i H i ,i H i ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i g ( x i ) ( g ( x i ) − g ( x i )) x i − x i G ( x ) H i ,i H i ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i g ′ ( ξ i ,i ) g ( x i ) G ( x ) H i ,i H i ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (for some ξ i ,i between x i and x i ) ≤ X i = i = i (cid:12)(cid:12) max { g ( x i ) , g ( x i ) } g ( x i ) G ( x ) H i ,i H i ,i (cid:12)(cid:12) ≤ O (cid:16) ∆ · log k · k H k (cid:17) , where the last inequality is from Fact 20 and Eqs. (30)(31). (6 , in Theorem 27 for Bentkus function Let G : R k → R be the Bentkus function given in Definition 16. Recall that G ( x ) = Q i g ( x i ),where g ( x ) = √ π R − x −∞ e − t / dt . Recall the notation g ′ ( x ) = √ π e − x / and g ( x ) = g ′ ( x ) /g ( x ).Restating the terms for convenience. 29 emma 34 (Bounding terms (6 ,
7) in Theorem 27) . (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i g ( x i ) − g ( x i ) x i − x i − g ( x i ) − g ( x i ) x i − x i x i − x i G ( x ) H i ,i H i ,i H i ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ O (cid:16) ∆ log k k H k (cid:17) (34)This is the most involved part. Note that the left hand side is unchanged if we zero out alldiagonal entries of H . And further note that k H − diag ( H ) k ≤ k H k where diag ( H ) is a diagonalmatrix obtained by diagonalizing H . Thus, we may assume that the diagonal elements in H arezero without loss of generality. We break down the analysis into two cases (the first one being thesimpler case). x i s. The simpler case is when the number of negative x i s is “large”. Lemma 35. If |{ i : x i < }| > k , then the quantity in Eq. (34) is upper bounded by O (cid:16) ∆ √ log kk · k H k (cid:17) .Proof. Applying Fact 3 a mean value theorem of divided difference and Lemma 19, the term inEq. (34) is upper bounded by O ∆ X i = i = i g ( ζ i ,i ,i ) G ( x ) | H i ,i H i ,i H i ,i |≤ O ∆ X i = i = i max { g ( x i ) , g ( x i ) , g ( x i ) } G ( x ) | H i ,i H i ,i H i ,i |≤ O ∆ k G (1) k max i X i ,i | H i ,i H i ,i H i ,i | ≤ O ∆ · k · k G (1) ( x ) k · max i ,i X i | H i ,i H i ,i H i ,i | ! ≤ O (cid:16) ∆ · k · k G (1) ( x ) k · k H k (cid:17) ≤ O (cid:18) ∆ √ log kk · k H k (cid:19) where the first inequality is from the positivity and monotonicity of g ( · ) due to Fact 17 to concludethat | g ( ζ i ,i ,i ) | ≤ max {| g ( x i ) | , | g ( x i ) | , | g ( x i ) |} ; the second last inequality is from the follow-ing fact X i | H i ,i H i ,i H i ,i | ≤ k H k vuut X i H i ,i ! X i H i ,i ! ≤ k H k ; (35)the last inequality is from Lemma 21 (which uses that the number of negative x i s is ≤ k ).30 .5.2 Case 2: A few negative x i s We now assume that |{ i : x i < }| ≤ k and this case the most complicated and upper boundingit is the most technical. We push this proof to Appendix A. Proof of Theorem 28.
Combining Theorem 27 and Lemmas 30, 32, 33, 34, we obtain our result.
In this section we prove certain combinatorial and geometric properties of positive spectrahedrons.Understanding the surface area of a convex object is a fundamental question in convex geometry.In the context of theoretical computer science, one of the earliest works Klivans, O’Donnell andServedio [KOS04] related learnability of geometric convex objects (in the PAC and agnostic setting)to a natural complexity measure of of
Gaussian surface area ( GSA ). Recall that for a convex object S ⊆ R n , we have GSA ( S ) = lim inf δ → G n ( S δ \ S ) δ where S δ = { x : dist ( x, S ) ≤ δ } denotes the δ -neighborhood of S under Euclidean distance.In some sense the work of [KOS04] showed that, GSA of convex objects characterizes learnabilityof these objects under the Gaussian distribution. This remarkable connection has provided furthermotivation to understand what is the
GSA of basic well-studied convex sets. In this direction, awell-known result of Ball gives an upper bound on the surface area of arbitrary convex objects.
Theorem 36. [Bal93] The surface area of every convex set on n coordinates is at most O ( n / ) . For our setting it is unclear what the Gaussian surface area of spectrahedrons is. Clearly, sincespectrahedrons are convex objects, one can use Theorem 36 to show an upper bound of O ( n / ).Below we show that one can in fact prove an upper bound of O (1) on the Gaussian surface area of positive spectrahedrons. We make this formal in the theorem below. Theorem 37 (Matrix version of Peres theorem) . Let S be a positive spectrahedron defined as S = (cid:8) x ∈ R n : X i x i A i (cid:22) B, A , . . . , A n , B ∈ Sym k and A i is PSD for i ∈ [ n ] (cid:9) . Then the Gaussian surface area of S is O (1) (independent of k, n ). Let f ( x ) = [ x ∈ S ] for x ∈ {− , } n . Then the ε -noise sensitivity of f is NS ε ( f ) = O ( √ ε ) . Corollary 38.
Let S , S be distinct positive spectrahedrons specified by { A j , . . . , A nj , B j } j ∈ [2] respectively, where A i (cid:23) and A i (cid:22) for all i . Let F ( x ) = ^ j =1 "X i x i A ij (cid:22) B j be an intersection of positive spectrahedrons. Then AS ( F ) ≤ O ( √ n ) , GSA ( S ∩ S ) = O (1) Subsequently it was shown that this bound was optimal for a convex body formed by exp( n / ) randomlyintersecting halfspaces. O ( √ log k ) (thereby reproving Nazarov [Naz03]). Before stating theKane’s result, we need to introduce the following notion. Definition 39 (Unate function) . A function f : {− , } n → { , } is unate if it satisfies thefollowing: for every i ∈ [ n ] , f is either increasing or decreasing with respect to the i th coordinate,i.e., for every i ∈ [ n ] , either f ( x , . . . , x i − , − , x i +1 , . . . , x n ) ≤ f ( x , . . . , x i − , , x i +1 , . . . , x n ) forall x or f ( x , . . . , x i − , − , x i +1 , . . . , x n ) ≥ f ( x , . . . , x i − , , x i +1 , . . . , x n ) for all x . In particular, Kane proved the following stronger statement.
Theorem 40. [Kan14a] Let f , . . . , f k : {− , } n → { , } be unate functions and let F : {− , } n → { , } be defined as F ( x ) = V i f i ( x ) . Then the average sensitivity of F satisfies AS ( F ) ≤ O ( p n log( k + 1)) . It is not hard to see that a positive spectrahedron is a unate function so Theorem 40 holds forus as well for k = 1. To be precise we have Corollary 41.
Let S be as defined in Theorem 37. Let F : {− , } n → { , } be defined as F ( x ) = 1 if and only if x ∈ S . Then AS ( F ) ≤ O ( √ n ) . In order to translate Theorem 40 to obtain the corollary above: for a positive spectrahedron S ,let F ( x ) = [ x ∈ S ] for x ∈ {− , } n , then one can rewrite F as F ( x ) = V kj =1 (cid:2) λ j (cid:0)P i x i A i − B (cid:1) ≥ (cid:3) which is an AND of k unate functions by the Weyl’s inequality [Bha13, Theorem III 2.1](the innerfunctions f i are unate since all the A i s are promised to be PSD ).Recall that we are interested in the Gaussian surface area of such bodies (not just the averagesensitivity) which is closely related to the noise sensitivity of positive spectrahedrons. In the samepaper, Kane [Kan14a] adapts the well-known techniques of [DGJ +
10] to show that the ε -noisesensitivity of intersections of halfspaces is at most O ( √ ε log k ) and remarks that such a bound does not hold for the intersections of unate functions. Below, we show that one can modify theproof of [DGJ +
10] to also show that the noise sensitivity of positive spectrahedrons is can bebounded by a the “average 2-sensitivity” of positive spectrahedrons which we show is O ( √ ε ) bymodifying Kane’s proof in Theorem 40. This proves our Theorem 37. Proof of Theorem 37.
In order to prove the theorem, we first show that for a function f : {− , } n →{ , } defined as f ( x ) = " n X i =1 x i A i (cid:22) B for A , . . . , A n , B ∈ Sym k and A i is PSD for i ∈ [ n ], the ε -noise sensitivity of f satisfies NS ε ( f ) = Pr ( x , y ) ε − correlated [ f ( x ) = f ( y )] ≤ O ( √ ε ) . For simplicity let us assume that ε = 1 /m , for some integer m which divides n (since NS ε is anon-decreasing function in ε , we can even round ε down to satisfy this condition).In order to analyze NS ε ( f ) we first observe that one can generate an ε -correlated pair of strings( x, y ) ∈ {− , } n as follows: There is a +1 compared to Kane’s result to ensure that the result is valid for k = 1.
32. Pick a uniformly random string z ∼ U n .2. Randomly partition [ n ] into m disjoint buckets C , . . . , C m ⊆ [ n ] such that ∪ i C i = [ n ].Furthermore, for z ∈ {− , } n (picked in step 1), split each bucket as follows: for every ℓ ∈ [ m ], split C ℓ into C ℓ, and C ℓ, − such that C ℓ, corresponds to the positive coordinates in z B ℓ and C ℓ, − corresponds to the negative coordinates in z C ℓ . So overall there are 2 m disjointbuckets { C ℓ,s : ℓ ∈ [ m ] , s ∈ {− , }} such that ∪ ℓ,s C ℓ,s = [ n ]. Set ˜ C ℓ = C ℓ, if ℓ ≤ n and˜ C ℓ = C ℓ − n, − if ℓ > n .3. Corresponding to each bucket ˜ C ℓ , pick a uniformly random bit b ℓ ∼ U .4. Obtain x as follows: for every ℓ ∈ [2 m ], obtain x from z by multiplying all the bits in z ˜ C ℓ by b ℓ .5. We obtain y as follows: pick a uniformly random ℓ ∈ [ m ] and flip the signs of x i (obtainedin step 4) for all the indices i in C ℓ , i.e., y i = − x i if i ∈ C ℓ and y i = x i otherwise.Observe that the the ( x , y ) obtained in step (4 ,
5) are uniform and ε -correlated. To see this,first observe that the probability of obtaining x ∈ {− , } n is given byPr z ∼U n , { C k } , b ∼U m [ x = x ] = Pr z ,C, b [ z C · b = x C , . . . z C m · b m = x C m ]= m X i =1 Pr z ,C, b (cid:2) z C i · b i = x C i | z C
10, Proposition 9.2]. The second result we useis by Ball [Bal13] who showed that: if a Boolean function f is an indicator function of a convexset S , i.e., f − (1) = S and if S has a smooth boundary, then the Gaussian surface area of S canbe bounded as GSA ( S ) ≤ lim ε → GNS ε ( f ) √ ε . (40)Putting together Eq. (40) and Eq. (39), we get GSA ( S ) ≤ lim ε → NS ε ( f ) √ ε ≤ O (1) , where the final inequality used the upper bound we derived earlier in Eq. (38). This concludes theproof of the theorem. We note that [DGJ +
10, Proposition 9.2] shows this statement with equality asymptotically (i.e., when we taken k Bernoulli’s to approximate a Gaussian for k → ∞ ) for f being a degree- d polynomial threshold function, and thesame proof holds true when f is an intersection of spectrahedrons. We remark that one can also obtain this bound [Kan11a, Section 3].
34e now prove Corollary 38 which bounds the Gaussian surface area of intersections of positivespectrahedrons.
Proof of Corollary 38.
The proof is very similar to the proof of the theorem above. Let m = ⌈ /ε ⌉ .We follow the same bucketing steps (1) − (5) in Theorem 37 to obtain a g : {− , } m → { , } given by g ( b ) = m X q =1 b q X j ∈ C q z A j (cid:22) B · m X q =1 b q X j ∈ C q z A j (cid:22) B . Observe that g is an intersection of positive spectrahedrons and by definition each positive spec-trahedron is a unate function. So, by Theorem 40, we have AS ( g ) ≤ O ( √ m ) = p /ε. Repeating the same steps after Eq. (38), we get that
GSA ( S ∩ S ) ≤ lim ε → GNS ε ( F ) √ ε ≤ lim ε → NS ε ( F ) √ ε ≤ lim ε → p ε · AS ( g ) ≤ O (1) . This concludes the proof of the corollary.
We now prove the main lemma which shows that the largest eigenvalues of positive spectrahedronscannot be very concentrated. In particular, we show that for a uniformly random x ∼ U n , supposewe consider a spectrahedron D = P i x i A i − B then the measure (over the Boolean cube) that D has largest eigenvalue in a small interval is fairly small. This anti-concentration statement willbe crucial in our invariance principle proof when we move from the Bentkus mollifier to our CDFfunction. In the passing we remark that, prior to this work, we aren’t even aware if the weaker Gaussian analogue of this statement was known (in particular, the results of [HKM13, ST17] onlyrequire Gaussian anti-concentration for which they use a result of Nazarov [Naz03] as a black-box).We remark that the proof of our main theorem (stated below) follows the result of [OST19,Kan14a] closely since they are able to handle intersections of unate functions which is the case forpositive spectrahedrons. However, there are two subtleties.( i ) In [OST19] they bucket the set of halfspaces (which form the polytope) and show that eachbucket has significant weight. Crucially for them, they use the fact that intersections ofhalfspaces are still unate functions. But this is not the case for positive spectrahedrons.For this, we need to modify the bucketing procedure (akin to what happens in the proof ofTheorem 37) so that this bucketing of positive spectrahedrons still results in a unate function.( ii ) In [OST19] they prove an analogue of Lemma 46 which shows that each bucket has “significantweight”. However our proof deviates significantly from the proof in [OST19]. For them,proving the statement in the lemma (for diagonal matrices), follows directly from Paley-Zygmund inequality, but as far as we are aware, we do not have a matrix-version of thisinequality. Due to this difficulty, we modify their proof and use the matrix Chernoff boundto prove the statement above. 35 heorem 42. Let k ≥ be an integer and τ ≤ √ log k . Let { B , B } ⊆ Sym k , { A i } i ∈ [ n ] and { A i } i ∈ [ n ] be sequences of PSD and
NSD matrices, respectively. They satisfy that for all i ∈ [ n ] , j ∈ [2] , A i (cid:22) τ · I , A i (cid:23) − τ I and P ni =1 ( A ij ) (cid:23) I . Then for every Λ ≥ τ log k , we have Pr x ∼U n " ∃ j ∈ [2] s.t. λ max X i x i A ij − B j ! ∈ ( − Λ , Λ] ≤ O (Λ) . Again using the standard bits-to-Gaussians trick, we have the following corollary.
Corollary 43.
Let k ≥ be an integer and τ ≤ k . Let { B , B } ⊆ Sym k , { A i } i ∈ [ n ] and { A i } i ∈ [ n ] be sequences of PSD and
NSD matrices, respectively. They satisfy that for all i ∈ [ n ] , j ∈ [2] , A i (cid:22) τ · I , A i (cid:23) − τ I and P i ( A ij ) (cid:23) I . Then for every Λ ≥ τ log k , we have Pr g ∼G n " ∃ j ∈ [2] s.t. λ max X i g i A ij − B j ! ∈ ( − Λ , Λ] ≤ O (Λ) . In order to prove this theorem we will use the following two lemmas by [OST19]. Beforestating these lemmas, we introduce a few definitions from [OST19] (adapted to our setting ofpositive spectrahedrons). For the rest of the section, we let F : {− , } n → { , } be the indicatorof an intersection of positive spectrahedrons, i.e., for every j ∈ [2], let F j ( x ) = hP ni =1 x i A ij (cid:22) B j i ,where { A ij } i satisfy Eq. (7) and F ( x ) = ^ j =1 F j ( x ) = ^ j =1 " n X i =1 x i A ij (cid:22) B j . (41)1. For a set S ⊆ {− , } n , let E ( S ) be the fraction of n · n − edges which have one endpoint in S and one endpoint in S c (i.e., complement of S ).2. We let H j ⊆ {− , } n be the indicator-set for F j , i.e., x ∈ H j if and only if F j ( x ) = 1.Additionally, suppose we have sets { ¯ H , ¯ H } such that H j ⊆ ¯ H j such that ¯ H j are also theindicator-sets of unate functions. Let ∂H j = ¯ H j \ H j .3. For α ∈ [0 , ∂H j is α -semi thin if for every x ∈ H j , at least an α -fraction of itshypercube-neighbours (i.e., set of y ∈ {− , } n for which d ( x, y ) = 1) are outside ∂H j .4. We now define a few sets: let F = ¯ H ∩ ¯ H , F ◦ = H ∩ H , ∂F = F \ F ◦ With this terminology, we have the following lemma that bounds the number of edges that cross F . Lemma 44 ([OST19, Theorem 7.18]) . For j ∈ [2] , let H j be as defined above. Suppose H j is α -semi thin, then vol ( ∂F ) ≤ O (cid:18) α √ n (cid:19) Using this lemma, we get the following theorem (which is the analogue of [OST19, Theo-rem 7.19]). 36 heorem 45.
Let λ > , α ∈ [0 , , { B , B } ⊆ Sym k . Let { A ij } i ∈ [ n ] ,j ∈ [2] ⊆ Sym k satisfy that A i (cid:23) , A i (cid:22) for all i ∈ [ n ] . At least α -fraction of i ∈ [ n ] satisfy that A i (cid:23) λ · I and A i (cid:22) − λ · I .Then, we have Pr x ∼U n " ∃ j ∈ [2] s.t. λ max X i x i A ij − B j ! ∈ ( − λ, ≤ O (cid:18) α √ n (cid:19) . Proof.
Let { A ij } , { B j } be as in the theorem statement. Let H j = n x ∈ {− , } n : λ max X i x i A ij − B j ! ≤ − λ o , ¯ H j = n x ∈ {− , } n : λ max X i x i A ij − B j ! ≤ o . Clearly we then have that ∂H j = ( x ∈ {− , } n : λ max X i x i A ij − B j ! ∈ ( − λ, ) and ∂F = ( x ∈ {− , } n : ∃ j ∈ [2] s.t. λ max X i x i A ij − B j ! ∈ ( − λ, ) . Since we assumed that at least an α -fraction of i s satisfied A i (cid:23) λ · I and A i (cid:22) − λ · I , it followsthat H j is α -semi thin, hence we can apply Lemma 44 to obtain the theorem statement.Using this theorem, we are now ready to prove our main technical lemma which says thatwe can always “randomly bucket” our positive spectrahedron so that many of these buckets have“pretty large” smallest eigenvalue. Lemma 46.
Let { A i } i ∈ [ n ] ⊆ Sym k be a sequence of positive semidefinite matrices which is ( τ, M ) -regular with τ ≤ √ log k . Let m ≥ τ log k and π : [ n ] → [ m ] be a random hash function thatindependently assigns each i ∈ [ n ] to a uniformly random bucket in [ m ] . For c ∈ [ m ] , let σ c = X j ∈ π − ( c ) A j and we say the bucket c ∈ C is good if σ c (cid:23) τm · I . Then, Pr [ at most m/ buckets c ∈ [ m ] are good ] ≤ exp ( − m/ . Proof.
Let z i ∈ { , } be a random variable satisfying Pr[ z i = 1] = 1 /m . Let Z i = z i · A i , henceone can write σ c = P i Z i . In particular, this implies E [ σ c ] = 1 m X i A i (cid:23) τ · m X i (cid:0) A i (cid:1) (cid:23) τ m . Applying Fact 5 (for δ = 1 / µ = 1 /τ m , R = τ ) we havePr "X i Z i (cid:23) τ m I ≥ − k · (cid:18) e (cid:19) τ m ≥ j ∈ [ n ] and c ∈ [ m ] define random variables Y c,j = ( π ( j ) = c , and X j = " m X c =1 Y c,j σ c (cid:23) τ m I . Using the Claim 47 below, X , . . . , X n are negatively associated. Thus we may apply the Chernoffbound to P mi =1 X i which has mean at least 3 m/
4, which gives us the lemma statement.
Claim 47.
The random variables X , . . . , X n are negatively associated.Proof. From [DP09, Page 35, Example 3.1 ], the set of random variables { Y c,j } ≤ c ≤ m are neg-atively associated for j ∈ [ n ]. Note that { Y ,j , . . . , Y m,j } j ∈ [ n ] are n independent families of ran-dom variables. By [DP09, Page 35], { Y c,j } c ∈ [ m ] ,j ∈ [ n ] are negatively associated. Given σ , . . . , σ m , (cid:2)P mc =1 Y c,j σ c (cid:23) τm I (cid:3) is a monotone non-decreasing function of Y c, , . . . , Y c,n . Thus from [DP09,Page 35], X , . . . , X m are negatively associated.The proof of this claim concludes the proof of the lemma.We are now ready to proof our main theorem. Proof of Theorem 42.
For j ∈ [2], let f j ( x ) = P ni =1 x i A ij . Let π : [ n ] → [2 m ] be a randomhash function that independently assigns each i ∈ [ n ] to uniformly random bucket in [2 m ]. Let C , . . . , C m ⊆ [ n ] be the buckets and z ∈ {− , } m be uniformly random. Consider the function g j : {− , } k → Sym k defined as g j ( z ) = m X q =1 z q · X i ∈ C q A ij . For q ∈ [2 m ], define ¯ A qj = P i ∈ C q A ij , so g j ( z ) = P q z q ¯ A qj . Observe that distribution of f j and g j are the same, i.e., for every D ∈ Sym k we havePr z ∼U m , { C i } [ g j ( z ) = D ] = Pr x ∼U n [ f j ( x ) = D ] . (42)In order to see this we argue that the n -bit string w ∈ {− , } n defined as w i = z q iff i ∈ C q , isuniformly random. To show this, we first prove the following: for z ∈ {− , } m , let S = { q ∈ [2 m ] : z q = 1 } and T = ∪ q ∈ S C q . Then, observe that for every T ⊆ [ n ], we have Pr z , { C q } [ T = T ] = 2 − n (for every i ∈ [ n ], the probability of i ∈ C q is 1 / (2 m ) and the probability C q is included in T is 1 / z q is a uniformly random bit, hence for every i ∈ [ n ], we have Pr z , { C q } [ i ∈ T ] = P mi =1 (1 / m ) · (1 /
2) = 1 / i ∈ [ n ] by construction). It is now easyto see that w is uniformly random becausePr z , { C j } [ W = w ] = X T Pr[ T = T ] · Pr[ W = w | T = T ] = 12 n X T Pr[ W = w | T = T ] = 2 − n , where the last equality used the fact that once we fix T , then all the bits of w which are 1 are fixed.For m = τ log k , let π : [ n ] → [2 m ] be a random hash that buckets these n variables (jointlyfor j ∈ [2]). By Lemma 46, we argued that, with probability at least 1 − e − m/ , at least 9 m/ m buckets are good for j = 1, i.e., a good bucket q ∈ [2 m ] for j = 1 satisfies P i ∈ π − ( q ) A i (cid:23) τm · I .For the same reason, with probability at least 1 − e − m/ , at least 9 m/ m buckets are good38or j = 2, i.e., a good bucket q ∈ [2 m ] for j = 2 satisfies P i ∈ π − ( q ) A i (cid:22) − τm · I . Applying a unionbound, at least 8 m/ m buckets are good for every j ∈ [2] with probability at least 1 − · e − m/ .By the argument in the start of the proof, we know that after bucketing, we can convert each f j into a function g j : {− , } m → Sym k such that f j and g j have the same distribution. Now wecan invoke Theorem 45 as follows: we know that a 4 / q ∈ [2 m ] satisfy ¯ A q (cid:23) τm · I and¯ A q (cid:22) − τm · I , so we havePr z ∼U m ∃ j ∈ [2] s.t. λ max m X q =1 z q ¯ A qj − B j ∈ ( − / τ m, ≤ O r m ! + 2 e − m/ . We now prove the main theorem statement. In order to do so, first observe that, we can partitionthe bound on the LHS into ⌈ τ m ⌉ intervals as Λ ≥ / τ m from our choice of parameters. andby a union bound we havePr x ∼U n " ∃ j ∈ [2] s.t. λ max X i x i A ij − B j ! ∈ ( − Λ , Λ] ≤ O Λ · τ · m r m + exp( − Ω( m/ !! From the choice of the parameters, the first term above dominates. And thusPr x ∼U n " ∃ j ∈ [2] s.t. λ max X i x i A ij − B j ! ∈ [ − Λ , ≤ O (Λ) . Similarly one can also show when the LHS of the equation above is replaced with (0 , Λ]. Hence weget our theorem statement.
In this section, we establish our main invariance principles.
We now prove our main lemma which is an invariance principle for the Bentkus mollifier. We remarkthat our analysis is the standard Lindeberg-style argument for proving invariance principles, butwhen applied to the spectral Bentkus mollifier. We first write out the Fr´echet series for the Bentkusmollifier, which we then upper bound using our main Theorem 28. In order to understand the errorterms in the Fr´echet series, we use the matrix Rosenthal inequality (in Fact 7) in order to understandthe moments of random matrices (we remark that this inequality will also be useful in our
PRG construction). Superficially, our proof techniques resemble the previous invariance principle proofsused in [HKM13, ST17, OST19], but the quantities we need to bound are very different fromtheir analysis. To be precise, for a vector v ∈ R k , observe that the event [ ∀ i ∈ [ k ] : v i ≤ b i + Λ , and ∃ j ∈ [ k ] : v j ≥ b j − Λ] canbe broken down into the intersections of Λ / τ m events given by V τm − ℓ =0 [ ∀ i ∈ [ k ] : v i ≤ b i + Λ − ℓ/ τ m, and ∃ j ∈ [ ℓ ] : v j > b j − Λ − ( ℓ + 1) / τ m ]. emma 48. Let k ≥ , θ, τ ∈ (0 , and Ψ θ : Sym k → R be defined as Ψ θ ( M ) = ( G θ ◦ λ ) ( M ) where G θ is the Bentkus mollifier defined in Eq. (18) . Let S , S be ( τ, M ) -regular positive spectrahedronsspecified by matrices { A , . . . , A n , B } and { A , . . . , A n , B } respectively. Let A i = diag (cid:0) A i , A i (cid:1) and B = diag ( B , B ) be block diagonal matrices. Then (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E x ∼U n " Ψ θ n X i =1 x i A i − B ! − E g ∼G n " Ψ θ n X i =1 g i A i − B ! ≤ O (cid:18) log kθ · ( M + k B k ) · ( M · τ ) . (cid:19) . This inequality holds if x is (10 log k ) -wise uniform.Proof. Let t = ⌈ /τ ⌉ . Let H = { h : [ n ] → [ t ] } be a family of (10 log k )-wise uniform hashingfunctions, i.e., for every subset I ⊆ [ n ] of size at most 10 log k , and b ∈ [ t ] I , we havePr h ∈H [ h ( i ) = b i ] = 1 t | I | , where the probability is taken over a uniformly random function h ∈ H . Fix an h ∈ H (think of h as a partition of [ n ] into t blocks S , . . . , S t ⊆ [ n ], where S i = h − ( i ) for all i ∈ [ t ]). For x ∼ U n and y ∼ G n let us divide x , y into blocks x , . . . , x t and y , . . . , y t according to h . It is not hardto see that x i ∼ uniform {− , } | h − ( i ) | and y i ∼ G | h − ( i ) | . We now upper bound the quantity (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E x ∼U n " Ψ θ n X i =1 x i A i − B ! − E y ∈G n " Ψ θ n X i =1 y i A i − B ! (43)by the standard hybrid argument. Let { Z , . . . , Z t } be a set of random variable on n coordinatessuch that Z is the uniform distribution on {− , } n and Z t is uniform in G n . To this end, define Z ℓ as follows: for j ∈ [ ℓ ], let Z ℓ | h − ( j ) = y j and for ℓ < j ≤ t let Z ℓ | h − ( j ) = x j . It is easy to see that Z ∼ U n and Z t ∼ G n . We now can upper bound Eq. (43) as (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E x ∼U n " Ψ θ n X i =1 x i A i − B ! − E y ∼G n " Ψ θ n X i =1 y i A i − B ! = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t X ℓ =1 E x ∼U n y ∼G n " Ψ θ n X i =1 Z ℓi A i − B ! − E x ∼U n y ∼G n " Ψ θ n X i =1 Z ℓ − i A i − B ! ≤ t X ℓ =1 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E x ∼U n y ∼G n " Ψ θ n X i =1 Z ℓi A i − B ! − E x ∼U n y ∼G n " Ψ θ n X i =1 Z ℓ − i A i − B ! (44)We now upper bound each of the t quantities on the RHS of Eq. (44). Fix ℓ ∈ [ t ] and let usassume for simplicity that h − ( ℓ ) = [ m ]. By definition of Z ℓ we observe that Z ℓj = Z ℓ +1 j for all j ∈ { m + 1 , . . . , n } and in fact we have Z ℓ = ( x , . . . , x m , Z m +1 , . . . , Z n ) , Z ℓ +1 = ( y , . . . , y m , Z m +1 , . . . , Z n ) , where x i ∼ U and y i ∈ G is uniform in their respective domains. Crucially note that Z m +1 , . . . , Z n is independent of the x i s or y i s by definition of Z ℓ , Z ℓ +1 . Rewriting the ℓ -th term in Eq. (44),we get (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E x ∼U n y ∼G n Ψ θ m X i =1 x i A i | {z } Q + n X i = m +1 Z i A i − B | {z } P − E x ∼U n y ∼G n Ψ θ m X i =1 y i A i | {z } R + n X i = m +1 Z i A i − B | {z } P (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (45)40et us analyze both these quantities separately. We can first write the Fr´echet series for both theseexpressions asΨ θ ( Q + P ) = Ψ θ ( P ) + D Ψ θ ( P ) [ Q ] + 12 D Ψ θ ( P ) [ Q, Q ] + 16 D Ψ θ (cid:0) P ′ (cid:1) [ Q, Q, Q ] (46)where P ′ = P + ξQ for some ξ ∈ [0 , Ψ θ ( R + P ) = Ψ θ ( P ) + D Ψ θ ( P ) [ R ] + 12 D Ψ θ ( P ) [ R, R ] + 16 D Ψ θ (cid:0) P ′′ (cid:1) [ R, R, R ] , (47)where P ′′ = P + ξ ′ R for some ξ ∈ [0 , x match with the standardnormal distributions. Thus we have that E x ∼U n y ∼G n [ D Ψ θ ( P ) [ R ]] = E x ∼U n y ∼G n [ D Ψ θ ( P ) [ Q ]] E x ∼U n y ∼G n (cid:2) D Ψ θ ( P ) [ R, R ] (cid:3) = E x ∼U n y ∼G n (cid:2) D Ψ θ ( P ) [ Q, Q ] (cid:3) . (48)So by taking the difference of Eq. (47) and Eq. (46), only the third order spectral derivatives remainto be bounded. For this, we now use the Corollary 29 and obtain (cid:12)(cid:12) D Ψ θ (cid:0) P ′ (cid:1) [ Q, Q, Q ] (cid:12)(cid:12) ≤ O (cid:18) ∆ θ log k · k Q k (cid:19) (49) (cid:12)(cid:12) D Ψ θ (cid:0) P ′′ (cid:1) [ R, R, R ] (cid:12)(cid:12) ≤ O (cid:18) ∆ θ log k · k R k (cid:19) . (50)where ∆ = k P ′ k and ∆ = k P ′′ k .Thus, the absolute value of Eq. (45) is upper bounded bylog kθ E h ∆ k Q k + ∆ k R k i ≤ log kθ (cid:18) E h k P ′ k i / E h k Q k i / + E h k P ′′ k i / E h k R k i / (cid:19) , (51)where the inequality is by Cauchy-Schwarz inequality.Using Fact 6 and the fact that P i ( A i ) (cid:22) M · I , we have E h k P ′ k i ≤ O (cid:16) log k · M + k B k (cid:17) , E h k P ′′ k i ≤ O (cid:16) log k · M + k B k (cid:17) (52)We now upper bound the last term in Eq. (51) using the following claim. Claim 49.
It holds that E h k Q k i ≤ O (cid:0) log k · τ · M (cid:1) , E h k R k i ≤ O (cid:0) log k · τ · M (cid:1) . Before proving this claim, observe that combining Claim 49 with Eq. (52), (51), we can upperbound Eq. (51) (and in turn Eq. (45)) by O (cid:18) log kθ · (cid:0) M log k + k B k (cid:1) · (cid:0) log k · τ . · M . (cid:1)(cid:19) ≤ O (cid:18) log kθ · ( M + k B k ) · ( M · τ ) . (cid:19) This follows directly from the mean value theorem for Fr´echet derivatives [AP95]. (cid:12)(cid:12)(cid:12) E x ∼U n " Ψ θ n X i =1 x i A i ! − E y ∼G n " Ψ θ n X i =1 y i A i ! ≤ O (cid:18) log kθ · ( M + k B k ) · ( M · τ ) . (cid:19) , concluding the theorem proof. We now prove the claim above. Proof of Claim 49.
Note that Q = P ni =1 x i A i , where ( x , . . . , x n ) is i.i.d. with Pr [ x i = 1] =Pr[ x i = −
1] = t and Pr[ x i = 0] = 1 − /t . Then using Fact 7, we have E h k Q k p p i / p ≤ p p − (cid:13)(cid:13)(cid:13) t X i (cid:0) A i (cid:1) ! / (cid:13)(cid:13)(cid:13) p + (8 p − t X i k A i k p p ! / p ≤ p p − · r Mt · k p + (8 p − (cid:18) τ p − · k · Mt (cid:19) / p where the second inequality used P i (cid:0) A i (cid:1) (cid:22) M · I for both terms and 0 (cid:22) A i (cid:22) τ I for upperbounding the second term. Setting p = 10 log k , t = 1 /τ we have E h k Q k p p i / p ≤ O (cid:16)p log k · √ τ · √ M + log k · τ · ( M/τ ) / (80 log k ) (cid:17) = O (cid:16) log k · √ τ · √ M (cid:17) . Thus, we have E h k Q k i ≤ E h k Q k p p i p ≤ O (cid:0) log k · τ · M (cid:1) , where in the first inequality note that the LHS is the spectral norm and the RHS is the (8 p )-Schatten norm. This proves the first inequality in the claim statement. The second inequality inthe claim follows by the exact same argument (since Fact 7 applies to even P i g i A i ).The proof of this claim concludes the proof of the theorem. We are now ready to prove our main theorem now, which involves combining our anti-concentrationTheorem 42 and our invariance principle for Bentkus mollifier in Lemma 48. Theorem 50.
Let k ≥ , M ≥ , γ ≥ , τ ∈ [0 , , δ ∈ [0 , . Let S , S be ( τ, M ) -regular positivespectrahedrons specified by matrices { A , . . . , A n , B } ∈ Sym k and { A , . . . , A n , B } ∈ Sym k respec-tively satisfying k B k , k B k ≤ γ . Let S = S ∩ S . If µ is a (10 log k ) -wise uniform distributionover {− , } n , then (cid:12)(cid:12)(cid:12)(cid:12) E x ∼ µ [ x ∈ S ] − E g ∼G n [ g ∈ S ] (cid:12)(cid:12)(cid:12)(cid:12) ≤ C · (cid:0) M + γ (cid:1) / · log / k · M / · τ / , for some universal constant C > . We remark that our theorem statements should also hold true for a larger class of proper distributions as consideredin [HKM13], which requires one to extend our main Theorem 19 to show that even the 4th order spectral derivativescan be bounded by k f (4) k . We believe this should be possible and leave this to be made rigorous for future work. roof. Again for notational simplicity, let A i = diag (cid:0) A i , A i (cid:1) and B = diag ( B , B ) be blockdiagonal matrices. We conclude the result by combining Fact 24, Lemma 48 and Corollary 42 asfollows: first Lemma 48 implies (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E x ∼ µ " Ψ θ n X i =1 x i A i − B ! − E g ∼G n " Ψ θ n X i =1 g i A i − B ! ≤ O (cid:18) log kθ · ( M + k B k ) · ( M · τ ) . (cid:19) , In particular, using Fact 24 (for D = B − β · I and D = B + β · I ), the “if” condition of Fact 24 issatisfied with η = O (cid:18) log kθ · (cid:0) M + ( γ + β ) (cid:1) · ( M · τ ) . (cid:19) where β = O ( θ · p log k/δ ). In particular, Fact 24 and Corollary 43 now together imply that (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E x ∼ µ " Ψ n X i =1 x i A i − B ! − E g ∼G n " Ψ n X i =1 g i A i − B ! ≤ γ + 3 δ + Pr g ∼G n " λ max n X i =1 g i A i − B ! ∈ [ − Λ , Λ] = O (cid:18) log kθ · (cid:18) M + (cid:16) γ + θ · p log( k/δ ) (cid:17) (cid:19) · ( M · τ ) . + δ + Λ (cid:19) ≤ O (cid:18) log kθ · (cid:18) M + (cid:16) γ + p log( k/δ ) (cid:17) (cid:19) · ( M · τ ) . + δ + Λ (cid:19) Let us fix θ ← δ, θ ← Λ , (cid:18) ( M · τ ) . · log k · (cid:18) M + (cid:16) γ + p log k (cid:17) (cid:19)(cid:19) / ← θ. This gives us (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E x ∼ µ " Ψ n X i =1 x i A i − B ! − E g ∼G n " Ψ n X i =1 g i A i − B ! ≤ (cid:0) ( M · τ ) . · log k · ( M + γ ) (cid:1) / . We are now ready to describe our pseudorandom generator for fooling positive spectrahedrons.Our
PRG is based on the well-known construction of Meka and Zuckerman [MZ13] which we de-scribe now. We remark that the same
PRG (with minor modifications and different parametersettings) was used in [MZ13, HKM13, ST17] in order to obtain
PRG s for polytopes.
Meka-Zuckerman PRG.
We begin by describing the Meka-Zuckerman
PRG . Let us fix theparameters δ ∈ (0 , τ = Ω( δ / / (log k · M · ( M + γ ))) so that we have (cid:0) M + γ (cid:1) / · log / k · M / · τ / = δ (where the LHS of this equality is the upper bound obtained in our invariableprinciple proof). Let t = ⌈ /τ ⌉ and consider the family of (2 log k )-wise uniform functions H = { h :[ n ] → [ t ] } , i.e., for every for every subset I ⊆ [ n ] of size at most 5 log k , and b ∈ [ t ] I , we havePr h ∈H [ h ( i ) = b i ] = 1 t | I | , h ∈ H . Efficient constructionsof such hash function families are known with |H| = O ( n k ). For simplicity (as in the proofof [MZ13, HKM13]), we also assume that for every j ∈ [ t ], we have | h − ( j ) | = n/t . Let m = n/t and G : { , } s → {− , } m generate a (10 log k )-wise uniform distribution over {− , } m , i.e., forevery I ⊆ [ n ] of size at most 5 log k and b ∈ {− , } I , we havePr z ∈{ , } s x = G ( z ) [ x i = b i for all i ∈ I ] = 12 | I | , where the probability is taken over uniformly random z ∈ { , } s . It is well-known by [NN93] thatefficient constructions of generators G are known for s = O (log k log n ). Finally, we are ready todescribe the Meka-Zuckerman generator: for a given hash function family H and generator G ,define G : H × ( { , } s ) t → {− , } n by G ( h, z , . . . , z t ) = x, where x | h − ( i ) = G ( z i ) for i ∈ [ t ] . Clearly the seed length of this generator is O (cid:18) (log n )(log k ) + (log n )(log k ) 1 τ (cid:19) = O ((log n )(log k ) /τ ) = (log n ) · poly(log k, M, /δ, γ ) , where the first term is the logarithm of the number of elements of the hash function family |H| ,the second term because we have s = O ((log n )(log k )) and recall that we picked t = O (1 /τ ) andthe final equality used the bound on τ we fixed at the start of the proof.We now restate our main theorem and prove it. Theorem 51.
Let δ ∈ (0 , , k, n, M ≥ and τ ≤ δ / / (log k · M · ( M + γ )) . Let S , S be ( τ, M ) -regular positive spectrahedrons specified by matrices { A , . . . , A n , B } ∈ Sym k and { A , . . . , A n , B } ∈ Sym k with k B k , k B k ≤ γ . Let S = S ∩ S . There exists a PRG G : { , } r → {− , } n with r = (log n ) · poly(log k, M, /δ, γ ) that δ -fools S with respect to the uniform distribution. The proof of this theorem is a generic statement that allows one to go from invariance principlesproven using the proof techniques to construct
PRG s. The proof uses the same proof ideas ofHarsha, Klivans and Meka [HKM13, Section 7.2] (except that now we directly proved
Boolean anti-concentration instead of the weaker
Gaussian anti-concentration as proven by [HKM13]). Weprovide the proof below for completeness.
Proof.
Again for notational simplicity, let A i = diag (cid:0) A i , A i (cid:1) and B = diag ( B , B ) be blockdiagonal matrices. The PRG G will be the Meka-Zuckerman PRG defined above, so the seed length r = (log n ) · poly(log k, M, /δ, γ ) immediately follows. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E x ∼U r " Ψ θ n X i =1 ( G ( x )) i A i − B ! − E g ∼G n " Ψ θ n X i =1 g i A i − B ! ≤ O (cid:18) log kθ · ( M + k B k ) · ( M · τ ) . (cid:19) , (53)44here we used the fact that G ( x ) for uniformly random x ∈ { , } r generates a (10 log k )-wise uni-form distribution and Lemma 48 holds for every (10 log k )-wise uniform distribution µ . Repeatingthe same calculation that we did in the proof of Theorem 50, we get (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E x ∼U r " Ψ n X i =1 ( G ( x )) i A i − B ! − E g ∼G n " Ψ n X i =1 g i A i − B ! ≤ γ + 3 δ + Pr g ∼G n [ λ max ( A ( g )) ∈ ( − Λ , Λ]]= O (cid:18) log kθ · ( M + k B k ) · ( M · τ ) . + δ + Λ (cid:19) , and using our assumption on τ (and the same parameters as in Theorem 50), this implies that (cid:12)(cid:12)(cid:12)(cid:12) E x ∼U r [ G ( x ) ∈ S ] − E g ∼G n [ g ∈ S ] (cid:12)(cid:12)(cid:12)(cid:12) ≤ δ, hence proving our theorem statement. References [AHK05] Sanjeev Arora, Elad Hazan, and Satyen Kale. Fast algorithms for approximate semidef-inite programming using the multiplicative weights update method. In , pages 339–348.IEEE, 2005. 1[AK07] Sanjeev Arora and Satyen Kale. A combinatorial, primal-dual approach to semidefiniteprograms. In
Proceedings of the thirty-ninth annual ACM symposium on Theory ofcomputing , pages 227–236, 2007. 1[AP95] Antonio Ambrosetti and Giovanni Prodi.
A primer of nonlinear analysis , volume 34.Cambridge University Press, 1995. 41[AS10] Brendan P.W. Ames and Hristo S. Sendov. Asymptotic expansions of the orderedspectrum of symmetric matrices.
Nonlinear Analysis: Theory, Methods & Applications ,72(11):4288 – 4297, 2010. 6[AS12] Brendan P.W. Ames and Hristo S. Sendov. A new derivation of a formula by Kato.
Linear Algebra and its Applications , 436(3):722 – 730, 2012. 6[AS16] Brendan P.W. Ames and Hristo S. Sendov. Derivatives of compound matrix valuedfunctions.
Journal of Mathematical Analysis and Applications , 433(2):1459 – 1485,2016. 6[AZLO16] Zeyuan Allen-Zhu, Yin Tat Lee, and Lorenzo Orecchia. Using optimization to obtaina width-independent, parallel, simpler, and faster positive SDP solver. In
Proceedingsof the 2016 Annual ACM-SIAM Symposium on Discrete Algorithms , pages 1824–1831,2016. 1[Bal93] Keith Ball. The reverse isoperimetric problem for Gaussian measure.
Discrete & Com-putational Geometry , 10(4):411–420, 1993. 7, 3145Bal13] Keith Ball. Talk: Noise sensitivity and Gaussian surface area, 2013. .7, 34[Baz09] Louay MJ Bazzi. Polylogarithmic independence can fool DNF formulas.
SIAM Journalon Computing , 38(6):2220–2272, 2009. 11[Ben90] Vidmantas Bentkus. Smooth approximations of the norm and differentiable func-tions with bounded support in Banach space ℓ k ∞ . Lithuanian Mathematical Journal ,30(3):223–230, 1990. 2, 5, 6, 17, 18[Bha00] Rajendra Bhatia. Pinching, trimming, truncating, and averaging of matrices.
TheAmerican Mathematical Monthly , 107(7):602–608, 2000. 13[Bha13] Rajendra Bhatia.
Matrix analysis , volume 169. Springer Science & Business Media,2013. 6, 15, 32[BHK +
19] Boaz Barak, Samuel Hopkins, Jonathan Kelner, Pravesh K Kothari, Ankur Moitra,and Aaron Potechin. A nearly tight sum-of-squares lower bound for the planted cliqueproblem.
SIAM Journal on Computing , 48(2):687–735, 2019. 1[BLZ05] Jan Brinkhuis, Zhi-Quan Luo, and Shuzhong Zhang. Matrix convex functions withapplications to weighted centers for semidefinite programming, 2005. 6, 15[Boo05] Carl de Boor.
Divided differences . Surv. Approx. Theory 1, 2005. 13[BPT12] Grigoriy Blekherman, Pablo A. Parrilo, and Rekha R. Thomas.
Semidefinite Optimiza-tion and Convex Algebraic Geometry . Society for Industrial and Applied Mathematics,2012. 1[BS99] Rajendra Bhatia and Kalyan B. Sinha. Derivations, derivatives and chain rules.
LinearAlgebra and its Applications , 302-303:231 – 244, 1999. 5[BSS98] Rajendra Bhatia, Dinesh Singh, and Kalyan B. Sinha. Differentiation of operator func-tions and perturbation bounds.
Communications in Mathematical Physics , 191:603–611, 1998. 5[CDS19] Eshan Chattopadhyay, Anindya De, and Rocco A Servedio. Simple and efficient pseu-dorandom generators from Gaussian processes. In . Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2019. 1,2, 3, 6, 7[Col12] Rodney Coleman.
Calculus on normed vector spaces . Springer Science & BusinessMedia, 2012. 14, 15[CQT03] Xin Chen, Houduo Qi, and Paul Tseng. Analysis of nonsmooth symmetric-matrix-valued functions with applications to semidefinite complementarity problems.
SIAMJournal on Optimization , 13(4):960–985, 2003. 5[DGJ +
10] Ilias Diakonikolas, Parikshit Gopalan, Ragesh Jaiswal, Rocco A Servedio, and EmanueleViola. Bounded independence fools halfspaces.
SIAM Journal on Computing ,39(8):3441–3462, 2010. 1, 2, 3, 7, 32, 3446DHK +
10] Ilias Diakonikolas, Prahladh Harsha, Adam Klivans, Raghu Meka, Prasad Raghaven-dra, Rocco A Servedio, and Li-Yang Tan. Bounding the average sensitivity and noisesensitivity of polynomial threshold functions. In
Proceedings of the forty-second ACMsymposium on Theory of computing , pages 533–542, 2010. 1, 7[DP09] Devdatt P. Dubhashi and Alessandro Panconesi.
Concentration of Measure for theAnalysis of Randomized Algorithms . Cambridge University Press, 2009. 38[Erd45] Paul Erd¨os. On a lemma of Littlewood and Offord.
Bulletin of the American Mathe-matical Society , 51(12):898–902, 1945. 8[Fel68] Willliam Feller.
An introduction to probability theory and its applications, vol 1 . NewYork: Wiley, 1968. 18[FF88] P´eter Frankl and Z Furedi. Solution of the Littlewood-Offord problem in high dimen-sions.
Annals of Mathematics , pages 259–270, 1988. 8[FK20] Xiao Fang and Yuta Koike. High-dimensional central limit theorems by Stein’s method. arXiv preprint arXiv:2001.10917 , 2020. 17, 18[FMP +
15] Samuel Fiorini, Serge Massar, Sebastian Pokutta, Hans Raj Tiwary, and Ronald deWolf. Exponential lower bounds for polytopes in combinatorial optimization.
Journalof the ACM (JACM) , 62(2):1–23, 2015. 1[GKM18] Parikshit Gopalan, Daniel M Kane, and Raghu Meka. Pseudorandomness via thediscrete Fourier transform.
SIAM Journal on Computing , 47(6):2451–2487, 2018. 1[GM12] Bernd G¨artner and Jiri Matousek.
Approximation algorithms and semidefinite pro-gramming . Springer Science & Business Media, 2012. 1, 10[GOWZ10] Parikshit Gopalan, Ryan O’Donnell, Yi Wu, and David Zuckerman. Fooling functionsof halfspaces under product distributions. In , pages 223–234. IEEE, 2010. 1, 3[GW95] Michel X. Goemans and David P. Williamson. Improved approximation algorithms formaximum cut and satisfiability problems using semidefinite programming.
J. ACM ,42(6):1115–1145, 1995. 1[GW13] Gus Gutoski and Xiaodi Wu. Parallel approximation of min-max problems.
Computa-tional Complexity , 22:385 – 428, 2013. 1[Hau92] David Haussler. Decision theoretic generalizations of the PAC model for neural net andother learning applications.
Information and computation , 100(1):78–150, 1992. 9[HKM13] Prahladh Harsha, Adam Klivans, and Raghu Meka. An invariance principle for poly-topes.
Journal of the ACM (JACM) , 59(6):1–25, 2013. , 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,25, 35, 39, 42, 43, 44[HMV06] J William Helton, Scott A McCullough, and Victor Vinnikov. Noncommutative convex-ity arises from linear matrix inequalities.
Journal of Functional Analysis , 240(1):105–191, 2006. 10[JJUW11] Rahul Jain, Zhengfeng Ji, Sarvagya Upadhyay, and John Watrous. QIP = PSPACE.
Journal of the ACM , 58(6), 2011. 1 47JLL +
20] Arun Jambulapati, Yin Tat Lee, Jerry Li, Swati Padmanabhan, and Kevin Tian. Posi-tive semidefinite programming: Mixed, parallel, and width-independent. In
Proceedingsof the 52nd Annual ACM SIGACT Symposium on Theory of Computing , STOC 2020,page 789–802, 2020. 1[JUW09] R. Jain, S. Upadhyay, and J. Watrous. Two-message quantum interactive proofs are inPSPACE. In ,pages 534–543, 2009. 1[JY11] R. Jain and P. Yao. A parallel approximation algorithm for positive semidefinite pro-gramming. In , pages 463–471, 2011. 1[Kan10] Daniel M. Kane. k -independent Gaussians fool polynomial threshold functions. arXivpreprint arXiv:1012.1614 , 2010. 1[Kan11a] Daniel M. Kane. The Gaussian surface area and noise sensitivity of degree- d polynomialthreshold functions. computational complexity , 20(2):389–412, 2011. 1, 7, 34[Kan11b] Daniel M. Kane. k-independent Gaussians fool polynomial threshold functions. In Proceedings of the 26th Annual IEEE Conference on Computational Complexity, CCC ,pages 252–261. IEEE Computer Society, 2011. 1[Kan11c] Daniel M. Kane. A small PRG for polynomial threshold functions of Gaussians. InRafail Ostrovsky, editor,
IEEE 52nd Annual Symposium on Foundations of ComputerScience, FOCS , pages 257–266. IEEE Computer Society, 2011. 1[Kan14a] Daniel Kane. The average sensitivity of an intersection of half spaces.
Research in theMathematical Sciences , 1(1):13, 2014. 7, 8, 11, 32, 35[Kan14b] Daniel M. Kane. A pseudorandom generator for polynomial threshold functions ofGaussian with subpolynomial seed length. In , pages 217–228. IEEE, 2014. 1[KKMS08] Adam Tauman Kalai, Adam R Klivans, Yishay Mansour, and Rocco A Servedio. Ag-nostically learning halfspaces.
SIAM Journal on Computing , 37(6):1777–1805, 2008.9[KM15] Pravesh K. Kothari and Raghu Meka. Almost optimal pseudorandom generators forspherical caps. In
Proceedings of the forty-seventh annual ACM symposium on Theoryof computing , pages 247–256, 2015. 1, 11[KOS04] Adam R Klivans, Ryan O’Donnell, and Rocco A Servedio. Learning intersections andthresholds of halfspaces.
Journal of Computer and System Sciences , 68(4):808–840,2004. 9, 31[KOS08] Adam R Klivans, Ryan O’Donnell, and Rocco A Servedio. Learning geometric conceptsvia Gaussian surface area. In , pages 541–550. IEEE, 2008. 2, 3, 7, 9[KSS94] Michael J Kearns, Robert E Schapire, and Linda M Sellie. Toward efficient agnosticlearning.
Machine Learning , 17(2-3):115–141, 1994. 948Lew96] A. S. Lewis. Derivatives of spectral functions.
Mathematics of Operations Research ,21(3):576–588, 1996. 5[Lin22] J.W. Lindeberg. Eine neue herleitung des exponentialgesetzes in der wahrscheinlichkeit-srechnung.
Mathematische Zeitschrift , 15:211–225, 1922. 2, 4[LO39] John Edensor Littlewood and Albert C Offord. On the number of real roots of a randomalgebraic equation. ii. In
Mathematical Proceedings of the Cambridge PhilosophicalSociety , volume 35, pages 133–148. Cambridge University Press, 1939. 8[LRS15] James R Lee, Prasad Raghavendra, and David Steurer. Lower bounds on the size ofsemidefinite programming relaxations. In
Proceedings of the forty-seventh annual ACMsymposium on Theory of computing , pages 567–576, 2015. 1[MJC +
14] Lester Mackey, Michael I. Jordan, Richard Y. Chen, Brendan Farrell, and Joel A.Tropp. Matrix concentration inequalities via the method of exchangeable pairs.
Ann.Probab. , 42(3):906–945, 05 2014. 6, 14[MOO05] Elchanan Mossel, Ryan O’Donnell, and Krzysztof Oleszkiewicz. Noise stability of func-tions with low influences: invariance and optimality. In , pages 21–30. IEEE, 2005. 5[Mos08] Elchanan Mossel. Gaussian bounds for noise correlation of functions and tight analysisof long codes. In , pages 156–165. IEEE Computer Society, 2008. 5[MZ13] Raghu Meka and David Zuckerman. Pseudorandom generators for polynomial thresholdfunctions.
SIAM Journal on Computing , 42(3):1275–1301, 2013. 1, 2, 3, 9, 43, 44[Naz03] Fedor Nazarov. On the maximal perimeter of a convex set in R n with respect to aGaussian measure. In Geometric aspects of functional analysis , pages 169–187. Springer,2003. 2, 3, 5, 7, 8, 32, 35[NN93] Joseph Naor and Moni Naor. Small-bias probability spaces: Efficient constructions andapplications.
SIAM journal on computing , 22(4):838–856, 1993. 44[NPS08] Jiawang Nie, Pablo A. Parrilo, and Bernd Sturmfels.
Semidefinite Representation ofthe k-Ellipse , pages 117–132. Springer New York, New York, NY, 2008. 1[O’D14] Ryan O’Donnell.
Analysis of Boolean functions . Cambridge University Press, 2014. 4,13, 14, 34[OST19] Ryan O’Donnell, Rocco A Servedio, and Li-Yang Tan. Fooling polytopes. In
Proceedingsof the 51st Annual ACM SIGACT Symposium on Theory of Computing , pages 614–625,2019. 1, 2, 4, 5, 6, 7, 8, 9, 10, 11, 20, 25, 35, 36, 39[OST20] Ryan O’Donnell, Rocco A Servedio, and Li-Yang Tan. Fooling Gaussian PTFs via localhyperconcentration. In
Proceedings of the 52nd Annual ACM SIGACT Symposium onTheory of Computing , pages 1170–1183, 2020. 1[Per04] Yuval Peres. Noise stability of weighted majority. arXiv math/0412377 , 2004. 749PT12] Richard Peng and Kanat Tangwongsan. Faster and simpler width-independent parallelalgorithms for positive semidefinite programming. In
Proceedings of the Twenty-FourthAnnual ACM Symposium on Parallelism in Algorithms and Architectures , SPAA ’12,page 101–108. Association for Computing Machinery, 2012. 1[Qua12] Ronan Quarez. Symmetric determinantal representation of polynomials.
Linear algebraand its applications , 436(9):3642–3660, 2012. 10[Sch18] Claus Scheiderer. Spectrahedral shadows.
SIAM Journal on Applied Algebra and Ge-ometry , 2(1):26–44, 2018. 1[Sen07] Hristo S Sendov. The higher-order derivatives of spectral functions.
Linear algebra andits applications , 424(1):240–281, 2007. 6, 20, 21[Ser06] Rocco A Servedio. Every linear threshold function has a low-weight approximator. In , pages 18–32.IEEE, 2006. 1, 11[ST17] Rocco A Servedio and Li-Yang Tan. Fooling intersections of low-weight halfspaces.In ,pages 824–835. IEEE, 2017. 1, 2, 3, 6, 7, 8, 9, 10, 11, 25, 35, 39, 43[Tao10] Terence Tao. 254a notes: Topics in random matrix theory., 2010. https://terrytao.wordpress.com/tag/lindeberg-replacement-trick/ . 4[Tro12] Joel A. Tropp. User-friendly tail bounds for sums of random matrices.
Foundations ofComputational Mathematics , 12:389–434, 2012. 13[Tro15] Joel A Tropp. An introduction to matrix concentration inequalities. arXiv preprintarXiv:1501.01571 , 2015. 6[Tro16] Joel A. Tropp. The expected norm of a sum of independent random matrices: Anelementary approach. In Christian Houdr´e, David M. Mason, Patricia Reynaud-Bouret,and Jan Rosi´nski, editors,
High Dimensional Probability VII , pages 173–202, Cham,2016. Springer International Publishing. 14[TV12] Terence Tao and Van Vu. The Littlewood-Offord problem in high dimensions and aconjecture of Frankl and F¨uredi.
Combinatorica , 32(3):363–372, 2012. 8[Viz17] Cynthia Vizant. Spectrahedra., 2017. https://clvinzan.math.ncsu.edu/slides/MSRI_SpectrahedraSlides.pdf .1[Yao19] Penghui Yao. A doubly exponential upper bound on noisy EPR states for binary games. arXiv:1904.08832 , 2019. 5, 11
A Proof of Lemma 34: Case 2
Recall that the goal is to prove the following inequality (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i g ( x i ) − g ( x i ) x i − x i − g ( x i ) − g ( x i ) x i − x i x i − x i G ( x ) H i ,i H i ,i H i ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ O (cid:16) ∆ log k k H k (cid:17) (54)50irst observe that the LHS of the inequality above can be rephrased as follows. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i g ( x i ) − g ( x i ) x i − x i − g ( x i ) − g ( x i ) x i − x i x i − x i G ( x ) H i ,i H i ,i H i ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i xi >xi g ′ ( x i ) g ( x i ) − g ( x i ) g ′ ( x i ) x i − x i g ( x i ) − g ′ ( x i ) g ( x i ) − g ( x i ) g ′ ( x i ) x i − x i g ( x i ) x i − x i G (cid:0) x −{ i ,i ,i } (cid:1) H i ,i H i ,i H i ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (55)Providing an upper bound on this consists of several lemmas and the result is concluded by combingall of them via triangle inequalities. To keep the expressions short, we use the following notationsto represent Eq. (54), which are clear in the context. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i x i >x i h i i ′ h i i−h i ih i i ′ [ i − i ] h i i − h i i ′ h i i−h i ih i i ′ [ i − i ] h i i [ i − i ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (56)where we implicitly hide the G (cid:0) x −{ i ,i ,i } (cid:1) H i ,i H i ,i H i ,i term. We first give a sketch of how weare going to upper bound this inequality and break it into subsections.(56) = h i ih i i ′ − h i i ′ h i i [ i − i ] · h i i − h i i [ i − i ] | {z } Section A. , Lemma − h i ih i i ′ −h i i ′ h i i [ i − i ] − h i ih i i ′ −h i i ′ h i i [ i − i ] [ i − i ] h i i | {z } ( ⋆ ) Section A. , Remark . We now break up Remark 1 into two cases( ⋆ ) = Remark · I [min { x i , x i } > x i ] | {z } ( † ) + Remark · I [ x i < x i < x i ] | {z } ( †† ) . Note that there are the only two cases we need to handle since by symmetry between i and i , wecan assume x i > x i , without loss of generality. Now we bound these two terms, separately.( † ) = h i i ′ −h i i ′ [ i − i ] − h i i ′ −h i i ′ [ i − i ] [ i − i ] h i ih i i | {z } Section A. , Lemma − h i i−h i i [ i − i ] − h i i−h i i [ i − i ] [ i − i ] h i i ′ h i i | {z } Section A. , Lemma . and ( †† ) = h i ih i i ′ −h i ih i i ′ [ i − i ] − h i ih i i ′ −h i ih i i ′ [ i − i ] [ i − i ] h i i | {z } Section A. , Lemma
59 + h i i ′ −h i i ′ [ i − i ] h i i − h i i ′ −h i i ′ [ i − i ] h i i [ i − i ] h i i | {z } ( ¶ ) Section A. , Remark , and ( ¶ ) = h i i ′ − h i i ′ [ i − i ] · h i i − h i i [ i − i ] · h i i | {z } Section A. , Lemma
60 + h i i ′ −h i i ′ [ i − i ] − h i i ′ −h i i ′ [ i − i ] [ i − i ] h i ih i i | {z } Section A. , Lemma O (cid:16) ∆ log k k H k (cid:17) in the respective sections (as underbraced by the terms).51 .1 Case 2.1 Lemma 52. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i xi >xi h i ih i i ′ − h i i ′ h i i [ i − i ] · h i i − h i i [ i − i ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ O (cid:16) ∆ · log k · k H k (cid:17) . Remark 1.
Using the triangle inequality, it suffices to upper bound (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i xi >xi h i ih i i ′ −h i i ′ h i i [ i − i ] − h i ih i i ′ −h i i ′ h i i [ i − i ] [ i − i ] h i i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Proof of Lemma 52.
We apply Claim 22 to the first sum and obtain O (∆ max { g ′ ( x i ) , g ′ ( x i ) } )(note that we have max {· , ·} to compensate for the fact that x i ≥ x i or x i ≥ x i ). Therefore, theleft hand side in Lemma 52 can be upper bounded by O X i = i = i xi >xi ∆ (cid:12)(cid:12)(cid:12)(cid:12) max (cid:8) g ′ ( x i ) , g ′ ( x i ) (cid:9) · g ( x i ) − g ( x i ) x i − x i · G (cid:0) x −{ i ,i ,i } (cid:1) H i ,i H i ,i H i ,i (cid:12)(cid:12)(cid:12)(cid:12) ≤ O X i = i = i xi >xi ≥ ,xi ≥ ( · · · ) + X i = i = i xi >xi ≥ ,xi < ( · · · ) + X i = i = i xi >xi ,xi < ,xi ≥ ( · · · ) + X i = i = i xi >xi ,xi < ,xi < ( · · · ) (57) First term in Eq. (57) . Note that g ( x ) ≥ if x ≥
0. Since g ′ is monotone decreasing in theinterval [0 , ∞ ), the first summation is upper bounded by O (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i xi >xi ≥ ,xi ≥ ∆ (cid:12)(cid:12) max (cid:8) g ′ ( x i ) g ′ ( x i ) G ( x − i ) , g ′ ( x i ) g ′ ( x i ) G ( x − i ) (cid:9) H i ,i H i ,i H i ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ O ∆ · k G (2) k · max i ,i X i | H i ,i H i ,i H i ,i | ! (58) ≤ O ∆ · log k · max i ,i X i | H i ,i H i ,i H i ,i | ! ≤ O (cid:0) ∆ · log k · k H k (cid:1) (59)where the first inequality used that | g ′ ( ζ i ,i ) | ≤ max {| g ′ ( x i ) | , | g ′ ( x i ) |} ) and the last inequalityfollows by Eq. (35). Second term in Eq. (57) . The second summation is upper bounded as follows, again by the52ean value theorem observe that O X i = i = i xi >xi ≥ ,xi < ∆ (cid:12)(cid:12) max (cid:8) g ′ ( x i ) g ′ ( x i ) , g ′ ( x i ) g ′ ( x i ) (cid:9) G ( x − i ) H i ,i H i ,i H i ,i (cid:12)(cid:12) ≤ O ∆ · X i : x i < k G (1) ( x − i ) k max i X i | H i ,i H i ,i H i ,i | + k G (2) ( x − i ) k max i ,i | H i ,i H i ,i H i ,i | ≤ O (cid:16) ∆ · log . k · k H k (cid:17) , where the last inequality is from Fact 20 and the assumption that |{ i : x i ≤ }| ≤ k . Third term in Eq. (57) . Using the fact that g ′ ( · ) is bounded by a constant, the thirdsummation is upper bounded by O X i = i = i xi >xi ,xi < ,xi ≥ ∆ (cid:12)(cid:12) max (cid:8) g ′ ( x i ) , g ′ ( x i ) (cid:9) · G (cid:0) x −{ i ,i } (cid:1) H i ,i H i ,i H i ,i (cid:12)(cid:12) = O X i = i = i xi >xi ,xi ≥ ,xi < ,xi ≥ ( · · · ) + X i = i = i xi >xi ,xi ≥ ,xi < ,xi < ( · · · ) . (60)For the first summation in Eq. (60), using the fact that g ( x ) ≥ when x ≥
0, it is upper bounded by O ∆ X i = i = i xi >xi ,xi ≥ ,xi < ,xi ≥ (cid:12)(cid:12) max (cid:8) g ′ ( x i ) , g ′ ( x i ) (cid:9) · G (cid:0) x −{ i } (cid:1) H i ,i H i ,i H i ,i (cid:12)(cid:12) ≤ O ∆ X i : x i < k G (1) k max i X i | H i ,i H i ,i H i ,i | ≤ O ∆ X i : x i < p log k max i X i | H i ,i H i ,i H i ,i | ≤ O (cid:16) ∆ · log . k · k H k (cid:17) , where the second inequality is from Fact 20, and the last inequality used Eq. (35) and the assumptionthat |{ i : x i ≤ }| ≤ k .In order to upper bound the second summation in Eq. (60), first observe that both g ( · ) and G ( · ) are positive and upper bounded by 1. Thus, Eq. (60) can be bounded as O ∆ X i = i xi < ,xi < X i | H i ,i H i ,i H i ,i | ≤ O (cid:16) ∆ · log k · k H k (cid:17) . |{ i : x i ≤ }| ≤ k . Fourth term in Eq. (57) . The last summation is upper bounded by O (cid:16) ∆ · log k · k H k (cid:17) using the same arguments to upper bound the second summation in Eq. (60). A.2 Case 2.2
We upper bound the quantity in Remark 1 in two cases that x i > x i and x i > x i . In order toprove this lemma we need the following lemmas and claims. Claim 53.
For integer k ≥ , X ∈ Sym k and H ∈ Mat k it holds that (cid:13)(cid:13) ( XH + HX ) e − X (cid:13)(cid:13) ≤ k X k · k He − X / k Proof.
As the Schattern norm is unitarily invariant, we assume that X = diag ( x , . . . , x n ) isdiagonal without loss of generality. Then k ( XH + HX ) e − X / k = X i,j H i,j ( x i + x j ) e − x j ≤ k X k · X i,j H i,j e − x j = 4∆ k He − X / k . Lemma 54.
Given an integer k ≥ , u , u , u ≥ satisfying u + u + u = 1 and X ∈ Sym k , H , H , H ∈ Mat k , if u , u ≤ , then it holds that (cid:12)(cid:12)(cid:12) Tr h e − u X H e − u X H e − u X H i(cid:12)(cid:12)(cid:12) ≤ k H e − X k · k H e − X k · k H k Proof.
Using the inequality | Tr ABC | ≤ k A k · k B k · k C k (where k · k is the standard Frobeniusnorm and k · k is the spectral norm), we have (cid:12)(cid:12)(cid:12) Tr h e − u X H e − u X H e − u X H i(cid:12)(cid:12)(cid:12) ≤ k e − u X H e ( u − ) X k · k e − u X H e ( u − ) X k · k H k≤ k H e − X k · k H e − X k · k H k The last inequality is from Lemma 55
Lemma 55.
Given diagonal matrices A = diag ( a , . . . , a k ) , B = diag ( b , . . . , b k ) with a ≥ · · · ≥ a k ≥ and b ≥ · · · ≥ b k ≥ and an arbitrary matrix H , it holds that k AHB k ≤ k HAB k . Proof.
Note that k HAB k − k AHB k = X i,j H ij (cid:0) a j b j − a i a j (cid:1) = X i,j H i,j a i b i + a j b j − a i b j − a j b i ! = 12 X i,j (cid:0) a i − a j (cid:1) (cid:0) b i − b j (cid:1) ≥ , where the first equality is from the symmetry. 54 emma 56. Given an integer k ≥ , matrices A, B, C ∈ Mat k and X ∈ Sym k with k X k ≤ ∆ , itholds that (cid:12)(cid:12)(cid:12) Tr h D (cid:16) e − X / (cid:17) [ A, B ] C i(cid:12)(cid:12)(cid:12) ≤ · max ( k Ae − X / k · k Be − X / k · k C k , k Ae − X / k · k Ce − X / k · k B k , k Be − X / k k Ce − X / k · k A k ) . Proof.
Combining Lemma 12, Lemma 54 and the inequality that k ( XA + AX ) e − X / k ≤ k Ae − X / k , k XA + AX k ≤ k A k , we conclude the result. Lemma 57. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i xi >xi ,xi >xi h i i ′ −h i i ′ [ i − i ] − h i i ′ −h i i ′ [ i − i ] [ i − i ] h i ih i i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ O (cid:16) ∆ · log . k k H k (cid:17) .. Proof of Lemma 57.
We break the summation into two summations (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i xi >xi >xi ( · · · ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i xi >xi >xi ( · · · ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (61)For the first summation, we define A i,j = ( H i,j , if x i < x j , otherwise . and Then k A k ≤ log k · k H k by Fact 4 (without loss of generality, we may assume that x i s aresorted in increasing order. Further notice that all the diagonal entries of H are zeros. Thus A isthe upper triangle part of H ). We first bound the first term in Eq. (61). In this direction, we firstrewrite it as1 √ π X i G ( x − i ) (cid:18)(cid:16) D (cid:16) e − X / (cid:17) [ A, A T ] H (cid:17) i ,i (cid:19) = 1 √ π X i < ( · · · ) + 1 √ π X i ≥ ( · · · ) , (62)where X = diag ( x , . . . , x k ) and we implicitly used that we are summing over terms with x i < x i .For the first summation in Eq. (62), (cid:12)(cid:12)(cid:12) Tr (cid:16) D (cid:16) e − X / (cid:17) [ A, A T ] HE i ,i (cid:17)(cid:12)(cid:12)(cid:12) ≤ log k · max n k Ae − X / k · k HE i ,i k , k Ae − X / k · k He − X / k · k AE i ,i k o ≤ log k k He − X / k · k H k . (63)55here the first inequality is by Lemma 56 and the second inequality is because X and E i ,i arediagonal and A is a submatrix of H . Thus, the first summation in Eq. (62) is upper bounded by∆ · log k √ π X i : x i < G ( x − i ) k He − X / k · k H k = ∆ log k · X i : x i < G ( x − i ) X i ,i e − x i H i ,i k H k ≤ ∆ log k · max i X i = i g ′ ( x i ) · G ( x − i ) · H i ,i k H k ≤ O (cid:16) ∆ log . k k H k (cid:17) , (64)where the first inequality is from the assumption that |{ i : x i < }| ≤ k and the second in-equality is from Fact 20.For the second summation in Eq. (62), we define˜ H i,j = ( H i,j g ( x j ) , if x j ≥ , otherwise . Then k ˜ H k ≤ k H k as g ( x i ) ≥ if x i ≥
0. It is easy to verify that the second summation inEq. (62) is equal to (cid:12)(cid:12)(cid:12)(cid:12) √ π G ( x ) Tr D (cid:16) e − X / (cid:17) [ A, A T ] ˜ H (cid:12)(cid:12)(cid:12)(cid:12) ≤ ∆ log k √ π G ( x ) k He − X / k k H k ≤ O (cid:16) ∆ · log . k k H k (cid:17) . where the first inequality is from Lemma 56 and the second inequality is from Fact 20.Finally, the second summation in Eq. (61) can be upper bounded using the verbatim samearguments by O (cid:16) ∆ · log . k · k H k (cid:17) . This proves the lemma statement. Lemma 58. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i xi >xi ,xi >xi h i i−h i i [ i − i ] − h i i−h i i [ i − i ] [ i − i ] h i i ′ h i i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ O (cid:16) ∆ · log . k · k H k (cid:17) Proof. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i xi >xi ,xi >xi h i i−h i i [ i − i ] − h i i−h i i [ i − i ] [ i − i ] h i i ′ h i i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i xi >xi ,xi >xi ≥ ( · · · ) + X i = i = i xi >xi ,xi > ,xi < ( · · · ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (65)56o upper bound the first summation in Eq. (65), we apply Fact 3 and upper bound the firstsummation by O X i = i = i xi >xi ,xi >xi ≥ (cid:12)(cid:12) g ′′ ( ξ i ,i ,i ) g ′ ( x i ) G (cid:0) x −{ i ,i } (cid:1) H i ,i H i ,i H i ,i (cid:12)(cid:12) ≤ O (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∆ · X i = i = i xi >xi ,xi >xi ≥ g ′ ( x i ) g ′ ( x i ) G (cid:0) x −{ i ,i } (cid:1) H i ,i H i ,i H i ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ O k G (2) ( x ) k max i ,i X i | H i ,i H i ,i H i ,i | ! ≤ O (cid:16) ∆ · log k · k H k (cid:17) where the last inequality is from Fact 20 and Eq. (35). Note that | g ′′ ( ξ ) | ≤ ∆ for any ξ ∈ [ x i , max { x i , x i } ] by Eq. (11). Applying Fact 3, the second summation in Eq. (65) is upperbounded by O ∆ X i : x i < X i ,i g ′ ( x i ) G (cid:0) x −{ i ,i } (cid:1) | H i ,i H i ,i H i ,i | ≤ O ∆ · log k · max i k G (1) ( x − i ) k · max i X i | H i ,i H i ,i H i ,i | ! ≤ O (cid:16) ∆ · log . k · k H k (cid:17) where the first inequality is from the assumption that |{ i : x i < }| ≤ k and the second in-equality is from Fact 20 and Eq. (35). A.3 Case 2.3
We now bound the second case of Remark 1 when x i > x i > x i . Recall that the goal is to upperbound the following lemma. Lemma 59. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i xi >xi >xi h i ih i i ′ −h i ih i i ′ [ i − i ] − h i ih i i ′ −h i ih i i ′ [ i − i ] [ i − i ] h i i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ O (cid:16) ∆ · log . k · k H k (cid:17) Remark 2.
Combining with Remark 1, it suffices to upper bound (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i xi >xi >xi h i i ′ −h i i ′ [ i − i ] h i i − h i i ′ −h i i ′ [ i − i ] h i i [ i − i ] h i i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) roof of Lemma 59. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i xi >xi >xi h i ih i i ′ −h i ih i i ′ [ i − i ] − h i ih i i ′ −h i ih i i ′ [ i − i ] [ i − i ] h i i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ X i = i = i xi >xi >xi (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h i ih i i ′ −h i ih i i ′ [ i − i ] − h i ih i i ′ −h i ih i i ′ [ i − i ] [ i − i ] h i i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + X i = i = i xi >xi >xi (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h i ih i i ′ −h i ih i i ′ [ i − i ] − h i ih i i ′ −h i ih i i ′ [ i − i ] [ i − i ] h i i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = X i = i = i xi >xi >xi (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) h i i−h i i [ i − i ] − h i i−h i i [ i − i ] [ i − i ] h i i ′ h i i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + X i = i = i xi >xi >xi (cid:12)(cid:12)(cid:12)(cid:12) h i i − h i i [ i − i ] · h i i ′ − h i i ′ [ i − i ] · h i i (cid:12)(cid:12)(cid:12)(cid:12) (66)The first term is upper bounded by O (cid:16) ∆ · log . k · k H k (cid:17) using the same argument in Lemma 58.The second term can be rephrased as (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i xi >xi >xi g ( x i ) − g ( x i ) x i − x i · g ′ ( x i ) − g ′ ( x i ) x i − x i g ( x i ) G ( x − i ) H i ,i H i ,i H i ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i xi >xi >xi ,xi ≥ ( · · · ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i xi >xi >xi ,xi < ( · · · ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (67)For the first summation in Eq. (67), we apply the mean value theorem for both g and g ′ . FromEq. (11) it is upper bounded by X i = i = i xi >xi >xi ,xi ≥ ∆ (cid:12)(cid:12) g ′ ( x i ) g ′ ( x i ) g ( x i ) G ( x − i ) H i ,i H i ,i H i ,i (cid:12)(cid:12) ≤ O (cid:16) ∆ · k G (2) ( x ) k k H k (cid:17) ≤ O (cid:16) ∆ · log k · k H k (cid:17) . For the second term in Eq. (67), it is not hard to verify that (cid:12)(cid:12)(cid:12)(cid:12) g ′ ( x i ) − g ′ ( x i ) x i − x i (cid:12)(cid:12)(cid:12)(cid:12) ≤ ∆ max (cid:8) g ′ ( x i ) , g ( x i ) (cid:9) (68)Further notice that | g ′ ( · ) | ≤
1. Applying the mean value theorem to g , we upper bound the secondsummation in 67 by O ∆ X i = i = i xi >xi >xi ,xi < max (cid:8) g ′ ( x i ) , g ′ ( x i ) (cid:9) g ( x i ) G ( x − i ) | H i ,i H i ,i H i ,i | ≤ O ∆ · log k · max i ·k G (1) ( x − i ) k · max i X i | H i ,i H i ,i H i ,i | ! ≤ O (cid:16) ∆ · log . k · k H k (cid:17) |{ i : x i < }| ≤ k and the second in-equality is from Fact 20 Eq. (35). A.4 Case 2.4
In this section, we want to upper bound Remark 2. To do so we write it as the sum of two terms(the first one is easy to bound).
Lemma 60. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i xi >xi >xi h i i ′ − h i i ′ [ i − i ] · h i i − h i i [ i − i ] · h i i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ O (cid:16) ∆ · log k · k H k (cid:17) Proof of Lemma 60.
We split the summation into two cases that x i ≥ x i <
0. For thecase that x i ≥
0, we apply the mean value theorem to g ( · ) and Eq. (68), it is upper bounded by O (cid:16) ∆ · log k · k H k (cid:17) . For the case that x i <
0, we have x i <
0. Note that | g ′ ( · ) | ≤
1. Thus it isupper bounded by O X i = i xi Lemma 61. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i xi >xi >xi h i i ′ −h i i ′ [ i − i ] − h i i ′ −h i i ′ [ i − i ] [ i − i ] h i ih i i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ O (cid:16)p log k · k H k (cid:17) . Before we prove this lemma, we first prove a “simpler” proposition which will be crucial inupper bound the above. Proposition 62. (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i h i i ′ −h i i ′ [ i − i ] − h i i ′ −h i i ′ [ i − i ] [ i − i ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ O (cid:16) ∆ · p log k · k H k (cid:17) Proof. Using Fact 10, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i g ′ ( x i ) − g ′ ( x i ) x i − x i − g ′ ( x i ) − g ′ ( x i ) x i − x i x i − x i G ( x ) H i ,i H i ,i H i ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = 1 √ π (cid:12)(cid:12)(cid:12)(cid:12) Tr (cid:20) D (cid:18) e − X (cid:19) [ H, H ] · H (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) G ( x ) , X = diag ( x , . . . , x n ). Using Lemma 12, it suffices to upper bound G ( x ) (cid:12)(cid:12)(cid:12)(cid:12) Tr (cid:20) e − uX ( XH + HX ) e − v (1 − u ) X ( XH + HX ) e − (1 − v )(1 − u ) X H (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) (69)and G ( x ) (cid:12)(cid:12)(cid:12)(cid:12) Tr (cid:20) e ( u − X H e − uX H (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) (70)Note that u + v (1 − u ) + (1 − v ) (1 − u )=1. At least two of these three quantities are at most .We upper bound Eq. 69 in the following three cases.If u ≤ and (1 − u ) (1 − v ) ≤ , using Claim 53 and Lemma 54, Eq. (69) is upper bounded by (cid:13)(cid:13) ( XH + HX ) e − X (cid:13)(cid:13) k H k ≤ ∆ k He − X / k · k H k If u ≤ and v (1 − u ) ≤ , then the Eq. (69) is upper bounded by k ( XH + HX ) e − X k · k He − X k k XH + HX k ≤ k He − X / k · k XH + HX k≤ k He − X / k · k H k . where the second last inequality is by Claim 53. The case that u (1 − v ) ≤ and v (1 − u ) ≤ follows similarly.Also Eq. (70) can be upper bounded with similar arguments. Thus G ( x ) · (cid:12)(cid:12)(cid:12) Tr e ( u − X / H e − uX / H (cid:12)(cid:12)(cid:12) ≤ G ( x ) k H k · k He − X / k . (71)Therefore, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i g ′ ( x i ) − g ′ ( x i ) x i − x i − g ′ ( x i ) − g ′ ( x i ) x i − x i x i − x i G ( x ) H i ,i H i ,i H i ,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (72) ≤ (cid:18) G ( x ) · Tr (cid:20) D (cid:18) e − X (cid:19) [ H, H ] · H (cid:21)(cid:19) (73) ≤ O (cid:16) ∆ G ( x ) k H kk He − X / k (cid:17) = O ∆ X i ,i e − x i H i ,i G ( x ) · k H k ≤ O ∆ X i ,i g ′ ( x i ) G ( x ) H i ,i · k H k ≤ O ∆ X i ,i g ′ ( x i ) G ( x − i ) H i ,i · k H k ≤ O ∆ k G (1) ( x ) k · max i X i H i ,i ! · k H k ! ≤ O (cid:18) ∆ k G (1) ( x ) k · max i (cid:0) H (cid:1) i ,i · k H k (cid:19) ≤ O (cid:16) ∆ · p log k · k H k (cid:17) , (74)60here the second inequality used e − x i / ≤ 1, third inequality used g ( x ) ∈ [0 , 1] and the lastinequality is from Fact 20.We are now ready to prove the main lemma. Note that end of the day we need to bound thecase (in Remark 2) when the quantity in Lemma 60 contains G ( x − i ) instead of G ( x −{ i ,i ,i } ).Observe that the inequality in Lemma 61 can be written as (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) X i = i = i i
By the paragraph above, proving this lemma is equivalent to proving Eq. (75).For the first summation above, let (cid:0) A i (cid:1) i,j = ( H i,i , if j = i and i > i , otherwise . The left hand side of the claim statement can be expressed as (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ π X i G ( x − i ) (cid:16) Tr D (cid:16) e − X / (cid:17) h A i , (cid:0) A i (cid:1) T i H (cid:17)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) √ π X i : x i < ( · · · ) + 1 √ π X i : x i ≥ ( · · · ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (76)Using the same arguments in Lemma 62 and the fact that k A i e − X / k ≤ k He − X / k , k A i k ≤k H k , we can upper bound the first summation in Eq. (76) by1 √ π X i : x i < G ( x − i ) k He − X / k · k H k ≤ O (cid:16) ∆ · log . k k H k (cid:17) , where the inequality follows from the argument in Eq. (64).For the second summation, define B i ,i = H i ,i q g ( x i ) , if x i ≥ , otherwise . Note that k Be − X / k ≤ √ k He − X / k and k B k ≤ √ k H k (since g ( x ) ≥ / x ≥ (cid:12)(cid:12)(cid:12)(cid:12) G ( x ) 1 √ π Tr D (cid:16) e − X / (cid:17) (cid:2) B, B T (cid:3) A (cid:12)(cid:12)(cid:12)(cid:12) ≤ √ π G ( x ) k He − X / k k H k ≤ O (cid:16) ∆ · log . k · k H k (cid:17)(cid:17)