[PDF] The smoothed complexity of Frank-Wolfe methods via conditioning of random matrices and polytopes

Abstract

Frank-Wolfe methods are popular for optimization over a polytope. One of the reasons is because they do not need projection onto the polytope but only linear optimization over it. To understand its complexity, Lacoste-Julien and Jaggi introduced a condition number for polytopes and showed linear convergence for several variations of the method. The actual running time can still be exponential in the worst case (when the condition number is exponential). We study the smoothed complexity of the condition number, namely the condition number of small random perturbations of the input polytope and show that it is polynomial for any simplex and exponential for general polytopes. Our results also apply to other condition measures of polytopes that have been proposed for the analysis of Frank-Wolfe methods: vertex-facet distance (Beck and Shtern) and facial distance (Peña and Rodríguez). Our argument for polytopes is a refinement of an argument that we develop to study the conditioning of random matrices. The basic argument shows that for c>1 a d -by- n random Gaussian matrix with n≥cd has a d -by- d submatrix with minimum singular value that is exponentially small with high probability. This has consequences on results about the robust uniqueness of tensor decompositions.

Full PDF

TThe smoothed complexity of Frank-Wolfe methods via conditioningof random matrices and polytopes

Luis RademacherUniversity of California, Davis [email protected]

Chang ShuUniversity of California, Davis [email protected]

Abstract

Frank-Wolfe methods are popular for optimization over a polytope. One of the reasons isbecause they do not need projection onto the polytope but only linear optimization over it.To understand its complexity, a fruitful approach in many works has been the use of conditionmeasures of polytopes. Lacoste-Julien and Jaggi introduced a condition number for polytopesand showed linear convergence for several variations of the method. The actual running timecan still be exponential in the worst case (when the condition number is exponential). Westudy the smoothed complexity of the condition number, namely the condition number of smallrandom perturbations of the input polytope and show that it is polynomial for any simplexand exponential for general polytopes. Our results also apply to other condition measuresof polytopes that have been proposed for the analysis of Frank-Wolfe methods: vertex-facetdistance (Beck and Shtern) and facial distance (Pe˜na and Rodr´ıguez).Our argument for polytopes is a reﬁnement of an argument that we develop to study theconditioning of random matrices. The basic argument shows that for c > d -by- n randomGaussian matrix with n ≥ cd has a d -by- d submatrix with minimum singular value that isexponentially small with high probability. This also has consequences on known results aboutthe robust uniqueness of tensor decompositions, the complexity of the simplex method and thediameter of polytopes. Frank-Wolfe methods (FWMs) [FW56] are a family of algorithms that attempt to minimize adiﬀerentiable function over a convex set. For concreteness we start by describing the basic Frank-Wolfe method to minimize a diﬀerentiable function f : C (cid:55)→ R where C ⊆ R d is a compact convexset. It is an iterative method and proceeds as follows:Let x ∈ C . for k = 0 , . . . , K do Compute y ∈ argmin x ∈ C ( ∇ f ( x k )) T x .Let x k +1 = x k + α ∗ ( y − x k ), where α ∗ is a suitablestep size. end for min x ∈ C (cid:107) x (cid:107) x x x O a r X i v : . [ c s . D S ] N ov ome of our results are about Wolfe’s method [Wol76], which is a variation of Frank-Wolfemethods specialized to the minimum norm point problem in a polytope (that is, a bounded convexpolyhedron). In this paper we are interested in the complexity of FWMs. The time complexity of Wolfe’s methodis know to be exponential in the worst case (by an upper bound in [Wol76] and a lower bound in[LHR20]). There is a large body of work proving linear convergence of several variations of FWMs[GM86, GH13, LJJ13, LJJ15, BS17, PnRS16, PnR19, PNAJ20]. We are particularly interested in[LJJ13, LJJ15, BS17, PnRS16, PnR19] which prove global linear convergence of certain variationsof FWMs: F-W with away steps, pairwise F-W and Wolfe’s method when the feasible region is apolytope C = conv( A ) for ﬁnite A ⊆ R d . In these results the upper bound on the running time(actual speed of linear convergence) depends on a condition number of C . Informally speaking, thedependence is of the following kind: if x t is the current point after t iterations, then the functionvalue satisﬁes f ( x t ) − f ∗ ≤ (1 − κ ) t ( f ( x ) − f ∗ ) where f ∗ is the optimal value, x is the initialpoint and 0 ≤ κ ≤ κ is small, then convergence is slow. In thepreviously mentioned papers, κ is of the form “something” / diam( C ), where “something” can be: • [LJJ15] minimum width, minwidth( A ) = min S ⊆ A width( S ) (width is standard, see Sec-tion 2.8.1); • [LJJ15] pyramidal width, PWidth( A ); • [BS17] vertex-facet distance, vf( C ) = min F ∈ facets( C ) d (aﬀ F, vertices( C ) \ F ); or • [PnR19] facial distance, Φ( C ) = min F ∈ faces( C ) ∅ (cid:32) F (cid:32) C d ( F, conv(vertices( C ) \ F )).We do not provide a deﬁnition of pyramidal width at this point as it is complicated and itwas shown in [PnR19] that PWidth( A ) = Φ( C ) (Theorem 2.24 here). It is also known thatminwidth( A ) ≤ PWidth( A ) [LJJ15, Section 3.1]. We start with the observation that Φ( C ) ≤ vf( C )(Theorem 2.25). (Note that the reverse inequality was claimed in [PnR19], but the cube [0 , d isa counterexample: Φ([0 , d ) = 1 / √ d while vf([0 , d ) = 1.) This implies that all four quantities liebetween minwidth( A ) and vf( C ) (Theorem 2.25). It follows from [LHR20] that all of them can beexponentially small as a function of the bit-length of A . In fact, a stronger result follows from thework of Alon and Vu [AV97] combined with the stated inequalities. Alon and Vu showed that thereis a 0–1 simplex S such that vf( S ) is sub-exponentially small in the dimension (Corollary 3.3). Theconnection between polytope conditioning for FWMs and the Alon and Vu result was observed in[LJJ15].The main contributions of this paper are about the smoothed analysis of FWMs and the condi-tion numbers of matrices and polytopes. Smoothed analysis [ST01] is an approach to understandthe behavior of algorithms that are eﬃcient in practice but are ineﬃcient in the worst case. Themain idea is to study small random perturbations of any given instance of a problem. Suppose thatthe instance is described by a vector x ∈ R n . Then one aims to understand T ( x + g ), where g ∈ R n is a random vector with distribution N (0 , σ I n ) and T is a measure of complexity (for example, T ( x ) could be the running time of a particular algorithm on input x ). We adopt a deﬁnition thatﬁrst appeared in [BV04, BV06]. 2 eﬁnition 1.1 ([RV05] [RV07, Section 1.1]) . We say T has (probabilistic) polynomial smoothedcomplexity if there is a polynomial p such that max x ∈ R n , (cid:107) x (cid:107)≤ P g (cid:0) T ( x + g ) ≥ p ( n, /σ, /δ ) (cid:1) ≤ δ Our ﬁrst smoothed analysis result concerns FWMs minimizing a convex function on a simplex(Section 3). We show that minwidth has good smoothed complexity (Lemma 3.6). This impliesthe following result on polytope conditioning that can be combined with results in [LJJ15] to showpolynomial smoothed time complexity of several FWMs for the minimization of a convex functionin any simplex:

Theorem 1.2.

Let A = { A , . . . , A d +1 } be a set of independent Gaussian random vectors withmeans µ i , (cid:107) µ i (cid:107) ≤ , i ∈ [ d + 1] , and covariance matrix σ I d . Then for δ > , with probability atleast − δ , the measure of conditioning κ = PWidth( A )diam( A ) of A is at least some inverse polynomial in d , /σ and /δ . Note that even the problem of ﬁnding the minimum norm point in a simplex is not known tohave a simple polynomial time algorithm. All polynomial time algorithms we know for such a specialcase are general purpose convex programming algorithms such as the ellipsoid method. Moreover,[LHR20] shows that the linear programming problem reduces in strongly polynomial time to theminimum norm point in a simplex problem. This suggests that to ﬁnd a simple polynomial timealgorithm for the minimum norm point in a simplex is hard and, in particular, to ﬁnd a stronglypolynomial time algorithm would imply the existence of a strongly polynomial time algorithm forlinear programming, which would solve a major open problem.Our second smoothed analysis result concerns condition measures of general polytopes (Sec-tion 7). We show that the standard global linear convergence results for FWMs mentioned abovebased on polytope conditioning cannot guarantee polynomial complexity for general polytopes inthe average or smoothed sense. More speciﬁcally, for V-polytopes conv( A ) with | A | and d largeand comparable, d ≈ δ | A | , δ ∈ (0 , A ) gets smaller, in thecontext of Deﬁnition 1.1 one sets T = 1 / vf. It is enough to take x = 0 there and we show: Theorem 1.3.

Let δ ∈ (0 , . Suppose A = { A , . . . , A n +1 } is a set of iid. standard Gaussianrandom vectors in R d and d = (cid:98) δn (cid:99) . Let P n +1 = conv( A , . . . , A n +1 ) . Then P (cid:0) diam( P n +1 ) ≥ √ d (cid:1) ≥ − e − nd , and there exists constants < c, c (cid:48) < (that depends only δ ) such that, lim n →∞ P (cid:0) vf( P n +1 ) ≤ c d (cid:1) ≥ c (cid:48) . Hence the measure of conditioning κ = vf( P n +1 )diam( P n +1 ) of A is exponentially small in d with constantprobability. Theorem 1.3 combined with Theorem 2.25 implies that none of the four measures of polytopeconditioning (minwidth, PWidth, Φ, vf) has polynomial smoothed complexity.A way of interpreting Theorem 1.3 is that the standard conditioning measures of polytopes forFWMs are somewhat pessimistic and can appear ill-conditioned even then polytope is bad only3ocally. For example, vertex-facet distance can be small even if one vertex and one facet are badwhile the rest of the polytope is good. In other words, it may still be possible to show smoothedpolynomial complexity of FWMs in a diﬀerent way.Theorem 1.3 is a statement about the minimum distance between the aﬃne hull of d pointsthat form a facet and a vertex not on that facet. In order to understand this problem we study ﬁrsta simpliﬁed version where we replace aﬃne hull by span and we remove the restriction that the d − n standard Gaussian randompoints in R d , how close can one of the points be to the span of some d − n is somewhatlarger than d , say, n = 2 d ? This question is easier to understand than the polytope version andit relates to conditioning of random matrices and the restricted isometry property in compressivesensing. The relation starts from the known observation (Lemma 2.8) that the minimum point-hyperplane distance is, up to polynomial factors, the same as the smallest singular value of amatrix. Given this, our question is essentially equivalent to: given an d -by- n random matrix withiid. standard Gaussian entries, what is the minimum of the smallest singular values over d -by- d submatrices? We answer this question by showing that when n/d ≥ c > Theorem 1.4.

Let A be an d -by- n random matrix with iid. standard Gaussian entries with d ≥ and nd ≥ c > . Then, there exist constants c , c > , < c < (that depend only on c ) suchthat with probability at least − c c d , min S ⊆ [ n ] , | S | = d σ d ( A S ) ≤ c c d − . Theorem 1.5.

Let A be an d -by- n random matrix with iid. standard Gaussian entries with d ≥ and < nd − ≤ C . Then, there exist constants C > , < C < (that depend only on C ) suchthat with probability at least − nC d − , min S ⊆ [ n ] , | S | = d σ d ( A S ) ≥ C d − . While Theorems 1.4 and 1.5 are new as far as we know, there is a large body of work, partlymotivated by compressive sensing, that studies questions related to them. In that area one isgenerally interested in showing that all d -by- k submatrices of A are well-conditioned, say, σ /σ k isno more than a constant (the restricted isometry property of Cand`es and Tao [CT05, CT06]). Thiscan only happen when k is much smaller than d , a regime very diﬀerent from our case k = d . Thestandard analyses in compressive sensing as well as recent results such as [CJL19] do not seem tobe able to clarify the behavior in our regime.The idea of the proof of Theorem 1.4 (Section 4) is the following: Consider the case n = 2 d forconcreteness and aim to show that with constant probability one point is exponentially close to thespan of d − S be the family of sets of d − A . For S ∈ S , let B S be theset of points in R d within distance (cid:15) of span S . Let V (cid:15) = (cid:83) S ∈S B S . It is enough to show that for (cid:15) = 1 /c d , c >

1, the Gaussian volume G ( V (cid:15) ) is at least a constant. We do this by lower boundingit using the ﬁrst two terms of the inclusion-exclusion principle (Bonferroni inequality) : G ( V (cid:15) ) ≥ (cid:88) S G ( B S ) − (cid:88) S,T : S (cid:54) = T G ( B S ∩ B T ) . B S ∩ B T can be large if S and T share many columns. To deal with this diﬃculty,replace S above with a large subfamily T ⊆ S of subsets of columns where each pair of subsetshas few columns in common by picking separated subsets greedily (Gilbert-Varshamov bound) . See[Raz88], [Juk11, Lemma 19.3] for another instance of Bonferroni’s inequality with almost pairwiseindependence.While Theorems 1.4 and 1.5 are results about random matrices, they have direct implications inthe analysis of algorithms: In Section 5 we discuss how Theorem 1.4 condtions the applicability ofthe robustness of tensor decomposition result by Bhaskara, Charikar and Vijayaraghavan [BCV14].In Section 6 we discuss how Theorem 1.4 conditions the applicability of results about the complexityof the simplex method and the diameter of polytopes in [BR13, BGR15, DH16, EV17].

For v ∈ R d and i ∈ [ d ], let v − i denote vector v with coordinate v i removed, that is v − i :=( v , . . . , v i − , v i +1 , . . . , v d ). If v (cid:54) = 0, let ˆ v := v/ (cid:107) v (cid:107) . Let B ( x, (cid:15) ) := { y ∈ R d : (cid:107) y − x (cid:107) ≤ (cid:15) } . Let S d − denote the ( d − R d . For v ∈ S d − , denote the spherical capcentered at v with angle α as C α ( v ) := { x ∈ S d − : v · x ≥ cos α } . For A ⊆ R d , let A (cid:15) := { x ∈ R d :dist( x, A ) ≤ (cid:15) } , A − (cid:15) := { x ∈ R d : B ( x, (cid:15) ) ⊂ A } . Let G denote the standard multivariate Gaussianprobability measure. For random variables or distributions X, Y , notation X d = Y states that X and Y have the same distribution. We ﬁrst recall the deﬁnition of noncentral chi-square distribution.

Deﬁnition 2.1.

Let Y , Y , . . . , Y k be independent Gaussian random variables with means µ i andunit variance. Then the random variable (cid:80) ki =1 Y i is distributed according to the noncentral chi-square distribution with k degrees of freedom and noncentrality parameter λ = (cid:80) ki =1 µ i . Theprobability density function of noncentral chi-square distribution is given by f Y ( x ; k, λ ) = ∞ (cid:88) i =0 e − λ/ ( λ/ i i ! f Z k +2 i ( x ) , where Z q is distributed as (central) chi-square with q degrees of freedom, denoted as χ q . We show an anti-concentration inequality of noncentral chi-square distribution by comparingto (central) chi-square distribution.

Lemma 2.2.

Let µ ∈ R . Let X i ∼ N (0 , σ ) , Y i ∼ N ( µ, σ ) , i ∈ [ k ] , be independent. Then P (cid:32) k (cid:88) i =1 X i ≥ t (cid:33) ≤ P (cid:32) k (cid:88) i =1 Y i ≥ t (cid:33) . Proof.

The proof follows directly from the density function of noncentral chi-square distributionand some basic facts about the chi-square distribution. We consider random variables Y i /σ ∼ ( µ/σ, (cid:80) ki =1 Y i /σ is distributed as noncentral chi-square with k degrees of freedom andnoncentrality parameter kµ /σ . From Deﬁnition 2.1, P (cid:32) k (cid:88) i =1 Y i σ ≥ t σ (cid:33) = ∞ (cid:88) i =0 e − kµ / σ ( kµ / σ ) i i ! P (cid:18) Z k +2 i ≥ t σ (cid:19) ≥ ∞ (cid:88) i =0 e − kµ / σ ( kµ / σ ) i i ! P (cid:18) Z k ≥ t σ (cid:19) = P (cid:18) Z k ≥ t σ (cid:19) = P (cid:32) k (cid:88) i =1 X i ≥ t (cid:33) , where Z k ∼ χ k . The inequality comes from the fact that chi-square random variable with ( k + i )degrees of freedom is equal in distribution to the sum of a chi-square random variable with k degreesof freedom and squares of i independent standard Gaussian random variables, so that Z k +2 i haslarger tail than Z k .The following lemma provides a comparison inequality between the ratio of noncentral chi-squarerandom variables and chi-square random variables, which is used in the proof of Lemma 7.4. Lemma 2.3.

Let µ ∈ R . Let X , X i , Y ∼ N (0 , σ ) , Y i ∼ N ( µ, σ ) , i ∈ [ k ] and be independent.Then for any t ∈ (0 , , P (cid:18) Y Y + (cid:80) n Y i ≥ t (cid:19) ≤ P (cid:18) X X + (cid:80) n X i ≥ t (cid:19) . Proof.

Let f denote the probability density function of X and Y . By the law of total expectation, P (cid:18) Y Y + (cid:80) ni =1 Y i ≥ t (cid:19) = (cid:90) ∞ P (cid:18) yy + (cid:80) ni =1 Y i ≥ t (cid:19) f ( y ) dy = (cid:90) ∞ P (cid:32) n (cid:88) i =1 Y i ≤ (1 /t − y (cid:33) f ( y ) dy ≤ (cid:90) ∞ P (cid:32) n (cid:88) i =1 X i ≤ (1 /t − x (cid:33) f ( x ) dx (Lemma 2.2)= (cid:90) ∞ P (cid:18) xx + (cid:80) ni =1 X i ≥ t (cid:19) f ( x ) dx = P (cid:18) X X + (cid:80) ni =1 X i ≥ t (cid:19) . Lemma 2.4.

Let X ∼ N (0 , σ ) , Y ∼ N ( c, σ ) , c ∈ R , then for any t > , P ( (cid:107) Y (cid:107) < t ) ≤ P ( (cid:107) X (cid:107) < t ) . roof. Without loss of generality, we may assume c >

0. Then P ( (cid:107) Y (cid:107) < t ) = (cid:90) t − t e − ( x − c )22 σ dx = (cid:90) t − c − t − c e − x σ dx = (cid:90) − t − t − c e − x σ dx + (cid:90) t − t e − x σ dx − (cid:90) t − t − c e − x σ dx = P (cid:0) (cid:107) X (cid:107) < t (cid:1) + (cid:18)(cid:90) t + ct e − x σ dx − (cid:90) tt − c e − x σ dx (cid:19) ≤ P (cid:0) (cid:107) X (cid:107) < t (cid:1) . Lemma 2.5 ([LM00]) . Let ( X , . . . , X n ) be iid. standard Gaussian variables. Let α , . . . , α n benonnegative. Let Z = (cid:80) ni =1 α i ( X i − . Then, the following inequalities hold for any positive t : P (cid:0) Z ≥ (cid:107) α (cid:107) √ t + 2 (cid:107) α (cid:107) ∞ t (cid:1) ≤ exp( − t ) , P (cid:0) Z ≤ − (cid:107) α (cid:107) √ t (cid:1) ≤ exp( − t ) . Lemma 2.6.

Let A ( n, t, w ) be the maximum number of binary n -vectors with exactly w ones andpairwise Hamming distance greater than or equal to t . Then for any c > , there exist constants c > and c > (that depend only on c ) such that for all d ≥ and n/d ≥ c we have A ( n, c d, d ) ≥ c d .Proof. Pick vectors greedily (Gilbert-Varshamov bound) to get, for integral t : A ( n, t, w ) ≥ (cid:0) nw (cid:1) B ( n, t ) , where B ( n, t ) is the number of binary vectors at Hamming distance less than or equal to t from thezero vector. We have B ( n, t ) = (cid:80) tk =0 (cid:0) nk (cid:1) ≤ (cid:0) net (cid:1) t (see footnote ). Note that this last inequality isvalid also for fractional 0 < t ≤ n using the fact that ( a/b ) b is increasing in b for a >

0, 0 < b ≤ a/e .Thus, for any 0 < c < d ≥ A ( n, cd, d ) ≥ (cid:0) nd (cid:1)(cid:0) necd (cid:1) cd ≥ ( n/d ) d (cid:0) necd (cid:1) cd = ( n/d ) d (1 − c ) ( e/c ) cd ≥ c d (1 − c )0 ( e/c ) cd = (cid:18) c ( c e/c ) c (cid:19) d (1)We have lim c → + ( a/c ) c = 1 and ( a/c ) c is increasing again for a >

0, 0 ≤ c ≤ a/e . Given this,choose c ∈ (0 ,

1) such that ( c e/c ) c < c . Let c := c ( c e/c ) c >

1. We have A ( n, c d, d ) ≥ c d for d ≥

1. The claim follows. (cid:80) tk =0 (cid:0) nk (cid:1) ≤ (cid:80) tk =0 n k k ! = (cid:80) tk =0 t k k ! ( n/t ) k ≤ e t ( n/t ) t . .5 Generalization of Archimedes’ formula Lemma 2.7.

Let d ≥ . Let U be a uniformly random d -dimensional unit vector. Then ( U , . . . , U d − ) is uniform in B d − and Pr( (cid:107) ( U , . . . , U d − ) (cid:107) ≤ t ) = t d − .Proof. The ﬁrst part is well-known, a proof can be found in [BGMN05, Corollary 4]. The secondpart follows immediately from the ﬁrst part.

Lemma 2.8 (see e.g. [BCMV13, Lemma 3.5] for a proof) . If A ∈ R m × n has columns a , . . . , a n and m ≥ n , then denoting a − i = span ( a j : j (cid:54) = i ) , we have √ n min i ∈ [ n ] dist( a i , a − i ) ≤ σ n ( A ) ≤ min i ∈ [ n ] dist( a i , a − i ) . (cid:15) -neighborhoodCorollary 2.9. Let Q be a convex set in R d . Then there exists an absolute constant c > suchthat G ( Q \ Q − (cid:15) ) ≤ c(cid:15)d / .Proof. Follows immediately from [CCK17, Lemma A.2] and the fact (cid:107) I (cid:107) HS = √ d (Hilbert-Schmidtnorm). Their proof is based on [Bal93, Naz03]. ([Ver18, Theorem 4.4.5]) . Let X be an m × n random matrix whose entries are iid.standard Gaussian random variables. Then for t > , P (cid:0) σ max ( X ) > c ( √ m + √ n + t ) (cid:1) ≤ e − t , where c is some absolute positive constant. Lemma 2.11.

Given linearly independent vectors p , p , . . . , p d ∈ R d , the shortest vector in theiraﬃne hull is v = P − / (cid:107) P − (cid:107) , where P = ( p · · · p d ) (cid:62) . In particular, (cid:107) v (cid:107) = 1 / (cid:107) P − (cid:107) .Proof. From [LHR20, Lemma 1.2], the shortest vector in the aﬃne hull, v , satisﬁes P v = (cid:107) v (cid:107) . Since P has independent columns, v = (cid:107) v (cid:107) P − . Compute norms of vectors in the above equationto get (cid:107) v (cid:107) = 1 / (cid:107) P − (cid:107) . The claim follows.The following lemma can directly generalize to Gaussian random vectors with mean zero andcovariance matrix σ I d by scaling by σ . Lemma 2.12.

Let X , . . . , X n be iid. standard Gaussian random vectors in R d . For S ⊆ [ n ] , | S | = d , deﬁne V S as the shortest vector in aﬀ( X S ) . Then there exists a constant c > such that P (cid:18) max S ⊆ [ n ] , | S | = d (cid:107) V S (cid:107) ≤ c (2 + (cid:112) n/d ) (cid:19) ≥ − e − d . roof. Let X be the matrix whose column vectors are X , . . . , X n . For any S ⊆ [ n ] , | S | = d , X S islinearly independent with probability 1. By Lemma 2.11, (cid:107) V S (cid:107) = 1 (cid:107) X − S (cid:107) ≤ √ d σ min ( X − S ) = σ max ( A S ) √ d ≤ σ max ( A ) √ d . (2)From Lemma 2.10 we know P (cid:16) σ max ( A ) > c ( √ d + √ n + t ) (cid:17) ≤ e − t . The claim follows by letting t = √ d and applying (2). We will need the fact that the number of facets of the convex hull of n Gaussian random points in R d is exponential in d with high probability when n = cd , c >

1. We could not ﬁnd such a result inthe literature and we do not see how to deduce it from results on the asymptotic number of facetsin stochastic geometry [Ray70, AW91, HMR04, HR05, BLR18] (the diﬃculties are: either theyonly determine the expectation or variance of the number of facets, or the bounds are as n goesto inﬁnity for ﬁxed d ). Nevertheless, it is easy to deduce what we want from the work of Donohoand Tanner on compressive sensing and the neighborliness of random polytopes. We build on topof basic polytope theory from [Zie93]. Deﬁnition 2.13 (Neighborliness) . A polytope P is k -neighborly if every subset of k vertices formsa ( k − -face. Let f l ( P ) denote the number of l -faces of polytope P . Theorem 2.14 ([DT05], Corollary 1.1, Lemma 3.2) . There exists a function (threshold) ρ ( δ ) :(0 , → R , ρ ( δ ) > with the following property: Let δ ∈ (0 , . Let d = (cid:98) δn (cid:99) . Let ρ < ρ ( δ ) . Let X , . . . , X n be iid. samples from a Gaussian distribution in R d with non-singular covariance. Let P = conv { X , . . . , X n } . Then lim n →∞ P (cid:0) f ( P ) = n and P is (cid:98) ρd (cid:99) -neighborly (cid:1) = 1 . The above theorem demonstrates, given its assumptions, that when n is large enough, P has (cid:0) n (cid:98) ρd (cid:99) (cid:1) many (cid:98) ρd (cid:99) -faces with high probability. Note also that P is simplicial (every facet is a simplex)a.s. Thus, a.s. each facet of P provides at most (cid:0) d (cid:98) ρd (cid:99) (cid:1) many (cid:98) ρd (cid:99) -faces, and the number of facetsis at least (cid:0) n (cid:98) ρd (cid:99) (cid:1)(cid:0) d (cid:98) ρd (cid:99) (cid:1) ≥ (cid:16) nd (cid:17) (cid:98) ρd (cid:99) ≥ (cid:18) δ (cid:19) (cid:98) ρd (cid:99) ≥ c d , for some c > d large enough). We conclude: Corollary 2.15.

Let δ ∈ (0 , . Let P be the convex hull of n iid. standard Gaussian randompoints in R d , d = (cid:98) δn (cid:99) . Then there exists a constant c > (that depends only on δ ) such that lim n →∞ P (cid:0) f d ( P ) ≥ c d (cid:1) = 1 . Corollary 2.15 can probably also be proven directly from diﬀerent but related neighborlinessresults by Vershik and Sporyshev [VS92], [DT05, Theorem 2].9 .8 Condition measures of polytopes (Directional width and width) . The directional width of a set A ⊆ R d with respectto a direction r ∈ R d is deﬁned as dirW( A, r ) := sup s,v ∈A (cid:104) r (cid:107) r (cid:107) , s − v (cid:105) . The width of A , denoted width( A ) is the inﬁmum of the directional width over all directions on its aﬃne hull. Deﬁnition 2.17 (Minwidth, [LJJ15, Section 3.1]) . The minwidth of a ﬁnite set A ⊆ R d , denoted minwidth( A ) , is the minimum width over all subsets of A . (Pyramidal directional width, [LJJ15]) . We deﬁne the pyramidal directional widthof a ﬁnite set A ⊆ R d with respect to a direction r ∈ R d and a base point x ∈ conv( A ) to be PDirW(

A, r, x ) := min S ∈ S x dirW( S ∪ { s ( A, r ) } , r ) = min S ∈ S x max s ∈ A,v ∈ S (cid:28) r (cid:107) r (cid:107) , s − v (cid:29) , where S x := { T ⊆ A : x is a proper convex combination of all the elements in T } and s ( A, r ) :=argmax v ∈ A (cid:104) r, v (cid:105) . Deﬁnition 2.19 (Feasible direction, [LJJ15]) . A direction r is feasible for A from x if it pointsinwards conv ( A ) , i.e. r ∈ cone ( A − x ) . A direction r is feasible for A if it is feasible for A fromsome x ∈ A . Deﬁnition 2.20 (Pyramidal width, [LJJ15]) . We deﬁne the pyramidal width of a ﬁnite set A ⊆ R d to be the smallest pyramidal directional width of all its faces, PWidth( A ) := min K ∈ faces(conv( A )) x ∈ Kr ∈ cone( K − x ) \{ } PDirW( K ∩ A, r, x ) . The vertex-facet distance polytope conditioning parameter for the analysis of FWMs was introducedin [BS17]. We adopt here the slightly specialized deﬁnition in [PnR19], which is deﬁned as a propertyof a polytope independent of the representation, while the original version in [BS17] can depend onthe numbers used to represent a polytope.

Deﬁnition 2.21 (vertex-facet distance [BS17, PnR19]) . Let P ⊆ R d be a polytope with dim(aﬀ( P )) ≥ . The vertex-facet distance of P is vf( P ) := min F ∈ facets( P ) dist(aﬀ( F ) , vertices( P ) \ F ) . We show vf (cid:0) conv( A ) (cid:1) ≥ PWidth( A ). It seems that this result may have already been know to[PnR19, comment before Theorem 1, combined with Theorem 2], but it is claimed there in the wrongdirection. That direction is impossible as the example of a unit cube shows: PWidth([0 , d ) = 1 / √ d [LJJ15, Lemma 4], but vf([0 , d ) = 1. 10 roposition 2.22. Let A ⊆ R d be a ﬁnite set with at least two points. Then vf (cid:0) conv( A ) (cid:1) ≥ PWidth( A ) . Figure 1: Proof of Proposition 2.22

Proof.

Let P = conv( A ). Let F be a facet of P and pick v ∈ vertices( P ) \ F so that dist (cid:0) v, aﬀ( F ) (cid:1) = (cid:15) := vf( P ). Pick x ∈ relint (cid:0) conv( F ∪ { v } ) (cid:1) and let r be the unit outer normal vector to F (in aﬀ( P )if P is not full-dimensional). We set K = P in Deﬁnition 2.20 so that r ∈ cone( K − x ) = aﬀ( P ) andPWidth( A ) ≤ PDirW( K ∩ A, r, x ) = PDirW(

A, r, x ). Now, set S = A ∩ ( F ∪ { v } ) in Deﬁnition 2.18so that, with these choices, PDirW( A, r, x ) ≤ dirW( S, r ) ≤ (cid:15) . The claim follows. ([PnR19]) . Let C ⊆ R d be a polytope with dim(aﬀ( C )) ≥ . The facial distance of C is Φ( C ) := min F ∈ faces( C ) ∅ (cid:32) F (cid:32) C d ( F, conv(vertices( C ) \ F )) . One of the motivations of [PnR19] to introduce parameter Φ is that it is the same as PWidth(except in degenerate cases) while the deﬁnition of Φ is simpler to use in many cases. We quotetheir result next.

Theorem 2.24 ([PnR19, Theorem 2]) . Let A ⊆ R d be a ﬁnite set with at least two points. Then Φ (cid:0) conv( A ) (cid:1) = PWidth( A ) . Let A ⊆ R d be a ﬁnite set with at least two points. Then minwidth( A ) ≤ Φ (cid:0) conv( A ) (cid:1) = PWidth( A ) ≤ vf (cid:0) conv( A ) (cid:1) . Proof.

Immediate from [LJJ15, Section 3.1], Theorem 2.24 and Proposition 2.22.11

Conditioning of simplices

In this section we show that the smoothed conditioning of any simplex is polynomial. This impliesthat several FWMs have smoothed polynomial complexity on the minimum norm point in a simplexproblem and the minimization of many convex functions on a simplex. To put this result in context,we ﬁrst argue (based on known results) that even a simplex with vertices having 0–1 coordinates canhave bad conditioning. Another relevant context to keep in mind is the fact that linear programmingreduces in strongly polynomial time to the minimum norm point in a simplex [LHR20]. width and minwidth of a simplex

We start with the observation that the minwidth of a simplex is the same as its width.

Lemma 3.1.

Let A be the vertex set of a simplex in R d and A ⊂ A which includes more than onevertex. Then width( A ) ≤ width( A ) . In particular, minwidth( A ) = width( A ) .Proof. We prove by induction in d . The width of a polytope is the minimum distance betweenparallel supporting hyperplanes in its aﬃne hull. Width of a 2-simplex is the minimum height oftriangle, which is smaller than the length of any edge. For a k -simplex A , suppose the width of oneof its facet is given by the distance between two parallel ( k − p k − and p k − .One can extend p k − and p k − to parallel hyperplanes in R k that enclose A . Suppose extensions p k − and p k − give the minimum distance. Then,dist (cid:0) p k − , p k − (cid:1) = min a ∈ p k − ,b ∈ p k − (cid:107) a − b (cid:107) ≤ min a ∈ p k − ,b ∈ p k − (cid:107) a − b (cid:107) = dist (cid:0) p k − , p k − (cid:1) which shows that the width of a k -simplex is less than the width of any of its facets. The claimthen follows by induction. Lacoste-Julien and Jaggi [LJJ15] observed that the minwidth of the unit cube in R d is exponentiallysmall in d . This example was one of their motivations for introducing PWidth, which is 1 / √ d forthe cube. Their observation is based on the following result by Alon and Vu: Theorem 3.2 ([AV97, Theorem 3.2.2], [Zie00, Corollary 27]) . There are d + 1 vectors in { , } d that form the vertices of a d -dimensional simplex S so that d − d d/ ≤ vf( S ) ≤ d (2+ o (1)) d d/ . [LHR20] observed that PWidth can be exponentially small in the size (bitlength) of a setof points with integer coordinates. Using Theorem 3.2 and the relationships between polytopecondition measures, we can immediately strengthen this result and show that this is not just a“large numbers” phenomenon, namely, all condition measures are exponentially small even for a0–1 simplex: Corollary 3.3.

There are d + 1 vectors in { , } d that form the vertices of a d -dimensional simplex S so that width(vertices( S )) = minwidth(vertices( S )) ≤ PWidth(vertices( S )) = Φ( S ) ≤ vf( S ) ≤ d (2+ o (1)) d d/ . roof. Let S be the d -dimensional simplex given by Theorem 3.2. Lemma 3.1 gives the leftmostequality. The rightmost inequality is one of the conclusions of Theorem 3.2. The other relationsfollow from Theorem 2.25. Now we start analyzing smoothed complexity of FWMs on the minimization of a strongly convexfunction with Lipschitz gradient on a simplex.

Deﬁnition 3.4.

A diﬀerentiable function f is said to have L -Lipschitz gradient if for some L > and for all x, y in its domain we have (cid:107)∇ f ( x ) − ∇ f ( y ) (cid:107) ≤ L (cid:107) x − y (cid:107) . Deﬁnition 3.5.

A diﬀerentiable function f is µ -strongly convex if for some µ > and for all x, y in its domain, we have f ( y ) ≥ f ( x ) + ∇ f ( x ) T ( y − x ) + µ (cid:107) y − x (cid:107) . In [LJJ15, Theorem 1], Lacoste-Julien and Jaggi proved the global linear convergence of FWMson the minimization of a strongly convex function with Lipschitz gradient: suppose u t is the currentpoint after t good iterations , f ( u t ) satisﬁes f ( u t ) − f ∗ ≤ (cid:32) − µ L (cid:18) PWidth( A )diam( A ) (cid:19) (cid:33) t ( f ( u ) − f ∗ ) , (3)where f ∗ is the optimal value and u is the initial point. To show polynomial smoothed complexity,we need to prove that the measure of conditioning κ = PWidth( A )diam( A ) is at least inverse polynomialin d, /σ, /δ . We are going to get this by giving a polynomial lower bound on PWidth( A ) and apolynomial upper bound on diam( A ). We know from Theorem 2.25 that minwidth ≤ PWidth, and from Lemma 3.1 that minwidth =width for any simplex. Thus, we instead ﬁnd a lower bound on width, namely the diameter of aball contained in the simplex, which is also a lower bound on PWidth. In the next lemma, we provethat a random simplex contains a ball of radius Ω( d − ) with probability close to 1. Lemma 3.6.

Let A = { A , . . . , A d +1 } be a set of independent Gaussian random vectors with means µ i , (cid:107) µ i (cid:107) ≤ , i ∈ [ d + 1] , and covariance matrix σ I d . Then for δ > , P (cid:16) minwidth (cid:0) conv( A ) (cid:1) ≥ √ πσδ ( d + 1) − (cid:17) ≥ − δ. Moreover, P (cid:16) PWidth (cid:0) conv( A ) (cid:1) ≥ √ πσδ ( d + 1) − (cid:17) ≥ − δ. The number of good iterations depends on variants of FWMs being used. It is always lower bounded by somelinear function of the actual number of iterations. See details in [LJJ15, Theorem 1]. roof. It is easy to see that A forms a simplex with probability 1. From Lemma 3.1, we know theminwidth of a simplex is its width. Let D i be the distance from A i to the aﬃne hull of its oppositefacet, aﬀ { A j : j (cid:54) = i } . Conditioning on aﬀ { A j : j (cid:54) = i } , by the rotational invariance of Gaussiandistribution, D i is equal in distribution to the absolute value of a Gaussian random variable withmean µ ∈ R (not necessarily be zero) and variance σ . Let X ∼ N (0 , σ ). By Lemma 2.4, we have P ( D i < t ) ≤ P ( (cid:107) X (cid:107) < t ) for all t . The right hand side is upper bounded by 2 t/ √ πσ , which is theproduct of maximal Gaussian density and length of interval. Apply union bound to get P (cid:32) d +1 (cid:92) i =1 { D i ≥ t } (cid:33) ≥ − t ( d + 1) √ πσ . Let C i be the distance between the center of mass of conv( A ) and aﬀ( A j : j (cid:54) = i ). Note that C i = D i / ( d + 1). Then P (cid:32) d +1 (cid:92) i =1 { C i ≥ td + 1 } (cid:33) ≥ − t ( d + 1) √ πσ . The above expression states that with some probability the ball centered at the center of mass andof radius t/ ( d + 1) lies inside conv( A ). Setting t = δσ √ π √ d +1) and using the fact that the width of thesimplex is at least the diameter of the inscribed ball, we get P (cid:16) width (cid:0) conv( A ) (cid:1) ≥ √ πσδ ( d + 1) − (cid:17) ≥ − δ. The claim follows immediately from Lemma 3.1 and Theorem 2.25.

Let A = { A , . . . , A d +1 } be a set of independent Gaussian random vectors with means µ i , (cid:107) µ i (cid:107) ≤ , i ∈ [ d + 1] , and covariance matrix σ I d . Then for δ > , P (cid:32) diam( A ) ≤ (cid:16) σ (cid:114) d + 3 ln (cid:16) d + 1 δ (cid:17) + 1 (cid:17)(cid:33) ≥ − δ. Proof.

Let A i = µ i + X i , where X i ∼ N (0 , σ I d ). Let t >

0. Triangle inequality gives that P ( (cid:107) A i (cid:107) > t + 1) = P ( (cid:107) X i + µ i (cid:107) > t + 1) ≤ P ( (cid:107) X i (cid:107) > t ) . Apply Lemma 2.5 with α = ( σ , . . . , σ ),we have P (cid:18) (cid:107) A i (cid:107) > σ (cid:113) d + 2 √ dt + 2 t + 1 (cid:19) ≤ P (cid:18) (cid:107) X i (cid:107) ≥ σ (cid:113) d + 2 √ dt + 2 t (cid:19) ≤ e − t , which shows that every A i is contained in a ball of radius σ (cid:112) d + 2 √ dt + 2 t + 1 ≤ σ √ d + 3 t + 1with high probability. With union bound, we see the diameter of the ball is an upper bound of thediameter of convex hull of A : P (cid:16) diam (cid:0) conv( A ) (cid:1) ≤ (cid:0) σ √ d + 3 t + 1 (cid:1)(cid:17) ≥ − ( d + 1) e − t . The claim then follows by setting t = ln (cid:0) ( d + 1) /δ (cid:1) .14ext we restate and prove our main theorem for this section: Theorem 1.2.

Let A = { A , . . . , A d +1 } be a set of independent Gaussian random vectors withmeans µ i , (cid:107) µ i (cid:107) ≤ , i ∈ [ d + 1] , and covariance matrix σ I d . Then for δ > , with probability atleast − δ , the measure of conditioning κ = PWidth( A )diam( A ) of A is at least some inverse polynomial in d , /σ and /δ .Proof. We proved in Lemma 3.6 and Lemma 3.7 that, P (cid:16) PWidth (cid:0) conv( A ) (cid:1) ≥ √ πσδ ( d + 1) − (cid:17) ≥ − δ. and P (cid:32) diam( A ) ≤ (cid:16) σ (cid:114) d + 3 ln (cid:16) d + 1 δ (cid:17) + 1 (cid:17)(cid:33) ≥ − δ. Thus with probability at least 1 − δ , we havePWidth( A )diam( A ) ≥ √ πσδ ( d + 1) − (cid:16) σ (cid:113) d + 3 ln (cid:0) d +1 δ (cid:1) + 1 (cid:17) = δ (cid:112) π/ d + 1) (cid:16)(cid:113) d + 3 ln (cid:0) d +1 δ (cid:1) + σ (cid:17) ≥ /ρ ( d, /σ, /δ )where ρ is a polynomial function of d, /σ, /δ .Going back to (3), let h t = f ( u t ) − f ∗ . We have h t ≤ (cid:32) − µ L (cid:18) PWidth( A )diam( A ) (cid:19) (cid:33) t h . Based on our smoothed analysis on the measure of conditioning in Theorem 1.2, with probabilityat least 1 − δ h t ≤ (cid:18) − µ Lρ (cid:19) t h ≤ e − µt Lρ h . Hence one needs at most Lρ ln( (cid:15) ) µ good iterations to get a solution whose value is within distance (cid:15) ( f − f ∗ ) of f ∗ . Let T denote the number of good iterations, we have (using the notation fromDeﬁnition 1.1) max A ⊆ B (0 , ⊆ R d | A | = d +1 P g (cid:32) T ( A + g ) ≥ Lρ ( d, σ , δ ) ln( (cid:15) ) µ (cid:33) ≤ δ. Conditioning of random matrices

In this section we prove that the smallest singular value of some square submatrix of a d -by- n Gaussian random matrix is exponentially small with probability exponentially close to 1 when n/d ≥ c >

1. From Lemma 2.8, we know that the smallest singular value of a square matrix iscomparable to the minimum distance between one column vector and the span of the other columnvectors (one-oﬀ-distance). If we consider exponentially narrow bands around each span of d − Lemma 4.1.

Let u, v ∈ R d be unit length vectors, let (cid:15) > , and let c S , c T ∈ R . Let B S = { x ∈ R d : c S ≤ x · u ≤ c S + (cid:15) } , B T = { x ∈ R d : c T ≤ x · v ≤ c T + (cid:15) } . Then G ( B S ∩ B T ) ≤ (cid:15) (cid:112) π (1 − ( u · v ) ) . Proof. If u and v are parallel then the claim holds. If they are not parallel, then by the structure ofthe Gaussian measure G this is a two-dimensional problem in the plane spanned by u, v . Identifythis plane with R . G ( B S ∩ B T ) is at most the maximum density 1 / √ π multiplied by the area ofthe parallelogram P (cid:48) := { x ∈ R : c S ≤ x · u ≤ c S + (cid:15), c T ≤ x · u ≤ c T + (cid:15) } . One can see that P (cid:48) has the same area as P := { x ∈ R : | x · u | ≤ (cid:15)/ , | x · v | ≤ (cid:15)/ } . Deﬁning A to be the matrix withrows u, v , we have P = { x : (cid:107) Ax (cid:107) ∞ ≤ (cid:15)/ } . This implies area( P ) = (cid:15) | det A − | = (cid:15) / | det A | = (cid:15) / √ det AA T = (cid:15) / (cid:112) − ( u · v ) . The claim follows.We now switch our focus to the random regime. The following lemma gives a probabilistic upperbound of the intersection of two bands around the spans of two (possibly not disjoint) subsets ofrandom vectors in high dimensional space. The bound is good when not too many points are sharedby the subsets (so that the behavior is not very diﬀerent from two independent bands). Lemma 4.2.

Let d ≥ . Let ≤ k ≤ d − . Let A , . . . , A k , S , . . . , S d − k − , T , . . . , T d − k − be d -dimensional iid. standard Gaussian random vectors. Let B S = (span { A , . . . , A k , S , . . . , S d − k − } ) (cid:15)/ , B T = (span { A , . . . , A k , T , . . . , T d − k − } ) (cid:15)/ . Then for any t ≥ , P (cid:18) G ( B S ∩ B T ) ≥ (cid:15) t √ π (cid:19) ≤ t d − k − . roof. If d ≤ k ≥ d −

2, then the claim is immediate. Otherwise, 0 ≤ k ≤ d − G this is a ( d − k )-dimensional problemin { A , . . . , A k } ⊥ . More precisely, let U, V be two ( d − k )-dimensional iid. uniformly random unit-length vectors and deﬁne B (cid:48) S = { x ∈ R d − k : | x · U | ≤ (cid:15)/ } and B (cid:48) T = { x ∈ R d − k : | x · V | ≤ (cid:15)/ } .Then G ( B S ∩ B T ) has the same distribution as G ( B (cid:48) S ∩ B (cid:48) T ). From Lemma 4.1 we have G ( B (cid:48) S ∩ B (cid:48) T ) ≤ (cid:15) √ π (1 − ( U · V ) ) .Using the rotational symmetry of the distribution of U and V and then Lemma 2.7 we get P (cid:0)(cid:112) (1 − ( U · V ) ) ≤ /t (cid:1) = P (cid:16)(cid:113) U + · · · + U d − k − ≤ /t (cid:17) ≤ P (cid:16)(cid:113) U + · · · + U d − k − ≤ /t (cid:17) = 1 /t d − k − . The claim follows.The main technical content of our singular value bound is the following lower bound on theGaussian volume of the union of bands around any d − d -by- n Gaussian randommatrix. We also include an upper bound on the volume.

Lemma 4.3.

Let (cid:15) ≥ , d ≥ . For { A , . . . , A n } ⊆ R d , deﬁne V = G (cid:16) (cid:91) S ⊆ [ n ] , | S | = d − span A S (cid:17) (cid:15)  . V ≤ (cid:15) √ π (cid:0) nd − (cid:1) .2. Suppose A , . . . , A n are d -dimensional iid. standard Gaussian random vectors with nd − ≥ c > . Then there exist constants c , c > (that depend only on c ) such that when (cid:15) ≤ / ( c c d − ) and with probability at least − c e − d we have V ≥ c d − √ π (cid:15) .Proof of part 1. The upper bound follows from the union bound and the fact that the 1-dimensionalGaussian density is upper bounded by 1 / √ π . Proof of part 2.

Let S = { S ⊆ [ n ] , | S | = d − } . Use Lemma 2.6 to get the bound A (cid:0) n, c ( d − , d − (cid:1) ≥ c d − . We get a subfamily T ⊆ S such that for all

S, T ∈ T with S (cid:54) = T we have | S ∩ T | ≤ (1 − c )( d −

1) and |T | = c d − for some constants 0 < c < c > c ), and any d ≥

2. Let N = |T | .Let B S = (span A S ) (cid:15) . Use the ﬁrst two terms of the inclusion-exclusion principle (Bonferroniinequality) and use Lemma 4.2 in a union bound applied to all pairs of sets in T to get G ( B S ∩B T ) ≤ (cid:15) t √ π for all S, T ∈ T , S (cid:54) = T . We get a bound on V that holds with probability at least1 − ( N ) t d − − (1 − c / d − − = 1 − ( N ) t c d − / − ≥ − N t c d − / − = 1 − t (cid:16) c t c / (cid:17) d − = 1 − c e − d (choosing aconstant t > c ( c ) and c ( c ) such that c /t c / = 1 /e and then setting c = t/e ,17hich ultimately depends only on c ). The bound on V is V ≥ G (cid:32)(cid:16) (cid:91) S ∈T span A S (cid:17) (cid:15) (cid:33) ≥ (cid:88) S ∈T G ( B S ) − (cid:88) S,T ∈T ,S (cid:54) = T G ( B S ∩ B T ) ≥ N (cid:15) √ π e − (cid:15) / − (cid:18) N (cid:19) (cid:15) t √ π ≥ N (cid:15) √ π ( e − (cid:15) / − tN (cid:15) ) ≥ N (cid:15) √ π (1 − (cid:15) / − tN (cid:15) ) ≥ N (cid:15) √ π (for (cid:15) ≤ / (8 tN )).In other words, V ≥ c d − (cid:15) √ π for (cid:15) ≤ / (8 ec c d − ). We ﬁnish our proof by taking c = 8 ec .We are ready now to restate and prove the main results of the section. Theorem 1.4.

Pick c ∈ (1 , c ). Let m = (cid:98) c d (cid:99) . Note that m ≥ c d − ≥ c d − c ≥ c ( d − A , . . . , A m with (cid:15) = 1 /c c d − . Then we get V ≥ √ πc withprobability greater than 1 − c e − d . This implies that with probability greater than(1 − c e − d ) (cid:0) − (1 − √ πc ) n − m (cid:1) ≥ (1 − c e − d ) (cid:0) − (1 − √ πc ) ( c − c ) d (cid:1) ≥ − c e − d − (1 − √ πc ) ( c − c ) d ≥ − c c d where c = max { /e, (1 − √ πc ) ( c − c ) } , at least one of A m +1 , . . . , A n , say A ∗ , falls in V , that is,falls within distance (cid:15) = 1 /c c d − of span( A S ) for some S ⊆ [ m ] , | S | = d −

1. Lemma 2.8 gives σ d ( A S , A ∗ ) ≤ /c c d − . Theorem 1.5.

Let A be an d -by- n random matrix with iid. standard Gaussian entries with d ≥ and < nd − ≤ C . Then, there exist constants C > , < C < (that depend only on C ) suchthat with probability at least − nC d − , min S ⊆ [ n ] , | S | = d σ d ( A S ) ≥ C d − . roof. Apply Lemma 4.3 to columns A , . . . , A n − to get V ≤ (cid:15) √ π (cid:18) nd − (cid:19) ≤ (cid:15) √ π (cid:18) end − (cid:19) d − ≤ (cid:15) √ π ( eC ) d − . By picking (cid:15) = 1 /C d − where C > eC , there exists a constant eC C < C < V ≤ C d − . This implies that, with probability at most C d − , column A n is within distance 1 /C d − of span A S for some S ⊆ [ n − , | S | = d −

1. A similar claim holds for columns A , . . . , A n − aswell. Applying the union bound, we get that no A i falls within distance 1 /C d − of span A S for any S ⊆ [ n − , | S | = d − − nC d − . Lemma 2.8 gives σ d ( A S , A n ) ≥ /C d − with probability at least 1 − nC d − . Kruskal [Kru77] showed a suﬃcient condition under which the component vectors a i , b i , c i , i =1 , . . . , n of an order-3 tensor T = (cid:80) ni =1 a i ⊗ b i ⊗ c i are uniquely determined by the tensor (up toinherent ambiguities). The condition depends on a parameter now known as the Kruskal rank of amatrix: For a d -by- n matrix A , the Kruskal rank of A , denoted K-rank( A ), is the maximum r ∈ [ n ]such that any r columns of A are linearly independent. The condition is K-rank( A ) + K-rank( B ) +K-rank( C ) ≥ n + 2, where A , B , C are the matrices with columns ( a i ), ( b i ), ( c i ), respectively. Forconcreteness, it is helpful to consider the symmetric case A = B = C ∈ R d × n . Kruskal’s conditionbecomes 3 K-rank( A ) ≥ n + 2. Informally, for a generic matrix A we have K-rank( A ) = d and soKruskal’s result guarantees uniqueness for generic A when n ≤ d/ − robust decomposition . That is, when the observed tensor is a smallperturbation of the original tensor, the components of the perturbed tensor are uniquely determinedand close to the components of the original tensor. Their condition for robust unique decompositionis a reﬁnement of Kruskal’s condition: Let τ >

0. The robust Kruskal rank (with threshold τ ) of A , denoted K-rank τ ( A ), is the maximum k ∈ [ n ] such that for any subset S ⊆ [ n ] of size k wehave σ k ( A S ) ≥ /τ ( σ k denotes the k th largest singular value). The condition is K-rank τ ( A ) +K-rank τ ( B )+K-rank τ ( C ) ≥ n +2 and the error in the recovered components depends polynomiallyon τ .In this context, Theorem 1.4 can be stated in the following equivalent way: Theorem 5.1.

Let A be an d -by- n random matrix with iid. standard Gaussian entries with d ≥ and n/d ≥ c > . Then, there exist constants c , c > , < c < (that depend only on c ) suchthat with probability at least − c c d , K-rank τ ( A ) = d ⇒ τ ≥ c c d − . This has the following implication for Bhaskara, Charikar and Vijayaraghavan’s result: Eventhough Kruskal’s result guarantees uniqueness for generic A when n = 3 d/ − A ) = d ), Bhaskara, Charikar and Vijayaraghavan’srobust uniqueness can give a polynomial bound on the reconstruction error on no more than anexponentially small fraction of matrices A when the fraction is measured by the Gaussian measure.This rarity of suﬃciently well-conditioned matrices A is somewhat surprising.19 On the complexity of the simplex method and the diameter ofpolytopes

In [BR13], Brunsch and R¨oglin introduced the following property of a matrix:

Deﬁnition 6.1 ( δ -distance property, [BGR15]) . Let A = ( a , . . . , a m ) (cid:62) be an m -by- n matrix withunit rows. We say that A satisﬁes the δ -distance property if: for any I ⊆ [ m ] and any j ∈ [ m ] whenever a j / ∈ span { a i : i ∈ I } we have d ( a j , span { a i : i ∈ I } ) ≥ δ . This property has been used in several papers [BR13, BGR15, DH16, EV17] to study polytopesof the form { x ∈ R n : Ax ≤ b } to provide upper bounds of the form poly( m, n, /δ ) on their diameterand the number of pivot steps of the simplex method. Our Theorem 1.4 combined with Lemma 2.8and concentration of the length of a Gaussian random vector implies that, for m/n ≥ c (cid:48) > A with the δ -distance property for δ ≥ c n , 0 < c <

1, are “rare”: they are exponentiallyunlikely when the rows are iid. random unit vectors. As in Section 5, this rarity of well-conditionedmatrices A is somewhat surprising. In this section we prove that the vertex-facet distance of the convex hull of a linear number of d -dimensional iid. Gaussian points can be exponentially small with probability at least some constant.The argument is a more elaborate version of the argument for the minimum singular value inSection 4 and works in the following way. Figure 2 shows a polytope, the convex hull of a partialsequence of random points, and (cid:15) -innner bands at all facets. If a new point falls into the blueregion, then the new polytope, which is the convex hull of the old polytope plus the new point, willhave vertex-facet distance no larger than (cid:15) : the new point is a vertex and its distance to the aﬃnehull of the facet associated to the band where the point lies in is less than (cid:15) .To get a lower bound on the Gaussian measure of the blue region (Lemma 7.5 ), we add themeasures of the bands and then we subtract the measures of pairwise intersections of bands and (cid:15) -inner neighbourhood (grey region). Lemma 7.4 gives a bound on the measure of a pairwiseintersection. Its proof is divided into two cases: Lemma 7.1 for the case where the two facets donot share vertices and Lemma 7.2 for the case where they do share vertices. This argument is areﬁnement of the proof of Lemma 4.2.Figure 2: A polytope (triangle) and the region (blue) where a new point would create a smallvertex-facet distance. 20 emma 7.1. Let S , . . . , S d , T , . . . , T d be iid. standard Gaussian random vectors in R d . Let B S = (cid:0) aﬀ { S , . . . , S d } (cid:1) (cid:15)/ , B T = (cid:0) aﬀ { T , . . . , T d } (cid:1) (cid:15)/ . Then for t ≥ P (cid:18) G ( B S ∩ B T ) ≥ (cid:15) t √ π (cid:19) ≤ t d − . Proof.

By the rotational invariance of the Gaussian distribution, unit normal vectors

U, V to B S , B T are independent and are uniformly distributed on S d − . Deﬁne B (cid:48) S = { x ∈ R d : | x · U | ≤ (cid:15)/ } , B (cid:48) T = { x ∈ R d : | x · V | ≤ (cid:15)/ } . By a standard argument (say, using logconcavity) we have P (cid:0) G ( B S ∩ B T ) ≥ t (cid:1) ≤ P (cid:0) G ( B (cid:48) S ∩B (cid:48) T ) ≥ t (cid:1) . Then, by the argument in the proof of Lemma 4.2 we get that for any t ≥ P (cid:16) G ( B (cid:48) S ∩ B (cid:48) T ) ≥ (cid:15) t √ π (cid:17) ≤ t d − . The claim follows.

Lemma 7.2.

Let A , . . . , A k , S , . . . , S d − k , T , . . . , T d − k be iid. standard Gaussian random vectorsin R d , and ≤ k ≤ d . Let B S = (cid:0) aﬀ { A , . . . , A k , S , . . . , S d − k } (cid:1) (cid:15)/ , B T = (cid:0) aﬀ { A , . . . , A k , T , . . . , T d − k } (cid:1) (cid:15)/ . Then for < α ≤ β < π/ , P (cid:18) G ( B S ∩ B T ) ≥ (cid:15) √ π sin α (cid:19) ≤ (sin β ) d − k − + 2 (cid:18) sin α sin( β − α ) (cid:19) d − k − . (4) In particular, for t > π we have P (cid:18) G ( B S ∩ B T ) ≥ (cid:15) t √ π (cid:19) ≤ (cid:32) π / √ t (cid:33) d − k − . Proof. If d − k ≤

2, then the bound holds immediately. Otherwise, d − k > d − k + 1)-dimensionalproblem: Conditioning on A i = a i , i = 1 , . . . , k , we project onto the orthogonal complement of thelinear subspace parallel to aﬀ { a , . . . , a k } . We will then prove the bound claimed in (4) conditioningon A , . . . , A k , which implies the claimed bound by total probability.With a slight abuse of notation, we denote the projection of aﬀ { a , . . . , a k } as a and theprojections of S i , T i as S i , T i , i = 1 , . . . , d − k . Using the fact that the Gaussian distribution isrotationally invariant, we may assume without loss of generality that a = µe for some µ ≥

0. A21ormal vector to aﬀ { a , S , . . . , S d − k } is U = det  e e · · · e d − k +1 P T ... P Td − k  , (5)where P i := S i − a , i ∈ [ d − k ]. Deﬁne the matrix P = (cid:0) P · · · P d − k (cid:1) . Let V be a normal vector toaﬀ { a , T , . . . , T d − k } , deﬁned similarly.Set (cid:0) H i P (cid:48) i (cid:1) = P i , where H i ∼ N ( µ,

1) and P (cid:48) i ∼ N (0 , I d − k ). Denote H T = (cid:0) H · · · H d − k (cid:1) as theﬁrst row of matrix P and P (cid:48) = (cid:0) P (cid:48) · · · P (cid:48) d − k (cid:1) as the rest. H and P (cid:48) are independent.Note that (cid:107) U (cid:107) = det( P T P ) (follows from (5) and the Cauchy-Binet formula). Also, U = U · e = det( P (cid:48) ). We now compute the distribution of the ﬁrst coordinate of unit normal vector ˆ U (using the matrix determinant lemma to compute the determinant of a rank-1 update).ˆ U = det( P (cid:48) T P (cid:48) )det( P T P )= det( P (cid:48) T P (cid:48) )det( P (cid:48) T P (cid:48) + HH T )= det( P (cid:48) T P (cid:48) )(1 + H T P (cid:48)− P (cid:48)− T H ) det( P (cid:48) T P (cid:48) )= 11 + H T P (cid:48)− P (cid:48)− T H .

Claim 7.3.

We have H T P (cid:48)− P (cid:48)− T H d = (cid:80) d − ki =1 Y i Y , where Y ∼ N (0 , , Y i ∼ N ( µ, , i ∈ [ d − k ] and Y , Y , . . . , Y d − k are independent.Proof of claim. Random variables P (cid:48) and H are independent. Moreover, P (cid:48) is a Gaussian matrixand therefore the distribution of P (cid:48)− is invariant under any orthogonal transformation applied torows or columns. Thus, it is enough to consider the case H = (cid:107) H (cid:107) e . Note that (cid:107) H (cid:107) d = (cid:80) d − ki =1 Y i ,and e T P (cid:48)− P (cid:48)− T e = (cid:107) ﬁrst row of P (cid:48)− (cid:107) d = Y . The claim follows. ♦ Recall that ˆ

U , ˆ V are unit normal vectors to B S , B T , respectively. We aim to show that P (cid:0) ˆ V ∈C α ( ˆ U ) ∪ C α ( − ˆ U ) (cid:1) , i.e. P (cid:0) | ˆ U · ˆ V | ≥ cos α (cid:1) , is upper bounded by an expression of the form c ( α ) d with c ( α ) → α → C α ( ˆ U ) denotes the spherical cap centered at ˆ U with angle α ). Tosee this, we divide the analysis into two cases, depending on whether the cap is close to e . Thecase analysis depends on a parameter β that will need to satisfy the constraint β ≥ α . Case 1: C α ( ˆ U ) ⊆ C β ( e ) ∪ C β ( − e ) (equivalently, | ˆ U | ≥ cos( β − α )). In the formula for U , the determinant should be interpreted as a formal cofactor expansion along the ﬁrst row;the entries in the ﬁrst row are the canonical vectors and the expansion gives the coeﬃcients of these vectors (assubdeterminants).

22n this case, the α -cap around ˆ U is contained in a larger cap centered at e . P (cid:16)(cid:8) ˆ V ∈ C α ( ˆ U ) ∪ C α ( − ˆ U ) (cid:9) ∩ (cid:8) C α ( ˆ U ) ⊆ C β ( e ) ∪ C β ( − e ) (cid:9)(cid:17) ≤ P (cid:0) ˆ V ∈ C β ( e ) ∪ C β ( − e ) (cid:1) (using β ≤ π/

2) = P ( ˆ V ≥ cos β ) . (6)From claim 7.3 we get ˆ V d = Y Y + (cid:80) ni =1 Y i . To upper bound (6), we get from Lemma 2.3 that making a = 0 (equivalently, µ = 0) only makesthe rhs larger and we then bound the case a = 0 explicitly. More precisely, let W be a normalvector to span { T , . . . , T d − k } deﬁned similarly to U and V : W = det  e e · · · e d − k +1 T T ... T Td − k  . Note that ˆ W is a uniformly random unit vector. Following the same computation as for V , onecan derive ˆ W d = X X + (cid:80) ni =1 X i , where X , X i ∼ N (0 , i ∈ [ d − k ]. Then, by Lemma 2.3, P ( ˆ V ≥ cos β ) ≤ P ( ˆ W ≥ cos β ).Hence, P (cid:16)(cid:8) ˆ V ∈ C α ( ˆ U ) ∪ C α ( − ˆ U ) (cid:9) ∩ (cid:8) C α ( ˆ U ) ⊆ C β ( e ) ∪ C β ( − e ) (cid:9)(cid:17) ≤ P ( ˆ W ≥ cos β )= P (cid:32) d − k +1 (cid:88) i =2 ˆ W i ≤ sin β (cid:33) ≤ P (cid:32) d − k (cid:88) i =2 ˆ W i ≤ sin β (cid:33) ≤ (sin β ) d − k − (Lemma 2.7) . Case 2: C α ( ˆ U ) (cid:54)⊆ C β ( e ) ∪ C β ( − e ).If C α ( ˆ U ) is not contained in C β ( e ) ∪ C β ( − e ), then ˆ U makes an angle at least β − α with e and − e , that is | ˆ U | < cos( β − α ) . (7)Our goal here is to bound P (cid:16)(cid:8) ˆ V ∈ C α ( ˆ U ) ∪ C α ( − ˆ U ) (cid:9) ∩ (cid:8) C α ( ˆ U ) (cid:54)⊆ C β ( e ) ∪ C β ( − e ) (cid:9)(cid:17) = P (cid:16)(cid:8) ˆ V ∈ C α ( ˆ U ) (cid:9) ∩ (cid:8) C α ( ˆ U ) (cid:54)⊆ C β ( e ) ∪ C β ( − e ) (cid:9)(cid:17) + P (cid:16)(cid:8) ˆ V ∈ C α ( − ˆ U ) (cid:9) ∩ (cid:8) C α ( ˆ U ) (cid:54)⊆ C β ( e ) ∪ C β ( − e ) (cid:9)(cid:17) = 2 · P (cid:16)(cid:8) ˆ V ∈ C α ( ˆ U ) (cid:9) ∩ (cid:8) C α ( ˆ U ) (cid:54)⊆ C β ( e ) ∪ C β ( − e ) (cid:9)(cid:17) . (8)23igure 3: Case 2 of proof of Lemma 7.4Observe that the distribution of ˆ U and the distribution of ˆ V are invariant under rotationsorthogonal to e . Thus, if we let ˆ U − , ˆ V − be the projections of ˆ U , ˆ V orthogonal to e and (cid:100) U − , (cid:100) V − be their normalizations, respectively, then (cid:100) U − , (cid:100) V − ∼ Unif( S d − k ).This observation motivates us to use the corresponding probability of projections to bound (8).We will show that under condition (7) of case 2, ˆ V ∈ C α ( ˆ U ) implies that (cid:100) V − ∈ C f ( α ) ( (cid:100) U − ), where f ( α ) is a bound (to be understood) on the angle that depends only on α . As events, { ˆ V ∈ C α ( ˆ U ) } ⊆ { ˆ V − ∈ Proj e ⊥ C α ( ˆ U ) }⊆ { (cid:100) V − ∈ C f ( α ) ( (cid:100) U − ) } . (9)Bounding f ( α ) is a three-dimensional problem since (cid:100) U − , (cid:100) V − are in span { e , ˆ U , ˆ V } . From nowon, the analysis lives in the above three-dimensional space to get an upper bound on f ( α ). Let˜ e = ( ˆ U − ˆ U · e ) / (cid:107) ˆ U − ˆ U · e (cid:107) (so that { e , ˜ e } is an orthonormal basis of span { e , ˆ U } ). Let { e , ˜ e , ˜ e } be an orthonormal basis of span { e , ˆ U , ˆ V } , and let ˆ U = ( ˆ U , ˆ U ,

0) be the coordinatetuple of ˆ U relative to { e , ˜ e , ˜ e } . Consider x ∈ C α ( ˆ U ) such that x · ˆ U = cos γ . Note that itscoordinates ( x , x , x ) in our chosen basis satisfy the following system of equations: x + x + x = 1 x ˆ U + x ˆ U = cos γ. The projections of all such x (for ﬁxed γ ) onto span { ˜ e , ˜ e } form the ellipse:( x − ˆ U cos γ ) + x ˆ U = ˆ U sin γ. If ˆ U = 0, then ˆ U = 1, and the projection is the line segment inside unit circle at x = cos γ .The angle between x − and (cid:100) U − is upper bounded by γ . As γ ranges from 0 to α , (cid:100) U − and (cid:100) V − form an angle at most α when ˆ U = 0. 24f ˆ U (cid:54) = 0, the projection is an ellipse inside the unit circle. As shown in Fig. 3, angle between x − and (cid:100) U − can be upper bounded by angle formed by (cid:100) U − and tangent line x = √ cos γ − ˆ U sin γ x .Note that from (7) we know ˆ U < cos ( β − α ) ≤ cos α ≤ cos γ (here we use β ≥ α explicitly),so the tangent line always exists.Hence the angle between x − and (cid:100) U − is at most arctan (cid:16) sin γ √ cos γ − ˆ U (cid:17) . Furthermore, sincearctan (cid:16) sin γ √ cos γ − ˆ U (cid:17) is increasing in γ , we can conclude that for any ˆ V ∈ C α ( ˆ U ), its normalizedprojection orthogonal to e , (cid:100) V − , is contained in the spherical cap centered at (cid:100) U − with polar angleat most arctan (cid:16) sin α √ cos α − ˆ U (cid:17) when ˆ U (cid:54) = 0.Therefore with (7), we can take f ( α ) = max { arctan (cid:16) sin α (cid:112) cos α − cos ( β − α ) (cid:17) , α } . Combine with (8) and (9), P ( { ˆ V ∈ C α ( ˆ U ) ∪ C α ( − ˆ U ) } ∩ {C α ( ˆ U ) (cid:54)⊆ C β ( e ) ∪ C β ( − e ) } ) ≤ · P ( { (cid:100) V − ∈ C f ( α ) ( (cid:100) U − ) } ∩ {C α ( ˆ U ) (cid:54)⊆ C β ( e ) ∪ C β ( − e ) } ) ≤ · P (cid:16) | (cid:100) U − · (cid:100) V − | ≥ cos( f ( α )) (cid:17) = 2 · P (cid:18)(cid:113) − ( (cid:100) U − · (cid:100) V − ) ≤ sin f ( α ) (cid:19) ≤ f ( α )) d − k − (Lemma 2.7)= 2 (cid:32) max { sin α (cid:112) cos α − cos ( β − α ) + sin α , sin α } (cid:33) d − k − = 2 (cid:18) max { sin α sin( β − α ) , sin α } (cid:19) d − k − = 2 (cid:18) sin α sin( β − α ) (cid:19) d − k − . Therefore, P (cid:16) | ˆ U · ˆ V | ≥ cos α (cid:17) ≤ (sin β ) d − k − + 2 (cid:18) sin α sin( β − α ) (cid:19) d − k − . (10)Note that we proved bound (10) conditioning on A i ’s, hence it is also a valid bound for random A i ’s (unconditionally). By Lemma 4.1, (10) implies P (cid:18) G ( B S ∩ B T ) ≥ (cid:15) √ π sin α (cid:19) ≤ (sin β ) d − k − + 2 (cid:18) sin α sin( β − α ) (cid:19) d − k − . (11)Use inequalities (2 /π ) x ≤ sin x ≤ x for 0 ≤ x ≤ π/ P (cid:18) G ( B S ∩ B T ) ≥ (cid:15) √ π √ α (cid:19) ≤ β d − k − + 2 (cid:18) πα β − α ) (cid:19) d − k − . β = √ α and restrict 0 < α < / √ α ≤ /

2. The above probabilistic bound simpliﬁesto P (cid:18) G ( B S ∩ B T ) ≥ (cid:15) √ π √ α (cid:19) ≤ α ( d − k − / + 2 (cid:18) πα √ α − α ) (cid:19) d − k − = α ( d − k − / + 2 (cid:18) π √ α − √ α ) (cid:19) d − k − ≤ α ( d − k − / + 2 (cid:0) π √ α (cid:1) d − k − (use 1 − √ α > / ≤ (cid:0) π √ α (cid:1) d − k − The claim follows by setting α = π t .Combining Lemmas 7.1 and 7.2 we get Lemma 7.4.

Let A , . . . , A k , S , . . . , S d − k , T , . . . , T d − k be iid. standard Gaussian random vectorsin R d and ≤ k ≤ d . Let B S = (aﬀ { A , . . . , A k , S , . . . , S d − k } ) (cid:15)/ , B T = (aﬀ { A , . . . , A k , T , . . . , T d − k } ) (cid:15)/ . Then for t > π we have P (cid:18) G ( B S ∩ B T ) ≥ (cid:15) t √ π (cid:19) ≤ (cid:32) π / √ t (cid:33) d − k − . Suppose P n = conv( A , . . . , A n ) is a full-dimensional simplicial polytope in R d and F n is itsset of facets. For S ∈ F n , we abuse notation so that S also denotes the index set of vertices of S . Let U S be an unit inner normal vector of aﬀ( A S ) to P n . Deﬁne (aﬀ A S ) (cid:15) − := { x ∈ R d : 0

Let δ ∈ (0 , . Suppose A , . . . , A n are d -dimensional iid. standard Gaussian randomvectors with d = (cid:98) δn (cid:99) . Let P n = conv( A , . . . , A n ) , which is full-dimensional simplicial a.s. For (cid:15) > , deﬁne a.s. V n = G (cid:32) (cid:91) S ∈F n (aﬀ A S ) (cid:15) − \ P n (cid:33) . V n ≤ (cid:15) √ π (cid:0) nd (cid:1) .2. There exist c , c , c > (that depend only on δ ) such that when (cid:15) = (cid:15) ( d ) ≤ / ( c c d ) we have lim n →∞ P (cid:0) V n ≥ ( c d /c ) (cid:15) (cid:1) = 1 . Proof of part 1.

The upper bound follows from the union bound of at most (cid:0) nd (cid:1) facets and the factthat the 1-dimensional Gaussian density is upper bounded by 1 / √ π . Proof of part 2.

From Corollary 2.15, there exists a constant c F > δ ) suchthat P (cid:0) |F n | ≥ c d F (cid:1) → n → ∞ . Since P n is simplicial a.s., we may present F n as a set of binary n -vectors with exactly d ones. Let A F n ( t ) be the maximum number of vectors in F n with pairwise26amming distance greater than or equal to t . Similarly as in the proof of Lemma 2.6, one canpick vectors greedily (Gilbert-Varshamov bound) so that when |F n | ≥ c d F and c ∈ (0 , n/d < /δ when d ≥ A F n ( cd ) ≥ c d F ( ne/cd ) cd ≥ c d F (2 e/cδ ) cd . Since lim c → + (2 e/cδ ) c = 1 and (2 e/cδ ) c is increasing for 0 ≤ c ≤ /δ , we can pick c ∈ (0 ,

1) suchthat (2 e/c δ ) c < c F . Let c = c F (2 e/c δ ) c >

1. Then we have,lim n →∞ P (cid:0) A F n ( c d ) ≥ c d (cid:1) = 1 . (12)Here we get a subset of facets T ⊆ F n such that any two diﬀerent facets in T share no more than(1 − c ) d vertices, and |T | = c d for some constants 0 < c < c > δ ). Let N = |T | . Let B S = (aﬀ A S ) (cid:15) − , S ∈ F n . Using an argument similar to the proof of Lemma 4.3, weget V n = G (cid:32) (cid:91) S ∈F n (aﬀ A S ) (cid:15) − \ P n (cid:33) = G (cid:32) (cid:91) S ∈F n (aﬀ A S ) (cid:15) − (cid:33) − G (cid:0) P n \ ( P n ) − (cid:15) (cid:1) ≥ G (cid:32) (cid:91) S ∈T (aﬀ A S ) (cid:15) − (cid:33) − G (cid:0) P n \ ( P n ) − (cid:15) (cid:1) ≥ (cid:88) S ∈T G ( B S ) − (cid:88) S,T ∈T ,S (cid:54) = T G ( B S ∩ B T ) − G (cid:0) P n \ ( P n ) − (cid:15) (cid:1) . We are going to bound each of the three terms in the last expression.

First term: (cid:80) S ∈T G ( B S ) . From Lemma 2.12, there exists a constant c > δ ) such that P (cid:0) max S ⊆ [ n ] , | S | = d dist(aﬀ A S , ≤ c (cid:1) ≥ − e − d . Moreover, we increase c so that c >

1, which ensures that c ≥ (cid:15) . Recall that B S = (aﬀ A S ) (cid:15) − . We get P (cid:32) (cid:88) S ∈T G ( B S ) ≥ N (cid:15) √ π e − c (cid:33) ≥ − e − d . (13) Second term: (cid:80) S,T ∈T ,S (cid:54) = T G ( B S ∩ B T ) . Use Lemma 7.4 in a union bound applied to all pairsof sets in T . For t > π we have,12 (cid:88) S,T ∈T ,S (cid:54) = T G ( B S ∩ B T ) ≤ (cid:18) N (cid:19) (cid:15) t √ π − (cid:18) N (cid:19) (cid:32) π / √ t (cid:33) d − (1 − c / d − ≥ − N t π (cid:32) π / √ t (cid:33) c d/ = 1 − t π  c (cid:32) π / √ t (cid:33) c /  d . Choose t = c := π ( ec ) c to get P  (cid:88) S,T ∈T ,S (cid:54) = T G ( B S ∩ B T ) ≤ (cid:18) N (cid:19) (cid:15) c √ π  ≥ − c π e − d . (14) Third term: G ( P n \ ( P n ) − (cid:15) ) . From Corollary 2.9, we know G (cid:0) P n \ ( P n ) − (cid:15) (cid:1) ≤ c (cid:15)d / for someabsolute constant c .Combining (12), (13) and (14) we conclude that with probability 1 − o (1) as d → ∞ : V n ≥ (cid:88) S ∈T G ( B S ) − (cid:88) S,T ∈T ,S (cid:54) = T G ( B S ∩ B T ) − G (cid:0) P n \ ( P n ) − (cid:15) (cid:1) ≥ N (cid:15) √ π e − c − (cid:18) N (cid:19) (cid:15) c √ π − c (cid:15)d / ≥ N (cid:15) √ π (cid:32) e − c − N (cid:15)c − √ πc d / N (cid:33) . Note that √ πc d / /N decays exponentially in d . Therefore, when (cid:15) ≤ /e c c N ,lim n →∞ P (cid:18) V n ≥ N (cid:15) √ πe c (cid:19) = 1 . The proof is ﬁnished by setting c = 3 √ πe c and c = e c c .We are ready now to restate and prove the main result of the section. Theorem 1.3.

We would like to thank Nina Amenta, Jes´us De Loera, Miles Lopes, JavierPe˜na, Thomas Strohmer, Roman Vershynin and Van Vu for helpful discussions. This materialis based upon work supported by the National Science Foundation under Grants CCF-1657939,CCF-1422830, CCF-2006994 and CCF-1934568.

References [AV97] Noga Alon and Van Vu. Anti-hadamard matrices, coin weighing, threshold gates, andindecomposable hypergraphs.

Journal of Combinatorial Theory, Series A , 79(1):133–160, 1997.[AW91] Fernando Aﬀentranger and John A. Wieacker. On the convex hull of uniform randompoints in a simple d -polytope. Discret. Comput. Geom. , 6:291–305, 1991.[Bal93] Keith Ball. The reverse isoperimetric problem for Gaussian measure.

Discrete & Com-putational Geometry , 10(4):411–420, 1993.[BCMV13] Aditya Bhaskara, Moses Charikar, Ankur Moitra, and Aravindan Vijayaraghavan.Smoothed analysis of tensor decompositions.

CoRR , abs/1311.3651, 2013.[BCV14] Aditya Bhaskara, Moses Charikar, and Aravindan Vijayaraghavan. Uniqueness of ten-sor decompositions with applications to polynomial identiﬁability. In

Conference onLearning Theory , pages 742–778, 2014.[BGMN05] Franck Barthe, Olivier Gu´edon, Shahar Mendelson, and Assaf Naor. A probabilisticapproach to the geometry of the (cid:96) np -ball. The Annals of Probability , 33(2):480–513,2005.[BGR15] Tobias Brunsch, Anna Großwendt, and Heiko R¨oglin. Solving totally unimodular LPswith the shadow vertex algorithm. In , volume 30 of

LIPIcs , pages 171–183, 2015.29BLR18] Karoly J Boroczky, Gabor Lugosi, and Matthias Reitzner. Facets of high-dimensionalGaussian polytopes. arXiv preprint arXiv:1808.01431 , 2018.[BR13] Tobias Brunsch and Heiko R¨oglin. Finding short paths on polytopes by the shadowvertex algorithm. In

International Colloquium on Automata, Languages, and Program-ming , pages 279–290. Springer, 2013.[BS17] Amir Beck and Shimrit Shtern. Linearly convergent away-step conditional gradient fornon-strongly convex functions.

Mathematical Programming , 164(1):1–27, Jul 2017.[BV04] Ren´e Beier and Berthold V¨ocking. Typical properties of winners and losers in dis-crete optimization. In

Proceedings of the 36th Annual ACM Symposium on Theory ofComputing , pages 343–352, 2004.[BV06] Ren´e Beier and Berthold V¨ocking. Typical properties of winners and losers in discreteoptimization.

SIAM J. Comput. , 35(4):855–881, 2006.[CCK17] Victor Chernozhukov, Denis Chetverikov, and Kengo Kato. Central limit theorems andbootstrap in high dimensions.

The Annals of Probability , 45(4):2309–2352, 2017.[CJL19] T Tony Cai, Tiefeng Jiang, and Xiaoou Li. Asymptotic analysis for extreme eigenvaluesof principal minors of random matrices. arXiv preprint arXiv:1905.08757 , 2019.[CT05] Emmanuel J. Cand`es and Terence Tao. Decoding by linear programming.

IEEE Trans.Inf. Theory , 51(12):4203–4215, 2005.[CT06] Emmanuel J. Cand`es and Terence Tao. Near-optimal signal recovery from randomprojections: Universal encoding strategies?

IEEE Trans. Inf. Theory , 52(12):5406–5425, 2006.[DH16] Daniel Dadush and Nicolai H¨ahnle. On the shadow simplex method for curved poly-hedra.

Discret. Comput. Geom. , 56(4):882–909, 2016.[DT05] David L. Donoho and Jared Tanner. Neighborliness of randomly projected simplices inhigh dimensions.

Proceedings of the National Academy of Sciences , 102(27):9452–9457,2005.[EV17] Friedrich Eisenbrand and Santosh S. Vempala. Geometric random edge.

Math. Pro-gram. , 164(1-2):325–339, 2017.[FW56] Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming.

NavalResearch Logistics Quarterly , 3(1-2):95–110, 1956.[GH13] Dan Garber and Elad Hazan. A polynomial time conditional gradient algorithm withapplications to online and stochastic optimization.

CoRR , abs/1301.4666, 2013.[GM86] Jacques Gu´elat and Patrice Marcotte. Some comments on Wolfe’s ‘away step’.

Math-ematical Programming , 35(1):110–119, 1986.[HMR04] Daniel Hug, G¨otz Olaf Munsonius, and Matthias Reitzner. Asymptotic mean values ofGaussian polytopes.

Beitr¨age Algebra Geom. , 45(2):531–548, 2004.30HR05] Daniel Hug and Matthias Reitzner. Gaussian polytopes: variances and limit theorems.

Adv. in Appl. Probab. , 37(2):297–320, 2005.[Juk11] Stasys Jukna.

Extremal combinatorics: with applications in computer science . SpringerScience & Business Media, 2011.[Kru77] Joseph B. Kruskal. Three-way arrays: rank and uniqueness of trilinear decomposi-tions, with application to arithmetic complexity and statistics.

Linear Algebra Appl. ,18(2):95–138, 1977.[LHR20] Jes´us A. De Loera, Jamie Haddock, and Luis Rademacher. The minimum euclidean-norm point in a convex polytope: Wolfe’s combinatorial algorithm is exponential.

SIAMJ. Comput. , 49(1):138–169, 2020.[LJJ13] Simon Lacoste-Julien and Martin Jaggi. An aﬃne invariant linear convergence analysisfor Frank-Wolfe algorithms. 2013.[LJJ15] Simon Lacoste-Julien and Martin Jaggi. On the global linear convergence of Frank-Wolfe optimization variants. In

Advances in Neural Information Processing Systems28 , pages 496–504. 2015.[LM00] B. Laurent and P. Massart. Adaptive estimation of a quadratic functional by modelselection.

Ann. Statist. , 28(5):1302–1338, 10 2000.[Naz03] Fedor Nazarov. On the maximal perimeter of a convex set in R n with respect to aGaussian measure. In Geometric aspects of functional analysis , pages 169–187. Springer,2003.[PNAJ20] Fabian Pedregosa, Geoﬀrey Negiar, Armin Askari, and Martin Jaggi. Linearly con-vergent Frank-Wolfe with backtracking line-search. In

International Conference onArtiﬁcial Intelligence and Statistics , pages 1–10. PMLR, 2020.[PnR19] Javier Pe˜na and Daniel Rodr´ıguez. Polytope conditioning and linear convergence ofthe Frank-Wolfe algorithm.

Math. Oper. Res. , 44(1):1–18, 2019.[PnRS16] Javier Pe˜na, Daniel Rodr´ıguez, and Negar Soheili. On the von Neumann and Frank-Wolfe algorithms with away steps.

SIAM Journal on Optimization , 26(1):499–512,2016.[Ray70] H. Raynaud. Sur l’enveloppe convexe des nuages de points al´eatoires dans R n . I. J.Appl. Probability , 7:35–48, 1970.[Raz88] Alexander A Razborov. Bounded-depth formulae over { & , ⊕} and some combinato-rial problems. Problems of Cybernetics. Complexity Theory and Applied MathematicalLogic , pages 149–166, 1988.[RV05] Heiko R¨oglin and Berthold V¨ocking. Smoothed analysis of integer programming. In

International Conference on Integer Programming and Combinatorial Optimization ,pages 276–290. Springer, 2005. 31RV07] Heiko R¨oglin and Berthold V¨ocking. Smoothed analysis of integer programming.

Math.Program. , 110(1):21–56, 2007.[ST01] Daniel Spielman and Shang-Hua Teng. Smoothed analysis of algorithms: Why thesimplex algorithm usually takes polynomial time. In

Proceedings of the Thirty-ThirdAnnual ACM Symposium on Theory of Computing , page 296–305, 2001.[Ver18] Roman Vershynin. High-dimensional probability: An introduction with applications indata science. 47, 2018.[VS92] A. M. Vershik and P. V. Sporyshev. Asymptotic behavior of the number of faces ofrandom polyhedra and the neighborliness problem. volume 11, pages 181–201. 1992.[Wol76] Philip Wolfe. Finding the nearest point in a polytope.

Mathematical Programming ,11(1):128–149, Dec 1976.[Zie93] G¨unter M Ziegler.

Lectures on Polytopes.