aa r X i v : . [ c s . L G ] A ug A Simple Proof of Optimal Approximations
M´onika Csik´os ∗ Nabil H. Mustafa † Universit´e Paris-Est, LIGM, Equipe A3SI, ESIEE Paris,Cit´e Descartes, 2 boulevard Blaise Pascal, 93162 Noisy-le-Grand Cedex, France.
Abstract
The fundamental result of Li, Long, and Srinivasan [LLS01] on approximations of set sys-tems has become a key tool across several communities such as learning theory, algorithms,combinatorics and data analysis (described as ‘the pinnacle of a long sequence of papers’in [HP11, Section 7.4]).The goal of this paper is to give a simpler, self-contained, modular proof of this result forfinite set systems. The only ingredient we assume is the standard Chernoff’s concentrationbound.This makes the proof accessible to a wider audience, readers not familiar with techniquesfrom statistical learning theory, and makes it possible to be covered in a single self-containedlecture in an algorithms course. ∗ Email: [email protected]. † Email: [email protected]. Introduction
Given a finite set system ( X, F ), our goal is to construct a small set A ⊆ X such that eachset of F is ‘well-approximated’ by A . Research on such approximations started in the 1950s,with random sampling being the key tool for showing their existence as well as for constructionalgorithms. Since then, the notion of approximations became a fundamental structure acrossseveral communities—learning theory, statistics, combinatorics, algorithms and data analysis.This work addresses two of the milestones in this area—the original work of Vapnik and Chervo-nenkis [VC71] on the basic notion of ǫ -approximations, and then its generalization by Li, Long,and Srinivasan [LLS01]. ǫ -approximations. Given ( X, F ) with n = | X | and a parameter ǫ >
0, a set A ⊆ X of size t is called an ǫ -approximation for F if for all S ∈ F , (cid:12)(cid:12)(cid:12)(cid:12) | S | n − | A ∩ S | t (cid:12)(cid:12)(cid:12)(cid:12) ≤ ǫ, or equivalently, | A ∩ S | ∈ (cid:20) | S | tn − ǫt, | S | tn + ǫt (cid:21) . A basic upper-bound on sizes of ǫ -approximations follows immediately from Chernoff’sbound, which we first recall (see [AS12]). Theorem A (Chernoff’s bound) . Let X be a set of n elements and A be a uniformrandom sample of X of size t . Then for any S ⊆ X and η > , Pr (cid:20) | A ∩ S | / ∈ (cid:18) | S | tn − η, | S | tn + η (cid:19)(cid:21) ≤ (cid:18) − η n | S | t + ηn (cid:19) . Setting η = ǫt , Chernoff’s bound implies that a uniform random sample A of size t failsto be an ǫ -approximation for a fixed S ∈ F with probability at most 2 exp (cid:16) − ǫ t (cid:17) . Thisfact together with the union bound gives the following trivial bound on ǫ -approximationsizes for any set system. Theorem B.
Let ( X, F ) be a finite set system and ǫ, γ > be given parameters. Thena uniform random sample A ⊆ X of size ǫ ln |F| γ is an ǫ -approximation for F withprobability at least − γ . A breakthrough in the study of ǫ -approximations dates back to 1971 when Vapnik andChervonenkis studied set systems with finite VC-dimension [VC71]. The VC-dimension of ( X, F ), denoted by VC-dim( X, F ), is the size of the largest Y ⊆ X for which F | Y = 2 Y ,where F | Y = { Y ∩ S : S ∈ F } . Theorem 1 ([VC71]) . There exists an absolute constant c such that the following holds.Given a set system ( X, F ) such that |F | Y | ≤ ( e | Y | /d ) d for all Y ⊆ X , | Y | ≥ d , andparameters ≤ ǫ, γ ≤ , a uniform random sample A ⊆ X of size c ǫ · (cid:18) d ln 1 ǫ + ln 1 γ (cid:19) is an ǫ -approximation for F with probability at least − γ . As VC-dim ( X, F ) ≤ d implies that |F | Y | ≤ ( e | Y | /d ) d for any Y ⊆ X [Sau72, She72],Theorem 1 also applies to set systems with VC-dim ( X, F ) ≤ d . Theorem 1 immediatelyimplies a bound of O (cid:0) dǫ log ǫ (cid:1) on the sizes of ǫ -nets for set systems with VC-dim ( X, F ) ≤ d (see [Cha00, Chapter 4]). elative ( ǫ, δ ) -approximations. The notion of relative approximations unifies and extendsseveral previous notions. Given a set system ( X, F ) with n = | X | and parameters 0 <ǫ, δ <
1, a set A of size t is a relative ( ǫ, δ ) -approximation for ( X, F ) if for all S ∈ F , (cid:12)(cid:12)(cid:12)(cid:12) | S | n − | A ∩ S | t (cid:12)(cid:12)(cid:12)(cid:12) ≤ δ · max (cid:26) | S | n , ǫ (cid:27) , or equivalently, | A ∩ S | = | S | tn ± δt max (cid:26) | S | n , ǫ (cid:27) . Then the influential result of Li, Long, and Srinivasan [LLS01] can be stated as follows . Theorem 2 ([LLS01]) . There exists an absolute constant c such that the following holds.Let ( X, F ) be a set system such that |F | Y | ≤ ( e | Y | /d ) d for all Y ⊆ X , | Y | ≥ d , and let < δ, ǫ, γ < be given parameters. Then a uniform random sample A ⊆ X of size c ǫδ · (cid:18) d ln 1 ǫ + ln 1 γ (cid:19) is a relative ( ǫ, δ ) -approximation for ( X, F ) with probability at least − γ . Note that this bound is asymptotically tight. Again, Theorem 2 applies to set systemsof VC-dimension d . Note also that a relative (cid:0) , δ (cid:1) -approximation is a δ -approximation,and we recover an improved version of Theorem 1 from Theorem 2 (the bound for thisspecial case was proved earlier by Talagrand [Tal94]). Our Results.
The proof of Theorem 1 uses symmetrization , a technique from statistics: to prove that auniform random sample A satisfies the required properties, one takes another random sample G , sometimes called a ‘ghost sample’. Properties of A are then proven by comparing it with G .Note that G is not used in the algorithm or its construction—it is solely a method of analysis,a ‘thought experiment’ of sorts. The symmetrization proof is the one given in nearly all texts [KV94, AS12, AB09, DGL96, Cha00, Mat99, Mat02, HP11] often with the caveat that theidea is ingenious but difficult to understand intuitively (e.g., “one might be tempted to believethat it works by some magic” [Mat02, Section 10.2]). The other known proof—for a statementwith sub-optimal dependency on the success probability—uses discrepancy theory and leads toefficient deterministic algorithms for computing approximations (see [Cha00, Mat99]).The proof of Theorem 2 uses, in addition to symmetrization, the technique of chaining . Chaininghere is essentially the idea of doing the analysis by partitioning each S ∈ F into a logarithmicnumber of smaller sets, each belonging to a distinct ‘level’. The number of sets increase withincreasing level while the size of each set decreases. The overall sum turns out to be a geometricseries, which then gives the optimal bounds. What makes the proof of Theorem 2 in the originalpaper [LLS01] difficult is that it combines chaining and symmetrization intricately. All the tailbounds are stated in their ‘symmetrized’ forms and symmetrization is carried through the entireproof. It is not an easy proof to explain to undergraduate or even graduate students in computerscience.In this paper we give short proofs of Theorems 1 and 2 based on the following two independent,intuitive and easy-to-explain ideas. The original result was stated using the notion of ( ǫ, δ ) -samples . As shown in [HS11], these two notions areasymptotically equivalent: an ( ǫ, δ )-sample is a relative ( ǫ, δ )-approximation and a relative ( ǫ, δ )-approximationis an ( ǫ, δ )-sample. Also used in teaching; to pick two arbitrary examples, see here for an example from the perspective ofstatistics/learning and here from the algorithmic side. heorem 2 (Section 3).
The idea is to first do a ‘pre-processing’ step of taking a uniformrandom sample A ′ ⊆ X that is an (cid:0) ǫδ (cid:1) -approximation of F with probability 1 − γ . Byusing Theorem 1, we can take A ′ of size | A ′ | = Θ (cid:18) ǫ δ (cid:18) d ln 1 ǫδ + ln 1 γ (cid:19)(cid:19) . This removes the dependence on | X | and with it, the need for symmetrization. The rest isthen a simple application of chaining, similar in this aspect to the original proof in [LLS01]. Theorem 1 (Section 2).
While we have decoupled symmetrization and chaining in the proofof Theorem 2, it now uses Theorem 1 as a black-box. For the case where ( X, F ) is afinite set system , we present a short elementary proof of Theorem 1 without any needof symmetrization. It can be viewed as a simplified and ‘compressed’ version of thediscrepancy-based argument of [MWW93], but with optimal dependency on the successprobability.To see the intuition, observe that since |F | ≤ ( e | X | /d ) d , the bound of Theorem B de-pends only on | X | —in particular that a random sample A ⊆ X of size O (cid:0) ǫ ln | X | d (cid:1) = O (cid:0) dǫ ln | X | (cid:1) is an ǫ -approximation. The size of A is much smaller than that of X andso applying Theorem B again to F | A gives an ǫ -approximation A ⊆ A for F | A , with | A | = O (cid:18) ǫ ln | A | d (cid:19) = O (cid:18) dǫ ln (cid:18) dǫ ln | X | (cid:19)(cid:19) = O (cid:18) dǫ ln dǫ + dǫ ln ln | X | (cid:19) . The size of A is again much smaller than that of A and furthermore, it follows imme-diately from the definition of ǫ -approximations that A is a (2 ǫ )-approximation for F .Repeating this gives the proof of Theorem 1.For chaining we will need the following immediate consequence of Theorem 1 (better boundsexist [Hau95]; however the one derived below suffices for our needs). Lemma 3.
Let α ≥ and let ( X, P ) be an α -packing; that is, for any pair S, S ′ ∈ P , thesymmetric difference of S and S ′ , denoted by ∆( S, S ′ ) , has size at least α . Then if |P| Y | ≤ ( e | Y | /d ) d for all Y ⊆ X with | Y | ≥ d , we have |P| ≤ (cid:16) c | X | α (cid:17) d , where c is the constant fromTheorem 1.Proof. Let G = { ∆ ( S, S ′ ) : S, S ′ ∈ P} . For any Y ⊆ X , we have |G| Y | ≤ |P| Y | ≤ ( e | Y | /d ) d and so Theorem 1 applied to G gives the existence of an α − | X | -approximation A for G of size atmost c d | X | α . This implies that A ∩ S = A ∩ S ′ for any S, S ′ ∈ P since (cid:12)(cid:12) ∆ (cid:0) S, S ′ (cid:1) ∩ A (cid:12)(cid:12) ≥ | ∆ ( S, S ′ ) | | A || X | − ( α − | A || X | ≥ α | A || X | − ( α − | A || X | > , and so |P| = |P| A | ≤ (cid:16) e c | X | α (cid:17) d ≤ (cid:16) c | X | α (cid:17) d . Let T ( ǫ, γ ) be such that a uniform random sample of size T ( ǫ, γ ) from X is an ǫ -approximationfor F with probability at least 1 − γ . Our goal is to show that T ( ǫ, γ ) ≤ c ǫ · (cid:16) d ln ǫ + ln γ (cid:17) ,where c ≥ This is typically the case in its use in algorithms, computational geometry, etc. The infinite case can usuallybe reduced to the finite case by a sufficiently fine grid, see [MWW93]. hen ǫ ≤ / p | X | , we have | X | ≤ ǫ ≤ T ( ǫ, γ ). By induction on ǫ , a random sample A ′ ⊆ X of size T (cid:0) ǫ , γ (cid:1) is an ǫ -approximation for F with probability at least 1 − γ . By Theorem B let A be a random sample of A ′ that is a ǫ -approximation for F | A ′ with probability 1 − γ . Thus A is a uniform random sample of X that is an ǫ -approximation for F with probability at least1 − γ , implying the recurrence T ( ǫ, γ ) ≤ | A | = 3( ǫ/ ln 2 |F | A ′ | ( γ/ ≤ ǫ ln γ e T (cid:0) ǫ , γ (cid:1) d ! d . The required bound on T ( ǫ, γ ) follows inductively: using (cid:16) d ln γ (cid:17) d ≤ γ ,12 ǫ ln γ e c ǫ (cid:16) d ln ǫ + ln γ (cid:17) d d ≤ ǫ ln γ (cid:18) e c ǫ (cid:19) d (cid:18) d ln 2 γ (cid:19) d ! ≤ ǫ ln (cid:18) ǫ d γ (cid:19) c , for a large-enough c . W.l.o.g. we assume that ǫ, δ, γ ≤ . Set c = 10 c , k = (cid:6) log δ (cid:7) and ǫ i = p ( i + 1) / i ǫ . Passing from X to A ′ . By Theorem 1, let A ′ ⊆ X be a random sample that is an ǫδ -approximation of F with probability 1 − γ . We will show that a random sample A ⊆ A ′ ofthe required size t = c ǫδ ln ǫ d γ is a relative (cid:0) ǫ, δ (cid:1) -approximation of F | A ′ with probability 1 − γ .This would complete the proof, since then for any S ∈ F , (cid:12)(cid:12)(cid:12)(cid:12) | S || X | − | A ∩ S || A | (cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) | S || X | − | A ′ ∩ S || A ′ | (cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12) | A ′ ∩ S || A ′ | − | A ∩ S || A | (cid:12)(cid:12)(cid:12)(cid:12) ≤ ǫδ δ (cid:26) | A ′ ∩ S || A ′ | , ǫ (cid:27) ≤ ǫδ δ (cid:26) | S || X | + ǫδ , ǫ (cid:27) ≤ δ max (cid:26) | S || X | , ǫ (cid:27) . To avoid additional variables, we will simply assume that X = A ′ , and so by Theorem 1, n = | X | ≤ c ǫ δ (cid:18) d ln 3 ǫδ + ln 2 γ (cid:19) ≤ c dǫ δ (cid:18) d ln 1 γ (cid:19) ≤ c dǫ δ exp (cid:18) d ln 1 γ (cid:19) , |F | ≤ (cid:18) e | X | d (cid:19) d ≤ (cid:18) e c ǫ δ (cid:19) d γ . Levels and their approximations.
For i ∈ [0 , k ], let P i be a maximal ǫn i -packing of ( X, F )such that P i ⊆ P i +1 and set P k +1 = F . Note that for any S ∈ P i +1 \ P i there exists a set F S ∈ P i such that | ∆( S, F S ) | < ǫn i . Define A i = { S \ F S : S ∈ P i +1 \ P i } and B i = { F S \ S : S ∈ P i +1 \ P i } , with |A i | , |B i | ≤ |P i +1 | ≤ (cid:18) c · i ǫ (cid:19) d (cid:0) by Lemma 3 (cid:1) and | S | < ǫn i for all S ∈ A i ∪ B i . Claim 4.
With probability − γ , A is simultaneously ( i ) a relative ( ǫ, δ ) -approximation for A k ∪ B k , and ( ii ) a relative ( ǫ i , δ ) -approximation for A i ∪ B i for all i ∈ [0 , k − , and ( iii ) arelative ( ǫ, δ ) -approximation for P .roof. ( i ) Each set in A k ∪ B k has size at most ǫn k = ǫnδ ≤ ǫn and so by Theorem A with η = δtǫ , A fails to be a relative ( ǫ, δ )-approximation for A k ∪ B k with probability at most |F | · (cid:18) − δ t ǫ · n ǫnδ · t + δtǫ · n (cid:19) = |F | · − c ln ǫ d γ δ ! ≤ (cid:18) e c ǫ δ (cid:19) d γ · (cid:16) ǫ d γ (cid:17) c / δ ≤ γ . ( ii ) Each set S ∈ A i ∪ B i has size at most ǫn i ≤ ǫ i n and so by Theorem A with η = δtǫ i ,Pr (cid:2) A is not ( ǫ i , δ )-approx. for S (cid:3) ≤ (cid:18) − δ tǫ i n | S | + δǫ i n (cid:19) ≤ − δ tǫ ( i +1) / i ǫ/ i + δǫ p ( i +1) / i ! = 2 exp − δ t ( i +1) ǫ δ p ( i +1) · i ! ≤ − δ t ( i + 1) ǫ δ p log(1 /δ ) · /δ ! ≤ − c ( i +1) ln ǫ d γ ! . The probability that A fails to be a relative ( ǫ i , δ )-approximation for a set of A i ∪ B i is at most k − X i =0 |A i ∪ B i | · − c ( i +1) ln ǫ d γ ! ≤ k − X i =0 (cid:18) c i ǫ (cid:19) d (cid:16) ǫ d γ (cid:17) c i +1)3 ≤ γ k − X i =0 (2 ǫ ) id ≤ γ . ( iii ) The probability of failure for any set S ∈ P , by Theorem A with η = δt max n | S | n , ǫ o is atmost X S ∈P − δ t (cid:16) max n | S | n , ǫ o(cid:17) · n | S | t + δt max n | S | n , ǫ o · n ≤ |P | · (cid:18) − ǫδ t (cid:19) ≤ (cid:18) c ǫ (cid:19) d (cid:16) ǫ d γ (cid:17) c ≤ γ . Chaining.
Let S ∈ F . There exists a set S k ∈ P k with A k = S \ S k ∈ A k and B k = S k \ S ∈ B k such that S = ( S k \ B k ) ∪ A k . Similarly one can write S k in terms of S k − ∈ P k − , A k − ∈ A k − , B k − ∈ B k − and so on until we reach S ∈ P . Thus using Claim 4 ( i ) , ( ii ) and ( iii ), we getthat with probability at least 1 − γ/ (cid:12)(cid:12)(cid:12)(cid:12) | S | n − | A ∩ S | t (cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) | S k | n − | B k | n + | A k | n − (cid:18) | A ∩ S k | t − | A ∩ B k | t + | A ∩ A k | t (cid:19)(cid:12)(cid:12)(cid:12)(cid:12) ( i ) ≤ (cid:12)(cid:12)(cid:12)(cid:12) | S k | n − | A ∩ S k | t (cid:12)(cid:12)(cid:12)(cid:12) + δ max (cid:26) ǫ, | A k | n (cid:27) + δ max (cid:26) ǫ, | B k | n (cid:27) = (cid:12)(cid:12)(cid:12)(cid:12) | S k | n − | A ∩ S k | t (cid:12)(cid:12)(cid:12)(cid:12) +2 δǫ ≤ · · · ( ii ) ≤ (cid:12)(cid:12)(cid:12)(cid:12) | S | n − | A ∩ S | t (cid:12)(cid:12)(cid:12)(cid:12) +2 δ k − X j =0 ǫ j +2 δǫ ( iii ) ≤ δ max (cid:26) ǫ, | S | n (cid:27) +14 δǫ ≤ δ | S | n +16 δǫ ≤ δ max (cid:26) | S | n , ǫ (cid:27) where the second-last step uses the fact that | S | ≤ | S | + k P j =0 | B i | ≤ | S | + ∞ P j =0 ǫn j ≤ | S | + 2 ǫn .Therefore A is a relative (2 δ, ǫ )-approximation of F | A ′ . Repeating the same arguments with δ ′ = δ/ ǫ ′ = ǫ/
16 we get a relative (cid:0) ǫ, δ (cid:1) -approximation of F | A ′ , as required. eferences [AB09] M. Anthony and P. L. Bartlett. Neural network learning: Theoretical foundations .Cambridge University Press, 2009.[AS12] N. Alon and J. Spencer.
The Probabilistic Method . John Wiley, 2012.[Cha00] B. Chazelle.
The Discrepancy Method: Randomness and Complexity . CambridgeUniversity Press, New York, NY, USA, 2000.[DGL96] L. Devroye, L. Gy¨orfi, and G. Lugosi.
A Probabilistic Theory of Pattern Recognition .Springer, Berlin, 1996.[Hau95] D. Haussler. Sphere Packing Numbers for Subsets of the Boolean n-Cube withBounded Vapnik-Chervonenkis Dimension.
J. Comb. Theory, Ser. A , 69(2):217–232, 1995.[HP11] S. Har-Peled.
Geometric Approximation Algorithms . American Mathematical Soci-ety, Boston, MA, USA, 2011.[HS11] S. Har-Peled and M. Sharir. Relative ( p , ǫ )-Approximations in Geometry. Discrete& Computational Geometry , 45(3):462–496, 2011.[KV94] M. J. Kearns and U. V. Vazirani.
An Introduction to Computational Learning The-ory . MIT Press, 1994.[LLS01] Y. Li, P. M. Long, and A. Srinivasan. Improved Bounds on the Sample Complexityof Learning.
J. Comput. Syst. Sci. , 62(3):516–527, 2001.[Mat99] J. Matouˇsek.
Geometric Discrepancy: An Illustrated Guide . Springer, 1999.[Mat02] J. Matouˇsek.
Lectures in Discrete Geometry . Springer-Verlag, New York, NY, 2002.[MWW93] J. Matousek, E. Welzl, and L. Wernisch. Discrepancy and approximations forbounded VC-dimension.
Combinatorica , 13(4):455–466, 1993.[Sau72] N. Sauer. On the density of families of sets.
Journal of Combinatorial Theory, SeriesA , 13:145147, 1972.[She72] S. Shelah. A combinatorial problem; stability and order for models and theories ininfinitary languages.
Pacific Journal of Mathematics , 41:247261, 1972.[Tal94] M. Talagrand. Sharper bounds for Gaussian and empirical processes.
Annals ofProbabability , 22:28–76, 1994.[VC71] V. Vapnik and A. Chervonenkis. On the uniform convergence of relative frequenciesof events to their probabilities.