[PDF] Decision list compression by mild random restrictions

Abstract

A decision list is an ordered list of rules. Each rule is specified by a term, which is a conjunction of literals, and a value. Given an input, the output of a decision list is the value corresponding to the first rule whose term is satisfied by the input. Decision lists generalize both CNFs and DNFs, and have been studied both in complexity theory and in learning theory. The size of a decision list is the number of rules, and its width is the maximal number of variables in a term. We prove that decision lists of small width can always be approximated by decision lists of small size, where we obtain sharp bounds. This in particular resolves a conjecture of Gopalan, Meka and Reingold (Computational Complexity, 2013) on DNF sparsification. An ingredient in our proof is a new random restriction lemma, which allows to analyze how DNFs (and more generally, decision lists) simplify if a small fraction of the variables are fixed. This is in contrast to the more commonly used switching lemma, which requires most of the variables to be fixed.

Full PDF

aa r X i v : . [ c s . CC ] F e b Decision list compression by mild random restrictions

Shachar Lovett ∗ Computer Science DepartmentUniversity of California, San Diego [email protected]

Kewen WuSchool of EECSPeking University, Beijing shlw [email protected]

Jiapeng Zhang † School of Engineering and Applied ScienceHarvard University [email protected]

February 19, 2020

Abstract

A decision list is an ordered list of rules. Each rule is speciﬁed by a term, which is aconjunction of literals, and a value. Given an input, the output of a decision list is the valuecorresponding to the ﬁrst rule whose term is satisﬁed by the input. Decision lists generalizeboth CNFs and DNFs, and have been studied both in complexity theory and in learning theory.The size of a decision list is the number of rules, and its width is the maximal number ofvariables in a term. We prove that decision lists of small width can always be approximated bydecision lists of small size, where we obtain sharp bounds. This in particular resolves a conjectureof Gopalan, Meka and Reingold (Computational Complexity, 2013) on DNF sparsiﬁcation.An ingredient in our proof is a new random restriction lemma, which allows to analyze howDNFs (and more generally, decision lists) simplify if a small fraction of the variables are ﬁxed.This is in contrast to the more commonly used switching lemma, which requires most of thevariables to be ﬁxed.

Decision lists are a model to represent boolean functions, ﬁrst introduced by Rivest [24]. A decisionlist is given by a list of rules ( C , v ) , . . . , ( C m , v m ). A rule is composed of a condition, given by aterm C i , which is a conjunction of literals (variables or their negations); and an output value v i insome set V . A decision list computes a function f : { , } n → V as follows: If C ( x ) = True then output v , else if C ( x ) = True then output v ,. . . , else if C m ( x ) = True then output v m .The last rule is the default value , where we assume that C m ≡ True .Decision lists generalize both CNFs and DNFs. For example, a DNF is a decision list with v = · · · = v m − = 1 and v m = 0, and a CNF is a decision list with v = · · · = v m − = 0 ∗ Research supported by NSF award 1614023. † Research Supported by NSF grant CCF-1763299 and Salil Vadhan’s Simons Investigator Award. v m = 1. It can be shown that decision lists are a strict generalization of both DNFs andCNFs [17, 24]. Following Rivest’s original work, decision lists have been studied both in complexitytheory [2, 5, 6, 8, 11, 18, 26] and in learning theory [3, 7, 12, 15, 20, 27, 28]. Complexity measures of decision lists.

There are two natural complexity measures of decisionlists: size and width . Let L = (( C i , v i )) i ∈ [ m ] be a decision list. Its size is the number of rules in it(namely m ), and its width is the maximal number of variables in a term C i . Decision list approximation.

A decision list

L ε -approximates another decision list L ′ if thetwo agree on a (1 − ε ) fraction of the inputs. It is straightforward to see that small-size decision listscan be approximated by small-width decision lists, by removing rules of large width. Concretely, adecision list of size m can be ε -approximated by a decision list of width w = log( m/ε ), simply byremoving all rules with terms of width more than w . The reverse direction is the main focus of thiswork. We prove the following result, which provides sharp bounds on approximating small-widthdecision lists by small-size decision lists. Theorem 1.1 (Main result) . Let w ≥ , ε > . Any width- w decision list L can be ε -approximatedby a decision list L ′ of width w and size s = (cid:0) w log ε (cid:1) O ( w ) . Moreover, L ′ is a sub-decision listof L , obtained by keeping s rules in L and removing the rest. The bound on s is optimal, up to theunspeciﬁed constant in the O ( w ) term. The proof of Theorem 1.1 appears in Section 2. We note that the size bound can be simpliﬁed,depending on whether the required error ε is below or above 2 − w : (cid:18) w log 1 ε (cid:19) O ( w ) = ( O ( w ) ε ≥ − w (cid:0) w log ε (cid:1) O ( w ) ε ≤ − w . In both cases, the bound we obtain is sharp, up to the unspeciﬁed constant in the O ( w ) term. Wegive examples demonstrating this in Section 3. Random restrictions are an essential ingredient of the proof of Theorem 1.1. H˚astad’s switchinglemma [4, 13, 22] is based on the fact that small-width DNFs simplify under random restrictions.More concretely, a random restriction that ﬁxes a 1 − O (1 /w ) fraction of the inputs simpliﬁes awidth- w DNF to a small-depth decision tree. In this work, we study random restrictions where asmall constant fraction of the variables is ﬁxed.A good example to keep in mind is the TRIBES function: a read-once DNF with 2 w termsof width w on disjoint variables. The TRIBES function does not simplify signiﬁcantly under arandom restriction, unless one really ﬁxes a 1 − O (1 /w ) fraction of the inputs. For example, if werandomly ﬁx 50% of the inputs, say, then the TRIBES function simpliﬁes to what is essentially asmaller TRIBES function (more formally, it simpliﬁes with high probability to a read-once DNF ofwidth Ω( w )). However, we show that this is in essence the worst possible example.The following lemma is a special case of Lemma 2.12 applied to DNFs (the full lemma dealswith decision lists). Given a DNF f : { , } n → { , } , let ρ ∈ { , , ∗} n be a restriction, and let f ↾ ρ be the restricted DNF. Clearly, some terms in f might become redundant in f ↾ ρ . For example,they could be false, or they could be implied by other terms. A term that is not redundant is called useful . We show that after ﬁxing even a small fraction of the variables (say, 1%), a width- w DNFsimpliﬁes to have at most 2 O ( w ) useful terms, and hence cannot be “too complicated”.2 emma 1.2 (DNFs simplify after mild random restrictions) . Let f be a width- w DNF, and let f ↾ ρ be a restriction of f obtained by restricting each variable with probability β , where the restrictedvariables take values 0 and 1 with equal probability. Then the expected number of useful terms in f ↾ ρ is at most (4 /β ) w . We discuss some applications of Theorem 1.1 below.

This decision list compression problem is a natural generalization of the

DNF sparsiﬁcation problem,introduced by Gopalan, Meka and Reingold [10] as a means to obtain pseudorandom generatorsfooling small-width DNFs. Their main structural result can be summarized as follows.

Theorem 1.3 ( [10]) . Any width- w DNF can be ε -approximated by a DNF of width w and size ( w log(1 /ε )) O ( w ) . They conjectured that a better bound is possible.

Conjecture 1.4 ( [10]) . Any width- w DNF can be ε -approximated by a DNF of width w and size s ( w, ε ) , where: • Weak version: s ( w, ε ) = c ( ε ) w for some function c . • Strong version: s ( w, ε ) = (log(1 /ε )) O ( w ) . The weak version was resolved by Lovett and Zhang [19], where they showed that c ( ε ) =(1 /ε ) O (1) suﬃces. Our main result, Theorem 1.1, veriﬁes the strong version of their conjecture (andin fact, proves a sharper bound than the one conjectured). Corollary 1.5 (This work) . Any width- w DNF can be ε -approximated by a DNF of width w andsize (cid:0) w log ε (cid:1) O ( w ) . We remark that Corollary 1.5 is also tight, up to the unspeciﬁed constant in the O ( w ) term.The proof is very similar to the proof in Section 3 that Theorem 1.1 is tight. We sketch the proofhere: • For 2 − w ≤ ε ≤ /

3, Claim 3.1 shows the existence of a function f : { , } w → { , } thatcannot be (1 / w and size O (2 w /w ). In particular, f cannot be approximated by a DNF of width w and size O (2 w /w ). Note that f can triviallybe computed by a DNF of width w and size 2 w , and that 2 Ω( w ) = (cid:0) w log ε (cid:1) Ω( w ) in thisregime. • For ε ≤ − w , consider exactly computing the Threshold- w function on log(1 /ε ) variables,which amounts to approximation with any error < ε . This requires a width- w DNF of size (cid:0) log(1 /ε ) w (cid:1) = (cid:0) w log ε (cid:1) Ω( w ) . 3 .2.2 Junta theorem A k -junta is a function depending on at most k variables. Friedgut’s junta theorem [9] shows thatboolean functions of small inﬂuence can be approximated by juntas. For the relevant deﬁnitionssee for example [21]. Theorem 1.6 (Friedgut’s junta theorem [9]) . Let f : { , } n → { , } be a boolean function withtotal inﬂuence I . Then for any ε > , f can be ε -approximated by a k -junta for k = 2 O ( I/ε ) . It is well known that width- w DNFs have total inﬂuence I = O ( w ), which implies by The-orem 1.6 that width- w DNFs can be ε -approximated by 2 O ( w/ε ) -juntas. Since a width- w size- s decision list is a ( sw )-junta, as a corollary of Theorem 1.1, we improve the bound, and generalizeit to decision lists. Corollary 1.7 (This work) . Any width- w decision list can be ε -approximated by a k -junta for k = (cid:0) w log ε (cid:1) O ( w ) . This improves previous bounds, even when restricted to DNFs or CNFs. By combining theresults in [10, 19] one gets the bound k = min { w log(1 /ε ) , /ε } O ( w ) for width- w DNFs or CNFs. Itcan be veriﬁed that our new result is indeed better; for example for ε = w − w we obtain (log w ) O ( w ) instead of w O ( w ) . It is also worthwhile noting that the result of [19], which obtained the bound(1 /ε ) O ( w ) , can be extended to decision lists with minimal changes. A class of boolean functions is said to be ( ε, δ )-PAC learnable using q queries if there exists alearning algorithm that, given query access to an unknown function in the class, returns withprobability (1 − δ ) a function which ε -approximates the unknown function, while making at most q queries. In our context we consider membership queries, where the learning algorithm can querythe value of the unknown function on any chosen input.A celebrated result of Jacskson [14] shows that polynomial-size DNFs can be PAC learned underthe uniform distribution using membership queries. Theorem 1.8 (Jackson’s harmonic sieve [14]) . The class of n -variate DNFs of size s is ( ε, δ ) -PAClearnable under the uniform distribution with q = poly ( s, n, /ε, log(1 /δ )) membership queries. Using Theorem 1.1, we can extend Jackson’s result to small-width DNFs. Note that the DNFsparsiﬁcation bound from [10, 19] also works here, if we replace the bound on s with their corre-sponding bound. Corollary 1.9 (This work) . The class of n -variate DNFs of width w is ( ε, δ ) -PAC learnableunder the uniform distribution with q = poly ( s, n, /ε, log(1 /δ )) membership queries, where s = (cid:0) w log ε (cid:1) O ( w ) .Proof Sketch. Jackson’s algorithm combines a weak learner based on Fourier analysis and a boostingalgorithm that converts this weak learner to a strong learner. Let f ( x ) be the target DNF thatwe are trying to learn. The weak learner solves the following problem: given a distribution D on { , } n , output a set S such that the parity χ S ( x ) = L i ∈ S x i is correlated with f under thedistribution D . Initially D is the uniform distribution, but the boosting algorithm keeps adapting D to focus on inputs where it made many mistakes.In Jackson’s algorithm, the existence of such S is shown by observing that for a size- s DNF, atleast one of the terms must be 1 /s correlated to the function; and each term’s contribution can be4ttributed to the parities supported on it. For width- w terms, this leads to at most a 2 − w decreasein the correlation.Assume now that f ( x ) is a width- w DNF with too many terms, so we cannot apply the previousargument directly. Apply Theorem 1.1 with error γ (to be determined soon), to obtain an approx-imate width- w DNF g ( x ) which γ -approximates f ( x ), where g has at most s = (cid:16) w log γ (cid:17) O ( w ) terms. Crucially, we obtain g ( x ) by removing some of the terms in f ( x ), and hence g ( x ) ≤ f ( x )for all inputs x . In particular, Pr x ∼ D [ f ( x ) = 1] ≥ Pr x ∼ D [ g ( x ) = 1].Assume that we know that the distribution D is not too far from uniform. Concretely, that D ( x ) ≤ K − n for some parameter K . This implies thatPr x ∼ D [ f ( x ) = 1] ≤ Pr x ∼ D [ g ( x ) = 1] + γK. We will choose γ = 1 / K . We may assume that Pr x ∼ D [ f ( x ) = 1] ∈ [1 / , / f under D . Thus Pr x ∼ D [ g ( x ) = 1] ∈ [1 / , / C of g which isΩ(1 /s )-correlated with g . One can verify that as g ( x ) ≤ f ( x ), C is also Ω(1 /s )-correlated with f .Finally, we need to bound K . It is known (see for example [16]) that boosting algorithms canbe restricted to have K = ε − O (1) , which completes the proof. We give a high-level overview of the proof of Theorem 1.1. Let L = (( C i , v i )) be a decision list ofwidth w and size m . General Framework.

Given a subset J ⊂ [ m ], we denote by L | J the decision list restricted tothe rules in J , where we delete the rest. Our goal is to ﬁnd a small subset J ⊂ [ m ] such that L | J approximates L . We say that a rule ( C i , v i ) of L is hit by an input x if C i ( x ) = 1 and C j ( x ) = 0for j < i ; in this case, L ( x ) = v i . The main intuition underlying our approach is: If a rule is rarely hit by random inputs, then we can safely remove it.

Armed with this intuition, our approach is to choose J to be the set of rules with the highestprobability of being hit. We show that in order to get an ε -approximation, it suﬃces to keep thetop (cid:0) w log ε (cid:1) O ( w ) rules.Our general approach follows that of Lovett and Zhang [19]. They combined two central resultsin the analysis of boolean functions: random restrictions and noise stability . The main innovationin the current work is that we apply random restrictions that ﬁx only a small fraction of the inputs;this is in contrast to the common use of random restrictions, such as in the proof of H˚astad’sswitching lemma [13], where most variables are ﬁxed. The ability to handle random restrictionswhich ﬁx only a small fraction is what allows us to obtain improved bounds. Mild random restrictions.

An index i ∈ [ m ] is said to be useful if there exists an assignment x such that the evaluation of L ( x ) hits the i -th rule (and hence outputs v i ). We denote the numberof useful indices in L by L ). This notion is natural, as we can always discard rules if noassignment hits them. The main point is that restrictions can render some rules in a decision listuseless. Let ρ be a random restriction that keeps each variable alive with probability α . We show5hat on average, the restricted decision list L ↾ ρ has a small number of useful indices: E ρ [ L ↾ ρ )] ≤ (cid:18) − α (cid:19) w . The proof is based on an encoding argument. Let ρ be a restriction for which L ↾ ρ has T usefulindices. Let t ∈ [ T ] be uniformly chosen. We construct a new restriction ρ ′ by further restrictingthe variables in the t -th useful rule so that this rule is satisﬁed. Then from ρ ′ and some smalladditional information a , we can recover both ρ and t . This shows that the probability of T beingtoo large is very low, as the entropy of ( ρ ′ , a ) is much lower than that of ( ρ, t ). Noise Stability.

Since there is no guarantee about the value on each rule of the decision list,it is convenient to consider the following index function. Let L = (( C i , v i )) i ∈ [ m ] be a decision liston n variables. The index function of L outputs for an input x the index i of the ﬁrst term in L satisﬁed by x . Equivalently, Ind L is given by the decision list Ind L = (( C i , i )) i ∈ [ m ] .We make two important deﬁnitions. What we want to analyze are the quantities p L ( i ) := Pr x [Ind L ( x ) = i ] , where x is taken from the uniform distribution of the input. In particular, we want to show thatthere is a small set of indices J such that P i ∈ J p L ( i ) ≥ − ε . What we can analyze using randomrestrictions are the quantities q L ( α, i ) = Pr ρ [index i is useful in L ↾ ρ ] , since it holds that X i q L ( α, i ) = E ρ [ L ↾ ρ )] ≤ (cid:18) − α (cid:19) w . We use noise stability to bridge between the two.Let β = 1 − α . For any x ∈ { , } n , the noise distribution y ∼ N β ( x ) is sampled by takingPr [ y i = x i ] = β independently for i ∈ [ n ]. Consider sampling x ∈ { , } n uniformly and y ∼N β ( x ). We can equivalently sample the pair ( x, y ) by ﬁrst sampling a common restriction ρ ,where each variables stays alive with probability α , and then sample its completion for x and y independently. Let Stab L ( β, i ) := Pr x,y [Ind L ( x ) = Ind L ( y ) = i ] . We show that p L ( i ) and q L ( α, i ) are both polynomially related, by relating them to Stab L ( β, i ): p L ( i ) q L (1 − β, i ) ≤ Stab L ( β, i ) ≤ p L ( i ) β . The upper bound is proven by hypercontrativity, and the lower bound by a somewhat delicateCauchy-Schwarz inequality. This allows us to obtain that p L ( i ) ≤ q L (1 − β, i ) β β . Finally, we put everything together by optimizing the value of β .6 elated works. We already discussed the works of Gopalan, Meka and Reingold [10] and Lovettand Zhang [19] which gave weaker bounds for DNF sparsiﬁcation than those in Theorem 1.1.There have been previous works studying how small-width DNFs simplify under mild randomrestrictions that ﬁx a small fraction of the variables (say, 1%). Segerlind, Buss and Impagliazzo’swork [25], improved by Razborov [23], show that width- w DNFs simplify to a decision tree of depth2 O ( w ) . We obtain bounds on size (namely, number of useful terms) in Theorem 1.1, which arebetter than bounds on depth. However, we only bound the ﬁrst moment (that is, expected numberof useful terms), while [23] bounds higher moments as well. So to some extent, the results areincomparable. We believe that with some further work, one can improve our techniques to obtainbounds on higher moments as well (this was unnecessary for the current work). Finally, it is alsoworthwhile to mention the work by the authors and Alweiss [1], where mild random restrictions(of a somewhat diﬀerent ﬂavor) were used to obtain improved bounds for the sunﬂower lemma incombinatorics. Paper Organization.

In Section 2, we prove the upper bound on decision list compression. InSection 3, we give the lower bounds to show the tightness of our result.

Acknowledgements.

We thank Ben Rossman for invaluable discussions. We also thank RyanAlweiss and the anonymous reviewers for helpful suggestions on an earlier version of this paper.

We start by make some deﬁnitions formal. We denote [ n ] = { , , . . . , n } , variables are x , . . . , x n ,and literals are x , ¬ x , . . . , x n , ¬ x n . A term is a conjunction of literals. Deﬁnition 2.1 (Decision list) . A width- w size- m decision list is a list L = (( C i , v i )) i ∈ [ m ] of rules.A rule is a pair ( C i , v i ) , where C i is a term containing at most w literals and each v i is a value insome ﬁnite set V . We assume C m ≡ , and ( C m , v m ) is the ﬁnal default rule.For any J ⊆ [ m ] with m ∈ J , we denote by L | J = (( C j , v j )) j ∈ J the restriction of L to the rulesin J , where elements of J are taken in ascending order. The evaluation of L given assignment x is to ﬁnd the ﬁrst index i such that C i ( x ) = 1 and thento output L ( x ) = v i . We make additional remarks for the decision list to avoid potential pitfalls. • If m / ∈ J , we will consider L | J invalid, as it does not have a default rule at the end. • No variable appears in any single term more than once, which rules out x ∧ x and x ∧ ¬ x .Our goal in this section is to prove the following theorem, which is the upper bound part inTheorem 1.1. Theorem 2.2.

Let L = (( C i , v i )) i ∈ [ m ] be a width- w decision list. Then for every ε > , there exists J ⊆ [ m ] , m ∈ J of size | J | = (cid:0) w log ε (cid:1) O ( w ) such that Pr [ L ( x ) = L | J ( x )] ≤ ε . Since there is no guarantee about the value on each rule of the decision list, it is convenient toconsider the index function. Let L = (( C i , v i )) i ∈ [ m ] be a decision list on n variables. The indexfunction of L is a function Ind L : { , } n → [ m ], given byInd L ( x ) = min { i ∈ [ m ] | C i ( x ) = 1 } . L is given by the decision list Ind L = (( C i , i )) i ∈ [ m ] . Using the index function, itsuﬃces to discard some rules of L and show it still approximates the index function. Claim 2.3.

Let L = (( C i , v i )) i ∈ [ m ] be a decision list. Then for any J ⊆ [ m ] , m ∈ J , we have Pr [ L ( x ) = L | J ( x )] ≤ Pr [

Ind L ( x ) / ∈ J ] . Proof.

This follows as if Ind L ( x ) = j ∈ J , then L ( x ) = L | J ( x ) = v j .Obviously, if a rule of a decision list is covered by some previous rules, then we can safely removeit. For example, in ( x , , ( x ∧ x ,

2) the second rule is useless. To make this more formal, weintroduce the following notion of a useful index . Deﬁnition 2.4 (Useful index) . Given size- m decision list L , an index i ∈ [ m ] is said to be useful if there exists an assignment x such that Ind L ( x ) = i . We denote by useful ( L ) the number ofuseful indices in L . Example 2.5.

Assume L = (( x , a ) , ( x ∧ ¬ x , b ) , (1 , c ) , ( x , d ) , (1 , e )) . Then indices , are useful,but indices , , are not. So useful ( L ) = 2 . The main intuition underlying our approach is that rules that are hardly hit by random inputscan be removed. Motivated by this, we deﬁne hit probability p L ( i ) := Pr [Ind L ( x ) = i ] . Claim 2.6.

For any size- m decision list L , we have P mi =1 p L ( i ) = 1 .Proof. This follows as the events [Ind L ( x ) = i ] are a partition of the probability space.The following is our main technical lemma. Lemma 2.7.

Let L = (( C i , v i )) i ∈ [ m ] be a width- w decision list. Sort [ m ] = { j , . . . , j m } such that p L ( j ) ≥ p L ( j ) ≥ · · · ≥ p L ( j m ) . For any ε > , let t = (cid:18) w log 1 ε (cid:19) O ( w ) . Then for J = { j , . . . , j t , m } it holds that Pr [

Ind L ( x ) / ∈ J ] ≤ ε . The proof of Theorem 2.2 follows immediately, by combining Lemma 2.7 and Claim 2.3. A restriction on n variables is ρ ∈ { , , ∗} n . An ( n, k )-random restriction is the uniform distributionover restrictions ρ ∈ { , , ∗} n with exactly k stars, which we denote by R ( n, k ). An ( n, α )-randomrestriction, which we denote by U ( n, α ), assigns independently each bit of the restriction ρ to 0 , , ∗ with probability − α , − α , α respectively. Given a decision list L : { , } n → V , its restriction under ρ is L ↾ ρ : { , } ρ − ( ∗ ) → V . Deﬁnition 2.8 (Useful probability) . Given size- m decision list L and α ∈ (0 , , the useful prob-ability of an index i ∈ [ m ] is q L ( α, i ) := Pr ρ ∼ U ( n,α ) [ index i is useful in L ↾ ρ ] . L initially does not contain useless rules, so for any α and i , we alwayshave q L ( α, i ) >

0. We also have the following simple fact regarding useful probability.

Claim 2.9.

For any size- m decision list L , we have P mi =1 q L ( α, i ) = E ρ ∼ U ( n,α ) [ useful ( L ↾ ρ )] .Proof. Let 1 ρ,i be the indicator of index i being useful in L ↾ ρ . Then E ρ ∼ U ( n,α ) [ L ↾ ρ )] = E ρ " m X i =1 ρ,i = m X i =1 E ρ [1 ρ,i ] = m X i =1 q L ( α, i ) . Now we present an encoding/decoding scheme for random restriction and analyze the expecta-tion in Claim 2.9 explicitly. Let α ∈ (0 ,

1) be such that αn is an integer. Deﬁne: U := (cid:26) ( ρ, s ) (cid:12)(cid:12)(cid:12)(cid:12) ρ ∈ R ( n, αn ) , s ∈ { , . . . , L ↾ ρ ) } (cid:27) V := ( ( ρ ′ , a ) (cid:12)(cid:12)(cid:12)(cid:12) ρ ′ ∈ w [ k =0 R ( n, αn − k ) , a ∈ { Old , New } w ) . We deﬁne two deterministic algorithms Enc :

U → V and Dec : Enc( U ) ⊆ V → U such thatDec(Enc( ρ, s )) = ( ρ, s ) holds for any ( ρ, s ) ∈ U . Algorithm 1:

Encoding algorithm Enc( ρ, s ) Input: restriction and index ( ρ, s ) ∈ U Output: restriction and string ( ρ ′ , a ) ∈ V I ← { i | i is a useful index in L ↾ ρ } j ← the s -th element in I ρ ′ ← ρ, a ← ∅ /* Assume C j = V ck =1 y j k , y j k ∈ { x j k , ¬ x j k } , c ≤ w */ for k = 1 to c do if ρ ( x j k ) ∈ { , } then Append a with Old /* x j k is already set by ρ */ else Append a with New /* x j k is newly set to satisfy this term */ if y j k = x j k then Update ρ ′ ( x j k ) ← else Update ρ ′ ( x j k ) ← end Complete a arbitrarily to length w end return ( ρ ′ , a )The following claim proves the correctness of the encoding and decoding algorithms. Claim 2.10.

Dec ( Enc ( ρ, s )) = ( ρ, s ) holds for any ( ρ, s ) ∈ U .Proof. Sort literals in each term of L = (( C i , v i )) i ∈ [ m ] arbitrarily. To justify the correctness, let( ρ ′ , a ) = Enc( ρ, s ), then we need to ensure: • Dec( ρ ′ , a ) obtains the same j in line 1 as Enc( ρ, s ) does in line 2:During Enc( ρ, s ), index j is useful in L ↾ ρ , thus setting unﬁxed variables to satisfy C j willnot make any term C i for i < j satisﬁed. Hence the ﬁrst satisﬁed term in L ↾ ρ ′ is C j .9 lgorithm 2: Decoding algorithm Dec( ρ ′ , a ) Input: restriction and string ( ρ ′ , a ) ∈ Enc( U ) ⊆ V Output: restriction and index ( ρ, s ) ∈ U j ← index of the ﬁrst satisﬁed term in L ↾ ρ ′ ρ ← ρ ′ /* Assume C j = V ck =1 y j k , y j k ∈ { x j k , ¬ x j k } , c ≤ w */ for k = 1 to c do if a k = New then /* x j k was not set by ρ */ Update ρ ( x j k ) ← ∗ end end I ← { i | i is a useful index in L ↾ ρ } s ← rank of j in I return ( ρ, s ) • Dec( ρ ′ , a ) in line 8 obtains the correct ρ :Since each term is sorted in advance, and a encodes which variable in C j is set by Enc( ρ, s )rather than ρ , the loop in Dec( ρ ′ , a ) will set these variables back to ∗ and recover ρ . Corollary 2.11. |U | ≤ |V| .Proof.

Enc is an injection from U to Enc( U ) ⊂ V . Lemma 2.12.

Let L be a width- w decision list on n variables and let α ∈ (0 , . Then E ρ ∼ U ( n,α ) [ useful ( L ↾ ρ )] ≤ (cid:18) − α (cid:19) w . Proof.

We ﬁrst prove the bound for ρ ∼ R ( n, αn ) and then increase the number of variables toinﬁnity, by adding dummy variables. This proves the desired bound as for n ′ → ∞ , the restrictionof R ( n ′ , αn ′ ) to the ﬁrst n variables converges to U ( n, α ). We have E ρ ∼ R ( n,αn ) [ L ↾ ρ )] = 1 | R ( n, αn ) | X ρ ∈ R ( n,αn ) L ↾ ρ )= |U || R ( n, αn ) | ≤ |V|| R ( n, αn ) | ≤ (cid:16)P wk =0 (cid:0) nαn − k (cid:1) (1 − α ) n + k (cid:17) × w (cid:0) nαn (cid:1) (1 − α ) n ≤ (cid:16)P wk =0 (cid:0) nαn − k (cid:1)(cid:17) × w (cid:0) nαn (cid:1) ≤ (cid:0) n + wαn (cid:1) × w (cid:0) nαn (cid:1) ≤ (cid:18) − α (cid:19) w . We use noise stability as a bridge between p L ( i ) and q L ( α, i ).10 eﬁnition 2.13 (Noisy distribution) . Given x ∈ { , } n and a noise parameter β ∈ (0 , , wedenote by N β ( x ) the distribution over y ∈ { , } n , where Pr [ y i = x i ] = β , Pr [ y i = x i ] = − β independently for all i ∈ [ n ] . Deﬁnition 2.14 (Stability) . Let g : { , } n → { , } be a boolean function. The β -stability of g isStab β ( g ) = Pr x ∈{ , } n ,y ∼N β ( x ) [ g ( x ) = g ( y ) = 1] . The hypercontractive inequality (see for example [21], page 259) allows us to bound the stabilityof a boolean function by its acceptance rate.

Fact 2.15.

Let g : { , } n → { , } and β ∈ (0 , . Then Stab β ( g ) ≤ (Pr [ g ( x ) = 1]) β . Next, we deﬁne index stability and relate it to useful probability q L ( · , · ) and hit probability p L ( · ). Deﬁnition 2.16 (Index stability) . Given a size- m decision list L on n variables, the β -stability ofindex i ∈ [ m ] is Stab L ( β, i ) := Pr x ∈{ , } n ,y ∼N β ( x ) [ Ind L ( x ) = Ind L ( y ) = i ] . Lemma 2.17 (Bridging lemma) . Let L be a size- m width- w decision list on n variables. Then forany index i ∈ [ m ] and β ∈ (0 , , we have p L ( i ) q L (1 − β, i ) ≤ Stab L ( β, i ) ≤ p L ( i ) β . Proof.

We ﬁrst prove the upper bound. Let g : { , } n → { , } be an indicator boolean functionfor Ind L ( x ) = i . Then using Fact 2.15, we haveStab L ( β, i ) = Stab β ( g ) ≤ (Pr [ g ( x ) = 1]) β = (Pr [Ind L ( x ) = i ]) β = p L ( i ) β . We now turn to prove the lower bound. Let α = 1 − β . Observe that we can sample ( x, y )where x ∈ { , } n , y ∼ N β ( x ) as follows: • Sample restriction ρ ∼ U ( n, α ); • Sample uniform x ′ ∈ { , } ρ − ( ∗ ) and complete stars in ρ with it as x ; • Sample uniform y ′ ∈ { , } ρ − ( ∗ ) and complete stars in ρ with it as y .We thus have Stab L ( β, i ) = Pr ρ,x ′ ,y ′ (cid:2) Ind L ↾ ρ ( x ′ ) = Ind L ↾ ρ ( y ′ ) = i (cid:3) . We now make a seemingly redundant, but surprisingly useful, conditioning. Let E ( ρ, i ) denote theevent E ( ρ, i ) := [ i is useful in L ↾ ρ ] . Then we can equivalently writeStab L ( β, i ) = Pr ρ,x ′ ,y ′ (cid:2) Ind L ↾ ρ ( x ′ ) = Ind L ↾ ρ ( y ′ ) = i ∧ E ( ρ, i ) (cid:3) . ρ , deﬁne r ρ ( i ) := Pr x ′ (cid:2) Ind L ↾ ρ ( x ′ ) = i (cid:3) . Since x ′ , y ′ are independent for any ﬁxed restriction, we haveStab L ( β, i ) = Pr ρ [ E ( ρ, i )] · Pr ρ,x ′ ,y ′ (cid:20) Ind L ↾ ρ ( x ′ ) = Ind L ↾ ρ ( y ′ ) = i (cid:12)(cid:12)(cid:12)(cid:12) E ( ρ, i ) (cid:21) = q L ( α, i ) · E ρ (cid:20) r ρ ( i ) (cid:12)(cid:12)(cid:12)(cid:12) E ( ρ, i ) (cid:21) ≥ q L ( α, i ) · (cid:18) E ρ (cid:20) r ρ ( i ) (cid:12)(cid:12)(cid:12)(cid:12) E ( ρ, i ) (cid:21)(cid:19) (Cauchy-Schwarz inequality)= 1 q L ( α, i ) (cid:18) q L ( α, i ) · E ρ (cid:20) r ρ ( i ) (cid:12)(cid:12)(cid:12)(cid:12) E ( ρ, i ) (cid:21)(cid:19) = 1 q L ( α, i ) (cid:18) Pr ρ,x ′ (cid:2) Ind L ↾ ρ ( x ′ ) = i ∧ E ( ρ, i ) (cid:3)(cid:19) = 1 q L ( α, i ) (cid:18) Pr ρ,x ′ (cid:2) Ind L ↾ ρ ( x ′ ) = i (cid:3)(cid:19) = 1 q L ( α, i ) (cid:16) Pr x [Ind L ( x ) = i ] (cid:17) = p L ( i ) q L ( α, i ) . Corollary 2.18.

Let L be a size- m width- w decision list. Then for any index i ∈ [ m ] and β ∈ (0 , ,we have p L ( i ) ≤ q L (1 − β, i ) β β . As a remark, we note that Lemma 2.17 can be generalized to arbitrary boolean functions witha similar proof.

Lemma 2.19.

Let g : { , } n → { , } be a boolean function which is not identically zero. Set | g | = Pr [ g ( x ) = 1] . Then for any β ∈ (0 , , we have | g | Pr ρ ∼ U ( n, − β ) [ g ↾ ρ ≤ Stab β ( g ) ≤ | g | β . Now we put everything together and give the proof of Lemma 2.7.

Proof of Lemma 2.7.

Recall that we sorted [ m ] = { j , . . . , j m } such that p L ( j ) ≥ p L ( j ) ≥ · · · ≥ p L ( j m ). Let J = { j , . . . , j t , m } for t to be optimized later.Next, let β ∈ (0 ,

1) to be optimized later and set α = 1 − β . Sort [ m ] = { i , . . . , i m } such that q L ( α, i ) ≥ q L ( α, i ) ≥ · · · ≥ q L ( α, i m ). By Claim 2.9 and Lemma 2.12, we have m X k =1 q L ( α, i k ) = E ρ ∼ U ( n,α ) [ L ↾ ρ )] ≤ (cid:18) − α (cid:19) w = (cid:18) β (cid:19) w . q L in decreasing order, so q L ( α, i k ) ≤ k (cid:18) β (cid:19) w . Observe that j , . . . , j t have the largest hit probability, and apply Corollary 2.18, then X j / ∈ J p L ( j ) ≤ m X k = t +1 p L ( j k ) ≤ m X k = t +1 p L ( i k ) ≤ m X k = t +1 q L ( α, i k ) β β ≤ (cid:18) β (cid:19) w × β β X k ≥ t +1 (cid:18) k (cid:19) β β ≤ (cid:18) β (cid:19) w × β β × β − β × t − − β β . If we restrict β ≤ / t = (cid:18) ε (cid:19) β − β (cid:18) β (cid:19) w × β − β (cid:18) β − β (cid:19) β − β ≤ (cid:18) ε (cid:19) β (cid:18) β (cid:19) w , then Pr [Ind L ( x ) / ∈ J ] = X j / ∈ J p L ( j ) ≤ ε. Now we divide ε into two cases. Assume ε = 2 − ℓw . Then: • If ℓ ≤ β = 1 / t = 2 O ( w ) . • If ℓ ≥ β = 1 /ℓ and get t = ℓ O ( w ) .One can verify that in either case we get t = (cid:18) w log 1 ε (cid:19) O ( w ) . In this section, we prove two lower bounds for decision list compression, which show that the boundsin Theorem 1.1 are tight up to constants.

Claim 3.1.

For any w , there is a width- w decision list L : { , } w → { , } such that Pr (cid:2) L ( x ) = L ′ ( x ) (cid:3) > / for any width- w decision list L ′ of size at most w / w .Proof. Since any boolean function on w variables can be expressed as some width- w decision list,13here are 2 w possible L . On the other hand, for any ﬁxed L ′ , it can approximate at most (cid:18) w w / (cid:19) × w / ≤ . × w diﬀerent boolean functions within distance 1 /

3; and for ﬁxed size m , there are at most (3 w × m distinct size- m width- w decision lists. As small-size decision lists can be embedded in larger ones,when restricted to size at most 2 w / w , width- w decision lists only approximate at most(3 w × w w × . × w < w diﬀerent boolean functions on w variables. Claim 3.2.

For any w and n > w , there is a width- w decision list L : { , } n → { , } which isnot equivalent to any width- w decision list L ′ of size smaller than (cid:0) nw (cid:1) /n .Proof. Let m = (cid:0) nw (cid:1) and sort all (cid:0) nw (cid:1) subsets of [ n ] with size w as { S , . . . , S m } arbitrarily. For any i ∈ [ m ], deﬁne C i = V j ∈ S i x j . For any v ∈ { , } m , let L v = (( C , v ) , . . . , ( C m , v m ) , (1 , m + 1) width- w decision list.As small-size decision lists can be embedded in larger ones, assume towards a contradiction thatany L v is equivalent to some size-( m/n ) width- w decision list L ′ v . Given L ′ v , we can recover L v by enumerating all assignments, since all rules in L v are useful. Thus, by counting argument, thenumber of possible L ′ v is upper bounded by × w X k =0 k (cid:18) nk (cid:19)! ( nw ) /n ≤ (cid:18) nw (cid:19) m/n < m . Now the general lower bound follows immediately.

Corollary 3.3.

For any w and ε ≤ / , there is a width- w decision list L such that Pr (cid:2) L ( x ) = L ′ ( x ) (cid:3) > ε holds for any width- w decision list L ′ of size at most (cid:18) w log 1 ε (cid:19) O ( w ) . Proof.

For ε ≥ − w , let L be the decision list in Claim 3.1. Then it cannot be approximatedwithin ε < / L ′ of size at most2 w w = (cid:18) w log 1 ε (cid:19) O ( w ) . For ε < − w , let L be the decision list in Claim 3.2 with n = log(1 /ε ). Since now ε = 2 − n , thedesired L ′ must be equivalent to L . Thus it cannot be realized by a decision list L ′ of size at most (cid:0) nw (cid:1) n = (cid:18) log ε w (cid:19) O (1) = (cid:18) w log 1 ε (cid:19) O ( w ) . eferences [1] R. Alweiss, S. Lovett, K. Wu, and J. Zhang. Improved bounds for the sunﬂower lemma. arXivpreprint arXiv:1908.08483 , 2019.[2] V. Arvind, J. K¨obler, S. Kuhnert, G. Rattan, and Y. Vasudev. On the isomorphism problemfor decision trees and decision lists. Theoretical Computer Science , 590:38–54, 2015.[3] G. Bagallo and D. Haussler. Boolean feature discovery in empirical learning.

Machine learning ,5(1):71–99, 1990.[4] P. Beame. A switching lemma primer. Technical report, Technical Report UW-CSE-95-07-01,Department of Computer Science, 1994.[5] A. Blum. Rank-r decision trees are a subclass of r-decision lists.

Information ProcessingLetters , 42(4):183–185, 1992.[6] A. Chattopadhyay, M. Mahajan, N. S. Mande, and N. Saurabh. Lower bounds for lineardecision lists.

CoRR , abs/1901.05911, 2019.[7] A. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant. A general lower bound on the numberof examples needed for learning.

Information and Computation , 82(3):247–261, 1989.[8] T. Eiter, T. Ibaraki, and K. Makino. Decision lists and related boolean functions.

TheoreticalComputer Science , 270(1-2):493–524, 2002.[9] E. Friedgut. Boolean functions with low average sensitivity depend on few coordinates.

Com-binatorica , 18(1):27–35, 1998.[10] P. Gopalan, R. Meka, and O. Reingold. DNF sparsiﬁcation and a faster deterministic countingalgorithm.

Computational Complexity , 22(2):275–310, 2013.[11] D. Guijarro, V. Lavin, and V. Raghavan. Monotone term decision lists.

Theoretical ComputerScience , 259(1-2):549–575, 2001.[12] T. Hancock, T. Jiang, M. Li, and J. Tromp. Lower bounds on learning decision lists and trees.

Information and Computation , 126(2):114–122, 1996.[13] J. H˚astad.

Computational Limitations of Small-depth Circuits . MIT Press, Cambridge, MA,USA, 1987.[14] J. C. Jackson. An eﬃcient membership-query algorithm for learning DNF with respect to theuniform distribution.

Journal of Computer and System Sciences , 55(3):414–440, 1997.[15] M. Kearns, M. Li, L. Pitt, and L. Valiant. On the learnability of boolean formulae. In

Annual ACM Symposium on Theory of Computing: Proceedings of the nineteenth annual ACMconference on Theory of computing , volume 1987, pages 285–295. Citeseer, 1987.[16] A. R. Klivans and R. A. Servedio. Boosting and hard-core set construction.

Machine Learning ,51(3):217–238, 2003. 1517] R. Kohavi and S. Benson. Research note on decision lists.

Machine Learning , 13(1):131–134,1993.[18] M. Krause. On the computational power of boolean decision lists. computational complexity ,14(4):362–375, 2006.[19] S. Lovett and J. Zhang. DNF sparsiﬁcation beyond sunﬂowers. In

Proceedings of the 51stAnnual ACM SIGACT Symposium on Theory of Computing, STOC 2019, Phoenix, AZ, USA,June 23-26, 2019. , pages 454–460, 2019.[20] Z. Nevo and R. El-Yaniv. On online learning of decision lists.

Journal of Machine LearningResearch , 3(Oct):271–301, 2002.[21] R. O’Donnell.

Analysis of boolean functions . Cambridge University Press, 2014.[22] A. A. Razborov. Bounded arithmetic and lower bounds in boolean complexity. In

FeasibleMathematics II , pages 344–386. Springer, 1995.[23] A. A. Razborov. Pseudorandom generators hard for k-dnf resolution and polynomial calculusresolution.

Annals of Mathematics , pages 415–472, 2015.[24] R. L. Rivest. Learning decision lists.

Machine learning , 2(3):229–246, 1987.[25] N. Segerlind, S. Buss, and R. Impagliazzo. A switching lemma for small restrictions and lowerbounds for k-dnf resolution.

SIAM Journal on Computing , 33(5):1171–1200, 2004.[26] G. Tur´an and F. Vatan. Linear decision lists and partitioning algorithms for the constructionof neural networks. In

Foundations of Computational Mathematics , pages 414–423. Springer,1997.[27] F. Wang and C. Rudin. Falling rule lists. In

Artiﬁcial Intelligence and Statistics , pages 1013–1022, 2015.[28] I. H. Witten, E. Frank, M. A. Hall, and C. J. Pal.