A Scalable Two Stage Approach to Computing Optimal Decision Sets
Alexey Ignatiev, Edward Lam, Peter J. Stuckey, Joao Marques-Silva
AA Scalable Two Stage Approach to Computing Optimal Decision Sets * Alexey Ignatiev , Edward Lam , Peter J. Stuckey , Joao Marques-Silva Monash University, Melbourne, Australia CSIRO Data61, Melbourne, Australia ANITI, IRIT, CNRS, Toulouse, France { alexey.ignatiev,edward.lam,peter.stuckey } @monash.edu, [email protected] Abstract
Machine learning (ML) is ubiquitous in modern life. Since itis being deployed in technologies that affect our privacy andsafety, it is often crucial to understand the reasoning behindits decisions, warranting the need for explainable AI . Rule-based models, such as decision trees, decision lists, and de-cision sets , are conventionally deemed to be the most inter-pretable. Recent work uses propositional satisfiability (SAT)solving (and its optimization variants) to generate minimum-size decision sets. Motivated by limited practical scalabilityof these earlier methods, this paper proposes a novel approachto learn minimum-size decision sets by enumerating indi-vidual rules of the target decision set independently of eachother, and then solving a set cover problem to select a subsetof rules. The approach makes use of modern maximum satis-fiability and integer linear programming technologies. Exper-iments on a wide range of publicly available datasets demon-strate the advantage of the new approach over the state of theart in SAT-based decision set learning.
Rapid advances in artificial intelligence and, in particular, inmachine learning (ML), have influenced all aspects of hu-man lives. Given the practical achievements and the over-all success of modern approaches to ML (LeCun, Bengio,and Hinton 2015; Jordan and Mitchell 2015; Mnih et al.2015; ACM 2018), one can argue that it will prevail as ageneric computing paradigm and it will find an ever grow-ing range of practical applications. Unfortunately, the mostwidely used ML models are opaque, which makes it hardfor a human decision-maker to comprehend the outcomes ofsuch models. This motivated efforts on validating the oper-ation of ML models (Ruan, Huang, and Kwiatkowska 2018;Katz et al. 2017) but also on devising approaches to ex-plainable artificial intelligence (XAI) (Ribeiro, Singh, andGuestrin 2016; Lundberg and Lee 2017; Monroe 2018).One of the major lines of work in XAI is devoted to train-ing logic-based models, e.g. decision trees (Bessiere, He-brard, and O’Sullivan 2009; Narodytska et al. 2018; Hu, * This work is supported in part by the AI Interdisciplinary In-stitute ANITI, funded by the French program “Investing for theFuture - PIA3” under Grant agreement no. ANR-19-PI3A-0004.
Rudin, and Seltzer 2019; Aglin, Nijssen, and Schaus 2020),decision lists (Angelino et al. 2017; Rudin and Ertekin 2018)or decision sets (Lakkaraju, Bach, and Leskovec 2016; Ig-natiev et al. 2018; Malioutov and Meel 2018; Ghosh andMeel 2019; Yu et al. 2020), where concise explanations canbe obtained directly from the model. This paper focuses onthe decision set (DS) model, which comprises an unorderedset of if-then rules.One of the advantages of decision sets over other rule-based models is that rule independence makes it straight-forward to explain a prediction: a user can pick any rulethat “fires” the prediction and the rule itself serves as theexplanation. As a result, generation of minimum-size deci-sion sets is of great interest. Recent work proposed SAT-based methods for generating minimum-size decision setsby solving a sequence of problems that determine whethera decision set of size K exists, with K being the number ofrules (Ignatiev et al. 2018) or the number of literals (Yu et al.2020) s.t. the decision set agrees with the training data. Un-fortunately, scalability of both approaches is limited due tothe large size of the propositional encoding.Motivated by this limitation, our work proposes a novelapproach that splits the DS generation problem into twoparts: (1) exhaustive rule enumeration and (2) computing asubset of rules agreeing with the training data. In general,this novel approach enables a significantly more compactpropositional encoding, which makes it scalable for probleminstances that are out of reach for the state of the art. Theproposed approach is inspired by the standard setup usedin two-level logic minimization (Quine 1952, 1955; Mc-Cluskey 1956). While the first part can be done using enu-meration of maximum satisfiability (MaxSAT) solutions, thesecond part is reduced to the set cover problem, for whichinteger linear programming (ILP) is effective. Experimentson a wide range of datasets indicate that this approach out-performs the state of the art. The remainder of this paperpresents these developments in detail. Satisfiability and Maximum Satisfiability.
We use thestandard definitions for propositional satisfiability (SAT)and maximum satisfiability (MaxSAT) solving (Biere et al. a r X i v : . [ c s . A I] F e b literal over Boolean variable x is either the variable x itself or itsnegation ¬ x . (Given a constant parameter σ ∈ { , } , we usenotation x σ to represent literal x if σ =
1, and to represent lit-eral ¬ x if σ = clause is a disjunction of literals. A term is a conjunction of literals. A propositional formula is said tobe in conjunctive normal form (CNF) or disjunctive normalform (DNF) if it is a conjunction of clauses or disjunction ofterms, respectively. Whenever convenient, clauses and termsare treated as sets of literals. Formulas written as sets of setsof literals (either in CNF or DNF) are described as clausal .We will make use of partial maximum satisfiability (Par-tial MaxSAT) (Biere et al. 2009, Chapter 19), which can beformulated as follows. A partial CNF formula can be seenas a conjunction of hard clauses H (which must be satis-fied) and soft clauses S (which represent a preference to sat-isfy those clauses). The Partial MaxSAT problem consists infinding an assignment that satisfies all the hard clauses andmaximizes the total number of satisfied soft clauses. Classification Problems and Decision Sets.
We followthe notation used in earlier work (Bessiere, Hebrard, andO’Sullivan 2009; Lakkaraju, Bach, and Leskovec 2016; Ig-natiev et al. 2018; Yu et al. 2020). Consider a set of features F = { , . . . , K } . The domain of possible values for feature r ∈ [ K ] is D r . The complete space of feature values (or fea-ture space (Han, Kamber, and Pei 2012)) is F (cid:44) ∏ Kr = D r .The vector f = ( f , . . . , f K ) of K variables f r ∈ D r refers toa point in F . Concrete (constant) points in F are denoted by v = ( v , . . . , v K ) , with v r ∈ D r . For simplicity, all the featuresare assumed to be binary, i.e. D r = { , } , ∀ r ∈ [ K ] ; categor-ical and ordinal features can be mapped to binary featuresusing standard techniques (Pedregosa et al. 2011). There-fore, whenever convenient, a Boolean literal on a feature r can be represented as f r (or ¬ f r , resp.), denoting that feature f r takes value 1 (value 0, resp.).Consider a standard classification scenario with trainingdata E = { e , . . . , e M } . A data instance (or example ) e i ∈ E isa pair ( v i , c i ) where v i ∈ F is a vector of feature values and c i ∈ C is a class. An example e i can be seen as associating a vector of feature values v i with a class c i ∈ C . This workfocuses on binary classification problems, i.e. C = {(cid:9) , ⊕} but the proposed ideas are easily extendable to the case ofmultiple classes. Given example e i = ( v i , c i ) ∈ E and the r th component v ir of v i , literal f − v ir r is said to discriminate e i because f − v ir r (cid:44) ¬ v ir , i.e. f − v ir r falsifies example e i whenit is considered as a conjunction of feature literals. The con-cept of example discrimination can be extended to terms.Examples e i , e j ∈ E associating the same set of featurevalues with the opposite classes are referred to as over-lapping . We assume wlog. that the training data E is per-fectly classifiable , i.e. E partially defines a Boolean function φ : F → C — in other words, there are no overlapping ex-amples in E . Otherwise, either e i or e j can be removed fromdataset E , incurring an error of 1. In general, repeated “col-lisions” can be resolved by taking the majority vote, whichresults in highest possible accuracy on training data.The objective of classification in ML is to devise a func- tion ˆ φ that matches the actual function φ on the train-ing data E and generalizes suitably well on unseen testdata (F¨urnkranz, Gamberger, and Lavrac 2012; Han, Kam-ber, and Pei 2012; Mitchell 1997; Quinlan 1993). In manysettings, function ˆ φ is not required to match φ on the com-plete set of examples E and instead an accuracy measureis considered. Furthermore, in classification problems oneconventionally has to optimize with respect to (1) the com-plexity of ˆ φ , (2) the accuracy of the learnt function (to makeit match the actual function φ on a maximum number of ex-amples), or (3) both. As this paper assumes that the trainingdata does not have overlapping instances, we aim solely atminimizing the representation size of the target ML models.This paper focuses on learning representations of ˆ φ cor-responding to decision sets (DS) (Lakkaraju, Bach, andLeskovec 2016; Ignatiev et al. 2018; Malioutov and Meel2018; Ghosh and Meel 2019; Yu et al. 2020). A decisionset is an unordered set of rules . Each rule π is from theset R = ∏ Kr = { f r , ¬ f r , u } , where u represents a don’t care value. For each example e ∈ E , a rule of the form π ⇒ c , π ∈ R , c ∈ C is interpreted as “if the feature values of ex-ample e agree with π then the rule predicts that example e has class c ”. Hereinafter, we will be dealing with learningminimum-size decision sets, with the size measure being ei-ther the number of rules in the decision set or the total num-ber of literals in it (sometimes referred to as total size ). Notethat because rules in decision sets are unordered, some rulesmay overlap , i.e. multiple rules π i ∈ R may agree with someinstance of the feature space F . It may also happen that noneof the rules of a decision set apply to some instances of F . Example . Consider the following dataset of four data in-stances representing the “to date or not to date?” exampleby Domingos (2015):
Date? e Weekday Dinner Warm Bad No e Weekend Club Warm Bad Yes e Weekend Club Warm Bad Yes e Weekend Club Cold Good No
This data serves to predict whether a friend accepts an invi-tation to go out for a date given various circumstances. Anexample of a valid decision set for this data is the following: IF TV Show = Good
THEN
Date = No IF Day = Weekday
THEN
Date = No IF TV Show = Bad ∧∧∧
Day = Weekend
THEN
Date = Yes
This DS has 3 rules and a total size of 7 (1 for each literalon the left and right, or alternatively, 1 for each literal on theleft and 1 for each rule). It does not exhibit rule overlap forexamples in F while the following decision set does: IF TV Show = Good
THEN
Date = No IF Day = Weekday
THEN
Date = No IF Weather = Warm ∧∧∧
Day = Weekend
THEN
Date = Yes
Here, the first and third rules overlap for all examples withfeature values
Weather = Warm and
TV Show = Good . Related Work
Rule-based ML models can be traced back to around the70s and 80s (Michalski 1969; Shwayder 1975; Hyafil andRivest 1976; Breiman et al. 1984; Quinlan 1986; Rivest1987). To our best knowledge, decision sets first appear asan unordered variant of decision lists (Rivest 1987; Clarkand Niblett 1989) in (Clark and Boswell 1991). The useof logic and optimization for synthesizing a disjunction ofrules matching a given training dataset was first tried in (Ka-math et al. 1992). Recently, (Lakkaraju, Bach, and Leskovec2016) argued that decision sets are more interpretable thandecision trees and decision lists.Our work builds on (Ignatiev et al. 2018; Yu et al. 2020)where SAT-based models were proposed for training deci-sion sets of smallest size. The method of (Ignatiev et al.2018) minimized the number of rules in perfect decisionsets, i.e. those that agree perfectly with the training data,which is assumed to be consistent; it was also shown tosignificantly outperform the smooth local search approachof (Lakkaraju, Bach, and Leskovec 2016). Rule minimiza-tion was then followed by minimization of the total numberof literals used in the decision set, which resulted in a lexico-graphic approach to the minimization problem. In contrast,(Yu et al. 2020) focused on minimizing the total number ofliterals in the target DS. This work showed that minimiz-ing the number of rules is more scalable for solving the per-fect decision set problem since the optimization measure, i.e.the number of rules, is more coarse-grained. However, min-imizing the total number of literals was shown to producesignificantly smaller and, thus, more interpretable target de-cision sets. Furthermore, they showed that sparse decisionsets (minimizing either the number of literals or the num-ber or rules) provide a user with yet another way to producea succinct classifier representation, by trading off its accu-racy for smaller size. Sparse decision sets were also consid-ered in (Malioutov and Meel 2018; Ghosh and Meel 2019)where the authors proposed a MaxSAT model for represent-ing one target class of the training data. ILP was also appliedto compute a variant of sparse decision sets (Dash, G¨unl¨uk,and Wei 2018). As was shown in (Yu et al. 2020), sparsedecision sets, although are much easier to compute, achievelower test accuracy compared to perfect decision sets. As aresult, the focus of this work is solely on improving scala-bility of computing perfect decision sets, i.e. sparse modelsare excluded from consideration.
Similar to the recent logic-based approaches to learning de-cision sets (Ignatiev et al. 2018; Malioutov and Meel 2018;Ghosh and Meel 2019), our approach builds on state-of-the-art SAT and MaxSAT technology. These prior works con-sider a SAT or MaxSAT model that determines whether thereexists a decision set of size N given training data E with |E| = M examples. The problem is solved by iteratively vary-ing size N and making either a series of SAT calls or oneMaxSAT call. The main limitation of prior work is the en-coding formula size, which is O ( N × M × K ) , where N is thetarget size of decision set (which is determined either as the number of rules (Ignatiev et al. 2018) or as the total num-ber of literals (Yu et al. 2020)), M is the number of trainingdata instances and K is the number of features in the trainingdata. This limitation significantly impairs scalability of theseapproaches and hence restricts their practical applicability.In contrast to the aforementioned works, the approach de-tailed below does not aim at devising a decision set in onestep and instead consists of two phases. The first phase se-quentially enumerates all individual minimal rules given aninput dataset. The second phase computes a minimum-sizesubset of rules (either in terms of the number of rules orthe total number of literals in use) that covers all the trainingdata instances. This way the approach trades off large encod-ing size and thus potentially hard SAT oracle calls for com-puting a complete decision set with a (much) larger numberof simpler oracle calls, each computing an individual rule,followed by solving the set cover problem. This algorithmicsetup is, in a sense, inspired by the effectiveness of the stan-dard clausal formula minimization approach (Quine 1952,1955; McCluskey 1956; Brayton et al. 1984; Espresso). First, recall that we consider binary classification, i.e. C = {(cid:9) , ⊕} but the ideas of this section can be easily adaptedto multi-class problems, e.g. by using one-hot encoding (Pe-dregosa et al. 2011). Next, let us split the set of training ex-amples E into the sets of examples E ⊕ and E (cid:9) for the respec-tive classes s.t. E = E ⊕ ∪ E (cid:9) and E ⊕ ∩ E (cid:9) = /0.Recall that a decision set is an unordered set of if-then rules π ⇒ c , each associating a set of literals π ∈ R , R = ∏ Kr = { f r , ¬ f r , u } , over the feature-values present in rule π with the corresponding class c ∈ {(cid:9) , ⊕} . Following (Ig-natiev et al. 2018), observe that each set of literals π in therule forms a term and so every class c i ∈ {(cid:9) , ⊕} in a de-cision set can be represented logically as a disjunction ofterms, each term representing a conjunction of literals in π . Example . Consider our example dataset shown in Exam-ple 1. Assume that features
Day , Venue , Weather , and
TVShow are represented with Boolean variables f , f , f , and f , respectively. Observe that all the features f r , r ∈ [ ] ,in the example are binary, and thus each value for feature f r can be represented either as literal f r or literal ¬ f r . Letus map the original feature values to { , } such that thealphabetically-first value is mapped to 0 while the other ismapped to 1. The classes No and Yes are mapped to (cid:9) and ⊕ , respectively. As a result, our dataset becomes f f f f c (cid:9) ⊕ ⊕ (cid:9) Using this binary dataset and the first decision set from Ex-ample 1 the classes c = (cid:9) and c = ⊕ are represented as theDNF formulas φ (cid:9) (cid:44) ( f ) ∨ ( ¬ f ) and φ ⊕ (cid:44) ( ¬ f ∧ f ) In this work, we follow (Ignatiev et al. 2018; Yu et al.2020) and compute minimum-size decision sets in the formof disjunctive representations φ (cid:9) and φ ⊕ of classes (cid:9) and . Even though it is simpler to construct rules for one classwhen there are only two classes (Malioutov and Meel 2018;Ghosh and Meel 2019), computing both φ (cid:9) and φ ⊕ achievesbetter interpretability. Specifically, if both classes are explic-itly represented, it is relatively easy to extract explicit andsuccinct explanations for any class, but this is not the casewhen only one class is computed. This approach also imme-diately extends to problems with three or more classes.Without loss of generality, we focus on computing a dis-junctive representation φ ⊕ for the class c = ⊕ . The samereasoning can be applied to compute φ (cid:9) . The target DNF φ ⊕ must be consistent with the training data, i.e. every term π ∈ φ ⊕ must (1) agree with at least one example e i ∈ E ⊕ and(2) discriminate all examples e j ∈ E (cid:9) . Furthermore, eachterm π ∈ φ ⊕ must be irreducible , meaning that any subterm π (cid:48) (cid:40) π does not fulfill one of the two conditions above. This section describes the first phase of the proposed ap-proach, namely, how a term π satisfying both of the condi-tions above can be obtained separately of the other terms.Every term is computed as a MaxSAT solution to a partialCNF formula ψ (cid:44) H ∧ S (1)with H and S being the hard and soft parts, described below.Consider two sets of Boolean variables P and N , | P | = | N | = K . For every feature f r , r ∈ [ K ] , define variables p r ∈ P and n r ∈ N . The idea is inspired by the dual-rail encoding (DRE) of propositional formulas (Bryant et al. 1987), e.g.studied in the context of logic minimization (Manquinhoet al. 1997; Jabbour et al. 2014). Variables p r and n r arereferred to as dual-rail variables . We assume that p r = f r = n r = f r =
0. Moreover, for every feature f r , r ∈ [ K ] , a hard clause is added to H to forbid the featuretaking two values at once: ∀ r ∈ [ K ] ( ¬ p r ∨ ¬ n r ) (2)The other combinations of values for p r and n r encode thefact that feature r occurs in the target term π positively, neg-atively, or does not occur at all (when p r = n r = S represents a preference to discardthe features from a target term π and thus contains a pair ofsoft unit clauses expressing that preference: S (cid:44) { ( ¬ p r ) , ( ¬ n r ) | r ∈ [ K ] } (3)By construction of S and given a MaxSAT solution for (1),the target term π is composed of all features f r , for whichone of the dual-rail variables (either p r or n r ) is assignedto 1 by the solution, i.e. the corresponding soft clauses are falsified . Discrimination Constraints.
Every example e j = ( v j , (cid:9) ) from E (cid:9) must be discriminated. This can be enforced by us-ing a clause ( (cid:87) r ∈ [ K ] f − v jr r ) , where constant v jr is the valueof the r th feature in example e j . To represent this in the dual-rail formulation, we add the following hard clauses to H : ∀ j ∈ [ | E (cid:9) | ] (cid:95) r ∈ [ K ] δ jr , (4) where δ jr is to be replaced by dual-rail variable p r if v jr = n r if v jr = Example . Consider instance e = ( f = , f = , f = , f = ) ∈ E (cid:9) of the running example. To discriminate it,we add a hard clause ( p ∨ n ∨ n ∨ p ) . Indeed, to satisfythis clause, we have to pick one of the literals discriminatingexample e , e.g. if p = f occurs in term π ,which discriminates instance e . Coverage Constraints.
To enforce that every term π cov-ers at least one training instance of E ⊕ , we can use simi-lar reasoning. Observe that a term π ∈ R covers instance e i = ( v i , c i ) ∈ E ⊕ iff none of its literals discriminates e i , i.e. f − v jr r (cid:54)∈ π for any r ∈ [ K ] . For each example e i = ( v i , c i ) ∈E ⊕ , we introduce an auxiliary variable t i defined by: t i ↔ ¬ ( K (cid:95) r = δ ir ) , (5)where δ ir is to be replaced by dual-rail variable p r if v ir = n r if v ir = t i is true iff term π covers example e i . Example . Consider instance e = ( f = , f = , f = , f = ) ∈ E ⊕ of the running example. Introduce variable t ↔ ¬ ( n ∨ p ∨ n ∨ p ) as shown above. If t =
1, the liter-als in the target term π cannot discriminate example e .Once auxiliary variables t i are introduced for each exam-ple e i ∈ E ⊕ , the hard clause (cid:95) i ∈ [ | E ⊕ | ] t i (6)can be added H to ensure that any term π agrees with at leastone of the training data instances.The overall partial MaxSAT model (1) comprises hardclauses (2), (4), (5), (6) and also soft clauses (3). The num-ber of variables used in the encoding is O ( K + M ) whilethe number of clauses is O ( K × M ) . Recall that earlierworks proposed encoding with O ( N × M × K ) variables andclauses, which in some situations makes it hard (or infeasi-ble) to prove optimality of large decision sets. Example . Consider our aim at computing rules for class ⊕ in the running example. By applying the DRE, one obtainsthe formula ψ = H ∧ S where H = ( ¬ p ∨ ¬ n ) ∧ ( ¬ p ∨ ¬ n ) ∧ ( ¬ p ∨ ¬ n ) ∧ ( ¬ p ∨ ¬ n ) ∧ ( p ∨ n ∨ n ∨ p ) ∧ ( n ∨ p ∨ p ∨ n ) ∧ [ t ↔ ¬ ( n ∨ p ∨ n ∨ p )] ∧ [ t ↔ ¬ ( n ∨ p ∨ n ∨ p )] ∧ ( t ∨ t ) and S = (cid:26) ( ¬ p ) ∧ ( ¬ n ) ∧ ( ¬ p ) ∧ ( ¬ n ) ∧ ( ¬ p ) ∧ ( ¬ n ) ∧ ( ¬ p ) ∧ ( ¬ n ) (cid:27) ny assignment satisfying the hard clauses H of the dual-rail MaxSAT formula (1) constructed above defines a term π that discriminates all examples of E (cid:9) and covers at least oneexample of E ⊕ ; soft clauses S ensure minimality of terms π .More importantly, one can exhaustively enumerate all solu-tions of (1) to compute the set of all such terms (i.e. one canuse the standard trick of adding a hard clause blocking theprevious solution and ask for a new one until no more solu-tions can be found). Let us refer to this set of terms as T ⊕ .Finally, we claim that as soon as exhaustive solution enumer-ation for formula (1) is finished, the set of terms T ⊕ coversevery example e i ∈ E ⊕ . The rationale is that if sets E ⊕ and E (cid:9) do not overlap then for any example e i ∈ E ⊕ there is away to cover it by a term π s.t. all examples of E (cid:9) are dis-criminated by π . This means that, by construction of (1), forevery variable t i , the hard part H of the formula has a sat-isfying assignment assigning t i =
1. (Recall that we assumetraining data to be perfectly classifiable.)
Example . Consider our running example. Observe that avalid solution for formula ψ above is { p , n } from whichwe can extract a term π = ( f ∧ ¬ f ) for the target class c = ⊕ . The term π is added to T ⊕ . Observe that π coversboth examples e and e and discriminates examples e and e from class (cid:9) . Once the set T ⊕ of all terms for class c = ⊕ is obtained, thenext step of the approach is to compute a smallest size cover φ ⊕ of the training examples E ⊕ . Concretely, the problem isto select the smallest size subset φ ⊕ of T ⊕ that covers allthe training examples. The size can be the either the num-ber of terms used or the total number of literals used in φ ⊕ .Therefore, the problem to solve is essentially the set coverproblem (Karp 1972). Assume that |T ⊕ | = L and create aBoolean variable b j for every term π j ∈ T ⊕ , indicating thatrule j is selected. Also, consider L × M (cid:48) , M (cid:48) = |E ⊕ | , Booleanconstant values a i j s.t. a i j = j covers example i .Then, the problem of computing the cover with the fewestnumber of terms can be stated as:minimize L ∑ j = b j (7)subject to L ∑ j = a i j · b j ≥ , ∀ i ∈ [ M (cid:48) ] (8)Alternatively, the objective function can be modified tominimize the total number of literals. Concretely, create aconstant s j ∈ Z s.t. s j = | π j | , π ∈ T ⊕ . The problem is then tominimize L ∑ j = s j · b j (9)subject to L ∑ j = a i j · b j ≥ , ∀ i ∈ [ M (cid:48) ] (10) Example . For our running example, |T ⊕ | =
4, with termsbeing: T ⊕ = { ( f ∧ f ) , ( f ∧ ¬ f ) , ( ¬ f ∧ f ) , ( ¬ f , ¬ f ) } The set cover problem for c = ⊕ can be seen as the table: π π π π a i j s j This example has trivial solutions because every term coversall examples of E ⊕ , i.e., every a i , j = c = (cid:9) ; |T (cid:9) | =
4, with terms being: T (cid:9) = { ( ¬ f ) , ( f ) , ( ¬ f ) , ( f ) } Then the set cover problem for c = (cid:9) can be seen as thefollowing table: π π π π a i j s j In our case, one valid solution picks columns π and π asboth of them together cover E (cid:9) . Thus, when minimizing thenumber of terms, the fewest number of columns to pick, suchthat every row has at least one value, is 2, i.e., any optimalsolution has cost 2 (1 + Breaking Symmetric Rules.
Observe in Example 7 thatthe two terms π and π in T ⊕ cover the same examples from E ⊕ . In the context of the set cover problem, such terms aredescribed as symmetric . It is straightforward that at most oneterm in a set of symmetric terms can appear in a solutionbecause of the minimization in the set cover problem.Symmetric terms can become an issue if the total numberof terms is exponential on the number of features. This kindof repetition can be avoided by using the instance coveragevariables t i . Concretely, given a term π ∈ T ⊕ covering a set E (cid:48)⊕ ⊂ E ⊕ of data instances, one can add a clause ( (cid:87) i ∈ E ⊕ \ E (cid:48)⊕ t i ) enforcing that any terms discovered later must cover at leastone instance e i uncovered by term π .While the terms are symmetric for objective (7), there is adominance relation for objective (9). Consider a term π withthe same coverage as another term ρ s.t. | ρ | ≥ | π | — term π dominates term ρ . Given a set cover solution S ∪ { ρ } , wecan always replace ρ by π to get a no worse solution S ∪{ π } .Therefore, term ρ can be ignored during selection.Because term enumeration is done with MaxSAT, i.e.smaller terms come first, we can use the same method abovefor symmetry to eliminate dominated terms, since a domi-nating term (a smallest term with the same coverage) willalways be discovered first. Clearly, optimality for objec-tives (7) and (9) is still guaranteed if breaking symmetricrules is applied. Example . By breaking symmetric terms, the enumerationprocedure computes only one term for class ⊕ and two termsfor class (cid:9) . (Recall that we previously got |T ⊕ | = |T (cid:9) | =
100 200 300 400 500 600 700 800instances020040060080010001200140016001800 C P U ti m e ( s ) ruler rilp +bruler lilp +bruler rrc2 +bruler lilp ruler rilp ruler lrc2 +bruler rrc2 mds ruler lrc2 mds ? opt (a) Raw performance − ruler lilp +b10 − op t s ec . ti m e ou t (b) Detailed runtime comparison of ruler lilp vs. opt Figure 1: Scalability of the competitors.
This section evaluates the proposed rule enumeration basedapproach in terms of scalability and compares it with thestate of the art of SAT-based learning of minimum-size deci-sion sets on a variety of publicly available datasets. The ex-periments were performed in Debian Linux on an Intel XeonSilver-4110 2.10GHz processor with 64GByte of memory.Following the setup of recent work (Yu et al. 2020), the timelimit was set to 1800s for each individual process to run. Thememory limit was set to 8GByte per process.
Prototype Implementation and Selected Competition.
A prototype of our rule enumeration based approach wasdeveloped as a set of Python scripts, in the following re-ferred to as ruler . The implementation of rule enumerationwas done with the use of the state-of-the-art MaxSAT solverRC2 (Ignatiev, Morgado, and Marques-Silva 2018, 2019),which proved to be the most effective in MaxSAT modelenumeration (MaxSAT Evaluation 2020). As a result, theterms are computed in a sorted fashion, i.e. the smallestones come first. For the second phase of the approach, i.e.computing the set cover, we attempted to solve the prob-lem both (1) with the RC2 MaxSAT solver and (2) with theGurobi ILP solver (Gurobi Optimization 2020). The corre-sponding configurations of the prototype are called ruler ∗ rc and ruler ∗ ilp , where ‘ ∗ ’ can either be ‘r’ or ‘l’ meaning thatthe solver minimizes either the number of rules or the totalnumber of literals. Configurations ruler ∗∗ +b apply symme-try breaking constraints to reduce the number of implicantrules computed in the first phase of the approach. Finally,all configurations were set to compute explicit optimal DNFrepresentations for all classes given a dataset.The competiting approaches include SAT-based meth-ods (Ignatiev et al. 2018) MinDS and MinDS (cid:63) referred to as Available as part of https://github.com/alexeyignatiev/minds. We also tried using Gurobi for the first phase but it was signif-icantly outperformed by the MaxSAT-based solution. mds and mds (cid:63) . The former tool computes the fewest num-ber of rules while the latter lexicographically minimizes thenumber of rules and then the number of literals. The secondcompetitor is a recent SAT-based approach that minimizesthe total number of literals in the model (Yu et al. 2020), inthe following referred to as opt . Note that as the main objec-tive of this experimental assessment is to demonstrate scala-bility of the proposed approach, the methods for computingsparse decision sets (Malioutov and Meel 2018; Ghosh andMeel 2019; Yu et al. 2020) are intentionally excluded due tothe significant difference in the problem they tackle. Benchmarks.
All datasets considered in the evaluationwere adopted from (Yu et al. 2020) and used unchanged.These datasets originated from the UCI Machine LearningRepository (UCI) and the Penn Machine Learning Bench-marks (PennML). The total number of datasets is 1065. Thenumber of one-hot encoded (Pedregosa et al. 2011) fea-tures (training instances, resp.) per dataset in the benchmarksuite varies from 3 to 384 (from 14 to 67557, resp.). Also,since ruler can handle only perfectly classifiable data, it pro-cesses each training dataset by keeping the largest consistent(non-overlapping) set of examples. This technique is appliedin (Ignatiev et al. 2018; Yu et al. 2020) as well, which en-ables one to achieve the highest possible accuracy on thetraining data. Motivated by one of the conclusions of (Yuet al. 2020) stating that perfectly accurate decision sets, ifsuccessfully computed, are significantly more accurate thansparse and also heuristic models, here we do not comparetest accuracy of the competitors – we assume test accuracyto be (close to) identical for all the considered approaches.
Raw Performance.
Figure 1a shows scalability of allthe selected approaches. Observe that the mixed solution ruler rilp +b demonstrates the best performance being able totrain decision sets for 802 datasets. Second best approach is ruler lilp +b and copes with 800 benchmarks. Pure MaxSAT-based ruler rrc +b and ruler lrc +b perform worse with 734 ruler lilp +b10 m d s (a) Literals or rules: ruler lilp vs. mds ruler lilp +b10 m d s ? (b) Literals or lexicographic: ruler lilp vs. mds (cid:63) Figure 2: Model size comparison.and 669 instances solved. This is not surprising because thestructure of set cover problems naturally fits the capabilitiesof modern ILP solvers.Disabling symmetry breaking constraints affects the per-formance of all configurations of ruler ∗∗ , which drops sig-nificantly. When it is disabled, the best such configura-tion ( ruler lilp ) solves 686 benchmarks while the worst one( ruler lrc ) tackles 556. Here, we should say that the maxi-mum and average number of rules enumerated if symme-try breaking is disabled is 326399 and 19604.4, respectively.Breaking symmetric rules decreases these numbers to 8865and 563.7, respectively.As for the rivals of the proposed approach, the best ofthem ( mds ) is far behind ruler ∗ ilp +b and successfully trains578 models even though it targets rule minimization, whichis arguably a much simpler problem. Another competitor( mds (cid:63) ) lexicographically minimizes the number of rules andthen the number of literals, which is unsurprisingly harderto deal with, as it solves 398 benchmarks. Finally, the worstperformance is demonstrated by opt , which learns 351 mod-els. We reemphasize that both opt and ruler lilp +b computeminimum-size decision sets in terms of the number of liter-als. However, the proposed solution outperforms the compe-tition by 449 benchmarks. Note that due to the large encod-ing size, opt ( mds , resp.) is practically limited to optimalmodels having a few dozens of literals (rules, resp.). Thereis no such limitation in ruler ∗∗ – in our experiments, it couldobtain minimum-size models having thousands of literals intotal within the given time limit. The runtime comparisonfor ruler lilp +b and opt is detailed in Figure 1b – observe thatexcept for a few outliers, ruler lilp +b outperforms the rival byup to four orders of magnitude. Rules vs. Literals.
Here we demonstrate that literal mini-mization results in smaller and thus more interpretable mod-els than rule minimization. Figure 2a compares the totalnumber of literals in the models of mds and ruler lilp +b while Figure 2b compares the model sizes for mds (cid:63) and ruler lilp +b . To make a valid comparison, we used only in-stances solved by both approaches in each pair. Among thedatasets used in of Figure 2a, the average number of literalsobtained by mds and ruler lilp +b is 116.2 and 62.2, respec-tively – the advantage of literal minimization is clear. Also,as shown in Figure 2b, lexicographic optimization results inmodels almost identical in size to the models produced by ruler lilp +b . This suggests that applying the approach of mds (cid:63) may in general pay off in terms of solution size, by signif-icantly sacrificing scalability compared to mds and, moreimportantly, to ruler lilp +b (398 vs. 578 vs. 800 instancessolved, which represents 37.4%, 54.7%, and 75.1% of all1065 benchmarks, respectively). This paper has introduced a novel approach to learningminimum-size decision sets based on individual rule enu-meration. The proposed approach has been motivated by thestandard twofold methods applied in two-level logic mini-mization (Quine 1952, 1955; McCluskey 1956) and split theproblem into two parts: (1) exhaustively enumerating indi-vidual rules followed by (2) solving the set cover problem.The basic approach has been additionally augmented withsymmetry breaking, enabling us to significantly reduce thenumber of rules produced. The approach has been appliedto computing minimum-size decision sets both in terms ofthe number of rules and in terms of the total number of lit-erals. The proposed approach has been shown to outperformthe state of the art in logic-based learning of minimum-sizedecision sets by a few orders of magnitude.As the proposed approach targets computing perfectly ac-curate decision sets, a natural line of future work is to ex-amine ways of applying it to computing sparse decision setsthat trade off accuracy for size. Another line of work is toaddress the issue of potential rule overlap wrt. the proposedapproach. Finally, it is of interest to apply similar rule enu-meration techniques for devising other kinds of rule-basedML models, e.g. decision lists and decision trees. eferences
ACM. 2018. Fathers of the Deep Learning Revolution Re-ceive ACM A.M. Turing Award. http://tiny.cc/9plzpz.Aglin, G.; Nijssen, S.; and Schaus, P. 2020. LearningOptimal Decision Trees Using Caching Branch-and-BoundSearch. In
AAAI , 3146–3153.Angelino, E.; Larus-Stone, N.; Alabi, D.; Seltzer, M.; andRudin, C. 2017. Learning Certifiably Optimal Rule Lists. In
KDD , 35–44.Bessiere, C.; Hebrard, E.; and O’Sullivan, B. 2009. Min-imising Decision Tree Size as Combinatorial Optimisation.In CP , 173–187.Biere, A.; Heule, M.; van Maaren, H.; and Walsh, T., eds.2009. Handbook of Satisfiability . IOS Press. ISBN 978-1-58603-929-5.Brayton, R. K.; Hachtel, G. D.; McMullen, C.; andSangiovanni-Vincentelli, A. 1984.
Logic minimization al-gorithms for VLSI synthesis , volume 2. Springer Science &Business Media.Breiman, L.; Friedman, J. H.; Olshen, R. A.; and Stone, C. J.1984.
Classification and Regression Trees . Wadsworth.ISBN 0-534-98053-8.Bryant, R. E.; Beatty, D. L.; Brace, K. S.; Cho, K.; and Shef-fler, T. J. 1987. COSMOS: A Compiled Simulator for MOSCircuits. In
DAC , 9–16.Clark, P.; and Boswell, R. 1991. Rule Induction with CN2:Some Recent Improvements. In
EWSL , 151–163.Clark, P.; and Niblett, T. 1989. The CN2 Induction Algo-rithm.
Machine Learning
3: 261–283.Dash, S.; G¨unl¨uk, O.; and Wei, D. 2018. Boolean DecisionRules via Column Generation. In
NeurIPS , 4660–4670.Domingos, P. 2015.
The Master Algorithm: How the Questfor the Ultimate Learning Machine Will Remake Our World .Basic Books. ISBN 978-0465065707.Espresso. 1993. Espresso — Multi-valued PLA minimiza-tion. http://tiny.cc/txdnsz.F¨urnkranz, J.; Gamberger, D.; and Lavrac, N. 2012.
Founda-tions of Rule Learning . Springer. ISBN 978-3-540-75196-0.Ghosh, B.; and Meel, K. S. 2019. IMLI: An Incremen-tal Framework for MaxSAT-Based Learning of InterpretableClassification Rules. In
AIES
Data Mining: Con-cepts and Techniques, 3rd edition . Morgan Kaufmann. ISBN978-0123814791.Hu, X.; Rudin, C.; and Seltzer, M. 2019. Optimal SparseDecision Trees. In
NeurIPS , 7265–7273.Hyafil, L.; and Rivest, R. L. 1976. Constructing OptimalBinary Decision Trees is NP-Complete.
Inf. Process. Lett.
SAT , 428–437.Ignatiev, A.; Morgado, A.; and Marques-Silva, J. 2019.RC2: an Efficient MaxSAT Solver.
J. Satisf. Boolean Model.Comput.
IJCAR , 627–645.Jabbour, S.; Marques-Silva, J.; Sais, L.; and Salhi, Y. 2014.Enumerating Prime Implicants of Propositional Formulae inConjunctive Normal Form. In
JELIA , 152–165.Jordan, M. I.; and Mitchell, T. M. 2015. Machine learn-ing: Trends, perspectives, and prospects.
Science
Math. Program.
57: 215–238.Karp, R. M. 1972. Reducibility Among Combinatorial Prob-lems. In
Complexity of Computer Computations , 85–103.Katz, G.; Barrett, C. W.; Dill, D. L.; Julian, K.; and Kochen-derfer, M. J. 2017. Reluplex: An Efficient SMT Solver forVerifying Deep Neural Networks. In
CAV , 97–117.Lakkaraju, H.; Bach, S. H.; and Leskovec, J. 2016. Inter-pretable Decision Sets: A Joint Framework for Descriptionand Prediction. In
KDD , 1675–1684.LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. nature
NIPS , 4765–4774.Malioutov, D.; and Meel, K. S. 2018. MLIC: A MaxSAT-Based Framework for Learning Interpretable ClassificationRules. In CP , 312–327.Manquinho, V.; Flores, P.; Marques-Silva, J.; and Oliveira,A. 1997. Prime Implicant Computation Using SatisfiabilityAlgorithms. In ICTAI , 232–239.MaxSAT Evaluation 2020. 2020. MaxSAT Evaluation 2020.https://maxsat-evaluations.github.io/2020/.McCluskey, E. J. 1956. Minimization of Boolean Functions.
Bell system technical Journal
International Symposium onInformation Processing , 125–128.Mitchell, T. M. 1997.
Machine learning . McGraw-Hill.Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Ve-ness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fid-jeland, A. K.; Ostrovski, G.; et al. 2015. Human-level con-trol through deep reinforcement learning.
Nature
Commun. ACM
IJCAI , 1362–1368.Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.;Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss,R.; Dubourg, V.; Vanderplas, J.; Passos, A.; Cournapeau, D.;Brucher, M.; Perrot, M.; and Duchesnay, E. 2011. Scikit-learn: Machine Learning in Python.
Journal of MachineLearning Research
12: 2825–2830.PennML. 2020. Penn Machine Learning Benchmarks. https://github.com/EpistasisLab/penn-ml-benchmarks.Quine, W. V. 1952. The problem of simplifying truth func-tions.
American mathematical monthly
Amer-ican mathematical monthly
MachineLearning
C4.5: Programs for machine learning .Morgan Kauffmann.Ribeiro, M. T.; Singh, S.; and Guestrin, C. 2016. “WhyShould I Trust You?”: Explaining the Predictions of AnyClassifier. In
KDD , 1135–1144.Rivest, R. L. 1987. Learning Decision Lists.
MachineLearning
IJCAI , 2651–2659.Rudin, C.; and Ertekin, S. 2018. Learning customized andoptimized lists of rules with mathematical programming.
Mathematical Programming Computation
10: 659–702.Shwayder, K. 1975. Combining Decision Rules in a Deci-sion Table.