LIBRE: Learning Interpretable Boolean Rule Ensembles
Graziano Mita, Paolo Papotti, Maurizio Filippone, Pietro Michiardi
LLIBRE: Learning Interpretable Boolean Rule Ensembles
Graziano Mita Paolo Papotti Maurizio Filippone Pietro Michiardi
EURECOM, 06410 Biot (France) { mita, papotti, filippone, michiardi } @eurecom.fr Abstract
We present a novel method – libre – to learnan interpretable classifier, which materializesas a set of Boolean rules. libre uses an en-semble of bottom-up, weak learners operat-ing on a random subset of features, whichallows for the learning of rules that general-ize well on unseen data even in imbalancedsettings. Weak learners are combined witha simple union so that the final ensemble isalso interpretable. Experimental results in-dicate that libre efficiently strikes the rightbalance between prediction accuracy, whichis competitive with black box methods, andinterpretability, which is often superior to al-ternative methods from the literature.
Model interpretability has become an important factorto consider when applying machine learning in criticalapplication domains. In medicine, law, and predictivemaintenance, to name a few, understanding the out-put of the model is at least as important as the outputitself. However, a large fraction of models currently inuse (e.g. Deep Nets, SVMs) favor predictive perfor-mance at the expenses of interpretability.To deal with this problem, interpretable models haveflourished in the machine learning literature over thepast years. Although defining interpretability is dif-ficult (Miller, 2017; Doshi-Velez and Kim, 2017), thecommon goal of such methods is to provide an expla-nation of their output. The form and properties of theexplanation are often application specific.In this work, we focus on predictive rule learning forchallenging applications where data is unbalanced. For
Preliminary work. Under review by AISTATS 2020. Donot distribute. IF mean corpuscular volume ∈ [90, 96) OR gamma glutamyl transpeptidase ∈ [20, max] THEN liver disorder = True
ELSE liver disorder = False
Figure 1: Example of rules learned by libre for
Liver .rules, interpretability translates into simplicity, and itis measured as a function of the number of rules andtheir size (average number of atoms): such proxies areeasy to compute, understandable, and allow compar-ing several rule-based models. The goal is to learn a setof rules from the training set that (i) effectively predicta given target, (ii) generalize to unseen data, (iii) andare interpretable, i.e., a small number of short rules(e.g., fig. 15). The first objective is particularly diffi-cult to meet in presence of imbalanced data. In thiscase, most rule-based methods fail at characterizingthe minority class. Additional data issues that hinderthe application of rule-based methods (Weiss, 2004)are data fragmentation (especially in case of small-disjuncts (Holte et al., 1989)), overlaps between im-balanced classes, and presence of rare examples.Many seminal rule learning methods come from thedata mining community: cba (Liu et al., 1998), cpar (Yin and Han, 2003), and cmar (Li et al., 2001),for example, use mining to identify class associationrules and then choose a subset of them according to aranking to implement the classifier. In practice, how-ever, these methods output a huge number of rules,which negatively impacts interpretability.Another family of approaches includes methods like cn2 (Clark and Niblett, 1989), foil (Quinlan andCameron-Jones, 1993), and ripper-k (Cohen, 1995),whereby top-down learners build rules by greedilyadding the condition that best explains the remainingdata.
Top-down learners are well suited for noisy dataand are known to find general rules (F¨urnkranz et al.,2014). They work well for the so called large disjuncts ,but have difficulties to identify small-disjuncts andrare examples, which are quite common in imbalanced a r X i v : . [ c s . L G ] N ov anuscript under review by AISTATS 2020 settings. In contrast, bottom-up learners like mod-lem (Grzymala-busse and Stefanowski, 2001), startdirectly from very specific rules (the examples them-selves) and generalize them until a given criteria ismet. Such methods are susceptible to noise, and tendto induce a very high number of specific rules, but arebetter suited for cases where only few examples char-acterize the target class (F¨urnkranz et al., 2014).Hybrid approaches such as bracid (Napierala, 2012)take the best from both worlds: maximally-specific (the examples themselves) and general rules are usedtogether in a hybrid classification strategy that com-bines rule learning and instance-based learning. Thus,they achieve better generalization, also in imbalancedsettings, but still generate many rules, penalizing in-terpretability. Other approaches to tackle data-relatedissues include heuristics to inflate the importance ofrules for minority classes (Grzymala-Busse et al., 2000;Nguyen and Ho, 2005; Blaszczynski et al., 2010).Recent work focus on marrying competitive predic-tive accuracy with high interpretability . A popularapproach is to use the output of an association rulediscovery algorithm (like FP Growth ) and combinethe discovered rules in a small and compact subsetwith high predictive performance. The rule combina-tion process can be formalized either as an integer opti-mization problem or solved heuristically, explicitly en-coding interpretability needs in the optimization func-tion. Such approaches have been successfully appliedto rule lists (Yang et al., 2017; Chen and Rudin, 2018;Angelino et al., 2018) and rule sets (Lakkaraju et al.,2016; Wang et al., 2017). Alternatively, rules can bedirectly learned from the data through an integer opti-mization framework (Hauser et al., 2010; Chang et al.,2012; Malioutov and Varshney, 2013; Goh and Rudin,2014; Su et al., 2016; Dash et al., 2018).Both rule-mining and integer-optimization based ap-proaches underestimate the complexity and impor-tance of finding good candidate rules, and become ex-pensive when the input dimensionality increases, un-less some constraints are imposed on the size and sup-port of the rules. Although such constraints favourinterpretability, they have a negative impact on thepredictive performance of the model, as we show em-pirically in our work. Additionally, these methods donot consider class imbalance issues.The key idea in our work is to exploit the known ad-vantages of bottom-up learners in imbalanced settings,and improve their generalization and noise-tolerancethrough an ensembling technique that does not sac-rifice interpretability. As a result, we produce a rule-based method that is (i) versatile and effective in deal-ing with both balanced and imbalanced data, (ii) in- terpretable , as it produces small and compact rule sets,and (iii) scalable to big datasets.
Contributions. (i) We propose libre , a novel en-semble method that, unlike other ensemble proposalsin the literature (W. Cohen and Singer, 1999; Fried-man and Popescu, 2008; Dembczy´nski et al., 2010) isinterpretable. Each weak learner uses a bottom-up ap-proach based on monotone Boolean function synthesisand generates rules with no assumptions on their sizeand support. Candidate rules are then combined witha simple union, to obtain a final interpretable rule set.The idea of ensembling is crucial to improve general-ization, while using bottom-up weak learners allows togenerate meaningful rules even when the target classhas few available samples. (ii) Our base algorithm fora weak learner, which is designed to generate a smallnumber of compact rules, is inspired by Muselli andQuarati (2005), but it dramatically improves compu-tational efficiency. (iii) We perform an extensive ex-perimental validation indicating that libre scales tolarge datasets, has competitive predictive performancecompared to state-of-the-art approaches (even black-box models), and produces few and simple rules, oftenoutperforming existing interpretable models.
Our methodology targets binary classification, al-though it can be easily extended to multi-class set-tings. For the sake of building interpretable models,we focus on Boolean functions for the mapping be-tween inputs and labels, which are amenable to a sim-ple interpretation.Boolean functions can be used as a model for binaryclassifiers f ( x ) = y , where x ∈ { , } d , y ∈ { , } . Thefunction f induces a separation of { , } d in two sub-sets F and T , where F = { x ∈ { , } d : f ( x ) = 0 } and T = { x ∈ { , } d : f ( x ) = 1 } . We call such subsetspositive and negative subsets, respectively. Clearly, F ∪ T = { , } d corresponds to the full truth table ofthe classification problem. We restrict the input space { , } d to be a partially ordered set ( poset ): a Booleanlattice on which we impose a partial ordering relation.
Definition 2.1.
Let (cid:86) , (cid:87) , ¬ be the and , or , and not logic operators respectively. A Boolean lattice is a5 tuple ( { , } d , (cid:86) , (cid:87) , , ¬ operatorimplies that a lattice is not a Boolean algebra. Let ≤ be a partial order relation such that x ≤ x (cid:48) ⇐⇒ x (cid:86) x (cid:48) = x (cid:48) . Then, ( { , } d , ≤ ) is a poset, a set onwhich a partial order relation has been imposed.The theory of Boolean algebra ensures that the class B d of Boolean functions f : { , } d → { , } can be re- anuscript under review by AISTATS 2020 alized in terms of (cid:86) , (cid:87) , and ¬ . However, if { , } d is aBoolean lattice, ¬ is not allowed and only a subset M d of B d can be realized. The class M d coincides with thecollection of monotone Boolean functions. The lackof the ¬ operator may limit the family of functionswe can reconstruct. However, by applying a suitabletransformation of the input space, we can enforce themonotonicity constraint (Muselli, 2005). As a conse-quence, it is possible to find a function ˜ f ∈ M d thatapproximates f ∈ B d arbitrarly well. Definition 2.2.
Let ( X , ≤ ) and ( Y , ≤ ) be two posets.Then, f : X → Y is called monotone if x ≤ x (cid:48) implies f ( x ) ≤ f ( x (cid:48) ). Definition 2.3.
Given x ∈ { , } d , let I m be the setof the first m positive integers { , . . . , m } . P ( x ) = { i ∈ I m : x ( i ) = 1 } . The inverse of P is denoted as p ( P ( x , m )) = x . Definition 2.4.
Let ˜ f ∈ M d be a monotone Booleanfunction, and A be a partially ordered set. Then, ˜ f can be written as: ˜ f ( x ) = (cid:87) a ∈A (cid:86) j ∈P ( a ) x ( j ).The monotone Boolean function ˜ f is specified in dis-junctive normal form (DNF), and is univocally deter-mined by the set A and its elements. Thus, given F and T , learning ˜ f amounts to finding a particular setof lattice elements A defining the boundary separat-ing positive from negative samples. Definition 2.5.
Given a ∈ { , } d = T ∪ F , if a ≤ x for some x ∈ T , and (cid:64) y ∈ F : a ≤ y , and ∃ y ∈ F : b ≤ y , ∀ b < a , then a is a boundary point for ( T , F ).The set A of boundary points defines the separationboundary . If a (cid:48) (cid:2) a (cid:48)(cid:48) and a (cid:48)(cid:48) (cid:2) a (cid:48) , ∀ a (cid:48) , a (cid:48)(cid:48) ∈ A , a (cid:48) (cid:54) = a (cid:48)(cid:48) , then the separation boundary is irredundant .In other words, a boundary point is a lattice elementthat is smaller than or equal to at least one positiveelement in T , but larger than all negative elements F . In practical applications, however, we usually haveaccess to a subset of the whole space, D + ⊆ T and D − ⊆ F . The goal of the algorithms we present nextis to approximate the boundary A , given D + and D − .We show that boundary points, and binary samplesin general, naturally translate into classification rules.Indeed, let R be the set of rules corresponding to thediscovered boundary. R ( · ) represents a binary classi-fier: R ( x ) = { ∃ r ∈ R : r ( x ) = 1; 0 otherwise } .Then, x is classified as positive if there is at least onerule in R that is true for it. We presented a theoretical framework that casts bi-nary classification as the problem of finding the bound-ary points for D + ⊆ T and D − ⊆ F . Next, we use suchframework to design our interpretable classifier. First, we describe a base, bottom-up method – whichwill be later used as a weak learner – that illustrateshow to move inside the boolean lattice to find bound-ary points. However, the base method does not scale tolarge datasets, and tends to overfit. Thus, we present libre , an ensemble classifier that overcomes such limi-tations by running on randomly selected subset of fea-tures. libre is interpretable because it combines theoutput of an ensemble of weak learners with a sim-ple union operation. Finally, we present a procedureto select a subset of the generated points – the oneswith the best predictive performance – and reduce thecomplexity of the boundary.We assume that the input dataset is a poset andthat the function we want to reconstruct is monotone.This is ensured by applying inverse-one-hot-encodingon discretized features, and concatenating the result-ing binary features, as done in Muselli (2006). Given z ∈ I m = { , ..., m } , inverse-on-hot encoding producesa binary string b of length m , where b ( i ) = 1 for i (cid:54) = z , b ( i ) = 0 for i = z . More details can be found in thesupplementary material. Example 3.1.
Consider a dataset with two con-tinuous features, f and f , both taking values inthe domain [0 , , , [40 , , , [30 , , [60 , f = 33 . , f = 44 . f (cid:48) = 1 , f (cid:48) = 3, and then binarizedas 01 101. In other words, each feature of a record isencoded with a number of bits equal to its discretizeddomain, and can have only one bit set to zero. We develop an approximate algorithm that learns theset A for ( D + , D − ). The algorithm strives to find lat-tice elements such that both |A| and |P ( a ) | , ∀ a ∈ A are small, translating in a small number of sparseboundary points (short rules). Algorithm Design.
To proceed with the presenta-tion of our algorithm, we need the following definitions:
Definition 3.1.
Given two lattice elements x , x (cid:48) ∈{ , } d , we say that x (cid:48) covers x , if and only if x (cid:48) ≤ x , Definition 3.2.
Given a lattice element x ∈ { , } d , flipping off the k -th element of x produces an element z such that z ( i ) = x ( i ) for i (cid:54) = k and z ( i ) = 0 for i = k . Definition 3.3.
Given a positive binary sample x ∈D + , we say that a flip-off operation produces a conflict if the lattice element z resulting from the flip-off is suchthat ∃ x (cid:48) ∈ D − : z ≤ x (cid:48) . anuscript under review by AISTATS 2020 Then, a boundary point is a lattice element that coversat least one positive sample, and for which a flip-offoperation would produce a conflict, as defined above.
Algorithm 1:
FindBoundary
Set A = ∅ and S = D + ; while S (cid:54) = ∅ do Choose x ∈ S ;Set I = P ( x ), J = ∅ ; FindBoundaryPoint ( A , I , J );Remove from S the elements covered by a , ∀ a ∈ A ; end Algorithm 1 presents the main steps of our algorithm,where A is the boundary set and S = { s ∈ D + : (cid:64) a ∈A , a ≤ s } is the set of elements in D + that are notcovered by a boundary point in A . I is the set of in-dexes of the components of the current positive sample x that can be flipped-off, and J is the set of indexesthat cannot be flipped-off to avoid a conflict with D − .Until S is not empty, an element x is picked from S .Then, the procedure FindBoundaryPoint is used togenerate one or more boundary points by flipping-offthe candidate bits of x . According to definition 3.2, aboundary point is generated when an additional flip-off would lead to a conflict, given definition 3.3. Whenthe FindBoundaryPoint procedure completes its op-eration, both A and S are updated. Example 3.2.
Let D + = { } and D − = { , } . Take the positive sample 11001, forwhich I = { , , } and J = ∅ . Suppose that FindBoundaryPoint flips-off the bits in I from leftto right. Flipping-off the first bit generates 01001 ≤ ∈ D − . The first bit is moved to J and keptto 1. Flipping-off the second bit generates 10001 ≤ ∈ D − . Also the second bit is moved to J . Wefinally flip-off the last bit and obtain 11000 that is notin conflict with any element in D − . 11000 is thereforea boundary point for ( D + , D − ).If we think about binary samples in terms of rules,a positive sample can be seen as a maximally-specificrule, with equality conditions on the input features(the value that particular feature takes on that partic-ular sample). Flipping-off bits is nothing more thangeneralizing that rule. Our goal is to do as many flip-off operations as possible before running into a conflict.Retrieving the complete set of boundary points re-quires an exhaustive search, which is expensive, re-stricting its application to small, low-dimensionaldatasets. It is easy to show that the computationalcomplexity of the exhaustive approach is O ( n d ),where n is the number of distinct training samples,and d is the dimension of the Boolean lattice. Inthis work, we propose an approximate heuristic for the FindBoundaryPoint procedure.
Finding Boundary Points.
The key idea is to finda subset of all possible boundary points, steering theirselection through a measure of their quality. A bound-ary point is considered to be “good” if it contributesto decreasing the complexity of the resulting bound-ary set, which is measured in terms of its cardinality |A| and the total number of positive bits (cid:80) a ∈A |P ( a ) | .In practice, |A| can be decreased by choosing bound-ary points that cover the largest number of elementsin S . To do this, we iteratively select the best can-didate index i ∈ I according to a measure of poten-tial coverage. Decreasing (cid:80) a ∈A |P ( a ) | implies findingboundary points with low number of 1s.Before proceeding, we define a notion of distance be-tween lattice elements: Definition 3.4.
Given x , x (cid:48) ∈ { , } d , the distance d l ( x , x (cid:48) ) between x and x (cid:48) is defined as: d l ( x , x (cid:48) ) = (cid:80) di =1 | x ( i ) − x (cid:48) ( i ) | + , where | · | + is equal to 1 if ( · ) ≥ Definition 3.5.
In the same way, we can define thedistance between a lattice element x and a set V as: d l ( x , V ) = min x (cid:48) ∈V d l ( x , x (cid:48) ).Every boundary point a for ( D + , D − ) has distance d l ( a , D − ) = 1; in fact, boundary points are all latticeelements for which a flip-off would generate a conflict.In the iterative selection process of the best index i ∈ I to be flipped-off, indexes having high d l ( p ( I ∪J ) , D − i )are preferred, where D − i = { x ∈ D − : x ( i ) = 0 } , be-cause they are the ones that contribute most to reducethe number of 1s of a potential boundary point. Algorithm 2:
FindBoundaryPoint( A , I , J ) For each i ∈ I compute |S i | , |D +0 i | , d l ( p ( I ∪ J ) , D − i ); while I (cid:54) = ∅ do Move from I to J all i with d l ( p ( I ∪ J ) , D − i ) = 1; if I = ∅ thenbreak ; end Choose the best index i ∈ I ;Remove i from I ;For each i ∈ I update d l ( p ( I ∪ J ) , D − i ); endif there is no a ∈ A : p ( J ) ≥ a then Set A = A ∪ p ( J ); end Algorithm 2 illustrates our approximate procedure,where S i = { s ∈ S : s ( i ) = 0 } and D +0 i = { t ∈D + : t ( i ) = 0 } are proxies for the potential cover-age of flipping-off a given bit i . The first step ofthe algorithm computes, for each index i ∈ I , theterms |S i | and |D +0 i | indicating its potential cover-age, and d l ( p ( I ∪ J ). Until the set I is not empty,indexes inducing a unit distance to D − are movedto J . Then, we choose the best index i best amongthe remaining indices in I , using our greedy heuris-tics : we can chose to optimize either for the tuple anuscript under review by AISTATS 2020 H = ( |S i | , |D +0 i | , d l ( p ( I ∪ J ) , D − i )) or for the tuple H = ( d l ( p ( I ∪ J ) , D − i ) , |S i | , |D +0 i | ). H prioritizes alower number of boundary points, while H tends togenerate boundary points with fewer 1s.When I is empty, p ( J ) is added to the boundary set A if it does not contain already an element covering p ( J ). Note that, in algorithm 2, the distance is com-puted only once, and updated at each iteration. Thisis because only one bit is selected and removed from I ; then, p ( I ∪ J ) new = p (( I ∪ J ) old \ { i } ). Formally,we apply definition 3.4 exclusively for i = i best . Example 3.3.
Let D + = { , , } and D − = { , } . We describe the procedure forfew steps and only for the first positive sample 10101.Suppose to optimize the tuple ( |S i | , |D +0 i | , d l ( p ( I ∪J ))). For 10101 we have I = { , , } and J = ∅ .At the beginning S = D + . |D +01 | = 2 , |D +03 | =0 , |D +05 | = 1. D − = ∅ , D − = { } , D − = { , } . Consequently: d l ( p ( I ∪ J ) , D − ) = undef ined , d l ( p ( I∪J ) , D − ) = 2, d l ( p ( I∪J ) , D − ) =1. Bit 5 is moved to J . Bit 1 has the higher value of |D +0 i | and is selected as best candidate to be flipped-off. The distance is recalculated and the procedurecontinues until the set of candidate bits I is empty.The algorithmic complexity of algorithm 1, when itruns algorithm 2, is O ( n d ). This is faster than theexhaustive algorithm, and better than the O ( n d )complexity of Muselli and Quarati (2005). We alsopoint out that most sequential-covering algorithms re-peatedly remove the samples covered by the new rules,forcing the induction phase to work in a more parti-tioned space with less data, especially affecting mi-nority rules, which already rely on few samples. Theproblem is mitigated in our solution: despite S cannotavoid this behavior, our heuristics keep a global andconstant view of both D − , in the conflict detection,and D + , in the discrimination of the best bits to flip. From Boundary Set To Rules.
Each element a of the boundary set A can be practically seen as theantecedent of an if-then rule having as target the pos-itive class. When a binary sample x is presented to a ,the rule outputs 1 only if x has a 1 in all positionswhere a has value 1, that is if a ≤ x . Then, the an-tecedent of the rule is expressed as a function of theinput features in the original domain. Example 3.4.
Consider a dataset with two con-tinuous features, f and f , discretized as follows:[[0 , , [40 , , , [30 , , [60 , A = {
01 100 } . From the boundary pointwe obtain a rule as follows: the first two bits referringto feature f – 01 – are mapped to “ if f ∈ [0 , f – 100 – are mapped to “ if f ∈ [30 , if f ∈ [0 ,
40) and f ∈ [30 , then label = 1”. The base approach generates boundary points by gen-eralizing input samples, i.e., by flipping-off positivebits if no conflict with negative samples is encountered.The hypothesis underlying this procedure is that whenno conflicts are found, a boundary point induces a validrule. However, such rule might be violated when usedwith unseen data. Stopping the flipping-off procedureas soon as a single conflict is found has two main ef-fects: i) we obtain very specific rules, that might besimplified if the approach could tolerate a limited num-ber of conflicts; ii) the rules cover no negative samplesin the training set and tend to overfit.To address these issues, a simple method would be tointroduce a measure for the number of conflicts anduse it as an additional heuristic in the learning pro-cess. However, this would dramatically increase thecomplexity of the algorithm.A more natural way to overcome such challenges is tomake the algorithm directly work on (random) sub-sets of features; in this way, the learning process pro-duces more general rules by construction. Randomiza-tion is a well-known technique to implement ensemblemethods that provide superior classification accuracy,as demonstrated, for example, in random forests (Ho,1998; Breiman, 2001). By using randomization, we candirectly use the methodology described in the previoussections, without modifying the search procedure. Thenew approach – libre – is an interpretable ensemble ofrules that operates on a randomized subset of features.Formally, let E be the number of classifiers in the en-semble. For each classifier j ∈ { , . . . , E } , we ran-domly sample k j features of the original space andrun algorithm 1 to produce a boundary set A j for thereduced input space. A j can be generated in parallel,since weak learners are independent from each other.At this point, to make the ensemble interpretable, wecrucially do not apply a voting (or aggregation) mech-anism to produce the final class prediction, but we doa simple union, such that A = (cid:83) Ej =1 A j .We note that libre addresses the problems outlinedabove, as we show experimentally. By training an en-semble of weak learners that operate on a small sub-set of features, we artificially inflate the probabilityof finding negative examples. Each weak learner isconstrained to run on less features not only reducingthe impact of d on the execution time, but also hav-ing an immediate effect on the interpretability of the anuscript under review by AISTATS 2020 model that is forced to generate simpler rules, exactlybecause it operates on fewer input features.Note that there are no guarantees that elements of A j will actually be boundary points in the full featurespace: weak learners have only a partial view of thefull input space and might generate rules that are notglobally true. Thus, it is important to filter out thepoints that are clearly far from the boundary by usingthe selection procedure described in the next section. The model learned by our greedy heuristic material-izes as a set A , which might contain a large number ofelements and, in case of libre , it might also containelements that cover many negative samples. In thissection, we explain how to produce a boundary set A ∗ with a good tradeoff between complexity and predic-tive performance. This can be cast as a weighted setcover problem. Since exploring all possible subsets ofelements in A can be computationally demanding, weuse a standard greedy weighted set cover algorithm.Each element a ∈ A is assigned a weight that is propor-tional to the number of positive and negative coveredsamples. The importance of the two contributions isgoverned by a parameter α . At each iteration, the el-ement a with the highest weight is selected; if thereis more than one, the element with the highest num-ber of zeros is preferred. All samples that are coveredby the selected element are removed, and the weightsare recalculated. The process continues until either allsamples are a covered or a stopping condition is met.Before running the selection procedure, with the aimof speeding up execution times, we eventually apply afiltering procedure to reduce the size of the initial setto a small number of good candidates: as proposed byGu et al. (2003), we select the top K rules according to exclusiveness and local support , that are more sensiblethan confidence and support for imbalanced settings. We evaluate libre in terms of predictive performance,interpretability, and scalability, and compare it withother rule-based methods and black-box models.
Datasets.
We report the results for seven publiclyavailable datasets from the UCI repository and tworeal industrial IT datasets – proprietary of
Sap . Re-sults on other UCI datasets are in the supplemen-tary material. These datasets cover several domains,have different imbalance ratios, number of records andfeatures, as summarized in Table 6. Some of thesedatasets have been used to evaluate methods for class
Dataset
Adult
Australian
690 14 .44
Bank
Ilpd
583 10 .28
Liver
345 5 .51
Pima
768 8 .35
Transfusion
748 5 .24
Sap-Clean
Sap-Full
Table 1: Characteristics of evaluated datasets.imbalance (Van Hulse et al., 2007) and present charac-teristics that make them difficult to learn: overlappingclasses, noisy and rare examples. All datasets have, orwere transformed to have, a binary class. The
Sap datasets consist of monitoring data collected acrossdatabase systems. They consists of 45 features, hand-crafted by domain experts based on low-level systemmetrics.
Sap runs a predictive maintenance systemon this data and notifies customers who confirm ordiscard the warnings: we use these as binary labels.
Sap-Clean is the clean version of
Sap-Full , wherewe removed records with at least one missing value.
Comparison With Other Methods.
We compare libre with two recent works: Scalable Bayesian RuleLists ( s-brl ) (Yang et al., 2017) and Bayesian RuleSets ( brs ) (Wang et al., 2017). We also report theresults for a weka implementation of ripper-k (Co-hen, 1995) and modlem (Grzymala-busse and Ste-fanowski, 2001) – as representative of top-down andbottom-up approaches – and scikit-learn imple-mentations of Decision Tree ( dt ) (Breiman et al.,1984), Support Vector Machine with RBF kernel ( rbf-svm ) (Cortes and Vapnik, 1995)), and random forests( rf ) (Breiman, 2001). rbf-svm and rf are selectedas popular black-box models; rf is also a representa-tive ensemble method. Other relevant methods are notpublicly available ( cg (Dash et al., 2018)), do not workproperly ( ids (Lakkaraju et al., 2016)), or are only par-tially implemented ( bracid (Napierala, 2012)). Parameter Tuning.
All results refer to stratified 5-fold cross validation, where the same splits are usedfor all tested methods. The initial set of candidaterules for s-brl and brs is generated by running
FPGrowth with a minimum support of 1 and a maxi-mum mining length of 5. We also optimize brs and s-brl ’s prior hyperparameters by cross validation. For brs , we run 2 chains of 500 iterations. For ripper-k , we change the number of optimization steps be-tween 1 and 5, and activate pruning. For modlem ,we try all available classification strategies and con-dition measures. For rbf-svm , we optimize C and γ . For dt and rf , we optimize the maximum depth anuscript under review by AISTATS 2020 Dataset rbf-svm rf dt ripper-k modlem s-brl brs libre libre Adult .62(.01) .68(.01) .68(.01) .59(.02) .66(.01) .68(.01) .61(.01) .70(.01) .62(.01)
Australian .83(.02) .86(.02) .84(.02) .85(.02) .68(.28) .82(.03) .83(.03) .84(.03) .84(.03)
Bank .46(.01) .50(.01) .50(.01) .44(.04) .50(.03) .50(.02) .32(.05) .55(.01) .44(.01)
Ilpd .47(.02) .44(.08) .42(.10) .20(.11) .48(.08) .14(.13) .09(.08) .54(.06) .52(.04)
Liver .58(.08) .58(.07) .56(.10) .59(.04) .58(.07) .54(.03) .61(.05) .60(.07) .63(.06)
Pima .61(.04) .63(.04) .60(.01) .60(.03) .38(.18) .61(.07) .03(.03) .64(.05) . .64(.05) Transfusion .41(.07) .35(.06) .35(.05) .42(.10) .42(.08) .05(.10) .04(.05) .49(.12) .49(.12)
Sap-Clean .93(.02) .93(.01) .85(.03) .86(.02) .88(.01) .90(.01) .68(.03) .95(.02) .72(.03)
Sap-Full - - - - - .81(.02) - .89(.03) .68(.04)Avg Rank 4.7(1.2) 3.3(1.6) 4.9(2.1) 5.3(2.1) 4.9(2.2) 5.2(2.8) 7.2(2.5)
Table 2: F1-score (st. dev. in parenthesis).
Dataset dt ripper-k modlem s-brl brs libre libre Adult
Australian
Bank
Ilpd
Liver
Pima
Transfusion
Sap-Clean
Sap-Full - - - 56.4(4.6) - 17.5(5.2)
Avg Rank 5.7(0.7) 3.2(1.0) 6.7(0.9) 4.9(0.7) 1.9(1.2) 2.9(0.9)
Table 3: { , , , N one } , we tried all possible options formax features and use a number of trees in { , , } for rf . For libre , we vary the number of weak learn-ers in E ∈ { , , } . Each weak learner uses up to5 features. Additionally, we try the two heuristics H and H to generate rules and vary α in { . , . , . } forweighted set cover. Parameters not reported above areall fixed to recommended or default values. Data Preprocessing.
Before running rbf-svm , weapply standardization to the input data to get betterresults. The remaining methods have no benefits fromstandardization in our experiments. For s-brl and li-bre , we apply
ChiMerge discretization algorithm Ker-ber (1992) with a discretization threshold in { , . , } ;in brs , discretization is instead controlled by an in-ternal parameter. In both cases, discretization is opti-mized during training. The remaining algorithms haveno explicit need for discretization. For the methodsrequiring binarization, we apply one-hot encoding , ex-cept for libre that uses inverse one-hot encoding . Evaluation Metrics.
We use F1-score to comparethe predictive performance of the classifiers, as it iswell-suited to evaluate the capability to characterizethe target class both in balanced and imbalanced set-tings. For rule-based methods, we use standard met-rics from the literature to evaluate the interpretabilityof the rule sets, namely the number of rules that im-plement a model, and the average number of atomsper rule . For dt , we extract the rules following the paths from root to leaves: this captures the percep-tion of a user who looks at the tree to understand theoutput of the model. For s-brl , the number of atomsin a rule is equal to the sum of the atoms in the pre-vious rules, highlighting the fact that a user has to gothrough all the rules up to the one that returns thelabel. For all rule-based methods, we change inequal-ities ( <, ≤ , >, ≥ ) to ranges to have a fair comparison.For example, f ≥ f ∈ [3 , max ]. Predictive Performance Evaluation.
Table 7shows the means and standard deviations of the F1-score for the tested algorithms (best results in bold)and the rank of their average performance, where thesame splits are used for all tested methods. We ad-ditionally report the results for libre when it is con-strained to generate at most 3 rules ( libre libre emerges as thebest method, beating both rbf-svm and rf , demon-strating its versatility in both balanced and imbal-anced settings. libre dt , modlem , s-brl and ripper-k show very similar performance, even if mod-lem is usually worse for balanced settings. brs is theworst method in terms of predictive performance.Focusing more on the single datasets, we can see that,except for Australian , libre obtains consistentlythe highest F1-score. In Bank , Ilpd , and
Transfu-sion the gap between libre and the closest competi- anuscript under review by AISTATS 2020 tor is significant; the gap is even larger in comparisonto alternative rule-based methods. For the remainingdatasets, the differences with the competitors are lesspronounced but still significant. In particular,
Ilpd seems to be very problematic for most of the testedmethods: ripper-k , brs and s-brl do not learn any-thing useful about the positive class; modlem per-forms marginally better. From a deeper analysis, itemerges that Ilpd is an imbalanced dataset with over-lapping classes: rules learned by libre have an errorrate close to 50% on the training set, consequence ofthe class imbalance. ripper-k is not able to learnthese rules, whereas the selection stage of brs and s-brl does not include such rules in the final set evenwhen they are in the set of candidate mined rules.With
Sap-Clean , libre brs but limiting the number of rules to 3 causes a signif-icant drop in F1-score w.r.t. libre . The situationis different for Sap-Full , the original version of thedataset containing also missing values. From table 6,
Sap-Full is more than five times bigger than
Sap-Clean , indicating that missing values are not a neg-ligible problem in real scenarios. A method that runswithout additional preprocessing is thus truly desir-able. Only libre and s-brl fit this requirement, while ripper-k , brs , and modlem natively manage missingvalues for categorical features only, but require an ad-ditional preprocessing for continuous features. Despitethe huge number of missing values, results for libre are comparable to other rule-based methods when ex-ecuted on Sap-Clean . Interpretability Evaluation.
Next, using table 8,we evaluate interpretability in terms of quantity andsimplicity of rules. In our analysis, we also refer totable 7, to measure the trade-off that exists betweeninterpretability and predictive performance. We high-light in bold the most interpretable results.In terms of number of rules, libre is better than ripper-k on average, indicating that it indeed over-comes the limitations of bottom-up learners like mod-lem , that is instead the worst method together with dt . s-brl is competitive for small datasets, butthe number of rules increases considerably for biggerdatasets like Adult , Bank , and
Sap . Overall, brs generates compact rule sets, with only one rule for halfof the tested datasets. However, we should also noticethat, except for
Liver , these are the same datasetsthat give F1-score close to zero. libre ripper-k modlem brs libre
Table 4: Runtime in seconds (st. dev. in parenthesis).tary material). Only s-brl has issues when the num-ber of rules is significant (like in
Adult , Bank , and
Sap datasets): indeed, in rule lists every rule dependson the previous ones, and the number of atoms easilyexplodes.
Scalability Evaluation.
Table 4 shows the run timefor libre and three representative rule-based com-petitors on synthetic balanced datasets with 10 fea-tures and a varying number of records: from 10’000 to1’000’000. For each configuration, we randomly gen-erate the dataset 3 times and report the average runtime and standard deviation. All methods are testedwith their default parameters and run sequentially, fora fair comparison. For libre , the time refers to oneweak learner, which is also a good approximation forthe computing time of E parallel weak learners. Thesymbol “-” identifies out-of-memory errors. modlem and brs fail with an out-of-memory errorwith 500’000 and 1’000’000 records datasets. Theyalso show much higher run times for smaller datasetsw.r.t. ripper-k and libre , that are instead able tocomplete their training in a few minutes also for thelarge datasets.Note that each weak learner in libre works with D + and D − that consist of distinct records : even if theoriginal dataset has millions of entries, the numberof binary records processed by the algorithm is muchlower, especially when the number of input features ofeach weak learner is relatively low. We also point outthat, for practical applications where interpretabilityis needed, it is more convenient to limit the number offeatures and train a bigger ensemble with more learn-ers to quickly generate understandable rules. Model interpretability has recently become of primaryimportance in many applications. In this work, we fo-cused on the task of learning a set of rules which spec-ify, using Boolean expressions, the classification model.We devised a practical method based on monotoneboolean function synthesis to learn rules from data.Our approach uses an ensemble of bottom-up learn-ers that generalizes better than traditional bottom-up methods, and that works well for both balanced anuscript under review by AISTATS 2020 and imbalanced scenarios. Interpretability needs canbe easily encoded in the rule generation and selectionprocedure that produces short and compact rule sets.Our experiments show that libre strikes the rightbalance between predictive performance and in-terpretability, often outperforming alternative ap-proaches from the literature.For future work, we will extend our model consideringnoisy labels and a Bayesian formulation.
Acknowledgements
The authors wish to thank SAPLabs France for support.
References
Apache spark. http://spark.apache.org/ .E. Angelino, N. Larus-Stone, D. Alabi, M. Seltzer, andC. Rudin. Learning certifiably optimal rule lists forcategorical data.
JMLR , 18(234):1–78, 2018.J. Blaszczynski, M. Deckert, J. Stefanowski, andS. Wilk. Integrating selective pre-processing of im-balanced data with ivotes ensemble. In
Proc. of the7th Int. Conf. on Rough Sets and Current Trends inComputing, RSCTC , pages 148–157, 2010.L. Breiman. Random forests.
Mach. Learn. , 45(1):5–32, 2001. ISSN 0885-6125.L. Breiman, J. Friedman, R. Olshen, and C. Stone.
Classification and Regression Trees . Wadsworth andBrooks, Monterey, CA, 1984.A. Chang, D. Bertsimas, and C. Rudin. An integeroptimization approach to associative classification.In
Proc. of the 24th Int. Conf. on Neural Informa-tion Processing Systems, NIPS , pages 3302–3310, 012012.C. Chen and C. Rudin. An optimization approach tolearning falling rule lists. In
Proc. of the 21st Int.Conf. on Artif. Intel. and Stat., AISTATS , pages604–612, 2018.P. Clark and T. Niblett. The cn2 induction algorithm.
Mach. Learn. , 3(4):261–283, 1989. ISSN 1573-0565.W. W. Cohen. Fast effective rule induction. In
Proc. ofthe 20th Int. Conf. on Mach. Learn., ICML , pages115–123, 1995.C. Cortes and V. Vapnik. Support-vector networks.
Mach. Learn. , 20(3):273–297, 1995.S. Dash, O. G¨unl¨uk, and D. Wei. Boolean decisionrules via column generation. In
Proc. of the 31stInt. Conf. on Neural Information Processing Sys-tems, NIPS , 2018.K. Dembczy´nski, W. Kot(cid:32)lowski, and R. S(cid:32)lowi´nski. En-der: a statistical framework for boosting decisionrules.
Data Min. and Knowl. Disc. , 21(1):52–90,2010. ISSN 1573-756X. F. Doshi-Velez and B. Kim. A roadmap for a rigorousscience of interpretability.
CoRR in arXiv , 2017.J. H. Friedman and B. E. Popescu. Predictive learningvia rule ensembles.
The Annals of Appl. Stat. , 2(3):916–954, 2008.J. F¨urnkranz, D. Gamberger, and N. Lavraˇc.
Founda-tions of Rule Learning . Springer Publishing Com-pany, Incorporated, 2014.S. T. Goh and C. Rudin. Box drawings for learningwith imbalanced data. In
Proc. of the 20th Int. Conf.on Knowl. Disc. and Data Min., KDD , pages 333–342, 2014.J. W. Grzymala-busse and J. Stefanowski. Three dis-cretization methods for rule induction.
Int. Journalof Intelligent Systems , pages 29–38, 2001.J. W. Grzymala-Busse, L. K. Goodwin, W. J.Grzymala-Busse, and X. Zheng. An approach to im-balanced data sets based on changing rule strength.In
Rough-Neural Computing: Techniques for Com-puting with Words , 2000.L. Gu, J. Li, H. He, G. Williams, S. Hawkins, andC. Kelman. Association rule discovery with unbal-anced class distributions. In
Proc. of the 16th Austr.Conf. on Artif. Intel., AI , pages 221–232, 12 2003.J. R. Hauser, O. Toubia, T. Evgeniou, R. Befurt, andD. Dzyabura. Disjunctions of conjunctions, cogni-tive simplicity, and consideration sets.
Journal ofMarketing Res. , 47(3):485–496, 2010.T. K. Ho. The random subspace method for con-structing decision forests.
IEEE Trans. on PatternAnalysis and Machine Intel., TPAMI , 20(8):832–844, 1998. ISSN 0162-8828.R. C. Holte, L. E. Acker, and B. W. Porter. Conceptlearning and the problem of small disjuncts. In
Proc.of the 11th Int. Joint Conf. on Artif. Intel., IJCAII ,pages 813–818, 1989.R. Kerber. Chimerge: Discretization of numeric at-tributes. In
Proc. of the 10th Nat. Conf. on Artif.Intel., AAAI , pages 123–128, 1992.H. Lakkaraju, S. H. Bach, and J. Leskovec. Inter-pretable decision sets: A joint framework for de-scription and prediction. In
Proc. of the 22nd Int.Conf. on Knowl. Disc. and Data Min., KDD , pages1675–1684, 2016.W. Li, J. Han, and J. Pei. Cmar: accurate and efficientclassification based on multiple class-associationrules. In
Proc. of the 2001 IEEE Int. Conf. on DataMin., ICDM , pages 369–376, 2001.B. Liu, W. Hsu, and Y. Ma. Integrating classificationand association rule mining. In
Proc. of the 4th Int.Conf. on Knowl. Disc. and Data Min., KDD , pages80–86, 1998. anuscript under review by AISTATS 2020
D. Malioutov and K. Varshney. Exact rule learningvia boolean compressed sensing. In
Proc. of the 30thInt. Conf. on Mach. Learn., ICML , pages 765–773,2013.T. Miller. Explanation in artificial intelligence: In-sights from the social sciences.
CoRR in arXiv , 2017.M. Muselli. Approximation properties of posi-tive boolean functions. In
Proc. of the 16thWIRN/NAIS , 2005.M. Muselli. Switching neural networks: A new con-nectionist model for classification. In B. Apolloni,M. Marinaro, G. Nicosia, and R. Tagliaferri, editors,
Neural Nets , pages 23–30, Berlin, Heidelberg, 2006.Springer Berlin Heidelberg.M. Muselli and A. Quarati. Reconstructing positiveboolean functions with shadow clustering. In
Proc.of the 2005 Europ. Conf. on Circuit Theory and De-sign, ECCTD , volume 3, pages III/377 – III/380 vol.3, 2005.K. Napierala. Bracid: a comprehensive approach tolearning rules from imbalanced data.
Journal of In-tel. Information Systems , 39(2):335–373, 2012.C. H. Nguyen and T. Ho. An imbalanced data rulelearner. In
Proc. of the 9th Europ. Conf. on Prin-ciples and Practice of Knowl. Disc. in Databases,PKDD , volume 3721, pages 617–624, 10 2005.J. R. Quinlan and R. M. Cameron-Jones. Foil: Amidterm report. In
Proc. of the 4th Europ. Conf.on Mach. Learn., ECML , pages 1–20, 1993.G. Su, D. Wei, K. R. Varshney, and D. M. Malioutov.Learning sparse two-level boolean rules. In
Proc. ofthe IEEE 26th Int. Workshop on Mach. Learn. forSignal Processing, MLSP , pages 1–6, 2016.J. Van Hulse, T. M. Khoshgoftaar, and A. Napolitano.Experimental perspectives on learning from imbal-anced data. In
Proc. of the 24th Int. Conf. on Mach.Learn., ICML , ICML ’07, pages 935–942, 2007.W. W. Cohen and Y. Singer. A simple, fast, and ef-fective rule learner.
Proc. of the 16th Nat. Conf. onArtif. Intel., AAAI , 06 1999.T. Wang, C. Rudin, F. Doshi-Velez, Y. Liu,E. Klampfl, and P. MacNeille. A bayesian frame-work for learning rule sets for interpretable classifi-cation.
JMLR , 18(70):1–37, 2017.G. M. Weiss. Mining with rarity: A unifying frame-work.
SIGKDD Explorations , 6(1):7–19, June 2004.H. Yang, C. Rudin, and M. Seltzer. Scalable bayesianrule lists. In
Proc. of the 34th Int. Conf. on Mach.Learn., ICML , pages 3921–3930, 2017.X. Yin and J. Han.
CPAR: Classification based onPredictive Association Rules , pages 331–335. SIAM,2003. anuscript under review by AISTATS 2020
A THE BASE METHOD STEP BYSTEP
In this section, we show in detail the main steps of thebase algorithm, by using a concrete example.Consider the scenario of forecasting the failure condi-tion of an IT system from two values representing the
CP U and main memory (
M EM ) utilization, as de-picted in the first two columns of table 5. We assumethat
CP U and
M EM are continuous features withvalues in the domain [0 , Label , where 1 represents asystem failure. The example reports eight records, ofwhich two are failures.
CPU MEM r r String Label t
95 10 3 1 110 01 1 t
80 10 1 1 011 01 0 t
81 85 2 2 101 10 1 t
10 85 1 2 011 10 0 t
10 10 1 1 011 01 0 t
82 10 2 1 101 01 0 t
85 10 2 1 101 01 0 t
81 10 2 1 101 01 0
Table 5: Original values from
CP U and
M EM , theirmappings to discrete ranges ( r , r ), binary encoding,and binary label. A.1 Discretization And Binarization
The first operation to do is discretization. Assume thediscretization algorithm identifies three intervals for
CP U and two intervals for
M EM , as follows.
CP U :[0 , , [81 , , [95 , max ). M EM : [0 , , [85 , max ).We can now map the original values to integer val-ues over the ranges (1, 2, 3) and (1, 2), as shown incolumns r , r , respectively. The resulting discretizedrecords are then mapped to (inverse one-hot encoded)binary strings of five bits, as recorded in the String col-umn. We also define a partial order relation betweenbinary records, such that x ≤ x (cid:48) ⇐⇒ x (cid:86) x (cid:48) = x (cid:48) .Moreover, the application of inverse one-hot encodingensures that the relation between input features andlabels is monotone, according to definition 2.2 in themain paper. We can give you an intuition through asimple example: consider two binary strings 011 and110; we see that 011 (cid:2)
110 and 110 (cid:2)
A.2 Learning The Boundary
Consider the first positive sample t with string 110 01.An exhaustive search strategy would explore all possi-ble flipping alternatives for the most general conflict-free binary strings. If, for example, we flip-off the firstbit we obtain 010 01 < = t : we have therefore a con-flict. If, for example, we keep the first bit at 1 and flip-off the second bit, we obtain 100 01, which is inconflict with t − t . Finally, if we flip-off the lastbit, we obtain 110 00, which has no conflict: this is acandidate boundary point. If we repeat the same pro-cedure for t , after flipping-off the third bit, we obtainanother boundary point 100 10. t : 110 01 + t : 101 10 + t − t : 101 01 − ...010 01 −
100 01 −
110 00 +
001 10 −
100 10 +
101 00 −
000 01 −
100 00 − Figure 2: Partially ordered set created from therecords in table 5.Figure 2 shows the partially ordered set correspondingto table 5. At the beginning, the nodes at the top arethe ones for which we know the label represented witha superscript symbol + and − for positive and neg-ative, respectively. They can be seen as maximally-specific rules. If we take as target the positive class,we move inside the Boolean lattice by flipping-off pos-itive bits, starting from the positive binary samples,and go down to find binary elements – located on theboundary – that divide positive and negative samples.While we navigate the Boolean lattice, nodes are la-belled according to the cover test against the negativesamples. As soon as a conflict is found, we can avoidgoing down from that node, but there is still the possi-bility to explore that path from another binary sample.This recursive procedure corresponds to up-and-downmovements in the lattice. However, if at each iterationwe are able to select the best candidate bit and to avoidconflicts, we only allow steps down in the Boolean lat-tice. We use the heuristic described in the main paperto choose the best candidate bit to flip-off. A.3 A Practical Example
Consider again the example in table 5. Since atthe beginning S = T , we will only report |T i | .For the first positive record t =110 01, we have: F = { , } , F = { } , F = { } .We have therefore: d l ( t , F ) = 1, d l ( t , F ) = 1, d l ( t , F ) = 2. We already know that flipping-off ei-ther the first or the second bit to 0 would lead to aconflict: thus, we directly flip-off the fifth bit to ob-tain the boundary point 110 00, independently fromthe value of |T | . Element 110 00 is added in the setof boundary points A .For the second positive record t = 101 10, we have: F = { , } , F = ∅ , F = { } .We have therefore: d l ( t , F ) = 1, d l ( t , F ) = undef ined , and d l ( t , F ) = 1. Although i = 3 in- anuscript under review by AISTATS 2020 duces a distance from an empty set, since we knowthat flipping-off other indexes generates conflicts, wecan immediately label 100 10 as boundary point andadd it to A . A.4 From Boundary Set To Rules
At the end of the previous phase, we obtain the bound-ary set A = { , } . In this case, each bound-ary point covers only one distinct positive sample,therefore the union of the two points covers all theset of positive samples and both points are kept afterthe regularization. Let’s suppose to follow a positiveset cover strategy, without early stopping condition.Then, the boundary set can be immediately mappedto the rule set shown in fig. 3. IF CPU ∈ [95, max) OR CPU ∈ [81, max) and MEM ∈ [85, max) THEN
Label = 1
ELSE
Label = 0
Figure 3: Ruleset extracted from the boundary
B PARALLEL AND DISTRIBUTEDIMPLEMENTATION libre is amenable to parallel and distributed im-plementations. Indeed, it processes one positivesample at a time. An exhaustive version of the
FindBoundaryPoint() procedure is embarrassinglyparallel and it is easily parallelizable on multi-core ar-chitectures: it is sufficient to spawn a UNIX processper positive sample, and exploit all available cores.Instead, the approximate procedure, requires a slightlymore involved approach. Indeed, the approximate
FindBoundaryPoint(.) procedure processes positiverecords that have not yet been covered by any bound-ary point. Hence, a global view on the set S is re-quired. We experimented with two alternatives. Thefirst is to place S in a shared, in-RAM datastore, be-cause UNIX processes – unlike threads – do not haveshared memory access. The second alternative is tosimply let each individual process to hold their ownversion of S , thus sacrificing a global view. Our ex-periments indicate that the loss in performance due toa local view only is negligible, and largely out-weightedby the gain in performance, since the execution timedecreases linearly with the number of spawned UNIXprocesses. Moreover, both D + and D − remain consis-tent throughout the whole induction phase. libre can be easily distributed such that it can runon a cluster of machines, using for example a dis- tributed computing framework such as Apache Sparkspa. This approach, called data parallelism , splits in-put data across machines, and let each machine exe-cute, independently, a weak learner. The data splittingoperation shuffles random subsets of the input featuresto each worker machine. Once each worker finishes togenerate the local rule sets, they are merged in the“driver” machine, which eventually applies the filter-ing and then executes the rule selection procedure toproduce the final boundary. C THE IMPACT OF LIBRE’SPARAMETERS
In this section we investigate how acting on libre ’sparameters allows to obtain specific performance-interpretability tradeoffs. We will not cover all thepossible parameters: in particular, we focus on thediscretization threshold, estimators , and f eatures per estimator. The effects of α and early-stopping inweighted set cover are not reported here since theireffects are well known from previous studies.When we vary one parameter, all the others are keptfix to isolate its impact. We will also give some rulesof thumb to choose them. C.1 The Effects Of Varying TheDiscretization Threshold
The choice of the discretization threshold depends onthe specific dataset: a threshold equal to zero meansno discretization, whereas increasing the threshold isequivalent to increase the tolerance to combine consec-utive ranges of values with different label distributions.In general, a zero threshold gives bad performancesand results in a bigger lattice with a consequent slowertraining time; also a too aggressive (high) threshold isnot recommended because it would lead to a huge lossof information.The most significant effects occur as soon as we startincreasing the threshold: in general, F1-score improves(and eventually oscillates) up to a value after which itcan eventually decrease. It is clear that, if the datasetcontains only continuous features and we continue toincrease the threshold, original and discrete recordswill coincide at a certain point.The threshold affects also the number of rules and theirsize. In general, when there is no discretization, twoextreme cases are possible: i) We might have as manyrules as the number of positive examples (if their bi-nary representation does not generate conflicts withthe elements in F ) with atoms = f eatures . Itmeans that the model simply overfitted the training anuscript under review by AISTATS 2020 data. ii) We might end up with few rules with veryhigh number of atoms (or no rules at all): the modeltried to generalize positive records but it was not ableto learn something meaningful because too many con-flicts were present in the dataset.From our experiments, the second option is more com-mon (few complex rules). Again, as soon as we startincreasing the threshold, the model starts to learn: thenumber of discovered rules increases and the numberof atoms decreases, since the model is able to filter outuseless features. After that, changes tend to stabilize:in our experiments, this happens when the discretiza-tion threshold is roughly between 3 and 6. C.2 The Effects of Varying estimators
And f eatures
We analyze how estimators and f eatures affectthe predictive performance and interpretability of li-bre , by keeping fixed the remaining parameters. Re-sults are reported for the
Heart
UCI dataset, but theconsiderations we do are quite general.
Parameter Settings.
We fixed a discretizationthreshold = 6. The search procedure optimizes the H α = 0 .
7, with-out applying any early-stopping condition. We var-ied estimators ∈ { , , , , } and f eatures ∈{ , , , , , , , } . We performed up to 50 runs foreach ( estimators, f eatures ), where features usedby each estimator are randomly selected. Please, no-tice that this is not the optimal set of parameters. Effects On F1-score.
As shown in fig. 4, if wefix estimators , when estimators is low (one es-timator), F1-score improves considerably as long as f eatures increases. When enough estimators areused, F1-score stabilizes: we can use less f eatures per estimator with almost no effect on F1-score.From fig. 5, we can see that, if we fix f eatures , F1-score benefits from increasing estimators . When f eatures increases, limiting estimators to a lowvalue does not significantly impact the F1-score.In other words, for low f eatures it is convenient torun more estimators : each estimator would work ondifferent subsets of the input features and the union ofrules would be hopefully diverse, with a consequenthigher F1-score. For the specific case of
Heart , wedo not notice any significant difference in F1-score bypassing from 5 to 20 estimators. However, it is gen-erally convenient to increase estimators in order totry as many combinations of features as possible andreduce the variance of results. For datasets with manyfeatures, this may make the difference.
Effects On rules . As shown in fig. 6, if we fix estimators and increase f eatures , the averagenumber of rules tends to increase up to a certain value,and then stabilizes or get slightly worse.From fig. 7, we notice that, when f eatures is low,the number of rules tends to increase as long as weincrease the number of estimators. Indeed, the modelgenerates less rules when there are not enough dis-criminant features; increasing the number of estima-tors, each estimator discovers different rules that arecombined. As long as we increase f eatures per esti-mator, the probability that different estimators workwith similar sets of features increases, together withthe probability of generating the same rules (or verysimilar rules): that’s why the size of the rule set tendsto stabilize. In this cases, it might be convenient torun less estimators to save execution time.In general, increasing the number of estimators con-siderably reduces the variance of results.
Effects On atoms . As shown in fig. 8, if we keep estimators fixed, atoms of the rule set increasesas long as the number f eatures increases. If we fix f eatures (fig. 9), estimators does not seem to af-fect atoms significantly.As usual, increasing estimators reduces the varianceof the results.
Final Remarks.
In conclusion, if we want inter-pretable rule sets, it is better to use few input featuresper estimator and as many estimators as possible.In appendix C, we have not used any early stop con-dition. However, it is a good practice to tune thisparameter in order to generate rule sets that are moreinterpretable and highly accurate.
D SCALABILITY EVALUATION
Here, we extensively test the scalability of libre . Un-like the main paper, we use up to 50 features and in-vestigate also the impact of class imbalance on theexecution time.
Synthetic Dataset.
For the scalability evaluation,we synthetically generate a dataset with 1 (cid:48) (cid:48) , . .
01, 0 .
1, and 0 . Settings.
We vary the number of records (10’000,100’000, 500’000, 1’000’000), features (10, 20, 50), andclass imbalance ratio (0.001, 0.01, 0.1, 0.5): for eachdataset configuration, libre runs up to 100 times withdifferent randomly generated subsets of features of size anuscript under review by AISTATS 2020 F - s c o r e HEART - F - s c o r e HEART - F - s c o r e HEART -
Figure 4:
Heart dataset: F1-score as a function of F - s c o r e HEART - F - s c o r e HEART - F - s c o r e HEART -
Figure 5:
Heart dataset: F1-score as a function of r u l e s HEART - r u l e s HEART - r u l e s HEART -
Figure 6:
Heart dataset: r u l e s HEART - r u l e s HEART - r u l e s HEART -
Figure 7:
Heart dataset: a t o m s HEART - a t o m s HEART - a t o m s HEART -
Figure 8:
Heart dataset: a t o m s HEART - a t o m s HEART - a t o m s HEART -
Figure 9:
Heart dataset: anuscript under review by AISTATS 2020
10 10 10 1010 10 10 1020 20 20 2020 20 20 2050 50 50 5050 50 50 50 num_records e x e c u t i o n _ t i m e ( s ) Balanced_ratio: 0.001 rule_generation_timerule_simplification_time
10 10 10 1010 10 10 1020 20 20 2020 20 20 2050 50 50 5050 50 50 50 num_records Balanced_ratio: 0.01
10 10 10 1010 10 10 1020 20 20 2020 20 20 2050 50 50 5050 50 50 50 num_records Balanced_ratio: 0.1
10 10 10 1010 10 10 1020 20 20 2020 20 20 2050 50 50 5050 50 50 50 num_records Balanced_ratio: 0.5
Figure 10: Run time on synthetic data.10, 20, and 50; the average execution time in secondsis reported as a sum of two contributions: rule genera-tion and simplification times. Times refer to one weaklearner only: if N weak learners run in parallel, the re-ported time is still a good estimate. Before executing libre , we discretize the dataset with a discretizationthreshold equal to 6, that we empirically find out tobe a good value. The simplification procedure runs onthe top 500 rules, if more are generated. Results.
As shown in fig. 10, the execution time isdominated by the rule generation term. Given a classimbalance ratio, execution time increases as long aswe increase the number of records and features. Thegeneration time also depends on which features are fedinto the model for two main reasons: i) ChiMerge en-codes bad predictive features with bigger domains, in-creasing the search space; ii) the generation procedurewill struggle more to generate rules when it runs onfeatures that are not that useful to predict the targetclass. This explains the high variance in the results.Intuitively, as long as the class imbalance ratio getsclose to 0.5, the number of processed records increases,together with the execution time. However, we veri-fied experimentally that this effect is somehow com-pensated by the higher number of negative records.As already pointed out in the main paper, we run therule generation procedure up to 50 features just forexperimental purposes: for practical applications, ifinterpretability is a need, it is more convenient to limitthe number of features and train a bigger ensemblewith more learners in order to generate compact rulesin a reasonable time.
E FULL EXPERIMENTS
In this section we report the full experimental cam-paign. We use the same methods, training procedure,preprocessing, and evaluation measures as the mainpaper, but we report the results for more datasets, asdescribed in table 6. We also clarify which class wehave trained the model on (target class). In case ofmulti-class classification datasets, records not belong- ing to the target class are considered to be negative.Table 7 reports a comparison between libre and theselected methods in terms of F1-score, whereas table 8and table 9 reports the number of rules and averagenumber of atoms, respectively. We also compare therule sets leading to the best F1-scores for ripper-k , brs , and s-brl with a few configurations for libre .In fig. 11, we report the average number of rules andatoms per rule, as a function of the F1-score: points atthe bottom-rigth side of each plot are preferable sincethey correspond to compact and high predictive rulesets. F MORE EXAMPLES OF RULESETS LEARNED BY LIBRE
In the main paper, we showed an example of rule setlearned by LIBRE for
Liver . In this section, we re-port additional examples for the medical UCI datasetsdescribed in table 6, for which it might be interestingto understand the relation between input features andthe predicted diseases.Please, notice that different rule sets may be obtaineddepending on how folds are randomly built duringcross validation. anuscript under review by AISTATS 2020
Dataset
Adult > Australian
690 14 .44 2
Balance
625 4 .08 B
Bank
Haberman
306 3 .26 died
Heart
270 13 .51 presence
Ilpd
583 10 .28 liver patient
Liver
345 5 .51 drinks > Pima
768 8 .35 1
Sonar
208 60 .53 R
Tictactoe
958 9 .65 positive
Transfusion
748 5 .24 yes
Wisconsin
699 9 .34 malignant
Sap-Clean
Sap-Full
Table 6: Characteristics of evaluated datasets.
Dataset rbf-svm rf dt ripper-k modlem s-brl brs libre libre Adult .62(.01) .68(.01) .68(.01) .59(.02) .66(.01) .68(.01) .61(.01) .70(.01) .62(.01)
Australian .83(.02) .86(.02) .84(.02) .85(.02) .68(.28) .82(.03) .83(.03) .84(.03) .84(.03)
Balance .03(.07) .00(.00) .01(.03) .00(.00) .16(.04) .00(.00) .00(.00) .16(.08) .14(.06)
Bank .46(.01) .50(.01) .50(.01) .44(.04) .50(.03) .50(.02) .32(.05) .55(.01) .44(.01)
Haberman .24(.10) .26(.07) .36(.08) .38(.07) .40(.07) .17(.21) .07(.06) .41(.04) .41(.04)
Heart .78(.06) .79(.07) .71(.01) .73(.09) .39(.31) .74(.05) .70(.09) .77(.06) .75(.02)
Ilpd .47(.02) .44(.08) .42(.10) .20(.11) .48(.08) .14(.13) .09(.08) .54(.06) .52(.04)
Liver .58(.08) .58(.07) .56(.10) .59(.04) .58(.07) .54(.03) .61(.05) .60(.07) .63(.06)
Pima .61(.04) .63(.04) .60(.01) .60(.03) .38(.18) .61(.07) .03(.03) .64(.05) . .64(.05) Sonar .81(.04) .83(.05) .75(.05) .77(.08) .70(.06) .76(.05) .69(.06) .79(.03) .76(.04)
Tictactoe .99(.01) .99(.01) .97(.01) .98(.01) .55(.10) .99(.01) .99(.01) .68(.04)
Transfusion .41(.07) .35(.06) .35(.05) .42(.10) .42(.08) .05(.10) .04(.05) .49(.12) .49(.12)
Wisconsin .95(.02) .95(.01) .91(.04) .94(.02) .95(.01) .94(.02) .88(.03) .95(.01) .93(.02)
Sap-Clean .93(.02) .93(.01) .85(.03) .86(.02) .88(.01) .90(.01) .68(.03) .95(.02) .72(.03)
Sap-Full - - - - - .81(.02) - .89(.03) .68(.04)Avg Rank 4.0(1.8) 3.1(1.9) 5.5(1.9) 5.3(1.7) 5.0(2.8) 5.3(2.3) 7.3(2.5)
Table 7: F1-score (st. dev. in parenthesis).
Dataset dt ripper-k modlem s-brl brs libre libre Adult
Australian
Balance
Bank
Haberman
Heart
Ilpd
Liver
Pima
Sonar
Tictactoe
Transfusion
Wisconsin
Sap-Clean
Sap-Full - - - 56.4(4.6) - 17.5(5.2)
Avg Rank 5.9(0.9) 3.3(1.3) 6.5(0.9) 4.8(0.8)
Table 8: anuscript under review by AISTATS 2020
Dataset dt ripper-k modlem s-brl brs libre libre Adult
Australian
Balance
Bank
Haberman
Heart
Ilpd
Liver
Pima
Sonar
Tictactoe
Transfusion
Wisconsin
Sap-Clean
Sap-Full - - - 85.6(9.7) - 4.7(0.3)
Rank 5.8(1.6)
Table 9: r u l e s a t o m s LIBRE (a)
Adult r u l e s a t o m s LIBRE (b)
Australian r u l e s a t o m s LIBRE (c)
Balance r u l e s a t o m s LIBRE (d)
Bank r u l e s a t o m s LIBRE (e)
Haberman r u l e s a t o m s LIBRE (f)
Heart
Figure 11: F1-score vs. anuscript under review by AISTATS 2020 r u l e s a t o m s LIBRE (g)
Ilpd r u l e s a t o m s LIBRE (h)
Liver r u l e s a t o m s LIBRE (i)
Pima r u l e s a t o m s LIBRE (j)
Sonar r u l e s a t o m s LIBRE (k)
Tictactoe r u l e s a t o m s LIBRE (l)
Transfusion r u l e s a t o m s LIBRE (m)
Wisconsin r u l e s a t o m s LIBRE (n)
Sap-Clean
Figure 11: F1-score vs. anuscript under review by AISTATS 2020IF (number of positive axillary nodes ∈ [2 , max ]) THEN died within 5 years
ELSE survived 5 years or longer
Figure 12: Example of rules learned by libre for
Haberman . IF (slope of the peak exercise ∈ { flat, downsloping } AND number of major vessels ∈ [1 , OR (chest pain type ∈ { asymptomatic } AND thal ∈ { reversable defect } ) OR (sex ∈ { male } , AND fasting blood sugar > ∈ { F alse } AND number of major vessels ∈ [1 , THEN class = presence
ELSE class = absence
Figure 13: Example of rules learned by libre for
Heart . IF (TB ∈ [ min, AND sgbp ∈ [ min,
42) ) OR (TB ∈ [ min, AND alkphos ∈ [ min, OR (age ∈ [35 , , [56 , AND sgbp ∈ [42 , THEN class = liver patient
ELSE class = non liver patient
Figure 14: Example of rules learned by libre for
Ilpd . IF (mean corpuscular volume ∈ [90 , OR (gamma glutamyl transpeptidase ∈ [20 , max ]) THEN liver disorder = True
ELSE liver disorder = False
Figure 15: Example of rules learned by libre for
Liver . IF (glucose ∈ [158, max] AND blood pressure ∈ [56 , max ]) OR (glucose ∈ [110, 158] AND
BMI ∈ [30 . , max )) OR (pregnancies ∈ [4, max] AND diabetes predigree func ∈ [0 . , max ]) THEN diabetes = True
ELSE diabetes = False
Figure 16: Example of rules learned by libre for
Pima . IF (months since last donation ∈ [0, 8) AND total blood donated ∈ [1250 , max )) THEN transfusion = Yes
ELSE transfusion = No
Figure 17: Example of rules learned by libre for
Transfusion . IF (uniformity of cell shape ∈ [5 , max ]) OR (clump thickness ∈ [2 , max ] AND bare nuclei ∈ [8 , max )) OR (clump thickness ∈ [7 , max ] AND marginal adhesion ∈ [1 , , [4 , max )) THEN transfusion = Yes
ELSE transfusion = No
Figure 18: Example of rules learned by libre for