[PDF] LIBRE: Learning Interpretable Boolean Rule Ensembles

Abstract

We present a novel method - LIBRE - to learn an interpretable classifier, which materializes as a set of Boolean rules. LIBRE uses an ensemble of bottom-up weak learners operating on a random subset of features, which allows for the learning of rules that generalize well on unseen data even in imbalanced settings. Weak learners are combined with a simple union so that the final ensemble is also interpretable. Experimental results indicate that LIBRE efficiently strikes the right balance between prediction accuracy, which is competitive with black box methods, and interpretability, which is often superior to alternative methods from the literature.

Full PDF

LLIBRE: Learning Interpretable Boolean Rule Ensembles

Graziano Mita Paolo Papotti Maurizio Filippone Pietro Michiardi

EURECOM, 06410 Biot (France) { mita, papotti, ﬁlippone, michiardi } @eurecom.fr Abstract

We present a novel method – libre – to learnan interpretable classiﬁer, which materializesas a set of Boolean rules. libre uses an en-semble of bottom-up, weak learners operat-ing on a random subset of features, whichallows for the learning of rules that general-ize well on unseen data even in imbalancedsettings. Weak learners are combined witha simple union so that the ﬁnal ensemble isalso interpretable. Experimental results in-dicate that libre eﬃciently strikes the rightbalance between prediction accuracy, whichis competitive with black box methods, andinterpretability, which is often superior to al-ternative methods from the literature.

Model interpretability has become an important factorto consider when applying machine learning in criticalapplication domains. In medicine, law, and predictivemaintenance, to name a few, understanding the out-put of the model is at least as important as the outputitself. However, a large fraction of models currently inuse (e.g. Deep Nets, SVMs) favor predictive perfor-mance at the expenses of interpretability.To deal with this problem, interpretable models haveﬂourished in the machine learning literature over thepast years. Although deﬁning interpretability is dif-ﬁcult (Miller, 2017; Doshi-Velez and Kim, 2017), thecommon goal of such methods is to provide an expla-nation of their output. The form and properties of theexplanation are often application speciﬁc.In this work, we focus on predictive rule learning forchallenging applications where data is unbalanced. For

Preliminary work. Under review by AISTATS 2020. Donot distribute. IF mean corpuscular volume ∈ [90, 96) OR gamma glutamyl transpeptidase ∈ [20, max] THEN liver disorder = True

ELSE liver disorder = False

Figure 1: Example of rules learned by libre for

Liver .rules, interpretability translates into simplicity, and itis measured as a function of the number of rules andtheir size (average number of atoms): such proxies areeasy to compute, understandable, and allow compar-ing several rule-based models. The goal is to learn a setof rules from the training set that (i) eﬀectively predicta given target, (ii) generalize to unseen data, (iii) andare interpretable, i.e., a small number of short rules(e.g., ﬁg. 15). The ﬁrst objective is particularly diﬃ-cult to meet in presence of imbalanced data. In thiscase, most rule-based methods fail at characterizingthe minority class. Additional data issues that hinderthe application of rule-based methods (Weiss, 2004)are data fragmentation (especially in case of small-disjuncts (Holte et al., 1989)), overlaps between im-balanced classes, and presence of rare examples.Many seminal rule learning methods come from thedata mining community: cba (Liu et al., 1998), cpar (Yin and Han, 2003), and cmar (Li et al., 2001),for example, use mining to identify class associationrules and then choose a subset of them according to aranking to implement the classiﬁer. In practice, how-ever, these methods output a huge number of rules,which negatively impacts interpretability.Another family of approaches includes methods like cn2 (Clark and Niblett, 1989), foil (Quinlan andCameron-Jones, 1993), and ripper-k (Cohen, 1995),whereby top-down learners build rules by greedilyadding the condition that best explains the remainingdata.

Top-down learners are well suited for noisy dataand are known to ﬁnd general rules (F¨urnkranz et al.,2014). They work well for the so called large disjuncts ,but have diﬃculties to identify small-disjuncts andrare examples, which are quite common in imbalanced a r X i v : . [ c s . L G ] N ov anuscript under review by AISTATS 2020 settings. In contrast, bottom-up learners like mod-lem (Grzymala-busse and Stefanowski, 2001), startdirectly from very speciﬁc rules (the examples them-selves) and generalize them until a given criteria ismet. Such methods are susceptible to noise, and tendto induce a very high number of speciﬁc rules, but arebetter suited for cases where only few examples char-acterize the target class (F¨urnkranz et al., 2014).Hybrid approaches such as bracid (Napierala, 2012)take the best from both worlds: maximally-speciﬁc (the examples themselves) and general rules are usedtogether in a hybrid classiﬁcation strategy that com-bines rule learning and instance-based learning. Thus,they achieve better generalization, also in imbalancedsettings, but still generate many rules, penalizing in-terpretability. Other approaches to tackle data-relatedissues include heuristics to inﬂate the importance ofrules for minority classes (Grzymala-Busse et al., 2000;Nguyen and Ho, 2005; Blaszczynski et al., 2010).Recent work focus on marrying competitive predic-tive accuracy with high interpretability . A popularapproach is to use the output of an association rulediscovery algorithm (like FP Growth ) and combinethe discovered rules in a small and compact subsetwith high predictive performance. The rule combina-tion process can be formalized either as an integer opti-mization problem or solved heuristically, explicitly en-coding interpretability needs in the optimization func-tion. Such approaches have been successfully appliedto rule lists (Yang et al., 2017; Chen and Rudin, 2018;Angelino et al., 2018) and rule sets (Lakkaraju et al.,2016; Wang et al., 2017). Alternatively, rules can bedirectly learned from the data through an integer opti-mization framework (Hauser et al., 2010; Chang et al.,2012; Malioutov and Varshney, 2013; Goh and Rudin,2014; Su et al., 2016; Dash et al., 2018).Both rule-mining and integer-optimization based ap-proaches underestimate the complexity and impor-tance of ﬁnding good candidate rules, and become ex-pensive when the input dimensionality increases, un-less some constraints are imposed on the size and sup-port of the rules. Although such constraints favourinterpretability, they have a negative impact on thepredictive performance of the model, as we show em-pirically in our work. Additionally, these methods donot consider class imbalance issues.The key idea in our work is to exploit the known ad-vantages of bottom-up learners in imbalanced settings,and improve their generalization and noise-tolerancethrough an ensembling technique that does not sac-riﬁce interpretability. As a result, we produce a rule-based method that is (i) versatile and eﬀective in deal-ing with both balanced and imbalanced data, (ii) in- terpretable , as it produces small and compact rule sets,and (iii) scalable to big datasets.

Contributions. (i) We propose libre , a novel en-semble method that, unlike other ensemble proposalsin the literature (W. Cohen and Singer, 1999; Fried-man and Popescu, 2008; Dembczy´nski et al., 2010) isinterpretable. Each weak learner uses a bottom-up ap-proach based on monotone Boolean function synthesisand generates rules with no assumptions on their sizeand support. Candidate rules are then combined witha simple union, to obtain a ﬁnal interpretable rule set.The idea of ensembling is crucial to improve general-ization, while using bottom-up weak learners allows togenerate meaningful rules even when the target classhas few available samples. (ii) Our base algorithm fora weak learner, which is designed to generate a smallnumber of compact rules, is inspired by Muselli andQuarati (2005), but it dramatically improves compu-tational eﬃciency. (iii) We perform an extensive ex-perimental validation indicating that libre scales tolarge datasets, has competitive predictive performancecompared to state-of-the-art approaches (even black-box models), and produces few and simple rules, oftenoutperforming existing interpretable models.

Our methodology targets binary classiﬁcation, al-though it can be easily extended to multi-class set-tings. For the sake of building interpretable models,we focus on Boolean functions for the mapping be-tween inputs and labels, which are amenable to a sim-ple interpretation.Boolean functions can be used as a model for binaryclassiﬁers f ( x ) = y , where x ∈ { , } d , y ∈ { , } . Thefunction f induces a separation of { , } d in two sub-sets F and T , where F = { x ∈ { , } d : f ( x ) = 0 } and T = { x ∈ { , } d : f ( x ) = 1 } . We call such subsetspositive and negative subsets, respectively. Clearly, F ∪ T = { , } d corresponds to the full truth table ofthe classiﬁcation problem. We restrict the input space { , } d to be a partially ordered set ( poset ): a Booleanlattice on which we impose a partial ordering relation.

Deﬁnition 2.1.

Let (cid:86) , (cid:87) , ¬ be the and , or , and not logic operators respectively. A Boolean lattice is a5 tuple ( { , } d , (cid:86) , (cid:87) , , ¬ operatorimplies that a lattice is not a Boolean algebra. Let ≤ be a partial order relation such that x ≤ x (cid:48) ⇐⇒ x (cid:86) x (cid:48) = x (cid:48) . Then, ( { , } d , ≤ ) is a poset, a set onwhich a partial order relation has been imposed.The theory of Boolean algebra ensures that the class B d of Boolean functions f : { , } d → { , } can be re- anuscript under review by AISTATS 2020 alized in terms of (cid:86) , (cid:87) , and ¬ . However, if { , } d is aBoolean lattice, ¬ is not allowed and only a subset M d of B d can be realized. The class M d coincides with thecollection of monotone Boolean functions. The lackof the ¬ operator may limit the family of functionswe can reconstruct. However, by applying a suitabletransformation of the input space, we can enforce themonotonicity constraint (Muselli, 2005). As a conse-quence, it is possible to ﬁnd a function ˜ f ∈ M d thatapproximates f ∈ B d arbitrarly well. Deﬁnition 2.2.

Let ( X , ≤ ) and ( Y , ≤ ) be two posets.Then, f : X → Y is called monotone if x ≤ x (cid:48) implies f ( x ) ≤ f ( x (cid:48) ). Deﬁnition 2.3.

Given x ∈ { , } d , let I m be the setof the ﬁrst m positive integers { , . . . , m } . P ( x ) = { i ∈ I m : x ( i ) = 1 } . The inverse of P is denoted as p ( P ( x , m )) = x . Deﬁnition 2.4.

Let ˜ f ∈ M d be a monotone Booleanfunction, and A be a partially ordered set. Then, ˜ f can be written as: ˜ f ( x ) = (cid:87) a ∈A (cid:86) j ∈P ( a ) x ( j ).The monotone Boolean function ˜ f is speciﬁed in dis-junctive normal form (DNF), and is univocally deter-mined by the set A and its elements. Thus, given F and T , learning ˜ f amounts to ﬁnding a particular setof lattice elements A deﬁning the boundary separat-ing positive from negative samples. Deﬁnition 2.5.

Given a ∈ { , } d = T ∪ F , if a ≤ x for some x ∈ T , and (cid:64) y ∈ F : a ≤ y , and ∃ y ∈ F : b ≤ y , ∀ b < a , then a is a boundary point for ( T , F ).The set A of boundary points deﬁnes the separationboundary . If a (cid:48) (cid:2) a (cid:48)(cid:48) and a (cid:48)(cid:48) (cid:2) a (cid:48) , ∀ a (cid:48) , a (cid:48)(cid:48) ∈ A , a (cid:48) (cid:54) = a (cid:48)(cid:48) , then the separation boundary is irredundant .In other words, a boundary point is a lattice elementthat is smaller than or equal to at least one positiveelement in T , but larger than all negative elements F . In practical applications, however, we usually haveaccess to a subset of the whole space, D + ⊆ T and D − ⊆ F . The goal of the algorithms we present nextis to approximate the boundary A , given D + and D − .We show that boundary points, and binary samplesin general, naturally translate into classiﬁcation rules.Indeed, let R be the set of rules corresponding to thediscovered boundary. R ( · ) represents a binary classi-ﬁer: R ( x ) = { ∃ r ∈ R : r ( x ) = 1; 0 otherwise } .Then, x is classiﬁed as positive if there is at least onerule in R that is true for it. We presented a theoretical framework that casts bi-nary classiﬁcation as the problem of ﬁnding the bound-ary points for D + ⊆ T and D − ⊆ F . Next, we use suchframework to design our interpretable classiﬁer. First, we describe a base, bottom-up method – whichwill be later used as a weak learner – that illustrateshow to move inside the boolean lattice to ﬁnd bound-ary points. However, the base method does not scale tolarge datasets, and tends to overﬁt. Thus, we present libre , an ensemble classiﬁer that overcomes such limi-tations by running on randomly selected subset of fea-tures. libre is interpretable because it combines theoutput of an ensemble of weak learners with a sim-ple union operation. Finally, we present a procedureto select a subset of the generated points – the oneswith the best predictive performance – and reduce thecomplexity of the boundary.We assume that the input dataset is a poset andthat the function we want to reconstruct is monotone.This is ensured by applying inverse-one-hot-encodingon discretized features, and concatenating the result-ing binary features, as done in Muselli (2006). Given z ∈ I m = { , ..., m } , inverse-on-hot encoding producesa binary string b of length m , where b ( i ) = 1 for i (cid:54) = z , b ( i ) = 0 for i = z . More details can be found in thesupplementary material. Example 3.1.

Consider a dataset with two con-tinuous features, f and f , both taking values inthe domain [0 , , , [40 , , , [30 , , [60 , f = 33 . , f = 44 . f (cid:48) = 1 , f (cid:48) = 3, and then binarizedas 01 101. In other words, each feature of a record isencoded with a number of bits equal to its discretizeddomain, and can have only one bit set to zero. We develop an approximate algorithm that learns theset A for ( D + , D − ). The algorithm strives to ﬁnd lat-tice elements such that both |A| and |P ( a ) | , ∀ a ∈ A are small, translating in a small number of sparseboundary points (short rules). Algorithm Design.

To proceed with the presenta-tion of our algorithm, we need the following deﬁnitions:

Deﬁnition 3.1.

Given two lattice elements x , x (cid:48) ∈{ , } d , we say that x (cid:48) covers x , if and only if x (cid:48) ≤ x , Deﬁnition 3.2.

Given a lattice element x ∈ { , } d , ﬂipping oﬀ the k -th element of x produces an element z such that z ( i ) = x ( i ) for i (cid:54) = k and z ( i ) = 0 for i = k . Deﬁnition 3.3.

Given a positive binary sample x ∈D + , we say that a ﬂip-oﬀ operation produces a conﬂict if the lattice element z resulting from the ﬂip-oﬀ is suchthat ∃ x (cid:48) ∈ D − : z ≤ x (cid:48) . anuscript under review by AISTATS 2020 Then, a boundary point is a lattice element that coversat least one positive sample, and for which a ﬂip-oﬀoperation would produce a conﬂict, as deﬁned above.

Algorithm 1:

FindBoundary

Set A = ∅ and S = D + ; while S (cid:54) = ∅ do Choose x ∈ S ;Set I = P ( x ), J = ∅ ; FindBoundaryPoint ( A , I , J );Remove from S the elements covered by a , ∀ a ∈ A ; end Algorithm 1 presents the main steps of our algorithm,where A is the boundary set and S = { s ∈ D + : (cid:64) a ∈A , a ≤ s } is the set of elements in D + that are notcovered by a boundary point in A . I is the set of in-dexes of the components of the current positive sample x that can be ﬂipped-oﬀ, and J is the set of indexesthat cannot be ﬂipped-oﬀ to avoid a conﬂict with D − .Until S is not empty, an element x is picked from S .Then, the procedure FindBoundaryPoint is used togenerate one or more boundary points by ﬂipping-oﬀthe candidate bits of x . According to deﬁnition 3.2, aboundary point is generated when an additional ﬂip-oﬀ would lead to a conﬂict, given deﬁnition 3.3. Whenthe FindBoundaryPoint procedure completes its op-eration, both A and S are updated. Example 3.2.

Let D + = { } and D − = { , } . Take the positive sample 11001, forwhich I = { , , } and J = ∅ . Suppose that FindBoundaryPoint ﬂips-oﬀ the bits in I from leftto right. Flipping-oﬀ the ﬁrst bit generates 01001 ≤ ∈ D − . The ﬁrst bit is moved to J and keptto 1. Flipping-oﬀ the second bit generates 10001 ≤ ∈ D − . Also the second bit is moved to J . Weﬁnally ﬂip-oﬀ the last bit and obtain 11000 that is notin conﬂict with any element in D − . 11000 is thereforea boundary point for ( D + , D − ).If we think about binary samples in terms of rules,a positive sample can be seen as a maximally-speciﬁcrule, with equality conditions on the input features(the value that particular feature takes on that partic-ular sample). Flipping-oﬀ bits is nothing more thangeneralizing that rule. Our goal is to do as many ﬂip-oﬀ operations as possible before running into a conﬂict.Retrieving the complete set of boundary points re-quires an exhaustive search, which is expensive, re-stricting its application to small, low-dimensionaldatasets. It is easy to show that the computationalcomplexity of the exhaustive approach is O ( n d ),where n is the number of distinct training samples,and d is the dimension of the Boolean lattice. Inthis work, we propose an approximate heuristic for the FindBoundaryPoint procedure.

Finding Boundary Points.

The key idea is to ﬁnda subset of all possible boundary points, steering theirselection through a measure of their quality. A bound-ary point is considered to be “good” if it contributesto decreasing the complexity of the resulting bound-ary set, which is measured in terms of its cardinality |A| and the total number of positive bits (cid:80) a ∈A |P ( a ) | .In practice, |A| can be decreased by choosing bound-ary points that cover the largest number of elementsin S . To do this, we iteratively select the best can-didate index i ∈ I according to a measure of poten-tial coverage. Decreasing (cid:80) a ∈A |P ( a ) | implies ﬁndingboundary points with low number of 1s.Before proceeding, we deﬁne a notion of distance be-tween lattice elements: Deﬁnition 3.4.

Given x , x (cid:48) ∈ { , } d , the distance d l ( x , x (cid:48) ) between x and x (cid:48) is deﬁned as: d l ( x , x (cid:48) ) = (cid:80) di =1 | x ( i ) − x (cid:48) ( i ) | + , where | · | + is equal to 1 if ( · ) ≥ Deﬁnition 3.5.

In the same way, we can deﬁne thedistance between a lattice element x and a set V as: d l ( x , V ) = min x (cid:48) ∈V d l ( x , x (cid:48) ).Every boundary point a for ( D + , D − ) has distance d l ( a , D − ) = 1; in fact, boundary points are all latticeelements for which a ﬂip-oﬀ would generate a conﬂict.In the iterative selection process of the best index i ∈ I to be ﬂipped-oﬀ, indexes having high d l ( p ( I ∪J ) , D − i )are preferred, where D − i = { x ∈ D − : x ( i ) = 0 } , be-cause they are the ones that contribute most to reducethe number of 1s of a potential boundary point. Algorithm 2:

FindBoundaryPoint( A , I , J ) For each i ∈ I compute |S i | , |D +0 i | , d l ( p ( I ∪ J ) , D − i ); while I (cid:54) = ∅ do Move from I to J all i with d l ( p ( I ∪ J ) , D − i ) = 1; if I = ∅ thenbreak ; end Choose the best index i ∈ I ;Remove i from I ;For each i ∈ I update d l ( p ( I ∪ J ) , D − i ); endif there is no a ∈ A : p ( J ) ≥ a then Set A = A ∪ p ( J ); end Algorithm 2 illustrates our approximate procedure,where S i = { s ∈ S : s ( i ) = 0 } and D +0 i = { t ∈D + : t ( i ) = 0 } are proxies for the potential cover-age of ﬂipping-oﬀ a given bit i . The ﬁrst step ofthe algorithm computes, for each index i ∈ I , theterms |S i | and |D +0 i | indicating its potential cover-age, and d l ( p ( I ∪ J ). Until the set I is not empty,indexes inducing a unit distance to D − are movedto J . Then, we choose the best index i best amongthe remaining indices in I , using our greedy heuris-tics : we can chose to optimize either for the tuple anuscript under review by AISTATS 2020 H = ( |S i | , |D +0 i | , d l ( p ( I ∪ J ) , D − i )) or for the tuple H = ( d l ( p ( I ∪ J ) , D − i ) , |S i | , |D +0 i | ). H prioritizes alower number of boundary points, while H tends togenerate boundary points with fewer 1s.When I is empty, p ( J ) is added to the boundary set A if it does not contain already an element covering p ( J ). Note that, in algorithm 2, the distance is com-puted only once, and updated at each iteration. Thisis because only one bit is selected and removed from I ; then, p ( I ∪ J ) new = p (( I ∪ J ) old \ { i } ). Formally,we apply deﬁnition 3.4 exclusively for i = i best . Example 3.3.

Let D + = { , , } and D − = { , } . We describe the procedure forfew steps and only for the ﬁrst positive sample 10101.Suppose to optimize the tuple ( |S i | , |D +0 i | , d l ( p ( I ∪J ))). For 10101 we have I = { , , } and J = ∅ .At the beginning S = D + . |D +01 | = 2 , |D +03 | =0 , |D +05 | = 1. D − = ∅ , D − = { } , D − = { , } . Consequently: d l ( p ( I ∪ J ) , D − ) = undef ined , d l ( p ( I∪J ) , D − ) = 2, d l ( p ( I∪J ) , D − ) =1. Bit 5 is moved to J . Bit 1 has the higher value of |D +0 i | and is selected as best candidate to be ﬂipped-oﬀ. The distance is recalculated and the procedurecontinues until the set of candidate bits I is empty.The algorithmic complexity of algorithm 1, when itruns algorithm 2, is O ( n d ). This is faster than theexhaustive algorithm, and better than the O ( n d )complexity of Muselli and Quarati (2005). We alsopoint out that most sequential-covering algorithms re-peatedly remove the samples covered by the new rules,forcing the induction phase to work in a more parti-tioned space with less data, especially aﬀecting mi-nority rules, which already rely on few samples. Theproblem is mitigated in our solution: despite S cannotavoid this behavior, our heuristics keep a global andconstant view of both D − , in the conﬂict detection,and D + , in the discrimination of the best bits to ﬂip. From Boundary Set To Rules.

Each element a of the boundary set A can be practically seen as theantecedent of an if-then rule having as target the pos-itive class. When a binary sample x is presented to a ,the rule outputs 1 only if x has a 1 in all positionswhere a has value 1, that is if a ≤ x . Then, the an-tecedent of the rule is expressed as a function of theinput features in the original domain. Example 3.4.

Consider a dataset with two con-tinuous features, f and f , discretized as follows:[[0 , , [40 , , , [30 , , [60 , A = {

01 100 } . From the boundary pointwe obtain a rule as follows: the ﬁrst two bits referringto feature f – 01 – are mapped to “ if f ∈ [0 , f – 100 – are mapped to “ if f ∈ [30 , if f ∈ [0 ,

40) and f ∈ [30 , then label = 1”. The base approach generates boundary points by gen-eralizing input samples, i.e., by ﬂipping-oﬀ positivebits if no conﬂict with negative samples is encountered.The hypothesis underlying this procedure is that whenno conﬂicts are found, a boundary point induces a validrule. However, such rule might be violated when usedwith unseen data. Stopping the ﬂipping-oﬀ procedureas soon as a single conﬂict is found has two main ef-fects: i) we obtain very speciﬁc rules, that might besimpliﬁed if the approach could tolerate a limited num-ber of conﬂicts; ii) the rules cover no negative samplesin the training set and tend to overﬁt.To address these issues, a simple method would be tointroduce a measure for the number of conﬂicts anduse it as an additional heuristic in the learning pro-cess. However, this would dramatically increase thecomplexity of the algorithm.A more natural way to overcome such challenges is tomake the algorithm directly work on (random) sub-sets of features; in this way, the learning process pro-duces more general rules by construction. Randomiza-tion is a well-known technique to implement ensemblemethods that provide superior classiﬁcation accuracy,as demonstrated, for example, in random forests (Ho,1998; Breiman, 2001). By using randomization, we candirectly use the methodology described in the previoussections, without modifying the search procedure. Thenew approach – libre – is an interpretable ensemble ofrules that operates on a randomized subset of features.Formally, let E be the number of classiﬁers in the en-semble. For each classiﬁer j ∈ { , . . . , E } , we ran-domly sample k j features of the original space andrun algorithm 1 to produce a boundary set A j for thereduced input space. A j can be generated in parallel,since weak learners are independent from each other.At this point, to make the ensemble interpretable, wecrucially do not apply a voting (or aggregation) mech-anism to produce the ﬁnal class prediction, but we doa simple union, such that A = (cid:83) Ej =1 A j .We note that libre addresses the problems outlinedabove, as we show experimentally. By training an en-semble of weak learners that operate on a small sub-set of features, we artiﬁcially inﬂate the probabilityof ﬁnding negative examples. Each weak learner isconstrained to run on less features not only reducingthe impact of d on the execution time, but also hav-ing an immediate eﬀect on the interpretability of the anuscript under review by AISTATS 2020 model that is forced to generate simpler rules, exactlybecause it operates on fewer input features.Note that there are no guarantees that elements of A j will actually be boundary points in the full featurespace: weak learners have only a partial view of thefull input space and might generate rules that are notglobally true. Thus, it is important to ﬁlter out thepoints that are clearly far from the boundary by usingthe selection procedure described in the next section. The model learned by our greedy heuristic material-izes as a set A , which might contain a large number ofelements and, in case of libre , it might also containelements that cover many negative samples. In thissection, we explain how to produce a boundary set A ∗ with a good tradeoﬀ between complexity and predic-tive performance. This can be cast as a weighted setcover problem. Since exploring all possible subsets ofelements in A can be computationally demanding, weuse a standard greedy weighted set cover algorithm.Each element a ∈ A is assigned a weight that is propor-tional to the number of positive and negative coveredsamples. The importance of the two contributions isgoverned by a parameter α . At each iteration, the el-ement a with the highest weight is selected; if thereis more than one, the element with the highest num-ber of zeros is preferred. All samples that are coveredby the selected element are removed, and the weightsare recalculated. The process continues until either allsamples are a covered or a stopping condition is met.Before running the selection procedure, with the aimof speeding up execution times, we eventually apply aﬁltering procedure to reduce the size of the initial setto a small number of good candidates: as proposed byGu et al. (2003), we select the top K rules according to exclusiveness and local support , that are more sensiblethan conﬁdence and support for imbalanced settings. We evaluate libre in terms of predictive performance,interpretability, and scalability, and compare it withother rule-based methods and black-box models.

Datasets.

We report the results for seven publiclyavailable datasets from the UCI repository and tworeal industrial IT datasets – proprietary of

Sap . Re-sults on other UCI datasets are in the supplemen-tary material. These datasets cover several domains,have diﬀerent imbalance ratios, number of records andfeatures, as summarized in Table 6. Some of thesedatasets have been used to evaluate methods for class

Dataset

Adult

Australian

690 14 .44

Bank

Ilpd

583 10 .28

Liver

345 5 .51

Pima

768 8 .35

Transfusion

748 5 .24

Sap-Clean

Sap-Full

Table 1: Characteristics of evaluated datasets.imbalance (Van Hulse et al., 2007) and present charac-teristics that make them diﬃcult to learn: overlappingclasses, noisy and rare examples. All datasets have, orwere transformed to have, a binary class. The

Sap datasets consist of monitoring data collected acrossdatabase systems. They consists of 45 features, hand-crafted by domain experts based on low-level systemmetrics.

Sap runs a predictive maintenance systemon this data and notiﬁes customers who conﬁrm ordiscard the warnings: we use these as binary labels.

Sap-Clean is the clean version of

Sap-Full , wherewe removed records with at least one missing value.

Comparison With Other Methods.

We compare libre with two recent works: Scalable Bayesian RuleLists ( s-brl ) (Yang et al., 2017) and Bayesian RuleSets ( brs ) (Wang et al., 2017). We also report theresults for a weka implementation of ripper-k (Co-hen, 1995) and modlem (Grzymala-busse and Ste-fanowski, 2001) – as representative of top-down andbottom-up approaches – and scikit-learn imple-mentations of Decision Tree ( dt ) (Breiman et al.,1984), Support Vector Machine with RBF kernel ( rbf-svm ) (Cortes and Vapnik, 1995)), and random forests( rf ) (Breiman, 2001). rbf-svm and rf are selectedas popular black-box models; rf is also a representa-tive ensemble method. Other relevant methods are notpublicly available ( cg (Dash et al., 2018)), do not workproperly ( ids (Lakkaraju et al., 2016)), or are only par-tially implemented ( bracid (Napierala, 2012)). Parameter Tuning.

All results refer to stratiﬁed 5-fold cross validation, where the same splits are usedfor all tested methods. The initial set of candidaterules for s-brl and brs is generated by running

FPGrowth with a minimum support of 1 and a maxi-mum mining length of 5. We also optimize brs and s-brl ’s prior hyperparameters by cross validation. For brs , we run 2 chains of 500 iterations. For ripper-k , we change the number of optimization steps be-tween 1 and 5, and activate pruning. For modlem ,we try all available classiﬁcation strategies and con-dition measures. For rbf-svm , we optimize C and γ . For dt and rf , we optimize the maximum depth anuscript under review by AISTATS 2020 Dataset rbf-svm rf dt ripper-k modlem s-brl brs libre libre Adult .62(.01) .68(.01) .68(.01) .59(.02) .66(.01) .68(.01) .61(.01) .70(.01) .62(.01)

Australian .83(.02) .86(.02) .84(.02) .85(.02) .68(.28) .82(.03) .83(.03) .84(.03) .84(.03)

Bank .46(.01) .50(.01) .50(.01) .44(.04) .50(.03) .50(.02) .32(.05) .55(.01) .44(.01)

Ilpd .47(.02) .44(.08) .42(.10) .20(.11) .48(.08) .14(.13) .09(.08) .54(.06) .52(.04)

Liver .58(.08) .58(.07) .56(.10) .59(.04) .58(.07) .54(.03) .61(.05) .60(.07) .63(.06)

Pima .61(.04) .63(.04) .60(.01) .60(.03) .38(.18) .61(.07) .03(.03) .64(.05) . .64(.05) Transfusion .41(.07) .35(.06) .35(.05) .42(.10) .42(.08) .05(.10) .04(.05) .49(.12) .49(.12)

Sap-Clean .93(.02) .93(.01) .85(.03) .86(.02) .88(.01) .90(.01) .68(.03) .95(.02) .72(.03)

Sap-Full - - - - - .81(.02) - .89(.03) .68(.04)Avg Rank 4.7(1.2) 3.3(1.6) 4.9(2.1) 5.3(2.1) 4.9(2.2) 5.2(2.8) 7.2(2.5)

Table 2: F1-score (st. dev. in parenthesis).

Dataset dt ripper-k modlem s-brl brs libre libre Adult

Australian

Bank

Ilpd

Liver

Pima

Transfusion

Sap-Clean

Sap-Full - - - 56.4(4.6) - 17.5(5.2)

Avg Rank 5.7(0.7) 3.2(1.0) 6.7(0.9) 4.9(0.7) 1.9(1.2) 2.9(0.9)

Table 3: { , , , N one } , we tried all possible options formax features and use a number of trees in { , , } for rf . For libre , we vary the number of weak learn-ers in E ∈ { , , } . Each weak learner uses up to5 features. Additionally, we try the two heuristics H and H to generate rules and vary α in { . , . , . } forweighted set cover. Parameters not reported above areall ﬁxed to recommended or default values. Data Preprocessing.

Before running rbf-svm , weapply standardization to the input data to get betterresults. The remaining methods have no beneﬁts fromstandardization in our experiments. For s-brl and li-bre , we apply

ChiMerge discretization algorithm Ker-ber (1992) with a discretization threshold in { , . , } ;in brs , discretization is instead controlled by an in-ternal parameter. In both cases, discretization is opti-mized during training. The remaining algorithms haveno explicit need for discretization. For the methodsrequiring binarization, we apply one-hot encoding , ex-cept for libre that uses inverse one-hot encoding . Evaluation Metrics.

We use F1-score to comparethe predictive performance of the classiﬁers, as it iswell-suited to evaluate the capability to characterizethe target class both in balanced and imbalanced set-tings. For rule-based methods, we use standard met-rics from the literature to evaluate the interpretabilityof the rule sets, namely the number of rules that im-plement a model, and the average number of atomsper rule . For dt , we extract the rules following the paths from root to leaves: this captures the percep-tion of a user who looks at the tree to understand theoutput of the model. For s-brl , the number of atomsin a rule is equal to the sum of the atoms in the pre-vious rules, highlighting the fact that a user has to gothrough all the rules up to the one that returns thelabel. For all rule-based methods, we change inequal-ities ( <, ≤ , >, ≥ ) to ranges to have a fair comparison.For example, f ≥ f ∈ [3 , max ]. Predictive Performance Evaluation.

Table 7shows the means and standard deviations of the F1-score for the tested algorithms (best results in bold)and the rank of their average performance, where thesame splits are used for all tested methods. We ad-ditionally report the results for libre when it is con-strained to generate at most 3 rules ( libre libre emerges as thebest method, beating both rbf-svm and rf , demon-strating its versatility in both balanced and imbal-anced settings. libre dt , modlem , s-brl and ripper-k show very similar performance, even if mod-lem is usually worse for balanced settings. brs is theworst method in terms of predictive performance.Focusing more on the single datasets, we can see that,except for Australian , libre obtains consistentlythe highest F1-score. In Bank , Ilpd , and

Transfu-sion the gap between libre and the closest competi- anuscript under review by AISTATS 2020 tor is signiﬁcant; the gap is even larger in comparisonto alternative rule-based methods. For the remainingdatasets, the diﬀerences with the competitors are lesspronounced but still signiﬁcant. In particular,

Ilpd seems to be very problematic for most of the testedmethods: ripper-k , brs and s-brl do not learn any-thing useful about the positive class; modlem per-forms marginally better. From a deeper analysis, itemerges that Ilpd is an imbalanced dataset with over-lapping classes: rules learned by libre have an errorrate close to 50% on the training set, consequence ofthe class imbalance. ripper-k is not able to learnthese rules, whereas the selection stage of brs and s-brl does not include such rules in the ﬁnal set evenwhen they are in the set of candidate mined rules.With

Sap-Clean , libre brs but limiting the number of rules to 3 causes a signif-icant drop in F1-score w.r.t. libre . The situationis diﬀerent for Sap-Full , the original version of thedataset containing also missing values. From table 6,

Sap-Full is more than ﬁve times bigger than

Sap-Clean , indicating that missing values are not a neg-ligible problem in real scenarios. A method that runswithout additional preprocessing is thus truly desir-able. Only libre and s-brl ﬁt this requirement, while ripper-k , brs , and modlem natively manage missingvalues for categorical features only, but require an ad-ditional preprocessing for continuous features. Despitethe huge number of missing values, results for libre are comparable to other rule-based methods when ex-ecuted on Sap-Clean . Interpretability Evaluation.

Next, using table 8,we evaluate interpretability in terms of quantity andsimplicity of rules. In our analysis, we also refer totable 7, to measure the trade-oﬀ that exists betweeninterpretability and predictive performance. We high-light in bold the most interpretable results.In terms of number of rules, libre is better than ripper-k on average, indicating that it indeed over-comes the limitations of bottom-up learners like mod-lem , that is instead the worst method together with dt . s-brl is competitive for small datasets, butthe number of rules increases considerably for biggerdatasets like Adult , Bank , and

Sap . Overall, brs generates compact rule sets, with only one rule for halfof the tested datasets. However, we should also noticethat, except for

Liver , these are the same datasetsthat give F1-score close to zero. libre ripper-k modlem brs libre

Table 4: Runtime in seconds (st. dev. in parenthesis).tary material). Only s-brl has issues when the num-ber of rules is signiﬁcant (like in

Adult , Bank , and

Sap datasets): indeed, in rule lists every rule dependson the previous ones, and the number of atoms easilyexplodes.

Scalability Evaluation.

Table 4 shows the run timefor libre and three representative rule-based com-petitors on synthetic balanced datasets with 10 fea-tures and a varying number of records: from 10’000 to1’000’000. For each conﬁguration, we randomly gen-erate the dataset 3 times and report the average runtime and standard deviation. All methods are testedwith their default parameters and run sequentially, fora fair comparison. For libre , the time refers to oneweak learner, which is also a good approximation forthe computing time of E parallel weak learners. Thesymbol “-” identiﬁes out-of-memory errors. modlem and brs fail with an out-of-memory errorwith 500’000 and 1’000’000 records datasets. Theyalso show much higher run times for smaller datasetsw.r.t. ripper-k and libre , that are instead able tocomplete their training in a few minutes also for thelarge datasets.Note that each weak learner in libre works with D + and D − that consist of distinct records : even if theoriginal dataset has millions of entries, the numberof binary records processed by the algorithm is muchlower, especially when the number of input features ofeach weak learner is relatively low. We also point outthat, for practical applications where interpretabilityis needed, it is more convenient to limit the number offeatures and train a bigger ensemble with more learn-ers to quickly generate understandable rules. Model interpretability has recently become of primaryimportance in many applications. In this work, we fo-cused on the task of learning a set of rules which spec-ify, using Boolean expressions, the classiﬁcation model.We devised a practical method based on monotoneboolean function synthesis to learn rules from data.Our approach uses an ensemble of bottom-up learn-ers that generalizes better than traditional bottom-up methods, and that works well for both balanced anuscript under review by AISTATS 2020 and imbalanced scenarios. Interpretability needs canbe easily encoded in the rule generation and selectionprocedure that produces short and compact rule sets.Our experiments show that libre strikes the rightbalance between predictive performance and in-terpretability, often outperforming alternative ap-proaches from the literature.For future work, we will extend our model consideringnoisy labels and a Bayesian formulation.

Acknowledgements

The authors wish to thank SAPLabs France for support.

References

Apache spark. http://spark.apache.org/ .E. Angelino, N. Larus-Stone, D. Alabi, M. Seltzer, andC. Rudin. Learning certiﬁably optimal rule lists forcategorical data.

JMLR , 18(234):1–78, 2018.J. Blaszczynski, M. Deckert, J. Stefanowski, andS. Wilk. Integrating selective pre-processing of im-balanced data with ivotes ensemble. In

Proc. of the7th Int. Conf. on Rough Sets and Current Trends inComputing, RSCTC , pages 148–157, 2010.L. Breiman. Random forests.

Mach. Learn. , 45(1):5–32, 2001. ISSN 0885-6125.L. Breiman, J. Friedman, R. Olshen, and C. Stone.

Classiﬁcation and Regression Trees . Wadsworth andBrooks, Monterey, CA, 1984.A. Chang, D. Bertsimas, and C. Rudin. An integeroptimization approach to associative classiﬁcation.In

Proc. of the 24th Int. Conf. on Neural Informa-tion Processing Systems, NIPS , pages 3302–3310, 012012.C. Chen and C. Rudin. An optimization approach tolearning falling rule lists. In

Proc. of the 21st Int.Conf. on Artif. Intel. and Stat., AISTATS , pages604–612, 2018.P. Clark and T. Niblett. The cn2 induction algorithm.

Mach. Learn. , 3(4):261–283, 1989. ISSN 1573-0565.W. W. Cohen. Fast eﬀective rule induction. In

Proc. ofthe 20th Int. Conf. on Mach. Learn., ICML , pages115–123, 1995.C. Cortes and V. Vapnik. Support-vector networks.

Mach. Learn. , 20(3):273–297, 1995.S. Dash, O. G¨unl¨uk, and D. Wei. Boolean decisionrules via column generation. In

Proc. of the 31stInt. Conf. on Neural Information Processing Sys-tems, NIPS , 2018.K. Dembczy´nski, W. Kot(cid:32)lowski, and R. S(cid:32)lowi´nski. En-der: a statistical framework for boosting decisionrules.

Data Min. and Knowl. Disc. , 21(1):52–90,2010. ISSN 1573-756X. F. Doshi-Velez and B. Kim. A roadmap for a rigorousscience of interpretability.

CoRR in arXiv , 2017.J. H. Friedman and B. E. Popescu. Predictive learningvia rule ensembles.

The Annals of Appl. Stat. , 2(3):916–954, 2008.J. F¨urnkranz, D. Gamberger, and N. Lavraˇc.

Founda-tions of Rule Learning . Springer Publishing Com-pany, Incorporated, 2014.S. T. Goh and C. Rudin. Box drawings for learningwith imbalanced data. In

Proc. of the 20th Int. Conf.on Knowl. Disc. and Data Min., KDD , pages 333–342, 2014.J. W. Grzymala-busse and J. Stefanowski. Three dis-cretization methods for rule induction.

Int. Journalof Intelligent Systems , pages 29–38, 2001.J. W. Grzymala-Busse, L. K. Goodwin, W. J.Grzymala-Busse, and X. Zheng. An approach to im-balanced data sets based on changing rule strength.In

Rough-Neural Computing: Techniques for Com-puting with Words , 2000.L. Gu, J. Li, H. He, G. Williams, S. Hawkins, andC. Kelman. Association rule discovery with unbal-anced class distributions. In

Proc. of the 16th Austr.Conf. on Artif. Intel., AI , pages 221–232, 12 2003.J. R. Hauser, O. Toubia, T. Evgeniou, R. Befurt, andD. Dzyabura. Disjunctions of conjunctions, cogni-tive simplicity, and consideration sets.

Journal ofMarketing Res. , 47(3):485–496, 2010.T. K. Ho. The random subspace method for con-structing decision forests.

IEEE Trans. on PatternAnalysis and Machine Intel., TPAMI , 20(8):832–844, 1998. ISSN 0162-8828.R. C. Holte, L. E. Acker, and B. W. Porter. Conceptlearning and the problem of small disjuncts. In

Proc.of the 11th Int. Joint Conf. on Artif. Intel., IJCAII ,pages 813–818, 1989.R. Kerber. Chimerge: Discretization of numeric at-tributes. In

Proc. of the 10th Nat. Conf. on Artif.Intel., AAAI , pages 123–128, 1992.H. Lakkaraju, S. H. Bach, and J. Leskovec. Inter-pretable decision sets: A joint framework for de-scription and prediction. In

Proc. of the 22nd Int.Conf. on Knowl. Disc. and Data Min., KDD , pages1675–1684, 2016.W. Li, J. Han, and J. Pei. Cmar: accurate and eﬃcientclassiﬁcation based on multiple class-associationrules. In

Proc. of the 2001 IEEE Int. Conf. on DataMin., ICDM , pages 369–376, 2001.B. Liu, W. Hsu, and Y. Ma. Integrating classiﬁcationand association rule mining. In

Proc. of the 4th Int.Conf. on Knowl. Disc. and Data Min., KDD , pages80–86, 1998. anuscript under review by AISTATS 2020

D. Malioutov and K. Varshney. Exact rule learningvia boolean compressed sensing. In

Proc. of the 30thInt. Conf. on Mach. Learn., ICML , pages 765–773,2013.T. Miller. Explanation in artiﬁcial intelligence: In-sights from the social sciences.

CoRR in arXiv , 2017.M. Muselli. Approximation properties of posi-tive boolean functions. In

Proc. of the 16thWIRN/NAIS , 2005.M. Muselli. Switching neural networks: A new con-nectionist model for classiﬁcation. In B. Apolloni,M. Marinaro, G. Nicosia, and R. Tagliaferri, editors,

Neural Nets , pages 23–30, Berlin, Heidelberg, 2006.Springer Berlin Heidelberg.M. Muselli and A. Quarati. Reconstructing positiveboolean functions with shadow clustering. In

Proc.of the 2005 Europ. Conf. on Circuit Theory and De-sign, ECCTD , volume 3, pages III/377 – III/380 vol.3, 2005.K. Napierala. Bracid: a comprehensive approach tolearning rules from imbalanced data.

Journal of In-tel. Information Systems , 39(2):335–373, 2012.C. H. Nguyen and T. Ho. An imbalanced data rulelearner. In

Proc. of the 9th Europ. Conf. on Prin-ciples and Practice of Knowl. Disc. in Databases,PKDD , volume 3721, pages 617–624, 10 2005.J. R. Quinlan and R. M. Cameron-Jones. Foil: Amidterm report. In

Proc. of the 4th Europ. Conf.on Mach. Learn., ECML , pages 1–20, 1993.G. Su, D. Wei, K. R. Varshney, and D. M. Malioutov.Learning sparse two-level boolean rules. In

Proc. ofthe IEEE 26th Int. Workshop on Mach. Learn. forSignal Processing, MLSP , pages 1–6, 2016.J. Van Hulse, T. M. Khoshgoftaar, and A. Napolitano.Experimental perspectives on learning from imbal-anced data. In

Proc. of the 24th Int. Conf. on Mach.Learn., ICML , ICML ’07, pages 935–942, 2007.W. W. Cohen and Y. Singer. A simple, fast, and ef-fective rule learner.

Proc. of the 16th Nat. Conf. onArtif. Intel., AAAI , 06 1999.T. Wang, C. Rudin, F. Doshi-Velez, Y. Liu,E. Klampﬂ, and P. MacNeille. A bayesian frame-work for learning rule sets for interpretable classiﬁ-cation.

JMLR , 18(70):1–37, 2017.G. M. Weiss. Mining with rarity: A unifying frame-work.

SIGKDD Explorations , 6(1):7–19, June 2004.H. Yang, C. Rudin, and M. Seltzer. Scalable bayesianrule lists. In

Proc. of the 34th Int. Conf. on Mach.Learn., ICML , pages 3921–3930, 2017.X. Yin and J. Han.

CPAR: Classiﬁcation based onPredictive Association Rules , pages 331–335. SIAM,2003. anuscript under review by AISTATS 2020

A THE BASE METHOD STEP BYSTEP

In this section, we show in detail the main steps of thebase algorithm, by using a concrete example.Consider the scenario of forecasting the failure condi-tion of an IT system from two values representing the

CP U and main memory (

M EM ) utilization, as de-picted in the ﬁrst two columns of table 5. We assumethat

CP U and

M EM are continuous features withvalues in the domain [0 , Label , where 1 represents asystem failure. The example reports eight records, ofwhich two are failures.

CPU MEM r r String Label t

95 10 3 1 110 01 1 t

80 10 1 1 011 01 0 t

81 85 2 2 101 10 1 t

10 85 1 2 011 10 0 t

10 10 1 1 011 01 0 t

82 10 2 1 101 01 0 t

85 10 2 1 101 01 0 t

81 10 2 1 101 01 0

Table 5: Original values from

CP U and

M EM , theirmappings to discrete ranges ( r , r ), binary encoding,and binary label. A.1 Discretization And Binarization

The ﬁrst operation to do is discretization. Assume thediscretization algorithm identiﬁes three intervals for

CP U and two intervals for

M EM , as follows.

CP U :[0 , , [81 , , [95 , max ). M EM : [0 , , [85 , max ).We can now map the original values to integer val-ues over the ranges (1, 2, 3) and (1, 2), as shown incolumns r , r , respectively. The resulting discretizedrecords are then mapped to (inverse one-hot encoded)binary strings of ﬁve bits, as recorded in the String col-umn. We also deﬁne a partial order relation betweenbinary records, such that x ≤ x (cid:48) ⇐⇒ x (cid:86) x (cid:48) = x (cid:48) .Moreover, the application of inverse one-hot encodingensures that the relation between input features andlabels is monotone, according to deﬁnition 2.2 in themain paper. We can give you an intuition through asimple example: consider two binary strings 011 and110; we see that 011 (cid:2)

110 and 110 (cid:2)

A.2 Learning The Boundary

Consider the ﬁrst positive sample t with string 110 01.An exhaustive search strategy would explore all possi-ble ﬂipping alternatives for the most general conﬂict-free binary strings. If, for example, we ﬂip-oﬀ the ﬁrstbit we obtain 010 01 < = t : we have therefore a con-ﬂict. If, for example, we keep the ﬁrst bit at 1 and ﬂip-oﬀ the second bit, we obtain 100 01, which is inconﬂict with t − t . Finally, if we ﬂip-oﬀ the lastbit, we obtain 110 00, which has no conﬂict: this is acandidate boundary point. If we repeat the same pro-cedure for t , after ﬂipping-oﬀ the third bit, we obtainanother boundary point 100 10. t : 110 01 + t : 101 10 + t − t : 101 01 − ...010 01 −

100 01 −

110 00 +

001 10 −

100 10 +

101 00 −

000 01 −

100 00 − Figure 2: Partially ordered set created from therecords in table 5.Figure 2 shows the partially ordered set correspondingto table 5. At the beginning, the nodes at the top arethe ones for which we know the label represented witha superscript symbol + and − for positive and neg-ative, respectively. They can be seen as maximally-speciﬁc rules. If we take as target the positive class,we move inside the Boolean lattice by ﬂipping-oﬀ pos-itive bits, starting from the positive binary samples,and go down to ﬁnd binary elements – located on theboundary – that divide positive and negative samples.While we navigate the Boolean lattice, nodes are la-belled according to the cover test against the negativesamples. As soon as a conﬂict is found, we can avoidgoing down from that node, but there is still the possi-bility to explore that path from another binary sample.This recursive procedure corresponds to up-and-downmovements in the lattice. However, if at each iterationwe are able to select the best candidate bit and to avoidconﬂicts, we only allow steps down in the Boolean lat-tice. We use the heuristic described in the main paperto choose the best candidate bit to ﬂip-oﬀ. A.3 A Practical Example

Consider again the example in table 5. Since atthe beginning S = T , we will only report |T i | .For the ﬁrst positive record t =110 01, we have: F = { , } , F = { } , F = { } .We have therefore: d l ( t , F ) = 1, d l ( t , F ) = 1, d l ( t , F ) = 2. We already know that ﬂipping-oﬀ ei-ther the ﬁrst or the second bit to 0 would lead to aconﬂict: thus, we directly ﬂip-oﬀ the ﬁfth bit to ob-tain the boundary point 110 00, independently fromthe value of |T | . Element 110 00 is added in the setof boundary points A .For the second positive record t = 101 10, we have: F = { , } , F = ∅ , F = { } .We have therefore: d l ( t , F ) = 1, d l ( t , F ) = undef ined , and d l ( t , F ) = 1. Although i = 3 in- anuscript under review by AISTATS 2020 duces a distance from an empty set, since we knowthat ﬂipping-oﬀ other indexes generates conﬂicts, wecan immediately label 100 10 as boundary point andadd it to A . A.4 From Boundary Set To Rules

At the end of the previous phase, we obtain the bound-ary set A = { , } . In this case, each bound-ary point covers only one distinct positive sample,therefore the union of the two points covers all theset of positive samples and both points are kept afterthe regularization. Let’s suppose to follow a positiveset cover strategy, without early stopping condition.Then, the boundary set can be immediately mappedto the rule set shown in ﬁg. 3. IF CPU ∈ [95, max) OR CPU ∈ [81, max) and MEM ∈ [85, max) THEN

Label = 1

ELSE

Label = 0

Figure 3: Ruleset extracted from the boundary

B PARALLEL AND DISTRIBUTEDIMPLEMENTATION libre is amenable to parallel and distributed im-plementations. Indeed, it processes one positivesample at a time. An exhaustive version of the

FindBoundaryPoint() procedure is embarrassinglyparallel and it is easily parallelizable on multi-core ar-chitectures: it is suﬃcient to spawn a UNIX processper positive sample, and exploit all available cores.Instead, the approximate procedure, requires a slightlymore involved approach. Indeed, the approximate

FindBoundaryPoint(.) procedure processes positiverecords that have not yet been covered by any bound-ary point. Hence, a global view on the set S is re-quired. We experimented with two alternatives. Theﬁrst is to place S in a shared, in-RAM datastore, be-cause UNIX processes – unlike threads – do not haveshared memory access. The second alternative is tosimply let each individual process to hold their ownversion of S , thus sacriﬁcing a global view. Our ex-periments indicate that the loss in performance due toa local view only is negligible, and largely out-weightedby the gain in performance, since the execution timedecreases linearly with the number of spawned UNIXprocesses. Moreover, both D + and D − remain consis-tent throughout the whole induction phase. libre can be easily distributed such that it can runon a cluster of machines, using for example a dis- tributed computing framework such as Apache Sparkspa. This approach, called data parallelism , splits in-put data across machines, and let each machine exe-cute, independently, a weak learner. The data splittingoperation shuﬄes random subsets of the input featuresto each worker machine. Once each worker ﬁnishes togenerate the local rule sets, they are merged in the“driver” machine, which eventually applies the ﬁlter-ing and then executes the rule selection procedure toproduce the ﬁnal boundary. C THE IMPACT OF LIBRE’SPARAMETERS

In this section we investigate how acting on libre ’sparameters allows to obtain speciﬁc performance-interpretability tradeoﬀs. We will not cover all thepossible parameters: in particular, we focus on thediscretization threshold, estimators , and f eatures per estimator. The eﬀects of α and early-stopping inweighted set cover are not reported here since theireﬀects are well known from previous studies.When we vary one parameter, all the others are keptﬁx to isolate its impact. We will also give some rulesof thumb to choose them. C.1 The Eﬀects Of Varying TheDiscretization Threshold

The choice of the discretization threshold depends onthe speciﬁc dataset: a threshold equal to zero meansno discretization, whereas increasing the threshold isequivalent to increase the tolerance to combine consec-utive ranges of values with diﬀerent label distributions.In general, a zero threshold gives bad performancesand results in a bigger lattice with a consequent slowertraining time; also a too aggressive (high) threshold isnot recommended because it would lead to a huge lossof information.The most signiﬁcant eﬀects occur as soon as we startincreasing the threshold: in general, F1-score improves(and eventually oscillates) up to a value after which itcan eventually decrease. It is clear that, if the datasetcontains only continuous features and we continue toincrease the threshold, original and discrete recordswill coincide at a certain point.The threshold aﬀects also the number of rules and theirsize. In general, when there is no discretization, twoextreme cases are possible: i) We might have as manyrules as the number of positive examples (if their bi-nary representation does not generate conﬂicts withthe elements in F ) with atoms = f eatures . Itmeans that the model simply overﬁtted the training anuscript under review by AISTATS 2020 data. ii) We might end up with few rules with veryhigh number of atoms (or no rules at all): the modeltried to generalize positive records but it was not ableto learn something meaningful because too many con-ﬂicts were present in the dataset.From our experiments, the second option is more com-mon (few complex rules). Again, as soon as we startincreasing the threshold, the model starts to learn: thenumber of discovered rules increases and the numberof atoms decreases, since the model is able to ﬁlter outuseless features. After that, changes tend to stabilize:in our experiments, this happens when the discretiza-tion threshold is roughly between 3 and 6. C.2 The Eﬀects of Varying estimators

And f eatures

We analyze how estimators and f eatures aﬀectthe predictive performance and interpretability of li-bre , by keeping ﬁxed the remaining parameters. Re-sults are reported for the

Heart

UCI dataset, but theconsiderations we do are quite general.

Parameter Settings.

We ﬁxed a discretizationthreshold = 6. The search procedure optimizes the H α = 0 .

7, with-out applying any early-stopping condition. We var-ied estimators ∈ { , , , , } and f eatures ∈{ , , , , , , , } . We performed up to 50 runs foreach ( estimators, f eatures ), where features usedby each estimator are randomly selected. Please, no-tice that this is not the optimal set of parameters. Eﬀects On F1-score.

As shown in ﬁg. 4, if weﬁx estimators , when estimators is low (one es-timator), F1-score improves considerably as long as f eatures increases. When enough estimators areused, F1-score stabilizes: we can use less f eatures per estimator with almost no eﬀect on F1-score.From ﬁg. 5, we can see that, if we ﬁx f eatures , F1-score beneﬁts from increasing estimators . When f eatures increases, limiting estimators to a lowvalue does not signiﬁcantly impact the F1-score.In other words, for low f eatures it is convenient torun more estimators : each estimator would work ondiﬀerent subsets of the input features and the union ofrules would be hopefully diverse, with a consequenthigher F1-score. For the speciﬁc case of

Heart , wedo not notice any signiﬁcant diﬀerence in F1-score bypassing from 5 to 20 estimators. However, it is gen-erally convenient to increase estimators in order totry as many combinations of features as possible andreduce the variance of results. For datasets with manyfeatures, this may make the diﬀerence.

Eﬀects On rules . As shown in ﬁg. 6, if we ﬁx estimators and increase f eatures , the averagenumber of rules tends to increase up to a certain value,and then stabilizes or get slightly worse.From ﬁg. 7, we notice that, when f eatures is low,the number of rules tends to increase as long as weincrease the number of estimators. Indeed, the modelgenerates less rules when there are not enough dis-criminant features; increasing the number of estima-tors, each estimator discovers diﬀerent rules that arecombined. As long as we increase f eatures per esti-mator, the probability that diﬀerent estimators workwith similar sets of features increases, together withthe probability of generating the same rules (or verysimilar rules): that’s why the size of the rule set tendsto stabilize. In this cases, it might be convenient torun less estimators to save execution time.In general, increasing the number of estimators con-siderably reduces the variance of results.

Eﬀects On atoms . As shown in ﬁg. 8, if we keep estimators ﬁxed, atoms of the rule set increasesas long as the number f eatures increases. If we ﬁx f eatures (ﬁg. 9), estimators does not seem to af-fect atoms signiﬁcantly.As usual, increasing estimators reduces the varianceof the results.

Final Remarks.

In conclusion, if we want inter-pretable rule sets, it is better to use few input featuresper estimator and as many estimators as possible.In appendix C, we have not used any early stop con-dition. However, it is a good practice to tune thisparameter in order to generate rule sets that are moreinterpretable and highly accurate.

D SCALABILITY EVALUATION

Here, we extensively test the scalability of libre . Un-like the main paper, we use up to 50 features and in-vestigate also the impact of class imbalance on theexecution time.

Synthetic Dataset.

For the scalability evaluation,we synthetically generate a dataset with 1 (cid:48) (cid:48) , . .

01, 0 .

1, and 0 . Settings.

We vary the number of records (10’000,100’000, 500’000, 1’000’000), features (10, 20, 50), andclass imbalance ratio (0.001, 0.01, 0.1, 0.5): for eachdataset conﬁguration, libre runs up to 100 times withdiﬀerent randomly generated subsets of features of size anuscript under review by AISTATS 2020 F - s c o r e HEART - F - s c o r e HEART - F - s c o r e HEART -

Figure 4:

Heart dataset: F1-score as a function of F - s c o r e HEART - F - s c o r e HEART - F - s c o r e HEART -

Figure 5:

Heart dataset: F1-score as a function of r u l e s HEART - r u l e s HEART - r u l e s HEART -

Figure 6:

Heart dataset: r u l e s HEART - r u l e s HEART - r u l e s HEART -

Figure 7:

Heart dataset: a t o m s HEART - a t o m s HEART - a t o m s HEART -

Figure 8:

Heart dataset: a t o m s HEART - a t o m s HEART - a t o m s HEART -

Figure 9:

Heart dataset: anuscript under review by AISTATS 2020

10 10 10 1010 10 10 1020 20 20 2020 20 20 2050 50 50 5050 50 50 50 num_records e x e c u t i o n _ t i m e ( s ) Balanced_ratio: 0.001 rule_generation_timerule_simplification_time

10 10 10 1010 10 10 1020 20 20 2020 20 20 2050 50 50 5050 50 50 50 num_records Balanced_ratio: 0.01

10 10 10 1010 10 10 1020 20 20 2020 20 20 2050 50 50 5050 50 50 50 num_records Balanced_ratio: 0.1

10 10 10 1010 10 10 1020 20 20 2020 20 20 2050 50 50 5050 50 50 50 num_records Balanced_ratio: 0.5

Figure 10: Run time on synthetic data.10, 20, and 50; the average execution time in secondsis reported as a sum of two contributions: rule genera-tion and simpliﬁcation times. Times refer to one weaklearner only: if N weak learners run in parallel, the re-ported time is still a good estimate. Before executing libre , we discretize the dataset with a discretizationthreshold equal to 6, that we empirically ﬁnd out tobe a good value. The simpliﬁcation procedure runs onthe top 500 rules, if more are generated. Results.

As shown in ﬁg. 10, the execution time isdominated by the rule generation term. Given a classimbalance ratio, execution time increases as long aswe increase the number of records and features. Thegeneration time also depends on which features are fedinto the model for two main reasons: i) ChiMerge en-codes bad predictive features with bigger domains, in-creasing the search space; ii) the generation procedurewill struggle more to generate rules when it runs onfeatures that are not that useful to predict the targetclass. This explains the high variance in the results.Intuitively, as long as the class imbalance ratio getsclose to 0.5, the number of processed records increases,together with the execution time. However, we veri-ﬁed experimentally that this eﬀect is somehow com-pensated by the higher number of negative records.As already pointed out in the main paper, we run therule generation procedure up to 50 features just forexperimental purposes: for practical applications, ifinterpretability is a need, it is more convenient to limitthe number of features and train a bigger ensemblewith more learners in order to generate compact rulesin a reasonable time.

E FULL EXPERIMENTS

In this section we report the full experimental cam-paign. We use the same methods, training procedure,preprocessing, and evaluation measures as the mainpaper, but we report the results for more datasets, asdescribed in table 6. We also clarify which class wehave trained the model on (target class). In case ofmulti-class classiﬁcation datasets, records not belong- ing to the target class are considered to be negative.Table 7 reports a comparison between libre and theselected methods in terms of F1-score, whereas table 8and table 9 reports the number of rules and averagenumber of atoms, respectively. We also compare therule sets leading to the best F1-scores for ripper-k , brs , and s-brl with a few conﬁgurations for libre .In ﬁg. 11, we report the average number of rules andatoms per rule, as a function of the F1-score: points atthe bottom-rigth side of each plot are preferable sincethey correspond to compact and high predictive rulesets. F MORE EXAMPLES OF RULESETS LEARNED BY LIBRE

In the main paper, we showed an example of rule setlearned by LIBRE for

Liver . In this section, we re-port additional examples for the medical UCI datasetsdescribed in table 6, for which it might be interestingto understand the relation between input features andthe predicted diseases.Please, notice that diﬀerent rule sets may be obtaineddepending on how folds are randomly built duringcross validation. anuscript under review by AISTATS 2020

Dataset

Adult > Australian

690 14 .44 2

Balance

625 4 .08 B

Bank

Haberman

306 3 .26 died

Heart

270 13 .51 presence

Ilpd

583 10 .28 liver patient

Liver

345 5 .51 drinks > Pima

768 8 .35 1

Sonar

208 60 .53 R

Tictactoe

958 9 .65 positive

Transfusion

748 5 .24 yes

Wisconsin

699 9 .34 malignant

Sap-Clean

Sap-Full

Table 6: Characteristics of evaluated datasets.

Dataset rbf-svm rf dt ripper-k modlem s-brl brs libre libre Adult .62(.01) .68(.01) .68(.01) .59(.02) .66(.01) .68(.01) .61(.01) .70(.01) .62(.01)

Australian .83(.02) .86(.02) .84(.02) .85(.02) .68(.28) .82(.03) .83(.03) .84(.03) .84(.03)

Balance .03(.07) .00(.00) .01(.03) .00(.00) .16(.04) .00(.00) .00(.00) .16(.08) .14(.06)

Bank .46(.01) .50(.01) .50(.01) .44(.04) .50(.03) .50(.02) .32(.05) .55(.01) .44(.01)

Haberman .24(.10) .26(.07) .36(.08) .38(.07) .40(.07) .17(.21) .07(.06) .41(.04) .41(.04)

Heart .78(.06) .79(.07) .71(.01) .73(.09) .39(.31) .74(.05) .70(.09) .77(.06) .75(.02)

Ilpd .47(.02) .44(.08) .42(.10) .20(.11) .48(.08) .14(.13) .09(.08) .54(.06) .52(.04)

Liver .58(.08) .58(.07) .56(.10) .59(.04) .58(.07) .54(.03) .61(.05) .60(.07) .63(.06)

Pima .61(.04) .63(.04) .60(.01) .60(.03) .38(.18) .61(.07) .03(.03) .64(.05) . .64(.05) Sonar .81(.04) .83(.05) .75(.05) .77(.08) .70(.06) .76(.05) .69(.06) .79(.03) .76(.04)

Tictactoe .99(.01) .99(.01) .97(.01) .98(.01) .55(.10) .99(.01) .99(.01) .68(.04)

Transfusion .41(.07) .35(.06) .35(.05) .42(.10) .42(.08) .05(.10) .04(.05) .49(.12) .49(.12)

Wisconsin .95(.02) .95(.01) .91(.04) .94(.02) .95(.01) .94(.02) .88(.03) .95(.01) .93(.02)

Sap-Clean .93(.02) .93(.01) .85(.03) .86(.02) .88(.01) .90(.01) .68(.03) .95(.02) .72(.03)

Sap-Full - - - - - .81(.02) - .89(.03) .68(.04)Avg Rank 4.0(1.8) 3.1(1.9) 5.5(1.9) 5.3(1.7) 5.0(2.8) 5.3(2.3) 7.3(2.5)

Table 7: F1-score (st. dev. in parenthesis).

Dataset dt ripper-k modlem s-brl brs libre libre Adult

Australian

Balance

Bank

Haberman

Heart

Ilpd

Liver

Pima

Sonar

Tictactoe

Transfusion

Wisconsin

Sap-Clean

Sap-Full - - - 56.4(4.6) - 17.5(5.2)

Avg Rank 5.9(0.9) 3.3(1.3) 6.5(0.9) 4.8(0.8)

Table 8: anuscript under review by AISTATS 2020

Dataset dt ripper-k modlem s-brl brs libre libre Adult

Australian

Balance

Bank

Haberman

Heart

Ilpd

Liver

Pima

Sonar

Tictactoe

Transfusion

Wisconsin

Sap-Clean

Sap-Full - - - 85.6(9.7) - 4.7(0.3)

Rank 5.8(1.6)

Table 9: r u l e s a t o m s LIBRE (a)

Adult r u l e s a t o m s LIBRE (b)

Australian r u l e s a t o m s LIBRE (c)

Balance r u l e s a t o m s LIBRE (d)

Bank r u l e s a t o m s LIBRE (e)

Haberman r u l e s a t o m s LIBRE (f)

Heart

Figure 11: F1-score vs. anuscript under review by AISTATS 2020 r u l e s a t o m s LIBRE (g)

Ilpd r u l e s a t o m s LIBRE (h)

Liver r u l e s a t o m s LIBRE (i)

Pima r u l e s a t o m s LIBRE (j)

Sonar r u l e s a t o m s LIBRE (k)

Tictactoe r u l e s a t o m s LIBRE (l)

Transfusion r u l e s a t o m s LIBRE (m)

Wisconsin r u l e s a t o m s LIBRE (n)

Sap-Clean

Figure 11: F1-score vs. anuscript under review by AISTATS 2020IF (number of positive axillary nodes ∈ [2 , max ]) THEN died within 5 years

ELSE survived 5 years or longer

Figure 12: Example of rules learned by libre for

Haberman . IF (slope of the peak exercise ∈ { flat, downsloping } AND number of major vessels ∈ [1 , OR (chest pain type ∈ { asymptomatic } AND thal ∈ { reversable defect } ) OR (sex ∈ { male } , AND fasting blood sugar > ∈ { F alse } AND number of major vessels ∈ [1 , THEN class = presence

ELSE class = absence

Figure 13: Example of rules learned by libre for

Heart . IF (TB ∈ [ min, AND sgbp ∈ [ min,

42) ) OR (TB ∈ [ min, AND alkphos ∈ [ min, OR (age ∈ [35 , , [56 , AND sgbp ∈ [42 , THEN class = liver patient

ELSE class = non liver patient

Figure 14: Example of rules learned by libre for

Ilpd . IF (mean corpuscular volume ∈ [90 , OR (gamma glutamyl transpeptidase ∈ [20 , max ]) THEN liver disorder = True

ELSE liver disorder = False

Figure 15: Example of rules learned by libre for

Liver . IF (glucose ∈ [158, max] AND blood pressure ∈ [56 , max ]) OR (glucose ∈ [110, 158] AND

BMI ∈ [30 . , max )) OR (pregnancies ∈ [4, max] AND diabetes predigree func ∈ [0 . , max ]) THEN diabetes = True

ELSE diabetes = False

Figure 16: Example of rules learned by libre for

Pima . IF (months since last donation ∈ [0, 8) AND total blood donated ∈ [1250 , max )) THEN transfusion = Yes

ELSE transfusion = No

Figure 17: Example of rules learned by libre for

Transfusion . IF (uniformity of cell shape ∈ [5 , max ]) OR (clump thickness ∈ [2 , max ] AND bare nuclei ∈ [8 , max )) OR (clump thickness ∈ [7 , max ] AND marginal adhesion ∈ [1 , , [4 , max )) THEN transfusion = Yes

ELSE transfusion = No

Figure 18: Example of rules learned by libre for