[PDF] Preventing the Generation of Inconsistent Sets of Classification Rules

Abstract

In recent years, the interest in interpretable classification models has grown. One of the proposed ways to improve the interpretability of a rule-based classification model is to use sets (unordered collections) of rules, instead of lists (ordered collections) of rules. One of the problems associated with sets is that multiple rules may cover a single instance, but predict different classes for it, thus requiring a conflict resolution strategy. In this work, we propose two algorithms capable of finding feature-space regions inside which any created rule would be consistent with the already existing rules, preventing inconsistencies from arising. Our algorithms do not generate classification models, but are instead meant to enhance algorithms that do so, such as Learning Classifier Systems. Both algorithms are described and analyzed exclusively from a theoretical perspective, since we have not modified a model-generating algorithm to incorporate our proposed solutions yet. This work presents the novelty of using conflict avoidance strategies instead of conflict resolution strategies.

Full PDF

PPreventing the Generation of Inconsistent Sets ofClassiﬁcation Rules

Thiago Zafalon Miranda

Department of Computer ScienceFederal University of São Carlos

São Carlos, [email protected]

Diorge Brognara Sardinha

Department of Computer ScienceFederal University of São Carlos

São Carlos, [email protected]

Ricardo Cerri

Department of Computer ScienceFederal University of São Carlos

São Carlos, [email protected]

Abstract —In recent years, the interest in interpretable classiﬁ-cation models has grown. One of the proposed ways to improvethe interpretability of a rule-based classiﬁcation model is to usesets (unordered collections) of rules, instead of lists (ordered col-lections) of rules. One of the problems associated with sets is thatmultiple rules may cover a single instance, but predict differentclasses for it, thus requiring a conﬂict resolution strategy. Inthis work, we propose two algorithms capable of ﬁnding feature-space regions inside which any created rule would be consistentwith the already existing rules, preventing inconsistencies fromarising. Our algorithms do not generate classiﬁcation models,but are instead meant to enhance algorithms that do so, suchas Learning Classiﬁer Systems. Both algorithms are describedand analyzed exclusively from a theoretical perspective, since wehave not modiﬁed a model-generating algorithm to incorporateour proposed solutions yet. This work presents the novelty ofusing conﬂict avoidance strategies instead of conﬂict resolutionstrategies.

Index Terms —Classiﬁcation rules, Rule generation, Rules con-sistency, Constraint handling

I. I

NTRODUCTION

Classiﬁcation is one of the commonest tasks of MachineLearning, concisely described in [1] as the generation of amodel that learns relations between predictive features andtarget features. This learning occurs by adjusting the internalparameters of the model.In recent years, the interest in interpretable classiﬁcationmodels has grown, partly due to regulations such as theGeneral Data Protection Regulation (commonly known asGDPR), that created a “right to explanation”, a regulation“whereby a user can ask for an explanation of an algorithmicdecision that signiﬁcantly affects them” [2].Even though interpretability, in the context of classiﬁcationmodels, is not an objectively and consistently deﬁned con-cept [3], it is reasonable to say that some types of classiﬁcationmodels are inherently more interpretable than others; a Deci-sion Tree [4], for instance, can be said to be more interpretablethan a Deep Neural Network [5].It is generally accepted that rule-based classiﬁers are amongthe most interpretable [6]. The training phase of such classi-ﬁers usually consist in creating and tuning a list of classiﬁca-tion rules. A classiﬁcation rule usually has two components, itsantecedent and its consequent. The antecedent is a collection

Fig. 1. Simple Classiﬁcation Rule of tests over feature values, and the consequent is the label that will be assigned to the dataset instance which will beclassiﬁed, if it passes all the antecedent’s tests. A simpleclassiﬁcation rule is exempliﬁed in Figure 1.The interpretability of a rule-based classiﬁcation model isfrequently measured by its size, i.e. the number of rules in themodel and/or the number of features tested by the rules [6].Therefore, many algorithms that generate interpretable classi-ﬁcation models try to minimize the model size.In [1], however, the authors propose an alternative wayof improving the interpretability of a rule-based classiﬁcationmodel, using sets (unordered collections) of rules, instead oflists (ordered collections).If a classiﬁer employs a list of rules, then its n-th rulecannot be correctly interpreted alone, because an instance thatis covered by it may also be covered by a previous rule; theactual class predicted by the classiﬁer would be the one ofthe previous rule. Using, instead, a set of rules allows theuser to analyze the rules individually, making the model moreinterpretable.However, using a set of rules may create conﬂicts when mul-tiple rules (with different consequents) cover the same datasetinstance. The authors of [1] discuss two conﬂict resolutionstrategies, allowing the classiﬁer to function properly even ifit contains multiple rules that contradict each other.In this work, we propose two algorithms capable of ﬁndingsets of feature-space regions such that any rule created withinthose regions will always be consistent with R . In this context,a rule r is said to be consistent with a rule r if theirconsequents are identical or if there is no intersection betweentheir antecedents, i.e. it is not possible to create an object thatwould be covered by rules that predict different labels. A set ofrules is said to be consistent if each rule of the set is consistentwith each other. Or set of labels, in multi-label classiﬁcation. a r X i v : . [ c s . L G ] M a r y only creating consistent rules, one avoids the problemof conﬂicting predictions entirely, hence improving the inter-pretability of the classiﬁcation model.The proposed algorithms do not generate classiﬁcationmodels by themselves, instead they are meant to enhancealgorithms that do so. They can be used, for instance, duringthe initialization and mutation phases of a genetic algorithm.Since no model is directly generated, our algorithms canonly be evaluated by modifying an existing model-buildingalgorithm to use one of the methods, then measuring therelative change on the induced models. The two algorithmsthemselves are independent of the metrics chosen.It is interesting to note that our algorithms are not sensitiveto the type of the consequent of the rules, as long as theycan be tested for equality. This means that they can beused to supplement algorithms that generate any kind ofrule format , since inconsistencies between rules arise fromoverlaps between the rules’ antecedents.The remainder of this work is organized as follows: inSection II we discuss related works; in Section III we presenttwo algorithms to aid the creation of consistent rules, andanalyze their properties; in Section IV we discuss future worksand present our conclusions.II. R ELATED W ORKS

The concept of interpretability has been a point of con-tention in Artiﬁcial Intelligence (AI) literature. There are manydifferent views on what constitute an interpretable classiﬁca-tion model, how to measure interpretability, and whether itis necessary to, or even worth to, sacriﬁce predictive powerof a classiﬁer in favor of its interpretability. Some authorshave proposed mechanisms to improve the interpretability ofblack-box models [7], while other have focused on transparentrule-based models, such as Learning Classiﬁer Systems [8]. Inthis section, we will discuss some of the works which havefocused on algorithms that generate rule-based models, andwhy interpretability is important.In [9] and [10] the authors argue that AI models do notusually operate in a vacuum, they interact with humans, andthat various types of Human-AI interactions may beneﬁt froman interpretable model.In areas such as bioinformatics (protein function prediction,gene function prediction, among others) it is important that theclassiﬁcation model is interpretable, in order to make it pos-sible for its users to validate it [11]. In medical and ﬁnancialapplications, understanding a computer-induced model is oftena prerequisite for users to trust the model’s predictions [6].Considering that rule-based classiﬁcation models are in-herently transparent, thus interpretable, many algorithms thatgenerate interpretable models have been published (see thediscussion of transparency in [3]).The Decision Tree algorithm C4.5 [12], for instance, gener-ates models that can be interpreted as easily as a ﬂowchart. Italso employs a pruning strategy that improves simultaneously Such as hierarchical multi-label or ﬂat single-label rules. the interpretability of the model, by reducing its size, and itspredictive power, by reducing overﬁtting.In [13], the authors modiﬁed the algorithm C4.5 to handlemulti-label classiﬁcation. One of the most interesting partsof their work, from an interpretability perspective, was thegeneration of a set of rules from the decision tree. This processof “splitting” a decision tree into a set of rules is one of thefew processes that we know of that can generate a consistentset of rules.In [14], the authors propose an algorithm based on Pre-dictive Clustering Trees (PCTs) [15] to perform hierarchicalmulti-label classiﬁcation using a single, global model. PCT-based algorithms see decision trees as hierarchies of clus-ters and as such, during the model training phase, they tryto minimize intra-cluster variance. The proposed algorithm,called Clus-HMC, was later modiﬁed in [16] to handle classhierarchies organized as Directed Acyclic Graphs (DAGs), andused in [17] to generate a collection of trees which build anensemble.In [18], the authors propose an evolutionary algorithm togenerate interpretable fuzzy classiﬁcation rules by using thePittsburg approach, in which each individual of the populationrepresents a complete classiﬁer. The ﬁttest selection mecha-nism used was the multi-objective algorithm NSGA-2 [19],and the functions being optimized were accuracy, number ofrules, and length of rules. The authors also discuss the issueof interpretability of fuzzy classiﬁcation rules and strategiesto improve it, such as merging similar fuzzy sets.In [20], the authors propose a Genetic Algorithm (GA)to generate interpretable traditional (non-fuzzy) classiﬁcationrules. The algorithm, called HMC-GA, is the only GA-basedmethod in the literature that is capable of building a globalhierarchical multi-label classiﬁcation model [21].In [22], the authors propose the ﬁrst ant colony-basedclassiﬁcation algorithm, called Ant-Miner. It generates listsof classiﬁcation rules, and had, in its original version, thelimitation of only handling categorical features. Ant-Minerwas used as a base for many algorithms, such as Multi-LabelAnt-Miner (MuLAM) [23], which generates ﬂat (i.e. non-hierarchical) multi-label classiﬁcation rules; cAnt-Miner [24],which removed the restriction of using only categorical fea-tures; h-Ant-Miner [25], which generates hierarchical single-label classiﬁcation rules; and hm-Ant-Miner [11], which gen-erates hierarchical multi-label classiﬁcation rules.In [26], the authors propose a new sequential coveringstrategy for cAnt-Miner, in an algorithm called cAnt-Miner pb .This algorithm was later enhanced to generate sets (unorderedcollections) of rules in an algorithm called Unordered cAnt-Miner pb [1]. The authors of Unordered cAnt-Miner pb arguethat a set of rules is more interpretable than a list (orderedcollection) of rules. They also propose a new interpretabilitymetric, called Prediction-Explanation Size , that accounts forthe inter-dependency of rules in lists.The sets of rules generated by Unordered cAnt-Miner pb could contain inconsistent rules, i.e. multiple rules that coverthe same dataset instance but predict different classes fort. The authors discuss mechanisms to resolve such conﬂictswhen they arise (i.e. when making a prediction), such as usingthe rule with the highest quality, or aggregating the predictionsof the conﬂicting rules and selecting the most common labelin the aggregation.The idea that sets of rules are more interpretable than listsof rules, and the fact that there are, as far as we are aware,no mechanisms to prevent the generation of inconsistent rules,motivated us to research and develop the algorithms describedin Section III. III. P ROPOSED A LGORITHMS

We propose two algorithms to solve the problem of addinga new rule to an existing set of consistent rules, creating anexpanded rule set that is still consistent. More speciﬁcally, thealgorithms ﬁnd feature-space regions in which rules can becreated while being consistent with an already existing set ofrules; the actual creation of the rules inside such regions isnot within the scope of the methods.There is a important distinction between identifying if anew rule is consistent with an existing collection of rules andcreating a new rule that is consistent with an existing collectionof rules; the ﬁrst task, analogous to determining if a cake tastesgood, is trivial; the second task, analogous to baking a goodcake, is far more complex. The algorithms we propose aremeant to guide the execution of the later task.To the best of our knowledge, both algorithms, and theconﬂict avoidance approach they employ, are new to theliterature. We refer to these algorithms as Constrained Feature-Space Greedy Search (CFSGS) and Constrained Feature-SpaceBox-Enlargement (CFSBE). We present them, respectively, inSection III-A and Section III-B.Whenever we refer to a rule, unless stated otherwise, wewill be referring only to its antecedent. The case in which ruleshave the same consequent but different antecedents, hence areconsistent with each other, will be discussed in Section III-D.We will assume that all predictive features are continuous.Both algorithms can handle categorical features, but explainingthem exclusively in terms of continuous features allows a bet-ter visualization and explanation. We will discuss the treatmentof categorical features in Section III-C.In order to simplify the explanation of both algorithms wewill use the convention that rules have exactly one featuretest for each predictive feature and have the format shown inEquation 1, where test i denotes the i-th test of the rule, i.e.the test over the i-th feature, and | f | denotes the number offeatures in the dataset. We will also use the convention thatfeature tests have the format shown in Equation 2, where f i denotes the value of the i-th feature of the dataset instancebeing tested, and test.lower and test.upper are, respectively,the lower and upper bound values of the test. rule := test ∧ test ∧ · · · ∧ test | f | (1) test := test.lower ≤ f i < test.upper (2) Fig. 2. Classiﬁcation rules in 2D feature-space

It is important to observe that the lower bound of a testis inclusive, but the upper bound is exclusive. This deﬁnitionprevents inconsistencies when two tests from different rules“touch” each other, i.e. the upper bound of the i-th test from arule has the same value as the lower bound of the i-th test fromanother rule. To exemplify this, consider the rules describedin Equations 3 and 4. If we use an inclusive upper bound, aperson with age = 10 will be covered by both rules, whichwill make the rules inconsistent with each other. rule = IF ≤ age < THEN class = child (3) rule = IF ≤ age < THEN class = adult (4)Both algorithms can be more easily explained if we usea geometric interpretation, that is, by viewing the antecedentof a classiﬁcation rule as an | f | -dimensional hyperrectangle,being f the set of features of the dataset.Figure 2 shows what three classiﬁcation rules, representedas colored rectangles, and seven dataset instances, representedas black dots, could look like in a classiﬁcation problem withtwo predictive features, f and f .Considering that the grid squares in Figure 2 have unitarylength, the antecedent of the rules depicted in the ﬁgure canbe formally described by Equations 5, 6 and 7. rule = (cid:40) test = 2 ≤ f < test = 5 ≤ f < (5) rule = (cid:40) test = 6 . ≤ f < . test = 5 ≤ f < (6) rule = (cid:40) test = 1 ≤ f < test = 1 ≤ f < (7)The premise of the two proposed algorithms is that if we canﬁnd a region of the feature-space that is not covered by anyrule, we could create a rule inside such region. The new rulewould be consistent with the already existing rules, becauseo dataset instance could be simultaneously covered by morethan one rule. A. Constrained Feature-Space Greedy Search

Since the region covered by a rule is the region describedby the conjunction of its tests, the region not covered by itcan be described by the the disjunction of the negation of itstests, as we can see in Equations 8 and 9. rule = test ∧ · · · ∧ test n (8) ¬ rule = ¬ ( test ∧ · · · ∧ test n ) ¬ rule = ( ¬ test ) ∨ · · · ∨ ( ¬ test n ) test i = lower i ≤ f i < upper i (9) test i = ( lower i ≤ f i ) ∧ ( f i < upper i ) ¬ test i = ¬ (( lower i ≤ f i ) ∧ ( f i < upper i )) ¬ test i = ¬ ( lower i ≤ f i ) ∨ ¬ ( f i < upper i ) ¬ test i = ( lower i > f i ) ∨ ( f i ≥ upper i ) ¬ test i = ( f i < lower i ) ∨ ( f i ≥ upper i ) Negating the tests of r , for instance, results in four inequal-ities (or constraints), each describing a region not covered bythe rule: • f < , the region to the left of the yellow rectangle • f ≥ , the region to the right of the yellow rectangle • f < , the region below the yellow rectangle • f ≥ , the region above the yellow rectangleIn Constrained Feature-Space Greedy Search (CFSGS), wecreate a collection C with the constraints generated by thenegation of all the tests from all the rules, with C i,j denotingthe j-th inequality generated from the i-th rule. The constraintsgenerated from the negation of the tests of r , r , and r are shown in Table II. As we assumed that all features arecontinuous, the number of constraints generated per rule, n f ,is equal to · | f | . We discuss the relation between n f andthe data type of the features (categorical or continuous) inSection III-C.By using Algorithm 1 we can organize C into a DirectedAcyclic Graph (DAG). Doing so allows us to perform agreedy search to ﬁnd all subsets of C that contain exactly oneconstraint from each rule and all constraints are simultaneouslysatisﬁable. Such subsets of C describe non-empty regions ofthe feature-space that are not covered by any rule. The DAGgenerated from the constraints of r , r , and r is shown inFigure 3.The objective of the search is to ﬁnd consistent paths fromthe root to a leaf node. A path is said to be consistent iff theconstraints represented by its nodes can all be simultaneouslysatisﬁed, such as the one described in Equation 10. If adding anode to the path currently being explored makes the constraintsunsatisﬁable, the search algorithm backtracks and tries addinganother node. TABLE IC

ONSTRAINTS G ENERATED F ROM R ULES

TABLE IIC

ONSTRAINTS G ENERATED F ROM R ULES

Index in C Constraint Generated C , f < C , f ≥ C , f < C , f ≥ C , f < . C , f ≥ . C , f < C , f ≥ C , f < C , f ≥ C , f < C , f ≥ Algorithm 1

CFSGS - DAG Build

Require:

Collection of constraints: C Number of existing rules: n r Number of constraints generated from a rule: n f (number ofcategorical features plus 2 times the number of continuousfeatures) Ensure:

The DAG G representing the constrained feature-space to beexplored G ← New Directed Graph G.E ← {} (cid:46)

The set of edges of G G.V ← { root } (cid:46) The set of vertices of G for i ∈ { , , . . . , n r } do for j ∈ { , , . . . , n f } do G.V ← G.V ∪ { C i,j } if i > then for k ∈ { , , . . . , n f } do G.E ← G.E ∪ { ( C i − ,k , C i,j ) } end for end if end for end for return G root C , C , C , C , C , C , C , C , C , C , C , C , Fig. 3. Constraints as DAG C , , C , , C , } = ( f < ∧ ( f < . ∧ ( f < (10) = f < We could fully explore the DAG to generate all consistentpaths, but since we only need one, we stop the search as soonas the ﬁrst one is found. It is important to observe that the orderin which the nodes within a level are explored determine theorder in which paths are found. If the nodes are explored ina lexicographic order, e.g. C , is explored before C , , theﬁrst paths found will have a bias for the lower regions of theﬁrst features (e.g. bottom left area in Figure 2). To removethis bias, it sufﬁces to explore the nodes of each level in arandom order.The runtime complexity of building the DAG can be ex-pressed in function of the number of existing rules n r and thenumber of constraints generated per rule n f .In Algorithm 1, line 7 is executed n r · n f times, due tothe two nested loops. Line 10 is executed n r · ( n f − · n f times, since the conditional in line 8 decreases the numberof iterations by one. Considering the cost of line 7 to be aconstant k , and the cost of the line 10 to be a constant k ,the total cost of building the DAG is described by Equation 11. T ( n f , n r ) = n r · n f · k + n r · ( n f − · n f · k (11) T ( n f , n r ) ∈ O ( n f · n r ) For the search algorithm, the worst-case scenario of havingno possible consistent paths would cause the exploration ofevery path. Since creating a path is choosing an edge, outof n f edges, repeated over n r levels of the graph, there areexactly n n r f possible paths, leading to a complexity cost of O ( n n r f ) .The total computational cost of CFSGS is the sum ofbuilding the DAG and exploring it, which makes the methodbounded by ( O ( n f · n r ) + O ( n n r f )) ∈ O ( n n r f ) . Whilethe algorithm is capable of ﬁnding all sub-regions where aconsistent rule could be created, this exponential complexitycost makes it unsuitable for many applications. B. Constrained Feature-Space Box-Enlargement

Extending the geometrical interpretation provided by CF-SGS, we would like a way to guide the search through theDAG such that the cost becomes polynomial in relation tothe number of features and rules. Constrained Feature-SpaceBox-Enlargement’s central idea is that it is possible to leverageinformation from the training dataset to visit only nodes thatlead to a possible consistent path to a leaf node.Instead of searching the feature-space for suitable regions,CFSBE starts from a point known not to be covered byany rule, called a “seed”, and “enlarge” this point along thedifferent dimensions, creating a box that does not overlap withany of the existing rules’ antecedents. A rule created insidesuch box would also not overlap with the existing rules, sothey would be consistent, regardless of their consequent.

Fig. 4. CFSBE - Horizontal Axis FirstFig. 5. CFSBE - Horizontal Axis First

Searching for arbitrary non-covered points is equivalent tosearching for non-covered regions, being as computationallyexpensive as CFSGS. However, if we keep track of whichdataset instances are not covered by any rules we may useone of such instances as the seed. By using an associativearray structure to map which points are covered by whichrules, choosing a seed would have cost O (1) , and updatingthe structure on the insertion or removal of a rule would havecost O ( n ) , n being the number of instances in the trainingdataset.After selecting a non-covered point as seed, we must choosethe order in which the dimensions will be expanded. Considerthe point p in Figure 2; if we ﬁrst enlarge it along the f dimension and f afterwards, we end up with the green regionshown in Figure 4. If, however, we start with f , we end upwith the green region shown in Figure 5.Once a seed point and a feature order is chosen, a degeneratehyperrectangle (or box) is created around the point. The boxis then enlarged in each dimension according to the chosenordering.o determine the limit of the expansion along each dimen-sion, that is, the boundaries to which the box can grow withoutoverlapping with existing rules, it is necessary to check whichrules “intersect” with the box on the other dimensions.To exemplify the need for checking for “intersections” ondimensions that are not currently being expanded on, considerthe point p in Figure 2. Even though it is not covered byany rule, it does pass the f feature test of the rule r , i.e. it“intersects” the rule r on the f dimension. If we create ourbox around p and grow it along the f dimension, it wouldeventually contain points that satisfy both tests of r , i.e. therewould be an intersection between the box and r .The method used to safely grow a box in such a way that itdoes not overlap with the boxes described by the antecedentsof a set of rules is presented in Algorithm 2. Algorithm 2

Constrained Feature-Space Box-Enlargement

Require:

Set of non-overlapping rules R Seed point x , not covered by any rule of R Order of dimensions to expand O d (a permutation of { , , . . . , | O d |} ) Ensure:

The box B covers a possible hyperrectangle containing x thatcan not be further expanded, and does not intersect any rule ofR B ← Degenerate rectangle with | O d | dimensions for i ∈ O d do B i .lower ← x i B i .upper ← x i end for for d ∈ O d do B d .lower ← −∞ B d .upper ← + ∞ for r ∈ R do if Intersects ( B, r, d ) then if x d ≥ r d .upper then B d .lower ← max ( B d .lower, r d .upper ) else (cid:46) Meaning x d < r d .lower , otherwise x wouldbe covered by r B d .upper ← min ( B d .upper, r d .lower ) end if end if end for end for return B function I NTERSECTS (Box B, Rule r, Dimension to skip d)

NumDimensions ← | r | for i ∈ { , , . . . , NumDimensions } \ d do u ← B i .upper > r i .lower l ← B i .lower < r i .upper if ¬ ( u ∧ l ) then return False end if end for return

True end function

It is worth noting that even though CFSBE cannot ﬁnd allsuitable hyperrectangles, it can ﬁnd all the arguably relevantones. Consider the dataset depicted in Figure 2, CFSGS could ﬁnd a region below the blue rectangle, while CFSBE cannot;but since there are no instances there, it is arguable that therules created in such region would not be useful, as theircoverage would be zero and their predictive power could not bemeasured on the dataset used to train the classiﬁcation model.CFSBE’s runtime complexity can be calculated in a straight-forward manner. Let d be the number of dimensions (features)in the dataset and r be the number of rules in the set of existingrules. The innermost part of the algorithm, in lines 11 through15, has a constant cost, k , because they do not depend onthe values of d or r . The Intersects function has a loop thatexecutes at most d − steps. The contents of this loop donot depend on d nor r , therefore they also have a constantcost, k , hence the worst-case scenario for this function is T ( d ) = ( d − · k .Lines 1 through 4 perform the initialization of the degener-ated rectangle, which occurs in a loop with d steps, each stephaving constant cost k . The rest of the algorithm is a trivialnesting of loops, the ﬁrst of which takes d steps, the second r steps, and the third, inside the Intersects function, takes atmost d − steps, as discussed previously.The total cost of CFSBE is described as T in Equation 12.There are three terms in this summation, the ﬁrst being thecreation of the degenerated rectangle, the second the main al-gorithm body, and the third value, n , comes from keeping trackof the available seeds, as discussed previously. Many algo-rithms that generate rule-based classiﬁcation models, however,already have to create a mapping between dataset instancesand rules that cover them; Learning Classiﬁer Systems, forinstance, may need such information to calculate the ﬁtnessof the rules [20]. Therefore, in practice, the cost of CFSBEcould be considered as O ( d · r ) . T ( d, r, n ) = d · k + d · r · T ( d ) · k + n (12) T ( d, r, n ) ∈ O ( d · r + n ) C. Tests Over Categorical Features

We explained both algorithms assuming that the datasetcontained only continuous features, but both algorithms canbe modiﬁed to handle categorical features. Since a featuretest over a categorical feature is simply an equality test, themain difference for CFSGS is that during the creation of thecollection of constraints C , a single constraint is generatedfor categorical features, instead of two, as we can see inEquations 13 and 14. Consequently, the parameter n f ofAlgorithm 1 equals to the number of categorical features plustwo times the number of continuous features. rule = (cid:40) test = 2 ≤ f < test = f = yellow (13) C =  C , = f < C , = f ≥ C , = f (cid:54) = yellow (14)or CFSBE to handle categorical features, we must changethe data structure that represents the enlarging box. Instead ofbeing a simple associative array that maps feature indices tocontinuous ranges, it must now map feature indices to eithersets of values, for categorical features, or continuous ranges,for continuous features. Considering this difference, it is moreappropriate to call the Box a conﬂict-free “Region”.Modifying the algorithm to check whether the Region anda rule intersect along a dimension that represents a categoricalfeature is rather simple, one only needs to check if the valuebeing tested by the rule is a member of the set of valuesof the Region for that dimension. Similarly, adjusting theRegion’s values for a categorical dimension, in order to avoidoverlapping with a rule, equates to removing the value whichis tested by the rule from the set of values for that dimension. D. Rules With Identical Consequents

We explained both algorithms using the simpliﬁcation ofignoring the case in which the created rule could overlapwith rules that already existed because they have the sameconsequent. If the consequent of the rule that will be createdis known beforehand, then both CFSGS and CFSBE can bemodiﬁed to allow rules with the same consequent to overlap.If, however, the consequent is not known beforehand, toensure that the the rule created inside the region found will beconsistent with the already existing rules, both CFSGS andCFSBE must assume that the consequent will be differentfrom the consequents of the rules that already exist, i.e. thatrules cannot overlap. It is common for evolutionary algorithmsto generate the consequent of the rule in function of itsantecedent, e.g. [20], [22], [23]. In such cases, both ouralgorithms will not allow intersections in the antecedents.Remember that rules are inconsistent, and therefore requirea conﬂict resolution strategy, iff their antecedents overlap buthave different consequents, i.e. they predict different labels fora single dataset instance. That means that if two rules have thesame consequent, then they are consistent, and don’t requirea conﬂict resolution strategy, even if they overlap.For CFSGS that means that during the creation of C ,the collection of constraints, it is not necessary to generateconstraints from rules that have the same consequent as therule that will be created. Not generating constraints from arule allows the region found by CFSGS to overlap with suchrule.For CFSBE, when enlarging the box, we can safely ignorerules that have the same consequent as the rule that will becreated, again resulting in the possibility of overlaps. That canbe achieved by either changing line 22 to skip such rules, orby simply removing them from the argument R .IV. C ONCLUSION AND F UTURE W ORK

In this work we discussed the problem of generating sets ofrules without inconsistencies and proposed two algorithms tosolve this problem, called CFSGS and CFSBE.CFSGS is able to search through the feature-space for anyregion where the antecedent of a rule can be created without creating inconsistencies with any existing rule. However, thealgorithm is computationally expensive.The CFSBE algorithm, on the other hand, can only ﬁndregions around dataset instances that are not covered by anyexisting rule, but its computational cost is far more reasonable.We argue that the non-covered dataset instance requirement ofCFSBE is not a hindering issue.Neither algorithm is particularly useful by itself, since bothare meant to supplement algorithms that generate rule-basedclassiﬁcation models. In the future, we intend to modify aLearning Classiﬁer System to use CFSBE both during theinitial population creation and during the mutation phases,in order to study its effects on the predictive power andinterpretability of the generated models, and whether it makesthe models more prone to overﬁtting.V. A

CKNOWLEDGMENTS

This study was ﬁnanced in part by the Coordenaçãode Aperfeiçoamento de Pessoal de Nível Superior - Brasil(CAPES) - Finance Code 001. R. Cerri thanks São PauloResearch Foundation (FAPESP) for the grant

ONFLICTS OF I NTEREST

The authors declare no conﬂict of interest. The ﬁnancinginstutions had no role in the design of the study, in the writingof the manuscript or in the decision to publish the results.R

EFERENCES[1] F. E. Otero and A. A. Freitas, “Improving the interpretability of classi-ﬁcation rules discovered by an ant colony algorithm,” in

Proceedings ofthe 15th annual conference on Genetic and evolutionary computation .ACM, 2013, pp. 73–80.[2] B. Goodman and S. Flaxman, “European union regulations on algo-rithmic decision-making and a "right to explanation",” arXiv preprintarXiv:1606.08813 , 2016.[3] Z. C. Lipton, “The mythos of model interpretability,” arXiv preprintarXiv:1606.03490 , 2016.[4] J. R. Quinlan, “Induction of decision trees,”

Machine learning , vol. 1,no. 1, pp. 81–106, 1986.[5] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature , vol. 521,no. 7553, p. 436, 2015.[6] A. A. Freitas, “Comprehensible classiﬁcation models: a position paper,”

ACM SIGKDD explorations newsletter , vol. 15, no. 1, pp. 1–10, 2014.[7] R. Caruana, H. Kangarloo, J. Dionisio, U. Sinha, and D. Johnson, “Case-based explanation of non-case-based learning methods.” in

Proceedingsof the AMIA Symposium . American Medical Informatics Association,1999, p. 212.[8] J. H. Holland, L. B. Booker, M. Colombetti, M. Dorigo, D. E. Goldberg,S. Forrest, R. L. Riolo, R. E. Smith, P. L. Lanzi, W. Stolzmann et al. ,“What is a learning classiﬁer system?” in

International Workshop onLearning Classiﬁer Systems . Springer, 1999, pp. 3–32.[9] K. R. Varshney, P. Khanduri, P. Sharma, S. Zhang, and P. K. Varshney,“Why interpretability in machine learning? an answer using distributeddetection and data fusion theory,” arXiv preprint arXiv:1806.09710 ,2018.[10] S. Nirenburg, “Cognitive systems: Toward human-level functionality.”

AI Magazine , vol. 38, no. 4, 2017.[11] F. E. Otero, A. A. Freitas, and C. G. Johnson, “A hierarchical multi-label classiﬁcation ant colony algorithm for protein function prediction,”

Memetic Computing , vol. 2, no. 3, pp. 165–181, 2010.[12] J. R. Quinlan,

C4. 5: programs for machine learning . Elsevier, 2014.[13] A. Clare and R. D. King, “Knowledge discovery in multi-label pheno-type data,” in

European Conference on Principles of Data Mining andKnowledge Discovery . Springer, 2001, pp. 42–53.14] H. Blockeel, M. Bruynooghe, S. Džeroski, J. Ramon, and J. Struyf,“Hierarchical multi-classiﬁcation,” in

Workshop Notes of the KDD’02Workshop on Multi-Relational Data Mining , 2002, pp. 21–35.[15] H. Blockeel, L. D. Raedt, and J. Ramon, “Top-down induction of clus-tering trees,” in

Proceedings of the Fifteenth International Conferenceon Machine Learning . Morgan Kaufmann Publishers Inc., 1998, pp.55–63.[16] C. Vens, J. Struyf, L. Schietgat, S. Džeroski, and H. Blockeel, “Deci-sion trees for hierarchical multi-label classiﬁcation,”

Machine learning ,vol. 73, no. 2, p. 185, 2008.[17] L. Schietgat, C. Vens, J. Struyf, H. Blockeel, D. Kocev, and S. Džeroski,“Predicting gene function using hierarchical multi-label decision treeensembles,”

BMC bioinformatics , vol. 11, no. 1, p. 2, 2010.[18] H. Wang, S. Kwong, Y. Jin, W. Wei, and K.-F. Man, “Agent-basedevolutionary approach for interpretable rule-based knowledge extrac-tion,”

IEEE Transactions on Systems, Man, and Cybernetics, Part C(Applications and Reviews) , vol. 35, no. 2, pp. 143–155, 2005.[19] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitistmultiobjective genetic algorithm: Nsga-ii,”

IEEE transactions on evolu-tionary computation , vol. 6, no. 2, pp. 182–197, 2002.[20] R. Cerri, R. C. Barros, and A. C. de Carvalho, “A genetic algorithm forhierarchical multi-label classiﬁcation,” in

Proceedings of the 27th annualACM symposium on applied computing . ACM, 2012, pp. 250–255.[21] E. C. Gonçalves, A. A. Freitas, and A. Plastino, “A survey of geneticalgorithms for multi-label classiﬁcation,” in . IEEE, 2018, pp. 1–8.[22] R. S. Parpinelli, H. S. Lopes, and A. A. Freitas, “Data mining with anant colony optimization algorithm,”

IEEE transactions on evolutionarycomputation , vol. 6, no. 4, pp. 321–332, 2002.[23] A. Chan and A. A. Freitas, “A new ant colony algorithm for multi-labelclassiﬁcation with applications in bioinfomatics,” in

Proceedings of the8th annual conference on Genetic and evolutionary computation . ACM,2006, pp. 27–34.[24] F. E. Otero, A. A. Freitas, and C. G. Johnson, “cant-miner: an antcolony classiﬁcation algorithm to cope with continuous attributes,”in

International Conference on Ant Colony Optimization and SwarmIntelligence . Springer, 2008, pp. 48–59.[25] ——, “A hierarchical classiﬁcation ant colony algorithm for predictinggene ontology terms,” in

European Conference on Evolutionary Compu-tation, Machine Learning and Data Mining in Bioinformatics . Springer,2009, pp. 68–79.[26] ——, “A new sequential covering strategy for inducing classiﬁcationrules with ant colony algorithms.”