Preventing the Generation of Inconsistent Sets of Classification Rules
Thiago Zafalon Miranda, Diorge Brognara Sardinha, Ricardo Cerri
PPreventing the Generation of Inconsistent Sets ofClassification Rules
Thiago Zafalon Miranda
Department of Computer ScienceFederal University of São Carlos
São Carlos, [email protected]
Diorge Brognara Sardinha
Department of Computer ScienceFederal University of São Carlos
São Carlos, [email protected]
Ricardo Cerri
Department of Computer ScienceFederal University of São Carlos
São Carlos, [email protected]
Abstract —In recent years, the interest in interpretable classifi-cation models has grown. One of the proposed ways to improvethe interpretability of a rule-based classification model is to usesets (unordered collections) of rules, instead of lists (ordered col-lections) of rules. One of the problems associated with sets is thatmultiple rules may cover a single instance, but predict differentclasses for it, thus requiring a conflict resolution strategy. Inthis work, we propose two algorithms capable of finding feature-space regions inside which any created rule would be consistentwith the already existing rules, preventing inconsistencies fromarising. Our algorithms do not generate classification models,but are instead meant to enhance algorithms that do so, suchas Learning Classifier Systems. Both algorithms are describedand analyzed exclusively from a theoretical perspective, since wehave not modified a model-generating algorithm to incorporateour proposed solutions yet. This work presents the novelty ofusing conflict avoidance strategies instead of conflict resolutionstrategies.
Index Terms —Classification rules, Rule generation, Rules con-sistency, Constraint handling
I. I
NTRODUCTION
Classification is one of the commonest tasks of MachineLearning, concisely described in [1] as the generation of amodel that learns relations between predictive features andtarget features. This learning occurs by adjusting the internalparameters of the model.In recent years, the interest in interpretable classificationmodels has grown, partly due to regulations such as theGeneral Data Protection Regulation (commonly known asGDPR), that created a “right to explanation”, a regulation“whereby a user can ask for an explanation of an algorithmicdecision that significantly affects them” [2].Even though interpretability, in the context of classificationmodels, is not an objectively and consistently defined con-cept [3], it is reasonable to say that some types of classificationmodels are inherently more interpretable than others; a Deci-sion Tree [4], for instance, can be said to be more interpretablethan a Deep Neural Network [5].It is generally accepted that rule-based classifiers are amongthe most interpretable [6]. The training phase of such classi-fiers usually consist in creating and tuning a list of classifica-tion rules. A classification rule usually has two components, itsantecedent and its consequent. The antecedent is a collection
Fig. 1. Simple Classification Rule of tests over feature values, and the consequent is the label that will be assigned to the dataset instance which will beclassified, if it passes all the antecedent’s tests. A simpleclassification rule is exemplified in Figure 1.The interpretability of a rule-based classification model isfrequently measured by its size, i.e. the number of rules in themodel and/or the number of features tested by the rules [6].Therefore, many algorithms that generate interpretable classi-fication models try to minimize the model size.In [1], however, the authors propose an alternative wayof improving the interpretability of a rule-based classificationmodel, using sets (unordered collections) of rules, instead oflists (ordered collections).If a classifier employs a list of rules, then its n-th rulecannot be correctly interpreted alone, because an instance thatis covered by it may also be covered by a previous rule; theactual class predicted by the classifier would be the one ofthe previous rule. Using, instead, a set of rules allows theuser to analyze the rules individually, making the model moreinterpretable.However, using a set of rules may create conflicts when mul-tiple rules (with different consequents) cover the same datasetinstance. The authors of [1] discuss two conflict resolutionstrategies, allowing the classifier to function properly even ifit contains multiple rules that contradict each other.In this work, we propose two algorithms capable of findingsets of feature-space regions such that any rule created withinthose regions will always be consistent with R . In this context,a rule r is said to be consistent with a rule r if theirconsequents are identical or if there is no intersection betweentheir antecedents, i.e. it is not possible to create an object thatwould be covered by rules that predict different labels. A set ofrules is said to be consistent if each rule of the set is consistentwith each other. Or set of labels, in multi-label classification. a r X i v : . [ c s . L G ] M a r y only creating consistent rules, one avoids the problemof conflicting predictions entirely, hence improving the inter-pretability of the classification model.The proposed algorithms do not generate classificationmodels by themselves, instead they are meant to enhancealgorithms that do so. They can be used, for instance, duringthe initialization and mutation phases of a genetic algorithm.Since no model is directly generated, our algorithms canonly be evaluated by modifying an existing model-buildingalgorithm to use one of the methods, then measuring therelative change on the induced models. The two algorithmsthemselves are independent of the metrics chosen.It is interesting to note that our algorithms are not sensitiveto the type of the consequent of the rules, as long as theycan be tested for equality. This means that they can beused to supplement algorithms that generate any kind ofrule format , since inconsistencies between rules arise fromoverlaps between the rules’ antecedents.The remainder of this work is organized as follows: inSection II we discuss related works; in Section III we presenttwo algorithms to aid the creation of consistent rules, andanalyze their properties; in Section IV we discuss future worksand present our conclusions.II. R ELATED W ORKS
The concept of interpretability has been a point of con-tention in Artificial Intelligence (AI) literature. There are manydifferent views on what constitute an interpretable classifica-tion model, how to measure interpretability, and whether itis necessary to, or even worth to, sacrifice predictive powerof a classifier in favor of its interpretability. Some authorshave proposed mechanisms to improve the interpretability ofblack-box models [7], while other have focused on transparentrule-based models, such as Learning Classifier Systems [8]. Inthis section, we will discuss some of the works which havefocused on algorithms that generate rule-based models, andwhy interpretability is important.In [9] and [10] the authors argue that AI models do notusually operate in a vacuum, they interact with humans, andthat various types of Human-AI interactions may benefit froman interpretable model.In areas such as bioinformatics (protein function prediction,gene function prediction, among others) it is important that theclassification model is interpretable, in order to make it pos-sible for its users to validate it [11]. In medical and financialapplications, understanding a computer-induced model is oftena prerequisite for users to trust the model’s predictions [6].Considering that rule-based classification models are in-herently transparent, thus interpretable, many algorithms thatgenerate interpretable models have been published (see thediscussion of transparency in [3]).The Decision Tree algorithm C4.5 [12], for instance, gener-ates models that can be interpreted as easily as a flowchart. Italso employs a pruning strategy that improves simultaneously Such as hierarchical multi-label or flat single-label rules. the interpretability of the model, by reducing its size, and itspredictive power, by reducing overfitting.In [13], the authors modified the algorithm C4.5 to handlemulti-label classification. One of the most interesting partsof their work, from an interpretability perspective, was thegeneration of a set of rules from the decision tree. This processof “splitting” a decision tree into a set of rules is one of thefew processes that we know of that can generate a consistentset of rules.In [14], the authors propose an algorithm based on Pre-dictive Clustering Trees (PCTs) [15] to perform hierarchicalmulti-label classification using a single, global model. PCT-based algorithms see decision trees as hierarchies of clus-ters and as such, during the model training phase, they tryto minimize intra-cluster variance. The proposed algorithm,called Clus-HMC, was later modified in [16] to handle classhierarchies organized as Directed Acyclic Graphs (DAGs), andused in [17] to generate a collection of trees which build anensemble.In [18], the authors propose an evolutionary algorithm togenerate interpretable fuzzy classification rules by using thePittsburg approach, in which each individual of the populationrepresents a complete classifier. The fittest selection mecha-nism used was the multi-objective algorithm NSGA-2 [19],and the functions being optimized were accuracy, number ofrules, and length of rules. The authors also discuss the issueof interpretability of fuzzy classification rules and strategiesto improve it, such as merging similar fuzzy sets.In [20], the authors propose a Genetic Algorithm (GA)to generate interpretable traditional (non-fuzzy) classificationrules. The algorithm, called HMC-GA, is the only GA-basedmethod in the literature that is capable of building a globalhierarchical multi-label classification model [21].In [22], the authors propose the first ant colony-basedclassification algorithm, called Ant-Miner. It generates listsof classification rules, and had, in its original version, thelimitation of only handling categorical features. Ant-Minerwas used as a base for many algorithms, such as Multi-LabelAnt-Miner (MuLAM) [23], which generates flat (i.e. non-hierarchical) multi-label classification rules; cAnt-Miner [24],which removed the restriction of using only categorical fea-tures; h-Ant-Miner [25], which generates hierarchical single-label classification rules; and hm-Ant-Miner [11], which gen-erates hierarchical multi-label classification rules.In [26], the authors propose a new sequential coveringstrategy for cAnt-Miner, in an algorithm called cAnt-Miner pb .This algorithm was later enhanced to generate sets (unorderedcollections) of rules in an algorithm called Unordered cAnt-Miner pb [1]. The authors of Unordered cAnt-Miner pb arguethat a set of rules is more interpretable than a list (orderedcollection) of rules. They also propose a new interpretabilitymetric, called Prediction-Explanation Size , that accounts forthe inter-dependency of rules in lists.The sets of rules generated by Unordered cAnt-Miner pb could contain inconsistent rules, i.e. multiple rules that coverthe same dataset instance but predict different classes fort. The authors discuss mechanisms to resolve such conflictswhen they arise (i.e. when making a prediction), such as usingthe rule with the highest quality, or aggregating the predictionsof the conflicting rules and selecting the most common labelin the aggregation.The idea that sets of rules are more interpretable than listsof rules, and the fact that there are, as far as we are aware,no mechanisms to prevent the generation of inconsistent rules,motivated us to research and develop the algorithms describedin Section III. III. P ROPOSED A LGORITHMS
We propose two algorithms to solve the problem of addinga new rule to an existing set of consistent rules, creating anexpanded rule set that is still consistent. More specifically, thealgorithms find feature-space regions in which rules can becreated while being consistent with an already existing set ofrules; the actual creation of the rules inside such regions isnot within the scope of the methods.There is a important distinction between identifying if anew rule is consistent with an existing collection of rules andcreating a new rule that is consistent with an existing collectionof rules; the first task, analogous to determining if a cake tastesgood, is trivial; the second task, analogous to baking a goodcake, is far more complex. The algorithms we propose aremeant to guide the execution of the later task.To the best of our knowledge, both algorithms, and theconflict avoidance approach they employ, are new to theliterature. We refer to these algorithms as Constrained Feature-Space Greedy Search (CFSGS) and Constrained Feature-SpaceBox-Enlargement (CFSBE). We present them, respectively, inSection III-A and Section III-B.Whenever we refer to a rule, unless stated otherwise, wewill be referring only to its antecedent. The case in which ruleshave the same consequent but different antecedents, hence areconsistent with each other, will be discussed in Section III-D.We will assume that all predictive features are continuous.Both algorithms can handle categorical features, but explainingthem exclusively in terms of continuous features allows a bet-ter visualization and explanation. We will discuss the treatmentof categorical features in Section III-C.In order to simplify the explanation of both algorithms wewill use the convention that rules have exactly one featuretest for each predictive feature and have the format shown inEquation 1, where test i denotes the i-th test of the rule, i.e.the test over the i-th feature, and | f | denotes the number offeatures in the dataset. We will also use the convention thatfeature tests have the format shown in Equation 2, where f i denotes the value of the i-th feature of the dataset instancebeing tested, and test.lower and test.upper are, respectively,the lower and upper bound values of the test. rule := test ∧ test ∧ · · · ∧ test | f | (1) test := test.lower ≤ f i < test.upper (2) Fig. 2. Classification rules in 2D feature-space
It is important to observe that the lower bound of a testis inclusive, but the upper bound is exclusive. This definitionprevents inconsistencies when two tests from different rules“touch” each other, i.e. the upper bound of the i-th test from arule has the same value as the lower bound of the i-th test fromanother rule. To exemplify this, consider the rules describedin Equations 3 and 4. If we use an inclusive upper bound, aperson with age = 10 will be covered by both rules, whichwill make the rules inconsistent with each other. rule = IF ≤ age < THEN class = child (3) rule = IF ≤ age < THEN class = adult (4)Both algorithms can be more easily explained if we usea geometric interpretation, that is, by viewing the antecedentof a classification rule as an | f | -dimensional hyperrectangle,being f the set of features of the dataset.Figure 2 shows what three classification rules, representedas colored rectangles, and seven dataset instances, representedas black dots, could look like in a classification problem withtwo predictive features, f and f .Considering that the grid squares in Figure 2 have unitarylength, the antecedent of the rules depicted in the figure canbe formally described by Equations 5, 6 and 7. rule = (cid:40) test = 2 ≤ f < test = 5 ≤ f < (5) rule = (cid:40) test = 6 . ≤ f < . test = 5 ≤ f < (6) rule = (cid:40) test = 1 ≤ f < test = 1 ≤ f < (7)The premise of the two proposed algorithms is that if we canfind a region of the feature-space that is not covered by anyrule, we could create a rule inside such region. The new rulewould be consistent with the already existing rules, becauseo dataset instance could be simultaneously covered by morethan one rule. A. Constrained Feature-Space Greedy Search
Since the region covered by a rule is the region describedby the conjunction of its tests, the region not covered by itcan be described by the the disjunction of the negation of itstests, as we can see in Equations 8 and 9. rule = test ∧ · · · ∧ test n (8) ¬ rule = ¬ ( test ∧ · · · ∧ test n ) ¬ rule = ( ¬ test ) ∨ · · · ∨ ( ¬ test n ) test i = lower i ≤ f i < upper i (9) test i = ( lower i ≤ f i ) ∧ ( f i < upper i ) ¬ test i = ¬ (( lower i ≤ f i ) ∧ ( f i < upper i )) ¬ test i = ¬ ( lower i ≤ f i ) ∨ ¬ ( f i < upper i ) ¬ test i = ( lower i > f i ) ∨ ( f i ≥ upper i ) ¬ test i = ( f i < lower i ) ∨ ( f i ≥ upper i ) Negating the tests of r , for instance, results in four inequal-ities (or constraints), each describing a region not covered bythe rule: • f < , the region to the left of the yellow rectangle • f ≥ , the region to the right of the yellow rectangle • f < , the region below the yellow rectangle • f ≥ , the region above the yellow rectangleIn Constrained Feature-Space Greedy Search (CFSGS), wecreate a collection C with the constraints generated by thenegation of all the tests from all the rules, with C i,j denotingthe j-th inequality generated from the i-th rule. The constraintsgenerated from the negation of the tests of r , r , and r are shown in Table II. As we assumed that all features arecontinuous, the number of constraints generated per rule, n f ,is equal to · | f | . We discuss the relation between n f andthe data type of the features (categorical or continuous) inSection III-C.By using Algorithm 1 we can organize C into a DirectedAcyclic Graph (DAG). Doing so allows us to perform agreedy search to find all subsets of C that contain exactly oneconstraint from each rule and all constraints are simultaneouslysatisfiable. Such subsets of C describe non-empty regions ofthe feature-space that are not covered by any rule. The DAGgenerated from the constraints of r , r , and r is shown inFigure 3.The objective of the search is to find consistent paths fromthe root to a leaf node. A path is said to be consistent iff theconstraints represented by its nodes can all be simultaneouslysatisfied, such as the one described in Equation 10. If adding anode to the path currently being explored makes the constraintsunsatisfiable, the search algorithm backtracks and tries addinganother node. TABLE IC
ONSTRAINTS G ENERATED F ROM R ULES
TABLE IIC
ONSTRAINTS G ENERATED F ROM R ULES
Index in C Constraint Generated C , f < C , f ≥ C , f < C , f ≥ C , f < . C , f ≥ . C , f < C , f ≥ C , f < C , f ≥ C , f < C , f ≥ Algorithm 1
CFSGS - DAG Build
Require:
Collection of constraints: C Number of existing rules: n r Number of constraints generated from a rule: n f (number ofcategorical features plus 2 times the number of continuousfeatures) Ensure:
The DAG G representing the constrained feature-space to beexplored G ← New Directed Graph G.E ← {} (cid:46)
The set of edges of G G.V ← { root } (cid:46) The set of vertices of G for i ∈ { , , . . . , n r } do for j ∈ { , , . . . , n f } do G.V ← G.V ∪ { C i,j } if i > then for k ∈ { , , . . . , n f } do G.E ← G.E ∪ { ( C i − ,k , C i,j ) } end for end if end for end for return G root C , C , C , C , C , C , C , C , C , C , C , C , Fig. 3. Constraints as DAG C , , C , , C , } = ( f < ∧ ( f < . ∧ ( f < (10) = f < We could fully explore the DAG to generate all consistentpaths, but since we only need one, we stop the search as soonas the first one is found. It is important to observe that the orderin which the nodes within a level are explored determine theorder in which paths are found. If the nodes are explored ina lexicographic order, e.g. C , is explored before C , , thefirst paths found will have a bias for the lower regions of thefirst features (e.g. bottom left area in Figure 2). To removethis bias, it suffices to explore the nodes of each level in arandom order.The runtime complexity of building the DAG can be ex-pressed in function of the number of existing rules n r and thenumber of constraints generated per rule n f .In Algorithm 1, line 7 is executed n r · n f times, due tothe two nested loops. Line 10 is executed n r · ( n f − · n f times, since the conditional in line 8 decreases the numberof iterations by one. Considering the cost of line 7 to be aconstant k , and the cost of the line 10 to be a constant k ,the total cost of building the DAG is described by Equation 11. T ( n f , n r ) = n r · n f · k + n r · ( n f − · n f · k (11) T ( n f , n r ) ∈ O ( n f · n r ) For the search algorithm, the worst-case scenario of havingno possible consistent paths would cause the exploration ofevery path. Since creating a path is choosing an edge, outof n f edges, repeated over n r levels of the graph, there areexactly n n r f possible paths, leading to a complexity cost of O ( n n r f ) .The total computational cost of CFSGS is the sum ofbuilding the DAG and exploring it, which makes the methodbounded by ( O ( n f · n r ) + O ( n n r f )) ∈ O ( n n r f ) . Whilethe algorithm is capable of finding all sub-regions where aconsistent rule could be created, this exponential complexitycost makes it unsuitable for many applications. B. Constrained Feature-Space Box-Enlargement
Extending the geometrical interpretation provided by CF-SGS, we would like a way to guide the search through theDAG such that the cost becomes polynomial in relation tothe number of features and rules. Constrained Feature-SpaceBox-Enlargement’s central idea is that it is possible to leverageinformation from the training dataset to visit only nodes thatlead to a possible consistent path to a leaf node.Instead of searching the feature-space for suitable regions,CFSBE starts from a point known not to be covered byany rule, called a “seed”, and “enlarge” this point along thedifferent dimensions, creating a box that does not overlap withany of the existing rules’ antecedents. A rule created insidesuch box would also not overlap with the existing rules, sothey would be consistent, regardless of their consequent.
Fig. 4. CFSBE - Horizontal Axis FirstFig. 5. CFSBE - Horizontal Axis First
Searching for arbitrary non-covered points is equivalent tosearching for non-covered regions, being as computationallyexpensive as CFSGS. However, if we keep track of whichdataset instances are not covered by any rules we may useone of such instances as the seed. By using an associativearray structure to map which points are covered by whichrules, choosing a seed would have cost O (1) , and updatingthe structure on the insertion or removal of a rule would havecost O ( n ) , n being the number of instances in the trainingdataset.After selecting a non-covered point as seed, we must choosethe order in which the dimensions will be expanded. Considerthe point p in Figure 2; if we first enlarge it along the f dimension and f afterwards, we end up with the green regionshown in Figure 4. If, however, we start with f , we end upwith the green region shown in Figure 5.Once a seed point and a feature order is chosen, a degeneratehyperrectangle (or box) is created around the point. The boxis then enlarged in each dimension according to the chosenordering.o determine the limit of the expansion along each dimen-sion, that is, the boundaries to which the box can grow withoutoverlapping with existing rules, it is necessary to check whichrules “intersect” with the box on the other dimensions.To exemplify the need for checking for “intersections” ondimensions that are not currently being expanded on, considerthe point p in Figure 2. Even though it is not covered byany rule, it does pass the f feature test of the rule r , i.e. it“intersects” the rule r on the f dimension. If we create ourbox around p and grow it along the f dimension, it wouldeventually contain points that satisfy both tests of r , i.e. therewould be an intersection between the box and r .The method used to safely grow a box in such a way that itdoes not overlap with the boxes described by the antecedentsof a set of rules is presented in Algorithm 2. Algorithm 2
Constrained Feature-Space Box-Enlargement
Require:
Set of non-overlapping rules R Seed point x , not covered by any rule of R Order of dimensions to expand O d (a permutation of { , , . . . , | O d |} ) Ensure:
The box B covers a possible hyperrectangle containing x thatcan not be further expanded, and does not intersect any rule ofR B ← Degenerate rectangle with | O d | dimensions for i ∈ O d do B i .lower ← x i B i .upper ← x i end for for d ∈ O d do B d .lower ← −∞ B d .upper ← + ∞ for r ∈ R do if Intersects ( B, r, d ) then if x d ≥ r d .upper then B d .lower ← max ( B d .lower, r d .upper ) else (cid:46) Meaning x d < r d .lower , otherwise x wouldbe covered by r B d .upper ← min ( B d .upper, r d .lower ) end if end if end for end for return B function I NTERSECTS (Box B, Rule r, Dimension to skip d)
NumDimensions ← | r | for i ∈ { , , . . . , NumDimensions } \ d do u ← B i .upper > r i .lower l ← B i .lower < r i .upper if ¬ ( u ∧ l ) then return False end if end for return
True end function
It is worth noting that even though CFSBE cannot find allsuitable hyperrectangles, it can find all the arguably relevantones. Consider the dataset depicted in Figure 2, CFSGS could find a region below the blue rectangle, while CFSBE cannot;but since there are no instances there, it is arguable that therules created in such region would not be useful, as theircoverage would be zero and their predictive power could not bemeasured on the dataset used to train the classification model.CFSBE’s runtime complexity can be calculated in a straight-forward manner. Let d be the number of dimensions (features)in the dataset and r be the number of rules in the set of existingrules. The innermost part of the algorithm, in lines 11 through15, has a constant cost, k , because they do not depend onthe values of d or r . The Intersects function has a loop thatexecutes at most d − steps. The contents of this loop donot depend on d nor r , therefore they also have a constantcost, k , hence the worst-case scenario for this function is T ( d ) = ( d − · k .Lines 1 through 4 perform the initialization of the degener-ated rectangle, which occurs in a loop with d steps, each stephaving constant cost k . The rest of the algorithm is a trivialnesting of loops, the first of which takes d steps, the second r steps, and the third, inside the Intersects function, takes atmost d − steps, as discussed previously.The total cost of CFSBE is described as T in Equation 12.There are three terms in this summation, the first being thecreation of the degenerated rectangle, the second the main al-gorithm body, and the third value, n , comes from keeping trackof the available seeds, as discussed previously. Many algo-rithms that generate rule-based classification models, however,already have to create a mapping between dataset instancesand rules that cover them; Learning Classifier Systems, forinstance, may need such information to calculate the fitnessof the rules [20]. Therefore, in practice, the cost of CFSBEcould be considered as O ( d · r ) . T ( d, r, n ) = d · k + d · r · T ( d ) · k + n (12) T ( d, r, n ) ∈ O ( d · r + n ) C. Tests Over Categorical Features
We explained both algorithms assuming that the datasetcontained only continuous features, but both algorithms canbe modified to handle categorical features. Since a featuretest over a categorical feature is simply an equality test, themain difference for CFSGS is that during the creation of thecollection of constraints C , a single constraint is generatedfor categorical features, instead of two, as we can see inEquations 13 and 14. Consequently, the parameter n f ofAlgorithm 1 equals to the number of categorical features plustwo times the number of continuous features. rule = (cid:40) test = 2 ≤ f < test = f = yellow (13) C = C , = f < C , = f ≥ C , = f (cid:54) = yellow (14)or CFSBE to handle categorical features, we must changethe data structure that represents the enlarging box. Instead ofbeing a simple associative array that maps feature indices tocontinuous ranges, it must now map feature indices to eithersets of values, for categorical features, or continuous ranges,for continuous features. Considering this difference, it is moreappropriate to call the Box a conflict-free “Region”.Modifying the algorithm to check whether the Region anda rule intersect along a dimension that represents a categoricalfeature is rather simple, one only needs to check if the valuebeing tested by the rule is a member of the set of valuesof the Region for that dimension. Similarly, adjusting theRegion’s values for a categorical dimension, in order to avoidoverlapping with a rule, equates to removing the value whichis tested by the rule from the set of values for that dimension. D. Rules With Identical Consequents
We explained both algorithms using the simplification ofignoring the case in which the created rule could overlapwith rules that already existed because they have the sameconsequent. If the consequent of the rule that will be createdis known beforehand, then both CFSGS and CFSBE can bemodified to allow rules with the same consequent to overlap.If, however, the consequent is not known beforehand, toensure that the the rule created inside the region found will beconsistent with the already existing rules, both CFSGS andCFSBE must assume that the consequent will be differentfrom the consequents of the rules that already exist, i.e. thatrules cannot overlap. It is common for evolutionary algorithmsto generate the consequent of the rule in function of itsantecedent, e.g. [20], [22], [23]. In such cases, both ouralgorithms will not allow intersections in the antecedents.Remember that rules are inconsistent, and therefore requirea conflict resolution strategy, iff their antecedents overlap buthave different consequents, i.e. they predict different labels fora single dataset instance. That means that if two rules have thesame consequent, then they are consistent, and don’t requirea conflict resolution strategy, even if they overlap.For CFSGS that means that during the creation of C ,the collection of constraints, it is not necessary to generateconstraints from rules that have the same consequent as therule that will be created. Not generating constraints from arule allows the region found by CFSGS to overlap with suchrule.For CFSBE, when enlarging the box, we can safely ignorerules that have the same consequent as the rule that will becreated, again resulting in the possibility of overlaps. That canbe achieved by either changing line 22 to skip such rules, orby simply removing them from the argument R .IV. C ONCLUSION AND F UTURE W ORK
In this work we discussed the problem of generating sets ofrules without inconsistencies and proposed two algorithms tosolve this problem, called CFSGS and CFSBE.CFSGS is able to search through the feature-space for anyregion where the antecedent of a rule can be created without creating inconsistencies with any existing rule. However, thealgorithm is computationally expensive.The CFSBE algorithm, on the other hand, can only findregions around dataset instances that are not covered by anyexisting rule, but its computational cost is far more reasonable.We argue that the non-covered dataset instance requirement ofCFSBE is not a hindering issue.Neither algorithm is particularly useful by itself, since bothare meant to supplement algorithms that generate rule-basedclassification models. In the future, we intend to modify aLearning Classifier System to use CFSBE both during theinitial population creation and during the mutation phases,in order to study its effects on the predictive power andinterpretability of the generated models, and whether it makesthe models more prone to overfitting.V. A
CKNOWLEDGMENTS
This study was financed in part by the Coordenaçãode Aperfeiçoamento de Pessoal de Nível Superior - Brasil(CAPES) - Finance Code 001. R. Cerri thanks São PauloResearch Foundation (FAPESP) for the grant
ONFLICTS OF I NTEREST
The authors declare no conflict of interest. The financinginstutions had no role in the design of the study, in the writingof the manuscript or in the decision to publish the results.R
EFERENCES[1] F. E. Otero and A. A. Freitas, “Improving the interpretability of classi-fication rules discovered by an ant colony algorithm,” in
Proceedings ofthe 15th annual conference on Genetic and evolutionary computation .ACM, 2013, pp. 73–80.[2] B. Goodman and S. Flaxman, “European union regulations on algo-rithmic decision-making and a "right to explanation",” arXiv preprintarXiv:1606.08813 , 2016.[3] Z. C. Lipton, “The mythos of model interpretability,” arXiv preprintarXiv:1606.03490 , 2016.[4] J. R. Quinlan, “Induction of decision trees,”
Machine learning , vol. 1,no. 1, pp. 81–106, 1986.[5] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature , vol. 521,no. 7553, p. 436, 2015.[6] A. A. Freitas, “Comprehensible classification models: a position paper,”
ACM SIGKDD explorations newsletter , vol. 15, no. 1, pp. 1–10, 2014.[7] R. Caruana, H. Kangarloo, J. Dionisio, U. Sinha, and D. Johnson, “Case-based explanation of non-case-based learning methods.” in
Proceedingsof the AMIA Symposium . American Medical Informatics Association,1999, p. 212.[8] J. H. Holland, L. B. Booker, M. Colombetti, M. Dorigo, D. E. Goldberg,S. Forrest, R. L. Riolo, R. E. Smith, P. L. Lanzi, W. Stolzmann et al. ,“What is a learning classifier system?” in
International Workshop onLearning Classifier Systems . Springer, 1999, pp. 3–32.[9] K. R. Varshney, P. Khanduri, P. Sharma, S. Zhang, and P. K. Varshney,“Why interpretability in machine learning? an answer using distributeddetection and data fusion theory,” arXiv preprint arXiv:1806.09710 ,2018.[10] S. Nirenburg, “Cognitive systems: Toward human-level functionality.”
AI Magazine , vol. 38, no. 4, 2017.[11] F. E. Otero, A. A. Freitas, and C. G. Johnson, “A hierarchical multi-label classification ant colony algorithm for protein function prediction,”
Memetic Computing , vol. 2, no. 3, pp. 165–181, 2010.[12] J. R. Quinlan,
C4. 5: programs for machine learning . Elsevier, 2014.[13] A. Clare and R. D. King, “Knowledge discovery in multi-label pheno-type data,” in
European Conference on Principles of Data Mining andKnowledge Discovery . Springer, 2001, pp. 42–53.14] H. Blockeel, M. Bruynooghe, S. Džeroski, J. Ramon, and J. Struyf,“Hierarchical multi-classification,” in
Workshop Notes of the KDD’02Workshop on Multi-Relational Data Mining , 2002, pp. 21–35.[15] H. Blockeel, L. D. Raedt, and J. Ramon, “Top-down induction of clus-tering trees,” in
Proceedings of the Fifteenth International Conferenceon Machine Learning . Morgan Kaufmann Publishers Inc., 1998, pp.55–63.[16] C. Vens, J. Struyf, L. Schietgat, S. Džeroski, and H. Blockeel, “Deci-sion trees for hierarchical multi-label classification,”
Machine learning ,vol. 73, no. 2, p. 185, 2008.[17] L. Schietgat, C. Vens, J. Struyf, H. Blockeel, D. Kocev, and S. Džeroski,“Predicting gene function using hierarchical multi-label decision treeensembles,”
BMC bioinformatics , vol. 11, no. 1, p. 2, 2010.[18] H. Wang, S. Kwong, Y. Jin, W. Wei, and K.-F. Man, “Agent-basedevolutionary approach for interpretable rule-based knowledge extrac-tion,”
IEEE Transactions on Systems, Man, and Cybernetics, Part C(Applications and Reviews) , vol. 35, no. 2, pp. 143–155, 2005.[19] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitistmultiobjective genetic algorithm: Nsga-ii,”
IEEE transactions on evolu-tionary computation , vol. 6, no. 2, pp. 182–197, 2002.[20] R. Cerri, R. C. Barros, and A. C. de Carvalho, “A genetic algorithm forhierarchical multi-label classification,” in
Proceedings of the 27th annualACM symposium on applied computing . ACM, 2012, pp. 250–255.[21] E. C. Gonçalves, A. A. Freitas, and A. Plastino, “A survey of geneticalgorithms for multi-label classification,” in . IEEE, 2018, pp. 1–8.[22] R. S. Parpinelli, H. S. Lopes, and A. A. Freitas, “Data mining with anant colony optimization algorithm,”
IEEE transactions on evolutionarycomputation , vol. 6, no. 4, pp. 321–332, 2002.[23] A. Chan and A. A. Freitas, “A new ant colony algorithm for multi-labelclassification with applications in bioinfomatics,” in
Proceedings of the8th annual conference on Genetic and evolutionary computation . ACM,2006, pp. 27–34.[24] F. E. Otero, A. A. Freitas, and C. G. Johnson, “cant-miner: an antcolony classification algorithm to cope with continuous attributes,”in
International Conference on Ant Colony Optimization and SwarmIntelligence . Springer, 2008, pp. 48–59.[25] ——, “A hierarchical classification ant colony algorithm for predictinggene ontology terms,” in
European Conference on Evolutionary Compu-tation, Machine Learning and Data Mining in Bioinformatics . Springer,2009, pp. 68–79.[26] ——, “A new sequential covering strategy for inducing classificationrules with ant colony algorithms.”