[PDF] Evolving Multi-label Classification Rules by Exploiting High-order Label Correlation

Abstract

In multi-label classification tasks, each problem instance is associated with multiple classes simultaneously. In such settings, the correlation between labels contains valuable information that can be used to obtain more accurate classification models. The correlation between labels can be exploited at different levels such as capturing the pair-wise correlation or exploiting the higher-order correlations. Even though the high-order approach is more capable of modeling the correlation, it is computationally more demanding and has scalability issues. This paper aims at exploiting the high-order label correlation within subsets of labels using a supervised learning classifier system (UCS). For this purpose, the label powerset (LP) strategy is employed and a prediction aggregation within the set of the relevant labels to an unseen instance is utilized to increase the prediction capability of the LP method in the presence of unseen labelsets. Exact match ratio and Hamming loss measures are considered to evaluate the rule performance and the expected fitness value of a classifier is investigated for both metrics. Also, a computational complexity analysis is provided for the proposed algorithm. The experimental results of the proposed method are compared with other well-known LP-based methods on multiple benchmark datasets and confirm the competitive performance of this method.

Full PDF

HHighlights

Evolving Multi-label Classiﬁcation Rules by Exploiting High-order Label Correlation

Shabnam Nazmi,Xuyang Yan,Abdollah Homaifar,Emily Doucette• A multi-label classiﬁcation model is proposed by extending the supervised learning classiﬁers through the LP technique• The high-order label correlation is exploited to improve the predictive performance and incompleteness challenge isaddressed• Approximate bounds are derived for the average classiﬁer ﬁtness in terms of the dataset properties a r X i v : . [ c s . L G ] J u l volving Multi-label Classiﬁcation Rules by Exploiting High-orderLabel Correlation Shabnam Nazmi a , Xuyang Yan a , Abdollah Homaifar a , ∗ and Emily Doucette b a & T State University b

101 West Eglin Blvd. Eglin AFB, FL, USA, The Air Force Research Laboratory - Munitions Directorate

A R T I C L E I N F O

Keywords :Multi-label classiﬁcationLabel correlationLabel powersetLearning classiﬁer systems

A B S T R A C T

In multi-label classiﬁcation tasks, each problem instance is associated with multiple classes simul-taneously. In such settings, the correlation between labels contains valuable information that can beused to obtain more accurate classiﬁcation models. The correlation between labels can be exploited atdiﬀerent levels such as capturing the pair-wise correlation or exploiting the higher-order correlations.Even though the high-order approach is more capable of modeling the correlation, it is computation-ally more demanding and has scalability issues. This paper aims at exploiting the high-order labelcorrelation within subsets of labels using a supervised learning classiﬁer system (UCS). For this pur-pose, the label powerset (LP) strategy is employed and a prediction aggregation within the set of therelevant labels to an unseen instance is utilized to increase the prediction capability of the LP methodin the presence of unseen labelsets. Exact match ratio and Hamming loss measures are consideredto evaluate the rule performance and the expected ﬁtness value of a classiﬁer is investigated for bothmetrics. Also, a computational complexity analysis is provided for the proposed algorithm. The ex-perimental results of the proposed method are compared with other well-known LP-based methodson multiple benchmark datasets and conﬁrm the competitive performance of this method.

1. Introduction

In multi-label classiﬁcation (MLC) tasks, each probleminstance is associated with multiple classes at the same time.Emotion identiﬁcation [1], image annotation [24], text cat-egorization [23], semantic scene classiﬁcation, or gene andprotein function prediction [34] are examples of such prob-lems. For instance, in text categorization, a document canbe classiﬁed as History and Biography simultaneously.Over the past decade, many multi-label classiﬁcation al-gorithms have been proposed to solve the multi-label classi-ﬁcation problem in various domains. These algorithms canbe categorized into two major groups: problem transforma-tion methods and algorithm adaptation methods [54, 40].Problem transformation methods transform the multi-labelproblem into one or multiple single-label classiﬁcation prob-lems, e.g., label powerset (LP) [40] and binary relevance(BR) methods [3]. Algorithm adaptation methods modifyexisting multi-class methods for multi-label problems, suchas methods based on 𝑘 NN [53, 16], decision tree [47, 6],neural networks [48, 50], and support vector machines [11].In many real-world multi-label classiﬁcation problems,a correlation exists between diﬀerent classes. For instance,a document belonging to the class ’Biography’ can also beconsidered to belong to the class ’History’. Incorporatingthis information into the classiﬁcation model could help withobtaining a more accurate classiﬁer. The label correlationcan be taken into account through three diﬀerent strategies,namely ﬁrst-order , second-order , and high-order [54]. The ∗ Corresponding author [email protected] (S. Nazmi); [email protected] (X.Yan); [email protected] (A. Homaifar); [email protected] (E.Doucette)

ORCID (s): (S. Nazmi); (X.Yan); (A. Homaifar) ﬁrst-order strategy converts the multi-label problem into mul-tiple single-label classiﬁcation problems and ignores the cor-relation among labels [53, 3]. The second-order strategyconsiders the pair-wise correlation between labels [11, 13,52], and the high-order strategy, looks at the high order cor-relation through a subset of labels [32, 33, 41].Many algorithms have been proposed that take into ac-count the second-order correlation often by exploiting thepair-wise relationship between labels. One way to modelpairwise correlation is to exploit the co-occurrence patternbetween label pairs (e.g. CLR [13] and LLSF [18]) whichonly consider the positive correlation between labels. Onthe other hand, LPLC [19] and the approach proposed in [27]exploit local positive and negative pairwise correlation be-tween labels to obtain a MLC model. The PRC algorithm[20] extends pairwise classiﬁcation to obtain a ranking pro-cedure based on binary preference relations. Methods devel-oped to capture the high-order correlation are more capablein modeling correlation among labels, but are computation-ally more expensive and suﬀer from scalability issues [54].High-order approaches mine the relationship between allclasses or subsets of classes. Classiﬁer chains (CC) [33] is amulti-label classiﬁcation method that models such relation-ship by using the vector of class labels as additional sampleattributes and transforms the multi-label classiﬁcation prob-lem into a chain of 𝑞 binary classiﬁcation problems. Ex-tensions of the CC algorithm such as probabilistic classiﬁerchains (PCC) [5], add a probabilistic interpretation to CC.Also, Bayesian CC [49] describes the dependency structureof the class labels as a tree. LP is one of the methods thatallows for exploiting the high-order label correlation by tak-ing into account label subsets. Random 𝑘 -labelset (RA 𝑘 EL)[41] exploits label correlation in a random way by transform-ing the problem into an ensemble of multi-class classiﬁca-

Shabnam Nazmi, Xuyang Yan, Abdollah Homaifar, Emily Doucette:

Preprint submitted to Elsevier

Page 1 of 12 volving Multi-label Classiﬁcation Rules by Exploiting High-order Label Correlation tion problems where each component of the ensemble learnsa random subset of the labels through a classiﬁer induced bythe LP technique. The ensemble of pruned sets (EPS) [32]follows the LP strategy but focuses only on the most impor-tant correlations in order to reduce complexity.Although LP is a straightforward approach to transformmulti-label problems into multi-class problems and incorpo-rates label correlation into the learning problem, it is chal-lenged in two signiﬁcant ways [54]: (i) Incompleteness whereLP is limited to predict label sets appearing in the trainingdata; (ii) Ineﬃciency, when the number of labels is large,there are too many possible LPs to learn and the training in-stances for some powersets to learn from are very few whichcreates class imbalance. RA 𝑘 EL tackles these challengesby combining ensemble learning with LP only on randomlychosen 𝑘 -sized labelsets.A learning classiﬁer system (LCS) is a genetic-based ma-chine learning system that combines discovery and learningcomponents to train a rule-based model [17, 45]. The evo-lutionary component ﬁnds new rules and the learning com-ponent assigns credit to the rules based on an estimate oftheir contribution. LCSs are applied in a variety of domainssuch as biology, computer science, medicine, and social sci-ences. Three of the major structures developed for LCS are:XCS [22], which operates under the reinforcement learningframework; UCS [2] and ExSTraCS [46], which are devel-oped for supervised learning tasks; and N-LCS [7], whichleverages neural networks. In [25], several convolutionalneural network (CNN) structures are exploited to study theperformance of the N-LCSs with CNNs. In [31], UCS issuccessfully used to solve multiple real-world pattern recog-nition problems. Furthermore, strength-based learning clas-siﬁers are adapted to handle multi-label data with weightedlabels in [28], and [29] investigated the UCS algorithm forits potential in solving multi-label classiﬁcation problems.In this paper, the high-order strategy to handle label cor-relation through the LP technique is considered and the UCSalgorithm is adapted to evolve a rule-based multi-label clas-siﬁcation model. The prediction of each rule is a subset oflabels induced from the training data. The genetic algorithm(GA) creates new rules by combining two of the existingrules through genetic operators. To reduce the computa-tional complexity, the genetic search is limited to the clas-siﬁer condition. To overcome the incompleteness of the LPtechnique on unseen samples, unlike methods that considerrandom subsets of labels, the proposed method generatesnew LPs by exploiting the information learned within eachproblem niche collectively. This approach adopts a similarprediction scheme to a 𝑘 NN method with a dynamic 𝑘 thataggregates predictions from all the relevant (matching) rulesto a given instance. Approximate bounds are derived forthe expected value of a classiﬁer’s ﬁtness using the averageHamming distance and average Hamming weight bounds.Moreover, a computational complexity analysis is performedfor the proposed algorithm. The major contributions of thiswork are as follows:• A new multi-label classiﬁcation technique is devel- oped that exploits high-order label correlation by adapt-ing the traditional UCS algorithm to predict multi-labels through the LP technique. Inspired by the 𝑘 NNmethod, a prediction aggregation is proposed to tacklethe incompleteness of the LP technique.• For evaluating the ﬁtness of the classiﬁcation rules,two strategies are considered. The average classiﬁerﬁtness using each evaluation strategy is derived in termsof the multi-label data properties and discussions pro-vide insight on the derived bounds. These strategiesare also studied through experiments on synthetic andreal-world datasets.• Experiments on multiple benchmark datasets are con-ducted to compare the proposed method with otherwell-known multi-label classiﬁcation methods and sta-tistical analyses are performed to analyze the results.The rest of the paper is organized as follows: section 2 nota-tions and important metrics, section 3 the proposed methodsand theoretical analysis, and section 4 the experimental setupand results. Finally, concluding remarks and future work arepresented.

2. Multi-label classiﬁcation problem

Let 𝕏 denote an input space and let  = { 𝜆 , … , 𝜆 𝑚 } bea ﬁnite set of class labels. Suppose every instance 𝐱 ∈ 𝕏 ,where 𝐱 ∈ ℝ 𝑑 , is associated with a subset of labels 𝐿 ⊂  ,which is often called the set of relevant labels. The comple-ment set of 𝐿 is called the irrelevant set and is shown by ̄𝐿 .Therefore, 𝐷 = {( 𝐱 , 𝐿 ) , ( 𝐱 , 𝐿 ) , … , ( 𝐱 𝑛 , 𝐿 𝑛 )} is a ﬁniteset of training instances that are assumed to be randomlydrawn from an unknown distribution. The objective is totrain a multi-label classiﬁer ℎ ∶ 𝕏 →  that best approx-imates the training data and generalizes well to the samplesin test data. The function 𝑓 ( 𝐱 , 𝜆 ) calculates the score valuefor class 𝜆 .To characterize the properties of a multi-label problemthat inﬂuences the learning performance, various metrics areproposed in the literature. Label cardinality (1) is the aver-age number of labels per sample, and label density (2) is thecardinality divided by the number of classes [38].

𝐶𝑎𝑟𝑑 ( 𝐷 ) = 1 𝑛 𝑛 ∑ 𝑖 =1 | 𝐿 𝑖 | , (1) 𝐷𝑒𝑛𝑠 ( 𝐷 ) = 1 𝑛 𝑛 ∑ 𝑖 =1 | 𝐿 𝑖 | 𝑚 = 𝐶𝑎𝑟𝑑 ( 𝐷 ) 𝑚 . (2) Moreover, in [51] and [41], the distinct labelsets (DL) isdeﬁned as the number of diﬀerent label combinations in thedataset: 𝐷𝐿 ( 𝐷 ) = | 𝐿 ⊂  | ( 𝐱 , 𝐿 ) ∈ 𝐷 | . (3) Shabnam Nazmi, Xuyang Yan, Abdollah Homaifar, Emily Doucette:

Preprint submitted to Elsevier

Page 2 of 12volving Multi-label Classiﬁcation Rules by Exploiting High-order Label Correlation

In [51], the proportion of distinct labelsets (PDL) is deﬁnedas the number of distinct labelsets relative to the number ofinstances:

𝑃 𝐷𝐿 ( 𝐷 ) = 𝐷𝐿 ( 𝐷 ) 𝑛 . (4)

3. Proposed methodology

In this section, the structure of the proposed multi-labellearning classiﬁer system (abbreviated as MLR for multi-label classiﬁcation rules) is explained. A theoretical analysisis presented to show the relationship between the expectedﬁtness of a classiﬁcation rule in MLR and dataset propertiesusing two ﬁtness evaluation strategies. Besides, the compu-tational complexity of the proposed algorithm is discussed.

The proposed algorithm consists of two main compo-nents; rule structure and multi-label prediction.

The following shows an example of a rule in multi-labelsetting that matches 𝐱 , 𝑅 ∶ { if 𝐱 ∈ [ 𝐜 − 𝐬 , 𝐜 + 𝐬 ] , then prediction = 𝐲 } , (5)where 𝐜 and 𝐬 are vectors of center and spread of a hyper-rectangle respectively encoding the classiﬁer condition 𝛿 ,and 𝐲 is a binary vector with a ’1’ representing relevant labelsand a ’0’ representing irrelevant labels to 𝐱 , where 𝕪 ∈ 𝐿 .Every time the covering mechanism creates a new rule, it as-signs the correct labelset of the training instance to the cre-ated rule. The set of all matching rules comprises the matchset ( [ 𝑀 ] ). The experience ( 𝑒𝑥𝑝 ) of a rule is the number oftimes that it has matched 𝐱 , and its numerosity ( 𝑛𝑢𝑚 ) is thenumber of copies of it in the population ( [ 𝑃 ] ) created byGA. It is assumed that the maximum number of the rules al-lowed in [ 𝑃 ] is 𝑁 . In the MLR algorithm, no genetic searchis applied to the label space and the oﬀ-springs take the ex-act same predictions as their parents. Moreover, ﬁtness (  )speciﬁes the relative predictive performance of a rule and isused by GA as a measure of contribution. Fitness of a rule atiteration 𝑡 can be calculated using diﬀerent multi-label per-formance metrics:  𝑒𝑚𝑡 = ( 𝑒𝑚 𝑡 𝑒𝑥𝑝 𝑡 ) 𝜈 , (6)  ℎ𝑙𝑡 = ( 1 − ℎ𝑙 𝑡 𝑒𝑥𝑝 𝑡 ) 𝜈 . (7)In the above equations, 𝑒𝑚 𝑡 and ℎ𝑙 𝑡 are the exact match (EM)and the hamming loss (HL) measures of a rule’s predictionat iteration 𝑡 , respectively. These values can be calculated ateach iteration using ∑ 𝑒𝑥𝑝 𝑡 𝑖 =1 𝑀 𝑖 , for a given multi-label metric 𝑀 . Moreover, 𝜈 is a constant set by the user that determinesthe strength pressure toward accurate rules [30]. In the original UCS, the predicted class is the one that ispredicted by the classiﬁer with the highest ﬁtness value. Let 𝐲 to be the prediction of rule 𝑅 , then the predicted multi-label can be obtained as follows: 𝐲 𝑚𝑎𝑥 ∶ { 𝐲 |  = max  𝑙 ; 𝑙 = 1 , … , | [ 𝑀 ] | } , (8)where  𝑙 is the ﬁtness of rule 𝑅 𝑙 . We call the algorithm thatimplements this prediction strategy MLR 𝑚𝑎𝑥 . Nonetheless,to overcome the incompleteness of the LP-based learning,labelsets predicted by the locally relevant rules (rules thatappear in the [ 𝑀 ] ) are aggregated and each class label isassigned a score value. More speciﬁcally, the aggregationof predictions is composed of two steps; to assemble allof the predicted classes into a uniﬁed prediction and toconstruct a combined score for each class. To perform theformer, a simple union of the all the predicted labels in [ 𝑀 ] is considered. Let 𝐲 𝑙 to be the prediction of rule 𝑅 𝑙 , then thepredicted combined multi-label ̄ 𝐲 is: ̄ 𝐲 ∶ { ̄𝑦 𝑖 | ∃ 𝑅 𝑙 ∶ ̄𝑦 𝑖 ∈ 𝐲 𝑙 ; 𝑖 = 1 , … , 𝑚 ; (9) 𝑙 = 1 , … , | [ 𝑀 ] | } , where | [ 𝑀 ] | is the size of the current match set and workssimilar to a dynamic 𝑘 in a 𝑘 NN algorithm. In the secondstep, a score is calculated for each class from the rules in [ 𝑀 ] based on the ﬁtness and density of the rules that coverthe subspace containing 𝐱 . The score of the 𝑖 𝑡ℎ class 𝜆 𝑖 canbe calculated as follows, 𝑠 𝑖 = ∑ 𝑙 =1 , … , | [ 𝑀 ] | ,𝑦 𝑙𝑖 =1  𝑙 × 𝑛𝑢𝑚 𝑙 . (10)A higher score indicates that 𝜆 𝑖 is advocated by rules with alarger number of copies or higher ﬁtness or both. We call thealgorithm that implements this prediction strategy MLR 𝑎𝑔𝑔 .By employing diﬀerent bi-partitioning methods, a set of rel-evant labels can be obtained from the normalized scores. Evaluating the ﬁtness of a classiﬁer using criteria (6) and(7) leads to the evolution of classiﬁers with diﬀerent proper-ties. In this section, the objective is to ﬁnd a relation betweenthe expected ﬁtness (  ) value of a classiﬁer when evaluatedwith each criterion in terms of dataset properties. Firstly,  𝑒𝑚 is derived when EM is used. Then, an upper boundis derived on  ℎ𝑙 in terms of the number of the classes 𝑚 and the number of the distinct labelsets 𝐷𝐿 in section 3.2.1.Furthermore, a lower bound is derived on  ℎ𝑙 in terms ofthe number of the classes 𝑚 and the label density value insection 3.2.2.Consider a random classiﬁcation rule 𝑅 with condition 𝛿 and prediction 𝐲 . Assuming an evenly distributed samplespace, for a classiﬁer with hyper-cube condition encoding,the probability that 𝑅 matches 𝐱 is proportional to the hyper-cube volume that it covers [8]. Assume 𝑉 𝑅 and 𝑉 𝐷 to be the Shabnam Nazmi, Xuyang Yan, Abdollah Homaifar, Emily Doucette:

Preprint submitted to Elsevier

Page 3 of 12volving Multi-label Classiﬁcation Rules by Exploiting High-order Label Correlation volume covered by 𝑅 and the input space of 𝐷 , respectively.The probability of matching is as follows, 𝑃 𝑚 ( 𝑅 ) = 𝑉 𝑅 𝑉 𝐷 = Π 𝑑𝑖 =1 𝑟 𝑖 Π 𝑑𝑖 =1 𝑠 𝑖 (11)where, 𝑟 𝑖 and 𝑠 𝑖 are the span of the classiﬁer condition andthe input space of 𝐷 , respectively. Assuming a normalizedinput space, the average volume covered by a classiﬁer is 𝑟 𝑑 ,where 𝑟 is a user-deﬁned initialization parameter utilized bythe covering mechanism. Furthermore, assume that the av-erage generality of the population is 𝜏 , which is a functionof generalization parameter 𝑃 and the probability of mu-tation by GA. Thus, we reach the following relation for theprobability that 𝑅 is part of the match set, 𝑃 ( 𝑅 ∈ [ 𝑀 ]) = 𝑟 𝑑𝜏 . (12)In MLR, the genetic algorithm selects two classiﬁers througha roulette wheel selection procedure proportional to their ﬁt-ness. For a dataset 𝐷 with 𝑛 samples, the average number ofsamples that 𝑅 matches ( 𝑛 𝑚 ) is 𝑟 𝑑𝜏 × 𝑛 and 𝐿 𝑚 ⊂ 𝐿 is a setcontaining the labelsets of these samples. Without the loss ofgenerality, assume that 𝜈 = 1 in equations (6) and (7). Thus,the expected ﬁtness update for a random classiﬁer using theEM criterion is,  𝑒𝑚 = 𝑟 𝑑𝜏 × 1 𝐷𝐿 , (13)which is a function of the distinct number of labelsets in 𝐷 .Using the HL criteria, the average ﬁtness of this classiﬁercan be formulated as,  ℎ𝑙 = 𝑟 𝑑𝜏 × ∑ 𝑛 𝑚 (1 − ℎ𝑙 ) 𝑛 𝑚 . (14)Substituting ℎ𝑙 with its average value based on the averageHamming distance ( 𝑎ℎ𝑑 ), the following equation is obtained,  ℎ𝑙 = 𝑟 𝑑𝜏 × (1 − 𝑎ℎ𝑑𝑚 ) . (15)In (15), 𝑎ℎ𝑑 is the average Hamming distance between thelabels of the samples that 𝑅 matches. To ﬁnd a relation be-tween  ℎ𝑙 and the properties of 𝐷 , we ﬁrst provide a fewdeﬁnitions.A binary code is a non-empty subset of the 𝑚 -dimensionalvector space over the binary ﬁeld 𝐹 [12]. Assuming thateach label combination in 𝐷 is observed only once, i.e. 𝐷𝐿 = 𝑛 , the set of labelsets 𝐿 of a dataset form a binary code withcardinality 𝐷𝐿 . Similarly, 𝐿 𝑚 forms a binary code with car-dinality 𝑁 𝑚 . The average Hamming distance for the binarycode 𝐿 is deﬁned as 𝑎ℎ𝑑 ( 𝐿 ) = 1 𝐷𝐿 ∑ 𝐿 𝑖 ∈ 𝐿 ∑ 𝐿 𝑗 ∈ 𝐿 ℎ𝑑 ( 𝐿 𝑖 , 𝐿 𝑗 ) . (16)Moreover, Hamming weight ( ℎ𝑤 ) is the number of non-zeroelements in a binary string [55]. The average Hamming weight ( 𝑎ℎ𝑤 ) of the binary code 𝐿 is deﬁned as 𝑎ℎ𝑤 ( 𝐿 ) = 1 𝐷𝐿 ∑ 𝐿 𝑖 ∈ 𝐿 ℎ𝑤 ( 𝐿 𝑖 ) . (17)To the best of our knowledge, there is no closed-form calcu-lation for the equations (16) and (17). Therefore, we proceedwith employing an upper bound and lower bound approxi-mations for them and obtain bounds for  ℎ𝑙 . In the literature, lower and upper bound approximationsfor the value of the 𝑎ℎ𝑑 are proposed [55] that consider spe-cial cases for the cardinality of a binary code . One straight-forward lower bound on ahd ( 𝐿 ) is as follows [12], 𝑚 + 12 − 2 𝑚 −1 𝐷𝐿 ≤ 𝑎ℎ𝑑 ( 𝐿 ) . (18)This inequality is meaningful only when 𝐷𝐿 ≥ 𝑚 ∕( 𝑚 + 1) [55]. Inequality (18) suggests that a larger number of distinctlabel sets in a multi-label dataset increases the upper boundon the average Hamming distance on 𝐿 . For a classiﬁer thatmatches a subset of the instances, 𝐿 𝑚 ⊂ 𝐿 holds. This meansthat 𝑁 𝑚 ≤ 𝐷𝐿 and the 𝑎ℎ𝑑 for this classiﬁer follows theinequality, 𝑎ℎ𝑑 ( 𝐿 𝑚 ) ≤ 𝑎ℎ𝑑 ( 𝐿 ) . (19)In (19), the equality holds when 𝐿 𝑚 = 𝐿 , i.e., 𝑅 matches allsamples in 𝐷 which is the case when 𝑅 is a classiﬁer with anover-general condition. In other words, the prediction madeby a classiﬁer with an over-general condition has the maxi-mum average Hamming distance to the labelset of a traininginstance. Putting (15) and (18) together, the following up-per bound exists for the expected ﬁtness of such a classiﬁerbased on the number of distinct labelsets 𝐷𝐿 and the numberof classes 𝑚 ,  ℎ𝑙 ≤ 𝑟 𝑑𝜏 (1 − 𝑚 + 12 𝑚 + 2 𝑚 −1 𝑚 × 𝐷𝐿 ) . (20)According to inequality (20), a larger number of distinct la-bels imposes a smaller upper bound for the expected valueof the classiﬁer ﬁtness.Based on the deﬁnition,  ℎ𝑙 = 𝑎𝑣𝑒 (  ) | ≤ ℎ𝑑 ≤ 𝑚 , while  𝑒𝑚 = 𝑎𝑣𝑒 (  ) | ℎ𝑑 =0 . This means that the classiﬁer ﬁtness  ℎ𝑙 at a given time is considered to be non-zero even whenits prediction is not an exact match of the true labelset, i.e.,its ℎ𝑑 > . As a result of this more frequent positive ﬁtnessevaluation, for a classiﬁer  𝑒𝑚 ≤  ℎ𝑙 holds. This meansthat in the MLR algorithm when classiﬁers are evaluatedwith respect to the HL of their predicted labelset, they areexpected to receive a larger ﬁtness update on average. Thisrelation implies that the classiﬁers that are not very accuratebut can partially predict the correct multi-label, are respectedas contributing classiﬁers when the evaluation criterion isHL. With a higher ﬁtness value, these classiﬁers will havea better chance of receiving reproductive opportunity fromGA and remain in the population of rules. Shabnam Nazmi, Xuyang Yan, Abdollah Homaifar, Emily Doucette:

Preprint submitted to Elsevier

Page 4 of 12volving Multi-label Classiﬁcation Rules by Exploiting High-order Label Correlation

An upper bound for 𝑎ℎ𝑑 ( 𝐿 ) is proposed in [55] that re-lates it to the value of 𝑎ℎ𝑤 in (17), as follows 𝑎ℎ𝑑 ( 𝐿 ) ≤ 𝑎ℎ𝑤 ( 𝐿 ) − 2 𝑎ℎ𝑤 ( 𝐿 ) 𝑚 . (21)Considering that the 𝑎ℎ𝑤 of a set of binary variables corre-sponds to the value of label 𝐶𝑎𝑟𝑑 , as deﬁned in (1), withinthe multi-label learning framework when 𝐷𝐿 = 𝑛 , the in-equality (21) oﬀers the following upper bound the on 𝑎ℎ𝑑 ofthe labelsets of 𝐷 , 𝑎ℎ𝑑 ( 𝐿 ) ≤ 𝐶𝑎𝑟𝑑 − 2

𝐶𝑎𝑟𝑑 𝑚 . (22)Putting (15) and (22) together, and replacing label 𝐷𝑒𝑛𝑠 for

𝐶𝑎𝑟𝑑𝑚 , we obtain the following lower bound on the expectedﬁtness of a random classiﬁer based the values of the

𝐷𝑒𝑛𝑠 of a dataset, 𝑟 𝑛𝜏 (2 𝐷𝑒𝑛𝑠 − 2 𝐷𝑒𝑛𝑠 + 1) ≤  ℎ𝑙 . (23)According to (23), the lower bound of  ℎ𝑙 is a quadraticfunction of 𝐷𝑒𝑛𝑠 whose minimum occurs at

𝐷𝑒𝑛𝑠 = 1∕2 .Furthermore, according to (22), 𝑎ℎ𝑑 has its maximum valuewhen

𝐶𝑎𝑟𝑑 = 𝑚 ∕2 . Therefore, employing the HL crite-rion allows the individual classiﬁers in MLR algorithm toexpect the smallest average ﬁtness when the average Ham-ming distance between the classiﬁer prediction and the cor-rect labelset is expected to be at its largest value. In this section, the computational complexity of the pro-posed multi-label classiﬁcation algorithm is analyzed in termsof its major components. Today’s learning classiﬁer sys-tems, including UCS and ExSTraCS, consist of many in-teracting components with complex dependencies. In [46],a comprehensive list of these components along with theirfunctional description is provided. The analysis presentedhere studies the complexity of these components individ-ually without considering the complex interactions amongthem.Table 1 presents the complexity of training the MLR al-gorithm in terms of its components, as well as its overallcomplexity for one training iteration. Here, it is assumed thatthe genetic algorithm employs a tournament selection withsize 𝑡 and a uniform crossover. Given that the algorithm isto be trained for 𝑖𝑡 iterations, deletion from population andthe genetic algorithm repeat numerous times during train-ing. According to Table 1, the computational complexity oftraining the MLR algorithm is of order 𝑂 ( 𝑁.𝑛.𝑑.𝑚 ) , whichgrows linearly in terms of the number of the rules 𝑁 , thenumber of training instances 𝑛 , the input dimension 𝑑 , andthe number of classes 𝑚 .

4. Results and Discussion

In this section, the benchmark datasets and several clas-siﬁcation algorithms that are used in the comparisons exper-iments are described. Then, multi-label evaluation measures

Operation Big 𝑂 One training iterationMatching 𝑂 ( 𝑁.𝑑 ) Parameter update ( exp , ℎ𝑙 , ...) 𝑂 ( | [ 𝑀 ] | .𝑚 ) ≤ 𝑂 ( 𝑁.𝑚 ) Fitness calculation 𝑂 ( | [ 𝑀 ] | ) ≤ 𝑂 ( 𝑁 ) Deletion from populationDeletion weight 𝑂 ( 𝑁 ) Delete from [ 𝑃 ] 𝑂 ( 𝑁 ) Applying genetic algorithm onceGA 𝑂 ( 𝑁.𝑑.𝑚.𝑡 ) Overall algorithm complexityOne iteration 𝑂 ( 𝑁.𝑛.𝑑.𝑚 ) Table 1

Computational complexity of the major MLR components andthe overall complexity of MLR. are explained, and the strategies employed for parameter in-stantiation are reported. Finally, results are presented anddiscussed.

For comparison, ﬁve real-world datasets are used includ-ing, Yeast [11], Emotions [37], Flags [14], COMPUTERAUDITION LAB 500 (CAL500) [43], and Genbase [10].Table 2 shows information about the number of instances,features, and classes for each dataset.

The comparison of the proposed algorithm is performedusing the implementations of the following algorithms in

MULAN library under the machine learning framework WEKA [15]. The compared methods are discussed as below:• Label powerset methods: vanilla LP, RA 𝑘 EL, ensem-ble of pruned sets (EPS), hierarchy of multi-label clas-siﬁers (HOMER) with balanced 𝑘 means [39].• Binary relevance methods: the ensemble of classiﬁerchaining (ECC).• Algorithm adaptation methods: multi-label 𝑘 -nearestneighbors (ML- 𝑘 NN).In the LP algorithm, the implementation of the decision treealgorithm in

WEKA is employed as the base learner. Theproposed method is implemented in Python using the UCSimplementation [44] as the base algorithm.In order to improve the performance and speed up thealgorithm execution, the

RF-ML feature selection strategy[35] is employed in this work.

RF-ML is an extension of thewell-known

ReliefF feature selection strategy to multi-labeldata that takes into account the eﬀect of interacting featuresin ML problems without a need to transform the ML probleminto a multi-class problem. The result of applying featureselection is either a subset or a ranked list of the originalfeatures. In the latter case, only the top thirty percent of thefeatures are used for model training [4]. http://mulan.sourceforge.net/ Shabnam Nazmi, Xuyang Yan, Abdollah Homaifar, Emily Doucette:

Preprint submitted to Elsevier

Page 5 of 12volving Multi-label Classiﬁcation Rules by Exploiting High-order Label Correlation

Dataset Domain Inst. Feat. Label Card. Dens. DL PDLYeast Biology 2,417 103(n) 14 4.237 0.303 198 0.082Emotions Music 593 72(n) 6 1.869 0.311 27 0.045Flags Images 194 9(c) + 10(n) 7 3.392 0.485 54 0.278CAL500 Music 502 68(n) 174 26.044 0.150 502 1Genbase Biology 662 1,186(b) 27 1.245 0.046 32 0.048

Table 2

ML datasets. In the Feat. column 𝑛 , 𝑐 , and 𝑏 refer to numeric, categorical, and binaryattributes, respectively. All experiments are carried out on a 2.70 GHz Windows10 machine with a 16.0 GB RAM.

In this section, evaluation measures used in the experi-ments are explained [26]. ∙ Hamming loss computes the percentage of labels whoserelevance is predicted incorrectly: 𝐻𝐿 ( ℎ ) = 1 𝑚 | ℎ ( 𝐱 )Δ 𝐿 | where Δ represents the Hamming distance between the twovectors ℎ ( 𝐱 ) and 𝐿 . ∙ Accuracy is the relative number of classes predictedcorrectly to the union of relevant and predicted labels:

𝐴𝑐𝑐 ( ℎ ) = | ℎ ( 𝐱 ) ⋂ 𝐿 || ℎ ( 𝐱 ) ⋃ 𝐿 | ∙ Precision is the relative number of classes predictedcorrectly to the set of relevant labels:

𝑃 𝑟 ( ℎ ) = | ℎ ( 𝐱 ) ⋂ 𝐿 || 𝐿 | ∙ Recall is the relative number of classes predicted cor-rectly to the set of all predicted classes: 𝑅𝑐 ( ℎ ) = | ℎ ( 𝐱 ) ⋂ 𝐿 || ℎ ( 𝑥 ) | ∙ 𝐹 measure is the harmonic mean of precision and re-call of the predicted labels: 𝐹 ( ℎ ) = 2 × 𝑃 𝑟 × 𝑅𝑐𝑃 𝑟 + 𝑅𝑐 For an evaluation measure 𝑀 , a macro -measure is com-puted by evaluating the underlying measure once for eachlabel and calculating their mean value. In contrast, a micro -measure aggregates the predictions of all labels and evalu-ates the measure at the end. 𝑀 𝑚𝑖𝑐𝑟𝑜 = 𝑀 ( 𝑚 ∑ 𝑖 =1 𝑇 𝑃 𝑖 , 𝑚 ∑ 𝑖 =1 𝐹 𝑃 𝑖 , 𝑚 ∑ 𝑖 =1 𝑇 𝑁 𝑖 , 𝑚 ∑ 𝑖 =1 𝐹 𝑁 𝑖 ) 𝑀 𝑚𝑎𝑐𝑟𝑜 = 1 𝑚 𝑚 ∑ 𝑖 =1 𝑀 ( 𝑇 𝑃 𝑖 , 𝐹 𝑃 𝑖 , 𝑇 𝑁 𝑖 , 𝐹 𝑁 𝑖 ) where TP , FP , TN , and FN stand for true positive, false pos-itive, true negative, and false negative respectively in bothequations. Based on these deﬁnitions, the micro and macro averages for 𝐹 measure can be calculated as follows. ∙ 𝐹 𝑚𝑖𝑐𝑟𝑜 is the harmonic mean between the micro-precision and micro-recall : 𝐹 𝑚𝑖𝑐𝑟𝑜 = 2 × 𝑃 𝑟 𝑚𝑖𝑐𝑟𝑜 × 𝑅𝑐 𝑚𝑖𝑐𝑟𝑜 𝑃 𝑟 𝑚𝑖𝑐𝑟𝑜 + 𝑅𝑐 𝑚𝑖𝑐𝑟𝑜 ∙ 𝐹 𝑚𝑎𝑐𝑟𝑜 is the harmonic mean between precision and re-call where the average is calculated per label and then av-eraged across all labels. If 𝑝 𝑗 and 𝑟 𝑗 are the precision andrecall for 𝜆 𝑗 , then: 𝐹 𝑚𝑎𝑐𝑟𝑜 = 1 𝑚 𝑚 ∑ 𝑗 =1 𝑝 𝑗 × 𝑟 𝑗 𝑝 𝑗 + 𝑟 𝑗 ∙ One Error computes how many times the top-rankedlabel is not relevant: 𝑂𝐸 ( 𝑓 ) = { 𝜆 ∈  𝑓 ( 𝐱 , 𝜆 ) ∉ 𝐿 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 ∙ Rank Loss computes the average fraction of label pairs thatare not correctly ordered: 𝑅𝐿 ( 𝑓 ) = 𝜆, ́𝜆 ) | 𝑓 ( 𝐱 , 𝜆 ) ≤ 𝑓 ( 𝐱 , ́𝜆 ) , ( 𝜆, ́𝜆 ) ∈ 𝐿 × ̄𝐿 } | 𝐿 | × | ̄𝐿 | The parameters of the methods used is the comparisonare instantiated following the recommendations from the lit-erature. In cases where a parameter is to be determined froma set of values, the value that corresponds to the maximum 𝐹 measure on each dataset is considered in the experiments.All parameters and threshold values are determined througha train-test split on each dataset.The number of models in RA 𝑘 EL is set to 𝑚𝑖𝑛 (2 𝑚, for all datasets [42]. The size of the labelsets for RA 𝑘 ELis set to 𝑚 ∕2 as it provides a balance between computationalcomplexity and performance [42, 33]. The number of neigh-bors in the ML- 𝑘 NN method for each dataset is selected from

Shabnam Nazmi, Xuyang Yan, Abdollah Homaifar, Emily Doucette:

Preprint submitted to Elsevier

Page 6 of 12volving Multi-label Classiﬁcation Rules by Exploiting High-order Label Correlation the set (6 , with a step size of 2. The EPS algorithm re-quires setting multiple parameters: the strategy parameter 𝑠 ,denoted as 𝐴 𝑏 and 𝐵 𝑏 for strategy 𝐴 and 𝐵 respectively, pa-rameter 𝑏 which is selected from the set {1 , , for eachstrategy, parameter 𝑝 which is selected by decreasing from5, and ﬁnally, the number of models that is set to 10 [32].The number of models in ECC is also set to 10 to be con-sistent with other ensemble methods. HOMER requires thenumber of clusters to be determined which is selected from (2 , [39].In the proposed method, the maximum number of rulesallowed in the model, 𝑁 is selected from (1000 , by1000 steps, 𝑃 is the probability of replacing an allele in clas-siﬁer condition with a hash, which is selected from [0 . , . with a step size of 0.05, and the threshold by which geneticalgorithm is applied is selected from (5 , with the stepsize of 10. Once the parameters are determined, the thresh-old values for all methods on all datasets are selected from [0 . , . with step 0.05. This experimental study aims at addressing the followingquestions: (i) Which strategy is more eﬀective for evaluatingindividual classiﬁcation rules is more eﬀective? (ii) What isthe eﬀect of employing prediction aggregation over the max-imum ﬁtness criteria? (iii) How eﬀective is the proposed al-gorithm in exploiting label correlation compared to the othermethods?

The ﬁtness of a single classiﬁer can be evaluated usingthe EM or the HL measures as shown in (6) and (7). In thissection, the eﬀect of employing each criterion on the overallmodel performance is investigated using synthetic and real-world data. For the synthetic data the framework proposed in[36] is employed using the hyper-cube strategy. Experimenton each dataset is repeated ten times to reduce the variance ofthe results, and the model performance is reported in termsof the HL of the model on test data in Tables 3 and 4.According to Table 3, the model performs better on syn-thetic datasets using  𝑒𝑚𝑡 , while Table 4 shows that  𝑒𝑚𝑡 pro-vides smaller test HL values on real-world datasets. Accord-ing to the discussions in section 3.2, during the training  𝑒𝑚𝑡 causes the expected update in the ﬁtness of the classiﬁersto be smaller compared to employing  ℎ𝑙𝑡 . This creates anevolutionary pressure towards classiﬁers with more speciﬁcconditions that cover only a few samples or a very small sub-space but tend to be more accurate. Such pressure increasesthe chance of training rules that overﬁt the training data, es-pecially on real-world problems as observed in Table 4. Onthe other hand,  ℎ𝑙𝑡 prevents over-ﬁtting by preserving theclassiﬁers that are not very accurate but predict partially cor-rect labelsets on every iteration.According to Table 4, employing  ℎ𝑙𝑡 leads to modelswith better performance on four out of ﬁve datasets. A one-tailed Wilcoxon signed ranks test with 𝛼 = 0 . and 𝑁 = 5 is applied to the results. The test did not have enough ev-idence for rejecting the null hypothesis, which means that given the current evidence the performance of the MLR al-gorithm using either evaluation method is not signiﬁcantlydiﬀerent. However, in the following comparison experiments  ℎ𝑙𝑡 , i.e. strategy (7), is employed to guarantee the conditionfor suﬃcient generalization and avoid over-ﬁtting. In this section, the results of training diﬀerent ML al-gorithms on the selected datasets are presented in Tables(6-13). The results are obtained by running a 5-fold cross-validation for each dataset. The numbers within the paren-theses are the relative rank of algorithms on a dataset withrespect to a metric. The highest average rank is shown inbold for each evaluation metric.To study the eﬀect of the prediction aggregation strat-egy (10), two sets of results are reported for the proposedmethod; the results using the equation (8) as MLR 𝑚𝑎𝑥 , andthe aggregated predictions after applying a bi-partitioningmethods as MLR 𝑎𝑔𝑔 . In this study, the reported aggregatedperformances are better results after applying

One Threshold and

Rank Cut [21] on the combined predictions using (9) and(10). Note that, MLR 𝑚𝑎𝑥 scores all classes equally and as aresult no ranking is available for classes to be reported interms of a ranking-based measure.To analyze the relative performance of diﬀerent algo-rithms, the Friedman test [9] is employed. The Friedmantest is a non-parametric statistical test to compare multiplealgorithms trained on multiple datasets based on their aver-age ranks. According to Table 5, the null hypothesis is re-jected for all evaluation metrics except for

Recall and 𝐹 𝑚𝑎𝑐𝑟𝑜 metrics, suggesting that the performance of the methods aresigniﬁcantly diﬀerent for all other metrics. Consequently, apost-hoc test [9] is applied to investigate the relative perfor-mance among algorithms. For this purpose, the Bonferroni-Dunn test [9] is employed for 𝑘 = 8 , i.e. the number of algo-rithms compared, and 𝑁 = 5 , i.e., the number of datasets,with a signiﬁcance level of 0.05. Figure 1 shows the criticaldistance (CD) diagrams for each evaluation metric. The topline in the diagram is the axis along which the average rankof each ML classiﬁer is plotted, from the lowest ranks (bestperformance) on the left to the highest ranks (worst perfor-mance) on the right. In each sub-ﬁgure, groups of algorithmsthat are not statistically diﬀerent (their average rank is withinone CD) from one another are connected. Following obser-vations are made based on the presented experiments:• According to Tables (6-14), MLR algorithm using theprediction aggregation strategy (9) has a higher aver-age rank than the maximum prediction strategy (8) interms of all metrics, which conﬁrms the eﬀectivenessof aggregating the predictions.• MLR 𝑎𝑔𝑔 has the highest average rank in terms of thesix evaluation metrics out of nine and has an outstand-ing performance in terms of Accuracy , Precision , and 𝐹 measures. In terms of the Recall metric, MLR 𝑎𝑔𝑔 and RA 𝑘 EL both has the same highest average rank.

Shabnam Nazmi, Xuyang Yan, Abdollah Homaifar, Emily Doucette:

Preprint submitted to Elsevier

Page 7 of 12volving Multi-label Classiﬁcation Rules by Exploiting High-order Label Correlation

Dataset 2-class 3-class 5-class 8-class 10-class 15-classEM-based 0.006 0.018 0.030 0.073 0.091 0.109HL-based 0.009 0.021 0.035 0.095 0.133 0.139

Table 3

The average test HL ↓ of the model using EM and HL as evaluation measures on syntheticdata. Ten datasets are generated per class size. Dataset Yeast Emotions Flags CAL500 GenbaseEM-based 0.2610 0.2620 0.3101 0.2104 0.0106HL-based 0.2180 0.2381 0.2923 0.1963 0.0121

Table 4

The average test HL ↓ of the model using EM and HL as evaluation measures on real-worlddata. Evaluation metric  𝐹 Critical value ( 𝛼 = 0 . )Hamming Loss 19.40 14.06Accuracy 22.73 𝐹 𝐹 𝐹 Table 5

Summary of the Friedman rank test for  𝐹 ( 𝑘 = 8 , 𝑁 = 5) . • When compared based on the DL value of the bench-mark datasets, MLR 𝑎𝑔𝑔 algorithm has the best perfor-mance in terms of six measures on CAL500 datasetwhich has the highest possible DL value. This resultshows that the proposed algorithm is capable of ad-dressing the incompleteness challenge of the LP bypredicting unseen labelsets more eﬀectively.• According to Figure 1, the proposed MLR 𝑎𝑔𝑔 has sig-niﬁcantly better performance than vanilla LP in termsof ﬁve metrics. It also oﬀers signiﬁcant improvementover HOMER on Accuracy and 𝐹 score measures,and ML- 𝑘 NN on

Accuracy measure.

5. Conclusion

In this paper, a multi-label classiﬁcation algorithm byextending supervised learning classiﬁer systems is proposedto exploit the high-order label correlation in order to obtaina more accurate classiﬁcation model. The proposed methodbuilds classiﬁcation rules by extending the LP technique andemploys a prediction aggregation that works similar to a 𝑘 NNmethod with a dynamic 𝑘 . Two strategies for evaluating theperformance of the individual classiﬁers during the trainingis considered and are investigated by deriving approximatebounds on the expected classiﬁer ﬁtness in terms of the num-ber of classes, the number of distinct labelsets, and the labeldensity in the dataset.The complexity analysis reveals that the cost of trainingMLR is linear in terms of the number of instances, numberof features, number of classes, and the number of rules in the population. Experiments on the synthetic and real-worlddatasets suggest that evaluating classiﬁer performance us-ing the Hamming loss measure is more eﬀective in prevent-ing over-ﬁtting than the exact match measure. This result isdue to higher expected ﬁtness for classiﬁers that are partiallycorrect when evaluated using the HL criteria. The proposedmethod is compared with multiple well-known multi-labelclassiﬁcation methods on multiple datasets and has the high-est average rank in terms of the seven out of nine measures.Statistical tests on the results show that the MLR algorithmwith aggregated predictions outperforms other methods onmost of the datasets and shows competitive performance onothers. The lower performance of the model in terms ofthe macro-averaged 𝐹 score suggests that the model mightpresent poor prediction performance on datasets with imbal-anced classes, where it is necessary to correctly predict theinfrequently occurring class labels.In the future, the impact of other mechanisms such asthe genetic operators and deletion will be incorporated intothe analysis presented for the performance of the individualclassiﬁers to obtain a more complete analysis of the MLRalgorithm. We will also investigate diﬀerent techniques toimprove the performance of the proposed method on imbal-anced class datasets.

6. Acknowledgement

This work is supported by Air Force Research Labora-tory (AFRL) and Oﬃce of Secretary of Defense (OSD) un-der agreement number FA8750-15-2-0116.

References [1] Almeida, A.M., Cerri, R., Paraiso, E.C., Mantovani, R.G., Junior,S.B., 2018. Applying multi-label techniques in emotion identiﬁca-tion of short texts. Neurocomputing 320, 35–46.[2] Bernadó-Mansilla, E., Garrell-Guiu, J.M., 2003. Accuracy-basedlearning classiﬁer systems: models, analysis and applications to clas-siﬁcation tasks. Evolutionary computation 11, 209–238.[3] Boutell, M.R., Luo, J., Shen, X., Brown, C.M., 2004. Learning multi-label scene classiﬁcation. Pattern recognition 37, 1757–1771.[4] Cai, Z., Zhu, W., 2018. Feature selection for multi-label classiﬁcationusing neighborhood preservation. IEEE/CAA Journal of AutomaticaSinica 5, 320–330.[5] Cheng, W., Hüllermeier, E., Dembczynski, K.J., 2010. Bayes op-timal multilabel classiﬁcation via probabilistic classiﬁer chains, in:

Shabnam Nazmi, Xuyang Yan, Abdollah Homaifar, Emily Doucette:

Preprint submitted to Elsevier

Page 8 of 12volving Multi-label Classiﬁcation Rules by Exploiting High-order Label Correlation(a) Hamming Loss (b) Accuracy(c) 𝐹 (d) Precision(e) Micro- 𝐹 (f) One Error(g) Rank Loss Figure 1:

Comparison of MLR with aggregated and max predictions against other algorithms with the Bonferroni-Dunn test with 𝛼 = 0 . . Datasets yeast emotions ﬂags CAL500 genbase Ave. rankRA 𝑘 EL 0.2413(5) 0.2580(7) 0.2819(7) 0.2566(7) 0.0139(6) 6.4LP( 𝐽 ) 0.2805(8) 0.2530(5) 0.2981(8) 0.2006(6) 0.0137(4) 6.2ML- 𝑘 NN 0.1976(3) 0.2218(2) 0.2458(1) 0.1404(2) 0.0158(8) 3.2EPS 0.2673(6) 0.2724(8) 0.2606(4) 0.2662(8) 0.0150(7) 6.6HOMER(LP) 0.2791(7) 0.2533(6) 0.2804(6) 0.1966(5) 0.0135(3) 5.4ECC 0.2062(4) 0.2007(1) 0.2561(3) 0.1424(3) 0.0138(5) 3.2MLR 𝑚𝑎𝑥 𝑎𝑔𝑔

Table 6

The performance of the ML algorithms in terms of

Hamming Loss ↓ . Proceedings of the 27th international conference on machine learn-ing (ICML-10), pp. 279–286.[6] Clare, A., King, R.D., 2001. Knowledge discovery in multi-label phe-notype data, in: European Conference on Principles of Data Miningand Knowledge Discovery, Springer. pp. 42–53.[7] Dam, H.H., Abbass, H.A., Lokan, C., Yao, X., 2007. Neural-basedlearning classiﬁer systems. IEEE Transactions on Knowledge andData Engineering 20, 26–39.[8] Debie, E., Shaﬁ, K., 2019. Implications of the curse of dimensionalityfor supervised learning classiﬁer systems: theoretical and empiricalanalyses. Pattern Analysis and Applications 22, 519–536.[9] Demšar, J., 2006. Statistical comparisons of classiﬁers over multipledata sets. Journal of Machine learning research 7, 1–30.[10] Diplaris, S., Tsoumakas, G., Mitkas, P.A., Vlahavas, I., 2005. Proteinclassiﬁcation with multiple algorithms, in: Panhellenic Conferenceon Informatics, Springer. pp. 448–456.[11] Elisseeﬀ, A., Weston, J., 2002. A kernel method for multi-labelledclassiﬁcation, in: Advances in neural information processing systems, pp. 681–687.[12] Fu, F.W., Wei, V.K., Yeung, R.W., 2001. On the minimum averagedistance of binary codes: linear programming approach. Discrete ap-plied mathematics 111, 263–281.[13] Fürnkranz, J., Hüllermeier, E., Mencía, E.L., Brinker, K., 2008. Mul-tilabel classiﬁcation via calibrated label ranking. Machine learning73, 133–153.[14] Goncalves, E.C., Plastino, A., Freitas, A.A., 2013. A genetic algo-rithm for optimizing the label ordering in multi-label classiﬁer chains,in: 2013 IEEE 25th International Conference on Tools with ArtiﬁcialIntelligence, IEEE. pp. 469–476.[15] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Wit-ten, I.H., 2009. The weka data mining software: an update. ACMSIGKDD explorations newsletter 11, 10–18.[16] Hanifelou, Z., Adibi, P., Monadjemi, S.A., Karshenas, H., 2018. Knn-based multi-label twin support vector machine with priority of labels.Neurocomputing 322, 177–186.[17] Holland, J.H., et al., 1992. Adaptation in natural and artiﬁcial sys-

Shabnam Nazmi, Xuyang Yan, Abdollah Homaifar, Emily Doucette:

Preprint submitted to Elsevier

Page 9 of 12volving Multi-label Classiﬁcation Rules by Exploiting High-order Label Correlation

Datasets yeast emotions ﬂags CAL500 genbase Ave. rankRA 𝑘 EL 0.5008(4) 0.5280(3) 0.6114(3) 0.2505(2) 0.8319(2) 2.8LP( 𝐽 ) 0.4113(7) 0.4764(7) 0.5533(8) 0.1982(6) 0.8252(5) 6.6ML- 𝑘 NN 0.4995(6) 0.5082(6) 0.6197(2) 0.1947(7) 0.6873(8) 5.8EPS 0.5044(3) 0.5234(4) 0.6077(4) 0.1903(8) 0.8183(6) 5.0HOMER(LP) 0.4063(8) 0.4662(8) 0.5627(7) 0.2050(5) 0.8279(4) 6.4ECC 0.5001(5) 0.5190(5) 0.6058(5) 0.2118(4) 0.7632(7) 5.2MLR 𝑚𝑎𝑥 𝑎𝑔𝑔

Table 7

The performance of the ML algorithms in terms of

Accuracy ↑ . Datasets yeast emotions ﬂags CAL500 genbase Ave. rankRA 𝑘 EL 0.6218(4) 0.6359(3) 0.7348(2) 0.3955(2) 0.8459(2) 2.6LP( 𝐽 ) 0.5140(8) 0.5587(8) 0.6625(8) 0.3216(6) 0.8511(1) 6.2ML- 𝑘 NN 0.6070(6) 0.5906(6) 0.7337(3) 0.3204(7) 0.6977(8) 6.0EPS 0.6284(3) 0.6368(2) 0.7169(4) 0.3014(8) 0.8362(4) 4.2HOMER(LP) 0.5173(7) 0.5621(7) 0.6782(7) 0.3333(5) 0.8314(6) 6.4ECC 0.6077(5) 0.5911(5) 0.7147(5) 0.3420(4) 0.7689(7) 5.2MLR 𝑚𝑎𝑥 𝑎𝑔𝑔

Table 8

The performance of the ML algorithms in terms of 𝐹 score ↑ . tems: an introductory analysis with applications to biology, control,and artiﬁcial intelligence. MIT press.[18] Huang, J., Li, G., Huang, Q., Wu, X., 2015. Learning label speciﬁcfeatures for multi-label classiﬁcation, in: 2015 IEEE InternationalConference on Data Mining, IEEE. pp. 181–190.[19] Huang, J., Li, G., Wang, S., Xue, Z., Huang, Q., 2017. Multi-labelclassiﬁcation by exploiting local positive and negative pairwise labelcorrelation. Neurocomputing 257, 164–174.[20] Hüllermeier, E., Fürnkranz, J., Cheng, W., Brinker, K., 2008. Labelranking by learning pairwise preferences. Artiﬁcial Intelligence 172,1897–1916.[21] Ioannou, M., Sakkas, G., Tsoumakas, G., Vlahavas, I., 2010. Obtain-ing bipartitions from score vectors for multi-label classiﬁcation, in:2010 22nd International Conference on Tools with Artiﬁcial Intelli-gence, IEEe. pp. 409–416. [22] Iqbal, M., Browne, W.N., Zhang, M., 2013. Evolving optimum pop-ulations with xcs classiﬁer systems. Soft Computing 17, 503–518.[23] Jiang, M., Pan, Z., Li, N., 2017. Multi-label text categorization usingl21-norm minimization extreme learning machine. Neurocomputing261, 4–10.[24] Jing, X.Y., Wu, F., Li, Z., Hu, R., Zhang, D., 2016. Multi-label dic-tionary learning for image annotation. IEEE Transactions on ImageProcessing 25, 2712–2725.[25] Kim, J.Y., Cho, S.B., 2019. Exploiting deep convolutional neural net-works for a neural-based learning classiﬁer system. Neurocomputing354, 61–70.[26] Madjarov, G., Kocev, D., Gjorgjevikj, D., Džeroski, S., 2012. An ex-tensive experimental comparison of methods for multi-label learning.Pattern recognition 45, 3084–3104.[27] Nan, G., Li, Q., Dou, R., Liu, J., 2018. Local positive and nega- Datasets yeast emotions ﬂags CAL500 genbase Ave. rankRA 𝑘 EL 0.6030(5) 0.5763(7) 0.6497(8) 0.3089(7) 0.8387(2) 5.8LP( 𝐽 ) 0.5384(8) 0.5886(6) 0.6613(7) 0.3293(5) 0.8330(5) 6.2ML- 𝑘 NN 0.7222(1) 0.6441(3) 0.7361(2) 0.5850(1) 0.7140(8) 3.0EPS 0.5616(6) 0.5617(8) 0.6929(5) 0.2897(8) 0.8290(6) 6.6HOMER(LP) 0.5484(7) 0.6075(5) 0.6748(6) 0.3455(4) 0.8355(3) 5.0ECC 0.6888(3) 0.6461(2) 0.7081(3) 0.5609(2) 0.7711(7) 3.4MLR 𝑚𝑎𝑥 𝑎𝑔𝑔

Table 9

The performance of the ML algorithms in terms of

Precision ↑ . Datasets yeast emotions ﬂags CAL500 genbase Ave. rankRA 𝑘 EL 0.6998(4) 0.5763(7) 0.8716(1) 0.5737(1) 0.8662(1)

LP( 𝐽 ) 0.5409(8) 0.5886(6) 0.6664(8) 0.3279(5) 0.8275(6) 6.6ML- 𝑘 NN 0.5694(6) 0.6441(3) 0.7668(2) 0.2259(7) 0.6919(8) 5.2EPS 0.7802(1) 0.5617(8) 0.7537(3) 0.2238(8) 0.8608(2) 4.4HOMER(LP) 0.5518(7) 0.6075(5) 0.7001(6) 0.3428(4) 0.8298(5) 5.4ECC 0.5890(5) 0.6461(2) 0.7351(4) 0.2545(6) 0.7711(7) 4.8MLR 𝑚𝑎𝑥 𝑎𝑔𝑔

Table 10

The performance of the ML algorithms in terms of

Recall ↑ .Shabnam Nazmi, Xuyang Yan, Abdollah Homaifar, Emily Doucette: Preprint submitted to Elsevier

Page 10 of 12volving Multi-label Classiﬁcation Rules by Exploiting High-order Label Correlation

Datasets yeast emotions ﬂags CAL500 genbase Ave. rankRA 𝑘 EL 0.6350(4) 0.6587(1) 0.7513(2) 0.3984(2) 0.8550(2) 2.2LP( 𝐽 ) 0.5387(8) 0.5924(7) 0.6904(8) 0.3253(6) 0.8511(5) 6.8ML- 𝑘 NN 0.6070(6) 0.6270(6) 0.7488(3) 0.3185(7) 0.8034(8) 6.0EPS 0.6367(3) 0.6516(2) 0.7417(4) 0.3012(8) 0.8435(6) 4.6HOMER(LP) 0.5438(7) 0.5922(8) 0.7145(6) 0.3402(5) 0.8524(4) 6.0ECC 0.6312(5) 0.6512(3) 0.7391(5) 0.3430(4) 0.8434(7) 4.8MLR 𝑚𝑎𝑥 𝑎𝑔𝑔

Table 11

The performance of the ML algorithms in terms of

Micro-F ↑ . Datasets yeast emotions ﬂags CAL500 genbase Ave. rankRA 𝑘 EL 0.4276(3) 0.6499(1) 0.6922(1) 0.1915(1) 0.8478(1)

LP( 𝐽 ) 0.3756(6) 0.5809(8) 0.6154(7) 0.1506(3) 0.8115(3) 5.4ML- 𝑘 NN 0.3609(8) 0.5889(6) 0.6221(6) 0.1052(7) 0.6577(6) 6.6EPS 0.4397(1) 0.6448(2) 0.6302(4) 0.0957(8) 0.8013(4) 3.8HOMER(LP) 0.3841(5) 0.5822(7) 0.6375(5) 0.1702(2) 0.8146(2) 4.2ECC 0.3736(7) 0.6219(4) 0.6327(3) 0.1301(5) 0.7985(5) 4.8MLR 𝑚𝑎𝑥 𝑎𝑔𝑔

Table 12

The performance of the ML algorithms in terms of

Macro-F ↑ . tive correlation-based k-labelsets for multi-label classiﬁcation. Neu-rocomputing 318, 90–101.[28] Nazmi, S., Razeghi-Jahromi, M., Homaifar, A., 2017. Multilabel clas-siﬁcation with weighted labels using learning classiﬁer systems, in:2017 16th IEEE International Conference on Machine Learning andApplications (ICMLA), IEEE. pp. 275–280.[29] Nazmi, S., Yan, X., Homaifar, A., 2018. Multi-label classiﬁcationusing genetic-based machine learning, in: 2018 IEEE InternationalConference on Systems, Man, and Cybernetics (SMC), IEEE. pp.675–680.[30] Orriols-Puig, A., Bernadó-Mansilla, E., 2006. Revisiting ucs: De-scription, ﬁtness sharing, and comparison with xcs, in: Learning Clas-siﬁer Systems. Springer, pp. 96–116.[31] Orriols-Puig, A., Casillas, J., Bernadó-Mansilla, E., 2008. Genetic-based machine learning systems are competitive for pattern recogni- tion. Evolutionary Intelligence 1, 209–232.[32] Read, J., 2008. A pruned problem transformation method for multi-label classiﬁcation, in: Proc. 2008 New Zealand Computer ScienceResearch Student Conference (NZCSRS 2008), p. 41.[33] Read, J., Pfahringer, B., Holmes, G., Frank, E., 2011. Classiﬁer chainsfor multi-label classiﬁcation. Machine learning 85, 333.[34] Schietgat, L., Vens, C., Struyf, J., Blockeel, H., Kocev, D., Džeroski,S., 2010. Predicting gene function using hierarchical multi-label de-cision tree ensembles. BMC bioinformatics 11, 2.[35] Spolaôr, N., Cherman, E.A., Monard, M.C., Lee, H.D., 2013. Relieﬀfor multi-label feature selection, in: 2013 Brazilian Conference onIntelligent Systems, IEEE. pp. 6–11.[36] Tomás, J.T., Spolaôr, N., Cherman, E.A., Monard, M.C., 2014. Aframework to generate synthetic multi-label datasets. Electron. NotesTheor. Comput. Sci. 302, 155–176. URL: http://dx.doi.org/10.1016/ Datasets yeast emotions ﬂags CAL500 genbase Ave. rankRA 𝑘 EL 0.2954(5) 0.3222(5) 0.2468(4) 0.2211(4) 0.1662(3) 4.2LP( 𝐽 ) 0.5337(7) 0.4234(6) 0.6233(7) 0.9832(6) 0.2328(7) 6.6ML- 𝑘 NN 0.2371(2) 0.3104(3) 0.2522(5) 0.1195(2) 0.1768(6) 3.6EPS 0.2474(3) 0.3171(4) 0.1854(1) 0.9868(7) 0.1708(4) 3.8HOMER(LP) 0.5130(6) 0.4252(7) 0.3549(6) 0.8146(5) 0.1647(2) 5.2ECC 0.2478(4) 0.2918(1) 0.2059(2) 0.1474(3) 0.1753(5) 3.0MLR 𝑚𝑎𝑥 - - - - - -MLR 𝑎𝑔𝑔

Table 13

The performance of the ML algorithms in terms of

One-error ↓ . Datasets yeast emotions ﬂags CAL500 genbase Ave. rankRA 𝑘 EL 0.2172(5) 0.1977(5) 0.2461(4) 0.2450(4) 0.0952(5) 4.6LP( 𝐽 ) 0.4102(7) 0.3331(7) 0.5405(7) 0.6578(6) 0.1924(7) 6.8ML- 𝑘 NN 0.1708(2) 0.1869(3) 0.1982(1) 0.1853(2) 0.0165(2)

EPS 0.1998(4) 0.1890(4) 0.2265(3) 0.6802(7) 0.0556(4) 4.4HOMER(LP) 0.3599(6) 0.3107(6) 0.3577(6) 0.4279(5) 0.1463(6) 5.8ECC 0.1798(3) 0.1650(2) 0.2018(2) 0.1987(3) 0.0135(1) 2.2MLR 𝑚𝑎𝑥 - - - - - -MLR 𝑎𝑔𝑔

Table 14

The performance of the ML algorithms in terms of

Rank Loss ↓ .Shabnam Nazmi, Xuyang Yan, Abdollah Homaifar, Emily Doucette: Preprint submitted to Elsevier

Page 11 of 12volving Multi-label Classiﬁcation Rules by Exploiting High-order Label Correlation j.entcs.2014.01.025 , doi: .[37] Trohidis, K., Tsoumakas, G., Kalliris, G., Vlahavas, I.P., 2008. Multi-label classiﬁcation of music into emotions., in: ISMIR, pp. 325–330.[38] Tsoumakas, G., Katakis, I., 2007. Multi-label classiﬁcation: Anoverview. International Journal of Data Warehousing and Mining(IJDWM) 3, 1–13.[39] Tsoumakas, G., Katakis, I., Vlahavas, I., 2008. Eﬀective and eﬃ-cient multilabel classiﬁcation in domains with large number of labels,in: Proc. ECML/PKDD 2008 Workshop on Mining MultidimensionalData (MMDâĂŹ08), sn. pp. 53–59.[40] Tsoumakas, G., Katakis, I., Vlahavas, I., 2009. Mining multi-labeldata, in: Data mining and knowledge discovery handbook. Springer,pp. 667–685.[41] Tsoumakas, G., Katakis, I., Vlahavas, I., 2010. Random k-labelsetsfor multilabel classiﬁcation. IEEE Transactions on Knowledge andData Engineering 23, 1079–1089.[42] Tsoumakas, G., Vlahavas, I., 2007. Random k-labelsets: An ensem-ble method for multilabel classiﬁcation, in: European conference onmachine learning, Springer. pp. 406–417.[43] Turnbull, D., Barrington, L., Torres, D., Lanckriet, G., 2008. Seman-tic annotation and retrieval of music and sound eﬀects. IEEE Trans-actions on Audio, Speech, and Language Processing 16, 467–476.[44] Urbanowicz, R., . The educational learning classiﬁer system (elcs).URL: https://sourceforge.net/projects/educationallcs/ .[45] Urbanowicz, R.J., Moore, J.H., 2009. Learning classiﬁer systems:a complete introduction, review, and roadmap. Journal of ArtiﬁcialEvolution and Applications 2009, 1.[46] Urbanowicz, R.J., Moore, J.H., 2015. Exstracs 2.0: description andevaluation of a scalable learning classiﬁer system. Evolutionary in-telligence 8, 89–116.[47] Vens, C., Struyf, J., Schietgat, L., Džeroski, S., Blockeel, H., 2008.Decision trees for hierarchical multi-label classiﬁcation. Machinelearning 73, 185.[48] Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., Xu, W., 2016.Cnn-rnn: A uniﬁed framework for multi-label image classiﬁcation, in:Proceedings of the IEEE conference on computer vision and patternrecognition, pp. 2285–2294.[49] Zaragoza, J.C., Sucar, E., Morales, E., Bielza, C., Larranaga, P.,2011. Bayesian chain classiﬁers for multidimensional classiﬁcation,in: Twenty-second international joint conference on artiﬁcial intelli-gence.[50] Zhang, M.L., 2009. M l-rbf: Rbf neural networks for multi-labellearning. Neural Processing Letters 29, 61–74.[51] Zhang, M.L., Peña, J.M., Robles, V., 2009. Feature selection formulti-label naive bayes classiﬁcation. Information Sciences 179,3218–3229.[52] Zhang, M.L., Zhou, Z.H., 2006. Multilabel neural networks with ap-plications to functional genomics and text categorization. IEEE trans-actions on Knowledge and Data Engineering 18, 1338–1351.[53] Zhang, M.L., Zhou, Z.H., 2007. Ml-knn: A lazy learning approachto multi-label learning. Pattern recognition 40, 2038–2048.[54] Zhang, M.L., Zhou, Z.H., 2013. A review on multi-label learningalgorithms. IEEE transactions on knowledge and data engineering26, 1819–1837.[55] Zhang, Z.Z., 2001. A relation between the average hamming distanceand the average hamming weight of binary codes. Journal of statisticalplanning and inference 94, 413–419.

Shabnam Nazmi received her B.S. degree inElectrical Engineering from K.N.Toosi University ofTechnology and her M.S. degree in Electrical Engi-neering from Sharif University of Technology in 2009and 2012, respectively. She is currently a Ph.D. candi-date at the Department of Electrical and Computer En-gineering, North Carolina A & T State University. Herresearch interests include multi-label classiﬁcation andits application on test and evaluation of autonomousvehicles, genetic-based machine learning, and learning from streaming data.

Xuyang Yan received his B.S. degree in Electri-cal Engineering from North Carolina Agricultural andTechnical State University (NC A & T) and Henan Poly-technic University in 2016. In 2018, he earned hisM.S. degree in electrical engineering at NC A & T. He iscurrently pursuing his Ph.D. degree in electrical engi-neering at NC A & T. His research interests include ex-tracting knowledge from streaming data, analyzing theemergent behaviors of large-scale autonomous systemsand the application of machine learning techniques inrobotics.

Abdollah Homaifar received his B.S. and M.S.degrees from the State University of New York atStony Brook in 1979 and 1980, respectively, and hisPh.D. degree from the University of Alabama in 1987,all in Electrical Engineering. He is the NASA Lang-ley Distinguished Professor and the Duke Energy Em-inent professor in the Department of Electrical andComputer Engineering at North Carolina A & T StateUniversity (NCA & TSU). He is the director of the Au-tonomous Control and Information Technology Institute and the Testing,Evaluation, and Control of Heterogeneous Large-scale Systems of Au-tonomous Vehicles (TECHLAV) Center at NCA & TSU. His research in-terests include machine learning, unmanned aerial vehicles (UAVs), testingand evaluation of autonomous vehicles, optimization, and signal process-ing. He also serves as an associate editor of the Journal of Intelligent Au-tomation and Soft Computing and is a reviewer for IEEE Transactions onFuzzy Systems, Man Machines and Cybernetics, and Neural Networks. Heis a member of the IEEE Control Society, Sigma Xi, Tau Beta Pi, and EtaKapa Nu.

Emily Doucette serves as Multi-Domain Net-worked Weapons technical lead for the Air Force Re-search Laboratory Munitions Directorate. Prior to thispost, she has served the Munitions Directorate as theAssistant to the Chief Scientist (2017-2019) and as aresearch engineer for the Weapon Dynamics and Con-trol Sciences Branch since 2012. She earned a Ph.D.in aerospace engineering from Auburn University andis a recipient of the SMART Scholarship. Her researchinterests include estimation theory, human-machine teaming, decentralizedtask assignment, cooperative autonomous engagement, and risk-aware tar-get tracking and interdiction. Dr. Doucette leads a team of postdoctoral andgraduate student researchers to support collaborative eﬀorts across DoD, in-dustry, academia, and international partnerships. She served on the AFRLMunitions Directorate Autonomy Steering Committee, is active in the Au-tonomy Community of Interest, and is the co-lead for the OSD AutonomyCenter of Excellence.

Shabnam Nazmi, Xuyang Yan, Abdollah Homaifar, Emily Doucette: