Evolving Multi-label Classification Rules by Exploiting High-order Label Correlation
Shabnam Nazmi, Xuyang Yan, Abdollah Homaifar, Emily Doucette
HHighlights
Evolving Multi-label Classification Rules by Exploiting High-order Label Correlation
Shabnam Nazmi,Xuyang Yan,Abdollah Homaifar,Emily Doucette• A multi-label classification model is proposed by extending the supervised learning classifiers through the LP technique• The high-order label correlation is exploited to improve the predictive performance and incompleteness challenge isaddressed• Approximate bounds are derived for the average classifier fitness in terms of the dataset properties a r X i v : . [ c s . L G ] J u l volving Multi-label Classification Rules by Exploiting High-orderLabel Correlation Shabnam Nazmi a , Xuyang Yan a , Abdollah Homaifar a , ∗ and Emily Doucette b a & T State University b
101 West Eglin Blvd. Eglin AFB, FL, USA, The Air Force Research Laboratory - Munitions Directorate
A R T I C L E I N F O
Keywords :Multi-label classificationLabel correlationLabel powersetLearning classifier systems
A B S T R A C T
In multi-label classification tasks, each problem instance is associated with multiple classes simul-taneously. In such settings, the correlation between labels contains valuable information that can beused to obtain more accurate classification models. The correlation between labels can be exploited atdifferent levels such as capturing the pair-wise correlation or exploiting the higher-order correlations.Even though the high-order approach is more capable of modeling the correlation, it is computation-ally more demanding and has scalability issues. This paper aims at exploiting the high-order labelcorrelation within subsets of labels using a supervised learning classifier system (UCS). For this pur-pose, the label powerset (LP) strategy is employed and a prediction aggregation within the set of therelevant labels to an unseen instance is utilized to increase the prediction capability of the LP methodin the presence of unseen labelsets. Exact match ratio and Hamming loss measures are consideredto evaluate the rule performance and the expected fitness value of a classifier is investigated for bothmetrics. Also, a computational complexity analysis is provided for the proposed algorithm. The ex-perimental results of the proposed method are compared with other well-known LP-based methodson multiple benchmark datasets and confirm the competitive performance of this method.
1. Introduction
In multi-label classification (MLC) tasks, each probleminstance is associated with multiple classes at the same time.Emotion identification [1], image annotation [24], text cat-egorization [23], semantic scene classification, or gene andprotein function prediction [34] are examples of such prob-lems. For instance, in text categorization, a document canbe classified as History and Biography simultaneously.Over the past decade, many multi-label classification al-gorithms have been proposed to solve the multi-label classi-fication problem in various domains. These algorithms canbe categorized into two major groups: problem transforma-tion methods and algorithm adaptation methods [54, 40].Problem transformation methods transform the multi-labelproblem into one or multiple single-label classification prob-lems, e.g., label powerset (LP) [40] and binary relevance(BR) methods [3]. Algorithm adaptation methods modifyexisting multi-class methods for multi-label problems, suchas methods based on 𝑘 NN [53, 16], decision tree [47, 6],neural networks [48, 50], and support vector machines [11].In many real-world multi-label classification problems,a correlation exists between different classes. For instance,a document belonging to the class ’Biography’ can also beconsidered to belong to the class ’History’. Incorporatingthis information into the classification model could help withobtaining a more accurate classifier. The label correlationcan be taken into account through three different strategies,namely first-order , second-order , and high-order [54]. The ∗ Corresponding author [email protected] (S. Nazmi); [email protected] (X.Yan); [email protected] (A. Homaifar); [email protected] (E.Doucette)
ORCID (s): (S. Nazmi); (X.Yan); (A. Homaifar) first-order strategy converts the multi-label problem into mul-tiple single-label classification problems and ignores the cor-relation among labels [53, 3]. The second-order strategyconsiders the pair-wise correlation between labels [11, 13,52], and the high-order strategy, looks at the high order cor-relation through a subset of labels [32, 33, 41].Many algorithms have been proposed that take into ac-count the second-order correlation often by exploiting thepair-wise relationship between labels. One way to modelpairwise correlation is to exploit the co-occurrence patternbetween label pairs (e.g. CLR [13] and LLSF [18]) whichonly consider the positive correlation between labels. Onthe other hand, LPLC [19] and the approach proposed in [27]exploit local positive and negative pairwise correlation be-tween labels to obtain a MLC model. The PRC algorithm[20] extends pairwise classification to obtain a ranking pro-cedure based on binary preference relations. Methods devel-oped to capture the high-order correlation are more capablein modeling correlation among labels, but are computation-ally more expensive and suffer from scalability issues [54].High-order approaches mine the relationship between allclasses or subsets of classes. Classifier chains (CC) [33] is amulti-label classification method that models such relation-ship by using the vector of class labels as additional sampleattributes and transforms the multi-label classification prob-lem into a chain of 𝑞 binary classification problems. Ex-tensions of the CC algorithm such as probabilistic classifierchains (PCC) [5], add a probabilistic interpretation to CC.Also, Bayesian CC [49] describes the dependency structureof the class labels as a tree. LP is one of the methods thatallows for exploiting the high-order label correlation by tak-ing into account label subsets. Random 𝑘 -labelset (RA 𝑘 EL)[41] exploits label correlation in a random way by transform-ing the problem into an ensemble of multi-class classifica-
Shabnam Nazmi, Xuyang Yan, Abdollah Homaifar, Emily Doucette:
Preprint submitted to Elsevier
Page 1 of 12 volving Multi-label Classification Rules by Exploiting High-order Label Correlation tion problems where each component of the ensemble learnsa random subset of the labels through a classifier induced bythe LP technique. The ensemble of pruned sets (EPS) [32]follows the LP strategy but focuses only on the most impor-tant correlations in order to reduce complexity.Although LP is a straightforward approach to transformmulti-label problems into multi-class problems and incorpo-rates label correlation into the learning problem, it is chal-lenged in two significant ways [54]: (i) Incompleteness whereLP is limited to predict label sets appearing in the trainingdata; (ii) Inefficiency, when the number of labels is large,there are too many possible LPs to learn and the training in-stances for some powersets to learn from are very few whichcreates class imbalance. RA 𝑘 EL tackles these challengesby combining ensemble learning with LP only on randomlychosen 𝑘 -sized labelsets.A learning classifier system (LCS) is a genetic-based ma-chine learning system that combines discovery and learningcomponents to train a rule-based model [17, 45]. The evo-lutionary component finds new rules and the learning com-ponent assigns credit to the rules based on an estimate oftheir contribution. LCSs are applied in a variety of domainssuch as biology, computer science, medicine, and social sci-ences. Three of the major structures developed for LCS are:XCS [22], which operates under the reinforcement learningframework; UCS [2] and ExSTraCS [46], which are devel-oped for supervised learning tasks; and N-LCS [7], whichleverages neural networks. In [25], several convolutionalneural network (CNN) structures are exploited to study theperformance of the N-LCSs with CNNs. In [31], UCS issuccessfully used to solve multiple real-world pattern recog-nition problems. Furthermore, strength-based learning clas-sifiers are adapted to handle multi-label data with weightedlabels in [28], and [29] investigated the UCS algorithm forits potential in solving multi-label classification problems.In this paper, the high-order strategy to handle label cor-relation through the LP technique is considered and the UCSalgorithm is adapted to evolve a rule-based multi-label clas-sification model. The prediction of each rule is a subset oflabels induced from the training data. The genetic algorithm(GA) creates new rules by combining two of the existingrules through genetic operators. To reduce the computa-tional complexity, the genetic search is limited to the clas-sifier condition. To overcome the incompleteness of the LPtechnique on unseen samples, unlike methods that considerrandom subsets of labels, the proposed method generatesnew LPs by exploiting the information learned within eachproblem niche collectively. This approach adopts a similarprediction scheme to a 𝑘 NN method with a dynamic 𝑘 thataggregates predictions from all the relevant (matching) rulesto a given instance. Approximate bounds are derived forthe expected value of a classifier’s fitness using the averageHamming distance and average Hamming weight bounds.Moreover, a computational complexity analysis is performedfor the proposed algorithm. The major contributions of thiswork are as follows:• A new multi-label classification technique is devel- oped that exploits high-order label correlation by adapt-ing the traditional UCS algorithm to predict multi-labels through the LP technique. Inspired by the 𝑘 NNmethod, a prediction aggregation is proposed to tacklethe incompleteness of the LP technique.• For evaluating the fitness of the classification rules,two strategies are considered. The average classifierfitness using each evaluation strategy is derived in termsof the multi-label data properties and discussions pro-vide insight on the derived bounds. These strategiesare also studied through experiments on synthetic andreal-world datasets.• Experiments on multiple benchmark datasets are con-ducted to compare the proposed method with otherwell-known multi-label classification methods and sta-tistical analyses are performed to analyze the results.The rest of the paper is organized as follows: section 2 nota-tions and important metrics, section 3 the proposed methodsand theoretical analysis, and section 4 the experimental setupand results. Finally, concluding remarks and future work arepresented.
2. Multi-label classification problem
Let 𝕏 denote an input space and let = { 𝜆 , … , 𝜆 𝑚 } bea finite set of class labels. Suppose every instance 𝐱 ∈ 𝕏 ,where 𝐱 ∈ ℝ 𝑑 , is associated with a subset of labels 𝐿 ⊂ ,which is often called the set of relevant labels. The comple-ment set of 𝐿 is called the irrelevant set and is shown by ̄𝐿 .Therefore, 𝐷 = {( 𝐱 , 𝐿 ) , ( 𝐱 , 𝐿 ) , … , ( 𝐱 𝑛 , 𝐿 𝑛 )} is a finiteset of training instances that are assumed to be randomlydrawn from an unknown distribution. The objective is totrain a multi-label classifier ℎ ∶ 𝕏 → that best approx-imates the training data and generalizes well to the samplesin test data. The function 𝑓 ( 𝐱 , 𝜆 ) calculates the score valuefor class 𝜆 .To characterize the properties of a multi-label problemthat influences the learning performance, various metrics areproposed in the literature. Label cardinality (1) is the aver-age number of labels per sample, and label density (2) is thecardinality divided by the number of classes [38].
𝐶𝑎𝑟𝑑 ( 𝐷 ) = 1 𝑛 𝑛 ∑ 𝑖 =1 | 𝐿 𝑖 | , (1) 𝐷𝑒𝑛𝑠 ( 𝐷 ) = 1 𝑛 𝑛 ∑ 𝑖 =1 | 𝐿 𝑖 | 𝑚 = 𝐶𝑎𝑟𝑑 ( 𝐷 ) 𝑚 . (2) Moreover, in [51] and [41], the distinct labelsets (DL) isdefined as the number of different label combinations in thedataset: 𝐷𝐿 ( 𝐷 ) = | 𝐿 ⊂ | ( 𝐱 , 𝐿 ) ∈ 𝐷 | . (3) Shabnam Nazmi, Xuyang Yan, Abdollah Homaifar, Emily Doucette:
Preprint submitted to Elsevier
Page 2 of 12volving Multi-label Classification Rules by Exploiting High-order Label Correlation
In [51], the proportion of distinct labelsets (PDL) is definedas the number of distinct labelsets relative to the number ofinstances:
𝑃 𝐷𝐿 ( 𝐷 ) = 𝐷𝐿 ( 𝐷 ) 𝑛 . (4)
3. Proposed methodology
In this section, the structure of the proposed multi-labellearning classifier system (abbreviated as MLR for multi-label classification rules) is explained. A theoretical analysisis presented to show the relationship between the expectedfitness of a classification rule in MLR and dataset propertiesusing two fitness evaluation strategies. Besides, the compu-tational complexity of the proposed algorithm is discussed.
The proposed algorithm consists of two main compo-nents; rule structure and multi-label prediction.
The following shows an example of a rule in multi-labelsetting that matches 𝐱 , 𝑅 ∶ { if 𝐱 ∈ [ 𝐜 − 𝐬 , 𝐜 + 𝐬 ] , then prediction = 𝐲 } , (5)where 𝐜 and 𝐬 are vectors of center and spread of a hyper-rectangle respectively encoding the classifier condition 𝛿 ,and 𝐲 is a binary vector with a ’1’ representing relevant labelsand a ’0’ representing irrelevant labels to 𝐱 , where 𝕪 ∈ 𝐿 .Every time the covering mechanism creates a new rule, it as-signs the correct labelset of the training instance to the cre-ated rule. The set of all matching rules comprises the matchset ( [ 𝑀 ] ). The experience ( 𝑒𝑥𝑝 ) of a rule is the number oftimes that it has matched 𝐱 , and its numerosity ( 𝑛𝑢𝑚 ) is thenumber of copies of it in the population ( [ 𝑃 ] ) created byGA. It is assumed that the maximum number of the rules al-lowed in [ 𝑃 ] is 𝑁 . In the MLR algorithm, no genetic searchis applied to the label space and the off-springs take the ex-act same predictions as their parents. Moreover, fitness ( )specifies the relative predictive performance of a rule and isused by GA as a measure of contribution. Fitness of a rule atiteration 𝑡 can be calculated using different multi-label per-formance metrics: 𝑒𝑚𝑡 = ( 𝑒𝑚 𝑡 𝑒𝑥𝑝 𝑡 ) 𝜈 , (6) ℎ𝑙𝑡 = ( 1 − ℎ𝑙 𝑡 𝑒𝑥𝑝 𝑡 ) 𝜈 . (7)In the above equations, 𝑒𝑚 𝑡 and ℎ𝑙 𝑡 are the exact match (EM)and the hamming loss (HL) measures of a rule’s predictionat iteration 𝑡 , respectively. These values can be calculated ateach iteration using ∑ 𝑒𝑥𝑝 𝑡 𝑖 =1 𝑀 𝑖 , for a given multi-label metric 𝑀 . Moreover, 𝜈 is a constant set by the user that determinesthe strength pressure toward accurate rules [30]. In the original UCS, the predicted class is the one that ispredicted by the classifier with the highest fitness value. Let 𝐲 to be the prediction of rule 𝑅 , then the predicted multi-label can be obtained as follows: 𝐲 𝑚𝑎𝑥 ∶ { 𝐲 | = max 𝑙 ; 𝑙 = 1 , … , | [ 𝑀 ] | } , (8)where 𝑙 is the fitness of rule 𝑅 𝑙 . We call the algorithm thatimplements this prediction strategy MLR 𝑚𝑎𝑥 . Nonetheless,to overcome the incompleteness of the LP-based learning,labelsets predicted by the locally relevant rules (rules thatappear in the [ 𝑀 ] ) are aggregated and each class label isassigned a score value. More specifically, the aggregationof predictions is composed of two steps; to assemble allof the predicted classes into a unified prediction and toconstruct a combined score for each class. To perform theformer, a simple union of the all the predicted labels in [ 𝑀 ] is considered. Let 𝐲 𝑙 to be the prediction of rule 𝑅 𝑙 , then thepredicted combined multi-label ̄ 𝐲 is: ̄ 𝐲 ∶ { ̄𝑦 𝑖 | ∃ 𝑅 𝑙 ∶ ̄𝑦 𝑖 ∈ 𝐲 𝑙 ; 𝑖 = 1 , … , 𝑚 ; (9) 𝑙 = 1 , … , | [ 𝑀 ] | } , where | [ 𝑀 ] | is the size of the current match set and workssimilar to a dynamic 𝑘 in a 𝑘 NN algorithm. In the secondstep, a score is calculated for each class from the rules in [ 𝑀 ] based on the fitness and density of the rules that coverthe subspace containing 𝐱 . The score of the 𝑖 𝑡ℎ class 𝜆 𝑖 canbe calculated as follows, 𝑠 𝑖 = ∑ 𝑙 =1 , … , | [ 𝑀 ] | ,𝑦 𝑙𝑖 =1 𝑙 × 𝑛𝑢𝑚 𝑙 . (10)A higher score indicates that 𝜆 𝑖 is advocated by rules with alarger number of copies or higher fitness or both. We call thealgorithm that implements this prediction strategy MLR 𝑎𝑔𝑔 .By employing different bi-partitioning methods, a set of rel-evant labels can be obtained from the normalized scores. Evaluating the fitness of a classifier using criteria (6) and(7) leads to the evolution of classifiers with different proper-ties. In this section, the objective is to find a relation betweenthe expected fitness ( ) value of a classifier when evaluatedwith each criterion in terms of dataset properties. Firstly, 𝑒𝑚 is derived when EM is used. Then, an upper boundis derived on ℎ𝑙 in terms of the number of the classes 𝑚 and the number of the distinct labelsets 𝐷𝐿 in section 3.2.1.Furthermore, a lower bound is derived on ℎ𝑙 in terms ofthe number of the classes 𝑚 and the label density value insection 3.2.2.Consider a random classification rule 𝑅 with condition 𝛿 and prediction 𝐲 . Assuming an evenly distributed samplespace, for a classifier with hyper-cube condition encoding,the probability that 𝑅 matches 𝐱 is proportional to the hyper-cube volume that it covers [8]. Assume 𝑉 𝑅 and 𝑉 𝐷 to be the Shabnam Nazmi, Xuyang Yan, Abdollah Homaifar, Emily Doucette:
Preprint submitted to Elsevier
Page 3 of 12volving Multi-label Classification Rules by Exploiting High-order Label Correlation volume covered by 𝑅 and the input space of 𝐷 , respectively.The probability of matching is as follows, 𝑃 𝑚 ( 𝑅 ) = 𝑉 𝑅 𝑉 𝐷 = Π 𝑑𝑖 =1 𝑟 𝑖 Π 𝑑𝑖 =1 𝑠 𝑖 (11)where, 𝑟 𝑖 and 𝑠 𝑖 are the span of the classifier condition andthe input space of 𝐷 , respectively. Assuming a normalizedinput space, the average volume covered by a classifier is 𝑟 𝑑 ,where 𝑟 is a user-defined initialization parameter utilized bythe covering mechanism. Furthermore, assume that the av-erage generality of the population is 𝜏 , which is a functionof generalization parameter 𝑃 and the probability of mu-tation by GA. Thus, we reach the following relation for theprobability that 𝑅 is part of the match set, 𝑃 ( 𝑅 ∈ [ 𝑀 ]) = 𝑟 𝑑𝜏 . (12)In MLR, the genetic algorithm selects two classifiers througha roulette wheel selection procedure proportional to their fit-ness. For a dataset 𝐷 with 𝑛 samples, the average number ofsamples that 𝑅 matches ( 𝑛 𝑚 ) is 𝑟 𝑑𝜏 × 𝑛 and 𝐿 𝑚 ⊂ 𝐿 is a setcontaining the labelsets of these samples. Without the loss ofgenerality, assume that 𝜈 = 1 in equations (6) and (7). Thus,the expected fitness update for a random classifier using theEM criterion is, 𝑒𝑚 = 𝑟 𝑑𝜏 × 1 𝐷𝐿 , (13)which is a function of the distinct number of labelsets in 𝐷 .Using the HL criteria, the average fitness of this classifiercan be formulated as, ℎ𝑙 = 𝑟 𝑑𝜏 × ∑ 𝑛 𝑚 (1 − ℎ𝑙 ) 𝑛 𝑚 . (14)Substituting ℎ𝑙 with its average value based on the averageHamming distance ( 𝑎ℎ𝑑 ), the following equation is obtained, ℎ𝑙 = 𝑟 𝑑𝜏 × (1 − 𝑎ℎ𝑑𝑚 ) . (15)In (15), 𝑎ℎ𝑑 is the average Hamming distance between thelabels of the samples that 𝑅 matches. To find a relation be-tween ℎ𝑙 and the properties of 𝐷 , we first provide a fewdefinitions.A binary code is a non-empty subset of the 𝑚 -dimensionalvector space over the binary field 𝐹 [12]. Assuming thateach label combination in 𝐷 is observed only once, i.e. 𝐷𝐿 = 𝑛 , the set of labelsets 𝐿 of a dataset form a binary code withcardinality 𝐷𝐿 . Similarly, 𝐿 𝑚 forms a binary code with car-dinality 𝑁 𝑚 . The average Hamming distance for the binarycode 𝐿 is defined as 𝑎ℎ𝑑 ( 𝐿 ) = 1 𝐷𝐿 ∑ 𝐿 𝑖 ∈ 𝐿 ∑ 𝐿 𝑗 ∈ 𝐿 ℎ𝑑 ( 𝐿 𝑖 , 𝐿 𝑗 ) . (16)Moreover, Hamming weight ( ℎ𝑤 ) is the number of non-zeroelements in a binary string [55]. The average Hamming weight ( 𝑎ℎ𝑤 ) of the binary code 𝐿 is defined as 𝑎ℎ𝑤 ( 𝐿 ) = 1 𝐷𝐿 ∑ 𝐿 𝑖 ∈ 𝐿 ℎ𝑤 ( 𝐿 𝑖 ) . (17)To the best of our knowledge, there is no closed-form calcu-lation for the equations (16) and (17). Therefore, we proceedwith employing an upper bound and lower bound approxi-mations for them and obtain bounds for ℎ𝑙 . In the literature, lower and upper bound approximationsfor the value of the 𝑎ℎ𝑑 are proposed [55] that consider spe-cial cases for the cardinality of a binary code . One straight-forward lower bound on ahd ( 𝐿 ) is as follows [12], 𝑚 + 12 − 2 𝑚 −1 𝐷𝐿 ≤ 𝑎ℎ𝑑 ( 𝐿 ) . (18)This inequality is meaningful only when 𝐷𝐿 ≥ 𝑚 ∕( 𝑚 + 1) [55]. Inequality (18) suggests that a larger number of distinctlabel sets in a multi-label dataset increases the upper boundon the average Hamming distance on 𝐿 . For a classifier thatmatches a subset of the instances, 𝐿 𝑚 ⊂ 𝐿 holds. This meansthat 𝑁 𝑚 ≤ 𝐷𝐿 and the 𝑎ℎ𝑑 for this classifier follows theinequality, 𝑎ℎ𝑑 ( 𝐿 𝑚 ) ≤ 𝑎ℎ𝑑 ( 𝐿 ) . (19)In (19), the equality holds when 𝐿 𝑚 = 𝐿 , i.e., 𝑅 matches allsamples in 𝐷 which is the case when 𝑅 is a classifier with anover-general condition. In other words, the prediction madeby a classifier with an over-general condition has the maxi-mum average Hamming distance to the labelset of a traininginstance. Putting (15) and (18) together, the following up-per bound exists for the expected fitness of such a classifierbased on the number of distinct labelsets 𝐷𝐿 and the numberof classes 𝑚 , ℎ𝑙 ≤ 𝑟 𝑑𝜏 (1 − 𝑚 + 12 𝑚 + 2 𝑚 −1 𝑚 × 𝐷𝐿 ) . (20)According to inequality (20), a larger number of distinct la-bels imposes a smaller upper bound for the expected valueof the classifier fitness.Based on the definition, ℎ𝑙 = 𝑎𝑣𝑒 ( ) | ≤ ℎ𝑑 ≤ 𝑚 , while 𝑒𝑚 = 𝑎𝑣𝑒 ( ) | ℎ𝑑 =0 . This means that the classifier fitness ℎ𝑙 at a given time is considered to be non-zero even whenits prediction is not an exact match of the true labelset, i.e.,its ℎ𝑑 > . As a result of this more frequent positive fitnessevaluation, for a classifier 𝑒𝑚 ≤ ℎ𝑙 holds. This meansthat in the MLR algorithm when classifiers are evaluatedwith respect to the HL of their predicted labelset, they areexpected to receive a larger fitness update on average. Thisrelation implies that the classifiers that are not very accuratebut can partially predict the correct multi-label, are respectedas contributing classifiers when the evaluation criterion isHL. With a higher fitness value, these classifiers will havea better chance of receiving reproductive opportunity fromGA and remain in the population of rules. Shabnam Nazmi, Xuyang Yan, Abdollah Homaifar, Emily Doucette:
Preprint submitted to Elsevier
Page 4 of 12volving Multi-label Classification Rules by Exploiting High-order Label Correlation
An upper bound for 𝑎ℎ𝑑 ( 𝐿 ) is proposed in [55] that re-lates it to the value of 𝑎ℎ𝑤 in (17), as follows 𝑎ℎ𝑑 ( 𝐿 ) ≤ 𝑎ℎ𝑤 ( 𝐿 ) − 2 𝑎ℎ𝑤 ( 𝐿 ) 𝑚 . (21)Considering that the 𝑎ℎ𝑤 of a set of binary variables corre-sponds to the value of label 𝐶𝑎𝑟𝑑 , as defined in (1), withinthe multi-label learning framework when 𝐷𝐿 = 𝑛 , the in-equality (21) offers the following upper bound the on 𝑎ℎ𝑑 ofthe labelsets of 𝐷 , 𝑎ℎ𝑑 ( 𝐿 ) ≤ 𝐶𝑎𝑟𝑑 − 2
𝐶𝑎𝑟𝑑 𝑚 . (22)Putting (15) and (22) together, and replacing label 𝐷𝑒𝑛𝑠 for
𝐶𝑎𝑟𝑑𝑚 , we obtain the following lower bound on the expectedfitness of a random classifier based the values of the
𝐷𝑒𝑛𝑠 of a dataset, 𝑟 𝑛𝜏 (2 𝐷𝑒𝑛𝑠 − 2 𝐷𝑒𝑛𝑠 + 1) ≤ ℎ𝑙 . (23)According to (23), the lower bound of ℎ𝑙 is a quadraticfunction of 𝐷𝑒𝑛𝑠 whose minimum occurs at
𝐷𝑒𝑛𝑠 = 1∕2 .Furthermore, according to (22), 𝑎ℎ𝑑 has its maximum valuewhen
𝐶𝑎𝑟𝑑 = 𝑚 ∕2 . Therefore, employing the HL crite-rion allows the individual classifiers in MLR algorithm toexpect the smallest average fitness when the average Ham-ming distance between the classifier prediction and the cor-rect labelset is expected to be at its largest value. In this section, the computational complexity of the pro-posed multi-label classification algorithm is analyzed in termsof its major components. Today’s learning classifier sys-tems, including UCS and ExSTraCS, consist of many in-teracting components with complex dependencies. In [46],a comprehensive list of these components along with theirfunctional description is provided. The analysis presentedhere studies the complexity of these components individ-ually without considering the complex interactions amongthem.Table 1 presents the complexity of training the MLR al-gorithm in terms of its components, as well as its overallcomplexity for one training iteration. Here, it is assumed thatthe genetic algorithm employs a tournament selection withsize 𝑡 and a uniform crossover. Given that the algorithm isto be trained for 𝑖𝑡 iterations, deletion from population andthe genetic algorithm repeat numerous times during train-ing. According to Table 1, the computational complexity oftraining the MLR algorithm is of order 𝑂 ( 𝑁.𝑛.𝑑.𝑚 ) , whichgrows linearly in terms of the number of the rules 𝑁 , thenumber of training instances 𝑛 , the input dimension 𝑑 , andthe number of classes 𝑚 .
4. Results and Discussion
In this section, the benchmark datasets and several clas-sification algorithms that are used in the comparisons exper-iments are described. Then, multi-label evaluation measures
Operation Big 𝑂 One training iterationMatching 𝑂 ( 𝑁.𝑑 ) Parameter update ( exp , ℎ𝑙 , ...) 𝑂 ( | [ 𝑀 ] | .𝑚 ) ≤ 𝑂 ( 𝑁.𝑚 ) Fitness calculation 𝑂 ( | [ 𝑀 ] | ) ≤ 𝑂 ( 𝑁 ) Deletion from populationDeletion weight 𝑂 ( 𝑁 ) Delete from [ 𝑃 ] 𝑂 ( 𝑁 ) Applying genetic algorithm onceGA 𝑂 ( 𝑁.𝑑.𝑚.𝑡 ) Overall algorithm complexityOne iteration 𝑂 ( 𝑁.𝑛.𝑑.𝑚 ) Table 1
Computational complexity of the major MLR components andthe overall complexity of MLR. are explained, and the strategies employed for parameter in-stantiation are reported. Finally, results are presented anddiscussed.
For comparison, five real-world datasets are used includ-ing, Yeast [11], Emotions [37], Flags [14], COMPUTERAUDITION LAB 500 (CAL500) [43], and Genbase [10].Table 2 shows information about the number of instances,features, and classes for each dataset.
The comparison of the proposed algorithm is performedusing the implementations of the following algorithms in
MULAN library under the machine learning framework WEKA [15]. The compared methods are discussed as below:• Label powerset methods: vanilla LP, RA 𝑘 EL, ensem-ble of pruned sets (EPS), hierarchy of multi-label clas-sifiers (HOMER) with balanced 𝑘 means [39].• Binary relevance methods: the ensemble of classifierchaining (ECC).• Algorithm adaptation methods: multi-label 𝑘 -nearestneighbors (ML- 𝑘 NN).In the LP algorithm, the implementation of the decision treealgorithm in
WEKA is employed as the base learner. Theproposed method is implemented in Python using the UCSimplementation [44] as the base algorithm.In order to improve the performance and speed up thealgorithm execution, the
RF-ML feature selection strategy[35] is employed in this work.
RF-ML is an extension of thewell-known
ReliefF feature selection strategy to multi-labeldata that takes into account the effect of interacting featuresin ML problems without a need to transform the ML probleminto a multi-class problem. The result of applying featureselection is either a subset or a ranked list of the originalfeatures. In the latter case, only the top thirty percent of thefeatures are used for model training [4]. http://mulan.sourceforge.net/ Shabnam Nazmi, Xuyang Yan, Abdollah Homaifar, Emily Doucette:
Preprint submitted to Elsevier
Page 5 of 12volving Multi-label Classification Rules by Exploiting High-order Label Correlation
Dataset Domain Inst. Feat. Label Card. Dens. DL PDLYeast Biology 2,417 103(n) 14 4.237 0.303 198 0.082Emotions Music 593 72(n) 6 1.869 0.311 27 0.045Flags Images 194 9(c) + 10(n) 7 3.392 0.485 54 0.278CAL500 Music 502 68(n) 174 26.044 0.150 502 1Genbase Biology 662 1,186(b) 27 1.245 0.046 32 0.048
Table 2
ML datasets. In the Feat. column 𝑛 , 𝑐 , and 𝑏 refer to numeric, categorical, and binaryattributes, respectively. All experiments are carried out on a 2.70 GHz Windows10 machine with a 16.0 GB RAM.
In this section, evaluation measures used in the experi-ments are explained [26]. ∙ Hamming loss computes the percentage of labels whoserelevance is predicted incorrectly: 𝐻𝐿 ( ℎ ) = 1 𝑚 | ℎ ( 𝐱 )Δ 𝐿 | where Δ represents the Hamming distance between the twovectors ℎ ( 𝐱 ) and 𝐿 . ∙ Accuracy is the relative number of classes predictedcorrectly to the union of relevant and predicted labels:
𝐴𝑐𝑐 ( ℎ ) = | ℎ ( 𝐱 ) ⋂ 𝐿 || ℎ ( 𝐱 ) ⋃ 𝐿 | ∙ Precision is the relative number of classes predictedcorrectly to the set of relevant labels:
𝑃 𝑟 ( ℎ ) = | ℎ ( 𝐱 ) ⋂ 𝐿 || 𝐿 | ∙ Recall is the relative number of classes predicted cor-rectly to the set of all predicted classes: 𝑅𝑐 ( ℎ ) = | ℎ ( 𝐱 ) ⋂ 𝐿 || ℎ ( 𝑥 ) | ∙ 𝐹 measure is the harmonic mean of precision and re-call of the predicted labels: 𝐹 ( ℎ ) = 2 × 𝑃 𝑟 × 𝑅𝑐𝑃 𝑟 + 𝑅𝑐 For an evaluation measure 𝑀 , a macro -measure is com-puted by evaluating the underlying measure once for eachlabel and calculating their mean value. In contrast, a micro -measure aggregates the predictions of all labels and evalu-ates the measure at the end. 𝑀 𝑚𝑖𝑐𝑟𝑜 = 𝑀 ( 𝑚 ∑ 𝑖 =1 𝑇 𝑃 𝑖 , 𝑚 ∑ 𝑖 =1 𝐹 𝑃 𝑖 , 𝑚 ∑ 𝑖 =1 𝑇 𝑁 𝑖 , 𝑚 ∑ 𝑖 =1 𝐹 𝑁 𝑖 ) 𝑀 𝑚𝑎𝑐𝑟𝑜 = 1 𝑚 𝑚 ∑ 𝑖 =1 𝑀 ( 𝑇 𝑃 𝑖 , 𝐹 𝑃 𝑖 , 𝑇 𝑁 𝑖 , 𝐹 𝑁 𝑖 ) where TP , FP , TN , and FN stand for true positive, false pos-itive, true negative, and false negative respectively in bothequations. Based on these definitions, the micro and macro averages for 𝐹 measure can be calculated as follows. ∙ 𝐹 𝑚𝑖𝑐𝑟𝑜 is the harmonic mean between the micro-precision and micro-recall : 𝐹 𝑚𝑖𝑐𝑟𝑜 = 2 × 𝑃 𝑟 𝑚𝑖𝑐𝑟𝑜 × 𝑅𝑐 𝑚𝑖𝑐𝑟𝑜 𝑃 𝑟 𝑚𝑖𝑐𝑟𝑜 + 𝑅𝑐 𝑚𝑖𝑐𝑟𝑜 ∙ 𝐹 𝑚𝑎𝑐𝑟𝑜 is the harmonic mean between precision and re-call where the average is calculated per label and then av-eraged across all labels. If 𝑝 𝑗 and 𝑟 𝑗 are the precision andrecall for 𝜆 𝑗 , then: 𝐹 𝑚𝑎𝑐𝑟𝑜 = 1 𝑚 𝑚 ∑ 𝑗 =1 𝑝 𝑗 × 𝑟 𝑗 𝑝 𝑗 + 𝑟 𝑗 ∙ One Error computes how many times the top-rankedlabel is not relevant: 𝑂𝐸 ( 𝑓 ) = { 𝜆 ∈ 𝑓 ( 𝐱 , 𝜆 ) ∉ 𝐿 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 ∙ Rank Loss computes the average fraction of label pairs thatare not correctly ordered: 𝑅𝐿 ( 𝑓 ) = 𝜆, ́𝜆 ) | 𝑓 ( 𝐱 , 𝜆 ) ≤ 𝑓 ( 𝐱 , ́𝜆 ) , ( 𝜆, ́𝜆 ) ∈ 𝐿 × ̄𝐿 } | 𝐿 | × | ̄𝐿 | The parameters of the methods used is the comparisonare instantiated following the recommendations from the lit-erature. In cases where a parameter is to be determined froma set of values, the value that corresponds to the maximum 𝐹 measure on each dataset is considered in the experiments.All parameters and threshold values are determined througha train-test split on each dataset.The number of models in RA 𝑘 EL is set to 𝑚𝑖𝑛 (2 𝑚, for all datasets [42]. The size of the labelsets for RA 𝑘 ELis set to 𝑚 ∕2 as it provides a balance between computationalcomplexity and performance [42, 33]. The number of neigh-bors in the ML- 𝑘 NN method for each dataset is selected from
Shabnam Nazmi, Xuyang Yan, Abdollah Homaifar, Emily Doucette:
Preprint submitted to Elsevier
Page 6 of 12volving Multi-label Classification Rules by Exploiting High-order Label Correlation the set (6 , with a step size of 2. The EPS algorithm re-quires setting multiple parameters: the strategy parameter 𝑠 ,denoted as 𝐴 𝑏 and 𝐵 𝑏 for strategy 𝐴 and 𝐵 respectively, pa-rameter 𝑏 which is selected from the set {1 , , for eachstrategy, parameter 𝑝 which is selected by decreasing from5, and finally, the number of models that is set to 10 [32].The number of models in ECC is also set to 10 to be con-sistent with other ensemble methods. HOMER requires thenumber of clusters to be determined which is selected from (2 , [39].In the proposed method, the maximum number of rulesallowed in the model, 𝑁 is selected from (1000 , by1000 steps, 𝑃 is the probability of replacing an allele in clas-sifier condition with a hash, which is selected from [0 . , . with a step size of 0.05, and the threshold by which geneticalgorithm is applied is selected from (5 , with the stepsize of 10. Once the parameters are determined, the thresh-old values for all methods on all datasets are selected from [0 . , . with step 0.05. This experimental study aims at addressing the followingquestions: (i) Which strategy is more effective for evaluatingindividual classification rules is more effective? (ii) What isthe effect of employing prediction aggregation over the max-imum fitness criteria? (iii) How effective is the proposed al-gorithm in exploiting label correlation compared to the othermethods?
The fitness of a single classifier can be evaluated usingthe EM or the HL measures as shown in (6) and (7). In thissection, the effect of employing each criterion on the overallmodel performance is investigated using synthetic and real-world data. For the synthetic data the framework proposed in[36] is employed using the hyper-cube strategy. Experimenton each dataset is repeated ten times to reduce the variance ofthe results, and the model performance is reported in termsof the HL of the model on test data in Tables 3 and 4.According to Table 3, the model performs better on syn-thetic datasets using 𝑒𝑚𝑡 , while Table 4 shows that 𝑒𝑚𝑡 pro-vides smaller test HL values on real-world datasets. Accord-ing to the discussions in section 3.2, during the training 𝑒𝑚𝑡 causes the expected update in the fitness of the classifiersto be smaller compared to employing ℎ𝑙𝑡 . This creates anevolutionary pressure towards classifiers with more specificconditions that cover only a few samples or a very small sub-space but tend to be more accurate. Such pressure increasesthe chance of training rules that overfit the training data, es-pecially on real-world problems as observed in Table 4. Onthe other hand, ℎ𝑙𝑡 prevents over-fitting by preserving theclassifiers that are not very accurate but predict partially cor-rect labelsets on every iteration.According to Table 4, employing ℎ𝑙𝑡 leads to modelswith better performance on four out of five datasets. A one-tailed Wilcoxon signed ranks test with 𝛼 = 0 . and 𝑁 = 5 is applied to the results. The test did not have enough ev-idence for rejecting the null hypothesis, which means that given the current evidence the performance of the MLR al-gorithm using either evaluation method is not significantlydifferent. However, in the following comparison experiments ℎ𝑙𝑡 , i.e. strategy (7), is employed to guarantee the conditionfor sufficient generalization and avoid over-fitting. In this section, the results of training different ML al-gorithms on the selected datasets are presented in Tables(6-13). The results are obtained by running a 5-fold cross-validation for each dataset. The numbers within the paren-theses are the relative rank of algorithms on a dataset withrespect to a metric. The highest average rank is shown inbold for each evaluation metric.To study the effect of the prediction aggregation strat-egy (10), two sets of results are reported for the proposedmethod; the results using the equation (8) as MLR 𝑚𝑎𝑥 , andthe aggregated predictions after applying a bi-partitioningmethods as MLR 𝑎𝑔𝑔 . In this study, the reported aggregatedperformances are better results after applying
One Threshold and
Rank Cut [21] on the combined predictions using (9) and(10). Note that, MLR 𝑚𝑎𝑥 scores all classes equally and as aresult no ranking is available for classes to be reported interms of a ranking-based measure.To analyze the relative performance of different algo-rithms, the Friedman test [9] is employed. The Friedmantest is a non-parametric statistical test to compare multiplealgorithms trained on multiple datasets based on their aver-age ranks. According to Table 5, the null hypothesis is re-jected for all evaluation metrics except for
Recall and 𝐹 𝑚𝑎𝑐𝑟𝑜 metrics, suggesting that the performance of the methods aresignificantly different for all other metrics. Consequently, apost-hoc test [9] is applied to investigate the relative perfor-mance among algorithms. For this purpose, the Bonferroni-Dunn test [9] is employed for 𝑘 = 8 , i.e. the number of algo-rithms compared, and 𝑁 = 5 , i.e., the number of datasets,with a significance level of 0.05. Figure 1 shows the criticaldistance (CD) diagrams for each evaluation metric. The topline in the diagram is the axis along which the average rankof each ML classifier is plotted, from the lowest ranks (bestperformance) on the left to the highest ranks (worst perfor-mance) on the right. In each sub-figure, groups of algorithmsthat are not statistically different (their average rank is withinone CD) from one another are connected. Following obser-vations are made based on the presented experiments:• According to Tables (6-14), MLR algorithm using theprediction aggregation strategy (9) has a higher aver-age rank than the maximum prediction strategy (8) interms of all metrics, which confirms the effectivenessof aggregating the predictions.• MLR 𝑎𝑔𝑔 has the highest average rank in terms of thesix evaluation metrics out of nine and has an outstand-ing performance in terms of Accuracy , Precision , and 𝐹 measures. In terms of the Recall metric, MLR 𝑎𝑔𝑔 and RA 𝑘 EL both has the same highest average rank.
Shabnam Nazmi, Xuyang Yan, Abdollah Homaifar, Emily Doucette:
Preprint submitted to Elsevier
Page 7 of 12volving Multi-label Classification Rules by Exploiting High-order Label Correlation
Dataset 2-class 3-class 5-class 8-class 10-class 15-classEM-based 0.006 0.018 0.030 0.073 0.091 0.109HL-based 0.009 0.021 0.035 0.095 0.133 0.139
Table 3
The average test HL ↓ of the model using EM and HL as evaluation measures on syntheticdata. Ten datasets are generated per class size. Dataset Yeast Emotions Flags CAL500 GenbaseEM-based 0.2610 0.2620 0.3101 0.2104 0.0106HL-based 0.2180 0.2381 0.2923 0.1963 0.0121
Table 4
The average test HL ↓ of the model using EM and HL as evaluation measures on real-worlddata. Evaluation metric 𝐹 Critical value ( 𝛼 = 0 . )Hamming Loss 19.40 14.06Accuracy 22.73 𝐹 𝐹 𝐹 Table 5
Summary of the Friedman rank test for 𝐹 ( 𝑘 = 8 , 𝑁 = 5) . • When compared based on the DL value of the bench-mark datasets, MLR 𝑎𝑔𝑔 algorithm has the best perfor-mance in terms of six measures on CAL500 datasetwhich has the highest possible DL value. This resultshows that the proposed algorithm is capable of ad-dressing the incompleteness challenge of the LP bypredicting unseen labelsets more effectively.• According to Figure 1, the proposed MLR 𝑎𝑔𝑔 has sig-nificantly better performance than vanilla LP in termsof five metrics. It also offers significant improvementover HOMER on Accuracy and 𝐹 score measures,and ML- 𝑘 NN on
Accuracy measure.
5. Conclusion
In this paper, a multi-label classification algorithm byextending supervised learning classifier systems is proposedto exploit the high-order label correlation in order to obtaina more accurate classification model. The proposed methodbuilds classification rules by extending the LP technique andemploys a prediction aggregation that works similar to a 𝑘 NNmethod with a dynamic 𝑘 . Two strategies for evaluating theperformance of the individual classifiers during the trainingis considered and are investigated by deriving approximatebounds on the expected classifier fitness in terms of the num-ber of classes, the number of distinct labelsets, and the labeldensity in the dataset.The complexity analysis reveals that the cost of trainingMLR is linear in terms of the number of instances, numberof features, number of classes, and the number of rules in the population. Experiments on the synthetic and real-worlddatasets suggest that evaluating classifier performance us-ing the Hamming loss measure is more effective in prevent-ing over-fitting than the exact match measure. This result isdue to higher expected fitness for classifiers that are partiallycorrect when evaluated using the HL criteria. The proposedmethod is compared with multiple well-known multi-labelclassification methods on multiple datasets and has the high-est average rank in terms of the seven out of nine measures.Statistical tests on the results show that the MLR algorithmwith aggregated predictions outperforms other methods onmost of the datasets and shows competitive performance onothers. The lower performance of the model in terms ofthe macro-averaged 𝐹 score suggests that the model mightpresent poor prediction performance on datasets with imbal-anced classes, where it is necessary to correctly predict theinfrequently occurring class labels.In the future, the impact of other mechanisms such asthe genetic operators and deletion will be incorporated intothe analysis presented for the performance of the individualclassifiers to obtain a more complete analysis of the MLRalgorithm. We will also investigate different techniques toimprove the performance of the proposed method on imbal-anced class datasets.
6. Acknowledgement
This work is supported by Air Force Research Labora-tory (AFRL) and Office of Secretary of Defense (OSD) un-der agreement number FA8750-15-2-0116.
References [1] Almeida, A.M., Cerri, R., Paraiso, E.C., Mantovani, R.G., Junior,S.B., 2018. Applying multi-label techniques in emotion identifica-tion of short texts. Neurocomputing 320, 35–46.[2] Bernadó-Mansilla, E., Garrell-Guiu, J.M., 2003. Accuracy-basedlearning classifier systems: models, analysis and applications to clas-sification tasks. Evolutionary computation 11, 209–238.[3] Boutell, M.R., Luo, J., Shen, X., Brown, C.M., 2004. Learning multi-label scene classification. Pattern recognition 37, 1757–1771.[4] Cai, Z., Zhu, W., 2018. Feature selection for multi-label classificationusing neighborhood preservation. IEEE/CAA Journal of AutomaticaSinica 5, 320–330.[5] Cheng, W., Hüllermeier, E., Dembczynski, K.J., 2010. Bayes op-timal multilabel classification via probabilistic classifier chains, in:
Shabnam Nazmi, Xuyang Yan, Abdollah Homaifar, Emily Doucette:
Preprint submitted to Elsevier
Page 8 of 12volving Multi-label Classification Rules by Exploiting High-order Label Correlation(a) Hamming Loss (b) Accuracy(c) 𝐹 (d) Precision(e) Micro- 𝐹 (f) One Error(g) Rank Loss Figure 1:
Comparison of MLR with aggregated and max predictions against other algorithms with the Bonferroni-Dunn test with 𝛼 = 0 . . Datasets yeast emotions flags CAL500 genbase Ave. rankRA 𝑘 EL 0.2413(5) 0.2580(7) 0.2819(7) 0.2566(7) 0.0139(6) 6.4LP( 𝐽 ) 0.2805(8) 0.2530(5) 0.2981(8) 0.2006(6) 0.0137(4) 6.2ML- 𝑘 NN 0.1976(3) 0.2218(2) 0.2458(1) 0.1404(2) 0.0158(8) 3.2EPS 0.2673(6) 0.2724(8) 0.2606(4) 0.2662(8) 0.0150(7) 6.6HOMER(LP) 0.2791(7) 0.2533(6) 0.2804(6) 0.1966(5) 0.0135(3) 5.4ECC 0.2062(4) 0.2007(1) 0.2561(3) 0.1424(3) 0.0138(5) 3.2MLR 𝑚𝑎𝑥 𝑎𝑔𝑔
Table 6
The performance of the ML algorithms in terms of
Hamming Loss ↓ . Proceedings of the 27th international conference on machine learn-ing (ICML-10), pp. 279–286.[6] Clare, A., King, R.D., 2001. Knowledge discovery in multi-label phe-notype data, in: European Conference on Principles of Data Miningand Knowledge Discovery, Springer. pp. 42–53.[7] Dam, H.H., Abbass, H.A., Lokan, C., Yao, X., 2007. Neural-basedlearning classifier systems. IEEE Transactions on Knowledge andData Engineering 20, 26–39.[8] Debie, E., Shafi, K., 2019. Implications of the curse of dimensionalityfor supervised learning classifier systems: theoretical and empiricalanalyses. Pattern Analysis and Applications 22, 519–536.[9] Demšar, J., 2006. Statistical comparisons of classifiers over multipledata sets. Journal of Machine learning research 7, 1–30.[10] Diplaris, S., Tsoumakas, G., Mitkas, P.A., Vlahavas, I., 2005. Proteinclassification with multiple algorithms, in: Panhellenic Conferenceon Informatics, Springer. pp. 448–456.[11] Elisseeff, A., Weston, J., 2002. A kernel method for multi-labelledclassification, in: Advances in neural information processing systems, pp. 681–687.[12] Fu, F.W., Wei, V.K., Yeung, R.W., 2001. On the minimum averagedistance of binary codes: linear programming approach. Discrete ap-plied mathematics 111, 263–281.[13] Fürnkranz, J., Hüllermeier, E., Mencía, E.L., Brinker, K., 2008. Mul-tilabel classification via calibrated label ranking. Machine learning73, 133–153.[14] Goncalves, E.C., Plastino, A., Freitas, A.A., 2013. A genetic algo-rithm for optimizing the label ordering in multi-label classifier chains,in: 2013 IEEE 25th International Conference on Tools with ArtificialIntelligence, IEEE. pp. 469–476.[15] Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Wit-ten, I.H., 2009. The weka data mining software: an update. ACMSIGKDD explorations newsletter 11, 10–18.[16] Hanifelou, Z., Adibi, P., Monadjemi, S.A., Karshenas, H., 2018. Knn-based multi-label twin support vector machine with priority of labels.Neurocomputing 322, 177–186.[17] Holland, J.H., et al., 1992. Adaptation in natural and artificial sys-
Shabnam Nazmi, Xuyang Yan, Abdollah Homaifar, Emily Doucette:
Preprint submitted to Elsevier
Page 9 of 12volving Multi-label Classification Rules by Exploiting High-order Label Correlation
Datasets yeast emotions flags CAL500 genbase Ave. rankRA 𝑘 EL 0.5008(4) 0.5280(3) 0.6114(3) 0.2505(2) 0.8319(2) 2.8LP( 𝐽 ) 0.4113(7) 0.4764(7) 0.5533(8) 0.1982(6) 0.8252(5) 6.6ML- 𝑘 NN 0.4995(6) 0.5082(6) 0.6197(2) 0.1947(7) 0.6873(8) 5.8EPS 0.5044(3) 0.5234(4) 0.6077(4) 0.1903(8) 0.8183(6) 5.0HOMER(LP) 0.4063(8) 0.4662(8) 0.5627(7) 0.2050(5) 0.8279(4) 6.4ECC 0.5001(5) 0.5190(5) 0.6058(5) 0.2118(4) 0.7632(7) 5.2MLR 𝑚𝑎𝑥 𝑎𝑔𝑔
Table 7
The performance of the ML algorithms in terms of
Accuracy ↑ . Datasets yeast emotions flags CAL500 genbase Ave. rankRA 𝑘 EL 0.6218(4) 0.6359(3) 0.7348(2) 0.3955(2) 0.8459(2) 2.6LP( 𝐽 ) 0.5140(8) 0.5587(8) 0.6625(8) 0.3216(6) 0.8511(1) 6.2ML- 𝑘 NN 0.6070(6) 0.5906(6) 0.7337(3) 0.3204(7) 0.6977(8) 6.0EPS 0.6284(3) 0.6368(2) 0.7169(4) 0.3014(8) 0.8362(4) 4.2HOMER(LP) 0.5173(7) 0.5621(7) 0.6782(7) 0.3333(5) 0.8314(6) 6.4ECC 0.6077(5) 0.5911(5) 0.7147(5) 0.3420(4) 0.7689(7) 5.2MLR 𝑚𝑎𝑥 𝑎𝑔𝑔
Table 8
The performance of the ML algorithms in terms of 𝐹 score ↑ . tems: an introductory analysis with applications to biology, control,and artificial intelligence. MIT press.[18] Huang, J., Li, G., Huang, Q., Wu, X., 2015. Learning label specificfeatures for multi-label classification, in: 2015 IEEE InternationalConference on Data Mining, IEEE. pp. 181–190.[19] Huang, J., Li, G., Wang, S., Xue, Z., Huang, Q., 2017. Multi-labelclassification by exploiting local positive and negative pairwise labelcorrelation. Neurocomputing 257, 164–174.[20] Hüllermeier, E., Fürnkranz, J., Cheng, W., Brinker, K., 2008. Labelranking by learning pairwise preferences. Artificial Intelligence 172,1897–1916.[21] Ioannou, M., Sakkas, G., Tsoumakas, G., Vlahavas, I., 2010. Obtain-ing bipartitions from score vectors for multi-label classification, in:2010 22nd International Conference on Tools with Artificial Intelli-gence, IEEe. pp. 409–416. [22] Iqbal, M., Browne, W.N., Zhang, M., 2013. Evolving optimum pop-ulations with xcs classifier systems. Soft Computing 17, 503–518.[23] Jiang, M., Pan, Z., Li, N., 2017. Multi-label text categorization usingl21-norm minimization extreme learning machine. Neurocomputing261, 4–10.[24] Jing, X.Y., Wu, F., Li, Z., Hu, R., Zhang, D., 2016. Multi-label dic-tionary learning for image annotation. IEEE Transactions on ImageProcessing 25, 2712–2725.[25] Kim, J.Y., Cho, S.B., 2019. Exploiting deep convolutional neural net-works for a neural-based learning classifier system. Neurocomputing354, 61–70.[26] Madjarov, G., Kocev, D., Gjorgjevikj, D., Džeroski, S., 2012. An ex-tensive experimental comparison of methods for multi-label learning.Pattern recognition 45, 3084–3104.[27] Nan, G., Li, Q., Dou, R., Liu, J., 2018. Local positive and nega- Datasets yeast emotions flags CAL500 genbase Ave. rankRA 𝑘 EL 0.6030(5) 0.5763(7) 0.6497(8) 0.3089(7) 0.8387(2) 5.8LP( 𝐽 ) 0.5384(8) 0.5886(6) 0.6613(7) 0.3293(5) 0.8330(5) 6.2ML- 𝑘 NN 0.7222(1) 0.6441(3) 0.7361(2) 0.5850(1) 0.7140(8) 3.0EPS 0.5616(6) 0.5617(8) 0.6929(5) 0.2897(8) 0.8290(6) 6.6HOMER(LP) 0.5484(7) 0.6075(5) 0.6748(6) 0.3455(4) 0.8355(3) 5.0ECC 0.6888(3) 0.6461(2) 0.7081(3) 0.5609(2) 0.7711(7) 3.4MLR 𝑚𝑎𝑥 𝑎𝑔𝑔
Table 9
The performance of the ML algorithms in terms of
Precision ↑ . Datasets yeast emotions flags CAL500 genbase Ave. rankRA 𝑘 EL 0.6998(4) 0.5763(7) 0.8716(1) 0.5737(1) 0.8662(1)
LP( 𝐽 ) 0.5409(8) 0.5886(6) 0.6664(8) 0.3279(5) 0.8275(6) 6.6ML- 𝑘 NN 0.5694(6) 0.6441(3) 0.7668(2) 0.2259(7) 0.6919(8) 5.2EPS 0.7802(1) 0.5617(8) 0.7537(3) 0.2238(8) 0.8608(2) 4.4HOMER(LP) 0.5518(7) 0.6075(5) 0.7001(6) 0.3428(4) 0.8298(5) 5.4ECC 0.5890(5) 0.6461(2) 0.7351(4) 0.2545(6) 0.7711(7) 4.8MLR 𝑚𝑎𝑥 𝑎𝑔𝑔
Table 10
The performance of the ML algorithms in terms of
Recall ↑ .Shabnam Nazmi, Xuyang Yan, Abdollah Homaifar, Emily Doucette: Preprint submitted to Elsevier
Page 10 of 12volving Multi-label Classification Rules by Exploiting High-order Label Correlation
Datasets yeast emotions flags CAL500 genbase Ave. rankRA 𝑘 EL 0.6350(4) 0.6587(1) 0.7513(2) 0.3984(2) 0.8550(2) 2.2LP( 𝐽 ) 0.5387(8) 0.5924(7) 0.6904(8) 0.3253(6) 0.8511(5) 6.8ML- 𝑘 NN 0.6070(6) 0.6270(6) 0.7488(3) 0.3185(7) 0.8034(8) 6.0EPS 0.6367(3) 0.6516(2) 0.7417(4) 0.3012(8) 0.8435(6) 4.6HOMER(LP) 0.5438(7) 0.5922(8) 0.7145(6) 0.3402(5) 0.8524(4) 6.0ECC 0.6312(5) 0.6512(3) 0.7391(5) 0.3430(4) 0.8434(7) 4.8MLR 𝑚𝑎𝑥 𝑎𝑔𝑔
Table 11
The performance of the ML algorithms in terms of
Micro-F ↑ . Datasets yeast emotions flags CAL500 genbase Ave. rankRA 𝑘 EL 0.4276(3) 0.6499(1) 0.6922(1) 0.1915(1) 0.8478(1)
LP( 𝐽 ) 0.3756(6) 0.5809(8) 0.6154(7) 0.1506(3) 0.8115(3) 5.4ML- 𝑘 NN 0.3609(8) 0.5889(6) 0.6221(6) 0.1052(7) 0.6577(6) 6.6EPS 0.4397(1) 0.6448(2) 0.6302(4) 0.0957(8) 0.8013(4) 3.8HOMER(LP) 0.3841(5) 0.5822(7) 0.6375(5) 0.1702(2) 0.8146(2) 4.2ECC 0.3736(7) 0.6219(4) 0.6327(3) 0.1301(5) 0.7985(5) 4.8MLR 𝑚𝑎𝑥 𝑎𝑔𝑔
Table 12
The performance of the ML algorithms in terms of
Macro-F ↑ . tive correlation-based k-labelsets for multi-label classification. Neu-rocomputing 318, 90–101.[28] Nazmi, S., Razeghi-Jahromi, M., Homaifar, A., 2017. Multilabel clas-sification with weighted labels using learning classifier systems, in:2017 16th IEEE International Conference on Machine Learning andApplications (ICMLA), IEEE. pp. 275–280.[29] Nazmi, S., Yan, X., Homaifar, A., 2018. Multi-label classificationusing genetic-based machine learning, in: 2018 IEEE InternationalConference on Systems, Man, and Cybernetics (SMC), IEEE. pp.675–680.[30] Orriols-Puig, A., Bernadó-Mansilla, E., 2006. Revisiting ucs: De-scription, fitness sharing, and comparison with xcs, in: Learning Clas-sifier Systems. Springer, pp. 96–116.[31] Orriols-Puig, A., Casillas, J., Bernadó-Mansilla, E., 2008. Genetic-based machine learning systems are competitive for pattern recogni- tion. Evolutionary Intelligence 1, 209–232.[32] Read, J., 2008. A pruned problem transformation method for multi-label classification, in: Proc. 2008 New Zealand Computer ScienceResearch Student Conference (NZCSRS 2008), p. 41.[33] Read, J., Pfahringer, B., Holmes, G., Frank, E., 2011. Classifier chainsfor multi-label classification. Machine learning 85, 333.[34] Schietgat, L., Vens, C., Struyf, J., Blockeel, H., Kocev, D., Džeroski,S., 2010. Predicting gene function using hierarchical multi-label de-cision tree ensembles. BMC bioinformatics 11, 2.[35] Spolaôr, N., Cherman, E.A., Monard, M.C., Lee, H.D., 2013. Reliefffor multi-label feature selection, in: 2013 Brazilian Conference onIntelligent Systems, IEEE. pp. 6–11.[36] Tomás, J.T., Spolaôr, N., Cherman, E.A., Monard, M.C., 2014. Aframework to generate synthetic multi-label datasets. Electron. NotesTheor. Comput. Sci. 302, 155–176. URL: http://dx.doi.org/10.1016/ Datasets yeast emotions flags CAL500 genbase Ave. rankRA 𝑘 EL 0.2954(5) 0.3222(5) 0.2468(4) 0.2211(4) 0.1662(3) 4.2LP( 𝐽 ) 0.5337(7) 0.4234(6) 0.6233(7) 0.9832(6) 0.2328(7) 6.6ML- 𝑘 NN 0.2371(2) 0.3104(3) 0.2522(5) 0.1195(2) 0.1768(6) 3.6EPS 0.2474(3) 0.3171(4) 0.1854(1) 0.9868(7) 0.1708(4) 3.8HOMER(LP) 0.5130(6) 0.4252(7) 0.3549(6) 0.8146(5) 0.1647(2) 5.2ECC 0.2478(4) 0.2918(1) 0.2059(2) 0.1474(3) 0.1753(5) 3.0MLR 𝑚𝑎𝑥 - - - - - -MLR 𝑎𝑔𝑔
Table 13
The performance of the ML algorithms in terms of
One-error ↓ . Datasets yeast emotions flags CAL500 genbase Ave. rankRA 𝑘 EL 0.2172(5) 0.1977(5) 0.2461(4) 0.2450(4) 0.0952(5) 4.6LP( 𝐽 ) 0.4102(7) 0.3331(7) 0.5405(7) 0.6578(6) 0.1924(7) 6.8ML- 𝑘 NN 0.1708(2) 0.1869(3) 0.1982(1) 0.1853(2) 0.0165(2)
EPS 0.1998(4) 0.1890(4) 0.2265(3) 0.6802(7) 0.0556(4) 4.4HOMER(LP) 0.3599(6) 0.3107(6) 0.3577(6) 0.4279(5) 0.1463(6) 5.8ECC 0.1798(3) 0.1650(2) 0.2018(2) 0.1987(3) 0.0135(1) 2.2MLR 𝑚𝑎𝑥 - - - - - -MLR 𝑎𝑔𝑔
Table 14
The performance of the ML algorithms in terms of
Rank Loss ↓ .Shabnam Nazmi, Xuyang Yan, Abdollah Homaifar, Emily Doucette: Preprint submitted to Elsevier
Page 11 of 12volving Multi-label Classification Rules by Exploiting High-order Label Correlation j.entcs.2014.01.025 , doi: .[37] Trohidis, K., Tsoumakas, G., Kalliris, G., Vlahavas, I.P., 2008. Multi-label classification of music into emotions., in: ISMIR, pp. 325–330.[38] Tsoumakas, G., Katakis, I., 2007. Multi-label classification: Anoverview. International Journal of Data Warehousing and Mining(IJDWM) 3, 1–13.[39] Tsoumakas, G., Katakis, I., Vlahavas, I., 2008. Effective and effi-cient multilabel classification in domains with large number of labels,in: Proc. ECML/PKDD 2008 Workshop on Mining MultidimensionalData (MMDâĂŹ08), sn. pp. 53–59.[40] Tsoumakas, G., Katakis, I., Vlahavas, I., 2009. Mining multi-labeldata, in: Data mining and knowledge discovery handbook. Springer,pp. 667–685.[41] Tsoumakas, G., Katakis, I., Vlahavas, I., 2010. Random k-labelsetsfor multilabel classification. IEEE Transactions on Knowledge andData Engineering 23, 1079–1089.[42] Tsoumakas, G., Vlahavas, I., 2007. Random k-labelsets: An ensem-ble method for multilabel classification, in: European conference onmachine learning, Springer. pp. 406–417.[43] Turnbull, D., Barrington, L., Torres, D., Lanckriet, G., 2008. Seman-tic annotation and retrieval of music and sound effects. IEEE Trans-actions on Audio, Speech, and Language Processing 16, 467–476.[44] Urbanowicz, R., . The educational learning classifier system (elcs).URL: https://sourceforge.net/projects/educationallcs/ .[45] Urbanowicz, R.J., Moore, J.H., 2009. Learning classifier systems:a complete introduction, review, and roadmap. Journal of ArtificialEvolution and Applications 2009, 1.[46] Urbanowicz, R.J., Moore, J.H., 2015. Exstracs 2.0: description andevaluation of a scalable learning classifier system. Evolutionary in-telligence 8, 89–116.[47] Vens, C., Struyf, J., Schietgat, L., Džeroski, S., Blockeel, H., 2008.Decision trees for hierarchical multi-label classification. Machinelearning 73, 185.[48] Wang, J., Yang, Y., Mao, J., Huang, Z., Huang, C., Xu, W., 2016.Cnn-rnn: A unified framework for multi-label image classification, in:Proceedings of the IEEE conference on computer vision and patternrecognition, pp. 2285–2294.[49] Zaragoza, J.C., Sucar, E., Morales, E., Bielza, C., Larranaga, P.,2011. Bayesian chain classifiers for multidimensional classification,in: Twenty-second international joint conference on artificial intelli-gence.[50] Zhang, M.L., 2009. M l-rbf: Rbf neural networks for multi-labellearning. Neural Processing Letters 29, 61–74.[51] Zhang, M.L., Peña, J.M., Robles, V., 2009. Feature selection formulti-label naive bayes classification. Information Sciences 179,3218–3229.[52] Zhang, M.L., Zhou, Z.H., 2006. Multilabel neural networks with ap-plications to functional genomics and text categorization. IEEE trans-actions on Knowledge and Data Engineering 18, 1338–1351.[53] Zhang, M.L., Zhou, Z.H., 2007. Ml-knn: A lazy learning approachto multi-label learning. Pattern recognition 40, 2038–2048.[54] Zhang, M.L., Zhou, Z.H., 2013. A review on multi-label learningalgorithms. IEEE transactions on knowledge and data engineering26, 1819–1837.[55] Zhang, Z.Z., 2001. A relation between the average hamming distanceand the average hamming weight of binary codes. Journal of statisticalplanning and inference 94, 413–419.
Shabnam Nazmi received her B.S. degree inElectrical Engineering from K.N.Toosi University ofTechnology and her M.S. degree in Electrical Engi-neering from Sharif University of Technology in 2009and 2012, respectively. She is currently a Ph.D. candi-date at the Department of Electrical and Computer En-gineering, North Carolina A & T State University. Herresearch interests include multi-label classification andits application on test and evaluation of autonomousvehicles, genetic-based machine learning, and learning from streaming data.
Xuyang Yan received his B.S. degree in Electri-cal Engineering from North Carolina Agricultural andTechnical State University (NC A & T) and Henan Poly-technic University in 2016. In 2018, he earned hisM.S. degree in electrical engineering at NC A & T. He iscurrently pursuing his Ph.D. degree in electrical engi-neering at NC A & T. His research interests include ex-tracting knowledge from streaming data, analyzing theemergent behaviors of large-scale autonomous systemsand the application of machine learning techniques inrobotics.
Abdollah Homaifar received his B.S. and M.S.degrees from the State University of New York atStony Brook in 1979 and 1980, respectively, and hisPh.D. degree from the University of Alabama in 1987,all in Electrical Engineering. He is the NASA Lang-ley Distinguished Professor and the Duke Energy Em-inent professor in the Department of Electrical andComputer Engineering at North Carolina A & T StateUniversity (NCA & TSU). He is the director of the Au-tonomous Control and Information Technology Institute and the Testing,Evaluation, and Control of Heterogeneous Large-scale Systems of Au-tonomous Vehicles (TECHLAV) Center at NCA & TSU. His research in-terests include machine learning, unmanned aerial vehicles (UAVs), testingand evaluation of autonomous vehicles, optimization, and signal process-ing. He also serves as an associate editor of the Journal of Intelligent Au-tomation and Soft Computing and is a reviewer for IEEE Transactions onFuzzy Systems, Man Machines and Cybernetics, and Neural Networks. Heis a member of the IEEE Control Society, Sigma Xi, Tau Beta Pi, and EtaKapa Nu.
Emily Doucette serves as Multi-Domain Net-worked Weapons technical lead for the Air Force Re-search Laboratory Munitions Directorate. Prior to thispost, she has served the Munitions Directorate as theAssistant to the Chief Scientist (2017-2019) and as aresearch engineer for the Weapon Dynamics and Con-trol Sciences Branch since 2012. She earned a Ph.D.in aerospace engineering from Auburn University andis a recipient of the SMART Scholarship. Her researchinterests include estimation theory, human-machine teaming, decentralizedtask assignment, cooperative autonomous engagement, and risk-aware tar-get tracking and interdiction. Dr. Doucette leads a team of postdoctoral andgraduate student researchers to support collaborative efforts across DoD, in-dustry, academia, and international partnerships. She served on the AFRLMunitions Directorate Autonomy Steering Committee, is active in the Au-tonomy Community of Interest, and is the co-lead for the OSD AutonomyCenter of Excellence.
Shabnam Nazmi, Xuyang Yan, Abdollah Homaifar, Emily Doucette: