[PDF] A Nature-Inspired Feature Selection Approach based on Hypercomplex Information

Abstract

Feature selection for a given model can be transformed into an optimization task. The essential idea behind it is to find the most suitable subset of features according to some criterion. Nature-inspired optimization can mitigate this problem by producing compelling yet straightforward solutions when dealing with complicated fitness functions. Additionally, new mathematical representations, such as quaternions and octonions, are being used to handle higher-dimensional spaces. In this context, we are introducing a meta-heuristic optimization framework in a hypercomplex-based feature selection, where hypercomplex numbers are mapped to real-valued solutions and then transferred onto a boolean hypercube by a sigmoid function. The intended hypercomplex feature selection is tested for several meta-heuristic algorithms and hypercomplex representations, achieving results comparable to some state-of-the-art approaches. The good results achieved by the proposed approach make it a promising tool amongst feature selection research.

Full PDF

AA N

ATURE -I NSPIRED F EATURE S ELECTION A PPROACH BASEDON H YPERCOMPLEX I NFORMATION

A P

REPRINT

Gustavo H. de Rosa, João P. Papa

Department of ComputingSão Paulo State UniversityBauru, São Paulo - Brazil [email protected], [email protected]

Xin-She Yang

School of Science and TechnologyMiddlesex UniversityLondon, United Kingdom [email protected]

January 15, 2021 A BSTRACT

Feature selection for a given model can be transformed into an optimization task. The essential ideabehind it is to ﬁnd the most suitable subset of features according to some criterion. Nature-inspiredoptimization can mitigate this problem by producing compelling yet straightforward solutions whendealing with complicated ﬁtness functions. Additionally, new mathematical representations, such asquaternions and octonions, are being used to handle higher-dimensional spaces. In this context, weare introducing a meta-heuristic optimization framework in a hypercomplex-based feature selection,where hypercomplex numbers are mapped to real-valued solutions and then transferred onto a booleanhypercube by a sigmoid function. The intended hypercomplex feature selection is tested for severalmeta-heuristic algorithms and hypercomplex representations, achieving results comparable to somestate-of-the-art approaches. The good results achieved by the proposed approach make it a promisingtool amongst feature selection research. K eywords Meta-heuristic optimization · Hypercomplex spaces · Feature selection

Optimization techniques became more and more popular in the last few years. Beneﬁcial in numerous applications,ranging from engineering [1, 2], medicine [3, 4] to machine learning ﬁne-tuning [5, 6, 7, 8], they provide suitablesolutions and virtually none human interaction with the modeling process, leaving the burden of choosing parameters tothe model itself. In this context, most of the obstacles described by non-convex mathematical functions [9] requiresmore robust optimization approaches rather than conventional optimization methods.Meta-heuristics algorithms, usually referred to as nature-inspired, or even to swarm- or evolutionary-based algorithms,gained great attention in the last years, attempting to solve optimization problems in a more appealing way thantraditional methods. These so-called nature techniques work without derivatives, thus being suitable for problems withhigh dimensional spaces. Even though they provide outstanding results in different applications, they can still gettrapped into local optimal points. Thus, an important question is how to run these algorithms in the case of complexobjective functions. One can refer to hybrid variants [10], aging mechanisms [11], and ﬁtness landscape analysis [12]as some distinct strategies used to deal with this issue.As mentioned above, the problem of selecting possible parameters can be solved as an optimization problem, wherea subset of parameters or features can be used to calculate the value of a ﬁtness function. This is similar to featureselection, and it is usually classiﬁed into two divisions: (i) wrapper approaches [13], and (ii) ﬁlter-based [14]. Theformer methods use the output of some classiﬁer (e.g., classiﬁcation accuracy) to control the optimization method.Conversely, ﬁlter-based ones do not consider this information. a r X i v : . [ c s . N E ] J a n PREPRINT - J

ANUARY

15, 2021One can presume that feature selection is a straightforward solution that automatizes the choice of parameters. However,it is still necessary to select an appropriate ﬁtness function, which is regularly correlated to the problem’s nature. Also,most machine learning problems deal with high-dimensional data, thus amplifying the problem of exploring the searchspace. An intriguing way to tackle this obstacle is to use a more complex representation of the search space, theso-called hypercomplex search space. The goal behind handling hypercomplex spaces is based on the possibility ofhaving more natural ﬁtness landscapes, although it has not been mathematically proved yet. Nevertheless, the resultsachieved previously sustain such a hypothesis [15, 16, 17, 18].Normalized quaternions, also known as versors, are broadly used to describe the orientation of objects in three-dimensional spaces, being extremely efﬁcient in performing rotations in such spaces [19]. Another intriguing additionof quaternions are the octonions, comprised of eight dimensions [20]. Even though they are not well known in theliterature, they have compelling traits that make them suitable for special relativity and quantum mechanics, amongother research specialties [21, 22]. However, to the best of our knowledge, they have not been used to embed searchspaces in meta-heuristic feature selection so far.This study considers meta-heuristic techniques, among with their quaternion- and octonion-based versions, validatedunder different datasets, proving the robustness of quaternionic and octatonic representations for hypercomplex-embedded search spaces. Therefore, we believe this paper can serve as a foundation for prospective research regardinghypercomplex representations in the context of meta-heuristic-based feature selection.The rest of this paper is organized as follows. Sections 2 and 3 present the theoretical background related tohypercomplex-based spaces (quaternions and octonions), and the proposed approach for hypercomplex-based featureselection, respectively. Section 4 discusses the methodology and the computational setup adopted in this paper, whileSection 5 presents the numerical results. Finally, Section 6 states conclusions and future works. The following problems can be solved with the modern methods of numerical analysis: x + 1 = 0 , (1)in spite of the fact that x = − cannot be a rational solution as any number square root must be positive, x ∈ (cid:60) .The problem (1) can be solved using the imaginary representation: i = − , (2)although this may not appear to be logically correct. The imaginary numbers assemble a structure called complexnumbers , which is formed by real and imaginary terms, as follows: c = h + h i, (3)where h , h ∈ (cid:60) and i = − . One can perceive that it is feasible to obtain a real number by using h = 0 , or evenan imaginary number by placing h = 0 . Thus, the complex numbers deal with the generalization of both real andimaginary numbers.One striking operation that performs positively well in a two-dimensional space is the rotation of complex numbers.Firstly, let us map a complex number on a two-dimensional grid, called complex plane , where the horizontal axis holdsthe real part mapping ( Re ), and the vertical axis is accountable for the imaginary part ( Im ). This description is depictedby Figure 1.One can see that we need to multiply a complex number by i for each 90-degree rotation in the complex plane. Toclarify this, let us consider a random point denoted by r = i + 1 . Also, let x be the result of the multiplication of r by i ,as follows: x = ri = i + i = − i. (4)Now, we can obtain a singular y point by multiplying again x by i : y = xi = − i + i = − − i. (5)2 PREPRINT - J

ANUARY

15, 2021

ReIm

Figure 1:

Representation of a complex plane, which is used to map complex numbers onto a two-dimensional space.

Moreover, if we multiply the result y by i , a w point can be achieved as follows: w = yi = − i − i = 1 − i. (6)Finally, by multiplying w with i , we conclude: z = wi = i − i = 1 + i, (7)where z is the same ﬁrst deﬁned position, i.e., r = z . Figure 2 illustrates the above calculations. ReIm √ -√ √ -√ r = z = 1 + ix = -1 + iy = -1 - i w = 1 - i Figure 2:

Representation of complex numbers’ rotation throughout the complex plane.

In a particular behavior, we can extend the idea of complex numbers by adding new imaginary terms, producing theso-called hypercomplex numbers. This concept also allows rotations to be performed in higher-dimensional complexspaces. In this work, we consider two traditional hypercomplex representations: quaternions and octonions.

A quaternion q is a hypercomplex number, composed of real and complex parts, being q = h + h i + h j + h k ,where h , h , h , h ∈ (cid:60) and i, j, k are imaginary numbers (also known as “fundamental quaternions units"). Thisassumption is hold by the following set of equations: ij = k, (8)3 PREPRINT - J

ANUARY

15, 2021 jk = i, (9) ki = j, (10) ji = − k, (11) kj = − i, (12) ik = − j, (13)and i = j = k = − . (14)Essentially, a quaternion q is a four-dimensional space representation over the real numbers, i.e., (cid:60) .Given two arbitrary quaternions q = g + g i + g j + g k and q = h + h i + h j + h k , the quaternion algebradeﬁnes a set of main operations [23]. The addition operation, for instance, can be deﬁned as follows: q + q = ( g + g i + g j + g k ) + ( h + h i + h j + h k ) (15) = ( g + h ) + ( g + h ) i + ( g + h ) j + ( g + h ) k, while the subtraction is deﬁned as follows: q − q = ( g + g i + g j + g k ) − ( h + h i + h j + h k )= ( g − h ) + ( g − h ) i + ( g − h ) j + ( g − h ) k. (16)Morever, Fister et al. [15, 16] introduced two other operations, q rand and q zero . The former initializes a given quaternionwith values drawn from a Gaussian distribution N , and is deﬁned as follows: q rand () = { g i = N (0 , | i ∈ { , , , }} . (17)The latter equation initializes a quaternion with zero values, as follows: q zero () = { g i = 0 | i ∈ { , , , }} . (18) Octonions are a natural extension of quaternions and were discovered autonomously by John T. Graves and ArthurCayley around 1843. An octonion is composed of seven complex parts and one real-valued term, being deﬁned asfollows: o = h e + h e + h e + . . . + h e , (19)where h i ∈ (cid:60) and e i are the imaginary numbers, i = 0 , . . . , . Commonly, e = 1 is used in order to obtain thereal-valued term of the octonion.The addition, subtraction, and norm equations are computed likewise to the quaternions’ formulae, giving us a clearimplementation framework in order to manipulate several hypercomplex representations.4 PREPRINT - J

ANUARY

15, 2021

This section outlines the proposed method for meta-heuristic-based feature selection. One can understand the featureselection process as a method that decides whether a feature should be selected or not (boolean) in order to solve agiven problem. As traditional optimization algorithms use a continuous-valued search space, we need to shape thesearch space into an n -dimensional binary structure, where solutions are selected across the edges of a hypercube.Furthermore, as our problem is to select or not a feature, each solution individual is now an n -dimensional binary array,where each dimension corresponds to a speciﬁc feature and the values and indicate whether this feature will or willnot be part of the new set.Concerning conventional optimization algorithms, the solutions are found upon continuous-valued positions of thesearch space. In order to accomplish this binary-valued individual, one can restrain the new solutions to binary valuesonly: S ( x ji ) = 11 + e − x ji , (20) x ji = (cid:26) if S ( x ji ) > α , otherwise (21)in which α ∼ U (0 , , and x ∈ (cid:60) stands for a possible solution.Equation 20 represents the transfer function , which maps real-valued solutions into binary-valued ones. Note that anytransfer functions can be used to fulﬁll this purpose. In this work, we are using a sigmoid function (Equation 20), whichis illustrated by Figure 3 to map bounded real-valued solutions is bounded within the interval [ − , . and generatea probability. Further, the mapped value is compared against a uniform distribution sampling in order to obtain thebinary output (Equation 21). -20 -15 -10 -5 0 5 10 15 20 x f ( x ) Figure 3:

Sigmoid transfer function f ( x ) = e − x bounded in [ − , . A hypercomplex-based feature selection strategy does not deviate too much from the regular method. One can encodethe common search space into a higher-dimensional space, by applying the power of quaternions or octonions. When Each real-valued solution PREPRINT - J

ANUARY

15, 2021conducting the meta-heuristic algorithm through the hypercomplex space towards a feasible solution, a crucial operatorthat needs to be deﬁned is the p -norm, which is responsible for mapping hyper-complex values to real numbers. Let q be a hypercomplex number with real coefﬁcients { h d } Dd =1 , one can compute the Minkowski p -norm as follows: (cid:107) q (cid:107) p = (cid:32) D (cid:88) d =1 | h d | p (cid:33) /p (22)where D is the number of dimensions of the space ( for complex numbers, for quaternions and for octonions,for instance) and p ≥ . Common values for the latter variable are or for the Taxicab and Euclidean norms,respectively. Hence, one can see the p -norm as a generalization of such distance operators.Prior to the transfer function activation, there is an additional equation, called the Span function, which is responsiblefor mapping the norm’s output between the lower and upper bounds, as follows: q span () = ( b u − b l ) (cid:107) q (cid:107) p D /p + b l , (23)where b l and b u stands for the lower and upper bounds, respectively.Figure 4 illustrates an encoding of a solution vector x into a quaternionic space, where x ji depicts the i -th componentof the hypercomplex number for the j -th decision variable. The same approach can be applied to octatonic spaces byextending the quaternion q (four components) to an octonion o (eight components). x i1 x i2 x i3 … x in a b c d q a b c d q a b c d q a n b n c n d n q n …………… Figure 4:

Quaternionic hypercomplex encoding of a solution vector x , such that x ji stands for the i -th component of the hypercomplexnumber for the j -th decision variable. The idea behind this work is to model the task of selecting the most suitable features for a given problem through ameta-heuristic optimization process. As stated in Section 1, feature selection stands for a proper selection of features,reducing a particular problem’s dimensionality and usually enhancing its performance. Also, as the proposed approachis a wrapper-based one, there is a need to deﬁne an objective function that will conduct the optimization process.Therefore, the proposed approach aims at selecting the subset of features that minimize the classiﬁcation error (maximizethe classiﬁcation accuracy) of a given supervised classiﬁer over a validation set. Although any supervised patternrecognition classiﬁer could be applied, we opted to use the Optimum-Path Forest (OPF) [24, 25] since it is parameterlessand has a fast training procedure. Essentially, the OPF encodes each dataset’s sample as a node in a graph, whoseconnections are deﬁned by an adjacency relation. Its learning process aims at ﬁnding prime samples called prototypes and trying to conquer the remaining samples by offering them optimum-paths according to a path-cost function. In theend, optimum-path trees are achieved, each one rooted at a different prototype node. In this work, we opted to use the Euclidean Norm as mapping function. PREPRINT - J

ANUARY

15, 2021

Dataset Task

Arcene Mass Spectrometry 100 100 10,000BASEHOCK Text 997 996 4,862COIL20 Face Image 770 770 1,024DNA Biological 2,000 1,186 180Isolet Spoken Letter Recognition 780 780 617Lung Biological 102 101 3,312Madelon Artiﬁcial 2,000 600 500MPEG7-BAS Image Descriptor 700 700 180MPEG7-Fourier Image Descriptor 700 700 126Mushrooms Biological 4,062 4,062 112NTL-Commercial Energy Theft 2,476 2,476 8NTL-Industrial Energy Theft 1,591 1,591 8ORL Face Image 200 200 1,024PCMAC Text 972 971 3,289Phishing Network Security 5,528 5,527 68Segment Image Segmentation 1,155 1,155 19Sonar Signal 104 104 60Splice Biological 1,000 2,175 60Vehicle Image Silhouettes 423 423 18Wine Chemical 89 89 13

Table 1: Employed datasets used in the computations.Given a set of hypercomplex candidate solutions { s } Si =1 , where s = { s , s , . . . , s j } | j ∈ { , , . . . , N } , such that S stands for meta-heuristic candidates and N for the number of decision variables (number of the problem’s featuresdepicted in Table 1), we wish to learn the best set of features F (cid:63) . Namely, we want to solve the following optimizationproblem: F (cid:63) = { f i ( s i , F ) | i ∈ [1 , S ] } , st. ≤ F ≤ N. (24)where f i ( s i , F ) stands for the ﬁtness function (OPF accuracy over validation set) of candidate i based on its binarysolution s i , which is responsible for activating or deactivating the set of features F . Finally, the meta-heuristic intrinsicmechanics are executed to identify the best solution so far and to update the candidates’ position in the hypercomplexspace. Table 1 describes all the datasets utilized in this work. We selected datasets that diversify within the number ofsamples, classes, and features, suggesting a more strong validation under distinguished scenarios. The datasets weredownloaded from LibSVM’s project and Arizona State University’s (ASU) repository , being already quantized forcategorical features and processed for missing values . As we need a unique set to guide the optimization process, notbeing the test one, we partitioned all datasets’ training sets in half, composing the so-called validation set. Therefore,we use for the training step, for the optimization task validation, and the remaining to assess theexperimental validation (testing step). Note that some meta-heuristic might start their searching procedure with every possible initial features, while others might startwith a randomly subset of the initial features. http://featureselection.asu.edu/datasets.php One can ﬁnd their post-processed versions at: http://recogna.tech PREPRINT - J

ANUARY

15, 2021

Algorithm Parameters

ABC number of trials = 1 , AIWPSO c = 1 . | c = 1 . | w = [0 . , . BA f = [0 , | A = 1 . | r = 0 . CS β = 1 . | p = 0 . | α = 0 . FA α = 0 . | β = 1 . | γ = 1 . FPA β = 1 . | p = 0 . PSO c = 1 . | c = 1 . | w = 0 . Table 2: Parameter settings for the meta-heuristic algorithms considered in the work.

The source code used in this work comes from two libraries: LibOPT and LibDEV . Both libraries are implemented inthe C language and have been extensively used throughout scientiﬁc research. The LibOPT library is a collection ofmeta-heuristic optimization techniques, while the LibDEV library provides an integration environment, e.g., featureselection conducted over meta-heuristic optimizations. One can refer to [26] in order to understand how it is possible towork under the LibOPT environment, i.e., how to design a hypercomplex optimization task.To perform a reasonable comparison among distinct meta-heuristic techniques, we must rely on mathematical methodsthat will sustain these observations. The ﬁrst step is to decide whether to use a parametric or a non-parametricstatistical test [27]. Unfortunately, we can not consider a normality state from our numerical trials due to the aleatoryand non-deterministic factor derived from the meta-heuristic techniques, restraining our analysis to non-parametricapproaches.Secondly, acknowledging that the results of our numerical trials are independent (i.e., classiﬁcation accuracy) andcontinuous over a particular dependent variable (i.e., number of observations), we can identify that the Wilcoxonsigned-rank test [28] will satisfy our obligations. It is a non-parametric hypothesis test used to compare two or morerelated observations (in our case, repeated measurements over a certain meta-heuristic) to assess whether there arestatistically signiﬁcant differences between them.For every dataset, each meta-heuristic was evaluated under a 2-fold cross-validation with 25 runs. Additionally, forevery meta-heuristic, 15 agents (particles) were used over 25 convergence iterations. To provide a thorough comparisonbetween meta-heuristics, we have chosen different techniques, ranging from swarm-based to evolutionary-inspired ones,in the context of feature selection:• Artiﬁcial Bee Colony (ABC) [29];• Adaptive Inertia Weight Particle Swarm Optimization (AIWPSO) [30];• Bat Algorithm (BA) [31];• Cuckoo Search (CS) [32];• Fireﬂy Algorithm (FA) [33];• Flower Pollination Algorithm (FPA) [34];• Particle Swarm Optimization (PSO) [35].Note that, for each selected meta-heuristic, we will also present their quaternion- and octonion-based versions, beingthe former preceded by a Q preﬁx and the latter preceded by an O preﬁx. Table 2 presents the chosen parameter settingfor every meta-heuristic technique . We overlooked quaternion- and octonion-based algorithms from the table, as theirparameters are the same as their original version.Concerning ABC, we only need to set the number of trial limits for each food source. AIWPSO deﬁnes minimumand maximum weight as a w interval, and c and c as the control parameters. BA has the minimum and maximum https://github.com/jppbsi/LibOPT https://github.com/jppbsi/LibDEV Remember that the training set was again split in half to form the validation set, used during the optimization process. Note that these values were empirically chosen according to their authors’ deﬁnition. PREPRINT - J

ANUARY

15, 2021frequency ranges deﬁned by f interval, as well as the loudness parameter A and pulse rate r . With CS, we demand toset up β , which is used to compute the Lévy distribution, as well as p , which is the probability of replacing worst nestsby new ones and α which is the step size. Regarding FA, we have α for calculating the randomized parameter, as wellas the attractiveness parameter β and the light absorption coefﬁcient γ . FPA requires the β parameter, used to computethe Lévy distribution and p , which is the probability of local pollination. Finally, PSO deﬁnes w as the inertia weight,and c and c as the control parameters. This section presents the numerical results concerning the proposed experiments. Furthermore, it is divided into twosubsections, which are in charge of discussing the overall analysis and the convergence analysis, respectively.In order to provide statistical analysis to the numerical results, we opted to bold the best results’ cells according to theWilcoxon signed-rank test with 5% of signiﬁcance. In other words, it is possible to observe that, regarding a particularcolumn, every bolded cell achieved the most suitable accuracy, time, or number of features according to the statisticaltest.

Table 3 describes all datasets’ average accuracy over the test set found by each meta-heuristic technique. A veryinteresting fact to highlight is that for almost every dataset, at least one meta-heuristic technique was able to achieve aperformance comparable to the baseline approach, i.e., OPF classiﬁcation using the whole dataset. On the other hand,considering Arcene, Mushrooms, NTL-Commercial, NTL-Industrial, and Splice datasets, meta-heuristic techniquesoutperformed the baseline approach.Regarding only meta-heuristic techniques and their hypercomplex versions, one can see that the hypercomplex-basedalgorithms were able to achieve comparable accuracy values. In some cases, they even outperformed their naïveversions, e.g., QFA, QFPA, OFPA, and QPSO on Arcene; QAIWPSO on COIL20; QAIWPSO on Madelon; QCS,QFPA, and OFPA in Mushrooms; QABC and OABC in ORL; QABC, QBA, and OBA in PCMAC; QABC, OABC,OAIWPSO, QFA, OFA, QFPA and OFPA in Phishing; OABC and QPSO in Splice; QCS and OCS in Wine. In such case,outperforming means that a particular technique was capable of achieving a higher accuracy than another technique.Essentially, the Wilcoxon signed-rank test assess whether there was a statistical similarity between the accuraciesobtained by each one of the techniques. Thus, as the statistical test was conducted over the independent accuraciesfor each meta-heuristic technique, it is possible to observe that the most signiﬁcant techniques (bolded ones) in theaforementioned cases were also the ones that achieved higher accuracy than their naïve versions.

ABC QABC OABC AIWPSO QAIWPSO OAIWPSO BA QBA OBA CS QCS OCS FA QFA OFA FPA QFPA OFPA PSO QPSO OPSO BASELINE

Arcene

COIL20 99.04% 99.09% 99.03% 99.02%

DNA 79.43% 79.58% 79.17% 78.95% 78.70% 78.64% 79.08% 79.32% 77.55% 77.90% 77.49% 78.06% 78.69% 79.23% 78.97% 79.17% 79.30% 79.30% 79.51% 78.71% 79.26%

Isolet 90.79% 90.81% 90.72% 90.89% 90.77% 90.83% 90.55% 90.71% 90.77% 90.57% 90.71% 90.70% 90.68% 90.89% 90.73% 90.77% 90.67% 90.69% 90.67% 90.71% 90.66%

Lung

Madelon

MPEG7-BAS 88.78%

MPEG7-Fourier

Mushrooms

PCMAC 71.97%

Phishing 84.53%

Segment

Sonar

Splice 71.64% 71.33%

Wine

Table 3: Average accuracy achieved over the test set considering all datasets.9

PREPRINT - J

ANUARY

15, 2021An interesting fact emerging from Table 4 is that CS was able to achieve the lowest number of features in nearly everydataset. Furthermore, when standard CS did not deliver the lowest number of features, its quaternionic and octatonicrepresentations were able to achieve this intent .Even though most algorithms were able to diminish the features’ space size and obtain statistically similar accuracywithin respect to the baseline method, in some cases they reached a slightly lower accuracy than the original OPFclassiﬁcation. However, it should be noted that in the case of the BASEHOCK dataset, where even the baselineclassiﬁcation obtained the best accuracy, all other meta-heuristic techniques could reduce by about 35% of the numberof features while scoring 2-3% lower accuracy than OPF. ABC QABC OABC AIWPSO QAIWPSO OAIWPSO BA QBA OBA CS QCS OCS FA QFA OFA FPA QFPA OFPA PSO QPSO OPSO BASELINE

Arcene 6462.88 6473.48 6482.36 6454.20 6492.80 6467.04 6476.72 6486.52 6512.96

Table 4: Average number of features used over the test set considering all datasets.Table 5 shows that CS-based techniques completed the optimization runs in a signiﬁcantly lower computation timethan every other technique. Additionally, Figures 5 and 6 illustrate a more in-depth comparison between CS and itshypercomplex versions for four distinct datasets: DNA, NTL-Commercial, Phishing and Segment. One can observe thatFigure 5 represents the CS-based techniques behavior, where the best techniques are positioned in the top-left corner ofthe graphic, i.e., best accuracy and lowest number of features. For the sake of brevity, we opted to show some datasetsthat have discrepant data, i.e., datasets that have a low amount of features, being more susceptible when selecting asubset of features. In such cases, any incorrect feature selection will depreciate the classiﬁcation results, thus, makingthe convergence process more unstable.One can perceive that CS-based techniques encountered a feasible number of features, but not necessarily the bestaccuracy. If one observes the difference between the best and the worst accuracy considering all datasets (exceptNTL-based ones) and meta-heuristic techniques, there is not a single one that surpasses the 4.77% barrier. Nevertheless,CS suffered in the NTL datasets (energy theft identiﬁcation), which are highly unbalanced and have a relatively smallamount of features. As CS encountered the lowest number of features in such a low-dimensional dataset, it is possibleto observe that is has overﬁtted the optimization process to ﬁnd the lowest number of possible features at the cost ofpenalizing the classiﬁer, hence, achieving a not suitable accuracy for these particular datasets.Regarding the hypercomplex techniques, such as quaternion and octonion, it is possible to observe that they have anextra computational loop per feature, due to its number of dimensions, e.g., 4 and 8. If the number of selected featuresis sufﬁciently smaller to overcome this extra loop, the hypercomplex techniques will achieve a shorter computationaltime than the conventional ones. For example, in a -features problem, the conventional technique loop lasts for times, while the quaternion and octonion loops last for and times, respectively. If the quaternion-basedtechnique selects features averagely while the octonion-based one selects . features averagely, both will performa loop that lasts for times, being comparable to the conventional algorithm.The results obtained in this study prove the promising use of meta-heuristic optimization techniques when selecting aquasi-optimum subset of features while preserving its performance and discriminative aptitudes. QCS achieved the lowest number of features for the Phishing dataset while OCS achieved this goal for the DNA, NTL-Commercial and Segment datasets. PREPRINT - J

ANUARY

15, 2021

ABC QABC OABC AIWPSO QAIWPSO OAIWPSO BA QBA OBA CS QCS OCS FA QFA OFA FPA QFPA OFPA PSO QPSO OPSO

Arcene 43.48s 42.11s 43.06s 21.59s 23.08s 23.45s 21.92s 23.97s 25.78s

Table 5: Average computation time required by the optimization process considering all datasets.

The convergence curves of CS and its variants obtained for the DNA, NTL-Commercial, Phishing, and Segment datasetsare shown in Figure 7.An interesting fact that one can perceive is that hypercomplex-based techniques were able to converge faster and betterthan the standard version in three out of four datasets (DNA, Phishing, and Segment). Additionally, it is essential tohighlight that as hypercomplex-based algorithms use an enhanced version of the search space, i.e., a space with a moresubstantial amount of possible values, they are capable of better exploring it, thus, leading to better convergence ratesand ﬁtness values. Moreover, as OCS encodes a higher-dimensional space, i.e., dimensions, it was able to achieve thelowest ﬁtness for two datasets (Phishing and Segment), thus showing its exploration capability of the search space.11 PREPRINT - J

ANUARY

15, 2021(a) (b)(c) (d)Figure 5: Number of selected features x Accuracy ([0,1]) chart considering CS, QCS and OCS in: (a) DNA, (b)NTL-Commercial, (c) Phishing and (d) Segment datasets.12

PREPRINT - J

ANUARY

15, 2021(a) (b)(c) (d)Figure 6: Computation time (s) for each independente run of CS, QCS and OCS in: (a) DNA, (b) NTL-Commercial, (c)Phishing and (d) Segment datasets. 13

PREPRINT - J

ANUARY

15, 2021(a) (b)(c) (d)Figure 7: Iteration x Fitness chart considering CS, QCS and OCS in: (a) DNA, (b) NTL-Commercial, (c) Phishing and(d) Segment datasets. 14

PREPRINT - J

ANUARY

15, 2021

This paper addressed the problem of feature selection through a meta-heuristic optimization approach. A wide rangeof meta-heuristic techniques was employed in distinct datasets in order to provide a more thoughtful numericalvalidation of the proposed computational framework. Additionally, we also present three distinct search spaces for eachoptimization technique: standard, quaternionic, and octatonic.In most circumstances, the meta-heuristic techniques were able to outperform the baseline approach (OPF classiﬁcationover the full-features dataset). In such cases, outperforming means that a singular technique was able to attain higheraccuracy than another algorithm, according to the Wilcoxon signed-rank test with 5% of signiﬁcance. Besides, it ispossible to highlight that all meta-heuristic techniques were able to diminish a substantial number of the initial datasets’features while maintaining their classiﬁcation accuracy.Even though most algorithms were able to reduce the features’ space size and obtain statistically similar accuracywithin respect to the baseline method, in some cases, they reached a slightly lower accuracy than the original OPFclassiﬁcation. Nevertheless, it should be remarked that in the BASEHOCK dataset, where the baseline classiﬁcationachieved the best accuracy, all other meta-heuristic techniques could decrease by about 35% of the number of featureswhile scoring 2-3% lower accuracy than OPF.An intriguing fact is that CS was able to obtain the lowest number of features in nearly every dataset, but not necessarilythe best accuracy. If one perceives the discrepancy between the best and the worst accuracy considering all datasets(except NTL-based ones) and meta-heuristic techniques, there is not a single one that exceeds the 4.77% limit.Nonetheless, CS underwent in the NTL datasets (energy theft identiﬁcation), which are highly unbalanced and havea comparatively small amount of features. As CS obtained the lowest number of features in such a low-dimensionaldataset, it is reasonable to mention that is has overﬁtted the optimization process in an attempt to ﬁnd the lowest numberof possible features. Such a procedure penalized the classiﬁer and, consequently, achieved a not proper accuracy forthese particular datasets.Furthermore, we presented a more in-depth analysis considering CS and its variants, QCS, and OCS, among fourdistinct datasets that have discrepant data, i.e., datasets with a low amount of features and highly sensitive to featureselection. This analysis provided thoughtful insights regarding the number of selected features per accuracy they wereable to achieve, the time they took to perform the optimization process, and their convergence process. Additionally, itis essential to highlight that CS hypercomplex-based approaches took more time than their standard version, while theywere able to converge better (to a lower ﬁtness function value) than its naïve version.For future works, we aim at exploring within more depth the hypercomplex mapping function, e.g., norm function.We have high hopes in understanding more the hypercomplex structure, as it seems that one of the central concepts inapplying them to feature selection methods lies in transferring values from hypercomplex- to real-valued search spaces. Acknowledgments

The authors appreciate São Paulo Research Foundation (FAPESP) grants

References [1] X.-S. Yang.

Engineering Optimization: An Introduction with Metaheuristic Applications . Wiley Publishing, 1stedition, 2010.[2] B. K. Oh, K. J. Kim, Y. K.and H. S. Park, and H. Adeli. Evolutionary learning based sustainable strain sensingmodel for structural health monitoring of high-rise buildings.

Applied Soft Computing , 58:576–585, 2017.[3] S. Klein, M. Staring, and J. Pluim. Evaluation of optimization methods for nonrigid medical image registrationusing mutual information and b-splines.

IEEE Transactions on Image Processing , 16(12):2879–2890, 2007.[4] N. Dey, S. Samanta, S. Chakraborty, A. Das, S. Chaudhuri, S. Sheli, and J. Suri. Fireﬂy algorithm for optimizationof scaling factors during embedding of manifold medical information: an application in ophthalmology imaging.

Journal of Medical Imaging and Health Informatics , 4(3):384–394, 2014.[5] G. H. Rosa, J. P. Papa, A. N. Marana, W. Scheirer, and D. D. Cox. Fine-tuning convolutional neural networksusing harmony search. In

Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications .Springer International Publishing, 2015. 15

PREPRINT - J

ANUARY

15, 2021[6] J. P. Papa, G. H. Rosa, K. A. P. Costa, A. N. Marana, W. Scheirer, and D. D. Cox. On the model selection ofBernoulli restricted Boltzmann machines through harmony search. In

Proceedings of the Genetic and EvolutionaryComputation Conference , GECCO ’15, pages 1449–1450, New York, USA, 2015. ACM.[7] J. P. Papa, W. Scheirer, and D. D. Cox. Fine-tuning deep belief networks using harmony search.

Applied SoftComputing , 46:875–885, 2016.[8] J. P. Papa, G. H. Rosa, A. N. Marana, W. Scheirer, and D. D. Cox. Model selection for discriminative restrictedBoltzmann machines through meta-heuristic techniques.

Journal of Computational Science , 9:14–18, 2015.[9] D. P. Bertsekas.

Nonlinear Programming . Athena Scientiﬁc, Belmont, 1999.[10] X. M. Hu, J. Zhang, Y. Yu, H. S. H. Chung, Y. L. Li, Y. H. Shi, and X. N. Luo. Hybrid genetic algorithmusing a forward encoding scheme for lifetime maximization of wireless sensor networks.

IEEE Transactions onEvolutionary Computation , 14(5):766–781, 2010.[11] W. N. Chen, J. Zhang, Y. Lin, N. Chen, Z. H. Zhan, H. S. H. Chung, Y. Li, and Y. H. Shi. Particle swarmoptimization with an aging leader and challengers.

IEEE Transactions on Evolutionary Computation , 17(2):241–258, 2013.[12] E. Pitzer and M. Affenzeller.

A Comprehensive Survey on Fitness Landscape Analysis . Springer, Berlin, 2012.[13] R. Kohavi and G. H. John. Wrappers for feature subset selection.

Artiﬁcial Intelligence , 97(1-2):273–324, 1997.[14] N. Sánchez-Maroño, A. Alonso-Betanzos, and M. Tombilla-Sanromán. Filter methods for feature selection–acomparative study. In

International Conference on Intelligent Data Engineering and Automated Learning , pages178–187. Springer, 2007.[15] I. Fister, X.-S. Yang, J. Brest, and I. Fister Jr. Modiﬁed ﬁreﬂy algorithm using quaternion representation.

ExpertSystems with Applications , 40(18):7220–7230, 2013.[16] I. Fister, J. Brest., I. Fister Jr., and X.-S. Yang. Modiﬁed bat algorithm with quaternion representation. In

IEEECongress on Evolutionary Computation , pages 491–498, 2015.[17] J. P. Papa, D. R. Pereira, A. Baldassin, and X.-S. Yang. On the harmony search using quaternions. In F. Schwenker,H. M. Abbas, N. El-Gayar, and E. Trentin, editors,

Artiﬁcial Neural Networks in Pattern Recognition: 7th IAPRTC3 Workshop, ANNPR , pages 126–137, Cham, 2016. Springer International Publishing.[18] J. P. Papa, G. H. Rosa, D. R. Pereira, and X.-S. Yang. Quaternion-based deep belief networks ﬁne-tuning.

AppliedSoft Computing , 60:328–335, 2017.[19] J. C. Hart, G. K. Francis, and L. H. Kauffman. Visualizing quaternion rotation.

ACM Transactions on Graphics(TOG) , 13(3):256–276, 1994.[20] J. T. Graves. On a connection between the general theory of normal couples and the theory of complete quadraticfunctions of two variables.

Philosophical Magazine , 26(173):315–320, 1845.[21] S. De Leo. Quaternions and special relativity.

Journal of Mathematical Physics , 37(6):2955–2968, 1996.[22] D. Finkelstein, J. M. Jauch, S. Schiminovich, and D. Speiser. Foundations of quaternion quantum mechanics.

Journal of Mathematical Physics , 3(2):207–220, 1962.[23] D. Eberly. Quaternion algebra and calculus. Technical report, Magic Software, 2002.[24] J. P. Papa, A. X. Falcão, and C. T. N. Suzuki. Supervised pattern classiﬁcation based on optimum-path forest.

International Journal of Imaging Systems and Technology , 19(2):120–131, 2009.[25] J. P. Papa, A. X. Falcão, V. H. C. Albuquerque, and J. M. R. S. Tavares. Efﬁcient supervised optimum-path forestclassiﬁcation for large datasets.

Pattern Recognition , 45(1):512–520, 2012.[26] João Paulo Papa, Gustavo Henrique de Rosa, and Xin-She Yang.

On the Hypercomplex-Based Search Spaces forOptimization Purposes , pages 119–147. Springer International Publishing, Cham, 2018.[27] M. Hollander, D. A. Wolfe, and E. Chicken.

Nonparametric Statistical Methods , volume 751. John Wiley & Sons,Hoboken, NJ, USA, 2013.[28] F. Wilcoxon. Individual comparisons by ranking methods.

Biometrics Bulletin , 1(6):80–83, 1945.[29] D. Karaboga and B. Basturk. A powerful and efﬁcient algorithm for numerical function optimization: Artiﬁcialbee colony (ABC) algorithm.

Journal of Global Optimization , 39(3):459–471, 2007.[30] M. M. Ebadzadeh A. Nickabadi and R. Safabakhsh. A novel particle swarm optimization algorithm with adaptiveinertia weight.

Applied Soft Computing , 11:3658–3670, 2011.[31] X.-S. Yang and A. H. Gandomi. Bat algorithm: a novel approach for global engineering optimization.

EngineeringComputations , 29(5):464–483, 2012. 16

PREPRINT - J

ANUARY

15, 2021[32] X-S. Yang and S. Deb. Engineering optimisation by cuckoo search.

International Journal of MathematicalModelling and Numerical Optimisation , 1:330–343, 2010.[33] X.-S. Yang. Fireﬂy algorithm, stochastic test functions and design optimisation.

International Journal Bio-InspiredComputing , 2(2):78–84, 2010.[34] S.-S. Yang, M. Karamanoglu, and X. He. Flower pollination algorithm: A novel approach for multiobjectiveoptimization.

Engineering Optimization , 46(9):1222–1237, 2014.[35] J. Kennedy and R. C. Eberhart.