On Explaining Machine Learning Models by Evolving Crucial and Compact Features
OOn Explaining Machine Learning Modelsby Evolving Crucial and Compact Features
Marco Virgolin
Life Sciences and Health Group, Centrum Wiskunde & Informatica, Amsterdam 1098 XG, the Netherlands
Tanja Alderliesten
Department of Radiation Oncology, Amsterdam UMC, Amsterdam 1105 AZ, the Netherlands
Peter A.N. Bosman
Life Sciences and Health Group, Centrum Wiskunde & Informatica, Amsterdam 1098XG, the NetherlandsAlgorithmics Group, Delft University of Technology, Delft 2628 XE, the Netherlands
Abstract
Feature construction can substantially improve the accuracy of Machine Learning (ML) algorithms. GeneticProgramming (GP) has been proven to be e ff ective at this task by evolving non-linear combinations of input features.GP additionally has the potential to improve ML explainability since explicit expressions are evolved. Yet, in mostGP works the complexity of evolved features is not explicitly bound or minimized though this is arguably key forexplainability. In this article, we assess to what extent GP still performs favorably at feature construction whenconstructing features that are (1) Of small-enough number, to enable visualization of the behavior of the ML model;(2) Of small-enough size, to enable interpretability of the features themselves; (3) Of su ffi cient informative power,to retain or even improve the performance of the ML algorithm. We consider a simple feature construction schemeusing three di ff erent GP algorithms, as well as random search, to evolve features for five ML algorithms, includingsupport vector machines and random forest. Our results on 21 datasets pertaining to classification and regressionproblems show that constructing only two compact features can be su ffi cient to rival the use of the entire originalfeature set. We further find that a modern GP algorithm, GP-GOMEA, performs best overall. These results, combinedwith examples that we provide of readable constructed features and of 2D visualizations of ML behavior, lead usto positively conclude that GP-based feature construction still works well when explicitly searching for compactfeatures, making it extremely helpful to explain ML models. This preprint is associated to a manuscript accepted for publication on
Swarm and Evolutionary Computation , doi.org/10.1016/j.swevo.2019.100640 .This work is licensed under a Creative Commons “Attribution-NonCommercial-NoDerivs 3.0 Unported” license. Keywords: feature construction, interpretable machine learning, genetic programming, GOMEA
1. Introduction
Feature selection and feature construction are two im-portant steps to improve the performance of any MachineLearning (ML) algorithm [1, 2]. Feature selection is thetask of excluding features that are redundant or mislead-ing. Feature construction is the task of transforming (parts of) the original feature space into one that the ML algo-rithm can better exploit.A very interesting method to perform feature construc-tion automatically is Genetic Programming (GP) [3, 4].GP can synthesize functions without many prior assump-tions on their form, di ff erently from, e.g., logistic regres-sion or regression splines [5, 6]. Moreover, feature con- Preprint submitted to Swarm and Evolutionary Computation January 13, 2020 a r X i v : . [ c s . N E ] J a n igure 1: Regression surface learned by SVM for the Yacht dataset (inblue), expressed as a 2D function of the two features (on the bottomaxes) constructed by our approach. Circles are training samples, dia-monds are test samples. The dataset has six features ( x ( i ) ). Our approachconstructs two new features (using GP-GOMEA, see Sec. 4.1), whichare non-linear transformations of the prismatic coe ffi cient ( x (2) ) and theFroude number ( x (6) ). With only two features the SVM prediction sur-face can be visualized. Moreover, these new features are understandable.Finally, the modeling quality is actually improved over employing SVMdirectly on all six features. The coe ffi cient of determination of SVM in-creased from 85% using the original features to 98% using the two newfeatures. struction not only depends on the data at hand, but alsoon the way a specific ML algorithm can model that data.Evolutionary methods in general are highly flexible intheir use due to the way they perform search (i.e., deriva-tive free). This makes it possible, for example, to evaluatethe quality of a feature for a specific ML algorithm by di-rectly measuring what its impact is on the performance ofthe ML algorithm (i.e., by training and validating the MLalgorithm when using that feature).Explaining what constructed features mean can shedlight on the behavior of ML-inferred models that use suchfeatures. Reducing the number of features is also im-portant to improve interpretability. If the original fea-ture space is reduced to few constructed features (e.g., upto two for regression and up to three for classification),the function learned by the ML model can be straightfor-wardly visualized w.r.t. the new features. In fact, how tomake ML models more understandable is a key topic of modern ML research, as many practical, sensitive applica-tions exist, where explaining (part of) the behavior of MLmodels is essential to trust their use (e.g., in medical appli-cations) [7, 8, 9, 10]. Typically, GP for feature construc-tion searches in a subspace of mathematical expressions.Adding to the appeal and potential of GP, these expres-sions can be human-interpretable if simple enough [8, 11].Figure 1 presents an example of the potential heldby such an approach: a multi-dimensional dataset trans-formed into a 2D one, where both the behavior of the MLalgorithm and the meaning of the new features is clear,while the performance of the ML algorithm is not com-promised w.r.t. the use of the original feature set (it isactually improved).In this article we study whether GP can be useful toconstruct a low number of small features, to increase thechance of obtaining interpretable ML models, withoutcompromising their accuracy (compared to using the orig-inal feature set). To this end, we design a simple, iterativefeature construction scheme, and perform a wide set ofexperiments: we consider four types of feature construc-tion methods (three GP algorithms and random search),five types of machine learning algorithms. We apply theircombinations on 21 datasets between classification andregression to determine to what extent they are capableof e ff ectively and e ffi ciently finding crucial and compactfeatures for specific ML algorithms.The main original scientific contribution of this work isan investigation of whether GP can be used to constructfeatures that are: • Of small-enough number, to enable visualization ofthe behavior of the ML model; • Of small-enough size, to enable interpretability ofthe features themselves; • Of su ffi cient informative power, to retain or even im-prove the performance of the ML, compared to usingthe original feature set;These aspects are assessed under di ff erent circumstances: • We test di ff erent search algorithms, including mod-ern model-based GP and random search; • We test di ff erent ML algorithms.2he remainder of this article is organized as follows.Related work is reported in Section 2. The proposed fea-ture construction scheme is presented in Section 3. Thesearch algorithms to construct features, as well as the con-sidered ML algorithms, are presented in Section 4. Theexperimental setup is described in Section 5. Results re-lated to performance are reported in Section 6, while re-sults concerning interpretability are reported in Section 8.Section 10 discusses our findings, and Section 11 con-cludes this article.
2. Related work
In this article, we consider GP for feature construc-tion to achieve better explainable ML models. Di ff er-ent forms of GP to obtain explainable ML have beenexplored in literature, but they do not necessarily lever-age feature construction. E.g., [12] introduced a formof GP for the automatic synthesis of interpretable classi-fiers, generated from scratch as self-contained ML mod-els, made of IF-THEN rules. A very di ff erent paradigmfor explainable ML by GP is considered in [13], where theauthors explore the use of GP to recover the behavior ofa given unintelligible classifier by evolving interpretableapproximation models. Other GP-based approaches andparadigms to synthesize interpretable ML models fromscratch, or to approximate the behavior of pre-existingML models by interpretable expressions, are reported inrecent surveys on explainable artificial ingelligence suchas [8, 9].Since in this article we particularly study what the po-tential of GP for feature construction is in terms of addedvalue for explaining complex, not directly explainablemodels learned by various popular ML algorithms, the re-lated work that follows describes GP approaches for fea-ture construction. For readers interested in feature selec-tion, we refer to a recent survey [14].One of the first approaches of GP for feature construc-tion is presented in [15]. There, each GP solution is a setof K features. The fitness of a set is the cross-validationperformance of a decision tree [16] using that set. The re-sults on six classification datasets show that the approachis able to synthesize a feature set that is competitive withthe original one, and can also be added to the original setfor further improvements. No attention is however givento the interpretability of evolved features. The work in [17] generates one feature with Standard,tree-based GP (SGP) [3], to be added to the original set.Feature importance metrics of decision trees such as in-formation gain, Gini index and Chi are used as fitnessmeasure. An advantage of using such fitness measuresover ML performance is that they can be computed veryquickly. However, they are decision tree-specific. Resultsshow that the approach can improve prediction accuracy,and, for a few problems, it is shown that decision treesthat are simple enough to be reasonably interpretable, canbe found.Feature construction for high-dimensional datasets isconsidered in [18], for eight bio-medical binary classi-fication problems, with 2,000 to 24,188 features. Thisapproach is di ff erent from the typical ones, as the au-thors propose to use SGP to evolve classifiers rather thanfeatures, and extract features from the components (sub-trees) of such classifiers. These are then used as new fea-tures for an ML algorithm. Results on K-Nearest Neigh-bors [19], Naive Bayes classifier [20, 21], and decisiontree show that a so-found feature set can be competitiveor outperform the original one. The authors show an ex-ample where a single interpretable feature is constructedthat enables linear separation of the classification exam-ples.Di ff erent from the aforementioned works, [22] exploresfeature construction for regression. A SGP-based ap-proach is designed to tackle regression problems with alarge number of features, and is tested on six datasets. In-stead of using the constructed features for a di ff erent MLalgorithm, SGP dynamically incorporates them within anongoing run, to enrich the terminal set. Every α genera-tions of SGP, the subtrees composing the best solutionsbecome new features by encapsulation into new termi-nal nodes. The approach is found to improve the abilityof SGP to find accurate solutions. However, the featuresfound by encapsulating subtrees are not interpretable be-cause allowing subsequent encapsulations leads to an ex-ponential growth of solution size.A recent work that focuses on evolutionary dimen-sionality reduction and consequent visualization is [23],where a multi-objective, grammar-based SGP approach isemployed. K feature transformations are evolved in syn-ergy to enable, at the same time, good classification ac-curacy, and visualization through dimensionality reduc-tion. The system is thoroughly tested on 42 classification3asks, showing that the algorithm performs well comparedto state-of-the-art dimensionality reduction methods, andit enables visualization of the learned space. However, astrees are free to grow up to a height of 50, the constructedfeatures themselves cannot be interpreted.The most similar works to ours that we found are [24]and [25]. In [24], which is our previous work, the pos-sibility of using a modern model-based GP algorithm(which we also use in our comparisons) for feature con-struction is explored on four regression datasets. There,focus is put on keeping feature size small, to actively at-tempt to obtain readable features. These features are itera-tively constructed to be added to the original feature set toimprove the performance of the ML algorithm, and threeML algorithms are compared (linear regression, supportvector machines [26], random forest [27]). Reducing thefeature space to enable a better understanding of inferredML models is not considered.In [25], di ff erent feature construction approaches arecompared on gene-expression datasets that have a largenumber of features (thousands to tens of thousands) tostudy if evolving class-dependent features, i.e., featuresthat are each targeted at aiding the ML algorithm detectone specific class, can be beneficial. Similarly to us, theauthors show visualizations of feature space reduced toup to three constructed features, and an example of threefeatures that are encoded as very small, easy-to-interprettrees. However, such small features are a rare outcome asthe trees used to encode features typically had more than75 nodes. These trees are therefore arguably extremelyhard to read and interpret.Our work is di ff erent from previous research in two ma-jor aspects. First, none of the previous work principallyaddresses the conflicting objectives of retaining good per-formance of an ML algorithm while attempting to explainboth its behavior (by dimensionality reduction to allow vi-sualization), and the meaning of the features themselves(by constraining feature complexity). Second, multipleGP algorithms within a same feature construction scheme,on multiple ML algorithms, are not compared in previ-ous work. Most of the times, it is a di ff erent feature con-struction scheme that is tested, using arguably small vari-ations of SGP. Here, we consider random search, two ver-sions of SGP, as well as another modern GP algorithm.Furthermore, we adopt both “weak” ML algorithms suchas ordinary least squares linear regression and the naive Bayes classifier, as well as “strong”, state-of-the-art ones,which are rarely used in literature for feature construc-tion, such as support vector machine and random forest;on both classification and regression tasks.
3. Iterative evolutionary feature construction
We use a remarkably simple scheme to construct fea-tures. Our approach constructs K ∈ N + features by it-erating K GP runs. The evolution of the k -th feature( k ∈ { , . . . , K } ) uses the previously constructed k − The dataset D defining the problem at hand is split intotwo parts: the training T r and the test
T e set. This parti-tion is kept fixed through the whole procedure. Only
T r isused to construct features, while
T e is exclusively used forfinal evaluation to avoid positive bias in the results [28].We use the notation x ( i ) j to refer to the i -th feature valueof the j -th example, and y j for the desired outcome (labelfor classification or target value for regression) of the j -thexample.The k -th GP run evolves the k -th feature. An exam-ple is shown in Figure 2. Each solution in the populationcompetes to become the new feature x ( k ) , that representsa transformation of the original feature set. In every run,the population is initialized at random.We evaluate the fitness of a feature of the k -th runby measuring the performance of the ML algorithm ona dataset that contains that feature and the previouslyevolved k − k -th run. This prevents the generation of nested features,which could harm interpretability.At the end of the k -th run, the best feature is storedand its values x ( k ) j are added to T r and
T e for the nextiterations.
The fitness of a feature is computed by measuring theperformance (i.e., error) of the ML algorithm when the4 r iteration k − x (1) ... x ( k − y GP Tr iteration k ... x ( k − x ( k ) y ... -3.10 7.12 10.4... -7.41 9.41 7.49... ... ... ... Te iteration k ... x ( k − x ( k ) y ... 9.87 1.11 5.55... 6.45 4.78 12.01... ... ... ... Best New Feature x ( k ) ML alg.Trained ModelTest Error k Figure 2: Construction of the k -th feature and computation of the k -th test error. Evolved features use the features of the original dataset (not shown)and random constants as terminal nodes. Dashed arrows represent inputs, solid arrows represents outputs. new feature is added to T r . We consider the C -fold cross-validation error rather than the training error to promotegeneralization and prevent overfitting. The pseudo codeof the evaluation function is shown in Algorithm 1.Specifically, the C -fold cross-validation error is com-puted by partitioning T r into C splits. For each c = , . . . , C iteration, a di ff erent split is used for validation(set V c ), and the remaining C − T r c ). The mean validation error is the final result.For classification tasks, in order to take into accountboth multiple and possibly imbalanced class distributions,the prediction error is computed as 1 minus the macroF F − F = − classes (cid:88) γ ∈ classes F γ = − classes (cid:88) γ ∈ classes TP γ TP γ + FP γ TP γ TP γ + FN γ TP γ TP γ + FP γ + TP γ TP γ + FN γ , where T P γ , FN γ , FP γ are the true positive, false nega-tive, and false positive classifications for the class γ , re-spectively. If the computation of F γ results in , we set F γ = Algorithm 1
Computation of the fitness of a feature s function C ompute F eature F itness ( s ) Tr (cid:48) ← AddFeatureToCurrentTrainingSet( s ) error ← for c = , . . . , C do T c , V c ← SplitSet( c , C , Tr (cid:48) ) M ← TrainMLModel( T c ) error ← error + ComputeError( M , V c ) Return (cid:16) errorC (cid:17)
Computing the fitness of a feature is particularly ex-pensive, as it consists of a C -fold cross-validation of theML algorithm. This limits the feasibility of, e.g., adoptinglarge population sizes and large numbers of evaluationsfor the GP algorithms.We therefore attempt to prevent unnecessary cross-validation calls, by assessing if features meet four criteria.Let n be the number of examples in T r . The criteria arethe following:1.
The feature is not a constant . We avoid evaluatingconstant features as they are likely to be useless formany ML algorithms, which internally already com-pute an intercept.2.
The feature does not contain extreme values thatmay cause numerical errors , i.e., with absolute valueabove a lower-bound β (cid:96) or above an upper-bound β u .Here, we set β (cid:96) = − , and β u = (none of thedatasets considered here have values exceeding these5ounds).3. The feature is not equivalent to one constructed inthe previous k − iterations . Equivalence is deter-mined by checking the values available in T r , i.e.,equivalence holds if: ∀ j ∈ T r , ∃ i ∈ { , . . . , k − } : x ( k ) j = x ( i ) j . Note that a constructed feature that is equivalent toa feature of the original feature set can be valid, aslong as no other previously constructed feature existsthat is already equivalent. Thus, our approach can inprinciple perform pure feature selection.4.
The values of the feature in consideration havechanged since the last time the feature was evalu-ated . GP variation can change the syntax of a fea-ture without necessarily a ff ecting its behavior (e.g.,inserting a multiplication by 1 will not change the fi-nal values a feature computes). If the values do notchange, then the fitness of the feature will not changeeither (see Sec. 3.2). We therefore avoid unnecessaryre-computations of feature fitnesses, by caching thefeature values prior to GP variation, and checkingwhether they have changed after variation.The computational e ff ort for each criterion is O ( n ) (itis O (( k − n ) for criterion 3, however in our experiments k (cid:28) n ). The fitness of a feature failing criterion 1, 2, or3 is set to the maximum possible error value. If criterion4 fails, the fitness remains the same (although perform-ing cross-validation may lead to slightly di ff erent resultswhen using stochastic ML algorithms like random forest).
4. Considered search algorithms and machine learn-ing algorithms
We consider SGP, Random Search (RS), and the GPinstance of the Gene-pool Optimal Mixing EvolutionaryAlgorithm (GP-GOMEA) as competing search algorithmsto construct features. SGP is widely used in feature con-struction (see related work in Sec. 2). RS is not typi-cally considered, yet we believe it is important to assesswhether evolution does bring any benefit over randomenumeration within the confines of our study, i.e., whenforcing to find small features. GP-GOMEA is a recently introduced GP algorithm that has proven to be particu-larly proficient in evolving accurate solutions of limitedsize [11, 29, 24].As ML algorithms, we consider the Naive Bayes classi-fier (NB), ordinary least-squares Linear Regression (LR),Support Vector Machines (SVM), Random Forest (RF),and eXtreme Gradient Boosting (XGB). NB is used onlyfor classification tasks, LR only for regression tasks,SVM, RF, and XGB for both tasks. We provide more de-tails in the following sections.
All search algorithms use the fitness evaluation func-tion. A feature s is evaluated by first checking whetherthe four criteria of Section 3.3 are met, and then, if theoutcome is positive, by running the ML algorithm overthe feature-extended dataset.For SGP, we use subtree crossover and subtree muta-tion, picking the depth of subtree roots uniformly ran-domly as proposed in [30]. The candidate parents for vari-ation are chosen with tournament selection. Since we areinterested in constructing small features so as to increasethe chances they will be interpretable, we consider twoversions of SGP. The first is the classic one where solu-tions are free to grow to tree heights typically much largerthan the one used for tree initialization. In the following,the notation SGP refers to this first version. The secondone uses trees that are not allowed to grow past the initialmaximum tree height. We call this version bounded SGP,and use the notation SGP b .RS is realized by continuously sampling and evaluatingnew trees, keeping the best [3]. Like for SGP b , a maxi-mum tree height is fixed during the whole run. If evolu-tion is hypothetically no better than RS, then we expectthat SGPb and GP-GOMEA will construct features thatare no better than the ones constructed by RS.GP-GOMEA is a recently introduced GP algorithm thathas been found to deliver accurate solutions of small sizeon benchmark problems [29], and to work well when asmall size is enforced in symbolic regression [11, 24].GP-GOMEA uses a tree template fixed by a maximumtree height (which can include intron nodes to allow forunbalanced tree shapes) and performs homologous varia-tion, i.e., mixed tree nodes come from the same positionsin the tree. Each generation prior to mixing, a hierarchicalmodel that captures interdependencies ( linkage ) between6odes is built (using mutual information). This model,called Linkage Tree (LT), drives variation by indicatingwhat nodes should be changed en block during mixing, toavoid the disruption of patterns with large linkage.The LT has been shown to enable GP-GOMEA to out-perform subtree crossover and subtree mutation of SGP,as well as the use of a randomly-build LT, i.e., the Ran-dom Tree (RT), on problems of di ff erent nature [11, 29].However, the LT requires su ffi ciently large populationsizes to be accurate and beneficial (e.g., several thou-sand solutions in GP for symbolic regression) [11]. Be-cause in the framework of this article fitness evaluationsuse the cross-validation of a ML algorithm, we cannota ff ord to use large population sizes. Accordingly, wefound the adoption of the LT to not be superior to theadoption of the RT under these circumstances in prelimi-nary experiments. Therefore, for the most part, we adoptGP-GOMEA with the RT (GP-GOMEA RT ). This meanswe e ff ectively compare random hierarchical homologousvariation with subtree-based variation. An example ofadopting the LT and large population sizes for feature con-struction is provided in Section 10. We now briefly describe the ML algorithms used in thiswork: NB, LR, SVM, RF, and XGB. NB and LR are lesscomputationally expensive compared to SVM, RF, andXGB. Details on the computational time complexity ofthese algorithms are reported at: https: // bit.ly / mlpack imple-mentation of NB [31] and assume the data to be normallydistributed (default setting).Similarly to NB, LR is often used as a baseline as it issimple and fast, for regression tasks. LR assumes that thetarget variable can be explained by a linear combinationof the features [20]. We use the mlpack implementationof LR [31].SVM is a powerful ML algorithm that can be used fornon-linear classification and regression [26, 32]. We usethe libsvm C ++ implementation [32]. We consider the Ra-dial Basis Function (RBF) kernel, which works well inpractice for many problems, with C-SVM for classifica-tion, and E -SVM for regression. Table 1: Parameter settings of the GP algorithms.
SGP(b) GP-GOMEA RT Population size 100 100Initialization method Ramped Half and Half Half and HalfInitialization max tree height 2–6 (2 or 4) 2 or 4Max tree height 17 (2 or 4) 2 or 4Variation SX 0 .
9, SM 0 . { + , × , − , ÷ , · , √· , log p , exp } for allTerminal set { x ( i ) , ERC } for all RF is an ensemble ML algorithm which, like SVM, canbe used for both classification and regression and can in-fer non-linear patterns [27]. RF builds an ensemble of(typically deep) decision trees, each trained on a sampleof the training set ( bagging ). At prediction time, the mean(or maximum agreement) prediction of the decision treesis returned. We use the ranger C ++ implementation [33].XGB is, like RF, an ensemble ML algorithm, typicallybased on decision trees, and capable of learning non-linear models [34]. XGB works by boosting, i.e., stackingtogether multiple weak estimators (small decision tress)that fit the data in an incremental fashion. We use the dmlc C ++ implementation (https: // bit.ly /
5. Experiments
We perform 30 runs of our Feature ConstructionScheme (FCS), with SGP, SGP b , RS, and GP-GOMEA RT ,in combination with each ML algorithm (NB only forclassification and LR only for regression), on each prob-lem. Each run of the FCS uses a random train-test split of80%-20%, and considers up to K = , RT performs more evaluations than SGP per generation [29].For GP-GOMEA RT , SGP b , and RS, we consider twolevels of maximum tree height h : 2 and 4. This choiceyields a maximum solution size of 7 and 31 respectively7using function nodes with a maximum arity r = h = h = · , √· , log p , exp. We do not consider bigger treeheights as resulting features may likely be impossible tointerpret, defying a key focus of this work.Other parameter settings used for the GP algorithms areshown in Table 1. SGP b uses the same settings as SGP,except for the maximum tree height (at initialization andalong the whole run), which is set to the same of GP-GOMEA RT . In GP-GOMEA RT we use the Half and Half(HH) tree initialization method instead of the RampedHalf and Half (RHH) [3] commonly used for SGP. Thisproved to be beneficial since GOM varies nodes insteadof subtrees [11, 24]. For both HH and RHH, syntacticaluniqueness of solutions is enforced for up to 100 tries [3].In GP-GOMEA RT we additionally avoid sampling treeshaving a terminal node as root by setting the minimumtree height of the grow method to 1. This is not done forSGP and SGP b , because di ff erently from GP-GOMEA RT where homologous nodes are varied, subtree root nodesfor subtree crossover (SX) and subtree mutation (SM) arechosen uniformly randomly. RS samples new trees usingthe same initialization method as SGP b , i.e., RHH.The division operator ÷ used in the function set is theanalytic quotient operator ( a ÷ b = a / √ + b ), whichwas shown to lead to better generalization performancethan protected division [35]. The logarithm is protectedlog p ( · ) = log( | · | ) and log p (0) =
0, and so is the squareroot operator. The terminal set contains the original fea-ture set, and an Ephemeral Random Constant (ERC) [4]with values uniformly sampled between the minimum andmaximum values of the features in the original trainingset, i.e., [min x ( i ) j , max x ( i ) j ] , ∀ i , j ∈ T r .The hyperparameter settings for the SVM, RF andXGB are shown in Table 2, and are mostly default [27,32, 33] (for XGB, we referred to https: // bit.ly / that can be considered traditional, i.e, theyhave small to moderate dimensionality (number of fea-tures). We mostly study this type of dataset because weseek to find small constructed features that can be inter-preted. Hence, they can represent a transformation ofonly a limited number of original features. Details onthe datasets are reported in Table 3. Rows with miss-ing values are omitted. Most datasets are taken from theUCI Machine Learning repository , with exception forDow Chemical and Tower, which come from GP litera-ture [36, 37].We further consider a very high-dimensional datasetfrom UCI (https: // bit.ly / ,
531 features are considered, in 801 examples. Sincelarge computational resources are needed to handle thisdataset, we consider only NB as ML algorithm for featureconstruction upon this data.
6. Results: performance on traditional datasets
The results described in this section aim at assessingwhether it is possible to construct few and small featuresthat lead to an equal or better performance than the origi-nal set, and whether some search algorithms can constructbetter features than others.
We begin by observing the dataset-wise aggregated per-formance of FCS for the di ff erent GP algorithms and thedi ff erent ML algorithms, separately for classification andregression. Figure 3 shows dataset-wise aggregated results ob-tained for NB, SVM, RF, and XGB, for the 10 tradi-tional classification tasks. Each data point is the mean The datasets are available at http: // goo.gl / http: // archive.ics.uci.edu / ml /
8B SVM RF XGBTraining Test Training Test Training Test Training Test h = h = RT Figure 3: Aggregated results on the classification datasets. Horizontal axis: Number of features. Vertical axis: Average of median F LR SVM RF XGBTraining Test Training Test Training Test Training Test h = h = RT Figure 4: Aggregated results on the regression datasets. Horizontal axis: Number of features. Vertical axis: Average of median R score obtainedon 30 runs for each dataset. able 2: Salient hyper-parameter settings of SVM, RF, and XGB. SVMKernel RBFCost 1Epsilon 0.1Tolerance 0.001Gamma k Shrinking ActiveRFNumber of trees 100Bagging sampling with replacementClassification mtry √ , )Min node size 1 classification, 5 regressionSplit rule Gini classification, Variance regressionXGBNumber of trees 100Booster gbtreeMax depth 6Objective multiclass softmax, MSE regressionLearning rate 0.3 Table 3: Traditional classification and regression datasets.
Dataset C l a ss i fi ca ti on Cylinder Bands 39 277 2Breast Cancer Wisc. 29 569 2Ecoli 7 336 8Ionosphere 34 351 2Iris 4 150 3Madelon 500 2600 2Image Segmentation 19 2310 7Sonar 60 208 2Vowel 9 990 11Yeast 8 1484 10 R e g r e ss i on Airfoil 6 1503 –Boston Housing 13 506 –Concrete 9 1030 –Dow Chemical 57 1066 –Energy Cooling 9 768 –Energy Heating 9 768 –Tower 26 4999 –Wine Red 12 1599 –Wine White 12 4898 –Yacht 7 308 – among the dataset-specific medians of macro F1 from the30 runs.In general, the use of only one constructed feature doesnot perform as good as the use of the original feature set.Constructing more features improves the performance,but with diminishing returns.Specifically for NB, the use of two constructed featuresis already preferable to the use of the original feature set.This is likely due to the fact that NB assumes completeindependence between the provided features, and this canbe implicitly tackled by FCS. SGP (unbounded) is the bestperforming algorithm as it can evolve arbitrarily com-plex features, however, the magnitude of improvement ofthe macro F1 score with respect to GP-GOMEA RT andSGP b is limited. For h = K =
5, GP-GOMEA RT reaches the performance of SGP. GP-GOMEA RT is typi-cally slightly better than SGP b , and RS has worse perfor-mance. Training and test F1 scores do not di ff er much forany feature construction algorithm, meaning that overfit-ting is not an issue for NB. Rather, compared to the otherML algorithms, NB underfits.The performance of FCS for SVM has an almost iden-tical pattern to the one observed for NB, except for thefact that the performance is found to be consistently bet-ter. However, for SVM it is preferable to use the originalfeature set rather than few constructed features. This isevident in terms of training performance, but less at testtime. In fact, using only 5 constructed features leads tosimilar test performance compared to using the originalset. The GP algorithms compare to each other similarlyto when using NB. Compared to NB, it can be seen thatSVM exhibits larger gaps between training and test re-sults, suggesting that some overfitting takes place, espe-cially when the original feature set is used.The way performance improves for RF by constructingfeatures is similar to the one observed for NB and SVM.However, for RF the di ff erences between the search al-gorithms is particularly small: notice that using RS leadsto close performance to the ones obtained by using theother GP algorithms, compared to the SVM case. More-over, virtually no di ff erence can be seen between GP-GOMEA RT and SGP b . This suggests that RF alreadyworks well with less refined features. Now, the featuresconstructed by SGP are no longer the best performingat test time. This is likely because SGP evolves larger,more complex features than the other algorithms (see10ec. 6.1.3), making RF overfit. In fact, RF exhibits thelargest di ff erence between training and test results com-pared to NB and SVM, for any feature construction algo-rithm and h limit. Still, the test results of RF are slightlybetter than the ones of SVM and markedly better than theones of NB, meaning that the latter two are underfitting.The training and test performance obtained when us-ing XGB is similar to the one obtained when using RF,but the di ff erences the between di ff erent search algorithmsare even less marked than for RF. Some di ff erences can beseen for K = RT , and GP-GOMEA RT better than SGPb andRS), but this di ff erence is much less marked on the testset. When more features are constructed, essentially allsearch algorithms deliver the same performance. XGBseems to be able to construct non-linear relationships evenbetter than RF. As to potential overfitting, the trend of dif-ferences between training and test performance that canbe observed for XGB mirrors the one visible for RF.As to maximum tree height, allowing the constructedfeatures to be bigger ( h = h =
2) moderately im-proves the performance. Interestingly, GP-GOMEA RT with h = Results on the regression tasks are shown in Figure 4,dataset-wise aggregated for LR, SVM, RF, and XGB. Wereport the results in terms of coe ffi cient of determination,i.e., R ( y , ¯ y ) = − MS E ( y , ¯ y ) / var ( y ). For the four MLalgorithms, results overall follow the same pattern. SGPis typically better, especially for LR and SVM, althoughconstructing more features reduces the performance gapwith the other GP algorithms. GP-GOMEA RT is slightly,yet consistently, the best performing within the maximumtree height limitation of 2, while SGP b is visibly prefer-able only when a single feature is constructed for LRand SVM, for h =
4. Di ff erently from the classificationcase, two features are typically enough to reach the per-formance of the original feature set for all ML algorithmsexcept for XGB. Moreover, for LR, SVM, and RF, the per-formance between training and test is similar, meaning noconsiderable overfitting is taking place, no matter the fea-ture construction algorithm used nor the limit of h . Thishowever is not the case for XGB, where a large perfor- mance gap is encountered. Still, the test performance ob-tained when using XGB is ultimately slightly better thanthe obtained for RF.As for classification, allowing for larger trees results inbetter performance overall, and reduces the gap betweenSGP and the other GP algorithms. With XGB, all searchalgorithms perform similarly. Figure 5 shows the aggregated feature size for the dif-ferent GP algorithms and RS. The aggregated solutionsize is computed by taking the median solution size perrun, then averaging over datasets, and finally averagingover ML algorithms (classification and regression are con-sidered together). The picture shows how, overall, theknown SGP tendency to bloat di ff ers compared to the al-gorithms working with a strict tree height limitation. SGPfeatures are so large that it is nearly impossible to interpretthem (see Sec. 8.1).RS finds the smallest features for both height limits h = h =
4. Considering that GP-GOMEA RT andSGP b generate trees within the same height bounds ofRS, we conclude that it is the variation operators that al-low finding larger trees with improved fitness within theheight limit. GP-GOMEA RT seems to construct slightly,yet consistently, larger trees than SGP b .For SGP, it can be seen that subsequently constructedfeatures are smaller (this is barely visible for GP-GOMEA RT and SGP b as well). This is interesting be-cause we do not use any mechanism to promote smallertrees. This result is likely linked to the diminishing re-turns in performance observed in Figure 3 and 4: con-structing new complex and informative features becomesharder with the number of FCS iterations. The aggregated results of Section 6.1 show moderatedi ff erences between GP-GOMEA RT and SGP b . These arearguably the most interesting algorithms to compare in-depth, as they are able to construct small features that leadto good performance (RS typically constructs less infor-mative features, while SGP constructs very large ones).We perform statistical significance tests to compareGP-GOMEA RT and SGP b . We consider their median per-formance on the test set T e , obtained by the FCS, and11 ea t u r e s i ze RT Figure 5: Aggregated feature size for k = , . . . ,
5. Solid (dotted) linesrepresent solution size for maximum tree height h = h = h = also compare it with the use of the original feature set,for each ML algorithm and each dataset. In our case, the treatments of our significance tests are the two search al-gorithms (i.e., GP-GOMEA RT and SGP b ) and the originalfeature set, while the subjects are the configurations givenby pairing ML algorithms and datasets [38].We first perform a Friedman test to assess whether dif-ferences exists among the use of di ff erent treatments (GPalgorithms and original feature set) upon multiple sub-jects (ML algorithm-dataset combinations). As post-hocanalysis, we use the pairwise Wilcoxon signed rank tests,paired by subject (ML algorithm-dataset combination), tosee how the treatments compare to each other [38]. Weadopt the Holm correction method to prevent reportingfalse positive results that might have happened due to purechance [39].We consider both h = h =
4, and focus on K =
2, since consideration of only two constructed fea-tures makes interpretation easier, and allows human visu-alization (see Sec. 8.1).
For both h = ,
4, the Friedman test strongly indicatesdi ff erences between GP-GOMEA RT , SGP b , and the use ofthe original feature set ( p -value (cid:28) . p -values ob-tained by the pairwise Wilcoxon tests for classification, where the alternative hypothesis is that the row allows forlarger macro F1 scores than the column. No significantdi ff erences between GP-GOMEA RT and SGP b are foundfor both h = ,
4. Both the GP algorithms can deliver con-structed features that are competitive with the use of theoriginal feature set. The original feature set is not signif-icantly better than using feature construction. Moreover,for GP-GOMEA RT and h =
4, the hypothesis that featureconstruction is not better than the original feature set canbe rejected with a corrected p -value below 0.1. The latterresult appears to be in contrast with the results from Fig. 3for SVM, RF and XGB, where it can be seen that the con-struction of only two features does, on average, lead toslightly worse test results than using the original featureset. Nonetheless, the opposite is true for NB, and withrather large magnitude. A more in-depth analysis on thisis provided in Sec. 6.3. As for classification datasets, the Friedman test indi-cates that di ff erences are presents between the treatments.Figure 6 (bottom) shows the Holm-corrected p -values ob-tained by the pairwise Wilcoxon tests for regression.The statistical tests confirm the hypothesis that the al-gorithms are capable of providing constructed featuresthat are more informative than the original feature set, asobserved in Fig. 4 for the regression datasets. Now, GP-GOMEA RT is significantly better than SGP b when h = h =
4, instead, GP-GOMEA RT is not found to be sig-nificantly better than SGP b . Results presented in Sec. 6.1 indicate that our FCSbrings most benefit if used with the weak ML algorithms.We now report, for each ML algorithm, on how manydatasets 2 features constructed using GP-GOMEA (with h = h =
4) lead to statistically significantly (usingHolm-corrected pairwise Wilcoxon test, p -value < . l a ss i fi ca ti on R e g r e ss i on h = h = Figure 6: Holm-corrected p -values of pairwise Wilcoxon tests on testperformance. Rows are tested to be significantly better than columns. Orig stands for the original feature set. set is preferable. However, for some datasets reducingthe space to two compact features without compromisingperformance is still possible.The use of the original feature set is generally hardestto beat when adopting RF or XGB. For RF, in the regres-sion case with h =
4, FCS brings benefits on the datasetsAirfoil, Energy Cooling, Energy Heating, and Yacht; andperforms on par with the use of the original feature set onthe datasets Boston Housing and Concrete. These datasetsare the ones with the smallest number of original features.We find similar results for SVM and for XGB. In the lattercase, FCS is, in terms of statistical significance, equal tothe original feature set only on Energy Cooling, EnergyHeating, and Yacht. It is reasonable to expect that FCSworks well when few features can be combined.In the classification case, findings are di ff erent. For RFand h =
4, the datasets where using two constructed fea-tures bring similar or better results than using the originalfeature set are Breast Cancer Wisconsin and Iris. The lat-ter does have a small number of original features (4), butthe former has more than several other datasets (29). Fur-thermore, the datasets where FCS helps are di ff erent forSVM: FCS performs equally good to the original featureset on Iris and Cylinder Bands (39 features), and better onMadelon (500 features) and Image Segmentation (19 fea-tures). Regarding XGB, there is no dataset where FCS is Table 4: Number of datasets where using two features constructed withGP-GOMEA results in significantly better / equal / worse test performancecompared to using the original feature set. h NB SVM RF XGB C l a ss . / / / / / / / /
64 8 / / / / / / / / h LR SVM RF XGB R e g r . / / / / / / / /
84 7 / / / / / / / / superior to the original feature set, but it is also not worseon almost half of the datasets. For classification datasets,we cannot conclude that a small cardinality of the originalfeature set is a good indication feature construction willwork well. Furthermore, feature construction influencesdi ff erent ML algorithms in di ff erent ways.
7. Results: performance on a highly-dimensionaldataset
We further consider the RNA-Seq cancer gene expres-sion dataset, comparing FCS by GP-GOMEA RT with h = ,
000 features vs less than 1 ,
000 examples)that actual patterns cannot be retrieved. The use of FCSforces NB to use only a small number of constructed fea-tures, which, in turn, can contain only a small number ofthe original features. Essentially, FCS provides both theadvantages of feature construction and feature selection.This leads to large F1 scores already when solely two fea-tures are constructed.
8. Results: improving interpretability
The results presented in Sec. 6 and 7 showed that theoriginal feature set can be already outperformed by twosmall constructed features in many cases. We now aimat assessing whether constraining features size can en-able interpretability of the features themselves, as well13
TRT
Figure 7: Comparison between the use of the original feature set andFCS with GP-GOMEA RT ( h =
4) on high-dimensional gene expressiondata. The vertical axis reports the median F1 score, the horizontal axisreports the number of features constructed by FCS. Stars indicate sta-tistical significant superiority ( p -value < .
05) of one method w.r.t. theother. as if extra insight can be achieved by plotting and visu-alizing the behavior of a trained ML model in the newtwo-dimensional space.
Table 5 shows some examples of features constructedby GP-GOMEA RT , for h = h =
4. We report thefirst feature constructed for the K = K =
2: from F = . F = .
63 for h =
2, and to F = .
66 for h = R ob-tained with the original feature set is 0 .
59, the one withtwo features constructed by GP-GOMEA RT is 0 .
76 (0 . h = h = h =
2, we argue that constructed features are mostlyeasy to interpret. For example, the feature shown forLR on Concrete tells us that aging ( x (8) ) has a negativeimpact on concrete compressive strength, whereas usingmore water ( x (4) ) than cement ( x (1) ) has a positive e ff ect(both features are in kg / cm ). The impact of other fea-tures is less important (within the data variability of thedataset). For h =
4, some features can be harder to readand understand, however many are still accessible. This ismostly because, even though the total solution size reach-able with h = RT in Sec. 7 are alsonot excessively complex to be understood. For example,the first two features for the median run are:1st : (cid:113)(cid:0) x (18382) (cid:1) + x (8014) + x (3885) + x (17316) (cid:16) x (7491) + √ x (7296) + x (19333) (cid:17) × x (5524) + x (18053) (cid:114) + (cid:16) x (5579) − x (4417) (cid:17) + x (14153) + x (19751) − x (13744) (cid:114) + (cid:16) x (16581) (cid:17) Even tough the second feature is somewhat involved, it isarguably still possible to carefully analyze it and obtain apicture of how gene expression levels interact.Overall, we cannot draw a strict conclusion on whetherthe features found by our approach are interpretable, as in-terpretability is a subjective matter and, to date, no clear-cut metric exists [7, 8] (we discuss this more in Sec. 10).Yet, it appears evident that enforcing a restriction on theirsize is a necessary condition. We generally find that fea-tures using 15 or more nodes start to be hard to interpretw.r.t. our experimental settings, i.e., using our functionset. Lastly, features constructed without a strict size lim-itation (by SGP) are in general very large, and thus ex-tremely hard to understand. As an example, Figure 8shows the first of the two features with median test per-formance constructed by SGP for LR on Concrete (thisis smaller than the first feature found by SGP for NB onEcoli).
The construction of a small number of interpretablefeatures can enable a better understanding of the problem14 able 5: Examples of features constructed by GP-GOMEA RT with h ∈{ , } , K =
2, for NB on Ecoli, and for LR on Concrete. h N B x (3) + x (6) + x (1) / (cid:113) + (cid:0) x (6) (cid:1) x (6) (cid:16) x (7) (cid:17) x (3) + . / (cid:113) + (cid:0) exp( x (2) ) (cid:1) − x (1) x (2) x (5) L R x (4) − x (1) + . / (cid:112) + ( x (8) ) (cid:112) .
764 log | x (8) | + x (2) + x (1) / (cid:112) + ( x (4) ) (cid:18) log p ((((((( x (4) + x (2) ) ÷ x (4) ) × ( x (1) ÷ x (4) ÷ x (4) × ( x (1) ÷ x (4) x (8) . × log p (((( x (4) x (1) + x (1) ÷ x (8) ÷ x (8) ) × ( x (4) − ( x (1) + (( x (2) ÷ x (4) + log p ( x (1) ))) ))) + − ( x (5) x (1) + x (2) ÷ x (8) + ((( x (8) + x (6) + x (2) ) ÷ (( x (1) ÷ x (4) ) + √ x (4) ÷ x (2) )) + exp(((( x (8) + x (2) ) ÷ x (8) + ( x (4) + x (2) ) ÷ x (4) ) ÷ x (5) + x (6) ÷ x (4) ))))))))) ÷ x (7) ) × ( x (8) − (441 . + x (2) ))) − x (1) )) (cid:19) Figure 8: Example of a relatively “small” feature constructed by SGP,derived from a tree with 96 nodes. Note that the analytic quotient op-erator ( ÷ ) and the protected logarithm (log p ) are not expanded to theirrespective definitions to keep the feature contained. This feature is ar-guably very hard to interpret. and of the learned ML models. The case where up to twofeatures are constructed is particularly interesting, since itallows visualization.We provide one example of classification boundariesand one of a regressed surface, inferred by SVM on a twodimensional feature space obtained with our approach us-ing GP-GOMEA RT .The classification dataset on which we find the best testimprovement for h = F ÷ and the protected log log p are replaced bytheir definition for readability. The constructed featuresare rather complex here, yet readable. At the same time,it can be clearly seen how the training and test examplesare distributed in the 2D space, and what classificationboundaries SVM learned.For regression, Figure 1 shows the surface learned bySVM on Yacht (median run), where GP-GOMEA RT with h = R of 0.98,against 0.85 obtained using the original feature set. Thefeatures are arguably easy to interpret, while it can be seen log|log| x (2) | + x (19) / 1 + ( x (5) ) + x (12) + exp( x (19) )| ( x ( ) ) + x ( ) / + ( x ( ) ) + x ( ) Figure 9: Classification boundaries learned by SVM with two featuresconstructed by GP-GOMEA RT ( h =
4) on the Image Segmentationdataset. The run with median test performance is shown. Circles aretraining samples, diamonds are test samples. that the learned surface accurately models most of the datapoints.
9. Running time
Our results are made possible by evaluating the fitnessof constructed features with cross-validation, a procedurewhich is particularly expensive. Table 6 shows the (meanover 30 runs) serial running time to construct five fea-tures on the smallest and largest classification and regres-sion datasets, using GP-GOMEA RT with h = TM Processor 6386 SE . Running time has a largevariability, from seconds to dozens of hours, dependingon dataset size and ML algorithm. For the traditionaldatasets and ML algorithms we considered, it can be ar-gued that our approach can be used in practice. However,for very high-dimensional datasets, only fast ML algo-rithms can be used. The construction of 5 features for theRNA-Seq gene expression dataset took 25 minutes eventough NB was used. To use slower ML algorithms wouldeasily require dozens to hundreds of hours. http: // cpuboss.com / cpu / AMD-Opteron-6386-SE able 6: Mean serial running time to construct five features using GP-GOMEA RT ( h =
4) on the smallest and largest traditional datasets.
Dataset Size NB / LR SVM RF XGB C l a s . Iris 150 × ×
500 4 m 14 h 8 h 10 h R e g r . Yacht 308 × ×
26 2 m 34 h 34 h 13 h
As to memory occupation, it basically mostly dependson the way the chosen ML algorithm handles the dataset.Our runs required at most few hundreds of MBs whendealing with the larger traditional datasets, for SVM andRF. Handling the parallel execution of FCS experimentsupon the gene expression dataset required a few GBs.
10. Discussion
We believe this is one of the few works on evolutionaryfeature construction where the focus is put on both im-proving the performance of an ML algorithm, and on hu-man interpretability at the same time. The interpretabilitywe aimed for is twofold: understanding the meaning ofthe features themselves, as well as reducing their num-ber. GP algorithms are key, as they can provide con-structed features as interpretable expressions given basicfunctional components, and a complexity limit (e.g., treeheight).We have run a large set of experiments, totaling morethan 150,000 cpu-hours. Our results strongly support thehypothesis that the original feature set can be replaced byfew (even solely K =
2) features built with our FCS with-out compromising performance in many cases. In somecases, performance even improved. GP-GOMEA RT andSGP b achieve this result while keeping the constructedfeature size extremely limited ( h = , RT andSGP b , but at the cost of constructing five to ten timeslarger features. RS proved to be less e ff ective than theGP algorithms.Our FCS is arguably most sensible to use for simplerML algorithms, such as NB and LR. Constructed featureschange the space upon which the ML algorithm operates.SVM already includes the kernel trick to change the fea-ture space. Similarly, the trees of RF and XGB e ff ectively embody complex non-linear feature combinations to ex-plain the variance in the data. NB and LR, instead, donot include such mechanisms. Rather, they have particu-lar assumptions on how the features should be combined(NB assumes normality, LR linearity). The features con-structed by GP can transform the input the ML algorithmoperates upon, to better fit its assumptions.We found that performance was almost always signif-icantly better than compared to using the original featureset for NB and LR. As running times for these ML algo-rithms can be in the order of seconds or minutes (Sec. 9),feature construction has the potential to be routinely usedin data analysis and machine learning practice. Further-more, FCS (or a modification where the constructed fea-tures are added to the original set) can be used as an al-ternative way to tune simple ML algorithms which havelimited or no hyper-parameters.We have shown that our approach can also be helpfulwhen dealing with high-dimensional data (on the RNA-Seq gene expression dataset), where system underdeter-mination can cause even simpler ML algorithms to overfit.This is because FCS essentially embodies feature selec-tion, as we only construct a small number of small-sizedfeatures.We remark that we did not adopt very popular high-dimensional datasets concerning image recognition suchas MNIST [40], CIFAR [41], or ImageNet [42]. In thesedatasets, features represent pixels, and each pixel hasno particular meaning. Consequently, constructing fea-tures as readable pixel transformations will likely carryno unhelpful information to explain the behavior of a MLmodel.Regarding the comparison between the search algo-rithms, GP-GOMEA RT was found to be slightly prefer-able to SGP b (especially for h = , K = ffi ciently large population sizes enablesthe possibility to exploit linkage estimation and performbetter-than-random mixing [11, 29]. To validate this alsowithin the framework of our proposed FCS, we scaled thepopulation size and the budget of fitness evaluations, andcompared the use of the LT with the use of the RT, on16 igure 10: Comparison between the use of the RT and of the LT in GP-GOMEA. Vertical axis: median F1 score of 30 runs, obtained by NB onImage Segmentation (left) and on Madelon (right) using the first con-structed feature, with h = ff erent scale). Horizontal axis:population size / fitness evaluations budget. Stars indicate significantsuperiority ( p − value < .
05) of one method w.r.t. the other. two traditional classification dataset: Image Segmenta-tion (19 features) and Madelon (500 features), using NB.The outcome is shown in Figure 10: the employment ofbig-enough population sizes (and of su ffi cient numbers offitness evaluation) can lead to better performance, if sta-tistical metrics can be measured reliably . For Image Seg-mentation, the number of terminals to be considered in thegenotype is relatively small due to the use of 19 features.This allows the LT to estimate node interdependencies re-liably, and deliver better-than-random performance. ForMadelon, the large number of terminals (500 features)makes it hard for the LT to outperform the RT within alimited computational budget. All in all, we recommendthe use of GP-GOMEA as feature constructor since it wasnot worse on classification and was statistically better forregression. Furthermore, we advice to use the LT if thepopulation size can be of the order of thousands or more(or even better, if exponential population sizing schemesare used as in [11, 29]). Otherwise, the RT should be pre-ferred.To assess if small constructed features are interpretableand if it is possible to visualize what the behavior oflearned ML models, we showed some examples, provid-ing evidence that both requirements can be reasonably sat-isfied. However, we did not perform a thorough study oninterpretability of the constructed features. Several met-rics have been recently proposed to measure some form ofinterpretability for ML models, that could be used to mea-sure the interpretability of features as well. E.g., in [7]two metrics called simulatability and decomposability are proposed. Simulatability represents the capability of hu-mans to predict the output of a model given an input. De-composability represents the capacity to intuitively under-stand the components of a model. Crucially, to measurethis type of metrics, user studies need to be conducted.For example, experts of a field should be asked to providefeedback, on features constructed for datasets they areknowledgeable about (e.g., biochemists for data on geneexpression, civil engineers for data on concrete strength).Nonetheless, we believe that enforcing features (and GPprograms in general) to be small still remains a necessarycondition to allow interpretability, although it is often ig-nored in GP literature [11].Considering the visualization examples proposed inSection 8, it is natural to compare our approach with well-known dimensionality reduction techniques, such as Prin-cipal Component Analysis (PCA) [43] or t-DistributedStochastic Neighbor Embedding (t-SNE) [44]. We re-mark that those techniques and our FCS have very dif-ferent objectives. In general, the sole aim of such tech-niques is to reduce the data dimensionality. PCA does soby detecting components that capture maximal variance.However, it does not attempt to optimize the transforma-tion of the original feature set to improve an ML algo-rithm’s performance. Also, PCA does not focus on theinterpretability of the feature transformations. FCS takesthe performance of the ML algorithm and interpretabilityof the features into account, while dimensionality reduc-tion comes from forcing the construction of few features.We compared using 2 features constructed with RS (theworst search algorithm) with maximum h =
2, with usingthe first 2 PCs found by PCA. The use of constructed fea-tures over PCs resulted in significantly superior or equaltest performance for all ML algorithms and for all prob-lems. We remark, however, that PCA is extremely fastand independent from the ML algorithm.Our FCS has several limitations. A first limitation re-gards the performance obtainable by the ML algorithmusing the constructed features. FCS is iterative, and thiscan lead to suboptimal performance for a chosen K , com-pared to attempting to find K features at once. Thisis because the contributions of multiple features to anML algorithm are not necessarily perpendicular to eachother [25]. FCS could be changed to find at any given it-eration, a synergistic set of K features, that is independentfrom previous iterations. To this end, larger population17izes need to be employed, and the search algorithms needto be modified so that they can evolve sets of constructedfeatures (a similar proposal for SGP was done in [15]).Yet, it is reasonable to expect that if K features need to belearned at the same time, larger population sizes may beneeded compared to learning the K features iteratively.Another limitation of this work is that hyper-parametertuning was not considered. To include hyper-parametertuning within FCS could bring even higher performancescores, or help prevent overfitting. A possibility couldbe, for example, to evolve pairs of features and hyper-parameter settings, where every time a feature is evalu-ated, the optimal hyper-parameters are also searched for.Such a procedure may likely require strong paralleliza-tion e ff orts, as C -fold cross-validation should be carriedout for each combination of hyper-parameter values.Lastly, it would be interesting to extend our approach toother classification and regression settings, e.g., problemswith missing data; or to unsupervised tasks, as simple fea-tures may lead to better clustering of the examples.
11. Conclusion
With a simple evolutionary feature construction frame-work we have studied the feasibility of constructing fewcrucial and compact features with Genetic Programming(GP), towards improving the explainability of MachineLearning (ML) models without losing prediction accu-racy. Within the proposed framework, we compared stan-dard GP, random search, and the GP adaptation of theGene-pool Optimal Mixing Evolutionary Algorithm (GP-GOMEA) as feature constructors, and found that GP-GOMEA is overall preferable when strict limitations onfeature size are enforced. Despite limitations on featuresize, and despite the reduction of problem dimensional-ity that we imposed by constructing only two features, weobtained equal or better ML prediction performance com-pared to using the original feature set for more than halfthe combinations of datasets and ML algorithms. In manycases, humans can understand what the feature means,and it is possible to visualize how trained ML models willbehave. All in all, we conclude that feature constructionis most useful and sensible for simpler ML algorithms,where more resources can be used for evolution (e.g.,larger population sizes), which, in turn, unlock the added benefits of more advanced evolutionary mechanisms (e.g.,using linkage learning in GP-GOMEA).
Acknowledgments
The authors acknowledge the Kinderen Kankervrijfoundation for financial support (project
References [1] H. Liu, H. Motoda, Feature extraction, constructionand selection: A data mining perspective, Vol. 453,Springer Science & Business Media, 1998.[2] J. Friedman, T. Hastie, R. Tibshirani, The elementsof statistical learning, Vol. 1, Springer series instatistics New York, NY, USA:, 2001.[3] J. R. Koza, Genetic Programming: On the Program-ming of Computers by Means of Natural Selection,MIT Press, Cambridge, MA, USA, 1992.[4] R. Poli, W. B. Langdon, N. F. McPhee, J. R. Koza,A field guide to genetic programming, Lulu. com,2008.[5] J. H. Friedman, Multivariate adaptive regressionsplines, The annals of statistics (1991) 1–67.[6] D. W. Hosmer Jr, S. Lemeshow, R. X. Sturdivant,Applied logistic regression, Vol. 398, John Wiley &Sons, 2013.[7] Z. C. Lipton, The mythos of model interpretability,Queue 16 (3) (2018) 30:31–30:57.[8] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini,F. Giannotti, D. Pedreschi, A survey of methods forexplaining black box models, ACM computing sur-veys (CSUR) 51 (5) (2018) 93.[9] A. Adadi, M. Berrada, Peeking inside the black-box:A survey on explainable artificial intelligence (xai),IEEE Access 6 (2018) 52138–52160.1810] B. Goodman, S. Flaxman, European union regula-tions on algorithmic decision-making and a right toexplanation, AI Magazine 38 (3) (2017) 50–57.[11] M. Virgolin, T. Alderliesten, C. Witteveen, P. A. N.Bosman, Improving model-based genetic program-ming for symbolic regression of small expressions,CoRR abs / arXiv:1904.02050 .[12] A. Cano, A. Zafra, S. Ventura, An interpretable clas-sification rule mining algorithm, Information Sci-ences 240 (2013) 1–20.[13] B. P. Evans, B. Xue, M. Zhang, What’s inside theblack-box?: a genetic programming method for in-terpreting complex machine learning models, in:Genetic and Evolutionary Computation (GECCO2019), ACM, 2019, pp. 1012–1020.[14] B. Xue, M. Zhang, W. N. Browne, X. Yao, A surveyon evolutionary computation approaches to featureselection, IEEE Transactions on Evolutionary Com-putation 20 (4) (2016) 606–626.[15] K. Krawiec, Genetic programming-based construc-tion of features for machine learning and knowledgediscovery tasks, Genetic Programming and Evolv-able Machines 3 (4) (2002) 329–343.[16] L. Breiman, Classification and regression trees,Routledge, 2017.[17] M. Muharram, G. D. Smith, Evolutionary construc-tive induction, IEEE Transactions on Knowledgeand Data Engineering 17 (11) (2005) 1518–1528.[18] B. Tran, B. Xue, M. Zhang, Genetic programmingfor feature construction and selection in classifica-tion on high-dimensional data, Memetic Computing8 (1) (2016) 3–15.[19] N. S. Altman, An introduction to kernel and nearest-neighbor nonparametric regression, The AmericanStatistician 46 (3) (1992) 175–185.[20] S. J. Russell, P. Norvig, Artificial intelligence: amodern approach, Malaysia; Pearson EducationLimited,, 2016. [21] K. P. Murphy, Naive Bayes classifiers, University ofBritish Columbia 18.[22] Q. Chen, M. Zhang, B. Xue, Genetic program-ming with embedded feature construction for high-dimensional symbolic regression, in: Intelligent andEvolutionary Systems, Springer, 2017, pp. 87–102.[23] A. Cano, S. Ventura, K. J. Cios, Multi-objective ge-netic programming for feature extraction and datavisualization, Soft Computing 21 (8) (2017) 2069–2089.[24] M. Virgolin, T. Alderliesten, A. Bel, C. Witteveen,P. A. N. Bosman, Symbolic regression and featureconstruction with GP-GOMEA applied to radiother-apy dose reconstruction of childhood cancer sur-vivors, in: Genetic and Evolutionary Computation(GECCO 2018), ACM, 2018, pp. 1395–1402.[25] B. Tran, B. Xue, M. Zhang, Genetic program-ming for multiple-feature construction on high-dimensional classification, Pattern Recognition 93(2019) 404–417.[26] C. Cortes, V. Vapnik, Support-vector networks, Ma-chine learning 20 (3) (1995) 273–297.[27] L. Breiman, Random forests, Machine learning45 (1) (2001) 5–32.[28] R. Kohavi, G. H. John, Wrappers for feature sub-set selection, Artificial intelligence 97 (1-2) (1997)273–324.[29] M. Virgolin, T. Alderliesten, C. Witteveen, P. A. N.Bosman, Scalable genetic programming by gene-pool optimal mixing and input-space entropy-basedbuilding-block learning, in: Genetic and Evolution-ary Computation (GECCO 2017), ACM, New York,NY, USA, 2017, pp. 1041–1048.[30] T. P. Pawlak, B. Wieloch, K. Krawiec, Semanticbackpropagation for designing search operators ingenetic programming, IEEE Transactions on Evolu-tionary Computation 19 (3) (2015) 326–340.[31] R. R. Curtin, J. R. Cline, N. P. Slagle, W. B. March,P. Ram, N. A. Mehta, A. G. Gray, MLPACK: A scal-able C ++ machine learning library, Journal of Ma-chine Learning Research 14 (2013) 801–805.1932] C.-C. Chang, C.-J. Lin, LIBSVM: A library forsupport vector machines, ACM Transactions on In-telligent Systems and Technology 2 (2011) 27:1–27:27, software available at .[33] M. N. Wright, A. Ziegler, ranger: A fast implemen-tation of random forests for high dimensional datain C ++ and R, arXiv preprint arXiv:1508.04409.[34] T. Chen, C. Guestrin, Xgboost: A scalable treeboosting system, in: Proceedings of the 22nd acmsigkdd international conference on knowledge dis-covery and data mining, ACM, 2016, pp. 785–794.[35] J. Ni, R. H. Drieberg, P. I. Rockett, The use ofan analytic quotient operator in genetic program-ming, IEEE Transactions on Evolutionary Compu-tation 17 (1) (2013) 146–152.[36] D. R. White, J. Mcdermott, M. Castelli, L. Manzoni,B. W. Goldman, G. Kronberger, W. Ja´skowski, U.-M. OReilly, S. Luke, Better GP benchmarks: com-munity survey results and proposals, Genetic Pro-gramming and Evolvable Machines 14 (1) (2013) 3–29.[37] J. Albinati, G. L. Pappa, F. E. Otero, L. O. V.Oliveira, The e ff ect of distinct geometric seman-tic crossover operators in regression problems, in:European Conference on Genetic Programming,Springer, 2015, pp. 3–15.[38] J. Demˇsar, Statistical comparisons of classifiers overmultiple data sets, Journal of Machine Learning Re-search 7 (2006) 1–30.[39] S. Holm, A simple sequentially rejective multipletest procedure, Scandinavian Journal of Statistics(1979) 65–70.[40] Y. LeCun, The mnist database of handwritten digits,http: // yann. lecun. com / exdb / mnist //