Unsupervised Feature Construction for Improving Data Representation and Semantics
NNoname manuscript No. (will be inserted by the editor)
Unsupervised Feature Construction for Improving DataRepresentation and Semantics
Marian-Andrei Rizoiu · Julien Velcin · St´ephane Lallich
Received: 27/01/2012 / Accepted: 29/01/2013
Abstract
Feature-based format is the main data representation format used by machinelearning algorithms. When the features do not properly describe the initial data, perfor-mance starts to degrade. Some algorithms address this problem by internally changing therepresentation space, but the newly-constructed features are rarely comprehensible. We seek to construct, in an unsupervised way, new features that are more appropriate for describinga given dataset and, at the same time, comprehensible for a human user. We propose twoalgorithms that construct the new features as conjunctions of the initial primitive features ortheir negations. The generated feature sets have reduced correlations between features andsucceed in catching some of the hidden relations between individuals in a dataset. For ex-ample, a feature like sky ∧ ¬ building ∧ panorama would be true for non-urban images andis more informative than simple features expressing the presence or the absence of an ob-ject. The notion of Pareto optimality is used to evaluate feature sets and to obtain a balancebetween total correlation and the complexity of the resulted feature set. Statistical hypoth-esis testing is used in order to automatically determine the values of the parameters usedfor constructing a data-dependent feature set. We experimentally show that our approachesachieve the construction of informative feature sets for multiple datasets. Keywords
Unsupervised feature construction · Feature evaluation · Nonparametricstatistics · Data mining · Clustering · Representations · Algorithms for data and knowledgemanagement · Heuristic methods · Pattern analysis
Marian-Andrei Rizoiu · Julien Velcin · St´ephane LallichERIC Laboratory, University Lumi`ere Lyon 25, avenue Pierre Mend`es France, 69676 Bron Cedex, FranceTel. +33 (0)4 78 77 31 54Fax. +33 (0)4 78 77 23 75Marian-Andrei RizoiuE-mail: [email protected] VelcinE-mail: [email protected]´ephane LallichE-mail: [email protected] a r X i v : . [ c s . A I] D ec Marian-Andrei Rizoiu et al.
Most machine learning algorithms use a representation space based on a feature-based for-mat. This format is a simple way to describe an instance as a measurement vector on a setof predefined features. In the case of supervised learning, a class label is also available. Onelimitation of the feature-based format is that supplied features sometimes do not adequatelydescribe, in terms of classification, the semantics of the dataset. This happens, for exam-ple, when general-purpose features are used to describe a collection that contains certainrelations between individuals.In order to obtain good results in classification tasks, many algorithms and preprocessingtechniques (e.g., SVM (Cortes and Vapnik, 1995), PCA (Dunteman, 1989) etc. ) deal withnon-adequate variables by internally changing the description space. The main drawback ofthese approaches is that they function as a black box, where the new representation spaceis either hidden (for SVM) or completely synthetic and incomprehensible to human readers(PCA).The purpose of our work is to construct a new feature set that is more descriptive forboth supervised and unsupervised classification tasks. In the same way that frequent item-sets (Piatetsky-Shapiro, 1991) help users to understand the patterns in transactions, our goalwith the new features is to help understand relations between individuals of datasets. There-fore, the new features should be easily comprehensible by a human reader. Literature pro- poses algorithms that construct features based on the original user-supplied features (calledprimitives). However, to our knowledge, all of these algorithms construct the feature set ina supervised way, based on the class information, supplied a priori with the data.In order to construct new features, we propose two algorithms that create new featuresets in the absence of classified examples, in an unsupervised manner. The first algorithm isan adaptation of an established supervised algorithm, making it unsupervised. For the secondalgorithm, we have developed a completely new heuristic that selects, at each iteration, pairsof highly correlated features and replaces them with conjunctions of literals that do notco-occur. Therefore, the overall redundancy of the feature set is reduced. Later iterationscreate more complex Boolean formulas, which can contain negations (meaning absence offeatures). We use statistical considerations (hypothesis testing) to automatically determinethe value of parameters depending on the dataset, and a
Pareto front (Sawaragi et al, 1985)-inspired method for the evaluation. The main advantage of the proposed methods over PCAor the kernel of the SVM is that the newly-created features are comprehensible to humanreaders (features like people ∧ mani f estation ∧ urban and people ∧ ¬ urban ∧ f orest areeasily interpretable).In Sections 2 and 3, we present our proposed algorithms and in Section 4 we describethe evaluation metrics and the complexity measures. In Section 5, we perform a set of initialexperiments and outline some of the inconveniences of the algorithms. In Section 6, byuse of statistical hypothesis testing, we address these weak points, notably the choice ofthe threshold parameter. In Section 7, a second set of experiments validates the proposedimprovements. Finally, Section 8 draws the conclusion and outlines future works.1.1 Motivation: why construct a new feature setIn the context of classification (supervised or unsupervised), a useful feature needs to por-tray new information. A feature p j , that is highly correlated with another feature p i , does nsupervised Feature Construction 3 Fig. 1: Example of images tagged with { groups , road , building , interior } not bring any new information, since the value of p j can be deduced from that of p i . Sub-sequently, one could filter out “irrelevant” features before applying the classification algo- rithm. But by simply removing certain features, one runs the risk of losing important infor-mation of the hidden structure of the feature set , and this is the reason why we perform feature construction . Feature construction attempts to increase the expressive power of theoriginal features by discovering missing information about relationships between features.We deal primarily with datasets described with Boolean features. Any dataset describedby using the feature-value format can be converted to a binary format using discretizationand binarization. In real-life datasets, most binary features have specific meanings. Let usconsider the example of a set of images that are tagged using Boolean features. Each featuremarks the presence ( true ) or the absence ( false ) of a certain object in the image. Theseobjects could include: water , cascade , mani f estation , urban , groups or interior . In thiscase, part of the semantic structure of the feature set can be guessed quite easily. Relationslike “is-a” and “part-of” are fairly intuitive: cascade is a sort of water , paw is part of animal etc. But other relations might be induced by the semantics of the dataset (images in ourexample). mani f estation will co-occur with urban , for they usually take place in the city.Fig. 1 depicts a simple image dataset described using the feature set { groups , road , building , interior } . The feature set is quite redundant and some of the features are non-informative(e.g., feature groups is present for all individuals). Considering co-occurrences betweenfeatures, we could create the more eloquent features groups ∧ ¬ road ∧ interior (describingthe top row) and groups ∧ road ∧ building (describing the bottom row).The idea is to create a data-dependent feature set, so that the new features are as inde-pendent as possible, limiting co-occurrences between the new features. At the same timethey should be comprehensible to the human reader.1.2 Related workThe literature proposes methods for augmenting the descriptive power of features. Liu andMotoda (1998) collects some of them and divides them into three categories: feature selec-tion, feature extraction and feature construction. Marian-Andrei Rizoiu et al.
Feature selection (Lallich and Rakotomalala, 2000; Mo and Huang, 2011) seeks tofilter the original feature set in order to remove redundant features. This results in a rep-resentation space of lower dimensionality.
Feature extraction is a process that extractsa set of new features from the original features through functional mapping (Motoda andLiu, 2002). For example, the
SVM algorithm (Cortes and Vapnik, 1995) constructs a ker-nel function that changes the description space into a new separable one. Supervised andnon-supervised algorithms can be boosted by pre-processing with principal componentanalysis (PCA) (Dunteman, 1989). PCA is a mathematical procedure that uses an orthog-onal transformation to convert a set of observations of possibly correlated variables intoa set of values of uncorrelated variables, called principal components . Manifold learn-ing (Huo et al, 2005) can be seen as a classification approach where the representationspace is changed internally in order to boost the performances. Feature extraction mainlyseeks to reduce the description space and redundancy between features. Newly-created fea-tures are rarely comprehensible and very difficult to interpret. Both feature selection andfeature extraction are inadequate for detecting relations between the original features.
Feature Construction is a process that discovers missing information about the rela-tionships between features and augments the space of features by inferring or creating addi-tional features (Motoda and Liu, 2002). This usually results in a representation space with alarger dimension than the original space. Constructive induction (Michalski, 1983) is a pro-cess of constructing new features using two intertwined searches (Bloedorn and Michalski,
Algorithm 1
General feature construction schema.
Input: P – set of primitive user-given features Input: I – the data expressed using P which will be used to construct features Inner parameters: Op – set of operators for constructing features, M – machine learning algorithm to beemployed Output: F – set of new (constructed and/or primitives) features. F ← Piter ← repeat iter ← iter + I iter ← convert ( I iter − , F ) out put ← Run M ( I iter , F ) F ← F (cid:83) new feat. constructed with Op ( F , out put ) prune useless features in F until stopping criteria are met. Algorithm 1, presented in Gomez and Morales (2002); Yang et al (1991), representsthe general schema followed by most constructive induction algorithms. The general ideais to start from I , the dataset described with the set of primitive features. Using a set ofconstructors and the results of a machine learning algorithm, the algorithm constructs newfeatures that are added to the feature set. In the end, useless features are pruned. These nsupervised Feature Construction 5 steps are iterated until some stopping criterion is met (e.g., a maximum number of iterationsperformed or a maximum number of created features).Most constructive induction systems construct features as conjunctions or disjunctionsof literals. Literals are the features or their negations. E.g., for the feature set { a , b } the literalset is { a , ¬ a , b , ¬ b } . Operator sets { AND , Negation } and { OR , Negation } are both completesets for the Boolean space. Any Boolean function can be created using only operators fromone set. FRINGE (Pagallo and Haussler, 1990) creates new features using a decision treethat it builds at each iteration. New features are conjunctions of the last two nodes in eachpositive path (a positive path connects the root with a leaf having the class label true ).The newly-created features are added to the feature set and then used in the next iteration toconstruct the decision tree. This first algorithm of feature construction was initially designedto solve replication problems in decision trees.Other algorithms have further improved this approach. CITRE (Matheus, 1990) addsother search strategies like root (selects first two nodes in a positive path) or root-fringe (selects the first and last node in the path). It also introduces domain-knowledge by apply-ing filters to prune the constructed features. CAT (Zheng, 1998) is another example of ahypothesis-driven constructive algorithm similar to FRINGE. It also constructs conjunctivefeatures based on the output of decision trees. It uses a dynamic-path based approach (theconditions used to generate new features are chosen dynamically) and it includes a pruningtechnique. There are alternative representations, other than conjunctive and disjunctive. The M-of-N and X-of-N representations use feature-value pairs. An feature-value pair AV k ( A i = V i j ) is true for an instance if and only if the feature A i has the value V i j for that instance. Thedifference between M-of-N and X-of-N is that, while the second one counts the number oftrue feature-value pairs, the first one uses a threshold parameter to assign a value of truthfor the entire representation. The algorithm ID − o f − Xo f N (Zheng, 1995) algorithm functions similarly, except that it uses the X-of-N represen-tation. It also takes into account the complexity of the features generated.Comparative studies like Zheng (1996) show that conjunctive and disjunctive repre-sentations have very similar performances in terms of prediction accuracy and theoreticalcomplexity. M-of-N, while more complex, has a stronger representation power than the twobefore. The X-of-N representation has the strongest representation power, but the same stud-ies show that it suffers from data fragmenting more than the other three.The problem with all of these algorithms is that they all work in a supervised envi-ronment and they cannot function without a class label. In the following sections, we willpropose two approaches towards unsupervised feature construction.
We propose uFRINGE , an unsupervised version of FRINGE, one of the first feature con-struction algorithms. FRINGE (Pagallo and Haussler, 1990) is a framework algorithm (seeSection 1.2), following the same general schema shown in Algorithm 1. It creates new fea-tures using a logical decision tree, created using a traditional algorithm like ID3 (Quinlan,1986) or C4.5 (Quinlan, 1993). Taking a closer look at FRINGE, one would observe that itsonly component that is supervised is the decision tree construction. The actual construction
Marian-Andrei Rizoiu et al. of features is independent of the existence of a class attribute. Hence, using an unsuperviseddecision tree construction algorithm renders FRINGE unsupervised.
Clustering trees (Blockeel et al, 1998) were introduced as generalized logical deci-sion trees. They are constructed using a top-down strategy. At each step, the cluster undera node is split into two, seeking to maximize the intra-cluster variance. The authors arguethat supervised indicators, used in traditional decision trees algorithms, are special cases ofintra-cluster variance, as they measure intra-cluster class diversity. Following this interpre-tation, clustering trees can be considered generalizations of decision trees and are suitablecandidates for replacing ID3 in uFRINGE .Adapting FRINGE to use clustering trees is straightforward: it is enough to replace M in Algorithm 1 with the clustering trees algorithm. At each step, uFRINGE constructsa clustering tree using the dataset and the current feature set. Just like in FRINGE, newfeatures are created using the conditions under the last two nodes in each path connectingthe root to a leaf. FRINGE constructs new features starting only from positive leaves (leaveslabelled true). But unlike decision trees, in classification trees the leaves are not labelledusing class features. Therefore, uFRINGE constructs new features based on all paths fromroot to a leaf.Newly-constructed features are added to the feature set and used in the next classificationtree construction. The algorithm stops when either no more features can be constructedfrom the clustering tree or when a maximum allowed number of features have already been constructed. Limitations. uFRINGE is capable of constructing new features in an unsupervised con-text. It is also relatively simple to understand and implement, as it is based on the sameframework as FRINGE. However, it suffers from a couple of drawbacks. Constructed fea-tures tend to be redundant and contain doubles. Newly-constructed features are added tothe feature set and are used, alongside old features, in later iterations. Older features arenever removed from the feature set and they can be combined multiple times, thus resultingin doubles in the constructed feature set. What is more, old features can be combined withnew features in which they already participated, therefore constructing redundant features(e.g., f and f ∧ f ∧ f resulting in f ∧ f ∧ f ∧ f ). Another limitation is controlling thenumber of constructed features. The algorithm stops when a maximum number of featuresare constructed. This is very inconvenient, as the dimension of the new feature set cannotbe known in advance and is highly dependent on the dataset. Furthermore, constructing toomany features leads to overfitting and an overly complex feature set. These shortcomingscould be corrected by refining the constructing operator and by introducing a filter operator. We address the limitations of uFRINGE by proposing a second, innovative approach. Wepropose an iterative algorithm that reduces the overall correlation of features of a datasetby iteratively replacing pairs of highly correlated features with conjunctions of literals. Weuse a greedy search strategy to identify the features that are highly correlated, then use aconstruction operator to create new features. From two correlated features f i and f j wecreate three new features: f i ∧ f j , f i ∧ f j and f i ∧ f j . In the end, both f i and f j are removedfrom the feature set. The algorithm stops when no more new features are created or whenit has performed a maximum number of iterations. The formalization and the different key nsupervised Feature Construction 7(a) (b) (c) Fig. 2: Graphical representation of how new features are constructed - Venn diagrams. (a)Iter. 0: Initial features (Primitives), (b) Iter. 1: Combining f and f and (c) Iter. 2: Combin-ing f ∧ f and f parts of the algorithm (e.g., the search strategy,construction operators or feature pruning)will be presented in the next sections.Fig. 2 illustrates visually, using Venn diagrams, how the algorithm replaces the old fea-tures with new ones. Features are represented as rectangles, where the rectangle for eachfeature contains the individuals having that feature set to true . Naturally, the individuals in the intersection of two rectangles have both features set to true . f and f have a big in-tersection, showing that they co-occur frequently. On the contrary, f and f have a smallintersection, suggesting that their co-occurrence is less than that of the hazard (negativelycorrelated). f is included in the intersection of f and f , while f has no common elementswith any other. f is incompatible with all of the others.In the first iteration, f and f are combined and 3 features are created: f ∧ f , f ∧ f and f ∧ f . These new features will replace f and f , the original ones. At the seconditeration, f ∧ f is combined with f . As f is contained in f ∧ f , the feature f ∧ f ∧ f will have a support equal to zero and will be removed. Note that f and f are nevercombined, as they are considered uncorrelated. The final feature set will be { f ∧ f , f ∧ f ∧ f , f ∧ f ∧ f , f ∧ f , f , f } uFC - the proposed algorithmWe define the set P = { p , p , ..., p k } of k user-supplied initial features and I = { i , i , ..., i n } the dataset described using P . We start from the hypothesis that even if the primitive set P cannot adequately describe the dataset I , there is a data-specific feature set F = { f , f , ..., f m } that can be created in order to represent the data better. New features are iteratively createdusing conjunctions of primitive features or their negations (as seen in Fig. 2). Our algorithmdoes not use the output of a learning algorithm in order to create the new features. Insteadwe use a greedy search strategy and a feature set evaluation function that can determine if anewly-obtained feature set is more appropriate than the former one.The schema of our proposal is presented in Algorithm 2. The feature construction isperformed starting from the dataset I and the primitives P . The algorithm follows the generalinductive schema presented in Algorithm 1. At each iteration, uFC searches for frequentlyco-occurring pairs in the feature set created at the previous iteration ( F iter − ). It determinesthe candidate set O and then creates new features as conjunctions of the highest scoringpairs. The new features are added to the current set ( F iter ), after which the set is filtered in Marian-Andrei Rizoiu et al.
Algorithm 2 uFC - Unsupervised feature construction
Input: P – set of primitive user-given features Input: I – the data expressed using P which will be used to construct features Inner parameters: λ – correlation threshold for searching, limit iter – max no of iterations. Output: F – set of newly-constructed features. F ← Piter ← repeat iter ← iter + O ← search correlated pairs ( I iter , F iter − , λ ) F iter ← F iter − while O (cid:54) = /0 do pair ← highest scoring pair ( O ) F iter ← F iter (cid:83) construct new feat ( pair ) remove candidate ( O , pair ) end whileprune obsolete features ( F iter , I iter ) I iter + ← convert ( I iter , F iter ) until F iter = F iter − OR iter = limit iterF ← F iter order to remove obsolete features. At the end of each iteration, the dataset I is translatedto reflect the feature set F iter . A new iteration is performed as long as new features were generated in the current iteration and a maximum number of iterations have not yet beenreached ( limit iter is a parameter for the algorithm).3.2 Searching co-occurring pairsThe search correlated pairs function searches for frequently co-occurring pairs of featuresin a feature set F . We start with an empty set O ← /0 and we investigate all possible pairsof features { f i , f j } ∈ F × F . We use a function ( r ) to measure the co-occurrence of a pairof features { f i , f j } and compare it to a threshold λ . If the value of the function is above thethreshold, then their co-occurrence is considered as significant and the pair is added to O .Therefore, O will be O = {{ f i , f j }|∀{ f i , f j } ∈ F × F sothat r ( { f i , f j } ) > λ } The r function is the empirical Pearson correlation coefficient , which is a measure ofthe strength of the linear dependency between two variables. r ∈ [ − , ] and it is defined asthe covariance of the two variables divided by the product of their standard deviations. Thesign of the r function gives the direction of the correlation (inverse correlation for r < r > r function has the following formulation: r ( { f i , f j } ) = a × d − b × c (cid:112) ( a + b ) × ( a + c ) × ( b + d ) × ( c + d ) The λ threshold parameter will serve to fine-tune the number of selected pairs. Its impacton the behaviour of the algorithm will be studied in Section 5.3. A method of automaticchoice of λ using statistical hypothesis testing is presented in Section 6.1. nsupervised Feature Construction 9 Table 1: Contingency table for two Boolean features f j ¬ f j f i a b ¬ f i c d O is constructed, uFC performs a greedy search. The function highest scoring pair is iteratively used to extract from O the pair { f i , f j } that has the highest co-occurrence score.The function construct new feat constructs three new features: f i ∧ f j , f i ∧ f j and f i ∧ f j . They represent, respectively, the intersection of the initial two features and the relativecomplements of one feature in the other. The new features are guaranteed by constructionto be negatively correlated. If one of them is set to true for an individual, the other twowill surely be false. At each iteration, very simple features are constructed: conjunctions oftwo literals. The creation of more complex and semantically rich features appears throughthe iterative process. f i and f j can be either primitives or features constructed in previousiterations.After the construction of features, the remove candidate function removes from O thepair { f i , f j } , as well as any other pair that contains f i of f j . When there are no more pairs in O , prune obsolete features is used to remove from the feature set two types of features: – features that are false for all individuals . These usually appear in the case of hierar-chical relations. We consider that f and f have a hierarchical relation if all individualsthat have feature f true, automatically have feature f true (e.g., f “is a type of” f or f “is a part of” f ). One of the generated features (in the example f ∧ f ) is false for allindividuals and, therefore, eliminated. In the example of water and cascade , we createonly water ∧ cascade and water ∧ ¬ cascade , since there cannot exist a cascade withoutwater (considering that a value of false means the absence of a feature and not missingdata). – features that participated in the creation of a new feature . Effectively, all { f i |{ f i , f j } ∈ O , f j ∈ F } are replaced by the newly-constructed features. { f i , f j | f i , f j ∈ F , { f i , f j } ∈ O } repl . −−→ { f i ∧ f j , f i ∧ f j , f i ∧ f j } To our knowledge, there are no widely accepted measures to evaluate the overall correlationbetween the features of a feature set. We propose a measure inspired from the “inclusion-exclusion” principle (Feller, 1950). In set theory, this principle permits to express the car-dinality of the finite reunion of finite ensembles by considering the cardinality of thoseensembles and their intersections. In the Boolean form, it is used to estimate the probabilityof a clause (disjunction of literals) as a function of its composing terms (conjunctions ofliterals).Given the feature set F = { f , f , ..., f m } , we have: p ( f ∨ f ∨ ... ∨ f m ) = m ∑ k = (cid:32) ( − ) k − ∑ ≤ i <...< i k ≤ m p ( f i ∧ f i ∧ ... ∧ f i k ) (cid:33) which, by putting apart the first term, is equivalent to: p ( f ∨ f ∨ ... ∨ f m ) = m ∑ i = p ( f i ) + m ∑ k = (cid:32) ( − ) k − ∑ ≤ i <...< i k ≤ m p ( f i ∧ f i ∧ ... ∧ f i k ) (cid:33) Without loss of generality, we can consider that each individual has at least one featureset to true . Otherwise, we can create an artificial feature “null” that is set to true whenall the others are false . Consequently, the left side of the equation is equal to 1. On theright side, the second term is the probability of intersections of the features. Knowing that1 ≤ ∑ mi = p ( f i ) ≤ m , this probability of intersection has a value of zero when all featuresare incompatible (no overlapping). It has a “worst case scenario” value of m −
1, when allindividuals have all the features set to true .Based on these observations, we propose the
Overlapping Index evaluation measure: OI ( F ) = ∑ mi = p ( f i ) − m − OI ( F ) ∈ [ , ] and “better” towards zero. Hence, a feature set F describes a datasetbetter than another feature set F when OI ( F ) < OI ( F ) . Number of features
Considering the case of the majority of machine learning datasets,where the number of primitives is inferior to the number of individuals in the dataset, reduc-ing correlations between features comes at the expense of increasing the number of features.Consider the pair of features { f i , f j } judged correlated. Unless f i ⊇ f j or f i ⊆ f j , the algo-rithm will replace { f i , f j } by { f i ∧ f j , f i ∧ f j , f i ∧ f j } , thus increasing the total number offeatures. A feature set that contains too many features is no longer informative, nor compre-hensible. The maximum number of features that can be constructed is mechanically limitedby the number of unique combinations of primitives in the dataset (the number of uniqueindividuals). | F | ≤ unique ( I ) ≤ | I | where F is the constructed feature set and I is the dataset.To measure the complexity in terms of number of features, we use: C ( F ) = | F | − | P | unique ( I ) − | P | where P is the primitive feature set. C measures the ratio between how many extra featuresare constructed and the maximum number of features that can be constructed. 0 ≤ C ≤ The average length of features
At each iteration, simple conjunctions of two literals areconstructed. Complex Boolean formulas are created by combining features constructed inprevious iterations. Long and complicated expressions generate incomprehensible features,which are more likely a random side-effect rather than a product of underlying semantics.We define C as the average number of literals (a primitive or its negation) that appearin a Boolean formula representing a new feature. P = { p i | p i ∈ P } ; L = P (cid:91) P nsupervised Feature Construction 11 C ( F ) = ∑ f i ∈ F |{ l j | l j ∈ L , l j appears in f i }|| F | where P is the primitive set and 1 ≤ C < ∞ .As more iterations are performed, the feature set contains more features ( C grows)which are increasingly more complex ( C grows). This suggests a correlation between thetwo. What is more, since C can potentially double at each iteration and C can have at mosta linear increase, the correlation is exponential. For this reason, in the following sections weshall use only C as the complexity measure. Overfitting
All algorithms that learn from data risk overfitting the solution to the learn-ing set. There are two ways in which uFC can overfit the resulted feature set, correspondingto the two complexity measures above: a) constructing too many features (measure C ) andb) constructing features that are too long (measure C ). The worst overfitting of type a) iswhen the algorithm constructs as many features as the maximum theoretical number (onefor each individual in the dataset). The worst overfitting of type b) appears in the same con-ditions, where each constructed feature is a conjunction of all the primitives appearing forthe corresponding individual. The two complexity measures can be used to quantify the twotypes of ovefitting. Since C and C are correlated, both types of overfitting appear simulta-neously and can be considered as two sides of a single phenomenon.4.2 The trade-off between two opposing criteria C is a measure of how overfitted a feature set is. In order to avoid overfitting, feature setcomplexity should be kept at low values, while the algorithm optimizes the co-occurrencescore of the feature set. Optimizing both the correlation score and the complexity at the sametime is not possible, as they are opposing criteria. A compromise between the two must beachieved. This is equivalent to the optimization of two contrary criteria, which is a verywell-known problem in multi-objective optimization. To acquire a trade-off between the twomutually contradicting objectives, we use the concept of Pareto optimality (Sawaragi et al,1985), originally developed in economics. Given multiple feature sets, a set is considered tobe Pareto optimal if there is no other set that has both a better correlation score and a bettercomplexity for a given dataset. Pareto optimal feature sets will form the Pareto front. Thismeans that no single optimum can be constructed, but rather a class of optima, dependingon the ratio between the two criteria.We plot the solutions in the plane defined by the complexity, as one axis, and the co-occurrence score, as the other. Constructing the Pareto front in this plane makes a visualevaluation of several characteristics of the uFC algorithm possible, based on the deviationof solutions compared to the front. The distance between the different solutions and theconstructed Pareto front visually shows how stable the algorithm is. The convergence ofthe algorithm can be visually evaluated by how fast (in number of performed iterations) thealgorithm transits the plane from the region of solutions with low complexity and high co-occurrence score to solutions with high complexity and low co-occurrence. We can visuallyevaluate overfitting, which corresponds to the region of the plane with high complexity andlow co-occurence score. Solutions found in this region are overfitted.In order to avoid overfitting, we propose the “closest-point” heuristic for finding a com-promise between OI and C . We consider the two criteria to have equal importance. We con-sider as a good compromise, the solution in which the gain in co-occurrence score and theloss in complexity are fairly equal. If one of the indicators has a value considerably largerthan the other, the solution is considered to be unsuitable. Such solutions would have either Fig. 3: Images related to the newly-constructed feature sky ∧ building ∧ panorama on hungarian : (a) Hungarian puszta, (b)(c) Hungarian M´atra mountainsa high correlation between features or a high complexity. Therefore, we perform a batteryof tests and we search a posteriori the Pareto front for solutions for which the two indica-tors have essentially equal values. In the space of solutions, this translates into a minimalEuclidian distance between the solution and the ideal point (the point (
0; 0 ) ). Throughout the experiments, uFC was executed by varying only the two parameters: λ and limit iter . We denote an execution with specific values for parameters as uFC ( λ , limit iter ) ,whereas the execution where the parameters were determined a posteriori using the “closest-point” strategy will be noted uFC* ( λ , limit iter ) . For uFRINGE , the maximum number offeatures was set at 300. We perform a comparative evaluation of the two algorithms seenfrom a qualitative and quantitative point of view, together with examples of typical execu-tions. Finally, we study the impact of the two parameters of uFC .Experiments were performed on three Boolean datasets. The hungarian dataset is areal-life collection of images, depicting Hungarian urban and countryside settings. Imageswere manually tagged using one or more of the 13 tags. Each tag represents an object thatappears in the image (eg. tree, cascade etc.). The tags serve as features and a feature takesthe value true if the corresponding object is present in the image or false otherwise. Theresulted dataset contains 264 individuals, described by 13 Boolean features. Once the datasetwas constructed, the images were not used any more. The street dataset was constructedin a similar way, starting from images taken from the LabelMe dataset (Russell et al, 2008).608 urban images from Barcelona, Madrid and Boston were selected. Image labels weretransformed into tags depicting objects by using the uniformization list provided with thetoolbox. The dataset contains 608 individuals, described by 66 Boolean features.The third dataset is “Spect Heart” from the UCI. The corpus is provided with a “class”attribute and divided into a learning corpus and a testing one. We eliminated the class at-tribute and concatenated the learning and testing corpus into a single dataset. It contains267 instances described by 22 Boolean features. Unlike the first two datasets, the features of spect have no specific meaning, being called “F1”, “F2”, ... , “F22”. http://eric.univ-lyon2.fr/~arizoiu/files/hungarian.txt http://eric.univ-lyon2.fr/~arizoiu/files/street.txt http://archive.ics.uci.edu/ml/datasets/SPECT+Heart nsupervised Feature Construction 13(a) (b) (c) Fig. 4: Images related to the newly-constructed feature water ∧ cascade ∧ tree ∧ f orest on hungarian (a) (b) (c) Fig. 5: Images related to the newly-constructed feature headlight ∧ windshield ∧ arm ∧ head on street uFC and uFRINGE : Qualitative evaluationFor the human reader, it is quite obvious why water and cascade have the tendency to ap-pear together or why road and interior have the tendency to appear separately. One wouldexpect, that based on a given dataset, the algorithms would succeed in making these associ-ations and catching the underlying semantics. Table 2 shows the features constructed with uFRINGE and uFC* ( . , ) on hungarian . A quick overview shows that constructedfeatures manage to make associations that seem “logical” to a human reader. For example,one would expect the feature sky ∧ building ∧ panorama to denote images where there is apanoramic view and the sky, but no buildings, therefore suggesting images outside the city.Fig. 3 supports this expectation. Similarly, the feature sky ∧ building ∧ groups ∧ road cov-ers urban images, where groups of people are present and water ∧ cascade ∧ tree ∧ f orest denotes a cascade in the forest (Fig. 4).Comprehension quickly deteriorates when the constructed feature set is overfitted, whenthe constructed features are too complex. The execution of uFC ( . , ) reveals featureslike: sky ∧ building ∧ tree ∧ building ∧ f orest ∧ sky ∧ building ∧ groups ∧ road ∧ sky ∧ building ∧ panorama ∧ groups ∧ road ∧ person ∧ sky ∧ groups ∧ road Table 2: Feature sets constructed by uFC and uFRINGE primitives uFRINGE uFC(0.194, 2) person water ∧ f orest ∧ grass ∧ water ∧ person groups ∧ road ∧ interiorgroups panorama ∧ building ∧ f orest ∧ grass groups ∧ road ∧ interiorwater tree ∧ person ∧ grass groups ∧ road ∧ interiorcascade tree ∧ person ∧ grass water ∧ cascade ∧ tree ∧ f orestsky groups ∧ tree ∧ person water ∧ cascade ∧ tree ∧ f oresttree person ∧ interior water ∧ cascade ∧ tree ∧ f orestgrass person ∧ interior sky ∧ building ∧ tree ∧ f orestf orest water ∧ panorama ∧ grass ∧ groups ∧ tree sky ∧ building ∧ tree ∧ f oreststatue water ∧ panorama ∧ grass ∧ groups ∧ tree sky ∧ building ∧ tree ∧ f orestbuilding statue ∧ groups ∧ groups sky ∧ building ∧ panoramaroad statue ∧ groups ∧ groups sky ∧ building ∧ panoramainterior panorama ∧ statue ∧ groups sky ∧ building ∧ panoramapanorama grass ∧ water ∧ f orest ∧ sky groups ∧ road ∧ persongrass ∧ water ∧ f orest ∧ sky groups ∧ road ∧ personperson ∧ grass ∧ water ∧ f orest groups ∧ road ∧ persongroups ∧ sky ∧ grass ∧ building sky ∧ building ∧ groups ∧ roadgroups ∧ sky ∧ grass ∧ building sky ∧ building ∧ groups ∧ roadgroups ∧ person ∧ water ∧ f orest ∧ statue ∧ groups sky ∧ building ∧ groups ∧ roadgroups ∧ person ∧ water ∧ f orest ∧ statue ∧ groups water ∧ cascadegrass ∧ person ∧ statue tree ∧ f orestgrass ∧ person ∧ statue grass ... and 284 others statue Even if the formula is not in the Disjunctive Normal Form (DNF), it is obvious that it is toocomplex to make any sense. If uFC tends to construct overly complex features, uFRINGE suffers from another type of dimensionality curse. Even if the complexity of features doesnot impede comprehension, the fact that there are over 300 hundred features constructedfrom 13 primitives makes the newly-constructed feature set unusable. The number of fea-tures is actually greater than the number of individuals in the dataset, which proves thatsome of the features are redundant. The actual correlation score of the newly-created fea-ture set is even greater than the initial primitive set. What is more, new features presentredundancy, just as predicted in section 2. For example, the feature water ∧ f orest ∧ grass ∧ water ∧ person which contains two times the primitive water .The same conclusions are drawn from execution on the street dataset. uFC* ( . , ) creates comprehensible features. For example headlight ∧ windshield ∧ arm ∧ head (Fig. 5)suggests images in which the front part of cars appear. It is especially interesting how thealgorithm specifies arm in conjunction with head in order to differenciate between people( head ∧ arm ) and objects that have heads (but no arms).5.2 uFC and uFRINGE : Quantitative evaluationTable 3 shows, for the three datasets, the values of certain indicators, like the size of thefeature set, the average length of a feature ( C ), the OI and C indicators. For each dataset,we compare four feature sets: the initial feature set (primitives), the execution of uFC* (parameters determined by the “closest-point” heuristic), uFC with another random set of nsupervised Feature Construction 15 Table 3: Values of indicators for multiple runs on each dataset
Strategy f eat length OI C hungar. Primitives
13 1.00 0.24 0.00 uFC*(0.194, 2)
21 2.95 0.08 0.07 uFC(0.184, 5)
36 11.19 0.03 0.20 uFRINGE
306 3.10 0.24 2.53 street
Primitives
66 1.00 0.12 0.00 uFC*(0.446, 3)
81 2.14 0.06 0.04 uFC(0.180, 5)
205 18.05 0.02 0.35 uFRINGE
233 2.08 0.20 0.42 spect
Primitives
22 1.00 0.28 0.00 uFC*(0.432, 3)
36 2.83 0.09 0.07 uFC(0.218, 4)
62 8.81 0.03 0.20 uFRINGE
307 2.90 0.25 1.45 parameters and uFRINGE . For the hungarian and street datasets, the same parametercombinations are used as in the qualitative evaluation.On all three datasets, uFC* creates feature sets that are less correlated than the primitivesets, while the increase in complexity is only marginal. Very few (2-3) iterations are needed, as uFC converges very fast. Increasing the number of iterations has very little impact on OI , but results in very complex vocabularies (large C and feature lengths). In the featureset created by uFC ( . , ) on street , on average, each feature contains more than 18literals. This is obviously too much for human comprehension.For uFRINGE , the OI indicator shows very marginal or no improvement on spect and hungarian datasets, and even a degradation on street (compared to the primitiveset). Features constructed using this approach have an average length between 2.08 and 3.1literals, just as much as the selected uFC* configuration. But, it constructs between 2.6 and13.9 times more features than uFC* . We consider this to be due to the lack of filtering in uFRINGE , which would also explain the low OI score. Old features remain in the featureset and amplify the total correlation by adding the correlation between old and new features.5.3 Impact of parameters λ and limit iter In order to understand the impact of parameters, we executed uFC with a wide range ofvalues for λ and limit iter and studied the evolution of the indicators OI and C . For eachdataset, we varied λ between 0 .
002 and 0 . . λ , weexecuted uFC by varying limit iter between 1 and 30 for the hungarian dataset, and between1 and 20 for street and spect . We study the evolution of the indicators as a function of limit iter , respectively λ , we plot the solution in the ( OI , C ) space and construct the Paretofront.For the study of limit iter , we hold λ fixed at various values and we vary only limit iter .The evolution of the OI correlation indicator is given in Fig. 6a. As expected, the measureameliorates with the number of iterations. OI has a very rapid descent and needs less than 10iterations to converge on all datasets towards a value dependent on λ . The higher the value of λ , the higher the value of convergence. The complexity has a very similar evolution, but inthe inverse direction: it increases with the number of iterations performed. It also converges6 Marian-Andrei Rizoiu et al.
002 and 0 . . λ , weexecuted uFC by varying limit iter between 1 and 30 for the hungarian dataset, and between1 and 20 for street and spect . We study the evolution of the indicators as a function of limit iter , respectively λ , we plot the solution in the ( OI , C ) space and construct the Paretofront.For the study of limit iter , we hold λ fixed at various values and we vary only limit iter .The evolution of the OI correlation indicator is given in Fig. 6a. As expected, the measureameliorates with the number of iterations. OI has a very rapid descent and needs less than 10iterations to converge on all datasets towards a value dependent on λ . The higher the value of λ , the higher the value of convergence. The complexity has a very similar evolution, but inthe inverse direction: it increases with the number of iterations performed. It also converges6 Marian-Andrei Rizoiu et al. O v e r l app i ng I nde x iterationOI function of iteration lambda = 0.002lambda = 0.070lambda = 0.150lambda = 0.190lambda = 0.200lambda = 0.300 (a) O v e r l app i ng I nde x lambdaOI function of lambda limit iter = 15limit iter = 8limit iter = 4limit iter = 3limit iter = 2limit iter = 1 (b) Fig. 6: Variation of OI indicator with limit iter on hungarian (a) and with λ on street (b)towards a value that is dependent on λ : the higher the value of λ , the lower the complexityof the resulting feature set. Similarly, we study λ by fixing limit iter . Fig. 6b shows how OI evolves when varying λ .As foreseen, for all values of limit iter , the OI indicator increases with λ , while C decreaseswith λ . OI shows an abrupt increase between 0.2 and 0.3, for all datasets. For lower valuesof λ , many pairs get combined as their correlation score is bigger than the threshold. As λ increases, only highly correlated pairs get selected and this usually happens in the firstiterations. Performing more iterations does not bring any change and indicators are less de-pendent on limit iter . For hungarian , no pair has a correlation score higher than 0.4. Setting λ higher than this value causes uFC to output the primitive set (no features are created). C o m p l e x i t y Overlapping IndexThe Pareto front
Pareto frontAll solutionsuFC*(0.432, 3) (a) C o m p l e x i t y Overlapping IndexThe Pareto front (magnified)
Pareto frontAll solutionsuFC*(0.432, 3) (b)
Fig. 7: The distribution of solutions, the Pareto front and the closest-point on spect dataset nsupervised Feature Construction 17
To study Pareto optimality, we plot the generated solutions in the ( OI , C ) space. Fig. 7apresents the distribution of solutions, the Pareto front and the solution chosen by the “closest-point” heuristic. The solutions generated by uFC with a wide range of parameter values arenot dispersed in the solution space, but their distribution is rather close together. This showsgood algorithm stability. Even if not all the solutions are Pareto optimal, none of them aretoo distant from the front and there are no outliers.Most of the solutions densely populate the part of the curve corresponding to low OI andhigh C . As pointed out in the Section 4.2, the area of the front corresponding to high featureset complexity (high C ) represents the overfitting area. This confirms that the algorithmconverges fast, then enters overfitting. Most of the improvement in quality is done in thefirst 2-3 iterations, while further iterating improves quality only marginally with the costof an explosion of complexity. The “closest-point” heuristic keeps the constructing out ofoverfitting, by stopping the algorithm at the point where the gain of co-occurence score andthe loss in complexity are fairly equal. Fig. 7b magnifies the region of the solution spacecorresponding for low numbers of iterations.5.4 Relation between number of features and feature length Both the average length of a feature ( C ) and the number of features ( C ) increase withthe number of iterations. In Section 4.1 we have speculated that the two are correlated: C = f ( C ) . For each λ in the batch of tests, we create the C and C series depending on the limit iter and we perform a statistical hypothesis test, using the Kendall rank coefficient as thetest statistic. The Kendall rank coefficient is particularly useful as it makes no assumptionsabout the distributions of C and C . For all values of λ , for all datasets, the statistical testrevealed a p-value of the order of 10 − . This is consistently lower than habitually usedsignificance levels and makes us reject the null independence hypothesis and conclude that C and C are statistically dependent. The major difficulty of uFC , shown by the initial experiments, is setting the values of pa-rameters. An unfortunate choice would result in either an overly complex feature set or afeature set where features are still correlated. But both parameters λ and limit iter are depen-dent on the dataset and finding the suitable values would prove to be a process of trial anderror for each new corpus. The “closest-point” heuristic achieves acceptable equilibrium be-tween complexity and performance, but requires multiple executions with large choices ofvalues for parameters and the construction of the Pareto front, which might not always bedesirable or even possible.We propose a new method for choosing λ based on statistical hypothesis testing and anew stopping criterion inspired from the “closest-point” heuristic. These will be integratedinto a new “risk-based” heuristic that approximates the best solution while avoiding the timeconsuming construction of multiple solutions and the Pareto front. The only parameter is thesignificance level α , which is independent of the dataset, and makes the task of running uFC on new, unseen datasets easy. A pruning technique is also proposed. λ We propose replacing the user-supplied co-occurrence threshold λ with a technique thatselects only pairs of features for whom the positive linear correlation is statistically signif-icant. These pairs are added to the set O of co-occurring pairs (defined in Section 3.2) and,starting from O , new features are constructed. We use a statistical method: the hypothesistesting . For each pair of candidate features, we test the independence hypothesis H againstthe positive correlation hypothesis H .We use as a test statistic the Pearson correlation coefficient (calculated as defined inSection 3.2) and test the following formally defined hypothesis: H : ρ = H : ρ > ρ is the theoretical correlation coefficient between two candidate features. We canshow that in the case of Boolean variables, having the contingency table shown in Table 1,the observed value of the χ of independence is χ obs = nr ( n is the size of the dataset).Consequently, considering true the hypothesis H , nr is approximately following a χ dis-tribution with one degree of freedom ( nr ∼ χ ), resulting in r √ n following a standardnormal distribution ( r √ n ∼ N ( , ) ), given that n is large enough.We reject the H hypothesis in favour of H if and only if r √ n ≥ u − α , where u − α isthe right critical value for the standard normal distribution. Two features will be consideredsignificantly correlated when r ( { f i , f j } ) ≥ u − α √ n . The significance level α represents therisk of rejecting the independence hypothesis when it was in fact true. It can be interpreted as the false discovery risk in data mining. In the context of feature construction it is the false construction risk , since this is the risk of constructing new features based on a pair offeatures that are not really correlated. Statistical literature usually sets α at 0 .
05 or 0 .
01, butlevels of 0 .
001 or even 0 . m , where m is the theoretical number of tests to be performed.6.2 Candidate pruning technique. Stopping criterion. Pruning
In order to apply the χ independence test, it is necessary that the expected frequen-cies considering true the H hypothesis be greater or equal than 5. We add this constraintto the new feature search strategy (subsection 3.2). Pairs for whom the values of ( a + b )( a + c ) n , ( a + b )( b + c ) n , ( a + c )( c + d ) n and ( b + d )( c + d ) n are not greater than 5, will be filtered from the set ofcandidate pairs O . This will impede the algorithm from constructing features that are presentfor very few individuals in the dataset. Risk-based heuristic
We introduced in Section 4.2 the “closest-point” for choosing thevalues for parameters λ and limit iter . It searches the solution on the Pareto front for whichthe indicators are sensibly equal. We transform the heuristic into a stopping criterion: OI and C are combined into a single formula, the root mean square (RMS). The algorithm will nsupervised Feature Construction 19 R oo t S qua r e M ean iterationRMS function of iteration lambda = 0.002lambda = 0.200lambda = 0.240lambda = 0.300lambda = 0.400lambda = 0.500 Fig. 8: RMS vs. limit iter on spect stop iterating when RMS has reached a minimum. Using the generalized mean inequality,we can prove that RMS ( OI , C ) has only one global minimum, as with each iteration thecomplexity increases and OI descends.The limit iter parameter, which is data-dependent, is replaced by the automatic RMS stop- ping criterion. This stopping criterion together with the automatic λ choice strategy, pre-sented in Section 6.1, form a data-independent heuristic for choosing parameters. We willcall the new heuristic risk-based heuristic . This new heuristic will make it possible to ap-proximate the best parameter compromise and avoid the time consuming task of computinga batch of solutions and constructing the Pareto front. We test the proposed ameliorations, similarly to what was shown in Section 5, on the samethree datasets: hungarian , spect and street . We execute uFC in two ways: the classical uFC (Section 3) and the improved uFC (Section 6). The classical uFC needs to have param-eters λ and limit iter set (noted uFC ( λ , limit iter ) ). uFC* ( λ , limit iter ) denotes the executionwith parameters which were determined a posteriori using the “closest-point” heuristic. Theimproved uFC will be denoted as uFC α ( risk ). The “risk-based” heuristic will be used to de-termine the parameters and control the execution.7.1 Risk-based heuristic for choosing parameters Root Mean Square
In the first batch of experiments, we study the variation of the RootMeans Square aggregation function for a series of selected values of λ . We vary limit iter between 0 and 30, for hungarian , and between 0 and 20 for spect and street . The evolution of RMS is presented in Fig. 8.For all λ the RMS starts by decreasing, as OI descends more rapidly than the C in-creases. In just 1-3 iterations, RMS reaches its minimum and afterwards its value starts toincrease. This is due to the fact that complexity increases rapidly, with only marginal im-provement of quality. This behaviour is consistent with the results presented in Section 5. As C o m p l e x i t y Overlapping IndexRisk-based and closest-point (magnified)
Pareto frontuFC*(0.194, 2)uFC alpha (0.001): (0.190,2)uFC alpha for multiple risks
Fig. 9: “Closest-point” and “risk-based” for multiple α on hungarian already discussed in Section 5.3, λ has a bounding effect over complexity, thus explainingwhy RMS reaches a maximum for higher values of λ . The “risk-based” heuristic
The second batch of experiments deals with comparing the“risk-based” heuristic to the “closest-point” heuristic. The “closest-point” was determined as described in Section 5. The “risk-based” heuristic was executed multiple times, with val-ues for parameter α ∈ { .
05, 0 .
01, 0 . . . . . . . . } Table 4: “closest-point” and “risk-based” heuristics
Strategy λ limit iter f eat common length OI C hung. Primitives - - 13 - 1.00 0.235 0.000 uFC*(0.194, 2) uFC α (0.001) street Primitives - - 66 - 1.00 0.121 0.000 uFC*(0.446, 3) uFC α (0.0001) spect Primitives - - 22 - 1.00 0.279 0.000 uFC*(0.432, 3) uFC α (0.0001) Table 4 gives a quantitative comparison between the two heuristics. A risk of 0 .
001 isused for hungarian and 0 . spect and street . The feature sets created by the twoapproaches are very similar, considering all indicators. Not only the differences betweenvalues for OI , C , average feature length and feature set dimension are negligible, but mostof the created features are identical. On hungarian , 19 of the 21 features created by thetwo heuristics are identical. Table 5 shows the two features sets, with non-identical featuresin bold.Fig. 9 presents the distribution of solutions created by the “risk-based” heuristic withmultiple α , plotted on the same graphics as the Pareto front in the ( OI , C ) space. Solu-tions for different values of risk α are grouped closely together. Not all of them are on the nsupervised Feature Construction 21 Table 5: Feature sets constructed by “closest-point” and “risk-based” heuristics on hungarian primitives uFC*(0.194, 2) uFC α (0.001) person groups ∧ road ∧ interior groups ∧ road ∧ interiorgroups groups ∧ road ∧ interior groups ∧ road ∧ interiorwater groups ∧ road ∧ interior groups ∧ road ∧ interiorcascade water ∧ cascade ∧ tree ∧ f orest water ∧ cascade ∧ tree ∧ f orestsky water ∧ cascade ∧ tree ∧ f orest water ∧ cascade ∧ tree ∧ f oresttree water ∧ cascade ∧ tree ∧ f orest water ∧ cascade ∧ tree ∧ f orestgrass sky ∧ building ∧ tree ∧ f orest sky ∧ building ∧ tree ∧ f orestf orest sky ∧ building ∧ tree ∧ f orest sky ∧ building ∧ tree ∧ f oreststatue sky ∧ building ∧ tree ∧ f orest sky ∧ building ∧ tree ∧ f orestbuilding sky ∧ building ∧ panorama sky ∧ building ∧ panoramaroad sky ∧ building ∧ panorama sky ∧ building ∧ panoramainterior sky ∧ building ∧ panorama sky ∧ building ∧ panoramapanorama groups ∧ road ∧ person groups ∧ road ∧ persongroups ∧ road ∧ person groups ∧ road ∧ persongroups ∧ road ∧ person groups ∧ road ∧ personwater ∧ cascade sky ∧ building ∧ groups ∧ roadsky ∧ building sky ∧ building ∧ groups ∧ road tree ∧ f orest sky ∧ building ∧ groups ∧ roadgroups ∧ road water ∧ cascadegrass tree ∧ f oreststatue grassstatue Pareto front, but they are never too far from the “closest-point” solution, providing a goodequilibrium between quality and complexity. C o m p l e x i t y Overlapping IndexRisk-based and closest-point (magnified)
Pareto frontuFC*(0.446, 3)uFC alpha (0.0001): (0.150,1)uFC alpha for multiple risks (a) C o m p l e x i t y Overlapping IndexRisk-based and closest-point (magnified)
Pareto frontuFC*(0.350, 3)uFC alpha (0.0001): (0.150,2)uFC alpha for multiple risks (b)
Fig. 10: “Closest-point” and “Risk-based” heuristics for street without pruning (a) andwith pruning (b) On street , performances of the “risk-based” heuristic start to degrade compared to uFC* . Table 4 shows differences in the resulted complexity and only 33% of the constructedfeatures are common for the two approaches. Fig. 10a shows that solutions found by the“risk-based” approach are moving away from the “closest-point”. The cause is the largesize of the street dataset. As the sample size increases, the null hypothesis tends to berejected at lower levels of p-value. The auto-determined λ threshold is set too low and theconstructed feature sets are too complex. Pruning solves this problem as shown in Fig. 10band Section 7.2.7.2 Pruning the candidatesThe pruning technique is independent of the “risk-based” heuristic and can be applied inconjunction with the classical uFC algorithm. An execution of this type will be denoted uFC P ( λ , max iter ) . We execute uFC P ( λ , max iter ) with the same parameters and on the samedatasets as described in Section 5.3. C o m p l e x i t y Overlapping IndexPruned and Non-pruned Pareto Fronts
Pareto front pruned
Pareto front non-pruned (a) C o m p l e x i t y Overlapping IndexPruned and Non-pruned Pareto Fronts (zoomed)
Pareto front prunedPareto front non-pruned (b)
Fig. 11: Pruned and Non-pruned Pareto Fronts on hungarian (a) and a zoom to the relevantpart (b)We compare uFC with and without pruning by plotting on the same graphic the twoPareto fronts resulted from each set of executions. Fig. 11a shows the pruned and non-pruned Pareto fronts on hungarian . The graphic should be interpreted in a manner similarto a ROC curve, since the algorithm seeks to minimize OI and C at the same time. When onePareto front runs closer to the origin of the graphic ( , ) than a second, it means that thefirst dominates the second one and, thus, its corresponding approach yields better results.For all datasets, the pruned Pareto front dominates the non-pruned one. The difference ismarginal, but proves that filtering improves results.The most important conclusion is that filtering limits complexity. As the initial experi-ments (Fig. 7a) showed, most of the non-pruned solutions correspond to very high complex-ities. Visually, the Pareto front is tangent to the vertical axis (the complexity) and showingcomplexities around 0 . − . nsupervised Feature Construction 23 to the pruned approach stops, for all datasets, for complexities lower than 0 .
15. This provesthat filtering successfully discards solutions that are too complex to be interpretable.Last, but not least, filtering corrects the problem of automatically choosing λ for the“risk-based” heuristic on big datasets. We ran uFC P with risk α ∈ { .
05, 0 .
01, 0 . . . . . . . . } . Fig. 10b presents the distributions ofsolutions found with the “risk-based pruned” heuristic on street . Unlike results withoutpruning (Fig. 10a), solutions generated with pruning are distributed closely to those gener-ated by “closest-point” and to the Pareto front.7.3 Algorithm stabilityIn order to evaluate the stability of the uFC α algorithm, we introduce noise in the hungarian dataset. The percentage of noise varied between 0% (no noise) and 30%. Introducing a cer-tain percentage x % of noise means that x % × k × n random features in the datasets are in-verted (false becomes true and true becomes false). k is the number of primitives and n isthe number of individuals. For each given noise percentage, 10 noised datasets are createdand only the averages are presented. uFC α is executed for all the noised datasets, with thesame combination of parameters ( risk = .
001 and no filtering). (a) (b)
Fig. 12: uFC α ( risk ) stability on hungarian when varying the noise percentage: and indi-cators (a) and number of constructed features (b)The stability is evaluated using five indicators: – Overlapping Index ; – Feature set complexity ( C ); – Number of features : the total number of features constructed by the algorithm; – Common with zero noise : the number of identical features between the feature setsconstructed based on the noised datasets and the non-noised dataset. This indicator eval-uates the measure in which the algorithm is capable of constructing the same features,even in the presence of noise; – Common between runs : the average number of identical features between feature setsconstructed using datasets with the same noise percentage. This indicator evaluates howmuch the constructed feature sets differ at the same noise level. As the noise percentage augments, the dataset becomes more random. Less pairs ofprimitives are considered as correlated and therefore less new features are created. Fig. 12ashows that the overlapping indicator increases with the noise percentage, while the com-plexity decreases. Furthermore, most features in the initial dataset are set to false. As thepercentage of noise increases, the ratio equilibrates (more false values becoming true, thanthe contrary). As a consequence, for high noise percentages, the OI score is higher than forthe primitive set.The same conclusions can be drawn from Fig. 12b. The indicator
Number of features descends when the noise percentage increases. This is because fewer features are constructedand the resulting feature set is very similar to the primitive set. The number of constructedfeatures stabilizes around 20% of noise. This is the point where most of the initial correla-tion between features is lost.
Common with zero noise has a similar evolution. The numberof features identical to the non-noised dataset descends quickly and stabilizes around 20%.After 20%, all the identical features are among the initial primitives. Similarly, the value of
Common between runs descends at first. For small values of introduced noise, the corre-lation between certain features is reduced, modifying the order in which pairs of correlatedfeatures are selected in Algorithm 2. This results in a diversity of constructed feature sets.As the noise level increases and the noised datasets become more random, the constructedfeature sets resemble the primitive set, therefore augmenting the value of
Common betweenruns . In this article, we propose two approaches towards feature construction. Unlike the otherfeature construction algorithms proposed so far in the literature, our proposals work in anunsupervised learning paradigm. uFRINGE is an unsupervised adaptation of the FRINGEalgorithm, while uFC is a new approach that replaces linearly correlated features with con-junctions of literals. We prove that our approaches succeed in reducing the overall corre-lation in the feature set, while constructing comprehensible and interpretable features. Wehave performed extensive experiments to highlight the impact of parameters on the totalcorrelation measure and feature set complexity. Based on the first set of experiments, wehave proposed a heuristic that finds a suitable balance between quality and complexity andavoids time consuming multiple executions, followed by a Pareto front construction. We usestatistical hypothesis testing and confidence levels for parameter approximation and reason-ing on the Pareto front of the solutions for evaluation. We also propose a pruning technique,based on hypothesis testing, that limits the complexity of the generated features and speedsup the construction process.For future development, we consider taking into account non-linear correlation betweenvariables by modifying the metric of the search and the co-occurence measure. Another re-search direction will be adapting our algorithms for data of the Web 2.0 (e.g., automatictreatment of labels on the web). Several challenges arise, like very large label sets (it iscommon to have over 10 000 features), non-standard label names (see standardization pre-processing task that we have performed for the LabelMe dataset in Section 5) and missingdata (a value of false can mean absence or missing data). We also consider converting gener-ated features to the Disjunctive Normal Form for easier reading and suppressing features thathave a low support in the dataset. This would reduce the size of the feature set by removingrare features, but would introduce new difficulties such as detecting nuggets. nsupervised Feature Construction 25
References
Benjamini Y, Liu W (1999) A step-down multiple hypotheses testing procedure that controlsthe false discovery rate under independence. Journal of Statistical Planning and Inference82(1-2):163–170Blockeel H, De Raedt L, Ramon J (1998) Top-down induction of clustering trees. In: Pro-ceedings of the 15th International Conference on Machine Learning, pp 55–63Bloedorn E, Michalski RS (1998) Data-driven constructive induction. Intelligent Systemsand their Applications 13(2):30–37Cortes C, Vapnik V (1995) Support-vector networks. Machine learning 20(3):273–297Dunteman GH (1989) Principal components analysis, vol 69. SAGE publications, IncFeller W (1950) An introduction to probability theory and its applications. Vol. I. WileyGe Y, Dudoit S, Speed TP (2003) Resampling-based multiple testing for microarray dataanalysis. Test 12(1):1–77Gomez G, Morales E (2002) Automatic feature construction and a simple rule inductionalgorithm for skin detection. In: Proc. of the ICML workshop on Machine Learning inComputer Vision, pp 31–38Holm S (1979) A simple sequentially rejective multiple test procedure. Scandinavian journalof statistics pp 65–70Huo X, Ni XS, Smith AK (2005) A survey of manifold-based learning methods. Mining of
Enterprise Data pp 06–10Lallich S, Rakotomalala R (2000) Fast feature selection using partial correlation for multi-valued attributes. In: Zighed DA, Komorowski J, Zytkow JM (eds) Proceedings of the4th European Conference on Principles of Data Mining and Knowledge Discovery, LNAISpringer-Verlag, pp 221–231Lallich S, Teytaud O, Prudhomme E (2006) Statistical inference and data mining: falsediscoveries control. In: COMPSTAT: proceedings in computational statistics: 17th sym-posium, Springer, p 325Liu H, Motoda H (1998) Feature extraction, construction and selection: A data mining per-spective. SpringerMatheus CJ (1990) Adding domain knowledge to sbl through feature construction. In: Pro-ceedings of the Eighth National Conference on Artificial Intelligence, pp 803–808Michalski RS (1983) A theory and methodology of inductive learning. Artificial Intelligence20(2):111–161Mo D, Huang SH (2011) Feature selection based on inference correlation. Intelligent DataAnalysis 15(3):375–398Motoda H, Liu H (2002) Feature selection, extraction and construction. Communication ofIICM (Institute of Information and Computing Machinery) 5:67–72Murphy PM, Pazzani MJ (1991) Id2-of-3: Constructive induction of m-of-n concepts fordiscriminators in decision trees. In: Proceedings of the Eighth International Workshop onMachine Learning, pp 183–187Pagallo G, Haussler D (1990) Boolean feature discovery in empirical learning. Machinelearning 5(1):71–99Piatetsky-Shapiro G (1991) Discovery, analysis, and presentation of strong rules. Knowl-edge discovery in databases 229:229–248Quinlan JR (1986) Induction of decision trees. Machine learning 1(1):81–106Quinlan JR (1993) C4.5: programs for machine learning. Morgan KaufmannRussell BC, Torralba A, Murphy KP, Freeman WT (2008) Labelme: a database and web-based tool for image annotation. International Journal of Computer Vision 77(1):157–1736 Marian-Andrei Rizoiu et al.