[PDF] Neural Decision Trees

Abstract

In this paper we propose a synergistic melting of neural networks and decision trees (DT) we call neural decision trees (NDT). NDT is an architecture a la decision tree where each splitting node is an independent multilayer perceptron allowing oblique decision functions or arbritrary nonlinear decision function if more than one layer is used. This way, each MLP can be seen as a node of the tree. We then show that with the weight sharing asumption among those units, we end up with a Hashing Neural Network (HNN) which is a multilayer perceptron with sigmoid activation function for the last layer as opposed to the standard softmax. The output units then jointly represent the probability to be in a particular region. The proposed framework allows for global optimization as opposed to greedy in DT and differentiability w.r.t. all parameters and the input, allowing easy integration in any learnable pipeline, for example after CNNs for computer vision tasks. We also demonstrate the modeling power of HNN allowing to learn union of disjoint regions for final clustering or classification making it more general and powerful than standard softmax MLP requiring linear separability thus reducing the need on the inner layer to perform complex data transformations. We finally show experiments for supervised, semi-suppervised and unsupervised tasks and compare results with standard DTs and MLPs.

Full PDF

11 Neural Decision Trees

Randall Balestriero ∗∗ Electrical and Computer Engineering Department, Rice University, Houston TX, USA ∗ [email protected] Abstract —In this paper we propose a synergistic melting ofneural networks and decision trees (DT) we call neural decisiontrees (NDT). NDT is an architecture a la decision tree where eachsplitting node is an independent multilayer perceptron allowingoblique decision functions or arbritrary nonlinear decision func-tion if more than one layer is used. This way, each MLP can beseen as a node of the tree. We then show that with the weightsharing asumption among those units, we end up with a HashingNeural Network (HNN) which is a multilayer perceptron withsigmoid activation function for the last layer as opposed to thestandard softmax. The output units then jointly represent theprobability to be in a particular region. The proposed frameworkallows for global optimization as opposed to greedy in DT anddifferentiability w.r.t. all parameters and the input, allowing easyintegration in any learnable pipeline, for example after CNNsfor computer vision tasks. We also demonstrate the modelingpower of HNN allowing to learn union of disjoint regions for ﬁnalclustering or classiﬁcation making it more general and powerfulthan standard softmax MLP requiring linear separability thusreducing the need on the inner layer to perform complex datatransformations. We ﬁnally show experiments for supervised,semi-suppervised and unsupervised tasks and compare resultswith standard DTs and MLPs.

Index Terms —Artiﬁcal Neural Networks, Multilayer PErcep-trons, Decision Trees, Locality Sensitive Hashing, Oblique Deci-sion Trees

I. I

NTRODUCTION AND D ECISION T REE R EVIEW

A. Motivations H ASHING. A hash function H : X → Y is a many-to-one mapping from a possibly inﬁnite set to a ﬁnite setHarrison (1971) where we usually have X ⊂ R d and Y ⊂{ , ..., C } . As a result, the following function H ( x ) = arg max y p ( y | x ) , (1)typical to discriminative models Jordan (2002), can be seen asa hashing function that has been optimized or learned givensome dataset of N independent observations { ( x i , y i ) Ni =1 } of input data x i and corresponding label y i . Decision trees(DT) are a special kind of discriminative model aiming atbreaking up a complex decision into a union of simple binarydecision a.k.a splitting nodes Safavian and Landgrebe (1990).In order to do so, DT learning involves a sequential top-down partitioning of the input space X into sub-regions Ω k satisfying ∪ k Ω k = X, (2) Ω k ∩ Ω l = ∅ , ∀ k (cid:54) = l. (3)This partition is done so that the label distribution of thepoints w.r.t each region has minimal entropy. In particular, theoptimum is obtained when all the points lying in a sub-region belong to the same class and this for all the sub-regions. Oncetrained, a new observation x is classiﬁed by ﬁrst ﬁnding inwhich region Ω( x ) it belongs to and by predicting the labelspeciﬁc to this region according to H ( x ) = mode ( { y i | x i ∈ Ω( x ) } ) , (4)for classiﬁcation problems and H ( x ) = mean ( { y i | x i ∈ Ω( x ) } ) , (5)for regression problems. What brings DT among the mostpowerful discriminative technique for non cognitive tasks liesin the fact that the number of sub-regions Ω k of X growsexponentially w.r.t their depth. This milestone is the core ofthe developed Hashing Neural Network (HNN) coupled withthe acute modeling capacity of deep neural networks. Wenow describe brieﬂy the standard univariate and multivariatedecision trees, their advantages and drawbacks as well asmotivations to extend them in a more uniﬁed and differentiableframework. In fact, as Quinlan said Quinlan (1994), decisiontrees are better when discrimination has to be done througha sequential testing of different attributes whereas ANN aregood when knowledge of a simultaneous combination of manyattributes is required, trying to get the best of both worldsseems natural. B. Univariate and Oblique Decision Trees

Decision trees lead to a recursive space partitioning of theinput space X through a cascade of simple tests Kuncheva(2004). In the case of univariate or monothetic DT, the localtest for each splitting node is done by looking at one attribute att of x ∈ X and comparing the taken values w.r.t. a thresholdvalue b . If we have x ( att ) < b then the observation is passedto the left child where another test is performed. This processis repeated until reaching a leaf in which case a predictioncan be done for x . There is an intuition behind the sequenceof simple tests performed by the tree to classify an objectwhich is particularly useful in botany, zoology and medicaldiagnosis. By looking at all the leaves, one can see that theypartition the input space into a set of axis-aligned regions.Since it has been proven that growing the optimal tree isNP-complete Hyaﬁl and Rivest (1976) standard DT inductionperforms a greedy optimization by learning the best attributeand threshold for each split sequentially Quinlan (1986) unlesseverything is picked at random s.a. in Extremely RandomizedDT Geurts et al. (2006). It is usually built in a top-downfashion but bottom-up and hybrid algorithms also exist. If astopping criteria has been used during the growing phase it iscalled prepruning. Typical stopping criteria or prepruning can a r X i v : . [ s t a t . M L ] M a r Fig. 1. Random Projection Tree example on D . involve a validation set, criteria on the impurity reduction or onthe number of examples reaching a node. Finally, hypothesistesting using a Chi-Square test Goulden (1939) to see if theclass distribution of the children is different than the parent canalso be used. Early stopping can be detrimental by stopping theexploration, known as the horizon effect Duda et al. (2012),a DT can instead be fully developed and then postprunedusing some heuristics which is a way to regularize the splits.In fact, in the case of a fully developed DT and if there isnot two identical objects x i , x j in the dataset with differentclass label, the decision tree can learn the dataset, leading tozero re-substitution error but making them instable. In fact,it can memorize the training set and thus a small changein the input would lead to a completely different ﬁtted tree.The main limitation of univariate DT resides in the axis-aligned splits. This inherently implies that the performanceof DT is not invariant to the rotation of the input space andfor cognitive tasks, tests that are done only on one attributeat a time lead extremely poor results if some hand-craftedfeatures are not provided. As a result, oblique decision treeshave been developed for which a test is now done on a linearcombination of the attributes of x . Since the splitting is stillnot differentiable, optimizing the cutting value and the weightvector w is usually done with Genetic Algorithms (GA)Cantu-Paz and Kamath (2003). Finally, unsupervised pre-processinghas also been developed with Random Projection Trees Das-gupta and Freund (2008); Blaser and Fryzlewicz (2015) andPCA Trees Verma et al. (2009); Sproull (1991) to avoid therotation problems. A partitioning example is presented in Fig.I-B for a univariate and oblique tree. The contributions of thepaper are: learning arbitrary decision boundaries as opposed toaxis-aligned or linear through a global optimization frameworkas opposed to greedy optimization via differentiable splittingnodes. The derivation of the HNN allowing the use of deepneural networks with a new deep hashing layer which is relatedto Locality Sensitive Hashing (LSH). Finally, the use of thedeveloped HNN for supervised and unsupervised problems aswell as classiﬁcation and regression tasks. X X X L L L L Fig. 2. Simple Tree

II. N

EURAL D ECISION T REES

We ﬁrst introduce the neural decision tree (NDT) whichis a soften version of decision trees allowing ﬁnely learnedarbitrary decision surfaces for each node split and a globaloptimization framework. We ﬁrst review brieﬂy supervisedLSH as the neural decision tree is a particular instance ofa LSH framework.

A. Locality Sensitive Hashing

Locality Sensitive Hashing Gionis et al. (1999); Charikar(2002) aims at mapping similar inputs to the same hashvalue. In the case of trees, the hash value corresponds to thereached leaf. Learning this kind of function in a supervisedmanner has been studied Liu et al. (2012). For examplein Xia et al. (2014) the similarity matrix induced by thelabels is factorized into H T H and the features H are learnedthrough a CNN. In Salakhutdinov and Hinton (2009) a deepautoencoder is learned in an unsupervised manner and thelatent representation is then used for LSH. This last frameworkwill be a special case of our unsupervised HNN with themain difference that the autoencoder will not just be trained toreconstruct the input but also provide a meaningful clusteredlatent space representation. B. Model

The main change we perform on a decision tree to make itdifferentiable is to replace the splitting function which can beseen as an indicator function into a sigmoid function Φ( x ) = 11 + e − x . (6)We now interpret for each node i, j the output of Φ( x ) i,j as the probability that the instance x goes to the left childof the node, note that this is a generalized version of thenode change suggested in Laptev and Buhmann (2014). Forexample, looking at the tree representation 2, we have P ( x ∈ L ) = P ( X = left | x ) P ( X = left | X = left , x ) . (7)In fact, if we denote the attributes and threshold value for eachnode X i,j in the tree as att X i,j and b X i,j , we can denote theprobability of a point to be passed to the left child throughthe use of a sigmoid function as P ( X i,j = left | x, X π i,j ) = 1 x ( att X )

We now introduce some notations in order to derive theoverall loss function of the tree. We will derive the binaryclassiﬁcation case only since the general case will be devel-oped in the HNN section. We thus restrain y ∈ { , } . In thecase where each node is optimized in a top-down fashion wewill ﬁrst introduce some notation and derive the soft versionof the usual DT splitting criteria. First, we have for each node i, j which is not a leaf P = n (cid:88) n =1 y n , (11) N = n (cid:88) n =1 (1 − y n ) , (12) n left = n (cid:88) n =1 Φ( x n ) , (13) n right = P + N − n left , (14) P left = n (cid:88) n =1 Φ( x n ) y n , (15) N left = n left − P left , (16) P right = P − P left , (17) N right = N − N left , (18)where these quantities represent respectively the number ofpositive and null observations, the observations going to theleft and right node, the number of observation s.t. y = 1 goingto the left child, the number of observation s.t. y = 0 going tothe left child and similarly for the right child. We now presentthe main loss functions that can be used which are adaptedfrom standard DT losses. a) Information Gain (ID3De M´antaras(1991),C4.5Quinlan (2014)C5.0Im and Jensen (2005)): The information gain represents the amount by which theentropy of the class labels changes w.r.t the splitting of thedataset. It has to be maximized which happens when oneminimizes the weighted sum of the local entropies, which inturn requires the class distribution of each region to convergeto a Dirac. It is deﬁned as IG ( x, y ; Φ) = E ( P, N ) − n left P + N ∗ E ( P left , H left ) − n right P + N ∗ E ( P right , N right ) . (19) b) Gini Impurity (CARTLewis (2000)): The Gini impu-rity is more statistically rooted. It has to be minimized andattains the global minimum at happening when each regionencodes only one class. It is deﬁned for each region as G ( x, y ) = 1 − (cid:88) k f k , (20)where f k is the proportion of observation of class k . Itsymbolizes the expected classiﬁcation error incurred if a classlabel was drawn following the class label distribution of theleaf. In fact, we have that P (ˆ y i (cid:54) = y i ) = (cid:88) k P (ˆ y i = k ) P ( y i (cid:54) = k )= (cid:88) k f k (1 − f k )=1 − (cid:88) k f k . The loss function per node is thus the weighted Gini impurityfor each of the children deﬁned as i left = 1 − ( P left n left ) − ( N left n left ) , (21) i right = 1 − ( P right n right ) − ( N right n right ) , (22) G ( x, y, Φ) = n left N + P i left + n right N + P i right , (23)with N k the number of observation of class k . c) Variance Reduction (CART Breiman et al. (1984)): Finally, to tackle regression problems, another measure hasbeen derived, the variance reduction. In this case, one aims toﬁnd the best split so that the intra-region variance is minimal.It is deﬁned as (cid:80) Ni =1 Φ( x i )( y i − ˜ y ) (cid:80) Ni =1 Φ( x i ) + (cid:80) Ni =1 (1 − Φ( x i ))( y i − ˜ y ) (cid:80) Ni =1 (1 − Φ( x i )) , (24)with ˜ y = (cid:80) Ni =1 Φ( x i ) y i (cid:80) Ni =1 Φ( x i ) , (25) ˜ y = (cid:80) Ni =1 (1 − Φ( x i )) y i (cid:80) Ni =1 (1 − Φ( x i )) . (26)Note that a weighted version of this variance reduction canbe used w.r.t the probability of each region. Learning a NDTis now straightforward and similar to learning a DT exceptthat now for each node instead of searching heuristically orexhaustively for the splitting criteria, it is optimized throughan iterative optimization procedure such as gradient descend.This already alleviates the drawbacks of non differentiabilityencountered in oblique trees where GA had to be used toﬁnd the optimal hyperplane. However, it is also possible to gofurther by not just optimizing each splitting node in a greedymanner but optimizing all the splitting nodes simultaneously.In fact, with the Neural Decision Tree framework, we are nowable to optimize the overall cost function simultaneously onall the nodes. This will loose the sequential aspect of the tests.However, this means that even though there exists no analytical solution for the global loss as it was the case in the greedyframework, the likelihood of being stuck in a local optimum issmaller. In fact, a non optimal split at a given node does notdegrade all the children performances. Thus the global lossfunction corresponds to the loss function of each last splitnode weighted by the probability to reach it. This is deﬁnedexplicitly for the Gini impurity as Gini ( T ree, x ) = (cid:88) leaf (cid:20) (cid:80) i P ( x i ∈ leaf ) P + N (cid:21) (cid:18) N leaf P leaf + N leaf − (cid:18) N leaf P leaf + N leaf (cid:19) (cid:33) , (27)and for the entropy as IG ( T ree, x ) = E ( P, N ) − (cid:88) leaf (cid:20) (cid:80) i P ( x i ∈ leaf ) P + N (cid:21) E ( P leaf , N leaf ) , (28)where the quantities (cid:104) (cid:80) i P ( x i ∈ leaf ) P + N (cid:105) denote the probabilitythat a given point belongs to this leaf which is estimated onthe training set. As a result these loss functions correspond tothe generalization of the node loss function for all the leavesweighted by the probability to go into each of the leaves.III. H ASHING N EURAL N ETWORK

A. Motivation

One can see that for the special case where G ( x ) = x T w + b we can rewrite the NDT as a perceptron where the outputneurons all have a sigmoid function. The result of the outputwhich will be an ordered chain ... is simply equivalentto the path of the corresponding tree that would put x tothe corresponding leaf as shown in Fig. 10. This is the mainmotivation for rewriting the NDT as the HNN to then be ableto add multiple layers and leverage deep learning frameworkcombined with the NDT. This idea of combining the topologyof DT with the learning capacities of ANN is not new.For example, combining ANN for latent space representationfollowed by DT is done in Chandra et al. (2007). Generatingrules based on a trained ANN with DT is studied in Fu (1994);Towell and Shavlik (1993); Kamruzzaman and Hasan (2010);Craven and Shavlik (1994); Craven (1996). Using an ANN toﬁlter a dataset prior to learning a DT is explored in Krishnanet al. (1999). Finally and more recently, a reformulation ofregression tree as a sparse ANN as been done in Biau et al.(2016) in order to ﬁne-tune the learned DT. In our case, theHNN can be summarized simply as a generalized NDT inthe sense that it will learn regions so that the class distribu-tion within each of the region has the minimal uncertainty,ultimately with only one class per region but not necessarilyone region per class as opposed to current architectures. Inaddition, we will see that thanks to the LSH framework, weare able to also perform HNN learning in a semi-supervisedway or even unsupervised. Fig. 3. Example of sub-region query

B. The Hashing Layer

In the HNN framework, the last layer now plays the role ofan hashing function hence its hashing layer name. As a result,its output does not represent anymore p ( y | x ) but p ( x ∈ L ) where L is any of the sub-region encoded by the network,and the output of the neurons correspond to a prediction of thepath taken in a decision tree. From the sub-region membership,a prediction policy can be used based on the most presenttraining data class present for example. This fact alreadyhighlights the ability to also make conﬁdence prediction. Infact, if a region is ambiguous, different kind of predictionscan be done based on the problem at hand. We no

1) Supervised Case:

The number of output neurons out must be of at least log ( C ) where C is the number of classes inthe task at hand. In fact, one needs a least C different regionsand it is clear that the number of different region that can bemodeled by an hashing layer is out . For the case where thenumber of output neurons is greater, then the ANN has theﬂexibility to learn different sub-regions for each class. This isparticularly interesting for example if the latent representationis somehow clustered yet the number of cluster is still greaterthan the number of classes and the cluster of same class are notnecessarily neighbors. In order to derive our loss function formulticlass problems we ﬁrst need to impose the formulationof y as a one-hot vector with a at the index of the class. From that we have the following quantities out n ( x ) = Φ( x T w n + b n ) , ∀ n ∈ { , ..., out } (29) P = { , } out (30)and ﬁnally we call chain each element of P which fullyand uniquely identify each of the sub-regions encoded in thenetwork. We thus have p ( x i ∈ chain ) = (cid:89) n out n ( x i ) chain ( n )=1 (1 − out n ( x i )) chain ( n )=0 (31)the average y per sub-region is thus the weighted mean ofall the y w.r.t their membership probability, this will give theclass distribution per chain as y chain = (cid:88) i y i p ( x i ∈ chain ) , (32)where we recall that y i is a one-hot vector and thus y chain isa vector of size C which sums to and is nonnegative. Thusit is p ( y | chain ) . Finally, we can now compute the measureof uncertainty E for each chain whether it is the entropy, theGini impurity or any other differentiable function and write theﬁnal loss function as the weighted sum of these uncertaintiesweighted by the probability to reach each sub-regions as loss = (cid:88) x ∈ X (cid:88) chain ∈ P p ( chain ) E ( y chain ) , (33)where as for the NDT the estimated probability to reach asub-region is simply p ( chain ) = (cid:80) Ni =1 p ( x i ∈ chain ) N . (34)Note that all the classiﬁcation loss can be used with classweights in order to deal with unbalanced dataset which occuroften in real world problems. For regression problems we ﬁrstdeﬁne the local variance of the outputs as V ( leaf ) = (cid:80) i ( y i − ˜ y chain ) p ( x i ∈ chain ) (cid:80) i p ( x i ∈ chain ) , (35)where ˜ y chain = (cid:80) i y i p ( x i ∈ chain ) (cid:80) i p ( x i ∈ chain ) . (36)As a result we can rewrite the complete regression loss as loss = (cid:88) x ∈ X (cid:88) chain ∈ P p ( chain ) V ( leaf ) . (37)Note that it is possible to not weight the local loss functionsby the probability to reach the region.Finally, note that this hashing layer can obviously beused after any already known neural network layer such asdensely connected Gardner and Dorling (1998) or convolu-tional Krizhevsky et al. (2012) layers or even recurrent basedlayers Gers et al. (2000) making HNN acting on deep andlatent representation rather than the input space. The trainingis thus done through back-propagation with the presented loss.

2) Semi-Supervised Learning:

One extension not devel-opped in the paper concerns the semi-supervised case forwhich the standard discriminative loss is used for labeledexamples s.a. the Information Gain of the Entropy and theunsupervised loss is used for the unlabbeled examples as wellas the labeled examples. The unlabeled loss would tipycallybe the intra region variance which is similar to the loss of k-NN algorithm or GMM with identity covariance matrix. Thisresult in aggregating into same region parts of the space withhigh density while constraining that two different labels neveroccur inside one region. Extensions of this will be presentedas well as validation results in order to validate this hybridloss between Decision Trees and k-NN.

3) Unsupervised Case:

In the last section a loss functionwas derived for the case where we have access to all the labels y i of the training inputs x i . The semi-supervised frameworkis basically a deep-autoencoder on which a hashing layeris connected to the latent space, namely the middle layer.As a result, the reconstruction loss is applied to all inputsand the standard hashing layer loss is used when labels areavailable. For unsupervised however, the deep autoencoder istrained again coupled with an hashing layer but this time thehashing layer is unsupervised. We now derive analytically theloss function of the hashing layer in the unsupervised case.There are two ways to do it, it can either be random tobecome a known LSH function such as MinHash or one cantry to ﬁnd clusters so that the intra-cluster variance is minimalsimilarly to a k-NN Fukunaga and Narendra (1975) approach.The variance per region is deﬁned as V chain ( x ) = (cid:80) i ( x i − ˜ x chain ) p ( x i ∈ chain ) (cid:80) i p ( x i ∈ chain ) , (38)where the local sub-region means is deﬁned as ˜ x chain = (cid:80) i p ( x i ∈ chain ) x i (cid:80) i p ( x i ∈ chain ) . (39)As a result, the overall loss is simply loss = (cid:88) x ∈ X (cid:88) chain ∈ P p ( chain ) V ( x ) . (40)If we denote by G the last layer of the neural network beforethe hashing layer, the ﬁnal unsupervised loss thus becomes loss = (cid:88) x ∈ X || G − ( G ( x )) − x || + (cid:88) chain ∈ P p ( chain ) V ( G ( x )) , (41)and for the semi-supervised case loss = (cid:88) x ∈ X || G − ( G ( x )) − x || + (cid:88) chain ∈ P p ( chain ) E ( G ( x ))1 x has a label . (42)Concerning the random strategy, similarly to Extremely Ran-domized Trees, it is possible to adopt an unlearn approach forwhich the hyperplanes w b are drawn according to a Normaldistribution. This way, we follow the LSH framework forwhich we have p ( out n ( x i ) (cid:54) = out n ( x j )) ∝ d ( x i , x j ) , (43)where d is a distance measure. C. Training1) Iterative Optimization Schemes:

Since there is no ana-lytical form to ﬁnd the optimal weights and bias inside a deepneural network, one has to use iterative optimization methods.Two of the main possibilities are Genetic Algorithms Davis(1991) and Gradient based methods. We focus here on theadvances for gradient based methods as it is the most popularoptimization technique nowadays. Put simply, the update foreach free parameter W , the update rule is W = W − α ∇ W + βf ( W ) , (44)where α is the learning rate and β a regularization parameterapplied on some extra function f . A common technique toﬁnd the best learning rate and regularizer is cross-validationbut new techniques have been developed allowing an adaptivelearning rate and momentum which are changed during train-ingYu et al. (2006); Yu and Liu (2002); Hamid et al. (2011);Nawi et al. (2011). Whatever activation function is used, onecan also add new parameters to the input in order to scale theinput as presented in He et al. (2015). Finally, many tricks arestudies for better back-propagation in LeCun et al. (2012) anda deep study of the behavior of the weights during learning iscarried out in LeCun et al. (1991). We now derive the explicitgradient for the case where the loss E is the Gini impurity. Itis clear that ddW loss = (cid:88) n dlossdout n dout n dW = (cid:88) chain ∈ P (cid:88) n dp ( chain ) E ( y chain ) dout n dout n dW (45)with dp ( chain ) E ( y c hain ) dout n a scalar and dout n dW a matrix. We nowderive explicitly the derivative: dout ( x i ) dW =  ( − chain (1)=0 out ( x i )(1 − out ( x i )) x Ti ( − chain (2)=0 out ( x i )(1 − out ( x i )) x Ti ...  , (46)which is of size ( out, in ) and is basically on each rowthe input x i which might be the output of another upperlayer weighted by the activation, it is similar to a sigmoidbased layer except for the indicator function which helps todetermine the sign. Now if we now denote the true output by σ chain ( n ) n ( x i ) = out n ( x i ) chain ( n )=1 (1 − out n ( x i )) chain ( n )=0 , (47)which includes the indicator function for clarity, we have ∂∂out n p ( chain ) = ∂∂out n (cid:80) i (cid:81) n σ n ( x i ) N = (cid:80) i (cid:80) k ( − chain ( k )=0 (cid:81) n (cid:54) = k σ chain ( n ) n ( x i ) N . (48)Note that the derivative of the probability w.r.t the output isthus simply a sum of the products of the other output neuronswhere the neuron considered in the derivative determines thesign applied. In short, we see that this changes linearly as one ﬁxes all the neurons but the one considered for variations,which is natural. We now derive the ﬁnal needed derivate: ∂∂out n E ( y chain ) = ∂∂out n (1 − (cid:88) k f k )= − (cid:88) k ∂∂out n (cid:32) (cid:80) i y i,k p ( x i ∈ chain ) (cid:80) j (cid:80) i y i,j p ( x i ∈ chain ) (cid:33) = − (cid:88) k ∂∂out n (cid:32) (cid:80) i y i,k (cid:81) n σ chain ( n ) n ( x i ) (cid:80) j (cid:80) i y i,j (cid:81) n σ chain ( n ) n ( x i ) (cid:33) = − (cid:88) k f k (cid:32) ( (cid:80) i y i,k (cid:80) l ( − chain ( l )=0 (cid:81) n (cid:54) = l σ chain ( n ) n ( x i ))( (cid:80) j (cid:80) i y i,j (cid:81) n σ chain ( n ) n ( x i )) × ( (cid:88) j (cid:88) i y i,j (cid:89) n σ chain ( n ) n ( x i ))  + (cid:88) k f k (cid:32) ( (cid:80) i y i,k (cid:81) n σ chain ( n ) n ( x i ))( (cid:80) j (cid:80) i y i,j (cid:81) n σ chain ( n ) n ( x i )) × ( (cid:88) j (cid:88) i y i,j (cid:88) l ( − chain ( l )=0 (cid:89) n (cid:54) = l σ chain ( n ) n ( x i ))  Finally, note that performing stochastic gradient descent in anoption when dealing with large training set. It has also beenshown to improve the convergence rate.

2) Regularization Techniques:

Initially, regularization wasdone by adding to the standard cost function a regularizationterm, typically the L1 or L2 norm of the weights. This imposedthe learned parameters to be sparse. From that new kind ofregularization techniques have been developed such as dropoutHinton et al. (2012), Srivastava et al. (2014). In dropout, eachneuron has a nonzero probability to be deactivated (simplyoutput ) forcing the weights to avoid co-adaptation. Thisprobabilistic deactivation can be transformed into adding someGaussian noise to each of the neuron outputs which in this caseforce the weights to be robust to noise. The motivation here isnot necessarily to avoid overﬁtting, in fact, if using ensemblemethods, over ﬁtting is actually better to then leverage variancereduction from averaging Krogh (1996). However another typeof regularization can be used on the distribution of the dataacross the regions. One such type might be (cid:88) chain ∈ P || p ( chain ) − | P | || (49)so that regions become more equally likely. D. Toy Dataset

We now present the application of the HNN on two simpletoy datasets for binary classiﬁcation, the two moons and circledataset. Each one presents nonlinearly separable data pointsyet the boundary decision can very effectively be representedby a small union of linear plans. The main result is that thenumber of parameters needed with the HNN is smaller thanwhen using an ANN. For all the below examples, the ANN ismade of 3 layers with topology which is the smallestnetwork able to tackle these two problems. As we will see,

Fig. 4. Noisy Two-Moons Dataset, learned regions and ﬁnal binary classiﬁ-cation regions for HNN.Fig. 5. Evolution of p ( chain ) over the regions learned during training. the HNN only requires one layer and the number of neededparameters for similar decision boundary goes for from for ANN to for HNN. This simple example shows that thereformulation of the hashing layer helps to avoid overﬁttingin general. In fact, overﬁtting is not necessarily learning thetraining set but it is also using a more ﬂexible model thatneeds be Hawkins (2004). Since with the HNN we are able toobtain the same decision boundaries yet with less parameters,it means that the ANN architecture was somehow sub-optimal.

1) Two-moon Dataset:

The two-moon dataset is a typicalexample of nonlinear binary classiﬁcation. It can be solvedeasily with kernel based methods or nonlinear classiﬁers ingeneral. As we will show, even though the used HNN hasonly one layer, by the way it combines the learned hyper-plansit is possible to learn nonlinear decision boundaries. In Fig.4 one can see the HNN with out = 3 after training. Theboundary decision is similar in shape with the one learnedform a

MLP shown in Fig. 7. In fact, it is made up of combined hyper-plans. One can see in Fig. 5 the evolution ofthe probability to reach each of the regions during training asopposed to the regions for the case of the MLP presented inFig. 8. We also present in Fig. 6 the evolution of the error andregularizations during training for the HNN. As one can see,regions starting with almost no points can recover and becomepreponderant whereas useful regions can be disregarded at anypoint in the training. Similarly we have for Fig. 9 the case withthe ANN. It is interesting to see that the convergence rate isalso faster with the HNN. In fact it converges in iterations Fig. 6. Evolution of the errors during training. On the left is the pure error asdeﬁned in 33 and with addition of the regularization terms. On the middle isthe norm of the weight and on the right the distance with respect to a uniformdistribution of the points in each region.Fig. 7. Noisy Two-Moons Dataset, learned regions and ﬁnal binary classiﬁ-cation regions for HNN.Fig. 8. Evolution of p ( y = 0) and p ( y = 1) during training of the MLP. whereas ANN converges in around iterations.

2) Two-circle Dataset:

The two-circle dataset is anothersimple yet meaningful dataset. It presents two circles withsame center with different radius, as a consequence oneis inside the other and the binary classiﬁcation task is todiscriminate between the two. It is quite straightforward tosee that a hand-craft change of variable ( x, y ) → ( r, θ ) canmake this problem linearly separable yet we will see how HNNand ANN solve this discrimination problem. We present inFig. 10 the result of the HNN with neurons. Again, thedecision boundary is also presented for the case of an ANN Fig. 9. Evolution of the errors during training for the ANN.Fig. 10. Regions and decision boundary for the HNN in the two-circle dataset.Fig. 11. Evolution of p ( chain ) during training for the possible regions ofthe HNN on the two-circle dataset. with topology in Fig. 13. We also present in Fig. 11for the HNN and Fig. 14 for the ANN the evolution of theprobability to reach the sub-regions during training. As canbe seen in Fig. 12 and 15 the convergence rate is signiﬁcantlyfaster for the HNN. In fact the convergence is done in about iteration whereas the neural network needs a bit more than iterations. Finally, we present the case where we used neurons for the HNN on the two-circle dataset. Note thatthis puts the number of parameters to which is still lessthan the number of parameters used in the ANN . Yet ifone wanted to have the same modeling ability with a MLP,the minimum topology would be which contains parameters. In fact, it grows exponentially w.r.t the number ofhidden neurons whereas for the HNN it grows linearly since Fig. 12. Evolution of the error and regularization for the HNN with neurons.Fig. 13. ANN Decision Boundary for the two-circle datasetFig. 14. ANN p ( chain ) for the two classesFig. 15. ANN training statistics including the error and the regularizations Fig. 16. neurons HNNFig. 17. neurons HNN, evolution of p ( chain ) .Fig. 18. neurons HNN, evolution of the error and regularization errorsduring training. the modeling capacity grows exponentially. We present in Fig.16, Fig. 17 and Fig. 18 the results for the neurons HNN.IV. F UTURE W ORK

This paper introduces a new way to improve neural net-works through the analysis of the output activations and theloss function. In fact, it has been demonstrated for example inTang (2013) a SVM type loss is used instead of the standardcross-entropy. As a result the generalization loss has beendiminished. Yet the important point was that the cross-entropyof the new trained network was far from optimal or even closeto what one could consider as satisfactory. This suggests thatdifferent loss functions do not just affect the learning but also the ﬁnal network. As a result, one part of the futurework is to study the impact of the loss function on differenttraining set with ﬁxed topologies. This includes the trainingwith standard neural networks and the HNN for the case where out = log ( n ) . This could lead to a new framework aimingat learning the loss function online.With the analysis of trees one natural extension of the HNNis boosting or its simpler form, bagging. Doing this ensemblemethods with complete ANN might be difﬁcult due to highcomplexity to already train one network. As a result, sometechniques such as dropout have been used and analyzed as aweak way to perform model averaging or bagging. A solutionin our case would be to perform bagging or boosting of onlythe hashing layer part, namely the last layer of the HNN.This way the workers live on the same latent space but acton different pieces of it and their combination is used forimproving the latent space representation. Another approachwould be to see model averaging from a bayesian point ofview Penny et al. (2006). The model evidence is given by p ( y | m ) = (cid:90) p ( y | θ, m ) p ( θ | m ) dθ, (50)and the Bayesian Model Selection (BMS) is m MP = arg max m ∈ M p ( m | y ) . (51)When dealing with one model m , the inference p ( θ | y, m ) isdependent w.r.t to the chosen model. In order to take intoaccount the uncertainty in the choice of the model, modelaveraging can be used. Model Averaging (BMA) formulatesthe distribution p ( θ | y ) = (cid:88) m p ( θ | y, m ) p ( m | y ) , (52)look for a review in Hoeting et al. (1999). This wholeframework can be used with a nonparametric PGM with anew deﬁne probability for a neural network, and p ( x | AN N ) as the reconstruction error and p ( AN N ) based on the modelcomplexity for example or the topology of the connections.Finally, an important aspect resides in the correlation be-tween different out n ( x i ) and out m ( x i ) for different inputs. Infact, they have to be not correlated otherwise it means that thehyperplanes are scaled versions of each others.V. C ONCLUSION

In this paper we presented an extension of DTs and ANNs toprovide a uniﬁed framework allowing to take the best of bothapproaches. The ability to hash a dataset into an exponentiallylarge number of regions which is the strenfgth of DT coupledwith the ability to learn any boundary decision for those regioncoming from the ability of ANN to model arbritrary functions.We leverage the differentiable of our approach to derive aglobal loss function to train all the nodes simultaneously withrespect to the resulting leaves entropy showing robustness topoor local optimum DTs can fall in. The differentiability ofthe model allows easy integreation in many machine learningpipeline allowing the extension of CNNs to robust semi-supervised clustering for example. In addition, the ability tolearn arbitrary union of regions of the space to perform a per class clustering reduces the necessary condition of having fullylinearized the dataset. This shall reduce the required depthof today’s deep architectures. Finally, this network has nowthe capacity to be information theoretically optimal as theminimum required number of neurons in supervised problemsis log ( C ) as opposed to C for usual soft-max layers where C is the number of classes. Finally, the possibility to applythis framework in supervised as well as unsupervised settingsmight lead to interesting behavior through the clustering prop-erty of the latent representations as the experiments showedpromising results. R EFERENCES

G. Biau, E. Scornet, and J. Welbl. Neural Random Forests.

ArXiv e-prints , April 2016.Rico Blaser and Piotr Fryzlewicz. Random rotation ensembles.

J Mach Learning Res , 2:1–15, 2015.Leo Breiman, Jerome Friedman, Charles J Stone, andRichard A Olshen.

Classiﬁcation and regression trees . CRCpress, 1984.Erick Cantu-Paz and Chandrika Kamath. Inducing obliquedecision trees with evolutionary algorithms.

IEEE Transac-tions on Evolutionary Computation , 7(1):54–68, 2003.Rohitash Chandra, Kaylash Chaudhary, and Akshay Kumar.The combination and comparison of neural networks withdecision trees for wine classiﬁcation.

School of sciencesand technology, University of Fiji, in , 2007.Moses S Charikar. Similarity estimation techniques fromrounding algorithms. In

Proceedings of the thiry-fourthannual ACM symposium on Theory of computing , pages380–388. ACM, 2002.Mark Craven and Jude W Shavlik. Using sampling and queriesto extract rules from trained neural networks. In

ICML ,pages 37–45, 1994.Mark W Craven.

Extracting comprehensible models fromtrained neural networks . PhD thesis, University ofWisconsin–Madison, 1996.Sanjoy Dasgupta and Yoav Freund. Random projection treesand low dimensional manifolds. In

Proceedings of thefortieth annual ACM symposium on Theory of computing ,pages 537–546. ACM, 2008.Lawrence Davis. Handbook of genetic algorithms. 1991.R L´opez De M´antaras. A distance-based attribute selectionmeasure for decision tree induction.

Machine learning , 6(1):81–92, 1991.Richard O Duda, Peter E Hart, and David G Stork.

Patternclassiﬁcation . John Wiley & Sons, 2012.LiMin Fu. Rule generation from neural networks.

IEEETransactions on Systems, Man, and Cybernetics , 24(8):1114–1124, 1994.Keinosuke Fukunaga and Patrenahalli M. Narendra. A branchand bound algorithm for computing k-nearest neighbors.

IEEE transactions on computers , 100(7):750–753, 1975.Matt W Gardner and SR Dorling. Artiﬁcial neural networks(the multilayer perceptron)a review of applications in theatmospheric sciences.

Atmospheric environment , 32(14):2627–2636, 1998.Felix A Gers, J¨urgen Schmidhuber, and Fred Cummins. Learn-ing to forget: Continual prediction with lstm.

Neuralcomputation , 12(10):2451–2471, 2000.Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremelyrandomized trees.

Machine learning , 63(1):3–42, 2006.Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. Sim-ilarity search in high dimensions via hashing. In

VLDB ,volume 99, pages 518–529, 1999.Cyril Harold Goulden. Methods of statistical analysis. 1939.Norhamreeza Abdul Hamid, Nazri Mohd Nawi, Rozaida Ghaz-ali, and Mohd Najib Mohd Salleh. Accelerating learningperformance of back propagation algorithm by using adap- tive gain together with adaptive momentum and adaptivelearning rate on classiﬁcation problems. In UbiquitousComputing and Multimedia Applications , pages 559–570.Springer, 2011.Malcolm C Harrison. Implementation of the substring testby hashing.

Communications of the ACM , 14(12):777–779,1971.Douglas M Hawkins. The problem of overﬁtting.

Journal ofchemical information and computer sciences , 44(1):1–12,2004.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Delving deep into rectiﬁers: Surpassing human-level per-formance on imagenet classiﬁcation. In

Proceedings of theIEEE International Conference on Computer Vision , pages1026–1034, 2015.Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, IlyaSutskever, and Ruslan R Salakhutdinov. Improving neuralnetworks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 , 2012.Jennifer A Hoeting, David Madigan, Adrian E Raftery, andChris T Volinsky. Bayesian model averaging: a tutorial.

Statistical science , pages 382–401, 1999.Laurent Hyaﬁl and Ronald L Rivest. Constructing optimal bi-nary decision trees is np-complete.

Information ProcessingLetters , 5(1):15–17, 1976.Jungho Im and John R Jensen. A change detection modelbased on neighborhood correlation image analysis and de-cision tree classiﬁcation.

Remote Sensing of Environment ,99(3):326–340, 2005.A Jordan. On discriminative vs. generative classiﬁers: Acomparison of logistic regression and naive bayes.

Advancesin neural information processing systems , 14:841, 2002.SM Kamruzzaman and Ahmed Ryadh Hasan. Rule ex-traction using artiﬁcial neural networks. arXiv preprintarXiv:1009.4984 , 2010.R Krishnan, G Sivakumar, and P Bhattacharya. Extractingdecision trees from trained neural networks.

Pattern Recog-nition , 32(12), 1999.Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classiﬁcation with deep convolutional neural net-works. In

Advances in neural information processingsystems , pages 1097–1105, 2012.Peter Sollich Anders Krogh. Learning with ensembles: Howover-ﬁtting can be useful. In

Proceedings of the 1995Conference , volume 8, page 190, 1996.Ludmila I Kuncheva.

Combining pattern classiﬁers: methodsand algorithms . John Wiley & Sons, 2004.Dmitry Laptev and Joachim M Buhmann. Convolutionaldecision trees for feature learning and segmentation. In

German Conference on Pattern Recognition , pages 95–106.Springer, 2014.Yann LeCun, Ido Kanter, and Sara A Solla. Second orderproperties of error surfaces: Learning time and generaliza-tion. In

Advances in neural information processing systems ,pages 918–924, 1991.Yann A LeCun, L´eon Bottou, Genevieve B Orr, and Klaus-Robert M¨uller. Efﬁcient backprop. In

Neural networks:Tricks of the trade , pages 9–48. Springer, 2012. Roger J Lewis. An introduction to classiﬁcation and regressiontree (cart) analysis. In

Annual meeting of the society foracademic emergency medicine in San Francisco, California ,pages 1–14, 2000.Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, and Shih-Fu Chang. Supervised hashing with kernels. In

ComputerVision and Pattern Recognition (CVPR), 2012 IEEE Con-ference on , pages 2074–2081. IEEE, 2012.Nazri Mohd Nawi, Norhamreeza Abdul Hamid, RS Ransing,Rozaida Ghazali, and Mohd Najib Mohd Salleh. Enhancingback propagation neural network algorithm with adaptivegain on classiﬁcation problems. networks , 4(2), 2011.Will Penny, J Mattout, and N Trujillo-Barreto. Bayesian modelselection and averaging.

Statistical Parametric Mapping:The analysis of functional brain images. London: Elsevier ,2006.J. Ross Quinlan. Induction of decision trees.

Machinelearning , 1(1):81–106, 1986.J Ross Quinlan.

C4. 5: programs for machine learning .Elsevier, 2014.John Ross Quinlan. Comparing connectionist and symboliclearning methods. In

Computational Learning Theoryand Natural Learning Systems: Constraints and Prospects .Citeseer, 1994.S Rasoul Safavian and David Landgrebe. A survey of decisiontree classiﬁer methodology. 1990.Ruslan Salakhutdinov and Geoffrey Hinton. Semantic hashing.

International Journal of Approximate Reasoning , 50(7):969–978, 2009.Robert F Sproull. Reﬁnements to nearest-neighbor searchingink-dimensional trees.

Algorithmica , 6(1-6):579–589, 1991.Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, IlyaSutskever, and Ruslan Salakhutdinov. Dropout: A simpleway to prevent neural networks from overﬁtting.

TheJournal of Machine Learning Research , 15(1):1929–1958,2014.Yichuan Tang. Deep learning using linear support vectormachines. arXiv preprint arXiv:1306.0239 , 2013.Geoffrey G Towell and Jude W Shavlik. Extracting reﬁnedrules from knowledge-based neural networks.

Machinelearning , 13(1):71–101, 1993.Nakul Verma, Samory Kpotufe, and Sanjoy Dasgupta. Whichspatial partition trees are adaptive to intrinsic dimension? In

Proceedings of the twenty-ﬁfth conference on uncertainty inartiﬁcial intelligence , pages 565–574. AUAI Press, 2009.Rongkai Xia, Yan Pan, Hanjiang Lai, Cong Liu, and ShuichengYan. Supervised hashing for image retrieval via imagerepresentation learning. In

AAAI , volume 1, page 2, 2014.Chien-Cheng Yu and Bin-Da Liu. A backpropagation algo-rithm with adaptive learning rate and momentum coefﬁcient.In

Neural Networks, 2002. IJCNN’02. Proceedings of the2002 International Joint Conference on , volume 2, pages1218–1223. IEEE, 2002.Lean Yu, Shouyang Wang, and Kin Keung Lai. An adaptive bpalgorithm with optimal learning rates and directional errorcorrection for foreign exchange market trend prediction. In

Related Researches

CaPC Learning: Confidential and Private Collaborative Learning

by Christopher A. Choquette-Choo

Label Smoothed Embedding Hypothesis for Out-of-Distribution Detection

by Dara Bahri

SLAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks

by Bahare Fatemi

On Theory-training Neural Networks to Infer the Solution of Highly Coupled Differential Equations

by M. Torabi Rad

Spherical Message Passing for 3D Graph Networks

by Yi Liu

Target Training Does Adversarial Training Without Adversarial Samples

by Blerta Lindqvist

Domain Invariant Representation Learning with Domain Density Transformations

by A. Tuan Nguyen

Sparsification via Compressed Sensing for Automatic Speech Recognition

by Kai Zhen

Using Deep LSD to build operators in GANs latent space with meaning in real space

by J. Quetzalcoatl Toledo-Marin

Consensus Based Multi-Layer Perceptrons for Edge Computing

by Haimonti Dutta

RL for Latent MDPs: Regret Guarantees and a Lower Bound

by Jeongyeol Kwon

Scheduling the NASA Deep Space Network with Deep Reinforcement Learning

by Edwin Goh

Backdoor Scanning for Deep Neural Networks through K-Arm Optimization

by Guangyu Shen

On Explainability of Graph Neural Networks via Subgraph Explorations

by Hao Yuan

More Is More -- Narrowing the Generalization Gap by Adding Classification Heads

by Roee Cates

Benchmarking Deep Graph Generative Models for Optimizing New Drug Molecules for COVID-19

by Logan Ward

Estimation and Applications of Quantiles in Deep Binary Classification

by Anuj Tambwekar

Emotion Transfer Using Vector-Valued Infinite Task Learning

by Alex Lambert

Measuring Progress in Deep Reinforcement Learning Sample Efficiency

by Florian E. Dorner

Bounded Memory Active Learning through Enriched Queries

by Max Hopkins

Adversarially Trained Models with Test-Time Covariate Shift Adaptation

by Jay Nandy

Transfer learning based few-shot classification using optimal transport mapping from preprocessed latent space of backbone neural network

by Tomáš Chobola

Regularization Strategies for Quantile Regression

by Taman Narayan

Large-Scale Training System for 100-Million Classification at Alibaba

by Liuyihan Song

Classifier Calibration: with implications to threat scores in cybersecurity

by Waleed A. Yousef

«

1

2

3

4

»

Submitted on 23 Feb 2017 (v1), last revised 6 Mar 2017 (this version, v2) Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar