11 Neural Decision Trees
Randall Balestriero ∗∗ Electrical and Computer Engineering Department, Rice University, Houston TX, USA ∗ [email protected] Abstract —In this paper we propose a synergistic melting ofneural networks and decision trees (DT) we call neural decisiontrees (NDT). NDT is an architecture a la decision tree where eachsplitting node is an independent multilayer perceptron allowingoblique decision functions or arbritrary nonlinear decision func-tion if more than one layer is used. This way, each MLP can beseen as a node of the tree. We then show that with the weightsharing asumption among those units, we end up with a HashingNeural Network (HNN) which is a multilayer perceptron withsigmoid activation function for the last layer as opposed to thestandard softmax. The output units then jointly represent theprobability to be in a particular region. The proposed frameworkallows for global optimization as opposed to greedy in DT anddifferentiability w.r.t. all parameters and the input, allowing easyintegration in any learnable pipeline, for example after CNNsfor computer vision tasks. We also demonstrate the modelingpower of HNN allowing to learn union of disjoint regions for finalclustering or classification making it more general and powerfulthan standard softmax MLP requiring linear separability thusreducing the need on the inner layer to perform complex datatransformations. We finally show experiments for supervised,semi-suppervised and unsupervised tasks and compare resultswith standard DTs and MLPs.
Index Terms —Artifical Neural Networks, Multilayer PErcep-trons, Decision Trees, Locality Sensitive Hashing, Oblique Deci-sion Trees
I. I
NTRODUCTION AND D ECISION T REE R EVIEW
A. Motivations H ASHING. A hash function H : X → Y is a many-to-one mapping from a possibly infinite set to a finite setHarrison (1971) where we usually have X ⊂ R d and Y ⊂{ , ..., C } . As a result, the following function H ( x ) = arg max y p ( y | x ) , (1)typical to discriminative models Jordan (2002), can be seen asa hashing function that has been optimized or learned givensome dataset of N independent observations { ( x i , y i ) Ni =1 } of input data x i and corresponding label y i . Decision trees(DT) are a special kind of discriminative model aiming atbreaking up a complex decision into a union of simple binarydecision a.k.a splitting nodes Safavian and Landgrebe (1990).In order to do so, DT learning involves a sequential top-down partitioning of the input space X into sub-regions Ω k satisfying ∪ k Ω k = X, (2) Ω k ∩ Ω l = ∅ , ∀ k (cid:54) = l. (3)This partition is done so that the label distribution of thepoints w.r.t each region has minimal entropy. In particular, theoptimum is obtained when all the points lying in a sub-region belong to the same class and this for all the sub-regions. Oncetrained, a new observation x is classified by first finding inwhich region Ω( x ) it belongs to and by predicting the labelspecific to this region according to H ( x ) = mode ( { y i | x i ∈ Ω( x ) } ) , (4)for classification problems and H ( x ) = mean ( { y i | x i ∈ Ω( x ) } ) , (5)for regression problems. What brings DT among the mostpowerful discriminative technique for non cognitive tasks liesin the fact that the number of sub-regions Ω k of X growsexponentially w.r.t their depth. This milestone is the core ofthe developed Hashing Neural Network (HNN) coupled withthe acute modeling capacity of deep neural networks. Wenow describe briefly the standard univariate and multivariatedecision trees, their advantages and drawbacks as well asmotivations to extend them in a more unified and differentiableframework. In fact, as Quinlan said Quinlan (1994), decisiontrees are better when discrimination has to be done througha sequential testing of different attributes whereas ANN aregood when knowledge of a simultaneous combination of manyattributes is required, trying to get the best of both worldsseems natural. B. Univariate and Oblique Decision Trees
Decision trees lead to a recursive space partitioning of theinput space X through a cascade of simple tests Kuncheva(2004). In the case of univariate or monothetic DT, the localtest for each splitting node is done by looking at one attribute att of x ∈ X and comparing the taken values w.r.t. a thresholdvalue b . If we have x ( att ) < b then the observation is passedto the left child where another test is performed. This processis repeated until reaching a leaf in which case a predictioncan be done for x . There is an intuition behind the sequenceof simple tests performed by the tree to classify an objectwhich is particularly useful in botany, zoology and medicaldiagnosis. By looking at all the leaves, one can see that theypartition the input space into a set of axis-aligned regions.Since it has been proven that growing the optimal tree isNP-complete Hyafil and Rivest (1976) standard DT inductionperforms a greedy optimization by learning the best attributeand threshold for each split sequentially Quinlan (1986) unlesseverything is picked at random s.a. in Extremely RandomizedDT Geurts et al. (2006). It is usually built in a top-downfashion but bottom-up and hybrid algorithms also exist. If astopping criteria has been used during the growing phase it iscalled prepruning. Typical stopping criteria or prepruning can a r X i v : . [ s t a t . M L ] M a r Fig. 1. Random Projection Tree example on D . involve a validation set, criteria on the impurity reduction or onthe number of examples reaching a node. Finally, hypothesistesting using a Chi-Square test Goulden (1939) to see if theclass distribution of the children is different than the parent canalso be used. Early stopping can be detrimental by stopping theexploration, known as the horizon effect Duda et al. (2012),a DT can instead be fully developed and then postprunedusing some heuristics which is a way to regularize the splits.In fact, in the case of a fully developed DT and if there isnot two identical objects x i , x j in the dataset with differentclass label, the decision tree can learn the dataset, leading tozero re-substitution error but making them instable. In fact,it can memorize the training set and thus a small changein the input would lead to a completely different fitted tree.The main limitation of univariate DT resides in the axis-aligned splits. This inherently implies that the performanceof DT is not invariant to the rotation of the input space andfor cognitive tasks, tests that are done only on one attributeat a time lead extremely poor results if some hand-craftedfeatures are not provided. As a result, oblique decision treeshave been developed for which a test is now done on a linearcombination of the attributes of x . Since the splitting is stillnot differentiable, optimizing the cutting value and the weightvector w is usually done with Genetic Algorithms (GA)Cantu-Paz and Kamath (2003). Finally, unsupervised pre-processinghas also been developed with Random Projection Trees Das-gupta and Freund (2008); Blaser and Fryzlewicz (2015) andPCA Trees Verma et al. (2009); Sproull (1991) to avoid therotation problems. A partitioning example is presented in Fig.I-B for a univariate and oblique tree. The contributions of thepaper are: learning arbitrary decision boundaries as opposed toaxis-aligned or linear through a global optimization frameworkas opposed to greedy optimization via differentiable splittingnodes. The derivation of the HNN allowing the use of deepneural networks with a new deep hashing layer which is relatedto Locality Sensitive Hashing (LSH). Finally, the use of thedeveloped HNN for supervised and unsupervised problems aswell as classification and regression tasks. X X X L L L L Fig. 2. Simple Tree
II. N
EURAL D ECISION T REES
We first introduce the neural decision tree (NDT) whichis a soften version of decision trees allowing finely learnedarbitrary decision surfaces for each node split and a globaloptimization framework. We first review briefly supervisedLSH as the neural decision tree is a particular instance ofa LSH framework.
A. Locality Sensitive Hashing
Locality Sensitive Hashing Gionis et al. (1999); Charikar(2002) aims at mapping similar inputs to the same hashvalue. In the case of trees, the hash value corresponds to thereached leaf. Learning this kind of function in a supervisedmanner has been studied Liu et al. (2012). For examplein Xia et al. (2014) the similarity matrix induced by thelabels is factorized into H T H and the features H are learnedthrough a CNN. In Salakhutdinov and Hinton (2009) a deepautoencoder is learned in an unsupervised manner and thelatent representation is then used for LSH. This last frameworkwill be a special case of our unsupervised HNN with themain difference that the autoencoder will not just be trained toreconstruct the input but also provide a meaningful clusteredlatent space representation. B. Model
The main change we perform on a decision tree to make itdifferentiable is to replace the splitting function which can beseen as an indicator function into a sigmoid function Φ( x ) = 11 + e − x . (6)We now interpret for each node i, j the output of Φ( x ) i,j as the probability that the instance x goes to the left childof the node, note that this is a generalized version of thenode change suggested in Laptev and Buhmann (2014). Forexample, looking at the tree representation 2, we have P ( x ∈ L ) = P ( X = left | x ) P ( X = left | X = left , x ) . (7)In fact, if we denote the attributes and threshold value for eachnode X i,j in the tree as att X i,j and b X i,j , we can denote theprobability of a point to be passed to the left child throughthe use of a sigmoid function as P ( X i,j = left | x, X π i,j ) = 1 x ( att X )
We now introduce some notations in order to derive theoverall loss function of the tree. We will derive the binaryclassification case only since the general case will be devel-oped in the HNN section. We thus restrain y ∈ { , } . In thecase where each node is optimized in a top-down fashion wewill first introduce some notation and derive the soft versionof the usual DT splitting criteria. First, we have for each node i, j which is not a leaf P = n (cid:88) n =1 y n , (11) N = n (cid:88) n =1 (1 − y n ) , (12) n left = n (cid:88) n =1 Φ( x n ) , (13) n right = P + N − n left , (14) P left = n (cid:88) n =1 Φ( x n ) y n , (15) N left = n left − P left , (16) P right = P − P left , (17) N right = N − N left , (18)where these quantities represent respectively the number ofpositive and null observations, the observations going to theleft and right node, the number of observation s.t. y = 1 goingto the left child, the number of observation s.t. y = 0 going tothe left child and similarly for the right child. We now presentthe main loss functions that can be used which are adaptedfrom standard DT losses. a) Information Gain (ID3De M´antaras(1991),C4.5Quinlan (2014)C5.0Im and Jensen (2005)): The information gain represents the amount by which theentropy of the class labels changes w.r.t the splitting of thedataset. It has to be maximized which happens when oneminimizes the weighted sum of the local entropies, which inturn requires the class distribution of each region to convergeto a Dirac. It is defined as IG ( x, y ; Φ) = E ( P, N ) − n left P + N ∗ E ( P left , H left ) − n right P + N ∗ E ( P right , N right ) . (19) b) Gini Impurity (CARTLewis (2000)): The Gini impu-rity is more statistically rooted. It has to be minimized andattains the global minimum at happening when each regionencodes only one class. It is defined for each region as G ( x, y ) = 1 − (cid:88) k f k , (20)where f k is the proportion of observation of class k . Itsymbolizes the expected classification error incurred if a classlabel was drawn following the class label distribution of theleaf. In fact, we have that P (ˆ y i (cid:54) = y i ) = (cid:88) k P (ˆ y i = k ) P ( y i (cid:54) = k )= (cid:88) k f k (1 − f k )=1 − (cid:88) k f k . The loss function per node is thus the weighted Gini impurityfor each of the children defined as i left = 1 − ( P left n left ) − ( N left n left ) , (21) i right = 1 − ( P right n right ) − ( N right n right ) , (22) G ( x, y, Φ) = n left N + P i left + n right N + P i right , (23)with N k the number of observation of class k . c) Variance Reduction (CART Breiman et al. (1984)): Finally, to tackle regression problems, another measure hasbeen derived, the variance reduction. In this case, one aims tofind the best split so that the intra-region variance is minimal.It is defined as (cid:80) Ni =1 Φ( x i )( y i − ˜ y ) (cid:80) Ni =1 Φ( x i ) + (cid:80) Ni =1 (1 − Φ( x i ))( y i − ˜ y ) (cid:80) Ni =1 (1 − Φ( x i )) , (24)with ˜ y = (cid:80) Ni =1 Φ( x i ) y i (cid:80) Ni =1 Φ( x i ) , (25) ˜ y = (cid:80) Ni =1 (1 − Φ( x i )) y i (cid:80) Ni =1 (1 − Φ( x i )) . (26)Note that a weighted version of this variance reduction canbe used w.r.t the probability of each region. Learning a NDTis now straightforward and similar to learning a DT exceptthat now for each node instead of searching heuristically orexhaustively for the splitting criteria, it is optimized throughan iterative optimization procedure such as gradient descend.This already alleviates the drawbacks of non differentiabilityencountered in oblique trees where GA had to be used tofind the optimal hyperplane. However, it is also possible to gofurther by not just optimizing each splitting node in a greedymanner but optimizing all the splitting nodes simultaneously.In fact, with the Neural Decision Tree framework, we are nowable to optimize the overall cost function simultaneously onall the nodes. This will loose the sequential aspect of the tests.However, this means that even though there exists no analytical solution for the global loss as it was the case in the greedyframework, the likelihood of being stuck in a local optimum issmaller. In fact, a non optimal split at a given node does notdegrade all the children performances. Thus the global lossfunction corresponds to the loss function of each last splitnode weighted by the probability to reach it. This is definedexplicitly for the Gini impurity as Gini ( T ree, x ) = (cid:88) leaf (cid:20) (cid:80) i P ( x i ∈ leaf ) P + N (cid:21) (cid:18) N leaf P leaf + N leaf − (cid:18) N leaf P leaf + N leaf (cid:19) (cid:33) , (27)and for the entropy as IG ( T ree, x ) = E ( P, N ) − (cid:88) leaf (cid:20) (cid:80) i P ( x i ∈ leaf ) P + N (cid:21) E ( P leaf , N leaf ) , (28)where the quantities (cid:104) (cid:80) i P ( x i ∈ leaf ) P + N (cid:105) denote the probabilitythat a given point belongs to this leaf which is estimated onthe training set. As a result these loss functions correspond tothe generalization of the node loss function for all the leavesweighted by the probability to go into each of the leaves.III. H ASHING N EURAL N ETWORK
A. Motivation
One can see that for the special case where G ( x ) = x T w + b we can rewrite the NDT as a perceptron where the outputneurons all have a sigmoid function. The result of the outputwhich will be an ordered chain ... is simply equivalentto the path of the corresponding tree that would put x tothe corresponding leaf as shown in Fig. 10. This is the mainmotivation for rewriting the NDT as the HNN to then be ableto add multiple layers and leverage deep learning frameworkcombined with the NDT. This idea of combining the topologyof DT with the learning capacities of ANN is not new.For example, combining ANN for latent space representationfollowed by DT is done in Chandra et al. (2007). Generatingrules based on a trained ANN with DT is studied in Fu (1994);Towell and Shavlik (1993); Kamruzzaman and Hasan (2010);Craven and Shavlik (1994); Craven (1996). Using an ANN tofilter a dataset prior to learning a DT is explored in Krishnanet al. (1999). Finally and more recently, a reformulation ofregression tree as a sparse ANN as been done in Biau et al.(2016) in order to fine-tune the learned DT. In our case, theHNN can be summarized simply as a generalized NDT inthe sense that it will learn regions so that the class distribu-tion within each of the region has the minimal uncertainty,ultimately with only one class per region but not necessarilyone region per class as opposed to current architectures. Inaddition, we will see that thanks to the LSH framework, weare able to also perform HNN learning in a semi-supervisedway or even unsupervised. Fig. 3. Example of sub-region query
B. The Hashing Layer
In the HNN framework, the last layer now plays the role ofan hashing function hence its hashing layer name. As a result,its output does not represent anymore p ( y | x ) but p ( x ∈ L ) where L is any of the sub-region encoded by the network,and the output of the neurons correspond to a prediction of thepath taken in a decision tree. From the sub-region membership,a prediction policy can be used based on the most presenttraining data class present for example. This fact alreadyhighlights the ability to also make confidence prediction. Infact, if a region is ambiguous, different kind of predictionscan be done based on the problem at hand. We no
1) Supervised Case:
The number of output neurons out must be of at least log ( C ) where C is the number of classes inthe task at hand. In fact, one needs a least C different regionsand it is clear that the number of different region that can bemodeled by an hashing layer is out . For the case where thenumber of output neurons is greater, then the ANN has theflexibility to learn different sub-regions for each class. This isparticularly interesting for example if the latent representationis somehow clustered yet the number of cluster is still greaterthan the number of classes and the cluster of same class are notnecessarily neighbors. In order to derive our loss function formulticlass problems we first need to impose the formulationof y as a one-hot vector with a at the index of the class. From that we have the following quantities out n ( x ) = Φ( x T w n + b n ) , ∀ n ∈ { , ..., out } (29) P = { , } out (30)and finally we call chain each element of P which fullyand uniquely identify each of the sub-regions encoded in thenetwork. We thus have p ( x i ∈ chain ) = (cid:89) n out n ( x i ) chain ( n )=1 (1 − out n ( x i )) chain ( n )=0 (31)the average y per sub-region is thus the weighted mean ofall the y w.r.t their membership probability, this will give theclass distribution per chain as y chain = (cid:88) i y i p ( x i ∈ chain ) , (32)where we recall that y i is a one-hot vector and thus y chain isa vector of size C which sums to and is nonnegative. Thusit is p ( y | chain ) . Finally, we can now compute the measureof uncertainty E for each chain whether it is the entropy, theGini impurity or any other differentiable function and write thefinal loss function as the weighted sum of these uncertaintiesweighted by the probability to reach each sub-regions as loss = (cid:88) x ∈ X (cid:88) chain ∈ P p ( chain ) E ( y chain ) , (33)where as for the NDT the estimated probability to reach asub-region is simply p ( chain ) = (cid:80) Ni =1 p ( x i ∈ chain ) N . (34)Note that all the classification loss can be used with classweights in order to deal with unbalanced dataset which occuroften in real world problems. For regression problems we firstdefine the local variance of the outputs as V ( leaf ) = (cid:80) i ( y i − ˜ y chain ) p ( x i ∈ chain ) (cid:80) i p ( x i ∈ chain ) , (35)where ˜ y chain = (cid:80) i y i p ( x i ∈ chain ) (cid:80) i p ( x i ∈ chain ) . (36)As a result we can rewrite the complete regression loss as loss = (cid:88) x ∈ X (cid:88) chain ∈ P p ( chain ) V ( leaf ) . (37)Note that it is possible to not weight the local loss functionsby the probability to reach the region.Finally, note that this hashing layer can obviously beused after any already known neural network layer such asdensely connected Gardner and Dorling (1998) or convolu-tional Krizhevsky et al. (2012) layers or even recurrent basedlayers Gers et al. (2000) making HNN acting on deep andlatent representation rather than the input space. The trainingis thus done through back-propagation with the presented loss.
2) Semi-Supervised Learning:
One extension not devel-opped in the paper concerns the semi-supervised case forwhich the standard discriminative loss is used for labeledexamples s.a. the Information Gain of the Entropy and theunsupervised loss is used for the unlabbeled examples as wellas the labeled examples. The unlabeled loss would tipycallybe the intra region variance which is similar to the loss of k-NN algorithm or GMM with identity covariance matrix. Thisresult in aggregating into same region parts of the space withhigh density while constraining that two different labels neveroccur inside one region. Extensions of this will be presentedas well as validation results in order to validate this hybridloss between Decision Trees and k-NN.
3) Unsupervised Case:
In the last section a loss functionwas derived for the case where we have access to all the labels y i of the training inputs x i . The semi-supervised frameworkis basically a deep-autoencoder on which a hashing layeris connected to the latent space, namely the middle layer.As a result, the reconstruction loss is applied to all inputsand the standard hashing layer loss is used when labels areavailable. For unsupervised however, the deep autoencoder istrained again coupled with an hashing layer but this time thehashing layer is unsupervised. We now derive analytically theloss function of the hashing layer in the unsupervised case.There are two ways to do it, it can either be random tobecome a known LSH function such as MinHash or one cantry to find clusters so that the intra-cluster variance is minimalsimilarly to a k-NN Fukunaga and Narendra (1975) approach.The variance per region is defined as V chain ( x ) = (cid:80) i ( x i − ˜ x chain ) p ( x i ∈ chain ) (cid:80) i p ( x i ∈ chain ) , (38)where the local sub-region means is defined as ˜ x chain = (cid:80) i p ( x i ∈ chain ) x i (cid:80) i p ( x i ∈ chain ) . (39)As a result, the overall loss is simply loss = (cid:88) x ∈ X (cid:88) chain ∈ P p ( chain ) V ( x ) . (40)If we denote by G the last layer of the neural network beforethe hashing layer, the final unsupervised loss thus becomes loss = (cid:88) x ∈ X || G − ( G ( x )) − x || + (cid:88) chain ∈ P p ( chain ) V ( G ( x )) , (41)and for the semi-supervised case loss = (cid:88) x ∈ X || G − ( G ( x )) − x || + (cid:88) chain ∈ P p ( chain ) E ( G ( x ))1 x has a label . (42)Concerning the random strategy, similarly to Extremely Ran-domized Trees, it is possible to adopt an unlearn approach forwhich the hyperplanes w b are drawn according to a Normaldistribution. This way, we follow the LSH framework forwhich we have p ( out n ( x i ) (cid:54) = out n ( x j )) ∝ d ( x i , x j ) , (43)where d is a distance measure. C. Training1) Iterative Optimization Schemes:
Since there is no ana-lytical form to find the optimal weights and bias inside a deepneural network, one has to use iterative optimization methods.Two of the main possibilities are Genetic Algorithms Davis(1991) and Gradient based methods. We focus here on theadvances for gradient based methods as it is the most popularoptimization technique nowadays. Put simply, the update foreach free parameter W , the update rule is W = W − α ∇ W + βf ( W ) , (44)where α is the learning rate and β a regularization parameterapplied on some extra function f . A common technique tofind the best learning rate and regularizer is cross-validationbut new techniques have been developed allowing an adaptivelearning rate and momentum which are changed during train-ingYu et al. (2006); Yu and Liu (2002); Hamid et al. (2011);Nawi et al. (2011). Whatever activation function is used, onecan also add new parameters to the input in order to scale theinput as presented in He et al. (2015). Finally, many tricks arestudies for better back-propagation in LeCun et al. (2012) anda deep study of the behavior of the weights during learning iscarried out in LeCun et al. (1991). We now derive the explicitgradient for the case where the loss E is the Gini impurity. Itis clear that ddW loss = (cid:88) n dlossdout n dout n dW = (cid:88) chain ∈ P (cid:88) n dp ( chain ) E ( y chain ) dout n dout n dW (45)with dp ( chain ) E ( y c hain ) dout n a scalar and dout n dW a matrix. We nowderive explicitly the derivative: dout ( x i ) dW = ( − chain (1)=0 out ( x i )(1 − out ( x i )) x Ti ( − chain (2)=0 out ( x i )(1 − out ( x i )) x Ti ... , (46)which is of size ( out, in ) and is basically on each rowthe input x i which might be the output of another upperlayer weighted by the activation, it is similar to a sigmoidbased layer except for the indicator function which helps todetermine the sign. Now if we now denote the true output by σ chain ( n ) n ( x i ) = out n ( x i ) chain ( n )=1 (1 − out n ( x i )) chain ( n )=0 , (47)which includes the indicator function for clarity, we have ∂∂out n p ( chain ) = ∂∂out n (cid:80) i (cid:81) n σ n ( x i ) N = (cid:80) i (cid:80) k ( − chain ( k )=0 (cid:81) n (cid:54) = k σ chain ( n ) n ( x i ) N . (48)Note that the derivative of the probability w.r.t the output isthus simply a sum of the products of the other output neuronswhere the neuron considered in the derivative determines thesign applied. In short, we see that this changes linearly as one fixes all the neurons but the one considered for variations,which is natural. We now derive the final needed derivate: ∂∂out n E ( y chain ) = ∂∂out n (1 − (cid:88) k f k )= − (cid:88) k ∂∂out n (cid:32) (cid:80) i y i,k p ( x i ∈ chain ) (cid:80) j (cid:80) i y i,j p ( x i ∈ chain ) (cid:33) = − (cid:88) k ∂∂out n (cid:32) (cid:80) i y i,k (cid:81) n σ chain ( n ) n ( x i ) (cid:80) j (cid:80) i y i,j (cid:81) n σ chain ( n ) n ( x i ) (cid:33) = − (cid:88) k f k (cid:32) ( (cid:80) i y i,k (cid:80) l ( − chain ( l )=0 (cid:81) n (cid:54) = l σ chain ( n ) n ( x i ))( (cid:80) j (cid:80) i y i,j (cid:81) n σ chain ( n ) n ( x i )) × ( (cid:88) j (cid:88) i y i,j (cid:89) n σ chain ( n ) n ( x i )) + (cid:88) k f k (cid:32) ( (cid:80) i y i,k (cid:81) n σ chain ( n ) n ( x i ))( (cid:80) j (cid:80) i y i,j (cid:81) n σ chain ( n ) n ( x i )) × ( (cid:88) j (cid:88) i y i,j (cid:88) l ( − chain ( l )=0 (cid:89) n (cid:54) = l σ chain ( n ) n ( x i )) Finally, note that performing stochastic gradient descent in anoption when dealing with large training set. It has also beenshown to improve the convergence rate.
2) Regularization Techniques:
Initially, regularization wasdone by adding to the standard cost function a regularizationterm, typically the L1 or L2 norm of the weights. This imposedthe learned parameters to be sparse. From that new kind ofregularization techniques have been developed such as dropoutHinton et al. (2012), Srivastava et al. (2014). In dropout, eachneuron has a nonzero probability to be deactivated (simplyoutput ) forcing the weights to avoid co-adaptation. Thisprobabilistic deactivation can be transformed into adding someGaussian noise to each of the neuron outputs which in this caseforce the weights to be robust to noise. The motivation here isnot necessarily to avoid overfitting, in fact, if using ensemblemethods, over fitting is actually better to then leverage variancereduction from averaging Krogh (1996). However another typeof regularization can be used on the distribution of the dataacross the regions. One such type might be (cid:88) chain ∈ P || p ( chain ) − | P | || (49)so that regions become more equally likely. D. Toy Dataset
We now present the application of the HNN on two simpletoy datasets for binary classification, the two moons and circledataset. Each one presents nonlinearly separable data pointsyet the boundary decision can very effectively be representedby a small union of linear plans. The main result is that thenumber of parameters needed with the HNN is smaller thanwhen using an ANN. For all the below examples, the ANN ismade of 3 layers with topology which is the smallestnetwork able to tackle these two problems. As we will see,
Fig. 4. Noisy Two-Moons Dataset, learned regions and final binary classifi-cation regions for HNN.Fig. 5. Evolution of p ( chain ) over the regions learned during training. the HNN only requires one layer and the number of neededparameters for similar decision boundary goes for from for ANN to for HNN. This simple example shows that thereformulation of the hashing layer helps to avoid overfittingin general. In fact, overfitting is not necessarily learning thetraining set but it is also using a more flexible model thatneeds be Hawkins (2004). Since with the HNN we are able toobtain the same decision boundaries yet with less parameters,it means that the ANN architecture was somehow sub-optimal.
1) Two-moon Dataset:
The two-moon dataset is a typicalexample of nonlinear binary classification. It can be solvedeasily with kernel based methods or nonlinear classifiers ingeneral. As we will show, even though the used HNN hasonly one layer, by the way it combines the learned hyper-plansit is possible to learn nonlinear decision boundaries. In Fig.4 one can see the HNN with out = 3 after training. Theboundary decision is similar in shape with the one learnedform a
MLP shown in Fig. 7. In fact, it is made up of combined hyper-plans. One can see in Fig. 5 the evolution ofthe probability to reach each of the regions during training asopposed to the regions for the case of the MLP presented inFig. 8. We also present in Fig. 6 the evolution of the error andregularizations during training for the HNN. As one can see,regions starting with almost no points can recover and becomepreponderant whereas useful regions can be disregarded at anypoint in the training. Similarly we have for Fig. 9 the case withthe ANN. It is interesting to see that the convergence rate isalso faster with the HNN. In fact it converges in iterations Fig. 6. Evolution of the errors during training. On the left is the pure error asdefined in 33 and with addition of the regularization terms. On the middle isthe norm of the weight and on the right the distance with respect to a uniformdistribution of the points in each region.Fig. 7. Noisy Two-Moons Dataset, learned regions and final binary classifi-cation regions for HNN.Fig. 8. Evolution of p ( y = 0) and p ( y = 1) during training of the MLP. whereas ANN converges in around iterations.
2) Two-circle Dataset:
The two-circle dataset is anothersimple yet meaningful dataset. It presents two circles withsame center with different radius, as a consequence oneis inside the other and the binary classification task is todiscriminate between the two. It is quite straightforward tosee that a hand-craft change of variable ( x, y ) → ( r, θ ) canmake this problem linearly separable yet we will see how HNNand ANN solve this discrimination problem. We present inFig. 10 the result of the HNN with neurons. Again, thedecision boundary is also presented for the case of an ANN Fig. 9. Evolution of the errors during training for the ANN.Fig. 10. Regions and decision boundary for the HNN in the two-circle dataset.Fig. 11. Evolution of p ( chain ) during training for the possible regions ofthe HNN on the two-circle dataset. with topology in Fig. 13. We also present in Fig. 11for the HNN and Fig. 14 for the ANN the evolution of theprobability to reach the sub-regions during training. As canbe seen in Fig. 12 and 15 the convergence rate is significantlyfaster for the HNN. In fact the convergence is done in about iteration whereas the neural network needs a bit more than iterations. Finally, we present the case where we used neurons for the HNN on the two-circle dataset. Note thatthis puts the number of parameters to which is still lessthan the number of parameters used in the ANN . Yet ifone wanted to have the same modeling ability with a MLP,the minimum topology would be which contains parameters. In fact, it grows exponentially w.r.t the number ofhidden neurons whereas for the HNN it grows linearly since Fig. 12. Evolution of the error and regularization for the HNN with neurons.Fig. 13. ANN Decision Boundary for the two-circle datasetFig. 14. ANN p ( chain ) for the two classesFig. 15. ANN training statistics including the error and the regularizations Fig. 16. neurons HNNFig. 17. neurons HNN, evolution of p ( chain ) .Fig. 18. neurons HNN, evolution of the error and regularization errorsduring training. the modeling capacity grows exponentially. We present in Fig.16, Fig. 17 and Fig. 18 the results for the neurons HNN.IV. F UTURE W ORK
This paper introduces a new way to improve neural net-works through the analysis of the output activations and theloss function. In fact, it has been demonstrated for example inTang (2013) a SVM type loss is used instead of the standardcross-entropy. As a result the generalization loss has beendiminished. Yet the important point was that the cross-entropyof the new trained network was far from optimal or even closeto what one could consider as satisfactory. This suggests thatdifferent loss functions do not just affect the learning but also the final network. As a result, one part of the futurework is to study the impact of the loss function on differenttraining set with fixed topologies. This includes the trainingwith standard neural networks and the HNN for the case where out = log ( n ) . This could lead to a new framework aimingat learning the loss function online.With the analysis of trees one natural extension of the HNNis boosting or its simpler form, bagging. Doing this ensemblemethods with complete ANN might be difficult due to highcomplexity to already train one network. As a result, sometechniques such as dropout have been used and analyzed as aweak way to perform model averaging or bagging. A solutionin our case would be to perform bagging or boosting of onlythe hashing layer part, namely the last layer of the HNN.This way the workers live on the same latent space but acton different pieces of it and their combination is used forimproving the latent space representation. Another approachwould be to see model averaging from a bayesian point ofview Penny et al. (2006). The model evidence is given by p ( y | m ) = (cid:90) p ( y | θ, m ) p ( θ | m ) dθ, (50)and the Bayesian Model Selection (BMS) is m MP = arg max m ∈ M p ( m | y ) . (51)When dealing with one model m , the inference p ( θ | y, m ) isdependent w.r.t to the chosen model. In order to take intoaccount the uncertainty in the choice of the model, modelaveraging can be used. Model Averaging (BMA) formulatesthe distribution p ( θ | y ) = (cid:88) m p ( θ | y, m ) p ( m | y ) , (52)look for a review in Hoeting et al. (1999). This wholeframework can be used with a nonparametric PGM with anew define probability for a neural network, and p ( x | AN N ) as the reconstruction error and p ( AN N ) based on the modelcomplexity for example or the topology of the connections.Finally, an important aspect resides in the correlation be-tween different out n ( x i ) and out m ( x i ) for different inputs. Infact, they have to be not correlated otherwise it means that thehyperplanes are scaled versions of each others.V. C ONCLUSION
In this paper we presented an extension of DTs and ANNs toprovide a unified framework allowing to take the best of bothapproaches. The ability to hash a dataset into an exponentiallylarge number of regions which is the strenfgth of DT coupledwith the ability to learn any boundary decision for those regioncoming from the ability of ANN to model arbritrary functions.We leverage the differentiable of our approach to derive aglobal loss function to train all the nodes simultaneously withrespect to the resulting leaves entropy showing robustness topoor local optimum DTs can fall in. The differentiability ofthe model allows easy integreation in many machine learningpipeline allowing the extension of CNNs to robust semi-supervised clustering for example. In addition, the ability tolearn arbitrary union of regions of the space to perform a per class clustering reduces the necessary condition of having fullylinearized the dataset. This shall reduce the required depthof today’s deep architectures. Finally, this network has nowthe capacity to be information theoretically optimal as theminimum required number of neurons in supervised problemsis log ( C ) as opposed to C for usual soft-max layers where C is the number of classes. Finally, the possibility to applythis framework in supervised as well as unsupervised settingsmight lead to interesting behavior through the clustering prop-erty of the latent representations as the experiments showedpromising results. R EFERENCES
G. Biau, E. Scornet, and J. Welbl. Neural Random Forests.
ArXiv e-prints , April 2016.Rico Blaser and Piotr Fryzlewicz. Random rotation ensembles.
J Mach Learning Res , 2:1–15, 2015.Leo Breiman, Jerome Friedman, Charles J Stone, andRichard A Olshen.
Classification and regression trees . CRCpress, 1984.Erick Cantu-Paz and Chandrika Kamath. Inducing obliquedecision trees with evolutionary algorithms.
IEEE Transac-tions on Evolutionary Computation , 7(1):54–68, 2003.Rohitash Chandra, Kaylash Chaudhary, and Akshay Kumar.The combination and comparison of neural networks withdecision trees for wine classification.
School of sciencesand technology, University of Fiji, in , 2007.Moses S Charikar. Similarity estimation techniques fromrounding algorithms. In
Proceedings of the thiry-fourthannual ACM symposium on Theory of computing , pages380–388. ACM, 2002.Mark Craven and Jude W Shavlik. Using sampling and queriesto extract rules from trained neural networks. In
ICML ,pages 37–45, 1994.Mark W Craven.
Extracting comprehensible models fromtrained neural networks . PhD thesis, University ofWisconsin–Madison, 1996.Sanjoy Dasgupta and Yoav Freund. Random projection treesand low dimensional manifolds. In
Proceedings of thefortieth annual ACM symposium on Theory of computing ,pages 537–546. ACM, 2008.Lawrence Davis. Handbook of genetic algorithms. 1991.R L´opez De M´antaras. A distance-based attribute selectionmeasure for decision tree induction.
Machine learning , 6(1):81–92, 1991.Richard O Duda, Peter E Hart, and David G Stork.
Patternclassification . John Wiley & Sons, 2012.LiMin Fu. Rule generation from neural networks.
IEEETransactions on Systems, Man, and Cybernetics , 24(8):1114–1124, 1994.Keinosuke Fukunaga and Patrenahalli M. Narendra. A branchand bound algorithm for computing k-nearest neighbors.
IEEE transactions on computers , 100(7):750–753, 1975.Matt W Gardner and SR Dorling. Artificial neural networks(the multilayer perceptron)a review of applications in theatmospheric sciences.
Atmospheric environment , 32(14):2627–2636, 1998.Felix A Gers, J¨urgen Schmidhuber, and Fred Cummins. Learn-ing to forget: Continual prediction with lstm.
Neuralcomputation , 12(10):2451–2471, 2000.Pierre Geurts, Damien Ernst, and Louis Wehenkel. Extremelyrandomized trees.
Machine learning , 63(1):3–42, 2006.Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. Sim-ilarity search in high dimensions via hashing. In
VLDB ,volume 99, pages 518–529, 1999.Cyril Harold Goulden. Methods of statistical analysis. 1939.Norhamreeza Abdul Hamid, Nazri Mohd Nawi, Rozaida Ghaz-ali, and Mohd Najib Mohd Salleh. Accelerating learningperformance of back propagation algorithm by using adap- tive gain together with adaptive momentum and adaptivelearning rate on classification problems. In UbiquitousComputing and Multimedia Applications , pages 559–570.Springer, 2011.Malcolm C Harrison. Implementation of the substring testby hashing.
Communications of the ACM , 14(12):777–779,1971.Douglas M Hawkins. The problem of overfitting.
Journal ofchemical information and computer sciences , 44(1):1–12,2004.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Delving deep into rectifiers: Surpassing human-level per-formance on imagenet classification. In
Proceedings of theIEEE International Conference on Computer Vision , pages1026–1034, 2015.Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, IlyaSutskever, and Ruslan R Salakhutdinov. Improving neuralnetworks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 , 2012.Jennifer A Hoeting, David Madigan, Adrian E Raftery, andChris T Volinsky. Bayesian model averaging: a tutorial.
Statistical science , pages 382–401, 1999.Laurent Hyafil and Ronald L Rivest. Constructing optimal bi-nary decision trees is np-complete.
Information ProcessingLetters , 5(1):15–17, 1976.Jungho Im and John R Jensen. A change detection modelbased on neighborhood correlation image analysis and de-cision tree classification.
Remote Sensing of Environment ,99(3):326–340, 2005.A Jordan. On discriminative vs. generative classifiers: Acomparison of logistic regression and naive bayes.
Advancesin neural information processing systems , 14:841, 2002.SM Kamruzzaman and Ahmed Ryadh Hasan. Rule ex-traction using artificial neural networks. arXiv preprintarXiv:1009.4984 , 2010.R Krishnan, G Sivakumar, and P Bhattacharya. Extractingdecision trees from trained neural networks.
Pattern Recog-nition , 32(12), 1999.Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classification with deep convolutional neural net-works. In
Advances in neural information processingsystems , pages 1097–1105, 2012.Peter Sollich Anders Krogh. Learning with ensembles: Howover-fitting can be useful. In
Proceedings of the 1995Conference , volume 8, page 190, 1996.Ludmila I Kuncheva.
Combining pattern classifiers: methodsand algorithms . John Wiley & Sons, 2004.Dmitry Laptev and Joachim M Buhmann. Convolutionaldecision trees for feature learning and segmentation. In
German Conference on Pattern Recognition , pages 95–106.Springer, 2014.Yann LeCun, Ido Kanter, and Sara A Solla. Second orderproperties of error surfaces: Learning time and generaliza-tion. In
Advances in neural information processing systems ,pages 918–924, 1991.Yann A LeCun, L´eon Bottou, Genevieve B Orr, and Klaus-Robert M¨uller. Efficient backprop. In
Neural networks:Tricks of the trade , pages 9–48. Springer, 2012. Roger J Lewis. An introduction to classification and regressiontree (cart) analysis. In
Annual meeting of the society foracademic emergency medicine in San Francisco, California ,pages 1–14, 2000.Wei Liu, Jun Wang, Rongrong Ji, Yu-Gang Jiang, and Shih-Fu Chang. Supervised hashing with kernels. In
ComputerVision and Pattern Recognition (CVPR), 2012 IEEE Con-ference on , pages 2074–2081. IEEE, 2012.Nazri Mohd Nawi, Norhamreeza Abdul Hamid, RS Ransing,Rozaida Ghazali, and Mohd Najib Mohd Salleh. Enhancingback propagation neural network algorithm with adaptivegain on classification problems. networks , 4(2), 2011.Will Penny, J Mattout, and N Trujillo-Barreto. Bayesian modelselection and averaging.
Statistical Parametric Mapping:The analysis of functional brain images. London: Elsevier ,2006.J. Ross Quinlan. Induction of decision trees.
Machinelearning , 1(1):81–106, 1986.J Ross Quinlan.
C4. 5: programs for machine learning .Elsevier, 2014.John Ross Quinlan. Comparing connectionist and symboliclearning methods. In
Computational Learning Theoryand Natural Learning Systems: Constraints and Prospects .Citeseer, 1994.S Rasoul Safavian and David Landgrebe. A survey of decisiontree classifier methodology. 1990.Ruslan Salakhutdinov and Geoffrey Hinton. Semantic hashing.
International Journal of Approximate Reasoning , 50(7):969–978, 2009.Robert F Sproull. Refinements to nearest-neighbor searchingink-dimensional trees.
Algorithmica , 6(1-6):579–589, 1991.Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, IlyaSutskever, and Ruslan Salakhutdinov. Dropout: A simpleway to prevent neural networks from overfitting.
TheJournal of Machine Learning Research , 15(1):1929–1958,2014.Yichuan Tang. Deep learning using linear support vectormachines. arXiv preprint arXiv:1306.0239 , 2013.Geoffrey G Towell and Jude W Shavlik. Extracting refinedrules from knowledge-based neural networks.
Machinelearning , 13(1):71–101, 1993.Nakul Verma, Samory Kpotufe, and Sanjoy Dasgupta. Whichspatial partition trees are adaptive to intrinsic dimension? In
Proceedings of the twenty-fifth conference on uncertainty inartificial intelligence , pages 565–574. AUAI Press, 2009.Rongkai Xia, Yan Pan, Hanjiang Lai, Cong Liu, and ShuichengYan. Supervised hashing for image retrieval via imagerepresentation learning. In
AAAI , volume 1, page 2, 2014.Chien-Cheng Yu and Bin-Da Liu. A backpropagation algo-rithm with adaptive learning rate and momentum coefficient.In
Neural Networks, 2002. IJCNN’02. Proceedings of the2002 International Joint Conference on , volume 2, pages1218–1223. IEEE, 2002.Lean Yu, Shouyang Wang, and Kin Keung Lai. An adaptive bpalgorithm with optimal learning rates and directional errorcorrection for foreign exchange market trend prediction. In