[PDF] NDT: Neual Decision Tree Towards Fully Functioned Neural Graph

Abstract

Though traditional algorithms could be embedded into neural architectures with the proposed principle of \cite{xiao2017hungarian}, the variables that only occur in the condition of branch could not be updated as a special case. To tackle this issue, we multiply the conditioned branches with Dirac symbol (i.e. 1 x>0 ), then approximate Dirac symbol with the continuous functions (e.g. 1− e −α|x| ). In this way, the gradients of condition-specific variables could be worked out in the back-propagation process, approximately, making a fully functioned neural graph. Within our novel principle, we propose the neural decision tree \textbf{(NDT)}, which takes simplified neural networks as decision function in each branch and employs complex neural networks to generate the output in each leaf. Extensive experiments verify our theoretical analysis and demonstrate the effectiveness of our model.

Full PDF

NNDT: Neual Decision Tree Towards Fully Functioned Neural Graph

Han Xiao Abstract

Though traditional algorithms could be embeddedinto neural architectures with the proposed princi-ple of (Xiao, 2017), the variables that only occurin the condition of branch could not be updatedas a special case. To tackle this issue, we multiplythe conditioned branches with Dirac symbol (i.e. x> ), then approximate Dirac symbol with thecontinuous functions (e.g. − e − α | x | ). In thisway, the gradients of condition-speciﬁc variablescould be worked out in the back-propagation pro-cess, approximately, making a fully functionedneural graph. Within our novel principle, we pro-pose the neural decision tree (NDT) , which takessimpliﬁed neural networks as decision functionin each branch and employs complex neural net-works to generate the output in each leaf. Exten-sive experiments verify our theoretical analysisand demonstrate the effectiveness of our model.

1. Introduction

Inspired by brain science, neural architectures have beenproposed in 1943, (Mcculloch & Pitts, 1943). This branchof artiﬁcial intelligence develops from single perception(Casper et al., 1969) to deep complex network (Lecun et al.,2015), achieving several critical successes such as AlphaGo(Silver et al., 2016). Notably, all the operators (i.e. matrixmultiply, non-linear function, convolution, etc.) in tradi-tional neural networks are numerical and continuous, whichcould beneﬁt from back-propagation algorithm, (Rumelhartet al., 1988).Recently, logics-based methods (e.g. Hungarian algorithm,max-ﬂow algorithm, A ∗ searching) are embedded into neu-ral architectures in a dynamically graph-constructing man-ner, opening a new chapter for intelligence system, (Xiao, State Key Laboratory of Intelligent Technology and Systems,National Laboratory for Information Science and Technology, De-partment of Computer Science and Technology, Tsinghua Univer-sity, Beijing 100084, PR China. Correspondence to: Han Xiao < [email protected] > . Proceedings of the th International Conference on MachineLearning , Stockholm, Sweden, PMLR 80, 2018. Copyright 2018by the author(s).

Generally, neural graph is deﬁned as the intelli-gence architecture, which is characterized by both logicsand neurons.

With this proposed principle from the seminal work, we at-tempt to tackle image classiﬁcation. Speciﬁcally, regardingthis task, the overfull categories make too much burden forclassiﬁers, which is a normal issue for large-scale datasetssuch as ImageNet (Deng et al., 2009).

We conjecture that itwould make effects to roughly classify the samples withdecision tree, then category the corresponding sampleswith strong neural network in each leaf, because in eachleaf, there are much fewer categories to predict.

The at-tribute split in traditional decision trees (e.g.

ID3, RandomForest, etc. ) is oversimpliﬁed for precise pre-classiﬁcation,(Zhou & Feng, 2017). Thus, we propose the method ofneural decision tree (NDT) , which applies neural networkas decision function to strengthen the performance.Regarding the calculus procedure of NDT, the basic princi-ple is to treat the logic ﬂow (i.e. “if, for, while” in the senseof programming language ) as a dynamic graph-constructingprocess, which is illustrated in Figure 1.This ﬁgure demonstrates the classiﬁcation of four categories(i.e. sun, moon, car and pen ), where an if structure is em-ployed to split the samples into two branches (i.e. sun-moon,car-pen ), where the fully connected networks generate theresults respectively. In the forward propagation, our method-ology activates some branch according to the condition of if , then dynamically constructs the graph according to theinstructions in the activated branch. In this way, the calcu-lus graph is constructed as a non-branching and continuousstructure, where backward propagation could be performedconventionally, demonstrated in Figure 1 (b). Generally, weshould note that the repeat (i.e. for, while ) could be treatedas performing if in multiple times, which could also betackled by the proposed principle. Thus, all the traditionalalgorithms could be embedded into neural architectures. Formore details, please refer to (Xiao, 2017).However, as a special challenge of this paper, the variablesthat are only introduced in the condition of branch could notbe updated in the backward propagation, because they areoutside the dynamically constructed graph, for the exampleof W in Figure 1. Thus, to make a completely functionedneural graph, this paper attempts to tackle this issue in an ap- a r X i v : . [ c s . N E ] D ec DT: Neual Decision Tree Towards Fully Functioned Neural Graph

Figure 1.

Illustration of how logic ﬂow is processed in our method-ology. Referring to (b), we process the if-else structure of (a) ina dynamically graph-constructing manner. Theoretically, we con-struct the graph according to the active branch, in the forward prop-agation. When the forward propagation has constructed the graphaccording to logic instructions, the backward propagation wouldbe performed as usual in a continuous and non-branching graph.Practically, the dynamically constructed process corresponds tobatch operations. The samples with id are activated in if branch, while those with id are tackled in else branch. Afterthe end of if-else instruction, the sub-batch hidden representationsare joined as the classiﬁed results. proximated manner. Simply, we multiply the symbols insidethe branch with Dirac function (i.e. x> or x ≤ ). Speciﬁ-cally, regarding Figure 1, we reform F CN etwork ( img ) as F CN etwork ( img ⊗ tanh> ) in the if branch and performthe corresponding transformation in the else branch, where ⊗ is element-wise multiplication. The forward propagationwould not be modiﬁed by the reformulation, while as to thebackward process, we approximate the Dirac symbol with acontinuous function to work out the gradients of condition,which solves this issue. It is noted that in this paper, thecontinuous function is − e − α | x | ≈ x> .We conduct our experiments on public benchmark datasets:MNIST and CIFAR. Experimental results illustrate that ourmodel outperforms other baselines extensively and signiﬁ- cantly, which illustrates the effectiveness of our methodol-ogy. The most important conclusion is that “our model isdifferentiable”, which veriﬁes our theory and provides thenovel methodology of fully functioned neural graph. Contributions (1.)

We complete the principle of neuralgraph, which characterizes the intelligence systems withboth logics and neurons. Also, we provide the proof thatneural graph is Turing complete, which makes a learnableTuring machine for the theory of computation. (2.)

To tacklethe issue of overfull categories, we propose the method ofneural decision tree (NDT) , which takes simpliﬁed neuralnetworks as decision function in each branch and employscomplex neural networks to generate the output in each leaf. (3.)

Our model outperforms other baselines extensively,verifying the effectiveness of our theory and method.

Organization.

In the Section 2, our methodology and neu-ral architecture are discussed. In the Section 3, we speciﬁcthe implementation of fully functioned neural graph in de-tail. In the Section 4, we provide the proof that neural graphis Turing complete. In the Section 5, we conduct the ex-periments for performance and veriﬁcation. In the Section6, we brieﬂy introduce the related work. In Section 7, welist the potential future work from a developing perspective.Finally in the Section 8, we conclude our paper and publishour codes.

2. Methodology

First, we introduce the overview of our model. Then, wediscuss each component, speciﬁcally. Last, we discuss ourmodel from the ensemble perspective.

Our architecture is illustrated in Figure 2, which is com-posed by three customized components namely feature, con-dition and target network. Firstly, The input is transformedby feature network and then the hidden features are clas-siﬁed by decision tree component composed by hierarchalcondition networks. Secondly, the target networks predictthe categories for each sample in each leaf. Finally, thetargets are joined to work out the cross entropy objective.The process is exempliﬁed in Algorithm 1.

Feature Network.

To extract the abstract features withdeep neural structures, we introduce the feature network,which is often a stacked CNN and LSTM.

Condition Network.

To exactly pre-classify each sample,we employ a simpliﬁed neural network as condition network,which is usually a one- or two-layer multi-perceptions withthe non-linear function of tanh . This layer is only applied inthe inner nodes of decision tree. Actually, the effectivenessof traditional decision tree stems from the information gain

DT: Neual Decision Tree Towards Fully Functioned Neural Graph

Figure 2.

The neural architecture of NDT (depth = 2). The input isclassiﬁed by decision tree component with the condition networks,then the target networks predict the categories for each sample ineach leaf. Notably, the tree component takes advantages of sub-batch technique, while the targets are joined in batch to computethe cross entropy objective. splitting rules, which could not be learned by conditionnetworks, directly. Thus, we involve an objective item foreach decision node to maximize the information gain as: max

Inf oGain = N left N total | F | (cid:88) j =0 p leftj ln ( p leftj ) + N right N total | F | (cid:88) j =0 p rightj ln ( p rightj ) (1)where N is the corresponding count, | F | is the feature num-ber and p is the corresponding probabilistic distribution offeatures. Regarding the derivatives relative to Dirac symbol,we ﬁrstly reformulate the information gain in the form ofDirac symbol as: N left = N total (cid:88) i =1 cn> (2) N right = N total (cid:88) i =1 cn ≤ (3) p leftj = (cid:80) N total i =0 cn> ,j L i,j (cid:80) N total i =0 cn> (4) p rightj = (cid:80) N total i =0 cn ≤ ,j L i,j (cid:80) N total i =0 cn ≤ (5)where cn is short for condition network and L i,j is the ad-hoc label vector of i -th sample, where the true label positionis 1 and otherwise 0. By simple computations, we have: ∂IG∂ cn> = (cid:80) N total i =0 L i,j ln ( p leftj ) N total (6) ∂IG∂ cn ≤ = (cid:80) N total i =0 L i,j ln ( p rightj ) N total (7)where IG is short for Inf oGain . As discussed in Intro-duction, we approximate the Dirac symbol as a continuousfunction, speciﬁcally as − e − α | x | ≈ x> . Thus, thegradient of condition network could be deducted as: ∂IG∂cn ≈ ∂IG∂ cn> ∂ (1 − e − α | cn | ) ∂cn = ∂IG∂ cn> ( αe − α | x | s ( cn )) (8) ∂IG∂cn ≈ ∂IG∂ cn ≤ ∂ (1 − e − α | cn | ) ∂cn = ∂IG∂ cn ≤ ( αe − α | x | s ( cn )) (9)where s is the sign function.Actually, all the reduction could be performed automaticallywithin the proposed principle, that to multiply the symbolsinside the branch with Dirac function. Speciﬁcally, as anexample of the count, N left = (cid:80) N total i =1 × cn> . Target Network.

To ﬁnally predict the category of eachsample, we apply a complex network as the target network,which often is a stacked convolution one for image or anLSTM for sentence.

NDT could be treated as an ensemble model, which ensem-bles many target networks with the hard branching conditionnetworks. Currently, there exist two branches of ensemblemethods, namely split by features, or split by samples, bothof which increases the difﬁculty of single classiﬁer. How-ever, NDT splits the data by categories, which means singleclassiﬁer deals with a simpler task.The key point is the split purity of condition networks, be-cause the branching reduces the sample numbers for eachleaf. Relatively to single classiﬁer, if our model keeps thesample number per category, NDT could make more effects.For an example of one leaf, the sample number reducesto 30%, while the category number reduces from 10 to 3.With similarly sufﬁcient samples, our model deals with 3-classiﬁcation, which is much easier than 10-classiﬁcation.Thus, our model beneﬁts from the strengthen of single clas-siﬁer.

DT: Neual Decision Tree Towards Fully Functioned Neural Graph

Algorithm 1

Neural Decision Tree (NDT)

3. Dynamical Graph Construction

Previously introduced, neural graph is the intelligence archi-tecture, which is characterized by both logics and neurons.Mathematically, the component of neurons are continuousfunctions, such as matrix multiply, hyperbolic tangent (tanh),convolution layer, etc, which could be implemented as math-ematical operations.Obviously, simple principal implementation for non-batchmode is easy and direct. But practically, all the latest train-ing methods take the advantages of batched mode. Hence,we focus on the batched implementation of neural graph inthis section.Conventionally, neural graph is composed by two stylesof variable, namely symbols such as W in Figure 1, andatomic types such as the integer d in Algorithm 1 Line 2.In essence, symbolic variables originate from the weightsbetween neurons, while the atomic types are introduced bythe embedded traditional algorithms.Therefore, regarding the component of logics, there existtwo styles: symbol- and atomic-type-speciﬁc logic compo-nents, which are differentiated in implementation. Symbol-speciﬁc logics indicates the condition involves the symbols,such as Line 5 ∼ if ) and repeat (i.e. for, while ). Repeat could be treated asperforming branch in multiple times. Thus, the three batchoperations, namely sub-, join- and allocate-batch operation,could process all the traditional algorithms, such as resolu-tion method, A ∗ searching, Q-learning, label propagation,PCA, K-Means, Multi-Armed Bandit (MAB), AdaBoost.

4. Neural Graph is Turing Complete

Actually, if neural graph could simulate the Turing machine,it is Turing complete. Turing machine is composed byfour parts: a cell-divided tape, reading/writing head, stateregister and a ﬁnite table of instructions. Correspondingly,

DT: Neual Decision Tree Towards Fully Functioned Neural Graph symbols are based on tensor arrays, which simulate the cell-divided tape. Forward/Backward process indicate whereto read/write. Atomic-type-speciﬁc variables record thestate. Last, the logic ﬂow (i.e. if, while, for ) constructsthe ﬁnite instruction table.

In summary, neural graph isTuring complete.

Speciﬁcally, neural graph is a learnable Turing machinerather than a static one. Learnable Turing machine couldadjust the behaviors/performance, according to data and en-vironment. Traditional computation models focus on staticalgorithms, while neural graph takes advantages of data andperception to strengthen the rationality of behaviors.

5. Experiment

In the section, we verify our model on two datasets: MNIST(Lecun et al., 1998) and CIFAR (Krizhevsky, 2009). We ﬁrstintroduce the experimental settings in Section 4.1. Then,in Section 4.2, we conduct performance experiments totestify our model. Last, in Section 4.3, to further verify ourtheoretical analysis, that NDT could reduce the categorynumber of leaf nodes, we perform a case study to justify ourassumption.

There exist three customized networks in our model, thatthe feature, condition and target network. We simply applyidentify mapping as feature network. Regarding the condi-tion network, we apply a two-layer fully connected percep-tions, with the hyper-parameter input-300-1 for MNIST andinput-3000-1 for CIFAR. Regarding the target network, wealso employ a three-layer fully connected perceptions, withthe hyper-parameter input-300-100-10 for MNIST, input-3000-1000-10 for CIFAR-10 and input-3000-1000-100 forCIFAR-100. To train the model, we leverage AdaDelta(Zeiler, 2012) as our optimizer, with hyper-parameter as mo-ment factor η = 0 . and (cid:15) = 1 × − . We train the modeluntil convergence, but at most 1,000 rounds. Regarding thebatch size, we always choose the largest one to fully utilizethe computing devices. Notably, the hyper-parameters ofapproximated continuous function is α = 1000 . The MNIST dataset (Lecun et al., 1998) is a clas-sic benchmark dataset, which consists of handwritten digitimages, 28 x 28 pixels in size, organized into 10 classes (0to 9) with 60,000 training and 10,000 test samples. We se-lect some representative and competitive baselines: modern We know the feature and target network are too oversimpliﬁedfor this task. But this version targets at an exempliﬁed model,which still could verify our conclusions. We will perform a com-plex feature and target network in the next/ﬁnal version.

Table 1.

Performance Evaluation on MNIST Dataset.

Methods Accuracy (%)Single Target Network 96.95LeNet-5 99.50Multi-Perspective CNN 81.38Deep Belief Net 98.75SVM (RBF kernel) 98.60Random Forest 96.80

NDT (depth = 2) 97.90

CNN-based architecture LeNet-5 with dropout and ReLUs,classic linear classiﬁer SVM with RBF kernel, Deep BeliefNets and a standard Random Forest with 2,000 trees. Wecould observe that:1. NDT will beat all the baselines, verifying our theoryand justifying the effectiveness of our model.2. Compared to single target network, NDT promotes0.65 point, which illustrates the ensemble of targetnetwork is effective.3. Compared to Random Forest that is also a tree-basedmethod, NDT promotes 0.75 point, which demon-strates the neurons indeed strengthen the decision trees.

CIFAR.

The CIFAR-10/100 dataset (Krizhevsky, 2009), isalso a classic benchmark for overfull category classiﬁcation,which consists of color natural images, 32 x 32 pixels insize, from 10/100 classes with 50,000 training and 10,000test images. Several representative baselines are selectedas Network in Network (NIN) (Lin et al., 2013), FitNets(Rao et al., 2016), Deep Supervised Network (DSN) (Leeet al., 2014), High-Way (Srivastava et al., 2015), All-CNN(Springenberg et al., 2014), Exponential Linear Units (ELU)(Clevert et al., 2015), FitResNets (Mishkin & Matas, 2015),gcForest (Zhou & Feng, 2017) and Deep ResNet (He et al.,2016). We could conclude that:1. NDT will beat all the strong baselines, which veriﬁesthe effectiveness of neural decision trees and justiﬁesthe theoretical analysis.2. Compared to single target network, NDT promotes4.85 point, which illustrates the ensemble of targetnetwork is effective.3. Compared with gcForest, the performance improves- points, which illustrates that neurons empower thedecision trees more effectively than direct ensembles.4. Compared with ResNet that is the strongest baseline,we promote the results over - points, which justiﬁes

DT: Neual Decision Tree Towards Fully Functioned Neural Graph

Figure 3.

Case Study for NDT in MNIST with depth = 2 (a) and depth = 3 (b). The left tables are the test sample numbers that correspondto row -th leaf node and col -th category. For example, the sliced “1105” means there are 1105 test samples of category “1” in leaf node“A”. We slice the main component of a leaf and draw the corresponding decision trees in the right panel. Notably, “X” indicates the emptyclass.

Table 2.

Performance Evaluation (Error (%)) on CIFAR.

Methods CIFAR-10 CIFAR-100NIN 8.81 35.68DSN 8.22 34.57FitNets 8.39 35.04High-Way 7.72 32.39All-CNN 7.25 33.71ELU 6.55 24.28FitResNets 5.84 27.66ResNet 6.61 25.16gcForest 31.00 -Random Forest 50.17 -Single Target Network - 89.37

NDT (depth = 4) - our assumption, that NDT could reduce the categorynumber of leaf nodes to enhance the intelligence sys-tems. To further testify our assumption that NDT could reducethe category number of leaf nodes, we perform a case studyin MNIST. We make a statistics of test samples for each leaf node, illustrated in Figure 3. The item of table means row -th leaf node has how many samples in col -th category.For example, the “1105” in the ﬁrst row and second column,means that there are 1,105 test samples of category “1” arepre-classiﬁed into leaf node “A”. Correspondingly, we drawthe decision trees in the right panel with labeled categories,which speciﬁcally illustrates the decision process of NDT.For a complete veriﬁcation, we vary the depth of NDT with2 and 3.Firstly, we could clearly draw the conclusion from Figure 3,that each leaf node needs to predict less categories, whichjustiﬁes our assumption. For example, in the bottom ﬁgure,the node “A” only needs to predict the category “1”, whichis a single classiﬁcation, and the node “H” only needs topredict the categories “0,3,5,8” which is a four classiﬁcation.Because small classiﬁcation is less difﬁcult than large one,our target network in the leaf could perform better, whichleads to performance promotion in a tree-ensemble manner.Secondly, from Figure 3, split purity could be worked out.Generally, the two-layer tanh multi-perception achieves adecent split purity. Indeed, the most difﬁcult leaf nodes(e.g. “D” in the top and “H” in the bottom ) are not perfect,but others gain a competitive split purity. Statistically, themain component or the sliced grid takes 92.4% share of totalsamples, which in a large probability, NDT would performbetter than 92.4% accuracy in this case.

DT: Neual Decision Tree Towards Fully Functioned Neural Graph

Finally, we discuss the hyper-parameter depth . From thetop to the bottom of Figure 3, the categories are furthersplit. For example, the node “B” in the top is split into“C” and “D” in the bottom, which means that the category“2” and “6” are further pre-classiﬁed. In this way, deepneural decision tree is advantageous. But much deeperNDT makes less sense, because the categories have beenalready split well. There would be mostly no difference for1- or 2-classiﬁcation. However, considering the efﬁciencyand consuming resources, we suggest to apply a suitabledepth, or theoretically about log ( | C | ) , where | C | is thetotal category number.

6. Related Work

In this section, we brieﬂy introduce three lines of relatedwork: image recognition, decision tree and neural graph.Convolution layer is necessary in current neural architec-tures for image recognition. Almost every model is a con-volutional model with different conﬁgurations and layers,such as All-CNN (Springenberg et al., 2014) and DSN (Leeet al., 2014). Empirically, deeper network produces betteraccuracy. But it is difﬁcult to train much deeper network forthe issue of vanishing/exploding gradients, (Glorot & Ben-gio, 2010). Recently, there emerge two ways to tackle thisproblem: High-Way (Srivastava et al., 2015) and ResidualNetwork (He et al., 2015). Inspired by LSTM, high-waynetwork applies transform- and carry-gates for each layer,which allow information to ﬂow across layers along the com-putation path without attenuation. For a more direct manner,residual network simply employs identity mappings to con-nect relatively top and bottom layers, which propagates thegradients more effectively to the whole network. Notably,achieving the state-of-the-art performance, residual network(ResNet) is the strongest model for image recognition, tem-porarily.Decision tree is a classic paradigm of artiﬁcial intelligence,while random forest is the representative methodology ofthis branch. During recent years, completely random treeforest has been proposed, such as iForest (Liu et al., 2008)for anomaly detection. However, with the popularity of deepneural network, lots of researches focus on the fusion be-tween neurons and random forest. For example, (Richmondet al., 2015) converts cascaded random forests to convo-lutional neural network, (Welbl, 2014) leverages randomforests to initialize neural network. Specially, as the state-of-the-art model, gcForest (Zhou & Feng, 2017) allocates avery deep architecture for forests, which is experimentallyveriﬁed on several tasks. Notably, all of this branch couldnot jointly train the neurons and decision trees, which is themain disadvantage.To jointly fuse neurons and logics, (Xiao, 2017) proposes the basic principle of neural graph, which could embed tradi-tional logics-based algorithms into neural architectures. Theseminal paper merges the Hungarian algorithm with neu-rons as Hungarian layer, which could effectively recognizematched/unmatched sentence pairs. However, as a specialcase, the variables only introduced in the condition couldnot be updated, which is a disadvantage for characterizingcomplex systems. Thus, this paper focuses on this issue tomake a fully functioned neural graph.

7. Future Work

We list three lines of future work: design new components ofneural graph, implement a script language for neural graphand analyze the theoretical properties of learnable Turingmachine.This paper exempliﬁes an approach to embed decision treeinto neural architectures. Actually, many traditional algo-rithms could promote intelligence system with neurons. Forexample, neural A ∗ searching could learn the heuristic rulesfrom data, which could be more effective and less resourceconsuming. For a further example, we could represent thedata with deep neural networks, and conduct label propaga-tion upon the hidden representations, where the propagationgraph is constructed by K-NN method. Because the labelpropagation, K-NN and deep neural networks are trainedjointly, the performance promotion could be expected.In fact, a fully functioned neural graph may be extremelyhard and complex to implement. Thus, we expect to publisha script language for modeling neural graph and also alibrary that includes all the mainstream intelligence methods.Based on these instruments, neural graph could be moreconvenient for practical usage.Finally, as we discussed, neural graph is Turing complete,making a learnable Turing machine. We believe theoreticalanalysis is necessary for compilation and ability of neuralgraph. Take an example. Do the learnable and static Turingmachine have the same ability? Take a further example.Could our brain excel Turing machine? If not, some ex-cellent neural graphs may gain advantages over biologicalbrain, because both of them are learnable Turing machines.If it could, the theoretical foundations of intelligence shouldbe reformed. Take the ﬁnal example. What is the bestcomputation model for intelligence?

8. Conclusion

This paper proposes the principle of fully functioned neuralgraph. Based on this principle, we design the neural decisiontree (NDT) for image recognition. Experimental results onbenchmark datasets demonstrate the effectiveness of ourproposed method.

DT: Neual Decision Tree Towards Fully Functioned Neural Graph

References

Casper, M, Mengel, M, Fuhrmann, C, Herrmann, E, Appen-rodt, B, Schiedermaier, P, Reichert, M, Bruns, T, Engel-mann, C, and Grnhage, F. Perceptrons: An introductionto computational geometry. 75(3):3356–62, 1969.Clevert, DjorkArn, Unterthiner, Thomas, and Hochreiter,Sepp. Fast and accurate deep network learning by expo-nential linear units (elus).

Computer Science , 2015.Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai,and Fei-Fei, Li. Imagenet: A large-scale hierarchicalimage database. In

Computer Vision and Pattern Recog-nition, 2009. CVPR 2009. IEEE Conference on , pp. 248–255. IEEE, 2009.Glorot, Xavier and Bengio, Yoshua. Understanding thedifﬁculty of training deep feedforward neural networks.In

Proceedings of the Thirteenth International Confer-ence on Artiﬁcial Intelligence and Statistics , pp. 249–256,2010.He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun,Jian. Deep residual learning for image recognition. pp.770–778, 2015.He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun,Jian. Deep residual learning for image recognition. In

Computer Vision and Pattern Recognition , pp. 770–778,2016.Krizhevsky, Alex. Learning multiple layers of features fromtiny images. 2009.Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition.

Proceed-ings of the IEEE , 86(11):2278–2324, 1998.Lecun, Yann, Bengio, Yoshua, and Hinton, Geoffrey. Deeplearning.

Nature , 521(7553):436–444, 2015.Lee, Chen Yu, Xie, Saining, Gallagher, Patrick, Zhang,Zhengyou, and Tu, Zhuowen. Deeply-supervised nets.

Eprint Arxiv , pp. 562–570, 2014.Lin, Min, Chen, Qiang, and Yan, Shuicheng. Network innetwork.

Computer Science , 2013.Liu, Fei Tony, Kai, Ming Ting, and Zhou, Zhi Hua. Isolationforest. In

Eighth IEEE International Conference on DataMining , pp. 413–422, 2008.Mcculloch, Warren S and Pitts, Walter H. A logical calculusof ideas imminent in nervous activity. 1943.Mishkin, Dmytro and Matas, Jiri. All you need is a goodinit. 69(14):3013–3018, 2015. Rao, Jinfeng, He, Hua, and Lin, Jimmy. Noise-contrastiveestimation for answer selection with deep neural net-works. In

ACM International on Conference on Informa-tion and Knowledge Management , pp. 1913–1916, 2016.Richmond, David L., Kainmueller, Dagmar, Yang,Michael Y., Myers, Eugene W., and Rother, Carsten. Re-lating cascaded random forests to deep convolutionalneural networks for semantic segmentation.

ComputerScience , 2015.Rumelhart, D. E., Hinton, G. E., and Williams, R. J.

Learn-ing internal representations by error propagation . MITPress, 1988.Silver, David, Huang, Aja, Maddison, Chris J., Guez, Arthur,Sifre, Laurent, Driessche, George Van Den, Schrittwieser,Julian, Antonoglou, Ioannis, Panneershelvam, Veda, andLanctot, Marc. Mastering the game of go with deepneural networks and tree search.

Nature , 529(7587):484,2016.Springenberg, Jost Tobias, Dosovitskiy, Alexey, Brox,Thomas, and Riedmiller, Martin. Striving for simplic-ity: The all convolutional net.

Eprint Arxiv , 2014.Srivastava, Rupesh Kumar, Greff, Klaus, and Schmidhuber,Jrgen. Training very deep networks.

Computer Science ,2015.Welbl, Johannes. Casting random forests as artiﬁcial neuralnetworks (and proﬁting from it). In

German Conferenceon Pattern Recognition , pp. 765–771. Springer, 2014.Xiao, Han. Hungarian layer: Logics empowered neuralarchitecture. arXiv preprint arXiv:1712.02555 , 2017.Zeiler, Matthew D. Adadelta: An adaptive learning ratemethod.