Supervising Unsupervised Learning with Evolutionary Algorithm in Deep Neural Network
SSupervising Unsupervised Learning withEvolutionary Algorithm in Deep Neural Network
Takeshi Inagaki
IBM Japan, Tokyo
Abstract —A method to control results of gradient descentunsupervised learning in a deep neural network by usingevolutionary algorithm is proposed. To process crossover ofunsupervisedly trained models, the algorithm evaluates pointwisefitness of individual nodes in neural network. Labeled trainingdata is randomly sampled and breeding process selects nodesby calculating degree of their consistency on different sets ofsampled data. This method supervises unsupervised trainingby evolutionary process. We also introduce modified RestrictedBoltzmann Machine which contains repulsive force among nodesin a neural network and it contributes to isolate networknodes each other to avoid accidental degeneration of nodesby evolutionary process. These new methods are applied todocument classification problem and it results better accuracythan a traditional fully supervised classifier implemented withlinear regression algorithm.
Index Terms —Deep neural networks; deep learning; evolution-ary algorithm; classification problem
I. I
NTRODUCTION
Deep Learning applying unsupervised learning to lowerlayers (near to input layer) and training higher layers (near tooutput layer) with labeled data captures wide variety of datacharacteristics from unlabeled data and supervised learninglayer can recognize abstracted features of target data [1]. Datain some categories, such as images or sound or other datadetected by sensors, are governed by laws of physics behindthem and it is expected that unsupervised learning automat-ically detects patterns in data caused by these laws. Thesepatterns are regarded as features in data used for supervisedleaning of higher layer. However, if this method is appliedto more conceptual level of problem such as classification oftext documents described by natural human language, featuresextracted by unsupervised layers are not always relevant forlabeling of data [2]. To understand the reason, we need to knowthe role of unsupervised learning in lower layers. Unsupervisedlearning layers are designed to have large number of inputnodes and less number of output nodes. It results to reducenumber of dimensions of parameter space representing inputdata. In case of image recognition, not all possible alignmentsof bit pixels occur in photo images. Only limited number ofpatterns of alignments happen in the real world governed bylaws of physics, like face of cat, shape of woods in forest,tall building and so on. By applying unsupervised learning toneural network, it is expected to reduce bit pixels parameters tolimited number of image patterns. These patterns with reducednumber of degree of freedom are regarded as representationsof concepts human recognizes. By referring these abstracted features, supervised learning performed on higher layer ofnetwork does not need large amount of labeled data to coverall pixel alignment patterns but just needs for limited numberof patterns of pixels to be labeled. When the same methodis applied to text documents, unsupervised learning layers areexpected to form clusters of words or phrases representingabstracted features. A problem arisen for classification oftext documents is that concepts in human writings are notalways concrete but sometimes rather abstract and there isambiguity in its nature. For example, a description by textmay allow multiple interpretations depending on interest ofindividual person who reads it. At attempt of classificationof documents, one text includes multiple concepts and someof them are relevant for specific classification but are notuseful for other classification context. Concepts automaticallydetected by unsupervised learning contain these conceptsunnecessary for a specific classification program and theymay cause unexpected results in classification. To avoid this,we develop new evolutionary algorithm which picks nodes inneural networks relevant for a specific classification problemto be addressed and generates a child neural network fromthem. This new method results better classification accuracythan traditional classifier using the linear regression algorithm.It works efficiently especially when there is large amount ofunlabeled data but is small number of labeled data.II. E
VOLUTIONARY A LGORITHM
Idea of Evolutionary Algorithm for neural network (calledNeuroevolution) has rather long history [3] and is still an activearea for study [4]. It was inspired by evolution of life and isdesigned to adapt neural network to external environment. Itevaluates fitness of parents to environment and breeds childrenfrom better fitting individuals. Basic technique employed arecrossover and mutation. This approach is regarded as a methodto find a solution of optimization problem by random sampling(or namely global search), and it is an alternative to traditionalgradient descent method which assumes analytic property(differentiability) of a loss function in the parameter space andthe learning process is traveling on a trajectory to reach to aminimum point of the function. A new method introduced inthis paper applies evolutional steps of evolutionary algorithmto overcome the issue discussed in previous section. Geometri-cally, there is no distinction between relevant features and irrel-evant features extracted by unsupervised learning and they aredetected by gradient descent equally. In the crossover processof evolutionary algorithm, selection of features is performed a r X i v : . [ s t a t . M L ] M a r y sampling method. This can be done by introducing a metricto measure relevancy of features. This new method turns out tobe effective for cases where gradient descent partially worksbut it can’t fit a solution perfectly due to the global ambiguityof underlying unsupervised learning results by which not all ofgeometric minimums are relevant for a solution. Evolutionaryprocess selects relevant one and removes irrelevant one.III. S UPERVISING E VOLUTIONARY A LGORITHM
New algorithm proposed here is hybrid of evolutionaryalgorithm and gradient descent method. We use a simple modelwith three layers of nodes. Nodes in a layer are connectedto nodes to other layers, upper or lower layer. Nodes inthe middle layer (hidden layer) correspond to concepts indata and each of them is connected to nodes in the inputlayer and the output layer. In case of document classificationproblem, nodes in the input layer are corresponding to wordsin a document and nodes in the output layer correspond tocategories of classification. In more general cases, input nodesare corresponding to features extracted from target data e.g.shapes of images and so on. Connections among nodes arerepresented by weight factors (denoted as w i,j below). Thehidden layer is trained by unsupervised training algorithm(autoencoder is used in this paper) to form concepts in data,for example, clusters of words in the context of documentclassification. To define statistical mechanics model for thishidden layer, the Restricted Boltzmann Machine described byaction (1) is employed S ( x, y ) = − (cid:88) i,j w i,j y i x j + (cid:88) i b i y i + (cid:88) j c j x j (1)where y i represents concepts and x j represents input fea-tures such as appearance of words in a document. With thisaction, a probability distribution function of concepts for inputfeatures is given by (2). P ( y | x ) = e − S ( x,y ) (2)The output layer on top of the hidden layer is trained withlabeled data to associate concepts in hidden layer to categoriesof classification. This layer is trained by the backwards prop-agation to correct incoincidence of prediction with label data.Idea of evolutionary algorithm is to execute these twosteps of training in iterations and every iteration uses bredartifacts of previous iteration as a set of initial values of modelparameters for next training. During these continuous steps,it generates a child model from multiple parent models. Achild is built up from hidden layer nodes picked from parents.These nodes are selected by evaluating relevancy for givenclassification problem. In that way, later generation of a modelcontains concepts relevant for a classification problem and willhave better accuracy for that problem.This process is described in below:1) At first a few iterations, models are populated withinitial seed parameters randomly generated. First, train the hidden layer with unlabeled data by unsupervisedtraining, then the train output layer with labeled data.These learning processes employ gradient descent. Re-sulted models are put in a pool of models.2) Then after, examine models in the model pool withlabeled test data and select two or more models withgood precision score.3) In the hidden layer of selected models, pick hiddennodes (concepts) consistently contributing to classifica-tion. A function to measure consistency is defined laterin this paper.4) Breed a new hidden layer combined with nodes pickedin previous step, create a new output layer with randomparameters.5) Train a new hidden layer with unlabeled data and traina new output layer with labeled data by gradient descentagain. Then this new model is put in the model pool.6) Repeat process from 2) to 5) until model growth ofmodel accuracy saturated.More details of steps 3) and 4) are as the following. Pick M models in the model pool and perform supervised training ofthe output layer. At that moment, we divide labeled documentsfor training in N sets in random but to include all categoriesequally. With these training sets train N of individual outputlayers on top of a same hidden layer. Do this operation forall M hidden layers. This process generates M × N models.Evolutionary process is picking relevant nodes in the hiddenlayer which contribute to classification consistently for all N models trained on top of that hidden layer. Doing same forall M of independent hidden layers with N output layers foreach, we obtain a set of nodes of hidden layers picked from M . To evaluate relevancy of each node in the hidden layer, ascore value calculated by the function (3) is used s j = (cid:88) n (cid:48) (cid:54) = n (cid:88) i w ni,j w n (cid:48) i,j (3)where w ni,j is weight parameters of the output layer andindex j stands for the hidden layer node (concept) and i forthe output layer nodes (category) and n is index of N trainingsets of the output layer. This regards w ni,j as a vector | w nj > spanning in dimensions of categories. The score (3) is sumof inner products of two vector | w nj > and | w n (cid:48) j > for allcombination of n , n (cid:48) but n (cid:54) = n (cid:48) . Here inner product is usedto measure coincidence of concepts j in document set n and n over all categories. To understand implication of this, we needto know cause of misdetection of categories. Concepts detectedby the autoencoder are based on deviation of distribution ofwords on each document. This may or may not be related toconcepts relevant for categories. If we do supervised trainingon top of these concepts, some of irrelevant concepts maycorrelate to categories by accident. However, this is accidentaloccurrence only on a set of training documents used and itmay not happen for other sets of documents. Because thereis no relation between sampling based dividing of trainingdocuments and categories of documents with labels, if we doampling multiple times, effect of labeled categories will bepersistent but effects of accidental co-occurrence in sampleddocuments will disappear. For example, let us consider thecase both of relevant and irrelevant concepts appear on a setof documents of a category with probability 50%. If we takeanother document set of the same category, relevant conceptsmay appear in it again consistently with probability 50% butsame irrelevant one may with 25% just by double of accidents.With this observation, we can conclude that if a concept isrelevant for categories, it will be correlated to them in differentsample sets of documents. In a simplest case of N = 2 , labeleddocuments are divided in two sets, even and odd, and bothsets includes all categories of documents equally. Equation (3)becomes s j = (cid:80) i w eveni,j w oddi,j . This measures how a concept j correlated on all categories in even and odd sets. If the conceptis relevant, this inner product becomes large. By applying thismeasurement in the iteration process described above, nodesin the hidden layer continuously becomes more relevant forcategories of classification.IV. I MPLEMENTATION
We apply Restricted Boltzmann Machine in hidden layerand train it with Denoising Autoencoders [5]. By performingautoencoding and decoding steps and adjust parameters w , b , c by the gradient descent to minimize the cross entropy lossfunction (4) L = (cid:88) j { x j ln ˆ x j + (1 − x j ) ln (1 − ˆ x j ) } (4)where, ˆ x is reconstructed visible parameter calculated fromequations (5) and (6). E i = (cid:88) j w i,j x j + b i E j = (cid:88) i w i,j ˆ y i + c j (5) ˆ x j = 11 + e − E j ˆ y i = 11 + e − E i (6)One potential problem with the algorithm described inprevious section is condensation of multiple nodes in thehidden layer on a single concept. That means multiple nodesin the hidden layer represents a same concept redundantly anddo not contribute for classification individually. To avoid that,we implemented repulsive force among nodes in RestrictedBoltzmann Machine. It can be achieved by modifying energyof the Boltzmann Machine states as (7) ˜ E i = (cid:88) j ( w i,j − α (cid:88) i (cid:48) (cid:54) = i w i (cid:48) ,j ) x j + b i E j = (cid:88) i w i,j ˆ y i + c j (7) where α is a small constant proportional to O ( √ C ) and C is number nodes of output layer (number of categories). Thismodification changes the stochastic gradient descent as (8)-(10). With this, influence of input nodes (words) commonlyreferred in many nodes in the hidden layer are suppressed. ∂L∂w i,j = (cid:88) i (cid:48) ∂L∂ ˜ E i (cid:48) ∂ ˜ E i (cid:48) ∂w i,j + (cid:88) j (cid:48) ∂L∂E j (cid:48) ∂E j (cid:48) ∂w i,j = (cid:88) i (cid:48) (cid:88) j (cid:48) { w i (cid:48) ,j (cid:48) ( x j (cid:48) − ˆ x j (cid:48) ) } y i (cid:48) (1 − y i (cid:48) )( δ i,i (cid:48) − α (cid:88) i (cid:48)(cid:48) (cid:54) = i (cid:48) δ i,i (cid:48)(cid:48) ) x j + ( x j − ˆ x j ) y i = [ { (cid:88) j (cid:48) w i,j (cid:48) ( x j (cid:48) − ˆ x j (cid:48) ) } y i (1 − y i ) − α (cid:88) i (cid:48)(cid:48) (cid:54) = i { (cid:88) j (cid:48) w i (cid:48)(cid:48) ,j (cid:48) ( x j (cid:48) − ˆ x j (cid:48) ) } y i (cid:48)(cid:48) (1 − y i (cid:48)(cid:48) )] x j +( x j − ˆ x j ) y i (8) ∂L∂b i = (cid:88) i (cid:48) ∂L∂ ˜ E i (cid:48) ∂ ˜ E i (cid:48) ∂b i = { (cid:88) j (cid:48) w i,j (cid:48) ( x j (cid:48) − ˆ x j (cid:48) ) } y i (1 − y i ) − α (cid:88) i (cid:48)(cid:48) (cid:54) = i { (cid:88) j (cid:48) w i (cid:48)(cid:48) ,j (cid:48) ( x j (cid:48) − ˆ x j (cid:48) ) } y i (cid:48)(cid:48) (1 − y i (cid:48)(cid:48) ) (9) ∂L∂c j = (cid:88) j (cid:48) ∂L∂E j (cid:48) ∂E j (cid:48) ∂c j = x j − ˆ x j (10)To stabilize learning results, ensemble of the hidden layeris populated. The input layer is connected multiple hiddenlayers independently trained and evolved, the output layer isconnected to all of nodes in this ensemble of the hidden layers.In stead of one big hidden layer with large number of concepts,a model has a set (ensemble) of independent hidden layerswith smaller number of concepts and it contributes to improveaccuracy in experiments.V. E XPERIMENTAL RESULTS
We classified 2,000 documents of 20 categories (100 doc-uments in each). A hidden layer consists of 40 nodes anda model has 10 independent hidden layers for ensemblelearning. To compare the result with a traditional method,linear regression algorithm is applied for same data. Both oflinear regression model and this neural network model weretrained with 200 labeled documents (10 documents in each).But in case of our new algorithm, 20,000 documents in thesame corpus are used for unsupervised learning. Accuracy oflinear regression model was 38.9%. When Tf-Idf calculatedrom 20,000 documents was applied, the result of the linearregression algorithm was improved to 42.8%. On the otherhand, accuracy of this new model was 56.2%. This accuracylargely varies with number and quality of documents used forunsupervised training. Fig. 1 shows increasing of accuracyby iteration for 2,000 documents were used for unsuper-vised training (46.1%), 20,000 documents were used (56.2%)and artificial documents generated from 2,000 documents byassembling words in same categories repeatedly in 20,000documents (84.4%). This artificial document set is regardedas a case where ideally large number of documents belongingto categories are used for unsupervised training.
Fig. 1.
Increasing of classification accuracy by iteration for differentnumber of documents used for unsupervised training. Trained with2,000 documents (dotted line 46.1%), trained with 20,000 documents(solid line 56.2%) and trained with artificially generated documents tosimulate an ideally large number of documents (dashed line 84.4%).Supervised training was performed with 200 documents (10 labeleddocuments for each of 20 categories) for all cases. Vertical axisis percentage of accuracy, horizontal line is number of iteration.Accuracy is improved by increase of number documents used forunsupervised training.
Fig. 2 indicates effect of repulsive force introduced by (7).When this effect is absent, increasing of accuracy was slowerand it remained lower even after repeating iterations.
Fig. 2.
Compare accuracy increasing by iteration for repulsive forceapplied case (solid line 56.2% at max) and not applied case (dashedline 47.5% at max). Vertical axis is percentage of accuracy, horizontalline is number of iteration. Without repulsive force, accuracy is lowerand unstable.
Note that Tf-Idf, which was efficient for the linear regres-sion model, did not improve accuracy if it is applied to newalgorithm proposed in this paper. The reason is supposed to be that the same effect is already incorporated by repulsiveforce introduced by (7).VI. C
ONCLUSION