A Layer-Wise Information Reinforcement Approach to Improve Learning in Deep Belief Networks
Mateus Roder, Leandro A. Passos, Luiz Carlos Felix Ribeiro, Clayton Pereira, João Paulo Papa
AA Layer-Wise Information ReinforcementApproach to Improve Learning in Deep BeliefNetworks
Mateus Roder , − − − , Leandro A. Passos , − − − ,Luiz Carlos Felix Ribeiro , − − − , ClaytonPereira , − − − , and Jo˜ao Paulo Papa , − − − S˜ao Paulo State University - UNESP, Bauru, Brasil { mateus.roder,clayton.pereira, leandro.passos, luiz.felix, joao.papa } @unesp.br Abstract.
With the advent of deep learning, the number of works propos-ing new methods or improving existent ones has grown exponentiallyin the last years. In this scenario, “very deep” models were emerging,once they were expected to extract more intrinsic and abstract featureswhile supporting a better performance. However, such models suffer fromthe gradient vanishing problem, i.e., backpropagation values become tooclose to zero in their shallower layers, ultimately causing learning tostagnate. Such an issue was overcome in the context of convolution neu-ral networks by creating “shortcut connections” between layers, in aso-called deep residual learning framework. Nonetheless, a very populardeep learning technique called Deep Belief Network still suffers from gra-dient vanishing when dealing with discriminative tasks. Therefore, thispaper proposes the Residual Deep Belief Network, which considers theinformation reinforcement layer-by-layer to improve the feature extrac-tion and knowledge retaining, that support better discriminative perfor-mance. Experiments conducted over three public datasets demonstrateits robustness concerning the task of binary image classification.
Keywords:
Deep Belief Networks · Residual Networks · Restricted Boltz-mann Machines.
Machine learning-based approaches have been massively studied and applied todaily tasks in the last decades, mostly due to the remarkable accomplishmentsachieved by deep learning models. Despite the success attained by these tech-niques, they still suffer from a well-known drawback regarding the backpropagation-based learning procedure: the vanishing gradient. This kind of problem becomesmore prominent on deeper models since the gradient vanishes and is not propa-gated adequately to former layers, thus, preventing a proper parameter update. a r X i v : . [ c s . A I] J a n M. Roder et al.
To tackle such an issue, He et al. [4] proposed the ResNet, a framework wherethe layers learn residual functions concerning the layer inputs, instead of learningunreferenced functions. In short, the idea is mapping a set of stacked layers toa residual map, which comprises a combination of the set input and output andthen mapping it back to the desired underlying mapping.The model achieved fast popularity, being applied in a wide range of applica-tions, such as traffic surveillance [7], medicine [12, 8], and action recognition [2],to cite a few. Moreover, many works proposed different approaches using theidea of residual functions. Lin et al. [11], for instance, proposed the RetinaNet,a pyramidal-shaped network that employs residual stages to deal with one-shotsmall object detection over unbalanced datasets. Meanwhile, Szegedy et al. [17]proposed the Inception-ResNet for object recognition. Later, Santos et al. [16]proposed the Cascade Residual Convolutional Neural Network for video segmen-tation.In the context of deep neural networks, there exist another class of methodsthat are composed of Restricted Boltzmann Machines (RBMs) [6], a stochas-tic approach represented by a bipartite graph whose training is given by theminimization of the energy between a visible and a latent layer. Among thesemethods, Deep Belief Networks (DBNs) [5] and Deep Boltzmann Machines [15,13] achieved a considerable popularity in the last years due the satisfactory re-sults over a wide variety of applications [14, 3, 18].However, as far as we are concerned, no work addressed the concept of re-inforcing the feature extraction over those models in a layer-by-layer fashion.Therefore, the main contributions of this paper are twofold: (i) to propose theResidual Deep Belief Network (Res-DBN), a novel approach that combines eachlayer input and output to reinforce the information conveyed through it, and (ii)to support the literature concerning both DBNs and residual-based models.The remainder of this paper is presented as follows: Section 2 introduces themain concepts regarding RBMs and DBNs, while Section 3 proposes the ResidualDeep Belief Network. Further, Section 4 describes the methodology and datasetsemployed in this work. Finally, Sections 5 and 6 provide the experimental resultsand conclusions, respectively.
This section introduces a brief theoretical background regarding Restricted Boltz-mann Machines and Deep Belief Networks.
Restricted Boltzmann Machine stands for a stochastic physics-inspired computa-tional model capable of learning data distribution intrinsic patterns. The processis represented as a bipartite graph where the data composes a visible input-likelayer v , and a latent n -dimensional vector h , composed of a set of hidden neurons Layer Wise Information Reinforcement in Deep Belief Networks 3 whose the model tries to map such inputs onto. The model’s training proceduredwells on the minimization of the system’s energy, given as follows: E ( v , h ) = − m (cid:88) i =1 b i v i − n (cid:88) j =1 c j h j − m (cid:88) i =1 n (cid:88) j =1 w ij v i h j , (1)where m and n stand for the dimensions of the visible and hidden layers, respec-tively, while b and c denote their respective bias vectors, further, W correspondsto the weight matrix connecting both layers, in which w ij stands for the connec-tion between visible unit i and the j hidden one. Notice the model is restricted,thus implying no connection is allowed among the same layer neurons.Ideally, the model was supposed to be solved by computing the joint proba-bility of the visible and hidden neurons in an analytic fashion. However, such anapproach is intractable since it requires the partition function calculation, i.e.,computing every possible configuration of the system. Therefore, Hinton pro-posed the Contrastive Divergence (CD) [6], an alternative method to estimatethe conditional probabilities of the visible and hidden neurons using Gibbs sam-pling over a Monte Carlo Markov Chain (MCMC). Hence, the probabilities ofboth input and hidden units are computed as follows: p ( h j = 1 | v ) = σ (cid:32) c j + m (cid:88) i =1 w ij v i (cid:33) , (2)and p ( v i = 1 | h ) = σ b i + n (cid:88) j =1 w ij h j , (3)where σ stands for the logistic-sigmoid function. Conceptually, Deep Belief Networks are graph-based generative models com-posed of a visible and a set of hidden layers connected by weight matrices, withno connection between neurons in the same layer. In practice, the model com-prises a set of stacked RBMs whose hidden layers greedily feeds the subsequentRBM visible layer. Finally, a softmax layer is attached at the top of the model,and the weights are fine-tuned using backpropagation for classification purposes.Figure 1 depicts the model. Notice that W ( l ) , l ∈ [1 , L ], stands for the weightmatrix at layer l , where L denotes the number of hidden layers. Moreover, v stands for the visible layer, as well as h ( l ) represents the l th hidden layer. In this section, we present the proposed approach concerning the residual rein-forcement layer-by-layer in Deep Belief Networks, from now on called Res-DBN.
M. Roder et al.
Fig. 1.
DBN architecture with two hidden layers for classification purposes. W (2) W (1) h (2) h (1) ... v S o f t m a x S o f t m a x y^ Since such a network is a hybrid model between sigmoid belief networks andbinary RBMs [5], it is important to highlight some “tricks” to make use of theinformation provided layer-by-layer.As aforementioned, DBNs can be viewed as hybrid networks that modelthe data’s prior distribution in a layer-by-layer fashion to improve the lowerbound from model distribution. Such a fact motivated us to make use of theinformation learned in each stack of RBM for reinforcement since the greedy-layer pre-training uses the activation of latent binary variables as the input ofthe next visible layer. Generally speaking, such activation is defined by Eq. 2,and its pre-activation vector, a ( l ) , as follows: a ( l ) j = c ( l ) j + m (cid:88) i =1 w ( l ) ij x ( l − i , (4)where, c ( l ) j stands for the bias from hidden layer l , m is the number of unitspresent on the previous layer, w ( l ) ij represents the weight matrix for layer l , and x ( l − i stands for the input data from layer l −
1, where x i = v i .Therefore, it is possible to use the “reinforcement pre-activation” vector,denoted as ˆ a ( l ) , from layer l , ∀ l >
1. Since the standard RBM output of post-activation (provided by Eq. 2) is in [0 ,
1] interval, it is necessary to limit thereinforcement term of the proposed approach as follows:ˆ a ( l ) = δ ( a ( l − ) max { δ ( a ( l − j ) } , (5)where, δ stands for the Rectifier function, while max returns the maximumvalue from the δ output vector for normalization purposes. Then, the new inputdata and the information aggregation for layer l is defined by adding the valuesobtained from Eq. 5 to the post-activation, i.e., applying σ ( a ( l − ), as follows: x ( l − i = σ ( a ( l − j ) + ˆ a ( l ) j , (6) δ ( z ) = max (0 , z ). Layer Wise Information Reinforcement in Deep Belief Networks 5 where x ( l − i stands for the new input data to layer l , ∀ l >
1, and its normalizedand vectorized form can be obtained as follows: x ( l − = x ( l − max { x ( l − i } . (7)It is important to highlight that, in Eq. 5, we only use the positive pre-activations to retrieve and propagate the signal that is meaningful for neuronsexcitation, i.e., values greater than 0, which generates a probability of more than50% after applying sigmoid activation. Fig. 2.
Res-DBN architecture with 3 hidden layers. v ... ... ... h (1) h (2) h (3) W (1) W (2) W (3) ... y^ S o f t m a x S o f t m a x a (2) ^a (1) a (3) ^a (2) The Figure 2 depicts the Res-DBN architecture, with hidden layers connectedby the weights W ( l ) . The dashed connections stand for the reinforcement ap-proach, with the information aggregation occuring as covered by the Eqs. 4 to7, from a generic hidden layer to the next one ( h (1) → h (2) , for instance). In this section, we present details regarding the datasets employed in our exper-iments, as well as the experimental setup applied for this paper.
Three well-known image datasets were employed throughout the experiments: • MNIST [10]: set of 28 ×
28 binary images of handwritten digits (0-9), i.e.,10 classes. The original version contains a training set with 60 ,
000 imagesfrom digits ‘0’-‘9’, as well as a test set with 10 ,
000 images. http://yann.lecun.com/exdb/mnist M. Roder et al. • Fashion-MNIST [20]: set of 28 ×
28 binary images of clothing objects. Theoriginal version contains a training set with 60 ,
000 images from 10 distinctobjects (t-shirt, trouser, pullover, dress, coat, sandal, shirt, sneaker, bag, andankle boot), and a test set with 10 ,
000 images. • Kuzushiji-MNIST [1]: set of 28 ×
28 binary images of hiragana characters.The original version contains a training set with 60 ,
000 images from 10previously selected hiragana characters, and a test set with 10 ,
000 images.
Concerning the experiments, we employed the concepts mentioned in Section 3,considering two main phases: (i) the DBN pre-training and (ii) the discrimi-native fine-tuning. Regarding the former, it is important to highlight that theinformation reinforcement is performed during the greedy layer-wise process, inwhich the hidden layers ( l = 1 , , . . . , L ) receive the positive “residual” informa-tion. Such a process takes into account a mini-batch of size 128, a learning rateof 0 .
1, 50 epochs for the bottommost RBM convergence, and 25 epochs for theintermediate and top layers convergence .Moreover, regarding the classification phase, a softmax layer was attachedat the top of the model after the DBN pre-training, performing the fine-tuningprocess for 20 epochs through backpropagation using the well-known ADAM [9]optimizer. The process employed a learning rate of 10 − for all layers. Further-more, it was performed 15 independent executions for each model to providestatistical analysis. To assess the robustness of the proposed approach, we em-ployed seven different DBN architectures changing the number of hidden neuronsand layers, as denoted in Table 1. Model Res-DBN DBN (a) i :500:500:10 i :500:500:10(b) i :500:500:500:10 i :500:500:500:10(c) i :500:500:500:500:10 i :500:500:500:500:10(d) i :1000:1000:10 i :1000:1000:10(e) i :1000:1000:1000:10 i :1000:1000:1000:10(f) i :1000:1000:1000:1000:10 i :1000:1000:1000:1000:10(g) i :2000:2000:2000:2000:10 i :2000:2000:2000:2000:10 Table 1.
Different setups, where i stands for the number of neurons on the input layer. https://github.com/zalandoresearch/fashion-mnist https://github.com/rois-codh/kmnist Such a value is half of the initial one to evaluate Res-DBN earlier convergence. Layer Wise Information Reinforcement in Deep Belief Networks 7
In this Section, we present the experimental results concerning seven distinctDBN architectures, i.e., (a), (b), (c), (d), (e), (f) and (g), over the aforemen-tioned datasets. Table 2 provides the average accuracies and standard deviationsfor each configuration on 15 trials, where the proposed approach is comparedagainst the standard DBN formulation in each dataset for each configuration.Further, results in bold represent the best values according to the statisticalWilcoxon signed-rank test [19] with significance p ≤ .
05 concerning each modelconfiguration. On the other hand, underlined values represent the best resultsoverall models regarding each dataset, without a statistical difference, i.e., resultssimilar to the best one achieved.
Experiment MNIST Fashion MNIST Kuzushiji MNIST
Res-DBN DBN Res-DBN DBN Res-DBN DBN(a) . ± . . ± .
09 81 . ± . . ± .
27 86 . ± . . ± . . ± . . ± .
11 81 . ± .
50 81 . ± . . ± . . ± . . ± .
10 97 . ± .
09 81 . ± .
33 81 . ± . . ± . . ± . . ± . . ± .
10 81 . ± .
35 81 . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± .
37 97 . ± .
29 82 . ± .
50 82 . ± . . ± . . ± . . ± . . ± .
30 82 . ± .
53 82 . ± . . ± . . ± . Table 2.
Experimental results on different datasets.
Regarding the original MNIST dataset, the preeminence of the proposedmodel over the standard version of the RBM is evident, since the best resultswere obtained exclusively by Res-DBN and, from these, five out of seven scenariospresented statistical significance. Such a behavior is stressed in the KuzushijiMNIST dataset, where the best results were obtained solely by the Res-DBNover every possible configuration. The results’ similarity between these datasetsis somehow expected since both are composed of handwritten digits or letters.The Fashion MNIST dataset presents the single experimental scenario, i.e.,model (a), where the proposed model was outperformed by the traditional DBN,although by a small margin. In all other cases Res-DBN presented results supe-rior or equal to the traditional formulation, which favors the Res-DBN use overthe DBNs.Finally, one can observe the best results overall were obtained using a morecomplex model, i.e., with a higher number of layers and neurons, as denotedby the underlined values. Additionally, the proposed model outperformed or atleast is equivalent, to the standard DBN in virtually all scenarios, except oneconcerning the Fashion-MNIST dataset.
M. Roder et al.
Figures 3, 4, and 5 depict the models’ learning curves over the test sets regardingMNIST, Fashion MNIST, and Kuzushiji MNIST, respectively. In Figure 3, onecan observe that Res-DBN(e) converged faster than the remaining approaches,obtained reasonably good results after seven iterations. At the end of the process,Res-DBN(f) and (g) boosted and outperformed Res-DBN(e), as well as any ofstandard DBN approaches, depicted as dashed lines.
Fig. 3.
Accuracy on MNIST test set.
Regarding Fashion MNIST, it can be observed in Figure 4 that Res-DBN(e)was once again the fastest technique to converge, obtaining acceptable resultsafter five iterations. However, after iteration number five, all models seem tooverfit, explaining the performance decrease observed over the testing samples.Finally, after 14 iterations, the results start increasing once again, being Res-DBN(g) the most accurate technique after 20 iterations.Finally, the Kuzushiji learning curve, depicted in Figure 5, displays a behaviorsilimiar to the MNIST dataset. Moreover, it shows that Res-DBN provided betterresults than its traditional variant in all cases right from the beginning of thetraining. In some cases with a margin greater than 2%, showing a promissingimprovement.
In this paper, we proposed a novel approach based on reinforcing DBN’s layer-by-layer feature extraction in a residual fashion, the so-called Residual Deep
Layer Wise Information Reinforcement in Deep Belief Networks 9
Fig. 4.
Accuracy on Fashion MNIST test set.
Fig. 5.
Accuracy on Kuzushiji MNIST test set.0 M. Roder et al.
Belief Network. Experiments conducted over three public datasets confirm thesturdiness of the model. Moreover, it is important to highlight faster convergenceachieved by Res-DBN in front of DBN, once half of the epochs were employedfor pre-training hidden layers, and the results outperformed the latter model.Regarding future work, we intend to investigate the model in the video do-main, applying it to classification and recognition tasks, as well as to propose asimilar approach regarding Deep Boltzmann Machines.
Acknowledgments
The authors are grateful to FAPESP grants
References
1. Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A., Yamamoto, K., Ha, D.:Deep learning for classical japanese literature. arXiv preprint arXiv:1812.01718(2018)2. Feichtenhofer, C., Pinz, A., Wildes, R.: Spatiotemporal residual networks for videoaction recognition. In: Advances in neural information processing systems. pp.3468–3476 (2016)3. Hassan, M.M., Alam, M.G.R., Uddin, M.Z., Huda, S., Almogren, A., Fortino, G.:Human emotion recognition using deep belief network architecture. InformationFusion , 10–18 (2019)4. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: IEEE CVPR. pp. 770–778 (2016)5. Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep beliefnets. Neural Computation (7), 1527–1554 (2006)6. Hinton, G.: Training products of experts by minimizing contrastive divergence.Neural Computation (8), 1771–1800 (2002)7. Jung, H., Choi, M.K., Jung, J., Lee, J.H., Kwon, S., Young Jung, W.: Resnet-basedvehicle classification and localization in traffic surveillance systems. In: Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition Workshops.pp. 61–67 (2017)8. Khojasteh, P., Passos, L.A., Carvalho, T., Rezende, E., Aliahmad, B., Papa, J.P.,Kumar, D.K.: Exudate detection in fundus images using deeply-learnable features.Computers in biology and medicine , 62–69 (2019)9. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)10. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied todocument recognition. Proceedings of the IEEE (11), 2278–2324 (1998)11. Lin, T.Y., Goyal, P., Girshick, R., He, K., Doll´ar, P.: Focal loss for dense objectdetection. In: Proceedings of the IEEE international conference on computer vision.pp. 2980–2988 (2017)12. Passos, L.A., Pereira, C.R., Rezende, E.R., Carvalho, T.J., Weber, S.A., Hook, C.,Papa, J.P.: Parkinson disease identification using residual networks and optimum-path forest. In: 2018 IEEE 12th International Symposium on Applied Computa-tional Intelligence and Informatics (SACI). pp. 000325–000330. IEEE (2018) Layer Wise Information Reinforcement in Deep Belief Networks 1113. Passos, L.A., Papa, J.P.: A metaheuristic-driven approach to fine-tune deep boltz-mann machines. Applied Soft Computing p. 105717 (2019)14. Pereira, C.R., Passos, L.A., Lopes, R.R., Weber, S.A., Hook, C., Papa, J.P.: Parkin-son’s disease identification using restricted boltzmann machines. In: InternationalConference on Computer Analysis of Images and Patterns. pp. 70–80. Springer(2017)15. Salakhutdinov, R., Hinton, G.E.: Deep boltzmann machines. In: AISTATS. vol. 1,p. 3 (2009)16. Santos, D.F., Pires, R.G., Colombo, D., Papa, J.P.: Video segmentation learningusing cascade residual convolutional neural network. In: 2019 32nd SIBGRAPIConference on Graphics, Patterns and Images (SIBGRAPI). pp. 1–7. IEEE (2019)17. Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnetand the impact of residual connections on learning. In: Thirty-First AAAI Confer-ence on Artificial Intelligence (2017)18. Wang, J., Wang, K., Wang, Y., Huang, Z., Xue, R.: Deep boltzmann machine basedcondition prediction for smart manufacturing. Journal of Ambient Intelligence andHumanized Computing (3), 851–861 (2019)19. Wilcoxon, F.: Individual comparisons by ranking methods. Biometrics Bulletin1