Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks
aa r X i v : . [ c s . L G ] O c t Cosine Normalization: Using Cosine Similarity Instead of Dot Product inNeural Networks
Luo Chunjie
Zhan Jianfeng Wang Lei Yang Qiang Abstract
Traditionally, multi-layer neural networks usedot product between the output vector of pre-vious layer and the incoming weight vector asthe input to activation function. The result ofdot product is unbounded, thus increases the riskof large variance. Large variance of neuronmakes the model sensitive to the change of in-put distribution, thus results in poor generaliza-tion, and aggravates the internal covariate shiftwhich slows down the training. To bound dotproduct and decrease the variance, we propose touse cosine similarity or centered cosine similar-ity (Pearson Correlation Coefficient) instead ofdot product in neural networks, which we call co-sine normalization. We compare cosine normal-ization with batch, weight and layer normaliza-tion in fully-connected neural networks as wellas convolutional networks on the data sets ofMNIST, 20NEWS GROUP, CIFAR-10/100 andSVHN. Experiments show that cosine normaliza-tion achieves better performance than other nor-malization techniques.Deep neural networks have received great success inrecent years in many areas, e.g. image recognition(Krizhevsky et al., 2012), speech processing (Hinton et al.,2012), natural language processing (Mikolov et al., 2013),Go game (Silver et al., 2016). Training deep neural net-works is nontrivial task. Gradient descent is commonlyused to train neural networks. However, due to gradientvanishing problem (Hochreiter et al., 2001), it works badlywhen directly applying to deep networks.Lots of approaches have been adopted to overcomethe difficulty of training deep networks. For example, Institute of Computing Technology, Chinese Academy ofSciences University of Chinese Academy of Sciences BeijingAcademy of Frontier Science and Technology. Correspondenceto: Luo Chunjie < [email protected] > , Zhan Jianfeng < [email protected] > , Wang Lei < wanglei [email protected] > ,Yang Qiang < [email protected] > . pre-training (Hinton et al., 2006; Hinton & Salakhutdinov,2006), special network structure (Simonyan & Zisserman,2014; Szegedy et al., 2015; He et al., 2016), ReLU activa-tion (Nair & Hinton, 2010; Maas et al., 2013), noise inject-ing (Wan et al., 2013; Srivastava et al., 2014), normaliza-tion (Ioffe & Szegedy, 2015; Salimans & Kingma, 2016;Ba et al., 2016; Arpit et al., 2016; Ren et al., 2016).In previous work, multi-layer neural networks use dot prod-uct (also called inner product) between the output vector ofprevious layer and the incoming weight vector as the inputto activation function. net = ~w · ~x (1)where net is the input to activation function (pre-activation), ~w is the incoming weight vector, and ~x is the in-put vector which is also the output vector of previous layer,( · ) indicates dot product. Equation 1 can be rewritten asEquation 2, where cos θ is the cosine of angle between ~w and ~x , | | is the Euclidean norm of vector. net = | ~w | | ~x | cos θ (2)The result of dot product is unbounded, thus increasesthe risk of large variance. Large variance of neuronmakes the model sensitive to the change of input distri-bution, thus results in poor generalization. Large vari-ance could also aggravate the internal covariate shift whichslows down the training (Ioffe & Szegedy, 2015). Usingsmall weights can alleviate this problem. Weight decay(L2-norm) (Krogh & Hertz, 1991) and max normalization(max-norm) (Srebro & Shraibman, 2005; Srivastava et al.,2014) are methods that could decrease the weights. Batchnormalization (Ioffe & Szegedy, 2015) uses statistics cal-culated from mini-batch training examples to normalize theresult of dot product, while layer normalization (Ba et al.,2016) uses statistics from the same layer on a single train-ing case. The variance can be constrained within certainrange using batch or layer normalization. Weight normal-ization (Salimans & Kingma, 2016) re-parameterizes theweight vector by dividing its norm, thus partially boundsthe result of dot product.To thoroughly bound dot product, a straight-forward ideais to use cosine similarity. Similarity (or distance) based osine Normalization methods are widely used in data mining and machine learn-ing (Tan et al., 2006). Particularly, cosine similarity is mostcommonly used in high dimensional spaces. For example,in information retrieval and text mining, cosine similaritygives a useful measure of how similar two documents are(Singhal, 2001).In this paper, we combine cosine similarity with neural net-works. We use cosine similarity instead of dot productwhen computing the pre-activation. That can be seen asa normalization procedure, which we call cosine normal-ization. Equation 3 shows the cosine normalization. net norm = cos θ = ~w · ~x | ~w | | ~x | (3)To extend, we can use the centered cosine similarity, Pear-son Correlation Coefficient (PCC), instead of dot product.By ignoring the magnitude of ~w and ~x , the input to activa-tion function is bounded between -1 and 1. Higher learningrate could be used for training without the risk of large vari-ance. Moreover, network with cosine normalization can betrained by both batch gradient descent and stochastic gra-dient descent, since it does not depend on any statistics onbatch or mini-batch examples.We compare our cosine normalization with batch, weightand layer normalization in fully-connected neural networkson the MNIST and 20NEWS GROUP data sets. Addition-ally, convolutional networks with different normalizationtechniques are evaluated on the CIFAR-10/100 and SVHNdata sets. Here is a brief summary: • Cosine normalization achieves lower test error thanbatch, weight and layer normalization • Centered cosine normalization ( Pearson CorrelationCoefficient ) further reduces the test error. • Cosine normalization is more stable than other nor-malization techniques, specially batch normalization. • Cosine normalization can accelerate neural networkstraining as well as other normalization.
1. Background and Motivation
Large variance of neuron in neural network makes themodel sensitive to the change of input distribution, thus re-sults in poor generalization. Moreover, variance could beamplified as information moves forward along layers, espe-cially in deep network. Large variance could also aggravatethe internal covariate shift, which refers the change of dis-tribution of each layer during training, as the parameters ofprevious layers change (Ioffe & Szegedy, 2015). Internalcovariate shift slows down the training because the layers need to continuously adapt to the new distribution. Tradi-tionally, neural networks use dot product to compute thepre-activation of neuron. The result of dot product is un-bounded. That is to say, the result could be any value in thewhole real space, thus increases the risk of large variance.Using small weights can alleviate this problem, sincethe pre-activation net in Equation 2 will be decreasedwhen | ~w | is small. Weight decay (Krogh & Hertz,1991) and max normalization (Srebro & Shraibman, 2005;Srivastava et al., 2014) are methods that try to make theweights to be small. Weight decay adds an extra term tothe cost function that penalizes the squared value of eachweight separately. Max normalization puts a constraint onthe maximum squared length of the incoming weight vec-tor of each neuron. If update violates this constraint, maxnormalization scales down the vector of incoming weightsto the allowed length. The objective (or direction to objec-tive) of original optimization problem is changed when us-ing weight decay (or max normalization). Moreover, theybring additional hyper parameters that should be carefullypreset.Batch normalization (Ioffe & Szegedy, 2015) uses statis-tics calculated from mini-batch training examples to nor-malize the pre-activation. The normalized value is re-scaled and re-shifted using additional parameters. Sincebatch normalization uses the statistics on mini-batch exam-ples, its effect is dependent on the mini-batch size. To over-come this problem, normalization propagation (Arpit et al.,2016) uses a data-independent parametric estimate ofmean and standard deviation, while layer normalization(Ba et al., 2016) computes the mean and standard deviationfrom the same layer on a single training case. Weight nor-malization (Salimans & Kingma, 2016) re-parameterizesthe incoming weight vector by dividing its norm. It de-couples the length of weight vector from its direction, thuspartially bounds the result of dot product. But it does notconsider the length of input vector. These methods all bringadditional parameters to be learned, thus make the modelmore complex.An important source of inspiration for our work is cosinesimilarity, which is widely used in data mining and ma-chine learning (Singhal, 2001; Tan et al., 2006). To thor-oughly bound dot product, a straight-forward idea is to usecosine similarity. We combine cosine similarity with neu-ral network, and the details will be described in the nextsection.
2. Cosine Normalization
To decrease the variance of neuron, we propose a newmethod, called cosine normalization, which simply usescosine similarity instead of dot product in neural network. osine Normalization
A simple multi-layer neural network is shown in Figure 1.Using cosine normalization, the output of hidden unit iscomputed by Equation 4. o = f ( net norm ) = f (cos θ ) = f ( ~w · ~x | ~w | | ~x | ) (4)where net norm is the normalized pre-activation, ~w is theincoming weight vector and ~x is the input vector, ( · ) indi-cates dot product, f is nonlinear activation function. Co-sine normalization bounds the pre-activation between -1and 1. The result could be even smaller when the dimen-sion is high. As a result, the variance can be controlledwithin a very narrow range. w (cid:38) x (cid:38) Figure1. A simple neural network. The output of hidden unit isthe nonlinear transform of dot product between input vector andincoming weight vector. That is computed by f ( ~w · ~x ) . Withcosine normalization, The output of hidden unit is computed by f ( ~w · ~x | ~w || ~x | ) Empirically, we find that using ReLU activation function max (0 , net norm ) , the result of normalization needs no re-scaling and re-shifting. Therefore, there is no additionalparameter to be learned or hyper-parameter to be preset.However, when using other activation functions , like sig-moid, tanh, or softmax, the result of normalization shouldbe re-valued to fully utilize the non-linear regime of thefunctions.When implementing of cosine normalization in fully-connected nets, we just need divide the norm of incomingweight vector, as well as the norm of input vector. The in-put vector is the output vector of previous layer. That is tosay, the hidden units in the same layer have the same normof input vector. While in the convolutional nets, the inputvector is constrained in a receptive field. Different recep-tive fields have different norms.One thing should be noticed is that cosine similarity canonly measure the similarity between two non-zero vec-tors, since denominator can not be zero. Non-zero biascan be added to avoid the situation of zero vector. Let ~w = [ w , w ...w i ] and ~x = [ x , x ...x i ] . After adding bias, ~w = [ w , w , w ...w i ] and ~x = [ x , x , x ...x i ] , where w and x should be non-zero.We can use gradient descent (back propagation) to train the neural network with cosine normalization. Comparing tobatch normalization, cosine normalization does not dependon any statistics on batch or mini-batch examples, so themodel can be trained by both batch gradient descent andstochastic gradient descent. Meanwhile, cosine normaliza-tion performs the same computation in forward propaga-tion at training and inference times. The procedure of backpropagation in neural network with cosine normalization isthe same as ordinary neural network except the derivativeof net norm with respect to w or x .To show the derivative conveniently, dot product can berewritten as Equation 5, where w i indicates the i dimen-sion of vector ~w , and x i indicates the i dimension of vector ~x . net = X i ( w i x i ) (5)Therefore, the derivative of net with respect to w i or x i inordinary neural network can be calculated by Equation 6 orEquation 7. ∂net∂w i = x i (6) ∂net∂x i = w i (7)Correspondingly, the cosine normalization can be rewrittenas Equation 8. net norm = cos θ = P i ( w i x i ) pP i ( w i ) pP i ( x i ) (8)Then, the derivative of net norm with respect to w i or x i can be calculated by Equation 9 or Equation 10. ∂net norm ∂w i = x i pP i ( w i ) pP i ( x i ) − w i P i ( w i x i )( pP i ( w i )) pP i ( x i ) (9) ∂net norm ∂x i = w i pP i ( w i ) pP i ( x i ) − x i P i ( w i x i ) pP i ( w i )( pP i ( x i )) (10)Equation 9 or Equation 10 can be briefly written as Equa-tion 11 or Equation 12. ∂net norm ∂w i = x i | ~w | | ~x | − w i ( ~w · ~x ) | ~w | | ~x | (11) ∂net norm ∂x i = w i | ~w | | ~x | − x i ( ~w · ~x ) | ~w | | ~x | (12) osine Normalization As pointed in (LeCun et al., 2012), centering the inputsof units can help the training of neural networks. Batchor layer normalization centers the data by subtracting themean of batch or layer, while mean-only batch normaliza-tion can enhance the performance of weight normalization(Salimans & Kingma, 2016). We can use Pearson Correla-tion Coefficient (PCC), which is centered cosine similarity,to extend cosine normalization: net norm = ( ~w − µ w ) · ( ~x − µ x ) | ~w − µ w | | ~x − µ x | (13)where µ w is the mean of ~w and µ x is the mean of ~x .
3. Discussions
Weight normalization (Salimans & Kingma, 2016) re-parameterizes the weights by using new parameters as: ~w new = g | ~w | ~w (14)Then, the output of hidden unit is computed as: o = f ( net norm ) = f ( ~w new · ~x ) = f (cid:18) g | ~w | ~w · ~x (cid:19) (15)where g is a re-scaling parameter and can be learned by gra-dient descent. Ignoring the re-scaling parameter g , weightnormalization could be seen as partial cosine normaliza-tion which only constrains the weights. By additionallydividing the magnitude of ~x , cosine normalization boundspre-activation within a narrower range, thus makes lowervariance of neurons.Moreover, cosine normalization makes the model more ro-bust for different input magnitude. For example, in the for-ward procedure of the fully-connected network, we have ~x l +1 = f ( ~w · ~x l ) . If we scale the ~x l by a factor λ , then ~x l +1 = f ( ~w · ( λ ~x l )) . When the activation function f isReLU, we have ~x l +1 = λf ( ~w · ~x l ) . So the λ is linearlytransmitted to the last layer. When the last layer is soft-max, exp( ~x ) / P exp( ~x ) , the output distribution becomesmore steep due to the nonlinearity of softmax. For exam-ple, if the input vector to softmax is [1, 2], then the outputdistribution is [0.2689, 0.7311]. When the λ = 10 , afterthe linearly transmitting, the input vector to softmax be-comes [10, 20], and the output distribution becomes [0, 1].Supposing we want to recognize a handwritten digit, scal-ing the whole digit by a factor does not bring any validinformation. In other words, the output distribution shouldnot be changed. By using cosine normalization, the outputdistribution can be stable when the input magnitude varies,and it depends only on the angle between the input and theweight. In the backward procedure of weight normalization, thederivative of net norm with respect to w i or x i can be cal-culated by Equation 16 or Equation 17. ∂net norm ∂w i = x i | ~w | − w i ( ~w · ~x ) | ~w | (16) ∂net norm ∂x i = w i | ~w | (17)After scaling the input by λ , the derivative of net norm withrespect to w i becomes Equation 18. Comparing Equation16 with Equation 18, we can see that the scaling of inputalso makes the gradient scaling in weight normalization.While in cosine normalization, as shown in Equation 11,the scaling factor λ can be offset by the | λ~x | in the denom-inator. ∂net norm ∂w i = λx i | ~w | − w i ( ~w · λ~x ) | ~w | = λ x i | ~w | − w i ( ~w · ~x ) | ~w | ! (18) Layer normalization (Ba et al., 2016) use Equation 19 tonormalize pre-activation, followed by re-scaling and re-shifting the normalized value (Equation 20). net norm == net − µσ = ~w · ~x − µσ (19) o = f ( γ net norm + β ) (20)The mean µ and standard deviation σ are computed over alayer on a single training case. The γ is re-scaling parame-ter and β is re-shifting parameter, which are learned duringtraining.Because | ~x − µ x | = pP i ( x i − µ x ) , and σ x = q n P i ( x i − µ x ) , where n is a constant referring to thedimension of ~x , The centered cosine normalization ( Pear-son Correlation Coefficient ) can be re-write as: net norm = ( ~w − µ w ) · ( ~x − µ x ) nσ w σ x (21)Ignoring the constraining of weights, layer normalization issimilar with Pearson Correlation Coefficient by constrain-ing the ~x in fully-connected networks.However, there are three differences between Pearson Cor-relation Coefficient and layer normalization: 1) PearsonCorrelation Coefficient constrains ~w as well as ~x , whilelayer normalization constrains only ~x . Thus Pearson Corre-lation Coefficient is robust to the scaling or shifting of both osine Normalization weight and input. 2) Layer normalization computes themean and standard deviation before activation and after dotproduct, while Pearson Correlation Coefficient computesthe mean and standard deviation before dot product and af-ter activation. 3) In convolutional networks, Pearson Cor-relation Coefficient calculates the mean and standard devi-ation from the receptive fields, while layer normalizationcalculates the mean and standard deviation from the wholelayer. That is to say, different receptive fields have differ-ent mean and standard deviation using Pearson CorrelationCoefficient, while the same layer has the same mean andstandard deviation using layer normalization. As pointedin (Ba et al., 2016), layer normalization works well whenall the hidden units in a layer make similar contributions,while the assumption of similar contributions is no longertrue for convolutional networks. Pearson Correlation Co-efficient just needs the assumption of similar contributionsin the receptive fields rather than the whole layer. That ismore reasonable for the convolutional networks. In machine learning and data mining, there are lots of met-rics to measure the similarity or distance between differentsamples. Among them, cosine similar or the centered co-sine (Pearson Correlation Coefficient), is heavily used inmany fields, e.g. K-nearest neighbors for classification,K-means for clustering, information retrieval, item or userbased recommendation. There are also some neural net-works using similarity metrics as the output of neurons, e.g.Radial Basis Function networks (RBF) (Moody & Darken,1989), Self-Organizing Map (SOM) (Kohonen, 1982). Thetraining of these networks is not using back propagation,and it is hard to build end-to-end deep networks using RBFor SOM. The paper (Lin et al., 2013) argues that the levelof abstraction is low with dot product (generalized linearmodel), thus uses multi-layer perceptron (network in net-work) to learn convolution filter in convolutional networks.Since dot product is not a decent metric, we may directlytry other metrics. As far as we know, it is the first time touse cosine similarity or Pearson Correlation Coefficient asthe basic metric to build end-to-end deep network trainedby back propagation.
4. Experiments
We compare our cosine normalization and centered cosinenormalization (PCC) with batch, weight and layer normal-ization in fully-connected neural networks on the MNISTand 20NEWS GROUP data sets. Additionally, convolu-tional networks with different normalization are evaluatedon the CIFAR-10, CIFAR-100 and SVHN data sets. Wealso test the networks without any normalization both forfully-connected and convolutional. The results are much worse than with normalization, thus we focus only on com-parison of different normalization techniques. osine Normalization
A fully-connected neural network which has two hiddenlayers is used in experiments of MNIST and 20NEWSGROUP. Each hidden layer has 1000 units. The last layeris the softmax classification layer with 10-class for MNIST,and 20-class for 20NEWS GROUP. To evaluated the convo-lutional networks, as shown in Table 1, a VGG-like archi-tecture with 12 weighted layers is evaluated in experimentsof CIFAR-10/100 and SVHN. Each convolutional layer has3 × × Table1. VGG-like architecture. conv-512conv-512conv-512maxpoolconv-512conv-512conv-512maxpoolconv-512conv-512conv-512maxpoolfully-connected-1000fully-connected-1000fully-connected-10/100soft-maxReLU activation function is used in the hidden layers. Allweights are randomly initialized by truncated normal dis-tribution with 0 mean and 0.1 variance. Mini-batch gradi-ent descent is used to train the networks. The batch size is100 in experiments of fully-connected nets, and 128 in con-volutional nets. In our experiments, we use no re-scalingand re-shifting after normalization for hidden layers. How-ever, for the last layer, we re-scale the normalized valuesbefore inputting to softmax. We tried different learningrate for all normalization techniques, and found that co-sine normalization can use larger learning rate than othernormalization techniques. The learning rate of the cosinenormalization, centered cosine normalization (PCC), batchnormalization, weight normalization, layer normalizationis 10, 10, 1, 1, 1, respectively in our experiments. Theexponential moving average of parameters with 0.9999 isused during inference in convolutional networks. No anyregularization, dropout, or dynamic learning rate is used.We train the fully-connected nets with 200 epochs and con-volutional nets step since the performances are not im-proved anymore (in this paper, training epoch refers a cyclethat all training data are used once for training, while train-ing step refers the times of update for parameters). e rr o r ( % ) pcccosinebatch normweight normlayer norm Figure2. The MNIST test error of different normalization tech-niques, vs. the number of training epoch.Table2. The mean and variance of test error in last 50 epochs inMNIST experiments. methods mean % variance ( − )centered cosine (PCC) 1.39 0cosine norm 1.40 0.009batch norm 1.45 6.740weight norm 1.65 0.054layer norm 1.43 0.1084.3.2. 20NEWS GROUPThe results for 20NEWS GROUP are shown in Figure3 and Table 3. Centered cosine normalization achievesthe lowest test error 29.37%, and cosine normalization osine Normalization achieves the second lowest test error 31.73%. The batchnormalization performs poorly in this task of high dimen-sional text classification. It only achieves 43.94% test er-ror. Weight normalization (33.55%) and layer normaliza-tion (33.29%) achieve close performances. Both batch andweight normalization have larger variance of test error thanother normalization. e rr o r ( % ) pcccosinebatch normweight normlayer norm Figure3. The 20NEWS GROUP test error of different normaliza-tion techniques, vs. the number of training epoch.Table3. The mean and variance of test error in last 50 epochs in20NEWS experiments. methods mean % variance ( − )centered cosine (PCC) 29.37 0.201cosine norm 31.73 0.633batch norm 43.94 1.231weight norm 33.55 4.775layer norm 33.29 0.5564.3.3. CIFAR-10The results for CIFAR-10 are shown in Figure 4 and Ta-ble 4. Centered cosine normalization achieves the lowesttest error 6.39%, and cosine normalization achieves the sec-ond lowest test error 7.33%. The layer normalization alsoachieves good performance, better than batch normaliza-tion, in this experiment. It achieves 7.42% test error. Batchnormalization achieves test error 8.08%, and still has largervariance of test error than other normalization. Weight nor-malization achieves the highest test error 8.55%.4.3.4. CIFAR-100The results for CIFAR-100 are shown in Figure 5 and Table5. Centered cosine normalization achieves the lowest testerror 27.49%. Cosine normalization and batch normaliza-tion achieve very close performance, 31.02% and 31.01%respectively. But batch normalization have larger varianceof test error. Weight normalization achieves the highest testerror 37.87%. e rr o r ( % ) pcccosinebatch normweight normlayer norm Figure4. The CIFAR-10 test error of different normalization tech-niques, vs. the number of training step.Table4. The mean and variance of test error in last 10000 step inCIFAR-10 experiments. methods mean % variance ( − )centered cosine (PCC) 6.39 0.076cosine norm 7.33 0.036batch norm 8.08 1.052weight norm 8.55 0.010layer norm 7.42 0.008 e rr o r ( % ) pcccosinebatch normweight normlayer norm Figure5. The CIFAR-100 test error of different normalizationtechniques, vs. the number of training step.Table5. The mean and variance of test error in last 10000 step inCIFAR-100 experiments. methods mean % variance ( − )centered cosine (PCC) 27.49 1.03cosine norm 31.02 0.43batch norm 31.01 3.23weight norm 37.87 1.36layer norm 31.66 0.22 osine Normalization e rr o r ( % ) pcccosinebatch normweight normlayer norm Figure6. The SVHN test error of different normalization tech-niques, vs. the number of training step.Table6. The mean and variance of test error in last 10000 step inSVHN experiments. methods mean % variance ( − )centered cosine (PCC) 2.22 0.01cosine norm 2.34 0.11batch norm 2.49 0.14weight norm 2.63 0.03layer norm 2.58 0.01
5. Conclusions
In this paper, we propose a new normalization technique,called cosine normalization, which uses cosine similarityor centered cosine similarity, Pearson correlation coeffi-cient, instead of dot product in neural networks. Cosinenormalization bounds the pre-activation of neuron withina narrower range, thus makes lower variance of neurons.Moreover, cosine normalization makes the model more ro-bust for different input magnitude. Networks with cosinenormalization can be trained using back propagation. Itdoes not depend on any statistics on batch or mini-batchexamples, and performs the same computation in forwardpropagation at training and inference times. In convolu-tional networks, it normalizes the neurons from the recep-tive fields rather than the same layer or batch size. Cosinenormalization is evaluated on different types of network(fully-connected network and convolutional network) andon different data sets (MNIST, 20NEWS GROUP, CIFAR-10/100, SVHN). Experiments show that cosine normaliza- tion and centered cosine normalization significantly reducethe test error of classification comparing to batch, weightand layer normalization.
References
Abadi, Mart´ın, Agarwal, Ashish, Barham, Paul, Brevdo,Eugene, Chen, Zhifeng, Citro, Craig, Corrado, Greg S,Davis, Andy, Dean, Jeffrey, Devin, Matthieu, et al. Ten-sorflow: Large-scale machine learning on heterogeneousdistributed systems. arXiv preprint arXiv:1603.04467 ,2016.Arpit, Devansh, Zhou, Yingbo, Kota, Bhargava U, andGovindaraju, Venu. Normalization propagation: A para-metric technique for removing internal covariate shift indeep networks. arXiv preprint arXiv:1603.01431 , 2016.Ba, Jimmy Lei, Kiros, Jamie Ryan, and Hinton, Ge-offrey E. Layer normalization. arXiv preprintarXiv:1607.06450 , 2016.He, Kaiming, Zhang, Xiangyu, Ren, Shaoqing, and Sun,Jian. Deep residual learning for image recognition. In
Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition , pp. 770–778, 2016.Hinton, Geoffrey, Deng, Li, Yu, Dong, Dahl, George E,Mohamed, Abdel-rahman, Jaitly, Navdeep, Senior, An-drew, Vanhoucke, Vincent, Nguyen, Patrick, Sainath,Tara N, et al. Deep neural networks for acoustic mod-eling in speech recognition: The shared views of fourresearch groups.
IEEE Signal Processing Magazine , 29(6):82–97, 2012.Hinton, Geoffrey E and Salakhutdinov, Ruslan R. Reduc-ing the dimensionality of data with neural networks. sci-ence , 313(5786):504–507, 2006.Hinton, Geoffrey E, Osindero, Simon, and Teh, Yee-Whye.A fast learning algorithm for deep belief nets.
Neuralcomputation , 18(7):1527–1554, 2006.Hochreiter, Sepp, Bengio, Yoshua, Frasconi, Paolo, andSchmidhuber, J¨urgen. Gradient flow in recurrent nets:the difficulty of learning long-term dependencies, 2001.Ioffe, Sergey and Szegedy, Christian. Batch normalization:Accelerating deep network training by reducing internalcovariate shift. In
Proceedings of The 32nd InternationalConference on Machine Learning , pp. 448–456, 2015.Kohonen, Teuvo. Self-organized formation of topologi-cally correct feature maps.
Biological cybernetics , 43(1):59–69, 1982.Krizhevsky, Alex and Hinton, Geoffrey. Learning multiplelayers of features from tiny images. 2009. osine Normalization
Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E.Imagenet classification with deep convolutional neuralnetworks. In
Advances in neural information processingsystems , pp. 1097–1105, 2012.Krogh, Anders and Hertz, John A. A simple weight decaycan improve generalization. In
NIPS , volume 4, pp. 950–957, 1991.LeCun, Yann, Bottou, L´eon, Bengio, Yoshua, and Haffner,Patrick. Gradient-based learning applied to documentrecognition.
Proceedings of the IEEE , 86(11):2278–2324, 1998.LeCun, Yann A, Bottou, L´eon, Orr, Genevieve B, andM¨uller, Klaus-Robert. Efficient backprop. In
Neural net-works: Tricks of the trade , pp. 9–48. Springer, 2012.Lin, Min, Chen, Qiang, and Yan, Shuicheng. Network innetwork. arXiv preprint arXiv:1312.4400 , 2013.Maas, Andrew L, Hannun, Awni Y, and Ng, Andrew Y.Rectifier nonlinearities improve neural network acousticmodels. In
Proc. ICML , volume 30, 2013.Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado,Greg S, and Dean, Jeff. Distributed representations ofwords and phrases and their compositionality. In
Ad-vances in neural information processing systems , pp.3111–3119, 2013.Moody, John and Darken, Christian J. Fast learning in net-works of locally-tuned processing units.
Neural compu-tation , 1(2):281–294, 1989.Nair, Vinod and Hinton, Geoffrey E. Rectified linear unitsimprove restricted boltzmann machines. In
Proceedingsof the 27th international conference on machine learning(ICML-10) , pp. 807–814, 2010.Netzer, Yuval, Wang, Tao, Coates, Adam, Bissacco,Alessandro, Wu, Bo, and Ng, Andrew Y. Reading dig-its in natural images with unsupervised feature learning.In
NIPS workshop on deep learning and unsupervisedfeature learning , volume 2011, pp. 5, 2011.Ren, Mengye, Liao, Renjie, Urtasun, Raquel, Sinz,Fabian H, and Zemel, Richard S. Normalizing the nor-malizers: Comparing and extending network normaliza-tion schemes. arXiv preprint arXiv:1611.04520 , 2016.Salimans, Tim and Kingma, Diederik P. Weight normaliza-tion: A simple reparameterization to accelerate trainingof deep neural networks. In
Advances in Neural Infor-mation Processing Systems , pp. 901–901, 2016.Silver, David, Huang, Aja, Maddison, Chris J, Guez,Arthur, Sifre, Laurent, Van Den Driessche, George, Schrittwieser, Julian, Antonoglou, Ioannis, Panneershel-vam, Veda, Lanctot, Marc, et al. Mastering the game ofgo with deep neural networks and tree search.
Nature ,529(7587):484–489, 2016.Simonyan, Karen and Zisserman, Andrew. Very deep con-volutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 , 2014.Singhal, Amit. Modern information retrieval: A briefoverview.
IEEE Data Eng. Bull. , 24(4):35–43, 2001.Srebro, Nathan and Shraibman, Adi. Rank, trace-norm andmax-norm. In
International Conference on Computa-tional Learning Theory , pp. 545–560. Springer, 2005.Srivastava, Nitish, Hinton, Geoffrey E, Krizhevsky, Alex,Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout:a simple way to prevent neural networks from overfit-ting.
Journal of Machine Learning Research , 15(1):1929–1958, 2014.Szegedy, Christian, Liu, Wei, Jia, Yangqing, Sermanet,Pierre, Reed, Scott, Anguelov, Dragomir, Erhan, Du-mitru, Vanhoucke, Vincent, and Rabinovich, Andrew.Going deeper with convolutions. In
Proceedings ofthe IEEE Conference on Computer Vision and PatternRecognition , pp. 1–9, 2015.Tan, Pang-Ning et al.
Introduction to data mining . PearsonEducation India, 2006.Wan, Li, Zeiler, Matthew, Zhang, Sixin, Cun, Yann L, andFergus, Rob. Regularization of neural networks usingdropconnect. In