[PDF] How to Center Binary Deep Boltzmann Machines

Abstract

This work analyzes centered binary Restricted Boltzmann Machines (RBMs) and binary Deep Boltzmann Machines (DBMs), where centering is done by subtracting offset values from visible and hidden variables. We show analytically that (i) centering results in a different but equivalent parameterization for artificial neural networks in general, (ii) the expected performance of centered binary RBMs/DBMs is invariant under simultaneous flip of data and offsets, for any offset value in the range of zero to one, (iii) centering can be reformulated as a different update rule for normal binary RBMs/DBMs, and (iv) using the enhanced gradient is equivalent to setting the offset values to the average over model and data mean. Furthermore, numerical simulations suggest that (i) optimal generative performance is achieved by subtracting mean values from visible as well as hidden variables, (ii) centered RBMs/DBMs reach significantly higher log-likelihood values than normal binary RBMs/DBMs, (iii) centering variants whose offsets depend on the model mean, like the enhanced gradient, suffer from severe divergence problems, (iv) learning is stabilized if an exponentially moving average over the batch means is used for the offset values instead of the current batch mean, which also prevents the enhanced gradient from diverging, (v) centered RBMs/DBMs reach higher LL values than normal RBMs/DBMs while having a smaller norm of the weight matrix, (vi) centering leads to an update direction that is closer to the natural gradient and that the natural gradient is extremly efficient for training RBMs, (vii) centering dispense the need for greedy layer-wise pre-training of DBMs, (viii) furthermore we show that pre-training often even worsen the results independently whether centering is used or not, and (ix) centering is also beneficial for auto encoders.

Full PDF

HHow to Center Binary Deep Boltzmann Machines

Jan Melchior

[email protected]

Theory of Neural SystemsInstitut f¨ur NeuroinformatikRuhr Universit¨at Bochum44780 Bochum, Germany

Asja Fischer

[email protected]

Theory of Machine LearningInstitut f¨ur NeuroinformatikRuhr Universit¨at Bochum44780 Bochum, Germany

Laurenz Wiskott

[email protected]

Theory of Neural SystemsInstitut f¨ur NeuroinformatikRuhr Universit¨at Bochum44780 Bochum, Germany

Abstract

This work analyzes centered binary Restricted Boltzmann Machines (RBMs) and binaryDeep Boltzmann Machines (DBMs), where centering is done by subtracting oﬀset valuesfrom visible and hidden variables. We show analytically that (i) centering results in adiﬀerent but equivalent parameterization for artiﬁcial neural networks in general, (ii) theexpected performance of centered binary RBMs/DBMs is invariant under simultaneous ﬂipof data and oﬀsets, for any oﬀset value in the range of zero to one, (iii) centering can bereformulated as a diﬀerent update rule for normal binary RBMs/DBMs, and (iv) using theenhanced gradient is equivalent to setting the oﬀset values to the average over model anddata mean. Furthermore, numerical simulations suggest that (i) optimal generative perfor-mance is achieved by subtracting mean values from visible as well as hidden variables, (ii)centered RBMs/DBMs reach signiﬁcantly higher log-likelihood values than normal binaryRBMs/DBMs, (iii) centering variants whose oﬀsets depend on the model mean, like theenhanced gradient, suﬀer from severe divergence problems, (iv) learning is stabilized if anexponentially moving average over the batch means is used for the oﬀset values insteadof the current batch mean, which also prevents the enhanced gradient from diverging, (v)centered RBMs/DBMs reach higher LL values than normal RBMs/DBMs while havinga smaller norm of the weight matrix, (vi) centering leads to an update direction that iscloser to the natural gradient and that the natural gradient is extremly eﬃcient for trainingRBMs, (vii) centering dispense the need for greedy layer-wise pre-training of DBMs, (viii)furthermore we show that pre-training often even worsen the results independently whethercentering is used or not, and (ix) centering is also beneﬁcial for auto encoders.

Keywords: centering, Boltzmann machines, artiﬁcial neural networks, generative models,auto encoders, contrastive divergence, enhanced gradient, natural gradient a r X i v : . [ s t a t . M L ] J u l . Introduction In the last decade Restricted Boltzmann Machines (RBMs) got into the focus of attentionbecause they can be considered as building blocks of deep neural networks (Hinton et al.,2006; Bengio, 2009). RBM training methods are usually based on gradient ascent on thelog-Likelihood (LL) of the model parameters given the training data. Since the gradientis intractable, it is often approximated using Gibbs sampling only for a few steps (Hintonet al., 2006; Tieleman, 2008; Tieleman and Hinton, 2009)Two major problems have been reported when training RBMs. Firstly, the bias of thegradient approximation introduced by using only a few steps of Gibbs sampling may leadto a divergence of the LL during training (Fischer and Igel, 2010; Schulz et al., 2010).To overcome the divergence problem, Desjardins et al. (2010) and Cho et al. (2010) haveproposed to use parallel tempering, which is an advanced sampling method that leads to afaster mixing Markov chain and thus to a better approximation of the LL gradient.Secondly, the learning process is not invariant to the data representation. For exampletraining an RBM on the

MNIST dataset leads to a better model than training it on (the dataset generated by ﬂipping each bit in

MNIST ). This is due to missinginvariance properties of the gradient with respect to these ﬂip transformations and not dueto the model’s capacity, since an RBM trained on

MNIST can be transformed in such a waythat it models with the same LL. Recently, two approaches have been introducedthat address the invariance problem. The enhanced gradient (Cho et al., 2011, 2013b) hasbeen designed as an invariant alternative to the true LL gradient of binary RBMs and hasbeen derived by calculating a weighted average over the gradients one gets by applying anypossible bit ﬂip combination on the dataset. Empirical results suggest that the enhancedgradient leads to more distinct features and thus to better classiﬁcation results based on thelearned hidden representation of the data. Furthermore, in combination with an adaptivelearning rate the enhanced gradient leads to more stable training in the sense that good LLvalues are reached independently of the initial learning rate. Tang and Sutskever (2011),on the other hand have shown empirically that subtracting the data mean from the visiblevariables leads to a model that can reach similar LL values on the

MNIST and the dataset and comparable results to those of the enhanced gradient. Removing the mean fromall variables is known as the ‘centering trick’ which was originally proposed for feed forwardneural networks (LeCun et al., 1998; Schraudolph, 1998). It has recently also been appliedto the visible and hidden variables of Deep Boltzmann Machines (DBM) (Montavon andM¨uller, 2012) where it has been shown to lead to an initially better conditioned optimizationproblem. Furthermore, the learned features have shown better discriminative propertiesand centering has improved the generative properties of locally connected DBMs. A relatedapproach applicable to multi-layer perceptrons where the activation functions of the neuronsare transformed to have zero mean and zero slope on average was proposed by Raiko et al.(2012). The authors could show that the gradient under this transformation became closerto the natural gradient, which is desirable since the natural gradient follows the directionof steepest ascent in the manifold of probability distributions. Furthermore, the naturalgradient is independent of the concrete parameterization of the distributions and is thus

1. Note, that changing the model such that the mean of the visible variables is removed is not equivalentto just removing the mean of the data. we give a uniﬁed view on centering where we analysis in particular theproperties and performance of centered binary RBMs and DBMs. We begin with a briefoverview over binary RBMs, the standard learning algorithms, the natural gradient of theLL of RBMs, and the basic ideas used to construct the enhanced gradient in Section 2.In Section 3, we discuss the theoretical properties of centered RBMs, show that centeringcan be reformulated as a diﬀerent update rule for normal binary RBMs, that the enhancedgradient is a particular form of centering and ﬁnally that centering in RBMs and its prop-erties naturally extend to DBMs. Furthermore, in Section 4, we show that centering is analternative parameterization for arbitrary Artiﬁcial Neural Networks (ANNs) in general andwe discusses how the parameters of centered and normal binary ANNs should be initial-ized. Our experimental setups are described in Section 5 before we empirically analyze theperformance of centered RBMs with diﬀerent initializations, oﬀset parameters, samplingmethods, and learning rates in Section 6. The empirical analysis includes experiments on10 real world problems, a comparison of the centered gradient with the natural gradient,and experiments on Deep Boltzmann machines and Auto encoders (AEs). Finally our workis concluded in Section 7.

2. Restricted Boltzmann Machines

An RBM (Smolensky, 1986) is a bipartite undirected graphical model with a set of N visibleand M hidden variables taking values x = ( x , ..., x N ) and h = ( h , ..., h M ), respectively.Since an RBM is a Markov random ﬁeld, its joint probability distribution is given by aGibbs distribution p ( x , h ) = 1 Z e − E ( x , h ) , with partition function Z and energy E ( x , h ). For binary RBMs, x ∈ { , } N , h ∈ { , } M ,and the energy, which deﬁnes the bipartite structure, is given by E ( x , h ) = − x T b − c T h − x T Wh , where the weight matrix W , the visible bias vector b and the hidden bias vector c are theparameters of the model, jointly denoted by θ . The partition function which sums over all

2. Previous versions of this work have been published as eprint (Melchior et al., 2013) and as part of thethesis (Fischer, 2014). Z = (cid:88) x (cid:88) h e − E ( x , h ) . RBM training is usually based on gradient ascent using approximations of the LL gra-dient ∇ θ = ∂ (cid:104) log ( p ( x | θ )) (cid:105) d ∂ θ = − (cid:28) ∂E ( x , h ) ∂ θ (cid:29) d + (cid:28) ∂E ( x , h ) ∂ θ (cid:29) m , where (cid:104)·(cid:105) m is the expectation under p ( h , x ) and (cid:104)·(cid:105) d is the expectation under p ( h | x ) p e ( x )with empirical distribution p e . We use the notation ∇ θ for the derivative of the LL withrespect to θ in order to be consistent with the notation used by Cho et al. (2011). Forbinary RBMs the gradient becomes ∇ W = (cid:104) xh T (cid:105) d − (cid:104) xh T (cid:105) m , ∇ b = (cid:104) x (cid:105) d − (cid:104) x (cid:105) m , ∇ c = (cid:104) h (cid:105) d − (cid:104) h (cid:105) m . Common RBM training methods approximate (cid:104)·(cid:105) m by samples gained by diﬀerent Markovchain Monte Carlo methods. Sampling k (usually k = 1) steps from a Gibbs chain initializedwith a data sample yields the Contrastive Divergence (CD) (Hinton et al., 2006) algorithm.In stochastic maximum likelihood (Younes, 1991), in the context of RBMs also known asPersistent Contrastive Divergence (PCD) (Tieleman, 2008), the chain is not reinitializedwith a data sample after parameter updates. This has been reported to lead to bettergradient approximations if the learning rate is chosen suﬃciently small. Fast PersistentContrastive Divergence (FPCD) (Tieleman and Hinton, 2009) tries to further speed uplearning by introducing an additional set of parameters, which is only used for Gibbs sam-pling during learning. The advanced sampling method Parallel Tempering (PT) introducesadditional ‘tempered’ Gibbs chains corresponding to smoothed versions of p ( x , h ). Theenergy of these distributions is multiplied by T , where T is referred to as temperature. Thehigher the temperature of a chain is, the ‘smoother’ the corresponding distribution andthe faster the chain mixes. Samples may swap between chains with a probability given bythe Metropolis Hastings ratio, which leads to better mixing of the original chain (where T = 1). We use PT c to denote the RBM training algorithm that uses Parallel Temperingwith c tempered chains as a sampling method. Usually only one step of Gibbs samplingis performed in each tempered chain before allowing samples to swap, and a deterministiceven odd algorithm (Lingenheil et al., 2009) is used as a swapping schedule. PT c increasesthe mixing rate and has been reported to achieve better gradient approximations than CDand (F)PCD (Desjardins et al., 2010; Cho et al., 2010) with the drawback of having ahigher computational cost.See the introductory paper of Fischer and Igel (2014) for a recent review of RBMs andtheir training algorithms. Cho et al. (2011) proposed a diﬀerent way to update parameters during training of binaryRBMs, which is invariant to the data representation.4hen transforming the state ( x , h ) of a binary RBM by ﬂipping some of its variables(that is ˜ x i = 1 − x i and ˜ h j = 1 − h j for some i, j ), yielding a new state ( ˜x , ˜h ), one cantransform the parameters θ of the RBM to ˜ θ such that E ( x , h | θ ) = E ( ˜x , ˜h | ˜ θ ) + const andthus p ( x , h | θ ) = p ( ˜x , ˜h | ˜ θ ) holds. However, if we update the parameters of the transformedmodel based on the corresponding LL gradient to ˜ θ (cid:48) = ˜ θ + η ∇ ˜ θ and apply the inverseparameter transformation to ˜ θ (cid:48) , the result will diﬀer from θ (cid:48) = θ + η ∇ θ . The describedprocedure of transforming, updating, and transforming back can be regarded as a diﬀerentway to update θ .Following this line of thought there exist 2 N + M diﬀerent parameter updates correspond-ing to the 2 N + M possible binary ﬂips of ( x , h ). Cho et al. (2011) proposed the enhancedgradient as a weighted sum of these 2 N + M parameter updates, which for their choice ofweighting is given by ∇ e W = (cid:104) ( x − (cid:104) x (cid:105) d )( h − (cid:104) h (cid:105) d ) T (cid:105) d − (cid:104) ( x − (cid:104) x (cid:105) m )( h − (cid:104) h (cid:105) m ) T (cid:105) m , (1) ∇ e b = (cid:104) x (cid:105) d − (cid:104) x (cid:105) m − ∇ e W

12 ( (cid:104) h (cid:105) d + (cid:104) h (cid:105) m ) , (2) ∇ e c = (cid:104) h (cid:105) d − (cid:104) h (cid:105) m − ∇ e W T

12 ( (cid:104) x (cid:105) d + (cid:104) x (cid:105) m ) . (3)It has been shown that the enhanced gradient is invariant to arbitrary bit ﬂips of the vari-ables and therefore invariant under the data representation, which has been demonstratedon the MNIST and dataset. Furthermore, the authors reported more stabletraining under various settings in terms of the LL estimate and classiﬁcation accuracy.

Following the direction of steepest ascent in the Euclidean parameter space (as given by thestandard gradient) does not necessarily correspond to the direction of steepest ascent in themanifold of probability distributions { p ( x | θ ) , θ ∈ Θ } , which we are actually interested in.To account for the local geometry of the manifold, the Euclidean metric should be replacedby the Fisher information metric deﬁned by || θ || I ( θ ) = (cid:112)(cid:80) θ k I kl ( θ ) θ l , where I ( θ ) is theFisher information matrix (Amari, 1998). The kl -th entry of the Fisher information matrixfor a parameterized distribution p ( x | θ ) is given by I kl ( θ ) = (cid:28)(cid:18) ∂ log ( p ( x | θ )) ∂θ k (cid:19) (cid:18) ∂ log ( p ( x | θ )) ∂θ l (cid:19)(cid:29) m , where (cid:104)·(cid:105) m denotes the expectation under p ( x | θ ). The gradient associated with the Fishermetric is called the natural gradient and is given by ∇ n θ = I ( θ ) − ∇ θ . The natural gradient points in the direction δ θ achieving the largest change of the objectivefunction (here the LL) for an inﬁnitesimal small distance δ θ between p ( x | θ ) and p ( x | θ + δθ )in terms of the Kullback-Leibler divergence (Amari, 1998). This makes the natural gradientindependent of the parameterization including the invariance to ﬂips of the data as a specialcase. Thus, the natural gradient is clearly the update direction of choice.5or binary RBMs the entries of the Fisher information matrix (Amari et al., 1992;Desjardins et al., 2013; Ollivier et al., 2013) are given by I w ij ,w uv ( θ ) = I ,w uv ,w ij ( θ ) = (cid:104) x i h j x u h v (cid:105) m − (cid:104) x u h v (cid:105) m (cid:104) x u h v (cid:105) m = Cov m ( x i h j , x u h v ) , I w ij ,b u ( θ ) = I b u ,w ij ( θ ) = Cov m ( x i h j , x u ) , I w ij ,c v ( θ ) = I c v ,w ij ( θ ) = Cov m ( x i h j , h v ) , I b i ,b u ( θ ) = I b u ,b i ( θ ) = Cov m ( x i , x u ) , I c j ,c v ( θ ) = I c v ,c j ( θ ) = Cov m ( h j , h v ) . Since these expressions involve expectations under the model distribution they are nottractable in general, but can be approximated using MCMC methods (Ollivier et al., 2013;Desjardins et al., 2013). Furthermore, a diagonal approximation of the Fisher informationmatrix could be used. However, the approximation of the natural gradient is still compu-tationally very expensive so that the practical usability remains questionable (Desjardinset al., 2013).

3. Centered Restricted Boltzmann Machines

Inspired by the centering trick proposed by LeCun et al. (1998), Tang and Sutskever (2011)have addressed the ﬂip-invariance problem by changing the energy of the RBM in a waythat the mean of the input data is removed. Montavon and M¨uller (2012) have extended theidea of centering to the visible and hidden variables of DBMs and have shown that centeringimproves the conditioning of the underlying optimization problem, leading to models withbetter discriminative properties for DBMs in general and better generative properties inthe case of locally connected DBMs.Following their line of thought, the energy for a centered binary RBM where thevisible and hidden variables are shifted by the oﬀset parameters µ = ( µ , . . . , µ N ) and λ = ( λ , . . . , λ M ), respectively, can be formulated as E ( x , h ) = − ( x − µ ) T b − c T ( h − λ ) − ( x − µ ) T W ( h − λ ) . (4)By setting both oﬀsets to zero one retains the normal binary RBM. Setting µ = (cid:104) x (cid:105) d and λ = leads to the model introduced by Tang and Sutskever (2011), and by setting µ = (cid:104) x (cid:105) d and λ = (cid:104) h (cid:105) d we get a shallow variant of the centered DBM analyzed by Montavon andM¨uller (2012).The conditional probabilities for a variable taking the value one are given by p ( x i = 1 | h ) = σ ( w i ∗ ( h − λ ) + b i ) , (5) p ( h j = 1 | x ) = σ (( x − µ ) T w ∗ j + c j ) , (6)where σ ( · ) is the sigmoid function, w i ∗ represents the i th row, and w ∗ j the j th column ofthe weight matrix W .The LL gradient now takes the form ∇ W = (cid:104) ( x − µ )( h − λ ) T (cid:105) d − (cid:104) ( x − µ )( h − λ ) T (cid:105) m , (7) ∇ b = (cid:104) x − µ (cid:105) d − (cid:104) x − µ (cid:105) m = (cid:104) x (cid:105) d − (cid:104) x (cid:105) m , (8) ∇ c = (cid:104) h − λ (cid:105) d − (cid:104) h − λ (cid:105) m = (cid:104) h (cid:105) d − (cid:104) h (cid:105) m . (9)6 b and ∇ c are independent of the choice of µ and λ and thus centering only aﬀects ∇ W .It can be shown (see Appendix A) that the gradient of a centered RBM is invariant to ﬂiptransformations if a ﬂip of x i to 1 − x i implies a change of µ i to 1 − µ i , and a ﬂip h j to 1 − h j implies a change of λ j to 1 − λ j . This obviously holds for µ i = 0 . λ j = 0 . x i and h j under any distribution. Moreover, if the oﬀsets are set toan expectation centered RBMs get also invariant to shifts of variables (see Section 4). Notethat the properties of centered RBMs naturally extend to centered DBMs (see Section 3.2).If we set µ and λ to the expectation values of the variables, these values may dependon the RBM parameters (think for example about (cid:104) h (cid:105) d ) and thus they might change duringtraining. Consequently, a learning algorithm for centered RBM needs to update the oﬀsetvalues to match the expectations under the distribution that has changed with a parameterupdate. When updating the oﬀsets one needs to transform the RBM parameters such thatthe modeled probability distribution stays the same. An RBM with oﬀsets µ and λ can betransformed to an RBM with oﬀsets µ (cid:48) and λ (cid:48) by W (cid:48) = W , (10) b (cid:48) = b + W (cid:0) λ (cid:48) − λ (cid:1) , (11) c (cid:48) = c + W T (cid:0) µ (cid:48) − µ (cid:1) , (12)such that E ( x , h | θ , µ , λ ) = E ( x , h | θ (cid:48) , µ (cid:48) , λ (cid:48) ) + const , is guaranteed. Obviously, this canbe used to transform a centered RBM to a normal RBM and vice versa, highlighting thatcentered and normal RBMs are just diﬀerent parameterizations of the same model class.If the intractable model mean is used for the oﬀsets, they have to be approximated bysamples. Furthermore, when λ is chosen to be (cid:104) h (cid:105) d or (cid:104) h (cid:105) m or when µ is chosen to be (cid:104) x (cid:105) m one could either approximate the mean values using the sampled states or the correspondingconditional probabilities. But due to the Rao-Blackwell theorem an estimation based onthe probabilities has lower variance and therefore is the approximation of choice .Algorithm 1 shows pseudo code for training a centered binary RBM, where we use (cid:104)·(cid:105) to denote the average over samples from the current batch. Thus, for example, we write (cid:104) x d (cid:105) for the average value of data samples x d in the current batch, which is used as anapproximation for the expectation of x under the data distribution that is (cid:104) x (cid:105) d . Similarly, (cid:104) h d (cid:105) approximates (cid:104) h (cid:105) d using the hidden samples h d in the current batch.Note that in Algorithm 1 the update of the oﬀsets is performed before the gradientis calculated, such that gradient and reparameterization both use the current samples.This is in contrast to the algorithm for centered DBMs proposed by Montavon and M¨uller(2012), where the update of the oﬀsets and the reparameterization follows after the gradientupdate. Thus, while the gradient still uses the current samples the reparameterization isbased on samples gained from the model of the previous iteration. However, the proposedDBM algorithm smooths the oﬀset estimations by an exponentially moving average overthe sample means from many iterations, so that the choice of the sample set used for theoﬀset estimation should be less relevant. In Algorithm 1 an exponentially moving averageis obtained if the sliding factor ν is set to 0 < ν < ν = 1. The eﬀects ofusing an exponentially moving average are empirically analyzed in Section 6.2.

3. This can be proven analogously to the proof of proposition 1 in the work of Swersky et al. (2010). lgorithm 1: Training centered RBMs Initialize W ; /* i.e. W ← N (0 , . N × M */ Initialize µ , λ ; /* i.e. µ ← (cid:104) data (cid:105) , λ ← . */ Initialize b , c ; /* i.e. b ← σ − ( µ ) , c ← σ − ( λ ) */ Initialize η, ν µ , ν λ ; /* i.e. η, ν µ , ν λ ∈ { . , ..., . } */ repeat foreach batch in data do foreach sample x d in batch do Calculate h d = p ( h j = 1 | x d ) ; /* (cid:46) Eq. (6) */ Sample x m f rom RBM ; /* (cid:46) Eqs. (5) , (6) */ Calculate h m = p ( h j = 1 | x m ) ; /* (cid:46) Eq. (6) */ Store x m , h d , h m Estimate µ new ; /* i.e. µ new ← (cid:104) x d (cid:105) */ Estimate λ new ; /* i.e. λ new ← (cid:104) h d (cid:105) *//* Transform parameters with respect to the new offsets */ b ← b + ν λ W ( λ new − λ ) ; /* (cid:46) Eq. (11) */ c ← c + ν µ W T ( µ new − µ ) ; /* (cid:46) Eq. (12) *//* Update offsets using exp. moving averages with sliding factors ν µ and ν λ */ µ ← (1 − ν µ ) µ + ν µ µ new λ ← (1 − ν λ ) λ + ν λ λ new /* Update parameters using gradient ascent with learning rate η */ ∇ W ← (cid:104) ( x d − µ )( h d − λ ) T (cid:105) − (cid:104) ( x m − µ )( h m − λ ) T (cid:105) ; /* (cid:46) Eq. (7) */ ∇ b ← (cid:104) x d (cid:105) − (cid:104) x m (cid:105) ; /* (cid:46) Eq. (8) */ ∇ c ← (cid:104) h d (cid:105) − (cid:104) h m (cid:105) ; /* (cid:46) Eq. (9) */ W ← W + η ∇ W b ← b + η ∇ b c ← c + η ∇ c until stopping criteria is met ; /* Transform network to a normal binary RBM if desired */ b ← b − W λ ; /* (cid:46) Eq. (11) */ c ← c − W T µ ; /* (cid:46) Eq. (12) */ µ ← λ ← .1 Centered Gradient We now use the centering trick to derive a centered parameter update, which can replacethe gradient during the training of normal binary RBMs. Similar to the derivation of theenhanced gradient we can transform a normal binary to a centered RBM, perform a gradientupdate, and transform the RBM back (see Appendix B for the derivation). This yields thefollowing parameter updates, which we refer to as centered gradient ∇ c W = (cid:104) ( x − µ )( h − λ ) T (cid:105) d − (cid:104) ( x − µ )( h − λ ) T (cid:105) m , (13) ∇ c b = (cid:104) x (cid:105) d − (cid:104) x (cid:105) m − ∇ c W λ , (14) ∇ c c = (cid:104) h (cid:105) d − (cid:104) h (cid:105) m − ∇ c W T µ . (15)Notice that by setting µ = ( (cid:104) x (cid:105) d + (cid:104) x (cid:105) m ) and λ = ( (cid:104) h (cid:105) d + (cid:104) h (cid:105) m ) the centered gradientbecomes equal to the enhanced gradient (see Appendix C). Thus, it becomes clear thatthe enhanced gradient is a special case of centering. This can also be concluded from thederivation of the enhanced gradient for Gaussian visible variables in (Cho et al., 2013a). Algorithm 2:

Training RBMs using the centered gradient Initialize W ; /* i.e. W ← N (0 , . N × M */ Initialize µ , λ ; /* i.e. µ ← (cid:104) data (cid:105) , λ ← . */ Initialize b , c ; /* i.e. b ← σ − ( µ ) , c ← σ − ( λ ) */ Initialize η, ν µ , ν λ ; /* i.e. η, ν µ , ν λ ∈ { . , ..., . } */ repeat foreach batch in data do foreach ample v d in batch do Calculate h d = p ( h j = 1 | x d ) ; /* (cid:46) Eq. (6) */ Sample x m f rom RBM ; /* (cid:46) Eqs. (5) , (6) */ Calculate h m = p ( h j = 1 | v m ) ; /* (cid:46) Eq. (6) */ Store x m , h d , h m Estimate µ new ; /* i.e. µ new ← (cid:104) x d (cid:105) */ Estimate λ new ; /* i.e. λ new ← (cid:104) h d (cid:105) *//* Update offsets using exp. moving averages with slidingfactors ν µ and ν λ */ µ ← (1 − ν µ ) µ + ν µ µ new λ ← (1 − ν λ ) λ + ν λ λ new /* Update parameters using the centered gradient with learningrate η */ ∇ c W ← (cid:104) ( x d − µ )( h d − λ ) T (cid:105) − (cid:104) ( x m − µ )( h m − λ ) T (cid:105) ; /* (cid:46) Eq. (13) */ ∇ c b ← (cid:104) x d (cid:105) − (cid:104) x m (cid:105) − ∇ c W λ ; /* (cid:46) Eq. (14) */ ∇ c c ← (cid:104) h d (cid:105) − (cid:104) h m (cid:105) − ∇ c W T µ ; /* (cid:46) Eq. (15) */ W ← W + η ∇ c W b ← b + η ∇ c b c ← c + η ∇ c c until stopping criteria is met ; 9he enhanced gradient has been designed such that the weight updates become thediﬀerence of the covariances between one visible and one hidden variable under the dataand the model distribution. Interestingly, one gets the same weight update for two otherchoices of oﬀset parameters: either µ = (cid:104) x (cid:105) d and λ = (cid:104) h (cid:105) m or µ = (cid:104) x (cid:105) m and λ = (cid:104) h (cid:105) d .However, these oﬀsets result in diﬀerent update rules for the bias parameters.Algorithm 2 shows pseudo code for training a normal binary RBM using the centeredgradient, which is equivalent to training a centered binary RBM using Algorithm 1. Bothalgorithms can easily be extended to RBMs with other types of units and DBMs. A DBM (Salakhutdinov and Hinton, 2009) is a deep undirected graphical model with severalhidden layers where successive layers have a bipartite connection structure. Therefore, aDBM can be seen as a stack of several RBMs and thus as natural extension of RBMs. Acentered binary DBM with L layers h (0) , · · · , h ( L ) (where h (0) corresponds to the visiblelayer) represents a Gibbs distribution with energy E (cid:0) h (0) , · · · , h ( L ) (cid:1) = − L (cid:88) l =0 (cid:0) h ( l ) − λ ( l ) (cid:1) T b ( l ) − L − (cid:88) l =0 (cid:0) h ( l ) − λ ( l ) (cid:1) T W ( l ) (cid:0) h ( l +1) − λ ( l +1) (cid:1) , where each layer l has a bias b ( l ) , an oﬀset λ ( l ) and is connected to layer l + 1 by weightmatrix W l .The derivations, proofs and algorithms given in this work for RBMs automatically ex-tend to DBMs since each DBM can be transformed to an RBM with restricted connectionsand partially unknown input data. This is illustrated for a DBM with four layers in Fig-ure 3.2. As a consequence of this relation DBMs can essentially be trained in the sameway as RBMs but also suﬀer from the same problems as described before. The only dif-ference when training DBMs is that the expectation under the data distribution in the LLgradient cannot be calculated exactly as it is the case for RBMs. Instead the term is approx-imated by running a mean ﬁeld estimation until convergence (Salakhutdinov and Hinton,2009), which corresponds to approximating the gradient of a lower variational bound ofthe LL. Furthermore, it is common to pre-train DBMs in a greedy layer wise fashion usingRBMs (Salakhutdinov and Hinton, 2009; Hinton and Salakhutdinov, 2012).

4. Centering in Artiﬁcial Neural Networks in General

Removing the mean from visible and hidden units has originally been proposed for feedforward neural networks (LeCun et al., 1998; Schraudolph, 1998). When this idea wasapplied to RBMs (Tang and Sutskever, 2011) the model was reparameterized such that theprobability distribution deﬁned by the normal and centered RBM stayed the same. In thissection we generalize this concept to show that centering is an alternative parameterizationfor arbitrary ANN architectures in general, if the network is reparameterized accordinglyThis holds independently of the chosen activation functions and connection types includingdirected, undirected and recurrent connections. To show the correctness of this statement,10 a) Deep network version. (b) Shallow network version.

Figure 1: Example for (a) a deep neural network with four layers h (0) , · · · , h (3) and (b) theequivalent two layer shallow version of the same network with restricted connec-tions and unknown input h (2) .let us consider the centered artiﬁcial neuron model o j = φ j (cid:32)(cid:88) i w ij ( a i − µ i ) + c j (cid:33) , (16)where the output o j of the j th neuron depends on its activation function φ j , bias term c j and weights w ij with associated inputs a i and their corresponding oﬀsets µ i . Such neuronscan be used to construct arbitrary network architectures using undirected, directed andrecurrent connections, which can then be optimized with respect to a chosen loss.Two ANNs that represent exactly the same functional input-output mapping can beconsidered as diﬀerent parameterizations of the same model. Thus, a centered ANN isjust a diﬀerent parameterization of an uncentered ANN if we can show that their functionalinput-output mappings are the same. This can be guaranteed in general if all correspondingunits in a centered and an uncentered ANN have the same mapping from inputs to outputs.If the oﬀset µ i is changed to µ (cid:48) i = µ i + ∇ µ i then the output of the centered artiﬁcialneuron (16) becomes φ j (cid:32)(cid:88) i w ij (cid:0) a i − µ (cid:48) i (cid:1) + c j (cid:33) = φ j (cid:32)(cid:88) i w ij ( a i − ( µ i + ∇ µ i )) + c j (cid:33) = φ j (cid:32)(cid:88) i w ij ( a i − µ i ) + c j − (cid:88) i w ij ∇ µ i (cid:33) , showing that the units output does not change when changing the oﬀset µ i to µ (cid:48) i if the unitsbias parameter c j is reparameterized to c (cid:48) j = c j + (cid:80) i w ij ∇ µ i .This generalizes the proposed reparameterization for RBMs given by Equation (12) toANNs. Note that the originally centering algorithm (LeCun et al., 1998; Schraudolph, 1998)did not reparameterize the network, which can cause instabilities especially if the learning11ate is large. By setting µ i or µ (cid:48) i to zero it now follows that for each normal ANN thereexists a centered ANN and vice verse such that the output of each neuron and thus thefunctional mapping from input to output of the whole network stays the same. If the inputto output mapping does not change this also holds for an arbitrary loss depending on thisoutput.Moreover, if we guarantee that a shift of a i implies a shift of µ i by the same value (thatis a shift of a i to a i + δ i implies a shift of µ i to µ i + δ i ) the neuron’s output o j gets invariantto shifts of a i . This is easy to see since δ i cancels out in Equation (16) if the same shift isapplied to both a i and µ i , which holds for example if we set the oﬀsets to the mean valuesof the corresponding variables since (cid:104) a i + δ i (cid:105) = δ i + (cid:104) a i (cid:105) . An AE or auto-associator (Rumelhart et al., 1986b) is a type of neural network that hasoriginally been proposed for unsupervised dimensionality reduction. Like RBMs, AEs havealso been used for unsupervised feature extraction and greedy layer-wise pre-training ofdeep neural networks (Bengio et al., 2007). In general, an AE consists of a determinis-tic encoder encode ( x ), which maps the input x = ( x , ..., x N ) to a hidden representation h = ( h , ..., h M ) and a deterministic decoder decode ( h ), which maps the hidden represen-tation to the reconstructed input representation ˜x . The network is optimized such that thereconstructed input ˜x gets as close as possible to the original input x measured by a chosenloss L ( x , ˜x ). Common choices for the loss are the mean squared error (cid:104) (cid:80) Ni =1 ( x i − ˜ x i ) (cid:105) for arbitrary input and the average cross entropy (cid:104)− (cid:80) Ni =1 x i log ˜ x i + (1 − x i ) log(1 − ˜ x i ) (cid:105) for binary data. AEs are usually trained via back-propagation (Kelley, 1960; Rumelhartet al., 1986a) and they can be seen as feed-forward neural networks where the input patternsare also the labels. We can therefore deﬁne a centered AE by centering the encoder anddecoder, which for a centered three layer AE corresponds to encode ( x ) = φ enc (cid:0) W (cid:48) ( x − µ ) + c (cid:1) = h ,decode ( h ) = φ dec ( W ( h − λ ) + b ) = ˜x , with encoder matrix W (cid:48) , decoder matrix W , encoder bias c , decoder bias b , encoder oﬀset µ , decoder oﬀset λ , encoder activation function φ enc and decoder activation function φ enc .It is common to assume tied weights, which means that the encoder is just the transpose ofthe decoder matrix ( W (cid:48) = W T ). When choosing the activation functions for the encoderand decoder (i.e. sigmoid, tangens-hyperbolicus, radial-basis, linear, linear-rectiﬁer, ... ), wehave to ensure that the encoder activation function is appropriate for the input data (e.g. asigmoid cannot represent negative values). Worth mentioning that when using the sigmoidfunction for φ enc and φ dec the encoder becomes equivalent to Equation 5 and the decoderbecomes equivalent to Equation 6. The networks structure therefore becomes equivalent toan RBM such that the only diﬀerence is the training objective. It is a common way to initialize the weight matrix of ANNs to small random values to breakthe symmetry. The bias parameters are often initialized to zero. However, we argue thatthere exists a more reasonable initialization for the bias parameters.12inton (2010) proposed to initialize the RBM’s visible bias parameter b i to ln( p i / (1 − p i )),where p i is the proportion of the data points in which unit i is on (that is p i = (cid:104) x i (cid:105) d ). Hestates that if this is not done, the hidden units are used to activate the i th visible unit witha probability of approximately p i in the early stage of training.We argue that this initialization is in fact reasonable since it corresponds to the Maxi-mum Likelihood Estimate (MLE) of the visible bias given the data for an RBM with zeroweight matrix, given by b ∗ = ln (cid:18) (cid:104) x (cid:105) d − (cid:104) x (cid:105) d (cid:19) = − ln (cid:18) (cid:104) x (cid:105) d − (cid:19) = σ − ( (cid:104) x (cid:105) d ) , (17)where σ − is the inverse sigmoid function. Notice that the MLE of the visible bias for anRBM with zero weights is the same whether the RBM is centered or not. The conditionalprobability of the visible variables (5) of an RBM with this initialization is then given by p ( x = 1 | h ) = σ ( σ − ( (cid:104) x (cid:105) d )) = (cid:104) x (cid:105) d , where p ( x = 1 | h ) denotes the vector containing theelements p ( x i = 1 | h ). Thus the mean of the data is initially modeled only by the bias valuesand the weights are free to model higher order statistics in the beginning of training. Forthe unknown hidden variables it is reasonable to assume an initial mean of 0 . c ∗ = σ − (0 .

5) = 0 . µ i = (cid:104) x i (cid:105) d and λ j = 0 .

5. Methods

As shown in the previous section the algorithms described by Cho et al. (2011), Tang andSutskever (2011) and Montavon and M¨uller (2012) can all be viewed as diﬀerent ways ofapplying the centering trick. They diﬀer in the choice of the oﬀset parameters and in theway of approximating them, either based on the samples gained from the model in theprevious learning step or from the current one, using an exponentially moving average ornot. The question arises, how RBMs should be centered to achieve the best performancein terms of the LL. In the following we analyze the diﬀerent ways of centering empiricallyand try to derive a deeper understanding of why centering is beneﬁcial.For simplicity we introduce the following shorthand notation. We use d to denote thedata mean (cid:104)·(cid:105) d , m for the model mean (cid:104)·(cid:105) m , a for the average of the means (cid:104)·(cid:105) d + (cid:104)·(cid:105) m ,and if the oﬀsets is set to zero. We indicate the choice of µ in the ﬁrst and the choiceof λ in the second place, for example dm translates to µ = (cid:104) x (cid:105) d and λ = (cid:104) h (cid:105) m . We add asuperscribed b (before) or l (later) to denote whether the reparameterization is performed13efore or after the gradient update. If the sliding factor in Algorithm 1 or 2 is set to a valuesmaller than one and thus an exponentially moving average is used, a subscript s is added.Thus, we indicate the variant of Cho et al. (2011) by aa b , the one of Montavon and M¨uller(2012) by dd ls , the data normalization of Tang and Sutskever (2011) by d0 , and the normalbinary RBM simply by . Table 1 summarizes the abbreviations most frequently used inthis paper.We begin our analysis with RBMs, where one layer is small enough to guarantee thatthe exact LL is still tractable. In a ﬁrst set of experiments we analyze the four algorithmsdescribed above in terms of the evolution of the LL during training. In a second set of exper-iments we analyze the eﬀect of the initialization described in Section 4.2. We proceed witha comparison of the eﬀects of estimating oﬀset values and reparameterizing the parametersbefore or after the gradient update. Afterwards we analyze the eﬀects of using an expo-nentially moving average to approximate the oﬀset values in the diﬀerent algorithms andof choosing other oﬀset values. We continue with comparing the normal and the centeredgradient with the natural gradient. To verify whether the results scale to more realisticproblem sizes we compare the RBMs, DBMs and AE on ten large datasets. We consider four diﬀerent benchmark problems in our detailed analysis.The

Bars & Stripes (MacKay, 2003) problem consists of quadratic patterns of size D × D that can be generated as follows. First, a vertical or horizontal orientation is chosenrandomly with equal probability. Then the state of all pixels of every row or columnis chosen uniformly at random. This leads to N = 2 D +1 patterns (see Figure 2(a) forsome example patterns) where the completely uniform patterns occur twice as often asthe others. The dataset is symmetric in terms of the amount of zeros and ones and thusthe ﬂipped and unﬂipped problems are equivalent. An upper bound of the LL is given by( N −

4) ln (cid:0) N (cid:1) +4 ln (cid:0) N (cid:1) . For our experiments we used D = 3 or D = 2 (only in Section 6.8)leading to an upper bound of − .

59 and − .

86, respectively.The

Shifting Bar dataset is an artiﬁcial benchmark problem we have designed to beasymmetric in terms of the amount of zeros and ones in the data. For an input dimension-ality N , a bar of length 0 < B < N has to be chosen, where BN expresses the percentage ofones in the dataset. A position 0 ≤ p < N is chosen uniformly at random and the states ofthe following B pixels are set to one, where a wrap around is used if p + B ≥ N . The statesof the remaining pixels are set to zero. This leads to N diﬀerent patterns (see Figure 2(b))with equal probability and an upper bound of the LL of N ln (cid:0) N (cid:1) . For our experiments weused N = 9, B = 1 and its ﬂipped version Flipped Shifting Bar , which we get for N = 9, B = 8, both having an upper LL bound of − . MNIST (LeCun et al., 1998) dataset of handwritten digits has become a standardbenchmark problem for RBMs. It consists of 60 ,

000 training and 10 ,

000 testing examplesof gray value handwritten digits of size 28 ×

28. See Figure 2(c) for some example patterns.After binarization (with a threshold of 0.5) the dataset contains 13 .

3% ones, similar to the

Shifting Bar problem, which for our choice of N and B contains 11 .

1% ones. We referto the dataset where each bit of MNIST is ﬂipped (that is each one is replaced by a zeroand vice versa ) as . To our knowledge, the best reported performance in terms14 bbr. µ λ

Description Normal binary RBM(Hinton et al., 2006) d (cid:104) x (cid:105) d Data Normalization RBM(Tang and Sutskever, 2011) dd ls (cid:104) x (cid:105) d (cid:104) h (cid:105) d Original Centered RBM(Montavon and M¨uller, 2012)reparam. after gradient update,use of an exp. moving average aa b . (cid:104) x (cid:105) d + (cid:104) x (cid:105) m ) 0 . (cid:104) h (cid:105) d + (cid:104) h (cid:105) m ) Enhanced gradient RBM(Cho et al., 2011)reparam. before gradient update,no exp. moving average dd bs (cid:104) x (cid:105) d (cid:104) h (cid:105) d Centering using the data mean,reparam. before gradient update,use of an exp. moving average mm bs (cid:104) x (cid:105) m (cid:104) h (cid:105) m Centering using the model mean,reparam. before gradient update,use of an exp. moving average dm bs (cid:104) x (cid:105) d (cid:104) h (cid:105) m Centering using the data meanfor the visible and the model meanfor hidden units,reparam. before gradient update,use of an exp. moving average

Table 1: Look-up table: Abbreviations for the most frequently used algorithms.of the average LL per sample of an RBM with 500 hidden units on

MNIST test data is-84 (Salakhutdinov, 2008; Salakhutdinov and Murray, 2008; Tang and Sutskever, 2011; Choet al., 2013b).The

CalTech 101 Silhouettes (Marlin et al., 2010) dataset consists of 4 .

100 training,2 .

307 validation, and 2264 testing examples of binary object silhouettes of size 28 ×

28. SeeFigure 2(d) for some example patterns. The dataset contains 55 .

1% ones, and thus (likein the

Bars & Stripes problem ) the amount of zeros and ones is almost the same. Thebackground pixels take the value one which is in contrast to

MNIST where the backgroundpixels are set to zero. To our knowledge, the best reported performance in terms of theaverage LL per sample of an RBM with 500 hidden units on

CalTech 101 Silhouettes testdata is -114.75 (Cho et al., 2013b).In some experiments we considered eight additional binary datasets from diﬀerentdomains compromising biological, image, text and game-related data (Larochelle et al.,2010). The datasets diﬀer in dimensionality (112 to 500) and size (a few hundred to several15 a) 8 out of 16 patterns from the

Bars & Stripes dataset.(b) All patterns of the

Shifting Bar dataset with N = 9 and B = 1.(c) Example patterns from the MNIST dataset after binarization.(d) Example patterns from the

Caltech 101 Silhouette dataset.

Figure 2: Some patterns from the diﬀerent benchmark problems.thousand examples) and have been separated into training, validation and test sets. Theaverage test LL for binary RBMs with 23 hidden units and related models can be foundin (Larochelle and Murray, 2011). All datasets contain less ones than zeros, where thepercentage of ones lies between 3 .

9% and 36 .

6. Results

For all models in this work the weight matrices were initialized with random values sampledfrom a Gaussian with zero mean and a standard deviation of 0 .

01. If not stated otherwisethe visible and hidden biases, and oﬀsets were initialized as described in Section 4.2.We begin our analysis with experiments on small RBMs where the LL can be calculatedexactly. We used 4 hidden units when modeling

Bars & Stripes and

Shifting Bar and 16hidden units when modeling

MNIST . For training we used CD and PCD with k steps ofGibbs sampling (CD- k , PCD- k ) and PT c where the c temperatures were distributed uni-formly form 0 to 1. For Bars & Stripes and

Shifting Bar full-batch training was performedfor 50 ,

000 gradient updates, where the LL was evaluated every 50th gradient update. Formodeling

MNIST mini-batch training with a batch size of 100 was performed for 100 epochs,each consisting of 600 gradient updates and the exact LL was evaluated after each epoch.The following tables containing the results for RBMs show the maximum average LLand the corresponding standard deviation reached during training with diﬀerent learning16lgorithms over the 25 trials. In some cases the ﬁnal average LL reached at the end oftraining is given in parenthesis to indicate a potential divergence of the LL. For reasons ofreadability, the average LL was divided by the number of training samples in the case of

MNIST . In order to check if the result of the best method within one row diﬀers signiﬁcantlyfrom the others we performed pairwise signed Wilcoxon rank-sum tests (with p = 0 . The comparison of the learning performance of the previously described algorithms dd ls , aa b , d0 , and (using their originally proposed initializations) shows that training a centeredRBM leads to signiﬁcantly higher LL values than training a normal binary RBM (see Table 2for the results for Bars & Stripes and

MNIST and Table 3 for the results for

Shifting Bar and

Flipped Shifting Bar ). Figure 3(b) illustrates on the

Bars & Stripes dataset thatcentering both the visible and the hidden variables ( dd ls and aa b ) compared to centeringonly the visible variables ( d0 ) accelerates the learning and leads to a higher LL when usingPT. The same holds for PCD as can be seen from Table 2. Thus centered RBMs can formmore accurate models of the data distribution than normal RBMs. This is diﬀerent tothe observations made for DBMs by Montavon and M¨uller (2012), which found a bettergenerative performance of centering only in the case of locally connected DBMs.It can also be seen in Figure 3 that all methods show divergence in combination withCD and PCD (as described before by Fischer and Igel, 2010, for normal RBMs), which canbe prevented for dd ls , d0 , and when using PT as shown in Figure 3(d). This can beexplained by the fact that PT leads to faster mixing Markov chains and thus less biasedgradient approximations. The aa algorithm however suﬀers from severe divergence of theLL when PCD or PT is used, which is even worse than with CD. This divergence problemarises independently of the choice of the learning rate as indicated by the LL values reachedat the end of training (given in parentheses) in Table 2 and Table 3 and which can also beseen by comparing Figure 3(c) and Figure 3(d). The divergence occurs the earlier and fasterthe bigger the learning rate, while for the other algorithms we never observed divergencein combination with PT even for very big learning rates and long training time. Thereasons for this divergence will be discussed in detail in Section 6.4. The results in Table 3also demonstrate the ﬂip invariance of the centered RBMs on the Shifting Bar datasetempirically. While fails to model the ﬂipped version of the dataset correctly dd ls , aa b , d0 have approximately the same performance on the ﬂipped and unﬂipped dataset.17 gradient update − − − − − − l og - li k e li hood aa b dd l d (a) CD-1 - learning rate 0.05 gradient update − − − − − − l og - li k e li hood aa b dd l d (b) PT - learning rate 0.05 gradient update − − − − − − l og - li k e li hood aa b dd l d (c) PCD-1 - learning rate 0.05 gradient update − − − − − − l og - li k e li hood aa b dd l d (d) PCD-1 - learning rate 0.01 Figure 3: Evolution of the average LL during training on the

Bars & Stripes dataset forthe standard centering methods. (a) When CD-1 is used for sampling and alearning rate of η = 0 .

05, (b) when PT is used for sampling and a learning rateof η = 0 .

05, (c) when PCD-1 is used for sampling and a learning rate of η = 0 . η = 0 . l g o r i t h m - η aa b dd l s d000 B a r s & S t r i pe s C D - - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) C D - - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) C D - - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) P C D - - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) P C D - - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) P C D - - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) P T - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) P T - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) P T - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) M N I S T C D - - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) C D - - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) C D - - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) P C D - - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) P C D - - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) P C D - - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) P T - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) T a b l e : M a x i m u m a v e r ag e LL du r i n g t r a i n i n go n (t o p )t h e B a r s & S t r i p e s d a t a s e t a nd ( b o tt o m )t h e M N I S T d a t a s e t u s i n g d i ﬀ e r e n t s a m p li n g m e t h o d s a nd l e a r n i n g r a t e s η . l g o r i t h m - η aa b dd l s d000 S h i f t i n g B a r C D - - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) C D - - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) C D - - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) P C D - - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) P C D - - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) P C D - - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) P T - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) P T - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) P T - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) F l i ppe d S h i f t i n g B a r C D - - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) C D - - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) C D - - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) P C D - - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) P C D - - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) P C D - - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) P T - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) P T - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) P T - . - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) T a b l e : M a x i m u m a v e r ag e LL du r i n g t r a i n i n go n (t o p )t h e S h if t i n g B a r d a t a s e t a nd ( b o tt o m )t h e F l i pp e d S h if t i n g B a r d a t a s e t u s i n g d i ﬀ e r e n t s a m p li n g m e t h o d s a nd l e a r n i n g r a t e s . lgorithm- η init zero init σ − CD-1-0.2 -27.98 ± -21.49 ± ± -21.09 ± ± -24.87 ± ± -22.45 ± ± -21.76 ± ± -24.83 ± -0.2 -28.01 ± -21.72 ± -0.1 -28.28 ± -21.14 ± -0.05 -28.28 ± -24.80 ± Table 4: Maximum average LL during training for on the Flipped Shifting Bar dataset,where the visible bias is initialized to zero or to the inverse sigmoid of the datamean.

Algorithm- η init zero init σ − CD-1-0.1 -165.91 ± ± -167.68 ± ± ± -168.29 ± ± -147.56 ± ± -144.20 ± ± -144.06 ± -0.01 -148.76 ± -145.63 ± Table 5: Maximum average LL during training for on the MNIST dataset, where thevisible bias is initialized to zero or to the inverse sigmoid of the data mean.

The following set of experiments was done to analyze the eﬀects of diﬀerent initializationsof the parameters as discussed in Section 4.2. First, we trained normal binary RBMs (thatis ) where the visible bias was initialized to zero or to the inverse sigmoid of the datamean. In both cases the hidden bias was initialized to zero. Table 4 shows the resultsfor normal binary RBMs trained on the Flipped Shifting Bar dataset, where RBMs withzero initialization failed to learn the distribution accurately. The RBMs using the inversesigmoid initialization achieved good performance and therefore seem to be less sensitive tothe ‘diﬃcult’ representation of the data. However, the results are not as good as the resultsof the centered RBMs shown in Table 3. The same observations can be made when trainingRBMs on the

MNIST dataset (see Table 5). The RBMs with inverse sigmoid initializationachieved signiﬁcantly better results than RBMs initialized to zero in the case of PCD andPT, but still worse compared to the centered RBMs. Furthermore, using the inverse sigmoid21 lgorithm- η dd ls init zero dd ls init σ − CD-1-0.2 -20.34 ± -20.42 ± -20.75 ± -20.85 ± ± -22.63 ± -21.03 ± -20.97 ± -20.86 ± -20.72 ± ± -22.30 ± -0.2 -20.08 ± -20.25 ± -0.1 -20.56 ± -20.68 ± -0.05 -22.93 ± -22.39 ± Table 6: Maximum average LL during training for dd ls on the Flipped Shifting Bar dataset,where the visible bias is initialized to zero or to the inverse sigmoid of the datamean.

Algorithm- η dd bs dd ls Bars & StripesCD-1-0.1 -60.34 ± -60.41 ± -60.19 ± -60.25 ± -61.23 ± -61.22 ± -54.86 ± -54.75 ± -53.71 ± -53.60 ± -56.68 ± -56.68 ± -0.1 -51.25 ± -51.13 ± -0.05 -52.06 ± -51.87 ± -0.01 -56.72 ± -56.73 ± MNIST

CD-1-0.1 -150.60 ± ± -150.98 ± ± -152.23 ± ± -141.11 ± -140.89 ± -139.95 ± -140.02 ± -140.67 ± -140.68 ± -0.01 -141.56 ± -141.46 ± Table 7: Maximum average LL during training on (top) the

Bars & Stripes dataset and(bottom) the

MNIST dataset, using the reparameterization before ( dd bs ) and after( dd ls ) the gradient update.initialization allows us to achieve similar performance on the ﬂipped and normal version ofthe MNIST dataset, while the RBM with zero initialization failed to learn at all.22econd, we trained models using the centering versions dd , aa , and d0 comparing theinitialization suggested in Section 4.2 against the initialization to zero, where we observedthat the diﬀerent ways to initialize had little eﬀect on the performance. In most cases theresults show no signiﬁcant diﬀerence in terms of the maximum LL reached during trialswith diﬀerent initializations or slightly better results were found when using the inversesigmoid, which can be explained by the better starting point yielded by this initialization.See Table 6 for the results for dd ls on the Bars & Stripes dataset as an example. We used theinverse sigmoid initialization in the following experiments since it was beneﬁcial for normalRBMs.

To investigate the eﬀects of performing the reparameterization before or after the gradientupdate in the training of centered RBMs (that is, the diﬀerence of the algorithm suggestedhere and the algorithm suggested by Montavon and M¨uller (2012)), we analyzed the learningbehavior of dd bs and dd ls on all datasets. The results for RBMs trained on the Bar & Stripes dataset are given in Table 7 (top). No signiﬁcant diﬀerence between the two versions canbe observed. The same result was obtained for the

Shifting Bar and

Flipped Shifting Bar dataset. The results for the

MNIST dataset are shown in Table 7 (bottom). Here dd bs performs slightly better than dd ls in the case of CD and no diﬀerence could be observed forPCD and PT. Therefore, we reparameterize the RBMs before the gradient update in theremainder of this work. The severe divergence problem observed when using the enhanced gradient in combinationwith PCD or PT raises the question whether the problem is induced by setting the oﬀsetsto 0 . (cid:104) x (cid:105) d + (cid:104) x (cid:105) m ) and 0 . (cid:104) h (cid:105) d + (cid:104) h (cid:105) m ) or by bad sampling based estimates of gradientand oﬀsets. We therefore trained centered RBMs with 9 visible and 4 hidden units on the2x2 Bars & Stripes dataset using either the exact gradient where only (cid:104) x (cid:105) m and (cid:104) h (cid:105) m wereapproximated by samples or using PT estimates of the gradient while (cid:104) x (cid:105) m and (cid:104) h (cid:105) m were calculated exactly. The results are shown in Figure 4. If the true model expectationsare used as oﬀsets instead of the sample approximations no divergence for aa in combi-nation with PT is observed and the performance of aa and dd become almost equivalent.Interestingly, the divergence is also prevented if one calculates the exact gradient whilestill approximating the oﬀsets by samples. Thus, the divergence behavior must be inducedby approximating both, gradient and oﬀsets. The mean under the data distribution canalways be calculated exactly, which might explain why we do not observe divergence for dd in combination with PT. On the contrary, the divergence becomes even worse when usingjust the model mean (a centering variant which we denote by mm in the following) insteadof the average of data and model mean (as in aa ) as oﬀsets, as can be seen in Figure 6.Furthermore, the divergence also occurs if either visible or hidden oﬀsets are set to thePT-approximated model mean, which can be seen for dm in Figure 8(b).To further deepen the understanding of the divergence eﬀect we investigated the param-eter evolution during training of RBMs with diﬀerent oﬀsets. We observed that the changeof the oﬀset values between two gradient updates gets extremely large during training when23 gradient update − − − − − − l og - li k e li hood aa b dd b d (a) Exact gradient with approximated oﬀsets gradient update − − − − − − l og - li k e li hood aa bs ( exact means ) (b) PT with exact means Figure 4: Evolution of the average LL during training on the

Bars & Stripes dataset forthe standard centering methods. (a) When the exact gradient is used, with ap-proximated oﬀsets and (b) when PT is used for estimating the gradient whilethe mean values for the oﬀsets are calculated exactly. In both cases a learningrate of η = 0 .

05 was used.using the model mean. Figure 5(a) shows exemplary the evolution of the ﬁrst hidden oﬀset λ for a single trial, where the oﬀset approximation for dd b is almost constant while it israther large for aa b and even bigger for mm b . In each iteration we calculated the exactoﬀsets to estimate the approximation error shown in Figure 5(b). Obviously, there is noapproximation error for dd while the error for aa quickly gets large and mm gets eventwice as big. In combination with the gradient approximation error this causes the weightmatrices for aa and mm to grow extremely big as shown in Figure 5(c).To verify that the divergence is not just caused by the additional sampling noise intro-duced by approximating the oﬀsets, we trained a centered RBM using PT for the gradientapproximation while we set the oﬀsets to uniform random values between zero and one ineach iteration. The results are shown in Figure 6(a) and demonstrate that even randomoﬀset values do not lead to the divergence problems. Thus, the divergence seems not becaused by additional sampling noise but rather by the correlated errors of gradient and oﬀsetapproximations. To test this hypothesis we investigated mm b where the samples for oﬀsetand gradient approximations were taken from diﬀerent PT sampling processes. The resultsare shown in Figure 6(b) where no divergence can be observed anymore. While creatingtwo sets of samples for gradient and oﬀset approximations prevents the LL from divergingit almost doubles the computational cost and can therefore not be considered as a relevantsolution in practice. Moreover, using the model mean as oﬀset still leads to slightly worseﬁnal LL values than using the mean under the data distribution. This might be explainedby the fact that the additional approximation of the model mean introduces noise while thedata mean can be estimated exactly. 24 gradient update . . . . . . λ aa b dd b mm b (a) Close up of the oﬀset evolution exemplary for λ gradient update . . . . . . A v e r ageh i ddeno ff s e t app r o x i m a t i one rr o r aa b dd b mm b (b) Evolution of the oﬀset approximation error gradient update F r oben i u s no r m o ft he w e i gh t m a t r i x aa b dd b mm b (c) Evolution of the weight norm Figure 5: Evolution of the oﬀsets and weights of diﬀerent centered variants for an RBMwith 9 visible and 4 hidden units trained on

Bars & Stripes 3x3 using PT .For clearness a single trial is shown, but the experiments where repeated 25trials where all results showed quantitatively the same results. (a) Close up ofthe evolution of oﬀset λ over 500 gradient updates. (b) The evolution of theabsolute diﬀerence between exact and approximated hidden oﬀset averaged of allhidden units. (c) The evolution of the Frobenius norm of the weight matrices.Interestingly, the observed initially faster learning speed of mm and aa , which can beseen in Figure 6(a), does not occur anymore when oﬀset and gradient approximation arebased on diﬀerent sample sets. This observation can also be made when the exact gradientis used (see Figure 4(a) and Figure 7). Thus, the initially faster learning speed seems alsobe caused by the correlated approximations of gradient and oﬀsets.25 gradient update − − − − − − − − l og - li k e li hood aa b dd b mm b Randommeans ) b (a) Random oﬀsets gradient update − − − − − − − − l og - li k e li hood aa b dd b mm b (b) Independant approximations Figure 6: Evolution of the average LL during training on the

Shifting Bars dataset usingPT with learning rate of η = 0 .

1. (a) normal RBMs, centered RBMs andcentered RBMs using random oﬀset values. (b) centered RBMs when the samplesfor the oﬀset approximations come from a diﬀerent Markov chain than the samplesused for the gradient estimation.

An exponentially moving average can be used to smooth the approximation of the oﬀsetsbetween parameter updates. This seems to be reasonable for stabilizing the approximationswhen small batch sizes are used as well as when the model mean is used for the oﬀsets. Wetherefore analyzed the impact of using an exponentially moving average with a sliding factorof 0 .

01 for the estimation of the oﬀset parameters. Figure 8(a) illustrates on the

Bars&Stripes dataset that the learning curves of the diﬀerent models become almost equivalentwhen using an exponentially moving average. The maximum LL values reached are the samewhether an exponentially moving average is used or not, which can be seen by comparingFigure 8(a) and Figure 8(b) and also by comparing the results in Table 2 and Table 3with those in Table 8. Notably, the divergence problem does not occur anymore when anexponentially moving average is used. As discussed in the previous section, this problem iscaused by the correlation between the approximation error of gradient and oﬀsets. Whenusing an exponential moving average the current oﬀsets contain only a small fraction of thecurrent mean such that the correlation is reduced.In the previous experiments dd was used with an exponentially moving average as suggestedfor this centering variant by Montavon and M¨uller (2012). Note however, that in batchlearning when (cid:104) x (cid:105) d is used for the visible oﬀsets, these values stay constant such that anexponentially moving average has no eﬀect. More generally if the training data and thus (cid:104) x (cid:105) d is known in advance the visible oﬀsets should be ﬁxed to this value independent ofwhether batch, mini-batch or online learning is used. However, the use of an exponentially26 gradient update − − − − − − l og - li k e li hood aa b dd b mm b (a) Visible and hidden units centered gradient update − − − − − − l og - li k e li hood a b d b m b a b d b m b (b) Visible or hidden units centered Figure 7: Evolution of the average LL during training on the

Bars & Stripes dataset for thevarious centering methods when the exact gradient is used with the exact oﬀsetsand a learning rate of η = 0 . (cid:104) x (cid:105) d is reasonable if the training data is not known inadvance, as well as for the approximation of the mean of the hidden representation (cid:104) h (cid:105) d .In our experiments, dd does not suﬀer from the divergence problem when PT is used forsampling, even without exponentially moving average, as can be seen in Figure 8(b) forexample. We did not even observe the divergence without a moving average in the case ofmini-batch learning. Thus, dd seems to be generally more stable than the other centeringvariants. As discussed in Section 3, any oﬀset value between 0 and 1 guarantees the ﬂip invarianceproperty as long as it ﬂips simultaneously with the data. An intuitive and constant choice isto set the oﬀsets to 0 .

5, which has also been proposed by Ollivier et al. (2013) and results ina symmetric variant of the energy of RBMs. This leads to comparable LL values on ﬂippedand unﬂipped datasets. However, if the dataset is unbalanced in the amount of zeros andones like

MNIST , the performances is always worse compared to that of a normal RBM onthe version of the dataset which has less ones than zeros. Therefore, ﬁxing the oﬀset valuesto 0.5 cannot be considered as an alternative for centering using expectation values overdata or model distribution.In Section 3, we mentioned the existence of alternative oﬀset parameters which lead tothe same updates for the weights as the enhanced gradient. Setting µ = (cid:104) x (cid:105) d and λ = (cid:104) h (cid:105) m seems reasonable since the data mean is usually known in advance. As mentioned abovewe refer to centering with this choice of oﬀsets as dm . We trained RBMs with dm bs using asliding factor of 0 .

01. The results are shown in Table 8 and suggest that there is no signiﬁcant27 gradient update − − − − − − l og - li k e li hood aa bs dd bs dm bs (a) With exponentially moving average gradient update − − − − − − l og - li k e li hood dd b dm b (b) Without exponentially moving average Figure 8: Evolution of the average LL during training on

Bars & Stripes with the diﬀerentcentering variants, using PT , and a learning rate of η = 0 .

05. (a) When anexponentially moving average with sliding factor of 0.01 was used (where thecurves are almost equivalent) and (b) when no exponentially moving average wasused.diﬀerence between dm bs , aa bs , and dd bs . However, without an exponentially moving average dm b has the same divergence problems as aa b , as shown in Figure 8(b).We further tried variants like mm , m

0, 0 d , m dd for any of these choices. The variants that subtract an oﬀset from both,visible and hidden variables outperformed or achieved the same performance as the variantsthat subtract an oﬀset only from one type of variables. When the model expectation wasused without a exponentially moving average either for µ or λ , or for both oﬀsets we alwaysobserved the divergence problem.Interestingly, if the exact gradient and oﬀsets are used for training no signiﬁcant diﬀer-ence can be observed in terms of the LL evolution whether data mean, model mean or theaverage of both is used for the oﬀsets as shown in Figure 7. But centering both visible andhidden units still leads to better results than centering only one. Furthermore, the resultsillustrate that centered RBMs outperform normal binary RBMs also if the exact gradientis used for training the both models. This emphasizes that the worse performance of nor-mal binary RBMs is caused by the properties of its gradient rather than by the gradientapproximation. In the previous experiments we trained small models in order to be able to run manyexperiments and to evaluate the LL exactly. We now want to show that the results we haveobserved on the toy problems and

MNIST with RBMs with 16 hidden units carry over to28 lgorithm- η aa bs dd bs dm bs Bars & StripesCD-1-0.1 -60.09 ± -60.34 ± -60.35 ± -60.31 ± -60.19 ± -60.25 ± -61.22 ± -61.23 ± -61.23 ± -54.78 ± -54.86 ± -54.92 ± -53.81 ± -53.71 ± -53.88 ± -56.48 ± ± -56.47 ± -0.1 -51.20 ± ± -51.10 ± -0.05 -51.99 ± ± -51.82 ± -0.01 -56.65 ± ± -56.67 ± -20.36 ± -20.32 ± -20.32 ± -20.80 ± -20.86 ± -20.69 ± -22.58 ± ± ± -21.00 ± -20.96 ± -21.00 ± -20.75 ± -20.76 ± -20.88 ± -22.28 ± -22.29 ± ± -0.2 -20.14 ± -20.31 ± -20.07 ± -0.1 -20.42 ± -20.46 ± -20.60 ± -0.05 -22.36 ± -22.39 ± ± MNIST

CD-1-0.1 -150.61 ± -150.60 ± -150.50 ± -151.11 ± ± -150.80 ± -152.83 ± -152.23 ± -152.17 ± -141.10 ± -141.11 ± -140.99 ± -140.01 ± -139.95 ± -139.94 ± -140.85 ± -140.67 ± -140.72 ± -0.01 -142.32 ± -141.56 ± ± Table 8: Maximum average LL during training on (top)

Bars & Stripes , (middel)

FlippedShifting Bar , and (bottom)

MNIST when using an exponentially moving averagewith an sliding factor of 0.01.more realistic settings. Furthermore, we want to investigate the generalization performanceof the diﬀerent models. In a ﬁrst set of experiments we therefore trained the models , d0 , dd bs , and aa bs with 500 hidden units on MNIST and

Caltech . The weight matrices wereinitialized with random values sampled from a Gaussian with zero mean and a standarddeviation of 0 .

01 and visible and hidden biases, and oﬀsets were initialized as described inSection 4.2. The LL was estimated using Annealed Importance Sampling (AIS), where weused the same setup as described in the analysis of Salakhutdinov and Murray (2008).Figure 9 shows the evolution of the average LL on the test data of

MNIST over 25trials for PCD-1 and PT for the diﬀerent centering versions. The models were trained for29 gradient update − − − − − − − − l og - li k e li hood aa bs dd bs d (a) PCD-1 gradient update − − − − − − − − l og - li k e li hood aa bs dd bs d (b) PT Figure 9: Evolution of the average LL on the test data of

MNIST during training fordiﬀerent centering variants with 500 hidden units, using a learning rate of η =0 .

01, and a sliding factor of 0.01. (a) When using PCD-1 and (b) when using

P T for sampling. The error bars indicate the standard deviation of the LL overthe 25 trials.200 epochs, each consisting of 600 gradient updates with a batch size of 100 and the LLwas estimated every 10th epoch using AIS. Both variants dd bs and aa bs reach signiﬁcantlyhigher LL values than and d0 . The standard deviation over the 25 trials indicatedby the error bars is smaller for dd bs and aa bs than for and d0 , especially when PT isused for sampling. Furthermore, and d0 show divergence already after 30.000 gradientupdates when PCD-1 is used, while no divergence can be observed for dd bs and aa bs after120.000 gradient updates. The evolution of the LL on the training data is not shown,since it is almost equivalent to the evolution on the test data. To our knowledge the bestreported performance of an RBM with 500 hidden units carefully trained on MNIST was-84 (Salakhutdinov, 2008; Salakhutdinov and Murray, 2008; Tang and Sutskever, 2011; Choet al., 2013b). In our experience choosing the correct training setup and using additionalmodiﬁcations of the update rule like a momentum term, weight decay, and an annealinglearning rate is essential to reach a value of -84 with normal binary RBMs. However, inorder to get an unbiased comparison of the diﬀerent models, we did not use any of thesemodiﬁcations in our experiments. This explains why our performance of does not reach-84. d0 however, reaches a value of -84 when PT is used for sampling, and dd bs and aa bs reach even higher values around -80 with PCD-1 and -75 with PT .

4. Note, that the preprocessing of

MNIST is usually done by treating the gray values (normalized to valuesin [0 , . MNIST experiments diﬃcult to compare across studies. gradient update − − − − − − − − l og - li k e li hood aa bs dd bs d (a) Training data- learning rate 0.001 gradient update − − − − − − − − l og - li k e li hood aa bs dd bs d (b) Test data - learning rate 0.001 gradient update − − − − − − − − l og - li k e li hood aa bs dd bs d (c) Training data - learning rate 0.01 gradient update − − − − − − − − l og - li k e li hood aa bs dd bs d (d) Test data - learning rate 0.01 Figure 10: Evolution of the average LL on

Caltech dataset with the diﬀerent centeringvariants with 500 hidden units. The results on training and test data for alearning rate of η = 0 .

001 are shown in sub-ﬁgures (a) and (b), respectively andfor a learning rate of η = 0 .

01 in sub-ﬁgures (c) and (d), respectively. In bothcases a sliding factor of 0.01 and PCD-1 was used. The error bars indicate thestandard deviation of the LL over the 25 trials.For the

Caltech dataset, Figure 10 shows the evolution of the average LL on trainingand test data over 25 trials for the diﬀerent centering versions using PCD-1 with a batchsize of 100 and either a learning rate of 0 .

001 or 0 .

01. The LL was estimated every 5000thgradient update using AIS. The results show that dd bs , aa bs and d0 reach higher LL valuesthan for both learning rates and on training and test data. While dd bs and aa bs perform31 gradient update − − − − − − l og - li k e li hood aa b dd b aa bs dd bs (a) MNIST LL on test data gradient update − − − − − − − − l og - li k e li hood aa b ( minibatch, η = 0 . dd b ( minibatch, η = 0 . aa bs ( minibatch, η = 0 . dd bs ( minibatch, η = 0 . aa b ( fullbatch, η = 0 . dd b ( fullbatch, η = 0 . (b) Caltech LL on test data Figure 11: Evolution of the LL of exemplary trials for the diﬀerent centering variants aa b , dd b , aa bs and dd bs on (a) MNIST and (b)

Caltech during training using PT witha batch size of 100, 500 hidden units and a learning rate of 0 . Caltech aa b and dd b were also trained in full batch mode with PT and a learning rateof 0 . d0 when a small learning rate is used, the diﬀerence becomes moreprominent for a big learning rate. Figure 10(c) and (d) show that all models over ﬁt to thetraining data. Nevertheless, dd bs and aa bs reach higher LL values on the test data and thuslead to a better generalization.To emphasize that the divergence problems induced by using the model means as oﬀsetsalso appear for big models when no shifting average is used, we trained RBMs with 500hidden units on MNIST and

Caltech using PT with a learning rate of 0 .

001 and a batchsize of 100. In addition, we trained aa b and dd b on Caltech using full batch learning and alearning rate of 0 .

01. Figure 11 shows that aa b diverges, while dd b and the correspondingcentering versions using a moving average, aa bs and dd bs , show no divergence. The divergencefor aa b even occurs in full batch training as shown in Figure 11(b).In a second set of experiments we extended our analysis to the eight datasets used byLarochelle et al. (2010). The four diﬀerent models , d0 , dd bs , and aa bs were trained withthe same setup as before using PCD-1, a learning rate of 0 .

01, a batch size of 100 and a totalnumber of 5000 epochs. All experiments were repeated 25 times and we trained either RBMswith 16 hidden units and calculated the LL exactly or RBMs with 200 hidden units usingAIS for estimating the LL. Additionally we trained RBMs with 200 hidden with a smallerlearning rate of 0 .

001 for 30000 epochs. Due to the long training time these experimentswere repeated only 10 times. The maximum average LL for the test data is shown in Table 9.On seven out of eight datasets dd bs or aa bs reached the best result independent of whether16 or 200 hidden units or a learning rate of 0 .

01 or 0 .

001 were used. Whenever aa bs reachedthe highest value it was not signiﬁcantly diﬀerent to dd bs . Note, that for training RBMs32ith 16 hidden units d dd bs on some datasets. Only on the RCV1 dataset, 00 lead to better LL values than the centered RBMs for both 16 and 200hidden units. It seems that the convergence rate on the

RCV1 , OCR and

Web dataset israther slow for all models since the diﬀerence between the highest and the ﬁnal LL valuesare rather small. This can also be observed on the training data shown in Table 10. The

DNA dataset and the

NIPS dataset over ﬁt to the training data as indicated by the factthat the divergence is only observed for the test data. In contrast, on the remaining threedatasets

Adult , CONNECT-4 and

Mushroom the divergence can be observed on trainingand test data. Finally note, that all eight datasets contain more zeros than ones in thecurrent representation as mentioned in Section 5.1. Thus, the performance of the normalRBM would be even worse on the ﬂipped datasets while for the centering variants it wouldstay the same.Consistent with the experiments on small models, the results from nine of the ten realworld datasets clearly support the superiority of centered over normal RBMs and show thatcentering visible and hidden units in RBMs is important for yielding good models.One explanation why centering works has been provided by Montavon and M¨uller (2012),who found that centering leads to an initially better conditioned optimization problem.Furthermore, Cho et al. (2011) have shown that when the enhanced gradient is used fortraining the update directions for the weights are less correlated than when the standardgradient is used, which allows to learn more meaningful features.From our analysis in Section 3 we know that centered RBMs and normal RBMs belongto the same model class and therefore the reason why centered RBMs outperform normalRBMs can indeed only be due to the optimization procedure. Furthermore, one has tokeep in mind that in centered RBMs the variables mean values are explicitly stored inthe corresponding oﬀset parameters, or if the centered gradient is used for training normalRBMs the mean values are transferred to the corresponding bias parameters. This allowsthe weights to model second and higher order statistics right from the start, which is incontrast to normal binary RBMs where weights usually capture parts of the mean values.To support this statement empirically, we calculated the average weight and bias normsduring training of the RBMs with 500 hidden units on

MNIST using the standard and thecentered gradient. The results are shown in Figure 12, where it can be seen that the rowand column norms (see Figure 12(a) and 12(b)) of the weight matrix for dd bs , aa bs , and d0 are consistently smaller than for . At the same time the bias values (see Figure 12(c) and12(d)) for dd bs , aa bs , and d0 are much bigger than for , indicating that the weight vectorsof model information that could potentially be modeled by the bias values. Interestingly,the curves for all parameters of dd bs and aa bs show the same logarithmic shape, while for d0 and the visible bias norm does not change signiﬁcantly. It seems that the bias values didnot adapt properly during training. Comparing, d0 with dd bs and aa bs , the weight norms areslightly bigger and the visible bias is much smaller for d0 , indicating that it is not suﬃcientto center only the visible variables and that visible and hidden bias inﬂuence each other.This dependence of the hidden mean and visible bias can also be seen from Equation (11)where the transformation of the visible bias depends on the oﬀset of the hidden variables.33 a t a s e t aa b s dd b s d000 h i dd e nun i t s - l e a r n i n g r a t e . A d u l t - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) C O NN E CT - - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) D NA - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) M u s h r oo m - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) N I P S - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) O C R - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) R C V - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) W e b - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) h i dd e nun i t s - l e a r n i n g r a t e . A d u l t - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) C O NN E CT - - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) D NA - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) M u s h r oo m - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) N I P S - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) O C R - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) R C V - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) W e b - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) h i dd e nun i t s - l e a r n i n g r a t e . A d u l t - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) C O NN E CT - - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) D NA - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) M u s h r oo m - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) N I P S - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) O C R - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) R C V - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) W e b - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) T a b l e : M a x i m u m a v e r ag e LL o n t e s t d a t ao n v a r i o u s d a t a s e t s u s i n g P C D - , w i t h (t o p ) h i dd e nun i t s a nd a l e a r n i n g r a t e o f . , ( m i dd l e ) h i dd e nun i t s a nd a l e a r n i n g r a t e o f . nd ( b o tt o m ) h i dd e nun i t s a nd l e a r n i n g r a t e o f . ( s i n ce t r i a l s a r e a r e n o t e n o u g h t o p e r f o r m a s t a t i s t i c a l s i g n i ﬁ c a n ce t e s t w e s i m p l y und e r li n e d t h e b e s t r e s u l t) . a t a s e t aa b s dd b s d000 h i dd e nun i t s - l e a r n i n g r a t e . A d u l t - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) C O NN E CT - - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) D NA - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) M u s h r oo m - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) N I P S - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) O C R - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) R C V - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) W e b - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) h i dd e nun i t s - l e a r n i n g r a t e . A d u l t - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) C O NN E CT - - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) D NA - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) M u s h r oo m - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) N I P S - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) O C R - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) R C V - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) W e b - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) h i dd e nun i t s - l e a r n i n g r a t e . A d u l t - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) C O NN E CT - - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) D NA - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) M u s h r oo m - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) N I P S - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) O C R - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) R C V - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) W e b - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) - . ± . ( - . ) T a b l e : M a x i m u m a v e r ag e LL o n t r a i n i n g d a t ao n v a r i o u s d a t a s e t s u s i n g P C D - , w i t h (t o p ) h i dd e nun i t s a nd a l e a r n i n g r a t e o f . , ( m i dd l e ) h i dd e nun i t s a nd a l e a r n i n g r a t e o f . nd ( b o tt o m ) h i dd e nun i t s a nd a l e a r n i n g r a t e o f . . ( s i n ce t r i a l s a r e a r e n o t e n o u g h t o p e r f o r m a s t a t i s t i c a l s i g n i ﬁ c a n ce t e s t w e s i m p l y und e r li n e d t h e b e s t r e s u l t) . .8 Comparision to the Natural Gradient The results of the previous section indicate that one explanation for the better performanceof the centered gradient compared to the standard gradient is the decoupling of the bias andweight parameters. As described in Section 2.2 the natural gradient is independent of theparameterization of the model distribution. Thus it is also independent of how the meaninformation is stored in the parameters and should not suﬀer from the described bias-weightcoupling problem. For the same reason it is also invariant to changes of the representationof the data distribution (e.g. variable ﬂipping). That is why we expect the direction of thenatural gradient to be closer to the direction of the centered gradient than the direction ofthe standard gradient.To verify this hypothesis empirically, we trained small RBMs with 4 visible and 4 hiddenunits using the exact natural gradient on the 2x2

Bars & Stripes dataset. After each gradientupdate the diﬀerent exact gradients were calculated and the angle between the centered andthe natural gradient as well as the angle between the standard and the natural gradient wereevaluated. The results are shown in Figure 13 where Figure 13(a) shows the evolution ofthe average LL when the exact natural gradient is used for training with diﬀerent learningrates. Figure 13(b) shows the average angles between the diﬀerent gradients during trainingwhen the natural gradient is used for training with a learning rate of 0 .

1. The angle betweencentered and natural gradient is consistently much smaller than the angle between standardand natural gradient. Comparable results can also be observed for the

Shifting & Bars dataset and when the standard, or centered gradient is used for training.Notice, how fast the natural gradient reaches a value very close to the theoretical LL up-per bound of − .

86 even for a learning rate of 0 .

1. This veriﬁes empirically the theoreticalstatement that the natural gradient is clearly the update direction of choice, which shouldbe used if it is tractable. To further emphasize how quick the natural gradient converges,we compared the average LL evolution of the standard, centered and natural gradient, asshown in Figure 13(c). Although much slower than the natural gradient, the centered gradi-ent reaches the theoretical upper bound of the LL. The standard gradient seems to saturateon a much smaller value, showing again the inferiority of the standard gradient even if it iscalculated exactly and not only approximated.To verify that the better performance of natural and centered gradient is not only dueto larger gradients resulting in bigger step sizes, we also analyzed the LL evolution for thenatural and centered gradient when they are scaled to the norm of the normal gradient beforeupdating the parameters. The results are shown in Figure 13(c). The natural gradient stilloutperforms the other methods but it becomes signiﬁcantly slower than when used with itsoriginal norm. The reason why the norm of the natural gradient is somehow optimal canbe explained by the fact that for distributions of the exponential family a natural gradientupdate on the LL becomes equivalent to performing a Newton step on the LL. In this sense,the Fisher metric results in an automatic step size adaption such that even a learning rateof 1 .

50 100 150 200 gradient update a v e r ageno r m o ft he w e i gh t m a t r i xc o l u m s aa bs dd bs d (a) Norm of the weight matrix colums gradient update . . . . . . . . . . a v e r ageno r m o ft he w e i gh t m a t r i x r o w s aa bs dd bs d (b) Norm of the weight matrix rows gradient update a v e r ageno r m o ft he v i s i b l eb i a s aa bs dd bs d (c) Norm of the visible bias gradient update a v e r ageno r m o ft heh i ddenb i a s aa bs dd bs d (d) Norm of the hidden bias Figure 12: Evolution of the average Euclidean norm of the parameters of the RBMs with500 hidden units trained on

MNIST .gradient. Therefore, the worse performance of the normal gradient does not result from thelength but the direction of the gradient. To conclude, these results support our assumptionthat the centered gradient is closer to the natural gradient and that it is therefore preferableover the standard gradient.

When centering was ﬁrstly applied to DBMs by Montavon and M¨uller (2012) the authorsonly saw an improvement of centering for locally connected DBMs. Due to our observationsfor RBMs and the structural similarity between RBMs and DBMs (a DBM is an RBM with37

50 100 150 200 gradient update − − − − − l og - li k e li hood natural gradient, η = 1 . natural gradient, η = 0 . natural gradient, η = 0 . (a) LL evolution of the natural gradient gradient update a v e r ageang l e t o t hena t u r a l g r ad i en t standard gradient, η = 0 . centered gradient ( dd b ) , η = 0 . (b) Angle between standard natural gradient gradient update − − − − − l og - li k e li hood natural gradient, η = 0 . standardgradient, η = 0 . centeredgradient ( dd b ) , η = 0 . natural gradientnormalized, η = 0 . centeredgradient ( dd b ) , normalized, η = 0 . (c) Comparision of the LL evolution of standard, natural and centered gradient Figure 13: Comparison of the centered gradient, standard gradient, and natural gradientfor RBMs with 4 visible and 4 hidden units trained on

Bars & Stripes 2x2 . (a)The average LL evolution over 25 trials when the natural gradient is used fortraining with diﬀerent learning rates, (b) the average angle over 25 trials betweenthe natural and standard gradient as well as natural and centered gradient whena learning rate of 0.1 is used, and (c) average LL evolution over 25 trials wheneither the natural gradient, standard gradient or centered gradient is used fortraining.restricted connections and partially unknown input data as discussed in Section 3.2) weexpect that the beneﬁt of centering carries over to DBMs. To verify this assumption and38mpirically investigate the diﬀerent centering variants in DBMs we performed extensiveexperiments on the big datasets listed in Section 5.1.Training the models and evaluating the lower bound of the LL was performed as orig-inally proposed for normal DBMs in Salakhutdinov and Hinton (2009). The authors alsoproposed to pre-train DBMs in a layer-wise fashion based on RBMs (Hinton and Salakhut-dinov, 2012). In our experiments we trained all models with and without pre-training toinvestigate the eﬀect of pre-training in both normal and centered DBMs. For pre-trainingwe used the same learning rate and the same oﬀset type as in the ﬁnal DBM models. No-tice, that we keep using the term “average LL” although it is precisely speaking only thelower bound of the average LL, which has been shown to be rather tied (Salakhutdinov andHinton, 2009). For the estimation of the partition function we again used AIS where wedoubled the number of intermediate temperatures compared to the RBM setting to 29000.We continue using the short hand notation introduced for RBMs also for DBMs with theonly diﬀerence that we add a third letter to indicate the oﬀset used in the second hiddenlayer, such that 000 corresponds to the normal binary DBM, and ddd bs and aaa bs correspondsto the centered DBMs using the data mean and the average of data and model mean asoﬀsets, respectively. Due to the large number of experiments and the high computationalcost – especially for estimating the LL – the experiments where repeated only 10 times andwe focused our analysis only on normal DBMs (000) and fully centered DBMs ( ddd bs , aaa bs ).Again, we begin our analysis with the MNIST dataset on which we trained normal andcentered DBMs with 500 hidden units in the ﬁrst and 500 units in the second hidden layer.Training was done using PCD-1 with a batch size of 100, a learning rate of 0 .

001 and in caseof centering a sliding factor of 0 .

01 for the extensive amount of 1000 epochs (600000 gradientupdates). The evolution of the average LL on the test data without pre-training is shownin Figure 14(a) where the evolution of the average LL on the training data is not shownsince it is almost equivalent. Both centered DBMs reach signiﬁcantly higher LL values witha much smaller standard deviation between the trials than the normal DBMs (indicated bythe error bars) and ddd bs performs slightly better than aaa bs . These ﬁndings are diﬀerent tothe observations of Montavon and M¨uller (2012) who reported only an improvement of themodel through centering for locally connected DBMs. This might be due to the diﬀerenttraining setup (e.g. learning rate, batch size, shorter training time or approximation of thedata dependent part of the LL gradient by Gibbs sampling instead of optimizing the lowerbound of the LL) Figure 14(b) shows the evolution of the average LL on the test data of thesame models with pre-training for 120000 gradient updates (200 epochs). The evolution ofthe average LL on the training data was again almost equivalent. ddd bs has approximatelythe same performance with and without pre-training but aaa bs now has similar performanceas ddd bs . Pre-training allows 000 to reach better LL than without pre-training, however it isstill signiﬁcantly worse compared to the centered DBMs with or without pre-training. Bycomparing the results with the results of RBMs with 500 hidden units trained on MNIST shown in Figure 9(a) we see that all DBMs reach higher LL values than the correspondingRBM models.The higher layer representations in DBMs highly depend on the data driven lower layerrepresentations. Thus we expect to see a qualitative diﬀerence between the second layerreceptive ﬁelds or ﬁlters given by the columns of the weight matrices in centered and normalDBMs. We did not visualize the ﬁlters of the ﬁrst layer since all models showed the well39 gradient update − − − − − − − − − − l og - li k e li hood aaa bs ddd bs (a) Without pre-training gradient update − − − − − − − − − − l og - li k e li hood aaa bs ddd bs (b) With pre-training Figure 14: Evolution of the average LL on the test data of the

MNIST dataset for DBMswith 500x500 hidden units. The diﬀerent variants aaa bs , ddd bs and 000 were eithertrained (a) without pre-training or (b) when each DBM layer was pre-trained for120.000 gradient updates (200 epochs). In both cases PCD-1 with a learning rateof η = 0 .

001 and for centering a sliding factor of 0.01 was used. The error barsindicate the standard deviation of the average LL over the 10 trials. We skippedevaluating the initial model and (b) starts after the 200 epochs of pre-trainingto roughly account for the computation overhead of pre-trainingknown stroke like structure, which can be seen for RBMs in the review paper by Fischerand Igel (2014) for example. We visualized the ﬁlters of the second layer by linearly backprojecting the second layer ﬁlters into the input space given by the matrix product of ﬁrstand second layer weight matrix. The corresponding back projected second layer ﬁlters for000 and ddd bs are shown Figure 15(a) and (b), respectively. It can be seen that many secondlayer ﬁlters of 000 are roughly the same and thus highly correlated. Moreover, they seemto represent some kind of mean information. Whereas the ﬁlters for ddd bs have much morediverse and less correlated structures than the ﬁlters of the normal DBM. When pre-trainingis used the ﬁlters for 000 become more diverse and the ﬁlters of both 000 and ddd bs becomemore selective as can be seen in Figure 15(c) and (d), respectively. The eﬀect of the diversitydiﬀerence of the ﬁlters can also be seen from the average activation of the second hiddenlayer. As shown in Figure 16(a) without pre-training the average activation of the hiddenunits of ddd bs given the training data is approximately 0.5 for all units while for aaa bs it isa bit less balanced and for 000 most of the units tend to be either active or inactive allthe time. The results in Figure 16(b) illustrate that the average activity for all modelsbecome less balanced when pre-training is used, which also reﬂects the higher selectivityof the ﬁlters as shown in Figure 15(c) and (d). While the second layer hidden activities of ddd bs and aaa bs stay in a reasonable range, they become extremely selective for 000 where300 out of 500 units are inactive all the time. Therefore, the ﬁlters, average activation and40 a) 000 without pre-training (b) ddd b without pre-training(c) 000 with pre-training (d) ddd b with pre-training Figure 15: Random selection of 100 linearly projected ﬁlters of the second hidden layer for(a) 000 and (b) ddd b without pre-training and (c) 000 and (d) ddd b without 200epochs pre-training. The ﬁlters have been normalized independently such thatthe structure can be seen better.evolution of the LL indicate that that normal RBMs have diﬃculties in making use of thesecond hidden layer with and without pre-training.We continue our analysis with experiments on the Caltech dataset on which we againtrained normal and centered DBMs with 500 hidden units on the ﬁrst and second hidden41

100 200 300 400 500 hidden unit . . . . . . a v e r agea c t i v a t i on aaa bs ddd bs (a) Without pre-training hidden unit . . . . . . a v e r agea c t i v a t i on aaa bs ddd bs (b) With pre-training Figure 16: Decreasingly ordered average hidden activity on the training data for the diﬀer-ent models (a) without pre-training and (b) with pre-traininglayer. Training was also done using PCD-1 with a batch size of 100, a learning rate of0 .

001 and in case of centering a sliding factor of 0 .

01. Since the training data has only 41batches the models were trained for the extensive amount of 10000 epochs (410000 gradientupdates). Figure 17 shows the average LL on the test data (a) without and (b) with 500epochs pre-training. In addition, Figure 17(c) and (d) show the corresponding averageLL on the training data demonstrating that all models overﬁtted to the

Caltech dataset.The results are consistent with the ﬁndings for

MNIST , that is 000 performs worse thancentering on training and test data independent of whether pre-training is used or not.Furthermore, aaa bs seems to perform slightly worse than ddd bs without pre-training, whilethe performance becomes equivalent if pre-training is used. But in contrast to the resultsof MNIST , on

Caltech all methods perform worse with pre-training. This negative eﬀect ofpre-training becomes even worse when the number of pre-training epochs is increased. Inthe case of 2000 epochs of pre-training for example, ddd bs and aaa bs still perform better than000 but the maximal average LL among all models, which was reached by ddd bs was only-98.1 for the training data and -141.4 for the test data, compared to -90.4 and -124.0 when500 epochs of pre-training were used and -87.3 and -118.8 when no pre-training was used.Without pre-training the LL values are comparable to the results when an RBM with 500hidden units is trained on Caltech as shown in Figure 10, illustrating that in terms of theLL a DBM does not necessarily perform better than an RBM. We also visualized the ﬁltersand plotted the average hidden activities for the training data of

Caltech , which lead to thesame conclusions as the results for

MNIST and are not shown for this reason.Finally, we also performed experiments on the eight additional binary datasets de-scribed in Section 5.1 using the same training setup as for the corresponding RBM exper-iments. That is, the DBMs with 200x200 hidden units were trained for 5000 epochs withPCD-1, a batch size of 100, a learning rate of 0 .

01, and in the case of centering a sliding42 gradient update − − − − − − l og - li k e li hood aaa bs ddd bs (a) Test LL without pre-training gradient update − − − − − − l og - li k e li hood aaa bs ddd bs (b) Test LL with 500 epochs pre-training gradient update − − − − − − l og - li k e li hood aaa bs ddd bs (c) Train LL without pre-training gradient update − − − − − − l og - li k e li hood aaa bs ddd bs (d) Train LL with 500 epochs pre-training Figure 17: Evolution of the average LL on the

Caltech dataset for the DBMs aaa bs , ddd bs and000 with 500x500 hidden units. (a) test LL (c) train LL without pre-training, (b)test LL and (d) train LL with 500 epochs (20500 gradients updates) pre-training.The models were trained using PCD-1 with a batch size of 100, a sliding factorof 0.01 and a learning rate of η = 0 .

001 was used. The error bars indicatethe standard deviation of the LL over the 10 trials. We skipped evaluating theinitial model and (b) and (d) start after the 500 epochs of pre-training to roughlyaccount for the computation overhead of pre-trainingfactor of 0 .

01. The LL was evaluated every 50th epoch and in the case of pre-training themodels were pre-trained for 200 epochs. Table 11 shows the maximum average LL on thetest data (top) without pre-training and (bottom) with pre-training. Without pre-training43 ataset dd bs No pre-trainingadult -15.54 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 11: Maximum average LL on test data on various datasets for DBMs with 200 hiddenunits on the ﬁrst and second layer. For training (top) without pre-training and(bottom) with 200 epochs pre-training PCD-1, with a learning rate of 0 .

01, batchsize of 100 was used. (The best result is underlined).the results are consistent with the ﬁndings for RBMs that ddd b performs better than 000on all datasets except for RCV1 where 000 performs slightly better. The LL values forthe DBMs are comparable but not necessarily better than the corresponding LL values forRBMs, which are shown in Table 9. In the case of the web datasets for example the DBMseven perform worse than the RBM models. When pre-training is used the performance ofall models, centered or normal, is worse than the performance of the corresponding DBMswithout pre-training. For completeness Table 12 shows the maximum average LL for thetraining data leading to the same conclusion as the test data. To summarize, the experi-ments described in this section show that centering leads to higher LL values for DBMs.While pre-training leads to more selective ﬁlters in general it often is even harmful for themodel quality. The beneﬁt of centering in feed forward neural networks for supervised tasks has alreadybeen shown by Schraudolph (1998). In this section we want to analyze centering in a specialkind of unsupervised feed forward neural networks, namely centered AEs as introduced inSection 4.1. We therefore trained normal and centered three layer AEs on the ten big44 ataset dd bs No pre-trainingadult -14.65 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 12: Maximum average LL on training data on various datasets for DBMs with 200hidden units on the ﬁrst and second layer. For training (top) without pre-trainingand (bottom) with 200 epochs pre-training PCD-1, with a learning rate of 0 . . . .

01. We used a default learning rate of 0 . . .

01 when the AEsconverged rather quickly ( <

500 epochs). Each experiment was repeated 10 times and wecalculated the average maximal reached cost value on test data, the corresponding standarddeviation and the average number of epochs needed for convergence.The results are given in Table 13 showing that except for the RCV1 dataset centeredAEs perform clearly better in terms of the average reached cost value on the test data thannormal AEs. On the training data normal AEs only perform slightly better on datasets45 ataset - Learning rate dd bs Test datamnist - 0.1 50.21472 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± Table 13: Average maximal reached cost value with standard deviation on test and trainingdata of various datasets for centered and normal three layer AEs with sigmoidunits, cross entropy cost function and half the number of output dimensions thaninput dimensions. The average number of epochs till convergence is given inbrackets. 46here both models reached very small cost values anyway. We did no show the results forthe validation sets since they are almost equivalent to the results for test data.Interestingly, the result that centering only performs worse on the RCV1 dataset is fullyconsistent with the ﬁndings for RBMs and DBMs. We inspected the RCV1 dataset andits ﬁrst and second order statistics but did not ﬁnd anything conspicuous compared to theother datasets that might explain why for this particular dataset centering is not beneﬁcial.However, learning is much slower for this dataset when centering is used, which can also beseen by comparing the results for learning rate 0 . .

7. Conclusion

This work discusses centering in RBMs and DBMs, where centering corresponds to sub-tracting oﬀset parameters from visible and hidden variables. Our theoretical analysis yieldsthe following results1. Centered ANNs and normal ANNs are diﬀerent parameterizations of the same modelclass, which justiﬁes the use of centering in arbitrary ANNs (Section 4).2. The LL gradient of centered RBMs/DBMs is invariant under simultaneous ﬂip of dataand oﬀsets, for any oﬀset value in the range of zero to one. This leads to a desiredinvariance of the LL performance of the model under changes of data representation(Appendix A).3. Training a centered RBM/DBM can be reformulated as training a normal RBM/DBMwith a new parameter update, which we refer to as centered gradient (Section 3.1 andAppendix B).4. From the new formulation follows that the enhanced gradient is a particular form ofcentering. That is, the centered gradient becomes equivalent to the enhanced gradientby setting the visible and hidden oﬀsets to the average over model and data mean ofthe corresponding variable (Section 3.1 and Appendix C).Our numerical analysis yielded the following results1. Optimal performance of centered DBMs/RBMs is achieved when both, visible andhidden variables are centered and the oﬀsets are set to their expectations under dataor model distribution.2. Centered RBMs/DBMs reach signiﬁcantly higher LL values than normal binary RBMs/DBMs.As an example, centered RBMs with 500 hidden units achieved an average test LL of-76 on

MNIST compared to a reported value of -84 for carefully trained normal binaryRBMs (Salakhutdinov, 2008; Salakhutdinov and Murray, 2008; Tang and Sutskever,2011; Cho et al., 2013b).3. Using the model expectation (as for the enhanced gradient for example) can lead toa severe divergence of the LL when PCD or PT c is used for sampling. This is causedby the correlation in oﬀset and gradient approximation as discussed in Section 6.4.47. Initializing the bias parameters such that the RBM/DBM/AE is initially centered(i.e. b i = σ − ( (cid:104) x i (cid:105) )) can already improve the performance of a normal binary RBM.However, this initialization leads to a performance still worse than the performanceof a centered RBM as shown in this work and is therefore no alternative to centering.5. The divergence can be prevented when an exponentially moving average for the ap-proximations of the oﬀset values is used, which also stabilized the training for othercentering variants especially when the mini batch size is small.6. Training centered RBMs/DBMs leads to smaller weight norms and larger bias normscompared to normal binary RBMs/DBMs. This supports the hypothesis that whenusing the standard gradient the mean value is modeled by both weights and biases,while when using the centered gradient the mean values are explicitly modeled by thebias parameters.7. The direction of the centered gradient is closer to the natural gradient than that of thestandard gradient and the natural gradient is extremely eﬃcient for training RBMs iftractable.8. Centered DBMs reach higher LL values than normal DBMs independent of whetherpre-training is used or not. Thus pre-training cannot be considered as a replacementfor centering.9. While pre-training helped normal DBMs on MNIST we did no observe an improve-ment through pre-training for centered DBMs. Furthermore, on all other datasetsthan

MNIST pre-training led to lower LL values and the results became worse aslonger pre-training was performed for normal and centered DBMs.10. The visual inspection of the learned ﬁlters, the average second hidden layer activitiesand reached LL values suggest that normal DBMs have diﬃculties in making use ofthe third and higher layers.11. Centering also improved the performance in terms of the optimized loss for AEs, whichsupports our assumption that centering is beneﬁcial not only for probabilistic modelslike RBMs and DBMs.Based on our results we recommend to center all units in the network using the data meanand to use an exponentially moving average if the mini-batch size is rather small ( <

100 forstochastic models and <

10 for deterministic models). Furthermore, we do not recommendto use pre-training of DBMs since it often worsen the results.All results clearly support the superiority of centered RBMs/DBMs and AEs, which webelieve will also extend to other models. Future work might focus on centering in otherprobabilistic models such as the neural auto-regressive distribution estimator (Larochelleand Murray, 2011) or recurrent neural networks such as long short-term memory (Hochreiterand Schmidhuber, 1997).

Acknowledgments

48e would like to thank Nan Wang for helpful discussions. We also would like to thankTobias Glasmachers for his suggestions on the natural gradient part. Asja Fischer wassupported by the German Federal Ministry of Education and Research within the NationalNetwork Computational Neuroscience under grant number 01GQ0951 (Bernstein FokusLearning behavioral models: From human experiment to technical assistance).

Appendix A. Proof of Invariance for the Centered RBM Gradient

In the following we show that the gradient of centered RBMs is invariant to ﬂips of thevariables if the corresponding oﬀset parameters ﬂip as well. Since training a centered RBMis equivalent to training a normal binary RBM using the centered gradient (see Appendix B),the proof also holds for the centered gradient.We begin by formalizing the invariance property in the following deﬁnitions.

Deﬁnition 1

Let there be an RBM with visible variables X = ( X , . . . , X N ) and hiddenvariables H = ( H , . . . , H M ) . The variables X i and H j are called ﬂipped if they take thevalues ˜ x i = 1 − x i and ˜ h j = 1 − h j for any given states x i and h j . Deﬁnition 2

Let there be a binary RBM with parameters θ and energy E and anotherbinary RBM with parameters ˜ θ and energy ˜ E where some of the variables are ﬂipped, suchthat E ( x , h ) = ˜ E (˜ x , ˜ h ) , (18) for all possible states ( x , h ) and corresponding ﬂipped states (˜ x , ˜ h ) , where ˜ x i = 1 − x i , ˜ h j = 1 − h j if X i and H j are ﬂipped and ˜ x i = x i , ˜ h j = h j otherwise. The gradient ∇ θ iscalled ﬂip-invariant or invariant to the ﬂips of the variables if (18) still holds afterupdating θ and ˜ θ to θ + η ∇ θ and ˜ θ + η ∇ ˜ θ , respectively, for an arbitrary learning rate η . We can now state the following theorem.

Theorem 3

The gradient of centered RBMs is invariant to ﬂips of arbitrary variables X i , . . . X i r and H j , . . . H j s with { i , . . . , i r } ⊂ { , . . . , N } and { j , . . . , j s } ⊂ { , . . . , M } ifthe corresponding oﬀset parameters µ i , . . . µ i r and λ j , . . . λ j s ﬂip as well that is if ˜ x i = 1 − x i implies ˜ µ i = 1 − µ i and ˜ h j = 1 − h j implies ˜ λ j = 1 − λ j . Proof

Let there be a centered RBM with parameters θ and energy E and another centeredRBM where some of the variables are ﬂipped with parameters ˜ θ and energy ˜ E , such that E ( x , h ) = ˜ E (˜ x , ˜ h ) for any ( x , h ) and corresponding (˜ x , ˜ h ). W.l.o.g. it is suﬃcient to showthe invariance of the gradient when ﬂipping only one visible variable X i , one hidden variable H j , or both of them, since each derivative with respect to a single parameter can only beaﬀected by the ﬂips of at most one hidden and one visible variable, which follows from thebipartite structure of the model.We start by investigating how the energy changes when the variables are ﬂipped. Forthis purpose we rewrite the energy in Equation (4) in summation notation given by E ( x , h ) (4) = − (cid:88) i ( x i − µ i ) b i − (cid:88) j ( h j − λ j ) c j − (cid:88) ij ( x i − µ i ) w ij ( h j − λ j ) . (19)49o indicate a variable ﬂip we introduce the binary parameter f i that takes the value 1 ifthe corresponding variable X i and the corresponding oﬀset µ i are ﬂipped and 0 otherwise.Similarly, g j = 1 if H j and λ j are ﬂipped and g j = 0 otherwise. Now we use E f i =1 ∧ g j =1 todenote the terms of the energy (19) that are aﬀected by a ﬂip of the variables X i and H j .Analogously, E f i =1 ∧ g j =0 and E f i =0 ∧ g j =1 denote the terms aﬀected by a ﬂip of either X i or H j respectively. For ﬂipped values ˜ x i , ˜ h j these terms get E f i =1 ∧ g j =1 (19) = − (˜ x i − ˜ µ i ) b i − (˜ x i − ˜ µ i ) (cid:88) k (cid:54) = j w ik ( h k − λ k ) − (˜ h j − ˜ λ j ) c j − (˜ h j − ˜ λ j ) (cid:88) u (cid:54) = i w uj ( x u − µ u ) − (˜ x i − ˜ µ i ) w ij (˜ h j − ˜ λ j )= − ((1 − x i ) − (1 − µ i )) b i − ((1 − x i ) − (1 − µ i )) (cid:88) k (cid:54) = j w ik ( h k − λ k ) − ((1 − h j ) − (1 − λ j )) c j − ((1 − h j ) − (1 − λ j )) (cid:88) u (cid:54) = i w uj ( x u − µ u ) − ((1 − x i ) − (1 − µ i )) w ij ((1 − h j ) − (1 − λ j ))= ( x i − µ i ) b i + ( x i − µ i ) (cid:88) k (cid:54) = j w ik ( h k − λ k )( h j − λ j ) c j + ( h j − λ j ) (cid:88) u (cid:54) = i w uj ( x u − µ u ) − ( x i − µ i ) w ij ( h j − λ j ) , and analogously E f i =1 ∧ g j =0 (19) = − (˜ x i − ˜ µ i ) b i − (˜ x i − ˜ µ i ) (cid:88) j w ij ( h j − λ j )= ( x i − µ i ) b i + ( x i − µ i ) (cid:88) j w ij ( h j − λ j ) , and E f i =0 ∧ g j =1 (19) = ( h j − λ j ) c j + ( h j − λ j ) (cid:88) i w ij ( x i − µ i ) . From the fact that the terms diﬀer from the corresponding terms in (19) only in thesign and that E ( x , h ) = ˜ E (˜ x , ˜ h ) holds for any ( x , h ) and corresponding (˜ x , ˜ h ), it followsthat the parameters ˜ θ must be given by˜ w f i ∧ g j ij = ( − f i + g j w ij , (20)˜ b f i ∧ g j i = ( − f i b i , (21)˜ c f i ∧ g j j = ( − g j c j , (22)˜ µ f i ∧ g j i = µ i , ˜ λ f i ∧ g j j = λ j . X i and H j are ﬂippedthe derivatives w.r.t. w ij , b i , , and c j are given by ∇ ˜ w f i =1 ∧ g j =1 ij = (cid:104) (1 − x i − (1 − µ i ))(1 − h j − (1 − λ j )) (cid:105) d −(cid:104) (1 − x i − (1 − µ i ))(1 − h j − (1 − λ j )) (cid:105) m = (cid:104) ( − x i + µ i )( − h j + λ j ) (cid:105) d − (cid:104) ( − x i + µ i )( − h j + λ j ) (cid:105) m = (cid:104) ( x i − µ i )( h j − λ j ) (cid:105) d − (cid:104) ( x i − µ i )( h j − λ j ) (cid:105) m = ( − ∇ w ij , ∇ ˜ b f i =1 ∧ g j =1 i = (cid:104) − x i − (1 − µ i ) (cid:105) d − (cid:104) − x i − (1 − µ i ) (cid:105) m = −(cid:104) x i (cid:105) d + µ i + (cid:104) x i (cid:105) m − µ i = ( − ∇ b i , ∇ ˜ c f i =1 ∧ g j =1 j = (cid:104) − h j − (1 − λ j ) (cid:105) d − (cid:104) − h j − (1 − λ j ) (cid:105) m = −(cid:104) h j (cid:105) d + λ j + (cid:104) h j (cid:105) m − λ j = ( − ∇ c j . If X i is ﬂipped they are given by ∇ ˜ w f i =1 ∧ g j =0 ij = (cid:104) (1 − x i − (1 − µ i ))( h j − λ j ) (cid:105) d −(cid:104) (1 − x i − (1 − µ i ))( h j − λ j ) (cid:105) m = (cid:104) ( − x i + µ i )( h j − λ j ) (cid:105) d − (cid:104) ( − x i + µ i )( h j − λ j ) (cid:105) m = − ( (cid:104) ( x i − µ i )( h j − λ j ) (cid:105) d − (cid:104) ( x i − µ i )( h j − λ j ) (cid:105) m )= ( − ∇ w ij , ∇ ˜ b f i =1 ∧ g j =0 i = ∇ ˜ b f i =1 ∧ g j =1 i = ( − ∇ b i , ∇ ˜ c f i =1 ∧ g j =0 j = ∇ ˜ c f i =0 ∧ g j =0 i = ( − ∇ c j , and due to the symmetry of the model the derivatives if H j is ﬂipped are given by ∇ ˜ w f i =0 ∧ g j =1 ij = ( − ∇ w ij , ∇ ˜ b f i =0 ∧ g j =1 i = ( − ∇ b i , ∇ ˜ c f i =0 ∧ g j =1 j = ( − ∇ c j . Comparing the results with Equations (20) - (22) shows that the gradient underlies thesame sign changes under variable ﬂips as the parameters. Thus, it holds for the updatedparameters that ˜ w f i ∧ g j ij + η ∇ ˜ w f i ∧ g j ij (20) = ( − f i + g j ( w ij + η ∇ w ij ) , ˜ b f i ∧ g j i + η ∇ ˜ b f i ∧ g j i (21) = ( − f i + g j ( b i + η ∇ b i ) , ˜ c f i ∧ g j j + η ∇ ˜ c f i ∧ g j j (22) = ( − f i + g j ( c j + η ∇ c j ) , E ( x , h ) = ˜ E (˜ x , ˜ h ) is still guaranteed and thus that the gradient of centeredRBMs is ﬂip-invariant according to Deﬁnition 2.Theorem 3 holds for any value from zero to one for µ i and λ j , if it is guaranteed thatthe oﬀsets ﬂip simultaneously with the corresponding variables. In practice one wantsthe model to perform equivalently on any ﬂipped version of the dataset without knowingwhich version is presented. This holds if we set the oﬀsets to the expectation value of thecorresponding variables under any distribution, since when µ i = (cid:80) x i p ( x i ) x i , ﬂipping X i leads to ˜ µ i = (cid:80) x i p ( x i ) (1 − x i ) = 1 − (cid:80) x i p ( x i ) x i = 1 − µ i and similarly for λ j , h j .Due to the structural similarity this proof also holds for DBMs. By replacing x by h l (which denotes the state of the variables in the l th hidden layer) and h by h l+1 (denotingthe state of the variables in the l + 1th hidden layer) we can prove the invariance propertyfor the derivatives of the parameters in layer l and l + 1. Appendix B. Derivation of the Centered Gradient

In the following we show that the gradient of centered RBMs can be reformulated as analternative update for the parameters of a normal binary RBM, which we name ‘centeredgradient’.A normal binary RBM with energy E ( x , h ) = − x T b − c T h − x T Wh can be transformedinto a centered RBM with energy ˜ E ( x , h ) = − ( x − µ ) T ˜ b − ˜ c T ( h − λ ) − ( x − µ ) T ˜ W ( h − λ )by the following parameter transformation˜ W (10) = W , (23)˜ b (11) = b + W λ , (24)˜ c (12) = c + W T µ , (25)which guarantees that E ( x , h ) = ˜ E ( x , h ) + const for all ( x , h ) ∈ { , } n + m and thus thatthe modeled distribution stays the same.Updating the parameters of the centered RBM according to Eq. (7) – (9) with a learningrate η leads to an updated set of parameters ˜ W u , ˜ b u , ˜ c u given by˜ W u (7) = ˜ W + η ( (cid:104) ( x − µ )( h − λ ) T (cid:105) d − (cid:104) ( x − µ )( h − λ ) T (cid:105) m ) , (26)˜ b u (8) = ˜ b + η ( (cid:104) x (cid:105) d − (cid:104) x (cid:105) m ) , (27)˜ c u (9) = ˜ c + η ( (cid:104) h (cid:105) d − (cid:104) h (cid:105) m ) . (28)One can now transform the updated centered RBM back to a normal RBM by applyingthe inverse transformation to the updated parameters, which ﬁnally leads to the centered52radient. W u (23) = ˜ W u (23) , (26) = W + η ( (cid:104) ( x − µ )( h − λ ) T (cid:105) d ) − (cid:104) ( x − µ )( h − λ ) T (cid:105) m ) (cid:124) (cid:123)(cid:122) (cid:125) (13) = ∇ c W , (29) b u (24) = ˜ b u − W u λ (27) , (29) = ˜ b + η ( (cid:104) x (cid:105) d − (cid:104) x (cid:105) m ) − ( W + η ∇ c W ) λ (24) = b + W λ + η ( (cid:104) x (cid:105) d − (cid:104) x (cid:105) m ) − W λ − η ∇ c W λ = b + η ( (cid:104) x (cid:105) d − (cid:104) x (cid:105) m − ∇ c W λ ) (cid:124) (cid:123)(cid:122) (cid:125) (14) = ∇ c b , (30) c u (25) = ˜ c u − W Tu µ (28) , (29) = ˜ c + η ( (cid:104) h (cid:105) d − (cid:104) h (cid:105) m ) − ( W + η ∇ c W ) µ (25) = c + W µ + η ( (cid:104) h (cid:105) d − (cid:104) h (cid:105) m ) − W µ − η ∇ c W µ = c + η ( (cid:104) h (cid:105) d − (cid:104) h (cid:105) m − ∇ c W µ ) (cid:124) (cid:123)(cid:122) (cid:125) (15) = ∇ c c . (31)The braces in Equation (29) - (31) mark the centered gradient given by Equations (13) - (15). Appendix C. Enhanced Gradient as Special Case of the CenteredGradient

In the following we show that the enhanced gradient can be derived as a special case of thecentered gradient. By setting µ = ( (cid:104) x (cid:105) d + (cid:104) x (cid:105) m ) and λ = ( (cid:104) h (cid:105) d + (cid:104) h (cid:105) m ) we get ∇ c W (13) = (cid:104) ( x − µ )( h − λ ) T (cid:105) d − (cid:104) ( x − µ )( h − λ ) T (cid:105) m = (cid:104) xh T (cid:105) d − (cid:104) x (cid:105) d λ T − µ (cid:104) h T (cid:105) d + µλ T − (cid:104) xh T (cid:105) m + (cid:104) x (cid:105) m λ T + µ (cid:104) h T (cid:105) m − µλ T = (cid:104) xh T (cid:105) d − (cid:104) x (cid:105) d ( (cid:104) h (cid:105) d + (cid:104) h (cid:105) m ) T −

12 ( (cid:104) x (cid:105) d + (cid:104) x (cid:105) m ) (cid:104) h T (cid:105) d −(cid:104) xh T (cid:105) m + 12 (cid:104) x (cid:105) m ( (cid:104) h (cid:105) d + (cid:104) h (cid:105) m ) T + 12 ( (cid:104) x (cid:105) d + (cid:104) x (cid:105) m ) (cid:104) h T (cid:105) m = (cid:104) xh T (cid:105) d − (cid:104) x (cid:105) d (cid:104) h T (cid:105) d − (cid:104) x (cid:105) d (cid:104) h T (cid:105) m − (cid:104) x (cid:105) d (cid:104) h T (cid:105) d − (cid:104) x (cid:105) m (cid:104) h T (cid:105) d −(cid:104) xh T (cid:105) m + 12 (cid:104) x (cid:105) m (cid:104) h T (cid:105) d + 12 (cid:104) x (cid:105) m (cid:104) h T (cid:105) m + 12 (cid:104) x (cid:105) d (cid:104) h T (cid:105) m + 12 (cid:104) x (cid:105) m (cid:104) h T (cid:105) m = (cid:104) xh T (cid:105) d − (cid:104) x (cid:105) d (cid:104) h T (cid:105) d − (cid:104) xh T (cid:105) m + (cid:104) x (cid:105) m (cid:104) h T (cid:105) m = (cid:104) ( x − (cid:104) x (cid:105) d )( h − (cid:104) h (cid:105) d ) T (cid:105) d − (cid:104) ( x − (cid:104) x (cid:105) m )( h − (cid:104) h (cid:105) m ) T (cid:105) m (1) = ∇ e W , ∇ c b (14) = (cid:104) x (cid:105) d − (cid:104) x (cid:105) m − ∇ e W λ = (cid:104) x (cid:105) d − (cid:104) x (cid:105) m − ∇ e W

12 ( (cid:104) h (cid:105) d + (cid:104) h (cid:105) m ) (2) = ∇ e b , ∇ c c (15) = (cid:104) h (cid:105) d − (cid:104) h (cid:105) m − ∇ e W T µ = (cid:104) h (cid:105) d − (cid:104) h (cid:105) m − ∇ e W T

12 ( (cid:104) x (cid:105) d + (cid:104) x (cid:105) m ) (3) = ∇ e c . References

S. Amari. Natural gradient works eﬃciently in learning.

Neural Computation , 10(2):251–276, 1998.S. Amari, K. Koji, and N. Hiroshi. Information geometry of Boltzmann machines.

IEEETransactions on Neural Networks , 3(2):260–271, 1992.Y. Bengio. Learning deep architectures for AI.

Foundations and Trends in Machine Learn-ing , 21(6):1601–1621, 2009.Y. Bengio, P. Lamblin, D. Popovici, H. Larochelle, et al. Greedy layer-wise training of deepnetworks.

Advances in neural information processing systems , 19:153, 2007.K. Cho, T. Raiko, and A. Ilin. Parallel tempering is eﬃcient for learning restricted Boltz-mann machines. In

Proceedings of the International Joint Conference on Neural Networks(IJCNN) , pages 3246–3253. IEEE Press, 2010.K. Cho, T. Raiko, and A. Ilin. Enhanced gradient and adaptive learning rate for train-ing restricted Boltzmann machines. In

Proceedings of the International Conference onMachine Learning , pages 105–112. Omnipress, 2011.K. Cho, T. Raiko, and A. Ilin. Gaussian-Bernoulli deep Boltzmann machine. In

Proceedingsof the International Joint Conference on Neural Networks , pages 1–7. IEEE, 2013a.K. Cho, T. Raiko, and A. Ilin. Enhanced gradient for training restricted Boltzmann ma-chines.

Neural Computation , 25(3):805–831, 2013b.G. Desjardins, A. Courville, Y. Bengio, P. Vincent, and O. Dellaleau. Tempered MarkovChain Monte Carlo for training of restricted Boltzmann machines. In

Proceedings of theThirteenth International Conference on Artiﬁcial Intelligence and Statistics , volume 9,pages 145–152, 2010.G. Desjardins, R. Pascanu, A. Courville, and Y. Bengio. Metric-free natural gradient forjoint-training of Boltzmann machines. CoRR, abs/arXiv:1301.3545, 2013.54. Fischer.

Training Restricted Boltzmann Machines . PhD thesis, University of Copen-hagen, 2014.A. Fischer and C. Igel. Empirical analysis of the divergence of Gibbs sampling basedlearning algorithms for restricted Boltzmann machines. In

Proceedings of the InternationalConference on Artiﬁcial Neural Networks , volume 6354 of

LNCS , pages 208–217. Springer-Verlag, 2010.A. Fischer and C. Igel. Training restricted Boltzmann machines: An introduction.

PatternRecognition , 47:25–39, 2014.R. Grosse and R. Salakhudinov. Scaling up natural gradient by sparsely factorizing theinverse ﬁsher matrix. In

Proceedings of the 32th International Conference on MachineLearning (ICML-15) , 2015.G. E. Hinton. A practical guide to training restricted Boltzmann machines. Technicalreport, Department of Computer Science, University of Toronto, 2010.G. E. Hinton and R. Salakhutdinov. A better way to pretrain deep boltzmann machines. InF. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors,

Advances in NeuralInformation Processing Systems 25 , pages 2447–2455. Curran Associates, Inc., 2012.G. E. Hinton, O. Simon, and T. Yee-Whye. A fast learning algorithm for deep belief nets.

Neural Computation , 18(7):1527–1554, 2006.Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory.

Neural computation ,9(8):1735–1780, 1997.HJ Kelley. Gradient theory of optimal ﬂight paths.

Ars Journal , 30(10):947–954, 1960.H. Larochelle and I. Murray. The neural autoregressive distribution estimator. In

Proceed-ings of the Fourteenth International Conference on Artiﬁcial Intelligence and Statistics ,volume 15, pages 29–37, 2011.H. Larochelle, Y. Bengio, and J. Turian. Tractable multivariate binary density estimationand the restricted boltzmann forest.

Neural Computation , 22:22852307, 2010.Y. LeCun, L. Bottou, G. Orr, and K. R. M¨uller. Eﬃcient backprop. In

Neural Networks:Tricks of the Trade , Lecture Notes in Computer Science, page 546. Springer Berlin /Heidelberg, 1998.M. Lingenheil, R. Denschlag, G. Mathias, and P. Tavan. Eﬃciency of exchange schemes inreplica exchange.

Chemical Physics Letters , 478:80–84, 2009.S. Loﬀe and C. Szegedy. Batch normalization: Accelerating deep network training byreducing internal covariate shift.

CoRR , abs/1502.03167, 2015. URL http://arxiv.org/abs/1502.03167 .D. MacKay.

Information Theory, Inference, and Learning Algorithm . Cambridge UniversityPress, Cambridge, 2003. 55. Marlin, K. Swersky, B. Chen, and N. de Freitas. Inductive principles for learning re-stricted boltzmann machines. 2010.J. Melchior, A. Fischer, and L. Wiskott. How to center binary restricted boltzmann ma-chines.

CoRR , 2013. URL http://arxiv.org/abs/1311.1354 .G. Montavon and K. R. M¨uller. Deep Boltzmann machines and the centering trick.

LectureNotes in Computer Science , 7700:621–637, 2012.Y. Ollivier, L. Arnold, A. Auger, and N. Hansen. Information-geometric optimizationalgorithms: A unifying picture via invariance principles. Technical report, CoRR,abs/1106.3708v2, 2013.T. Raiko, H. Valpola, and Y. LeCun. Deep learning made easier by linear transformationsin perceptrons.

Journal of Machine Learning Research , 22:924–332, 2012.D. Rumelhart, G. E. Hinton, and R. Williams. Learning representations by back-propagating errors.

Nature , 323:533–536, 1986a.D. Rumelhart, J. McClelland, and the PDP Research Group.

Parallel Distributed Process-ing: Explorations in the Microstructure of Cognition , volume 1. MIT Press, Cambridge,1986b.R. Salakhutdinov. Learning and evaluating Boltzmann machine. Technical report, Univer-sity of Toronto, 2008.R. Salakhutdinov and G. E. Hinton. Deep boltzmann machines. In , volume 5, pages448–455, Clearwater Beach, Florida, USA, 2009.R. Salakhutdinov and I. Murray. On the quantitative analysis of deep belief networks. In

Proceedings of the International Conference on Machine Learning , pages 872–879, NewYork, 2008. ACM.N. Schraudolph. Centering neural network gradient factors. In Genevieve B. Orr andKlaus-Robert M¨uller, editors,

Neural Networks: Tricks of the Trade , volume 1524 of

Lecture Notes in Computer Science , pages 207–226. Springer Verlag, Berlin, 1998.H. Schulz, A. M¨uller, and S. Behnke. Investigating convergence of restricted Boltzmannmachine learning. In

Proceedings of the NIPS 2010 Workshop on Deep Learning andUnsupervised Feature Learning , 2010.B. Schwehn. Using the natural gradient for training restricted Boltzmann machines. Mas-ter’s thesis, University of Edinburgh, Edinburgh, 2010.P. Smolensky.

Information processing in dynamical systems: foundations of harmony theory .MIT Press, Cambridge, MA, USA, 1986.K. Swersky, B. Chen, B. Marlin, and N. de Freitas. A tutorial on stochastic approximationalgorithms for training restricted boltzmann machines and deep belief nets. 2010.56. Tang and I. Sutskever. Data normalization in the learning of restricted Boltzmannmachines. Technical report, Department of Computer Science, University of Toronto,2011.T. Tieleman. Training restricted Boltzmann machines using approximations to the likeli-hood gradient. In W. W. Cohen, A. McCallum, and S. T. Roweis, editors,

Proceedings ofthe International Conference on Machine learning , pages 1064–1071. ACM, 2008.T. Tieleman and G. E. Hinton. Using fast weights to improve persistent contrastive di-vergence. In

Proceedings of the International Conference on Machine learning , pages1033–1040, New York, 2009. ACM.L. Younes. Maximum likelihood estimation of Gibbs ﬁelds. In A. Possolo, editor,