Contrastive Divergence Learning is a Time Reversal Adversarial Game
CC ONTRASTIVE D IVERGENCE L EARNING IS A T IME R EVERSAL A DVERSARIAL G AME
Yair Omer
Department of Electrical EngineeringTechnion - Israel Institute of TechnologyHaifa, Israel [email protected]
Tomer Michaeli
Department of Electrical EngineeringTechnion - Israel Institute of TechnologyHaifa, Israel [email protected] A BSTRACT
Contrastive divergence (CD) learning is a classical method for fitting unnormal-ized statistical models to data samples. Despite its wide-spread use, the conver-gence properties of this algorithm are still not well understood. The main sourceof difficulty is an unjustified approximation which has been used to derive thegradient of the loss. In this paper, we present an alternative derivation of CD thatdoes not require any approximation and sheds new light on the objective that isactually being optimized by the algorithm. Specifically, we show that CD is anadversarial learning procedure, where a discriminator attempts to classify whethera Markov chain generated from the model has been time-reversed. Thus, althoughpredating generative adversarial networks (GANs) by more than a decade, CD is,in fact, closely related to these techniques. Our derivation settles well with pre-vious observations, which have concluded that CD’s update steps cannot be ex-pressed as the gradients of any fixed objective function. In addition, as a byprod-uct, our derivation reveals a simple correction that can be used as an alternativeto Metropolis-Hastings rejection, which is required when the underlying Markovchain is inexact ( e.g., when using Langevin dynamics with a large step).
NTRODUCTION
Unnormalized probability models have drawn significant attention over the years. These modelsarise, for example, in energy based models, where the normalization constant is intractable to com-pute, and are thus relevant to numerous settings. Particularly, they have been extensively used in thecontext of restricted Boltzmann machines (Smolensky, 1986; Hinton, 2002), deep belief networks(Hinton et al., 2006; Salakhutdinov & Hinton, 2009), Markov random fields (Carreira-Perpinan &Hinton, 2005; Hinton & Salakhutdinov, 2006), and recently also with deep neural networks ( ? Song& Ermon, 2019; Du & Mordatch, 2019; Grathwohl et al., 2019; Nijkamp et al., 2019).Fitting an unnormalized density model to a dataset is challenging due to the missing normalizationconstant of the distribution. A naive approach is to employ approximate maximum likelihood esti-mation (MLE). This approach relies on the fact that the likelihood’s gradient can be approximatedusing samples from the model, generated using Markov Chain Monte Carlo (MCMC) techniques.However, a good approximation requires using very long chains and is thus impractical. This dif-ficulty motivated the development of a plethora of more practical approaches, like score matching(Hyv¨arinen, 2005), noise contrastive estimation (NCE) (Gutmann & Hyv¨arinen, 2010), and condi-tional NCE (CNCE) (Ceylan & Gutmann, 2018), which replace the log-likelihood loss with objec-tives that do not require the computation of the normalization constant or its gradient.Perhaps the most popular method for learning unnormalized models is contrastive divergence (CD)(Hinton, 2002). CD’s advantage over MLE stems from its use of short Markov chains initialized atthe data samples. CD has been successfully used in a wide range of domains, including modelingimages (Hinton et al., 2006), speech (Mohamed & Hinton, 2010), documents (Hinton & Salakhut-dinov, 2009), and movie ratings (Salakhutdinov et al., 2007), and is continuing to attract significantresearch attention (Liu & Wang, 2017; Gao et al., 2018; Qiu et al., 2019).1 a r X i v : . [ c s . L G ] D ec ) Generator update step ii) Discriminator update step Discriminator
Original orderReversedDatasetsample
MCMC (generator)
Learned model
Figure 1: Contrastive divergence as an adversarial process. In the first step, the distribution model isused to generate an MCMC process which is used to generate a chain of samples. In the second stepthe distribution model is updated using a gradient descent step, using the MCMC transition rule.Despite CD’s popularity and empirical success, there still remain open questions regarding its theo-retical properties. The primary source of difficulty is an unjustified approximation used to derive itsobjective’s gradient, which biases its update steps (Carreira-Perpinan & Hinton, 2005; Bengio & De-lalleau, 2009). The difficulty is exacerbated by the fact that CD’s update steps cannot be expressedas the gradients of any fixed objective (Tieleman, 2007; Sutskever & Tieleman, 2010).In this paper, we present an alternative derivation of CD, which relies on completely different princi-ples and requires no approximations. Specifically, we show that CD’s update steps are the gradientsof an adversarial game in which a discriminator attempts to classify whether a Markov chain gen-erated from the model is presented to it in its original or a time-reversed order (see Fig. 1). Thus,our derivation sheds new light on CD’s success: Similarly to modern generative adversarial meth-ods (Goodfellow et al., 2014), CD’s discrimination task becomes more challenging as the modelapproaches the true distribution. This keeps the update steps effective throughout the entire train-ing process and prevents early saturation as often happens in non-adaptive methods like NCE andCNCE. In fact, we derive CD as a natural extension of the CNCE method, replacing the fixed distri-bution of the contrastive examples with an adversarial adaptive distribution.CD requires that the underlying MCMC be exact, which is not the case for popular methods likeLangevin dynamics. This commonly requires using Metropolis-Hastings (MH) rejection, whichignores some of the generated samples. Interestingly, our derivation reveals an alternative correctionmethod for inexact chains, which does not require rejection.
ACKGROUND HE C LASSICAL D ERIVATION OF
CDAssume we have an unnormalized distribution model p θ . Given a dataset of samples { x i } inde-pendently drawn from some unknown distribution p , CD attempts to determine the parameters θ with which p θ best explains the dataset. Rather than using the log-likelihood loss, CD’s objectiveinvolves distributions of samples along finite Markov chains initialized at { x i } . When based onchains of length k , the algorithm is usually referred to as CD- k .Concretely, let q θ ( x (cid:48) | x ) denote the transition rule of a Markov chain with stationary distribution p θ ,and let r mθ denote the distribution of samples after m steps of the chain. As the Markov chain isinitialized from the dataset distribution and converges to p θ , we have that r θ = p and r ∞ θ = p θ . TheCD algorithm then attempts to minimize the loss (cid:96) CD- k = D KL ( r θ || r ∞ θ ) − D KL ( r kθ || r ∞ θ )= D KL ( p || p θ ) − D KL ( r kθ || p θ ) , (1)where D KL is the Kullback-Leibler divergence. Under mild conditions on q θ (Cover & Halliwell,1994) this loss is guaranteed to be positive, and it vanishes when p θ = p (in which case r kθ = p θ ).2o allow the minimization of (1) using gradient-based methods, one can write ∇ θ (cid:96) CD- k = E ˜ X ∼ r kθ [ ∇ θ log p θ ( ˜ X )] − E X ∼ p [ ∇ θ log p θ ( X )] + dD KL ( r kθ || p θ ) dr kθ ∇ θ r kθ . (2)Here, the first two terms can be approximated using two batches of samples, one drawn from p andone from r kθ . The third term is the derivative of the loss with respect only to the θ that appears in r kθ ,ignoring the dependence of p θ on θ . This is the original notation from (Hinton, 2002); an alternativeway to write this term would be ∇ ˜ θ D KL ( r k ˜ θ || p θ ) . This term turns out to be intractable and in theoriginal derivation, it is argued to be small and thus neglected, leading to the approximation ∇ θ (cid:96) CD- k ≈ n (cid:88) i ( ∇ θ log p θ (˜ x i ) − ∇ θ log p θ ( x i )) (3)Here { x i } is a batch of n samples from the dataset and { ˜ x i } are n samples generated by applying k MCMC steps to each of the samples in that batch. The intuition behind the resulting algorithm (sum-marized in App. A) is therefore simple. In each gradient step θ ← θ − η ∇ θ (cid:96) CD- k , the log-likelihoodof samples from the dataset is increased on the expense of the log-likelihood of the contrastivesamples { ˜ x i } , which are closer to the current learned distribution p θ .Despite the simple intuition, it has been shown that without the third term, (2) generally cannot bethe gradient of any fixed objective (Tieleman, 2007; Sutskever & Tieleman, 2010) except for somevery specific cases, such as CD-1 with a step size η that approaches zero, when the Markov chainis based on Langevin dynamics with infinitesimal steps (Hyvarinen, 2007). Here, we show that thisapproximation is in fact the exact gradient of a particular adversarial objective, which adapts to thecurrent learned model in each step.2.2 C ONDITIONAL N OISE C ONTRASTIVE E STIMATION
Our derivation views CD as an extension of the CNCE method, which itself is an extension of NCE.We therefore start by briefly reviewing those two methods.In NCE, the unsupervised density learning problem is transformed into a supervised one. This isdone by training a discriminator D θ ( x ) to distinguish between samples drawn from p and samplesdrawn from some preselected contrastive distribution p ref . Specifically, let the random variable Y denote the label of the class from which the variable X has been drawn, so that X | ( Y = 1) ∼ p and X | ( Y = 0) ∼ p ref . Then it is well known that the discriminator minimizing the binary cross-entropy(BCE) loss is given by D opt ( x ) = P ( Y = 1 | X = x ) = p ( x ) p ( x ) + p ref ( x ) . (4)Therefore, letting our parametric discriminator have the form D θ ( x ) = p θ ( x ) p θ ( x ) + p ref ( x ) , (5)and training it with the BCE loss, should in theory lead to D θ ( x ) = D opt ( x ) and thus to p θ ( x ) = p ( x ) . In practice, however, the convergence of NCE highly depends on the selection of p ref . Ifit significantly deviates from p , then the two distributions can be easily discriminated even whenthe learned distribution p θ is still very far from p . At this point, the optimization essentially stopsupdating the model, which can result in a very inaccurate estimate for p . In the next section weprovide a precise mathematical explanation for this behavior.The CNCE method attempts to alleviate this problem by drawing the contrastive samples based onthe dataset samples. Specifically, each dataset sample x is paired with a contrastive sample ˜ x that isdrawn conditioned on x from some predetermined conditional distribution q (˜ x | x ) ( e.g. N ( x, σ I ) ).The pair is then concatenated in a random order, and a discriminator is trained to predict the correctorder. This is illustrated in Fig. 2a. Specifically, here the two classes are of pairs ( A, B ) corre-sponding to ( A, B ) = ( X, ˜ X ) for Y = 1 , and ( A, B ) = ( ˜
X, X ) for Y = 0 , and the discriminatorminimizing the BCE loss is given by D opt ( a, b ) = P ( Y = 1 | A = a, B = b ) = q ( b | a ) p ( a ) q ( b | a ) p ( a ) + q ( a | b ) p ( b ) . (6)3 - fixed (a) CNCE - Based on MCMC (b) CD- Figure 2:
From CNCE to CD- . (a) In CNCE, each contrastive sample is generated using a fixedconditional distribution q ( ·|· ) (which usually corresponds to additive noise). The real and fake sam-ples are then concatenated and presented to a discriminator in a random order, which is trained topredict the correct order. (b) CD- can be viewed as CNCE with a q ( ·|· ) that corresponds to thetransition rule of a Markov chain with stationary distribution p θ . Since q depends on p θ (hence thesubscript θ ), during training the distribution of contrastive samples becomes more similar to that ofthe real samples, making the discrimination task harder.Therefore, constructing a parametric discriminator of the form D θ ( a, b ) = q ( b | a ) p θ ( a ) q ( b | a ) p θ ( a ) + q ( a | b ) p θ ( b ) = (cid:18) q ( a | b ) p θ ( b ) q ( b | a ) p θ ( a ) (cid:19) − , (7)and training it with the BCE loss, should lead to p θ ∝ p . Note that here D θ is indifferent to a scalingof p θ , which is thus determined only up to an arbitrary multiplicative constant.CNCE improves upon NCE, as it allows working with contrastive samples whose distribution iscloser to p . However, it does not completely eliminate the problem, especially when p exhibitsdifferent scales of variation in different directions. This is the case, for example, with natural images,which are known to lie close to a low-dimensional manifold. Indeed if the conditional distribution q ( ·|· ) is chosen to have a small variance, then CNCE fails to capture the global structure of p . Andif q ( ·|· ) is taken to have a large variance, then CNCE fails to capture the intricate features of p (see Fig. 3). The latter case can be easily understood in the context of images (see Fig. 2a). Here,the discriminator can easily distinguish which of its pair of input images is the noisy one, withouthaving learned an accurate model for the distribution of natural images ( e.g., simply by comparingtheir smoothness). When this point is reached, the optimization essentially stops.In the next section we show that CD is in fact an adaptive version of CNCE, in which the contrastivedistribution is constantly updated in order to keep the discrimination task hard. This explains whyCD is less prone to early saturation than NCE and CNCE. N A LTERNATIVE D ERIVATION OF CD We now present our alternative derivation of CD. In Sec. 3.1 we identify a decomposition of theCNCE loss, which reveals the term that is responsible for early saturation. In Sec. 3.2, we thenpresent a method for adapting the contrastive distribution in a way that provably keeps this termbounded away from zero. Surprisingly, the resulting update step turns out to precisely match thatof CD- , thus providing a new perspective on CD learning. In Sec. 3.3, we extend our derivation toinclude CD- k (with k ≥ ).3.1 R EINTERPRETING
CNCELet us denote w θ ( a, b ) (cid:44) q ( a | b ) p θ ( b ) q ( b | a ) p θ ( a ) , (8)so that we can write CNCE’s discriminator (7) as D θ ( a, b ) = (1 + w θ ( a, b )) − . (9)Then we have the following observation (see proof in App. B).4 D view ( x x x )The x x planeThe x x plane (a) The toy model Weights ( )
CNC E l a r ge First iteration Last iteration Learned model CNC E s m a ll Median CD GT densitySamples from dataContrastive samples Learned densitySamples drawnfrom the model (b) Comparing CNCE with CD
Figure 3: A toy example illustrating the importance of the adversarial nature of CD. Here, thedata lies close to a 2D spiral embedded in a 10-dimensional space. (a) The training samples inthe first 3 dimensions. (b) Three different approaches for learning the distribution: CNCE withlarge contrastive variance (top), CNCE with small contrastive variance (middle), and CD based onLangevin dynamics MCMC with the weight adjustment described in Sec. 3.4 (bottom). As can beseen in the first two columns, CD adapts the contrastive samples according to the data distribution,whereas CNCE does not. Therefore, CNCE with large variance fails to learn the distribution becausethe vast majority of its contrastive samples are far from the manifold and quickly become irrelevant(as indicated by the weights α θ in the third column). And CNCE with small variance fails to learnthe global structure of the distribution because its contrastive samples are extremely close to thedataset samples. CD, on the other hand, adjusts the contrastive distribution during training, so as togenerate samples that are close to the manifold yet traverse large distances along it. Observation 1.
The gradient of the CNCE loss can be expressed as ∇ θ (cid:96) CNCE = E X ∼ p ˜ X | X ∼ q (cid:104) α θ ( X, ˜ X ) (cid:16) ∇ θ log p θ ( ˜ X ) − ∇ θ log p θ ( X ) (cid:17)(cid:105) , (10) where α θ ( x, ˜ x ) (cid:44) (1 + w θ ( x, ˜ x ) − ) − . (11)Note that (10) is similar in nature to the (approximate) gradient of the CD loss (3). Particularly, as inCD, the term ∇ θ log p θ ( ˜ X ) − ∇ θ log p θ ( X ) causes each gradient step to increase the log-likelihoodof samples from the dataset on the expense of the log-likelihood of the contrastive samples. How-ever, as opposed to CD, here we also have the coefficient α θ ( x, ˜ x ) , which assigns a weight between and to each pair of samples ( x, ˜ x ) . To understand its effect, observe that α θ ( x, ˜ x ) = 1 − D θ ( x, ˜ x ) = D θ (˜ x, x ) . (12)Namely, this coefficient is precisely the probability that the discriminator assigns to the incorrectorder of the pair. Therefore, this term gives a low weight to “easy” pairs ( i.e., for which D θ ( x, ˜ x ) isclose to ) and a high weight to “hard” ones. 5his weighting coefficient is of course essential for ensuring convergence to p . For example, itprevents log p θ from diverging to ±∞ when the discriminator is presented with the same samplesover and over again. The problem is that a discriminator can often correctly discriminate all trainingpairs, even with a p θ that is still far from p . In such cases, α θ becomes practically zero for all pairsand the model stops updating. This shows that a good contrastive distribution is one which keepsthe discrimination task hard throughout the training. As we show next, there is a particular choicewhich provably prevents α θ from converging to zero, and that choice results in the CD method.3.2 F ROM
CNCE TO CD- To bound α θ away from , and thus avoid the early stopping of the training process, we now extendthe original CNCE algorithm by allowing the conditional distribution q to depend on p θ (and thusto change from one step to the next). Our next key observation is that in this setting there exists aparticular choice that keeps α θ constant. Observation 2. If q is chosen to be the transition probability of a reversible Markov chain withstationary distribution p θ , then α θ ( x, ˜ x ) = 12 , ∀ x, ˜ x. (13) Proof.
A reversible chain with transition q and stationary distribution p θ , satisfies the detailed bal-ance property q (˜ x | x ) p θ ( x ) = q ( x | ˜ x ) p θ (˜ x ) , ∀ x, ˜ x. (14)Substituting (14) into (8) leads to w θ ( x, ˜ x ) = 1 , which from (11) implies α θ ( x, ˜ x ) = .This observation directly links CNCE to CD. First, the suggested method for generating the con-trastive samples is precisely the one used in CD- . Second, as this choice of q leads to α θ ( x, ˜ x ) = ,it causes the gradient of the CNCE loss (10) to become ∇ θ (cid:96) CNCE = E X ∼ p ˜ X | X ∼ q (cid:104) ∇ θ log p θ ( ˜ X ) − ∇ θ log p θ ( X ) (cid:105) , (15)which is exactly proportional to the CD- update (3). We have thus obtained an alternative derivationof CD- . Namely, rather than viewing CD- learning as an approximate gradient descent processfor the loss (1), we can view each step as the exact gradient of the CNCE discrimination loss, wherethe reference distribution q is adapted to the current learned model p θ . This is illustrated in Fig. 2b.Since q is chosen based on p θ , the overall process is in fact an adversarial game. Namely, the opti-mization alternates between updating q , which acts as a generator, and updating p θ , which definesthe discriminator. As p θ approaches p , the distribution of samples generated from the MCMC alsobecomes closer to p , which makes the discriminator’s task harder and thus prevents early saturation.It should be noted that formally, since q depends on p θ , it also indirectly depends on θ , so that amore appropriate notation would be q θ . However, during the update of p θ we fix q θ (and vise versa),so that the gradient in the discriminator update does not consider the dependence of q θ on θ . This iswhy (15) does not involve the gradient of ˜ X which depends on q θ .The reason for fixing q θ comes from the adversarial nature of the learning process. Being part of thechain generation process, the goal of the transition rule q θ is to generate chains that appear to be time-reversible, while the goal of the classifier, which is based on the model p θ , is to correctly classifywhether the chains were reversed. Therefore, we do not want the optimization of the classifier toaffect q θ . This is just like in GANs, where the generator and discriminator have different objectives,and so when updating the discriminator the generator is kept fixed.3.3 F ROM
CD- TO CD- k To extend our derivation to CD- k with an arbitrary k ≥ , let us now view the discriminationproblem of the previous section as a special case of a more general setting. Specifically, the pairsof samples presented to the discriminator in Sec. 3.2, can be viewed as Markov chains of lengthtwo (comprising the initial sample from the dataset and one extra generated sample). It is thereforenatural to consider also Markov chains of arbitrary lengths. That is, assume we initialize the MCMC6t a sample x i from the dataset and run it for k steps to obtain a sequence ( x (0) , x (1) , . . . , x ( k ) ) ,where x (0) = x i . We can then present this sequence to a discriminator either in its original order, ortime-reversed, and train the discriminator to classify the correct order. We coin this a time-reversalclassification task . Interestingly, in this setting, we have the following. Observation 3.
When using a reversible Markov chain of length k + 1 with stationary distribution p θ , the gradient of the BCE loss of the time-reversal classification task is given by ∇ θ (cid:96) CNCE = E (cid:104) ∇ θ log p θ ( X ( k ) ) − ∇ θ log p θ ( X (0) ) (cid:105) , (16) which is exactly identical to the CD- k update (3) up to a multiplicative factor of . This constitutes an alternative interpretation of CD- k . That is, CD- k can be viewed as a time-reversaladversarial game, where in each step, the model p θ is updated so as to allow the discriminator tobetter distinguish MCMC chains from their time-reversed counterparts.Two remarks are in order. First, it is interesting to note that although the discriminator’s task is toclassify the order of the whole chain, its optimal strategy is to examine only the endpoints of thechain, x (0) and x ( k ) . Second, it is insightful to recall that the original motivation behind the CD- k loss (1) was that when p θ equals p , the marginal probability of each individual step in the chain isalso p . Our derivation, however, requires more than that. To make the chain indistinguishable fromits time-reversed version, the joint probability of all samples in the chain must be invariant to a flipof the order. When p θ = p , this is indeed the case, due to the detailed balance property (14). Proof of Observation 3.
We provide the outline of the proof (see full derivation in App. C). Let ( A (0) , A (1) , . . . , A ( k ) ) denote the input to the discriminator and let Y indicate the order of the chain,with Y = 1 corresponding to ( A (0) , A (1) , . . . , A ( k ) ) = ( X (0) , X (1) , . . . , X ( k ) ) and Y = 0 to ( A (0) , A (1) , . . . , A ( k ) ) = ( X ( k ) , X ( k − , . . . , X (0) ) . The discriminator that minimizes the BCEloss is now given by D ( a , a , . . . , a k ) = P ( Y = 1 | A (0) = a , A (1) = a , . . . , A ( k ) = a k )= (cid:18) q ( a | a ) · · · q ( a k − | a k ) p ( a k ) q ( a k | a k − ) · · · q ( a | a ) p ( a ) (cid:19) − = (cid:32) k (cid:89) i =1 w θ ( a i − , a i ) (cid:33) − . (17)The CNCE paradigm thus defines a discriminator D θ having the form of (17) but with p replacedby p θ . Recall that despite the dependence of the transition probability q on the current learnedmodel p θ , it is regarded as fixed within each discriminator update step. We therefore omit thesubscript θ from q here. Similarly to the derivation of (10), explicitly writing the gradient of theBCE loss of our discriminatrion task, gives ∇ θ (cid:96) chain = E (cid:32) k (cid:89) i =1 w θ ( X ( i − , X ( i ) ) − (cid:33) − (cid:16) ∇ θ log p θ ( X ( k ) ) − ∇ θ log p θ ( X (0) ) (cid:17) = E (cid:104) α θ ( X (0) , . . . , X ( k ) ) (cid:16) ∇ θ log p θ ( X ( k ) ) − ∇ θ log p θ ( X (0) ) (cid:17)(cid:105) . (18)where we now defined α θ ( a , . . . , a k ) (cid:44) (cid:32) k (cid:89) i =1 w θ ( a i − , a i ) − (cid:33) − . (19)Note that (10) is a special case of (18) corresponding to k = 1 , where X and ˜ X in (10) are X (0) and X (1) in (18). As before, when q satisfies the detailed balance property (14), we obtain w θ = 1 andconsequently the weighting term α θ again equals . Thus, the gradient (18) reduces to (16), whichis exactly proportional to the CD- k update (3). 7 T CD CD+MH ACD
Learned densityMCMC samples
Figure 4: Here, we use different CD configurations for learning the model of Fig. 3. All configu-rations use Langevin dynamics as their MCMC process, but with different ways of compensatingfor the lack of detailed balance. From left to right we have the ground-truth density, CD w/o anycorrection, CD with Metropolis-Hastings rejection, and CD with our proposed adjustment.3.4 MCMC P
ROCESSES T HAT DO NOT H AVE D ETAILED B ALANCE
In our derivation, we assumed that the MCMC process is reversible, and thus exactly satisfies thedetailed balance property (14). This assumption ensured that w θ = 1 and thus α θ = . In practice,however, commonly used MCMC methods satisfy this property only approximately. For example,the popular discrete Langevin dynamics process obeys detailed balance only in the limit where thestep size approaches zero. The common approach to overcome this is through Metropolis-Hastings(MH) rejection (Hastings, 1970), which guarantees detailed balance by accepting only a portion ofthe proposed MCMC transitions. In this approach, the probability of accepting a transition from x to ˜ x is closely related to the weighing term w θ , and is given by A ( x, ˜ x ) = min (1 , w θ ( x, ˜ x )) . (20)Interestingly, our derivation reveals an alternative method for accounting for lack of detailed balance.Concretely, we saw that the general expression for the gradient of the BCE loss (before assumingdetailed balance) is given by (18). This expression differs from the original update step of CD- k onlyin the weighting term α θ ( x (0) , . . . , x ( k ) ) . Therefore, all that is required for maintaining correctnessin the absence of detailed balance, is to weigh each chain by its “hardness” α θ ( x (0) , . . . , x ( k ) ) (seeAlg. 2 in App. A). Note that in this case, the update depends not only on the end-points of the chains,but rather also on their intermediate steps. As can be seen in Fig. 4, this method performs just aswell as MH, and significantly better than vanilla CD without correction. LLUSTRATION T HROUGH A T OY E XAMPLE
To illustrate our observations, we now conclude with a simple toy example (see Fig. 3). Our goalhere is not to draw general conclusions regarding the performance of CNCE and CD, but rathermerely to highlight the adversarial nature of CD and its importance when the data density exhibitsdifferent scales of variation along different directions.We take data concentrated around a 2-dimensional manifold embedded in 10-dimensional space.Specifically, let e (1) , . . . , e (10) denote the standard basis in R . Then each data sample is generatedby adding Gaussian noise to a random point along a 2D spiral lying in the e (1) - e (2) plane. The STDof the noise in the e (1) and e (2) directions is 5 times larger than that in the other 8 axes. Figure3a shows the projections of the the data samples onto the the first 3 dimensions. Here, we use amulti-layer perceptron (MLP) as our parametric model, log p θ , and train it using several differentlearning configurations (for the full details see App. D).Figure 3b visualizes the training as well as the final result achieved by each configuration. The firsttwo rows show CNCE with Gaussian contrastive distributions of two different STDs. The third rowshows the adjusted CD described in Sec. 3.4 with Langevin Dynamics as its MCMC process. Ascan be seen, for CNCE with a large STD, the contrastive samples are able to explore large areasaround the original samples, but this causes their majority to lie relatively far from the manifold (seetheir projections onto the e (1) - e (3) plane). In this case, α θ decreases quickly, causing the learningprocess to ignore most samples at a very early stage of the training. When using CNCE with a smallSTD, the samples remain relevant throughout the training, but this comes at the price of inability to8apture the global structure of the distribution. CD, on the other hand, is able to enjoy the best ofboth worlds as it adapts the contrastive distribution over time. Indeed, as the learning progresses,the contrastive samples move closer to the manifold to maintain their relevance. Note that since weuse the adjusted version of CD, the weights in this configuration are not precisely . We chose thestep size of the Langevin Dynamics so that the median of the weights is approximately − .Figure 4 shows the results achieved by different variants of CD. As can be seen, without correctingfor the lack of detailed balance, CD fails to estimate the density correctly. When using MH rejec-tion to correct the MCMC, or our adaptive CD (ADC) to correct the update steps, the estimate issignificantly improved. ONCLUSION
The classical CD method has seen many uses and theoretical analyses over the years. The originalderivation presented the algorithm as an approximate gradient descent process for a certain loss.However, the accuracy of the approximation has been a matter of much dispute, leaving it unclearwhat objective the algorithm minimizes in practice. Here, we presented an alternative derivation ofCD’s update steps, which involves no approximations. Our analysis shows that CD is in essencean adversarial learning procedure, where a discriminator is trained to distinguish whether a Markovchain generated from the learned model has been time-flipped or not. Therefore, although predatingGANs by more than a decade, CD in fact belongs to the same family of techniques. This provides apossible explanation for its empirical success. R EFERENCES
Yoshua Bengio and Olivier Delalleau. Justifying and generalizing contrastive divergence.
Neuralcomputation , 21(6):1601–1621, 2009.Miguel A Carreira-Perpinan and Geoffrey E Hinton. On contrastive divergence learning. In
Aistats ,volume 10, pp. 33–40. Citeseer, 2005.Ciwan Ceylan and Michael U Gutmann. Conditional noise-contrastive estimation of unnormalisedmodels. In
International Conference on Machine Learning , pp. 726–734, 2018.Thomas M Cover and J Halliwell. Which processes satisfy the second law.
Physical origins of timeasymmetry , pp. 98–107, 1994.Yilun Du and Igor Mordatch. Implicit generation and generalization in energy-based models. arXivpreprint arXiv:1903.08689 , 2019.Ruiqi Gao, Yang Lu, Junpei Zhou, Song-Chun Zhu, and Ying Nian Wu. Learning generative con-vnets via multi-grid modeling and sampling. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pp. 9155–9164, 2018.Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In
Advances in neural infor-mation processing systems , pp. 2672–2680, 2014.Will Grathwohl, Kuan-Chieh Wang, Joern-Henrik Jacobsen, David Duvenaud, Mohammad Norouzi,and Kevin Swersky. Your classifier is secretly an energy based model and you should treat it likeone. In
International Conference on Learning Representations , 2019.Michael Gutmann and Aapo Hyv¨arinen. Noise-contrastive estimation: A new estimation principlefor unnormalized statistical models. In
Proceedings of the Thirteenth International Conferenceon Artificial Intelligence and Statistics , pp. 297–304, 2010.W Keith Hastings. Monte carlo sampling methods using markov chains and their applications.
Biometrika , 57(1):97–109, 1970.Geoffrey E Hinton. Training products of experts by minimizing contrastive divergence.
Neuralcomputation , 14(8):1771–1800, 2002. 9eoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neuralnetworks. science , 313(5786):504–507, 2006.Geoffrey E Hinton and Russ R Salakhutdinov. Replicated softmax: an undirected topic model. In
Advances in neural information processing systems , pp. 1607–1614, 2009.Geoffrey E Hinton, Simon Osindero, and Yee-Whye Teh. A fast learning algorithm for deep beliefnets.
Neural computation , 18(7):1527–1554, 2006.Aapo Hyv¨arinen. Estimation of non-normalized statistical models by score matching.
Journal ofMachine Learning Research , 6(Apr):695–709, 2005.Aapo Hyvarinen. Connections between score matching, contrastive divergence, and pseudolikeli-hood for continuous-valued variables.
IEEE Transactions on neural networks , 18(5):1529–1531,2007.Qiang Liu and Dilin Wang. Learning deep energy models: Contrastive divergence vs. amortizedmle. arXiv preprint arXiv:1707.00797 , 2017.Abdel-rahman Mohamed and Geoffrey Hinton. Phone recognition using restricted boltzmann ma-chines. In , pp.4354–4357. IEEE, 2010.Erik Nijkamp, Mitch Hill, Song-Chun Zhu, and Ying Nian Wu. Learning non-convergent non-persistent short-run mcmc toward energy-based model. In
Advances in Neural Information Pro-cessing Systems , pp. 5232–5242, 2019.Yixuan Qiu, Lingsong Zhang, and Xiao Wang. Unbiased contrastive divergence algorithm for train-ing energy-based latent variable models. In
International Conference on Learning Representa-tions , 2019.Ruslan Salakhutdinov and Geoffrey Hinton. Deep boltzmann machines. In
Artificial intelligenceand statistics , pp. 448–455, 2009.Ruslan Salakhutdinov, Andriy Mnih, and Geoffrey Hinton. Restricted Boltzmann machines forcollaborative filtering. In
Proceedings of the 24th international conference on Machine learning ,pp. 791–798, 2007.Paul Smolensky. Information processing in dynamical systems: Foundations of harmony theory.Technical report, Colorado Univ at Boulder Dept of Computer Science, 1986.Jascha Sohl-Dickstein, Peter Battaglino, and Michael R DeWeese. Minimum probability flow learn-ing. arXiv preprint arXiv:0906.4779 , 2009.Yang Song and Stefano Ermon. Generative modeling by estimating gradients of the data distribution.In
Advances in Neural Information Processing Systems , pp. 11918–11930, 2019.Ilya Sutskever and Tijmen Tieleman. On the convergence properties of contrastive divergence. In
Proceedings of the thirteenth international conference on artificial intelligence and statistics , pp.789–795, 2010.Tijmen Tieleman.
Some investigations into energy-based models . PhD thesis, University of Toronto,2007. 10 A LGORITHMS
Below we summarize the algorithms of the classical CD and the proposed adjusted version describedin Sec. 3.4.
Algorithm 1:
Contrastive Divergence - k Require: parametric model p θ , MCMC transition rule q θ ( ·|· ) with stationary distribution p θ ,step size η , chain length k . while not converged do Sample a batch { x i } ni =1 from the datasetInitialize { ˜ x i } ni =1 to be a copy of the batch for i= to n dofor j= to k do Draw a sample x (cid:48) from q θ ( ·| ˜ x i )˜ x i ← x (cid:48) end g i ← ∇ θ log p θ (˜ x i ) − ∇ θ log p θ ( x i ) end θ ← θ − η n (cid:80) i g i endAlgorithm 2: Adjusted Contrastive Divergence - k Require: parametric model p θ , MCMC transition rule q θ ( ·|· ) whose stationary distribution is p θ , step size η , chain length k . while not converged do Sample a batch { x i } ni =1 from the datasetInitialize { ˜ x i } ni =1 to be a copy of the batch for i= to n do w tot i ← for j= to k do Draw a sample x (cid:48) from q θ ( ·| ˜ x i ) w tot i ← w tot i · q ( x i | x (cid:48) ) p θ ( x (cid:48) ) q ( x (cid:48) | x i ) p θ ( x i ) ˜ x i ← x (cid:48) end α i ← (1 + 1 /w tot i ) − g i ← ∇ θ log p θ (˜ x i ) − ∇ θ log p θ ( x i ) end θ ← θ − η n (cid:80) i α i · g i end B D
ERIVATION OF
CNCE’ S G RADIENT
Proof of Observation 1.
The BCE loss achieved by the CNCE discriminator (7) is given by (cid:96)
CNCE = − E A ∼ pB | A ∼ q [log( D θ ( A, B ))] − E B ∼ pA | B ∼ q [log(1 − D θ ( A, B ))] == − E X ∼ p ˜ X | X ∼ q (cid:104) log( D θ ( X, ˜ X )) (cid:105) , (21)11here we used the fact that − D θ ( a, b ) = D θ ( b, a ) . Now, substituting the definition of D θ form(9), the gradient of (21) can be expressed as ∇ θ (cid:96) CNCE = E X ∼ p ˜ X | X ∼ q (cid:104) ∇ θ log (cid:16) w θ ( X, ˜ X ) (cid:17)(cid:105) = E X ∼ p ˜ X | X ∼ q (cid:20)(cid:16) w θ ( X, ˜ X ) (cid:17) − ∇ θ w θ ( X, ˜ X ) (cid:21) = E X ∼ p ˜ X | X ∼ q (cid:34)(cid:16) w θ ( X, ˜ X ) (cid:17) − w θ ( X, ˜ X ) w θ ( X, ˜ X ) ∇ θ w θ ( X, ˜ X ) (cid:35) = E X ∼ p ˜ X | X ∼ q (cid:32) w θ ( X, ˜ X ) w θ ( X, ˜ X ) (cid:33) − ∇ θ w θ ( X, ˜ X ) w θ ( X, ˜ X ) = E X ∼ p ˜ X | X ∼ q (cid:20)(cid:16) w θ ( X, ˜ X ) − (cid:17) − ∇ θ log( w θ ( X, ˜ X )) (cid:21) = E X ∼ p ˜ X | X ∼ q (cid:104) α θ ( X, ˜ X ) (cid:16) ∇ θ log p θ ( ˜ X ) − ∇ θ log p θ ( X ) (cid:17)(cid:105) , (22)where we used the fact that ∇ θ w θ = w θ ∇ θ log( w θ ) and the definition of α θ from (11). C D
ERIVATION OF THE GRADIENT OF
CNCE
WITH MULTIPLE MC STEPS
We here describe the full derivation of the gradient in (18) following the same steps as in (22). TheBCE loss achieved by the discriminator in (17) is given by (cid:96) chain = − E (cid:104) log( D θ ( X (0) , X (1) , . . . , X ( k ) )) (cid:105) . (23)where we again used the fact that − D θ ( a , a , . . . , a k ) = D θ ( a k , a k − , . . . , a ) . Now, substitutingthe definition of D θ form (17), the gradient of (23) can be expressed as ∇ θ (cid:96) chain = E (cid:34) ∇ θ log (cid:32) k (cid:89) i =1 w θ ( X ( i − , X ( i ) ) (cid:33)(cid:35) = E (cid:32) k (cid:89) i =1 w θ ( X ( i − , X ( i ) ) (cid:33) − ∇ θ (cid:32) k (cid:89) i =1 w θ ( X ( i − , X ( i ) ) (cid:33) = E (cid:32) (cid:81) ki =1 w θ ( X ( i − , X ( i ) ) (cid:81) ki =1 w θ ( X ( i − , X ( i ) ) (cid:33) − ∇ θ (cid:16)(cid:81) ki =1 w θ ( X ( i − , X ( i ) ) (cid:17)(cid:81) ki =1 w θ ( X ( i − , X ( i ) ) = E (cid:32) k (cid:89) i =1 w θ ( X ( i − , X ( i ) ) − (cid:33) − ∇ θ log (cid:32) k (cid:89) i =1 w θ ( X ( i − , X ( i ) ) (cid:33) = E (cid:104) α θ ( X (0) , . . . , X ( k ) ) (cid:16) ∇ θ log p θ ( X ( k ) ) − ∇ θ log p θ ( X (0) ) (cid:17)(cid:105) , (24)where we used the definition of α θ from (19). D T OY E XPERIMENT AND T RAINING D ETAILS
We here describe the full details of the toy model and learning configuration, which we used toproduce the results in the paper. The code for reproducing the results is available at —- (for theblind review the code will be available in the supplementary material).The toy model used in the paper consists of a distribution concentrated around a 2D spiral embeddedin a 10 dimensional space. Denoting the 10 orthogonal axes of the standard basis in this space by12 + + F C x F C x ky R eL U F C x ky R eL U F C x ky R eL U F C x ky R eL U F C x ky R eL U F C x ky R eL U F C x ky R eL U Figure 5: The architecture. e (1) , . . . , e (10) , the spiral lies in the e (1) - e (2) plane and is confined to [ − , in each of these twoaxes. The samples of the model are produced by selecting random points along the spiral and addingGaussian noise to them. In order to keep the samples close to the e (1) - e (2) plane we used a non-isotropic noise with an STD of . in the e (1) and e (2) directions, and an STD of . in thedirections e (3) , . . . , e (10) .As a parametric model for log p θ ( x ) , we used an 8-layer multi-layer perceptron (MLP) of width 512with skip connections, as illustrated in Fig. 5.Throughout the paper we referred to the results of five different learning configurations.1. CNCE with an optimal (small) variance . This configuration uses additive Gaussian noiseas its contrastive distribution. We found . to be the STD of the Gaussian whichproduces the best results.2. CNCE with a large variance . This configuration is similar to the previous one except forthe STD of the Gaussian which was set to . in order to illustrate the problems of using aconditional distribution with a large variance.3. CD without any MCMC correction . For the MCMC process we used 5 steps of Langevindynamics, where we did not employ any correction for the inaccuracy which results fromusing Langavin dynamics with a finite step size. We found . to be the step size(multiplying the standard Gaussian noise term) which produces the best results.4. CD with MH correction . This configuration is similar to the previous one except for aMH rejection scheme which was used during the MCMC sampling. In this case we foundthe step size of . to produce the best results.5. Adjusted CD . This configuration is similar to the previous one except that we used themethod from Sec. 3.4 instead of MH rejection. Similarly to the previous configuration, wefound the step size of . to produce the best results.The optimization of all configurations was preformed using SGD with a momentum of 0.9 and anexponential decaying learning rate. Except for the training of the third configuration, the learningrate ran down from − to − over optimization steps. For the third configuration wehad to reduce the learning rate by a factor of 10 in order to prevent the optimization from diverging.In order to select the best step size / variance for each of the configurations we ran a parameter sweeparound the relevant value range. The results of this sweep are shown in Fig. 6.For the selection of the number of training steps, we have iteratively increased the number of stepsuntil the results stopped improving for all configurations. These results are presented in Fig. 7.13 NC E Step size = 0.005 CD w / o c o rr e c t i on CD + M H A CD Step size = 0.0075 Step size = 0.01 Step size = 0.0125 Step size = 0.015 Step size = 0.0175