[PDF] Belief Propagation in Conditional RBMs for Structured Prediction

Abstract

Restricted Boltzmann machines~(RBMs) and conditional RBMs~(CRBMs) are popular models for a wide range of applications. In previous work, learning on such models has been dominated by contrastive divergence~(CD) and its variants. Belief propagation~(BP) algorithms are believed to be slow for structured prediction on conditional RBMs~(e.g., Mnih et al. [2011]), and not as good as CD when applied in learning~(e.g., Larochelle et al. [2012]). In this work, we present a matrix-based implementation of belief propagation algorithms on CRBMs, which is easily scalable to tens of thousands of visible and hidden units. We demonstrate that, in both maximum likelihood and max-margin learning, training conditional RBMs with BP as the inference routine can provide significantly better results than current state-of-the-art CD methods on structured prediction problems. We also include practical guidelines on training CRBMs with BP, and some insights on the interaction of learning and inference algorithms for CRBMs.

Full PDF

BBelief Propagation in Conditional RBMs for Structured Prediction

Wei Ping Alexander Ihler

Computer Science, UC Irvine Computer Science, UC [email protected] [email protected]

Abstract

Restricted Boltzmann machines (RBMs) andconditional RBMs (CRBMs) are popularmodels for a wide range of applications.In previous work, learning on such modelshas been dominated by contrastive diver-gence (CD) and its variants. Belief prop-agation (BP) algorithms are believed to beslow for structured prediction on conditionalRBMs (e.g., Mnih et al. [2011]), and not asgood as CD when applied in learning (e.g.,Larochelle et al. [2012]). In this work, wepresent a matrix-based implementation ofbelief propagation algorithms on CRBMs,which is easily scalable to tens of thousandsof visible and hidden units. We demonstratethat, in both maximum likelihood and max-margin learning, training conditional RBMswith BP as the inference routine can pro-vide signiﬁcantly better results than currentstate-of-the-art CD methods on structuredprediction problems. We also include prac-tical guidelines on training CRBMs with BP,and some insights on the interaction of learn-ing and inference algorithms for CRBMs.

A restricted Boltzmann machine (RBM) is a two-layerlatent variable model that uses a layer of hidden units h to model the distribution of visible units v . RBMsare widely used as building blocks for deep gener-ative models, such as deep belief networks [Hintonet al., 2006] and deep Boltzmann machines [Salakhut-dinov and Hinton, 2009]. Due to the intractabilityof the partition function in maximum likelihood esti-mation (MLE), RBMs are usually learned using the Proceedings of the 20 th International Conference on Artiﬁ-cial Intelligence and Statistics (AISTATS) 2017, Fort Laud-erdale, Florida, USA. JMLR: W&CP volume 54. Copy-right 2017 by the author(s). contrastive divergence (CD) algorithm [Hinton, 2002],which approximates the gradient of the log-partitionfunction using a k-step Gibbs sampler (referred to asCD-k). To speed up the convergence of the Markovchain, a critical trick in CD-k is to initialize the stateof the Markov chain with each training instance. Al-though it has been shown that CD-k does not followthe gradient of any objective function [Sutskever andTieleman, 2010], it works well in many practical ap-plications [Hinton, 2010]. An important variant ofCD-k is persistent CD (PCD) [Tieleman, 2008]. PCDuses a persistent Markov chain during learning, wherethe Markov Chain is not reset between parameter up-dates. Because the learning rate is usually small andthe model changes only slightly between parameter up-dates, the long-run persistent chain in PCD usuallyprovides a better approximation to the target distri-bution than the limited step chain in CD-k.A conditional RBM (CRBM) is the discriminativeextension of RBM to include observed features x ;CRBM is used in deep probabilistic model for super-vised learning [Hinton et al., 2006], and also providesa stand-alone solution to a wide range of problemssuch as classiﬁcation [Larochelle and Bengio, 2008],human motion capture [Taylor et al., 2006], collabo-rative ﬁltering [Salakhutdinov et al., 2007], and struc-tured prediction [Mnih et al., 2011, Yang et al., 2014].For structured prediction, a CRBM need not make anyexplicit assumptions about the structure of the outputvariables (visible units v ). This is especially useful inmany applications where the structure of the outputsis challenging to describe (e.g., multi-label learning [Liet al., 2015]). In image denoising or image segmenta-tion, the hidden units can encode higher-order correla-tions of visible units (e.g. shapes, or parts of object),which play the same role as high-order potentials butcan improve the statistical eﬃciency.In contrast to the success of CD methods for RBMs, ithas been noted that both CD-k and PCD may not bewell suited to learning conditional RBMs [Mnih et al.,2011]. In particular, PCD is not appropriate for learn-ing such conditional models, because the observed fea- a r X i v : . [ c s . L G ] M a r elief Propagation in Conditional RBMs for Structured Prediction tures x greatly aﬀect the model potentials. This meanswe need to run a separate persistent chain for everytraining instance, which is costly for large datasets. Tomake things worse, as we revisit a training instancein stochastic gradient descent (SGD) (which is stan-dard practice for large datasets), the model parameterswill have changed substantially, making the persistentchain for this instance far from the target distribution.Also, given the observed features, CRBMs tend to bemore peaked than RBMs in a purely generative set-ting. CD methods may make slow progress because itis diﬃcult for the sampling procedure to explore thesepeaked but multi-modal distributions. It was also ob-served that the important trick in CD-k, which initial-izes the Markov chain using the training data, does notwork well for CRBMs in structured prediction [Mnihet al., 2011]. In contrast, starting the Gibbs chain witha random state (which resembles the original learn-ing algorithm for Boltzmann machines [Ackley et al.,1985]) provides better results.Approximate inference methods, such as meanﬁeld (MF) and belief propagation (BP), can be em-ployed as inference routines in learning as well as formaking predictions after the CRBM has been learned[Welling and Teh, 2003, Yasuda and Tanaka, 2009].Although loopy BP usually provides a better approx-imation of marginals than MF [Murphy et al., 1999],it was found to be slow on CRBMs for structured pre-diction and only considered practical on problems withfew visible and hidden nodes [Mnih et al., 2011, Man-del et al., 2011]. This ineﬃciency prevents it from be-ing widely applied to conditional RBMs for structuredprediction, in which the CRBMs may have thousandsof visible and hidden units. More importantly, thereis a pervasive opinion that belief propagation does notwork well on RBM-based models, especially for learn-ing [Goodfellow et al., 2016, Chapter 16].In this work, we present an eﬃcient implementa-tion of belief propagation algorithms for conditionalRBMs. It takes advantage of the bipartite graphstructure and is scalable to tens of thousands of vis-ible units and hidden units. Our algorithm uses acompact representation and only depends on matrixproduct and element-wise operations, which are typ-ically highly optimized in modern high-performancecomputing architectures. We demonstrate that, inthe conditional setting, learning RBM-based mod-els with belief propagation and its variants can pro-vide much better results than the state-of-the-art CDmethods. We also show that the marginal structuredSVM (MSSVM; [Ping et al., 2014]) can provide im- For random RBMs with 10 ,

000 visible units and 2 , provements for max-margin learning of CRBMs [Yanget al., 2014]. We include practical guidelines on train-ing CRBMs, and some insights on the interaction oflearning and message-passing algorithms for CRBMs.We organize the rest of the paper as follows. Sec-tion 2 discusses some connections to related work. Wereview the RBM model and conditional RBMs in Sec-tion 3 and discuss the learning algorithms in Section 4.In Section 5, we provide our eﬃcient inference proce-dure. We report experimental results in Section 6 andconclude the paper in Section 7. Mnih et al. [2011] proposed the CD-PercLoss algo-rithm for conditional RBMs, which uses a CD-likestochastic search procedure to minimize the percep-tron loss on training data. Given the observed fea-tures of the training instance, CD-PercLoss starts theGibbs chain using the logistic regression componentof the CRBM. Yang et al. [2014] trained CRBMs us-ing a latent structured SVM (LSSVM) objective [Yuand Joachims, 2009], and used a greedy search (i.e.,iterated conditional modes) for joint maximum a pos-teriori (MAP) inference over hidden and visible units.It is also feasible to apply the mean-ﬁeld (MF) ap-proximation for the partition function in MLE learningof RBMs and CRBMs [Peterson and Anderson, 1987].Although eﬃcient, this is conceptually problematic inthe sense that it eﬀectively maximizes an upper boundof the log-likelihood in learning. In addition, MF usesa unimodal proposal to approximate the multi-modaldistribution, which may lead to unsatisfactory results.Although belief propagation (BP) and its variantshave long been used to learn conditional randomﬁelds (CRFs) with hidden variables [Quattoni et al.,2007, Ping et al., 2014], they are mainly applied onsparsely connected graphs (e.g., chains and grids) andwere believed to be ineﬀective and slow on very densegraphs like CRBMs [Mnih et al., 2011, Goodfellowet al., 2016]. A few recent works impose particular as-sumptions on the type of edge potentials and provideeﬃcient inference algorithms for fully connected CRFs.For example, the edge potentials in [Kr¨ahenb¨uhl andKoltun, 2012] are deﬁned by a linear combination ofGaussian kernels. In this work, however, we proposeto speed up general belief propagation on conditionalRBMs without any potential function restrictions.

In this section, we review background on RBMs andconditional RBMs. We also discuss structured predic-tion with CRBMs. ei Ping, Alexander Ihler h v hv x (a) RBM (b) conditional RBM Figure 1: Graphical illustration of (a) RBM with | v | = 5 visible units and | h | = 4 hidden units, and(b) the extended CRBM with v as output variables, x as observed input features. An RBM is a undirected graphical model (see Figure1(a)) that deﬁnes a joint distribution over the vectorsof visible units v ∈ { , } | v |× and hidden units h ∈{ , } | h |× , p ( v , h | θ ) = 1 Z ( θ ) exp (cid:0) − E ( v , h ; θ ) (cid:1) , (1)where | v | and | h | are the dimensions of v and h re-spectively; E ( v , h ; θ ) is the energy function, E ( v , h ; θ ) = − v (cid:62) W vh h − v (cid:62) b − h (cid:62) b ;and θ = { W vh , b , b } are the model parameters, in-cluding pairwise interaction terms W vh ∈ R | v |×| h | ,and bias terms b ∈ R | v |× for visible units and b ∈ R | h |× for hidden units. The function Z ( θ ) isthe normalization constant, or partition function , Z ( θ ) = (cid:88) v (cid:88) h exp (cid:0) − E ( v , h ; θ ) (cid:1) , which is typically intractable to calculate. The conditional RBM (CRBM) extends RBMs to in-clude observed features x (see Figure 1(b) for an il-lustration [Mnih et al., 2011]), and deﬁnes a jointconditional distribution over v and h given features x ∈ R | x |× , p ( v , h | x ; θ ) = 1 Z ( x ; θ ) exp (cid:0) − E ( v , h , x ; θ ) (cid:1) , (2)where the energy function E is deﬁned as, E ( v , h , x ; θ ) = − v (cid:62) W vh h − v (cid:62) W vx x − h (cid:62) W hx x − v (cid:62) b v − h (cid:62) b h , and θ = { W vh , W vx , W hx , b v , b h } are model parame-ters. Z ( x ; θ ) is the x -dependent partition function, Z ( x ; θ ) = (cid:88) v (cid:88) h exp (cid:0) − E ( v , h , x ; θ ) (cid:1) . One can view an RBM as a special CRBM with x ≡ Conditional Distribution

Because CRBMs havea bipartite structure given the observed features, theconditional distributions p ( v | h , x ) and p ( h | v , x ) arefully factored and can be written as, p ( v | h , x ) = (cid:89) i p ( v i | h , x ) , p ( h | v , x ) = (cid:89) j p ( h j | v , x )with p ( v i = 1 | h , x ) = σ (cid:0) W vhi • h + W vxi • x + b vi (cid:1) ,p ( h j = 1 | v , x ) = σ (cid:0) v T W vh • j + W hxj • x + b hj (cid:1) , (3)where σ ( u ) = 1 / (1 + exp( − u )) is the logistic function, W vhi • and W vh • j are the i -th row and j -th column of W vh respectively, W vxi • is the i -th row of of W vx , and W hxj • is the j -th row of W hx . Eq. (3) allows us to de-rive a blocked Gibbs sampler that iteratively alternatesbetween drawing v and h . Marginal Distribution

The marginal distributionof visible units v given observed features x is, p ( v | x ) = (cid:80) h p ( v , h | x ) = Z ( x ; θ ) exp (cid:2) − F ( v , x ; θ ) (cid:3) (4)where the negative energy function has analytic form, − F ( v , x ; θ ) = | h | (cid:88) j =1 log (cid:2) (cid:0) v (cid:62) W vh • j + W hxj • x + b hj (cid:1)(cid:3) + v (cid:62) W vx x + v (cid:62) b v . Note, after marginalizing out hidden variables, thelog-linear model (2) becomes a non-linear model (4),which can capture high-order correlations among vis-ible units. This property is essentially important inmany applications of CRBMs with structured out-put [e.g., Salakhutdinov et al., 2007, Mnih et al., 2011].

In structured prediction, the visible units v typicallyrepresent output variables, while the observed x rep-resent input features, and the hidden units h facili-tate the modeling of output variables given observedfeatures. To make predictions, one choice is to in-fer the modes of the singleton marginals, p ( v i | x ) = (cid:80) v \ i (cid:80) h p ( v , h | x ) . This marginalization inference isintractable and is closely related to calculating the par-tition function. One can also decode the output v byperforming joint maximum a posteriori (MAP) infer-ence [e.g., Yu and Joachims, 2009, Yang et al., 2014],(ˆ v , ˆ h ) = argmax v , h p ( v , h | x ) , which gives a prediction for the pair ( v , h ); one ob-tains a prediction of v by simply discarding the h component. Intuitively, the joint MAP prediction is“over-conﬁdent”, since it deterministically assigns the elief Propagation in Conditional RBMs for Structured Prediction hidden units to their most likely states, and is not ro-bust when the uncertainty of the hidden units is high.One promising alternative for CRBM is marginal MAPprediction [Ping et al., 2014],˜ v = argmax v p ( v | x ) = argmax v (cid:88) h exp (cid:0) − E ( v , h , x ; θ ) (cid:1) , which explicitly takes into account the uncertainty ofthe hidden units by marginalizing them out. In gen-eral, these predictions are intractable in CRBMs, andone must use approximate inference methods, such asmean ﬁeld or belief propagation. In this section, we discuss diﬀerent learning methodsfor conditional RBMs.

Assume we have a training set { v n , x n } Nn =1 ; then, thelog-likelihood can be written as, N (cid:88) n =1 (cid:110) log (cid:88) h exp (cid:0) − E ( v n , h , x n ; θ ) (cid:1) − log Z ( x n ; θ ) (cid:111) . To eﬃciently maximize the objective function, stochas-tic gradient descent (SGD) is usually applied. Given arandomly chosen instance { v n , x n } , one can show thatthe gradient of log-likelihood w.r.t. W vh is, ∂ log p ( v n | x n ) ∂W vh = v n ( µ n ) (cid:62) − E p ( v , h | x n ) (cid:2) vh (cid:62) (cid:3) , (5)where µ n = σ ( W vh (cid:62) v n + W hx x n + b h ) and the logisticfunction σ is applied in an element-wise manner. Thepositive part of the gradient can be calculated exactly.The negative part arises from the derivatives of the log-partition function and is intractable to calculate. Thegradients of log-likelihood w.r.t. other parameters areanalogous to Eq. (5), and can be found in Appendix A.CD-k initializes the Gibbs chain by instance v n , andperforms k -step Gibbs sampling by Eq. (3). Then, theempirical moment is used as a substitute for the in-tractable expectation E p ( v , h | x n ) (cid:2) vh (cid:62) (cid:3) . Although thisworks well on RBMs, it gives unsatisfactory resultson CRBMs. In practice, the conditional distributions p ( v , h | x n ) are strongly inﬂuenced by the observed fea-tures x n , and usually more peaked than generativeRBMs. It is usually diﬃcult for a Markov chain withfew steps (e.g., 10) to explore these peaked and multi-modal distributions. PCD uses a long-run persistentMarkov chain to improve convergence, but is not suit-able for CRBMs as discussed in Section 1. Sum-product BP and mean ﬁeld methods providepseudo-marginals as substitutes for the intractable ex-pectations in Eq. (5). These deterministic gradient es-timates have the advantage that a larger learning ratecan be used. BP tends to give a more accurate esti-mate of log Z and marginals, but is reported to be slowon CRBMs and is impractical on problems with largeoutput dimension and hidden layer sizes in structuredprediction [Mnih et al., 2011].More importantly, it was observed that belief propaga-tion usually gives unsatisfactory results when learningvanilla RBMs. This is mainly because the parame-ters’ magnitude gradually increases during learning;the RBM model eventually undergoes a “phase tran-sition” after which BP has diﬃculty converging [Ihleret al., 2005, Mooij and Kappen, 2005]. If BP does notconverge, it can not provide a meaningful gradient di-rection to update the model, and the leaning becomesstuck. However, CRBMs appear to behave quite dif-ferently, due to operating in the “high signal” regimeprovided by an informative observation x . This im-proves the convergence behaviour of BP, which maynot be surprising since loopy BP is widely accepted asuseful in learning other conditional models (e.g., gridCRFs for image segmentation). In addition, given N training instances for learning the CRBM, BP is ac-tually performed on N diﬀerent RBMs correspondingto diﬀerent features x n . During any particular phaseof learning, BP may have trouble converging on sometraining instances, but we can still make progress aslong as BP converges on the majority of instances. Wedemonstrate this behavior in our experiments. Another by-product of using BP is that it enables us toapply the marginal structured SVM (MSSVM) [Pinget al., 2014] framework for max-margin learning ofCRBMs,min θ N (cid:88) n =1 (cid:110) max v log (cid:88) h exp (cid:16) ∆( v , v n ) − E ( v , h , x ; θ ) (cid:17) − log (cid:88) h exp (cid:16) − E ( v n , h , x n ; θ ) (cid:17)(cid:111) , (6)where the loss function ∆( v , v n ) = (cid:80) i ∆( v i , v ni ) isdecomposable (e.g., Hamming loss). In contrast toLSSVM [Yu and Joachims, 2009, Yang et al., 2014],MSSVM marginalizes over the uncertainty of hiddenvariables, and can signiﬁcantly outperform LSSVMwhen that uncertainty is large [Ping et al., 2014]. Ex-perimentally, we ﬁnd that MSSVM improves perfor-mance of max-margin CRBMs, likely because thereis usually non-trivial uncertainty in the hidden units.Given an instance { v n , x n } , the stochastic gradient of ei Ping, Alexander Ihler Eq. (6) w.r.t. W vh is, ∂l ( v n , x n ) ∂W vh = E p ( h | ˆ v , x n ) (cid:2) ˆ vh T (cid:3) − v n ( µ n ) (cid:62) , (7)where µ n is deﬁned as in Eq. (5); ˆ v is the loss-augmented marginal MAP prediction,ˆ v = argmax v (cid:88) h exp (cid:16) ∆( v , v ( n ) ) − E ( v , h , x n ; θ ) (cid:17) ;and “mixed-product” belief propagation [Liu and Ih-ler, 2013] or dual-decomposition method [Ping et al.,2015] for marginal MAP can provide pseudo-marginalsto estimate the intractable expectation. (The gradi-ents for other parameters are analogous.) In this section, we present a matrix-based implementa-tion of sum-product and mixed-product BP algorithmsfor RBMs. Given a particular x n in CRBM (2), we ob-tain a x n -dependent RBM model, p ( v , h | x n ) = 1 Z ( θ ( x n )) exp (cid:0) v (cid:62) W vh h + v (cid:62) b + h (cid:62) b (cid:1) , where the bias terms b = b v + W vx x n , b = b h + W hx x n , and thus we can directly apply the algorithmto CRBMs. We ﬁrst review the standard message-passing form inRBMs. On a dense graphical models like RBMs, toreduce the amount of calculation, one should alwayspre-compute the product of incoming messages (or thebeliefs) on the nodes, and reuse them to perform up-dates of all outgoing messages. In sum-product BP,we write the ﬁxed-point update rule for the messagesent from hidden unit h j to visible unit v i as, m j → i ( v i ) ∝ (cid:88) h j exp (cid:0) v i W vhij h j (cid:1) · τ ( h j ) m i → j ( h j ) , (8)where the belief on h j is τ ( h j ) ∝ exp (cid:0) h j b j (cid:1) · | v | (cid:89) k =1 m k → j ( h j ) . (9)The update rule for the message sent from v i to h j is, m i → j ( h j ) ∝ (cid:88) v i exp (cid:0) v i W vhij h j (cid:1) · τ ( v i ) m j → i ( v i ) , (10)where the belief on v i is, τ ( v i ) ∝ exp (cid:0) v i b i (cid:1) · | h | (cid:89) k =1 m k → i ( v i ) . (11) In mixed-product BP, the message sent from hiddenunit to visible unit is the same as Eq. (8). The messagesent from visible unit v i to hidden unit h j is˜ m i → j ( h j ) ∝ exp (cid:0) ˜ v i W vhij h j (cid:1) · τ ( ˜ v i ) m j → i ( ˜ v i ) , (12)where ˜ v i = argmax v i τ ( v i ), and τ ( v i ) is deﬁned inEq. (11). These update equations are repeatedly ap-plied until the values converge (hopefully), or a stop-ping criterion is satisﬁed. Then, the pairwise belief on( v i , h j ) is calculated as, τ ( v i , h j ) ∝ exp (cid:0) v i W vhij h j (cid:1) · τ ( v i ) m j → i ( v i ) · τ ( h j ) m i → j ( h j ) . It is well known that BP on loopy graphs is not guar-anteed to converge, although in practice it usuallydoes [Murphy et al., 1999].

Our algorithms use a compact matrix representation.We denote the “free” belief vectors and matrices as, τ v ∈ R | v |× , where τ vi = τ ( v i = 1) , τ h ∈ R | h |× , where τ hj = τ ( h j = 1) , Γ ∈ R | v |×| h | , where Γ ij = τ ( v i = 1 , h j = 1) . Other beliefs can be represented by these “free” beliefs: τ ( v i = 0) = 1 − τ vi , τ ( h j = 0) = 1 − τ hj ,τ ( v i = 1 , h j = 0) = τ vi − Γ ij ,τ ( v i = 0 , h j = 1) = τ hj − Γ ij ,τ ( v i = 0 , h j = 0) = 1 + Γ ij − τ vi − τ hj . We similarly deﬁne the normalized message matrices, M vh ∈ R | v |×| h | , M vhij = m j → i ( v i = 1) ,M hv ∈ R | h |×| v | , M hvji = m i → j ( h j = 1) . Thus, M vh represents all the messages sent from h to v , and M hv represents all the messages from v to h .One can show (see Appendix B.1) that the updateequation for message matrix M vh in both sum-productand mixed-product BP is M vh = σ (cid:16) log (cid:16) exp( W vh ) ◦ Λ vh + Λ vh Λ vh + Λ vh (cid:17)(cid:17) , (13)where Λ vh = ( hv − M hv ) (cid:62) · diag( τ h ) , Λ vh = M hv (cid:62) · diag( h − τ h ) , where hv is a | h |×| v | matrix of ones, h is a | h |× ◦ is the element-wise Hadamard product, elief Propagation in Conditional RBMs for Structured Prediction and diag( · ) extracts the elements in a vector to forma diagonal matrix. The logarithm, fraction and lo-gistic function are all applied in an element-wise man-ner. Similarly, the update equation for message matrix M hv in sum-product BP is M hv = σ (cid:16) log (cid:16) exp( W vh (cid:62) ) ◦ Λ hv + Λ hv Λ hv + Λ hv (cid:17)(cid:17) , (14)where Λ hv = ( vh − M vh ) (cid:62) · diag( τ v ) , Λ hv = M vh (cid:62) · diag( v − τ v ) , with vh a | v | × | h | matrix of ones, and v a | v | × M vh is M hv = σ (cid:16) W vh (cid:62) · diag(˜ v ) (cid:17) , (15)where ˜ v i = argmax v i τ v ( v i ) for all v i . In addition, onecan show (see Appendix B.2) that the belief vectors τ v and τ h can be calculated as, τ v = σ (cid:16) b + log (cid:16) M vh vh − M vh (cid:17) · h (cid:17) , (16) τ h = σ (cid:16) b + log (cid:16) M hv hv − M hv (cid:1) · v (cid:17) , (17)where · is the matrix product. These update equationsare repeatedly applied until the stopping criterion issatisﬁed. After that, the pairwise belief matrix Γ canbe calculated as,Γ = Γ Γ + Γ + Γ + Γ , where (18) Γ = exp( W vh ) ◦ ( τ v · τ h (cid:62) ) ◦ ( vh − M vh ) ◦ ( hv − M hv ) (cid:62) , Γ = (cid:0) ( v − τ v ) · τ h (cid:62) (cid:1) ◦ M vh ◦ ( hv − M hv ) (cid:62) , Γ = (cid:0) τ v · ( h − τ h ) (cid:62) (cid:1) ◦ ( vh − M vh ) ◦ M hv (cid:62) , Γ = (cid:0) ( v − τ v ) · ( h − τ h ) (cid:1) ◦ M vh ◦ M hv (cid:62) . We summarize the matrix-based sum-product BP andmixed-product BP in Algorithm 1. It is well knownthat asynchronous (sequential) BP message updatesusually converge much faster than synchronous up-dates [e.g., Wainwright et al., 2003, Gonzalez et al.,2009]; in Algorithm 1, although messages are sent inparallel from all hidden units to visible units, the bi-partite graph structure ensures that these are actu-ally asynchronous updates, which helps convergencein practice. Our method is also related to message-passing algorithms designed for other binary networks,such as binary LDPC codes [Kschischang et al., 2001],which parametrize each message by a single real num-ber using a hyperbolic tangent transform. Our al-gorithm is specially designed for RBM-based models,and signiﬁcantly speeds up BP by taking advantage ofthe RBM structure and using only matrix operations.

Algorithm 1

Sum(mixed)-product BP on RBM

Input: { W vh , b , b } , number of iterations T Output: beliefs { τ v , τ h , Γ } initialize message matrices: M vh = 0 . × vh , M hv = 0 . × hv ;initialize beliefs: τ v = σ ( b ), τ h = σ ( b ); for t = 1 to T do send messages from h to v :Λ vh = ( hv − M hv ) (cid:62) · diag( τ h );Λ vh = M hv (cid:62) · diag( h − τ h ); M vh = σ (cid:16) log (cid:16) exp( W vh ) ◦ Λ vh +Λ vh Λ vh +Λ vh (cid:17)(cid:17) ; (13) τ v = σ (cid:16) b + log (cid:16) M vh vh − M vh (cid:17) · h (cid:17) ; (16)send messages from v to h :for sum-product BPΛ hv = ( vh − M vh ) (cid:62) · diag( τ v );Λ hv = M vh (cid:62) · diag( v − τ v ); M hv = σ (cid:16) log (cid:16) exp( W vh (cid:62) ) ◦ Λ hv +Λ hv Λ hv +Λ hv (cid:17)(cid:17) ; (14)or, for mixed-product BP M hv = σ (cid:16) W vh (cid:62) · diag(˜ v ) (cid:17) ; (15) τ h = σ (cid:16) b + log (cid:16) M hv hv − M hv (cid:1) · v (cid:17) ; (17) end for Γ = Γ Γ +Γ +Γ +Γ as deﬁned in Eq. (18);In practice, our matrix implementation runs orders ofmagnitude faster than standard implementation of be-lief propagation, e.g., the C++ factor graph packagelibDAI [Mooij, 2010], which has been used for RBM as-sessments [e.g., Hadjis and Ermon, 2015]. For an RBMwith 1000 visible and 500 hidden units, 10 iterationsof BP in our Matlab implementation takes 0.5 secondson a laptop with Intel Core i5 (2.5GHz). In libDAI(with gcc -O3, i.e., fully optimized for speed), 10 iter-ations of BP takes 297.4 seconds, approximately 600 × slower. This is mainly because matrix operations arehighly optimized in modern computer architectures,e.g., they are performed in parallel in the instructionpipeline, and no pointers (to messages, neighbors, etc.)need to be dereferenced. In this section, we compare our methods with state-of-the-art algorithms for learning CRBMs on twodatasets: MNIST and Caltech101 Silhouettes.

Datasets:

The MNIST database [LeCun et al., 1998]contains 60 ,

000 images in the training set and 10 , ,

000 imagesfrom training as the validation set. Each image is28 ×

28 pixels, thus | v | = 784. We binarize thegrayscale images by thresholding the pixels at 127, to ei Ping, Alexander Ihler Figure 2: (Row 1) 7 original images from the test set.(Row 2) The noisy (10%) images. (Row 3) The im-ages predicted from noisy images. (Row 4) The oc-cluded (8 ×

8) images. (Row 5) The images predictedfrom the occluded images. Rows 3 and 5 use our MLE-BP for learning.obtain the clean image v . We test two types of struc-tured prediction tasks in our experiment. The ﬁrsttask is image denoising and denoted “ noisy MNIST”,where the noisy image x is obtained by ﬂipping ei-ther 10% or 20% of the entries in v . The second taskis image completion, denoted occluded MNIST, wherethe occluded image x is obtained by setting a randompatch within the image v to 0. The patch size is either8 × ×

12 pixels. See Figure 2 for an illustration.The Caltech101 Silhouettes dataset [Marlin et al.,2010] has 8 ,

671 images with 28 ×

28 binary pixels,where each image represents object silhouette. Thedataset is divided into three subsets: 4 ,

100 examplesfor training, 2 ,

264 for validation and 2 ,

307 for testing.We test both image denoising and image completiontasks. The noisy image x in noisy Caltech101 is ob-tained by ﬂipping 20% of the pixels from the clean v , and the occluded image in occluded Caltech101 isobtained by setting a random 12 ×

12 patch to 1.

Model:

Following [Mnih et al., 2011], we structuredthe CRBM model with 256 hidden units, giving 1 mil-lion parameters in the model. All the learning algo-rithms are applied to learn this CRBM model. Thelogistic regression method can be viewed as learningthis CRBM with only W vx and b v non-zero. Algorithms:

We train several CRBMs using thestate-of-the-art CD methods, including CD-1, CD-10 and CD-PercLoss. We also train models to op-timize likelihood (MLE) using mean ﬁeld (MLE-MF) and sum-product BP (MLE-BP). Finally, wetrain MSSVM CRBMs using mixed-product BP, andLSSVM CRBMs using max-product BP. A ﬁxed learn- In previous work [Mnih et al., 2011], MLE-BP was con-sidered impractical on this task due to the eﬃciency issue.

Table 1: Average test error (%) for image denoising on noisy

MINIST. All denotes the percentage incorrectlylabeled pixels among all pixels. Changed denotes thepercentage of errors among pixels that were changedby the noise process.Dataset Noisy (10%) Noisy (20%)Method All Changed All ChangedLR 1 .

960 12 .

531 4 .

088 12 . .

925 12 .

229 4 .

012 12 . .

816 11 .

103 3 .

995 11 . .

760 11 .

121 3 .

970 10 . .

862 11 .

319 3 .

917 10 . .

688 10 .

718 3 .

691 10 . LSSVM 1 .

807 11 .

565 3 .

910 11 . .

751 11 .

023 3 .

804 10 . occluded MINIST.Dataset Occluded (8 ×

8) Occluded (12 × .

468 61 .

304 3 .

498 53 . .

814 63 .

130 3 .

983 58 . .

707 67 .

925 3 .

921 63 . .

394 45 .

684 3 .

483 35 . .

492 49 .

553 3 .

477 40 . .

329 39 .

785 3 . . .

496 44 .

037 3 .

468 39 . .

391 41 .

829 3 . . ing rate is selected from the set { . , . , . , . } using the validation set, and the mini-batch size isselected from the set { , , , , } . The CD-PercLoss algorithm uses 10-step Gibbs sampling in thestochastic search process. All the CD methods use200 epochs in training. In contrast, MLE-MF, MLE-BP, MSSVM and LSSVM use 50 epochs, because BPand MF provide a deterministic gradient estimate andlarger learning rates can be applied. Early stoppingbased on the validation error is also used for all meth-ods. We test the learned models of the CD methodsand MLE-MF with mean-ﬁeld predictions; the learnedmodel of MLE-BP with sum-product BP predictions;MSSVM with mixed-product BP; and LSSVM withmax-product BP.

Results:

Table 1 shows the percentage of incorrectlylabeled pixels on the noisy

MNIST for diﬀerent meth-ods. “All” denotes the errors among all pixels and isthe main measurement. We also report the “Changed”errors among the pixels that were changed by thenoise/occlusion process. MLE-BP works best andprovides 4% and 7% relative improvement over CD- In experiments, we found that early stopping alwaysworked better than the Frobenius norm regularization. elief Propagation in Conditional RBMs for Structured Prediction

Table 3: Average test error (%) for image denoising &completion on Caltech101 Silhouettes dataset.Dataset Noisy (20%) Occluded (12 × .

653 11 .

460 4 .

771 16 . .

876 12 .

423 5 .

033 20 . .

736 12 .

013 5 .

149 21 . .

622 10 .

808 5 .

081 15 . .

617 11 .

083 4 .

692 15 . .

445 10 . .

548 16 . .

628 11 .

468 4 .

703 16 . .

549 11 . .

534 14 . occlusion level t e s t e rr o r ( % ) MLE−BPCD−10

Figure 3: Average test error (%) for image completionon occluded

MINIST under diﬀerent occlusion levels.PercLoss on two datasets with diﬀerent noise levels.Table 2 shows the results on occluded

MNIST. HereMLE-BP provides 4% and 10% relative improvementover CD-PercLoss on the two datasets, respectively.CD-k gives unsatisfactory results in both cases. HereMSSVM performs worse than MLE-BP, but betterthan the other methods in Table 1 and 2. The imagecompletion task is viewed as more diﬃcult on Changedpixels. However, again training the CRBM with MLE-BP gives very good results; see the last two rows of im-ages in Figure 2. Table 3 demonstrate the results onCaltech101 Silhouettes; in this setting, MLE-BP andMSSVM perform the best for image denoising and im-age completion, respectively.Figure 3 shows the results for image completion underdiﬀerent occlusion levels. MLE-BP works better thanCD-10, unless the images are almost fully occluded.Note, the full occlusion (28 ×

28) corresponds to noconditioning (i.e., x ≡ Discussion:

We include several observations on theinteraction of learning and inference algorithms forCRBMs: (1) Early on in learning, message passingis fast to converge, typically within ≈ T = 7 + epoch (e.g.,at epoch 10, T = 17). See Figure 4 for an illustration epoch % o f c on v e r ged BP Figure 4: Percentage of converged BP in each epochduring MLE-BP training on occluded (8 ×

8) MNIST.of the convergence behavior of BP using this strat-egy during training. We set the convergence tolerance (cid:15) = 0 . In contrast to past work, we argue that belief propaga-tion can be an excellent choice for learning and infer-ence with RBM-based models in the conditional set-ting. We present a matrix-based expression of the BPupdates for CRBMs, which is scalable to tens of thou-sands of visible and hidden units. Our implementationtakes advantage of the bipartite graphical structureand uses a compact representation of messages andbeliefs. Since it uses only matrix product and element-wise operations, it is highly suited to GPU accelera-tion. We demonstrate that learning CRBMs with sum-product BP (MLE) and mixed-product BP (MSSVM)can provide signiﬁcantly better results than the state-of-the-art CD methods on structured prediction prob-lems. Future directions include a GPU-based imple-mentation and applying the method to deep proba-bilistic models, such as deep Boltzmann machines.

Acknowledgements

This work is sponsored in part by NSF grants IIS-1254071 and CCF-1331915. It is also funded in partby the United States Air Force under Contract No.FA8750-14-C-0011 under the DARPA PPAML pro-gram. ei Ping, Alexander Ihler

References

D. Ackley, G. Hinton, and T. Sejnowski. A learning al-gorithm for Boltzmann machines.

Cognitive science ,1985.J. Gonzalez, Y. Low, and C. Guestrin. Residual splashfor optimally parallelizing belief propagation. In

AISTATS , 2009.I. Goodfellow, Y. Bengio, and A. Courville. Deeplearning. Book in preparation for MIT Press, 2016.S. Hadjis and S. Ermon. Importance sampling oversets: A new probabilistic inference scheme. In

UAI ,pages 355–364, 2015.G. Hinton, S. Osindero, and Y.-W. Teh. A fast learningalgorithm for deep belief nets.

Neural computation ,2006.G. E. Hinton. Training products of experts by mini-mizing contrastive divergence.

Neural computation ,2002.G. E. Hinton. A practical guide to training restrictedBoltzmann machines.

UTML TR 2010-003 , 2010.A. T. Ihler, W. F. John III, and A. S. Willsky. Loopybelief propagation: Convergence and eﬀects of mes-sage errors.

JMLR , 2005.P. Kr¨ahenb¨uhl and V. Koltun. Eﬃcient inference infully connected CRFs with Gaussian edge poten-tials. In

NIPS , 2012.F. R. Kschischang, B. J. Frey, and H.-A. Loeliger. Fac-tor graphs and the sum-product algorithm.

IEEETransactions on Information Theory , 2001.H. Larochelle and Y. Bengio. Classiﬁcation usingdiscriminative restricted Boltzmann machines. In

ICML , 2008.H. Larochelle, M. Mandel, R. Pascanu, and Y. Bengio.Learning algorithms for the classiﬁcation restrictedBoltzmann machine.

JMLR , 2012.Y. LeCun, L. Bottou, Y. Bengio, and P. Haﬀner.Gradient-based learning applied to document recog-nition.

Proceedings of the IEEE , 1998.X. Li, F. Zhao, and Y. Guo. Conditional restrictedBoltzmann machines for multi-label learning withincomplete labels. In

AISTATS , 2015.Q. Liu and A. Ihler. Variational algorithms formarginal MAP.

JMLR , 2013.M. Mandel, R. Pascanu, H. Larochelle, and Y. Ben-gio. Autotagging music with conditional restrictedBoltzmann machines. arXiv:1103.2832 , 2011.B. M. Marlin, K. Swersky, B. Chen, and N. de Fre-itas. Inductive principles for restricted boltzmannmachine learning. In

AISTATS , 2010. V. Mnih, H. Larochelle, and G. E. Hinton. Condi-tional restricted Boltzmann machines for structuredoutput predition. In

UAI , 2011.J. Mooij. libDAI: A free and open source C++ libraryfor discrete approximate inference in graphical mod-els.

JMLR , 2010.J. Mooij and H. Kappen. On the properties of thebethe approximation and loopy belief propagationon binary networks.

Journal of Statistical Mechan-ics: Theory and Experiment , 2005.K. Murphy, Y. Weiss, and M. Jordan. Loopy beliefpropagation for approximate inference: An empiri-cal study. In

UAI , 1999.C. Peterson and J. R. Anderson. A mean ﬁeld the-ory learning algorithm for neural networks.

ComplexSystems , 1987.W. Ping, Q. Liu, and A. Ihler. Marginal structuredSVM with hidden variables. In

ICML , 2014.W. Ping, Q. Liu, and A. Ihler. Decomposition boundsfor marginal MAP. In

NIPS , 2015.A. Quattoni, S. Wang, L.-P. Morency, M. Collins, andT. Darrell. Hidden conditional random ﬁelds.

IEEETransactions on PAMI , 2007.R. Salakhutdinov and G. E. Hinton. Deep Boltzmannmachines. In

AISTATS , 2009.R. Salakhutdinov, A. Mnih, and G. Hinton. RestrictedBoltzmann machines for collaborative ﬁltering. In

ICML , pages 791–798, 2007.I. Sutskever and T. Tieleman. On the convergenceproperties of contrastive divergence. In

AISTATS ,2010.G. W. Taylor, G. E. Hinton, and S. Roweis. Model-ing human motion using binary latent variables. In

NIPS , 2006.T. Tieleman. Training restricted Boltzmann machinesusing approximations to the likelihood gradient. In

ICML , 2008.M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky.Tree-based reparameterization framework for anal-ysis of sum-product and related algorithms.

IEEETransactions on information theory , 2003.M. Welling and Y. W. Teh. Approximate inference inBoltzmann machines.

Artiﬁcial Intelligence , 2003.J. Yang, S. Safar, and M.-H. Yang. Max-margin Boltz-mann machines for object segmentation. In

CVPR ,2014.M. Yasuda and K. Tanaka. Approximate learning algo-rithm in Boltzmann machines.

Neural computation ,2009.C.-N. J. Yu and T. Joachims. Learning structuralSVMs with latent variables. In

ICML , 2009. elief Propagation in Conditional RBMs for Structured Prediction

Appendix

A Gradients of Log-likelihood

Similar to Eq. (5) in the main text, the gradients oflog-likelihood with other weights and biases are, ∂ log p ( v n | x n ) ∂W vx = v n x n (cid:62) − E p ( v , h | x n ) (cid:2) vx n (cid:62) (cid:3) ,∂ log p ( v n | x n ) ∂W hx = µ n x n (cid:62) − E p ( v , h | x n ) (cid:2) hx n (cid:62) (cid:3) ,∂ log p ( v n | x n ) ∂b v = v n − E p ( v , h | x n ) (cid:2) v (cid:3) ,∂ log p ( v n | x n ) ∂b h = µ n − E p ( v , h | x n ) (cid:2) h (cid:3) , where µ n = E p ( h | v n , x n ) (cid:2) h (cid:3) = σ ( W vh (cid:62) v n + W hx x n + b h ). All the negative parts of these gradients are in-tractable to calculate, and must be approximated dur-ing learning. B Derivation of Matrix-based BP

In this section we give additional proof details of ourmatrix-based BP update equations.

B.1

Proof of the update rule for M vh in Eq. (13): M vhij = m j → i ( v i = 1) m j → i ( v i = 1) + m j → i ( v i = 0) , = σ (cid:16) log exp( W vhij ) · τ hj M hvji + − τ hj − M hvji τ hj M hvji + − τ hj − M hvji (cid:17) , by Eq. (8)= σ (cid:16) log exp( W vhij ) · (1 − M hvji ) τ hj + M hvji (1 − τ hj )(1 − M hvji ) τ hj + M hvji (1 − τ hj ) (cid:17) . Then, one can verify the update of M vh (13) holds.The derivation is analogous for updating M hv (14). B.2

Proof of the update rule for τ v in Eq. (16): τ vi = τ ( v i = 1) τ ( v i = 1) + τ ( v i = 0) , = 11 + exp (cid:0) (cid:80) | h | j =1 log m j → i ( v i =0) (cid:1) exp (cid:0) b i + (cid:80) | h | j =1 log m j → i ( v i =1) (cid:1) , by Eq. (11)= 11 + exp (cid:8) − b i − (cid:80) | h | j =1 (cid:0) log M vhij − log(1 − M vhij ) (cid:1)(cid:9) , = σ (cid:16) b i + (cid:0) log M vhi • − log( h (cid:62) − M vhi • ) (cid:1) · h (cid:9)(cid:17) . Then, one can verify the update of τ v (16) holds. Theupdate of τ h in Eq. (17) is derived similarly. B.3

The ( i, j ) element of pairwise belief matrix: Γ ij = τ ( v i = 1 , h j = 1) (cid:80) v i ,h j τ ( v i , h j ) = exp( w vhij ) τ vi τ hj M vhij M hvji exp( w vhij ) τ vi τ hj M vhij M hvji + (1 − τ vi ) τ hj (1 − M vhij ) M hvji + τ vi (1 − τ hj ) M vhij (1 − M hvji ) + (1 − τ vi )(1 − τ hj )(1 − M vhij )(1 − M hvji ) We can denote the intermediate terms Γ = exp( W vh ) ◦ ( τ v · τ h (cid:62) ) ◦ ( vh − M vh ) ◦ ( hv − M hv ) (cid:62) , Γ = (cid:0) ( v − τ v ) · τ h (cid:62) (cid:1) ◦ M vh ◦ ( hv − M hv ) (cid:62) , Γ = (cid:0) τ v · ( h − τ h ) (cid:62) (cid:1) ◦ ( vh − M vh ) ◦ M hv (cid:62) , Γ = (cid:0) ( v − τ v ) · ( h − τ h ) (cid:1) ◦ M vh ◦ M hv (cid:62) . Then, the pairwise belief matrix Γ = Γ Γ +Γ +Γ +Γ00