[PDF] Learning Thermodynamics with Boltzmann Machines

Abstract

A Boltzmann machine is a stochastic neural network that has been extensively used in the layers of deep architectures for modern machine learning applications. In this paper, we develop a Boltzmann machine that is capable of modelling thermodynamic observables for physical systems in thermal equilibrium. Through unsupervised learning, we train the Boltzmann machine on data sets constructed with spin configurations importance-sampled from the partition function of an Ising Hamiltonian at different temperatures using Monte Carlo (MC) methods. The trained Boltzmann machine is then used to generate spin states, for which we compare thermodynamic observables to those computed by direct MC sampling. We demonstrate that the Boltzmann machine can faithfully reproduce the observables of the physical system. Further, we observe that the number of neurons required to obtain accurate results increases as the system is brought close to criticality.

Full PDF

LLearning Thermodynamics with Boltzmann Machines

Giacomo Torlai and Roger G. Melko

Perimeter Institute for Theoretical Physics, Waterloo, Ontario N2L 2Y5, CanadaDepartment of Physics and Astronomy, University of Waterloo, Ontario N2L 3G1, Canada (Dated: June 10, 2016)A Boltzmann machine is a stochastic neural network that has been extensively used in the lay-ers of deep architectures for modern machine learning applications. In this paper, we develop aBoltzmann machine that is capable of modelling thermodynamic observables for physical systems inthermal equilibrium. Through unsupervised learning, we train the Boltzmann machine on data setsconstructed with spin conﬁgurations importance-sampled from the partition function of an IsingHamiltonian at diﬀerent temperatures using Monte Carlo (MC) methods. The trained Boltzmannmachine is then used to generate spin states, for which we compare thermodynamic observables tothose computed by direct MC sampling. We demonstrate that the Boltzmann machine can faithfullyreproduce the observables of the physical system. Further, we observe that the number of neuronsrequired to obtain accurate results increases as the system is brought close to criticality.

I. INTRODUCTION

Machine learning is a paradigm whereby computer al-gorithms are designed to learn from – and make predic-tions on – data. The success of such algorithms in thearea of classifying and extracting features from large datasets relies on their ability to infer them without explicitguidance from a human programmer. Such automaticencoding proceeds by ﬁrst “training” the algorithm ona large data set and then asking the trained machine toperform some task. Currently, many machine learningapplications are performed with neural networks, whichessentially ﬁt the data to a graph structure composed ofmany nodes and edges. If the ultimate goal is to performclassiﬁcation, like in image or speech recognition, the net-work can be trained on a labelled data set by maximiz-ing the output probability of the correct label ( supervised learning). However, since labelled data is often scarce, amore eﬀective strategy is to learn the full distribution ofthe data using a generative model, which does not requirelabels ( unsupervised learning). Such generative trainingallows the network to extract more information, and alsoto generate approximate samples of the distribution. Forthe classiﬁcation of data, this training is followed by asupervised ﬁne-tuning, which can be done with only asmall amount of labelled data.Although neural networks have been researched formany decades, the performance required for solvinghighly complex problems in real-world applicationshas been achieved only relatively recently with deeplearning. Here, the networks are made up of several lay-ers stacked such that the output of one layer becomesthe input of the next layer. The ability to learn multiplelevels of representations makes deep learning a very pow-erful tool in capturing features in high-dimensional data, and it drastically improved the performance in com-plex tasks such image recognition, speech recognition or natural language understanding. Machine learningalso has many applications in physics, and has beensuccessfully used to solve complex problems, includingsearching for exotic particles in high-energy physics, , solving dynamical mean-ﬁeld theory in strong correlatedsystems or classifying the liquid-glass transition. Morerecently, neural networks has been also employed to iden-tify phases of matter with and without conventional orderparameters, and locate the position of phase transitionsto high accuracy. In light of this success, one may askwhether neural networks can be trained for other diﬃcultproblems, such as reproducing statistical-mechanical dis-tributions of classical Hamiltonians in an unsupervisedsetting. This would allow one, for example, to train aneural network using data that has been importance-sampled using Monte Carlo (MC) from a partition func-tion, and then to calculate estimators from the distribu-tion produced by the neural network.A natural candidate neural network for this task is aBoltzmann machine. A Boltzmann machine is a stochas-tic neural network, composed of neuron-like nodes form-ing a network with undirected edges. Each neuron has abinary value that has a probabilistic element, which de-pends on the neighbouring units to which it is connected.The connecting edges weigh inputs to each neuron to de-ﬁne its state. This architecture, once elaborated, can beused to produce approximate reconstructions of the orig-inal data set. More precisely, a reconstruction is an esti-mate of the probability distribution of the original input,which is of course imperfectly contained in the limited-size training data set. This procedure has been widelysuccessful, leading Boltzmann machines to become a corepiece of deep learning architectures.In this paper, we explore the ability of Boltzmannmachines to learn ﬁnite-temperature distributions of theclassical Ising Hamiltonian and, consequently, associatedthermodynamic observables such as energy, magnetiza-tion, or speciﬁc heat. We show that faithful recreation ofobservables is possible for a ﬁnite-size lattice Ising sys-tem. We also demonstrate that the number of neuronsin the networks required to recreate data at the criticalpoint can be much larger than in the paramagnetic orferromagnetic phase. This suggests that deep networksmay be required for the faithful representation of ther-modynamics by Boltzmann machines at critical points. a r X i v : . [ c ond - m a t . s t a t - m ec h ] J un II. THE BOLTZMANN MACHINE

In constructing a Boltzmann machine, our goal is tobuild an approximate model of a target probability dis-tribution. For the sake of concreteness, we will considerthe Boltzmann distribution of N Ising spin variables,weighted by the partition function, as our target distribu-tion. It is natural to imagine sampling this distributionwith a MC procedure. In addition to producing thesesamples, a MC simulation usually calculates estimatorsof thermodynamic observables, such as energy or spe-ciﬁc heat, directly from the sampled target distribution.However, one could instead imagine obtaining estimatorsfrom an approximate distribution constructed to mimicour target distribution. In this scenario, spin conﬁgura-tions can be generated by a Boltzmann machine that wastrained by the MC samples of the target distribution. Inthis section, we review the concept of sampling the tar-get distribution for an Ising spin Hamiltonian, and detailthe construction and training of a Boltzmann machine.In Sec. III we present the results for thermodynamic ob-servables obtained from this Boltzmann machine, trainedon ﬁnite-temperature conﬁgurations produced from thenearest-neighbor Ising ferromagnet.

A. Target probability distribution andthermodynamic observables

Consider a system of N classical spins on a d -dimensional lattice, with Ising spin conﬁguration σ = { σ , σ , · · · , σ N } , and a generic Hamiltonian H S ( σ )where the S subscript indicates the physical (spin) sys-tem. When the system is at thermal equilibrium at tem-perature T , the “target” probability of a spin conﬁgura-tion σ is given by the familiar Boltzmann distribution p S ( σ , T ) = 1 Z S e − H S ( σ ) /T (1)where Z S = Tr σ e − H S ( σ ) /T is the canonical partitionfunction. With the knowledge of Z S it is possible tocompute all thermodynamic potentials and average val-ues of observables. However, the estimation of the parti-tion function involves a summation over all the 2 N states,which is feasible only for very small systems. The averagevalue of an observable O can be calculated as (cid:104)O ( T ) (cid:105) = 1 M M (cid:88) k =1 O ( σ k ) (2)if σ k are samples drawn from the distribution p S ( σ , T )at temperature T . This equation is exact only when M → ∞ . However, the sampling process can be doneusing Markov Chain MC simulations, leading Eq. (2) togive an expression for a MC expectation value for ﬁnitebut large M . In the below, we consider expectation val-ues obtained with this procedure to be the exact results hcb W FIG. 1. Restricted Boltzmann machine. The visible units(blue) are connected to the hidden nodes (red) with a sym-metric matrix of weight W . The external ﬁelds in the Hamil-tonian are represented by new edges with weights b and c connecting the visible and hidden nodes respectively with an-cillary units (purple and orange) with value clamped to one. for the target probability distribution. They will be com-pared to observables calculated from a probability distri-bution generated by a Boltzmann machine, as we nowdescribe. B. Restricted Boltzmann Machine

Given a target probability distribution p S ( σ ) deﬁnedover a set of random variables σ , our goal is to build aprobabilistic model p λ ( σ ) which mimics our target dis-tribution. The model is in general characterized by a setof parameters λ , which we will tune in order to mini-mize the distance between these two probability distri-butions. It is advantageous to build a joint probabilitydistribution on a graph, where conditional independencebetween random variables in the corresponding proba-bilistic model can be better understood with the help ofgraph theory and through visualization. We recall thata graph is a collection of nodes and edges where to eachnode is associated a variable σ and each edge representsa probabilistic relation between nodes. A probabilisticgraphical model deﬁnes a joint probability distribution p λ ( σ ) over the graph and conditional independence be-tween the variables σ provides us with a factorization rulefor the distribution. We build the probability distribu-tion over an undirected graph satisfying a local Markovproperty (called a Markov random ﬁeld). In particular,we adopt a bilayer architecture. Symmetric edges con-nect spin nodes σ ∈ { , } N in the so-called “visible”layer, with “hidden” nodes h ∈ { , } n H in the hiddenlayer (Fig. 1). The weights of the edges are describedby a matrix W with zero diagonal, where the element W ij is the weight on the edge connecting h i to σ j . Wealso introduce two external ﬁelds b and c coupled to thevisible and hidden layers respectively. One can considerthe latter as weights on new edges between each visibleand hidden nodes and an ancillary node, with its variable“clamped” (or ﬁxed) to one. Moreover, all the variablesin the graph are stochastic, comprising one major diﬀer-ence between this model, called a restricted Boltzmannmachine, and regular neural networks. The full probabil-ity distribution deﬁned by the graph can be written as aBoltzmann distribution p λ ( σ , h ) = 1 Z λ e − E λ ( σ , h ) (3)where the model parameters are λ = { W , b , c } and theenergy is given by E λ ( σ , h ) = − (cid:88) ij W ij h i σ j − (cid:88) j b j σ j − (cid:88) i c i h i . (4)As now the joint distribution is deﬁned over two sets ofnodes, the graph distribution over the spins is obtainedby marginalization p λ ( σ ) = (cid:88) h p λ ( σ , h ) = 1 Z λ e −E λ ( σ ) (5)where we introduced an eﬀective visible energy E λ ( σ ) = − (cid:88) j b j σ j − (cid:88) i log(1 + e c i + (cid:80) j W ij σ j ) , (6)often called the “free energy” in literature on restrictedBoltzmann machines. This probabilistic graphical modelhas a very important property used in the inference pro-cess of the states of the two layers. Since the state ofany node is sampled from a non-linear function of its in-puts (its “activation”), and the activations of nodes inthe same layer are independent from each other (Fig. 1),it is possible to sample one layer at a time, exploiting fastlinear algebra routines in numerical simulations. More-over, for a speciﬁc choice of λ , the states of visible andhidden layers can be inferred exactly with the posteri-ors p λ ( σ | h ) and p λ ( h | σ ). Because the Boltzmann ma-chine is restricted (meaning no intra-layer connections),the posteriors factorizes nicely as p λ ( σ | h ) = (cid:89) j p λ ( σ j | h ) , (7) p λ ( h | σ ) = (cid:89) i p λ ( h i | σ ) . (8)All the probabilities can be easily estimated using Bayestheorem p λ ( σ j = 1 | h ) = σ (cid:0) (cid:88) i W ij h i + b j (cid:1) (9)with the function σ ( x ) = (1 + e − x ) − called a “sig-moid” (a similar expression is obtained for the condi-tional of the hidden layer). We point out that, although we are interested here in the generation of visible spinstates, it is straightforward to extend this network fordiscriminative tasks. By adding a new layer for the la-bels, the resulting three-layer neural network can per-form classiﬁcation with competitive accuracies on com-mon benchmarks. Restricted Boltzmann machinesalso play a central role in deep learning, for instance inthe greedy layer-by-layer pre-training of deep belief net-works or in their natural multilayer extension calleddeep Boltzmann machine.

C. Training

We have discussed how the Boltzmann machine cangenerate an arbitrary probability distribution, provideda large enough number of hidden nodes, and how we canobtain the probability p λ ( σ ). As we already mentioned,the training process consists of tuning the machine pa-rameters λ until the p λ ( σ ) is close to the target distribu-tion p S ( σ ). This is equivalent to solving an optimizationproblem where the function to minimize is the distancebetween the two distributions. This distance can be de-ﬁned by the Kullbach-Leibler (KL) divergence KL ( p S || p λ ) ≡ (cid:88) σ p S ( σ ) log p S ( σ ) p λ ( σ ) ≥ D = { σ (1) , . . . , σ ( |D| ) } by drawingsamples σ from the Ising p S ( σ ) with Markov chain MCsampling at temperature T . The probability distributionunderlying the data set is p data ( σ ) = |D| (cid:80) σ (cid:48) δ ( σ , σ (cid:48) )and, if the data set size |D| is large enough, p data ( σ ) isthen a good approximation of p S ( σ ). We can then writethe KL divergence as KL ( p data || p λ ) = − |D| (cid:88) σ ∈D log p λ ( σ ) − H ( p data ) (11)where the ﬁrst term is called negative log-likelihood and H ( p data ) = − (cid:80) σ p data ( σ ) log p data ( σ ) is the entropy ofthe data set. The optimization problem is solved bystochastic gradient descent. We choose an initial point λ (0) in the full conﬁguration space with zero externalﬁelds and weights W ij randomly drawn from a uniformdistribution centered around zero. Gradient descent op-timization consists of updating all the parameters withthe rule λ j ← λ j − η ∇ λ j KL ( p data || p λ ) . (12)The size η of the gradient step, called the “learning rate”,is kept constant during the training. The increments inthe parameters are obtained by averaging the gradient ofthe KL divergence over the entire data set D . However,since the data set is usually redundant, the updates canbe evaluated on a mini-batch of samples instead, result-ing in a larger number of updates for each data set sweep.This optimization procedure, called stochastic gradientdescent, substantially speeds up the learning, especiallywhen the data set contains a very large number of sam-ples. On the other hand, for data sets with moderatenumber of samples, a common issue in the training ofneural networks is overﬁtting the training data set. Dif-ferent techniques have been proposed to regularize thenetworks and overcome the overﬁtting, such as introduc-ing a weight decay term in the KL divergence cost func-tion, or randomly removing some hidden nodes in thenetwork (called “dropout” ). However, producing train-ing data is not an issue for the cases studied here, whereMC sampling is fast and eﬃcient. Thus, we build a dataset suﬃciently large to avoid using regularization. How-ever, one could envision other cases where MC samplesare expensive, so that regularization would be required.To obtain an update rule for the gradient descentwe need to take the derivative of the KL divergencein Eq. (12), which reduces to the derivative of the log-likelihood, ∇ λ j log p λ ( σ ) = −∇ λ j E λ ( σ ) + (cid:88) σ p λ ( σ ) ∇ λ j E λ ( σ ) . (13)If we consider for instance the case of λ = W , the deriva-tive of the visible energy is ∇ W E λ ( σ ) = − (cid:88) h p λ ( h | σ ) σ h (cid:62) . (14)Plugging this back into Eq. (13), we obtain ∇ W KL ( p data || p λ ) = −(cid:104) σ h (cid:62) (cid:105) p λ ( h | σ ) + (cid:104) σ h (cid:62) (cid:105) p λ ( σ , h ) . (15)The ﬁrst average of the correlation matrix σ h (cid:62) can beeasily computed by clamping the spin variables σ tothe sample from the data set, and inferring the state h of the hidden variables from the conditional distribu-tion p λ ( h | σ ). In the second term however, the correla-tion matrix is averaged over the full model distribution p λ ( σ , h ), which involves knowledge of the partition func-tion Z λ . To overcome this issue, we instead run a MCfor k Markov steps σ (0) → h (0) → σ (1) → h (1) → · · · → σ ( k ) → h ( k ) (16)by sampling each layer using the exact conditional distri-butions. The updates of the stochastic gradient descentare then obtained by taking the average of Eq. (15) overa mini-batch D [ b ] of samples λ j ← λ j − η |D [ b ] | (cid:88) σ ∈D [ b ] ∇ λ j KL ( p data || p λ ) . (17)with b = 1 , . . . , |D| / |D [ b ] | . This training algorithm iscalled contrastive divergence (CD k ) and is the mosteﬀective known tool for the training of restricted Boltz-mann machines. Note that since the initial state of thechain is a sample from the data set and thus it already belongs to the distribution, there is no need for a longequilibration time. Hence the order k of the chain canbe very low, resulting into a very fast learning proce-dure. In some cases, only one step (CD ) is suﬃcient toreconstruct the visible states with low error. III. RESULTS

The classical spin system we choose to train the Boltz-mann machine on is the Ising Hamiltonian, H S ( σ ) = − J (cid:88) (cid:104) ij (cid:105) σ i σ j , (18)with ferromagnetic interactions J = 1 between nearestneighbours. As an instructive example we begin by train-ing one machine on a one-dimensional chain with 6 spins.For such a small system it is possible to compute the par-tition function, and thus the full probability distribution,exactly. We prepare a data set of conﬁgurations usingthe exact probability distribution and then train a Boltz-mann machine using CD . Because the partition functionof the Boltzmann machine is known, we can compute theKL divergence for various sets λ , evaluating the perfor-mance of the training. By plotting the KL divergence asa function of the training steps (Fig. 2a) we see how thedistribution generated by the machine improves towardsthe data set distribution. We also show the comparisonbetween the true probability distribution and the onesproduced by the machine at two diﬀerent stages of thetraining (Fig. 2b).Next, we consider the more interesting case of a two-dimensional system with N = L × L spins on a squarelattice with periodic boundaries. Contrary to the one-dimensional case, this system undergoes a second or-der phase transition at T c (cid:39) .

269 from an orderedferromagnetic phase (

T < T c ) to a disordered param-agnetic phase ( T > T c ). We prepare a data set D T with 10 binary spin conﬁgurations MC-sampled from p S ( σ , T ) for several temperatures in a range centeredaround T c . For each T we train a diﬀerent machine M τ which generates a distribution p λ τ ( σ ), where the sub-script τ refers to the physical temperature T . For eachmachine we collect samples using a diﬀerent number ofhidden nodes while adopting the same external hyper-parameters (learning rate, mini-batch size, number oftraining steps, initial conditions, etc.). We update theparameters with CD using stochastic gradient descentwith learning rate η = 0 .

01 and mini-batch size of 50samples. We initialize the weights W from an uniformdistribution around zero and width w ∝ (cid:112) / ( n H + N ).We note that, although a larger value of contrastive di-vergence order k is bound to improve the learning, italso substantially slows down the time required to reacha solution.It is natural to ask how the performance of each Boltz-mann machine is aﬀected when the training samples are steps p ( ) KL Exact10 steps500 steps ab FIG. 2. KL divergence as a function of training step ( a ) andprobability distributions ( b ) for a d = 1 Ising model with N = 6 spins. We show the comparison between the exactprobability distribution (red) and the approximate distribu-tion produced by the Boltzmann machine after 10 (green) and500 (blue) training steps for all of the 2 states σ . taken at high or low temperature. Moreover, we are in-terested in whether or not a Boltzmann machine is ableto properly capture the ﬂuctuations that the system un-dergoes at criticality. Before discussing the quantitativeanalysis of the thermodynamics, we give an insight intothe functioning of these machines by showing the his-togram of the matrix elements of W (Fig. 3) after thetraining at low and high temperature. In these two lim-its we know what the probability distribution p S ( σ , T )looks like and we can thus obtain a qualitative under-standing of the training and sampling processes of themachines. At very high temperature J/T (cid:28) p S ( σ ) (cid:39) N/

2. In this casethe weights histogram of the high temperature machine( T = 3 .

54) displays a sharp peak centered around zero.This means that the visible and hidden layers are quasi-decoupled, and the visible state is random since the ac-tivation probability from Eq. (9) is p λ ( σ j | h ) (cid:39) /

2. Onthe other hand, at low temperature the two polarizedstates σ = , are most probable and this causes thehistogram to be wide and ﬂat. When we start the sam-pling we initialize both visible and hidden layers ran-domly. There is a spontaneous symmetry breaking andthe machine chooses one of the two polarizations. If the W T = 3 . T = 1 . FIG. 3. Histogram of the relative frequency of appearanceof the weight amplitudes for two Boltzmann machines with n h = 32 hidden nodes, trained at low and high T for the d = 2 Ising model with N = 64 spins. machine chooses the visible state σ = after equilibra-tion, we ﬁnd, by inspecting the hidden states driving thespins, that the hidden layer is arranged such that only thenodes connected to the positive weights are active (andsimilarly for the opposite state). The activations will bein this case large and positive and thus p λ ( σ = | h ) (cid:39) O deﬁned on the spin system we cancompare its average value computed on the spins in thedataset at temperature T , (cid:104)O ( T ) (cid:105) D = 1 |D T | (cid:88) σ ∈D T O ( σ ) , (19)with that computed on the spin samples produced bythe machine M τ . After training, we can initialize thismachine with a random conﬁguration and perform blockGibbs sampling until equilibration. We can then buildanother spin data set S τ with these visible samples andcompute the average as, (cid:104)O ( τ ) (cid:105) S = 1 Z λ ,τ (cid:88) σ O ( σ ) e −E λ ,τ ( σ ) (cid:39) |S τ | (cid:88) σ ∈S τ O ( σ ) . (20)If the machine is properly trained we expect the de-viations δ O = |(cid:104)O ( T ) (cid:105) D − (cid:104)O ( τ ) (cid:105) S | to be small. In EC V T T n h = 4 n h = 16 n h = 64 a b M c d FIG. 4. Comparison of the observables generated with the Boltzmann machine with the exact values calculated from the dataset (black) for a d = 2 Ising system with N = 64 spins. The observables considered are energy ( a ), magnetization ( b ), speciﬁcheat ( c ) and magnetic susceptibility ( d ). We show the results for Boltzmann machines with hidden nodes n H = 4 (pink), n H = 16 (orange) and n H = 64 (cyan). Fig. 4 we plot the energy per spin E , the magne-tization M = (cid:104) (cid:80) i σ i (cid:105) /N , the speciﬁc heat C V =( (cid:104) E (cid:105) − (cid:104) E (cid:105) ) / ( N T ) and the magnetic susceptiblity χ = ( (cid:104) M (cid:105) − (cid:104) M (cid:105) ) / ( N T ). For the magnetization, weﬁnd that even with a number of hidden nodes as low astwo (not shown), the machine is able to reproduce theexact behaviour within statistical error. This can be ex-plained, since the learning is based on real-space spinconﬁguration samples, and thus the magnetization is im-plicitly encoded into the data set. In the case of theenergy however, even though we are computing its valueusing Eq. (18) applied to the visible units, informationabout the local energy constraints is not included in thedata set. This results in a larger discrepancy between thephysical value and that generated with the Boltzmannmachine.Most interestingly, it appears that for a given physi-cal system size N , the Boltzmann machine with a ﬁxed n h learns best away from criticality. In Fig. 5a we plotthe scaling of the speciﬁc heat with the number of hid-den nodes in the machine for ﬁve diﬀerent temperatures.When the system is in an ordered or a disordered state,the machines trained on the spins of the correspondingdata sets are able to reproduce the exact values withinstatistical error, irrespective to n h . This is consistentwith the weight histograms in Fig. 3. At high tempera-ture this follows from the two layers being quasi decou-pled. For low temperatures we have seen that only the hidden nodes that connect to positive weights (or nega-tive weights, depending on the polarization of the visiblelayer) are set to 1; increasing the number of hidden nodeswill not aﬀect the activation of the visible units. Finally,when the system is at criticality, it is still possible toobtain an accurate approximation of the physical distri-bution, however a clear dependency on the ﬁnite numberof hidden units appears. As illustrated in Fig. 5a, in or-der to converge the speciﬁc heat at the critical point, therequired n h is signiﬁcantly larger than for T far above orbelow the transition. We also note that the same scalingplot for the magnetization (not reported here) shows noclear dependencies on n h . Finally, we show in Fig. 5b thescaling curves at criticality for diﬀerent system sizes. Asexpected, the threshold in the number of hidden unitsrequired for faithful learning of the speciﬁc heat growswith increasing N . IV. CONCLUSIONS

We have trained a generative neural network calleda restricted Boltzmann machine to produce a stochas-tic model of a thermodynamic probability distribution.The physical distributions were produced by Monte Carloimportance-sampling the spin conﬁgurations of a d -dimensional Ising system at diﬀerent temperatures. For asmall system in d = 1, we conﬁrm through an exact cal- L = 4 L = 6 L = 8 L = 10 C V T = 1 . T = 1 . T = 2 . T = 2 . T = 3 . C V n H L = 8 T = 2 . ab FIG. 5. Scaling of the speciﬁc heat C V with the number ofhidden nodes n H . In ( a ) we show scaling at diﬀerent temper-atures T , when the system is ordered (blue and cyan), disor-dered (red and pink) and critical (green). In ( b ) we show thescaling at criticality for diﬀerent systems sizes L . Dotted linesrepresent the exact value computed on the spin conﬁgurationsof the training dataset. culation that the Boltzmann machine converges to thephysical probability distribution with suﬃcient trainingsteps.For the more diﬃcult problem of the Ising model in d = 2, where exact calculations are impossible, we com-pare thermodynamic observables produced by the Boltz-mann machine to those calculated directly by MonteCarlo. Spin samples produced by Monte Carlo wereused to train diﬀerent machines at distinct temperaturesabove, below, and at Ising criticality T c . Once trained,we evaluated diﬀerent thermodynamic estimators on thesamples generated by the Boltzmann machines and showthat they faithfully reproduce those calculated directlyfrom the Monte Carlo samples. For all training instanceswe ﬁxed the values of the hyper-parameters, and variedthe number of hidden nodes. We showed that for T > T c and T < T c , the Boltzmann machine is able to capture the thermodynamics with only a few hidden nodes. How-ever, near T = T c , the number of hidden nodes requiredto reproduce the speciﬁc heat becomes large, reﬂectingthe increase of ﬂuctuations at criticality. This growth ofhidden nodes required at criticality is reminiscent of theconnection between deep learning and the renormaliza-tion group suggested previously. Our results demonstrate that Boltzmann machinesmay serve as a basic research tool for condensed matterand statistical mechanics, when coupled together withstandard Monte Carlo sampling techniques. One ap-plication may be to use the approximate conﬁgurationsproduced by the trained machine to calculate thermody-namic estimators that may have been overlooked duringthe original Monte Carlo sampling (since such conﬁgura-tions are typically discarded). Similarly, estimator calcu-lation could be completely transferred to the machine, inorder to re-distribute these tasks away from the MonteCarlo procedure. Conversely, we have demonstrated thatthe performance of a Boltzmann machine may be evalu-ated using a comparison of thermodynamic observablescalculated from both the physical and modelled distri-bution. The conceptual elimination of reliance on theKL divergence may suggest alternatives to evaluating theperformance of such machines in other applications.Among the many possible future applications, it wouldbe particularly interesting to train a Boltzmann machineon conﬁgurations produced in various bases by quantumMonte Carlo. One may ask if a standard restrictedmachine like studied in the present paper is suﬃcientto capture quantum correlations, or if a quantum ver-sion of the machine is required. It would also be inter-esting to understand the relationship between the signproblem in calculations of estimators directly in quan-tum Monte Carlo versus their approximation by suitably-trained Boltzmann machines.

ACKNOWLEDGMENTS

We would like to thank M. Amin, E. Andriyash, G.Carleo, J. Carrasquilla, B. Kulchytskyy, D. Schwab, andM. Stoudenmire for many useful discussions. This re-search was supported by Natural Sciences and Engineer-ing Research Council of Canada, the Canada ResearchChair program, the Ontario Trillium Foundation, andthe Perimeter Institute for Theoretical Physics. Sim-ulations were performed on resources provided by theShared Hierarchical Academic Research Computing Net-work (SHARCNET). Research at Perimeter Institute issupported through Industry Canada and by the Provinceof Ontario through the Ministry of Research & Innova-tion. Y. LeCun, Y. Bengio, and G. Hinton, Nature , 436(2008). G. Hinton, Trends in Cognitive Science , 428 (2007). A. Krizhevsky, I. Sutskever, and G. Hinton, Proc. Ad-vances in Neural Information Processing Systems , 1090(2012). G. Hinton and et al , IEEE Signal Processing Magazine ,82 (2012). R. Collobert, J. Weston, L. Bottou, M. Karlen,K. Kavukcuoglu, and P. Kuksa, Journal of Machine Learn-ing Research , 2493 (2011). P. Baldi, P. Sadowski, and D. Whiteson, Nature Commu-nications , 4308 (2014). L. F. Arsenault, O. A. von Lilienfeld, and A. J. Millis,arXiv:1506.08858 (2015). S. S. Schoenholz, E. D. Cubuk, D. M. Sussman, E. Kaxiras,and A. J. Liu, Nature Physics , 469 (2016). J. Carrasquilla and R. G. Melko, arXiv:1605.01735 (2016). L. Wang, arXiv:1606.00318 (2016). P. Mehta and D. J. Schwab, arXiv:1410.3831 (2014). J. Louradour and H. Larochelle, arXiv:1103.4896 (2008). H. Larochelle and Y. Bengio, ICML08 Proceedings of the25th international conference on Machine learning (2008). T. Schmah, G. E. Hinton, S. L. Small, S. Strother, andR. S. Zemel, Advances in neural information processing systems , 1409 (2008). H. Larochelle, M. Mandel, R. Pascanu, and Y. Bengio,Journal of Machine Learning Research , 643 (2012). G. Hinton, S. Osindero, and Y. Teh, Neural computation , 1527 (2006). R. Salakhutdinov and I. Murray, ICML’08 Proceedings ofthe 25th international conference on machine learning , 872(2008). R. Salakhutdinov and G. Hinton, International conferenceon artiﬁcial intelligence and statistics , 448 (2009). R. Salakhutdinov and G. Hinton, Neural Computation ,1967 (2012). A. Krogh and J. A. Hertz, Advances in neural networksinformation processing systems , 950 (1992). N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, andR. Salakhudtinov, Journal of Machine Learning Research , 1929 (2014). G. Hinton, Neural computation , 1771 (2002). G. Carleo and M. Troyer, In preparation (2016).24