[PDF] Estimation of Thermodynamic Observables in Lattice Field Theories with Deep Generative Models

Abstract

In this work, we demonstrate that applying deep generative machine learning models for lattice field theory is a promising route for solving problems where Markov Chain Monte Carlo (MCMC) methods are problematic. More specifically, we show that generative models can be used to estimate the absolute value of the free energy, which is in contrast to existing MCMC-based methods which are limited to only estimate free energy differences. We demonstrate the effectiveness of the proposed method for two-dimensional ϕ 4 theory and compare it to MCMC-based methods in detailed numerical experiments.

Full PDF

HHU-EP-20/16

On Estimation of Thermodynamic Observablesin Lattice Field Theories with Deep Generative Models

Kim A. Nicoli, Christopher J. Anders, Lena Funcke, Tobias Hartung, Karl Jansen, Pan Kessel, Shinichi Nakajima,

1, 5 and Paolo Stornati

4, 6 Machine Learning Group, Technische Universit¨at Berlin, Berlin, Germany Perimeter Institute for Theoretical Physics, Waterloo, Canada Department of Mathematics, Kings College London, London, United Kingdom NIC, DESY Zeuthen, Zeuthen, Germany RIKEN Center for AIP, Tokyo, Japan Institut f¨ur Physik, Humboldt-Universit¨at, Berlin, Germany

In this work, we demonstrate that applying deep generative machine learning models for latticeﬁeld theory is a promising route for solving problems where Markov Chain Monte Carlo (MCMC)methods are problematic. More speciﬁcally, we show that generative models can be used to estimatethe absolute value of the free energy, which is in contrast to existing MCMC-based methods whichare limited to only estimate free energy diﬀerences. We demonstrate the eﬀectiveness of the pro-posed method for two-dimensional φ theory and compare it to MCMC-based methods in detailednumerical experiments. Introduction.

The free energy of a physical system isof great importance since it can be related to severalthermodynamical observables. In particular, at non-zerotemperature, it allows to compute the entropy, the pres-sure or, more generally, the equation of state of the con-sidered physical system. For example, QCD at high tem-perature –as a generic strongly interacting ﬁeld theory–plays an essential role in the physics of the early universeand is now extensively probed in large-scale heavy ionexperiments [1]. Hence, knowing such thermodynamicquantities from QCD alone is of very high relevance.The main tool to study strongly-coupled ﬁeld theories,such as QCD, is to discretize them on a spacetime lat-tice and use Monte-Carlo Markov-Chain (MCMC) meth-ods to numerically calculate the relevant physical quan-tities. Unfortunately, these thermodynamical quantitiesare challenging to compute using existing MCMC meth-ods. The fundamental diﬃculty is that MCMC is notable to directly estimate the partition function of thelattice ﬁeld theory. Therefore, the absolute value of thefree energy cannot be estimated straightforwardly.Instead, there are a number of MCMC methods to es-timate diﬀerences of free energies. One typically choosesa free energy diﬀerence ∆ F = F b − F a such that F a isknown either exactly or approximately. One can thendeduce the value of the free energy F b = ∆ F + F a at thedesired point in parameter space. If the free energy F a isonly known approximately, this induces an unwanted sys-tematic error. In other situations, such a starting point F a may not even be available. Most of the methods toestimate ∆ F rely on integrating a derivative of the parti-tion function over a trajectory in the parameter space ofthe lattice ﬁeld theory [2]. Alternatively, one can use areweighting procedure to calculate free energy diﬀerencesbetween neighbouring points of the discretized trajectoryand then sum them up [2, 3]. These approaches requiresimulations at each parameter point of the discretized trajectory which is numerically costly and leads to accu-mulation of errors. There are also non-equilibrium meth-ods based on Jarzynskis identity to estimate free energydiﬀerences without the need for integration [4–6]. How-ever, also these methods require expensive repeated sim-ulations corresponding to an ensemble of non-equilibriumtrajectories through phase-space.It is therefore desirable to develop methods which allowthe direct estimation of the free energy at a given pointin parameter space.In the following, we will propose such a method basedon deep generative machine learning models. Over thelast years, such deep generative models have been ap-plied with great success to generate, for example, high-resolution images, natural speech, and text (see [7] foran overview). In the recent works [8, 9], deep genera-tive models have also been used in the context of latticequantum ﬁeld theories (see also [10]). The main objectiveof these works was to reduce the integrated autocorrela-tion of the simulations. In contrast, this work demon-strates that deep generative models can be used to es-timate quantities which are not (directly) obtainable byMCMC approaches.We also note that generative models have been usedin [11] to estimate free energy diﬀerences in the contextof statistical mechanics by combining these models withthe Zwanzig free energy perturbation method [12]. Con-trary to this approach, our method estimates the abso-lute value of the free energy. We furthermore note thatthe free energy can also be directly computed using theTensor Renormalization Group method, see [13] for anapplication to φ -theory. For other novel approaches toobtain thermodynamic quantities and, in particular, theequation of state, see [14, 15].In the following, we will give a brief overview of rele-vant aspects of lattice ﬁeld theories and generative mod-els. We will then discuss how generative models can be a r X i v : . [ h e p - l a t ] J u l used to estimate the free energy and compare this ap-proach to MCMC-based methods in numerical experi-ments. Lattice Field Theory.

A lattice ﬁeld theory can bedescribed by an action S ( φ ). In the following, we willconsider (euclidian) real scalar ﬁeld theory for concrete-ness, i.e. φ ( x ) ∈ R for each lattice site x ∈ Λ of thelattice Λ. The path integral then reduces to an ordinaryhigh-dimensional integral. Therefore, expectation valuesof operators O ( φ ) can be calculated by (cid:104)O(cid:105) = 1 Z (cid:90) D [ φ ] O ( φ ) exp( − S ( φ )) , where we deﬁned D [ φ ] = (cid:81) x ∈ Λ d[ φ ( x )] and the partitionfunction Z is given by Z = (cid:90) D [ φ ] exp( − S ( φ )) . If we impose periodic boundary conditions in time for alattice with temporal extend N T , the theory is at ﬁnitetemperature T = β = N T a , where a denotes the latticespacing. The free energy is then deﬁned by F = − T ln( Z ) , (1)and can be related to the pressure p = − FV , where V denotes the spatial volume of the lattice Λ whose numberof lattice sites we denote by | Λ | . Similarly, the entropy H can be obtained from the free energy by F = U − T H ,where the U is the internal energy. Deep Generative Models.

We focus on a particularsubclass of generative models called normalizing ﬂows(see [16] for a recent review). These ﬂows are distribu-tions q θ with learnable parameters θ . They also have theappealing property that they allow for eﬃcient samplingand calculation of the probability of the samples.In more detail, these ﬂows are constructed by deﬁningan invertible neural network g θ . For a brief overview ofneural networks, we refer to the Supplement. The sam-ples φ ∈ R | Λ | are obtained by applying this network tosamples z ∈ R | Λ | drawn from a simple prior distribution q Z such as a standard normal N (0 , φ = g θ ( z ) , z ∼ q Z . (2)Since the network g θ is invertible by assumption, it thenfollows by the change of variable theorem that φ ∼ q θ with q θ ( φ ) = q Z ( g − θ ( φ )) (cid:12)(cid:12)(cid:12)(cid:12) d g θ d z (cid:12)(cid:12)(cid:12)(cid:12) − . (3)The architecture of the neural network g θ is chosen suchthat i.) invertibility of g θ and ii.) eﬃcient evaluation ofthe Jacobian determinant (cid:12)(cid:12)(cid:12) d g θ d z (cid:12)(cid:12)(cid:12) are ensured. A particularexample of such an architecture is Non-linear Indepen-dent Component Estimation (NICE) [17] for which the neural network g θ consists of invertible coupling layers y l : R | Λ | → R | Λ | , i.e. g θ ( z ) = (cid:0) y L ◦ y L − ◦ · · · ◦ y (cid:1) ( z ) (4)Invertiblity and eﬃcent evaluation of Jacobian determi-nant is then ensured by splitting the components of thelayer y l = ( y lu , y ld ) in two parts y lu ∈ R | Λ |− k and y ld ∈ R k for given k ∈ { , | Λ | − } . The layer y l +1 = ( y l +1 u , y l +1 d )is then recursively deﬁned by y l +1 u = y lu ,y l +1 d = y ld + m ( y lu ) , (5)where m is another neural network (not necessarily sat-isfying the two requirements from above). Due to thesplitting, this can be easily inverted by y lu = y l +1 u ,y ld = y l +1 d − m ( y l +1 u ) , and the determinant of the Jacobian is given bydet ∂y l +1 ∂y l = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂y l +1 u ∂y lu ∂y l +1 u ∂y ld ∂y l +1 d ∂y lu ∂y l +1 d ∂y ld (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) I ∗ I (cid:12)(cid:12)(cid:12)(cid:12) = 1 . The total Jacobian determinant is then (cid:12)(cid:12)(cid:12) d g θ d z (cid:12)(cid:12)(cid:12) = 1 since itis the product of the Jacobian determinant of each layer. Training.

We want to train a generative model whichsamples ﬁeld conﬁgurations φ ∼ q θ approximately fromthe path-integral distribution p ( φ ) = 1 Z exp( − S ( φ )) . (6)For this, the Kullback–Leibler divergence [18] betweenthe normalizing ﬂow q θ and the target distribution p isminimized, i.e.KL( q θ || p ) = (cid:90) D [ φ ] q θ ( φ ) ln (cid:18) q θ ( φ ) p ( φ ) (cid:19) = β ( F q − F ) , where we have deﬁned the variational free energy βF q = E φ ∼ q θ [ S ( φ ) + ln q θ ( φ )] (7)and used the free energy F = − β ln( Z ). This divergencevanishes if and only if the distributions q and p are iden-tical [19].The KL divergence is minimized by gradient descentwith respect to the parameter θ of the ﬂow q θ . Sincethe free energy F does not depend on the ﬂow q , thevariational free energy F q can equivalently be minimized.Therefore, the training procedure does not require a tar-get distribution (6) with a tractable partition function hopping parameter | || | FIG. 1. Absolute magnetization density as a function of hop-ping parameter κ for bare coupling λ = 0 . κ = 0 . Z . Using the explicit expression for the probability ofthe ﬂow (3), we can rewrite the variational free energy as βF q = E z ∼ q Z (cid:20) S ( g θ ( z )) − ln (cid:12)(cid:12)(cid:12)(cid:12) d g θ d z (cid:12)(cid:12)(cid:12)(cid:12) ( z ) + ln q Z ( z ) (cid:21) . In training, the expectation value is approximated byits Monte-Carlo estimate. In machine learning, this ap-proach of learning a model from an unnormalized targetdistribution is very well established [20–23]. Recently,the same method has been used in the context of lat-tice ﬁeld theories [8]. Furthermore, this approach hasbeen applied to quantum chemistry [24] and statisticalphysics [25–27].The variational free energy does not allow us to inferthe value of the KL divergence since the free energy F is not known. In order to alleviate this shortcoming, wedeﬁne the random variable C ( φ ) = S ( φ )+ln q θ ( φ ), whichis related to the variational free energy by βF q = (cid:104) C (cid:105) q .In the appendix, we show thatKL( q θ || p ) = Var q ( C ) + O ( E q [ | w − | ]) , where we have deﬁned the importance weight w ( φ ) = p ( φ ) q ( φ ) . Thus convergence of training will result in a smallvariance Var q ( C ). In practice, a Monte-Carlo estimateof this quantity can be calculated without any signiﬁ-cant overhead during training as C ( φ ) is also needed forMonte-Carlo estimation of the variational free energy F q ,see (7). It is therefore advisable to closely monitor thevariance of C during training. Lattice

FV a flowhmc FIG. 2. Estimate of free energy density at λ = 0 .

022 and κ = 0 . ﬂow-based and MCMC-based method for various lattice sizes. MCMC estimates areobtained from integrating free energy diﬀerences. Both meth-ods use the same number of samples (5.6 M) for estimation.Errors are obtained with the delta and uwerror method [32]for ﬂow and HMC respectively (see Appendix for Jackknifeerror analysis).

Estimation of Thermodynamical Observables.

Thepartition function Z can be rewritten as Z = (cid:90) D [ φ ] q θ ( φ ) ˜ w ( φ ) , (8)where we have deﬁned the unnormalized importanceweight ˜ w ( φ ) = exp( − S ( φ )) q θ ( φ ) . Therefore, the partition func-tion can be estimated by Monte-Carlo as followsˆ Z = 1 N N (cid:88) i =1 ˜ w ( φ i ) with φ i ∼ q θ . (9)We emphasize that the sampling procedure does not needto be sequential (as for a Markov Chain). As a result,it can very eﬃciently be parallelized and does not suf-fer from autocorrelation. From ˆ Z , one can then easilyestimate the free energy byˆ F = − T ln ˆ Z . (10)From the free energy (10), one can then straightforwardlyobtain estimates for the pressure and entropy, as ex-plained above. The estimator (10) has been extensivelystudied in the context of variational inference [33–35]. Itwas shown in [33] that it is a statistically consistent esti-mator. In [34], its variance and bias were derived usingthe delta method (see also [26, 35]). For convenience, wesummarize the relevant results in the Supplement. Alter-natively, one can use the Jackknife method to estimatethe bias and variance [36].

Numerical Experiments.

We apply the proposedmethod to two-dimensional real scalar ﬁeld theory withaction S = (cid:88) x ∈ Λ − κ (cid:88) ˆ µ =1 ϕ ( x ) ϕ ( x + ˆ µ ) + (1 − λ ) ϕ ( x ) + λ ϕ ( x ) , where κ is the hopping parameter and λ denotes the barecoupling constant of the theory. The action is invariantunder Z -transformations, i.e. φ → − φ . Figure 1 showsthe absolute magnetization (cid:104)| φ |(cid:105) as a function of the hop-ping parameter κ . As the hopping parameter κ increases,spontaneous magnetization is observed.In the following, we will estimate the free energy F e at λ e = 0 .

022 and κ e = 0 . | Λ | = N L × N T of 64 ×

8, 48 ×

8, 32 ×

8, 16 × q θ is invariant under Z -transformations,i.e. q θ ( φ ) = q θ ( − φ ). By the deﬁnition (3) of q θ , an oddfunction g θ ( − z ) = − g θ ( z ) implies Z -invariance of q θ .The map g θ is odd if all its coupling blocks y l are odd,see (4). The latter condition can be ensured by choosingan odd neural network m for the coupling (5) which weachieve by using tanh non-linearities and vanishing biasesfor the network m .After training has completed, the free energy is thencomputed using the proposed estimator (10). For erroranalysis, we use both the Jackknife as well as the delta-method. We then check that the error estimates are com-patible. We refer to the Supplement for a more detaileddescription.For MCMC, we use a reweighting procedure [2, 3]which is signiﬁcantly more involved and uses the rela-tion F e = ∆ F e b + F b . Here, F b is the free energy at κ b = 0 and λ b = λ e . The value of F b can be analyticallycalculated since for vanishing Hopping parameter κ : F ( λ ) = −| Λ | T ln z ( λ ) , where | Λ | denotes the number of sites of the lattice Λ and z ( λ ) = (cid:114) − λ λ exp (cid:18) (1 − λ ) λ (cid:19) K (cid:18) (1 − λ ) λ (cid:19) , with K n being the Bessel function of the second kind. Weprove this relation in the Supplement. The free energydiﬀerence ∆ F e b = F e − F b = − T ln Z e Z b can be obtainedby E p b (cid:20) exp( − S e )exp( − S b ) (cid:21) = 1 Z b (cid:90) D [ φ ] e − S b ( φ ) e − S e ( φ ) e − S b ( φ ) = Z e Z b . We estimate this expectation value with an overrelaxedHMC algorithm [28–31]. In practice, the variance of the estimator will become prohibitively large if the two dis-tributions p b and p e do not have suﬃcient overlap. Wetherefore choose intermediate distributions p i , . . . p i K ensuring that neighbouring distributions p i k and p i k +1 have suﬃcient overlap. The free energy diﬀerence canthen be obtained by∆ F e b = ∆ F e,i K + ∆ F i K i K − + · · · + ∆ F i b . In our numerical experiments, we keep λ = 0 .

022 ﬁxedand only vary the hopping parameter κ of the interme-diate distributions p i . We choose a diﬀerence in hoppingparameter of δκ = 0 .

01 for κ ∈ [0 . , .

3] and δκ = 0 .

05 forall other intermediate hopping parameters κ . We there-fore use K = 14 Markov chains with 400k steps each.Thus, a total number of 5.6 million conﬁgurations is usedfor estimation. For a detailed analysis of the dependenceof our results on this choice of δκ , we refer to the Sup-plement.The error analysis is performed with both the uwerr[32] and Jackknife method which are checked to lead toconsistent estimates. We again refer to the Supplementfor a more detailed description.Figure 2 shows that the estimates of both the ﬂow andMCMC are compatible within errorbars. Notably how-ever, the estimate of the ﬂow is performed directly at thedesired point in parameter space. On the other hand, theMCMC estimate required integration over a trajectory inparameter space for which the starting point is known.This is both a practical and conceptual concern. For ex-ample in ﬁnite-temperature QCD, one often uses a tra-jectory whose initial free energy is approximated by theHadron Resonance model (see for example [2]). This inturn introduces an undesirable systematic error. In othersituations, one may simply not have a suitable startingpoint. Furthermore, summing up the free energy diﬀer-ences along the trajectory will lead to an accumulationof errors. In our example of the φ theory, the trajec-tory had to pass the critical point which is challengingfor MCMC methods due to critical slowing down and in-creases the variance of the estimator. This problem isnot present for the ﬂow-based estimator. Conclusion.

In this letter, we have proposed amethod to directly estimate the free energy and hencethermodynamical observables of lattice ﬁeld theoriesusing deep generative models. This method is of greatconceptual appeal as it avoids cumbersome integrationthrough parameter space and does not require anexactly or approximately known integration constant.Future work will focus on scaling this approach to four-dimensional gauge theories. While it is presently notunderstood how to eﬃciently incorporate non-abeliangauge theories in normalizing ﬂows, recent work demon-strated that ﬂows invariant to certain abelian gaugesymmetries can be constructed [9]. This recent progress,combined with the enormous ongoing advances in deeplearning, makes it very promising that our method canbe applied to non-abelian gauge theories, and ultimatelyQCD, in the not too distant future.

Acknowledgements.

This work was supported inpart by the German Ministry for Education and Re-search (BMBF) under Grants 01IS14013A-E, 01GQ1115,01GQ0850, 01IS18025A and 01IS18037A. This workis also supported by the Fraunhofer Heinrich-Hertz-Institut (HHI) and by the grant funded by the DFG(EXC 2046/1, project-id 390685689). P.S. thanks theHelmholtz Einstein International Berlin Research Schoolin Data Science (HEIBRiDS) for funding. Research atPerimeter Institute is supported by the Government ofCanada through the Department of Innovation, Scienceand Economic Development Canada and by the Provinceof Ontario through the Ministry of Economic Develop-ment, Job Creation and Trade. We want to express ourgratitude for valuable feedback by Klaus-Robert M¨uller,Wojciech Samek, Nils Strodthoﬀ, Alessandro Nada, andStefan K¨uhn. [1] W. Busza, K. Rajagopal, and W. Van Der Schee, Heavyion collisions: the big picture and the big questions,Annual Review of Nuclear and Particle Science , 339(2018).[2] O. Philipsen, The qcd equation of state from the lattice,Progress in Particle and Nuclear Physics , 55 (2013).[3] P. De Forcrand, M. D’Elia, and M. Pepe, ’t hooft loopin su(2) yang-mills theory, Physical Review Letters ,1438 (2001).[4] C. Jarzynski, Equilibrium free-energy diﬀerences fromnonequilibrium measurements: A master-equation ap-proach, Physical Review E , 5018 (1997).[5] C. Jarzynski, Nonequilibrium equality for free energy dif-ferences, Physical Review Letters , 2690 (1997).[6] M. Caselle, A. Nada, and M. Panero, Qcd thermodynam-ics from lattice calculations with nonequilibrium meth-ods: The su(3) equation of state, Physical Review D ,054513 (2018).[7] I. Goodfellow, Y. Bengio, and A. Courville, Deep learning (MIT press, 2016).[8] M. S. Albergo, G. Kanwar, and P. E. Shanahan, Flow-based generative models for markov chain monte carlo inlattice ﬁeld theory, Phys. Rev. D , 034515 (2019).[9] G. Kanwar, M. S. Albergo, D. Boyda, K. Cranmer, D. C.Hackett, S. Racani`ere, D. J. Rezende, and P. E. Shana-han, Equivariant ﬂow-based sampling for lattice gaugetheory, arXiv preprint arXiv:2003.06413 (2020).[10] J. M. Urban and J. M. Pawlowski, Reducing autocorre-lation times in lattice simulations with generative adver-sarial networks, arXiv preprint arXiv:1811.03533 (2018).[11] P. Wirnsberger, A. J. Ballard, G. Papamakarios, S. Aber-crombie, S. Racani`ere, A. Pritzel, D. J. Rezende, andC. Blundell, Targeted free energy estimation via learnedmappings, arXiv preprint arXiv:2002.04913 (2020).[12] R. W. Zwanzig, High-temperature equation of state bya perturbation method. ii. polar gases, The Journal ofChemical Physics , 1915 (1955). [13] S. Akiyama, D. Kadoh, Y. Kuramashi, T. Yamashita,and Y. Yoshimura, Tensor renormalization group ap-proach to four-dimensional complex φ theory at ﬁnitedensity, arXiv preprint arXiv:2005.04645 (2020).[14] M. Asakawa, T. Hatsuda, E. Itou, M. Kitazawa,H. Suzuki, F. Collaboration, et al. , Thermodynamics ofs u (3) gauge theory from gradient ﬂow on the lattice,Physical Review D , 011501 (2014).[15] L. Giusti and M. Pepe, Equation of state of a relativistictheory from a moving frame, Physical review letters ,031601 (2014).[16] G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mo-hamed, and B. Lakshminarayanan, Normalizing ﬂowsfor probabilistic modeling and inference, arXiv preprintarXiv:1912.02762 (2019).[17] L. Dinh, D. Krueger, and Y. Bengio, Nice: Non-linearindependent components estimation, arXiv preprintarXiv:1410.8516 (2014).[18] D. J. MacKay and D. J. Mac Kay, Information theory,inference and learning algorithms (Cambridge universitypress, 2003).[19] More precisely, the distributions p and q have to be iden-tically only almost everywhere, i.e. up to a set of measurezero.[20] D. P. Kingma and M. Welling, Auto-encoding variationalbayes, arXiv preprint arXiv:1312.6114 (2013).[21] D. J. Rezende, S. Mohamed, and D. Wierstra, Stochasticbackpropagation and approximate inference in deep gen-erative models, arXiv preprint arXiv:1401.4082 (2014).[22] D. J. Rezende and S. Mohamed, Variational inferencewith normalizing ﬂows, arXiv preprint arXiv:1505.05770(2015).[23] T. M¨uller, B. Mcwilliams, F. Rousselle, M. Gross, andJ. Nov´ak, Neural importance sampling, ACM Transac-tions on Graphics (TOG) , 1 (2019).[24] F. No´e, S. Olsson, J. K¨ohler, and H. Wu, Boltzmanngenerators: Sampling equilibrium states of many-bodysystems with deep learning, Science (2019).[25] D. Wu, L. Wang, and P. Zhang, Solving statistical me-chanics using variational autoregressive networks, Phys-ical Review Letters , 080602 (2019).[26] K. A. Nicoli, S. Nakajima, N. Strodthoﬀ, W. Samek, K.-R. M¨uller, and P. Kessel, Asymptotically unbiased es-timation of physical observables with neural samplers,Physical Review E , 023304 (2020).[27] K. Nicoli, P. Kessel, N. Strodthoﬀ, W. Samek, K.-R.M¨uller, and S. Nakajima, Comment on ”solving statis-tical mechanics using vans”: Introducing savant-vansenhanced by importance and mcmc sampling, arXivpreprint arXiv:1903.11048 (2019).[28] S. L. Adler, Over-relaxation method for the monte carloevaluation of the partition function for multiquadraticactions, Physical Review D , 2901 (1981).[29] C. Whitmer, Over-relaxation methods for monte carlosimulations of quadratic and multiquadratic actions,Physical Review D , 306 (1984).[30] D. J. Callaway and A. Rahman, Lattice gauge theoryin the microcanonical ensemble, Physical Review D ,1506 (1983).[31] Z. Fodor and K. Jansen, Overrelaxation algorithm forcoupled gauge-higgs systems, Physics Letters B , 119(1994).[32] U. Wolﬀ, A. Collaboration, et al. , Monte carlo errors withless errors, Computer Physics Communications , 143 (2004).[33] Y. Burda, R. Grosse, and R. Salakhutdinov, Importanceweighted autoencoders, arXiv preprint arXiv:1509.00519(2015).[34] S. Nowozin, Debiasing evidence approximations: Onimportance-weighted autoencoders and jackknife varia-tional inference, ICLR 2018 (2018).[35] Y. W. Teh, D. Newman, and M. Welling, A collapsedvariational bayesian inference algorithm for latent dirich-let allocation, in Advances in neural information process-ing systems (2007) pp. 1353–1360.[36] C. Gattringer and C. Lang,

Quantum chromodynamicson the lattice: an introductory presentation , Vol. 788(Springer Science & Business Media, 2009).[37] P. J. Bickel and K. A. Doksum,

Mathematical statistics:basic ideas and selected topics, volume I , Vol. 117 (CRCPress, 2015).

Conventions for Action

The form of the action S used in the main text is S ( φ ) = (cid:88) x ∈ Λ − κ (cid:88) ˆ µ =1 ϕ ( x ) ϕ ( x + ˆ µ ) + (1 − λ ) ϕ ( x ) + λ ϕ ( x ) . (11)It can be obtained by starting from the (more standard)action S ( ϕ ) = (cid:88) x ∈ Λ a (cid:88) ˆ µ =1 ϕ ( x + a ˆ µ ) − ϕ ( x ) a + m ϕ ( x ) + g ϕ ( x ) (12)and performing the following re-deﬁnitions ϕ = (2 κ ) φ , (13)( am ) = 1 − λκ − , (14) a g = 6 λκ . (15) A brief overview of Deep Learning

Neural Networks

Neural networks are a machinelearning algorithm which has proven to be particularlypowerful. A neural network is build of layers which aredeﬁned by y ( l ) ( x ) = σ ( W l x + b l ) , where x ∈ R n , y l ∈ R m are input and output of the layer.The output of the layer is also often called the activationof the layer. The weights W l ∈ R m,n and the bias b l ∈ R n are the learnable parameters of the neural network. Thenon-linearity σ : R → R is a non-linear function whichis applied element-wise to the components of W l x + b l .Widely-used activation function are σ ( x ) = max( x,

0) or σ ( x ) = tanh( x ).A neural network consists of L such layers, i.e. g ( x ) = ( y ( L ) ◦ · · · ◦ y (1) )( x ) . It is important to note that the weights W l and biases b l do not have to be of the same dimensionality for eachlayer (although we did not make this explicit in our no-tation). It is also important to note that we merely de-scribed the most simple type of neural network, namelya fully-connected neural network. There is a zoo of otherneural networks but we will refrain from a more detaileddiscussion as it is not needed for our purposes (see [7] foran overview). Learning Parameters with Backpropagation

The pa-rameters of the neural networks, W = { ( W l , b l ) , i = 1 , . . . , L } , are determined by minimizing a certain loss function L bygradient descent (see (7) for the particular loss functionused in this work). It is important to emphasize that thenumber of parameters are typically large (of order 10 − for typical modern neural networks). It is thereforeclear that one cannot determine the gradient by ﬁnite-diﬀerence (as we would need calculate the ﬁnite diﬀerenceratio for each of these parameters which is prohibitivelyexpensive).The basic idea for calculating the gradient ∇ W L isto use the fact that we know the functional form of theneural network: the gradient of the loss is given by ∇ W L = ∂ L ∂y L ∂y L ∂y L − . . . ∂y ∂x . Each term in this expression is known, for example ∂y l +1 ∂y l = σ (cid:48) ( y l ) W l . For a ﬁxed non-linearity, we know the analytical formof the derivative σ (cid:48) . This observation leads to the fol-lowing algorithm: we ﬁrst perform a forward pass of theneural network, i.e. starting from the input x , we cal-culate the activations y l for each layer and store themin memory. This process ends with the ﬁnal activation y L which is, by deﬁnition, the output of the neural net-work. The gradient of each layer ∂y l +1 ∂y l = σ (cid:48) ( y l ) W l canthen directly be calculated (as we have stored the activa-tion y l ). Crucially, we only need the matrix product ofthese Jacobians and it is eﬃcient to start by calculatingthe gradient with respect to the output layer L , then thelayer L − ∂ L ∂y L ∂y L ∂y L − . . . ∂y l +1 ∂y l is a vector with the same number of components as y l .We can therefore save memory by simply overwriting thestored activation y l . This algorithm is called backpropa-gation and allows us to calculate the gradient ∇ W L forroughly the same cost as a forward pass of the neuralnetwork. Relation between Var ( C ) and KL divergence Theorem.

Let C ( φ ) = S ( φ ) + ln q ( φ ) . The followingrelation between the KL divergence and the variance of C holds:KL ( q θ || p ) = Var q ( C ) + O ( E q [ | w − | ]) , where w ( φ ) = p ( φ ) q ( φ ) is the normalized importance weight.Proof. The expectation value of the normalized impor-tance weight is E q (cid:20) pq (cid:21) = (cid:90) D [ φ ] p ( φ ) = 1 . The Kullback-Leibler divergence can be rewritten interms of the normalized importance weightKL( q | p ) = E q (cid:20) ln qp (cid:21) = − E q [ln w ] . We now expand the KL divergence around the expecta-tion value of the normalized importance weightKL( q | p ) = − E q [ln (1 − ( w − E q  ∞ (cid:88) j =0 ( − j j ( w − j  = E q [ w − (cid:124) (cid:123)(cid:122) (cid:125) =0 + 12 E q [( w − ] + O ( E q [ | w − | ]) . We now relate this expression to the variance of C . Tothis end, we ﬁrst observe that C = − ln ˜ w , where ˜ w =exp( − S ) /q is the unnormalized importance weight. Wethen rewrite the expectation value of C as E q [ C ] = − E q [ln ˜ w ]= − E q [ln w + ln Z ]= − ln Z − E q [ln w ]= − ln Z + KL( q | p )= − ln Z + O ( E q [( w − ]) , where the last step uses the expansion for the KL diver-gence derived above. It then follows its variance is givenbyVar q ( C ) = E q (cid:104) ( C − E q [ C ]) (cid:105) = E q (cid:104) ( − ln ˜ w + E q [ln ˜ w ]) (cid:105) = E q (cid:34)(cid:16) − ln ˜ w + ln Z (cid:124) (cid:123)(cid:122) (cid:125) = − ln w + O (cid:0) E q [( w − ] (cid:1) (cid:17) (cid:35) Expanding the logarithm around E q [ w ] = 1 again, weobtainVar q [ C ] = E q [( w − ] + O (cid:0) E q [ | w − | ] (cid:1) . Combining this expression with the expansion derived forthe KL divergence, we obtain the claim of the theorem.The higher-order moments will be small towards theend of the training process for which q ≈ p and thus w ≈

1. Thus, the variance of C will become small. We indeedobserve this behaviour in our numerical experiments, seeFigure 3 for an example. Training Step V a r q ( C ) FIG. 3. The variance Var q ( C ) decreases during training. Weuse hopping parameter κ = 0 . λ = 0 . × Analytic Solution for Partition Function

The action of the scalar ﬁeld theory is given by S = (cid:88) x ∈ Λ − κ (cid:88) ˆ µ =1 ϕ ( x ) ϕ ( x + ˆ µ ) + (1 − λ ) ϕ ( x ) + λ ϕ ( x ) . We want to calculate the partion function Z = (cid:90) D [ φ ] exp( − S ( φ )) , which for vanishing hopping parameter κ decouples inindependent integrals of each lattice site of the lattice Λ: Z = (cid:89) x ∈ Λ (cid:18)(cid:90) d φ ( x ) exp( − λ φ ( x ) − (1 − λ ) φ ( x ) ) (cid:19) The partition function can then be calculated analyti-cally using the integral (cid:90) exp( − ax − bx ) d x = (cid:114) b a exp (cid:18) b a (cid:19) K (cid:18) b a (cid:19) , where K n is the modiﬁed Bessel function of the secondkind. Using this formula, we obtain the following analyticform of the free energy F = − T | Λ | ln( z ) , where we have deﬁned z ( λ ) = (cid:114) − λ λ exp (cid:18) (1 − λ ) λ (cid:19) K (cid:18) (1 − λ ) λ (cid:19) . This result corresponds to the zeroth order of the hop-ping expansion [36] of the partition function Z and onemay, in principle, also calculate higher-order corrections.However, they are not needed for our purposes. = 5 × 10 = 1 × 10 = 2.5 × 10 = 5 × 10 i n t F V a = 1 × 10 FIG. 4. Integrated autocorrelation time of the free energy during reﬁnement of the step size. The shaded red areas refers tothe interval in hopping parameter κ ∈ [0 . , .

28] for which the reﬁnement is applied. Darker shading indicates narrower stepsize. The experiments were performed using the overrelaxed HCM algorithm.

Error Analysis for Free Energy Estimator

Our discussion is based on [34] which discussed thesame results in the context of variational inference. Weprovide a review here since these results may be hard toextract for physicist not familiar with variational infer-ence. We also point out a subtlety that was not discussedin the previous work.The estimator for the free energy is given byˆ F = − T ln 1 N N (cid:88) i =0 ˜ w ( φ i ) , φ i ∼ q θ . (16)Theorems for the variance and bias of this estimator arediscussed in the following. For this, we use the deltamethod of moments which is summarized in the followingtheorem. Theorem.

Let ˆ X N = N (cid:80) Ni =1 X i be the sample mean ofindependent and identically distributed random variables X i with E (cid:2) X k +2 i (cid:3) < ∞ for k ∈ { , } . Let h be a real-valued function with uniformly bounded derivatives. Itthen holds that E (cid:104) h ( ˆ X N ) (cid:105) = c + c N + O (cid:18) N (cid:19) , where c = h ( µ ) , c = h (cid:48)(cid:48) ( µ ) σ , with σ = E (cid:2) ( X − E X ) (cid:3) and µ = E [ X ] . We refer to Chapter 5.3 of [37] for a proof and moredetails.The application of the delta method to the free energyestimator ˆ F is, in practise, subject to a subtlety regardingthe bounded diﬀerentiablity of the function h . We willignore this subtlety in the following and return to it atthe end of the section. Theorem.

The bias of ˆ F is given by B [ − β ˆ F ] = − N E q (cid:2) ( ˜ w − E q [ ˜ w ]) (cid:3) E q [ ˜ w ] + O ( N − ) , assuming that E q [ ˜ w k +2 ] < ∞ for k = 0 , .Proof. The bias of − β ˆ F = ln ˆ Z is given by B [ − β ˆ F ] = E q [ln ˆ Z ] − ln Z .

Using the delta method for moments, we derive that E q [ln ˆ Z ] = ln Z − N Z E q (cid:2) ( ˜ w − E q [ ˜ w ]) (cid:3) + O (cid:0) N − (cid:1) , (17)where we have used that h ( x ) = ln( x ) has second deriva-tive h (cid:48)(cid:48) ( x ) = − x . The proof then concludes by observingthat E q [ ˜ w ] = Z . Theorem.

The variance of ˆ F is given byVar (cid:104) − β ˆ F (cid:105) = 1 N E q ( ˜ w − E q [ ˜ w ]) ( E q [ ˜ w ]) + O (cid:18) N (cid:19) , assuming that E q [ ˜ w k +2 ] < ∞ for k = 0 , .Proof. The variance can be written asVar[ − β ˆ F ] = E q [(ln ˆ Z ) ] − E q [ln ˆ Z ] . We now evaluate both terms on the right-hand-side in-dividually using the delta method. For the ﬁrst term,we use the delta method with h ( x ) = (ln x ) which hassecond derivative h (cid:48)(cid:48) ( x ) = 2 x − x ) x . Using this expression, we then obtain that E q [(ln ˆ Z ) ] isequal to(ln Z ) + (cid:18) Z − log ZZ (cid:19) E q (cid:2) ( ˜ w − E q [ ˜ w ]) (cid:3) N + O ( N − )For the squared expectation value, we use the expansion(17) derived in the proof for the bias. This gives that( E q ln ˆ Z ) is equal to (cid:18) ln Z − N Z E q (cid:2) ( ˜ w − E q [ ˜ w ]) (cid:3) + O (cid:0) N − (cid:1)(cid:19) = (ln Z ) − N Z E q (cid:2) ( ˜ w − E q [ ˜ w ]) (cid:3) + O (cid:0) N − (cid:1) . Subtracting these two expressions, it then followsVar[ − β ˆ F ] = 1 Z E q (cid:2) ( ˜ w − E q [ ˜ w ]) (cid:3) N + O ( N − ) , and the proof concludes by observing that Z = E q [ ˜ w ].A few remarks are in order: from the theorems, it fol-lows that the standard deviation of the estimator ˆ F is oforder O (1 / √ N ). In the large N limit, we can thereforeneglect the bias correction as it is of order O ( N − ). Fur-thermore, we can replace the expectation values in thetheorems by the sample mean up to (negligible) higher-order corrections. In practise, we therefore use theseresults to estimate the variance and bias of ˆ F . Alter-natively, one can use a standard Jackknife analysis toestimate variance and bias (see for example [36]). In ourexperiments, we use both methods to estimate the errorsand check that they lead to consistent results. Lastly,we remark that error estimators for general observablesinvolving the partition function can be derived, see [26].As mentioned above, the delta method requires thatthe derivatives of the function h are (uniformly) bounded.For a generic LQFT, this will not be the case for h ( x ) =ln( x ) since its derivatives diverge for x → + . To the best of our knowledge, the same problem will genericallyarise in the context of variational inference but seems tohave not been discussed in the literature.To address this subtlety, one could require that the ac-tion of the lattice quantum ﬁeld theory is bounded. Forexample, this can be ensured by putting the ﬁeld theoryin a box potential. Since only very high energy conﬁg-urations are aﬀected by this (for suitably large choiceof the box potential) and since these conﬁgurations areextremely unlikely to be sampled, this modiﬁcation willhave no practical eﬀect on the numerical experiments.After this modiﬁcation, ˆ Z is bounded from below and h ( n ) ( ˆ Z ) is also bounded as a result.More rigorously, the result for the variance can be de-rived without assumptions on a bound for the derivativesby using the delta method for in law approximation whichtakes the following form Theorem.

Let ˆ X N = N (cid:80) Ni =1 X i be the sample meanof independent and identically distributed random vari-ables X i with E (cid:2) X ki (cid:3) < ∞ for k ∈ { , } . Let h be adiﬀerential function at µ = E q [ X ] . Then √ n (cid:16) h ( ˆ X N ) − h ( µ ) (cid:17) D → N (0 , σ ( h )) , where σ ( h ) = h (cid:48) ( µ ) Var ( X ) . For a proof, we again refer to [37], see Theorem 5.3.3.Applying this theorem to the free energy estimator − β ˆ F = ln ˆ Z , we obtain the same expression for its vari-ance as derived above. However, the theorem does notrequire any bound on the derivatives of h ( x ) = ln( x ). Step Size Analysis

As explained in the main text, the free energy diﬀer-ence ∆ F e b is calculated in steps∆ F e b = ∆ F e,i K + ∆ F i K i K − + · · · + ∆ F i b . In this appendix, we will analyze the dependency of ourresults on the chosen steps.We start from an initial step size corresponding to achange in hopping parameter κ of δκ = 0 .

05. Between κ = 0 . κ = 0 .

3, we however take a ﬁner step sizeof δκ = 0 .

01. Since we are interested in the free energydiﬀerence ∆ F e b between κ b = 0 . κ e = 0 .

3, this cor-responds to running K = 14 Markov chains. We focuson the 16 × k conﬁgurations for each chain. Theoverrelaxation is performed every 10 steps.We then repeatedly reﬁne the step size in a certainsubregion around the critical κ value. The details can befound in Table I.1 FIG. 5. Free energy at κ = 0 . δκ of the hopping parameter were used.Details on the step sizes δκ are summarized in Table I. Forthe ﬂow-based method, we use the same number of samplesas for the corresponding reﬁned MCMC method. As a result,also the error of the ﬂow’s estimate decreases. Lattice

FV a jcknf uwerr jcknf uwerr FIG. 6. Free energy at κ = 0 . λ = 0 .

022 using both Z i /Z i +1 (down) and Z i +1 /Z i (up). We use the same setup(i.e. number of steps, hopping parameter change δκ , etc) asfor Figure 2. The results of this analysis are shown in Figure 5. Weobserve that the error of the estimator does not signiﬁ-cantly decrease. We note that the error of the ﬂow de-creases during reﬁnement because its free energy esti-mation uses the same number of samples as all Markovchains combined (and this number increases by the ad-ditional reﬁnement steps).In order to ensure that the distributions p i and p i +1 in E p i (cid:20) exp( − S i +1 )exp( − S i ) (cid:21) = 1 Z i (cid:90) D [ φ ] e − S i ( φ ) e − S i +1 ( φ ) e − S i ( φ ) = Z i +1 Z i have suﬃcient overlap, we also estimate Z i Z i +1 by exchang-ing p i with p i +1 in the relation above. We then check TABLE I. Details on the reﬁnement analysis. In each reﬁne-ment stage, we take smaller steps δκ in a certain subregionof the hopping parameter κ trajectory (see last column). Thestep size taken in this region is shown in the ﬁrst column.Outside of the most reﬁned region, the same step sizes as inthe previous reﬁnement stage are taken. Since each chain istaken to be of the same length, the total number of samples(third column) grows proportional to the number of chains(second column). δκ κ region0 .

01 14 5.6 M 0.20-0.300 .

005 24 9.6 M 0.20-0.300 . .

001 76 30.4 M 0.24-0.300 . that this leads to compatible results, see Figure 6. Wenote that this consistency check is relatively cheap as itrequires running one additional Markov chain.We also study the dependence of the integrated auto-correlation of the free energy on this reﬁnement proce-dure, see Figure 4. Details on Numerical Experiments

Lattice

FV a flow deltaflow jcknfhmc uwerrhmc jcknf FIG. 7. Free energy estimation with error analysis by bothJackknife and delta method. Both lead to compatible results.We use the same data as in Figure 2.

HMC:

We use a HMC algorithm with overrelaxation.Each Markov chain has 5k thermalization steps followedby 400k estimation steps. The sign of the ﬁeld conﬁgu-ration is ﬂipped every ten steps.

Training of ﬂow:

For every lattice, we use a normal-izing ﬂow with six coupling layers. Each coupling layer(5) has neural network m with ﬁve fully-connected layerswith no bias and Tanh non-linearities. The hidden lay-ers of m consist of 1000 neurons each. We train the ﬂow2for 1M steps using an 8k mini-batch. We use ReduceL-ROnPlateau learning rate scheduler of PyTorch with aninitial learning rate of 5 × − and patience of 3k steps.The minimum learning rate was set to 1 × − . Estimation:

As described in the main text, for HMC-based estimation we use a step size of δκ = 0 .

01 for κ ∈ [0 . , .

3] and a step size of δκ = 0 .

05 for all othervalues of the hopping parameter. As a result 14 Markovchains are run. In total, the HMC-based method there-fore uses 14 × k = 5 . M conﬁgurations. We use thesame number of samples for the ﬂow-based estimation.For eﬃciency, we sample these conﬁgurations in mini-batches of 3k samples. Error estimation: we use both the uwerr [32] andjackknife method to estimate uncertainties for HMC. Inorder to deal with autocorrelation for jackknife, we per-form binning with a 1k bin size. Error estimation for ﬂowis performed by the delta method and also by jackknife,see Figure 7.From the free energy estimates, one can then deriveother thermodynamic observables such as the entropy.We refer to the main text for a discussion of this. Figure8 shows estimation of entropy. Errors were estimated using both the Jackknife and uwerr method. Both erroranalysis methods lead to consistent results.

Lattice H | | flowhmc uwerrhmc jcknfflowhmc uwerrhmc jcknf