# Quantum field-theoretic machine learning

QQuantum ﬁeld-theoretic machine learning

Dimitrios Bachtis, ∗ Gert Aarts,

2, 3, † and Biagio Lucini

1, 4, ‡ Department of Mathematics, Swansea University, Bay Campus, SA1 8EN, Swansea, Wales, UK Department of Physics, Swansea University, Singleton Campus, SA2 8PP, Swansea, Wales, UK European Centre for Theoretical Studies in Nuclear Physics and Related Areas (ECT*)& Fondazione Bruno Kessler Strada delle Tabarelle 286, 38123 Villazzano (TN), Italy Swansea Academy of Advanced Computing, Swansea University, Bay Campus, SA1 8EN, Swansea, Wales, UK (Dated: February 18, 2021)We derive machine learning algorithms from discretized Euclidean ﬁeld theories, making inferenceand learning possible within dynamics described by quantum ﬁeld theory. Speciﬁcally, we demon-strate that the φ scalar ﬁeld theory satisﬁes the Hammersley-Cliﬀord theorem, therefore recastingit as a machine learning algorithm within the mathematically rigorous framework of Markov randomﬁelds. We illustrate the concepts by minimizing an asymmetric distance between the probabilitydistribution of the φ theory and that of target distributions, by quantifying the overlap of statisti-cal ensembles between probability distributions and through reweighting to complex-valued actionswith longer-range interactions. Neural networks architectures are additionally derived from the φ theory which can be viewed as generalizations of conventional neural networks and applications arepresented. We conclude by discussing how the proposal opens up a new research avenue, that ofdeveloping a mathematical and computational framework of machine learning within quantum ﬁeldtheory. I. INTRODUCTION

Relativistic quantum ﬁelds [1] are formulated onMinkowski space where intricate mathematical problemsrelated to the hyperbolic geometry emerge. By recastingMinkowski space as Euclidean signiﬁcant simpliﬁcationscan be obtained for certain cases: The hyperbolic prob-lems are transformed to be elliptic, the Poincar´e groupbecomes the Euclidean group where a positive deﬁnitescalar product emerges, noncommuting operators are ex-pressed as random variables and causality is formulatedas a Markov property.Of high importance is the reverse direction: that ofarriving at a quantum ﬁeld in Minkowski space by con-structing it from one in Euclidean space. To make suchprospects attainable a rigorous mathematical frameworkfor quantum ﬁelds had to be established, and a series ofrelevant contributions led to advances known as construc-tive quantum ﬁeld theory [2–4]. A connection betweenprobability theory and quantum ﬁeld theory was thenestablished when quantum ﬁelds were constructed fromEuclidean ﬁelds that satisfy Markov properties [5, 6].Recently, applications of deep learning [7], a class ofmachine learning algorithms which are able to hierarchi-cally extract abstract features in data, have emerged inthe physical sciences [8], including in lattice ﬁeld theo-ries [9–16] and in the study of phase transitions [17–23].Insights on machine learning algorithms have been ob-tained from the perspective of statistical physics [24–32],particularly within the theory of spin glasses [33], or inrelation to Gaussian processes [34–38]. ∗ [email protected] † [email protected] ‡ [email protected] A notable case of these algorithms is the frameworkof Markov random ﬁelds [39], which introduces Markovproperties on a graph-based representation to encodeprobability distributions over high-dimensional spaces.As quantum ﬁeld theory and probability theory are evi-dently connected analytically [6], and computational in-vestigations of quantum ﬁelds are feasible through theframework of lattice ﬁeld theory [40], a new challenge isanticipated to emerge: namely that of investigating ma-chine learning from the perspective of quantum ﬁelds.In this manuscript, we derive machine learning algo-rithms from discretized Euclidean ﬁeld theories, mak-ing inference and learning possible within dynamics de-scribed by quantum ﬁeld theory. From the mathematicalpoint of view, we explore if the φ scalar ﬁeld theory on asquare lattice satisﬁes the Hammersley-Cliﬀord theorem,therefore recasting it as a Markov random ﬁeld which cancomplete machine learning tasks. From the equivalentperspective of physics, we treat the φ scalar ﬁeld theoryas a system with inhomogeneous coupling constants andwe search based on its dynamics, which comprise localinteractions, for the optimal values of the coupling con-stants that are able to complete a machine learning task.Speciﬁcally we consider the minimization of an asymmet-ric distance between the probability distribution of the φ theory and that of target distributions. We also quantifythe overlap of statistical ensembles between probabilitydistributions and investigate if reweighting to the param-eter space of complex-valued actions with longer-rangeinteractions is possible by utilizing instead the proba-bility distribution of the approximating local inhomoge-neous action.We then proceed to derive neural network architecturesfrom the φ scalar ﬁeld theory which can progressivelyextract features of increased abstraction in data. Weexplore the implications of including a local symmetry- a r X i v : . [ h e p - l a t ] F e b breaking term in the φ Markov random ﬁeld, and rear-range the lattice topology to derive a φ neural networkwhich can be viewed as a generalization of conventionalneural network architectures. Based on the equivalencebetween the φ scalar ﬁeld theory and the Ising model un-der a certain limit, we discuss how the φ neural networkcan provide novel physical insights to the interpretabil-ity of a notable class of machine learning algorithms. Fi-nally, we conclude by discussing how the introduction of φ machine learning algorithms opens up a new researchavenue, that of developing, computationally and analyt-ically, a framework of machine learning within quantumﬁeld theory. II. THE φ SCALAR FIELD THEORY AS AMARKOV RANDOM FIELD

Let Λ be a ﬁnite set whose points represent the sites ofa physical model, and let Λ have an additional structure,for instance consider that the spacing between the sitesmight be known and that the sites are connected. Wenow consider that the points of Λ lie on the vertices of aﬁnite graph G = (Λ , e ), where e is the set of edges on G .If i, j ∈ Λ and there exists an edge between i and j then i and j are called neighbours and the set of all neighboursof a considered point i will be denoted by N i . A cliqueis a subset of Λ where the points are pairwise connected,and a clique is called maximal if no additional point canbe included such that the resulting set is still a clique.We will denote a maximal clique as c and the set of allmaximal cliques as C . For an illustration of the conceptssee Fig. 1 and for rigorous results see Refs. [39, 41].In addition we associate to each point i ∈ Λ a randomvariable φ i,i ∈ Λ and we will call φ = { φ i } a state or con-ﬁguration of the system. Given a graph G = (Λ , e ), theset of random variables deﬁne a Markov random ﬁeld ifthe associated probability distribution p fulﬁlls the localMarkov property with respect to G . The local Markovproperty denotes that a variable φ i is conditionally in-dependent of all other variables given its neighbors N i ,i.e: p ( φ i | ( φ j ) j ∈ Λ − i ) = p ( φ i | ( φ j ) j ∈N i ) . (1)A probability distribution is then related with theevents generated by a Markov random ﬁeld through theHammersley-Cliﬀord theorem [39]: Theorem 1 (Hammersley-Cliﬀord.)

A strictly posi-tive distribution p satisﬁes the local Markov property ofan undirected graph G , if and only if p can be representedas a product of nonnegative potential functions ψ c over G , one per maximal clique c ∈ C , i.e., p ( φ ) = 1 Z (cid:89) c ∈ C ψ c ( φ ) , (2) where Z = (cid:82) φ (cid:81) c ∈ C ψ c ( φ ) d φ is the partition functionand φ are all possible states of the system. FIG. 1. (a) A bipartite graph. The maximal cliques cor-respond to the sites associated with the random variables { φ , h } , { φ , h } , { φ , h m } , { φ , h } , { φ , h } , { φ , h m } , { φ n , h } , { φ n , h } , { φ n , h m } . (b) A square lattice. The maxi-mal cliques correspond to the sites associated with the randomvariables { φ , φ } , { φ , φ } , { φ , φ } and { φ , φ } . We will demonstrate that the φ scalar ﬁeld theory sat-isﬁes the Hammersley-Cliﬀord theorem and is thereforea Markov random ﬁeld. The two-dimensional φ theoryis described by the Euclidean Langrangian: L E = κ ∇ φ ) + µ φ + λ φ , (3)where the action that regularizes the continuum theoryon a square lattice is: S E = − κ L (cid:88) (cid:104) ij (cid:105) φ i φ j + ( µ L + 4 κ L )2 (cid:88) i φ i + λ L (cid:88) i φ i . (4)The quantities κ L , µ L , λ L are dimensionless parame-ters, one of which is deprecated and can be absorbed byrescaling the ﬁelds [42]. Nevertheless, consider the setof variables w = κ L , a = ( µ L + 4 κ L ) / b = λ L / S ( φ ; θ ) = − (cid:88) (cid:104) ij (cid:105) w ij φ i φ j + (cid:88) i a i φ i + (cid:88) i b i φ i , (5)where the set of coupling constants is θ = { w ij , a i , b i } ,and the associated Boltzmann probability distribution is: p ( φ ; θ ) = exp (cid:2) − S ( φ ; θ ) (cid:3)(cid:82) φ exp[ − S ( φ , θ )] d φ . (6)The φ scalar ﬁeld theory is formulated on a graph G = (Λ , e ) where Λ is the set of lattice sites and e theset of edges or pairwise interactions. For a square lat-tice only nearest neighbors deﬁne a maximal clique (seeFig. 1). Since we search for arbitrary, strictly positivepotential functions ψ c per maximal clique c ∈ C , we canmultiply ψ c with nonnegative functions of subsets of c [43], i.e. with functions of one-site cliques. We then ar-rive, after considering the imposed boundary conditions,at a nonunique choice of potential function: ψ c = exp (cid:20) − w ij φ i φ j + 14 ( a i φ i + a j φ j + b i φ i + b j φ j ) (cid:21) , (7)where i, j are nearest neighbors. As the potential func-tions ψ c are nonnegative the quantity ln ψ c can be de-ﬁned, and the probability distribution p ( φ ; θ ) can be fac-torized as: p ( φ ; θ ) = exp (cid:2)(cid:80) c ∈ C ln ψ c ( φ ) (cid:3)(cid:82) φ exp (cid:2) (cid:80) c ∈ C ln ψ c ( φ ) (cid:3) d φ = 1 Z (cid:89) c ∈ C ψ c ( φ ) . (8)To summarize, the discretized φ scalar ﬁeld theorysatisﬁes the Hammersley-Cliﬀord theorem and the lo-cal Markov property and is therefore a Markov ran-dom ﬁeld. To understand intuitively the meaning ofthe local Markov property, consider the more familiarcase satisﬁed by a Markov chain P ( φ k +1 | φ k , . . . , φ ) = P ( φ k +1 | φ k ). This property declares that given a certainstate φ k a future state φ k +1 depends only on the cur-rent state φ k , and not on states that preceded it, suchas φ k − . The local Markov property of Eq. 1 extendsthis concept to higher dimensions by giving it a spatialrepresentation via a Markov random ﬁeld. For the caseof the φ scalar ﬁeld theory the variational parameters θ are the coupling constants θ = { w ij , a i , b i } . By consid-ering that the probability p ( φ ; θ ) of the Markov randomﬁeld depends on the parameters θ a variety of machinelearning tasks can then be completed. III. MACHINE LEARNING WITH THE φ SCALAR FIELD THEORYA. Learning without predeﬁned data

Consider a target probability distribution q ( φ ) of an ar-bitrary statistical system. An asymmetric measure of thedistance between the two probability distributions p ( φ ; θ )and q ( φ ) can be deﬁned, which is called the Kullback-Leibler divergence [39]: KL ( p || q ) = (cid:90) ∞−∞ p ( φ ; θ ) ln p ( φ ; θ ) q ( φ ) d φ ≥ . (9)The Kullback-Leibler divergence is nonnegative andequal to zero when the two probability distributionsexactly match one another. We emphasize that theKullback-Leibler divergence does not satisfy the trian-gle inequality and it therefore cannot be classiﬁed as aproper distance as it isn’t symmetric. It is the quan-tity KL ( p || q ) + KL ( q || p ) which is a true metric. TheKullback-Leibler divergence will be called an asymmet-ric distance to retain the intuitive picture that it estab-lishes a measure of the diﬀerence between two probabilitydistributions.By searching for an optimal set of coupling constants θ = { w ij , a i , b i } we can minimize the Kullback-Leiblerdivergence so that the probability distribution of the φ scalar ﬁeld theory p ( φ ; θ ) will converge to the tar-get probability distribution q ( φ ). Once minimizationis conducted a Markov chain Monte Carlo simulation can be initiated for p ( φ ; θ ) to draw samples that wouldbe representative of the target distribution q ( φ ). Letus consider the case where the target probability dis-tribution q ( φ ) is that of an arbitrary statistical systemwith partition function Z A and it has a Boltzmann form q ( φ ) = exp[ −A ] /Z A . Any additional parameter, such asthe inverse temperature, is absorbed within the Hamilto-nian or lattice action A . By substituting q ( φ ) and p ( φ ; θ )in Eq. 9 we arrive at: − ln Z A ≤ (cid:104)A − S (cid:105) p ( φ ; θ ) − ln Z. (10)By considering that the terms F A = − ln Z A and F = − ln Z are equal to the free energy, the above equationcan be equivalently expressed as: F A ≤ (cid:104)A − S (cid:105) p ( φ ; θ ) + F ≡ F , (11)where F is the variational free energy. As a result Eq. 11sets a rigorous upper bound to the calculation of thefree energy F A of the target system and this bound F is dependent on calculations conducted entirely on thedistribution p ( φ ; θ ) of the φ Markov random ﬁeld. Thisindicates that one can map an arbitrary system to a φ scalar ﬁeld theory by minimizing an asymmetric distancebetween the probability distributions of the two systems.A gradient-based approach can then be implemented tominimize the variational free energy F via its derivativesin terms of the parameters θ : ∂ F ∂θ i = (cid:104)A(cid:105) (cid:68) ∂S∂θ i (cid:69) − (cid:68) A ∂S∂θ i (cid:69) + (cid:68) S ∂S∂θ i (cid:69) −(cid:104) S (cid:105) (cid:68) ∂S∂θ i (cid:69) , (12)where all expectation values are calculated under theprobability distribution p ( φ ; θ ) of the φ scalar ﬁeld the-ory. Derivations can be found in Appendix A. The vari-ational parameters are then updated at each epoch t ofthe minimization process through: θ ( t +1) = θ ( t ) − η ∗ L , (13)where η is the learning rate and L = ∂ F /∂θ ( t ) . After theminimization process we anticipate that F ≈ F A and asa result p ( φ ; θ ) ≈ q ( φ ).To illustrate the approach we consider as a target sys-tem a φ lattice action A with longer-range interactionsand complex-valued coupling constants, deﬁned as: A = (cid:88) k =1 g k A ( k ) = g (cid:88) (cid:104) ij (cid:105) nn φ i φ j + g (cid:88) i φ i (14)+ g (cid:88) i φ i + g (cid:88) (cid:104) ij (cid:105) nnn φ i φ j + ig (cid:88) i φ i . (15)The notation nn and nnn denotes nearest neighbor andnext-nearest neighbor interactions and the lattice actionis complex due to the g A (5) term. The combination ofthe g and g parameters introduces a complex couplingconstant in the mass term. The coupling constants havevalues g g − g = 1 . g = 0 .

175 and

FIG. 2. Variational parameters θ = { w ij , a i , b i } versusepochs t on logarithmic scale. The ﬁgures depict the evo-lution of the parameters θ towards the expected values of thecoupling constants in the target homogeneous action. g = 0 .

15. The values for g , g and g have been chosennear the critical point of the second-order phase tran-sition for the system with a local homogeneous actionfor which g = g = 0. We will present three applica-tions for lattices of size L = 4 at each dimension: ﬁrstly,a proof-of-principle demonstration will be conducted toverify that the inhomogeneous action S (see Eq. 5) canlearn the local lattice action A { } = (cid:80) k =1 g k A ( k ) . Sec-ondly, we will discuss that by considering the local lat-tice action A { } it is impossible to reweight to the fullaction A due to insuﬃcient overlap of statistical ensem-bles, but there exists an inhomogeneous representationof A { } equal to S for which this is possible. Finally wewill demonstrate that S can approximate A suﬃcientlyto simultaneously extrapolate observables in the parame-ter space of the complex action A along the trajectory ofa considered coupling constant and we will discuss howto successfully deﬁne the allowed reweighting range.We now initialize the φ Markov random ﬁeld with in-homogeneous coupling constants θ which are randomlydrawn from a Gaussian distribution and consider as atarget system in Eq. 10 the local lattice action A { } . Weanticipate that the optimal solution is the one where theinhomogeneous coupling constants θ of the φ Markovrandom ﬁeld will converge to the homogeneous constants g , g and g of the target φ scalar ﬁeld theory. Detailsabout the simulations can be found in Appendix B. Thetime evolution for the parameters θ is depicted in Fig 2and details of the training process can be found in Ap-pendix B. After training is conducted the parameters θ have converged to the homogeneous constants of the tar-get system with precision of order of magnitude of 10 − for all cases. It then becomes clear that given suﬃcienttraining time the two systems become identical.The overlap of statistical ensembles can be quanti-ﬁed through the Kullback-Leibler divergence. We con-sider the probability distribution p ( φ ; θ ), described bythe local inhomogeneous action S , and we minimize theKullback-Leibler divergence to approximate the targetdistribution of action A { } which is denoted as q ( φ ).In addition, we simultaneously estimate the Kullback-Leibler divergence between the distributions of A { } and A { } to quantify their overlap of statistical ensembles.The results are depicted in Fig. 3 where it is evident that FIG. 3. Estimated Kullback-Leibler divergence versus epoch t on logarithmic scale. The probability distributions of actions A { } and S are compared with the one of A { } . Only theaction S is updated at each epoch based on a ﬁnite sampleof ﬁxed size. For action A { } results are depicted based on aﬁnite sample of equal size to allow for a direct comparison ofthe two quantities at each epoch t . the local inhomogeneous action S produces a probabilitydistribution which approximates A { } exceedingly betterthan the probability distribution of A { } . This tenta-tively indicates that while S and A { } have the sameform of lattice action, the inhomogeneity present in theformer allows for the construction of richer representa-tions of probability distributions. As a result, histogramreweighting [44] from local inhomogeneous actions to re-gions of parameter space that are inaccessible to the localhomogeneous action might be possible.We proceed to discuss the precise implications ofthe equivalence between the approximating distribution p ( φ ; θ ) of action S and the target distribution q ( φ ) of ac-tion A { } . The deﬁnition of the expectation value (cid:104) O (cid:105) P of an arbitrary observable O in a system that has someequilibrium occupation probabilities P is: (cid:104) O (cid:105) P = (cid:88) φ O φ P ( φ ) , (16)where the sum is over all possible states φ of the sys-tem. After the Kullback-Leibler divergence between thedistributions p ( φ ; θ ) and q ( φ ) is minimized we anticipatethat: p ( φ ; θ ) ≈ q ( φ ) , (17)which instantly implies, based on Eq.16, that: (cid:104) O (cid:105) p ( φ ; θ ) ≈ (cid:104) O (cid:105) q ( φ ) . (18)To clarify further, observables, such as the lattice ac-tion A { } should yield approximately equal values whencalculated from samples drawn from either distribution p ( φ ; θ ) or q ( φ ) even though the two distributions have dif-ferent actions S and A { } , respectively. To express theseideas in a more formal manner, we now consider the ex-pectation value of an arbitrary observable as obtainedduring a Monte Carlo simulation (e.g. see Refs [20, 21]) FIG. 4. Real part of the complex lattice action A versuscoupling constant g . The results are obtained by reweightingfrom the Markov random ﬁeld distribution p to the distribu-tion of the complex action A . The statistical errors are com-parable with the width of the line. The results are comparedwith Monte Carlo (MC) and reweighting from the distributionof the real action A { } to A . in the target system with action A { } : (cid:104) O (cid:105) q ( φ ) = (cid:80) Nl =1 ˜ p − l O l exp[ − (cid:80) k =1 g k A ( k ) l ] (cid:80) Nl =1 ˜ p − l exp[ − (cid:80) k =1 g k A ( k ) l ] , (19)where ˜ p are the probabilities used to sample from theequilibrium distribution and N the number of samplesthat we have obtained during the Monte Carlo simula-tion. There are two fundamentally diﬀerent ways to pro-ceed in calculating the expectation value of the aboveequation by relying instead on the approximating prob-ability distribution p ( φ ; θ ).The ﬁrst is to draw a subset of samples from p ( φ ; θ )and then conjecture, based on Eq. 17, that these N sam-ples have been produced instead by the distribution q ( φ ).This would have been equivalent to considering ˜ p = q ( φ )in Eq. 19 but a systematic error would be introducedbased on the accuracy in which the probability distri-bution p ( φ ; θ ) approximates q ( φ ). The second approachagain relies on drawing a subset of samples from thedistribution p ( φ ; θ ), but this time we will consider that p ( φ ; θ ) (cid:54) = q ( φ ) and that the samples have been produceddirectly from p ( φ ; θ ) of Eq. 6 with action S . This is equiv-alent to conducting a reweighting step to arrive from dis-tribution p ( φ ; θ ) to q ( φ ) under a suﬃcient overlap of en-sembles. We anticipate that this reweighting step is pos-sible to achieve due to the minimization of the Kullback-Leibler divergence between the two distributions p ( φ ; θ )to q ( φ ) and their approximate equivalence.We will follow the second approach and implement areweighting technique, details of which can be found inAppendix C, to simultaneously extrapolate observablesin the parameter space of the full action A which includescomplex couplings and longer-range interactions: (cid:104) O (cid:105) = (cid:80) Nl =1 O l exp[ S l − g (cid:48) j A ( j ) l − (cid:80) k =1 ,k (cid:54) = j g k A ( k ) l ] (cid:80) Nl =1 exp[ S l − g (cid:48) j A ( j ) l − (cid:80) k =1 ,k (cid:54) = j g k A ( k ) l ] . (20) FIG. 5. Real part of the magnetization m versus couplingconstant g . The results are obtained by reweighting from theMarkov random ﬁeld distribution p to the target distributionof the complex action A . The associated statistical errors aredepicted by the dashed lines. The results are compared withMonte Carlo (MC) and reweighting from the distribution ofthe real action A { } to A . The equation above can be interpreted as two distinctsimultaneous reweighting steps. Firstly the probabilitydistribution p ( φ ; θ ) of the φ Markov random ﬁeld withaction S is reweighted to the distribution q ( φ ) with action A { } but with a shifted coupling constant g (cid:48) j . This actsas a correction step to ensure that the proper distribu-tion is reached from p ( φ ; θ ) and it additionally allows anextrapolation along the direction of the parameter spacedescribed by coupling g (cid:48) j . Secondly there is a reweight-ing step to reach the distribution described by the com-plex lattice action A , which includes the imaginary part g A (5) . Any arbitrary observable can be reweighted inparameter space, such as machine learning derived ob-servables [21], and Hamiltonian-agnostic reweighting [20]could additionally be explored.We consider that j = 4 and we extrapolate observ-ables along the trajectory of the g (cid:48) coupling constant fora continuous range of values g (cid:48) ∈ [ − . , − . φ Markov random ﬁeld was trained toapproximate the action A { } where g = −

1. Results forthe magnetization and the internal energy, obtained withreweighting from the probability distribution p ( φ ; θ ) tothe full action A are depicted in Figs. 4 and 5. The resultsare compared with Monte Carlo simulations conductedon action A { } which are combined with reweighting tothe full complex distribution to allow for a comparisonwith the ones from p ( φ ; θ ). It is evident that the resultsdepicted agree within statistical errors with the MonteCarlo extrapolations. Details about the statistical erroranalysis can be found in Appendix D.When reweighting is implemented to extrapolate tothe probability distribution of a complex action or as acorrection step in the case of an approximating distribu-tion the question of how to strictly deﬁne the reweightingrange emerges. This can be achieved, formally, throughthe calculation of weight functions which are dependenton the underlying histograms. Speciﬁcally, we consider asan example in Eq. 20 the expectation value of the action S . In addition, instead of expressing Eq. 20 as a sum overeach action S l calculated on a conﬁguration φ we insteadreformulate it in terms of each uniquely sampled action S in the Monte Carlo dataset after the construction of histograms. The expectation value is then: (cid:104) S (cid:105) = (cid:88) S S W ( S ) , (21)where the sum is over uniquely sampled actions S and W ( S ) is a weight function which is equal to: W ( S ) = (cid:80) (cid:60) [ A (cid:48) ] , (cid:61) [ A (cid:48) ] h ( S, (cid:60) [ A (cid:48) ] , (cid:61) [ A (cid:48) ]) exp[ S − (cid:60) [ A (cid:48) ] − i (cid:61) [ A (cid:48) ]] (cid:80) S, (cid:60) [ A (cid:48) ] , (cid:61) [ A (cid:48) ] h ( S, (cid:60) [ A (cid:48) ] , (cid:61) [ A (cid:48) ]) exp[ S − (cid:60) [ A (cid:48) ] − i (cid:61) [ A (cid:48) ]] , (22)where A (cid:48) = g (cid:48) j A ( j ) + (cid:80) k =1 ,k (cid:54) = j g k A ( k ) . The quantity h ( S, (cid:60) [ A (cid:48) ] , (cid:61) [ A (cid:48) ]) is a multi-dimensional histogram of theinhomogeneous action S as well as each action term inwhich we are interested to extrapolate towards duringreweighting. Reweighting can be achieved either by in-cluding novel terms in the action or by shifting its cor-responding coupling constant if the term already exists.Of particular interest is also the quantity W (cid:48) ( S ) wherethe exponentials are chosen equal to one and which isproportional to the actual histograms of the action inthe corresponding Monte Carlo dataset. This quantitycan additionally serve as an indication of the reweightingrange.We proceed to calculate the weight functions W ( S ) foreach uniquely sampled action S in a considered extrapo-lation range. The results are depicted in Fig. 6 where anoverlap between distinct weight functions that are adja-cent in parameter space to the coupling constant g = − g (cid:48) = − . g (cid:48) = − . FIG. 6. Real part of the weight function W ( S ) versus latticeaction S for considered coupling constants g (cid:48) ∈ [ − . , − . S to the complex action A which includeslonger-range interactions. successfully predicted.We emphasize that reweighting from the local homoge-neous action A { } to the full action A is not possible. Theinclusion of an imaginary term and a longer range inter-action does not produce a suﬃcient overlap of ensembles.Results are depicted in Fig. 7. We recall that the localhomogeneous action A { } has coupling constant g = 0and the target distribution of action A { } includes a termwith coupling constant g (cid:48) = − .

0. It is clear that the val-ues of the lattice action lie at an entirely diﬀerent scaleand inconsistencies begin to emerge when g (cid:48) = − . A { } . However, thelocal inhomogeneous action S is able to achieve reweight-ing to the full distribution of the action A . Consequentlythe opportunity to map improved lattice actions, whichinclude longer-range interactions, to local inhomogeneousactions is a prospect that is open to explore. This can beachieved by minimizing the asymmetric distance betweentheir associated probability distributions. B. Learning with predeﬁned data

The preceding results do not require any predeﬁneddata to be used as input within the training process sinceconﬁgurations were obtained during the gradient-based

FIG. 7. Real part of the weight function W ( A { } ) versuslattice action A { } for considered coupling constants g (cid:48) . Theresults are obtained by reweighting from action A { } to thecomplex action A which includes longer-range interactions. FIG. 8. Probability density function versus lattice value φ i for a Euclidean action S that is Z invariant and S b whichincludes a local symmetry-breaking term. approach. However, there exist cases where one has al-ready obtained a set of available data, which could com-prise conﬁgurations of a system, experimental data, ora set of images, and whose probability distribution is ofunknown form. The obtained dataset then explicitly en-codes an empirical probability distribution q ( φ ) that is arepresentation of the complete probability distribution ofthe system. The empirical distribution q ( φ ) can still belearned by minimizing instead the opposite divergence: KL ( q || p ) = (cid:90) ∞−∞ q ( φ ) ln q ( φ ) p ( φ ; θ ) d φ ≥ . (23)By expanding the above equation we arrive at: KL ( q || p ) = (cid:104) ln q ( φ ) (cid:105) q ( φ ) − (cid:104) ln p ( φ ; θ ) (cid:105) q ( φ ) . (24)The ﬁrst right hand term is constant and the minimiza-tion of KL ( q || p ) is therefore equivalent to the maximiza-tion of the second right hand term under the trainingdata: ∂ ln p ( φ ; θ ) ∂θ = (cid:68) ∂S∂θ (cid:69) p ( φ ; θ ) − ∂S∂θ . (25)The variational parameters are now updated accordingto Eq. 13 where L = − ∂ ln p ( φ ; θ ( t ) ) /∂θ ( t ) .To illustrate the concepts we now create a dataset froma Gaussian distribution with µ = − . σ = 0 . q ( φ ). The in-formation about the form of q ( φ ) will not be introducedin Eq. 23 because the training will instead be conductedon the obtained data. To clarify further, the same ap-proach can be established for any obtained dataset, with-out the need to even infer the underlying form of the dis-tribution. After successful training, Markov chain MonteCarlo simulations can be implemented based on the dis-tribution p ( φ ; θ ) of the φ Markov random ﬁeld to drawsamples that would be representative of the unknown tar-get distribution q ( φ ). Additional details can be found inAppendix A.We anticipate, due to the invariance under the Z sym-metry in the lattice action S , that the symmetric distri-bution with µ = 0 . FIG. 9. Original image and equilibration of the Markovrandom ﬁeld after 1, 10 and 50 steps.

If this feature is not desirable then a local symmetry-breaking term of the form (cid:80) i r i φ i can be included in theaction S to favor conﬁgurations that will explicitly re-produce q ( φ ). The Hammersley-Cliﬀord theorem is stillsatisﬁed and results for the symmetric action S and theaction S b which includes a symmetry-breaking term aredepicted in Fig. 8. We observe for the symmetric casethat while the algorithm has been trained on one of theprobable solutions it is able to produce additional so-lutions that are invariant under the inherent symmetry,whereas this feature has been eliminated for the broken-symmetry case where the probability distribution q ( φ ) isexplicitly reproduced.Markov random ﬁelds are widely applied to problemsin computer vision, image segmentation and compres-sion, as well as image analysis [45]. Every problem thatis formulated as an energy or lattice action minimiza-tion problem can be solved by implementing Markov ran-dom ﬁelds. Since the φ scalar ﬁeld theory satisﬁes theHammersley-Cliﬀord theorem and is therefore a Markovrandom ﬁeld it can be implemented to complete suchtasks. We therefore consider as q ( φ ) in Eq. 24 the con-ﬁguration of an image from the CIFAR-10 dataset [46],which we will map to the action of the inhomogeneous φ theory of Eq. 5. In essence, we search for the opti-mal values of the coupling constants, which describe thelocal interactions in the φ scalar ﬁeld theory, that canreproduce the considered image as a conﬁguration in theequilibrium distribution of the system. In Fig. 9, resultsare depicted after training the φ theory. We observethat by initializing a Markov chain the conﬁgurations ofthe equilibrium distribution converge to an accurate rep-resentation of the original image. IV. φ NEURAL NETWORKS

When the aim of the machine learning task is to studyintricate probability distributions, deep learning algo-rithms that include multiple layers in the neural networkarchitecture can be implemented. These layers progres-sively transform data to arrive at increasingly abstractrepresentations, allowing for increased expressivity andrepresentational capacity in the model. Such cases ofdeep learning algorithms can be constructed from thedynamics of the φ scalar ﬁeld theory.We consider that part of the random variables φ i onthe lattice sites are visible and correspond to a set ofobservations and the remaining are hidden variables h j ,which capture dependencies on a set of training data,given as input to φ i . In addition, to make the connec-tion with the computer science literature we consider abipartite graph which imposes the restriction that inter-actions are exclusively between the φ and the h variables(see Fig. 1). We therefore recast the φ neural networkas a variant of a restricted Boltzmann machine (RBM)[47–50], which is able to model continuous data. Alter-native parametrizations of the graph structure are opento explore. A joint probability distribution p ( φ, h ; θ ) isthen deﬁned, based on a lattice action S ( φ, h ; θ ): S ( φ, h ; θ ) = − (cid:88) i,j w ij φ i h j + (cid:88) i r i φ i + (cid:88) i a i φ i (26)+ (cid:88) i b i φ i + (cid:88) j s j h j + (cid:88) j m j h j + (cid:88) j n j h j , (27)which also gives rise to a new expression, based on Eq. 23,for the derivative of the log-likelihood ln p ( φ, θ ): ∂ ln p ( φ ; θ ) ∂θ = (cid:68) ∂S∂θ (cid:69) p ( φ,h ; θ ) − (cid:68) ∂S∂θ (cid:69) p ( h | φ ; θ ) , (28)where the set of variational parameters is now θ = { w ij , r i , a i , b i , s j , m j , n j } . The conditional distributionsof the visible and the hidden variables are p ( φ | h ; θ ) = (cid:81) i p ( φ i | h ) and p ( h | φ ; θ ) = (cid:81) j p ( h j | φ ). Derivations canbe found in Appendix A.By considering certain values of parameters in the φ neural network of Eq. 26 one can arrive at other neu-ral network architectures, all of which are special casesof a φ Markov random ﬁeld. For instance by choos-ing b i = n j = 0 one obtains a Gaussian-Gaussian RBM[47, 49]. If b i = n j = m j = 0 and h j ∈ {− , } then thearchitecture is a Gaussian-Bernoulli RBM[47, 49]. Ofparticular interest could be the choice of m j = n j = 0and h j ∈ {− , } which would reduce to a φ -BernoulliRBM, a case with a non-linear sigmoid function that, toour knowledge, has not been studied before. We empha-size that the φ Bernoulli RBM is anticipated to havesubstantial representational capacity due to the presenceof the non-linear sigmoid function in the hidden layer[51].It is a well-known fact that the φ scalar ﬁeld the-ory of Eq. 4, a model with continuous degrees of free-dom, reduces to an Ising model under the limit κ L ﬁxed, λ L → ∞ and µ L → −∞ [42]. The φ -Bernoulli RBMcan then be interpreted as a φ neural network wherecertain lattice sites have reached the Ising limit, allow-ing for novel physical insights. It is important to recallthat, with the inclusion of two hidden layer layers, deepvariants of restricted Boltzmann machines are universalapproximators of probability distributions [52].To demonstrate the applicability of the φ neural net-work of Eq. 26, we train it on the ﬁrst forty examplesof the Olivetti faces dataset [53] using 4096 visible unitsand 32 hidden unit to observe if meaningful features are FIG. 10. Example features learned in the hidden layer of the φ neural network. learned. A subset of the learned features, i.e. the cou-pling constants w ij for a ﬁxed j , are depicted in Fig. 10.We observe that the neural network has learned hiddenfeatures which comprise abstract face shapes and charac-teristics. The hidden units can then serve as input to anew φ neural network to progressively extract abstractfeatures in data [54]. V. CONCLUSIONS

In this manuscript we derived machine learning algo-rithms from discretized Euclidean ﬁeld theories. Speciﬁ-cally we demonstrated that the φ scalar ﬁeld theory on asquare lattice satisﬁes the Hammersley-Cliﬀord theoremand is therefore a Markov random ﬁeld that can be usedfor inference and learning. By recasting the φ theorywithin a mathematically rigorous framework a variety oftheorems, as well as training algorithms, are availableand an overview can be found in Ref. [39]. As the result-ing algorithm has inhomogeneous coupling constants itcan additionally be investigated from the perspective ofspin glasses [33], and enhanced sampling can be obtainedbased on computational techniques from statistical me-chanics [55, 56], or model-speciﬁc algorithms [57, 58].The Kullback-Leibler divergence can be utilized toquantify the overlap of statistical ensembles betweenprobability distributions. Speciﬁcally, we demonstratedthat the φ scalar ﬁeld theory with inhomogeneous cou-pling constants is able to absorb longer range interactionsand observables can be reweighted to the parameter spaceof complex actions using the approximating probabilitydistribution. The prospect of constructing improved lat-tice actions [59, 60] based on local inhomogeneous repre-sentations is open to explore.In principle any arbitrary system can be mapped toa φ scalar ﬁeld theory with inhomogeneous couplingconstants by minimizing an asymmetric distance of theirprobability distributions based on Eq. 10. The conceptsare therefore anticipated to be generally applicable tosystems within condensed matter physics, lattice ﬁeldtheories and statistical mechanics. To enhance the accu-racy a variant of a neural network architecture can be im-plemented which is proven to be a universal approxima-tor of a probability distribution [52]. In the manuscriptsuch variants have been presented as special cases of a φ neural network.The resulting φ machine learning algorithm of Sec-tions II and III retains the topology of the lattice struc-ture and the boundary conditions, but diﬀers from theconventional φ scalar ﬁeld theory due to the inhomo-geneous coupling constants. To employ the tools ofquantum ﬁeld theory a framework involving the replicamethod is required, but the theories can still be formu-lated in terms of the functional integral with an addi-tional averaging over the space of couplings [61]. It isnoted that in our formulation the couplings are inhomo-geneous but not random as they are determined duringthe minimization process.We emphasize that prior arguments considering theHammersley-Cliﬀord theorem hold for arbitrary dimen-sions and one could therefore construct a d -dimensionalMarkov random ﬁeld to initiate analytical or computa-tional investigations. The factorization of a lattice actionin terms of products of potential functions, a step that isrequired to recast a system as a Markov random ﬁeld, de-pends on the topology of the graph structure and diﬀer-ent topologies yield diﬀerent maximal cliques. An equiva-lence between local, pairwise and global Markov proper-ties of a graph structure can also be rigorously proven[39]. Through the construction of quantum ﬁelds inMinkowski space from Markov ﬁelds in Euclidean space[6], a new research avenue is envisaged, namely that of de-veloping a computational and mathematical frameworkof machine learning within quantum ﬁeld theory. VI. ACKNOWLEDGEMENTS

The authors received funding from the European Re-search Council (ERC) under the European Union’s Hori-zon 2020 research and innovation programme under grantagreement No 813942. The work of GA and BL hasbeen supported in part by the UKRI Science and Tech-nology Facilities Council (STFC) Consolidated GrantST/P00055X/1. The work of BL is further supportedin part by the Royal Society Wolfson Research Merit Award WM170010 and by the Leverhulme FoundationResearch Fellowship RF-2020-461 \

9. Numerical simula-tions have been performed on the Swansea SUNBIRDsystem. This system is part of the Supercomputing Walesproject, which is part-funded by the European RegionalDevelopment Fund (ERDF) via Welsh Government. Wethank COST Action CA15213 THOR for support.

Appendix A: Derivations1. φ Markov Random Field

The Kullback-Leibler divergence, which is repeatedhere for convenience, deﬁnes an asymmetric measure ofthe distance between the distribution of the machinelearning algorithm p ( φ ; θ ) and an unknown target dis-tribution q ( φ ): KL ( p || q ) = (cid:90) ∞−∞ p ( φ ; θ ) ln p ( φ ; θ ) q ( φ ) d φ ≥ . (A1)By expanding the above equation we arrive at: (cid:104) ln p ( φ ; θ ) (cid:105) p ( φ ; θ ) − (cid:104) ln q ( φ ) (cid:105) p ( φ ; θ ) ≥ , (A2)where (cid:104)(cid:105) p ( φ ; θ ) denotes the expectation value under theprobability distribution p ( φ ; θ ). If the two probabilitydistributions are substituted to be of Boltzmann form, p ( φ ; θ ) = exp[ − S ] /Z , q ( φ ) = exp[ −A ] /Z A , we arrive at: − (cid:104) ln Z A (cid:105) p ( φ ; θ ) ≤ (cid:104)A − S (cid:105) p ( φ ; θ ) − (cid:104) ln Z (cid:105) p ( φ ; θ ) . (A3)The terms (cid:104) ln Z (cid:105) p ( φ ; θ ) are constant in terms of expec-tation values and we therefore obtain: − ln Z A ≤ (cid:104)A − S (cid:105) p ( φ ; θ ) − ln Z. (A4)By denoting the right hand part as F , the derivativein terms of a variational parameter θ i is equal to: ∂ F ∂θ i = ∂ (cid:104)A(cid:105) p ( φ ; θ ) ∂θ i − ∂ (cid:104) S (cid:105) p ( φ ; θ ) ∂θ i − ∂ ( − ln Z ) ∂θ i , (A5)where each term is calculated as: ∂ (cid:104)A(cid:105) p ( φ ; θ ) ∂θ i = ∂∂θ i (cid:34) (cid:82) φ A ( φ ) exp[ − S ( φ ; θ )] d φ (cid:82) φ exp[ − S ( φ ; θ )] d φ (cid:35) = − (cid:68) A ∂S∂θ i (cid:69) p ( φ ; θ ) + (cid:104)A(cid:105) p ( φ ; θ ) (cid:68) ∂S∂θ i (cid:69) p ( φ ; θ ) , (A6) ∂ (cid:104) S (cid:105) p ( φ ; θ ) ∂θ i = ∂∂θ i (cid:34) (cid:82) φ S ( φ ; θ ) exp[ − S ( φ ; θ )] d φ (cid:82) φ exp[ − S ( φ ; θ )] d φ (cid:35) = (cid:68) ∂S∂θ i (cid:69) p ( φ ; θ ) − (cid:68) S ∂S∂θ i (cid:69) p ( φ ; θ ) + (cid:104) S (cid:105) p ( φ ; θ ) (cid:68) ∂S∂θ i (cid:69) p ( φ ; θ ) , (A7) ∂ ( − ln Z ) ∂θ i = − (cid:82) φ ∂∂θ i ( − S ( φ ; θ )) exp[ − S ( φ ; θ )] d φ (cid:82) φ exp[ − S ( φ ; θ )] d φ = (cid:68) ∂S∂θ i (cid:69) p ( φ ; θ ) , (A8)0By substituting we arrive at: ∂ F ∂θ i = − (cid:68) A ∂S∂θ i (cid:69) + (cid:104)A(cid:105) (cid:68) ∂S∂θ i (cid:69) − (cid:26)(cid:26)(cid:26)(cid:26) (cid:68) ∂S∂θ i (cid:69) + (cid:68) S ∂S∂θ i (cid:69) − (cid:104) S (cid:105) (cid:68) ∂S∂θ i (cid:69) + (cid:26)(cid:26)(cid:26)(cid:26) (cid:68) ∂S∂θ i (cid:69) . (A9)A gradient based approach can be implemented basedon the above equation to learn a target known probabilitydistribution.On the opposite direction if a set of data is availablefor which the probability distribution is unknown the al-ternative Kullback-Leibler divergence can be considered: KL ( q || p ) = (cid:90) ∞−∞ q ( φ ) ln q ( φ ) p ( φ ; θ ) d φ . (A10)By expanding the right-hand side we arrive at the ex- pression: KL ( q || p ) = (cid:104) ln q ( φ ) (cid:105) q ( φ ) − (cid:104) ln p ( φ ; θ ) (cid:105) q ( φ ) . (A11)Minimizing the Kullback-Leibler divergence is equiv-alent to the maximization of the term (cid:104) ln p ( φ ; θ ) (cid:105) q ( φ ) ,which is: (cid:104) ln p ( φ ; θ ) (cid:105) q ( φ ) = 1 N (cid:88) x ln p ( φ ( x ) ; θ ) , (A12)where x is a training example and N the number of train-ing data. For the case of the Markov random ﬁeld thederivative of the log-likelihood is: ∂ ln p ( φ ; θ ) ∂θ = ∂∂θ (cid:34) ln exp[ − S ( φ ; θ )] (cid:82) φ exp[ − S ( φ ; θ )] d φ (cid:35) = ∂∂θ (cid:20) ln exp[ − S ( φ ; θ )] − ln (cid:90) φ exp[ − S ( φ ; θ )] d φ (cid:21) = ∂∂θ ( − S ( φ ; θ )) − (cid:82) φ ∂∂θ ( − S ( φ ; θ )) exp[ − S ( φ ; θ )] d φ (cid:82) φ exp[ − S ( φ ; θ )] d φ = ∂∂θ ( − S ( φ ; θ )) − (cid:90) φ p ( φ ; θ ) ∂ ( − S ( φ ; θ )) ∂θ d φ = ∂∂θ ( − S ( φ ; θ )) − (cid:68) ∂∂θ ( − S ( φ ; θ )) (cid:69) p ( φ ; θ ) , where the term outside the expectation values is cal-culated on the training examples. φ Neural Network

The case of the quantum ﬁeld-theoretic neural networkis more complicated due to the joint probability distri-bution of the visible units φ and the hidden units h : p ( φ, h ; θ ) = exp[ − S ( φ, h ; θ )] (cid:82) φ , h exp[ − S ( φ , h ; θ )] d φ d h . (A13) From the joint probability distribution we can deﬁnemarginal probability distributions via: p ( φ ; θ ) = (cid:90) h p ( φ, h ; θ ) d h = (cid:82) h exp[ − S ( φ, h ; θ )] d h (cid:82) φ , h exp[ − S ( φ , h ; θ )] d φ d h ,p ( h ; θ ) = (cid:90) φ p ( φ , h ; θ ) d φ = (cid:82) φ exp[ − S ( φ , h ; θ )] d φ (cid:82) φ , h exp[ − S ( φ , h ; θ )] d φ d h , as well as conditional probability distributions through:1 p ( φ | h ; θ ) = p ( φ, h ; θ ) p ( h ; θ ) = exp[ − S ( φ, h ; θ )] dh (cid:82) φ exp[ − S ( φ , h ; θ )] d φ (A14)= exp[ (cid:80) i,j w ij φ i h j − (cid:80) i r i φ i − (cid:80) i a i φ i − (cid:80) i b i φ i − (cid:24)(cid:24)(cid:24)(cid:24) (cid:80) j s j h j − (cid:24)(cid:24)(cid:24)(cid:24)(cid:24) (cid:80) j m j h j − (cid:24)(cid:24)(cid:24)(cid:24) (cid:80) j n j h j ] (cid:82) φ exp[ (cid:80) i,j w ij φ i h j − (cid:80) i r i φ i − (cid:80) i a i φ i − (cid:80) i b i φ i − (cid:24)(cid:24)(cid:24)(cid:24) (cid:80) j s j h j − (cid:24)(cid:24)(cid:24)(cid:24)(cid:24) (cid:80) j m j h j − (cid:24)(cid:24)(cid:24)(cid:24) (cid:80) j n j h j ] d φ (A15)= (cid:81) i exp[ φ i (cid:80) j w ij h j − r i φ i − a i φ i − b i φ i ] (cid:82) φ (cid:81) i exp[ φ i (cid:80) j w ij h j − r i φ i − a i φ i − b i φ i ] d φ (A16)= (cid:89) i exp[ φ i (cid:80) j w ij h j − r i φ i − a i φ i − b i φ i ] (cid:82) φ i exp[ φ i (cid:80) j w ij h j − r i φ i − a i φ i − b i φ i ] d φ i (A17)= (cid:89) i p ( φ i | h ) , (A18)Similarly: p ( h | φ ; θ ) = p ( φ, h ; θ ) p ( φ ; θ ) = exp[ − S ( φ, h ; θ )] (cid:82) h exp[ − S ( φ, h ; θ )] d h = (cid:89) j p ( h j | φ ) . (A19)The gradient of the log-likelihood for the case of the quantum ﬁeld-theoretic neural network is: ∂ ln p ( φ ; θ ) ∂θ = ∂∂θ (cid:34) ln (cid:82) h exp[ − S ( φ, h ; θ )] d h (cid:82) φ , h exp[ − S ( φ , h ; θ )] d φ d h (cid:35) (A20)= ∂∂θ (cid:34) ln (cid:90) h exp[ − S ( φ, h ; θ )] d h − ln (cid:90) φ , h exp[ − S ( φ , h ; θ )] d φ d h (cid:35) (A21)= (cid:82) h ∂∂θ ( − S ( φ, h ; θ )) exp[ − S ( φ, h ; θ )] d h (cid:82) h exp[ − S ( φ, h ; θ )] d h − (cid:82) φ , h ∂∂θ ( − S ( φ , h ; θ )) exp[ − S ( φ , h ; θ )] d φ d h (cid:82) φ , h exp[ − S ( φ , h ; θ )] d φ d h (A22)= (cid:90) h p ( h | φ ; θ ) ∂∂θ ( − S ( φ, h ; θ )) d h − (cid:90) φ , h p ( φ , h ; θ ) ∂∂θ ( − S ( φ , h ; θ )) d φ d h (A23)= (cid:68) ∂∂θ ( − S ( φ, h ; θ )) (cid:69) p ( h | φ ; θ ) − (cid:68) ∂∂θ ( − S ( φ, h ; θ )) (cid:69) p ( φ,h ; θ ) . (A24)We approximate the last expression in the above equation for each parameter θ using contrastive divergence.Speciﬁcally, the visible units φ are set equal to a speciﬁc training example φ ( x ) and then based on the conditionaldistribution p ( h | φ ( x ) ) a set of hidden units h ( x ) is sampled. The hidden units h ( x ) are then utilized to sample a newset of visible units φ ( x +1) and the approach is repeated for k steps: CD k = (cid:68) ∂∂θ ( − S ( φ (0) , h ; θ )) (cid:69) p ( h | φ (0) ; θ ) − (cid:68) ∂∂θ ( − S ( φ ( k ) , h ; θ )) (cid:69) p ( h | φ ( k ) , ; θ ) , (A25)where for the considered cases we use k = 1. Appendix B: Simulation details andHyper-Parameters

The φ scalar ﬁeld theory is a system with continuousdegrees of freedom −∞ < φ < + ∞ . To sample the sys-tem we implement Markov chain Monte Carlo samplingwith the Metropolis algorithm, where we consider onestep as equivalent to updating a number of lattice sitesequal to the volume of the system. The question of howto properly choose a new state additionally arises. Whenthe training data have values which lie at a speciﬁc inter- val, the aim of the machine learning algorithm is to learna probability distribution which reproduces them. Thenew state can then be chosen by sampling uniformly be-tween the minimum and maximum value, therefore guar-anteeing that every state is reachable under an arbitrarylarge number of Monte Carlo steps. For the case of thehidden units in the φ neural network we impose thesame restriction, even though the hidden units could, inprinciple, remain unconstrained, i.e. −∞ < h < + ∞ .We also emphasize that during the gradient process ofthe Markov random ﬁeld we retain one Markov chain to2conduct the necessary calculations.The learning rate that produced Figs. 2 and 3 is 10 − and 10 − , respectively. The sample size is chosen equalto 50 before updating the variational parameters θ . Theimage in Fig. 9 has size 32 ∗

32 and its continuous valueslie between [ − , . × epochs. The parametersthat produced Fig. 8 are a learning rate of 0 .

1, 400 epochsand a batch size of 4. For the results depicted in Fig. 10the φ neural network has 4096 visible units, 32 hiddenunits, learning rate 0 .

1, batch size of 5 and was trainedfor 10 epochs on the ﬁrst 40 examples of the Olivettifaces dataset. Appendix C: Histogram Reweighting

We consider the numerical estimator for an arbitraryobservable (cid:104) O (cid:105) in the full complex action A which we aimto sample: (cid:104) O (cid:105) = (cid:80) Nl =1 O l ˜ p l − exp[ − g (cid:48) j A ( j ) l − (cid:80) k =1 ,k (cid:54) = j g k A ( k ) l ] (cid:80) Nl =1 ˜ p l − exp[ − g (cid:48) j A ( j ) l − (cid:80) k =1 ,k (cid:54) = j g k A ( k ) l ] , (C1)where N is the subset of Monte Carlo samples and ˜ p are the probabilities used to sample from the equilibriumdistribution. We have expressed the numerical estimatorin a form that simultaneously allows extrapolation alongthe trajectory of a coupling constant g (cid:48) j . We will nowsubstitute the probabilities ˜ p for the probabilities of theinhomogeneous φ Markov random ﬁeld:˜ p l = exp[ − S l ] (cid:82) φ exp[ − S φ ] d φ , (C2)where the sum is over all possible states φ of the systemand we arrive at the reweighting equation: (cid:104) O (cid:105) = (cid:80) Nl =1 O l exp[ S l − g (cid:48) j A ( j ) l − (cid:80) k =1 ,k (cid:54) = j g k A ( k ) l ] (cid:80) Nl =1 exp[ S l − g (cid:48) j A ( j ) l − (cid:80) k =1 ,k (cid:54) = j g k A ( k ) l ] . (C3) Given a subset of samples drawn from the equilib-rium distribution of the φ Markov random ﬁeld, whichis described by the action S , one can extrapolate ob-servables to the full distribution of the action A whichincludes longer range interactions and complex-valuedterms along the trajectory of a coupling constant g (cid:48) j .To compare the reweighting extrapolations from the φ Markov random ﬁeld to the full action, we additionallyimplement reweighting from the simulated action A { } .In this form of reweighting we consider again Eq. C1 andwe substitute ˜ p for:˜ p l = exp[ − (cid:80) k =1 g k A ( k ) l ] (cid:82) φ exp[ − (cid:80) k =1 g k A ( k ) φ ] d φ , (C4)where we consider for this speciﬁc case that g (cid:48) j = g j ,arriving at equation: (cid:104) O (cid:105) = (cid:80) Nl =1 O l exp[ −(cid:61) A l ] (cid:80) Nl =1 exp[ −(cid:61) A l ] . (C5)One observable of interest is the magnetization which isdeﬁned as: m = 1 V (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:88) i φ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (C6)where V = L ∗ L is the volume of the system. Appendix D: Binning Analysis

Statistical errors are calculated with the binningmethod on the obtained Monte Carlo datasets. Eachdataset with 10 minimally correlated conﬁgurations issplit into n = 10 datasets where calculations of observ-ables O are conducted. The standard deviation for anobservable O is then obtained through: σ O = (cid:114) n − O − O ) . (D1) [1] J. Zinn-Justin, Quantum Field Theory and Critical Phe-nomena (Oxford University Press, Oxford, 2002).[2] J. Glimm and A. Jaﬀe,

Quantum Physics: A FunctionalIntegral Point of View (Springer, New York, NY, 1987).[3] G. Velo and A. Wightman,

Constructive Quantum FieldTheory: The 1973 “Ettore Majorana” InternationalSchool of Mathematical Physics , Lecture Notes in Physics(Springer Berlin Heidelberg, 1973).[4] E. Seiler,

Gauge Theories as a Problem of Construc-tive Quantum Field Theory and Statistical Mechanics (Springer, Berlin, Heidelberg, 1982). [5] E. Nelson, Probability theory and euclidean ﬁeld theory,in

Constructive Quantum Field Theory , edited by G. Veloand A. Wightman (Springer Berlin Heidelberg, Berlin,Heidelberg, 1973) pp. 94–124.[6] E. Nelson, Construction of quantum ﬁelds from markoﬀﬁelds, Journal of Functional Analysis , 97 (1973).[7] I. J. Goodfellow, Y. Bengio, and A. Courville, DeepLearning (MIT Press, Cambridge, MA, USA, 2016) .[8] G. Carleo, I. Cirac, K. Cranmer, L. Daudet, M. Schuld,N. Tishby, L. Vogt-Maranto, and L. Zdeborov´a, Machine learning and the physical sciences, Reviews of ModernPhysics , 10.1103/revmodphys.91.045002 (2019).[9] G. Kanwar, M. S. Albergo, D. Boyda, K. Cranmer, D. C.Hackett, S. Racani`ere, D. J. Rezende, and P. E. Shana-han, Equivariant ﬂow-based sampling for lattice gaugetheory, Phys. Rev. Lett. , 121601 (2020).[10] P. E. Shanahan, D. Trewartha, and W. Detmold, Ma-chine learning action parameters in lattice quantum chro-modynamics, Phys. Rev. D , 094506 (2018).[11] K. Zhou, G. Endr˝odi, L.-G. Pang, and H. St¨ocker, Re-gressive and generative neural networks for scalar ﬁeldtheory, Phys. Rev. D , 011501 (2019).[12] D. Bachtis, G. Aarts, and B. Lucini, Mapping distinctphase transitions to a neural network, Phys. Rev. E ,053306 (2020).[13] M. N. Chernodub, H. Erbin, V. A. Goy, and A. V.Molochkov, Topological defects and conﬁnement withmachine learning: The case of monopoles in compactelectrodynamics, Phys. Rev. D , 054501 (2020).[14] S. Bl¨ucher, L. Kades, J. M. Pawlowski, N. Strodthoﬀ,and J. M. Urban, Towards novel insights in lattice ﬁeldtheory with explainable machine learning, Phys. Rev. D , 094507 (2020).[15] M. Favoni, A. Ipp, D. I. M¨uller, and D. Schuh, Latticegauge equivariant convolutional neural networks (2020),arXiv:2012.12901 [hep-lat].[16] K. A. Nicoli, C. J. Anders, L. Funcke, T. Har-tung, K. Jansen, P. Kessel, S. Nakajima, and P. Stor-nati, Estimation of thermodynamic observables in lat-tice ﬁeld theories with deep generative models (2021),arXiv:2007.07115 [hep-lat].[17] E. L. van Nieuwenburg, Y.-H. Liu, and S. Huber, Learn-ing phase transitions by confusion, Nature Physics ,435 (2017).[18] J. Carrasquilla and R. G. Melko, Machine learning phasesof matter, Nature Physics , 431 (2017).[19] C. Giannetti, B. Lucini, and D. Vadacchino, Machinelearning as a universal tool for quantitative investiga-tions of phase transitions, Nuclear Physics B , 114639(2019).[20] D. Bachtis, G. Aarts, and B. Lucini, Adding machinelearning within hamiltonians: Renormalization grouptransformations, symmetry breaking and restoration,Phys. Rev. Research , 013134 (2021).[21] D. Bachtis, G. Aarts, and B. Lucini, Extending ma-chine learning classiﬁcation capabilities with histogramreweighting, Phys. Rev. E , 033303 (2020).[22] L. Wang, Discovering phase transitions with unsuper-vised learning, Phys. Rev. B , 195105 (2016).[23] A. Tanaka and A. Tomiya, Detection of phase tran-sition via convolutional neural networks, Journal ofthe Physical Society of Japan , 063001 (2017),https://doi.org/10.7566/JPSJ.86.063001.[24] E. Agliari, A. Barra, P. Sollich, and L. Zdeborov´a, Ma-chine learning and statistical physics: preface, Journalof Physics A: Mathematical and Theoretical , 500401(2020).[25] L. Zdeborova and F. Krzakala, Statisticalphysics of inference: thresholds and algo-rithms, Advances in Physics , 453 (2016),https://doi.org/10.1080/00018732.2016.1211393.[26] S. Goldt, M. S. Advani, A. M. Saxe, F. Krzakala, andL. Zdeborov´a, Dynamics of stochastic gradient descentfor two-layer neural networks in the teacher–student setup, Journal of Statistical Mechanics: Theory and Ex-periment , 124010 (2020).[27] D. Alberici, A. Barra, P. Contucci, and E. Mingione,Annealing and replica-symmetry in deep boltzmann ma-chines, Journal of Statistical Physics , 665 (2020).[28] E. Agliari, D. Migliozzi, and D. Tantari, Non-convexmulti-species hopﬁeld models, J. Stat. Phys. (2018).[29] A. Barra, G. Genovese, P. Sollich, and D. Tantari, Phasetransitions in restricted boltzmann machines with genericpriors, Phys. Rev. E (2017).[30] A. Barra, G. Genovese, P. Sollich, and D. Tantari, Phasediagram of restricted boltzmann machines and general-ized hopﬁeld networks with arbitrary priors, Phys. Rev.E (2018).[31] M. M´ezard, Mean-ﬁeld message-passing equations in thehopﬁeld model and its generalizations, Phys. Rev. E (2017).[32] A. Barra, A. Bernacchia, E. Santucci, and P. Contucci,On the equivalence of hopﬁeld networks and boltzmannmachines, Neural Netw. (2012).[33] M. M´ezard, G. Parisi, and M. A. Virasoro, Spin GlassTheory and Beyond: An Introduction to the ReplicaMethod and Its Applications (World Scientiﬁc, Singapore,1987).[34] J. Halverson, A. Maiti, and K. Stoner, Neural Networksand Quantum Field Theory, (2020), arXiv:2008.08601[cs.LG].[35] J. Lee, Y. Bahri, R. Novak, S. Schoenholz, J. Penning-ton, and J. Sohl-dickstein, Deep neural networks as gaus-sian processes (6th International Conference on LearningRepresentations, Vancouver, BC, Canada, 2018).[36] A. G. de G. Matthews, J. Hron, M. Rowland, R. E.Turner, and Z. Ghahramani, Gaussian process behaviourin wide deep neural networks, in

International Confer-ence on Learning Representations (2018).[37] R. Novak, L. Xiao, Y. Bahri, J. Lee, G. Yang, D. A. Abo-laﬁa, J. Pennington, and J. Sohl-dickstein, Bayesian deepconvolutional networks with many channels are gaussianprocesses, in

International Conference on Learning Rep-resentations (2019).[38] A. Garriga-Alonso, C. E. Rasmussen, and L. Aitchison,Deep convolutional networks as shallow gaussian pro-cesses, in

International Conference on Learning Repre-sentations (2019).[39] D. Koller and N. Friedman,

Probabilistic Graphical Mod-els: Principles and Techniques (The MIT Press, 2009).[40] K. G. Wilson, Conﬁnement of quarks, Phys. Rev. D ,2445 (1974).[41] C. J. Preston, Gibbs States on Countable Sets , Cam-bridge Tracts in Mathematics (Cambridge UniversityPress, 1974).[42] A. Milchev, D. W. Heermann, and K. Binder, Finite-size scaling analysis of the φ , 749 (1986).[43] C. M. Bishop, Pattern Recognition and Machine Learning(Information Science and Statistics) (Springer-Verlag,Berlin, Heidelberg, 2006).[44] A. M. Ferrenberg and R. H. Swendsen, New monte carlotechnique for studying phase transitions, Phys. Rev. Lett. , 2635 (1988).[45] A. Blake, P. Kohli, and C. Rother, Markov RandomFields for Vision and Image Processing (The MIT Press,2011). [46] A. Krizhevsky, Learning multiple layers of features fromtiny images (2009).[47] P. Smolensky, Information processing in dynamical sys-tems: Foundations of harmony theory, in Parallel Dis-tributed Processing: Explorations in the Microstructureof Cognition, Vol. 1: Foundations (MIT Press, Cam-bridge, MA, USA, 1986) p. 194–281.[48] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, A learn-ing algorithm for boltzmann machines, Cognitive Science , 147 (1985).[49] A. Fischer and C. Igel, Training restricted boltzmannmachines: An introduction, Pattern Recognition , 25(2014).[50] G. E. Hinton, A practical guide to training restrictedboltzmann machines, in Neural Networks: Tricks ofthe Trade: Second Edition , edited by G. Montavon,G. B. Orr, and K.-R. M¨uller (Springer Berlin Heidelberg,Berlin, Heidelberg, 2012) pp. 599–619.[51] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle,Greedy layer-wise training of deep networks, in

Pro-ceedings of the 19th International Conference on NeuralInformation Processing Systems , NIPS’06 (MIT Press,Cambridge, MA, USA, 2006) p. 153–160.[52] O. Krause, A. Fischer, T. Glasmachers, and C. Igel, Ap-proximation properties of DBNs with binary hidden unitsand real-valued visible units, in

Proceedings of the 30thInternational Conference on Machine Learning , Proceed-ings of Machine Learning Research, Vol. 28, edited by S. Dasgupta and D. McAllester (PMLR, Atlanta, Geor-gia, USA, 2013) pp. 419–426.[53] This dataset contains a set of face images taken betweenApril 1992 and April 1994 at AT&T Laboratories Cam-bridge.[54] G. E. Hinton and R. R. Salakhutdinov, Reducing the di-mensionality of data with neural networks, Science ,504 (2006).[55] E. Marinari and G. Parisi, Simulated tempering: A newmonte carlo scheme, Europhysics Letters (EPL) , 451(1992).[56] J. Lee, New monte carlo algorithm: Entropic sampling,Phys. Rev. Lett. , 211 (1993).[57] R. C. Brower and P. Tamayo, Embedded dynamics for ϕ theory, Phys. Rev. Lett. , 1087 (1989).[58] W. Loinaz and R. S. Willey, Monte carlo simulationcalculation of the critical coupling constant for two-dimensional continuum ϕ theory, Phys. Rev. D ,076003 (1998).[59] W. Bietenholz, R. Brower, S. Chandrasekharan, and U.-J. Wiese, Progress on perfect lattice actions for qcd,Nuclear Physics B - Proceedings Supplements , 921(1997), lattice 96.[60] W. Bietenholz and U.-J. Wiese, Perfect actions withchemical potential, Physics Letters B , 114 (1998).[61] M. Jain and V. Vanchurin, Generating functionals forquantum ﬁeld theories with random potentials, Journalof High Energy Physics2016