[PDF] Introduction to Normalizing Flows for Lattice Field Theory

Abstract

This notebook tutorial demonstrates a method for sampling Boltzmann distributions of lattice field theories using a class of machine learning models known as normalizing flows. The ideas and approaches proposed in arXiv:1904.12072, arXiv:2002.02428, and arXiv:2003.06413 are reviewed and a concrete implementation of the framework is presented. We apply this framework to a lattice scalar field theory and to U(1) gauge theory, explicitly encoding gauge symmetries in the flow-based approach to the latter. This presentation is intended to be interactive and working with the attached Jupyter notebook is recommended.

Full PDF

MMIT-CTP/5272

Introduction to Normalizing Flows for Lattice Field Theory

Michael S. Albergo, ∗ Denis Boyda,

2, 3, † Daniel C. Hackett,

2, 3, ‡ Gurtej Kanwar,

2, 3, § Kyle Cranmer, S´ebastien Racani`ere, Danilo Jimenez Rezende, and Phiala E. Shanahan

2, 3 Center for Cosmology and Particle Physics,New York University, New York, NY 10003, USA Center for Theoretical Physics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA The NSF AI Institute for Artiﬁcial Intelligence and Fundamental Interactions DeepMind, London, UK (Dated: January 21, 2021)This notebook tutorial demonstrates a method for sampling Boltzmann distributions of latticeﬁeld theories using a class of machine learning models known as normalizing ﬂows. The ideas andapproaches proposed in arXiv:1904.12072, arXiv:2002.02428, and arXiv:2003.06413 are reviewedand a concrete implementation of the framework is presented. We apply this framework to a latticescalar ﬁeld theory and to U(1) gauge theory, explicitly encoding gauge symmetries in the ﬂow-basedapproach to the latter. This presentation is intended to be interactive and working with the attachedJupyter notebook is recommended.

I. INTRODUCTION TO NORMALIZING FLOWS FOR LATTICE FIELD THEORY

A central challenge in lattice ﬁeld theory is devising algorithms to eﬃciently generate ﬁeld conﬁgurations.In recent works [1–3] we have demonstrated a promising new method based on normalizing ﬂows, a classof probabilistic machine-learning models for which both direct sampling and exact likelihood evaluation arecomputationally tractable. The aims of this tutorial are to introduce the reader to the normalizing ﬂowmethod and its application to scalar and gauge ﬁeld theory.We ﬁrst work through some toy examples which illustrate the underlying concept of normalizing ﬂowsas a change of variables. From there, we straightforwardly generalize to more expressive forms that canparametrize samplers for close approximations of our distributions of interest. We detail how such approxi-mations can be corrected using MCMC methods, yielding provably correct statistics. As an important partof our toolkit, we show how we can dramatically reduce the complexity of these models by constraining themto be equivariant with respect to physical symmetries: (a subgroup of) lattice translational symmetries and,for U(1) gauge theory, local gauge invariance. Readers unfamiliar with the notebook format should readthis document as a single annotated program, wherein all code is executed sequentially from start to ﬁnishwithout clearing the scope.We begin by deﬁning a few utilities and importing common packages. Readers may safely execute andskip over the remainder of this section. import base64import ioimport pickleimport numpy as npimport torchprint(f ' TORCH VERSION: {torch.__version__} ' )import packaging.versionif packaging.version.parse(torch.__version__) < packaging.version.parse( ' ' ):raise RuntimeError( ' Torch versions lower than 1.5.0 not supported ' )%matplotlib inlineimport matplotlib.pyplot as pltimport seaborn as snssns.set_style( ' whitegrid ' ) ∗ [email protected] † [email protected] ‡ [email protected] § [email protected] a r X i v : . [ h e p - l a t ] J a n >>> TORCH VERSION: 1.6.0if torch.cuda.is_available():torch_device = ' cuda ' float_dtype = np.float32 torch.set_default_tensor_type(torch.cuda.FloatTensor)else:torch_device = ' cpu ' float_dtype = np.float64 torch.set_default_tensor_type(torch.DoubleTensor)print(f"TORCH DEVICE: {torch_device}")>>> TORCH DEVICE: cpudef torch_mod(x):return torch.remainder(x, 2*np.pi)def torch_wrap(x):return torch_mod(x+np.pi) - np.pi Often we want to detach tensors from the computational graph and pull them to the CPU as a numpyarray. def grab(var):return var.detach().cpu().numpy()

The code below makes a live-updating plot during training. from IPython.display import displaydef init_live_plot(dpi=125, figsize=(8,4)):fig, ax_ess = plt.subplots(1,1, dpi=dpi, figsize=figsize)plt.xlim(0, N_era*N_epoch)plt.ylim(0, 1)ess_line = plt.plot([0],[0], alpha=0.5) plt.grid(False)plt.ylabel( ' ESS ' )ax_loss = ax_ess.twinx()loss_line = plt.plot([0],[0], alpha=0.5, c= ' orange ' ) plt.grid(False)plt.ylabel( ' Loss ' )plt.xlabel( ' Epoch ' )display_id = display(fig, display_id=True)return dict(fig=fig, ax_ess=ax_ess, ax_loss=ax_loss,ess_line=ess_line, loss_line=loss_line,display_id=display_id)def moving_average(x, window=10):if len(x) < window:return np.mean(x, keepdims=True)else:return np.convolve(x, np.ones(window), ' valid ' ) / windowdef update_plots(history, fig, ax_ess, ax_loss, ess_line, loss_line, display_id):Y = np.array(history[ ' ess ' ]) Y = moving_average(Y, window=15)ess_line[0].set_ydata(Y)ess_line[0].set_xdata(np.arange(len(Y)))Y = history[ ' loss ' ]Y = moving_average(Y, window=15)loss_line[0].set_ydata(np.array(Y))loss_line[0].set_xdata(np.arange(len(Y)))ax_loss.relim()ax_loss.autoscale_view()fig.canvas.draw()display_id.update(fig) II. NOTATION

This section is intended as a reference. The phrases and notation listed here will be deﬁned in detail inthe remainder of the notebook.1.

Notation for generic normalizing ﬂows • Coordinates z, x ∈ some manifold X (a space with local R n structure)The manifolds used here are X = R n (for scalar ﬁeld theory) and X = T n (for U(1) gauge theory)where T n refers to the n-dimensional torus. • Probability densities over those manifolds, – Prior density r ( z ) – Model density q ( x ) – Target density p ( x ) • Normalizing ﬂow f : X → X , invertible and diﬀerentiable • Jacobian factor J ( z ) = | det ij ∂f i ( z ) /∂z j | • Coupling layer g : X → X , invertible and diﬀerentiable • Subsets of the components of the coordinate x = ( x , x ), where the choice of subsets will be clearfrom context2. Notation for lattice ﬁeld theories • Lattice spacing a We work in “lattice units” where a = 1. • Spacetime dimension N d We work in this notebook with N d = 2. • Lattice extent L , with volume V = L N d = L , in lattice units where a = 1. • Lattice position (cid:126)x = a(cid:126)n ≡ ( an x , an y ), with (cid:126)x = (cid:126)n in lattice units where a = 1. We use n x , n y ∈ [0 , L − Notation for normalizing ﬂows targeting scalar lattice ﬁeld theory • Field conﬁgurations z ∈ R V or φ ∈ R V , corresponding to z or x in the generic notation • φ ( (cid:126)n ) denotes the ﬁeld conﬁguration which lives on the sites of the lattice, while φ (cid:126)n denotes the unraveled1D vector of lattice DOF • Action S [ φ ] ∈ R • Discretized path integral measure (cid:81) (cid:126)n dφ (cid:126)n Notation for normalizing ﬂows targeting U(1) lattice gauge theory • Field conﬁgurations U ∈ T N d V or U (cid:48) ∈ T N d V , corresponding to z or x in the generic notation • U µ ( (cid:126)n ) denotes the component of ﬁeld conﬁguration U which lives on the link ( n, n + ˆ µ ) of the lattice,where µ ∈ [0 , N d −

1] indicates the Cartesian direction. U µ,(cid:126)n denotes the unraveled 1D vector of latticeDOF • Action S [ U ] ∈ R • Angular parameterization of each component U µ,(cid:126)n ≡ e iθ µ,(cid:126)n • Discretized path integral measure (cid:81) µ,(cid:126)n dU µ,(cid:126)n , where dU µ,(cid:126)n = dθ µ,(cid:126)n is the Haar measure for U(1) III. NORMALIZING FLOWS (FOR LATTICE QFTS)

A powerful method to generate samples from complicated distributions is to combine (1) sampling froma simpler / tractable distribution with (2) applying a deterministic change-of-variables (a normalizing ﬂow )to the output samples. The transformed samples are distributed according to a new distribution whichis determined by the initial distribution and change-of-variables. These two components together deﬁne a normalizing ﬂow model . See [4] for a review.

A. A simple example

The Box-Muller transform is an example of this trick in practice: to produce Gaussian random variables,draw two variables U and U from unif(0 ,

1) then change variables to Z = (cid:112) − U cos(2 πU ) and Z = (cid:112) − U sin(2 πU ) . (1)The resulting variables Z , Z are then distributed according to an uncorrelated, unit-variance Gaussiandistribution. batch_size = 2**14u = np.random.random(size=(batch_size, 2))z = np.sqrt(-2*np.log(u[:,0]))[:,np.newaxis] * np.stack((np.cos(2*np.pi*u[:,1]), np.sin(2*np.pi*u[:,1])), axis=-1)fig, ax = plt.subplots(1,2, dpi=125, figsize=(4,2))for a in ax:a.set_xticks([-2, 0, 2])a.set_yticks([-2, 0, 2])a.set_aspect( ' equal ' )ax[0].hist2d(u[:,0], u[:,1], bins=30, range=[[-3.0,3.0], [-3.0,3.0]])ax[0].set_xlabel(r"$U_1$")ax[0].set_ylabel(r"$U_2$", rotation=0, y=.46)ax[1].hist2d(z[:,0], z[:,1], bins=30, range=[[-3.0,3.0], [-3.0,3.0]])ax[1].set_yticklabels([])ax[1].set_xlabel(r"$Z_1$")ax[1].set_ylabel(r"$Z_2$", rotation=0, y=.53)ax[1].yaxis.set_label_position("right")ax[1].yaxis.tick_right()plt.show() We can analytically compute the density associated with output samples by the change-of-variablesformula relating the prior density r ( U , U ) = 1 to the output density q ( Z , Z ): q ( Z , Z ) = r ( U , U ) (cid:12)(cid:12)(cid:12)(cid:12) det kl ∂Z k ( U , U ) ∂U l (cid:12)(cid:12)(cid:12)(cid:12) − = 1 × (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) det (cid:32) − U √− U cos(2 πU ) − π √− U sin(2 πU ) − U √− U sin(2 πU ) 2 π √− U cos(2 πU ) (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − = (cid:12)(cid:12)(cid:12)(cid:12) πU (cid:12)(cid:12)(cid:12)(cid:12) − . (2)Here, the term J ( U , U ) ≡ (cid:12)(cid:12)(cid:12) det kl ∂Z k ( U ,U ) ∂U l (cid:12)(cid:12)(cid:12) is the determinant of the Jacobian of the transformation from( U , U ) to ( Z , Z ). Intuitively, the Jacobian factor can be thought of as a change in volume element, there-fore the change-of-variables formula must contain the inverse of this factor (spreading out volume decreasesdensity). To complete the example, we can rearrange the change of variables to ﬁnd U = exp( − ( Z + Z ) / q ( Z , Z ) = 12 π e − ( Z + Z ) / . (3) NOTE : In this example, the model has no free parameters because we didn’t need any to create atransform that exactly reproduced our target distribution (independent, unit-variance Gaussian). In general,we may not know a normalizing ﬂow that exactly produces our desired distribution, and so instead constructparametrized models that we can variationally optimize to approximate that target distribution, and becausewe can compute the density these can be corrected to nevertheless guarantee exactness.

B. The general approach

Generalizing this example, it is clear that any invertible and diﬀerentiable function f ( z ) will transforma prior density r ( z ) on the (possibly multi-dimensional) random variable z to an output density q ( x ) on x ≡ f ( z ). If the Jacobian factor J ( z ) ≡ | det kl ∂f k ( z ) /∂z l | is eﬃciently calculable, we can compute theoutput density alongside any samples drawn using the change-of-variables formula, q ( x ) = r ( z )[ J ( z )] − = r ( z ) (cid:12)(cid:12)(cid:12)(cid:12) det kl ∂f k ( z ) ∂z l (cid:12)(cid:12)(cid:12)(cid:12) − . (4)In some cases, it is easy to compute the Jacobian factor even when the whole Jacobian matrix is intractable;for example, only the diagonal elements are needed if the Jacobian matrix is known to be triangular. Belowwe will see how to construct f with a triangular Jacobian using coupling layers .In lattice ﬁeld theory simulations, our goal is to draw samples from a distribution over lattice ﬁeld conﬁgu-rations deﬁned by the imaginary-time path integral. By optimizing the function f we hope to ﬁnd an outputdistribution that closely models this desired physical distribution. If the family of functions is expressive (i.e. includes a wide variety of possible functions) we expect the optimal choice to be a good approximationto the true distribution. Moreover, we can make the task of searching for the optimal choice more eﬃcientby restricting to functions that guarantee certain symmetries in the output distribution. Once we have agood approximation to the output distribution, we can draw samples from it and use MCMC methods orreweighting to correct their statistics to the exact distribution of interest. C. Prior distributions

Any probability distribution that is easy to sample from and has calculable density r ( z ) can be used asthe prior distribution.In code, our interface mimics a subset of the pytorch Distribution interface. For example, below wedeﬁne a prior distribution corresponding to uncorrelated Gaussians (one per component of the ﬁeld). Anyother distribution you may want to deﬁne should provide analogous methods log prob and sample n . class SimpleNormal:def __init__(self, loc, var):self.dist = torch.distributions.normal.Normal(torch.flatten(loc), torch.flatten(var))self.shape = loc.shapedef log_prob(self, x):logp = self.dist.log_prob(x.reshape(x.shape[0], -1))return torch.sum(logp, dim=1)def sample_n(self, batch_size):x = self.dist.sample((batch_size,))return x.reshape(batch_size, *self.shape) The shape of loc and var determine the shape of samples drawn. normal_prior = SimpleNormal(torch.zeros((3,4,5)), torch.ones((3,4,5)))z = normal_prior.sample_n(17)print(f ' z.shape = {z.shape} ' )print(f ' log r(z) = {grab(normal_prior.log_prob(z))} ' )>>> z.shape = torch.Size([17, 3, 4, 5])... log r(z) = [-82.15322048 -92.57309315 -81.64606318 -81.02122974 -82.402781... -87.32209986 -89.71201831 -80.20303503 -84.56155853 -89.50678784... -84.31995159 -90.90156267 -82.82711487 -80.48685168 -88.90803586... -80.86843481 -87.73830259] We use

SimpleNormal as the prior distribution for scalar ﬁeld theory, and later deﬁne a uniform distributionas the prior distribution for U(1) gauge theory.

D. Designing the ﬂow f As a reminder, a normalizing ﬂow f must be invertible and diﬀerentiable . To be useful, it should alsobe eﬃcient to compute the Jacobian factor and be expressive.Expressive functions can be built through composition of simpler ones. When each simpler function isinvertible and diﬀerentiable, the composed function is as well. Schematically, this subdivides the task oflearning a complicated map as below: FIG. 1. Fig. 1 of [1]. The notation superﬁcially diﬀers from what we present here.

Coupling layers are one approach to deﬁning the g i in the composed function. These functions aredeﬁned to update only a subset of the input variables, conditioned on the complimentary (“frozen”) subset.For example, if the input to a coupling layer was a lattice with one real number per site, the layer couldbe deﬁned to update only the odd sites in a checkerboard pattern. To ensure all variables are updated, wecould then compose coupling layers that alternatingly update odd sites and even sites.In a coupling layer, the transform applied to the updated subset of variables is manifestly a simply invertedoperation such as a scaling ( x → e s x ) or aﬃne transformation ( x → e s x + t ). For example, a coupling layer g ( x , x ) = ( x (cid:48) , x (cid:48) ) based on an scaling transformation looks like x (cid:48) = e s ( x ) x x (cid:48) = x (5)where x , x are subsets of the components of x . We say that x is updated based on the frozen subset x ,which is not changed by the coupling layer. e s ( x ) is a vector of the same shape as x , and e x ( s ) x denotesan elementwise product. The parameters deﬁning the transform , s ( x ), can be complicated, non-invertiblefunctions of the frozen subset of variables. However, the inverse of this transformation g − ( x (cid:48) , x (cid:48) ) = ( x , x )is simply computed using the same parameters, x = e − s ( x (cid:48) ) x (cid:48) x = x (cid:48) . (6)Here the key to guaranteeing invertibility is that x = x (cid:48) . This “trick” is exactly what guaranteesinvertibility for leapfrog integrators, which alternately update position and momentum variables.This also ensures a triangular Jacobian, ∂g ( x , x ) ∂x = (cid:32) ∂x (cid:48) ∂x ∂x (cid:48) ∂x (cid:33) , (7)which in the scaling example takes the form ∂g ( x , x ) ∂x =  e [ s ( x )] · · · e [ s ( x )] · · · . . . · · ·

10 1 . . .  (8)where we have expanded the blocks over ( x (cid:48) , x (cid:48) ) × ( x , x ) from the ﬁrst expression. Therefore J ( x ) iseﬃciently computed as J ( x ) = (cid:12)(cid:12)(cid:12)(cid:12) det kl ∂ [ g ( x , x )] k ∂x l (cid:12)(cid:12)(cid:12)(cid:12) = (cid:89) k e [ s ( x )] k (9)where k runs over the components in s ( x ). The Jacobian of the inverse transformation is simply J reverse ( x (cid:48) ) =1 /J ( x (cid:48) ) = (cid:81) k e [ − s ( x (cid:48) )] k ; note that here we were able to compute the reverse Jacobian in terms of the forwardJacobian applied to x (cid:48) because of the simplicity of the coupling layer.The coupling layer architecture makes it easier to guarantee invertibility while retaining expressivity: thefunctions which provide the parameters of the transformation are ﬂexible while the inverse and Jacobianfactor of such coupling transformations are easy to compute. Many such coupling layers can be stacked tocompose expressive functions f eﬃciently. E. Simple coupling layer demo

To demonstrate coupling layers in practice, we deﬁne a coupling layer using scaling (see above) for two-dimensional inputs x ≡ ( x , x ) [i.e. in comparison to the previous section, here x and x are just scalars].Because we have the freedom to make the function s ( x ) arbitrarily complex without sacriﬁcing invertibility,we parametrize s ( x ) as a neural net made of alternating layers of linear transformations and ReLU (“rectiﬁedlinear unit”) activation functions, with a tanh activation function after the ﬁnal linear transform.We implement coupling layers as an extension of torch.nn.Module to include application of g (see forward ) and inverse g − (see reverse ). These both map from the domain of lattice degrees of free-dom to itself, X → X ; in this case, this is just R → R . The superclass automatically holds references toall tunable parameters (weights) that will later be optimized. class SimpleCouplingLayer(torch.nn.Module):def __init__(self):super().__init__()self.s = torch.nn.Sequential(torch.nn.Linear(1, 8),torch.nn.ReLU(),torch.nn.Linear(8, 8),torch.nn.ReLU(),torch.nn.Linear(8, 1),torch.nn.Tanh())def forward(self, x):x1, x2 = x[:,0], x[:,1]s = self.s(x2.unsqueeze(-1)).squeeze(-1)fx1 = torch.exp(s) * x1fx2 = x2logJ = sreturn torch.stack((fx1, fx2), dim=-1), logJdef reverse(self, fx):fx1, fx2 = fx[:,0], fx[:,1]x2 = fx2s = self.s(x2.unsqueeze(-1)).squeeze(-1)logJ = -sx1 = torch.exp(-s) * fx1return torch.stack((x1, x2), dim=-1), logJcoupling_layer = SimpleCouplingLayer() def set_weights(m):if hasattr(m, ' weight ' ) and m.weight is not None:torch.nn.init.normal_(m.weight, mean=1, std=2)if hasattr(m, ' bias ' ) and m.bias is not None:m.bias.data.fill_(-1)torch.manual_seed(1234)coupling_layer.s.apply(set_weights); Let’s see what our simple coupling layer g does. We draw a batch of samples x from an arbitrary inputdistribution (uniform in [0 , ), feed it through the coupling layer forwards to get samples g ( x ) from anew distribution, then feed g ( x ) backwards through the coupling layer to double-check that we recover ouroriginal sample x (cid:48) = g − ( g ( x )) ! = x . batch_size = 1024np_x = (2*np.random.random(size=(batch_size, 2)) - 1).astype(float_dtype)x = torch.from_numpy(np_x).to(torch_device)gx, fwd_logJ = coupling_layer.forward(x)xp, bwd_logJ = coupling_layer.reverse(gx)fig, ax = plt.subplots(1,3, dpi=125, figsize=(6,2.3), sharex=True, sharey=True)np_gx, np_xp = grab(gx), grab(xp)for a in ax:a.set_xlim(-1.1,1.1)a.set_ylim(-1.1,1.1)ax[0].scatter(np_x[:,0], np_x[:,1], marker= ' . ' )ax[0].set_title(r ' $x$ ' )ax[1].scatter(np_gx[:,0], np_gx[:,1], marker= ' . ' , color= ' tab:orange ' )ax[1].set_title(r ' $g(x)$ ' )ax[2].scatter(np_xp[:,0], np_xp[:,1], marker= ' . ' )ax[2].set_title(r"$g^{-1}(g(x))$")fig.set_tight_layout(True)plt.show() F. Composition

The Jacobian factors J i from each coupling layer simply multiply together to deﬁne the Jacobian factorof the composed function, so that the ﬁnal density is q ( x ) = r ( z ) (cid:12)(cid:12)(cid:12)(cid:12) det ∂f ( z ) ∂z (cid:12)(cid:12)(cid:12)(cid:12) − = r ( z ) (cid:89) i J − i . (10)In practice, we’ll add together log Jacobians instead. Altogether, sampling and computing the density issimple composition. def apply_flow_to_prior(prior, coupling_layers, *, batch_size):x = prior.sample_n(batch_size)logq = prior.log_prob(x)for layer in coupling_layers:x, logJ = layer.forward(x) logq = logq - logJreturn x, logq IV. APPLICATION 1: φ LATTICE SCALAR FIELD THEORY IN 2D

As an example, we consider applying normalizing ﬂows to sampling the distributions associated with scalarﬁeld theory in two spacetime dimensions with a φ interaction. See [1] for details. A. Physical theory

The continuum theory consists of a single real scalar ﬁeld φ ( (cid:126)x ) as a function of 2D coordinates (cid:126)x . To accessnon-perturbative results, such as behavior in the strong-coupling regime, we can regularize the theory on a2D lattice, assigning one real degree of freedom per site of the lattice. Let’s initialize some conﬁgurations ofan example lattice of size 8 × L = 8lattice_shape = (L,L)phi_ex1 = np.random.normal(size=lattice_shape).astype(float_dtype)phi_ex2 = np.random.normal(size=lattice_shape).astype(float_dtype)cfgs = torch.from_numpy(np.stack((phi_ex1, phi_ex2), axis=0)).to(torch_device)

A simple discretization of the derivatives in the continuum Euclidean action gives rise to a valid latticeEuclidean action, S E cont [ φ ] = (cid:90) d (cid:126)x ( ∂ µ φ ( (cid:126)x )) + m φ ( (cid:126)x ) + λφ ( (cid:126)x ) → S E latt ( φ ) = (cid:88) (cid:126)n φ ( (cid:126)n )  (cid:88) µ ∈{ , } φ ( (cid:126)n ) − φ ( (cid:126)n + ˆ µ ) − φ ( (cid:126)n − ˆ µ )  + m φ ( (cid:126)n ) + λφ ( (cid:126)n ) (11)where now φ ( (cid:126)n ) is only deﬁned on the sites of the L x × L y lattice, (cid:126)n = ( n x , n y ), with integer n x , n y . Wehave implicitly moved to “lattice units” where a = 1 such that L x , L y , V are integers and all quantities areunitless. The discretized ﬁeld φ can therefore be thought of as an ( L x × L y )-dimensional vector. We useperiodic boundary conditions in all directions, i.e. φ ( L x , y ) ≡ φ (0 , y ), etc. For convenience, we typicallyabbreviate S E latt ≡ S .More details on φ lattice scalar ﬁeld theory can be found in [5].The lattice action then deﬁnes a probability distribution over conﬁgurations φ , p ( φ ) = 1 Z e − S ( φ ) , Z ≡ (cid:90) (cid:89) (cid:126)n dφ ( (cid:126)n ) e − S ( φ ) , (12)where (cid:81) (cid:126)n runs over all lattice sites (cid:126)n . This is the distribution we are training the normalizing ﬂows toreproduce. While Z is diﬃcult to calculate, in practice we only need p ( φ ) up to a constant. The action canbe eﬃciently calculated on arbitrary conﬁgurations using Pytorch. Note that while the theory describes 2Dspacetime, the dimensionality of distribution p ( φ ) is the number of lattice sites, scaling with the volume ofthe lattice. class ScalarPhi4Action:def __init__(self, M2, lam):self.M2 = M2self.lam = lamdef __call__(self, cfgs): action_density = self.M2*cfgs**2 + self.lam*cfgs**4 Nd = len(cfgs.shape)-1 dims = range(1,Nd+1)for mu in dims:action_density += 2*cfgs**2action_density -= cfgs*torch.roll(cfgs, -1, mu)action_density -= cfgs*torch.roll(cfgs, 1, mu)return torch.sum(action_density, dim=tuple(dims))print("Actions for example configs:", ScalarPhi4Action(M2=1.0, lam=1.0)(cfgs))>>> Actions for example configs: tensor([682.8262, 295.3547]) The theory has a symmetric phase and a broken symmetry phase, corresponding respectively to nearlyone mode of the distribution or two widely separated modes (with intermediate conﬁgurations suppressedexponentially in volume). The broken symmetry phase can be accessed for m < λ less than a critical λ c . For simplicity, we restrict focus to the symmetric phase , but remain close to this phase transition suchthat the system has a non-trivial correlation length. M2 = -4.0lam = 8.0phi4_action = ScalarPhi4Action(M2=M2, lam=lam)

B. Prior distribution

We choose the prior distribution to be I.I.D. Gaussians at each lattice site. This is easy to sample from,and intuitively gives the coupling layers a “blank slate” from which to build in correlations. prior = SimpleNormal(torch.zeros(lattice_shape), torch.ones(lattice_shape))

We can use the draw function to acquire samples from the prior. Some samples drawn from the prior arevisualized below. torch_z = prior.sample_n(1024)z = grab(torch_z)print(f ' z.shape = {z.shape} ' )fig, ax = plt.subplots(4,4, dpi=125, figsize=(4,4))for i in range(4):for j in range(4):ind = i*4 + jax[i,j].imshow(np.tanh(z[ind]), vmin=-1, vmax=1, cmap= ' viridis ' )ax[i,j].axes.xaxis.set_visible(False)ax[i,j].axes.yaxis.set_visible(False)plt.show()>>> z.shape = (1024, 8, 8) fig, ax = plt.subplots(4,4, dpi=125, figsize=(4,4))for x1 in range(2):for y1 in range(2):i1 = x1*2 + y1for x2 in range(2):for y2 in range(2):i2 = x2*2 + y2ax[i1,i2].hist2d(z[:,x1,y1], z[:,x2,y2], range=[[-3,3],[-3,3]], bins=20)ax[i1,i2].set_xticks([])ax[i1,i2].set_yticks([])if i1 == 3:ax[i1,i2].set_xlabel(rf ' $\phi({x2},{y2})$ ' )if i2 == 0:ax[i1,i2].set_ylabel(rf ' $\phi({x1},{y1})$ ' )fig.suptitle("Correlations in Various Lattice Sites")plt.show() − log r ( z )) and the true action ( S ( z )). If the prior distribution was already a good model for the truedistribution, all samples should have identical action under the prior and true distributions, up to an overallshift. In other words, these should have linear correlation with slope 1. S_eff = -grab(prior.log_prob(torch_z))S = grab(phi4_action(torch_z))fit_b = np.mean(S) - np.mean(S_eff)print(f ' slope 1 linear regression S = -logr + {fit_b:.4f} ' )fig, ax = plt.subplots(1,1, dpi=125, figsize=(4,4))ax.hist2d(S_eff, S, bins=20, range=[[-800, 800], [200,1800]])xs = np.linspace(-800, 800, num=4, endpoint=True)ax.plot(xs, xs + fit_b, ' : ' , color= ' w ' , label= ' slope 1 fit ' )ax.set_xlabel(r ' $S_{\mathrm{eff}} \equiv -\log~r(z)$ ' )ax.set_ylabel(r ' $S(z)$ ' )ax.set_aspect( ' equal ' )plt.legend(prop={ ' size ' : 6})plt.show()>>> slope 1 linear regression S = -logr + 1455.4647 f . C. Aﬃne coupling layers

As mentioned earlier, an aﬃne transformation is a particularly simple, yet eﬀective, transform to usewithin a coupling layer acting on real degrees of freedom. The transformation of the subset of variables φ ,conditioned on the frozen subset φ , is deﬁned as g ( φ , φ ) = (cid:16) e s ( φ ) φ + t ( φ ) , φ (cid:17) , (13)with inverse given by: g − ( φ (cid:48) , φ (cid:48) ) = (cid:16) ( φ (cid:48) − t ( φ (cid:48) )) e − s ( φ (cid:48) ) , φ (cid:48) (cid:17) (14)where s ( φ ) and t ( φ ) produce vectors of the same dimension as φ and operations above are element-wiseon these vectors. We deﬁne the functions s and t using a feed-forward neural network. The coupling layerleaves φ unchanged. Note that this is just a simple extension of the scaling transformation introducedabove, with a constant oﬀset t ( φ ) added to the transformation.The Jacobian factor for such an aﬃne transformation is easy to compute (both analytically and numeri-cally). In fact, because ∂ [ t ( φ )] /∂φ = 0, the Jacobian is the same as for the scaling transformation workedout above. In practice, we work with log probabilities, so we note that the forward and reverse transformationreturn forward: log J ( φ ) = (cid:88) k [ s ( φ )] k (15)reverse: log J reverse ( φ (cid:48) ) = (cid:88) k − [ s ( φ (cid:48) )] k . (16)5The subsets φ , φ are deﬁned by a mask m ( (cid:126)n ) ∈ { , } . In our conventions m ( (cid:126)n ) = 1 implies an input tothe neural net deﬁning s and t , and therefore that the variable on site (cid:126)n is an element of the frozen subset φ . We choose checkerboard masking, as this intuitively allows sites to inﬂuence the transformation of theirdirect neighbors and build local correlations. def make_checker_mask(shape, parity):checker = torch.ones(shape, dtype=torch.uint8) - paritychecker[::2, ::2] = paritychecker[1::2, 1::2] = parityreturn checker.to(torch_device)print("For example this is the mask for an 8x8 configuration:\n",make_checker_mask(lattice_shape, 0))>>> For example this is the mask for an 8x8 configuration:... tensor([[0, 1, 0, 1, 0, 1, 0, 1],... [1, 0, 1, 0, 1, 0, 1, 0],... [0, 1, 0, 1, 0, 1, 0, 1],... [1, 0, 1, 0, 1, 0, 1, 0],... [0, 1, 0, 1, 0, 1, 0, 1],... [1, 0, 1, 0, 1, 0, 1, 0],... [0, 1, 0, 1, 0, 1, 0, 1],... [1, 0, 1, 0, 1, 0, 1, 0]], dtype=torch.uint8) We implement the described coupling layer in the code cell below.For simplicity in our implementation below, we allow s ( φ ) and t ( φ ) to produce outputs for φ as well as φ and then mask them out. We use the same NN to parametrize s and t , so they have shared parameters;this does not add formal complications and just makes the model simpler. Technical note:

Pytorch’s implementation of 2D CNNs requires inputs shaped like (batch size,n input channels, L x, L y) , hence the use of unsqueeze below which adds a fake n input channels dimension of length 1. The CNNs return an output shaped like (batch size, n output channels, L x,L y) ; we use n output channels == 2 , where the two channels are s ( x ) and t ( x ). class AffineCoupling(torch.nn.Module):def __init__(self, net, *, mask_shape, mask_parity):super().__init__()self.mask = make_checker_mask(mask_shape, mask_parity)self.net = netdef forward(self, x):x_frozen = self.mask * xx_active = (1 - self.mask) * xnet_out = self.net(x_frozen.unsqueeze(1))s, t = net_out[:,0], net_out[:,1]fx = (1 - self.mask) * t + x_active * torch.exp(s) + x_frozenaxes = range(1,len(s.size()))logJ = torch.sum((1 - self.mask) * s, dim=tuple(axes))return fx, logJdef reverse(self, fx):fx_frozen = self.mask * fxfx_active = (1 - self.mask) * fxnet_out = self.net(fx_frozen.unsqueeze(1))s, t = net_out[:,0], net_out[:,1]x = (fx_active - (1 - self.mask) * t) * torch.exp(-s) + fx_frozenaxes = range(1,len(s.size()))logJ = torch.sum((1 - self.mask)*(-s), dim=tuple(axes))return x, logJ D. Convolutional neural nets (CNNs)

Any continuous function can be used to deﬁne the coupling layer parameters s and t . We’ll use CNNsbecause they’re cheap and explicitly encode partial spacetime translation symmetry: the parity-preservingtranslations are exact symmetries of the output distribution, due to the checkerboard subsets. ( Note: in [1],fully-connected networks were used as a proof of principle; we ﬁnd CNNs are generally the better choice.)Unlike typical uses of CNNs, our ﬂow formalism requires the input and output (spatial) shapes to beidentical, so there are no pooling operations and we use stride 1. To implement periodic BCs, we employcircular padding.

Technical note:

In below code we assume PyTorch > = 1.5.0 where padding semantics changed. If youare willing to downgrade, the requirement than kernels have odd side length can be dropped, and paddingsemantics should be changed to padding size = kernel size-1 . def make_conv_net(*, hidden_sizes, kernel_size, in_channels, out_channels, use_final_tanh):sizes = [in_channels] + hidden_sizes + [out_channels]assert packaging.version.parse(torch.__version__) >= packaging.version.parse( ' ' )assert kernel_size % 2 == 1, ' kernel size must be odd for PyTorch >= 1.5.0 ' padding_size = (kernel_size // 2)net = []for i in range(len(sizes) - 1):net.append(torch.nn.Conv2d(sizes[i], sizes[i+1], kernel_size, padding=padding_size,stride=1, padding_mode= ' circular ' ))if i != len(sizes) - 2:net.append(torch.nn.LeakyReLU())else:if use_final_tanh:net.append(torch.nn.Tanh())return torch.nn.Sequential(*net) E. Assemble the model

We can construct our model for p ( φ ) by composing a sequence of these aﬃne coupling layers. We’ll use16 layers, with checkerboard parity alternating between sites. The CNNs used to compute parameters havekernel size 3x3 which is suﬃcient to condition on local information in each transform, allowing the ﬂow tobuild up local correlations in the ﬁeld conﬁguration. Larger kernel sizes are also possible at increased cost.Each network has one input channel φ and two output channels ( s ( φ ) , t ( φ )). def make_phi4_affine_layers(*, n_layers, lattice_shape, hidden_sizes, kernel_size):layers = []for i in range(n_layers):parity = i % 2net = make_conv_net(in_channels=1, out_channels=2, hidden_sizes=hidden_sizes,kernel_size=kernel_size, use_final_tanh=True)coupling = AffineCoupling(net, mask_shape=lattice_shape, mask_parity=parity)layers.append(coupling)return torch.nn.ModuleList(layers)n_layers = 16hidden_sizes = [8,8]kernel_size = 3layers = make_phi4_affine_layers(lattice_shape=lattice_shape, n_layers=n_layers,hidden_sizes=hidden_sizes, kernel_size=kernel_size)model = { ' layers ' : layers, ' prior ' : prior} F. Train the model

With a model in hand, we need to optimize the coupling layers to improve the model distribution q ( φ ). Todo that, we need a way to measure how close the model and true distributions are [ q ( φ ) vs p ( φ ), respectively].We use a quantity known as the Kullback-Leibler (KL) divergence to do this. The KL divergence is minimizedwhen p = q .Those familiar with ﬂows will note that they are usually trained with the “forward direction” of the KLdivergence, D KL ( p || q ) ≡ (cid:90) dφ p ( φ ) [log p ( φ ) − log q ( φ )] (17)which we can estimate with N samples drawn from the target distribution ( φ i ∼ p ) as (cid:98) D KL ( p || q ) = 1 N N (cid:88) i =1 [log p ( φ i ) − log q ( φ i )] ( φ i ∼ p ) (18)corresponding to maximum likelihood estimation with respect to training data from the true distribution.Because training data drawn from p can be scarce in simulations of lattice ﬁeld theories, we instead makeuse of the “reverse” KL divergence, D KL ( q || p ) ≡ (cid:90) dφ q ( φ ) [log q ( φ ) − log p ( φ )] (19)which we can estimate using N samples drawn from the model distribution ( φ i ∼ q ) as (cid:98) D KL ( q || p ) = 1 N N (cid:88) i =1 [log q ( φ i ) − log p ( φ i )] ( φ i ∼ q ) . (20)Because data need only be sampled from the model distribution, we can optimize q ( φ ) without data from p ( φ ) [which typically is expensive to generate using standard algorithms like HMC]. This “self-training”protocol then consists of 1. Drawing samples and density estimates from the model 2. Estimating thereverse KL divergence 3. Using standard stochastic gradient descent methods to iteratively update neuralnetwork weights (we’ll use the Adam optimizer)A possible tradeoﬀ of this approach is that the reverse KL is known as mode-seeking / zero-forcing (seee.g. [6]) which means it favors assigning mass to a large mode in the probability density, and places zero masselsewhere. This could be disadvantageous for multimodal target densities. This problem will be investigatedin future work. def calc_dkl(logp, logq):return (logq - logp).mean() Note that the training step deﬁned below logs a few metrics, including the eﬀective sample size (ESS)deﬁned and explained later. def train_step(model, action, loss_fn, optimizer, metrics):layers, prior = model[ ' layers ' ], model[ ' prior ' ]optimizer.zero_grad()x, logq = apply_flow_to_prior(prior, layers, batch_size=batch_size)logp = -action(x)loss = calc_dkl(logp, logq)loss.backward()optimizer.step()metrics[ ' loss ' ].append(grab(loss))metrics[ ' logp ' ].append(grab(logp))metrics[ ' logq ' ].append(grab(logq)) metrics[ ' ess ' ].append(grab( compute_ess(logp, logq) )) Caveat: p ( φ ) is only known up to normalization, p ( φ ) ∝ e − S [ φ ] . Using this unnormalized value shiftsthe KL divergence by an overall constant. This does not aﬀect training, but without this normalization wecannot know whether we are converging to a good estimate directly from the unnormalized KL. Telemetry

We’ll measure some observables and diagnostics as we go.For a batch of samples φ i , the eﬀective sample size (ESS) is deﬁned as (cid:0) N (cid:80) i p [ φ i ] /q [ φ i ] (cid:1) N (cid:80) i ( p [ φ i ] /q [ φ i ]) (21)where i indexes the samples. This deﬁnition normalizes the ESS to live in the range [0 , p ( x ), where larger valuesindicate a better eﬀective sampling of the desired distribution and ESS = 1 is a perfect independent drawfrom the desired distribution for each sample.Why not use this directly to train? It’s much noisier than the KL divergences, so in practice we ﬁnd it’sless eﬀective as a loss function. Caution:

The ESS is biased towards larger values when estimated using small batches of samples. Muchlike measures of autocorrelation time in MCMC approaches, a suﬃciently large sample size is needed todetermine whether any regions of sample space are missed. def compute_ess(logp, logq):logw = logp - logqlog_ess = 2*torch.logsumexp(logw, dim=0) - torch.logsumexp(2*logw, dim=0)ess_per_cfg = torch.exp(log_ess) / len(logw)return ess_per_cfgdef print_metrics(history, avg_last_N_epochs):print(f ' == Era {era} | Epoch {epoch} metrics == ' )for key, val in history.items():avgd = np.mean(val[-avg_last_N_epochs:])print(f ' \t{key} {avgd:g} ' ) Do the training!

We ﬁnd that this model trains to ESS ∼

40% after 25 eras, which takes roughly 5 mins on a Colab GPU.You can either load a pre-trained model or train your own based on the ﬂag below. use_pretrained = True

Below we summarize all parameters discussed so far for the sake of convenience.

L = 8lattice_shape = (L,L)M2 = -4.0lam = 8.0phi4_action = ScalarPhi4Action(M2=M2, lam=lam) prior = SimpleNormal(torch.zeros(lattice_shape), torch.ones(lattice_shape))n_layers = 16hidden_sizes = [8,8]kernel_size = 3 layers = make_phi4_affine_layers(lattice_shape=lattice_shape, n_layers=n_layers,hidden_sizes=hidden_sizes, kernel_size=kernel_size)model = { ' layers ' : layers, ' prior ' : prior} base_lr = .001optimizer = torch.optim.Adam(model[ ' layers ' ].parameters(), lr=base_lr) As with any good cooking show, we made a trained version of the model weights ahead of time (loaded if use pretrained == True ). if use_pretrained:print( ' Loading pre-trained model ' )phi4_trained_weights = torch.load(io.BytesIO(base64.b64decode(b"""""".strip())), map_location=torch.device( ' cpu ' ))model[ ' layers ' ].load_state_dict(phi4_trained_weights)if torch_device == ' cuda ' :model[ ' layers ' ].cuda()else:print( ' Skipping pre-trained model ' )>>> Loading pre-trained model Main training setup and loop.

N_era = 25N_epoch = 100batch_size = 64print_freq = N_epochplot_freq = 1history = { ' loss ' : [], ' logp ' : [], ' logq ' : [], ' ess ' : []}if not use_pretrained:[plt.close(plt.figure(fignum)) for fignum in plt.get_fignums()] live_plot = init_live_plot()for era in range(N_era):for epoch in range(N_epoch):train_step(model, phi4_action, calc_dkl, optimizer, history)if epoch % print_freq == 0:print_metrics(history, avg_last_N_epochs=print_freq)if epoch % plot_freq == 0:update_plots(history, **live_plot)else:print( ' Skipping training ' )>>> Skipping training Serialize the weights to distribute the model in this state. print( ' Model weights blob:\n=== ' )serialized_model = io.BytesIO()torch.save(model[ ' layers ' ].state_dict(), serialized_model) print(base64.b64encode(serialized_model.getbuffer()).decode( ' utf-8 ' ))print( ' === ' ) G. Evaluate the model

With a trained model, we now directly draw samples from the model and check their quality below. Weﬁnd samples that have regions with smoother, correlated ﬂuctuations, in comparison to the raw noise fromthe prior distribution seen above.

Caution:

These samples are drawn from a distribution that only approximates the desired one, so westress that one should not measure and report observables directly using these model samples, as this wouldintroduce bias. However, as discussed later, the reported probability density from the models allows us toeither reweight or resample, producing unbiased estimates of observables when these steps are taken. torch_x, torch_logq = apply_flow_to_prior(prior, layers, batch_size=1024)x = grab(torch_x)fig, ax = plt.subplots(4,4, dpi=125, figsize=(4,4))for i in range(4):for j in range(4):ind = i*4 + jax[i,j].imshow(np.tanh(x[ind]), vmin=-1, vmax=1, cmap= ' viridis ' )ax[i,j].axes.xaxis.set_visible(False)ax[i,j].axes.yaxis.set_visible(False)plt.show() We further see below that the model eﬀective action ( − log q ( x )) is very close to the true action, once weaccount for an overall shift. This oﬀset corresponds to the unknown multiplicative constant 1 /Z that ourtraining is insensitive to. There is still some variation, especially in the regions with lower density, indicatingthe tails of the distribution are not perfectly modeled. S_eff = -grab(torch_logq)S = grab(phi4_action(torch_x))fit_b = np.mean(S) - np.mean(S_eff) print(f ' slope 1 linear regression S = S_eff + {fit_b:.4f} ' )fig, ax = plt.subplots(1,1, dpi=125, figsize=(4,4))ax.hist2d(S_eff, S, bins=20, range=[[5, 35], [-5, 25]])ax.set_xlabel(r ' $S_{\mathrm{eff}} = -\log~q(x)$ ' )ax.set_ylabel(r ' $S(x)$ ' )ax.set_aspect( ' equal ' )xs = np.linspace(5, 35, num=4, endpoint=True)ax.plot(xs, xs + fit_b, ' : ' , color= ' w ' , label= ' slope 1 fit ' )plt.legend(prop={ ' size ' : 6})plt.show()>>> slope 1 linear regression S = S eff + -8.6417 We can see how the model density evolved over training time to become well-correlated with p ( x ) overtime (if use pretrained == False ). if not use_pretrained:fig, axes = plt.subplots(1, 10, dpi=125, sharey=True, figsize=(10, 1))logq_hist = np.array(history[ ' logq ' ]).reshape(N_era, -1)[::N_era//10]logp_hist = np.array(history[ ' logp ' ]).reshape(N_era, -1)[::N_era//10]for i, (ax, logq, logp) in enumerate(zip(axes, logq_hist, logp_hist)):ax.hist2d(-logq, -logp, bins=20, range=[[5, 35], [-5, 25]])if i == 0:ax.set_ylabel(r ' $S(x)$ ' )ax.set_xlabel(r ' $S_{\mathrm{eff}}$ ' )ax.set_title(f ' Era {i * (N_era//10)} ' )ax.set_xticks([])ax.set_yticks([])ax.set_aspect( ' equal ' )plt.show() Independence Metropolis

To produce unbiased estimates of observables, either reweighting or resampling can be performed accord-ing to the weights p ( φ i ) /q ( φ i ). See Sec. IIA of [7] for a discussion of the tradeoﬀs in this choice. There area number of possible resampling approaches; we choose to use the model samples as proposals in a MarkovChain Monte Carlo.We’ll use the Metropolis-Hastings (MH) algorithm to construct the asymptotically exact Markov chainsampler. Generally, the MH algorithm consists of proposing an updated conﬁguration φ (cid:48) to the currentconﬁguration φ i − and stochastically accepting or rejecting the conﬁguration with probability p accept ( φ (cid:48) | φ i − ) = min (cid:18) , T ( φ (cid:48) → φ i − ) T ( φ i − → φ (cid:48) ) p ( φ (cid:48) ) p ( φ i − ) (cid:19) . (22)Here T ( x → y ) is the probability of proposing conﬁg y starting from x . If accepted, we deﬁne the nextconﬁguration in the chain to be φ i = φ (cid:48) ; if rejected, the last conﬁguration is repeated and φ i = φ i − . Often, p ( φ ) ∼ e − S is computationally tractable but T is not, so algorithms are engineered to have symmetricproposal probabilities such that T ( x → y ) = T ( y → x ) and the factors in p accept cancel, leading to thefamiliar Metropolis formula p symmaccept = min(1 , exp[ − ∆ S ]).We instead propose updates by drawing samples from our model independently of the previous conﬁgu-ration, so T ( x → y ) = T ( y ) = q ( y ), where q ( y ) is the model density computed alongside sample y . Theresulting proposal probability T is therefore not symmetric but is known. We thus must accept or rejectbased on p accept ( φ (cid:48) | φ i − ) = min (cid:18) , q ( φ i − ) p ( φ i − ) p ( φ (cid:48) ) q ( φ (cid:48) ) (cid:19) . (23)This procedure is known as the independence Metropolis sampler. Note that rejections occur proportionallyto how poorly the model density matches the desired density; if p ( φ (cid:48) ) = q ( φ (cid:48) ), all (independent) proposalsare accepted, and the chain is a sequence of totally uncorrelated samples. As the rejection rate increases,the autocorrelation time does as well.Below we build the MH algorithm in two stages. First, we need some way of generating an ordered list ofsamples using our model. The code below deﬁnes a generator which does this by drawing batches eﬃcientlyin parallel, then iterating over them one at at time. def serial_sample_generator(model, action, batch_size, N_samples):layers, prior = model[ ' layers ' ], model[ ' prior ' ]layers.eval()x, logq, logp = None, None, Nonefor i in range(N_samples):batch_i = i % batch_sizeif batch_i == 0: ' re out of samples to propose, generate a new batch x, logq = apply_flow_to_prior(prior, layers, batch_size=batch_size)logp = -action(x)yield x[batch_i], logq[batch_i], logp[batch_i] Now we need to iterate over the samples and construct them into a Markov Chain. The code belowimplements the Metropolis independence sampler to do this. def make_mcmc_ensemble(model, action, batch_size, N_samples):history = { ' x ' : [], ' logq ' : [], ' logp ' : [], ' accepted ' : []} sample_gen = serial_sample_generator(model, action, batch_size, N_samples)for new_x, new_logq, new_logp in sample_gen: if len(history[ ' logp ' ]) == 0: accepted = Trueelse: last_logp = history[ ' logp ' ][-1]last_logq = history[ ' logq ' ][-1]p_accept = torch.exp((new_logp - new_logq) - (last_logp - last_logq))p_accept = min(1, p_accept)draw = torch.rand(1) if draw < p_accept:accepted = Trueelse:accepted = Falsenew_x = history[ ' x ' ][-1]new_logp = last_logpnew_logq = last_logq history[ ' logp ' ].append(new_logp)history[ ' logq ' ].append(new_logq)history[ ' x ' ].append(new_x)history[ ' accepted ' ].append(accepted)return history Finally, the cell below uses the code above to generate an ensemble of conﬁgurations using our trainedﬂow model. You should see a 30-40% accept rate. ensemble_size = 8192phi4_ens = make_mcmc_ensemble(model, phi4_action, 64, ensemble_size)print("Accept rate:", np.mean(phi4_ens[ ' accepted ' ]))>>> Accept rate: 0.458984375 The generated ensemble is asymptotically unbiased. As an example of an observable measurements, wemeasure the two-point susceptibility below and compare against a value determined from a large HMCensemble evaluated at the same choice of parameters. n_therm = 512cfgs = np.stack(list(map(grab, phi4_ens[ ' x ' ])), axis=0)[n_therm:]C = 0for x in range(L):for y in range(L):C = C + cfgs*np.roll(cfgs, (-x, -y), axis=(1,2))X = np.mean(C, axis=(1,2))def bootstrap(x, *, Nboot, binsize):boots = []x = x.reshape(-1, binsize, *x.shape[1:])for i in range(Nboot):boots.append(np.mean(x[np.random.randint(len(x), size=len(x))], axis=(0,1)))return np.mean(boots), np.std(boots)X_mean, X_err = bootstrap(X, Nboot=100, binsize=4)print(f ' Two-point susceptibility = {X_mean:.2f} +/- {X_err:.2f} ' )print(f ' ... vs HMC estimate = 0.75 +/- 0.01 ' )>>> Two-point susceptibility = 0.79 +/- 0.02... ... vs HMC estimate = 0.75 +/- 0.01 V. APPLICATION 2:

U(1)

GAUGE THEORY IN 2D

As a second example, we train a ﬂow to sample distributions for U(1) gauge theory in two spacetimedimensions. The desired physical distributions are symmetric under a large gauge symmetry group. We canconstruct ﬂows which explicitly respect this symmetry by enforcing two requirements:1. The prior distribution is gauge-invariant. We’ll use the uniform distribution (with respect to the Haarmeasure) on each gauge link. For U(1), this is just the uniform distribution in [0 , π ] N d V .2. Coupling layers are gauge equivariant (commute with gauge transformations).If both conditions are satisﬁed, this guarantees a gauge invariant output distribution. See [3] for details. CAUTION: many variable names are reused from the previous section.

A. Physical theory

The continuum theory consists of a real-valued ﬁeld A µ ( x ) as a function of 2D coordinate x , with Lorentzindex µ . The lattice regularization of the theory replaces this ﬁeld per site with a collection of paralleltransporters U µ ( x ) ≡ exp (cid:34) i (cid:90) x (cid:48) = x +ˆ µx (cid:48) = x A µ ( x (cid:48) ) (cid:35) (24)with each U µ ( x ) living on the lattice link connecting x to x + ˆ µ , such that there are 2 V independent linkson a V -site 2D lattice. As unit-modulus complex numbers we can consider U µ ( x ) to live in U(1) and A µ ( x )to live in the algebra u (1). FIG. 2. Lattice discretization of a gauge theory.

We can write in the angular representation U µ ( (cid:126)n ) = exp[ iθ µ ( (cid:126)n )] where θ µ ( (cid:126)n ) ∈ R . We will work withthe real-valued angles θ µ ( (cid:126)n ) = arg( U µ ( (cid:126)n )) ∈ [0 , π ] rather than the unit-modulus complex number U µ ( (cid:126)n ).Using this parsimonious representation of lattice DOF saves us from having to worry about maintainingthe normalization of the complex U , at the cost of having to deal with discontinuities at the boundary θ = 2 π ≡ × L = 8lattice_shape = (L,L)link_shape = (2,L,L) u1_ex1 = 2*np.pi*np.random.random(size=link_shape).astype(float_dtype)u1_ex2 = 2*np.pi*np.random.random(size=link_shape).astype(float_dtype)cfgs = torch.from_numpy(np.stack((u1_ex1, u1_ex2), axis=0)).to(torch_device) F µν ≡ ∂ µ A ν − ∂ ν A µ as S E cont [ A ] = (cid:90) d d x (cid:34) − (cid:88) µ<ν F µν (cid:35) (25)which can be regularized on the lattice in terms of parallel transporters U µ ( (cid:126)n ) S E latt [ U ] = − β (cid:88) (cid:126)n (cid:34)(cid:88) µ<ν Re P µν ( (cid:126)n ) (cid:35) where P µν ( (cid:126)n ) ≡ U µ ( (cid:126)n ) U ν ( (cid:126)n + ˆ µ ) U † µ ( (cid:126)n + ˆ ν ) U † ν ( (cid:126)n ) (26)is the “plaquette”, the simplest possible closed loop of links on a lattice, a 1 × S E latt is known as the Wilson gauge action. FIG. 3. A plaquette.

The expressions above are valid for non-Abelian gauge theory in an arbitrary number of spacetime di-mensions, but in this notebook we are interested in Abelian U(1) gauge group. This immediately leads tosome simpliﬁcations. In 2D, P µν = P is the only orientation of plaquette and the sum (cid:80) µ<ν is trivial. Inangular representation, as discussed above, the link has form U µ ( (cid:126)n ) = exp[ iθ µ ( (cid:126)n )], hence the plaquette canbe written as P µν ( (cid:126)n ) = exp[ θ µν ( (cid:126)n )], where θ µν ( (cid:126)n ) = θ µ ( (cid:126)n ) + θ ν ( (cid:126)n + ˆ µ ) − θ µ ( (cid:126)n + ˆ ν ) − θ ν ( (cid:126)n ) and the Wilsongauge action for 2D U(1) gauge theory reduces to S E latt [ U ] = − β (cid:88) (cid:126)n cos [ θ µν ( (cid:126)n )] . (27)Since we are working in the angular representation, in the code we are always dealing with θ µν rather than P µν . The function to calculate θ µν in terms of link angles looks like: def compute_u1_plaq(links, mu, nu): """Compute U(1) plaquettes in the (mu,nu) plane given `links` = arg(U)""" return (links[:,mu] + torch.roll(links[:,nu], -1, mu+1)- torch.roll(links[:,mu], -1, nu+1) - links[:,nu]) The gauge action in terms of angular variables is then: class U1GaugeAction:def __init__(self, beta):self.beta = betadef __call__(self, cfgs):Nd = cfgs.shape[1]action_density = 0for mu in range(Nd):for nu in range(mu+1,Nd):action_density = action_density + torch.cos(compute_u1_plaq(cfgs, mu, nu))return -self.beta * torch.sum(action_density, dim=tuple(range(1,Nd+1)))print(U1GaugeAction(beta=1.0)(cfgs))>>> tensor([-2.3814, 3.3746]) beta = 1u1_action = U1GaugeAction(beta) This action is invariant under gauge transformations U µ ( (cid:126)n ) → e iα ( (cid:126)n ) U µ ( (cid:126)n ) e − iα ( (cid:126)n +ˆ µ ) (28)or, in terms of the angular variables, θ µ ( (cid:126)n ) → α ( (cid:126)n ) + θ µ ( (cid:126)n ) − α ( (cid:126)n + ˆ µ ) (29)for any real-valued lattice ﬁeld α ( (cid:126)n ) (i.e. an independent real number for each lattice site (cid:126)n ). The set of gaugetransforms composes a large symmetry group ( N d V -dimensional, for U(1)) that we will explicitly encode inthe normalizing ﬂow model. Encoding this symmetry exactly improves data eﬃciency of training.We can check numerically that our Pytorch implementation of the action is invariant with respect to anarbitrary gauge transformation. def gauge_transform(links, alpha):for mu in range(len(links.shape[2:])):links[:,mu] = alpha + links[:,mu] - torch.roll(alpha, -1, mu+1)return linksdef random_gauge_transform(x):Nconf, VolShape = x.shape[0], x.shape[2:]return gauge_transform(x, 2*np.pi*torch.rand((Nconf,) + VolShape)) cfgs_transformed = random_gauge_transform(cfgs)print(u1_action(cfgs), ' vs ' , u1_action(cfgs_transformed))assert np.allclose(grab(u1_action(cfgs)), grab(u1_action(cfgs_transformed))), \ ' gauge transform should be a symmetry of the action ' >>> tensor([-2.3814, 3.3746]) vs tensor([-2.3814, 3.3746]) Gauge theory in 2D is a bit peculiar: in the lattice regularization, each plaquette ﬂuctuates independentlyexcept for exponentially-suppressed correlations due to periodic boundary conditions (i.e. in the inﬁnitevolume limit the correlation length is zero). For U(1) gauge theory in particular, there is also a well-deﬁnedtopological charge on the lattice, Q ≡ π (cid:88) (cid:126)n arg( P ( (cid:126)n )) , Q ∈ Z (30)where arg( · ) ∈ [ − π, π ]. This topological charge mixes slowly with usual MCMC techniques, and we ﬁnd thatdirectly sampling using ﬂow models vastly improves estimates for topological quantities. def topo_charge(x):P01 = torch_wrap(compute_u1_plaq(x, mu=0, nu=1))axes = tuple(range(1, len(P01.shape)))return torch.sum(P01, dim=axes) / (2*np.pi)with np.printoptions(suppress=True):print(f ' cfg topological charges = {grab(topo_charge(cfgs))} ' )Q = grab(topo_charge(cfgs))assert np.allclose(Q, np.around(Q), atol=1e-6), ' topological charge must be an integer ' >>> cfg topological charges = [3. 0.] Details on the formulation of Lattice Gauge Theory may be found in the books [8] and [9].7

B. Prior distribution

We use a uniform distribution with respect to the Haar measure. For U(1), this just corresponds to auniform distribution over arg( U µ ( (cid:126)n )) = θ µ ( (cid:126)n ) ∈ [0 , π ]. This is easy to sample, as well as gauge invariant. Note : in the implementation below, we deﬁne the Haar measure normalized to total volume 2 π ; this is anirrelevant normalization having no eﬀect on training or sampling. class MultivariateUniform(torch.nn.Module): """Uniformly draw samples from [a,b]""" def __init__(self, a, b):super().__init__()self.dist = torch.distributions.uniform.Uniform(a, b)def log_prob(self, x):axes = range(1, len(x.shape))return torch.sum(self.dist.log_prob(x), dim=tuple(axes))def sample_n(self, batch_size):return self.dist.sample((batch_size,))prior = MultivariateUniform(torch.zeros(link_shape), 2*np.pi*torch.ones(link_shape))z = prior.sample_n(17)print(f ' z.shape = {z.shape} ' )print(f ' log r(z) = {grab(prior.log_prob(z))} ' )>>> z.shape = torch.Size([17, 2, 8, 8])... log r(z) = [-235.2482645 -235.2482645 -235.2482645 -235.2482645 -235.2482645... -235.2482645 -235.2482645 -235.2482645 -235.2482645 -235.2482645... -235.2482645 -235.2482645 -235.2482645 -235.2482645 -235.2482645... -235.2482645 -235.2482645] C. Gauge equivariant coupling layers

Recall, our goal is to produce a gauge invariant output distribution, which can be achieved using gaugeequivariant coupling layers combined with a gauge invariant prior distribution. A coupling layer is gaugeequivariant if applying a gauge transformation commutes with application of the coupling layer. Considerabstractly factoring the degrees of freedom in the lattice gauge theory into pure-gauge and gauge-invariantdegrees of freedom. Under this factorization, a gauge transformation only aﬀects pure-gauge degrees offreedom, and thus a transformation acting only on gauge invariant quantities will be a gauge equivarianttransformation.Constructing coupling layers that transform links in a way that only aﬀects gauge invariant quantities isnot obvious, since these quantities are not necessarily in 1:1 correspondence with the gauge links, which arethe lattice degrees of freedom. In [3], we presented a construction that resolves this issue by deﬁning howlinks should be transformed to produce a transformation of the gauge-invariant spectra of untraced Wilsonloops. The general case is worked out there, but here we consider the special case of U(1) gauge theorywhere we focus on 1 × g transforms P µν ( (cid:126)n ) → P (cid:48) µν ( (cid:126)n ) (see Figure below). The update to each plaquette can be uniquely “pushedonto” a corresponding link if we transform few enough plaquettes. For U(1) gauge theory, pushing updatesfrom a plaquette to a contained link is easy due to the Abelian nature of the group. If the inner couplinglayer maps P → P (cid:48) , then the contained link U should be updated as U → U (cid:48) = P (cid:48) P − U . This enacts thedesired transformation on the plaquette, P = U V → P (cid:48) P − U V = P (cid:48) ( V − U − ) U V = P (cid:48) . (31)where V is the remaining product of links (“staple”) deﬁning the plaquette. A similar expression applieswhen the plaquette is deﬁned in terms of U − instead. Note that this “passively” transforms any otherplaquettes containing U .8 FIG. 4. Equivariant action by “pushing” plaquette updates onto links in each coupling layer g i . This transformation is correctly gauge equivariant if, in addition, the inner ﬂow updating P µν ( (cid:126)n ) dependsonly on frozen gauge invariant quantities. In comparison to scalar ﬁeld theory, we must deﬁne three disjointsubsets of degrees of freedom to be passed to the inner coupling layer: active , passive , and frozen . The frozensubset works the same as before; it is not updated and can be used as input to the neural nets deﬁning theparameters. The active subset is actively transformed by the coupling layer using those parameters, and itstransformation will be pushed onto corresponding links. The passively updated subset contains plaquettesthat include an updated link but are not directly transformed by the inner layer nor used as inputs to theneural nets deﬁning the parameters.A gauge equivariant coupling layer for U(1) gauge theory is deﬁned below, with a speciﬁc choice of 1:1mapping between updated links and plaquettes: each active plaquette P ( (cid:126)n ) contains exactly one link toupdate, either the link on the left U ( (cid:126)n ) or the link at the bottom U ( (cid:126)n ) − , speciﬁed implicitly by masks(we discuss our choice of masking pattern and deﬁne the relevant functions in the next section). Translatingto the angular variables used in code to represent the links, these updates look like θ ( (cid:126)n ) → θ ( (cid:126)n ) + δ ( (cid:126)n ) θ ( (cid:126)n ) → θ ( (cid:126)n ) − δ ( (cid:126)n ) , (32)where δ ( (cid:126)n ) = θ (cid:48) ( (cid:126)n ) − θ ( (cid:126)n ). Note:

We deﬁne the gauge equivariant coupling layer in terms of an inner coupling layer plaq coupling which we will deﬁne and discuss later. class GaugeEquivCouplingLayer(torch.nn.Module): """U(1) gauge equiv coupling layer defined by `plaq_coupling` acting on plaquettes.""" def __init__(self, *, lattice_shape, mask_mu, mask_off, plaq_coupling):super().__init__()link_mask_shape = (len(lattice_shape),) + lattice_shapeself.active_mask = make_2d_link_active_stripes(link_mask_shape, mask_mu, mask_off)self.plaq_coupling = plaq_couplingdef forward(self, x):plaq = compute_u1_plaq(x, mu=0, nu=1)new_plaq, logJ = self.plaq_coupling(plaq)delta_plaq = new_plaq - plaqdelta_links = torch.stack((delta_plaq, -delta_plaq), dim=1) fx = self.active_mask * torch_mod(delta_links + x) + (1-self.active_mask) * xreturn fx, logJ def reverse(self, fx):new_plaq = compute_u1_plaq(fx, mu=0, nu=1)plaq, logJ = self.plaq_coupling.reverse(new_plaq)delta_plaq = plaq - new_plaqdelta_links = torch.stack((delta_plaq, -delta_plaq), dim=1) x = self.active_mask * torch_mod(delta_links + fx) + (1-self.active_mask) * fxreturn x, logJ Transformation of U ( (cid:126)n ) is done according to foluma above but formula for U ( (cid:126)n ) requires additionalclariﬁcation. According to the explanation above we would need to update U → U (cid:48) = P (cid:48) P − U but it isgenerally accepted in LQCD to use only the positive direction of plaquettes. Keeping in mind P = P † ,we can change the transformation to U † → U (cid:48)† = U † P − P (cid:48) . In the angular representaion it simply has theform θ ( (cid:126)n ) → θ ( (cid:126)n ) − δθ ( (cid:126)n ), where δθ ( (cid:126)n ) = θ (cid:48) − θ .The masking pattern and choice of gauge invariant quantities for the inner update could all be generalized.See also [7] for details on the non-Abelian version of this equivariant construction. D. Gauge-equivariant masking patterns

There are many choices of masking patterns that allow updates to be pushed onto links. We used thisone because it’s simple and strikes a good balance between updating as many links as possible (number ofactive links) and having suﬃcient info to make well-informed updates (number of frozen plaquettes). Moreexploration of optimal masking pattern structure in higher dimensions will be explored in future work. Wecan update all links on the lattice by composing coupling layers with diﬀerent mask oﬀsets and directions.

FIG. 5. Masking pattern for a single coupling layer, with active links indicated with dotted lines and passive linkswith solid lines.

For the links, we need a mask that will pick out the “active” links to be updated. We only update linksin one direction at a time, as indicated in the ﬁgure above. The masking pattern for links in the updateddirection looks like stripes along the updated direction, spaced by 4 lattice units. For links in the otherdirection, it’s all 0s. def make_2d_link_active_stripes(shape, mu, off): """Stripes mask looks like in the `mu` channel (mu-oriented links)::1 0 0 0 1 0 0 0 1 0 01 0 0 0 1 0 0 0 1 0 01 0 0 0 1 0 0 0 1 0 01 0 0 0 1 0 0 0 1 0 0where vertical is the `mu` direction, and the pattern is offset in the nudirection by `off` (mod 4). The other channel is identically 0.""" assert len(shape) == 2+1, ' need to pass shape suitable for 2D gauge theory ' assert shape[0] == len(shape[1:]), ' first dim of shape must be Nd ' assert mu in (0,1), ' mu must be 0 or 1 ' mask = np.zeros(shape).astype(np.uint8)if mu == 0:mask[mu,:,0::4] = 1elif mu == 1:mask[mu,0::4] = 1nu = 1-mumask = np.roll(mask, off, axis=nu+1)return torch.from_numpy(mask.astype(float_dtype)).to(torch_device) Before we deﬁne the masking patterns for plaquettes, let’s deﬁne a few helper functions. def make_single_stripes(shape, mu, off): """1 0 0 0 1 0 0 0 1 0 01 0 0 0 1 0 0 0 1 0 01 0 0 0 1 0 0 0 1 0 01 0 0 0 1 0 0 0 1 0 0where vertical is the `mu` direction. Vector of 1 is repeated every 4.The pattern is offset in perpendicular to the mu direction by `off` (mod 4).""" assert len(shape) == 2, ' need to pass 2D shape ' assert mu in (0,1), ' mu must be 0 or 1 ' mask = np.zeros(shape).astype(np.uint8)if mu == 0:mask[:,0::4] = 1elif mu == 1:mask[0::4] = 1mask = np.roll(mask, off, axis=1-mu)return torch.from_numpy(mask).to(torch_device)def make_double_stripes(shape, mu, off): """Double stripes mask looks like::1 1 0 0 1 1 0 01 1 0 0 1 1 0 01 1 0 0 1 1 0 01 1 0 0 1 1 0 0where vertical is the `mu` direction. The pattern is offset in perpendicularto the mu direction by `off` (mod 4).""" assert len(shape) == 2, ' need to pass 2D shape ' assert mu in (0,1), ' mu must be 0 or 1 ' mask = np.zeros(shape).astype(np.uint8)if mu == 0:mask[:,0::4] = 1mask[:,1::4] = 1elif mu == 1:mask[0::4] = 1mask[1::4] = 1mask = np.roll(mask, off, axis=1-mu)return torch.from_numpy(mask).to(torch_device) The masking pattern for active, passive, and frozen plaquettes are stripes in the same direction as the1active link mask, with appropriate relative oﬀsets to ensure these three subsets give a disjoint partition ofall plaquettes. The active plaquettes are the ones ahead of each active link in the updated direction, whilepassive plaquettes are the ones behind. All other plaquettes are frozen. def make_plaq_masks(mask_shape, mask_mu, mask_off):mask = {}mask[ ' frozen ' ] = make_double_stripes(mask_shape, mask_mu, mask_off+1)mask[ ' active ' ] = make_single_stripes(mask_shape, mask_mu, mask_off)mask[ ' passive ' ] = 1 - mask[ ' frozen ' ] - mask[ ' active ' ]return mask _test_plaq_masks = make_plaq_masks((8,8), 0, mask_off=1)print( ' Frozen (fed into NNs) ' )print(_test_plaq_masks[ ' frozen ' ])print( ' Active (driving the link update) ' )print(_test_plaq_masks[ ' active ' ])print( ' Passive (updated as a result of link update) ' )print(_test_plaq_masks[ ' passive ' ])>>> Frozen (fed into NNs)... tensor([[0, 0, 1, 1, 0, 0, 1, 1],... [0, 0, 1, 1, 0, 0, 1, 1],... [0, 0, 1, 1, 0, 0, 1, 1],... [0, 0, 1, 1, 0, 0, 1, 1],... [0, 0, 1, 1, 0, 0, 1, 1],... [0, 0, 1, 1, 0, 0, 1, 1],... [0, 0, 1, 1, 0, 0, 1, 1],... [0, 0, 1, 1, 0, 0, 1, 1]], dtype=torch.uint8)... Active (driving the link update)... tensor([[0, 1, 0, 0, 0, 1, 0, 0],... [0, 1, 0, 0, 0, 1, 0, 0],... [0, 1, 0, 0, 0, 1, 0, 0],... [0, 1, 0, 0, 0, 1, 0, 0],... [0, 1, 0, 0, 0, 1, 0, 0],... [0, 1, 0, 0, 0, 1, 0, 0],... [0, 1, 0, 0, 0, 1, 0, 0],... [0, 1, 0, 0, 0, 1, 0, 0]], dtype=torch.uint8)... Passive (updated as a result of link update)... tensor([[1, 0, 0, 0, 1, 0, 0, 0],... [1, 0, 0, 0, 1, 0, 0, 0],... [1, 0, 0, 0, 1, 0, 0, 0],... [1, 0, 0, 0, 1, 0, 0, 0],... [1, 0, 0, 0, 1, 0, 0, 0],... [1, 0, 0, 0, 1, 0, 0, 0],... [1, 0, 0, 0, 1, 0, 0, 0],... [1, 0, 0, 0, 1, 0, 0, 0]], dtype=torch.uint8) E. Flowing plaquettes gauge invariantly

The inner coupling layer simply needs to deﬁne an expressive invertible function on the plaquette angularvariables. However, we cannot directly use scaling or aﬃne transformations because they do not map [0 , π ]back into itself.Here we implement a non-compact projection (NCP) transformation introduced for angular variablesin [2]. The transform simply changes variables from x ∈ [0 , π ] to tan( x/ ∈ ( −∞ , ∞ ) before applying ascaling transformation, then transforms back to [0 , π ] afterwards, i.e.: x (cid:48) = 2 tan − ( e s tan( x/ J ( x ) = (cid:104) e − s cos (cid:16) x (cid:17) + e s sin (cid:16) x (cid:17)(cid:105) − . (34)We deﬁne these transformations below. def tan_transform(x, s):return torch_mod(2*torch.atan(torch.exp(s)*torch.tan(x/2)))def tan_transform_logJ(x, s):return -torch.log(torch.exp(-s)*torch.cos(x/2)**2 + torch.exp(s)*torch.sin(x/2)**2) The average of transformations with diﬀerent scales s i also deﬁnes an invertible transform. This lets usemake our models more expressive, at the cost of needing numerical methods to invert each coupling layer. def mixture_tan_transform(x, s):assert len(x.shape) == len(s.shape), \f ' Dimension mismatch between x and s {x.shape} vs {s.shape} ' return torch.mean(tan_transform(x, s), dim=1, keepdim=True)def mixture_tan_transform_logJ(x, s):assert len(x.shape) == len(s.shape), \f ' Dimension mismatch between x and s {x.shape} vs {s.shape} ' return torch.logsumexp(tan_transform_logJ(x, s), dim=1) - np.log(s.shape[1]) Unfortunately, the NCP transformation does not have an analytic inverse transformation but it is easyto calculate it numerically. It is worth noting that inverse transformation is required only to measure themodel density on new data (not used for training or evaluation), so slow numerical inversion is not an issue.There are alternative coupling layers that avoid this issue, if such measurements are needed.We implement numerical inversion using the bisection algorithm below. def invert_transform_bisect(y, *, f, tol, max_iter, a=0, b=2*np.pi):min_x = a*torch.ones_like(y)max_x = b*torch.ones_like(y)min_val = f(min_x)max_val = f(max_x)with torch.no_grad():for i in range(max_iter):mid_x = (min_x + max_x) / 2mid_val = f(mid_x)greater_mask = (y > mid_val).int()greater_mask = greater_mask.float()err = torch.max(torch.abs(y - mid_val))if err < tol: return mid_xif torch.all((mid_x == min_x) + (mid_x == max_x)):print( ' WARNING: Reached floating point precision before tolerance ' f ' (iter {i}, err {err}) ' )return mid_xmin_x = greater_mask*mid_x + (1-greater_mask)*min_xmin_val = greater_mask*mid_val + (1-greater_mask)*min_valmax_x = (1-greater_mask)*mid_x + greater_mask*max_xmax_val = (1-greater_mask)*mid_val + greater_mask*max_valprint(f ' WARNING: Did not converge to tol {tol} in {max_iter} iters! Error was {err} ' )return mid_x As before, we’ll use neural nets to parametrize the scales s i . We will preprocess the input angles (of thefrozen plaquettes) as x → (sin( x ) , cos( x )) to ensure the neural nets have continous outputs with respect tothe angular inputs. def stack_cos_sin(x):return torch.stack((torch.cos(x), torch.sin(x)), dim=1) x → x + t (mod 2 π ). As discussed above, we partition the plaquettes into three sets: active, passive,and frozen, with the exact partitioning scheme deﬁned by our choice of masking pattern. The inner couplinglayer updates the active plaquettes, uses the frozen plaquettes as inputs for the neural nets deﬁning s i and t , and ignores the passive plaquettes completely. class NCPPlaqCouplingLayer(torch.nn.Module):def __init__(self, net, *, mask_shape, mask_mu, mask_off,inv_prec=1e-6, inv_max_iter=1000):super().__init__()assert len(mask_shape) == 2, (f ' NCPPlaqCouplingLayer is implemented only in 2D, ' f ' mask shape {mask_shape} is invalid ' )self.mask = make_plaq_masks(mask_shape, mask_mu, mask_off)self.net = netself.inv_prec = inv_precself.inv_max_iter = inv_max_iterdef forward(self, x):x2 = self.mask[ ' frozen ' ] * xnet_out = self.net(stack_cos_sin(x2))assert net_out.shape[1] >= 2, ' CNN must output n_mix (s_i) + 1 (t) channels ' s, t = net_out[:,:-1], net_out[:,-1]x1 = self.mask[ ' active ' ] * xx1 = x1.unsqueeze(1)local_logJ = self.mask[ ' active ' ] * mixture_tan_transform_logJ(x1, s)axes = tuple(range(1, len(local_logJ.shape)))logJ = torch.sum(local_logJ, dim=axes)fx1 = self.mask[ ' active ' ] * mixture_tan_transform(x1, s).squeeze(1)fx = (self.mask[ ' active ' ] * torch_mod(fx1 + t) +self.mask[ ' passive ' ] * x +self.mask[ ' frozen ' ] * x)return fx, logJdef reverse(self, fx):fx2 = self.mask[ ' frozen ' ] * fxnet_out = self.net(stack_cos_sin(fx2))assert net_out.shape[1] >= 2, ' CNN must output n_mix (s_i) + 1 (t) channels ' s, t = net_out[:,:-1], net_out[:,-1]x1 = torch_mod(self.mask[ ' active ' ] * (fx - t).unsqueeze(1))transform = lambda x: self.mask[ ' active ' ] * mixture_tan_transform(x, s)x1 = invert_transform_bisect(x1, f=transform, tol=self.inv_prec, max_iter=self.inv_max_iter)local_logJ = self.mask[ ' active ' ] * mixture_tan_transform_logJ(x1, s)axes = tuple(range(1, len(local_logJ.shape)))logJ = -torch.sum(local_logJ, dim=axes)x1 = x1.squeeze(1)x = (self.mask[ ' active ' ] * x1 +self.mask[ ' passive ' ] * fx +self.mask[ ' frozen ' ] * fx2)return x, logJ F. Assemble the model

Finally, we’ll use CNNs for this application as well. To summarize, each coupling layer: • Computes plaquettes from the gauge links, and partitions them into active, passive, and frozen subsets • Provides the frozen plaquette angles x as inputs (cos x, sin x ) to a CNN • Uses the resulting scales s i and oﬀset t to update the active plaquettes with mixed NCP • Updates the active links to induce the transformation of the active plaquettes (updating the passiveplaquettes as a side eﬀect)The ﬂow as a whole is made by stacking coupling layers, repeatedly scanning the masking pattern acrossall four distinct oﬀsets and both directions (see the ﬁgure above). Eight layers are required to update eachlink once. def make_u1_equiv_layers(*, n_layers, n_mixture_comps, lattice_shape, hidden_sizes, kernel_size):layers = []for i in range(n_layers): mu = i % 2off = (i//2) % 4in_channels = 2 out_channels = n_mixture_comps + 1 net = make_conv_net(in_channels=in_channels, out_channels=out_channels,hidden_sizes=hidden_sizes, kernel_size=kernel_size,use_final_tanh=False)plaq_coupling = NCPPlaqCouplingLayer(net, mask_shape=lattice_shape, mask_mu=mu, mask_off=off)link_coupling = GaugeEquivCouplingLayer(lattice_shape=lattice_shape, mask_mu=mu, mask_off=off,plaq_coupling=plaq_coupling)layers.append(link_coupling)return torch.nn.ModuleList(layers)

G. Train the model

We use the same self-training scheme with the reverse KL divergence as for φ theory. You should ﬁndthat this model trains much faster than the model for φ theory above, achieving ∼

40% ESS after 10 erasof training, which takes 2-3 minutes on a Google Colab GPU.As with before, if you don’t want to train the model, we have provided a pre-trained example; just set theﬂag below to use it. use_pretrained = True

For convenience, this cell reproduces all of the setup code from above:

L = 8lattice_shape = (L,L)link_shape = (2,L,L)beta = 2.0u1_action = U1GaugeAction(beta) prior = MultivariateUniform(torch.zeros(link_shape), 2*np.pi*torch.ones(link_shape))n_layers = 16n_s_nets = 2hidden_sizes = [8,8] kernel_size = 3layers = make_u1_equiv_layers(lattice_shape=lattice_shape, n_layers=n_layers, n_mixture_comps=n_s_nets,hidden_sizes=hidden_sizes, kernel_size=kernel_size)set_weights(layers)model = { ' layers ' : layers, ' prior ' : prior} base_lr = .001optimizer = torch.optim.Adam(model[ ' layers ' ].parameters(), lr=base_lr)if use_pretrained:print( ' Loading pre-trained model ' )u1_trained_weights = torch.load(io.BytesIO(base64.b64decode(b"""""".strip())), map_location=torch.device( ' cpu ' ))model[ ' layers ' ].load_state_dict(u1_trained_weights)if torch_device == ' cuda ' :model[ ' layers ' ].cuda()else:print( ' Skipping pre-trained model ' )>>> Loading pre-trained modelN_era = 10N_epoch = 100batch_size = 64print_freq = N_epoch plot_freq = 1 history = { ' loss ' : [], ' logp ' : [], ' logq ' : [], ' ess ' : []}if not use_pretrained:[plt.close(plt.figure(fignum)) for fignum in plt.get_fignums()] live_plot = init_live_plot()for era in range(N_era):for epoch in range(N_epoch):train_step(model, u1_action, calc_dkl, optimizer, history)if epoch % print_freq == 0:print_metrics(history, avg_last_N_epochs=print_freq)if epoch % plot_freq == 0:update_plots(history, **live_plot)else:print( ' Skipping training ' )>>> Skipping trainingprint( ' Model weights blob:\n=== ' )serialized_model = io.BytesIO()torch.save(model[ ' layers ' ].state_dict(), serialized_model)print(base64.b64encode(serialized_model.getbuffer()).decode( ' utf-8 ' ))print( ' === ' ) H. Evaluate the model

We can apply the same checks to our trained U(1) model as we did to the φ model above.It’s harder to visualize gauge ﬁelds because the variables are angular and there are multiple degrees offreedom per site. However, we can double-check that a random draw from the trained model producessamples with log q correlated with log p . layers, prior = model[ ' layers ' ], model[ ' prior ' ]torch_x, torch_logq = apply_flow_to_prior(prior, layers, batch_size=1024)S_eff = -grab(torch_logq)S = grab(u1_action(torch_x))fit_b = np.mean(S) - np.mean(S_eff)print(f ' slope 1 linear regression S = S_eff + {fit_b:.4f} ' )fig, ax = plt.subplots(1,1, dpi=125, figsize=(4,4))ax.hist2d(S_eff, S, bins=20, range=[[175, 225], [-110, -60]])ax.set_xlabel(r ' $S_{\mathrm{eff}} = -\log~q(x)$ ' )ax.set_ylabel(r ' $S(x)$ ' )ax.set_aspect( ' equal ' )xs = np.linspace(175, 225, num=4, endpoint=True)ax.plot(xs, xs + fit_b, ' : ' , color= ' w ' , label= ' slope 1 fit ' )plt.legend(prop={ ' size ' : 6})plt.show()>>> slope 1 linear regression S = S eff + -286.5096 We can see how the model density evolved over training time to become well-correlated with p ( x ) overtime (if use pretrained == False ). if not use_pretrained:fig, axes = plt.subplots(1, 10, dpi=125, sharey=True, figsize=(10, 1)) logq_hist = np.array(history[ ' logq ' ]).reshape(N_era, -1)[::N_era//10]logp_hist = np.array(history[ ' logp ' ]).reshape(N_era, -1)[::N_era//10]for i, (ax, logq, logp) in enumerate(zip(axes, logq_hist, logp_hist)):ax.hist2d(-logq, -logp, bins=20, range=[[175, 225], [-110, -60]])if i == 0:ax.set_ylabel(r ' $S(x)$ ' )ax.set_xlabel(r ' $S_{\mathrm{eff}}$ ' )ax.set_title(f ' Era {i * (N_era//10)} ' )ax.set_xticks([])ax.set_yticks([])ax.set_aspect( ' equal ' )plt.show()else:print( ' Skipping plot because use_pretrained == True ' )>>> Skipping plot because use pretrained == True We can reuse our Metropolis independence sampler from above to sample the theory using our modeland check that we get a good acceptance rate. You should see an accept rate around 40-50% for the modeltrained above. ensemble_size = 8192u1_ens = make_mcmc_ensemble(model, u1_action, 64, ensemble_size)print("Accept rate:", np.mean(u1_ens[ ' accepted ' ]))>>> Accept rate: 0.244140625 Algorithms like HMC have a diﬃcult time sampling from diﬀerent topological sectors, exhibiting “topo-logical freezing” where the topological charge Q moves very slowly in Markov chain time. We can measurethis quantity on the ensemble of U(1) conﬁgurations we just generated and see that it mixes quickly usingour direct sampling approach. See [3] for a detailed comparison against two standard approaches. Q = grab(topo_charge(torch.stack(u1_ens[ ' x ' ], axis=0)))plt.figure(figsize=(5,3.5), dpi=125)plt.plot(Q)plt.xlabel(r ' $t_{MC}$ ' )plt.ylabel(r ' topological charge $Q$ ' )plt.show() X_mean, X_err = bootstrap(Q**2, Nboot=100, binsize=16)print(f ' Topological susceptibility = {X_mean:.2f} +/- {X_err:.2f} ' )print(f ' ... vs HMC estimate = 1.23 +/- 0.02 ' )>>> Topological susceptibility = 1.22 +/- 0.05... ... vs HMC estimate = 1.23 +/- 0.02 [1] M. S. Albergo, G. Kanwar, and P. E. Shanahan, Phys. Rev. D , 034515 (2019), arXiv:1904.12072 [hep-lat].[2] D. J. Rezende, G. Papamakarios, S. Racani`ere, M. S. Albergo, G. Kanwar, P. E. Shanahan, and K. Cranmer,(2020), arXiv:2002.02428 [stat.ML].[3] G. Kanwar, M. S. Albergo, D. Boyda, K. Cranmer, D. C. Hackett, S. Racani`ere, D. J. Rezende, and P. E.Shanahan, Phys. Rev. Lett. , 121601 (2020), arXiv:2003.06413 [hep-lat].[4] G. Papamakarios, E. Nalisnick, D. J. Rezende, S. Mohamed, and B. Lakshminarayanan, “Normalizing ﬂows forprobabilistic modeling and inference,” (2019), arXiv:1912.02762 [stat.ML].[5] I. Vierhaus, Simulation of phi 4 theory in the strong coupling expansion beyond the Ising Limit , Master’s thesis,Humboldt-Universit¨at zu Berlin, Mathematisch-Naturwissenschaftliche Fakult¨at I (2010).[6] F. Husz´ar, “How (not) to train your generative model: Scheduled sampling, likelihood, adversary?” (2015),arXiv:1511.05101 [stat.ML].[7] D. Boyda, G. Kanwar, S. Racani`ere, D. J. Rezende, M. S. Albergo, K. Cranmer, D. C. Hackett, and P. E.Shanahan, (2020), arXiv:2008.05456 [hep-lat].[8] C. Gattringer and C. B. Lang,

Quantum chromodynamics on the lattice , Vol. 788 (Springer, Berlin, 2010).[9] J. Smit,