Recurrent Neural Network Wave Functions
Mohamed Hibat-Allah, Martin Ganahl, Lauren E. Hayward, Roger G. Melko, Juan Carrasquilla
RRecurrent Neural Network Wave Functions
Mohamed Hibat-Allah,
1, 2, 3, ∗ Martin Ganahl, Lauren E. Hayward, Roger G. Melko,
2, 3 and Juan Carrasquilla
1, 3 Vector Institute, MaRS Centre, Toronto, Ontario, M5G 1M1, Canada Perimeter Institute for Theoretical Physics, 31 Caroline Street North, Waterloo, Ontario, N2L 2Y5, Canada Department of Physics and Astronomy, University of Waterloo, Ontario, N2L 3G1, Canada (Dated: June 23, 2020)A core technology that has emerged from the artificial intelligence revolution is the recurrent neu-ral network (RNN). Its unique sequence-based architecture provides a tractable likelihood estimatewith stable training paradigms, a combination that has precipitated many spectacular advances innatural language processing and neural machine translation. This architecture also makes a goodcandidate for a variational wave function, where the RNN parameters are tuned to learn the ap-proximate ground state of a quantum Hamiltonian. In this paper, we demonstrate the ability ofRNNs to represent several many-body wave functions, optimizing the variational parameters usinga stochastic approach. Among other attractive features of these variational wave functions, theirautoregressive nature allows for the efficient calculation of physical estimators by providing inde-pendent samples. We demonstrate the effectiveness of RNN wave functions by calculating groundstate energies, correlation functions, and entanglement entropies for several quantum spin modelsof interest to condensed matter physicists in one and two spatial dimensions.
I. INTRODUCTION
The last decade has marked the start of a worldwideartificial intelligence (AI) revolution, which is dramati-cally affecting industry, science, and society. The sourceof the current AI resurgence can largely be traced backto AlexNet [1], one of the most influential breakthroughpapers in computer vision, which provided a dramaticquantitative improvement in object recognition tasks andpopularized the paradigm of deep learning [2]. The con-cept of deep learning encompasses a set of machine learn-ing techniques where data are processed through thecomposition of parametrized nonlinear layers, each ofwhich generates increasingly abstract representations ofthe original data [2]. This paradigm has demonstrated anunprecedented unifying power by making advances in ar-eas as diverse as image recognition [3], natural languageprocessing [4], drug discovery [5], self-driving cars [6],game play [7], and more.The striking performance of deep learning methodshas motivated researchers to use a machine learningperspective to reexamine problems in the physical sci-ences, including areas such as particle physics, cosmol-ogy, materials science, quantum chemistry, and statis-tical physics [8]. The exploration of machine learningtechniques has been particularly prominent in the field ofquantum many-body physics, where the task of elucidat-ing the equilibrium and non-equilibrium properties of in-teracting many-particle systems remains at the researchfrontier of quantum information and condensed matterphysics. One of the first successful technology transfersfrom machine learning into many-body physics involvedthe use of neural network methods in a variational cal-culation [9]. The variational principle is the theoreti-cal bedrock behind many of the most powerful numerical ∗ [email protected] approaches to solving many-body problems in quantummechanics [10–12]. Modern incarnations range from well-established techniques such as variational Monte Carlo(VMC) [13] and tensor networks (TN) [14] to varia-tional quantum eigensolvers (VQE) for quantum compu-tation [15]. The resurgence of interest in machine learn-ing has motivated a rich new playground for variationalcalculations based on neural networks [9, 16–19]. Simul-taneous to the computer vision revolution, a wide array ofmodel architectures and algorithmic advances have alsoemerged in the context of natural language processing(NLP) – the technology that enables computers to pro-cess and understand human language. Some of the mostimportant algorithmic advances in NLP have been devel-oped in the context of sequence learning using recurrentneural networks (RNNs) [20–24]. These have resultedin impressive results in speech and text comprehension,as well as in state-of-the-art results in neural machinetranslation. With RNNs and other algorithmic and con-ceptual advances, algorithms are bringing machine trans-lation and speech recognition closer to the human levelwith unprecedented success [23, 25–27]. Here we explorewhether the power and scalability of NLP models such asthe RNN can be extended to applications in physical sys-tems, in particular to perform variational calculations tofind the low-energy states of quantum many-body Hamil-tonians.RNNs have already proven to be powerful tools withinthe field of many-body physics. In Ref. [28], RNNs wereapplied in the context of quantum state tomography andwere found to be capable of representing a broad range ofcomplex quantum systems, including prototypical statesin quantum information and ground states of local spinmodels. Furthermore, RNNs have established similaritiesto matrix product states (MPS) and are capable of cap-turing entanglement properties of quantum many-bodysystems [29]. To date however, little effort has been madeto develop NLP technology for use together with the vari- a r X i v : . [ c ond - m a t . d i s - nn ] J un ational principle. Here we investigate the power of RNNsand their extensions for approximating the ground stateof strongly correlated local Hamiltonians. We demon-strate how the variational principle can be combined withRNNs to yield highly efficient ansatz wave functions. Ourproposal makes use of the autoregressive property [30–32]of RNNs, which, unlike traditional VMC methods, allowsfor sampling from the wave function. We variationallyoptimize our RNNs to approximate ground states of var-ious strongly correlated quantum systems in one and twodimensions. We find excellent agreement for local cor-relation functions and entanglement entropy upon com-parison with well-established state-of-the-art approaches,while requiring only a fraction of the variational parame-ters. Through extensive scaling studies, we show that theintrinsic bias of our ansatz can be systematically reducedto yield highly accurate ground state approximations oflarge quantum systems. II. CLASSICAL AND QUANTUM RECURRENTNEURAL NETWORKSA. RNNs for classical probability distributions
We consider probability distributions defined over adiscrete sample space, where a single configuration con-sists of a list σ ≡ ( σ , σ , . . . , σ N ) of N variables σ n , and σ n ∈ { , , . . . , d v − } . Here, the input dimension d v represents the number of possible values that any givenvariable σ n can take. A central task in machine learningis to use a set of empirical samples to infer probabilitydistributions in cases where there are strong correlationsamong the variables σ n . We denote the probability ofa configuration σ by P ( σ ) ≡ P ( σ , σ , . . . , σ N ), and usethe product rule for probabilities to express this distri-bution as P ( σ ) = P ( σ ) P ( σ | σ ) · · · P ( σ N | σ N − , . . . , σ , σ ) , (1)where P ( σ i | σ i − , . . . , σ , σ ) ≡ P ( σ i | σ
1) for σ n = 0 , P ( σ ) is carriedout by sequentially computing the conditionals, startingwith P ( σ ), as P ( σ n | σ n − , . . . , σ ) = y n · σ n , where the right-hand side contains the usual scalar prod-uct between vectors and y n ≡ S ( U h n + c ) . (3)Here, U ∈ R d v × d h and c ∈ R d v are weights and biasesof a so-called Softmax layer, and the Softmax activationfunction S is given byS( v n ) = exp( v n ) (cid:80) i exp( v i ) . In Eq. (3) y n = ( y n , . . . , y d v n ) is a d v -component vectorof positive, real numbers summing up to 1, i.e., (cid:107) y n (cid:107) = 1 , (4)and thus forms a probability distribution over the states σ n . Once the vectors y n have been specified, the fullprobability P ( σ ) is given by P ( σ ) = N (cid:89) n =1 y n · σ n . Note that P ( σ ) is already properly normalized to unitysuch that (cid:107) P ( σ ) (cid:107) = 1 . (5)Sampling from an RNN probability distribution isachieved in a similar sequential fashion. To generate asample σ = ( σ , . . . , σ N ) consisting of a set of N config-urations σ n , one first calculates the hidden state h andthe probability y from the initial vectors h and σ . Asample σ from the probability distribution y is drawn,which is then fed as a one-hot vector σ along with h back into the recurrent cell to obtain y , h and then σ .The procedure is then iterated until N configurations σ n have been obtained as illustrated in Fig. 1(c). (b) The previous section focused exclusively on the ef-ficient parametrization of classical probability distribu-tions P ( σ ). In contrast, quantum mechanical wave func-tions are in general a set of complex valued amplitudes ψ ( σ ), rather than conventional probabilities. Before dis-cussing how to modify the RNN ansatz to represent com-plex wave functions, we note that an important classof stoquastic many-body Hamiltonians has ground states | Ψ (cid:105) with real and positive amplitudes in the standardproduct spin basis [41]. Thus, these ground states haverepresentations in terms of probability distributions, | Ψ (cid:105) = (cid:88) σ ψ ( σ ) | σ (cid:105) = (cid:88) σ (cid:112) P ( σ ) | σ (cid:105) . (6)This property has been exploited extensively in wavefunction representations using generative models such asrestricted Boltzmann machines [19]. For such wave func-tions, it is also natural to try to approximate P ( σ ) witha conventional RNN, as illustrated in Fig. 2(a). For laterreference we call this architecture a positive recurrentneural network wave function (pRNN wave function). (b) We focus our attention on the ground state proper-ties of prototypical Hamiltonians in condensed matterphysics including the one- and two-dimensional (1D and2D) transverse field Ising model (TFIM), as well as the1D J - J model, both with open boundary conditions.Their Hamiltonians are given byˆ H TFIM = − (cid:88) (cid:104) i,j (cid:105) ˆ σ zi ˆ σ zj − h (cid:88) i ˆ σ xi , (14)where ˆ σ ( x,y,z ) i are Pauli matrices acting on site i , andˆ H J − J = J (cid:88) (cid:104) i,j (cid:105) ˆ S i · ˆ S j + J (cid:88) (cid:104)(cid:104) i,j (cid:105)(cid:105) ˆ S i · ˆ S j . (15)where ˆ S i is a spin-1/2 operator. Here, (cid:104) i, j (cid:105) and (cid:104)(cid:104) i, j (cid:105)(cid:105) denote nearest- and next-nearest-neighbor pairs, respec-tively. Energies for the J - J model are measured in unitsof J = 1 in the results that follow.To train our models we use the variational principle,where for a given problem Hamiltonian ˆ H , the optimiza-tion strategy involves minimizing the expectation value E λ = (cid:104) Ψ λ | ˆ H | Ψ λ (cid:105) ≥ E with respect to the variationalparameters λ . Here, E is the exact ground state en-ergy of ˆ H . The variational parameters λ are updatedusing variants of the gradient descent algorithm with theobjective of minimizing E λ = (cid:104) Ψ λ | ˆ H | Ψ λ (cid:105) . We providea detailed description of the VMC scheme and the opti-mization strategy with which we optimize our RNN wavefunctions in App. C.Since the TFIM in Eq. (14) is stoquastic, the groundstate is positive [41] and hence we use the pRNN wavefunction ansatz. The J - J model with positive cou-plings, on the other hand, has a ground state endowedwith a sign structure in the computational z -basis, andthus we use a cRNN wave function ansatz.In the following sections, we use 1D RNN wave func-tions to approximate the ground state problem of the1D TFIM and the 1D J - J model, whereas we use both1D and 2D pRNN wave functions in the case of the 2DTFIM. A. 1D transverse field Ising model To demonstrate the power of our proposed method,we use it to target the ground state of a TFIM in onedimension with N = 1000 spins at the critical point h =1 using a pRNN wave function that has a single-layerRNN with 50 memory units. In Fig. 3(a), we show theevolution of the relative error (cid:15) ≡ | E RNN − E DMRG || E DMRG | , (16)and the energy variance per spin σ ≡ (cid:68) ˆ H (cid:69) − (cid:68) ˆ H (cid:69) N , (17)as a function of the training step. E DMRG is the groundstate energy as obtained from a density matrix renor-malization group (DMRG) calculation [43, 44], and canbe considered exact in one dimension. We obtain veryaccurate results with a modest number of parameters( ∼ M N with M the number ofhidden units and N the number of physical spins. Thisscaling implies that the pRNN wave function here hasthe same number of variational parameters as an RBMwith only eight hidden units.While energies and variances give a quantitative indi-cation of the quality of a variational wave function, corre-lation functions provide a more comprehensive character-ization. Indeed, correlation functions are at the heart ofcondensed matter theory since many experimental probesin condensed matter physics directly relate to measure-ments of correlation functions. Examples include inelas-tic scattering, which probes density-density correlationfunctions, and the Green’s function, out of which impor-tant thermodynamic properties of a quantum system canbe computed [45]. In Fig. 3(b) we compare the RNN re-sults for the two-point correlation functions (cid:104) ˆ S xn ˆ S xm (cid:105) and (cid:104) ˆ S zn ˆ S zm (cid:105) with DMRG. Here, we see consistency betweenthe RNN and the DMRG results.Extracting entanglement entropy from many-bodyquantum systems is a central theme in condensed matter Training step − − − (cid:15) Relative error (cid:15) − − − σ (a) Energy variance per spin σ 40 50 60 70 80 n . . . . . h ˆ S z ˆ S z n i DMRGRNN 0 . . . . . h ˆ S x ˆ S x n i (b) h ˆ S z ˆ S z i : h ˆ S x ˆ S x i : DMRGRNN . . . . . . ‘/N . . . . S N = 20: N = 80: space (c) DMRGRNN DMRGRNNSymmetric RNN Figure 3. Results for the pRNN wave function comparedwith DMRG when targeting the ground state of a 1D TFIMat the critical point. Our pRNN wave function has one layerwith 50 units. (a) The relative error (cid:15) and the energy varianceper spin σ against the number of training steps (i.e. gradientdescent steps) for N = 1000 spins. We use only 200 samplesper gradient step, which are enough to achieve convergence.(b) The two-point correlation function (cid:104) ˆ S ˆ S n (cid:105) along the x -axis and z -axis of the optimized pRNN wave function for sites n > 40 using 10 samples. DMRG results are also shown forcomparison. (c) The R´enyi entropy S against the relative sizeof subregion A for system sizes N = 20 and 80. In both (b)and (c), the error bars are smaller than the data points. physics, with entanglement entropy providing an addi-tional window into the structure of complex quantumstates of matter beyond what is seen from correlationfunctions. Of particular interest is the family of R´enyientropies of order α of a reduced density matrix ρ , S α ( ρ ) = 11 − α log (Tr ρ α ) . (18) S α ( ρ ) encodes important non-local properties of quan-tum many-body systems such as topological entangle-ment, and contains information about universal prop-erties of quantum phases such as the central charge c [46, 47]. Due to their non-local character, extractingR´enyi entropies from many-body quantum systems is no-toriously difficult. Here, we use the so-called replica trick [48] to calculate the α = 2 R´enyi entropy S ( ρ ) for RNNwave functions. The details of the implementation canbe found in App. E. In Fig. 3(c), we show results forthe R´enyi entropy S ( ρ (cid:96) ) for two different system sizes N = 20 , 80 of 1D TFIM. ρ (cid:96) here is the reduced densitymatrix on the first (cid:96) sites of the spin chain, obtained bytracing out all sites n ∈ [ (cid:96) + 1 , L ] such that ρ (cid:96) = Tr n ∈ [ (cid:96) +1 ,L ] ( | Ψ (cid:105) (cid:104) Ψ | ) . (19)Indeed for both system sizes, Fig. 3(c) shows excellentagreement between the pRNN wave function estimationand the DMRG result. To improve the overall quality ofthe quantum state, we have enforced the parity symmetryon our pRNN wave function (see App. D 1), denoted by“Symmetric RNN” in Fig. 3(c). We observe that thesymmetric pRNN wave function leads to a more accurateestimate of S ( ρ (cid:96) ) for N = 80 sites. B. 1D J − J model Moving beyond stoquastic Hamiltonians, we now in-vestigate the performance of RNN wave functions for aHamiltonian the ground state of which has a sign struc-ture in the computational basis, specifically the J - J model in one dimension.We use a variationally optimized deep cRNN wavefunction with three GRU layers, each with 100 mem-ory units, to approximate the ground state of the J - J model. The phase diagram of this model has beenstudied with DMRG [49], where it was found that themodel exhibits a quantum phase transition at J c =0 . ± . J ≤ J c to a spontaneously dimerizedgapped valence bond state phase for J ≥ J c .We impose U (1) spin symmetry in the cRNN wavefunction (see App. D 2), and target the ground state atfour different points J = 0 . , . , . , . 8. Note that at J = 0, the Hamiltonian in Eq. (15) can be made stoquas-tic by a local unitary transformation that rotates everyother spin by π around the z -axis. The ground state canin this case be decomposed as [52] ψ ( σ ) = ( − M A ( σ ) ˜ ψ ( σ ) , (20) No Sign Marshall Sign10 − − − − (cid:15) J = 0 . J = 0 . J = 0 . Figure 4. The relative error (compared to DMRG) of thecRNN wave function trained on the 1D J − J model with N = 100 spins for different values J , both without a priorsign (represented by “No Sign”) and with a prior Marshallsign as in Eq. (20) (represented by “Marshall Sign”). Weobserve that applying a Marshall sign improves the accuracy. where M A ( σ ) is given by M A ( σ ) = (cid:80) i ∈ A σ i with σ i ∈{ , } [52] and ˜ ψ ( σ ) is the positive amplitude of the wavefunction. The set A comprises the sites belonging to thesublattice of all even (or all odd) sites in the lattice. Theprefactor ( − M A ( σ ) is known as the Marshall sign ofthe wave function [52]. For J (cid:54) = 0, this decompositionis no longer exact, and ˜ ψ ( σ ) acquires a non-trivial signstructure. For finite J the decomposition in Eq. (20)can still be applied with the hope that the sign structureof ψ ( σ ) remains close to the Marshall sign [53].In Fig. 4, we compare ground state energies of thecRNN wave function trained on the 1D J - J model with N = 100 spins with and without applying a Marshallsign. For small values of J , we find a considerable im-provement of the energies when applying the Marshallsign on top of the cRNN wave function. This observationhighlights the importance of considering a prior “signansatz” to achieve better results. In the absence of a priorsign, the cRNN wave function can still achieve accurateestimations of the ground state energies, showing thatcRNN wave functions can recover some of the unknownsign structure of the ground state. For J = 0 . 8, however,the improvement is less pronounced, which is expecteddue to the emergence of a second sign structure in thelimit J → ∞ (when the system decouples into two inde-pendent unfrustrated Heisenberg chains) [54, 55], that iswidely different from the Marshall sign in Eq. (20). Weomit from Fig. 4 our results at the point J = 0 . 5. Inthis case, the 1D J - J model reduces to the Majumdar-Ghosh model, where the ground state is a product-stateof spin singlets, and we find agreement with the exactground state energy within error bars when we apply aninitial Marshall sign structure. We provide a summaryof the cRNN wave function’s obtained values in App. F. C. 2D transverse field Ising model Understanding strongly correlated quantum many-body systems in D > h c ≈ . 044 that separates a magnetically or-dered phase from a random paramagnet [61].The simplest strategy for extending our approach to2D geometries is to simply treat them as folded 1Dchains, similar to the “snaking” approach used in 2DDMRG calculations (see Fig. 5(a)). While this approachworks quite well, it has the fundamental drawback thatneighboring sites on the lattice can become separated inthe 1D geometry. As a consequence, local correlationsin the 2D lattice are mapped into non-local correlationsin the 1D geometry, which can increase the complexityof the problem considerably. For example, 2D DMRGcalculations are typically restricted to 2D lattices withsmall width L y . This problem has led to the develop-ment of more powerful tensor network algorithms for 2Dquantum systems such as projected entangled pair states(PEPS) [57].An advantage of RNN wave functions is their flexibil-ity in how hidden vectors are passed between units. Toobtain an RNN wave function more suited to a 2D geom-etry, we modify the simple 1D approach outlined aboveby allowing hidden vectors to also be passed vertically,instead of only horizontally, as described in App. B. Thismodification is illustrated by the red arrows in Fig. 5(b).We refer to this geometry in the following discussions as a2D RNN. We optimize the 2D pRNN wave function witha single-layer 2D vanilla RNN cell that has 100 mem-ory units (i.e. with ∼ h = 2 , , 4. The training complexity of the 2DpRNN wave function is only quadratic in the number ofmemory units d h (see App. B), which is very inexpensivecompared to, e.g., the expensive variational optimizationof PEPS, which scales as χ ˜ D (where ˜ D is the PEPSbond dimension and χ is the bond dimension of the in-termediate MPS) [62].For comparison, we also optimize a deep 1D pRNNwave function architecture with three layers of stackedGRU cells, each with 100 memory units (i.e., with ∼ h . In Fig. 5(c) we compare the ob-tained ground state energies with results from 2D DMRG calculations (run on the same 1D geometry as for the1D pRNN wave function) and the PixelCNN architec-ture [63] (with ∼ χ = 512, about 2.6% of the variational param-eters of the PixelCNN wave function used in Ref. [56],and about 14% of the parameters used in the 1D pRNNarchitecture. A summary of our results in tabular formcan be found in App. F. D. Scaling of resources The optimization results of our RNN wave functionapproach depend on several hyperparameters, includingthe number of memory units, the number of recurrentlayers in deep architectures, and the number of samplesused to obtain the gradient during an optimization step(see App. C). Here, we investigate how the optimized en-ergy and the energy variance per spin σ (see Eq. (17))depend on these parameters. This energy variance perspin is an indicator of the quality of the optimized wavefunction, with exact eigenstates corresponding to σ = 0.When targeting eigenstates, deviations from this valuecan be used to assess the quality of a variational wavefunction [13, 64, 65], as previously done in the case ofmatrix product state based techniques [66, 67]. For vari-ational approaches such as DMRG, one typically expectsa non-zero value of σ that decreases when one increasesthe number of parameters (i.e., the expressivity) of thevariational wave function. Since the number of varia-tional parameters is directly related to the number ofmemory units of the pRNN wave function (see App. A),we study here the scaling of σ with the number of mem-ory units.In Fig. 6, we present the dependence of σ on the num-ber of memory units for the 1D and 2D critical TFIMs.Fig. 6(a) shows results for σ for a 1D critical TFIMon three system sizes N = 20 , 40 and 80, and Fig. 6(b)shows results for the 2D TFIM on 4 × , × × N we observe a systematic decrease of σ (i.e., an increase inquality of the wave function) as we increase the numberof memory units.In App. G, we study the dependence of σ on both thenumber of samples and the number of layers in the pRNNwave function for a critical 1D TFIM. We observe only aweak dependence on both parameters. The weak depen-dence on the number of samples suggests that optimizingthe RNN wave functions with noisy gradients does notsignificantly impact the results of the optimization proce- Figure 5. (a): Autoregressive sampling path of 2D spin configurations using 1D RNN wave functions. The 2D configurationsare generated through raster scanning, such that in order to generate spin σ i one has to condition on the spins that arepreviously generated. (b): Autoregressive sampling path of 2D spin configurations using 2D RNN wave functions through azigzag path, where each site receives two hidden states and two spins from the horizontal and the vertical neighbors that werepreviously generated. For both panels (a) and (b), the digits and the green dashed arrows indicate the sampling path, whilethe red arrows indicate how the hidden states are passed from one site to another. (c): A comparison of the variational energyper spin between a 2D pRNN wave function (labeled as 2DRNN), 1D pRNN wave function (labeled as 1DRNN), PixelCNNwave function [56], and DMRG with bond dimension χ for the 2D TFIM on a system with L x × L y = 12 × 12 spins. Theshaded regions represent the error bars of each method. Note the broken y -axis on the plots for h = 3 and 4, denoting a changein scale between the upper and lower portions of the plots. These results show that 2D pRNN wave functions can achieve aperformance comparable to PixelCNN wave functions and DMRG with a large bond dimension, while requiring only a fractionof their variational parameters. dure, and yields accurate estimations of the ground stateand its energy. From the weak dependence on the numberof layers we conclude that shallow RNNs with a sufficientnumber of memory units have enough expressivity andthat deep architectures do not seem to be beneficial froman accuracy point of view. However, deeper networkscould have potential ramifications regarding memory us-age and training speed when it comes to training a largenumber of variational parameters, as shallow RNNs witha large number of memory units are equivalent in terms ofnumber of parameters to deep RNNs with a smaller num-ber of memory units. We also note that adding residualconnections between layers [68] and dilated connectionsbetween RNN cells [69] to deep RNNs, which we leavefor future investigations, might change our previous con-clusions and make deep RNNs more beneficial compared to shallow RNNs. IV. CONCLUSIONS AND OUTLOOK We have introduced recurrent neural network wavefunctions, a novel variational ansatz for quantum many-body systems, which we use to approximate groundstate energies, correlation functions, and entanglement ofmany-body Hamiltonians of interest to condensed mat-ter physics. We find that RNN wave functions are com-petitive with state-of-the-art methods such as DMRGand PixelCNN wave functions [56], performing partic-ularly well on the task of finding the ground state of thetransverse-field Ising model in two dimensions. By in-creasing the number of memory of units in the RNN, the 16 32 64 128 Number of memory units − − − σ 1D TFIM (a) N = 80 N = 40 N = 2016 32 64 128 Number of memory units − − − − − σ 2D TFIM (b) × × × Figure 6. The energy variance per spin against the numberof memory units of a 1D pRNN wave function trained at thecritical point of (a) the 1D TFIM and (b) the 2D TFIM. Bothscalings show that we can systematically reduce the bias inthe estimation of the ground state energy. error in our results can be systematically reduced. Wehave shown furthermore that we can accurately modelground states endowed with a sign structure using acomplex recurrent neural network (cRNN) wave functionansatz. Here, accuracy can be improved by introduc-ing an ansatz sign structure and by enforcing symme-tries such as U (1) symmetry. The autoregressive natureof RNN wave functions makes it possible to directly gen-erate independent samples, in contrast to methods basedon Markov chain sampling, which are often plagued bylong autocorrelation times that affect the optimizationand the accurate estimation of correlation functions ina variational ansatz. Thanks to weight sharing amonglattice sites, RNN wave functions provide very compactyet expressive representations of quantum states, whileretaining the ability to easily train with millions of vari-ational parameters, as opposed to, e.g., restricted Boltz-mann machines [9]. We expect that future work incorpo-rating additional numerical techniques such as attention[25, 70] and higher order optimization [13, 71] will make RNN wave functions a highly competitive tool for sim-ulating quantum many-body systems, with applicationsto material science, quantum chemistry [72], quantumcomputation [73], and beyond. Note added . A complementary paper on recurrent neu-ral network wave functions [74] appeared after the pub-lication of this manuscript. OPEN-SOURCE CODE Our code is made publicly available at “ http://github.com/mhibatallah/RNNwavefunctions ”. Thehyperparameters we use are given in App. H. ACKNOWLEDGMENTS We acknowledge Di Luo for his generous commentswhich were extremely helpful. We also thank EstelleInack, Dan Sehayek, Amine Natik, Matthew Beach,Bohdan Kulchytskyy, Florian Hopfmueller, RoelandWiersema, Giuseppe Carleo and Noam Wies for usefuldiscussions and insights. M.H. acknowledges support ofthe Ecole Normale Superieure de Lyon. M.G. acknowl-edges support by the Simons Foundation (Many Elec-tron Collaboration). R.G.M. acknowledges support fromthe Natural Sciences and Engineering Research Coun-cil of Canada (NSERC), a Canada Research Chair, theShared Hierarchical Academic Research Computing Net-work (SHARCNET) and Compute Canada. J.C. ac-knowledges support from NSERC, SHARCNET, Com-pute Canada, and the Canadian institute for advancedresearch (CIFAR) AI chair program. Computer simula-tions were also made possible thanks to the Vector Insti-tute computing cluster and Google Colaboratory. Thisresearch was supported in part by the National ScienceFoundation under Grant No. NSF PHY-1748958. It wasalso supported in part by the Perimeter Institute for The-oretical Physics. Research at Perimeter Institute is sup-ported in part by the Government of Canada throughInnovation, Science and Economic Development Canada(ISED) and by the Province of Ontario through the Min-istry of Economic Development, Job Creation and Trade. Appendix A: Gated Recurrent Neural Networks We use the GRU model introduced in Ref. [40], whichprocesses the spin configurations σ as u n = sig ( W u [ h n − ; σ n − ] + b u ) , (A1) r n = sig ( W r [ h n − ; σ n − ] + b r ) , ˜ h n = tanh ( W c [ r n (cid:12) h n − ; σ n − ] + b c ) , h n = (1 − u n ) (cid:12) h n − + u n (cid:12) ˜ h n , Figure 7. Graphical representation of the gated recurrentunit cell described in Eq. (A1). The magenta circles/ellipsesrepresent point wise operations such as vector addition ormultiplication. The blue rectangles represent neural networklayers labeled by the non-linearity we use. Merging lines de-note vector concatenation and forking lines denote a copyoperation. The sigmoid activation function is represented by σ . where sig and tanh represent the sigmoid and hyperbolictangent activation functions, respectively. Thus, the hid-den vector h n is updated through an interpolation be-tween the previous hidden state h n − and a candidatehidden state ˜ h n . The update gate u n decides to whatextent the contents of the hidden state are modified, anddepends on how relevant the input σ n − is to the predic-tion (Softmax layer). The symbol (cid:12) denotes the point-wise (Hadamard) product. The reset gate modeled by thevector r n is such that if the i -th component r n is closeto zero, it cancels out the i -th component of the hiddenvector state h n − , effectively making the GRU “forget”part of the sequence that has already been encoded inthe state vector h n − .The weights matrices W u,r,c and the bias vectors b u,r,c parametrize the GRU and are optimized using energyminimization as described in App. C. The GRU transfor-mations in Eq. (A1) are depicted graphically in Fig. 7.To take advantage of the GPU speed up, we use in-stead the cuDNN variant of GRUs implemented in Ten-sorflow [75], with u n = sig ( W u [ h n − ; σ n − ] + b u ) , (A2) r n = sig ( W r [ h n − ; σ n − ] + b r ) , h (cid:48) n = W (1) c h n − + b (1) c , ˜ h n = tanh (cid:16) W (2) c σ n − + r n (cid:12) h (cid:48) n + b (2) c (cid:17) , h n = (1 − u n ) (cid:12) h n − + u n (cid:12) ˜ h n , which differs slightly from the above implementation oftraditional GRU cells [76].Provided that the dimensions of the hidden state h n − and input σ n − are d h and d v (respectively), then thedimensions of the variational parameters of a GRU as inEq. (A2) are • dim( W u,r ) = d h × ( d h + d v ), • dim( b u,r ) = d h , • dim( W (1) c ) = d h × d h , • dim( W (2) c ) = d h × d v , • dim( b (1 , c ) = d h .The new hidden state h n is fed into a Softmax layer toinfer conditional probabilities, such that y (1) n = Softmax( U (1) h n + c (1) ) , and also into a Softsign layer to infer the phases as y (2) n = π Softsign( U (2) h n + c (2) ) . We require the outputs y (1 , n to have dimension d v ,so that each element of y (1) n represents the conditionalprobability of sampling a value for the next spin σ n ∈{ , , . . . , d v − } , and that each element of y (2) n cor-responds to the phase of the chosen spin σ n . Thus,the dimension of the parameters introduced in the Soft-max/Softsign layer are • dim( U (1 , ) = d v × d h , • dim( c (1 , ) = d v .The same reasoning can be also applied to determine thedimensions of the variational parameters of 2D vanillaRNNs presented in App. B. Appendix B: Two-dimensional Recurrent NeuralNetwork wave functions Standard RNN architectures are inherently one dimen-sional. However, most interesting quantum many-bodysystems live in higher dimensions. By taking inspirationfrom Refs. [21] and [77], we generalize one dimensionalRNNs to multidimensional RNN wave functions. In par-ticular, we generalize to 2D vanilla RNNs that are moresuitable to simulating two-dimensional square latticesthan one-dimensional RNNs, which map two-dimensionallattice configurations to one-dimensional configurationsand do not necessarily encode spatial information aboutneighboring sites in a plausible manner.The main idea behind the implementation of 2DRNNs [21] is to replace the single hidden state that ispassed from one site to another by two hidden states,with each one corresponding to the state of a neighbor-ing site (vertical and horizontal) and hence respectingthe 2D geometry of the problem. To do so, we changethe one-dimensional recursion relation in Eq. (2) to thetwo-dimensional recursion relation h i,j = f (cid:16) W ( h ) [ h i − ,j ; σ i − ,j ] + W ( v ) [ h i,j − ; σ i,j − ] + b (cid:17) , (B1)1where h i,j is the hidden state at site ( i, j ), W ( v,h ) areweight matrices and b is a bias. Here f is a non-linearactivation function chosen to be equal to the exponentiallinear unit (ELU) defined asELU( x ) = (cid:40) x, if x ≥ , exp( x ) − , if x < . The cost of computing a new hidden state h i,j isquadratic in the size of the hidden state (number of mem-ory units d h ), and the cost of computing the gradientswith respect to the variational parameters of the 2D RNNremains unchanged. This property allows us to train 2DRNNs with a relatively large d h .To initialize the 2D RNN, we choose h i, , σ i, and h ,j , σ ,j to be null vectors. Once h i,j is computed, weapply the same scheme as in Sec. II B to sample a spin σ i,j . The scheme for computing positive or complex am-plitudes from Sec. II B remains the same.We note that generalization to higher dimensions, toother lattices, as well as to other types of RNN architec-tures can be done by taking inspiration from this scheme.For instance, using LSTMs [20], GRUs [40] or Trans-formers [25] instead of vanilla RNNs in two dimensionsis expected to make a significant improvement. We alsoexpect that using multiplicative interactions [78] mightincrease the expressiveness of 2D RNNs as compared tothe additive interactions in Eq. (B1). Appendix C: Variational Monte Carlo and VarianceReduction The main goal of variational Monte Carlo (VMC) is toiteratively optimize an ansatz wave function to approx-imate, e.g., ground states of local Hamiltonians. VMCstarts from a suitable trial wave function | Ψ λ (cid:105) that in-corporates the variational degrees of freedom of the ap-proach. | Ψ λ (cid:105) could be, for example, an MPS wave func-tion [79] in which case the free parameters are the MPSmatrices. Crucially, the ansatz | Ψ λ (cid:105) has to allow for effi-cient sampling from the square of the amplitudes of | Ψ λ (cid:105) .In this paper, we choose RNN wave functions, describedin Sec. II B, to parametrize the trial wave function | Ψ λ (cid:105) for a VMC optimization of ground states.The aim of the VMC optimization is to minimize theexpectation value of the energy E ≡ (cid:104) Ψ λ | ˆ H | Ψ λ (cid:105)(cid:104) Ψ λ | Ψ λ (cid:105) (C1)when given a family of states | Ψ λ (cid:105) . This minimization iscarried out using the gradient descent method or any ofits variants. Since the RNN wave function is normalizedsuch that (cid:104) Ψ λ | Ψ λ (cid:105) = 1, the expectation value in Eq. (C1) can be written as E = (cid:104) Ψ λ | H | Ψ λ (cid:105) = (cid:88) σ | ψ λ ( σ ) | (cid:88) σ (cid:48) H σσ (cid:48) ψ λ ( σ (cid:48) ) ψ λ ( σ ) ≡ (cid:88) σ | ψ λ ( σ ) | E loc ( σ ) ≈ N S (cid:88) σ ∼| ψ λ ( σ ) | E loc ( σ ) , (C2)which represents a sample average of the local energy E loc ( σ ). The latter can be calculated efficiently for lo-cal Hamiltonians. Denoting λ i to be the real variationalparameters of | Ψ λ (cid:105) , the gradients ∂ λ i E can be similarlywritten as ∂ λ i E = (cid:88) σ | ψ λ ( σ ) | ∂ λ i ψ ∗ λ ( σ ) ψ ∗ λ ( σ ) E loc ( σ ) + c.c . (C3)An optimization step consists of drawing N S samples { σ (1) , σ (2) , . . . , σ ( N S ) } from | ψ λ ( σ ) | autoregressivelyusing the RNN wave function, and then computing ∂ λ i E from Eq. (C3) as ∂ λ i E ≈ N S Re (cid:32) N S (cid:88) i =1 ∂ λ i ψ ∗ λ ( σ ( i ) ) ψ ∗ λ ( σ ( i ) ) E loc ( σ ( i ) ) (cid:33) , (C4)using automatic differentiation [80] and updating the pa-rameters (if using gradient descent) according to λ i ← λ i − α∂ λ i E (C5)with a small learning rate α . Instead of this simple gra-dient descent rule, we use the Adam optimizer [81] toimplement the gradient updates. We found that the lat-ter gives better results compared to the simple gradientdescent optimization shown in Eq. (C5) and without hav-ing to carefully tune the learning rate α .We note that the stochastic evaluation of the gradientsin Eq. (C4) tends to carry noise that increases their vari-ances [82, 83]. Such high variances tend to slow downthe convergence to the ground state energy. We pro-pose to cure this limitation by introducing a new term inEq. (C4) that helps reduce the variance of the gradientsby approximating ∂ λ i E ≈ N S Re (cid:32) N S (cid:88) i =1 ∂ λ i ψ ∗ λ ( σ ( i ) ) ψ ∗ λ ( σ ( i ) ) (cid:16) E loc ( σ ( i ) ) − E (cid:17)(cid:33) = 2 N S Re (cid:32) N S (cid:88) i =1 ∂ λ i log ψ ∗ λ ( σ ( i ) ) (cid:16) E loc ( σ ( i ) ) − E (cid:17)(cid:33) , (C6)and we show below that this approximation does not in-troduce a bias. This new term is useful for reducing theuncertainty in the gradient estimation, as in the limitwhere E loc ( σ ( i ) ) ≈ E near convergence, the variance ofthe gradients ∂ λ i E goes to zero as opposed to the nonzero2variance of the gradients in Eq. (C4). As a consequence,a stable convergence to the ground state is achieved asconfirmed by our experiments. This idea is similar inspirit to control variate methods in Monte Carlo [82] andto baseline methods in reinforcement learning [84].To show that the term we add in Eq. (C6) does notbias the true gradients in Eq. (C3), it suffices to provethat Re ( (cid:104) ∂ λ i log ( ψ ∗ λ ( σ )) (cid:105) E ) = 0 , (C7)where (cid:104) ... (cid:105) denotes the statistical average over the prob-ability distribution | ψ λ | . To prove this expression, wewrite ψ λ ( σ ) = (cid:112) P λ ( σ ) exp(i φ λ ( σ )), which implies thatlog ( ψ ∗ λ ( σ )) = 12 log ( P λ ( σ )) − i φ λ ( σ ) , and hence Re ( (cid:104) ∂ λ i log ( ψ ∗ λ ( σ )) (cid:105) E ) =12 (cid:104) ∂ λ i log ( P λ ( σ )) (cid:105) Re ( E ) + (cid:104) ∂ λ i φ λ ( σ ) (cid:105) Im ( E ) . (C8)To show that (cid:104) ∂ λ i log ( P λ ( σ )) (cid:105) = 0 [84, 85], we write (cid:104) ∂ λ i log ( P λ ( σ )) (cid:105) = (cid:88) σ P λ ( σ ) ∂ λ i log ( P λ ( σ )) , = (cid:88) σ P λ ( σ ) ∂ λ i P λ ( σ ) P λ ( σ ) , = ∂ λ i (cid:88) σ P λ ( σ ) , = ∂ λ i , where the fact that the RNN wave function is normalizedjustifies the transition from the third line to the fourthline.From here, it suffices to show that (cid:104) ∂ λ i φ λ ( σ ) (cid:105) Im ( E ) =0. Since the Hamiltonian ˆ H is Hermitian, the expectationvalue E is real and hence Im ( E ) = 0. We therefore arriveat Eq. (C7). Appendix D: Implementing Symmetries1. Imposing discrete symmetries Inspired by Refs. [32] and [56], we propose to imple-ment discrete symmetries in a similar fashion for RNNwave functions without spoiling their autoregressive na-ture.Assuming that a Hamiltonian ˆ H has a symmetry underdiscrete transformations T , its ground state | Ψ G (cid:105) = (cid:88) σ ψ G ( σ ) | σ (cid:105) is an eigenvector of the symmetry transformation T . Theground state transforms as ψ G ( T σ ) = ω T ψ G ( σ ) where ω T is an eigenvalue with module 1, that is independent ofthe choice of σ . This expression implies that the trans-formation T changes the ground state with only a globalphase term that does not affect the probability distribu-tion, and changes the sign structure with a global phaseterm. It is thus desirable that the RNN wave functionalso has this symmetry.To enforce a discrete symmetry {T } on an RNN wavefunction | Ψ λ (cid:105) , we propose the following scheme: • Generate a sample σ autoregressively from theRNN wave function. • Sample with a probability 1 / Card( G ) a transfor-mation T from the symmetry transformation group G = { , T , ... } that leaves the Hamiltonian ˆ H in-variant, and apply the transformation T to σ . • Assign to the spin configuration ˜ σ = T σ the am-plitude ψ λ ( ˜ σ ) = (cid:112) P λ ( ˜ σ ) exp( iφ λ ( ˜ σ )), such that P λ ( ˜ σ ) = 1Card(G) (cid:88) ˜ T ∈ G P λ (cid:16) ˜ T σ (cid:17) ,φ λ ( ˜ σ ) = Arg ω T (cid:88) ˜ T ∈ G exp (cid:16) i φ λ (cid:16) ˜ T σ (cid:17)(cid:17) , where P λ ( ˜ T σ ) is a probability generated by theSoftmax layer and φ λ ( ˜ T σ ) is a phase generated bythe Softsign layer, as explained in Sec. II B.If the ground state is positive [41], we use the same algo-rithm but only symmetrize the probability P λ , withouthaving to worry about symmetrizing the phase φ λ .For concreteness, we illustrate the algorithm abovewith “Symmetric RNNs” that have a built-in parity sym-metry. We use this architecture in Sec. III A to get amore accurate estimate of the ground state of the 1DTFIM that also obeys a parity symmetry. Indeed, sym-metric RNNs show an improvement over ordinary pRNNwave functions on the task of estimating the second Rnyientropy (see App. E). Symmetric RNNs can be imple-mented using the following procedure: • Sample each configuration σ . • Choose to apply or to not apply the parity trans-formation ˆ P on σ with a probability 1 / • Assign to σ the probability: P = (cid:16) P λ ( σ ) + P λ (cid:16) ˆ P σ (cid:17)(cid:17) . We also emphasize the possibility of carefully design-ing RNN wave functions to impose discrete symmetries,without using the symmetrization scheme above andwhich we leave for future investigations.3 2. Imposing zero magnetization Since the ground state of the J - J model has zeromagnetization, i.e., a U (1) symmetry [52, 86], it is helpfulto enforce this constraint on our RNN wave functions toget accurate estimations of the ground state energy. Todo so, we propose an efficient way to generate sampleswith zero magnetization while maintaining the autore-gressive property of the RNN wave function. The proce-dure effectively applies a projector P S z =0 to the originalstate, which restricts the RNN wave function to the sub-space of configurations with zero magnetization. Thisprocedure avoids generating a large number of samplesand discarding the ones that have non-zero magnetiza-tion.The condition of zero magnetization implies that thenumber of up spins should be equal to the number ofdown spins. To satisfy this constraint, we utilize thefollowing algorithm: • Sample autoregressively the first half of the spinconfiguration ( σ , σ , ..., σ N/ ) • At each step i > N/ – Generate the output of the RNN wave func-tion: y i = ( ψ down i , ψ up i ) where ψ down i and ψ up i are both non-zero and their modules squaredsum to 1. – Define the following amplitudes: a i = ψ down i × Ξ (cid:18) N − N down ( i ) (cid:19) ,b i = ψ up i × Ξ (cid:18) N − N up ( i ) (cid:19) , where Ξ( x ) ≡ (cid:40) , if x > , , if x ≤ , and N down ( i ) = Card ( { j /σ j = 0 and j < i } ) ,N up ( i ) = Card ( { j /σ j = 1 and j < i } ) . In words, N up ( i ) /N down ( i ) is the number ofup/down spins generated before step i . – Sample σ i from | ˜ y i | , where: ˜ y i = 1 (cid:112) a i + b i ( a i , b i )which is normalized, i.e. || ˜ y i || = 1.Using this algorithm, it is clear that the RNN wave func-tion generates a spin configuration that has the samenumber of up spins and down spins, and hence a zero magnetization. In fact, at each step i > N/ 2, the func-tion Ξ assigns a zero amplitude for the next spin σ i to be spin up if N up ( i ) = N/ N down ( i ) = N/ | ˜ y i | are also normalized. Wealso note that this algorithm preserves the autoregres-sive property of the original RNN wave function and canalso be parallelized. Moreover, this scheme can be eas-ily extended to the generation of samples with a non-zerofixed magnetization, which is useful when considering theproblem of finding states that live in a non-zero fixedmagnetization sector. Appendix E: R´enyi entropies Given a quantum system with a spatial bipartition( A, B ), one can write the RNN wave function | Ψ λ (cid:105) as | Ψ λ (cid:105) = (cid:88) σ A , σ B ψ λ ( σ A σ B ) | σ A σ B (cid:105) , where σ A/B denotes the spin configuration that lives inthe partition A/B and σ A σ B stands for a concatenationof σ A and σ B .The α -R´enyi entropy between region A and B is givenby S α ( A ) = 11 − α log (Tr ρ αA ) , (E1)where ρ A = Tr B | Ψ λ (cid:105) (cid:104) Ψ λ | and α is an integer [48]. Toestimate these entropies, we use the so-called replicatrick [48], where we consider the action of the Swap A operator on the two copies of the RNN wave function,which swaps the spins in the region A between the twocopies (as demonstrated in Fig. 8) such thatSwap A | Ψ λ (cid:105) ⊗ | Ψ λ (cid:105) = (cid:88) σ , ˜ σ ψ λ ( σ A σ B ) ψ λ ( ˜ σ A ˜ σ B ) | ˜ σ A σ B (cid:105) ⊗ | σ A ˜ σ B (cid:105) . (E2)The expectation value of Swap A in the double copy ofthe RNN wave function “ | Ψ λ (cid:105) ⊗ | Ψ λ (cid:105) ” is given by [42, 48] (cid:104) Swap A (cid:105) = (cid:88) σ , ˜ σ ψ ∗ λ ( σ A σ B ) ψ ∗ λ ( ˜ σ A ˜ σ B ) ψ λ ( ˜ σ A σ B ) ψ λ ( σ A ˜ σ B )= Tr ρ A = exp( − S ( A )) . (E3)Hence, by calculating the expectation of the value of theSwap operator in the double copy of the RNN wave func-tion, we can access the second R´enyi entropy. Interest-ingly, the R´enyi entropies S α have been shown to encodesimilar properties independently of α [46, 48].Although an exact evaluation of Eq. (E3) is numeri-cally intractable, we can use importance sampling to es-4 Figure 8. The Swap operator acting on the tensor product oftwo samples σ and σ (cid:48) . timate it [48] as (cid:104) Swap A (cid:105) = (cid:88) σ , ˜ σ | ψ λ ( σ A σ B ) | | ψ λ ( ˜ σ A ˜ σ B ) | ψ λ ( ˜ σ A σ B ) ψ λ ( σ A ˜ σ B ) ψ λ ( σ A σ B ) ψ λ ( ˜ σ A ˜ σ B ) , ≈ N s N s (cid:88) i =1 ψ λ ( ˜ σ ( i ) A σ ( i ) B ) ψ λ ( σ ( i ) A ˜ σ ( i ) B ) ψ λ ( σ ( i ) A σ ( i ) B ) ψ λ ( ˜ σ ( i ) A ˜ σ ( i ) B ) . (E4)Using this trick, for the system sizes studied in this pa-per we only have to generate two sets of exact samples { σ ( i ) } N s i =1 and { ˜ σ ( i ) } N s i =1 independently from | ψ λ | with-out having to use the improved ratio trick [48]. By defin-ing Swap ( i ) A ≡ ψ λ ( ˜ σ ( i ) A σ ( i ) B ) ψ λ ( σ ( i ) A ˜ σ ( i ) B ) ψ λ ( σ ( i ) A σ ( i ) B ) ψ λ ( ˜ σ ( i ) A ˜ σ ( i ) B ) , the statistical error on the estimation of the R´enyi-2 en-tropy can be calculated as (cid:15) = 1 (cid:104) Swap A (cid:105) (cid:118)(cid:117)(cid:117)(cid:116) var (cid:16) { Swap ( i ) A } (cid:17) N s . For the estimation of the R´enyi-2 entropy for the 1DTFIM in this paper, we use N s = 2 × samples from atrained RNN wave function with one GRU layer and 50memory units.During the completion of this paper, we became awareof another way to estimate entanglement entropies usingautoregressive models with conditional sampling [87]. Appendix F: Tables of Results In Tab. I, we state the variational energies of the cRNNwave function for the 1D J - J model and compare withresults from DMRG. We examine two different methodsof training. First, we do not impose an initial sign struc-ture while, secondly, we introduce a background Marshallsign. The results suggest that using a Marshall sign im-proves the results significantly for J = 0 . , . . J = 1 for all cases). J E/N No Sign Marshall Sign DMRG0 . . . . J - J model. Weconsider a cRNN wave function with two different methods oftraining (with no initial sign structure and with a backgroundMarshall sign) and compare with results from DMRG. Allresults correspond to 100 spins and have J = 1. We usethree GRU layers, where each layer has 100 units. Note that J = 0 . × samples. 100 200 500 1000 Number of samples − − − σ 1D TFIM N = 80 N = 40 N = 20 Figure 9. The energy variance per spin against the numberof samples, which suggests that the energy variance saturatesand does not improve further by using a larger number ofsamples for training. In Tab. II, we compare the variational energies per siteof the 2D TFIM with a lattice size of 12 × 12 for differentvalues of the transverse magnetic field h , for a 1D pRNNwave function, a 2D pRNN wave function, a PixelCNNwave function [56] and DMRG. Appendix G: Scaling of resources (continued) Fig. 9 shows the dependence of σ on the number ofsamples used to estimate the gradients of the variationalenergy (see App. C). We investigate this effect for thecase of the 1D TFIM, using 50 memory units in thepRNN wave function. Even though a large number ofsamples yields higher statistical accuracy of the gradientestimates used in our optimizations, we observe only aweak dependence of σ on the number of samples for allstudied system sizes.In Fig. 10 we present results for the dependence of σ on the depth of the pRNN wave function architecturefor a critical TFIM with N = 40 sites. We investigate5 h E/N -2.40960262(9) -2.4096022(2) -2.40960263 -3.1739018(2) -3.1739005(5) -3.173899664 -4.1217969(3) -4.12179808(6) -4.1217979(2) -4.12179793Table II. Variational energies per site for a 1D pRNN wave function (three layers of GRUs with 100 memory units), 2D pRNNwave function (a single layer of 2D vanilla RNN with 100 memory units), PixelCNN wave functions with results taken fromRef. [56] and DMRG (with bond dimension χ = 512 for h = 2 and χ = 1024 for both h = 3 , × 12 for different values of h where the critical point is at h ≈ 3. Values in bold fontcorrespond to the lowest variational energies and hence to the most accurate estimations of the ground state energy across allfour methods. For the estimation of the variational energy of the trained 1D and 2D pRNN wave functions, we use 2 × samples. Number of layers − − σ 1D TFIM N = 40 Figure 10. Scaling study of the energy variance per spin vs thenumber of layers of a pRNN wave function such that all pRNNwave functions with different layers have approximately thesame number of variational parameters. The results show thatfixing the number of parameters while changing the numberof layers does not affect the energy variance obtained by thepRNN wave function. architectures up to a depth of four layers. The numberof memory units is adapted such that we have a similarnumber of variational parameters ( ∼ σ depends onlyweakly on the number of layers. Appendix H: Hyperparameters In Tab. III, we present the hyperparameters usedto train the RNN wave functions in this paper. Weanticipate that further improvements such as the useof stochastic reconfiguration [13] or a computationallycheaper variant such as K-FAC [71] for the optimizationcould potentially lead to more accurate estimations ofthe ground state energies as compared to the Adam opti-mizer [81]. Seeds are listed in the table for reproducibilitypurposes.6 Figures Hyperparameter ValueFig. 3 Architecture One-layer 1D pRNN wave function with 50 memory unitsNumber of samples N s = 1000 ( N = 20), N s = 500 ( N = 80), N s = 200 ( N = 1000)Training iterations 20000Learning rate 5 × − Seed 111Fig. 4 Architecture Three-layer 1D cRNN wave function with 100 memory unitsNumber of samples 500Training iterations 100000Learning rate ( η − + 0 . t ) − with η = 2 . × − Seed 111Fig. 5(c): 1DRNN Architecture Three-layer 1D pRNN wave function with 100 memory unitsNumber of samples 500Training iterations 150000Learning rate ( η − + 0 . t ) − with η = 10 − Seed 333Fig. 5(c): 2DRNN Architecture One-layer 2D pRNN wave function with 100 memory unitsNumber of samples 500Training iterations 150000Learning rate η (1 + t/ − with η = 5 × − Seed 111Fig. 6(a) Architecture One-layer 1D pRNN wave functionNumber of samples 500Training iterations 10000Learning rate 10 − Seeds 111 , , , , η − + 0 . t ) − with η = 10 − Seeds 111 , , , , , , , , , − Seeds 111 , , , , × − Seeds 111 , , , , [1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton,“Imagenet classification with deep convolutional neuralnetworks,” in Proceedings of the 25th International Con-ference on Neural Information Processing Systems - Vol-ume 1 , NIPS12 (Curran Associates Inc., Red Hook, NY,USA, 2012) p. 10971105.[2] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton,“Deep learning,” Nature , 436 (2015).[3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun, “Deep Residual Learning for Image Recognition,” in (2016) pp. 770–778.[4] Tom Young, Devamanyu Hazarika, Soujanya Poria, andErik Cambria, “Recent trends in deep learning basednatural language processing,” (2017), arXiv:1708.02709[cs.CL].[5] Jessica Vamathevan, Dominic Clark, Paul Czodrowski,Ian Dunham, Edgardo Ferran, George Lee, Bin Li, AnantMadabhushi, Parantu Shah, Michaela Spitzer, and Shan-rong Zhao, “Applications of machine learning in drug dis-covery and development,” Nature Reviews Drug Discov-ery , 463–477 (2019).[6] Claudine Badue, Rnik Guidolini, Raphael VivacquaCarneiro, Pedro Azevedo, Vinicius Brito Cardoso,Avelino Forechi, Luan Jesus, Rodrigo Berriel, ThiagoPaixo, Filipe Mutz, Lucas Veronese, Thiago Oliveira-Santos, and Alberto Ferreira De Souza, “Self-drivingcars: A survey,” (2019), arXiv:1901.04407 [cs.RO].[7] David Silver, Julian Schrittwieser, Karen Simonyan,Ioannis Antonoglou, Aja Huang, Arthur Guez, ThomasHubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yu-tian Chen, Timothy Lillicrap, Fan Hui, Laurent Sifre,George van den Driessche, Thore Graepel, and DemisHassabis, “Mastering the game of go without humanknowledge,” Nature , 354–359 (2017).[8] Giuseppe Carleo, Ignacio Cirac, Kyle Cranmer, Lau-rent Daudet, Maria Schuld, Naftali Tishby, Leslie Vogt-Maranto, and Lenka Zdeborov, “Machine learning andthe physical sciences,” Reviews of Modern Physics (2019), 10.1103/revmodphys.91.045002.[9] Giuseppe Carleo and Matthias Troyer, “Solving thequantum many-body problem with artificial neural net-works,” Science , 602606 (2017).[10] P. Hohenberg and W. Kohn, “Inhomogeneous electrongas,” Phys. Rev. , B864–B871 (1964).[11] J. Bardeen, L. N. Cooper, and J. R. Schrieffer, “Theoryof superconductivity,” Phys. Rev. , 1175–1204 (1957).[12] R. B. Laughlin, “Anomalous quantum hall effect: Anincompressible quantum fluid with fractionally chargedexcitations,” Phys. Rev. Lett. , 1395–1398 (1983).[13] F. Becca and S. Sorella, Quantum Monte Carlo Ap-proaches for Correlated Systems (Cambridge UniversityPress, 2017).[14] Rom´an Or´us, “Tensor networks for complex quantumsystems,” Nature Reviews Physics , 538–550 (2019).[15] Alberto Peruzzo, Jarrod McClean, Peter Shadbolt, Man-Hong Yung, Xiao-Qi Zhou, Peter J. Love, Al´an Aspuru-Guzik, and Jeremy L. O’Brien, “A variational eigenvaluesolver on a photonic quantum processor,” Nature Com-munications , 1–7 (2014). [16] J. Androsiuk, L. Kulak, and K. Sienicki, “Neural net-work solution of the Schr¨odinger equation for a two-dimensional harmonic oscillator,” Chemical Physics ,377–383 (1993).[17] “Artificial neural network methods in quantum mechan-ics,” Computer Physics Communications , 1 – 14(1997).[18] M. Sugawara, “Numerical solution of the Schr¨odingerequation by neural network and genetic algorithm,”Computer Physics Communications , 366–380(2001).[19] Roger G. Melko, Giuseppe Carleo, Juan Carrasquilla,and J. Ignacio Cirac, “Restricted Boltzmann machinesin quantum physics,” Nature Physics , 1 (2019).[20] Sepp Hochreiter and J¨urgen Schmidhuber, “Long short-term memory,” Neural Comput. , 1735–1780 (1997).[21] Alex Graves, “Supervised sequence labelling,” in Super-vised sequence labelling with recurrent neural networks (Springer, 2012) pp. 5–13.[22] Kyunghyun Cho, Bart van Merrienboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio, “Learning phrase repre-sentations using rnn encoder-decoder for statistical ma-chine translation,” (2014), arXiv:1406.1078 [cs.CL].[23] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho,and Yoshua Bengio, “Empirical evaluation of gated re-current neural networks on sequence modeling,” (2014),arXiv:1412.3555 [cs.NE].[24] Zachary C. Lipton, John Berkowitz, and Charles Elkan,“A critical review of recurrent neural networks for se-quence learning,” (2015), arXiv:1506.00019 [cs.LG].[25] Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser,and Illia Polosukhin, “Attention is all you need,” (2017),arXiv:1706.03762 [cs.CL].[26] Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova, “Bert: Pre-training of deep bidirec-tional transformers for language understanding,” (2018),arXiv:1810.04805 [cs.CL].[27] Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell,Ruslan Salakhutdinov, and Quoc V. Le, “Xlnet: Gen-eralized autoregressive pretraining for language under-standing,” (2019), arXiv:1906.08237 [cs.CL].[28] Juan Carrasquilla, Giacomo Torlai, Roger G. Melko, andLeandro Aolita, “Reconstructing quantum states withgenerative models,” Nature Machine Intelligence , 155–161 (2019).[29] Yoav Levine, Or Sharir, Nadav Cohen, and AmnonShashua, “Quantum entanglement in deep learning ar-chitectures,” Phys. Rev. Lett. , 065301 (2019).[30] S. Bengio and Y. Bengio, “Taking on the curse of dimen-sionality in joint distributions using neural networks,”IEEE Transactions on Neural Networks , 550–557(2000).[31] Benigno Uria, Marc-Alexandre Cˆot´e, Karol Gregor, IainMurray, and Hugo Larochelle, “Neural autoregres-sive distribution estimation,” J. Mach. Learn. Res. ,71847220 (2016).[32] Dian Wu, Lei Wang, and Pan Zhang, “Solving statisti-cal mechanics using variational autoregressive networks,”Physical Review Letters (2019), 10.1103/phys- revlett.122.080602.[33] Ian Goodfellow, Yoshua Bengio, and AaronCourville, Deep Learning (MIT Press, 2016) .[34] Ian Goodfellow, “Nips 2016 tutorial: Generative adver-sarial networks,” (2016), arXiv:1701.00160 [cs.LG].[35] Y. Bengio, P. Simard, and P. Frasconi, “Learninglong-term dependencies with gradient descent is diffi-cult,” IEEE Transactions on Neural Networks , 157–166(1994).[36] J. F. Kolen and S. C. Kremer, “Gradient flow in recur-rent nets: The difficulty of learning longterm dependen-cies,” in A Field Guide to Dynamical Recurrent Networks (IEEE, 2001) pp. 237–243.[37] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio,“On the difficulty of training recurrent neural networks,”in International conference on machine learning (2013)pp. 1310–1318.[38] M. Fannes, B. Nachtergaele, and R. F. Werner, “Finitelycorrelated states on quantum spin chains,” Communica-tions in Mathematical Physics , 443–490 (1992).[39] Huitao Shen, “Mutual information scaling and expressivepower of sequence models,” (2019), arXiv:1905.04271[cs.LG].[40] Kyunghyun Cho, Bart van Merri¨enboer, Dzmitry Bah-danau, and Yoshua Bengio, “On the properties of neuralmachine translation: Encoder–decoder approaches,” in Proceedings of SSST-8, Eighth Workshop on Syntax, Se-mantics and Structure in Statistical Translation (Associ-ation for Computational Linguistics, Doha, Qatar, 2014)pp. 103–111.[41] Sergey Bravyi, David P. Divincenzo, Roberto Oliveira,and Barbara M. Terhal, “The complexity of stoquasticlocal hamiltonian problems,” Quantum Info. Comput. ,361–385 (2008).[42] Giacomo Torlai, Guglielmo Mazzola, Juan Carrasquilla,Matthias Troyer, Roger Melko, and Giuseppe Carleo,“Neural-network quantum state tomography,” NaturePhysics , 447 (2018).[43] Steven R. White, “Density matrix formulation for quan-tum renormalization groups,” Phys. Rev. Lett. , 2863–2866 (1992).[44] Chase Roberts, Ashley Milsted, Martin Ganahl, AdamZalcman, Bruce Fontaine, Yijian Zou, Jack Hidary,Guifre Vidal, and Stefan Leichenauer, “Tensornetwork:A library for physics and machine learning,” (2019),arXiv:1905.01330 [physics.comp-ph].[45] A A Abrikosov, I Dzyaloshinskii, L P Gorkov, andRichard A Silverman, Methods of quantum field theoryin statistical physics (Dover, New York, NY, 1975).[46] Steven T. Flammia, Alioscia Hamma, Taylor L. Hughes,and Xiao-Gang Wen, “Topological entanglement r´enyientropy and reduced density matrix structure,” Phys.Rev. Lett. , 261601 (2009).[47] Shinsei Ryu and Tadashi Takayanagi, “Holographicderivation of entanglement entropy from the anti–de sit-ter space/conformal field theory correspondence,” Phys.Rev. Lett. , 181602 (2006).[48] Matthew B. Hastings, Ivn Gonzlez, Ann B. Kallin, andRoger G. Melko, “Measuring renyi entanglement entropyin quantum monte carlo simulations,” Physical ReviewLetters (2010), 10.1103/physrevlett.104.157201.[49] Steven R. White and Ian Affleck, “Dimerization and in-commensurate spiral spin correlations in the zigzag spin chain: Analogies to the kondo lattice,” Phys. Rev. B ,9862–9869 (1996).[50] Sebastian Eggert, “Numerical evidence for multiplicativelogarithmic corrections from marginal operators,” Phys.Rev. B , R9612–R9615 (1996).[51] Federico Becca, Luca Capriotti, Alberto Parola, andSandro Sorella, “Variational wave functions for frus-trated magnetic models,” (2009), arXiv:0905.4854 [cond-mat.str-el].[52] W Marshall, “Antiferromagnetism,” Proceedings of theRoyal Society of London. Series A. Mathematical andPhysical Sciences , 48–68 (1955).[53] Kenny Choo, Titus Neupert, and Giuseppe Carleo,“Two-dimensional frustrated j1-j2 model studied withneural network quantum states,” Physical Review B (2019), 10.1103/physrevb.100.125124.[54] Giacomo Torlai, Juan Carrasquilla, Matthew T. Fish-man, Roger G. Melko, and Matthew P. A. Fisher,“wave function positivization via automatic differentia-tion,” (2019), arXiv:1906.04654 [quant-ph].[55] Jrme Thibaut, Tommaso Roscilde, and Fabio Mezza-capo, “Long-range entangled-plaquette states for criticaland frustrated quantum systems on a lattice,” PhysicalReview B (2019), 10.1103/physrevb.100.155148.[56] Or Sharir, Yoav Levine, Noam Wies, Giuseppe Car-leo, and Amnon Shashua, “Deep autoregressive mod-els for the efficient variational simulation of many-bodyquantum systems,” Physical Review Letters (2020),10.1103/physrevlett.124.020503.[57] F. Verstraete and J. I. Cirac, “Renormalization algo-rithms for quantum-many body systems in two andhigher dimensions,” (2004), arXiv:cond-mat/0407066[cond-mat.str-el].[58] Simeng Yan, David A. Huse, and Steven R. White,“Spin-liquid ground state of the s = 1/2 kagome heisen-berg antiferromagnet,” Science , 1173–1176 (2011).[59] G. Evenbly and G. Vidal, “Tensor network renormaliza-tion,” Phys. Rev. Lett. , 180405 (2015).[60] Anders W. Sandvik, “Computational Studies of Quan-tum Spin Systems,” AIP Conference Proceedings ,135–338 (2010).[61] Henk W. J. Bl¨ote and Youjin Deng, “Cluster monte carlosimulation of the transverse ising model,” Phys. Rev. E , 066110 (2002).[62] Laurens Vanderstraeten, Jutho Haegeman, PhilippeCorboz, and Frank Verstraete, “Gradient methodsfor variational optimization of projected entangled-pairstates,” Physical Review B (2016), 10.1103/phys-revb.94.155123.[63] Aaron van den Oord, Nal Kalchbrenner, Lasse Espe-holt, koray kavukcuoglu, Oriol Vinyals, and Alex Graves,“Conditional image generation with pixelcnn decoders,”in Advances in Neural Information Processing Systems29 , edited by D. D. Lee, M. Sugiyama, U. V. Luxburg,I. Guyon, and R. Garnett (Curran Associates, Inc., 2016)pp. 4790–4798.[64] Claudius Gros, “Criterion for a good variational wavefunction,” Phys. Rev. B , 6835–6838 (1990).[65] Roland Assaraf and Michel Caffarel, “Zero-variance zero-bias principle for observables in quantum monte carlo:Application to forces,” The Journal of Chemical Physics , 10536–10552 (2003).[66] C. Hubig, J. Haegeman, and U. Schollwck, “Er-ror estimates for extrapolations with matrix-product states,” Physical Review B (2018), 10.1103/phys-revb.97.045125.[67] Mari Carmen Bauls, David A. Huse, and J. IgnacioCirac, “How much entanglement is needed to reduce theenergy variance?” (2019), arXiv:1912.07639 [quant-ph].[68] Michiel Hermans and Benjamin Schrauwen, “Trainingand analysing deep recurrent neural networks,” in Ad-vances in Neural Information Processing Systems 26 ,edited by C. J. C. Burges, L. Bottou, M. Welling,Z. Ghahramani, and K. Q. Weinberger (Curran Asso-ciates, Inc., 2013) pp. 190–198.[69] Shiyu Chang, Yang Zhang, Wei Han, Mo Yu, XiaoxiaoGuo, Wei Tan, Xiaodong Cui, Michael Witbrock, MarkHasegawa-Johnson, and Thomas S. Huang, “Dilatedrecurrent neural networks,” (2017), arXiv:1710.02224[cs.AI].[70] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio, “Neural machine translation by jointly learning toalign and translate,” (2014), arXiv:1409.0473 [cs.CL].[71] James Martens, Jimmy Ba, and Matt Johnson,“Kronecker-factored curvature approximations for recur-rent neural networks,” in International Conference onLearning Representations (2018).[72] Kenny Choo, Antonio Mezzacapo, and Giuseppe Carleo,“Fermionic neural-network states for ab-initio electronicstructure,” (2019), arXiv:1909.12852 [physics.comp-ph].[73] Juan Carrasquilla, Di Luo, Felipe Prez, Ashley Milsted,Bryan K. Clark, Maksims Volkovs, and Leandro Aolita,“Probabilistic simulation of quantum circuits with thetransformer,” (2019), arXiv:1912.11052 [cond-mat.str-el].[74] Christopher Roth, “Iterative retraining of quantumspin models using recurrent neural networks,” (2020),arXiv:2003.06228 [physics.comp-ph].[75] Mart´ın Abadi, Ashish Agarwal, Paul Barham, Eu-gene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Cor-rado, Andy Davis, Jeffrey Dean, Matthieu Devin, San-jay Ghemawat, Ian Goodfellow, Andrew Harp, Geof-frey Irving, Michael Isard, Yangqing Jia, Rafal Jozefow-icz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg,Dan Man´e, Rajat Monga, Sherry Moore, Derek Mur-ray, Chris Olah, Mike Schuster, Jonathon Shlens, BenoitSteiner, Ilya Sutskever, Kunal Talwar, Paul Tucker,Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi´egas, Oriol Vinyals, Pete Warden, Martin Wattenberg, MartinWicke, Yuan Yu, and Xiaoqiang Zheng, “TensorFlow:Large-scale machine learning on heterogeneous systems,”(2015), software available from tensorflow.org.[76] Jeremy Appleyard, Tomas Kocisky, and Phil Blunsom,“Optimizing performance of recurrent neural networks ongpus,” (2016), arXiv:1604.01946 [cs.LG].[77] Aaron van den Oord, Nal Kalchbrenner, and KorayKavukcuoglu, “Pixel recurrent neural networks,” (2016),1601.06759 [cs.CV].[78] Yuhuai Wu, Saizheng Zhang, Ying Zhang, Yoshua Ben-gio, and Ruslan R Salakhutdinov, “On multiplicative in-tegration with recurrent neural networks,” in Advances inneural information processing systems (2016) pp. 2856–2864, 1606.06630.[79] A. W. Sandvik and G. Vidal, “Variational quantummonte carlo simulations with tensor-network states,”Phys. Rev. Lett. , 220602 (2007).[80] Shi-Xin Zhang, Zhou-Quan Wan, and Hong Yao, “Au-tomatic differentiable monte carlo: Theory and applica-tion,” (2019), arXiv:1911.09117 [physics.comp-ph].[81] Diederik P. Kingma and Jimmy Ba, “Adam: A methodfor stochastic optimization,” (2014), arXiv:1412.6980[cs.LG].[82] Roland Assaraf and Michel Caffarel, “Zero-variance prin-ciple for monte carlo algorithms,” Physical Review Let-ters , 46824685 (1999).[83] Bryan Clark, “Variational monte carlo notes for bouldersummer school 2010,” (2010).[84] Shakir Mohamed, Mihaela Rosca, Michael Figurnov, andAndriy Mnih, “Monte carlo gradient estimation in ma-chine learning,” (2019), arXiv:1906.10652 [stat.ML].[85] Richard S Sutton, David A McAllester, Satinder P Singh,and Yishay Mansour, “Policy gradient methods for rein-forcement learning with function approximation,” in Ad-vances in neural information processing systems (2000)pp. 1057–1063.[86] Elliott Lieb and Daniel Mattis, “Ordering energy lev-els of interacting spin systems,” Journal of MathematicalPhysics3