[PDF] Teacher-student learning for a binary perceptron with quantum fluctuations

Abstract

We analysed the generalisation performance of a binary perceptron with quantum fluctuations using the replica method. An exponential number of local minima dominate the energy landscape of the binary perceptron. Local search algorithms often fail to identify the ground state of a binary perceptron. In this study, we considered the teacher-student learning method and computed the generalisation error of a binary perceptron with quantum fluctuations. Due to the quantum fluctuations, we can efficiently find robust solutions that have better generalisation performance than the classical model. We validated our theoretical results through quantum Monte Carlo simulations. We adopted the replica symmetry (RS) ansatz assumption and static approximation. The RS solutions are consistent with our simulation results, except for the relatively low strength of the transverse field and high pattern ratio. These deviations are caused by the violation of ergodicity and static approximation. After accounting for the deviation between the RS solutions and numerical results, the enhancement of generalisation performance with quantum fluctuations holds.

Full PDF

aa r X i v : . [ c ond - m a t . d i s - nn ] F e b Teacher-student learning for a binary perceptron with quantum ﬂuctuations

Shunta Arai , , ∗ Masayuki Ohzeki , , , and Kazuyuki Tanaka Graduate School of Information Sciences,Tohoku University, Sendai 980-8579, Japan Sigma-i Co., Ltd., Tokyo, Japan Institute of Innovative Research,Tokyo Institute of Technology, Nagatsuta-cho 4259,Midori-ku, Yokohama Kanagawa, 226-8503 Japan (Dated: February 18, 2021)We analysed the generalisation performance of a binary perceptron with quantum ﬂuctuations using thereplica method. An exponential number of local minima dominate the energy landscape of the binary per-ceptron. Local search algorithms often fail to identify the ground state of a binary perceptron. In this study,we considered the teacher-student learning method and computed the generalisation error of a binary percep-tron with quantum ﬂuctuations. Due to the quantum ﬂuctuations, we can e ﬃ ciently ﬁnd robust solutions thathave better generalisation performance than the classical model. We validated our theoretical results throughquantum Monte Carlo simulations. We adopted the replica symmetry (RS) ansatz assumption and static ap-proximation. The RS solutions are consistent with our simulation results, except for the relatively low strengthof the transverse ﬁeld and high pattern ratio. These deviations are caused by the violation of ergodicity andstatic approximation. After accounting for the deviation between the RS solutions and numerical results, theenhancement of generalisation performance with quantum ﬂuctuations holds. I. INTRODUCTION

Deep neural networks (DNNs) have achieved excellent per-formance on a wide range of tasks [1]. In supervised learn-ing, DNNs can generalise for unseen data better than clas-sical machine learning algorithms. Generally, when apply-ing DNNs for supervised learning, the number of parametersis greater than the number of datapoints, which is called an‘over-parameterised setting’. In such a setting, it is di ﬃ cultto explain why DNNs generalise well using classical statisti-cal learning theory. Therefore, the theoretical understandingof the generalisation performance of DNNs is under develop-ment [2].For simple models, typical generalisation performance canbe theoretically analysed using statistical mechanics [3, 4].The simplest case is the binary perceptron [5, 6]. In sta-tistical mechanics analysis, we consider the thermodynamiclimit as the number of parameters N → ∞ and number ofdata p → ∞ while ﬁxing the number of data per parameter α ≡ p / N = O (1). For random pattern learning, a binary per-ceptron can learn random input-output patterns until reachingthe storage capacity of α c ≃ .

833 [7]. For teacher-studentlearning, we consider a student perceptron and teacher percep-tron. The teacher perceptron generates output data from inputdata and the student perceptron learn input-output patterns fora given dataset. The student perceptron can predict the outputsgenerated by the teacher perceptron with α ≥ α c ≃ .

245 [8].The energy landscape of a binary perceptron is highly non-convex and is dominated by an exponential number of localminima [9, 10]. Ground states are geometrically isolated inthe solution space. The hamming distance between each so-lution is proportional to N . A local spin-ﬂip algorithm basedon free energy minimisation such as simulated annealing (SA) ∗ [email protected] [11] can easily become stuck in metastable states and fail toﬁnd the ground state [12, 13]. In the solution space, there isa subdominant dense region in which the solutions have highlocal entropy. In this region, all solutions have similar en-ergy values. We refer to these solutions as robust solutionsin this paper. By incorporating local entropy, we can modifythe standard SA technique. The resulting algorithm can ﬁndsubdominant solutions [14, 15]. The quality of subdominantsolutions is di ﬀ erent from that of the typical solutions foundby standard SA. In teacher-student learning, the generalisationerror of the subdominant solutions is lower than that of typicalsolutions [16].Several empirical studies have demonstrated that the localenergy (loss) landscape is related to the generalisation per-formance [17–19]. Stochastic gradient descent (SGD) withsmall batches often ﬁnds a ﬂat minimiser within the energylandscape. In contrast, SGD with large batches converges toa sharp minimiser. Although the training errors of these twosolutions are the same or similar, the generalisation perfor-mances di ﬀ er signiﬁcantly. A ﬂat minimiser generalises betterthan a sharp minimiser. A ﬂat region in the energy landscapeis robust to perturbations in parameters and data ﬂuctuations.By incorporating the e ﬀ ects of entropy into a loss function, aﬂat minimiser is algorithmically reachable. The Entropy-SGDalgorithm developed in a recent study [20] computes local en-tropy and exploits stochastic gradient Langevin dynamics toapproximate the gradient of local entropy, allowing it to ﬁnda ﬂat minimiser.To optimise DNNs, we can utilise quantum ﬂuctuations.In a recent study [21], the authors formulated an optimisa-tion algorithm based on quantum ﬂuctuations using a path-integral representation. They determined that ﬁnite-valuequantum ﬂuctuations improve the generalisation performanceof DNNs, allowing solutions to converge to a ﬂat minimiser.In a binary perceptron for random pattern learning, quantumannealing (QA) [22–24] can identify a ﬂat region in the energylandscape [25]. The local entropy obtained by QA is greaterthan that of typical solutions obtained by SA.In this study, we extended previous research on random pat-tern learning [25] to teacher-student learning. We computedthe typical behaviours of the generalisation error of a binaryperceptron with quantum ﬂuctuations by using the replicamethod. In a previous study, it was observed that quantumﬂuctuations lead to a ﬂat minimiser. In this study, we investi-gated whether the generalisation performance of a binary per-ceptron is enhanced by quantum ﬂuctuations. This study isan analytical demonstration of the e ﬀ ectiveness of quantumﬂuctuations for machine learning problems.The remainder of this paper is organised as follows. In Sec-tion II, we present the formulation of a binary perceptron withquantum ﬂuctuations. In Section III, we numerically solve thesaddle-point equations and present the phase diagrams. Toverify our theoretical analysis, we performed quantum MonteCarlo simulations. We verify the robustness of the solutionsobtained by the quantum Monte Carlo simulations using theenergy landscape around the solutions. Finally, in Section IV,we summarise our results and discuss future research direc-tions. II. BINARY PERCEPTRON WITH QUANTUMFLUCTUATIONS

We consider teacher-student learning in a single-layer bi-nary perceptron. The student perceptron learns input-outputpatterns from the given dataset D = { ( x µ , y µ ) } p µ = . The numberof data points is represented by p . For each sample, an inputdata vector x µ ∈ {± } N is generated from the uniform distri-bution P ( x µ ) = Q Ni = (cid:16) δ ( x i µ − + δ ( x i µ + (cid:17) / N , where N isthe dimensionality of the input data. The joint probability dis-tribution of the input data is denoted as P ( X ) = Q p µ = P ( x µ ).The output data are determined by the teacher perceptronas y µ = sgn (cid:16) / √ N P Ni = w i x i µ (cid:17) ∈ {± } , where sgn( · ) is thesignum function. The weight vector of the teacher percep-tron is generated from the uniform distribution as P ( w ) = Q Ni = ( δ ( w i − + δ ( w i + / N . The joint distribution of thedataset is given by P ( D| w ) = p Y µ = P ( y µ | x µ , w ) P ( x µ ) = p Y µ = δ  y µ − sgn  √ N N X i = w i x i µ  P ( x µ ) . (1)The learning problem involves ﬁnding a weight vector σ = ( σ , . . . , σ N ) ∈ {± } N such that all input data are simultane-ously classiﬁed correctly. We formulate this problem as aBayesian inference problem. By using the Bayes formula, theposterior distribution is expressed as P ( σ |D ) = P ( D| σ ) P ( σ ) P σ P ( D| σ ) P ( σ ) . (2) We deﬁne the likelihood P ( D| σ ) as P ( D| σ ) ∝ exp ( − β E ( σ )) , (3) E ( σ ) = p X µ = Θ  − y µ sgn  √ N N X i = x i µ σ i  , (4)where β = / T is the inverse temperature, Θ ( x ) is the Heav-iside step function, and Θ ( x ) = x > or Θ ( x ) =

0, oth-erwise. The Hamiltonian E ( σ ) represents the number of mis-classiﬁcations. We set the prior distribution P ( σ ) to the uni-form distribution P ( σ ) = Q Ni = ( δ ( σ i − + δ ( σ i + / N . Inthis case, we can omit P ( σ ) from Eq.(2).We often utilise the maximum a posteriori (MAP) estima-tion to estimate σ . In the MAP estimation, we ﬁnd the statethat maximises the posterior Eq.(2) at the limit of β → ∞ .This corresponds to searching for the ground state of theHamiltonian Eq. (4). To ﬁnd the ground state, we typi-cally adopt SA. However, SA can fail to identify the groundstate based on the existence of many local minima. There-fore, instead of maximising the posterior, we consider ﬁnite-temperature estimation. At low temperatures, the probabil-ity measure in Eq.(2) concentrates on low-energy states. Thelearning strategy involves sampling low-energy states fromEq.(2) at low temperatures. The estimated weight vector isexpressed by the expectation over the posterior distribution as h σ i i = P σ σ i P ( σ |D ).The indicators of learning outcomes are the training andgeneralisation errors. The training error is given by ǫ t ( D ) = p h E ( σ ) i . (5)To evaluate performance for unseen data, we consider the gen-eralisation error as ǫ g ( D ) = E { x new , y new }  Θ  − y new sgn  √ N N X i = x new i h σ i i  , (6)where E { x new , y new } [ · ] denotes the expectation over P ( D new | w ) = P ( x new , y new | w ) = P ( y new | x new , w ) P ( x new ). These quantitiesare expected to exhibit a ‘self-averaging’ property at the ther-modynamic limit N → ∞ . The observables for a quenchedrealisation of D and w are equivalent to the self-expectationover the data distribution P ( D| w ) P ( w ). For example, the gen-eralisation error can be expressed aslim N →∞ ǫ g = [ ǫ g ( D )] D = π cos − m , (7)where the bracket [ · ] D indicates the expectation over the datadistribution P ( D| w ) P ( w ) and m = / N P Ni = w i h σ i i denotes theoverlap between the teacher and student.We can extend the above formulation into a quantum sys-tem as follows: P ( D| ˆ σ z ) ∝ exp (cid:16) − β ˆ H (cid:17) , (8)ˆ H = E ( ˆ σ z ) − Γ N X i = ˆ σ xi , (9) E ( ˆ σ z ) = p X µ = Θ  − y µ sgn  √ N N X i = x i µ ˆ σ zi  , (10)where ˆ σ zi and ˆ σ xi are the z and x components of the Pauli matri-ces at site i , respectively, and Γ is the strength of the transverseﬁeld. The learning strategy involves sampling low-energystates from the density matrix ˆ ρ ≡ e − β ˆ H / Tr e − β ˆ H , where Trdenotes the summation over all possible conﬁgurations on acomputational basis. The estimated weight vector is expressed by h ˆ σ zi i = Tr( ˆ σ zi ˆ ρ ). The overlap between the teacher and stu-dent is also written as m = / N P Ni = w i h ˆ σ i i . III. MEAN FIELD ANALYSIS

The typical behaviour of order parameters, such as gener-alisation error, can be derived the free energy. We computedthe partition function and derived the free energy at the limitof N , p → ∞ while ﬁxing the number of data per coupling as α ≡ p / N = O (1). We assume the self-averaging property atthe thermodynamic limit. The free energy f can be evaluatedas − β f = lim N →∞ / N [ln Z ] D . We employ the Suzuki-Trotterdecomposition [26] to the following partition function: Z = Tr exp (cid:16) − β ˆ H (cid:17) = lim M →∞ Tr  exp (cid:18) − β M E ( ˆ σ z ) (cid:19) exp  β Γ M N X i = ˆ σ xi  M = lim M →∞ Z M , (11)where Z M = Tr M Y t = exp  − β M p X µ = Θ  − y µ sgn  √ N N X i = x i µ σ zi ( t )  + β Γ M N X i = σ xi ( t )  Y i , t D σ zi ( t ) (cid:12)(cid:12)(cid:12) σ xi ( t ) E D σ xi ( t ) (cid:12)(cid:12)(cid:12) σ zi ( t + E , (12)The symbol t is the index of the Trotter slice, and M is the Trotter number. We also impose the periodic boundary conditions σ zi (1) = σ zi ( M +

1) for all i . To evaluate [ln Z M ], we utilise the replica method [27] as follows:[ln Z M ] D = lim n → [ Z nM ] D − n . (13)The replicated partition function is written as (cid:2) Z nM (cid:3) D = Z d D P ( D| w ) Z d w P ( w )Tr n Y a = exp  − β M M X t = p X µ = Θ  − y µ sgn  √ N N X i = x i µ σ zia ( t )  + β Γ M M X t = N X i = σ xia ( t )  × Y i , t , a D σ zia ( t ) (cid:12)(cid:12)(cid:12) σ xia ( t ) E D σ xia ( t ) (cid:12)(cid:12)(cid:12) σ zia ( t + E , (14)where a denotes the replica index. We introduce the order parameters as m a ( t ) = N N X i = w i σ zia ( t ) , (15) q ab ( t , t ′ ) = N N X i = σ zia ( t ) σ zib ( t ′ ) ( a < b ) , (16) R a ( t , t ′ ) = N N X i = σ zia ( t ) σ zia ( t ′ ) ( t , t ′ ) , (17) m xa ( t ) = N N X i = σ xia ( t ) . (18)The physical meanings of the order parameters are deﬁned as follows: m a ( t ) is the magnetisation, q ab ( t , t ′ ) is the spin-glassorder parameter, R a ( t , t ′ ) is the correlation between each Trotter slice, and m xa ( t ) is the transverse magnetisation. Additionally,we introduce the auxiliary parameters ˜ m a ( t ) , ˜ q ab ( t , t ′ ) , ˜ R a ( t , t ′ ) , ˜ m xa ( t ) for the order parameters. Under the replica symmetry (RS)ansatz assumption and static approximation, we have m a ( t ) = m , q ab ( t , t ′ ) = q , R a ( t , t ′ ) = R , m xa ( t ) = m x , ˜ m a ( t ) = ˜ m , ˜ q ab ( t , t ′ ) = ˜ q , ˜ R a ( t , t ′ ) = ˜ R , ˜ m xa ( t ) = ˜ m x . The free energy is given by − β f RS = α Z DuH  − s m q − m u  ln Z D ν exp  − β H  √ qu + √ R − q ν √ − R  + Z Dz ln Z Dy s ˜ m + p ˜ qz + q ˜ R − ˜ qy ! + ( ˜ m x ) − m ˜ m − m x ˜ m x − R ˜ R + q ˜ q + β Γ m x , (19)where Dz denotes the Gaussian measure as Dz = / √ π dze − z / . Here, we utilise the function H ( x ) = R ∞ x Dz . Detailedcalculations for the derivation of the RS free energy in Eq.(11) are provided in Appendix A. To simplify the notation, we deﬁnethe following expressions: g ≡ ˜ m + p ˜ qz + q ˜ R − ˜ qy , (20) u ≡ q g + ( ˜ m x ) , (21) Y ≡ Z Dy cosh u , (22) X ≡ s m q − m u , (23) X ≡ √ qu + √ R − q ν √ − R . (24)The extremisation of Eq.(11) gives rise to the following saddle-point equations: m = Z DzY − Z Dy (cid:18) gu (cid:19) sinh u , (25) q = Z Dz ( Y − Z Dy (cid:18) gu (cid:19) sinh u ) , (26) R = Z DzY − Z Dy ( ( ˜ m x ) u ! sinh u + (cid:18) gu (cid:19) cosh u ) , (27) m x = Z DzY − Z Dy ˜ m x u ! sinh u , (28)˜ m = α q p ( q − m ) Z uDuG ( X ) ln Z D ν exp ( − β H ( X )) , (29)˜ q = α m p ( q − m ) Z uDuG ( X ) ln Z D ν exp ( − β H ( X )) + αβ √ − R Z DuH ( − X ) R D ν exp ( − β H ( X )) G ( X ) (cid:26) ν √ R − q − u √ q (cid:27)R D ν exp ( − β H ( X )) , (30)˜ R = αβ p ( R − q )(1 − R ) Z DuH ( − X ) R D ν exp ( − β H ( X )) G ( X ) n (1 − q ) ν + p q ( R − q ) u oR D ν exp ( − β H ( X )) , (31)˜ m x = β Γ , (32)where we utilise G ( x ) = e − x / / √ π . The training error is written as ǫ t = Z DuH ( − X ) R D ν H ( X ) exp ( − β H ( X )) R D ν exp ( − β H ( X )) . (33) IV. EXPERIMENTAL RESULTS

We numerically solve the saddle-point equations inEqs.(25)-(32) using the inverse temperature β =

20. A phase diagram of a binary perceptron with quantum ﬂuctuations is Γ α α qsp α classical α thr α csp branch1branch2critical0.090 0.095 0.100 0.1051.201.221.24 FIG. 1. Phase diagram of a binary perceptron with quantum ﬂuctu-ations. The horizontal axis represents the strength of the transverseﬁeld. The vertical axis represents the pattern ratio. Γ α −0.04−0.020.000.020.04 ε g − ε c l a ss i c a l g FIG. 2. Heat map of the di ﬀ erences in generalisation errors betweenthe quantum model and classical model ( Γ = presented in Fig.1. The α classical line indicates the location ofthe ﬁrst-order phase transition between the quantum model inEq.(9) and classical model in Eq.(4). On the left side of the α classical line, the classical model has the lowest free energy.On the right side of the α classical line, the quantum model hasthe lowest free energy. The α qsp line represents the locationat which the freezing phenomenon occurs ( R = ﬀ ect the system andthe model is identical to the classical model. The α thr linerepresents the location of the phase transition between poorgeneralisation m , m =

1. Abovethis line, the solution with perfect generalisation is stable. Thesolution with poor generalisation has negative entropy. In thiscase, the RS ansatz assumption is incorrect. Similar to thecase of the classical model [28], we can plot the zero-entropyline in the phase diagram. To avoid confusion, we omittedthe zero-entropy line in this study. The α csp line denotes thelocation at which the solution with poor generalisation disap-pears in the classical model. The ‘branch 1’ and ‘branch 2’curves represent the spinodal curves of the di ﬀ erent solutions.At the ‘critical’ curve, the ﬁrst-order phase transition occurs.The ‘branch 2’ solution becomes stable above the ‘critical’curve. Around Γ = .

5, the ‘critical’ curve intersects the α thr line. Above this intersection, the perfect generalisation solu-tion has the lowest free energy and the solution of the ‘branch2’ curve becomes unstable. We omit the ‘critical’ curve above α ε g classicalΓ = 0.1Γ = 0.2Γ = 0.3 FIG. 3. Dependence of the generalisation error on the pattern ratio ofthe ﬁxed strength of the transverse ﬁeld for

Γ = .

1, 0 .

2, and 0 .

3. Thelines are derived from the saddle-point equations. The symbols rep-resent the results obtained by the quantum Monte Carlo simulations.The ‘classical’ line represents the results of the classical model. this intersection.Figure2 presents a heat map of the di ﬀ erences in generali-sation error between the quantum model and classical model( Γ = N = M = T = . Γ = .

1, 0 .

2, and 0 . ﬃ cult to estimate the order parameters.We investigated the robustness of the solutions obtainedby the quantum Monte Carlo simulations based on the en-ergy landscape around the solutions. We utilised the ﬁnalresults of the quantum Monte Carlo simulations as ˆ σ i = sgn( P Mt = σ it ) ( i = , ..., N ) for α = . Γ = . ﬀ erences between the reference solu-tions and neighbouring solutions, whose Hamming distancesfrom the reference solutions are denoted by d . In Fig.4, weplot the local training error di ﬀ erences between the referencesolutions and neighbouring solutions. We computed the aver-age di ﬀ erences over all neighbouring solutions. The error bars d Δ E QMCMCMC

FIG. 4. Training error di ﬀ erences between the reference solutionsobtained by the quantum Monte Carlo simulations and Markov-Chain Monte Carlo simulations, and the neighbouring solutions,whose hamming distances from the reference solutions are denotedas d . The symbols are averaged over all neighbouring solutionswithin the hamming distance d . represent standard deviations. For comparison, we also plotthe results of the reference solutions of the classical modelattained by Markov-Chain Monte Carlo simulations. The ex-perimental settings and instances are the same as those shownin Fig.3. The local training error di ﬀ erences are related to theﬂatness of the energy landscape. The training error of the ﬂatminimiser is robust to perturbations. One can see that the localtraining error di ﬀ erences of the quantum model are less thanthose of the classical model. Overall, quantum ﬂuctuationslead solutions toward the ﬂat minimiser.Finally, we plot the behaviours of the generalisation errorand the correlations between Trotter slices with respect to thestrength of the transverse ﬁeld for the ﬁxed pattern ratios of α = . , .

6, 1 .

0, and 1 . ﬀ ects of violating the static approximationare small. The RS solutions for the generalisation error are Γ ε g α = 0.2α = 0.6α = 1.0α = 1.4 (a) Γ R α = 0.2α = 0.6α = 1.0α = 1.4 (b) FIG. 5. Dependence of order parameters on the strength of the trans-verse ﬁeld for ﬁxed pattern ratios of α = . , . , . .

4. Thevertical axes denote the following order parameters: (a) generalisa-tion error and (b) correlation between Trotter slices. Each line isderived from the saddle-point equations. The symbols are derivedfrom the quantum Monte Carlo simulations. valid, except for the low strength of the transverse ﬁeld. Be-cause the correlation between Trotter slices depends on eachTrotter slice, the RS solutions are invalid, except for the highstrength of the transverse ﬁeld. A similar behaviour occurs inthe code-division multiple-access model [29].

V. SUMMARY

We analysed the teacher-student learning of binary percep-trons with quantum ﬂuctuations. The energy landscape ofa binary perceptron contains many local minima. StandardSA often becomes stuck in local minima and fails to identifythe ground state of a system. By introducing quantum ﬂuc-tuations, the state evolves more e ﬃ ciently because the e ﬀ ec-tive energy landscape becomes smoother. Additionally, quan-tum ﬂuctuations improve the robustness of solutions charac-terised by local wide-ﬂat minima. We determined that theserobust solutions yield better generalisation performance thanthe dominated solutions identiﬁed by standard SA.First, we presented a phase diagram of a binary perceptronwith quantum ﬂuctuations. This model contains three ﬁrst-order phase transitions. The ﬁrst is a ﬁrst-order phase transi-tion between the solutions of the classical and quantum mod-els in the weak transverse ﬁeld. The second is a ﬁrst-orderphase transition between solutions with poor generalisationand perfect generalisation. The third is a ﬁrst-order phasetransition between two solutions from di ﬀ erent branches un-der a transverse ﬁeld.Second, we presented a heat map of the di ﬀ erences in gen-eralisation error between the quantum model and classicalmodel. In a strong transverse ﬁeld, the generalisation perfor-mance of a binary perceptron is worse than that of the classicalmodel. In a weak transverse ﬁeld, quantum ﬂuctuations en-hance the generalisation performance of a binary perceptron.Third, we performed quantum Monte Carlo simulations toverify our results at the level of the RS assumption. First,we presented the behaviours of the generalisation error withrespect to the pattern ratio for a ﬁxed strength of the trans-verse ﬁeld. The numerical results were consistent with the RSsolutions, except for the high pattern ratio. By introducingquantum ﬂuctuations, spin conﬁgurations can be e ﬃ cientlysampled from the posterior distribution of the weight param-eter. For a high pattern ratio, the energy landscape becomesincreasingly non-convex compared to the case of a low patternratio. In this setting, local-spin ﬂip algorithms easily becometrapped in local minima, even when incorporating quantumﬂuctuations.Next, we investigated the robustness of the reference solu-tions obtained using quantum Monte Carlo simulations basedon the energy landscape. We calculated the average localtraining error di ﬀ erences between the reference solutions andall neighbouring solutions. The local training error di ﬀ er-ences between the reference solutions and neighbouring so-lutions obtained by quantum Monte Carlo simulations weresmaller than those obtained by the Markov-chain Monte Carlomethod. Overall, quantum ﬂuctuations lead solutions towardthe ﬂat minimiser.Finally, we analysed the behaviours of generalisation er-ror and correlation between Trotter slices with respect to thestrength of the transverse ﬁeld for a ﬁxed pattern ratio. Thenumerical results are consistent with the RS solutions, exceptfor the low strength of the transverse ﬁeld and high patternratio, which are caused by the violation of ergodicity or thestatic approximation. For a high pattern ratio, the deviationbetween the RS solutions and numerical results is greater thanthat for a low pattern ratio based on the non-convexity of theenergy landscape. In a weak transverse ﬁeld, the order pa-rameters depend on the Trotter slices. In this case, the staticapproximation does not hold. Therefore, our RS solutions are not exact solutions. However, in a qualitative sense, our ob-servations in this study do not void our results regarding gen-eralisation performance under quantum ﬂuctuations. Despitethe violation of ergodicity and the static approximation, ournumerical results support the enhancement of the generalisa-tion performance of a binary perceptron obtained by the RSsolutions.In this study, we considered a single-layer perceptron. Inthe future, our results can be extended to a multilayer percep-tron [30–34]. A multilayer perceptron contains a high localentropy region [15]. Therefore, we expect that quantum ﬂuc-tuations will also improve the generalisation performance ofa multilayer perceptron. We can also consider a rotationallyinvariant model [35–37]. In the classical model, orthogonalinput data increase the critical capacity. Determining whetherquantum ﬂuctuations improve the generalisation performancefor di ﬀ erent types of datasets is another interesting topic forfuture study. ACKNOWLEDGMENTS

M.O. was supported by KAKENHI (No. 19H01095),and the Next Generation High- Performance Computing In-frastructures and Applications R & D Program by MEXT.K.T. was supported by JSPS KAKENHI (No. 18H03303).This work was partially supported by JST-CREST (No. JP-MJCR1402).

Appendix A: DERIVATION OF FREE ENERGY

We derive the free energy under the RS ansatz assumption and static approximation. We introduce following terms: u µ a ( t ) = √ N N X i = x i µ σ zia ( t ) , (A1) u µ = √ N N X i = x i µ w i . (A2)The energy function term in Eq.(7) can be expressed asexp  − β M n X a = M X t = p X µ = Θ  − y µ sgn  √ N N X i = x i µ σ zia ( t )  = n Y a = M Y t = α N Y µ = exp (cid:18) − β M Θ (cid:16) − u µ u µ a ( t ) (cid:17)(cid:19) = n Y a = M Y t = α N Y µ = (cid:26) exp (cid:18) − β M (cid:19) + Θ (cid:16) u µ u µ a ( t ) (cid:17) (cid:18) − exp (cid:18) − β M (cid:19)(cid:19)(cid:27) . (A3)When x i µ are i.i.d. random variables with zero mean and unit variance, the central limit theorem guarantees that u µ a ( t ) and u µ follow the multivariate Gaussian random variables, which are characterised by zero mean and covariances of [ u µ a ( t ) u υ b ( t ′ )] D = Q ab ( t , t ′ ) δ µυ for a ﬁxed parameter of w , { σ zt ( a ) } ( a = , . . . , n , t = , . . . , M ). To simplify notation, we utilise the representation u µ ( t ) = u µ . The covariance matrix can be written as Q ab ( t , t ′ ) ≡  q ab ( t , t ′ ) ( a > b = , . . . , n ; b > a = , . . . , n ; t , t ′ = , . . . , M ) R a ( t , t ′ ) ( a = b = . . . , n ; t , t ′ ) m a ( t ) ( a = , b = , . . . , n ; a = , . . . , n , b = t = , . . . , M ) . a = b = a = , . . . , n ; t = , . . . , M ) . (A4)Therefore, the integration of the data distribution P ( D| w ) P ( w ) can be replaced by the multivariate Gaussian distribution P ( u , { u a ( t ) } ). Next, we introduce the delta function and its Fourier integral representation as follows: Y a , t Z dm a ( t ) δ  m a ( t ) − N N X i = w i σ zia ( t )  = Y a , t Z Ndm a ( t ) d ˜ m a ( t )2 π iM exp  − ˜ m a ( t ) M  Nm a ( t ) − N X i = w i σ zia ( t )  , (A5) Y a , t , t ′ Z dR a ( t , t ′ ) δ  R a ( t , t ′ ) − N N X i = σ zia ( t ) σ zia ( t ′ )  = Y a , t , t ′ Z NdR a ( t , t ′ ) d ˜ R a ( t , t ′ )4 π iM exp  − ˜ R a ( t , t ′ )2 M  NR a ( t , t ′ ) − N X i = σ zia ( t ) σ zia ( t ′ )  , (A6) Y a < b , t , t ′ Z dq ab ( t , t ′ ) δ  q ab ( t , t ′ ) − N N X i = σ zia ( t ) σ zib ( t ′ )  = Y a < b , t , t ′ Z Ndq ab ( t , t ′ ) d ˜ q ab ( t , t ′ )2 π iM exp  − ˜ q ab ( t , t ′ ) M  Nq ab ( t , t ′ ) − N X i = σ zia ( t ) σ zib ( t ′ )  , (A7) Y a , t Z dm xa ( t ) δ  m xa ( t ) − N N X i = σ xia ( t )  = Y a , t Z Ndm xa ( t ) d ˜ m xa ( t )2 π iM exp  − ˜ m xa ( t ) M  Nm xa ( t ) − N X i = σ xia ( t )  . (A8)Finally, we can rewrite the replicated partition function as (cid:2) Z nM (cid:3) D = Y a , t Z Ndm a ( t ) d ˜ m a ( t )2 π iM Y a , t , t ′ Z NdR a ( t , t ′ ) d ˜ R a ( t , t ′ )2 π iM Y a < b , t , t ′ Z Ndq ab ( t , t ′ ) d ˜ q ab ( t , t ′ )2 π iM Y a , t Z Ndm xa ( t ) d ˜ m xa ( t )2 π iM e G + G + G , (A9) e G ≡ exp  α N ln Y a , t (cid:26) exp (cid:18) − β M (cid:19) + Θ (cid:16) u µ u µ a ( t ) (cid:17) (cid:18) − exp (cid:18) − β M (cid:19)(cid:19)(cid:27) u  , (A10) e G ≡ Z d w P ( w )Tr exp  M X a , t ˜ m a ( t ) N X i = w i σ zia ( t ) + M X a , t , t ′ ˜ R a ( t , t ′ ) N X i = σ zia ( t ) σ zia ( t ′ ) + M X a < b X t , t ′ ˜ q ab ( t , t ′ ) N X i = σ zia ( t ) σ zib ( t ′ ) + M X a , t ˜ m xa ( t ) N X i = σ xia ( t )  Y i , t , a D σ zia ( t ) (cid:12)(cid:12)(cid:12) σ xia ( t ) E D σ xia ( t ) (cid:12)(cid:12)(cid:12) σ zia ( t + E , (A11) e G ≡ exp  − NM X a , t ˜ m a ( t ) m a ( t ) − NM X a , t ˜ m xa ( t ) m xa ( t ) − N M X a , t , t ′ ˜ R a ( t , t ′ ) R a ( t , t ′ ) − NM X a < b , t , t ′ ˜ q ab ( t , t ′ ) q ab ( t , t ′ ) + β Γ NM X a , t m xa ( t )  , (A12)where [ · ] u represents the expectation over the multivariate Gaussian distribution P ( u , { u a ( t ) } ). We adopt the RS ansatz assump-tion and the static approximation as follows: m a ( t ) = m , q ab ( t , t ′ ) = q , R a ( t , t ′ ) = R , m xa ( t ) = m x , ˜ m a ( t ) = ˜ m , ˜ q ab ( t , t ′ ) = ˜ q , ˜ R a ( t , t ′ ) = ˜ R , ˜ m xa ( t ) = m x . (A13)Under the RS ansatz assumption and static approximation, the Gaussian random variables can be expressed as u µ a ( t ) = √ qu + p R − q ν a + √ − R υ t ( a = , . . . , n ; t = , . . . , M ) , (A14) u µ = s m q u + s − m q ν , (A15)where u , { ν a } , and { υ t } are i.i.d. Gaussian random variables with zero mean and unit variance. In Eq.(A10), the integration over P ( u , { u a ( t ) } ) can be performed as Y a , t (cid:26) exp (cid:18) − β M (cid:19) + Θ (cid:16) u µ u µ a ( t ) (cid:17) (cid:18) − exp (cid:18) − β M (cid:19)(cid:19)(cid:27) u = Z Du Z D ν Θ ( u µ ) n Y a = Z D ν a M Y t = Z D υ t (cid:26) exp (cid:18) − β M (cid:19) + Θ (cid:16) u µ a ( t ) (cid:17) (cid:18) − exp (cid:18) − β M (cid:19)(cid:19)(cid:27) = Z DuH  − s m q − m u  n Y a = Z D ν a M Y t =  exp (cid:18) − β M (cid:19) + H  − √ qu + √ R − q ν a √ − R  (cid:18) − exp (cid:18) − β M (cid:19)(cid:19) ≃ exp  n Z DuH  − s m q − m u  ln Z D ν a M Y t =  exp (cid:18) − β M (cid:19) + H  − √ qu + √ R − q ν a √ − R  (cid:18) − exp (cid:18) − β M (cid:19)(cid:19) ≃ exp  n Z DuH  − s m q − m u  ln Z D ν a  − β M  − H  − √ qu + √ R − q ν a √ − R  − M β − H − √ qu + √ R − q ν a √ − R !!  − β (cid:18) − H (cid:18) − √ qu + √ R − q ν a √ − R (cid:19)(cid:19)  = exp  n Z DuH  − s m q − m u  ln Z D ν a exp  − β  − H  − √ qu + √ R − q ν a √ − R  = exp  n Z DuH  − s m q − m u  ln Z D ν exp  − β H  √ qu + √ R − q ν √ − R  , (A16)where we utilise the relationship 1 − H ( x ) = H ( − x ) and rewrite ν a as ν in the ﬁnal equation.We calculate e G under the RS ansatz assumption and static approximation as follows: e G = Z d w P ( w )Tr Z Dz exp  ˜ mM X a , t , i w i σ zia ( t ) + ˜ R − ˜ q M X a , i  M X t = σ zia ( t )  + √ ˜ qM X a , t , i z σ zia ( t ) + ˜ m x M X a , t , i σ xia ( t )  × Y i , t , a D σ zia ( t ) (cid:12)(cid:12)(cid:12) σ xia ( t ) E D σ xia ( t ) (cid:12)(cid:12)(cid:12) σ zia ( t + E = Z d w P ( w )Tr Z Dz n Y a = Z Dy exp  ˜ mM X t , i w i σ zia ( t ) + p ˜ R − ˜ qM X t , i y σ zia ( t ) + √ ˜ qM X t , i z σ zia ( t ) + ˜ m x M X t , i σ xia ( t )  × Y i , t , a D σ zia ( t ) (cid:12)(cid:12)(cid:12) σ xia ( t ) E D σ xia ( t ) (cid:12)(cid:12)(cid:12) σ zia ( t + E = N Y i = X w i = ± Z Dz Z Dy s w i ˜ m + √ qz + q ˜ R − ˜ qy ! + ( ˜ m x )  n ≃ exp  nN Z Dz ln Z Dy s ˜ m + √ qz + q ˜ R − ˜ qy ! + ( ˜ m x )  . (A17)0Here, we introduce the Hubbard-Stratonovich transformationexp x ! = Z Dz exp ( xz ) (A18)to the terms ( √ ˜ q / M P a , t σ zia ( t )) / P a ( p ˜ R − ˜ q / M P t σ zia ( t )) / e G is represented as e G = exp ( Nn − m ˜ m − m x ˜ m x − R ˜ R − n − q ˜ q + β Γ m x + O M !!) . (A19)At the thermodynamic limit N → ∞ , the saddle-point method can be adopted and the RS free energy can be expressed as − β f RS = α Z DuH  − s m q − m u  ln Z D ν exp  − β H  √ qu + √ R − q ν √ − R  + Z Dz ln Z Dy s ˜ m + √ qz + q ˜ R − ˜ qy ! + ( ˜ m x ) + β Γ m x − m ˜ m − m x ˜ m x − R ˜ R + q ˜ q . (A20)At the classical limit Γ →

0, we can obtain the classical RS free energy. At this limit, R → m x →

0, and M → e G ≃ exp  α Nn Z DuH  − s m q − m u  ln Z D ν  exp ( − β ) + H  − √ qu + √ R − q ν √ − R  (cid:0) − exp ( − β ) (cid:1) = exp  α Nn Z DzH  s m q − m u  ln ( exp ( − β ) + (cid:0) − exp ( − β ) (cid:1) H r q − q u !) , (A21)where we use the relationship R DxH ( ax + b ) = H ( b / √ + a ).We compute e G as e G = exp ( nN Z Dz ln Z Dy ˜ m + √ qz + q ˜ R − ˜ qy !!) = exp ( nN ˜ R − ˜ q + Z Dz ln 2 cosh (cid:0) ˜ m + √ qz (cid:1)!) , (A22)where we utilise the relationship R Dx cosh( ax + b ) = e a / cosh b . At the classical limit, e G is expressed as e G ≃ exp ( Nn − m ˜ m −

12 ˜ R − n − q ˜ q !) . (A23)The classical RS free energy can be recovered as − β f RS = α Z DzH  s m q − m u  ln ( exp ( − β ) + (cid:0) − exp ( − β ) (cid:1) H r q − q u !) + Z Dz ln 2 cosh (cid:0) ˜ m + √ qz (cid:1) − m ˜ m − q − ˜ q ) . (A24)This result is consistent with the results in Ref.[28]. [1] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (The MIT Press, 2016). [2] Y. Bahri, J. Kadmon, J. Pennington, S. S. Schoenholz, J. Sohl- Dickstein, and S. Ganguli, Annual Review of Condensed MatterPhysics , 501 (2020).[3] A. Engel and C. P. L. V. d. Broeck, Statistical Mechanics ofLearning (Cambridge University Press, USA, 2001).[4] H. Nishimori,

Statistical Physics of Spin Glasses and Informa-tion Processing: an Introduction (Oxford University Press, Ox-ford;, 2001).[5] E. Gardner, Journal of Physics A: Mathematical and General , 257 (1988).[6] E. Gardner and B. Derrida, Journal of Physics A: Mathematicaland General , 271 (1988).[7] Krauth, Werner and M´ezard, Marc,J. Phys. France , 3057 (1989).[8] G. Gy¨orgyi, Phys. Rev. A , 7097 (1990).[9] H. Huang, K. Y. M. Wong, and Y. Kabashima,Journal of Physics A: Mathematical and Theoretical , 375002 (2013).[10] H. Huang and Y. Kabashima, Phys. Rev. E , 052813 (2014).[11] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi,Science , 671 (1983).[12] H. Horner, Zeitschrift f¨ur Physik B Condensed Matter , 371 (1992).[13] H. K. Patel, Zeitschrift f¨ur Physik B Condensed Matter , 257 (1993).[14] C. Baldassi, A. Ingrosso, C. Lu-cibello, L. Saglietti, and R. Zecchina,Journal of Statistical Mechanics: Theory and Experiment , 023301 (2016).[15] C. Baldassi, C. Borgs, J. T. Chayes, A. In-grosso, C. Lucibello, L. Saglietti, and R. Zecchina,Proceedings of the National Academy of Sciences , E7655 (2016).[16] C. Baldassi, A. Ingrosso, C. Lucibello, L. Saglietti, andR. Zecchina, Phys. Rev. Lett. , 128101 (2015).[17] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyan-skiy, and P. T. P. Tang, On large-batch training for deeplearning: Generalization gap and sharp minima (2017),arXiv:1609.04836 [cs.LG].[18] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio (PMLR, Interna-tional Convention Centre, Sydney, Australia, 2017) pp. 1019–1028.[19] F. Pittorino, C. Lucibello, C. Feinauer, G. Perugini, C. Bal- dassi, E. Demyanenko, and R. Zecchina, Entropic gra-dient descent algorithms and wide ﬂat minima (2020),arXiv:2006.07897 [cs.LG].[20] P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun,C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina,Journal of Statistical Mechanics: Theory and Experiment , 124018 (2019).[21] M. Ohzeki, S. Okada, M. Terabe, and S. Taguchi, Scientiﬁc Re-ports , 9950 (2018).[22] T. Kadowaki and H. Nishimori, Phys. Rev. E , 5355 (1998).[23] G. E. Santoro, R. Martoˇn´ak, E. Tosatti, and R. Car,Science , 2427 (2002).[24] A. Das and B. K. Chakrabarti,Rev. Mod. Phys. , 1061 (2008).[25] C. Baldassi and R. Zecchina,Proceedings of the National Academy of Sciences , 1457 (2018).[26] M. Suzuki, Communications in Mathematical Physics , 183 (1976).[27] D. Sherrington and S. Kirkpatrick,Phys. Rev. Lett. , 1792 (1975).[28] H. S. Seung, H. Sompolinsky, and N. Tishby, Phys. Rev. A ,6056 (1992).[29] S. Arai, M. Ohzeki, and K. Tanaka, Mean ﬁeld analysis of re-verse annealing for code-division multiple-access multiuser de-modulator (2020), arXiv:2004.11066 [cond-mat.dis-nn].[30] H. Schwarze and J. Hertz,Europhysics Letters (EPL) , 375 (1992).[31] H. Schwarze and J. Hertz,Europhysics Letters (EPL) , 785 (1993).[32] H. Schwarze, Journal of Physics A: Mathematical and General , 5781 (1993).[33] R. Monasson and R. Zecchina,Phys. Rev. Lett. , 2432 (1995).[34] B. Aubin, A. Maillard, J. Barbier, F. Krza-kala, N. Macris, and L. Zdeborov´a,Journal of Statistical Mechanics: Theory and Experiment , 124023 (2019).[35] Y. Kabashima, Journal of Physics: Conference Series , 012001 (2008).[36] T. Shinzato and Y. Kabashima,Journal of Physics A: Mathematical and Theoretical , 324013 (2008).[37] T. Shinzato and Y. Kabashima,Journal of Physics A: Mathematical and Theoretical42