Teacher-student learning for a binary perceptron with quantum fluctuations
aa r X i v : . [ c ond - m a t . d i s - nn ] F e b Teacher-student learning for a binary perceptron with quantum fluctuations
Shunta Arai , , ∗ Masayuki Ohzeki , , , and Kazuyuki Tanaka Graduate School of Information Sciences,Tohoku University, Sendai 980-8579, Japan Sigma-i Co., Ltd., Tokyo, Japan Institute of Innovative Research,Tokyo Institute of Technology, Nagatsuta-cho 4259,Midori-ku, Yokohama Kanagawa, 226-8503 Japan (Dated: February 18, 2021)We analysed the generalisation performance of a binary perceptron with quantum fluctuations using thereplica method. An exponential number of local minima dominate the energy landscape of the binary per-ceptron. Local search algorithms often fail to identify the ground state of a binary perceptron. In this study,we considered the teacher-student learning method and computed the generalisation error of a binary percep-tron with quantum fluctuations. Due to the quantum fluctuations, we can e ffi ciently find robust solutions thathave better generalisation performance than the classical model. We validated our theoretical results throughquantum Monte Carlo simulations. We adopted the replica symmetry (RS) ansatz assumption and static ap-proximation. The RS solutions are consistent with our simulation results, except for the relatively low strengthof the transverse field and high pattern ratio. These deviations are caused by the violation of ergodicity andstatic approximation. After accounting for the deviation between the RS solutions and numerical results, theenhancement of generalisation performance with quantum fluctuations holds. I. INTRODUCTION
Deep neural networks (DNNs) have achieved excellent per-formance on a wide range of tasks [1]. In supervised learn-ing, DNNs can generalise for unseen data better than clas-sical machine learning algorithms. Generally, when apply-ing DNNs for supervised learning, the number of parametersis greater than the number of datapoints, which is called an‘over-parameterised setting’. In such a setting, it is di ffi cultto explain why DNNs generalise well using classical statisti-cal learning theory. Therefore, the theoretical understandingof the generalisation performance of DNNs is under develop-ment [2].For simple models, typical generalisation performance canbe theoretically analysed using statistical mechanics [3, 4].The simplest case is the binary perceptron [5, 6]. In sta-tistical mechanics analysis, we consider the thermodynamiclimit as the number of parameters N → ∞ and number ofdata p → ∞ while fixing the number of data per parameter α ≡ p / N = O (1). For random pattern learning, a binary per-ceptron can learn random input-output patterns until reachingthe storage capacity of α c ≃ .
833 [7]. For teacher-studentlearning, we consider a student perceptron and teacher percep-tron. The teacher perceptron generates output data from inputdata and the student perceptron learn input-output patterns fora given dataset. The student perceptron can predict the outputsgenerated by the teacher perceptron with α ≥ α c ≃ .
245 [8].The energy landscape of a binary perceptron is highly non-convex and is dominated by an exponential number of localminima [9, 10]. Ground states are geometrically isolated inthe solution space. The hamming distance between each so-lution is proportional to N . A local spin-flip algorithm basedon free energy minimisation such as simulated annealing (SA) ∗ [email protected] [11] can easily become stuck in metastable states and fail tofind the ground state [12, 13]. In the solution space, there isa subdominant dense region in which the solutions have highlocal entropy. In this region, all solutions have similar en-ergy values. We refer to these solutions as robust solutionsin this paper. By incorporating local entropy, we can modifythe standard SA technique. The resulting algorithm can findsubdominant solutions [14, 15]. The quality of subdominantsolutions is di ff erent from that of the typical solutions foundby standard SA. In teacher-student learning, the generalisationerror of the subdominant solutions is lower than that of typicalsolutions [16].Several empirical studies have demonstrated that the localenergy (loss) landscape is related to the generalisation per-formance [17–19]. Stochastic gradient descent (SGD) withsmall batches often finds a flat minimiser within the energylandscape. In contrast, SGD with large batches converges toa sharp minimiser. Although the training errors of these twosolutions are the same or similar, the generalisation perfor-mances di ff er significantly. A flat minimiser generalises betterthan a sharp minimiser. A flat region in the energy landscapeis robust to perturbations in parameters and data fluctuations.By incorporating the e ff ects of entropy into a loss function, aflat minimiser is algorithmically reachable. The Entropy-SGDalgorithm developed in a recent study [20] computes local en-tropy and exploits stochastic gradient Langevin dynamics toapproximate the gradient of local entropy, allowing it to finda flat minimiser.To optimise DNNs, we can utilise quantum fluctuations.In a recent study [21], the authors formulated an optimisa-tion algorithm based on quantum fluctuations using a path-integral representation. They determined that finite-valuequantum fluctuations improve the generalisation performanceof DNNs, allowing solutions to converge to a flat minimiser.In a binary perceptron for random pattern learning, quantumannealing (QA) [22–24] can identify a flat region in the energylandscape [25]. The local entropy obtained by QA is greaterthan that of typical solutions obtained by SA.In this study, we extended previous research on random pat-tern learning [25] to teacher-student learning. We computedthe typical behaviours of the generalisation error of a binaryperceptron with quantum fluctuations by using the replicamethod. In a previous study, it was observed that quantumfluctuations lead to a flat minimiser. In this study, we investi-gated whether the generalisation performance of a binary per-ceptron is enhanced by quantum fluctuations. This study isan analytical demonstration of the e ff ectiveness of quantumfluctuations for machine learning problems.The remainder of this paper is organised as follows. In Sec-tion II, we present the formulation of a binary perceptron withquantum fluctuations. In Section III, we numerically solve thesaddle-point equations and present the phase diagrams. Toverify our theoretical analysis, we performed quantum MonteCarlo simulations. We verify the robustness of the solutionsobtained by the quantum Monte Carlo simulations using theenergy landscape around the solutions. Finally, in Section IV,we summarise our results and discuss future research direc-tions. II. BINARY PERCEPTRON WITH QUANTUMFLUCTUATIONS
We consider teacher-student learning in a single-layer bi-nary perceptron. The student perceptron learns input-outputpatterns from the given dataset D = { ( x µ , y µ ) } p µ = . The numberof data points is represented by p . For each sample, an inputdata vector x µ ∈ {± } N is generated from the uniform distri-bution P ( x µ ) = Q Ni = (cid:16) δ ( x i µ − + δ ( x i µ + (cid:17) / N , where N isthe dimensionality of the input data. The joint probability dis-tribution of the input data is denoted as P ( X ) = Q p µ = P ( x µ ).The output data are determined by the teacher perceptronas y µ = sgn (cid:16) / √ N P Ni = w i x i µ (cid:17) ∈ {± } , where sgn( · ) is thesignum function. The weight vector of the teacher percep-tron is generated from the uniform distribution as P ( w ) = Q Ni = ( δ ( w i − + δ ( w i + / N . The joint distribution of thedataset is given by P ( D| w ) = p Y µ = P ( y µ | x µ , w ) P ( x µ ) = p Y µ = δ y µ − sgn √ N N X i = w i x i µ P ( x µ ) . (1)The learning problem involves finding a weight vector σ = ( σ , . . . , σ N ) ∈ {± } N such that all input data are simultane-ously classified correctly. We formulate this problem as aBayesian inference problem. By using the Bayes formula, theposterior distribution is expressed as P ( σ |D ) = P ( D| σ ) P ( σ ) P σ P ( D| σ ) P ( σ ) . (2) We define the likelihood P ( D| σ ) as P ( D| σ ) ∝ exp ( − β E ( σ )) , (3) E ( σ ) = p X µ = Θ − y µ sgn √ N N X i = x i µ σ i , (4)where β = / T is the inverse temperature, Θ ( x ) is the Heav-iside step function, and Θ ( x ) = x > or Θ ( x ) =
0, oth-erwise. The Hamiltonian E ( σ ) represents the number of mis-classifications. We set the prior distribution P ( σ ) to the uni-form distribution P ( σ ) = Q Ni = ( δ ( σ i − + δ ( σ i + / N . Inthis case, we can omit P ( σ ) from Eq.(2).We often utilise the maximum a posteriori (MAP) estima-tion to estimate σ . In the MAP estimation, we find the statethat maximises the posterior Eq.(2) at the limit of β → ∞ .This corresponds to searching for the ground state of theHamiltonian Eq. (4). To find the ground state, we typi-cally adopt SA. However, SA can fail to identify the groundstate based on the existence of many local minima. There-fore, instead of maximising the posterior, we consider finite-temperature estimation. At low temperatures, the probabil-ity measure in Eq.(2) concentrates on low-energy states. Thelearning strategy involves sampling low-energy states fromEq.(2) at low temperatures. The estimated weight vector isexpressed by the expectation over the posterior distribution as h σ i i = P σ σ i P ( σ |D ).The indicators of learning outcomes are the training andgeneralisation errors. The training error is given by ǫ t ( D ) = p h E ( σ ) i . (5)To evaluate performance for unseen data, we consider the gen-eralisation error as ǫ g ( D ) = E { x new , y new } Θ − y new sgn √ N N X i = x new i h σ i i , (6)where E { x new , y new } [ · ] denotes the expectation over P ( D new | w ) = P ( x new , y new | w ) = P ( y new | x new , w ) P ( x new ). These quantitiesare expected to exhibit a ‘self-averaging’ property at the ther-modynamic limit N → ∞ . The observables for a quenchedrealisation of D and w are equivalent to the self-expectationover the data distribution P ( D| w ) P ( w ). For example, the gen-eralisation error can be expressed aslim N →∞ ǫ g = [ ǫ g ( D )] D = π cos − m , (7)where the bracket [ · ] D indicates the expectation over the datadistribution P ( D| w ) P ( w ) and m = / N P Ni = w i h σ i i denotes theoverlap between the teacher and student.We can extend the above formulation into a quantum sys-tem as follows: P ( D| ˆ σ z ) ∝ exp (cid:16) − β ˆ H (cid:17) , (8)ˆ H = E ( ˆ σ z ) − Γ N X i = ˆ σ xi , (9) E ( ˆ σ z ) = p X µ = Θ − y µ sgn √ N N X i = x i µ ˆ σ zi , (10)where ˆ σ zi and ˆ σ xi are the z and x components of the Pauli matri-ces at site i , respectively, and Γ is the strength of the transversefield. The learning strategy involves sampling low-energystates from the density matrix ˆ ρ ≡ e − β ˆ H / Tr e − β ˆ H , where Trdenotes the summation over all possible configurations on acomputational basis. The estimated weight vector is expressed by h ˆ σ zi i = Tr( ˆ σ zi ˆ ρ ). The overlap between the teacher and stu-dent is also written as m = / N P Ni = w i h ˆ σ i i . III. MEAN FIELD ANALYSIS
The typical behaviour of order parameters, such as gener-alisation error, can be derived the free energy. We computedthe partition function and derived the free energy at the limitof N , p → ∞ while fixing the number of data per coupling as α ≡ p / N = O (1). We assume the self-averaging property atthe thermodynamic limit. The free energy f can be evaluatedas − β f = lim N →∞ / N [ln Z ] D . We employ the Suzuki-Trotterdecomposition [26] to the following partition function: Z = Tr exp (cid:16) − β ˆ H (cid:17) = lim M →∞ Tr exp (cid:18) − β M E ( ˆ σ z ) (cid:19) exp β Γ M N X i = ˆ σ xi M = lim M →∞ Z M , (11)where Z M = Tr M Y t = exp − β M p X µ = Θ − y µ sgn √ N N X i = x i µ σ zi ( t ) + β Γ M N X i = σ xi ( t ) Y i , t D σ zi ( t ) (cid:12)(cid:12)(cid:12) σ xi ( t ) E D σ xi ( t ) (cid:12)(cid:12)(cid:12) σ zi ( t + E , (12)The symbol t is the index of the Trotter slice, and M is the Trotter number. We also impose the periodic boundary conditions σ zi (1) = σ zi ( M +
1) for all i . To evaluate [ln Z M ], we utilise the replica method [27] as follows:[ln Z M ] D = lim n → [ Z nM ] D − n . (13)The replicated partition function is written as (cid:2) Z nM (cid:3) D = Z d D P ( D| w ) Z d w P ( w )Tr n Y a = exp − β M M X t = p X µ = Θ − y µ sgn √ N N X i = x i µ σ zia ( t ) + β Γ M M X t = N X i = σ xia ( t ) × Y i , t , a D σ zia ( t ) (cid:12)(cid:12)(cid:12) σ xia ( t ) E D σ xia ( t ) (cid:12)(cid:12)(cid:12) σ zia ( t + E , (14)where a denotes the replica index. We introduce the order parameters as m a ( t ) = N N X i = w i σ zia ( t ) , (15) q ab ( t , t ′ ) = N N X i = σ zia ( t ) σ zib ( t ′ ) ( a < b ) , (16) R a ( t , t ′ ) = N N X i = σ zia ( t ) σ zia ( t ′ ) ( t , t ′ ) , (17) m xa ( t ) = N N X i = σ xia ( t ) . (18)The physical meanings of the order parameters are defined as follows: m a ( t ) is the magnetisation, q ab ( t , t ′ ) is the spin-glassorder parameter, R a ( t , t ′ ) is the correlation between each Trotter slice, and m xa ( t ) is the transverse magnetisation. Additionally,we introduce the auxiliary parameters ˜ m a ( t ) , ˜ q ab ( t , t ′ ) , ˜ R a ( t , t ′ ) , ˜ m xa ( t ) for the order parameters. Under the replica symmetry (RS)ansatz assumption and static approximation, we have m a ( t ) = m , q ab ( t , t ′ ) = q , R a ( t , t ′ ) = R , m xa ( t ) = m x , ˜ m a ( t ) = ˜ m , ˜ q ab ( t , t ′ ) = ˜ q , ˜ R a ( t , t ′ ) = ˜ R , ˜ m xa ( t ) = ˜ m x . The free energy is given by − β f RS = α Z DuH − s m q − m u ln Z D ν exp − β H √ qu + √ R − q ν √ − R + Z Dz ln Z Dy s ˜ m + p ˜ qz + q ˜ R − ˜ qy ! + ( ˜ m x ) − m ˜ m − m x ˜ m x − R ˜ R + q ˜ q + β Γ m x , (19)where Dz denotes the Gaussian measure as Dz = / √ π dze − z / . Here, we utilise the function H ( x ) = R ∞ x Dz . Detailedcalculations for the derivation of the RS free energy in Eq.(11) are provided in Appendix A. To simplify the notation, we definethe following expressions: g ≡ ˜ m + p ˜ qz + q ˜ R − ˜ qy , (20) u ≡ q g + ( ˜ m x ) , (21) Y ≡ Z Dy cosh u , (22) X ≡ s m q − m u , (23) X ≡ √ qu + √ R − q ν √ − R . (24)The extremisation of Eq.(11) gives rise to the following saddle-point equations: m = Z DzY − Z Dy (cid:18) gu (cid:19) sinh u , (25) q = Z Dz ( Y − Z Dy (cid:18) gu (cid:19) sinh u ) , (26) R = Z DzY − Z Dy ( ( ˜ m x ) u ! sinh u + (cid:18) gu (cid:19) cosh u ) , (27) m x = Z DzY − Z Dy ˜ m x u ! sinh u , (28)˜ m = α q p ( q − m ) Z uDuG ( X ) ln Z D ν exp ( − β H ( X )) , (29)˜ q = α m p ( q − m ) Z uDuG ( X ) ln Z D ν exp ( − β H ( X )) + αβ √ − R Z DuH ( − X ) R D ν exp ( − β H ( X )) G ( X ) (cid:26) ν √ R − q − u √ q (cid:27)R D ν exp ( − β H ( X )) , (30)˜ R = αβ p ( R − q )(1 − R ) Z DuH ( − X ) R D ν exp ( − β H ( X )) G ( X ) n (1 − q ) ν + p q ( R − q ) u oR D ν exp ( − β H ( X )) , (31)˜ m x = β Γ , (32)where we utilise G ( x ) = e − x / / √ π . The training error is written as ǫ t = Z DuH ( − X ) R D ν H ( X ) exp ( − β H ( X )) R D ν exp ( − β H ( X )) . (33) IV. EXPERIMENTAL RESULTS
We numerically solve the saddle-point equations inEqs.(25)-(32) using the inverse temperature β =
20. A phase diagram of a binary perceptron with quantum fluctuations is Γ α α qsp α classical α thr α csp branch1branch2critical0.090 0.095 0.100 0.1051.201.221.24 FIG. 1. Phase diagram of a binary perceptron with quantum fluctu-ations. The horizontal axis represents the strength of the transversefield. The vertical axis represents the pattern ratio. Γ α −0.04−0.020.000.020.04 ε g − ε c l a ss i c a l g FIG. 2. Heat map of the di ff erences in generalisation errors betweenthe quantum model and classical model ( Γ = presented in Fig.1. The α classical line indicates the location ofthe first-order phase transition between the quantum model inEq.(9) and classical model in Eq.(4). On the left side of the α classical line, the classical model has the lowest free energy.On the right side of the α classical line, the quantum model hasthe lowest free energy. The α qsp line represents the locationat which the freezing phenomenon occurs ( R = ff ect the system andthe model is identical to the classical model. The α thr linerepresents the location of the phase transition between poorgeneralisation m , m =
1. Abovethis line, the solution with perfect generalisation is stable. Thesolution with poor generalisation has negative entropy. In thiscase, the RS ansatz assumption is incorrect. Similar to thecase of the classical model [28], we can plot the zero-entropyline in the phase diagram. To avoid confusion, we omittedthe zero-entropy line in this study. The α csp line denotes thelocation at which the solution with poor generalisation disap-pears in the classical model. The ‘branch 1’ and ‘branch 2’curves represent the spinodal curves of the di ff erent solutions.At the ‘critical’ curve, the first-order phase transition occurs.The ‘branch 2’ solution becomes stable above the ‘critical’curve. Around Γ = .
5, the ‘critical’ curve intersects the α thr line. Above this intersection, the perfect generalisation solu-tion has the lowest free energy and the solution of the ‘branch2’ curve becomes unstable. We omit the ‘critical’ curve above α ε g classicalΓ = 0.1Γ = 0.2Γ = 0.3 FIG. 3. Dependence of the generalisation error on the pattern ratio ofthe fixed strength of the transverse field for
Γ = .
1, 0 .
2, and 0 .
3. Thelines are derived from the saddle-point equations. The symbols rep-resent the results obtained by the quantum Monte Carlo simulations.The ‘classical’ line represents the results of the classical model. this intersection.Figure2 presents a heat map of the di ff erences in generali-sation error between the quantum model and classical model( Γ = N = M = T = . Γ = .
1, 0 .
2, and 0 . ffi cult to estimate the order parameters.We investigated the robustness of the solutions obtainedby the quantum Monte Carlo simulations based on the en-ergy landscape around the solutions. We utilised the finalresults of the quantum Monte Carlo simulations as ˆ σ i = sgn( P Mt = σ it ) ( i = , ..., N ) for α = . Γ = . ff erences between the reference solu-tions and neighbouring solutions, whose Hamming distancesfrom the reference solutions are denoted by d . In Fig.4, weplot the local training error di ff erences between the referencesolutions and neighbouring solutions. We computed the aver-age di ff erences over all neighbouring solutions. The error bars d Δ E QMCMCMC
FIG. 4. Training error di ff erences between the reference solutionsobtained by the quantum Monte Carlo simulations and Markov-Chain Monte Carlo simulations, and the neighbouring solutions,whose hamming distances from the reference solutions are denotedas d . The symbols are averaged over all neighbouring solutionswithin the hamming distance d . represent standard deviations. For comparison, we also plotthe results of the reference solutions of the classical modelattained by Markov-Chain Monte Carlo simulations. The ex-perimental settings and instances are the same as those shownin Fig.3. The local training error di ff erences are related to theflatness of the energy landscape. The training error of the flatminimiser is robust to perturbations. One can see that the localtraining error di ff erences of the quantum model are less thanthose of the classical model. Overall, quantum fluctuationslead solutions toward the flat minimiser.Finally, we plot the behaviours of the generalisation errorand the correlations between Trotter slices with respect to thestrength of the transverse field for the fixed pattern ratios of α = . , .
6, 1 .
0, and 1 . ff ects of violating the static approximationare small. The RS solutions for the generalisation error are Γ ε g α = 0.2α = 0.6α = 1.0α = 1.4 (a) Γ R α = 0.2α = 0.6α = 1.0α = 1.4 (b) FIG. 5. Dependence of order parameters on the strength of the trans-verse field for fixed pattern ratios of α = . , . , . .
4. Thevertical axes denote the following order parameters: (a) generalisa-tion error and (b) correlation between Trotter slices. Each line isderived from the saddle-point equations. The symbols are derivedfrom the quantum Monte Carlo simulations. valid, except for the low strength of the transverse field. Be-cause the correlation between Trotter slices depends on eachTrotter slice, the RS solutions are invalid, except for the highstrength of the transverse field. A similar behaviour occurs inthe code-division multiple-access model [29].
V. SUMMARY
We analysed the teacher-student learning of binary percep-trons with quantum fluctuations. The energy landscape ofa binary perceptron contains many local minima. StandardSA often becomes stuck in local minima and fails to identifythe ground state of a system. By introducing quantum fluc-tuations, the state evolves more e ffi ciently because the e ff ec-tive energy landscape becomes smoother. Additionally, quan-tum fluctuations improve the robustness of solutions charac-terised by local wide-flat minima. We determined that theserobust solutions yield better generalisation performance thanthe dominated solutions identified by standard SA.First, we presented a phase diagram of a binary perceptronwith quantum fluctuations. This model contains three first-order phase transitions. The first is a first-order phase transi-tion between the solutions of the classical and quantum mod-els in the weak transverse field. The second is a first-orderphase transition between solutions with poor generalisationand perfect generalisation. The third is a first-order phasetransition between two solutions from di ff erent branches un-der a transverse field.Second, we presented a heat map of the di ff erences in gen-eralisation error between the quantum model and classicalmodel. In a strong transverse field, the generalisation perfor-mance of a binary perceptron is worse than that of the classicalmodel. In a weak transverse field, quantum fluctuations en-hance the generalisation performance of a binary perceptron.Third, we performed quantum Monte Carlo simulations toverify our results at the level of the RS assumption. First,we presented the behaviours of the generalisation error withrespect to the pattern ratio for a fixed strength of the trans-verse field. The numerical results were consistent with the RSsolutions, except for the high pattern ratio. By introducingquantum fluctuations, spin configurations can be e ffi cientlysampled from the posterior distribution of the weight param-eter. For a high pattern ratio, the energy landscape becomesincreasingly non-convex compared to the case of a low patternratio. In this setting, local-spin flip algorithms easily becometrapped in local minima, even when incorporating quantumfluctuations.Next, we investigated the robustness of the reference solu-tions obtained using quantum Monte Carlo simulations basedon the energy landscape. We calculated the average localtraining error di ff erences between the reference solutions andall neighbouring solutions. The local training error di ff er-ences between the reference solutions and neighbouring so-lutions obtained by quantum Monte Carlo simulations weresmaller than those obtained by the Markov-chain Monte Carlomethod. Overall, quantum fluctuations lead solutions towardthe flat minimiser.Finally, we analysed the behaviours of generalisation er-ror and correlation between Trotter slices with respect to thestrength of the transverse field for a fixed pattern ratio. Thenumerical results are consistent with the RS solutions, exceptfor the low strength of the transverse field and high patternratio, which are caused by the violation of ergodicity or thestatic approximation. For a high pattern ratio, the deviationbetween the RS solutions and numerical results is greater thanthat for a low pattern ratio based on the non-convexity of theenergy landscape. In a weak transverse field, the order pa-rameters depend on the Trotter slices. In this case, the staticapproximation does not hold. Therefore, our RS solutions are not exact solutions. However, in a qualitative sense, our ob-servations in this study do not void our results regarding gen-eralisation performance under quantum fluctuations. Despitethe violation of ergodicity and the static approximation, ournumerical results support the enhancement of the generalisa-tion performance of a binary perceptron obtained by the RSsolutions.In this study, we considered a single-layer perceptron. Inthe future, our results can be extended to a multilayer percep-tron [30–34]. A multilayer perceptron contains a high localentropy region [15]. Therefore, we expect that quantum fluc-tuations will also improve the generalisation performance ofa multilayer perceptron. We can also consider a rotationallyinvariant model [35–37]. In the classical model, orthogonalinput data increase the critical capacity. Determining whetherquantum fluctuations improve the generalisation performancefor di ff erent types of datasets is another interesting topic forfuture study. ACKNOWLEDGMENTS
M.O. was supported by KAKENHI (No. 19H01095),and the Next Generation High- Performance Computing In-frastructures and Applications R & D Program by MEXT.K.T. was supported by JSPS KAKENHI (No. 18H03303).This work was partially supported by JST-CREST (No. JP-MJCR1402).
Appendix A: DERIVATION OF FREE ENERGY
We derive the free energy under the RS ansatz assumption and static approximation. We introduce following terms: u µ a ( t ) = √ N N X i = x i µ σ zia ( t ) , (A1) u µ = √ N N X i = x i µ w i . (A2)The energy function term in Eq.(7) can be expressed asexp − β M n X a = M X t = p X µ = Θ − y µ sgn √ N N X i = x i µ σ zia ( t ) = n Y a = M Y t = α N Y µ = exp (cid:18) − β M Θ (cid:16) − u µ u µ a ( t ) (cid:17)(cid:19) = n Y a = M Y t = α N Y µ = (cid:26) exp (cid:18) − β M (cid:19) + Θ (cid:16) u µ u µ a ( t ) (cid:17) (cid:18) − exp (cid:18) − β M (cid:19)(cid:19)(cid:27) . (A3)When x i µ are i.i.d. random variables with zero mean and unit variance, the central limit theorem guarantees that u µ a ( t ) and u µ follow the multivariate Gaussian random variables, which are characterised by zero mean and covariances of [ u µ a ( t ) u υ b ( t ′ )] D = Q ab ( t , t ′ ) δ µυ for a fixed parameter of w , { σ zt ( a ) } ( a = , . . . , n , t = , . . . , M ). To simplify notation, we utilise the representation u µ ( t ) = u µ . The covariance matrix can be written as Q ab ( t , t ′ ) ≡ q ab ( t , t ′ ) ( a > b = , . . . , n ; b > a = , . . . , n ; t , t ′ = , . . . , M ) R a ( t , t ′ ) ( a = b = . . . , n ; t , t ′ ) m a ( t ) ( a = , b = , . . . , n ; a = , . . . , n , b = t = , . . . , M ) . a = b = a = , . . . , n ; t = , . . . , M ) . (A4)Therefore, the integration of the data distribution P ( D| w ) P ( w ) can be replaced by the multivariate Gaussian distribution P ( u , { u a ( t ) } ). Next, we introduce the delta function and its Fourier integral representation as follows: Y a , t Z dm a ( t ) δ m a ( t ) − N N X i = w i σ zia ( t ) = Y a , t Z Ndm a ( t ) d ˜ m a ( t )2 π iM exp − ˜ m a ( t ) M Nm a ( t ) − N X i = w i σ zia ( t ) , (A5) Y a , t , t ′ Z dR a ( t , t ′ ) δ R a ( t , t ′ ) − N N X i = σ zia ( t ) σ zia ( t ′ ) = Y a , t , t ′ Z NdR a ( t , t ′ ) d ˜ R a ( t , t ′ )4 π iM exp − ˜ R a ( t , t ′ )2 M NR a ( t , t ′ ) − N X i = σ zia ( t ) σ zia ( t ′ ) , (A6) Y a < b , t , t ′ Z dq ab ( t , t ′ ) δ q ab ( t , t ′ ) − N N X i = σ zia ( t ) σ zib ( t ′ ) = Y a < b , t , t ′ Z Ndq ab ( t , t ′ ) d ˜ q ab ( t , t ′ )2 π iM exp − ˜ q ab ( t , t ′ ) M Nq ab ( t , t ′ ) − N X i = σ zia ( t ) σ zib ( t ′ ) , (A7) Y a , t Z dm xa ( t ) δ m xa ( t ) − N N X i = σ xia ( t ) = Y a , t Z Ndm xa ( t ) d ˜ m xa ( t )2 π iM exp − ˜ m xa ( t ) M Nm xa ( t ) − N X i = σ xia ( t ) . (A8)Finally, we can rewrite the replicated partition function as (cid:2) Z nM (cid:3) D = Y a , t Z Ndm a ( t ) d ˜ m a ( t )2 π iM Y a , t , t ′ Z NdR a ( t , t ′ ) d ˜ R a ( t , t ′ )2 π iM Y a < b , t , t ′ Z Ndq ab ( t , t ′ ) d ˜ q ab ( t , t ′ )2 π iM Y a , t Z Ndm xa ( t ) d ˜ m xa ( t )2 π iM e G + G + G , (A9) e G ≡ exp α N ln Y a , t (cid:26) exp (cid:18) − β M (cid:19) + Θ (cid:16) u µ u µ a ( t ) (cid:17) (cid:18) − exp (cid:18) − β M (cid:19)(cid:19)(cid:27) u , (A10) e G ≡ Z d w P ( w )Tr exp M X a , t ˜ m a ( t ) N X i = w i σ zia ( t ) + M X a , t , t ′ ˜ R a ( t , t ′ ) N X i = σ zia ( t ) σ zia ( t ′ ) + M X a < b X t , t ′ ˜ q ab ( t , t ′ ) N X i = σ zia ( t ) σ zib ( t ′ ) + M X a , t ˜ m xa ( t ) N X i = σ xia ( t ) Y i , t , a D σ zia ( t ) (cid:12)(cid:12)(cid:12) σ xia ( t ) E D σ xia ( t ) (cid:12)(cid:12)(cid:12) σ zia ( t + E , (A11) e G ≡ exp − NM X a , t ˜ m a ( t ) m a ( t ) − NM X a , t ˜ m xa ( t ) m xa ( t ) − N M X a , t , t ′ ˜ R a ( t , t ′ ) R a ( t , t ′ ) − NM X a < b , t , t ′ ˜ q ab ( t , t ′ ) q ab ( t , t ′ ) + β Γ NM X a , t m xa ( t ) , (A12)where [ · ] u represents the expectation over the multivariate Gaussian distribution P ( u , { u a ( t ) } ). We adopt the RS ansatz assump-tion and the static approximation as follows: m a ( t ) = m , q ab ( t , t ′ ) = q , R a ( t , t ′ ) = R , m xa ( t ) = m x , ˜ m a ( t ) = ˜ m , ˜ q ab ( t , t ′ ) = ˜ q , ˜ R a ( t , t ′ ) = ˜ R , ˜ m xa ( t ) = m x . (A13)Under the RS ansatz assumption and static approximation, the Gaussian random variables can be expressed as u µ a ( t ) = √ qu + p R − q ν a + √ − R υ t ( a = , . . . , n ; t = , . . . , M ) , (A14) u µ = s m q u + s − m q ν , (A15)where u , { ν a } , and { υ t } are i.i.d. Gaussian random variables with zero mean and unit variance. In Eq.(A10), the integration over P ( u , { u a ( t ) } ) can be performed as Y a , t (cid:26) exp (cid:18) − β M (cid:19) + Θ (cid:16) u µ u µ a ( t ) (cid:17) (cid:18) − exp (cid:18) − β M (cid:19)(cid:19)(cid:27) u = Z Du Z D ν Θ ( u µ ) n Y a = Z D ν a M Y t = Z D υ t (cid:26) exp (cid:18) − β M (cid:19) + Θ (cid:16) u µ a ( t ) (cid:17) (cid:18) − exp (cid:18) − β M (cid:19)(cid:19)(cid:27) = Z DuH − s m q − m u n Y a = Z D ν a M Y t = exp (cid:18) − β M (cid:19) + H − √ qu + √ R − q ν a √ − R (cid:18) − exp (cid:18) − β M (cid:19)(cid:19) ≃ exp n Z DuH − s m q − m u ln Z D ν a M Y t = exp (cid:18) − β M (cid:19) + H − √ qu + √ R − q ν a √ − R (cid:18) − exp (cid:18) − β M (cid:19)(cid:19) ≃ exp n Z DuH − s m q − m u ln Z D ν a − β M − H − √ qu + √ R − q ν a √ − R − M β − H − √ qu + √ R − q ν a √ − R !! − β (cid:18) − H (cid:18) − √ qu + √ R − q ν a √ − R (cid:19)(cid:19) = exp n Z DuH − s m q − m u ln Z D ν a exp − β − H − √ qu + √ R − q ν a √ − R = exp n Z DuH − s m q − m u ln Z D ν exp − β H √ qu + √ R − q ν √ − R , (A16)where we utilise the relationship 1 − H ( x ) = H ( − x ) and rewrite ν a as ν in the final equation.We calculate e G under the RS ansatz assumption and static approximation as follows: e G = Z d w P ( w )Tr Z Dz exp ˜ mM X a , t , i w i σ zia ( t ) + ˜ R − ˜ q M X a , i M X t = σ zia ( t ) + √ ˜ qM X a , t , i z σ zia ( t ) + ˜ m x M X a , t , i σ xia ( t ) × Y i , t , a D σ zia ( t ) (cid:12)(cid:12)(cid:12) σ xia ( t ) E D σ xia ( t ) (cid:12)(cid:12)(cid:12) σ zia ( t + E = Z d w P ( w )Tr Z Dz n Y a = Z Dy exp ˜ mM X t , i w i σ zia ( t ) + p ˜ R − ˜ qM X t , i y σ zia ( t ) + √ ˜ qM X t , i z σ zia ( t ) + ˜ m x M X t , i σ xia ( t ) × Y i , t , a D σ zia ( t ) (cid:12)(cid:12)(cid:12) σ xia ( t ) E D σ xia ( t ) (cid:12)(cid:12)(cid:12) σ zia ( t + E = N Y i = X w i = ± Z Dz Z Dy s w i ˜ m + √ qz + q ˜ R − ˜ qy ! + ( ˜ m x ) n ≃ exp nN Z Dz ln Z Dy s ˜ m + √ qz + q ˜ R − ˜ qy ! + ( ˜ m x ) . (A17)0Here, we introduce the Hubbard-Stratonovich transformationexp x ! = Z Dz exp ( xz ) (A18)to the terms ( √ ˜ q / M P a , t σ zia ( t )) / P a ( p ˜ R − ˜ q / M P t σ zia ( t )) / e G is represented as e G = exp ( Nn − m ˜ m − m x ˜ m x − R ˜ R − n − q ˜ q + β Γ m x + O M !!) . (A19)At the thermodynamic limit N → ∞ , the saddle-point method can be adopted and the RS free energy can be expressed as − β f RS = α Z DuH − s m q − m u ln Z D ν exp − β H √ qu + √ R − q ν √ − R + Z Dz ln Z Dy s ˜ m + √ qz + q ˜ R − ˜ qy ! + ( ˜ m x ) + β Γ m x − m ˜ m − m x ˜ m x − R ˜ R + q ˜ q . (A20)At the classical limit Γ →
0, we can obtain the classical RS free energy. At this limit, R → m x →
0, and M → e G ≃ exp α Nn Z DuH − s m q − m u ln Z D ν exp ( − β ) + H − √ qu + √ R − q ν √ − R (cid:0) − exp ( − β ) (cid:1) = exp α Nn Z DzH s m q − m u ln ( exp ( − β ) + (cid:0) − exp ( − β ) (cid:1) H r q − q u !) , (A21)where we use the relationship R DxH ( ax + b ) = H ( b / √ + a ).We compute e G as e G = exp ( nN Z Dz ln Z Dy ˜ m + √ qz + q ˜ R − ˜ qy !!) = exp ( nN ˜ R − ˜ q + Z Dz ln 2 cosh (cid:0) ˜ m + √ qz (cid:1)!) , (A22)where we utilise the relationship R Dx cosh( ax + b ) = e a / cosh b . At the classical limit, e G is expressed as e G ≃ exp ( Nn − m ˜ m −
12 ˜ R − n − q ˜ q !) . (A23)The classical RS free energy can be recovered as − β f RS = α Z DzH s m q − m u ln ( exp ( − β ) + (cid:0) − exp ( − β ) (cid:1) H r q − q u !) + Z Dz ln 2 cosh (cid:0) ˜ m + √ qz (cid:1) − m ˜ m − q − ˜ q ) . (A24)This result is consistent with the results in Ref.[28]. [1] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning (The MIT Press, 2016). [2] Y. Bahri, J. Kadmon, J. Pennington, S. S. Schoenholz, J. Sohl- Dickstein, and S. Ganguli, Annual Review of Condensed MatterPhysics , 501 (2020).[3] A. Engel and C. P. L. V. d. Broeck, Statistical Mechanics ofLearning (Cambridge University Press, USA, 2001).[4] H. Nishimori,
Statistical Physics of Spin Glasses and Informa-tion Processing: an Introduction (Oxford University Press, Ox-ford;, 2001).[5] E. Gardner, Journal of Physics A: Mathematical and General , 257 (1988).[6] E. Gardner and B. Derrida, Journal of Physics A: Mathematicaland General , 271 (1988).[7] Krauth, Werner and M´ezard, Marc,J. Phys. France , 3057 (1989).[8] G. Gy¨orgyi, Phys. Rev. A , 7097 (1990).[9] H. Huang, K. Y. M. Wong, and Y. Kabashima,Journal of Physics A: Mathematical and Theoretical , 375002 (2013).[10] H. Huang and Y. Kabashima, Phys. Rev. E , 052813 (2014).[11] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi,Science , 671 (1983).[12] H. Horner, Zeitschrift f¨ur Physik B Condensed Matter , 371 (1992).[13] H. K. Patel, Zeitschrift f¨ur Physik B Condensed Matter , 257 (1993).[14] C. Baldassi, A. Ingrosso, C. Lu-cibello, L. Saglietti, and R. Zecchina,Journal of Statistical Mechanics: Theory and Experiment , 023301 (2016).[15] C. Baldassi, C. Borgs, J. T. Chayes, A. In-grosso, C. Lucibello, L. Saglietti, and R. Zecchina,Proceedings of the National Academy of Sciences , E7655 (2016).[16] C. Baldassi, A. Ingrosso, C. Lucibello, L. Saglietti, andR. Zecchina, Phys. Rev. Lett. , 128101 (2015).[17] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyan-skiy, and P. T. P. Tang, On large-batch training for deeplearning: Generalization gap and sharp minima (2017),arXiv:1609.04836 [cs.LG].[18] L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio (PMLR, Interna-tional Convention Centre, Sydney, Australia, 2017) pp. 1019–1028.[19] F. Pittorino, C. Lucibello, C. Feinauer, G. Perugini, C. Bal- dassi, E. Demyanenko, and R. Zecchina, Entropic gra-dient descent algorithms and wide flat minima (2020),arXiv:2006.07897 [cs.LG].[20] P. Chaudhari, A. Choromanska, S. Soatto, Y. LeCun,C. Baldassi, C. Borgs, J. Chayes, L. Sagun, and R. Zecchina,Journal of Statistical Mechanics: Theory and Experiment , 124018 (2019).[21] M. Ohzeki, S. Okada, M. Terabe, and S. Taguchi, Scientific Re-ports , 9950 (2018).[22] T. Kadowaki and H. Nishimori, Phys. Rev. E , 5355 (1998).[23] G. E. Santoro, R. Martoˇn´ak, E. Tosatti, and R. Car,Science , 2427 (2002).[24] A. Das and B. K. Chakrabarti,Rev. Mod. Phys. , 1061 (2008).[25] C. Baldassi and R. Zecchina,Proceedings of the National Academy of Sciences , 1457 (2018).[26] M. Suzuki, Communications in Mathematical Physics , 183 (1976).[27] D. Sherrington and S. Kirkpatrick,Phys. Rev. Lett. , 1792 (1975).[28] H. S. Seung, H. Sompolinsky, and N. Tishby, Phys. Rev. A ,6056 (1992).[29] S. Arai, M. Ohzeki, and K. Tanaka, Mean field analysis of re-verse annealing for code-division multiple-access multiuser de-modulator (2020), arXiv:2004.11066 [cond-mat.dis-nn].[30] H. Schwarze and J. Hertz,Europhysics Letters (EPL) , 375 (1992).[31] H. Schwarze and J. Hertz,Europhysics Letters (EPL) , 785 (1993).[32] H. Schwarze, Journal of Physics A: Mathematical and General , 5781 (1993).[33] R. Monasson and R. Zecchina,Phys. Rev. Lett. , 2432 (1995).[34] B. Aubin, A. Maillard, J. Barbier, F. Krza-kala, N. Macris, and L. Zdeborov´a,Journal of Statistical Mechanics: Theory and Experiment , 124023 (2019).[35] Y. Kabashima, Journal of Physics: Conference Series , 012001 (2008).[36] T. Shinzato and Y. Kabashima,Journal of Physics A: Mathematical and Theoretical , 324013 (2008).[37] T. Shinzato and Y. Kabashima,Journal of Physics A: Mathematical and Theoretical42