[PDF] Optimization of neural networks via finite-value quantum fluctuations

Abstract

We numerically test an optimization method for deep neural networks (DNNs) using quantum fluctuations inspired by quantum annealing. For efficient optimization, our method utilizes the quantum tunneling effect beyond the potential barriers. The path integral formulation of the DNN optimization generates an attracting force to simulate the quantum tunneling effect. In the standard quantum annealing method, the quantum fluctuations will vanish at the last stage of optimization. In this study, we propose a learning protocol that utilizes a finite value for quantum fluctuations strength to obtain higher generalization performance, which is a type of robustness. We demonstrate the performance of our method using two well-known open datasets: the MNIST dataset and the Olivetti face dataset. Although computational costs prevent us from testing our method on large datasets with high-dimensional data, results show that our method can enhance generalization performance by induction of the finite value for quantum fluctuations.

Full PDF

aa r X i v : . [ c ond - m a t . d i s - nn ] J u l Optimization of neural networks via ﬁnite-valuequantum ﬂuctuations

Masayuki Ohzeki , Shuntaro Okada , Masayoshi Terabe , and Shinichiro Taguchi Graduate School of Information Sciences, Tohoku University, Sendai 980-8579, Japan DENSO Corporation, Kariya, Aichi 474-0025, Japan * [email protected] ABSTRACT

We numerically test an optimization method for deep neural networks (DNNs) using quantum ﬂuctuations inspired by quantumannealing. For efﬁcient optimization, our method utilizes the quantum tunneling effect beyond the potential barriers. The pathintegral formulation of the DNN optimization generates an attracting force to simulate the quantum tunneling effect. In thestandard quantum annealing method, the quantum ﬂuctuations will vanish at the last stage of optimization. In this study,we propose a learning protocol that utilizes a ﬁnite value for quantum ﬂuctuations strength to obtain higher generalizationperformance, which is a type of robustness. We demonstrate the performance of our method using two well-known opendatasets: the MNIST dataset and the Olivetti face dataset. Although computational costs prevent us from testing our methodon large datasets with high-dimensional data, results show that our method can enhance generalization performance byinduction of the ﬁnite value for quantum ﬂuctuations.

Introduction

Data-driven approach is being widely adopted in many science and engineering ﬁelds. The key technology is machine learn-ing, which is supported by successful examples of the use of deep neural networks (DNNs) . Deep neural networks haveachieved state-of-the-art results in a wide variety of tasks, including computer vision, natural language processing, and re-inforcement learning . The revolutionary event in which artiﬁcial intelligence bested a human at a game of Go exempliﬁesthe potential power of machine learning. In DNNs, iterative structures of linear and non-linear transformations construct apattern-recognition system for designing a feature extractor from the raw data (such as the pixel values of natural image data)into a nontrivial internal representation or feature vector. The extracted features enable us to classify the different patternsfrom the input data.To promote DNN technology, various researchers have developed learning algorithms to provide faster results and betterperformance. The algorithms for optimizing DNNs are based on the stochastic gradient descent ; it partitions a large datasetinto several batches and approximates the gradient of the cost function. The standard choice among the various algorithmsstemming from the stochastic gradient method is the Adaptive Momentum (Adam) algorithm . This algorithm is designed toefﬁciently escape saddle points that often appear in the cost functions of DNNs. In practice, however, the learning of DNNssuffers from local minima with different generalization performance resulting from the shape of the DNN cost functions. Thesharp minimizer has poorer generalization performance than that in the wide-ﬂat minimizer. It is thus important to designa learning algorithm to ﬁnd a more optimal solution by escaping from both the saddle points and the local minima. In arecent study , the batch size is closely related to the generalization performance, which is characterized by the shape ofthe local minima. They experimentally demonstrate that the large-batch stochastic gradient method and its variants tend toconverge to sharp minimizers with poor generalization performance. The small-batch stochastic gradient descent, on theother hand, is likely to fall into the wider minimizers, in which the DNNs have high generalization performance. The batchsize is closely related to the magnitude of the stochastic noise during learning. In other words, injection of the stochasticnoise can be an origin of an efﬁcient learning algorithm for converging into wider local minima. In addition, an analyticalstudy on discrete-weight networks revealed the subdominant solutions with relatively higher generalization performance thanthe exponentially dominant (typical) solutions that deviated from the ground truth . The subdominant solutions can bealgorithmically reachable by considering the effect of entropy. As proposed in the literature , they compute the local entropyby injection of stochastic noise and update the weight to take the DNN to wider local minima with better generalizationperformance.The gradient descent algorithm is closely related to classical dynamics in physics, and the stochastic version also hasa connection with Langevin dynamics, which models the classical stochastic dynamics in various ﬁelds of nature. In theresent study, we test the optimization of DNNs using the quantum ﬂuctuation as employed in quantum annealing (QA).Quantum annealing is a method that is developing as a generic solver for the optimization problems. This scheme wasoriginally proposed as an algorithm that used numerical computations to optimize cost functions with discrete variables .The theoretical aspects of QA are well known. Its basic concept is derived from the quantum adiabatic theorem , anda successful experimental implementation of QA was realized using present-day technology . Since then, QA has beendeveloped rapidly and has attracted much attention. Several protocols based on QA do not stick to the adiabatic quantumcomputation or maintain the system at the ground state; rather, they employ a nonadiabatic counterpart . In addition,some studies have used a more sophisticated quantum effect . Although the original proposal for QA was designed foroptimization problems with discrete variables, as described in the form of a spin-glass Hamiltonian , the concept of QA canbe generalized to a wider range of optimization problems, even those with continuous values. Most practical optimizationproblems, including machine learning, use continuous variables. One typical instance is the optimization problem for DNNs.Below, we apply the concept of QA to the DNN optimization problem. In the previous study, they assessed the potentialefﬁciency of using quantum ﬂuctuations to avoid the non-convex cost function by means of the replica method, which isa sophisticated tool in statistical mechanics . Although the analysis in the previous study discussed the learning of thediscrete-weight neural network (binary variable as in the Ising model), the essential features are expected not to differ fromthe continuous-variable neural networks. As discussed in the previous study, the generalization performance attained by theoptimization with quantum ﬂuctuations can be better than that without them. In the present study, we perform practical tests:the optimization of DNNs with quantum ﬂuctuations, and discuss its efﬁciency. Because the computational cost for simulatingquantum dynamics is prohibitive, as shown below, our test is restricted to the case for the relatively shallow networks. Howeverour approach is straightforward to apply deeper networks.The paper is organized as follows: The second section describes our method for optimizing DNNs. The following sectiondemonstrates the method using three simple tasks. The last section discusses the feasibility of our method. Methods

Quantum annealing for continuous variables

The optimization problem is interpreted as the minimization of the energy function (potential energy) V ( w ) in the context ofphysics. We address the optimization of the weights of DNNs below. The weights are denoted by w ∈ R N . The standardgradient descent is given as the equation of motion for the overdamped system w ( t + ) = w ( t ) − η ∂∂ w V ( w ) . (1)where t is the update step. This is regarded as a dynamical system in a low-temperature region in the context of physics.Considering the thermal effect characterized by the temperature T , the weights ﬂuctuate following the Gibbs-Boltzmanndistribution as P ( w ) = Z exp ( − β V ( w )) , (2)where Z is the partition function that acts as a normalization constant. In this case, instead of the equation of motion, a dynam-ical system with Langevin dynamics is adequate for description of the weights following the Gibbs–Boltzmann distribution as w ( t + ) = w ( t ) − η ∂∂ w V ( w ) + p T η N ( , ) . (3)This is the procedure known as the stochastic gradient Langevin method , in which the learning rate decreases in the samemanner as in simulated annealing (SA) . In QA, we introduce quantum ﬂuctuations in addition to the energy function in theextremely low temperature T → ( β → ∞ ) . We consider the following time-dependent Hamiltonian:ˆ H ( t ) = V ( ˆ w ) + ρ ( t ) ˆ p (4)where ˆ w denotes degrees of freedom and ˆ p represents momentum that satisﬁes the commutation relation [ ˆ w , ˆ p ] = i ¯ h . Inaddition, ρ ( t ) represents the mass of the weights and increases from 0 to ∞ over time throughout the QA process. Followingthe ideas of quantum mechanics, the weights ﬂuctuate as characterized by the following density matrix, instead of directly bythe distribution function; this is deﬁned asˆ ρ = Z exp (cid:0) − β ˆ H ( t ) (cid:1) (5) here Z = Tr (cid:0) exp ( − β ˆ H ( t )) (cid:1) . To specify the probability distribution of the realized conﬁguration of the weights, we computethe matrix elements as P ( w ) = h w | ˆ ρ | w i . (6)where ˆ w | w i = w | w i . However, the computation of the density matrix is intractable in general. We then employ the Suzuki–Trotter decomposition to reduce the operators to c-numbers by introducing M copies and obtain the following path-integralrepresentation as shown in Appendix: P ( w ) = lim M → ∞ Z D w exp (cid:18) − β M V ( w k ) − M ρ ( t ) β k w k − w k − k (cid:19) . (7)where R D w = ∏ M − k = R d w k , M is the Trotter number and k is the index of the replicated system. The boundary condition isset to w = w M = w . The numerical implementation of the Suzuki-Trotter decomposition is established as an approximationof the distribution function (7) by setting a ﬁnite number for M . For instance, in the quantum Monte Carlo simulation , theconﬁguration of the degrees of freedom is sampled using the distribution function as P ( w , w , · · · , w M ) = M ∏ k = exp (cid:18) − β M V ( w k ) − M ρ ( t ) β k w k − w k − k (cid:19) , (8)in which the inverse temperature is taken to be β → ∞ with β / M being ﬁnite. In other words, the quantum Monte Carlosimulation deals with many replicated realizations or paths w k ( t ) with index k (imaginary time) following Langevin dynamicsas w k ( t + ) = w k ( t ) − η ∂∂ w k V ( w k ( t )) − η T q ρ ( t ) ( w k ( t ) − w k − ( t ) − w k + ( t )) + p T q η N ( , ) . (9)where T q = M / β . One might recognize that many DNN realizations interact with each other through the elastic term, whichrepresents the quantum effect. The elastic term urges many DNN realizations into a single condensed solution w ∗ when ρ ( t ) takes relatively a large value. By the boundary condition w = w M , w ∗ = w . For simplicity, let us ﬁrst consider the case with alarge ρ ( t ) . The path integral formulation allows ﬂuctuation around w ∗ . In other words, the action in the exponential functionin P ( w ) has two terms: one is the cost function, which is what we originally want to optimize, and the other is degree ofcondensation of the realizations. As in Appendix, we ﬁnd that w k − w follows a Gaussian distribution with some covariance β V kk ′ ( t ) . Thus, the approximated distribution function in a large ρ ( t ) is reduced to P ( w ) ≈ Z D w k exp − β M ∑ k V ( w k ) ! exp − β ∑ k , k ′ ( w k − w ) V kk ′ ( t )( w k ′ − w ) ! (10)Here, we set the minimizer of the (logarithm of) the distribution function in order to make analysis simpler.log P ( w ) ≥ M log Z d w ′ exp (cid:18) − γ ( t ) T q ( w ′ − w ) − T q V ( w ′ ) (cid:19) (11)where M γ is a constant for maintaining this inequality. The minimizer on the right-hand side is the cost function appearingin the entropy stochastic gradient descent (E-SGD) algorithm, which captures the wider local minima . In order to obtain themost probable weights w , taking the derivative with respect to w of the minimizer of log P ( w ) , we obtain the following updateequation w ( t ) = γ ( t ) (cid:0) w ( t ) − (cid:10) w ′ (cid:11)(cid:1) , (12)where h· · · i takes the average of w ′ in the integrand of (11). The average is directly intractable and is instead estimated by thefollowing Langevin dynamics: w ′ ( s + ) = w ′ ( s ) − η (cid:26) ∂∂ w V ( w ) + γ ( t )( w ( t ) − w ′ ( s )) (cid:27) + p T q η N ( , ) . (13)In the E-SGD algorithm, γ ( t ) is a decreasing value, which will vanish at the completion of optimization. The time dependenceof γ ( t ) is closely related to ρ ( t ) as described in the Appendix. In standard QA, we gradually increase ρ ( t ) . Then γ ( t ) similarlyincreases. Thus, the E-SGD algorithm is essentially different from the standard QA procedure. As they stated, the “reverseannealing” method is considered in the literature .Reverse annealing is now implemented in the current system of the D-Wave machine, and shows better performance foroptimization. A similar approach for increasing the performance is to search by induction of quantum ﬂuctuation . In thesecases, reverse annealing is induction of the quantum ﬂuctuation, namely ρ ( ) = ρ ( T ) = ρ ( t ) > igure 1. Schematic pictures of two local minima and quantum effects.

Finite-value quantum annealing

As described in previous studies , there is a useful algorithm exploiting an entropic effect around a single condensedsolution. In this algorithm, the author can elucidate one of the aspects related to the quantum effect: i.e., the entropy effect.In our study, we perform the direct optimization of the cost function, which appears in the exponential of the probabilitydistribution (8), which involves nontrivial quantum tunneling stemming from non-perturbative effects. Thus, we must dealwith M replicated systems for optimizing the DNNs. In this sense, our procedure is not reasonable for optimizing DNNs inpractical applications. However, our trial may stimulate motivation for possible applications of the quantum computation. Wereport several simple DNN optimization tests to provide future perspectives in machine learning with respect to the quantummechanics described below.From this point forward, we do not focus on cases with a large ρ ( t ) . We consider directly optimizing the cost function (8),but T → w k ( t + ) = w k ( t ) − η ∂∂ w k V ( w k ( t )) − ηρ ( t ) ( w k ( t ) − w k − ( t ) − w k + ( t )) . (14)In addition, we consider a ﬁnite-value quantum annealing, in which the quantum ﬂuctuation remains at the ﬁnal stage ofoptimization. In standard QA, we gradually increase ρ ( t ) to obtain a single realization among many replicas. However, asdiscussed later, a moderate ρ ( t ) value is beneﬁcial for obtaining improved generalization performance. When we do notconsider the “quality” of the solution, the standard QA is one of the best choices. The theoretical assurance of the ideal QAtoward the optimal solution with the lowest cost function value is well established on the basis of the adiabatic theorem .However, as in the case of DNN optimization, the quality of the solution is measured using a different scale than the costfunction itself, namely the generalization performance. Therefore, the standard QA method is not necessarily the best choicefor optimization of DNNs. As a result, we inject a ﬁnite quantum ﬂuctuation value to attain better generalization performance.Here, we provide a simple schematic picture for the ﬁnite-value QA to attain improved generalization performance. Forsimplicity, we assume that a DNN loss function has two local minima: a sharp local minimum and a wide local minimum.Both of the depths are the same, as shown in Fig. 1. In other words, the ﬁrst term in the cost function (14) takes the samevalues in two local minima. Let us here consider the favorable solution in the standard QA. In standard QA, we increase ρ ( t ) to a very large value. When the optimization is successfully performed without entrapment in any saddle points or triviallocal minima, we compare the two representative local minima of the cost function (14). When most of the realizations ofthe M -replicated DNNs are condensed to the sharp local minimum, the cost function (14) takes a smaller value compared tothe case of the wide local minimum. Thus, the successful result of the standard QA is absorbed in the sharp local minimum.In this sense, standard QA is not suitable for optimization of DNNs. Instead, in ﬁnite-value QA, the ﬁnal value of ρ ( t ) isset to be ﬁnite. Then, depending on the ﬁnal value of ρ ( t ) , the resultant solution is allowed to be absorbed into the widerlocal minimum of the loss function. In a previous study , γ ( t ) (similar to ρ ( t ) ) is referred to as the scoping coefﬁcient and isgradually decreased.The remaining problem is that, in general, a priori we do not ﬁnd an adequate strength value for quantum ﬂuctuation. Wepropose an adaptive approach for tuning the value of ρ ( t ) in the next subsection. Quantum Adam

We hereafter assume the loss function L ( D | w ) for a training dataset D as the energy function. The loss function measures thediscrepancy between the ground truth labels t and the output y predicted by the network. The gradient of the loss functionis computed using the back-propagation method . We here employ the stochastic gradient descent method by dividing he training dataset into M minibatches as { D , D , · · · , D M } . It is convenient to process a large amount of training data andmitigate the computational cost of the gradient. We then distribute the minibatch to each Trotter slice k . Following the standardprescription of the Suzuki-Trotter decomposition, we should utilize the same energy function on each Trotter slice. However,to induce the stochastic ingredients over M -replicated DNNs to perform efﬁcient learning, we employ the loss function as L ( D k | w k ) on each Trotter slice k . Thus, we divide the training dataset into M minibatches, where M is the number of Trotterslices. We then sweep all the minibatches over each Trotter slice in an epoch. The minibatches are randomly shufﬂed in eachepoch.We here assume that our procedure is employed in practice in a parallel computing environment. In the context of thecurrent machine learning environment, parallel computing for learning is sometimes employed for very large datasets. Asin our case, the elastic term ρ k w k − w ∗ k has been used in parallel computing environments . Another study prepared themaster with w and updated it by summing over gradients obtained by slaves with w k .We now address the remaining problem of determining the magnitude of the coefﬁcient ρ ( t ) of the elastic term. Weexploit the idea of the Adam method, which is often implemented in DNN optimization , to adaptively change the coefﬁcient.It accelerates the update when the gradient tends to shrink around the saddle point. In Adam, instead of the standard gradientdescent method (1), w ( t + ) = w ( t ) − η p ˜ v ( t ) + ε ˜ m ( t ) , (15)where ˜ m ( t ) = m ( t ) / ( − β t ) , ˜ v ( t ) = v k ( t ) / ( − β t ) , and m ( t ) = ( − β ) m ( t − ) + β g ( t ) (16) v ( t ) = ( − β ) v ( t − ) + β g ( t ) ⊙ g ( t ) . (17)Here, g ( t ) is the gradient of the loss function. The hyperparameters β and β are chosen a priori. The quantity of ε avoidsaccidental division by zero. The calculation of the product ⊙ and the division between vectors are performed in a component-wise manner. During update iterations, the magnitude of the gradient becomes small around the saddle point. Then, v ( t ) becomes a vector with small-valued elements. The coefﬁcient η / p ˜ v ( t ) + ε of the effective gradient ˜ m ( t ) is then increased.The updates are then efﬁciently performed, even around the saddle point. This is a rough sketch of the learning accelerationprovided by Adam.For tuning ρ ( t ) , we employ a technique similar to one in Adam, in which the coefﬁcient of the effective gradient isadaptively changed as follows: w k ( t + ) = w k ( t ) − η p ˜ v k ( t ) + ε ˜ m k ( t ) − ηρ q ˜ v qk ( t ) + ε ˜ m qk ( t ) , (18)where ˜ m k ( t ) and ˜ v k ( t ) are obtained in the same manner as in Adam, and ˜ m qk ( t ) = m qk ( t ) / ( − α t ) , ˜ v qk ( t ) = v qk ( t ) / ( − α t ) and m qk ( t ) = ( − α ) m qk ( t − ) + α g qk ( t ) (19) v qk ( t ) = ( − α ) v qk ( t − ) + α g qk ( t ) ⊙ g qk ( t ) . (20)Here, g qk ( t ) = w k ( t ) − w k + ( t ) − w k − ( t ) . Similar to the process followed in Adam, the hyperparameters α and α are set apriori. The above update rule adequately tunes the elastic term. It reads that the coefﬁcient is tuned as ρ ( t ) → ρ / ( q ˜ v qk ( t ) + ε ) .Following the standard QA, the weights are randomly initialized in order to search for good candidates for the optimalsolution over a relatively wide range. In other words, in the initial stage of optimization, the weights associated with thedifferent Trotter slices deviate. Owing to the elastic term, the discrepancies between Trotter slices begin to lessen after severaliterations. In other words, the tunneling effect gradually decays, and the effective coefﬁcient ρ / ( q ˜ v qk ( t ) + ε ) then increasesto enhance the tunneling effect again. Therefore, the above update rule efﬁciently induces the tunneling effect without directlytuning the value of the mass ρ . We call the above update rule “quantum Adam” in the sense that we add the quantum effectsstemming from g qk ( t ) while tuning the contribution of the effect during the learning. We emphasize that other gradient methodsdeveloped for machine learning, including AdaGrad , AdaDelta , RMSprop , and the Sum of Functions Optimizer , canbe implemented in conjunction with the quantum effect in the same manner.In the following section, we demonstrate the effectiveness of quantum Adam by testing it against two datasets: the MNISThandwritten digit dataset and the Olivetti face image dataset ; both are open datasets often used in benchmark tests formachine learning. igure 2. Accuracy for test data (red and dashed curves: classical Adam, blue and solid curves: quantum Adam) insingle-layer NN for MNIST. All results from the M -replicated systems are indicated by light-colored curves. The bold curvesdenote the average, and the thin curves represent the maximum in the replicated NNs. The horizontal axis represents theepoch, and the vertical axis represents the accuracy of the test data. Results

In this section, we demonstrate the application of quantum Adam to DNNs by using a well-known open dataset. Although thedatasets used in the experiments contain data that are relatively easy to analyze, there are high computational costs incurredwhen implementing the M -replicated DNNs for the realization of quantum Adam. In this sense, the present study is simply aproof of concept.For simplicity, we used ReLU as the activation function in the middle layers in all experiments. We used cross entropyas the cost function for classiﬁcation and the mean-squared error for auto-encoding in the results shown below. The weightsare initialized with i.i.d. Gaussian samples with a zero mean and deviation p / N l , where N l is the number of inputs foreach layer l . We use the standard choice of α = β = . α = β = . M -independent classical (standard) and quantum Adam tests for comparison. We then assessed the generalizationperformance in terms of the average and minimum/maximum of the loss function/accuracy.The ﬁrst task was to classify the MNIST 8 × M = ρ = .

0. Both the average and the maximum accuracy conﬁrm thatquantum Adam is superior to classical Adam.The second task was to make the auto encoder. It recovers the original input as the output by using MNIST 8 × M = ρ = .

0. Both the average and the minimumof the loss function in the replicated systems conﬁrm that quantum Adam is superior to classical Adam. However, this resultmight be accidental, as there were no signiﬁcant improvements in several experiments in terms of the mean-square error.The third task was to classify the Olivetti 64 × data points and setting M =

40. We then determined the accuracy using 200 dataitems. In this case, we set the constant ρ = . Discussion

We proposed a quantum Adam formulated through a path-integral representation for optimization of DNNs. The proposedalgorithm generates an elastic term between different realizations of DNNs and could ﬁnd a better solution in terms of gener-alization performance than that by classical Adam. The point is to control the quantum ﬂuctuation by introducing the adaptive igure 3.

Loss function for test data in an auto encoder using MNIST. All results from the replicated systems are indicatedby light-colored curves. The bold and thin curves indicate the average and the minimum in replicated NNs. The horizontalaxis represents the epoch, and the vertical axis represents the loss function of the test data. The inset shows an enlarged viewof the average loss functions during 800–1000 epochs.change of the coefﬁcient and inducing the wide-ﬂat local minimum by means of the entropy effect, as discussed in the previousstudies . In the present study, we directly optimize the M -replicated DNNs while dealing with the non-perturbative effect,which allows the quantum tunneling effect. Although relatively small datasets are used, we demonstrate better generalizationperformance by considering the optimization with a ﬁnite quantum ﬂuctuation strength. In this sense, our method does notconform to the standard QA method. The ideal QA might not be the best choice of learning algorithm for DNNs because theresultant solutions are absorbed into a sharp minimum. In recent development of manufacturing microdevices, QA has beensuccessfully implemented in superconducting qubits, or so-called quantum annealer. Several experiments have shown thatthe resultant solutions seem to fall into wide local minima . However, this is due to the freezing phenomena in the quantumannealer, which is a particular problem in the quantum device. The resultant solutions are closely related to low-energy stateswith a certain value of quantum ﬂuctuation as pointed out in the literature . In other words, the output from the present ver-sion of the quantum annealer follows the Gibbs-Boltzmann distribution with a certain value of quantum ﬂuctuations. In thissense, QA, which is performed in real experiments, can be a choice of learning algorithm. In addition, the current version of aquantum annealer, the D-Wave 2000Q, implements two optimization techniques by manipulating a certain value of quantumﬂuctuation, namely quenching, and reverse annealing. These two techniques will be available for efﬁciently attaining bettergeneralization performance in real experiments, as discussed in the literature .In the present study, we manipulate the optimization in classical computers. In addition, we select the strength of thequantum ﬂuctuation by employing adaptive change inspired by the Adam method. The potential performance of quantumAdam emerges in cases with many Trotter numbers that correspond to the number of minibatches. When we use a smallnumber of minibatches, quantum Adam does not work well. This is because most of the DNNs fall into the sharp minimizers.In addition, the ρ value should be tuned adequately. When we select a ρ value that is too high, the searching range willbe narrow, whereas a ρ value that is too small will not lead to a condensed solution. We tested three different tasks toassess the performance of quantum Adam in comparison to classical Adam. The results demonstrate that quantum Adam canprovide fairly good performance. We emphasize that the most important feature of quantum Adam should be its generalizationperformance. In machine learning, the purpose of improvements in learning is nothing more than enhancing generalizationperformance with limited epochs and computational resources. In quantum Adam, the elastic term aggregates DNNs whilelearning. This effect might work to prevent sudden falls into the valley. In other words, when most of the DNNs are in thewide minimizer, the others do not tend to fall into the sharp minimizer; this can lead to improved generalization performance.In quantum Adam, we use M -replicated DNNs. In a sense, this seems to be too abundant. However, when we processa large number of datasets, we distribute each batch to a number of processors or GPUs and establish a consensus to obtainDNNs with high generalization performance. Our present method is too computationally expensive to implement in theordinary environments used in a wide range of research efforts, although it might be useful for learning large datasets in parallelcomputing environments. In this sense, our algorithm might be helpful even in classical computers. In future research, weshall test quantum Adam in a parallel computing environment with a large dataset comprising high-dimensional components, igure 4. Accuracy for test data for classiﬁcation of Olivetti face images. The same curves as those in Fig. 2 are used. Thehorizontal axis represents the epoch, and the vertical axis represents the accuracy of the test data.and propose another simpliﬁed algorithm by elucidating the most signiﬁcant part of the quantum ﬂuctuations, as in previousstudies .We remark on the time complexity of quantum Adam. The standard assessment of the time complexity of QA canbe performed by estimating the energy gap in the time-dependent Hamiltonian. In our case, through the Suzuki–Trotterdecomposition, the problem is reduced to the optimization problem for the cost function with continuous variables. Byconsidering the rate of convergence to be at a minimum in the feasible set, the classical Adam method has a convergence rateof O ( / √ T ) , as shown in the literature . We believe that a similar analysis can also be performed for quantum Adam. Inaddition, we emphasize that the most important feature of quantum Adam is its generalization performance. In this sense,the present study triggers a new aspect of QA not for pursuing the minimum of the cost function, but for different optimalitymeasured in a different indicator from the cost function itself.Finally, in present study, we demonstrate a potential power of quantum ﬂuctuation, as done by QA. It promotes “quality”of solution via optimization with quantum ﬂuctuation. The standard assessment of the performance of optimization solver isevaluated by the cost function itself. In particular, the performance of QA has been discussed through the decrease of the costfunction. However, the robustness of the solution can be attained by optimization of the cost function in conjunction withthe local entropy as discussed in the literature . The optimization with quantum ﬂuctuation automatically and potentiallyleads to the robustness of the solution as discussed in the present study. In the context of machine learning, the generalizationperformance is robustness of the solution. In future, deepening the understanding of the quantum ﬂuctuation would promotevarious approaches in machine learning and beyond. References LeCun, Y., Bengio, Y. & Hinton, G. Deep learning.

Nat. , 436–444 (2015). Goodfellow, I., Bengio, Y. & Courville, A.

Deep Learning (MIT Press, 2016). Robbins, H. & Monro, S. A stochastic approximation method.

Ann. Math. Stat. , 400–407 (1951). Bottou, L. Online algorithms and stochastic approximations. In Saad, D. (ed.)

Online Learning and Neural Networks (Cambridge University Press, Cambridge, UK, 1998). Revised, oct 2012. Sutskever, I., Martens, J., Dahl, G. & Hinton, G. On the importance of initialization and momentum in deep learning.In

Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28 ,ICML’13, III–1139–III–1147 (JMLR.org, 2013). Kingma, D. P. & Ba, J. Adam: A Method for Stochastic Optimization.

In 3rd Int. Conf. for Learn. Represent. (ICLR),2015 (2015). Shirish Keskar, N., Mudigere, D., Nocedal, J., Smelyanskiy, M. & Tang, P. T. P. On Large-Batch Training for DeepLearning: Generalization Gap and Sharp Minima.

ArXiv e-prints (2016). . . Baldassi, C., Ingrosso, A., Lucibello, C., Saglietti, L. & Zecchina, R. Subdominant dense clusters allow for simplelearning and high computational performance in neural networks with discrete synapses.

Phys. Rev. Lett. , 128101(2015). Baldassi, C. et al.

Unreasonable effectiveness of learning neural networks: From accessible states and robust ensemblesto basic algorithmic schemes.

Proc. Natl. Acad. Sci. , E7655–E7662 (2016).

Chaudhari, P. et al.

Entropy-SGD: Biasing Gradient Descent Into Wide Valleys.

ArXiv e-prints (2016). . Kadowaki, T. & Nishimori, H. Quantum annealing in the transverse ising model.

Phys. Rev. E , 5355–5363 (1998).DOI 10.1103/PhysRevE.58.5355. Suzuki, S. & Okada, M. Residual energies after slow quantum annealing.

J. Phys. Soc. Jpn. , 1649–1652 (2005). DOI10.1143/JPSJ.74.1649. Morita, S. & Nishimori, H. Mathematical foundation of quantum annealing.

J. Math. Phys. (2008). DOIhttp://dx.doi.org/10.1063/1.2995837. Ohzeki, M. & Nishimori, H. Quantum annealing: An introduction and new developments.

J. Comput. Theor. Nanosci. ,963–971 (2011-06-01T00:00:00). DOI doi:10.1166/jctn.2011.1776963. Johnson, M. W. et al.

A scalable control system for a superconducting adiabatic quantum optimization processor.

Super-cond. Sci. Technol. , 065004 (2010). Berkley, A. J. et al.

A scalable readout system for a superconducting adiabatic quantum optimization system.

Supercond.Sci. Technol. , 105014 (2010). Harris, R. et al.

Experimental investigation of an eight-qubit unit cell in a superconducting optimization processor.

Phys.Rev. B , 024511 (2010). DOI 10.1103/PhysRevB.82.024511. Bunyk, P. I. et al.

Architectural considerations in the design of a superconducting quantum annealing processor.

IEEETransactions on Appl. Supercond. , 1–10 (2014). DOI 10.1109/TASC.2014.2318294. Ohzeki, M. Quantum annealing with the jarzynski equality.

Phys. Rev. Lett. , 050401 (2010). DOI 10.1103/Phys-RevLett.105.050401.

Ohzeki, M., Nishimori, H. & Katsuda, H. Nonequilibrium work on spin glasses in longitudinal and transverse ﬁelds.

J.Phys. Soc. Jpn. , 084002 (2011). DOI 10.1143/JPSJ.80.084002. Ohzeki, M. & Nishimori, H. Nonequilibrium work performed in quantum annealing.

J. Physics: Conf. Ser. , 012047(2011).

Somma, R. D., Nagaj, D. & Kieferov´a, M. Quantum speedup by quantum annealing.

Phys. Rev. Lett. , 050501 (2012).

Seki, Y. & Nishimori, H. Quantum annealing with antiferromagnetic ﬂuctuations.

Phys. Rev. E , 051112 (2012). DOI10.1103/PhysRevE.85.051112. Nishimori, H. & Takada, K. Exponential enhancement of the efﬁciency of quantum annealing by non-stoquastic hamilto-nians.

Front. ICT , 2 (2017). Ohzeki, M. Quantum monte carlo simulation of a particular class of non-stoquastic hamiltonians in quantum annealing.

Sci. Reports , 41186 (2017). Baldassi, C. & Zecchina, R. Efﬁciency of quantum vs. classical annealing in nonconvex learning problems.

Proc. Natl.Acad. Sci. , 1457–1462 (2018). DOI 10.1073/pnas.1711456115.

Welling, M. & Teh, Y. W. Bayesian learning via stochastic gradient langevin dynamics. In

Proceedings of the 28th Inter-national Conference on International Conference on Machine Learning , ICML’11, 681–688 (Omnipress, USA, 2011).

Kirkpatrick, S., Gelatt, C. D. & Vecchi, M. P. Optimization by simulated annealing.

Sci. , 671–680 (1983). DOI10.1126/science.220.4598.671.

Hatano, N. Localization in non-hermitian quantum mechanics and ﬂux-line pinning in superconductors.

Phys. A: Stat.Mech. its Appl. , 317 – 331 (1998).

Suzuki, M. Relationship between d-dimensional quantal spin systems and (d+1)-dimensional ising systems: Equivalence,critical exponents and systematic approximants of the partition function and spin correlations.

Prog. Theor. Phys. ,1454–1469 (1976). DOI 10.1143/PTP.56.1454. Perdomo-Ortiz, A., Dickson, N., Drew-Brook, M., Rose, G. & Aspuru-Guzik, A. Finding low-energy conformations oflattice protein models by quantum annealing.

Sci. Reports , 571 EP – (2012). Rumelhart, D. E., Hinton, G. E. & Williams, R. J. Learning representations by back-propagating errors.

Nat. , 533–536(1986).

Zhang, S., Choromanska, A. & LeCun, Y. Deep learning with elastic averaging sgd. In

Proceedings of the 28th Inter-national Conference on Neural Information Processing Systems , NIPS’15, 685–693 (MIT Press, Cambridge, MA, USA,2015).

Li, M., Andersen, D. G., Smola, A. & Yu, K. Communication efﬁcient distributed machine learning with the parameterserver. In

Proceedings of the 27th International Conference on Neural Information Processing Systems , NIPS’14, 19–27(MIT Press, Cambridge, MA, USA, 2014).

Duchi, J., Hazan, E. & Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization.

J. Mach.Learn. Res. , 2121–2159 (2011). Zeiler, M. D. Adadelta: An adaptive learning rate method.

CoRR abs/1212.5701 (2012).

Tieleman, T. & Hinton, G. Lecture 6.5 - rmsprop.

COURSERA: Neural Networks for Mach. Learn. (2012).

Sohl-Dickstein, J., Poole, B. & Ganguli, S. Fast large-scale optimization by unifying stochastic gradient and quasi-newtonmethods. In Xing, E. P. & Jebara, T. (eds.)

Proceedings of the 31st International Conference on Machine Learning , vol. 32of

Proceedings of Machine Learning Research , 604–612 (PMLR, Bejing, China, 2014).

Lecun, Y., Bottou, L., Bengio, Y. & Haffner, P. Gradient-based learning applied to document recognition.

Proc. IEEE ,2278–2324 (1998). DOI 10.1109/5.726791. Samaria, F. S. & Harter, A. C. Parameterisation of a stochastic model for human face identiﬁcation. In

Proceedings of1994 IEEE Workshop on Applications of Computer Vision , 138–142 (1994).

Johnson, M. W. et al.

Quantum annealing with manufactured spins.

Nat. , 194 EP – (2011).

Amin, M. H. Searching for quantum speedup in quasistatic quantum annealers.

Phys. Rev. A , 052323 (2015). Acknowledgements

The authors would like to thank Shu Tanaka and Muneki Yasuda for many fruitful discussions that contributed to the work.The present work is ﬁnancially supported by MEXT KAKENHI Grant No. 15H03699 and 16H04382, and by JST START.

Author contributions statement

M.O. conceived and conducted the experiment and analyzed the results. S. O. tested the previous version of the optimizationmethod, M. T. discussed the possibility of the other applications of our method to industry, S. T. directed the project in ourstudy and investigated the possible design of our method. All authors discussed the details of the results and reviewed themanuscript.

Additional information

Competing Interests : The authors declare that they have no competing interests.

Path integral representation

By use of the Suzuki-Trotter decomposition, we formulate the path integral representation. Let us start the following expres-sion of the Suzuki-Trotter decomposition as Z = Tr (cid:26) exp (cid:18) − β V ( ˆ w ) − ˆ p ρ (cid:19)(cid:27) = Tr ( M − ∏ k = exp (cid:18) − β M V ( ˆ w ) (cid:19) exp (cid:18) − β ˆ p ρ M (cid:19)) . (21)We insert the summation over the complete set R d w k | w k ih w k | and R d p k | p k ih p k | where ˆ w | w k i = w k | w k i and ˆ p | p k i = p k | p k i .Then we obtain Z = Z d w h w | Z D w D p M ∏ k = (cid:26) exp (cid:18) − β M V ( ˆ w ) (cid:19) | w k ih w k | exp (cid:18) − β ˆ p ρ M (cid:19) | p k ih p k | (cid:27) | w i (22)This expression can be reduced to Z ∝ Z d w Z D w D p M ∏ k = (cid:26) exp (cid:18) − β M V ( w k ) (cid:19) exp ( i p k ( w k − w k − )) exp (cid:18) − β p k ρ M (cid:19)(cid:27) (23) here we have used h w k ′ | p k i = exp ( i p k w k ′ ) . (24)Manipulation of the Gaussian integral with respect to p k yields Z ∝ Z d w Z D w M ∏ k = exp (cid:18) − β M V ( w ) − M ρ β k w k − w k − k (cid:19) . (25) Strong limit of ρ ( t ) First we consider the Fourier transformation on the discrepancy from the center of weights w ∗ as w k = w ∗ + √ M M − ∑ r = a r e i π kr / M , (26)where a r = a M − r because w k is a real vector. Then the elastic term is diagonalized as k w k − w k − k = [ M / ] ∑ r = a r a M − r (cid:18) − cos (cid:18) π rM (cid:19)(cid:19) . (27)where we have used ∑ M − k = e i π kr / M = M δ ( r ) . When ρ ( t ) ≫

1, the exponentiated elastic term is reduced to M ∏ k = exp (cid:18) − M ρ ( t ) β k w k − w k − k (cid:19) = [ M / ] ∏ r = exp (cid:18) − M ρβ a r a M − r (cid:18) − cos (cid:18) π rM (cid:19)(cid:19)(cid:19) . (28)We ﬁnd that a r follows the Gaussian distribution. We then perform the inverse Fourier transformation and attain M ∏ k = exp (cid:18) − M ρ ( t ) β k w k − w k − k (cid:19) = M ∏ k = exp − β ∑ k , k ′ ( w k − w ) V k ′ , k ′ ( w k ′ − w ) ! . (29)In M → ∞ , we use 2 π r / M = x and 2 π / M = dx β V − kk ′ = ∑ r β M ρ (cid:0) − cos (cid:0) π rM (cid:1)(cid:1) e i π ( k − k ′ ) r / M = β ρ Z π dx π e i ( k − k ′ ) x − cos x . (30)(30)