Effectively Trainable Semi-Quantum Restricted Boltzmann Machine
aa r X i v : . [ c ond - m a t . d i s - nn ] F e b Effectively Trainable Semi-Quantum Restricted Boltzmann Machine
Ya.S. Lyakhova , , , ∗ E.A. Polyakov , and A.N. Rubtsov , Russian Quantum Center, Skolkovo Innovation city, 121205 Moscow, Russia NTI Center for Quantum Communications, National University of Science and Technology MISiS, 119049 Moscow, Russia National Research Nuclear University MEPhI, 115409 Moscow, Russia (Dated: February 5, 2020)We propose a novel quantum model for the restricted Boltzmann machine (RBM), in which thevisible units remain classical whereas the hidden units are quantized as noninteracting fermions.The free motion of the fermions is parametrically coupled to the classical signal of the visibleunits. This model possesses a quantum behaviour such as coherences between the hidden units.Numerical experiments show that this fact makes it more powerful than the classical RBM withthe same number of hidden units. At the same time, a significant advantage of the proposedmodel over the other approaches to the Quantum Boltzmann Machine (QBM) is that it is exactlysolvable and efficiently trainable on a classical computer: there is a closed expression for the log-likelihood gradient with respect to its parameters. This fact makes it interesting not only as a modelof a hypothetical quantum simulator, but also as a quantum-inspired classical machine-learningalgorithm.
I. INTRODUCTION
Nowadays machine learning becomes an all-pervasiveparadigm of how to obtain, process and store knowledge.Initially arising in the field of computer science, it findsnumerous interdisciplinary applications. They rangefrom commercial applications like speech and handwrit-ing recognition, classification/recognition of video con-tent [1], up to various scientific applications. As an ex-ample of the latter, in physics the most prominent ap-plications are the generative modeling of quantum con-trol and measurement protocols [2], of condensed matterproblems [3], and of quantum systems [4].One of the major approaches to machine learning isthe generative modeling. Here, given a certain finitedata set X , a model of its probability distribution P ( X )is estimated. The tuning of the model parameters (i.e.training) is usually carried over by maximizing a certaindistribution-resemblance measure. There are two criteriaof a successful learning model. Firstly, it has to be easilytrainable. It turns out that too complex models (or ma-chines) are challenging and impractical to train. Stacksof simple models is a way round this problem. While eachlayer is easy to train, their composition can model dataof high statistical complexity [5]. The second criterionis imposed by the problem of overfit. If one increasesthe number of model parameters for a given data set,the model tends to approximate the particular realiza-tion of the random data scatter thus losing its predictivecapability. It entails that a successful machine has to besimple enough to avoid overfit.In the classical machine learning, such a model is theRestricted Boltzmann Machine (RBM). RBM is a two-layer energy model with no coupling between the units ofthe same layer. One layer is called visible and represents ∗ [email protected] the observable data, while the other one is called hiddenand represents some statistical correlations between thevisible units. The energy of such a machine is a quadraticfunction of the visible and hidden variables, and theirprobability distribution is given by the finite-temperatureBoltzmann distribution [6]. The absence of connectionsbetween the units in the same layer makes RBM a highlyefficiently trainable model. Originally, the parameters ofthe RBM, such as biases and weights, are assumed to bereal. Nevertheless, recently a complex-valued RBM wasproposed [4]. Such an extension allows the machine totreat quantum properties of the system it models, namelythe wave-function amplitude and it’s phase.We live in the era of the second quantum revolution,which is characterized by experimental and technologi-cal achievements in the control of individual quantumsystems. The major motivation behind this activity isto devise a feasible quantum computing circuits, whichcould outperform the capabilities of classical computingdevices [7]. The field of machine learning does not standaside. Quantum models of machine learning are beingproposed [8, 9], and in particular, recently the quantumBoltzmann machine (QBM) was proposed [10]. The maindifference from the classical Boltzmann machine is thatboth the visible and hidden units (spins) are allowed tobe in a state of quantum superposition. The energy func-tion (Hamiltonian) is modified to include a non-diagonalconnections of spins to an external field (bias). Whilesuch a model is demonstrated to learn the data distribu-tion better than the classical RBM, its major drawbackis that this model is of high computational complexity,and its training is challenging.The purpose of this work is to present a quantum ver-sion of the Restricted Boltzmann Machine, which at thesame time has reasonable computational complexity sothat it can be efficiently trained. We propose to repre-sent the hidden units by non-interacting fermions. Tech-nically it amounts to the replacement of a vector of m hidden units with quadratic matrix of the size m × m ,and thus hidden bias and coupling between the layersare now represented my m × m complex-valued matrixand n × m × m complex-valued matrix respectively, with n visible units. Analogously to the classical RBM, wherethe coupling between the layers is linear, the free-motionHamiltonian for the fermionic hidden units is linearlycoupled to the visible (classical) units. We call it thesemi-quantum Restricted Boltzmann Machine (sqRBM).We evaluate the proposed sqRBM model on the twotraining data sets: the ensemble of Bars&Stripes [11],and the Optdigits dataset of the handwritten digits [12].We compare the log-likelihood with the classical RBM.Also we present the results of the cross-validation test tocompare the overfit of sqRBM and RBM.In the Section 2 we briefly overview the RBM modeland it’s training procedures. Section 3 is devoted tothe detailed description of the proposed semi-quantumRBM. In this case the probability distribution is given interms of density matrix, and the usual summation overthe classical hidden units is replaced with the taking oftrace w.r.t. quantum fermionic hidden units. We alsoderive learning rules for the training via gradient ascentalgorithm, and discuss the numerical implementation indetails. We show the results of numerical experiments inthe Section 4, where we conduct tests using Bars&Stripesset and OptDigits set as the input training data. Ourconclusions are set out in the Section 5. II. CLASSICAL RESTRICTED BOLTZMANNMACHINEA. Model
Restricted Boltzmann Machine (RBM) is a bipartiteundirected neural network (see figure 1(a). One of it’slayers is conventionally called visible and the other onehidden (see figure 1(b). Both visible ( v ) and hidden ( h )neurons are assumed to take the values { , } . The for-mer is used to describe the observable data, whereas thelatter is used to model the correlations between the ob-servable components.RBM is an energy-based model, which dynamics canbe described by the Gibbs-Boltzmann distribution: p ( v , h ) = 1 Z e − E ( v , h ) , (1)with the conventional definition of partition function Z = X v X h e − E ( v , h ) . (2)Physically it corresponds to the assumption that RBMsystem is in the thermal equilibrium at the finite tem-perature 1 /T = 1. The energy E ( v , h ) of the machineis postulated to be a quadratic function of it’s variables.For the RBM of m visible units and n hidden ones we (a)(b)FIG. 1. RBM (a) and sqRBM (b) schemes. They consist oftwo interacting layers, one of which is called visible v i andthe other is called hidden. Dimensionality of sqRBM hiddenlayer ˆ h j ˆ h ′ j is twice the dimensionality of the correspondingRBM hidden layer h j . The interaction between the layers isdenoted with w ij for the RBM and with w ijj ′ for the sqRBM. have E ( v , h ) = − m X i =1 b i v i − n X j =1 c j h j − m X i =1 n X j =1 v i w ij h j . (3)Here, coefficients b and c are called biases, whereas w isa weight matrix.The absence of connections between the units of thesame layer of RBM allows one to treat their states asmutually independent in terms of probability, namely [13] p ( h | v ) = m Y j =1 p ( h j | v ) , (4)and vice versa. This property makes RBM solvable andeffectively trainable model via the method of direct Gibbssampling [14]. B. Training
The main idea of RBM usage is to model the distribu-tion of observable data p data ( v ). This is achieved by theadjustment of the biases b , c and weights w (the param-eters of the model) so that the marginal distribution ofthe visible layer p ( v ) would be as close as possible to thetarget distribution p data ( v ).The measure of target and model distributions prox-imity can be represented by the log-likelihood function L ( b , c , w ) = 1 N data X v data log p ( v data ) . (5)The higher L ( b , c , w ) the better given RBM models theobservables’ distribution.Maximizing the log-likelihood L (or equivalently min-imizing the negative log-likelihood −L ) can be imple-mented by the gradient ascent (descent) algorithm [6]. Itcan be shown [13] that it gives the following update rulesfor biases and weights:∆ b = η∂ b L ( b , c , w ) == η N data X v data v data − X v P ( v ) v ! , (6a)∆ c = η∂ c L ( b , c , w ) == η N data X v data , h P ( v data , h ) h − X v , h P ( v , h ) h , (6b)∆ w = ∂ w L ( b , c , w ) == η N data X v data , h P ( v data , h ) v data h −− X v , h P ( v , h ) vh . (6c)Here η is called learning rate and defines the size of asingle gradient ascent step.There are various techniques of how to actually eval-uate the gradients (6). In this work we employ the fastmean-field-like algorithm called Contrastive Divergence(CD) [14], and the full Monte-Carlo simulation calledPersistent Contrastive Divergence (PCD) [15]. III. SEMI-QUANTUM RESTRICTEDBOLTZMANN MACHINEA. Model
In the semi-quantum case we propose a two-layer sys-tem again (see figure 1(b). The visible layer is the sameas in the case of a classical RBM. It’s units are clas-sical binary variables v . On the contrary, the hiddenlayer is composed of quantum degrees of freedom ˆ h . Inthe spirit of Schwinger-Wigner representation each hid-den binary unit is represented by a pair of fermionic cre-ation/annihilation operators. Such a system is describedby the following Hamiltonian: ˆ H ( v , h ) = − n X i =1 b i v i − m X j,j ′ =1 c jj ′ ˆ h + j ˆ h j ′ − n X i =1 m X j,j ′ =1 v i w ijj ′ ˆ h + j ˆ h j ′ , (7)where c is a Hermitian matrix and w ijj ′ is a Hermitianmatrix w.r.t. j, j ′ for a given i . We call this model Semi-Quantum Restricted Boltzmann Machine (sqRBM).As in the classical case we assume that sqRBM is in thestate of thermal equilibrium at some finite inverse tem-perature β . Thus its state is described by the canonicalensemble in terms of the density matrixˆ ρ ( v , h ) = e − β ˆ H ( v , h ) Z . (8)Here the partition function Z of this hybrid classical-quantum system is given by Z = X v Tr h ( e − β ˆ H ( v , h ) ) , (9)where P v stands for the ordinary summation over theobservable classical v , and the trace Tr h is taken over thequantum hidden degrees of freedom. Hereinafter we set β = 1 for simplicity.Fermionic hidden subsystems obeys the Fermi-Diracstatistics: p (ˆ h + j ˆ h j ′ | v ) = (cid:18)
11 + e − H h (cid:19) jj ′ , (10)where H h = − c jj ′ − P ni =1 v i w ijj ′ is a m × m matrix. Itis clear now that in the case when the Hamiltonian (7)is diagonal in the fermionic degrees of freedom, namely c jj ′ = c jj δ jj ′ and w ijj ′ = w ijj δ jj ′ , it is fully equivalentto the classical RBM (3). B. Training
As in the case of classical RBM the goal of training is toapproximate the probability distribution of data p data ( v )by the marginal probability distribution p ( v ) of visiblevariables p data ( v ) ≈ p ( v ) = Z v Z , (11)where Z v ( b , c , w ) is the conditional partition function(see (9)) Z v ( b , c , w ) = Tr h e − ˆ H == e bv Tr exp m X j,j ′ =1 ( c jj ′ + n X i =1 v i w ijj ′ )ˆ h + j ˆ h j ′ == e bv det( + e H h ) , (12)and H h is introduced in (10). For this purpose we adjustthe model parameters b , c , w so that the log-likelihoodof the data set v data would be maximum for a given setof model parameters (see (5)). To maximize the log-likelihood we use the gradient ascent algorithm again:∆ b = η∂ b L ( b , c , w ) , (13a)∆ c = η∂ c L ( b , c , w ) , (13b)∆ w = η∂ w L ( b , c , w ) . (13c)The ascent step for the visible bias b coincides withthat of RBM (6), as far as the visible subsystem is pos-tulated to be purely classical. Consider now the ascentstep for the hidden bias c : ∂ w L ( b , c , w ) = 1 N data X v data ( ∂ w log Z v data − ∂ w log Z ) == 1 N data X v data Z − v data ∂ w Z v data − Z − X v ∂ w Z v data ! . (14)Let us evaluate separately the gradient of Z v data . Hereand in the following we omit the subscript data duringthis calculation for simplicity: ∂ w Z v = e bv ∂ w det( + e H h ) == e bv Tr (cid:0) det( + e H h ) · (1 + e H h ) − ∂ w ( + e H h ) (cid:1) == Z v Tr (cid:16) p ( h + h | v ) v ˆ h + ˆ h (cid:17) . (15)Substituting this into (14) and assuming that there are N data samples of visible units in the input dataset, weobtain ∂ w L ( b , c , w ) = 1 N data X v data Tr[ p ( v data , h + h ) v data h + h ] − X v Tr[ p ( v , h + h ) vh + h ] . (16)For the gradient w.r.t. the hidden layer bias c one canproceed in the same way to obtain ∂ c L ( b , c , w ) = 1 N data X v data Tr[ p ( v data , h + h ) h + h ] − X v Tr[ p ( v , h + h ) h + h ] . (17)The update of the biases b , c and weights w is propor-tional to the relative gradients with the proportionalityfactor η , which is the learning rate. C. Gibbs sampling
During the training of both classical RBM and sqRBMone needs to evaluate the mean values of different quan-tities namely visible units, hidden units and their combi-nation. In practice it can be implemented by generatinga large amount (representative set) of samples with someunderlying probability distribution, which is not knownexplicitly, and then by taking the mean value over thegenerated samples. To do so one may apply the so-calledGibbs sampling [13], which allows to produce samplesfrom some joint probability distribution ( v , h ). Basicallyit allows us to update the state of visible layer v for agiven hidden layer state h and vice versa. This proce-dure is applicable and quite efficient because all hiddenunits are independent of each other, and the same for thevisible ones.During the training of an sqRBM though we cannotapply Gibbs sampling straightforwardly. This is becauseof the introduction of a m × m matrix for the hiddenlayer instead of a m -dimensional vector with mutuallyindependent components. In this case diagonalizationcan fix the problem. One has to include this step into theusual Gibbs sampling algorithm. Update of the visiblelayer for a given hidden state stays the same as for theclassical RBM.To sum up we suggest the following pseudo-code forthe numerical realization of sqRBM [6, 13]:Here E is the number of descent steps; K is thenumber of Gibbs sampling steps; σ el/m is the element-wise/matrix logistic (sigma) function; sample [ ... ] is thestochastic turn on of visible neuron with the probability[ ... ]; ∆ ... is the increment of the corresponding quantitywhich is proportional to r.h.s. with the coefficient of pro-portionality equal to the learning rate[16]. TABLE I. Pseudo-code of Training for sqRBM1. Initialization of machine: W ← gauss ( µ = , σ = . ) a , b ← ( ... )2. Training procedure:For e in ( ... E ) do : v ← ( training set )( h + h ) ← σ m (cid:0) v · W + b (cid:1) For k in ( ... K ) do :Diagonalization (cid:2) ( h + h ) k − (cid:3) v k ← sample (cid:2) σ el (cid:0) ( h + h ) k − · W + a (cid:1)(cid:3) ( h + h ) k ← σ m (cid:0) v k · W + b (cid:1) ∆ W ∼ h vh + h i − h vh + h i k ∆ a ∼ h v i − h v i k ∆ b ∼ h h + h i − h h + h i k D. Persistent Contrastive Divergence
The exact evaluation of the gradient in (13) can beachieved by the Monte Carlo simulation. The positivepart of the gradient involves the averaging of the single-particle density matrix (10) and its moments over thetraining data points v data . The negative part of the gra-dient involves the averaging of the same quantities over v with the model probability p ( v ). The latter requiresto perform a separate Monte Carlo simulation via theMetropolis algorithm [17] for each gradient ascent stepwhich is impractical. The way out is provided by theobservation that when the learning rate η is small, themodel is changed only slightly, so that we can continuethe Markov Chain from the state at the previous learningstep. Provided that the change of η is sufficiently smallbetween the steps, such a chain will follow close enoughthe probability distribution of the model at the currentset of parameters. In other words, we perform a singleMetropolis simulation of the model. At each Monte Carlostep the negative part of the gradient is approximated byits value for the current configuration of the model. Thepositive part of the gradient is estimated by drawing arandom training data point v data at each Monte Carlostep. Then the model parameters are updated accord-ing to (13). This is called the Persistent Chain (PC)[15]. However the fluctuations of the gradient estimatemay hinder the convergence of the gradient-ascent pro-cedure. In order to reduce these fluctuations, one sim-ulates not one but several independent concurrent Per-sistent Chains, and the increment of model parametersis averaged over the chains. This is called the PersistentContrastive Divergence algorithm. IV. TESTS
We perform the comparative observations of RBMand sqRBM by training the machines on two familiardatasets. The first one is a set of 4 × Bars&Stripes (see figure2a). The second one is
OptDigits [12] (see figure 3a).This is a set of 5620 8 × A. Training by Contrastive Divergence
The first experiment was conducted in order to com-pare the performances of RBM and sqRBM with use ofsimple Contrastive Divergence algorithm with one Gibbssampling step ( CD ). The measure of training qual-ity was the log-likelihood of the training data set. Forestimating the log-likelihood we used annealed impor-tance sampling (AIS) [18]. During the experiment the learning rate followed the search-then-converge rule [19]: η ( n ) = η / (1 + n/n ), where η is the starting learningrate, n is the number of current step, and n is the de-crease parameter. For the training on both datasets weused η = 0 . , n = 300. In this experiment we traineda RBM with 9 hidden units and a sqRBM with 3 hid-den units on the Bars&Stripes set, and a RBM with 36and a sqRBM with 6 hidden units on the OptDigits set.The results are presented on figure 2 and 3. Sharp peaksand following slump of the log-likelihood is the typicalbehaviour for CD algorithm [13]. Increase in the num-ber of Gibbs steps eliminates this slump as it can be seenfurther. It can be seen that for both Bars&Stripes andOptDigits sets sqRBM demonstrate better results thanRBM. B. Training by Persistent Contrastive Divergence
In figure 4 the results of training RBM and sqRBMon the 4x4 Bars&Stripes dataset are presented. It showsthat making the hidden units quantum leads to signif-icant improvement of the learning capabilities of themodel. sqRBM was trained with the constant learningrate η = 0 . η = 0 .
001 for thenumber of hidden units ≤
5, and η = 0 . TABLE II. Pseudo-code of Persistent Contrastive DivergenceTraining Procedure for sqRBM1. Initialization of machine: W ← gauss ( µ = , σ = . ) b , c ← ( ... )For i in ( ... N pcd ) do : v [ i ] ← ( vector of random bits ) W [ i ] ← p( v [ i ])2. Training procedure:For e in ( ... E ) do :∆ W = 0∆ a = 0∆ b = 0 v data ← ( random instance of training set )( h + h ) data ← σ m ( v data · W + b )For i in ( ... N pcd ) do : v trial ← ( generate trial configuration from v e − [ i ]) W trial ← p( v trial )If v trial is accepted based on Metropolis rulefor W trial / W e − [ i ] then v epcd [ i ] ← v trial Else v epcd [ i ] ← v e − [ i ]( h + h ) pcd [ i ] ← σ m (cid:0) v epcd [ i ] · W + b (cid:1) ∆ W = ∆ W + v data ( h + h ) data − v epcd [ i ]( h + h ) pcd [ i ]∆ a = ∆ a + v data − v epcd [ i ]∆ b = ∆ b + ( h + h ) data − ( h + h ) pcd [ i ] W ← W + η ∆ Wa ← a + η ∆ ab ← b + η ∆ b (a)(b)FIG. 2. Log-likelihood vs. the number of ascent steps (b)for the training on the set of bars and stripes (a). Trainingcomparison for RBM and sqRBM.(a)(b)FIG. 3. Log-likelihood vs. the number of ascent steps (b) forthe training on the OptDigits set (a). Training comparisonfor RBM and sqRBM. (number of concurrent PC) is 100. The training datasetwas 40000 random instances of the Bars&Stripes.In figure 5 the results of training of RBM and sqRBMon the Optdigits dataset are presented. In all the cases η = 0 .
001 was employed. The minibatch size is 100. Herewe also observe the improvement due to the quantumnessof the hidden units. The convergence rate of sqRBM isalso faster than of RBM.We conclude with the observation that sqRBM with N quantum hidden units tends to learn better that theRBM with N classical hidden units number of gradient ascent steps -12-10-8-6-4-2 l og li k e li hood o f t r a i n i ng da t a RBM 5RBM 10RBM 64 RBM 30 RBM 3sqRBM 5 sqRBM 3
FIG. 4. Modelling the 4x4 Bars&Stripes dataset with RBMand sqRBM at different numbers of hidden units. The typeof model and the number of hidden units are indicated byarrows. number of gradient ascent steps -26-24-22-20-18-16-14 l og li k e li hood o f t r a i n i ng da t a RBM 10RBM 5RBM 40RBM 100RBM 200 RBM 3sqRBM 3sqRBM 5 sqRBM 10sqRBM 15
FIG. 5. Modelling the Optdigits dataset with RBM andsqRBM at different numbers of hidden units. The type ofmodel and the number of hidden units are indicated by ar-rows. training progress -101234567 o v e r f i t o f t r a i n i ng da t a RBM 36 RBM 9sqRBM 3sqRBM 6
FIG. 6. Overfit of the RBM and sqRBM models on the Optid-its dataset. The ”training progress” are the gradient ascentsteps up to the saturation of log-likelihood on the trainingdata. The step numbers are normalized to be 1 at the pointof saturation of the training log-likelihood.
C. Overfit
The overfit (i.e. degradation of the predictive capa-bility of the model) may be estimated as the drop oflikelihood when the model is shown the data which itdid not see before. In figure 6 we present the resultsof such a calculation. The Optdigits dataset was parti-tioned into the training subset, which was chosen as thefirst 2800 samples. Next 5600 samples were considered asthe validating subset. The models (RBM and sqRBM)learned on the training subset. The log-likelihood of thetraining subset was calculated. Then the log-likelihoodof the validating subset was calculated. The amount bywhich the log likelihood of the validating subset decreasesis the measure of the overfit: the higher the worse. Infigure 6 we present the overfits of sqRBM and of RBMmodel with a quadratically larger number of hidden units(see the end of the previous section), as a function of thetraining progress. The latter is defined as follows. As themodels are trained, the log-likelihood first starts to growthen saturates at a certain level (i.e. the model no longerlearns from the training dataset). As the saturation isachieved, the training is stopped. This corresponds to acertain (maximal) number of gradient ascent steps. Foreach model, we ”normalize” the training progress by di-viding the current number of the gradient ascent step by the maximal number before saturation. We see thatsqRBM is slightly better than RBM.
V. CONCLUSION
We proposed and examined a semi-quantum version ofthe Restricted Boltzmann Machine with classical visibleunits and fermionic hidden units which we called semi-quantum RBM (sqRBM). The presented sqRBM modelinherits simple and effective trainability of the classicalRBM. In particular this model can be trained using theGibbs sampling with including diagonalization w.r.t. thehidden units as an additional step.At the same time introduction of quantum hidden de-grees of freedom makes the model sufficiently more flexi-ble than the purely classical one. Moreover sqRBM avoidoverfit better than the classical RBM. It was confirmedduring the numerical experiments with the use of twostandard data sets, namely Bars&Stripes and OptDigits.As a performance measure we used the log-likelihood esti-mation via annealed importance sampling. Future workshould investigate whether this success stays for largerdata sets.The distinctive features of our model (hybrid classi-cal/quantum system and fermionic quantum part) makesit an interesting interesting option for the developmentof quantum-inspired machine learning algorithms. [1] G. W. Taylor, G. E. Hinton, and S. T. Roweis, Mod-eling human motion using binary latent variables, in
Advances in Neural Information Processing Systems 19(NIPS 2006) (MIT Press, 2006).[2] M. Y. Niu, S. Boixo, and V. N. Smelyanskiy, Univer-sal quantum control through deep reinforcement learn-ing, npj Quantum Inf , 33 (2019).[3] J. Carrasquilla and R. Melko, Machine learning phasesof matter, Nature Phys , 431 (2017).[4] G. Carleo and M. Troyer, Solving the quantum many-body problem with artificial neural networks, Science , 602 (2017).[5] Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Na-ture , 436 (2015).[6] G. E. Hinton, A practical guide to training restrictedboltzmann machines, in Neural Networks: Tricks of theTrade. Lecture Notes in Computer Science , Vol. 7700,edited by M. G., O. G.B., and M. KR. (Springer, Berlin,Heidelberg, 2012) pp. 599–619.[7] M. A. Nielsen and I. Chuang,
Quantum Computationand Quantum Information (Cambridge University Press,Cambridge, UK, 2002).[8] P. Rebentrost, M. Mohseni, and S. Lloyd, Quantum sup-port vector machine for big bata classification, Phys. Rev.Lett. , 130503 (2014).[9] P. Rebentrost, T. R. Bromley, C. Weedbrook, andS. Lloyd, Quantum hopfield neural network, Phys. Rev.A , 042308 (2018).[10] M. H. Amin, E. Andriyash, J. Rolfe, B. Kulchytskyy, and R. Melko, Quantum boltzmann machine, Phys. Rev. X , 021050 (2018).[11] A. Fischer and C. Igel, Empirical analysis of the diver-gence of gibbs sampling based learning algorithms forrestricted boltzmann machines, in Proceedings of Artifi-cial Neural Networks - ICANN 2010 - 20th InternationalConference (2010) pp. 208 – 217.[12] C. Alpaydin, E. Kaynak,
Optical Recognition of Hand-written Digits (1998), .[13] A. Fischer and C. Igel, An Introduction to RestrictedBoltzmann Machines, Progress in Pattern Recognition,Image Analysis, Computer Vision, and Applications.CIARP 2012. Lecture Notes in Computer Science ,14 (2012).[14] G. Hinton, Training Products of Experts by Minimiz-ing Contrastive Divergence, Neural Computation ,1711 (2002).[15] T. Tieleman, Training restricted boltzmann machinesusing approximations to the likelihood gradient, in