Quantum-assisted associative adversarial network: Applying quantum annealing in deep learning
QQuantum-assisted associative adversarial network: Applying quantum annealing indeep learning
Max Wilson
Quantum Artificial Intelligence Lab., NASA Ames Research Center, Moffett Field, CA 94035, USA andQuantum Engineering CDT, Bristol University, Bristol, BS8 1TH, UK
Thomas Vandal
NASA Ames Research Center / Bay Area Environmental Research Institute, Moffett Field, CA 94035, USA
Tad Hogg and Eleanor Rieffel
Quantum Artificial Intelligence Lab., NASA Ames Research Center, Moffett Field, CA 94035, USA (Dated: April 25, 2019)We present an algorithm for learning a latent variable generative model via generative adversariallearning where the canonical uniform noise input is replaced by samples from a graphical model.This graphical model is learned by a Boltzmann machine which learns low-dimensional featurerepresentation of data extracted by the discriminator. A quantum annealer, the D-Wave 2000Q, isused to sample from this model. This algorithm joins a growing family of algorithms that use aquantum annealing subroutine in deep learning, and provides a framework to test the advantagesof quantum-assisted learning in GANs. Fully connected, symmetric bipartite and Chimera graphtopologies are compared on a reduced stochastically binarized MNIST dataset, for both classicaland quantum annealing sampling methods. The quantum-assisted associative adversarial networksuccessfully learns a generative model of the MNIST dataset for all topologies, and is also applied tothe LSUN dataset bedrooms class for the Chimera topology. Evaluated using the Fr´echet inceptiondistance and inception score, the quantum and classical versions of the algorithm are found to haveequivalent performance for learning an implicit generative model of the MNIST dataset.
I. INTRODUCTION
The ability to efficiently and accurately model adataset, even without full knowledge of why a model isthe way it is, is a valuable tool for understanding complexsystems. Machine Learning (ML), the field of data analy-sis algorithms that create models of data, is experiencinga renaissance due to the availability of data, increasedcomputational resources and algorithm innovations, no-tably in deep neural networks [1, 2]. Of particular interestare unsupervised algorithms that train generative mod-els. These models are useful because they can be used togenerate new examples representative of a dataset.A Generative Adversarial Network (GAN) is an algo-rithm which trains a latent variable generative modelwith a range of applications including image or signalsynthesis, classification and image resolution. The al-gorithm has been demonstrated in a range of architec-tures, now well over 300 types and applications, from theGAN zoo [3–5]. Two problems in GAN learning are non-convergence, oscillating and unstable parameters in themodel, and mode collapse, where the generator only pro-vides a small variety of possible samples. These problemshave been addressed previously in existing work includingenergy based GANs [6] and the Wasserstein GAN [7, 8].Another proposed solution involves replacing the canon-ical uniform noise prior of a GAN with a prior distribu-tion modelling low-dimensional feature representation ofthe dataset. Using this informed prior may alleviate thelearning task of the generative network, decrease mode-collapse and encourage convergence [9]. This feature distribution is a rich and low-dimensionalrepresentation of the dataset extracted by the discrimina-tor in a GAN. A generative probabilistic graphical modelcan learn this feature distribution. However, given theintractability of calculating the exact distribution of themodel, classical techniques often use approximate meth-ods for sampling from restricted topologies, such as con-trastive divergence, to train and sample from these mod-els. Quantum annealing, a quantum optimisation algo-rithm, has been shown to sample from a Boltzmann-likedistribution on near-term hardware [10, 11], which canbe used in the training of these types of models. In thefuture, quantum annealing may decrease the cost of thistraining by decreasing the computation time [12], energyusage [13], or improve performance as quantum models[14] may better represent some datasets.Here, we demonstrate the Quantum Assisted Associa-tive Adversarial Network (QAAAN) algorithm, Figure1, a hybrid quantum-assited GAN in which a BoltzmannMachine (BM) trains, using samples from a quantum an-nealer, a model of a low-dimensional feature distributionof the dataset as the prior to a generator. The modellearned by the algorithm is a latent variable implicit gen-erative model p ( x | z ) and an informed prior p ( z ), where z are latent variables and x are data space variables. Theprior will contain useful information about the features ofthe data distribution and this information will not needto be learned by the generator. Put another way, theprior will be a model of the feature distribution contain-ing the latent variable modes of the dataset. a r X i v : . [ c s . L G ] A p r Contributions
The core contribution of this work is the developmentof a scalable quantum-assisted GAN which trains an im-plicit latent variable generative model. This algorithmfulfills the criteria for inclusion of near-term quantum an-nealing hardware in deep learning frameworks that canlearn continuous variable datasets: Resistant to noise,small number of variables, in a hybrid architecture. Ad-ditionally in this work we explore different topologies forthe latent space model. The purpose of the work is to • compare different topologies to appropriatelychoose a graphical model, restricted by the connec-tivity of the quantum hardware, to integrate withthe deep learning framework, • design a framework for using sampling from a quan-tum annealer in generative adversarial networks,which may lead to architectures that encourageconvergence and decrease mode collapse. Outline
First, there is a short section on the background ofGANs, quantum annealing and Boltzmann machines. InSection III an algorithm is developed to learn a latentvariable generative model using samples from a quantumannealer to replace the canonical uniform noise input.We explore different models, specifically complete, sym-metric bipartite and Chimera topologies, tested on a re-duced stochastically binarized version of MNIST, for usein the latent space. In Section IV the results are detailed,including application of the QAAAN and a classical ver-sion of the algorithm to the MNIST dataset. The archi-tectures are evaluated using the Inception Score and theFrech´et Inception Distance. The algorithm is also imple-mented on the LSUN bedrooms dataset using classicalsampling methods, demonstrating the scalability.
II. BACKGROUNDGenerative Adversarial Networks
Implicit generative models are those which specify astochastic procedure with which to generate data. In thecase of a GAN, the generative network maps latent vari-ables z to images which are likely under the real datadistribution, for example x = G ( z ), G is the functionrepresented by a neural network, x is the resulting imagewith z ∼ q ( z ), and q ( z ) is typically the uniform distribu-tion between 0 and 1, U [0 , FIG. 1. The inputs to the generator network are samplesfrom a Boltzmann distribution. A BM trains a model ofthe feature space in the generator network, indicated by theLearning. Samples from the quantum annealer, the D-Wave2000Q, are used in the training process for the BM, and re-place the canonical uniform noise input to the generator net-work. These discrete variables z are reparametrised to con-tinuous variables ζ before being processed by transposed con-volutional layers. Generated and real data are passed intothe convolutional layers of the discriminator which extracts alow-dimensional representation of the data. The BM learns amodel of this representation. An example flow of informationthrough the network is highlighted in green. In the classicalversion of this algorithm, MCMC sampling is used to samplefrom the discrete latent space, otherwise the architectures areidentical. imise. The cost function of this minimax game is V ( D, G ) = E x ∼ p ( x ) [log( D ( x ))]+ E z ∼ q ( z ) [log(1 − D ( G ( z )))] . (1) E x ∼ p ( x ) is the expectation over the distribution of thedataset, E z ∼ q ( z ) is the expectation over the latent vari-able distribution and D and G are functions instantiatedby a discriminative and generative neural network, re-spectively, and we are trying to find min G max D V ( D, G ).The model learned is a latent variable generative model P model ( x | z ).The first term in Equation 1 is the log-probability ofthe discriminator predicting that the real data is gen-uine and the second the log-probability of it predictingthat the generated data is fake. In practice, ML engineerswill instead use a heuristic maximising the likelihood thatthe generator network produces data that trick the dis-criminator instead of minimising the probability that thediscriminator label them as real. This has the effect ofstronger gradients earlier in training [15].GANs are lauded for many reasons: The algorithmis unsupervised; the adversarial training does not re-quire direct replication of the real dataset resulting in FIG. 2. Bedrooms from the LSUN dataset generated with anassociative adversarial network, with a fully connected latentspace sampled via MCMC sampling. samples that are sharp [16]; and it is possible to per-form the weight updates through efficient backpropaga-tion and stochastic gradient descent. There are also sev-eral known disadvantages. Primarily, the learned distri-bution is implicit. It is not straightforward to computethe distribution of the training set [17] unlike explicit,or prescribed, generative models which provide a para-metric specification of the distribution specifying a log-likelihood log P ( x ) that some observed variable x is fromthat distribution. This means that simple GAN imple-mentations are limited to generation.Further, as outlined in the introduction, the trainingis prone to non-convergence [18], and mode collapse [19].This stability of GAN training is an issue and there aremany hacks to encourage convergence, discourage mode-collapse and increase sample diversity including usingspherical input space [20], adding noise to the real andgenerated samples [7] and minibatch discrimination [21].We hypothesise that using an informed prior will decreasemode-collapse and encourage convergence. Boltzmann Machines & Quantum Annealing
A BM is a energy-based graphical model composed ofstochastic nodes, with weighted connections between andbiases applied to the nodes. The energy of the networkcorresponds to the energy function applied to the state ofthe system. BMs represent multimodal and intractabledistributions [22], and the internal representation of theBM, the weights and biases, can learn a generative modelof a distribution [23].A graph G = ( V , E ) with cardinality N describing aBoltzmann machine with model parameters λ = { ω , b } over logical variables V = { z , z , ...z N } connected by edges E has energy E λ ( v ) = − (cid:88) z i ∈V b i z i − (cid:88) ( z i ,z j ) ∈E ω ij z i z j (2)where weight ω ij is assigned to the edge connecting vari-ables z i and z j , bias b i is assigned to variable z i andpossible states of the variables are z i ∈ {− , } corre-sponding to ‘off’ and ‘on’, respectively. We refer to thisgraph as the logical graph. The distribution of the states z is P ( z ) = e − βE λ ( z ) Z (3)with β a parameter recognized by physicists as the inversetemperature in the function defining the Boltzmann dis-tribution.BM training requires sampling from the distributionrepresented by the energy function. For fully-connectedvariants it is an intractable problem to calculate the prob-ability of the state occurring exactly [24] and is com-putationally expensive to approximate. Exact inferenceof complete graph BMs is generally intractable and ap-proximate methods including Gibbs sampling are slow.Generally, applications will use deep stacked RestrictedBoltzmann Machine (RBM) architectures, which can beefficiently trained with approximate methods.An RBM is a symmetric bipartite BM. It is possi-ble to efficiently learn the distribution of some inputdata spaces through approximate methods, notably con-trastive divergence [25]. Stacked RBMs form a Deep Be-lief Net (DBN) and can be greedily trained to learn thegenerative model of datasets with higher-level featureswith applications in a wide range of fields from imagerecognition to finance [26]. Training these types of mod-els requires sampling from the Boltzmann distribution.Quantum Annealing (QA) has been proposed as amethod for sampling from complex Boltzmann-like dis-tributions. It is an optimisation algorithm exploitingquantum phenomena to find the ground state of a costfunction. QA has been demonstrated for a range of opti-misation problems [27], however, defining and detectingspeedup, especially in small and noisy hardware imple-mentations is challenging [28, 29].QA has been proposed and in some cases demonstratedas a sampling subroutine in ML algorithms: A quan-tum Boltzmann machine [11]; training a Quantum Vari-ational Autoencoder (QVAE) [30]; a quantum-assistedHelmholtz machine [31]; deep belief nets of stacked RBMs[32].In order to achieve this, the framework outlined inEquation 2 can be mapped to an Ising model for a quan-tum system represented by the Hamiltonianˆ H λ = − (cid:88) ˆ σ zi ∈V h i ˆ σ zi − (cid:88) (ˆ σ zi , ˆ σ zj ) ∈E J ij ˆ σ zi ˆ σ zj . (4)where now variables z have been replaced by the Pauli- z operators, ˆ σ i , which return eigenvalues in the set {− , } (a) (b) (c) FIG. 3. (a) Complete (b) Chimera (c) symmetric bipartite graphical models. These graphical models are embedded into thehardware and the nodes in these graphs are not necessarily representative of the embeddings. when applied to the state of variable z i , physically cor-responding to spin-up and spin-down, respectively. Pa-rameters b i and ω ij are replaced with the Ising modelparameters h i and J ij which are conceptually equivalent.In the hardware, these parameters are referred to as theflux bias and the coupling strength, respectively.The full Hamiltonian describing the dynamics of theD-Wave 2000Q, equivalent to the time-dependent trans-verse field Ising model, isˆ H ( t ) = A ( t ) ˆ H ⊥ + B ( t ) ˆ H λ . (5)The transverse field term H ⊥ isˆ H ⊥ = (cid:88) ˆ σ xi ∈ V ˆ σ xi . (6)ˆ σ x are the Pauli- x operators in the Hilbert space C N × N . A ( t ) and B ( t ) are monotonic functions defined by thetotal annealing time t max [27]. Generally, at the startof an anneal, A (0) ≈ B (0) ≈ A ( t ) decreasesand B ( t ) increases monotonically with t until, at the endof the anneal, A ( t max ) ≈ B ( t max ) ≈
1. When B ( t ) >
0, the Hamiltonian contains terms that are notpossible in the classical Ising model, that is those thatare normalised linear combinations of classical states.This Hamiltonian was embedded in the D-Wave2000Q, a system with 2048 qubits, each with degree 6.Embedding is the process of mapping the logical graph,represented by Equation 4, to hardware. If the logicalgraph has degree > z i is represented by more than one qubit.These qubits are arranged in a ‘chain’(this term is usedeven when the set of qubits forms a small tree). A chainis formed by setting the coupling strength J ij betweenthese qubits to a strong value to encourage them to takea single value by the end, but not so strong that it over-whelms the J ij and h i in the original problem Hamilto-nian or has a detrimental effect on the dynamics. There is a sweet spot for this value. In our case, we used themaximum value available on the D-Wave 2000Q, namely −
1. At the end of the anneal, to determine the value ofa logical variable expressed as a qubit chain in the hard-ware a majority vote is performed: The logical variabletakes the value corresponding to the state of the major-ity of qubits. If there is no majority a coin is flipped todetermine the value of the logical variable.Each state found after an anneal comes from a distri-bution, though it is not clear what distribution the quan-tum annealer is sampling from. For example, in probleminstances with a well defined freeze-out region, the distri-bution is hypothesised to follow a quantum Boltzmanndistribution up to the freeze-out region where the dy-namics of the system slow down and diverge [33]. If thefreeze-out region is narrow then the distribution can bemodelled as the classical distribution of problem Hamil-tonian, H λ , at s ( t ∗ ) = s ∗ , at a higher unknown effectivetemperature, ρ = e − β ˆ H λ ( t ∗ ) Z (7)where Z = Tr (cid:104) e − β ˆ H λ ( t ∗ ) (cid:105) and we have performed matrixexponentiation. In the case where s ∗ = 0 the Hamil-tonian contains no off-diagonal terms and Equation 7 isequivalent to the classical Boltzmann distribution, Equa-tion 3, at some temperature. β is a dimensionless param-eter which depends on the temperature of the system, theenergy scale of the superconducting flux qubits and opensystem quantum dynamics. However, it is an open ques-tion as to when the freeze-out hypothesis holds.Other implementations of training graphical modelshave accounted for this instance dependent effective tem-perature [34], in this work to get around the problem ofusing the unknown effective temperature for training aprobabilistic graphical model, we use a gray-box modelapproach proposed in [10]. In this approach, full knowl-edge of the effective parameters, dependent on β , arenot needed to perform the weight updates as long as the Algorithm I
Quantum-assisted associative adversarial network training. for epochs do Sample m Boltzmann distribution samples from ρ → φ = { φ , φ , ...φ m } using quantum annealer Sample n examples φ → φ D and map to logical space φ (cid:55)→ z D Sample n examples φ → φ B Sample n examples φ → φ G and map to logical space φ (cid:55)→ z G Sample n training data examples x = { x , x , ...x n } Generate x D = G ( z D ) θ D ← θ D − ∇ θ D n (cid:80) i (cid:0) log D ( x i ) + log (cid:0) D ( x Di ) (cid:1)(cid:1) Generate z f = D ( x ) Update weights of BM via SGD with z f and φ B Generate x G = G ( z G ) θ G ← θ G − ∇ θ G n (cid:80) i (cid:0) log (cid:0) D ( x Gi ) (cid:1)(cid:1) return Network G ( z ; θ G )FIG. 4. QAAAN training algorithm. ρ represents the distribution given by the quantum annealer from sampling, therefore ρ → φ represents sampling a set of vectors φ from distribution ρ . Steps 3 - 5 are indicative of the real-world implementation ofthese devices. In order to reduce sampling time we sampled from the device once and used this set for different tasks: φ D forgenerating samples to train the discriminator, φ B for training the BM and φ G for generating samples to train the generator.Further details on mapping to the logical space for samples from the quantum annealer can be found in Section III. x is theMNIST dataset. Steps 8 and 12 are typical of GAN implementation, G and D are the action of the generator discriminatornetwork, respectively. projection of the gradient is positive in the direction ofthe true gradient. The gray-box approach ties the modelgenerated to the specific device used to train the model,though is robust to noise and is not required to estimate β [35], for the purposes of Equations 10 and 11. Wefind that under this approach performance remains goodenough for deep learning applications.Though we do not have full knowledge of the distribu-tion the quantum annealer samples from, we have mod-elled it as a classical Boltzmann distribution at an un-known temperature. This allows us to train models with-out the having to estimate the temperature of the system,providing a simple approach to integrating probabilisticgraphical models into deep learning frameworks. III. QUANTUM-ASSISTED ASSOCIATIVEADVERSARIAL NETWORK
In this section, the QAAAN algorithm is outlined,including a novel way to learn the feature distributiongenerated by the discriminator network via a BM usingsampling from a quantum annealer. The QAAAN archi-tecture is similar to the classical Associative AdversarialNetwork proposed in Ref [9], as such the minimax gameplayed by the QAAAN is V ( D, G, ρ ) = E x ∼ p data ( x ) [log D ( x )]+ E z ∼ ρ ( z ) [log(1 − D ( G ( z )))]+ E f ∼ ρ f ( f ) [log ρ ] , (8) where the aim is now to find min G max ρ max D V ( D, G, ρ ),with equivalent terms to Equation 1 plus an additionalterm to describe the optimisation of the model ρ , Equa-tion 7. This term conceptually represents the probabil-ity that samples generated by the model ρ are from thefeature distribution ρ f . ρ f is the feature distribution ex-tracted from the interim layer of the discriminator. Thisdistribution is assumed to be Boltzmann, a common tech-nique for modelling a complex distribution.The algorithm used for training ρ , a probabilisticgraphical model, is a BM. Sampling from the quantumannealer, the D-Wave 2000Q, replaces a classical sam-pling subroutine in the BM. ρ is used in the latent spaceof the generator, Figure 1, and samples from this model,also generated by the quantum annealer, replace thecanonical uniform noise input to the generator network.Samples from ρ are restricted to discrete values, as themeasured values of qubits are z ∈ {− , +1 } . These dis-crete variables z are reparametrised to continuous vari-ables ζ before being processed by the layers of the genera-tor network, producing ‘generated’ data. Generated andreal data are then passed into the layers of the discrimi-nator which extracts the low-dimensional feature distri-bution ρ f . This is akin to a variational autoencoder,where an approximate posterior maps the evidence dis-tribution to latent variables which capture features of thedistribution [36]. The algorithm for training the completenetwork is detailed in Algorithm I.Below, we outline the details of the BM training in thelatent space, reparametrisation of discrete variables, andthe networks used in this investigation. Additionally, wedetail an experiment to distinguish the performance ofthree different topologies of probabilistic graphical mod-els to be used in the latent space. Latent space
As in Figure 1, samples from a intermediate layer ofthe discriminator network are used to train a model forthe latent space of the generator network. Here, a BMtrains this model. The cost function of this BM is thequantum relative entropy S ( ρ || ρ f ) = Tr[ ρ ln ρ ] − Tr[ ρ ln ρ f ] (9)equivalent to the classical Kullback-Leibler divergencewhen all off-diagonal elements of ρ and ρ f are zero. Thismetric measures the divergence of distribution ρ from ρ f where ρ f is the target feature distribution of features ex-tracted by the discriminator network and ρ is the modeltrained by the BM, from Equation 8. Though the distri-butions used here are modelled classically, this frameworkcan be extended to quantum models using the quantumrelative entropy. Given this it can be shown that theupdates to the weights and biases of the model are∆ J ij = ηβ [ (cid:104) z i z j (cid:105) ρ f − (cid:104) z i z j (cid:105) ρ ] (10)∆ h i = ηβ [ (cid:104) z i (cid:105) ρ f − (cid:104) z i (cid:105) ρ ] . (11) η is the learning rate, β is an unknown parameter, and (cid:104) z (cid:105) ρ is the expectation value of z in distribution ρ . z are the logical variables of the graphical model and theexpectation values (cid:104) z (cid:105) ρ are estimated by averaging 1000samples from the quantum annealer. The quantum rela-tive entropy is minimised by stochastic gradient descent. Topologies
We explored three different topologies of probabilis-tic graphical models, complete, symmetric bipartite andChimera, for the latent space. Their performance onlearning a model of a reduced stochastically binarizedversion of MNIST was compared, in both classical sam-pling, Figure 9, and sampling via quantum anneal-ing, Figure 8, cases. The complete topology is self-explanatory, Figure 3a, restricted refers to a symmetricbipartite graph, Figure 3c, and the sparse is the graphnative to the D-Wave 2000Q, or Chimera graph, wherethe connectivity of the model is determined by the avail-able connections on the hardware, Figure 3b.The models were trained by minimising the quantumrelative entropy, Equation 9, and evaluated with the L -norm, L -norm = (cid:88) z i ,z j ∈V (cid:104) z i z j (cid:105) ρ f − (cid:104) z i z j (cid:105) ρ . (12)The algorithm did not include temperature estimation,or methods to adjust intra-chain coupling strengths for the embedding, as in [34] and [10], respectively. Themethod used here makes a comparison between the dif-ferent topologies, though for best performance one wouldwant to account for the embedding and adjust algorithmparameters, such as the learning rate, to each topology.In addition to these requirements, there are severalnon-functioning, ‘dead’, qubits and couplers in the hard-ware. These qubits or couplers were removed in all em-beddings, which had a negligible effect on the final perfor-mance. The complete topology embedding was found us-ing a heuristic embedder. A better choice would be a de-terministic embedder, resulting in shorter chain lengths,though when adjusting for the dead qubits the symme-tries are broken and the embedded graph chain length in-creases to be comparable to that returned by the heuristicembedder. The restricted topology was implemented us-ing the method detailed by Adachi and Henderson [32].The Chimera topology was implemented on a 2x2 grid ofunit cells, avoiding dead qubits. Learning was run over5 different embeddings for each topology and the resultsaveraged. For topologies requiring chains of qubits, thecouplers in the chains were set to -1. FIG. 5. Left to right: 28x28 continuous, 6x6 continuous, 6x6stochastically binarized example from the MNIST dataset.
Reparametrisation
Samples from the latent space come from a discretespace. These variables are reparametrised to a continu-ous space, using standard techniques. There are manypotential choices for reparametrisation functions and asimple example case is outlined below. We chose a prob-ability density function pdf( x ) which rises exponentiallyand can be scaled by parameter α :p( x ) = α exp( − α (1 − x ))1 − exp( − α ) . (13)The cumulative distribution function of this probabilitydensity function is F ( z ) = (cid:40)(cid:82) z − p ( x ) dx − < z ≤
10 otherwise,and (cid:90) r − p ( x ) dx = exp( − α (1 − r )) − exp( − α )1 − exp( − α ) (14) FIG. 6. The probability density function, p ( x ), for differentvalues of α . In this investigation α = 4 was used, to distin-guish strongly from the uniform noise case. Discrete samples can be reparametrised by sampling r from U ( − ,
1] and inputting into Equation 14. The valueof α was set to 4. Networks
The generator network consists of dense and transposeconvolutional, stride 2 kernel size 4, layers with batchnormalisation and ReLU activations. The output layerhas a tanh activation. These components are standarddeep learning techniques found in textbooks, for example[37].The discriminator network consists of dense, convo-lutional layers, stride 2 kernel size 4, LeakyReLU acti-vations. The dense layer corresponding to the featuredistribution was chosen to have tanh activations in or-der that outputs could map to the BM. The hidden layerrepresenting ρ f was the fourth layer of the discriminatornetwork with 100 nodes. When sampling the trainingdata for the BM from the discriminator, the variablesgiven values from the set {− , } as in the Ising model,dependent on the activation of the node being greater orless than the threshold, set at zero, respectively.The networks were trained with an Adam optimiserwith learning rate 0.0002 and the labels were smoothedwith noise. For the sparse graph latent space used inlearning the MNIST dataset in Section IV, the BM wasembedded in the D-Wave hardware using a heuristic em-bedder. As there is a 1-1 mapping for the sparse graphit was expressed in hardware using 100 qubits. An an-nealing schedule of 1 µs and a learning rate of 0.0002were used. The classical architecture that was comparedwith the QAAAN was identical other than replacing sam-pling via quantum annealing with MCMC sampling tech-niques. IV. RESULTS & DISCUSSION
For this work we performed several experiments.First, we compared three topologies of graphical mod-els, trained using both classical and quantum annealingsampling methods. They were evaluated for performanceby measuring the L -norm over the course of the learninga reduced stochastically binarzied version of the MNISTdataset, Figure 5. Second, the QAAAN and the classicalassociative adversarial network described in Section IIIwere both used to generate new examples of the MNISTdataset. Their performance was evaluated used the in-ception score and the Frech´et Inception Distance (FID).Finally, the classical associative adversarial network wasused to generate new examples of the LSUN bedroomsdataset.In the experiment comparing topologies, as expected,the BM trains a better model faster with higher connec-tivity, Figure 9. When trained via sampling with thequantum annealer the picture is less intuitive, Figure 8.All topologies learned a model to the same accuracy, atsimilar rates. This indicates that there is a noise floorpreventing the learning of a better model in the morecomplex graphical topologies. For the purposes of thisinvestigation the performance of the sparse graph wasdemonstrated to be enough to learn an informed priorfor use in the QAAAN algorithm.Second, for the classical associative adversarial net-work, all topologies were implemented, and the quantum-assisted algorithm was implemented with a sparse topol-ogy latent space. The generated images for sparse topol-ogy latent spaces are shown for both classical and quan-tum versions in Figures 7a and 7b.We evaluated classical and quantum-assisted versionsof the associative adversarial network with sparse latentspaces via two metrics, the inception score and the FID.Both metrics required an inception network, a networktrained to classify images from the MNIST dataset, whichwas trained to an accuracy of ∼ p ( y | x )should be dominated by one value of y , indicating a highprobability that an image is representative of a class. Sec-ondly, over the whole set there should be a uniform distri-bution of classes, indicating diversity of the distribution.This is expressedIS = exp( E x ∼ ρ D D KL ( p ( y | x ) || p ( y ))) . (15)The first criterion is satisfied by requiring that image-wise class distributions should have low entropy. Thesecond criterion implies that the entropy of the overalldistribution should be high. The method is to calcu-late the KL distance between these two distributions: Ahigh value indicates that both the p ( y | x ) is distributedover one class and p ( y ) is distributed over many classes.When averaged over all samples this score gives a goodindication of the performance of the network. The incep-tion score of the classical and quantum-assisted versionswere ∼ . ∼ .
6, respectively.The FID measures the similarity between features ex-tracted by an inception network from the dataset X andthe generated data G . The distribution of the featuresare modelled as a mutlivariate Gaussian. Lower FID val-ues mean the features extracted from the generated im-ages are closer those for the real images. In Equation 16, µ are the means of the activations of an interim layer ofthe inception network and Σ are the covariance matricesof these activations. The classical and quantum-assistedalgorithms scored ∼
29 and ∼
23, respectively.FID(
X, G ) = || µ X − µ G || + Tr (cid:16) Σ X + Σ G − (cid:112) Σ X Σ G (cid:17) (16)The classical implementation was also used to generateimages mimicking the LSUN bedrooms dataset, Figure 2.This final experiment was only performed as a demon-stration of scalability, and no metrics were used to eval-uate performance. Discussion
Though it is trivial to demonstrate a correlation be-tween the connectivity of a graphical model and the qual-ity of the learned model, Figure 9, it is not immediatelyclear that the benefits of increasing the complexity ofthe latent space can be detected easily in deep learn-ing frameworks, such as the quantum-assisted Helmholtzmachine [31] and those looking to exploit quantum mod-els [30]. The effect of the complexity of the latent spacemodel on the quality of the final latent variable genera-tive model was not apparent in our investigations. Deeplearning frameworks looking to exploit quantum hard-ware supported training in the latent spaces need to trulybenefit from this application, and not iron out any po-tential gains with backpropagation. For example, if ex-ploiting a quantum model gives improved performanceon some small test problem, it is an open question asto whether this improvement will be detected when in-tegrated into a deep learning framework, such as the ar-chitecture presented here.Here, given the nature of the demonstration and a de-sire to avoid chaining we use a sparse connectivity model.Avoiding chaining allows for larger models to be embed-ded into near-term quantum hardware. Given the O ( n )scaling of qubits to logical variables for a complete logicalgraph [38], future applications of sampling via quantumannealing will likely exploit restricted graphical models.Though the size of near-term quantum annealers has fol-lowed Moore’s law trajectory, doubling in size every twoyears, it is not clear what size of probabilistic graphicalmodels will find mainstream usage in machine learningapplications and exploring the uses of different modelswill be an important theme of research as these devicesgrow in size.There are two takeaways from the results presentedhere. Though these values are not comparable to state- (a)(b) FIG. 7. Example MNIST characters generated by (a) classi-cal and (b) quantum-assisted associative adversarial networkarchitectures, with sparse topology latent spaces. of-the-art GAN architectures and are on a simple MNISTimplementation, they serve the purpose of highlightingthat the inclusion of a near-term quantum device isnot detrimental to the performance of this algorithm.Secondly, we have demonstrated the framework on thelarger, more complex, dataset LSUN bedrooms, Figure 2.This indicates that the algorithm can be scaled.
V. CONCLUSIONSSummary
In this work. we have presented a novel and scal-able quantum-assisted algorithm, based on a GAN frame-work, which can learn a implicit latent variable genera-tive model of complex datasets.This work is a step in the development of algorithms
FIG. 8. Comparison of the convergence of different graphicaltopologies trained using samples from a quantum annealerson a reduced stochastically binarized MNIST dataset. Thelearning rate used was 0.03. This learning rate produced thefastest learning with no loss in performance of the final model.The learning was run 5 times over different embeddings andthe results averaged. The error bars describe the varianceover these curves. that may use quantum phenomena to improve the learn-ing generative models of datasets. This algorithm fulfillsthe requirements of the three areas outlined by Perdomo-Ortiz et al [39]: Generative problems, data where quan-tum correlations may be beneficial, and hybrid. Thisimplementation also allows for use of sparse topologies,removing the need for chaining, requires a relatively smallnumber of variables (allowing for near-term quantumhardware to be applied) and is resistant to noise.Though the key motivation of this work is to demon-strate a functional deep learning framework integratingnear-term quantum hardware in the learning process, itbuilds on classical work by Tarik Arici and Asli Celikyil-maz [9] exploring the effect of learning the feature spaceand using this distribution as the input to the genera-tor. No claims are made here on the improvements thatcan be made classically, though it is possible that furtherresearch into the associative adversarial architecture willyield improvements to GAN design.In summary, we have successfully demonstrated aquantum-assisted GAN capable of learning a model ofa complex dataset such as LSUN, and compared perfor-mance of different topologies.
Further Work
There are many avenues to use quantum annealing forsampling in machine learning, topologies and GAN re-search. Here, we have outlined a framework that workson simple (MNIST) and more complex (LSUN) datasets.
FIG. 9. Comparison of different graphical topologies trainedusing MCMC sampling on a reduced stochastically binarizedMNIST dataset. The learning rate used was 0.001. Thislearning rate was chosen such that the training was stable foreach topology, we found that the error diverged for certaintopologies at other learning rates. The learning was run 5times and the results averaged. The error bars decribe thevariance over these curves.
We highlight several areas of interest that build on thiswork.The first is an investigation into how the inclusion ofquantum hardware into models such as this can be de-tected. There are two potential improvements to themodel: Quantum terms improve the model of the datadistribution; or graphical models, which are classicallyintractable to learn for example fully connected, inte-grated into the latent spaces, may improve the latentvariable generative model learned. Before investing ex-tensive time and research into integrating quantum mod-els into latent spaces it will be important to note thatthese improvements are reflected in the overall model ofthe dataset. That is, that backpropagation does not eraseany latent space performance gains.There are still outstanding questions as to the distri-bution the quantum annealer samples. The pause and re-verse anneal features on the D-Wave 2000Q gives greatercontrol over the distribution output by the quantum an-nealer, and can be used to explore the relationship be-tween the quantum nature of that distribution and thequality of the model trained by a quantum Boltzmannmachine [40]. It is also not clear what distribution is the‘best’ for learning a model of a distribution. It couldbe that efforts to decrease the operating temperature ofa quantum annealer to boost performance in optimisa-tion problems will lead to decreased performance in MLapplications, as the diversity of states in a distributiondecreases and probabilities accumulate at a few low en-ergy states. There are interesting open questions as tothe optimal effective temperature of a quantum annealer0for ML applications. This question fits within a broad arefor research in ML asking which distributions are mostuseful for ML and why.For this simple implementation, the quantum samplingsparse graph performance is comparable to the completeand restricted topologies. Though in optimised imple-mentations we expect divergent performance, the sparsegraph serves the purpose of demonstrating the QAAANarchitecture. Additionally, we have highlighted sparseclassical graphical models for use in the architecturedemonstrated on LSUN bedrooms. Though they have re-duced expressive power there are many more applicationsfor current quantum hardware; for example a fully con-nected graphical model would require in excess of 2048qubits (the number available on the D-Wave 2000Q) tolearn a model of a standard MNIST dataset, not to men-tion the detrimental effect of the extensive chains. Asparse D-Wave 2000Q native graph (Chimera) converselywould only use 784 qubits. This is a stark example of howsparse models might be used in lieu of models with higherconnectivity. Investigations finding the optimal balancebetween the complexity of a model, resulting overheadrequired by embedding, and the affect on both on per-formance are needed to understand how future quantum annealers might be used for applications in ML.
Acknowledgements
We would like to thank Marcello Benedetti for conver-sations full of his expertise and good humour. We wouldalso like to thank Thomas Vandal, Rama Nemani, An-drew Michaelis, Subodh Kalia and Salvatore Mandra foruseful discussions and comments.We are grateful for support from NASA Ames Re-search Center, and from the NASA Earth Science Tech-nology Office (ESTO), the NASA Advanced Explorationsystems (AES) program, and the NASA TransformativeAeronautic Concepts Program (TACP). We also appreci-ate support from the AFRL Information Directorate un-der grant F4HBKC4162G001 and the Office of the Direc-tor of National Intelligence (ODNI) and the IntelligenceAdvanced Research Projects Activity (IARPA), via IAA145483. The views and conclusions contained herein arethose of the authors and should not be interpreted asnecessarily representing the official policies or endorse-ments, either expressed or implied, of ODNI, IARPA,AFRL, or the U.S. Government. The U.S. Government isauthorized to reproduce and distribute reprints for Gov-ernmental purpose notwithstanding any copyright anno-tation thereon. [1] T. R. Society, “Machine learning: the power and promiseof computers that learn by example,”
The Royal Society ,2017.[2] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre,G. Van Den Driessche, J. Schrittwieser, I. Antonoglou,V. Panneershelvam, M. Lanctot, et al. , “Mastering thegame of go with deep neural networks and tree search,” nature , vol. 529, no. 7587, pp. 484–489, 2016.[3] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” arXiv preprint , 2017.[4] A. Radford, L. Metz, and S. Chintala, “Unsu-pervised representation learning with deep convolu-tional generative adversarial networks,” arXiv preprintarXiv:1511.06434 , 2015.[5] C. Ledig, L. Theis, F. Husz´ar, J. Caballero, A. Cunning-ham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang, et al. , “Photo-realistic single image super-resolution us-ing a generative adversarial network,” arXiv preprint ,2016.[6] J. Zhao, M. Mathieu, and Y. LeCun, “Energy-based generative adversarial network,” arXiv preprintarXiv:1609.03126 , 2016.[7] M. Arjovsky, S. Chintala, and L. Bottou, “Wassersteingan,” arXiv preprint arXiv:1701.07875 , 2017.[8] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, andA. C. Courville, “Improved training of wasserstein gans,”in
Advances in Neural Information Processing Systems ,pp. 5769–5779, 2017.[9] T. Arici and A. Celikyilmaz, “Associative adversarial net- works,” arXiv preprint arXiv:1611.06953 , 2016.[10] M. Benedetti, J. Realpe-G´omez, R. Biswas, andA. Perdomo-Ortiz, “Quantum-assisted learning ofhardware-embedded probabilistic graphical models,”
Physical Review X , vol. 7, no. 4, p. 041052, 2017.[11] M. H. Amin, E. Andriyash, J. Rolfe, B. Kulchytskyy,and R. Melko, “Quantum Boltzmann machine,” arXivpreprint arXiv:1601.02036 , 2016.[12] J. Biamonte, P. Wittek, N. Pancotti, P. Rebentrost,N. Wiebe, and S. Lloyd, “Quantum machine learning,”
Nature , vol. 549, no. 7671, p. 195, 2017.[13] C. Ciliberto, M. Herbster, A. D. Ialongo, M. Pontil,A. Rocchetto, S. Severini, and L. Wossnig, “Quantummachine learning: a classical perspective,”
Proc. R. Soc.A , vol. 474, no. 2209, p. 20170551, 2018.[14] H. J. Kappen, “Learning quantum models from quan-tum or classical data,” arXiv preprint arXiv:1803.11278 ,2018.[15] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben-gio, “Generative adversarial nets,” in
Advances in neuralinformation processing systems , pp. 2672–2680, 2014.[16] K. Wang, C. Gou, Y. Duan, Y. Lin, X. Zheng, and F.-Y. Wang, “Generative adversarial networks: introduc-tion and outlook,”
IEEE/CAA Journal of AutomaticaSinica , vol. 4, no. 4, pp. 588–598, 2017.[17] S. Mohamed and B. Lakshminarayanan, “Learn-ing in implicit generative models,” arXiv preprintarXiv:1610.03483 , 2016.[18] S. A. Barnett, “Convergence problems with gen- erative adversarial networks (gans),” arXiv preprintarXiv:1806.11382 , 2018.[19] H. Thanh-Tung, T. Tran, and S. Venkatesh, “On catas-trophic forgetting and mode collapse in generative adver-sarial networks,” arXiv preprint arXiv:1807.04015 , 2018.[20] T. White, “Sampling generative networks: Notes on a feweffective techniques,” arXiv preprint arXiv:1609.04468 ,2016.[21] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung,A. Radford, and X. Chen, “Improved techniques fortraining gans,” in Advances in Neural Information Pro-cessing Systems , pp. 2234–2242, 2016.[22] N. Le Roux and Y. Bengio, “Representational powerof restricted Boltzmann machines and deep belief net-works,”
Neural computation , vol. 20, no. 6, pp. 1631–1649, 2008.[23] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, “Alearning algorithm for Boltzmann machines,” in
Readingsin Computer Vision , pp. 522–533, Elsevier, 1987.[24] D. Koller, N. Friedman, L. Getoor, and B. Taskar,“Graphical models in a nutshell,”
Introduction to sta-tistical relational learning , pp. 13–55, 2007.[25] M. A. Carreira-Perpinan and G. E. Hinton, “On con-trastive divergence learning.,” in
Aistats , vol. 10, pp. 33–40, Citeseer, 2005.[26] L. Deng, D. Yu, et al. , “Deep learning: methods andapplications,”
Foundations and Trends R (cid:13) in Signal Pro-cessing , vol. 7, no. 3–4, pp. 197–387, 2014.[27] R. Biswas, Z. Jiang, K. Kechezhi, S. Knysh, S. Man-dra, B. O’Gorman, A. Perdomo-Ortiz, A. Petukhov,J. Realpe-G´omez, E. Rieffel, et al. , “A nasa perspectiveon quantum computing: Opportunities and challenges,” Parallel Computing , vol. 64, pp. 81–98, 2017.[28] T. F. Rønnow, Z. Wang, J. Job, S. Boixo, S. V. Isakov,D. Wecker, J. M. Martinis, D. A. Lidar, and M. Troyer,“Defining and detecting quantum speedup,”
Science ,vol. 345, no. 6195, pp. 420–424, 2014.[29] H. G. Katzgraber, F. Hamze, and R. S. Andrist, “Glassychimeras could be blind to quantum speedup: Design- ing better benchmarks for quantum annealing machines,”
Physical Review X , vol. 4, no. 2, p. 021008, 2014.[30] A. Khoshaman, W. Vinci, B. Denis, E. Andriyash, andM. H. Amin, “Quantum variational autoencoder,” arXivpreprint arXiv:1802.05779 , 2018.[31] M. Benedetti, J. R. G´omez, and A. Perdomo-Ortiz,“Quantum-assisted helmholtz machines: A quantum-classical deep learning framework for industrial datasetsin near-term devices,”
Quantum Science and Technology ,2018.[32] S. H. Adachi and M. P. Henderson, “Application ofquantum annealing to training of deep neural networks,” arXiv preprint arXiv:1510.06356 , 2015.[33] M. H. Amin, “Searching for quantum speedup in qua-sistatic quantum annealers,”
Physical Review A , vol. 92,no. 5, p. 052323, 2015.[34] M. Benedetti, J. Realpe-G´omez, R. Biswas, andA. Perdomo-Ortiz, “Estimation of effective temperaturesin quantum annealers for sampling applications: A casestudy with possible applications in deep learning,”
Phys-ical Review A , vol. 94, no. 2, p. 022308, 2016.[35] J. Raymond, S. Yarkoni, and E. Andriyash, “Globalwarming: Temperature estimation in annealers,”
Fron-tiers in ICT , vol. 3, p. 23, 2016.[36] C. Doersch, “Tutorial on variational autoencoders,” arXiv preprint arXiv:1606.05908 , 2016.[37] I. Goodfellow, Y. Bengio, and A. Courville,
Deep Learn-ing . MIT Press, 2016. .[38] V. Choi, “Minor-embedding in adiabatic quantum com-putation: Ii. minor-universal graph design,”
QuantumInformation Processing , vol. 10, no. 3, pp. 343–353, 2011.[39] A. Perdomo-Ortiz, M. Benedetti, J. Realpe-G´omez, andR. Biswas, “Opportunities and challenges for quantum-assisted machine learning in near-term quantum comput-ers,” arXiv preprint arXiv:1708.09757 , 2017.[40] J. Marshall, D. Venturelli, I. Hen, and E. G. Rieffel,“The power of pausing: advancing understanding of ther-malization in experimental quantum annealers,” arXivpreprint arXiv:1810.05881arXivpreprint arXiv:1810.05881