[PDF] Supervised Learning with Projected Entangled Pair States

Abstract

Tensor networks, a model that originated from quantum physics, has been gradually generalized as efficient models in machine learning in recent years. However, in order to achieve exact contraction, only tree-like tensor networks such as the matrix product states and tree tensor networks have been considered, even for modeling two-dimensional data such as images. In this work, we construct supervised learning models for images using the projected entangled pair states (PEPS), a two-dimensional tensor network having a similar structure prior to natural images. Our approach first performs a feature map, which transforms the image data to a product state on a grid, then contracts the product state to a PEPS with trainable parameters to predict image labels. The tensor elements of PEPS are trained by minimizing differences between training labels and predicted labels. The proposed model is evaluated on image classifications using the MNIST and the Fashion-MNIST datasets. We show that our model is significantly superior to existing models using tree-like tensor networks. Moreover, using the same input features, our method performs as well as the multilayer perceptron classifier, but with much fewer parameters and is more stable. Our results shed light on potential applications of two-dimensional tensor network models in machine learning.

Full PDF

SSupervised Learning with Projected Entangled Pair States

Song Cheng,

1, 2

Lei Wang,

2, 3, ∗ and Pan Zhang † Beijing Institute of Mathematical Sciences and Applications, Beijing 101407, China Institute of Physics, Chinese Academy of Sciences, Beijing 100190, China Songshan Lake Materials Laboratory, Dongguan, Guangdong 523808, China CAS key laboratory of theoretical physics, Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing 100190, China

Tensor networks, a model that originated from quantum physics, has been gradually generalized as e ﬃ cientmodels in machine learning in recent years. However, in order to achieve exact contraction, only tree-like tensornetworks such as the matrix product states and tree tensor networks have been considered, even for modelingtwo-dimensional data such as images. In this work, we construct supervised learning models for images usingthe projected entangled pair states (PEPS), a two-dimensional tensor network having a similar structure priorto natural images. Our approach ﬁrst performs a feature map, which transforms the image data to a productstate on a grid, then contracts the product state to a PEPS with trainable parameters to predict image labels. Thetensor elements of PEPS are trained by minimizing di ﬀ erences between training labels and predicted labels.The proposed model is evaluated on image classiﬁcations using the MNIST and the Fashion-MNIST datasets.We show that our model is signiﬁcantly superior to existing models using tree-like tensor networks. Moreover,using the same input features, our method performs as well as the multilayer perceptron classiﬁer, but with muchfewer parameters and is more stable. Our results shed light on potential applications of two-dimensional tensornetwork models in machine learning. I. INTRODUCTION

Tensor networks is a framework that approximates a high-order tensor using contractions of low-order tensors. It hasbeen widely used in quantum many-body physics whosecorrelation or entanglement entropy satisﬁes the area law ,which is traditionally related to the partition function of thelocal classical Hamiltonians and the gapped ground state ofthe local quantum Hamiltonians. Recent works have uncov-ered that the information entropy of image datasets in machinelearning also approximately satisﬁes the area law , whichinspired several recent works introducing tensor networks tovarious machine learning tasks .On the theoretical side, introducing tensor networks whichis purely linear can bring more interpretability to the ma-chine learning framework: the entanglement structure can benaturally introduced to characterize the expressive power ofthe learning model, and some neural network models can bemapped to a tensor network for theoretical investigations .On the practical side, numerical techniques of the tensor net-works are also useful and inspiring for optimizing and train-ing methods of the machine learning models, these includethe normalized (or canonical) form, the adaptive learning,etc. . These beneﬁts have motivated Google X of Alpha-bet to release a

TensorNetwork library built upon Google’sfamous machine learning framework tensorﬂow .It has been shown, tensor networks has a deep connectionwith the probabilistic graph models in statistical learning, in-cluding the hidden Markov chain , the restricted Boltzmannmachine , the Markov random ﬁelds , etc. The study of thisconnection has given rise to the so call "Born Machine" as atypical quantum machine learning prototype. In this sense, thetensor networks based machine learning is closely related tothe quantum machine learning in terms of their model struc-ture. A special part of tensor networks based machine learningalgorithm can be directly regarded as the classical simulationof the corresponded quantum machine learning algorithm . In recent years, various of tensor networks models, suchas the Matrix Product States (MPS) , the

Tree Tensor Net-work (TTN) , the

String Bond States (SBS) , have beenintroduced to machine learning, and found successful applica-tions in image classiﬁcations, image density estimation, gen-erative modeling of natural languages , and neural networkcompressions . These models are all based on the quasi-one-dimensional tree-like tensor networks, which are the best-understood types of tensor networks, and can be e ﬃ cientlycontracted due to the existence of the canonical form. How-ever, when dealing with data such as natural images, the spa-tial correlations between nearby pixels as well as the struc-tural prior are completely ignored in the tree-like tensor net-works, forcing some short-range correlations to be artiﬁciallymade long-range, leading to unnecessary computational over-head as well as statistical bias. Notice that in physics there isone kind of tensor network with exactly the same geometricstructure as the natural images, known as Projected EntangledPair States (PEPS) . which is composed of tensors locatedon a two-dimensional lattice, as shown in Fig. 1.In this work, we introduce a supervised learning modelbased on the PEPS representations on a L × L grid with localphysical dimension d and local bond dimension D for eachtensor. The size of PEPS L and value of local physical dimen-sion d depend on the feature map we employ, which trans-forms the set of L × L pixels of an input image to a productstate on a L × L grid. Given the features, a PEPS with trainableparameters contracts with input feature to obtain a probabilitydistribution over labels, then predict the label of the input im-age. When the grid is small, exact contractions of PEPS canbe performed with both space and time complexity propor-tional to the D L . When L or D is large, the PEPS can not becontracted exactly, then we apply an approximate contractionmethod based on the boundary MPS method. We evaluate thePEPS model on the standard MNIST and the Fashion-MNISTdatasets. We show that our model signiﬁcantly outperformsexisting tensor-network models using MPS and tree tensor a r X i v : . [ c s . C V ] S e p W ℓ Φ( x ) = ℓf ℓ ( x ) ℓ x } Figure 1. Supervised learning model of PEPS structure. The inputimage x would be mapped to a high-dimensional vector Φ ( x ) consist-ing of local feature maps φ s i ( x i ). The label vetor f (cid:96) ( x ) come fromthe contraction of Φ ( x ) and a PEPS strucure tensor network W . networks. When compared with the standard classiﬁer, theMultilayer Perceptron (MLP), we ﬁnd that they perform sim-ilarly when the same input features are used, but our PEPSbased method requires much fewer parameters and is morestable.The rest of this paper is organized as follows: In Sec. II wegive a detailed description of the PEPS model and the corre-sponding training algorithm. In Sec. III we evaluate our modelon the MNIST and the Fashion-MNIST datasets and com-pared the results with other tensor network models as well asclassic machine learning models. We conclude in Sec. IV anddiscuss possible future developments along the direction ofapplying tensor networks to machine learning. II. IMAGE CLASSIFICATION WITH PEPSA. Feature map of input data

The goal of supervised learning is to learn a complex func-tion f ( x ) which maps an input training (grayscale) image x ∈ R L × L with pixels deﬁned on a L × L grid, to a givenlabel y ∈ { , , ..., T } , where T denotes the number of pos-sible labels. Usually, such mapping is highly nonlinear in theoriginal space of input data x , because nonlinearity e ﬀ ectivelyincreases the dimension of the input space where features ofdata are easier to capture. In this work, we consider the clas-siﬁer with tensor networks, which is a linear model usuallyacting at a space with a very large dimension. The motiva-tion of working with a very large dimension is that there is notnecessary to consider nonlinearity because all features wouldbecome linear separable as stated in the representer theorem .So ﬁrst one needs to transform the input data x to a feature ten-sor Φ ( x ) in a space of large dimension using a feature map. We consider two distinct kinds of feature maps in this work.

1. Product state feature map

A simple way to increase the dimension of input space iscreating an Hilbert space for pixels. This is to levarage theblack pixel with x i = x i = | (cid:105) = (cid:32) (cid:33) and a white state | (cid:105) = (cid:32) (cid:33) respectively,then convert each gray scale pixel x i in the image x as a superposition of | (cid:105) and | (cid:105) φ ( x i ) = a (cid:32) (cid:33) + b (cid:32) (cid:33) , (1)where a and b are functions of x i , which for example can bechosen as a = cos( π x i / , b = sin( π x i / . (2)For image with N = L × L pixels, the feature tensor Φ ( x ) isthen deﬁned by the tensor product of φ ( x i ) Φ ( x ) = φ ( x ) ⊗ φ ( x ) ⊗ · · · ⊗ φ ( x N ) (3)This is probably the most straightforward feature map thattransforms every pixel in the original space R N to a productstate in the Hilbert space of dimension 2 L × L , and has beenwidely used in the literatures . We term it as the productstate feature map .

2. Convolution feature map

The simple product state feature maps introduced in the pre-vious section are pre-determined before the classiﬁer is ap-plied, thus is apparently not optimal. Another option is usingan adaptive feature map with parameters learned together withthe classiﬁer. The most famous adaptive feature map is theconvolution layers, which perform non-linear transformationsto transform input images to a feature tensor with multiplechannels using two-dimensional convolutions.The input of the convolution layer is a raw image x ∈ R L × L . After the transformation, the convolution layer out-puts a three-order feature tensor with dimension L × L × d ,where the L × L refers to the output size of features with L ≤ L depending on size kernels and paddings, and d denotes thenumber of channels. This is to say that the output of the CNNfeature map is also a product state with components located ata grid of size L × L , and each component is of local physicaldimension d . Thus the total space size of the feature tensor is d L × L .In the standard convolution neural networks (CNN), thefunction of convolution layers (plus pooling layers) is extract-ing relevant features from input data. Following the con-volution layers, a classiﬁer, usually a multi-layer perceptron(MLP), or simply a linear classiﬁer such as the Logistic re-gression, is used for predicting a label from the extracted fea-tures. Notice that the MLP can not accept a feature tensoras input. Instead, in MLP the feature tensor Φ ( x ) is ﬂattenedto a vector in the space R d × L × L , that is, completely ignoredthe spatial structure of the feature tensor. In this work, weconsider a linear classiﬁer using two-dimensional tensor net-works, which can fully take the raw feature tensor as an input,as tensor networks was born to compress a large Hilbert space. B. PEPS Classiﬁer

Equipped with the feature map, and the extracted featuretensor Φ ( x ) ∈ R d L × L , we then consider a linear mapping W ,which results to a vector representing probablity of being oneof T labels given the input image. To be more speciﬁc, con-sider f ( x ) = W · Φ ( x ) (4)Where W ∈ R d L × L × T , and · denotes tensor contraction of theoperator W and feature tensor Φ ( x ). In general W is a tensorof order L × L +

1, where L × L is the order of input, eachof which has dimension d , and the output has dimension T .We can see that the total number of parameters, d L × L × T , istoo large to use in practice. So one must make a reasonableapproximation of W to archive a practical and e ﬃ cient algo-rithm, or in other words, to decompose W into contraction ofmany adjacent tensors using the tensor network representa-tion.For physics systems, such as the gapped ground states ofthe local quantum hamiltonian, most long-range correlationsof the model are irrelevant, the tensor networks approach hasbeen proved to be a successful ansatz. For systems like thenatural image datasets, its locality has been discussed in sev-eral works recently . This suggests that the correlation ofthe natural images handled by the machine learning ﬁeld maybe dominated by local correlations, with rare long-range cor-relations. This implication has also been numerically veriﬁedby some successful tensor network machine learning mod-els based on the MPS , TTN , etc. However, we no-ticed that previous works on compressing W are all basedon tree-like tensor networks, which completely ignored thetwo-dimensional nature of feature tensor Φ ( x ), treating it as aquasi-one-dimensional tensor. In this work, we propose to usethe PEPS which is a two-dimensional tensor networks with asimilar structural prior to images.As shown in Fig. 1, the PEPS represents the W using a com-position of tensors T [ i ] , W (cid:96), s s ··· s N = (cid:88) σ σ ··· σ K T s σ ,σ T s σ ,σ ,σ · · · T s i ,(cid:96)σ k ,σ k + ,σ k + ,σ k + · · · T s N σ K − ,σ K (5)where K is the number of bonds in the square lattice. Eachtensor has a "physical" index s i ∈ { , , ..., d } connected to theinput vector Φ ( x i ). Their "virtual" indices σ k ∈ { , , ..., D } will be contracted with their adjacent tensors of a square lat-tice. Besides, a special tensor in the center of the lattice has an additional "label" index (cid:96) ∈ { , , ..., T } to generate the outputvector of the model. The values of these tensors are randomlyinitialized by real numbers ranging from 0 to 0.01, which con-stitute the trainable parameters of the model, denoted as θ . C. Training algorithm

The purpose of training is to minimize the di ﬀ erence be-tween the predicted labels and the training labels by tuning thetraining parameters θ . This is usually achived by minimizinga loss function L representing distance between distributionof predicted labels and the one-hot vector correponding to thedistribution of training labels. A common choice of the lossfunction is deﬁned as L = − (cid:88) x i , y i ∈T log (cid:104) softmax (cid:16) f [ y i ] ( x i ) (cid:17)(cid:105) , (6)where softmax[ f [ y i ] ( x )] ≡ exp f [ y i ] ( x i ) (cid:80) T (cid:96) = exp f [ (cid:96) ] ( x i ) . (7)Here x i , y i denotes the i th image and the corresponding la-bel in the dataset T . The output of softmax function can beinterpreted as the probability that the model predicts that x i belongs to class y i . L is the cross-entropy of model’s prob-ability and image’s labels, which is known to be well suitedfor most of supervised learning models. f (cid:96) ( x ) comes from thecontraction of the physical indices of the PEPS model and thefeature map vectors. f [ (cid:96) ] ( x ) = W (cid:96), s s ··· s N · φ s ( x ) ⊗ φ s ( x ) ⊗ · · · ⊗ φ s N ( x N ) = (cid:88) σ σ ··· σ K M σ ,σ M σ ,σ ,σ · · · M (cid:96)σ k ,σ k + ,σ k + ,σ k + · · · M σ K − ,σ K (8)where each M = (cid:80) s i T s i φ s i ( x i ).However, calculating f (cid:96) ( x i ) from contracting these M ten-sors is not trivial. Consider contracting the tensors of thebottom row with the tensors of the nearest row, as shown inFig. 2 (a), if the initial dimension of the virtual bond of ten-sors is D , the bond dimension of the resulting tensors wouldincrease from D to D . As this process repeats, the compu-tational cost of contraction would grow exponentially. Morerigorous proof shows that the exact contraction of the PEPSstructure is , there is no polynomial algorithm existsin general.If the PEPS is small, one can do the contraction with com-putational complexity proportional to D L . If D and latticelength L are large, we have to use approximate methods. oneof them is the Boundary MPS method , which treats the bot-tom row tensors as an MPS and the rest of row tensors as theoperators applied on the MPS. Each time the neighbor rowtensors are applied, a DMRG-like method is used to truncatethe bond dimension of the MPS to a maximum value χ . Inorder to achieve a smaller truncation error, one ﬁrst applies ≈ ℓℓ = = (a) (b) ℓ Figure 2. (a)

Contracting tensors of one row to another causes thebond dimension to grow exponentially. To keep D under a ﬁnitenumber χ , an approximate algorithm likes the Boundary MPS is in-evitable. (b) . After several approximate contraction, the contractionsof the last lines can be exactly performed under the complexity of O ( T χ D ). the QR decomposition to the MPS to ensure that the MPS isin the correct canonical form. Then apply the SVD decompo-sition to the central tensor of the canonicalized MPS, whichensures the truncation is optimal for the entire row. There aremany choices for the approximate contraction methods, suchas the coarse-grained methods , closely related to the renor-malization group theory, which may provide a natural way forintroducing the renormalization group into machine learning.However, those methods usually have a much larger compu-tational complexity than the boundary MPS method, and aremore suitable for inﬁnite systems.The total computational cost of the approximate contrac-tion by the Boundary MPS is ∼ O ( N χ D ). More e ﬃ ciently,We can approximately contract from the top and the bot-tom in parallel. The resulting tensor network is shown inFig. 2 (b), which can be contracted exactly with calculationcomplexity ∼ O ( T χ D ). The computational cost of thistraining algorithm for a complete forward process scales as ∼ O ( |T | N χ D ), where |T | denotes the number of input im-ages, usually |T | ∼ ﬀ erentiation technique of tensor networks . Thekey to this technique is to treat the tensor network algorithm,such as the Boundary MPS, as a traceable computation graphabout tensors and algebraic operations. Then through simplechain rule, one could implement a backward-propagation pro-cess along this computation graph to get the gradient valueof the loss function L with respect to each parameter. Auto-matic di ﬀ erentiation is one of the core techniques of modernmachine learning applications. It is proven that the computa-tional complexity would not exceed the original feed-forwardalgorithm.There are two obstacles to applying automatic di ﬀ erentia- Figure 3. Images from the MNIST and Fashion-MNIST dataset. tion to tensor networks. The ﬁrst is that in the backward pro-cess of singular value decompositions (SVD), in case of exist-ing singular value degeneracies λ i = λ j , a factor appearing inthe perturbation analysis F i j ≡ λ i − λ j , would encounter numer-ical instability. The solution is to replace F i j with λ i − λ j ( λ i − λ j ) + (cid:15) ,where (cid:15) is a small factor that does not signiﬁcantly change thegradient . The second issue is that automatic di ﬀ erentiationtensor networks may cause huge memory consumption. Weemploy two techniques to settle it. One is called checkpoint-ing technique, which stores less intermediate variables byrecalculating them during the backward process. The otheris blocking , which uses one tensor whose physical index di-mension is d n to parameterize n neighboring pixels to reducethe scale of the PEPS, which is equivalent to approximatingan intermediate tensor of bond dimension D n with a tensor ofbond dimension D .After obtaining the gradients, we directly apply theStochastic gradient descent (SGD) and Adam optimizer toupdate the trainable variables θ . However, we found that inpractical situations, when the feature map is a simple func-tion similar to Eq. 3 other than CNN, if we keep the param-eters non-negative, such as assigning | θ | to θ , the stability ofoptimization can be greatly improved. Under this constraint,the PEPS model can also be regarded as a kind of MarkovRandom Field(MRF) of probabilistic graph models. The rea-son for these phenomena may be attributed to the fact thatthe SVD truncation error of the positive matrices is usuallysmaller, so the gradient value could be transmitted more accu-rately. These new phenomena raise an interesting question forthe machine learning community, that is, if the feedforwardprocess can only be approximated, how should we design amore e ﬀ ective gradient update algorithm. III. NUMERICAL EXPERIMENTS

In this section, we conduct numerical experiments to evalu-ate the expressive power of PEPS models for image classiﬁca-tions. As we have introduced two feature maps for generatinginput feature tensor to the PEPS classiﬁer, we term the PEPSclassiﬁer with the simple product state feature map as

PEPS ,and term the PEPS classiﬁer with the convolution layer featureas

CNN-PEPS . In PEPS, the simple function feature mappingin Eq. 3 is used and further transferred to PEPS tensors with2 × ×

28 images, the PEPSwould be 14 ×

14 with the dimension of physical indices equalto 16. Each tensor handles the information of pixels withina 2 × θ of PEPS models to be positive would signiﬁcantlyimprove the stability of optimization. The CNN-PEPS sharedthe 2 × ×

5, stride 1, ReLU activation, and 2 × ﬀ ect on the optimization result.In both models we set bound dimension of PEPS classiﬁer χ = n h hidden neurons and 10 output neurons.The activation function is softmax and the cost function is cross entropy , the same to the PEPS model. The CNN-MLPhas the similar MLP with the same CNN layer of CNN-PEPSused for feature extractions. In our experiments, the best testaccuracy is achieved with n h = α = − , the batch size is 100,regularization is set to 0, weight decay is 0, and we train 100epochs in total. To compare with the one-dimensional ten-sor networks learning model, we also experimented with theMPS model, with parameters set to be exactly the same as inRef. . The code of the MPS model is based on the open-source project . A. MNIST dataset

We ﬁrst test our models using the MNIST dataset , a sim-ple and standard dataset widely used by many supervisedlearning models. The MNIST dataset consists of 55 ,

000 train-ing images, 5 ,

000 validation images and 10 ,

000 test images,each image contains 28 ×

28 pixels, the content of these im-ages are divided into 10 classes, corresponding to di ﬀ erenthandwritten digits from 0 to 9.As shown in Fig. 4, under the condition of the same bonddimension D , the best test accuracy of the PEPS model is sig-niﬁcantly better than that of MPS, which reﬂects the superi-ority of PEPS tensor networks in modeling images over one-dimensional tensor networks. At the point of D =

5, the PEPSmodel achieves its best test set accuracy 97 . D is small. At the point of D =

2, the training accuracy ofPEPS is already very close to 100% (99.68%). With D = D =

3, PEPS and CNN-PEPS give almost the same best testaccuracy as MLP and CNN-MLP, while the number of pa-rameters of the PEPS structure is 27.60% and 6.96% of thecorresponding MLP structure, respectively. These facts implythe potential application of tensor networks in model com-pression. We also found that the best test accuracy of PEPS

Bond dimension T e s t acc u r ac y MNIST dataset

MPSMLPCNN-MLPPEPSCNN-PEPS

Figure 4. Best test set accuracy of di ﬀ erent models for MNISTdataset. The dash lines refer to the best results of multilayer percep-trons with 784 − −

10 neurons. The "CNN" indicates the modelapplying the convolution feature map in Sec. II A 2. Due to the struc-tural prior to images, PEPS models perform signiﬁcantly better thanthe one-dimensional MPS model with the same bond dimension. TheCNN-PEPS archives the state-of-the-art performance of tensor net-works models. Meanwhile, The performance is comparable to theMLP but has fewer parameters. structure is stable in a wide range of learning rate(10 − to 0 . − to 10 ), while the bestresults of MLP easily deteriorated under a smaller perturba-tions. Moreover, with the bond dimension D =

5, the CNN-PEPS archives 99 .

31% test set accuracy of the MNIST dataset,which is the state-of-the-art performance of tensor networksmodels. Compared with MPS, which archive best test accu-racy 99 .

03% at D = D also veriﬁes the inherent low entanglement locality of the nat-ural image dataset itself. This inherent nature of images maybe the physical reasons for the success of machine learningmodels like CNN. Moreover, the PEPS with a small D is bene-ﬁcial to the hardware implementation of the quantum machinelearning model. B. Fashion-MNIST dataset

Another dataset we evaluate is the Fashion MNIST dataset,which includes grayscale photographs of 10 classes of cloth-ing, and is considered as a more challenging dataset than theMNIST dataset. The test accuracy results of di ﬀ erent modelsare detailed in Table. I. We can see that with the bond dimen-sion D =

5, the best test accuracy of PEPS-CNN could reach91.2%, which is the current state-of-the-art result of the ten-sor network machine learning model on the Fashion-MNISTdataset. It’s also competitive with the AlexNet and XGBoostmodels, but there is still a clear gap with the most recent ad-vanced convolutional neural network, such as the GoogleNetwhich employs many convolution layers.

Table I. Best test set accuracy of di ﬀ erent models for the Fashion-MNIST dataset. Bold are the models calculated in this work. The"CNN" indicates the model applying the convolution feature map inSec. II A 2. Model Test AcurracySupport Vector Machine MLP

PEPS + TTN CNN-MLP

CNN-PEPS IV. CONCLUSIONS AND DISCUSSIONS

We have presented a tensor network model for supervisedlearning using the PEPS, which directly takes advantage ofthe two-dimensional feature tensor when compared with MPSand MLP classiﬁers. We applied the Boundary MPS methodto achieve e ﬃ cient approximate contraction on the feedfor-ward process of the PEPS classiﬁer, and combined the au-tomatic di ﬀ erentiation tensor network technique in the back-ward propagation process to compute gradients of model pa-rameters.Using extensive numerical experiments we showed that onboth the MNIST and Fashion-MNIST datasets, we obtainedthe state-of-the-art performance of tensor network learningmodels on the test accuracy, which illustrates the prior ad-vantages of the two-dimensional tensor networks in modelingnatural images. Compared with the fully connected neuralnetworks such as MLP, the two-dimensional tensor networkstructure makes more use of the features of the images it-self, and achieves the same or even slightly better results withfewer parameters. More importantly, a machine learning al-gorithm based on tensor networks has the potential to trans-form to a quantum machine learning algorithm based on the near-term noisy intermediate-scale quantum circuits. Notably,the PEPS structure has a geometric structure similar to thatof several current quantum hardware . Moreover, we haveshown that the classical supervised learning algorithm basedon PEPS works well even when D is small, so the quantummachine learning model corresponding to this structure couldbeneﬁt from our result.However, there are also several issues in the 2D tensor net-work machine learning models. First, the computation andstorage costs of the PEPS model are signiﬁcantly higher thantraditional machine learning models such as neural networks.Second, although the expressive power of the model is likelyto be su ﬃ cient, the current optimization method may not besuitable for optimizing such complex tensor networks that canonly be approximately contracted, resulting in the failure ofstable convergence when the model parameters are negative.Finally, there are many possible ways to improve the cur-rent PEPS machine learning model. For example, it bene-ﬁts from new optimization methods that are more suitable fortensor networks. One could also combine the renormaliza-tion group and machine learning in a more practical way byapplying the coarse-grained contraction scheme. Moreover,the unsupervised generative learning model based on two-dimensional tensor networks may be a promising attempt. ACKNOWLEDGMENTS

We thank E. Miles Stoudenmire, Jing Chen, Jin-Guo Liu,Hai-Jun Liao and Wen-Yuan Liu for inspiring discussionsand collaborations. S.C. are supported by the National R&DProgram of China (Grant No. 2017YFA0302901) and theNational Natural Science Foundation of China (Grants No.11190024 and No. 11474331). L.W. is supported by the Min-istry of Science and Technology of China under the GrantNo. 2016YFA0300603 and National Natural Science Foun-dation of China under the Grant No. 11774398. P.Z. is sup-ported by Key Research Program of Frontier Sciences of CAS,Grant No. QYZDB-SSW-SYS032 and Project 11747601 ofNational Natural Science Foundation of China. ∗ [email protected] † [email protected] R. Orús, “A practical introduction to tensor networks: Matrixproduct states and projected entangled pair states,” Annals ofPhysics , 117–158 (2014). Roman Orus, “Advances on Tensor Network Theory: Sym-metries, Fermions, Entanglement, and Holography,” The Euro-pean Physical Journal B (2014), 10.1140 / epjb / e2014-50502-9,arXiv: 1407.6552. F Verstraete, V Murg, and J I Cirac, “Matrix product states,projected entangled pair states, and variational renormalizationgroup methods for quantum spin systems,” Advances in Physics , 143–224 (2008). Michael M. Wolf, Frank Verstraete, Matthew B. Hastings, and J. I. Cirac, “Area laws in quantum systems: Mutual informationand correlations,” Phys. Rev. Lett. , 070502 (2008). F. Verstraete, M. M. Wolf, D. Perez-Garcia, and J. I. Cirac, “Crit-icality, the area law, and the computational power of projectedentangled pair states,” Phys. Rev. Lett. , 220601 (2006). J. Eisert, M. Cramer, and M. B. Plenio, “

Colloquium : Arealaws for the entanglement entropy,” Rev. Mod. Phys. , 277–306(2010). Henry W. Lin, Max Tegmark, and David Rolnick, “Why doesdeep and cheap learning work so well?” Journal of StatisticalPhysics (2017), 10.1007 / s10955-017-1836-5. Song Cheng, Jing Chen, and Lei Wang, “Information perspec-tive to probabilistic modeling: Boltzmann machines versus bornmachines,” Entropy (2018), 10.3390 / e20080583. Dong-Ling Deng, Xiaopeng Li, and S. Das Sarma, “Quantumentanglement in neural network states,” Phys. Rev. X , 021021(2017). Xun Gao and Lu-Ming Duan, “E ﬃ cient representation of quan-tum many-body states with deep neural networks,” Nature Com-munications , 662 (2017). Jing Chen, Song Cheng, Haidong Xie, Lei Wang, and Tao Xi-ang, “Equivalence of restricted boltzmann machines and tensornetwork states,” Phys. Rev. B , 085104 (2018). E. Miles Stoudenmire and D. J. Schwab, “Supervised Learn-ing with Quantum-Inspired Tensor Networks,” Advances in Neu-ral Information Processing Systems 29, 4799 (2016) (2016),arXiv:1605.05775 [stat.ML]. E Miles Stoudenmire, “Learning relevant features of data withmulti-scale tensor networks,” Quantum Science and Technology , 034003 (2018). Zhao-Yu Han, Jun Wang, Heng Fan, Lei Wang, and Pan Zhang,“Unsupervised generative modeling using matrix product states,”Phys. Rev. X , 031012 (2018). Ding Liu, Shi-Ju Ran, Peter Wittek, Cheng Peng, Raul BlázquezGarcía, Gang Su, and Maciej Lewenstein, “Machine learning byunitary tensor network of hierarchical tree structure,” New Journalof Physics , 073059 (2019). Ivan Glasser, Nicola Pancotti, and J Ignacio Cirac, “Super-vised learning with generalized tensor networks,” arXiv preprintarXiv:1806.05964 (2018). Yoav Levine, Or Sharir, Nadav Cohen, and Amnon Shashua,“Quantum entanglement in deep learning architectures,” Physicalreview letters , 065301 (2019). Chase Roberts, Ashley Milsted, Martin Ganahl, Adam Zalcman,Bruce Fontaine, Yijian Zou, Jack Hidary, Guifre Vidal, and StefanLeichenauer, “Tensornetwork: A library for physics and machinelearning,” arXiv preprint arXiv:1905.01330 (2019). Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo,Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Je ﬀ reyDean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, An-drew Harp, Geo ﬀ rey Irving, Michael Isard, Yangqing Jia, RafalJozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg,Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray,Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, IlyaSutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, VijayVasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Mar-tin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng,“TensorFlow: Large-scale machine learning on heterogeneoussystems,” (2015), software available from tensorﬂow.org. Jin-Guo Liu and Lei Wang, “Di ﬀ erentiable Learning of QuantumCircuit Born Machine,” arXiv e-prints , arXiv:1804.04168 (2018),arXiv:1804.04168 [quant-ph]. Xun Gao, Zhengyu Zhang, and Luming Duan, “An ef-ﬁcient quantum algorithm for generative machine learning,”arXiv:1711.02038 [quant-ph, stat] (2017), arXiv: 1711.02038. Song Cheng, Lei Wang, Tao Xiang, and Pan Zhang, “Tree ten-sor networks for generative modeling,” Physical Review B ,155131 (2019). Chu Guo, Zhanming Jie, Wei Lu, and Dario Poletti, “Matrix prod-uct operators for sequence-to-sequence learning,” Physical Re-view E , 042114 (2018). Alexander Novikov, Dmitrii Podoprikhin, Anton Osokin, andDmitry P Vetrov, “Tensorizing neural networks,” in

Advances inNeural Information Processing Systems (2015) pp. 442–450. C. Cortes Y. LeCun and C. J. Burges, “Mnist handwritten digitdatabase,” http: // yann.lecun.com / exdb / mnist / (1998). Han Xiao, Kashif Rasul, and Roland Vollgraf, “Fashion-mnist:a novel image dataset for benchmarking machine learning algo-rithms,” arXiv preprint arXiv:1708.07747 (2017). Bernhard Schölkopf, Ralf Herbrich, and Alex J Smola, “A gener-alized representer theorem,” in

International conference on com-putational learning theory (Springer, 2001) pp. 416–426. Ian Goodfellow, Yoshua Bengio, and Aaron Courville,

DeepLearning (MIT Press, 2016). Y.-H. Zhang, “Entanglement Entropy of Target Functions for Im-age Classiﬁcation and Convolutional Neural Network,” ArXiv e-prints (2017), arXiv:1710.05520 [cs.LG]. Norbert Schuch, Michael M. Wolf, Frank Verstraete, and J. Igna-cio Cirac, “Computational complexity of projected entangled pairstates,” Phys. Rev. Lett. , 140506 (2007). Jacob Jordan, Roman Orús, Guifre Vidal, Frank Verstraete, andJ Ignacio Cirac, “Classical simulation of inﬁnite-size quantum lat-tice systems in two spatial dimensions,” Physical review letters , 250602 (2008). Michael Levin and Cody P. Nave, “Tensor renormalization groupapproach to two-dimensional classical lattice models,” Phys. Rev.Lett. , 120601 (2007). Hai-Jun Liao, Jin-Guo Liu, Lei Wang, and Tao Xiang, “Di ﬀ er-entiable programming tensor networks,” Phys. Rev. X , 031041(2019). Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin,“Training Deep Nets with Sublinear Memory Cost,” arXiv e-prints, arXiv:1604.06174 (2016), arXiv:1604.06174 [cs.LG]. Diederik P. Kingma and Jimmy Ba, “Adam: A Methodfor Stochastic Optimization,” arXiv e-prints , arXiv:1412.6980(2014), arXiv:1412.6980 [cs.LG]. Jacob Miller, “Torchmps,” https://github.com/jemisjoky/torchmps (2019). Stavros Efthymiou, Jack Hidary, and Stefan Leichenauer,“TensorNetwork for Machine Learning,” arXiv e-prints ,arXiv:1906.06329 (2019), arXiv:1906.06329 [cs.LG]. Frank Arute, Kunal Arya, Ryan Babbush, Dave Bacon, Joseph C.Bardin, Rami Barends, Rupak Biswas, Sergio Boixo, FernandoG. S. L. Brandao, David A. Buell, Brian Burkett, Yu Chen, ZijunChen, Ben Chiaro, Roberto Collins, William Courtney, AndrewDunsworth, Edward Farhi, Brooks Foxen, Austin Fowler, CraigGidney, Marissa Giustina, Rob Gra ﬀ , Keith Guerin, Steve Habeg-ger, Matthew P. Harrigan, Michael J. Hartmann, Alan Ho, MarkusHo ﬀ mann, Trent Huang, Travis S. Humble, Sergei V. Isakov,Evan Je ﬀ rey, Zhang Jiang, Dvir Kafri, Kostyantyn Kechedzhi, Ju-lian Kelly, Paul V. Klimov, Sergey Knysh, Alexander Korotkov,Fedor Kostritsa, David Landhuis, Mike Lindmark, Erik Lucero,Dmitry Lyakh, Salvatore MandrÃ , Jarrod R. McClean, MatthewMcEwen, Anthony Megrant, Xiao Mi, Kristel Michielsen, Ma-soud Mohseni, Josh Mutus, Ofer Naaman, Matthew Neeley,Charles Neill, Murphy Yuezhen Niu, Eric Ostby, Andre Petukhov,John C. Platt, Chris Quintana, Eleanor G. Rie ﬀ el, PedramRoushan, Nicholas C. Rubin, Daniel Sank, Kevin J. Satzinger,Vadim Smelyanskiy, Kevin J. Sung, Matthew D. Trevithick, AmitVainsencher, Benjamin Villalonga, Theodore White, Z. JamieYao, Ping Yeh, Adam Zalcman, Hartmut Neven, and John M.Martinis, “Quantum supremacy using a programmable supercon-ducting processor,” Nature574