[PDF] Learning with Density Matrices and Random Features

Abstract

A density matrix describes the statistical state of a quantum system. It is a powerful formalism to represent both the quantum and classical uncertainty of quantum systems and to express different statistical operations such as measurement, system combination and expectations as linear algebra operations. This paper explores how density matrices can be used as a building block to build machine learning models exploiting their ability to straightforwardly combine linear algebra and probability. One of the main results of the paper is to show that density matrices coupled with random Fourier features could approximate arbitrary probability distributions over \mathbb{R}^n. Based on this finding the paper builds different models for density estimation, classification and regression. These models are differentiable, so it is possible to integrate them with other differentiable components, such as deep learning architectures and to learn their parameters using gradient-based optimization. In addition, the paper presents optimization-less training strategies based on estimation and model averaging. The models are evaluated in benchmark tasks and the results are reported and discussed.

Full PDF

LLearning with Density Matrices and Random Features

Fabio A. Gonz´alez Alejandro Gallego Santiago Toledo-Cort´es Vladimir Vargas-Calder´on Abstract

A density matrix describes the statistical state of aquantum system. It is a powerful formalism to rep-resent both the quantum and classical uncertaintyof quantum systems and to express different sta-tistical operations such as measurement, systemcombination and expectations as linear algebraoperations. This paper explores how density ma-trices can be used as a building block for machinelearning models exploiting their ability to straight-forwardly combine linear algebra and probabil-ity. One of the main results of the paper is toshow that density matrices coupled with randomFourier features could approximate arbitrary prob-ability distributions over R n . Based on this ﬁnd-ing the paper builds different models for densityestimation, classiﬁcation and regression. Thesemodels are differentiable, so it is possible to inte-grate them with other differentiable components,such as deep learning architectures and to learntheir parameters using gradient-based optimiza-tion. In addition, the paper presents optimization-less training strategies based on estimation andmodel averaging. The models are evaluated inbenchmark tasks and the results are reported anddiscussed.

1. Introduction

The formalism of density operators and density matriceswas developed by von Neumann as a foundation of quan-tum statistical mechanics (Von Neumann, 1927). From thepoint of view of machine learning, density matrices have aninteresting feature: the fact that they combine linear algebraand probability, two of the pillars of machine learning, in avery particular but powerful way.The main question addressed by this work is how densitymatrices can be used in machine learning models. One ofthe main approaches to machine learning is to address theproblem of learning as one of estimating a probability dis-tribution from data, joint probabilities P ( x, y ) in generativesupervised models or conditional probabilities P ( y | x ) indiscriminative models. The central idea of this work is to use density matrices torepresent these probability distributions tackling the impor-tant question of how to encode arbitrary probability densityfunctions in R n into density matrices.The quantum probabilistic formalism of von Neumann isbased on linear algebra, in contrast with classical probabilitywhich is based in set theory. In the quantum formalism thesample space corresponds to a Hilbert space H and the eventspace to a set of linear operators in H , the density operators.The quantum formalism generalizes classical probability.A density matrix in an n -dimensional Hilbert space can beseen as a catalog of categorical distributions on the ﬁnite set { . . . n } . A direct application of this fact is not very use-ful as we want to efﬁciently model continuous probabilitydistributions in R n . One of the main results of this paperis to show that it is possible to model arbitrary probabilitydistributions in R n using density matrices of ﬁnite dimen-sion in conjunction with random Fourier features (Rahimi& Recht, 2007). In particular the paper presents a methodfor non-parametric density estimation that combines densitymatrices and random Fourier features to efﬁciently learna probability density function from data and to efﬁcientlypredict the density of new samples.The fact that the probability density function is representedin matrix form and that the density of a sample is calculatedby linear algebra operations makes it easy to implement themodel in GPU-accelerated machine learning frameworks.This also facilitates using density matrices as a buildingblock for classiﬁcation and regression models, which canbe trained using gradient-based optimization and can beeasily integrated with conventional deep neural networks.The paper presents examples of these models and showshow they can be trained using gradient-based optimizationas well as optimization-less learning based on estimation.The paper is organized as follows: Section 2 covers the back-ground on random features, kernel density estimation anddensity matrices; Section 3 presents four different methodsfor density estimation, classiﬁcation and regression; Sec-tion 4 discusses some relevant works; Section 5 presentsthe experimental evaluation; ﬁnally, Section 6 discusses theconclusions of the work. a r X i v : . [ c s . L G ] F e b earning with Density Matrices and Random Features

2. Background

Random Fourier features (RFF) (Rahimi & Recht, 2007) isa method that builds an embedding φ rﬀ : R d → R D givena shift-invariant kernel k : R d × R d → R such that ∀ x, y ∈ R d k ( x, y ) ≈ (cid:104) φ rﬀ ( x ) , φ rﬀ ( y ) (cid:105) = φ ∗ rﬀ ( x ) φ rﬀ ( y ) , where ∗ is the transpose operator . One of the main applications ofRFF is to speedup kernel methods, being data independenceone of its main advantages.The RFF method is based on the Bochner’s theorem. Inlayman’s terms, Bochner’s theorem shows that a shift invari-ant positive-deﬁnite kernel k ( · ) is the Fourier transform ofa probability measure p ( w ) . Rahimi & Recht (2007) usethis result to approximate the kernel function by design-ing a sample procedure that estimates the integral of theFourier transform. The ﬁrst step is to draw D iid samples { w , . . . w D } from p an D iid samples { b , . . . b D } from anuniform distribution in [0 , π ] . Then, deﬁne: φ rff : R d → R D x (cid:55)→ (cid:114) D (cos ( w ∗ x + b ) , . . . , cos( w ∗ D x + b D )) . (1)Rahimi & Recht (2007) showed that the expected value of φ ∗ rﬀ ( x ) φ rﬀ ( y ) uniformly converges to k ( x, y ) : Theorem 2.1. (Rahimi & Recht, 2007) Let M be a com-pact subset of R d with a diameter diam ( M ) . Then for themapping φ rﬀ deﬁned above, we have Pr (cid:20) sup x,y ∈M | φ ∗ rﬀ ( x ) φ rﬀ ( y ) − k ( x, y ) | ≥ (cid:15) (cid:21) ≤ (cid:18) σ p diam( M ) (cid:15) (cid:19) exp (cid:18) − D(cid:15) d + 2) (cid:19) , (2) where, σ p is the second momentum of the Fourier transformof k . In particular, for the Gaussian kernel σ p = 2 dγ , where γ is the spread parameter (see Eq. 4). Different approaches to compute random features for kernelapproximation have been proposed based on different strate-gies: Monte Carlo sampling (Le et al., 2013; Yu et al., 2016),quasi-Monte-Carlo sampling (Avron et al., 2016; Shen et al.,2017), and quadrature rules (Dao et al., 2017).

Kernel Density Estimation (KDE) (Rosenblatt, 1956;Parzen, 1962), also known as Parzen-Rossenblat window Later on this operator will be used as the Hermitian conjugatethat, for real matrices, is equivalent to the transpose operator. method, is a non-parametric density estimation method.This method does not make any particular assumption aboutthe underlying probability density function. Given an iidset of samples X = { x , . . . , x N } , the smooth Parzen’swindow estimate has the form ˆ f X ( x ) = 1 N λ N (cid:88) i =1 k λ ( x, x i ) , (3)where k λ ( · ) is a kernel function and λ is the smoothingbandwidth parameter of the estimate. A small λ -parameterimplies a small grade of smoothing. In contrast, a high λ -parameter implies a high grade of smoothing, possiblyleaving behind some important structures.Rosenblatt (1956) and Parzen (1962) showed that eq. (3) isan unbiased estimator of the pdf f . When k λ is the Gaussiankernel, eq. (3) takes the form ˆ g γ,X ( x ) = 1 N ( π/γ ) d N (cid:88) i =1 e − γ (cid:107) x i − x (cid:107) . (4)KDE has several applications: to estimate the underlyingprobability density function, to estimate conﬁdence intervalsand conﬁdence bands (Efron, 1992; Chernozhukov et al.,2014), to ﬁnd local modes for geometric feature estimation(Chazal et al., 2017; Chen et al., 2016), to estimate ridge ofthe density function (Genovese et al., 2014), to build clustertrees (Balakrishnan et al., 2013), to estimate the cumulativedistribution function (Nadaraya, 1964), to estimate receiveroperating characteristic (ROC) curves (McNeil & Hanley,1984), among others. The state of a quantum system is represented by a vector ψ ∈ H , where H is the Hilbert space of the possible statesof the system. Usually H = C d .As an example, consider a system that could be in twopossible states, e.g. the spin of an electron. The state ofthis system is, in general, represented by a column vector ψ = ( α, β ) , with | α | + | β | = 1 . This state represents asystem that is in a superposition of two basis states ψ = α ↑ + β ↓ . The outcome of a measurement of this system,along the z axis, is determined by the Born rule: the spinis up with probability | α | and down with probability | β | .Notice that α and β could be negative or complex numbers,but the Born rule guarantees that we get valid probabilities.The probabilities that arise from the superposition of statesin the previous example is a manifestation of the uncertainty In the latter sections of this paper we will use H = R d , butmost of the results apply to the complex case. earning with Density Matrices and Random Features that is inherent to the nature of quantum physical systems.We call this kind of uncertainty quantum uncertainty . Otherkind of uncertainty comes, for instance, from errors in themeasurement or state-preparation processes, we call thisuncertainty classical uncertainty . A density matrix is a for-malism that allows us to represent both types of uncertainty.To illustrate it, let’s go back to our previous example. Thedensity matrix representing the state ψ is: ρ = ψψ ∗ = (cid:20) | α | αβ ∗ β ∗ α | β | (cid:21) , (5)where ψ ∗ is the conjugate transpose of ψ and α ∗ is theconjugate of α . As a concrete example, consider ψ = (cid:16) √ , − √ (cid:17) the corresponding density matrix is: ρ = ψ ψ ∗ = (cid:20) − −

12 12 (cid:21) , (6)which represents a superposition state where we have a probability of measuring any of the two states. Noticethat the probabilities for each state are in the diagonal of thedensity matrix. ρ is a rank-1 density matrix, and this meansthat it represents a pure state, i.e. a state without classicaluncertainty. The opposite to a pure state is a mixed state,which is represented by a density matrix with the form: ρ = N (cid:88) i =1 p i ψ i ψ ∗ i , (7)where p i > ∈ R , (cid:80) Ni =1 p i = 1 , and { ψ i } are the statesof a an ensemble of quantum systems, where each systemhas an associated probability p i . As a concrete exampleof a mixed state consider two pure states ψ = (1 , and ψ (cid:48) = (0 , , and consider a system that is prepared in state ψ with probability and in state ψ (cid:48) with probability aswell. The state of this system is represented by the followingdensity matrix: ρ = 12 ψ ψ ∗ + 12 ψ (cid:48) ψ (cid:48)∗ = (cid:20) (cid:21) , (8)At ﬁrst sight, states ρ and ρ may be seen as represent-ing the same quantum system, one where the probabilityof measuring an up state (or down state) is . Howeverthey are different systems, ρ represents a system with onlyquantum uncertainty, while ρ corresponds a system withclassical uncertainty. To better observe the differences ofthe two systems we have to perform a measurement along aparticular axis. To do so, we use the following version ofthe Born rule for density matrices: P ( ϕ | ρ ) = Tr ( ρϕϕ ∗ ) = ϕ ∗ ρϕ (9) which calculates the probability of measuring the state ϕ in a system in state ρ . If we set ϕ = (cid:16) √ , − √ (cid:17) we get P ( ϕ | ρ ) = 1 and P ( ϕ | ρ ) = , showing that in fact bothsystems are different.

3. Methods

In this section we present a non-parametric method for den-sity estimation. We start with a set of iid samples drawnfrom a particular distribution in R d . Each sample will beembedded into R D using RFF as a feature map (eq. (1)).Each embedded vector will be assimilated as the state ofa quantum system. A density matrix representing a mix-ture of all the training samples is calculated by averagingthe density matrix representing each sample. This matrixrepresents the probability distribution of the samples and isused for calculating the density of new samples. The overallprocess is deﬁned next:• Input. A sample set X = { x i } i =1 ...N with x i ∈ R d ,parameters γ ∈ R + , D ∈ N • Calculate W rﬀ = [ w . . . w D ] and b rﬀ = [ b . . . b D ] using the random Fourier features method described inSection 2.1 for approximating a Gaussian kernel withparameters γ and D .• Apply φ rff (eq. (1)): z i = φ rff ( x i ) . (10)• Density matrix estimation: ρ = 1 N N (cid:88) i =1 z i z ∗ i , (11)The density of a new sample x is calculated as: ˆ f ρ ( x ) = Tr ( ρφ rff ( x ) φ rff ( x ) ∗ ) Z = φ rff ( x ) ∗ ρφ rff ( x ) Z , (12)where the normalizing constant is deﬁned as: Z = (cid:18) π γ (cid:19) d . (13)The following proposition shows that ˆ f ρ , as deﬁnedin eq. (12), uniformly converges to the Gaussian kernelParzen’s estimator ˆ f γ,X (eq. (4)). The proof can be found in the supplementary material. earning with Density Matrices and Random Features

Proposition 3.1.

Let M be a compact subset of R d with adiameter diam ( M ) , let X = { x i } i =1 ...N ⊂ M a set of iidsamples, then ˆ f ρ (eq. (12) ) and ˆ g γ,X (eq. (4) ) satisfy: Pr (cid:20) sup x ∈M | ˆ f ρ ( x ) − ˆ g γ,X ( x ) | ≥ Z (cid:15) (cid:21) ≤ (cid:18) √ dγ diam( M ) (cid:15) (cid:19) exp (cid:18) − D(cid:15) d + 2) (cid:19) (14)The Parzen’s estimator is an unbiased estimator of the truedensity function from which the samples were generatedand Proposition 3.1 shows that ˆ f ρ ( x ) can approximate thisestimator. The density matrix estimator has an importantadvantage over the Parzen’s estimator, its computationalcomplexity. The time to calculate the Parzen’s estimator(eq. (4)) is O ( dN ) while the time to estimate the densitybased on the density matrix ρ rﬀ (Eq. 12) is O ( D ) , whichis constant on the size of the sample dataset.Notice that the vectors z i in eq. (10) are in general non-normal. We normalize these vectors which ensures that z ∗ i z i = 1 = e − γ (cid:107) z i − z i (cid:107) and also ensures that ρ in eq. (11)is a well deﬁned density matrix. Empirically, we observedthat this change did not affect the performance of the methodand, in some cases, improved the results.Also, the density calculation in eq. (12) can be simpliﬁed byfactorizing ρ as ρ = V ∗ Λ V , (15)where V ∈ R r × D , Λ ∈ R r × r is a diagonal matrix and r < D is the reduced rank of the factorization. This is doneby performing a spectral decomposition thanks to the factthat ρ is a Hermitian matrix. The calculation in eq. (12) canbe expressed as: ˆ f ρ ( x ) = 1 Z (cid:107) Λ V φ rff ( x ) (cid:107) . (16)This reduces the time to calculate the density of a newsample to O ( Dr ) .Something interesting to notice is that the process describedby eqs. (10) and (11) generalizes density estimation for vari-ables with a categorical distribution, i.e. x ∈ { , . . . , K } .To see this, we replace φ rﬀ in eq. (10) by the well-knownone-hot-encoding feature map: φ ohe : D → R K i (cid:55)→ E i , (17)where E i is the unit vector with a 1 in position i and 0 inthe other positions. It is not difﬁcult to see that in this case ρ ii = Pr ( x = i ) = |{ x j | j ∈ { , . . . , N } , x j = i }| N . (18)

The extension of kernel density estimation to classiﬁcationis called kernel density classiﬁcation (Hastie et al., 2009).The posterior probability is calculated as ˆPr( Y = j | X = x ) = π j ˆ f j ( x ) (cid:80) Kk =1 π k ˆ f k ( x ) , (19)where π j and ˆ f j are respectively the class prior and thedensity estimator of class j .We follow this approach to deﬁne a classiﬁcation modelthat uses the density estimation strategy based on RFF anddensity matrices described in the previous section. The inputto the model is a vector x ∈ R d . The model is deﬁned bythe following equations: z = φ rff ( x ) = cos( W rﬀ x + b rﬀ ) , (20a) z (cid:48) = z (cid:107) z (cid:107) , (20b) ˜ y i = (cid:107) V i z (cid:48) (cid:107) ∀ i = 1 . . . K, (20c) ˜ y (cid:48) i = π i y i (cid:80) Kj = i y j (20d)The hyperparameters of the model are the dimension ofthe RFF representation D , the spread parameter γ of theGaussian kernel, the class priors π i and the rank r of thedensity matrix factorization. The parameters are the weightsand biases of the RFF, W rﬀ ∈ R D × d and b rﬀ ∈ R d , V i ∈ R r × D and λ i ∈ R for i = 1 . . . K .The model can be trained using two different strategies:one, using DMKDE to estimate the density matrices ofeach class; two, use stochastic gradient descent over theparameters to minimize an appropriate loss function.The training process based on density matrix estimation isas follows:1. Use the RFF method to calculate W rﬀ and b rﬀ .2. For each class i :(a) Estimate π i as the relative frequency of the class i in the dataset.(b) Estimate ρ i using eq. (11) and the training sam-ples from class i .(c) Find a factorization of rank r of ρ i : ρ i = V ∗ i Λ V i . Notice that this training procedure does not require any kindof iterative optimization. The training samples are only usedonce and the time complexity of the algorithm is linear on earning with Density Matrices and Random Features the number of training samples. The complexity of step 2(b)is O ( D N ) and of 2(c) is O ( D ) .Since the operations deﬁned in eqs. (20a) to (20d) are differ-entiable, it is possible to use gradient-descent to minimizean appropriate loss function. For instance, we can use cate-gorical cross entropy: L = K (cid:88) i =1 y i log(˜ y (cid:48) i ) (21)where y = ( y , . . . , y K ) corresponds to the one-hot-encoding of the real label of the sample x . The version ofthe model trained with gradient descent is called DMKDC-SGD.An advantage of this approach is that the model can bejointly trained with other differentiable architecture such asa deep learning feature extractor. In DMKDC we assume a categorical distribution for theoutput variable. If we want a more general probability dis-tribution we need to deﬁne a more general classiﬁcationmodel. The idea is to model the joint probability of inputsand outputs using a density matrix. This density matrix rep-resents the state of a bipartite system whose representationspace is H X ⊗ H Y where H X is the representation space ofthe inputs, H Y is the representation space of the outputs and ⊗ is the tensor product. A prediction is made by performinga measurement with an operator speciﬁcally prepared froma new input sample.This model is based on the one described by Gonz´alez et al.(2020) and works as follows:• Input encoding. The input x ∈ R d is encoded using afeature map φ X z = φ X ( x ) . (22)• Measurement operator. The effect of this measurementoperator is to collapse, using a projector to z , the part H X of the bipartite system while keeping the H Y partunmodiﬁed. This is done by deﬁning the followingoperator: π = zz ∗ ⊗ Id H Y , (23)where Id H Y is the identity in H Y .• Apply the measurement operator to the training densitymatrix: ρ = πρ train π Tr[ πρ train π ] , (24)where ρ train can be built, for instance, as in eq. (11), orcan be a parameter density matrix, as will be shownnext. • Calculate the partial trace of ρ to obtain a density ma-trix that encodes the prediction: ρ Y = Tr X [ ρ ] . (25)The parameter of the model, without taking into ac-count the parameters of the feature maps, is the ρ train ∈ R D X D Y × D X D Y density matrix, where D X and D Y are thedimensions of H X and H Y respectively. As discussed inSection 3.1, the density matrix ρ train can be factorized as: ρ train = V ∗ train Λ V train (26)where V train ∈ R r × D X D Y , Λ ∈ R r × r is a diagonal matrixand r < D X D Y is the reduced rank of the factorization.This factorization not only helps to reduce the space nec-essary to store the parameters, but learning V train and Λ ,instead of ρ train , helps to guarantee that ρ train is a valid den-sity matrix.As in Subsection 3.2, we described two different approachesto train the system: one based on estimation of the ρ train andone based on learning ρ train using gradient descent. QMCcan be also trained using these two strategies.In the estimation strategy, given a training data set { ( x i , y i ) } i =1 ...N the training density matrix is calculatedby: ρ train = 1 N N (cid:88) i =1 ( φ X ( x i ) ⊗ φ Y ( y i )) ( φ X ( x i ) ⊗ φ Y ( y i )) ∗ . (27)The computational cost is O ( N D X D Y ).For the gradient-descent-based strategy (QMC-SGD) wecan minimize the following loss function: L = D Y (cid:88) i =1 y i log( ρ Y ii ) , (28)where ρ Y ii is the i -th diagonal element of ρ Y .As in DMKDC-SGD, this model can be combined with adeep learning architecture and the parameters can be jointlylearned using gradient descent.QMC can be used with different feature maps for inputsand outputs. For instance, if φ X = φ rﬀ (eq. (1)) and φ Y = φ ohe (eq. (17)), QMC corresponds to DMKDC. However,in this case DMKDC is preferred because of its reducedcomputational cost. In this section we show how to use QMC to perform regres-sion. For this we will use a feature map that allows us to earning with Density Matrices and Random Features encode continuous values. The feature map is deﬁned withthe help of D equally distributed landmarks in the [0 , interval : α i = i − D − for i = 1 . . . D. (29)The following function (which is equivalent to a softmax)deﬁnes a set of unimodal probability density functions cen-tered at each landmark: p i ( x ) = (cid:32) exp (cid:0) − β (cid:107) x − α i (cid:107) (cid:1)(cid:80) mj =1 exp( − β (cid:107) x − α j (cid:107) ) (cid:33) i =1 ...D , (30)where β controls the shape of the density functions.The feature map is deﬁned as: φ sm : [0 , → R D x (cid:55)→ ( (cid:112) p ( x ) , . . . , (cid:112) p D ( x )) . (31)This feature map is used in QMC as the feature map of theoutput variable ( φ Y ). To calculate the prediction for a newsample x we apply the process described in Subsection 3.3to obtain ρ Y . Then the prediction is given by: ˆ y = E ρ Y [ α i ] = D (cid:88) i =1 ρ Y ii α i . (32)Note that this framework also allows to easily computeconﬁdence intervals for the prediction. The model can betrained using the strategies discussed in Subsection 3.3. Forgradient-based optimization we use a mean squared errorloss function: L = D (cid:88) i =1 ( y − ˆ y ) + α D (cid:88) i =1 ρ Y ii (ˆ y − α i ) (33)where the second term correspond to the variance of theprediction and α controls the trade-off between error andvariance.

4. Related Work

The ability of density matrices to represent probability distri-butions has been used in previous works. The early work byWolf (2006) uses the density matrix formalism to performspectral clustering, and shows that this formalism not only isable to predict cluster labels for the objects being classiﬁed, Without loss of generality the continuous variable to be en-coded is restricted to the [0 , interval. but also provides the probability that the object belongs toeach of the clusters. Similarly, Tiwari & Melucci (2019)proposed a quantum-inspired binary classiﬁer using densitymatrices, where samples are encoded into pure quantumstates. In a similar fashion, Sergioli et al. (2018) proposeda quantum nearest mean classiﬁer based on the trace dis-tance between the quantum state of a sample, and a quantumcentroid that is a mixed state of the pure quantum states ofall samples belonging to a single class. Another class ofproposals directly combine these quantum ideas with cus-tomary machine learning techniques, such as frameworksfor multi-modal learning for sentiment analysis (Li et al.,2021; Li et al., 2020; Zhang et al., 2018).Since its inception, random features have been used to im-prove the performance of several kernel methods: kernelridge regression (Avron et al., 2017), support vector ma-chines (SVM) (Sun et al., 2018), and nonlinear componentanalysis (Xie et al., 2015). Besides, random features havebeen used in conjunction with deep learning architecturesin different works (Arora et al., 2019; Ji & Telgarsky, 2019;Li et al., 2019).The combination of RFF and density matrices was initiallyproposed by Gonz´alez et al. (2020). In that work, RFFare used as a quantum feature map, among others, andthe QMC method (Subsection 3.3) was presented. Theemphasis of that work is to show that quantum measurementcan be used to do supervised learning. In contrast, thepresent paper addresses a wider problem with several newcontributions: a new method for density estimation basedon density matrices and RFF, the proof of the connectionbetween this method and kernel density estimation, and newdifferentiable models for density estimation, classiﬁcationand regression.The present work can be seen as a type of quantum ma-chine learning (QML), which is generally referred as theﬁeld in the intersection of quantum computing and machinelearning (Schuld et al., 2015; Schuld, 2018). In particu-lar, the methods in this paper are in the subcategory ofQML called quantum inspired classical machine learning,where theory and methods from quantum physics are bor-rowed and adapted to machine learning methods intended torun in classical computers. Works in this category include:quantum-inspired recommendation systems (Tang, 2019a),quantum-inspired kernel-based classiﬁcation methods (Ti-wari et al., 2020; Gonz´alez et al., 2020), conversationalsentiment analysis based on density matrix-like convolu-tional neural networks (Zhang et al., 2019), dequantisedprincipal component analysis (Tang, 2019b), among others.Being a memory-based strategy, KDE suffers from large-scale, high dimensional data. Due to this issue, fast approxi-mate evaluation of non-parametric density estimation is anactive research topic. Different approaches are proposed earning with Density Matrices and Random Features in the literature: higher-order divide-and-conquer method(Gray & Moore, 2003), separation of near and far-ﬁeld(pruning) (March et al., 2015), and hashing based estimators(HBE) (Charikar & Siminelakis, 2017). Even though thepurpose of the present work was not to design methods forfast approximation of KDE, the use of RFF to speed KDEseems to be a promising research direction. ComparingDMKDE against fast KDE approximation methods is partof our future work.

5. Experimental Evaluation

In this section we perform some experiments to evaluate theperformance of the proposed methods in different bench-mark tasks. The experiments are organized in two sub-sections: classiﬁcation evaluation and ordinal regressionevaluation.

In this set of experiments, we evaluated DMKDC over dif-ferent well known benchmark classiﬁcation datasets.5.1.1. D

ATA SETS AND EXPERIMENTAL SETUP

Six benchmark data sets were used. The details of the datasets are reported in the Supplementary Material. In thecase of Gisette and Cifar, we applied a principal componentanalysis algorithm using 400 principal components in orderto reduce the dimension. DMKDC was trained using theestimation strategy (DMKDC) and an ADAM stochasticgradient descent strategy (DMKDC-SGD). As baseline wecompared against a linear support vector machine (SVM)trained using the same RFF as DMKDC. In the case ofMNIST and Cifar, we additionally built an union of a LeNetarchitecture (LeCun et al., 1989) as a feature extractionlayer and DMKDC as the classiﬁer layer. The LeNet layeris composed of two convolutional layers, one ﬂatten layerand one dense layer. The former convolutional layer has20 ﬁlters, kernel size of 5, same as padding, and ReLu asthe activation function. The latter convolutional layer has50 ﬁlters, kernel size of 5, same as padding, and ReLu asthe activation function. The dense layer has 84 units andReLU as the activation function. The dense layer is ﬁnallyconnected to DMKDC. We report results for the combinedmodel (LeNet DMKDC-SGD) and the LeNet model with asoftmax output layer (LeNet).For each data set, we made a hyper parameter search usinga ﬁve-fold cross-validation with 25 randomly generated con-ﬁgurations. The number of RFF was set to 1000 for all themethods. For each dataset we calculated the inter-samplemedian distance µ and deﬁned an interval around γ = σ .The C parameter of the SVM was explored in an exponen-tial scale from − to . For the ADAM optimizer in DMKDC-SGD with and without LeNet we choose the learn-ing rate in the interval (0 , . . The RFF layer was alwaysset to trainable. The number of eigen-components of thefactorization was chosen from { . , . , . , } where eachnumber represents a percentage of the RFF. After ﬁndingthe best hyper-parameter conﬁguration using cross valida-tion, 10 different experiments were performed with differentrandom initialization. The mean and the standard deviationof the accuracy is reported.5.1.2. R ESULTS AND DISCUSSION

Table 1 shows the results of the classiﬁcation experiments.DMKDC is a shallow method that uses RFF, so a SVMusing the same RFF is fair and strong baseline. In all thecases, except one, DMKDC-SGD outperforms the SVM,which shows that it is a very competitive method. DMKDCtrained using estimation shows less competitive results, butthey are still remarkable taking into account that this is anoptimization-less training strategy that only passes onceover the training dataset. For MNIST and Cifar the useof a deep learning feature extractor is mandatory to obtaincompetitive results. The results show that DMKDC can beintegrated with other deeper neural network architectures toreach these results.

Many multi-class classiﬁcation problems can be seen asordinal regression problems. That is, problems where labelsnot only indicate class membership, but also an order. Ordi-nal regression problems are halfway between a classiﬁcationproblem and a regression problem, and given the discreteprobability distribution representation used in QMR, ordinalregression seems to be a suitable problem to test it.5.2.1. D

ATA SETS AND EXPERIMENTAL SETUP

Nine standard benchmark data sets for ordinal regressionwere used. The details of each data set are reported in theSupplementary Material. These data sets are originally usedin metric regression tasks. To convert the task into an or-dinal regression one, the target values were discretized bytaking ﬁve intervals of equal length over the target range.For each set, 20 different train and test partitions are made.These partitions are the same used by Chu & Ghahra-mani (2005) and several posterior works, and are pub-licly available at . The modelswere evaluated using the mean absolute error (MAE), whichis a popular and widely used measure in ordinal regression(Guti´errez et al., 2016; Garg & Manwani, 2020).QMR was trained using the estimation strategy (QMR)and an ADAM stochastic gradient descent strategy (QMR-SGD). For each data set, and for each one of the 20 parti- earning with Density Matrices and Random Features

Table 1.

Accuracy test results for DMKDCD

ATA S ET SVM-RFF DMKDC DMKDC-SGD LENET LENET DMKDCL

ETTERS ± ± ± - -USPS 0.9332 ± ± ± - -F OREST ± ± ± - -G ISETTE ± ± ± - -M NIST ± ± ± ± ± C IFAR ± ± ± ± ± Table 2.

MAE test results for QMR, QMR-SGD and different baseline methods: support vector machines (SVM), Gaussian Processes(GP), Neural Network Rank (NNRank), Ordinal Extreme Learning Machines (ORELM) and Ordinal Regression Neural Network (ORNN).The results are the mean and standard deviation of the MAE for the twenty partitions of each dataset. The best result for each data set is inbold face.D

ATA SET

QMR-SGD QMR ORNN ORELM NNR

ANK

GP SVMD

IABETES ± ± ± ± ± YRIMIDINES ± ± ± ± ± ± RIAZINES ± ± ± ± ± ISCONSIN ± ± ± ± ACHINE ± ± ± ± ± ± UTO ± ± ± ± ± OSTON ± ± ± ± ± TOCKS ± ± ± ± ± ± BALONE ± ± ± ± ± ± tions, we made a hyper parameter search using a ﬁve-foldcross-validation procedure. The search was done generating25 different random conﬁguration. The range for γ wascomputed in the same way as for the classiﬁcation experi-ments, β ∈ (0 , , the number of RFF randomly chosenbetween the number of attributes and , and the numberof eigen-components of the factorization was chosen from { . , . , . , } where each number represents a percent-age of the RFF. For the ADAM optimizer in QMR-SGDwe choose the learning rate in the interval (0 , . and α ∈ (0 , . The RFF layer was always set to trainable, andthe criteria for selecting the best parameter conﬁgurationwas the MAE performance.5.2.2. R ESULTS AND DISCUSSION

For each data set, the means and standard deviations ofthe test MAE for the 20 partitions are reported in Table 2,together with the results of previous state-of-the-art workson ordinal regression: Gaussian Processes (GP) and sup-port vector machines (SVM) (Chu & Ghahramani, 2005),Neural Network Rank (NNRank) (Cheng et al., 2008), Or-dinal Extreme Learning Machines (ORELM) (Deng et al.,2010) and Ordinal Regression Neural Network (ORNN)(Fernandez-Navarro et al., 2014).QMR-SGD shows a very competitive performance. It out-performs the baseline methods in six out of the nine data sets. The training strategy based on estimation, QMR, did not per-formed as well. This evidences that for this problem a ﬁnetuning of the representation is required and it is successfullyaccomplished by the gradient descent optimization.

6. Conclusions

The mathematical framework underlying quantum mechan-ics is a powerful formalism that harmoniously combinelinear algebra and probability in the form of density matri-ces. This paper has shown how to use these density matricesas a building block for designing different machine learningmodels. The main contribution of this work is to show anovel perspective to learning that combines two very dif-ferent and seemingly unrelated tools, random features anddensity matrices. The, somehow surprising, connection ofthis combination with kernel density estimation provides anew way of representing and learning probability densityfunctions from data. The experimental results showed someevidence that this building block can be used to build com-petitive models for some particular tasks. However, the fullpotential of this new perspective is still to be explored. Ex-amples of directions of future inquire include using complexvalued density matrices, exploring the role of entanglementand exploiting the battery of practical and theoretical toolsprovided by quantum information theory. earning with Density Matrices and Random Features

Acknowledgements

The authors gratefully acknowledge the help of DiegoUseche with code testing and optimization.

References

Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R., andWang, R. On exact computation with an inﬁnitely wideneural net. arXiv preprint arXiv:1904.11955 , 2019.Avron, H., Sindhwani, V., Yang, J., and Mahoney, M. W.Quasi-monte carlo feature maps for shift-invariant ker-nels.

Journal of Machine Learning Research , 17(120):1–38, 2016. URL http://jmlr.org/papers/v17/14-538.html .Avron, H., Kapralov, M., Musco, C., Musco, C., Velingker,A., and Zandieh, A. Random fourier features for kernelridge regression: Approximation bounds and statisticalguarantees. In

International Conference on MachineLearning , pp. 253–262. PMLR, 2017.Balakrishnan, S., Narayanan, S., Rinaldo, A., Singh, A., andWasserman, L. Cluster trees on manifolds. arXiv preprintarXiv:1307.6515 , 2013.Charikar, M. and Siminelakis, P. Hashing-based-estimatorsfor kernel density in high dimensions. In , pp. 1032–1043. IEEE, 2017.Chazal, F., Fasy, B., Lecci, F., Michel, B., Rinaldo, A.,Rinaldo, A., and Wasserman, L. Robust topological in-ference: Distance to a measure and kernel distance.

TheJournal of Machine Learning Research , 18(1):5845–5884,2017.Chen, Y.-C., Genovese, C. R., Wasserman, L., et al. Acomprehensive approach to mode clustering.

ElectronicJournal of Statistics , 10(1):210–241, 2016.Cheng, J., Wang, Z., and Pollastri, G. A neural network ap-proach to ordinal regression.

Proceedings of the Interna-tional Joint Conference on Neural Networks , (May 2014):1279–1284, 2008. doi: 10.1109/IJCNN.2008.4633963.Chernozhukov, V., Chetverikov, D., Kato, K., et al. Gaussianapproximation of suprema of empirical processes.

Annalsof Statistics , 42(4):1564–1597, 2014.Chu, W. and Ghahramani, Z. Gaussian Processesfor Ordinal Regression.

Journal of Machine Learn-ing Research , 6:1019–1041, 2005. ISSN 1532-4435. URL . Dao, T., De Sa, C., and R´e, C. Gaussian quadrature for ker-nel features.

Advances in neural information processingsystems , 30:6109, 2017.Deng, W. Y., Zheng, Q. H., Lian, S., Chen, L., and Wang,X. Ordinal extreme learning machine.

Neurocomputing ,74(1-3):447–456, 2010. ISSN 09252312. doi: 10.1016/j.neucom.2010.08.022. URL http://dx.doi.org/10.1016/j.neucom.2010.08.022 .Efron, B. Bootstrap methods: another look at the jackknife.In

Breakthroughs in statistics , pp. 569–593. Springer,1992.Fernandez-Navarro, F., Riccardi, A., and Carloni, S. Or-dinal neural networks without iterative tuning.

IEEETransactions on Neural Networks and Learning Sys-tems , 25(11):2075–2085, 2014. ISSN 21622388. doi:10.1109/TNNLS.2014.2304976.Garg, B. and Manwani, N. Robust deep ordinal regres-sion under label noise. In Pan, S. J. and Sugiyama, M.(eds.),

Proceedings of The 12th Asian Conference on Ma-chine Learning , volume 129 of

Proceedings of MachineLearning Research , pp. 782–796, Bangkok, Thailand, 18–20 Nov 2020. PMLR. URL http://proceedings.mlr.press/v129/garg20a.html .Genovese, C. R., Perone-Paciﬁco, M., Verdinelli, I., Wasser-man, L., et al. Nonparametric ridge estimation.

Annalsof Statistics , 42(4):1511–1545, 2014.Gonz´alez, F. A., Vargas-Calder´on, V., and Vinck-Posada, H.Supervised Learning with Quantum Measurements. 2020.URL http://arxiv.org/abs/2004.01227 .Gray, A. G. and Moore, A. W. Nonparametric densityestimation: Toward computational tractability. In

Pro-ceedings of the 2003 SIAM International Conference onData Mining , pp. 203–211. SIAM, 2003.Guti´errez, P. A., P´erez-Ortiz, M., S´anchez-Monedero, J.,Fern´andez-Navarro, F., and Herv´as-Mart´ınez, C. OrdinalRegression Methods: Survey and Experimental Study.

IEEE Transactions on Knowledge and Data Engineering ,28(1):127–146, 2016. ISSN 10414347. doi: 10.1109/TKDE.2015.2457911.Hastie, T., Tibshirani, R., and Friedman, J.

The elements ofstatistical learning: data mining, inference, and predic-tion . Springer Science & Business Media, 2009.Ji, Z. and Telgarsky, M. Polylogarithmic width sufﬁces forgradient descent to achieve arbitrarily small test error withshallow relu networks. arXiv preprint arXiv:1909.12292 ,2019. earning with Density Matrices and Random Features

Le, Q., Sarlos, T., and Smola, A. Fastfood - approximat-ing kernel expansions in loglinear time. In ,2013. URL http://jmlr.org/proceedings/papers/v28/le13.html .LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard,R. E., Hubbard, W., and Jackel, L. D. Backpropaga-tion applied to handwritten zip code recognition.

Neuralcomputation , 1(4):541–551, 1989.Li, C.-L., Chang, W.-C., Mroueh, Y., Yang, Y., and P´oczos,B. Implicit kernel learning. In

The 22nd InternationalConference on Artiﬁcial Intelligence and Statistics , pp.2007–2016. PMLR, 2019.Li, Q., Stefani, A., Toto, G., Di Buccio, E., and Melucci, M.Towards multimodal sentiment analysis inspired by thequantum theoretical framework. In , pp. 177–180, 2020. doi: 10.1109/MIPR49039.2020.00044.Li, Q., Gkoumas, D., Lioma, C., and Melucci, M. Quantum-inspired multimodal fusion for video sentiment anal-ysis.

Information Fusion , 65:58 – 71, 2021. ISSN1566-2535. doi: https://doi.org/10.1016/j.inffus.2020.08.006. URL .March, W. B., Xiao, B., and Biros, G. Askit: Approximateskeletonization kernel-independent treecode in high di-mensions.

SIAM Journal on Scientiﬁc Computing , 37(2):A1089–A1110, 2015.McNeil, B. J. and Hanley, J. A. Statistical approaches to theanalysis of receiver operating characteristic (roc) curves.

Medical decision making , 4(2):137–150, 1984.Nadaraya, E. A. Some new estimates for distribution func-tions.

Theory of Probability & Its Applications , 9(3):497–500, 1964.Parzen, E. On estimation of a probability density functionand mode.

The annals of mathematical statistics , 33(3):1065–1076, 1962.Rahimi, A. and Recht, B. Random features for large-scalekernel machines. In

Proceedings of the 20th InternationalConference on Neural Information Processing Systems ,NIPS’07, pp. 1177–1184, Red Hook, NY, USA, 2007.Curran Associates Inc. ISBN 9781605603520.Rosenblatt, M. Remarks on some nonparametric estimatesof a density function.

Ann. Math. Statist. , 27(3):832–837,09 1956. doi: 10.1214/aoms/1177728190. URL https://doi.org/10.1214/aoms/1177728190 . Schuld, M.

Supervised learning with quantum computers .Springer, 2018.Schuld, M., Sinayskiy, I., and Petruccione, F. An intro-duction to quantum machine learning.

ContemporaryPhysics , 56(2):172–185, 2015.Sergioli, G., Santucci, E., Didaci, L., Miszczak, J. A.,and Giuntini, R. A quantum-inspired version of thenearest mean classiﬁer.

Soft Computing , 22(3):691–705, Feb 2018. ISSN 1433-7479. doi: 10.1007/s00500-016-2478-2. URL https://doi.org/10.1007/s00500-016-2478-2 .Shen, W., Yang, Z., and Wang, J. Random features for shift-invariant kernels with moment matching. In

Proceedingsof the AAAI Conference on Artiﬁcial Intelligence , vol-ume 31, 2017.Sun, Y., Gilbert, A., and Tewari, A. But how does it work intheory? linear svm with random features. arXiv preprintarXiv:1809.04481 , 2018.Tang, E. A quantum-inspired classical algorithm for recom-mendation systems. In

Proceedings of the 51st AnnualACM SIGACT Symposium on Theory of Computing , pp.217–228, 2019a.Tang, E. Quantum-inspired classical algorithms for principalcomponent analysis and supervised clustering, 2019b.Tiwari, P. and Melucci, M. Towards a quantum-inspiredbinary classiﬁer.

IEEE Access , 7:42354–42372, 2019.doi: 10.1109/ACCESS.2019.2904624.Tiwari, P., Dehdashti, S., Obeid, A. K., Melucci, M., andBruza, P. Kernel method based on non-linear coherentstate. arXiv preprint arXiv:2007.07887 , 2020.Von Neumann, J. Wahrscheinlichkeitstheoretischer aufbauder quantenmechanik.

Nachrichten von der Gesellschaftder Wissenschaften zu G¨ottingen, Mathematisch-Physikalische Klasse , 1927:245–272, 1927.Wolf, L. Learning using the born rule. Technical report,Massachusetts Institute of Technology Computer Scienceand Artiﬁcial Intelligence Laboratory, 2006.Xie, B., Liang, Y., and Song, L. Scale up nonlinear com-ponent analysis with doubly stochastic gradients. arXivpreprint arXiv:1504.03655 , 2015.Yu, F. X. X., Suresh, A. T., Choromanski, K. M., Holtmann-Rice, D. N., and Kumar, S. Orthogonal random features.In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., andGarnett, R. (eds.),

Advances in Neural Information Pro-cessing Systems , volume 29, pp. 1975–1983. Curran As-sociates, Inc., 2016. URL https://proceedings. earning with Density Matrices and Random Features neurips.cc/paper/2016/file/53adaf494dc89ef7196d73636eb2451b-Paper.pdf .Zhang, Y., Song, D., Zhang, P., Wang, P., Li, J., Li,X., and Wang, B. A quantum-inspired multimodalsentiment analysis framework.

Theoretical Com-puter Science , 752:21 – 40, 2018. ISSN 0304-3975.doi: https://doi.org/10.1016/j.tcs.2018.04.029. URL . Quantumstructures in computer science: language, semantics,retrieval.Zhang, Y., Li, Q., Song, D., Zhang, P., and Wang, P.Quantum-inspired interactive networks for conversationalsentiment analysis. In , August 2019.URL http://oro.open.ac.uk/61755/ . earning with Density Matrices and Random Features Supplementary Material7. Proofs

Proposition 3.1.

Let M be a compact subset of R d with adiameter diam ( M ) , let X = { x i } i =1 ...N ⊂ M a set of iidsamples, then ˆ f ρ (Eq. 12) and ˆ g γ,X (Eq. 4) satisfy: Pr (cid:20) sup x ∈M | ˆ f ρ ( x ) − ˆ g γ,X ( x ) | ≥ Z (cid:15) (cid:21) ≤ (cid:18) √ dγ diam( M ) (cid:15) (cid:19) exp (cid:18) − D(cid:15) d + 2) (cid:19) Proof. ˆ f ρ ( x ) = 1 Z Tr ( ρφ rff ( x ) φ rff ( x ) ∗ )= 1 Z Tr ( φ rff ( x ) ∗ ρφ rff ( x ))= 1 Z Tr (cid:32) φ rff ( x ) ∗ (cid:32) N N (cid:88) i =1 z i z ∗ i (cid:33) φ rff ( x ) (cid:33) = 1 Z N N (cid:88) i =1 Tr ( φ rff ( x ) ∗ φ rff ( x i ) φ rff ( x i ) ∗ φ rff ( x ))= 1 Z N N (cid:88) i =1 ( φ rff ( x ) ∗ φ rff ( x i )) (34)Because of Theorem 2.1 we know that Pr (cid:20) sup x,y ∈M | φ ∗ rﬀ ( x ) φ rﬀ ( y ) − e − γ (cid:107) x − y (cid:107) | ≥ (cid:15) (cid:21) ≤ (cid:18) √ dγ diam( M ) (cid:15) (cid:19) exp (cid:18) − D(cid:15) d + 2) (cid:19) = B By construction | φ ∗ rﬀ ( x ) φ rﬀ ( y ) + e − γ (cid:107) x − y (cid:107) | ≤ , then | ( φ ∗ rﬀ ( x ) φ rﬀ ( y )) − e − γ (cid:107) x − y (cid:107) | = | ( φ ∗ rﬀ ( x ) φ rﬀ ( y ) − e − γ (cid:107) x − y (cid:107) )( φ ∗ rﬀ ( x ) φ rﬀ ( y ) + e − γ (cid:107) x − y (cid:107) ) | ≤ | ( φ ∗ rﬀ ( x ) φ rﬀ ( y ) − e − γ (cid:107) x − y (cid:107) ) | . Then Pr (cid:20) sup x,y ∈M | ( φ ∗ rﬀ ( x ) φ rﬀ ( y )) − e − γ (cid:107) x − y (cid:107) | ≥ (cid:15) (cid:21) ≤ B (35)Combining Equations 34 and 35 we get: Pr (cid:20) sup x ∈M | ˆ f ρ ( x ) − ˆ g γ,X ( x ) | ≥ Z (cid:15) (cid:21) ≤ B

8. Additional experiments andexperimentation details

In this section we present additional experi-ments, experimental details and experiment anal-yses. The source code of the methods andthe scripts of the experiments are available at https://drive.google.com/drive/folders/16pHMLjIvr6v1zY6cMvo11EqMAMqjn3Xa asJupyter notebooks.

The goal of these experiments is to evaluate the efﬁcacy andefﬁciency of DMKDE to approximate a pdf. We compare itagainst conventional Gaussian KDE.We used three datasets:• 1-D synthetic. The ﬁrst synthetic dataset correspondsto a mixture of univariate Gaussians as shown in Figure1. The mixture weights are 0.3 and 0.7 respectivelyand the parameters are ( µ = 0 , σ = 1) and ( µ =5 , σ = 1) . We generated 10,000 samples for trainingand use as test dataset 1,000 samples equally spaced inthe interval [ − , .• 2-D synthetic. This dataset corresponds to three spiralsas depicted in Figure 3.The training and test datasetshave 10,0000 and 1,000 points respectively, all of themgenerated with the same stochastic procedure.• MNIST dataset. We used PCA to reduce the original784 dimension to 40. The resulting vectors were scaledto [0 , . We used stratiﬁed sampling to choose 10,000and 1,000 samples for training and testing respectively.We performed two types of experiments over the threedatasets. In the ﬁrst, we wanted to evaluate the accuracy ofDMKDE. In the second, we evaluated the time to predictthe density on the test set.In the ﬁrst experiment, DMKDE was run with differentnumber of RFF to see how the dimension of the RFF rep-resentation affected the accuracy of the estimation. For the1-D dataset, both the DMKDE prediction and the KDE pre-diction were compared against the true pdf using root meansquared error (RMSE). For the 2-D dataset the RMSE be-tween the DMKDE prediction and the KDE prediction wasevaluated. In the case of MNIST, and because of the smallvalues for the density, we calculated the RMSE between thelog density predicted by DMKDE and KDE. The values of γ and r (number of eigencomponents) for DMKDE where: ( γ = 8 , r = 30) for the 1-D dataset, ( γ = 128 , r = 100) for the 2-D dataset, ( γ = 0 . , r = 150) for the MNISTdataset. In all the cases KDE was run with γ twice the valueused for DMKDE. earning with Density Matrices and Random Features P r o b a b ili t y N=1000 pointsDMKDEKDETrue pdf

Figure 1. γ = 2 ) and KDE( γ = 4 ) is shown. Figure 2 shows how the accuracy of DMKDE increaseswith an increasing number of RFF. For each conﬁguration30 experiments were run and the blue solid line representsthe mean RMSE of the experiments and the blue regionrepresents the 95% conﬁdence interval. In all the threedatasets, RFF achieved a low RMSE. The variance alsodecreases when the number of RFF is increased.Figure 3 shows the 2-D spirals dataset (left) and the densityestimation of both KDE (center) and DMKDE (right). Thedensity calculated by DMKDE is very close to the onecalculated with KDE.Figure 4 shows a comparison of the log density predictedby KDE and DMKDE. Both models were applied to testsamples and samples generated randomly from a uniformdistribution. As expected points are clustered around thediagonal. The DMKDE log density of test samples (left)seems to be more accurately predicted than the one of ran-dom samples. The reason is that the density of randomsamples is smaller than the density of test samples and thedifference is ampliﬁed by the logarithm.For the second experiment, we measured the timetaken to predict 1,000 test samples for both KDE andDMKDE using different number of train samples. KDEwas implemented in Python using liner algebra opera-tions accelerated by numpy. At least for the experi-

Table 3.

Data sets used for classiﬁcation evaluation.D

ATA SET A TTRIBUTES C LASSES T RAIN -T EST L ETTERS

16 26 14000-6000U

SPS

256 10 7291-2007F

OREST

54 3 70-30M

NIST

784 10 60000-10000G

ISETTE

IFAR

Table 4.

Speciﬁcations of the data sets used for ordinal regressionevaluation. Train and Test indicate the number of samples, whichis the same for all the twenty partitions.D

ATA SET A TTRIBUTES T RAIN T EST D IABETES

YRIMIDINES

27 50 24T

RIAZINES

60 100 86W

ISCONSIN

32 130 64M

ACHINE

CPU 6 150 59A

UTO

MPG 7 200 192B

OSTON H OUSING

13 300 206S

TOCK D OMAIN

BALONE ments reported, our implementation was faster than otherKDE implementations available such as the one pro-vided by scikit learn ( https://scikit-learn.org/stable/modules/density.html ), which is proba-bly optimized for other use cases. DMKDE was imple-mented in Python using Tensorﬂow. Both method were runin a dual core Intel(R) Xeon(R) CPU @ 2.20GHz withoutGPU to avoid any unfair advantage.Figure 5 shows the time of both methods for different sizesof the training dataset. The prediction time of KDE dependson the size of the training dataset, while the time of DMKDEdoes not depend on it. The advantage of DMKDE in termsof computation time is clear for training datasets above data samples. Table 3 shows each dataset with their corresponding numberof attritubes, classses, and, number of train points and testpoints D ATASETS

The details of the nine benchmark datasets used for theordinal regression evaluation are given in Table 4. earning with Density Matrices and Random Features RFF dimensions R M S E KDEDMKDE RFF dimensions R M S E T O K D E DMKDE RFF dimensions R M S E l o g d e n s i t y v e r s u s k d e DMKDE

Figure 2.

Accuracy of the density estimation of DMKDE for different number of RFF for the 1-D dataset (left), 2-D dataset (center) andMNIST dataset (right). For the 1-D dataset both KDE and DMKDE are compared against the true density. For the two other datasets thedifference between KDE and DMKDE is calculated. In all the cases the RMSE is calculated. The blue shaded zone represents the 95%conﬁdence interval. x y Figure 3. A NALYSIS OF THE PREDICTIONS

In this subsection, we analyze the behavior of the predic-tions given by QMR and QMR-SGD in the ﬁrst partitionof the Stocks Domain data set. As described in Subsection3.4, the prediction of QMR and QMR-SGD is a probabilitydistribution over the ordinal classes. To better understandbetter the working of the models we calculated the averagedistribution predicted for each class by averaging the predic-tions of the corresponding samples. The results are shownin Figures 6 and 7.The distributions predicted for each class by the QMRmethod have a higher dispersion. Because of this QMRpredicts all test samples in the second, thid and fouth class,and is explained by the fact that Stocks is slightly unbal-anced towards the third class (see Figure 8). After trainingwith SGD, it is evident that the class distributions becomemore distinguishable from each other. This manifests itselfin the improved ability of the QMR-SGD to correctly pre-dict the class of a new sample, as can also be seen in the histogram of the predicted labels (see Figure 8).V

ARIANCE

QMR and QMR-SGD allow to calculate the variance of theprediction. In fact, as described in Subsection 3.4, QMR-SGD is trained using a mean squared error loss functioninvolving such variance, controlled by a parameter α . Inthis case, we set α = 0 . after hyper-parameter search.For both QMR and QMR-SGD we study the behavior ofthe variance of the predictions grouping the test samplesaccording to the absolute classiﬁcation error. That is, a newcase that is predicted in the correct class has an error equalto . A new case whose true label is 1 but is predicted inclass 3 has an absolute error equal to 2. The results areshown in Figure 9.From Figure 9 we can observe that for QMR the varianceis not necessarily an indicator of missclassiﬁcation. ForQMR-SGD, the effect of the inclusion of the variance in earning with Density Matrices and Random Features log ( g X ( x )) l o g ( f ( x )) log ( g X ( x )) l o g ( f ( x ))

29 28 27 26 25 24 log ( g X ( x )) l o g ( f ( x )) Figure 4.

Scatter-plots comparing the log density predicted by KDE and DMKDE: test samples (left), uniformly random generatedsamples (center), both test and random samples (right). Number of training samples T i m e i n m illi s e c o n d s KDEDMKDE Number of training samples T i m e i n m illi s e c o n d s KDEDMKDE Number of training samples T i m e i n m illi s e c o n d s KDEDMKDE

Figure 5.

Evaluation of the prediction time of DMKDE and KDE: 1-D dataset (left), 2-D dataset (center) and MNIST dataset (right). . the loss function is notable as it is lower over the whole testset. Further, the mean of the variance of the well classiﬁedsamples is signiﬁcantly lower from that of the missclassiﬁedsamples. Thus the variance value associated with the predic-tion is a good indicator of uncertainty, and a good indicatorof the quality of the prediction. earning with Density Matrices and Random Features

Figure 6.

QMR average predicted probability distributions per class. According to the class they belong to, the Stocks test samples aregrouped and the average probability distribution predicted by QMR is calculated. The plots are ordered according to the class startingwith class one at the left.

Figure 7.

QMR-SGD average predicted probability distributions by class. According to the class they belong to, the Stocks test samplesare grouped and the average probability distribution predicted by QMR-SGD is calculated. The plots are ordered according to the classstarting with class one at the left.

Figure 8.

Histograms of the labels in the Stocks test set. The ﬁgure on the left shows the histogram of the true labels. The middle ﬁgureshows the histogram of the labels predicted by QMR and the right ﬁgure shows the histogram of the labels predicted by QMR-SGD. earning with Density Matrices and Random Features