[PDF] Training a Large Scale Classifier with the Quantum Adiabatic Algorithm

Abstract

In a previous publication we proposed discrete global optimization as a method to train a strong binary classifier constructed as a thresholded sum over weak classifiers. Our motivation was to cast the training of a classifier into a format amenable to solution by the quantum adiabatic algorithm. Applying adiabatic quantum computing (AQC) promises to yield solutions that are superior to those which can be achieved with classical heuristic solvers. Interestingly we found that by using heuristic solvers to obtain approximate solutions we could already gain an advantage over the standard method AdaBoost. In this communication we generalize the baseline method to large scale classifier training. By large scale we mean that either the cardinality of the dictionary of candidate weak classifiers or the number of weak learners used in the strong classifier exceed the number of variables that can be handled effectively in a single global optimization. For such situations we propose an iterative and piecewise approach in which a subset of weak classifiers is selected in each iteration via global optimization. The strong classifier is then constructed by concatenating the subsets of weak classifiers. We show in numerical studies that the generalized method again successfully competes with AdaBoost. We also provide theoretical arguments as to why the proposed optimization method, which does not only minimize the empirical loss but also adds L0-norm regularization, is superior to versions of boosting that only minimize the empirical loss. By conducting a Quantum Monte Carlo simulation we gather evidence that the quantum adiabatic algorithm is able to handle a generic training problem efficiently.

Full PDF

TTraining a Large Scale Classiﬁer with theQuantum Adiabatic Algorithm

Hartmut Neven

Google [email protected]

Vasil S. Denchev

Purdue University [email protected]

Geordie Rose and William G. Macready

D-Wave Systems Inc. { rose,wgm } @dwavesys.com May 29, 2018

Abstract

In a previous publication we proposed discrete global optimization as a method to train astrong binary classiﬁer constructed as a thresholded sum over weak classiﬁers. Our motivationwas to cast the training of a classiﬁer into a format amenable to solution by the quantum adi-abatic algorithm. Applying adiabatic quantum computing (AQC) promises to yield solutionsthat are superior to those which can be achieved with classical heuristic solvers. Interestinglywe found that by using heuristic solvers to obtain approximate solutions we could already gainan advantage over the standard method AdaBoost. In this communication we generalize thebaseline method to large scale classiﬁer training. By large scale we mean that either the car-dinality of the dictionary of candidate weak classiﬁers or the number of weak learners used inthe strong classiﬁer exceed the number of variables that can be handled effectively in a sin-gle global optimization. For such situations we propose an iterative and piecewise approachin which a subset of weak classiﬁers is selected in each iteration via global optimization. Thestrong classiﬁer is then constructed by concatenating the subsets of weak classiﬁers. We showin numerical studies that the generalized method again successfully competes with AdaBoost.We also provide theoretical arguments as to why the proposed optimization method, which doesnot only minimize the empirical loss but also adds L0-norm regularization, is superior to ver-sions of boosting that only minimize the empirical loss. By conducting a Quantum Monte Carlosimulation we gather evidence that the quantum adiabatic algorithm is able to handle a generictraining problem efﬁciently. a r X i v : . [ qu a n t - ph ] D ec Baseline System

In [NDRM08] we study a binary classiﬁer of the form y = H ( x ) = sign (cid:32) N (cid:88) i =1 w i h i ( x ) (cid:33) , (1)where x ∈ R M are the input patterns to be classiﬁed, y ∈ {− , +1 } is the output of the classiﬁer,the h i : x (cid:55)→ {− , +1 } are so-called weak classiﬁers or features detectors, and the w i ∈ { , } area set of weights to be optimized during training. H ( x ) is known as a strong classiﬁer.Training, i.e. the process of choosing the weights w i , proceeds by simultaneously minimizingtwo terms. The ﬁrst term, called the loss L ( w ) , measures the error over a set of S training examples { ( x s , y s ) | s = 1 , . . . , S } . We choose least squares as the loss function. The second term, knownas regularization R ( w ) , ensures that the classiﬁer does not become too complex. We employ aregularization term based on the L0-norm, (cid:107) w (cid:107) . This term encourages the strong classiﬁer to bebuilt with as few weak classiﬁers as possible while maintaining a low training error. Thus, trainingis accomplished by solving the following discrete optimization problem: w opt = arg min w  S (cid:88) s =1 (cid:32) N N (cid:88) i =1 w i h i ( x s ) − y s (cid:33) (cid:124) (cid:123)(cid:122) (cid:125) L ( w ) + λ (cid:107) w (cid:107) (cid:124) (cid:123)(cid:122) (cid:125) R ( w )  = arg min w  N N (cid:88) i =1 N (cid:88) j =1 w i w j (cid:32) S (cid:88) s =1 h i ( x s ) h j ( x s ) (cid:33)(cid:124) (cid:123)(cid:122) (cid:125) Corr ( h i ,h j ) + N (cid:88) i =1 w i  λ − N S (cid:88) s =1 h i ( x s ) y s (cid:124) (cid:123)(cid:122) (cid:125) Corr ( h i ,y )  (2)Note that in our formulation, the weights are binary and not positive real numbers as in AdaBoost.Even though discrete optimization could be applied to any bit depth representing the weights, wefound that a small bit depth is often sufﬁcient [NDRM08]. Here we only deal with the simplest casein which the weights are chosen to be binary. In the case of a ﬁnite dictionary of weak classiﬁers { h i ( x ) | i = 1 , ..., N } AdaBoost can be seen as agreedy algorithm that minimizes the exponential loss [Zha04], α opt = arg min α (cid:32) S (cid:88) s =1 exp (cid:32) − y s N (cid:88) i =1 α i h i ( x s ) (cid:33)(cid:33) , (3)with α i ∈ R + . There are two differences between the objective of our algorithm (Eqn. 2) and theone employed by AdaBoost. The ﬁrst is that we added L0-norm regularization. Second, we employa quadratic loss function, while Adaboost works with the exponential loss.2t can easily be shown that including L0-norm regularization in the objective in Eqn. (2) leadsto improved generalization error as compared to using the quadratic loss only. The proof goes asfollows. An upper bound for the Vapnik-Chernovenkis dimension of a strong classiﬁer H of theform H ( x ) = (cid:80) Tt =1 h t ( x ) is given by V C H = 2( V C { h i } + 1)( T + 1) log ( e ( T + 1)) , (4)where V C { h i } is the VC dimension of the dictionary of weak classiﬁers [FS95]. The strong classi-ﬁer’s generalization error Error test has therefore an upper bound given by [VC71]

Error test ≤ Error train + (cid:115) V C H ln( SV C H + 1) + ln ( δ ) S . (5)It is apparent that a more compact strong classiﬁer that achieves a given training error Error train with a smaller number T of weak classiﬁers (hence, with a smaller VC dimension V C H ), comes A CB T=10 T=20 T=640

Figure 1: AdaBoost applied to a simple classiﬁcation task. A shows the data, a separable set con-sisting of a two-dimensional cluster of positive examples (blue) surrounded by negative ones (red). B shows the random division into training (saturated colors) and test data (light colors). The dictio-nary of weak classiﬁers is constructed of axis-parallel one-dimensional hyperplanes. C shows theoptimal classiﬁer for this situation, which employs four weak classiﬁers to partition the input spaceinto positive and a negative areas. The lower row shows partitions generated by AdaBoost after 10,20, and 640 iterations. The conﬁguration at T = 640 is the asymptotic conﬁguration that does notchange anymore in subsequent training rounds. The ”breakout regions” outside the bounding box ofthe positive cluster occur in areas in which the training set does not contain negative examples. Thisproblem becomes more severe for higher dimensional data. Due to AdaBoost’s greedy approach,the optimal conﬁguration is not found despite the fact that the weak classiﬁers necessary to constructthe ideal bounding box are generated. In fact AdaBoost fails to learn higher dimensional versionsof this problem altogether with error rates approaching 50%. See section 6 for a discussion on howglobal optimization based learning can handle this data set.3ith a guarantee for a lower generalization error. Looking at the optimization problem in Eqn. 2,one can see that if the regularization strength λ is chosen weak enough, i.e. λ < N + N , then theeffect of the regularization is merely to thin out the strong classiﬁer. One arrives at the conditionfor λ by demanding that the reduction of the regularization term ∆ R ( w ) that can be obtained byswitching one w i to zero is smaller than the smallest associated increase in the loss term ∆ L ( w ) thatcomes from incorrectly labeling a training example. This condition guarantees that weak classiﬁersare not eliminated at the expense of a higher training error. Therefore the regularization will onlykeep a minimal set of components, those which are needed to achieve the minimal training error thatwas obtained when using the loss term only. In this regime, the VC bound of the resulting strongclassiﬁer is lower or equal to the VC bound of a classiﬁer trained with no regularization.AdaBoost contains no explicit regularization term and it can easily happen that the classiﬁeruses a richer set of weak classiﬁers than needed to achieve the minimal training error, which in turnleads to degraded generalization. Fig. 1 illustrates this fact.In practice we do not operate in the weak λ regime but rather determine the regularizationstrength λ by using a validation set. We measure the performance of the classiﬁer for differentvalues of λ on a validation set and then choose the one with the minimal validation error. In thisregime, the optimization performs a trade-off and accepts a higher empirical loss if the classiﬁer canbe kept more compact. In other words it may choose to misclassify training examples if it can keepthe classiﬁer simpler. This leads to increased robustness in the case of noisy data, and indeed weobserve the most signiﬁcant gains over AdaBoost for noisy data sets when the Bayes error is high.The fact that boosting in its standard formulation with convex loss functions and no regularizationis not robust against label noise has drawn attention recently [LS08][Fre09].The second difference to our baseline system, namely that we employ quadratic loss while Ad-aBoost works with exponential loss is of smaller importance. In fact, the discussion above aboutthe role of the regularization term would not have changed if we were to choose exponential ratherthan square loss. Literature seems to agree that the use of exponential loss in AdaBoost is not es-sential and that other loss functions could be employed to yield classiﬁers with similar performance[FHT98][Wyn02]. From a statistical perspective, quadratic loss is satisfying since a classiﬁer thatminimizes the quadratic loss is consistent. With an increasing number of training samples it willasymptotically approach a Bayes classiﬁer i.e. the classiﬁer with the smallest possible error [Zha04]. The baseline system assumes a ﬁxed dictionary containing a number of weak classiﬁers smallenough, so that all weight variables can be considered in a single global optimization. This ap-proach needs to be modiﬁed if the goal is to train a large scale classiﬁer. Large scale here meansthat at least one of two conditions is fulﬁlled:1. The dictionary contains more weak classiﬁers than can be considered in a single global opti-mization.2. The ﬁnal strong classiﬁer consists of a number of weak classiﬁers that exceeds the number ofvariables that can be handled in a single global optimization.Let us take a look at typical problem sizes. The state-of-art heuristic solver ILOG CPLEX canobtain good solutions for up to 1000 variable quadratic binary programmes depending on the coef-4cient matrix. The quantum hardware solvers manufactured by D-Wave currently can handle 128variable problems. In order to train a strong classiﬁer we often sift through millions of features.Moreover, dictionaries of weak learners are often dependent on a set of continuous parameters suchas thresholds, which means that their cardinality is inﬁnite. We estimate that typical classiﬁers em-ployed in vision based products today use thousands of weak learners. Therefore it is not possible todetermine all weights in a single global optimization, but rather it is necessary to break the probleminto smaller chunks.Let T designate the size of the ﬁnal strong classiﬁer and Q the number of variables that we canhandle in a single optimization. Q may be determined by the number of available qubits, or if weemploy classical heuristic solvers such as ILOG CPLEX or tabu search [Pal04], then Q designatesa problem size for which we can hope to obtain a solution of reasonable quality. We are going toconsider two cases. The ﬁrst with T ≤ Q and the second with T > Q .We ﬁrst describe the ”inner loop” algorithm we suggest if the number of variables we can handleexceeds the number of weak learners needed to construct the strong classiﬁer.

Algorithm 1 T ≤ Q (Inner Loop) Require:

Training and validation data, dictionary of weak classiﬁers

Ensure:

Strong classiﬁer Initialize weight distribution d inner over training samples as uniform distribution ∀ s : d inner ( s ) = S Set T inner = 0 repeat Select the Q − T inner weak classiﬁers h i from the dictionary that have the smallest weightedtraining error rates for λ = λ min to λ max do Run the optimization w opt = arg min w (cid:16)(cid:80) Ss =1 ( Q (cid:80) Qi =1 w i h i ( x s ) − y s ) + λ (cid:107) w (cid:107) (cid:17) Set T inner = (cid:107) w (cid:107) Construct strong classiﬁer H ( x ) = sign (cid:16)(cid:80) T inner t =1 h t ( x ) (cid:17) summing up the weak classiﬁersfor which w i = 1 Measure validation error

Error val of strong classiﬁer on unweighted validation set end for

Keep T inner , H ( x ) and Error val for the λ run that yielded the smallest validation error Update weights d inner ( s ) = d inner ( s ) (cid:16) T inner (cid:80) T inner t =1 h t ( x ) − y s (cid:17) Normalize d inner ( s ) = d inner ( s ) P Ss =1 d inner ( s ) until validation error Error val stops decreasing

A way to think about this algorithm is to see it as an enrichment process. In the ﬁrst round, thealgorithm selects those T inner weak classiﬁers out of subset of Q that produce the optimal validationerror. The subset of Q weak classiﬁers has been preselected from a dictionary with a cardinalitypossibly much larger than Q. In the next step the algorithm ﬁlls the Q − T inner empty slots in thesolver with the best weak classiﬁers drawn from a modiﬁed dictionary that was adapted by takinginto account for which samples the strong classiﬁer constructed in the ﬁrst round is already good andwhere it still makes errors. This is the boosting idea. Under the assumption that the solver alwaysﬁnds the global minimum, it is guaranteed that for a given λ the solutions found in the subsequentround will have lower or equal objective value i.e. they achieve a lower loss or they represent a morecompact strong classiﬁer. The fact that the algorithm always considers groups of Q weak classiﬁers5imultaneously rather than incrementing the strong classiﬁer one by one and then tries to ﬁnd thesmallest subset that still produces a low training error allows it to ﬁnd optimal conﬁgurations moreefﬁciently.If the validation error cannot be decreased any further using the inner loop, one may concludethat more weak classiﬁers are needed to construct the strong one. In this case the ”outer loop”algorithm ”freezes” the classiﬁer obtained so far and adds another partial classiﬁer trained again bythe inner loop. Algorithm 2

T > Q (Outer Loop)

Require:

Training and validation data, dictionary of weak classiﬁers

Ensure:

Strong classiﬁer Initialize weight distribution d outer over training samples as uniform distribution ∀ s : d outer ( s ) = S Set T outer = 0 repeat Run

Algorithm 1 with d inner initialized from current d outer and using an objective function thattakes into account the current H ( x ) : w opt = arg min w (cid:16)(cid:80) Ss =1 ( T outer + Q ( (cid:80) T outer t =1 h t ( x s ) + (cid:80) Qi =1 w i h i ( x s )) − y s ) + λ (cid:107) w (cid:107) (cid:17) Construct a strong classiﬁer H ( x ) = sign (cid:16)(cid:80) T outer t =1 h t ( x ) + (cid:80) T outer + T inner t = T outer +1 h t ( x ) (cid:17) adding thoseweak classiﬁers for which w i = 1 Set T outer = T outer + T inner Update weights d outer ( s ) = d outer ( s ) (cid:16) T outer (cid:80) T outer t =1 h t ( x ) − y s (cid:17) Normalize d outer ( s ) = d outer ( s ) P Ss =1 d outer ( s ) until validation error Error val stops decreasing

To assess the performance of binary classiﬁers of the form (1) trained by applying the outer loopalgorithm, we measured their performance on synthetic and natural data sets. The synthetic testdata consisted of 30-dimensional input vectors generated by sampling from P ( x, y ) = δ ( y − N ( x | µ + , I ) + δ ( y + 1) N ( x | µ − , I ) , where N ( x | µ, Σ) is a spherical Gaussian having mean µ and covariance Σ . An overlap coefﬁcient determines the separation of the two Gaussians. See[NDRM08] for details. The natural data consists of two sets of 30- and 96-dimensional vectors ofGabor wavelet amplitudes extracted at eye locations in images showing faces. The input vectors arenormalized using the L2-norm, i.e. we have (cid:107) x (cid:107) = 1 . The data sets consisted of 20,000 inputvectors, which we divided evenly into a training set, a validation set to ﬁx the parameter λ , and atest set. We used Tabu search as the heuristic solver [Pal04]. For both experiments we employed adictionary consisting of decision stumps of the form: h l ( x ) = sign ( x l − Θ + l ) for l = 1 , . . . , M (6) h − l ( x ) = sign ( − x l − Θ − l ) for l = 1 , . . . , M (7) h l ( x ) = sign ( x i x j − Θ + i,j ) for l = 1 , . . . , (cid:18) M (cid:19) ; i, j = 1 , . . . , M ; i < j (8)6 .7 0.75 0.8 0.85 0.9 0.95 100.010.020.030.040.050.060.070.08 Overlap Coefficient E rr o r R a t e AdaBoostQP 2 Algorithm with Fixed DictionaryOuter Loop Algorithm Q=64Outer Loop Algorithm Q=128Outer Loop Algorithm Q=256

Figure 2: Test errors for the synthetic data set. We ran the outer loop algorithm for three differentvalues of Q : Q = 64 , Q = 128 , and Q = 256 . The curves show means for 100 runs and theerror bars indicate the corresponding standard deviations. All three versions outperform AdaBoost.The gain increases as the classiﬁcation problem gets harder i.e. as the overlap between the positiveand negative example clouds increases. The Bayes error rate for the case of complete overlap,overlap coefﬁcient=1, is ≈ . . One can also see that there is a beneﬁt to being able to run largeroptimizations since the error rate decreases with increasing Q . For comparison, we also included theresults from the last article [NDRM08] for a classiﬁer based on a ﬁxed dictionary using quadraticloss (QP 2) for which the training was performed as per Eqn. 2. Not surprisingly, working with anadaptive set of weak classiﬁers yields higher accuracy. h − l ( x ) = sign ( − x i x j − Θ − i,j ) for l = 1 , . . . , (cid:18) M (cid:19) ; i, j = 1 , . . . , M ; i < j (9)Here h l , h − l , h l and h − l are positive and negative weak classiﬁers of orders 1 and 2 respec-tively; M is the dimensionality of the input vector x ; x l , x i , x j are the elements of the input vectorand Θ + l , Θ − l , Θ + i,j and Θ − i,j are optimal thresholds of the positive and negative weak classiﬁers oforders 1, and 2 respectively. For the 30-dimensional input data the dictionary employs 930 weakclassiﬁers and for the 96-dimensional input it consists of 9312 weak learners.As in [NDRM08] we compute an optimal threshold for the ﬁnal strong classiﬁer according to Θ = S (cid:80) Ss =1 (cid:80) Ni =1 w opti h i ( x s ) . The ﬁnal classiﬁer then becomes y = sign (cid:16)(cid:80) Ni =1 w opti h i ( x ) − Θ (cid:17) .In a separate set of experiments we co-optimized Θ jointly with the weights w i . For the datasetswe studied the difference was negligibly small but we do not expect this to be generally the case.Note that in order to handle the multi-valued global threshold within the frame work of discrete7 AdaBoost Outer Loop Q = 64 Outer Loop Q = 128 Outer Loop Q = 256 Test Error

Weak Classifiers

Reweightings

Training Error

Outer Loops

Figure 3: Test results obtained for the natural data set with 30-dimensional input vectors. Similarto the synthetic data we compare the outer loop algorithm for three different window sizes Q = 64 , Q = 128 , and Q = 256 to AdaBoost. The results were obtained for 1000 runs. The piecewiseglobal optimization based training only leads to sightly lower test errors but obtains those with asigniﬁcantly reduced number of weak classiﬁers. Also, the number of iterations needed to train thestrong classiﬁers is more than 4 times lower than required by AdaBoost.optimization one has to insert a binary expansion for Θ and the loss term then becomes L ( w ) = (cid:80) Ss =1 (cid:16) N (cid:16)(cid:80) Ni =1 w i h i ( x s ) − (cid:80) (cid:100) log N (cid:101) k =0 Θ k k + 2 (cid:100) log N (cid:101) − (cid:17) − y s (cid:17) .Test results for the synthetic data set are shown in Fig. 2 and the table in Fig. 3 displays resultsobtained from the natural data set. We did comprehensive tests of the described inner and outer loopalgorithms and found that minor modiﬁcations lead to the best results. We found that rather thanadding just Q − T inner weak classiﬁers, the error rates dropped slightly (about 10%) if we wouldreplace all Q classiﬁers from the previous round by new ones. The objective in Eqn. (2) employs ascaling factor of N to ensure that the unthresholded output of the classiﬁer, sometimes referred toas score function, does not overshoot the {− , +1 } labels. Systematic investigation of the scalingfactor, however, suggested that larger scaling factors lead to a minimal improvement in accuracyand to a more signiﬁcant reduction in the number of classiﬁers used (between 10-30%). Thus, toobtain the reported results we chose a scale factor of N .To determine the optimal size T of the strong classiﬁer generated by AdaBoost we used a vali-dation set. If the error did not decrease during 400 iterations we stopped and picked the T for whichthe minimal error was obtained. The results for the 96-dimensional natural data sets looked similar. We used the Quantum Monte Carlo (QMC) simulator of [FGG +

09] to obtain an initial estimationof the time complexity of the quantum adiabatic algorithm on our objective function. Accordingto the adiabatic theorem [Mes99], the ground state of the problem Hamiltonian H P is found withhigh probability by the quantum adiabatic algorithm, provided that the evolution time T from theinitial Hamiltonian H B to H P is Ω( g − min ) , where g min is the minimum gap. Here H B is chosen as H B = (cid:80) Ni =1 (1 − σ xi ) / . The minimum gap is the smallest energy gap between the ground state E and ﬁrst excited state E of the time-dependent Hamiltonian H ( t ) = (1 − t/T ) H B + ( t/T ) H P at8ny ≤ t ≤ T . For notational convenience, we also use ˜ H ( s ) = (1 − s ) H B + sH P with ≤ s ≤ .More details can be found in the seminal work [FGGS00].As a consequence, to ﬁnd the time complexity of AQC for a given objective function, oneneeds to estimate the asymptotic scaling of the minimum gap as observed on a collection of typical-case instances of this objective function. As noted in [AC09], the task of analytically extractingthe minimum gap scaling has been extremely difﬁcult in practice, except for a few special cases.The only alternative is to resort to numerical methods, which consist of diagonalization and QMCsimulation. Unfortunately, diagonalization is currently limited to about N < , and QMC to about N = 256 , where N is the number of binary variables [YKS09]. Hence, the best that can be donewith the currently available tools is to collect data via QMC simulations on small problem instancesand attempt to extrapolate the scaling of the minimum gap for larger instances.Using the QMC simulator of [FGG + s , which isrelated to the minimum gap [YKS08]. This quantity is an upper bound on | V | /g min , where V = (cid:104) Ψ | d ˜ H/ds | Ψ (cid:105) and Ψ , Ψ are the eigenstates corresponding to the ground and ﬁrst ex-cited states of ˜ H . However, the quantity that one is interested in for the time scaling of AQC is | d E d s s ( − s ) | QMC Estimator IQMC Estimator IIExact Diagonalization

Figure 4: The quantity | d E ds s (1 − s ) | determined by quantum Monte Carlo simulation as well asexact diagonalization for a training problem with 20 weight variables. For small problem instanceswith less than 30 variables, we can determine this quantity via exact diagonalization of the Hamilto-nian ˜ H ( s ) . As one can see, the results obtained by diagonalization coincide very well with the onesdetermined by QMC. The training objective is given by Eqn. 2 using the synthetic data set with anoverlap coefﬁcient of 0.95. 9

10 20 30 40 50 60 70 80 90 100 11001234567 N M e a n s o f M a x i m ao f | d E d s s ( − s ) | QMC DataLinear Fit

Figure 5: A plot of the peaks of the mean of | d E ds s (1 − s ) | against the number of qubits forthe range of 10-100 qubits. The errors bars indicate the standard deviation. Each point of the plotrepresents 20 QMC runs. The data is well ﬁtted by a linear function. From the fact that the scalingis at most polynomial in the problem size we can infer that the minimum gap and hence the runtimeof AQC are scaling polynomially as well. | V | /g min , but assuming that the matrix element V is not extremely small, the scaling of thesecond derivative, polynomial or exponential, can be used to infer whether the time complexity ofAQC is polynomial or exponential in N .Fig. 5 shows the results of a scaling analysis for the synthetic data set. The result is encouragingas the maxima of | d E ds s (1 − s ) | only scale linearly with the problem size. This implies thatthe runtime of AQC on this data set is likely to scale at most polynomially. It is not possible tomake a statement how typical this data set and hence this scaling behavior is. We do know fromrelated experiments with known optimal solutions that Tabu search often fails to obtain the optimalsolution for a training problem for sizes as small as 64 variables . Obviously, scaling will dependon the input data and the dictionary used. In fact it should be possible to take a hard problemknown to defeat AQC [AC09] and encode it as a training problem, which would cause scaling tobecome exponential. But even if the scaling is exponential, the solutions found by AQC for a givenproblem size can still be signiﬁcantly better than those found by classical solvers. Moreover, newerversions of AQC with changing paths [FGG +

09] may be able to solve hard training problems likethis efﬁciently. For instance we applied Tabu search to 30-dimensional separable data sets of the form depicted in Fig.1. Tabu searchfailed to return the minimal objective value for N=64 and S=9300. Discussion and future work

Building on earlier work [NDRM08] we continued our exploration of discrete global optimizationas a tool to train binary classiﬁers. The proposed algorithms which we would like to call

QBoost enable us to handle training tasks of larger sizes as they occur in production systems today. QBoostoffer gains over standard AdaBoost in three respects.1. The generalization error is lower.2. Classiﬁcation is faster during execution because it employs a smaller number of weak classi-ﬁers.3. Training can be accomplished more quickly since the number of boosting steps is smaller.In all experiments we found that the classiﬁer constructed with global optimization was signiﬁcantlymore compact. The gain in accuracy however was more varied. The good performance of a largescale binary classiﬁer trained using piecewise global optimization in a form amenable to AQC butsolved with classical heuristics shows that it is possible to map the training of a classiﬁer to AQCwith negative translation costs. Any improvements to the solution of the training problem broughtabout by AQC will directly increase the advantage of the algorithm proposed here over conventionalgreedy approaches. Access to emerging hardware that realizes the quantum adiabatic algorithm isneeded to establish the size of the gain over classical solvers. This gain will depend on the structureof the learning problem.The proposed framework offers attractive avenues for extensions that we will explore in futurework.

Alternative loss functions

We employed the quadratic loss function in training because it maps naturally to the quantum pro-cessors manufactured by D-Wave Systems which support solving quadratic unconstrained binaryprogramming. Other loss functions merit investigation as well including new versions that tradi-tionally have not been studied by the machine learning community. A ﬁrst candidate for which wehave already done preliminary investigations is the 0-1 loss since it measures the categorical trainingerror directly and not a convex relaxation. This loss function is usually discarded due to its compu-tational intractability, making it an attractive candidate for the application of AQC. In particular 0-1loss will do well on separable data sets with small Bayes error. An example is the dataset depictedin Fig. 1 and its higher-dimensional analogs. An objective as in Eqn. 2 employing 0-1 loss andincluding Θ in the optimization has the ideal solution as its minimum while for AdaBoost as wellas the outer loop algorithm with square loss the error approaches 50% as the dimension of the inputgrows larger.We developed two alternative objective functions that mimic the action of the 0-1 loss in aquadratic optimization framework. Unfortunately this is only possible at the expense of auxiliaryvariables. The new objective minimizes the norm of the labels ¯ y s simultaneously with ﬁnding thesmallest set of weights w i that minimizes the training error. Samples that can not be classiﬁedcorrectly are ﬂagged by error bits e s . 11 w opt , ¯ y opt , e opt ) = ... = argmin w, ¯ y,e (cid:32) S (cid:88) s =1 (cid:32)(cid:32) N (cid:88) i =1 w i h i ( x s ) − sign ( y s )¯ y s (cid:33) + ...... + N (cid:32) N (cid:88) i =1 w i h i ( x s ) − sign ( y s )¯ y s + sign ( y s ) N e s (cid:33) (cid:33) + λ N (cid:88) i =1 w i (cid:33) = argmin w, ¯ y,e (cid:32) (1 + N ) N (cid:88) i,j =1 w i w j (cid:32) S (cid:88) s =1 h i ( x s ) h j ( x s ) (cid:33) + ... (10) ... + (1 + N ) S (cid:88) s =1 (¯ y † + 2¯ y † (cid:100) log N (cid:101)− (cid:88) k =0 ¯ y k,s k + (cid:100) log N (cid:101)− (cid:88) k,k (cid:48) =0 ¯ y k,s ¯ y k (cid:48) ,s ( k + k (cid:48) ) ) + ...... + N S (cid:88) s =1 e s − N ) N (cid:88) i =1 S (cid:88) s =1 w i sign ( y s ) h i ( x s )(¯ y † + (cid:100) log N (cid:101)− (cid:88) k =0 ¯ y k,s k ) + ...... + 2 N N (cid:88) i =1 S (cid:88) s =1 w i e s sign ( y s ) h i ( x s ) − N S (cid:88) s =1 e s (¯ y † + (cid:100) log N (cid:101)− (cid:88) k =0 ¯ y k,s k ) + λ N (cid:88) i =1 w i (cid:33) with ¯ y s ∈ { , , , ..., N } . To replace the N-valued ¯ y s with binary variables we effected a binaryexpansion ¯ y s = ¯ y † + (cid:80) (cid:100) log N (cid:101)− k =0 ¯ y k,s k . ¯ y † is a constant we set to 1 for the purpose of preventing ¯ y s from ever becoming 0. The number of binary variables needed is N for w , S (cid:100) log N (cid:101) for ¯ y and S for e .The computational hardness of learning objectives based on 0-1 loss manifested itself in that forhandcrafted datasets for which we knew the solution we could see that Tabu search was not able toﬁnd the minimum assignment. We also conducted a QMC analysis but were not able to determine aﬁnite gap size for problem sizes of 60 variables and larger. However this was a preliminary analysisthat will have to be redone with larger computational resources. The difﬁculty to determine the gapsize led us to propose an alternative version that uses a larger number of auxiliary variables but hasa smaller range of coefﬁcients. Samples that can be classiﬁed correctly are ﬂagged by indicator bits e + s . Vice versa samples that can not be classiﬁed correctly are indicated by e − s . ( w opt , ¯ y opt , e opt ) = ... (11) = argmin w, ¯ y,e (cid:32) S (cid:88) s =1 (cid:32)(cid:32) N (cid:88) i =1 w i h i ( x s ) − ( e + s − e − s ) sign ( y s )¯ y s (cid:33) + e − s (cid:33) + λ N (cid:88) i =1 w i (cid:33) The number of binary variables needed is N for w , S (cid:100) log N (cid:101) for ¯ y and S for the e + s and e − s .However since the objective contains third-order terms, we will need to effect a variable change toreduce to second order: y + s = e + s ¯ y s and y − s = e − s ¯ y s . This adds another S (cid:100) log N (cid:101) qubits. Due tothe large number of binary variables we have not analyzed Eqn. 11 yet.12 o-Optimization of weak classiﬁer parameters The weak classiﬁers depend on parameters such as the thresholds Θ l . Rather than determiningthese parameters in a process outside of the global optimization it would better keep with the spiritof our approach to include these in the global optimization as well. Thus is would look more likea perceptron but one in which all weights are determined by global optimization. So far we havenot been able to ﬁnd a formulation that only uses quadratic interactions between the variables andthat does not need a tremendous amount of auxiliary variables. This is due to the fact that the weakclassiﬁer parameters live under the sign function which makes the resulting optimization problemcontain terms of order N if no simpliﬁcations are effected. Our desire to stay with quadratic opti-mization stems from the fact that the current D-Wave processors are designed to support this format,and that it will be hard to represent N -local interactions in any physical process. Co-Training of multiple classiﬁers

Our training framework allows for simultaneous training of multiple classiﬁers with feature sharingin a very elegant way. For example if two classiﬁers are to learn similar tasks, then a training objec-tive is formed that sums up two objectives as described in Eqn. (2)—one for each classiﬁer. Thencross terms are introduced that encourage the reuse of weak classiﬁers and which have the form µ (cid:80) Ni =1 ( w Ai − w Bi ) . The w Ai and w Bi are the weights of classiﬁers A and B respectively. From theperspective of classiﬁer A this looks like a special form of context dependent regularization. Theresulting set of classiﬁers is likely to exhibit higher accuracy and reduced execution times. But moreimportantly, this framework may allow reducing the number of necessary training examples. Incorporating Gestalt principles

The approach also allows to seamlessly incorporate a priori knowledge about the structure of theclassiﬁcation problem, for instance in the form of Gestalt principles. For example, if the goal isto train an object detector, it may be meaningful to impose a constraint that if a feature is detectedat position x in an image then there should also be one at a feature position x (cid:48) nearby. Similarly,we may be able to express symmetry or continuity constraints by introducing appropriate penaltyfunctions on the weight variables optimized during training. Formally, Gestalt principles take onthe form of another regularization term, i.e. a penalty term G ( w ) that is a function of the weights. Acknowledgments

We would like to thank Alessandro Bissacco, Jiayong Zhang and Ulrich Buddemeier for their as-sistance with MapReduce, Boris Babenko for helpful discussions of approaches to boosting, Ed-ward Farhi and David Gosset for their support with the Quantum Monte Carlo simulations, CorinnaCortes for reviewing our initial results and Hartwig Adam, Jiayong Zhang and Edward Farhi forreviewing drafts of the paper. 13 eferences [AC09] Mohammad H. S. Amin and Vicki Choi. First order quantum phase transition in adia-batic quantum computation, 2009. arXiv: quant-ph/0904.1387.[FGG +

09] Edward Farhi, Jeffrey Goldstone, David Gosset, Sam Gutmann, Harvey B. Meyer, andPeter Shor. Quantum adiabatic algorithms, small gaps, and different paths. 2009.arXiv: quant-ph/0909.4766v1.[FGGS00] Edward Farhi, Jeffrey Goldstone, Sam Gutmann, and Michael Sipser. Quantum com-putation by adiabatic evolution. 2000. arXiv: quant-ph/0001106v1.[FHT98] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Additive logistic regression:a statistical view of boosting.

Annals of Statistics , 28:2000, 1998.[Fre09] Yoav Freund. A more robust boosting algorithm. 2009. arXiv: stat.ML/0905.2138v1.[FS95] Yoav Freund and Robert E. Shapire. A decision-theoretic generalization of onlinelearning and an application to boosting.

AT&T Bell Laboratories Technical Report ,1995.[LS08] Philip M. Long and Rocco A. Servedio. Random classiﬁcation noise defeats all convexpotential boosters. , 2008.[Mes99] Albert Messiah.

Quantum mechanics: Two volumes bound as one . Dover, Mineola,NY, 1999. Trans. of : Mcanique quantique, t.1. Paris, Dunod, 1959.[NDRM08] Hartmut Neven, Vasil S. Denchev, Geordie Rose, and William G. MacReady. Train-ing a binary classiﬁer with the quantum adiabatic algorithm. 2008. arXiv: quant-ph/0811.0416v1.[Pal04] Gintaras Palubeckis. Multistart tabu search strategies for the unconstrained binaryquadratic optimization problem.

Ann. Oper. Res. , 131:259–282, 2004.[VC71] Vladimir Vapnik and Alexey Chervonenkis. On the uniform convergence of relativefrequencies of events to their probabilities.

Theory of Probability and its Applications ,16(2):264–280, 1971.[Wyn02] Abraham J. Wyner. Boosting and the exponential loss.

Proceedings of the Ninth AnnualConference on AI and Statistics , 2002.[YKS08] A. P. Young, S. Knysh, and V. N. Smelyanskiy. Size dependence of the minimum ex-citation gap in the quantum adiabatic algorithm.

Physical Review Letters , 101:170503,2008.[YKS09] A. P. Young, S. Knysh, and V. N. Smelyanskiy. First order phase transition in thequantum adiabatic algorithm. 2009. arXiv:quant-ph/0910.1378v1.[Zha04] Tong Zhang. Statistical behavior and consistency of classiﬁcation methods based onconvex risk minimization.