Quantum Annealing for Dirichlet Process Mixture Models with Applications to Network Clustering
Issei Sato, Shu Tanaka, Kenichi Kurihara, Seiji Miyashita, Hiroshi Nakagawa
aa r X i v : . [ c ond - m a t . d i s - nn ] M a y Quantum Annealing for Dirichlet Process Mixture Models withApplications to Network Clustering
Issei Sato a , Shu Tanaka b , Kenichi Kurihara c , Seiji Miyashita d , Hiroshi Nakagawa a a Information Technology Center, University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan b Department of Chemistry, University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan c Google, 6-10-1, Roppngi, Minato-ku, Tokyo, 106-6126, Japan d Department of Physics, University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo, 113-0033, Japan
Abstract
We developed a new quantum annealing (QA) algorithm for Dirichlet process mixture (DPM) models basedon the Chinese restaurant process (CRP). QA is a parallelized extension of simulated annealing (SA), i.e., itis a parallel stochastic optimization technique. Existing approaches (Kurihara et al., 2009; Sato et al., 2009)cannot be applied to the CRP because their QA framework is formulated using a fixed number of mixturecomponents. The proposed QA algorithm can handle an unfixed number of classes in mixture models.We applied QA to a DPM model for clustering vertices in a network where a CRP seating arrangementindicates a network partition. A multi core processer was used for running QA in experiments, the resultsof which show that QA is better than SA, Markov chain Monte Carlo inference, and beam search at findinga maximum a posteriori estimation of a seating arrangement in the CRP. Since our QA algorithm is as easyas to implement the SA algorithm, it is suitable for a wide range of applications.
Keywords:
Quantum annealing, Dirichlet process, Stochastic optimization, Maximum a posterioriestimation, Bayesian nonparametrics
1. Introduction
Clustering is one of the most important top-ics in machine learning because it is a fundamen-tal approach to analyze differences and similaritiesof data. In statistical machine learning, a prob-abilistic latent variable model is used for cluster-ing. The Dirichlet process mixture (DPM) models(Antoniak, 1974) are well studied and they enableus to handle an unfixed number of classes, whichmeans that we do not have to decide the number ofclasses in advance. In other words, they can esti-mate the number of classes according to data. ADPM model is often represented by the Chineserestaurant process (CRP) (Aldous, 1985), in whichclustering is represented as a seating arrangementof customers in a restaurant. This representationis a useful one helping us understand the clusteringprocess in DPM models.A clustering problem using a probabilistic modelis generally formulated as a maximum a posteriori(MAP) estimation in statistical machine learning. Since finding the exact MAP solution will be diffi-cult in many cases, we have to search for an approx-imate one. Markov chain Monte Carlo (MCMC)inference is widely used for the CRP (Neal, 2000)but the MCMC is not necessarily appropriate forthe MAP estimation. When we use MCMC for theMAP estimation, we extract a single class assign-ment with the highest probability in the class as-signments sampled from the posterior distribution.The problem is that this sampling distribution (i.e.,the posterior distribution) has to be stationary, andmuch iteration is needed before it converges.DaumeIII (2007) showed that a beam search pro-vides an attractive alternative to the MCMC in theCRP and another approach for the MAP estimationis a stochastic search. One of the most well-knownstochastic search algorithms is simulated anneal-ing (SA) (Kirkpatrick et al., 1983), which is simi-lar to the MCMC but has an additional parameter,called a temperature, controlling the uncertainty ofthe search space. SA is known to find the globaloptimum when the cooling temperature reduction
Preprint submitted to Neurocomputing May 21, 2013 uantum Annealing
Iteration
Simulated Annealing Iteration
Figure 1: The left-hand panel shows the running of SA, inwhich σ indicates a seating arrangement of N customers inthe CRP (i.e., a class assignment of N data points). Theright-hand panel shows QA, in which multiple SAs inter-act through f . σ j indicates a seating arrangement of theCRP running in the j -th process. Note in QA that σ m isinteracted with σ m − and σ (i.e., σ m +1 = σ ), which ismathematically derived from the QA framework (Theorem3.1). During iterations, we control the hyper-parameters. schedule is slow enough (Geman and Geman, 1984)but such a schedule is too slow for practical use. SAwith a practical cooling schedule is therefore alsoaffected by a local optimization problem.In this work, we focus on a novel stochas-tic search algorithm, quantum annealing (QA),which has attracted attention as an alterna-tive annealing method for optimization problems(Kadowaki and Nishimori, 1998; Farhi et al., 2001;Santoro et al., 2002) in quantum information sci-ence (Lloyd, 1996; Nielsen and Chuang, 2000). QAhas been shown experimentally converge faster thanto find better local optimums for Ising spin models.It has a parameter inducing quantum fluctuation,so the search space is controlled in a way differentfrom that in SA. The details are explained below.QA is a parallelized extension of SA in whichquantum fluctuation is induced by running mul-tiple SAs with interactions. Let us consider run-ning m SAs, and let σ j ( j = 1 , · · · , m ) indicate astate (e.g., a class assignment) of N data pointsin the j -th simulation. In the CRP, we formulate σ j as a table-seating arrangement of customers inthe j -th CRP (see Fig. 1). We denote N datapoints as x = ( x , x , · · · , x N ). The log-likelihoodmodel given σ j (i.e., log p ( x , σ j )) is based on theway the data is modeled. For simplicity, we denotethe log-likelihood by log p ( σ j ). In this work, we usethe Newman model (Newman and Leicht, 2007) forclustering network data (explained in Sec.4). QAruns multiple dependent SAs with dependent heremeaning that there is interaction f among neigh-boring SAs (see right-hand panel in Fig. 1).We describe QA in terms of an optimization problem. When we run m SAs with differentrandom initializations independently, we optimizelog p ( σ j ) individually. That is, we find σ ∗ j =argmax σ j log p ( σ j ) for each j and we choose σ thathas the highest log p ( σ ∗ j ) of all j . In QA, weoptimize the joint probability of m CRPs’ states { σ j } mj =1 : max ( σ ,σ , ··· ,σ m ) log p QA ( { σ , σ , · · · , σ m } ) , (1)where p QA ( · ) is a probability measure over a setof states, which means that each state σ j ( j =1 , · · · , m ) can take an independent state and QAgives the probability for these states. A set ofstates ( σ , σ , · · · , σ m ) represents (quantum) super-position of different states. That is, p QA ( · ) is aprobability measure over superposition of differentstates in the limit of m → ∞ in quantum physics.In the CRP, ( σ , σ , · · · , σ m ) represents a superpo-sition of m seating arrangements. The optimizationproblem (1) is actually formulated asmax ( σ ,σ , ··· ,σ m ) m X j =1 log p SA ( σ j ) + f · R ( σ , σ , · · · , σ m ) , (2)where the first term corresponds to the summationover m SA objectives, and R ( · ) is regarded as aregularizer among m states, which are describedin Sec.3.5. This optimization is derived from theQA framework explained in Sec. 3. QA was re-cently used for solving practical optimization prob-lems, such as clustering (Kurihara et al., 2009) andvariational Bayes inference (Sato et al., 2009), andit outperformed SA. Figure 2 summarizes QA andrelated work. Problems:
Existing approaches(Kurihara et al., 2009; Sato et al., 2009) can-not be applied to the CRP because they need afixed number of mixture components. Moreover,these approaches have to use a heuristic such aspurity to apply their QA algorithms to the cluster-ing problem. Therefore, a different formulation isneeded.
Contributions:
The purpose of this study isto propose a QA algorithm for the CRP. The keypoint is how to represent the states of data in theCRP. The existing work represents the data statesas “which class a data point is assigned to.” Thatis, they require K -dimensional indicator vectors torepresent the data states, and K (the number ofclasses) is given and fixed. We instead represent the2 o InteractionInteraction MCMC target field Metropolis Hastings,Gibbs sampler,Slice sampler, etc. EM algorithm,Beam search,Simulated annealing, etc.
Quantum annealing
Exchange Monte Carlo(Parallel Tempering),Particle filter, etc.
Optimization(MAP estimation)
Figure 2: QA and related algorithms. We categorize thealgorithms into two groups according to their purposes:MCMC and optimization. MCMC aims at approximatingthe expectation by a finite sum of samples drawn from theposterior distribution and therefore needs a large number ofiterations. Optimization algorithms search for optimal solu-tions in a small number of iterations. The algorithms are alsoclassified according to the presence of interactions amongmultiple processes. Note that expectation maximization(EM) algorithms (Dempster et al., 1977) and variational in-ferences (Blei and Jordan, 2005; Kurihara et al., 2007) can-not be applied to the CRP. states of data as “which data points a data pointshares the table with” in the CRP. That is, we usean N -by- N bit matrix to represent the data states,and this matrix indicates a seating arrangement inthe CRP and does not depend on K . This bit-matrix representation of the CRP is a novel ideaand a key point in applying QA to the CRP. Notethat the bit matrix is only used for mathemati-cally deriving QA for the CRP and is not usedin the actual algorithm. Mathematically, this nov-elty appears in interaction function f in Eq.(17),where our derived f does not include the number ofclasses, K , whereas f in existing work is formulatedby using (fixed) K . Moreover, our algorithm doesnot require heuristics such as purity. We also useparallel processing in QA, whereas Kurihara et al.(2009) and Sato et al. (2009) used a single process-ing.The idea of using a matrix that represents the re-lationship among data points has been used in sev-eral studies (L. Xu and Oja, 1993; Frey and Dueck,2007; Wang and Lai, 2011). These studies useda similarity matrix based on the feature vectorsamong data points. For example, x i and x j de-note the feature vectors of the i-th and j-th datapoints and the similarity was calculated by usingthe Gaussian kernel between x i and x j . The for-mulation in this paper is different from these exist-ing approaches because we do not use the similar-ity matrix based on the feature vectors. We used the matrix of data points to formulate the cluster-ing state of data in a learning process. For exam-ple, a bit-matrix in this study changes in a learningprocess, while the similarity (kernel) matrix in theexisting works does not change but is fixed. More-over, we use a bit-matrix only for mathematicallyderiving QA for the CRP. We do not directly usethe matrix in an actual algorithm, whereas the ex-isting approaches directly use the similarity matrixfor clustering data.
2. Chinese Restaurant Process (CRP)
The CRP is a distribution over partitions suchas clustering and is composed of three elements:a customer, table, and restaurant. In a clusteringproblem, the customer denotes a data point and thetable denotes a data class. A seating arrangementof customers in a restaurant indicates a class assign-ment of data. In QA, we run multiple CRPs, i.e.,we consider the seating arrangements in multiplerestaurants.The CRP assigns a probability for the seating ar-rangement of the customers in which Z = { z i } Ni =1 denotes the seating arrangement of the customersand z i = k indicates that customer i sits at the k -th table. N indicates the number of customers.When customer i enters a restaurant with K oc-cupied tables at which other customers are alreadyseated, customer i sits at a table with the followingprobability: p ( z i | Z \ z i ; α ) ∝ N k α + N − k -th occupied table) ,αα + N − , (3) where N k denotes the number of customers sittingat the k -th table, and α is the hyper parameter ofthe CRP. A customer tends to select a new tablewhen α takes large value.The log-likelihood of Z is given by p ( Z ) = α K ( Z ) Q Nl =1 ( N − l + α ) Q K ( Z ) k =1 ( N k − , where K ( Z ) is thenumber of occupied tables in Z .
3. Quantum Annealing for CRP
This section explains how we derive QA for theCRP (QACRP). First, we introduce some notationsand explain QACRP intuitively. Second, we intro-duce a bit matrix to reformulate the CRP for usingQA independent of the number of classes, which3 j-1-th CRP2 3 54 2j+1-th CRP4 1 531j-th CRP4 3 52
Figure 3: Example of approximation inference in QACRP where β = m . Let us consider adding customer 2 to restaurant j ( j -th CRP). The classical CRP seats customers at existing tables in proportion to the number of customers already seated (seeEq.(3)). The QACRP sampler derived in Eq.(4) introduces the effect of customers who share tables with customer 2 in the( j − j + 1)-th CRPs. In this case, since customers 1, 3, and 4 are customers who share tables with customer 2 inthe ( j − j + 1)-th CRPs, the 1st table in the j -th CRP has these “two” customers and so takes the effect e “2” f ( β, Γ) into account ( c − j, (2) + c + j, (2) = 2 in Eq. (4)). That is, in the QACRP sampler, when interaction f ( β, Γ) is large, a customertends to sit with customers sharing the table in other CRPs. is a key idea in this work. Third, we formulatethe CRP by using a “density matrix” that is a ba-sic formulation in quantum mechanics. Finally, weapply the Suzuki-Trotter expansion (Trotter, 1959;Suzuki, 1976) to approximate QACRP because thefirst-derived QACRP is intractable because of thecomputational cost of a matrix exponential.
QACRP uses multiple restaurants. z j,i = k indi-cates that customer i sits at the k -th table in the j -th restaurant. Z j = { z j,i } denotes the seatingarrangements of customers in the j -th restaurant. c + j,k ( i ) denotes the number of customers who sit atthe k -th table in the j -th restaurant and share ta-bles with customer i in the ( j + 1)-th restaurant. c − j,k ( i ) denotes the number of customers who sit atthe k -th table in the j -th restaurant and share ta-bles with customer i in the ( j − i sits at a table in the j -th restaurantwith the following probability: p QA ( z j,i |{ Z d } md =1 \{ z j,i } ; β, Γ) ∝ (cid:18) N j,k α + N − (cid:19) βm e ( c − j,k ( i )+ c + j,k ( i )) f ( β, Γ) ( k -th occupied table) , (cid:18) αα + N − (cid:19) βm (new unoccupied table) , (4)where N j,k denotes the number of customers sittingat the k -th table in the j -th restaurant. f ( β, Γ) isderived in Sec.3.5 Eq.(17) where β and Γ are hyperparameters that are called inverse temperature andquantum effect, respectively. The inverse tempera-ture is also the hyper parameter of SA. When you change the CRP in Eq.(3) into QACRP in Eq.(4),all you have to do is to count the customers sharingtables in neighboring CRPs and introduce f ( β, Γ).Figure 3 shows an example of QACRP and pro-vides an intuitive explanation. When f ( β, Γ) = 0,QACRP is equivalent to m independent CRPs withinverse temperature β/m , which we call SACRPs.We provide the details of the derivation in the nextsections. We represent seating arrangement Z by using abit matrix B in order to reformulate the CRP with-out fixing the number of tables (see Contributionsin Sec. 1). Although this bit matrix representationseems to have high computational complexity, in anactual algorithm of QA, we do not need the directcalculation to the bit matrix.A bit matrix B looks like an adjacency matrixof customers (see Fig. 4) and B denotes an N -by- N bit matrix where B i is the i -th row vector,i.e., B i = ( B i, , B i, , · · · , B i,N ), and B i,n is the i -th row and the n -th column element of B or the n -th element of B i . ˜ σ i,n ( i, n = 1 , · · · , N ) is a two-dimensional indicator vector, i.e., it takes (1 , ⊤ or (0 , ⊤ We correspond B i,n = 1 to ˜ σ i,n = (1 , ⊤ and B i,n = 0 to ˜ σ i,n = (0 , ⊤ , which means we canrepresent B by using the 2 N dimensional indicatorvector, σ , as follows: B ⇔ σ = N O i =1 N O n =1 ˜ σ i,n . (5) N is the Kronecker product, which is a specialcase of a tensor product. If A is a k -by- l matrixand B is an m -by- n matrix, then the Kronecker4 Figure 4: Example of bit matrix representation. A seating arrangement Z is represented as a bit matrix B , which enablesus to formulate the CRP without fixing the number of tables. For example, ˜ σ represents customers who share a table withcustomer 2. ˜Σ represents a set of the states that customer 2 can take under the seating conditions, and ρ (2)9 indicates thatcustomer 2 sits alone at a table. product A N B is the following km -by- ln block ma-trix: A N B = a B · · · a l B ... . . . ... a k B · · · a kl B . For exam-ple, (1 , ⊤ N (0 , ⊤ = (0 , , , ⊤ . Σ denotes aset of σ , i.e., | Σ | = 2 N .The bit matrix B is regarded as a seating ar-rangement as follows. If B i,n = 1, the i -th and the n -th customers share a table. Note that we needthe following conditionals to represent seating ar-rangements with the bit matrix1. B i,n = B n,i (symmetric matrix)2. B i,i = 1( i = 1 , , · · · , N ), i.e., Tr B = N ∀ i and l , B i | B i | · B l | B l | = 1 or 0, where · is theinner product.Tr X is the trace of X . ˜Σ( ⊂ Σ) denotes a set of σ corresponding to B satisfying the above conditions.We call these conditions “seating conditions.”Here, ˜ σ i indicates the state of the i -th customer,i.e., with whom the i -th customer shares a table(see the left-hand side of Fig. 4) and ˜ σ i is a (2 N − σ i = i − O n =1 ˜ σ i,n ⊗ ˜ σ i,i ⊗ N O n = i +1 ˜ σ i,n N O n =1 ,n = i ˜ σ n,i . (6)Let ˜Σ i be a set of the states that ˜ σ i can take un-der the seating conditions (i.e., σ ∈ ˜Σ) when the i -th row elements and the i -th column elements areblank and the others are filled (see the right-handside of Fig. 4). Since ˜Σ i is a set of table-assignmentstates of the i -th customer, | ˜Σ i | = K ( Z \{ z i } ) + 1.For example, the right-hand side of Fig. 4 showstable assignments of the 2nd customer when cus-tomers 1,3,4, and 5 have already been seated. ρ ( i )2 N − is defined as a 2 N − ρ ( i )2 N − = i − O n =1 (0 , ⊤ ⊗ (1 , ⊤ ⊗ N O n = i +1 (0 , ⊤ N O n =1 ,n = i (0 , ⊤ . (7)The right-hand side of Fig. 4 shows an example of ρ ( i )2 N − . We use ρ ( i )2 N − only in Appendix A. We define the energy function E over σ ( l ) ∈ Σ ( l = 1 , · · · , N ) by E ( σ ( l ) ) = − log p ( σ ( l ) ),where if σ ( l ) ∈ Σ \ ˜Σ, then p ( σ ( l ) ) = 0, i.e., E ( σ ( l ) ) =+ ∞ .The probability of a state σ ( ∈ Σ) is given by p ( σ ) = 1 Z σ ⊤ e −H c σ, (8)where H c = diag h E ( σ (1) ) , E ( σ (2) ) , · · · , E ( σ (2 N ) ) i ,and diag[ · ] denotes a diagonal matrix. Note that Z = P σ σ ⊤ e −H c σ = Tr e −H c , where H c iscalled the classical Hamiltonian. If σ ∈ ˜Σ,then p ( σ ) is equal to p ( Z ), i.e., p ( σ ) is theprobability over a seating arrangement. Since H c is diagonal, e −H c is also diagonal withthe k -th diagonal element e − E ( σ ( k ) ) . That is, p ( σ ( k ) ) = Z σ ( k ) ⊤ e −H c σ ( k ) = Z e − E ( σ ( k ) ) . The basic approach to expanding a classical sys-tem to a quantum one is to make the Hamiltonian5on-diagonal, i.e., add some off-diagonal elementswhile keeping hermiticity. We define a non-diagonalmatrix H by H = H c + H q , (9)where H q is a non-diagonal matrix (we describethe definition of H q later). Intuitively, diago-nal elements are filled with zero, and some off-diagonal elements are filled with Γ in H q . Thatis, H is filled with energy E ( σ ) in diagonal ele-ments and quantum effect Γ in off-diagonal ele-ments. The above scheme that adds a non-diagonalmatrix ( H q ) to a diagonal matrix ( H c ) is a basicapproach in quantum physics and has also workedwell in (Kurihara et al., 2009; Sato et al., 2009).The meaning of this formulation was described in(Sato et al., 2009) in terms of uncertainty.The probability of a state σ ( ∈ Σ) in a quantumsystem is given by p QA ( σ ; β, Γ) = 1
Z σ ⊤ e − β ( H c + H q ) σ, (10)where Z = P σ σ ⊤ e − β ( H c + H q ) σ =Tr[ e − β ( H c + H q ) ].The optimization problemmax σ log p QA ( σ ; β, Γ) (11)could be solved by using the eigenvalue decomposi-tion of the density matrix Z e − β ( H c + H q ) , but, thisapproach is intractable because of its large compu-tational cost.One approximation approach for solving the op-timization problem (11) is a stochastic search bydrawing a state of the i -th customer, ˜ σ i , from p (˜ σ i | σ \ ˜ σ i ) = σ ⊤ e −H c σ P ˜ σ i σ ⊤ e −H c σ , (12) p QA (˜ σ i | σ \ ˜ σ i ; β, Γ) = σ ⊤ e − β ( H c + H q ) σ P ˜ σ i σ ⊤ e − β ( H c + H q ) σ , (13)where σ \ ˜ σ i indicates that bits excluding the i -throw and the i -th column elements are standing. Thesummation over ˜ σ i is actually the summation of˜ σ i ∈ ˜Σ i ; therefore, the classical system p (˜ σ i | σ \ ˜ σ i )is tractable when p (˜ σ i | σ \ ˜ σ i ) in Eq. (12) is anotherexpression of Eq. (3). Calculation of the prob-ability of the quantum system p QA (˜ σ i | σ \ ˜ σ i ; β, Γ),however, is intractable because of the exponentialoperation of a non-diagonal matrix H = H c + H q .We therefore need another approach described inSec.3.5. We define H q as follows. H q = − Γ N X i =1 N X n =1 σ xi,n , E = (cid:18) (cid:19) , σ x = (cid:18) (cid:19) ,σ xi,n = i − O t =1 N O u =1 E ! ⊗ " n − O t =1 E ! ⊗ σ x ⊗ N O t = n +1 E ! ⊗ N O t = i +1 N O u =1 E ! , (14) where Γ is the quantum effect parameter. Thisformulation means that diagonal elements are filledwith zeros, and some off-diagonal elements are filledwith Γ in H q . Although other definitions can beconsidered, we define this formulation so that wecan make the derivation of the search algorithmtractable by using an approximation method thatis easy to implement. We cannot calculate p QA (˜ σ i | σ \ ˜ σ i ; β, Γ) inEq.(13) because σ ⊤ e − β H σ is intractable becauseof the non-diagonal matrix H . We use the Suzuki-Trotter expansion (Trotter, 1959; Suzuki, 1976) toapproximate p QA (˜ σ i | σ \ ˜ σ i ; β, Γ).We consider multiple running CRPs in which σ j ( j = 1 , · · · , m ) indicates the seating arrangementof the j -th CRP and represents the j -th bit matrix B j . We correspond B j,i,n = 1 to ˜ σ j,i,n = (1 , ⊤ and B j,i,n = 0 to ˜ σ j,i,n = (0 , ⊤ , which means thatwe can represent B j as σ j by using Eq.(5). Wederive the following theorem: Theorem 3.1. p QA ( σ ; β, Γ) in Eq. (10) is approxi-mated by the Suzuki-Trotter expansion as follows: p QA ( σ ; β, Γ) = 1
Z σ ⊤ e − β ( H c + H q ) σ = X σ j ( j ≥ p QA-ST ( σ, σ , · · · , σ m ; β, Γ) + O (cid:18) β m (cid:19) , (15)6 here we rewrite σ as σ , and p QA-ST ( σ , σ , · · · , σ m ; β, Γ)= m Y j =1 Z ( β, Γ) e − βm E ( σ j ) e f ( β, Γ) s ( σ j ,σ j +1 ) , (16) f ( β, Γ) = 2 log coth (cid:18) βm Γ (cid:19) , (17) s ( σ j , σ j +1 ) = N X i =1 N X n =1 δ (˜ σ j,i,n , ˜ σ j +1 ,i,n ) , (18) Z ( β, Γ) = (cid:20) sinh (cid:18) βm Γ (cid:19)(cid:21) N X σ e − βm E ( σ ) . (19)Note that σ m +1 = σ . The proof is given inAppendix A. Note that our derived f in Eq.(17)does not include the number of classes, K , whereasthe f in existing work (Kurihara et al., 2009;Sato et al., 2009) is formulated by using a fixed K .Equation (15) is interpreted as fol-lows. p QA ( σ ; β, Γ) is approximated bymarginalizing out other states { σ j } j ≥ of p QA-ST ( σ , σ , · · · , σ m ; β, Γ). As shown inEq.(16), p QA-ST ( σ , σ , · · · , σ m ; β, Γ) looks likethe joint probability of the states of m depen-dent CRPs. In Eq.(16), e − βm E ( σ j ) correspondsto the classical CRP with inverse temperatureand e f ( β, Γ) s ( σ j ,σ j +1 ) indicates the quantum effectpart. If f ( β, Γ) = 0, which means CRPs areindependent, p QA-ST ( σ , σ , · · · , σ m ; β, Γ) is equalto the products of probability of m classicalCRPs. s ( σ j , σ j +1 )( >
0) is regarded as a similarityfunction between the j -th and ( j + 1)-th bitmatrices. If they are the same matrices, then s ( σ j , σ j +1 ) = N . In Eq.(2), log p SA ( σ j ) corre-sponds to log e − βm E ( σ j ) /Z and the regularizer term f · R ( σ , · · · , σ m ) is log Q mj =1 e f ( β, Γ) s ( σ j ,σ j +1 ) = f ( β, Γ) P mj =1 s ( σ j , σ j +1 ).Note that we aim at deriving the approximationinference for p QA (˜ σ i | σ \ ˜ σ i ; β, Γ) in Eq.(13). UsingTheorem 3.1, we can derive Eq.(4) as the approxi-mation inference. The details of the derivation areprovided in Appendix B.
4. Experiments
We evaluated QA in a real application. We ap-plied QA to a DPM model for clustering vertices ina network where a seating arrangement of the CRPindicates a network partition.
Figure 5: Examples of Network structures.
We used the Newman model(Newman and Leicht, 2007) for network mod-eling in this work. The Newman model is aprobabilistic generative network model. Thismodel is flexible, which enables researchers to an-alyze observed graph data without specifying thenetwork structure (disassortative or assortative) inadvance.In an assortative network, such as a social net-work, the members (vertices) of each class aremostly connected to the other members of the sameclass. The communications between members inthree social groups is illustrated in Fig. 5, whereone sees that the members generally communicatemore with others in the same group than they dowith those outside the group. In a disassortativenetwork, the members (vertices) have most of theirconnections outside their class. An election networkof supporters and candidates is illustrated in Fig.5-b, where a link indicates support for a candidate.The Newman model can model not only these twokinds of networks but also a mixture of them, suchas a citation network (see Fig.5-c), but, the usermust decide in advance the number of classes. Wetherefore used the DPM extension of the Newmanmodel as described in Appendix C.7 .2. Dataset
We used three social network datasets,Netscience , Citeseer , and Wikivote . Netscienceis a coauthorship network of scientists workingon a network that has 1,589 scientists (vertices).Citeseer is a citation network dataset for 2,110papers (vertices). Wikivote is a bipartite networkconstructed using administrator elections andvote history data in Wikipedia. Its 7,115 verticesrepresent Wikipedia users and a directed edge fromvertex i to vertex j represents that user i votedfor user j . Netscience, Wikivote, and Citeseerrespectively correspond to network examples a,b, and c in Fig.5. We used the vertices in thesenetworks to represent customers in the CRP, andwe used a negative log-likelihood as an energyfunction to find the MAP solution. We tested several β/m schedules using combi-nations of β = 0 . m , 0 . m , and 0 . m and β = β log(1 + t ), β √ t , and β t , where t denotes the t -th iteration. The results we observed in our ex-periments showed that β = 0 . m and β √ t createda better schedule in SA in terms of the MAP esti-mation. That is, β increases to β √ T , where T is the total number of iterations. In QA, we usethe same β/m schedule we used in SAs. Note thatsince the new table is easy to sample at very small β (where the probability distribution becomes flat,see Eq. (4)), the SACRP has many tables at small β and converges very slowly. That is, inverse tem-peratures that are too low do not work well in theCRP.Since interaction f is a function of Γ and β ,in practice we have to consider the schedule of f ( β, Γ). The interaction f ( β, Γ) increases when β Γ m decreases. QA is known to work well when f ( β, Γ)starts from zero (i.e., “independent” multiple SAs)and gradually increases. This process of f ( β, Γ)is achieved when β Γ m is a decreasing function of t .Therefore, we use β Γ m = Γ Tt , (20)where Γ is a tuning parameter. http://snap.stanford.edu/data/wiki-Vote.html D i ff. o f l og - li k e li hood Γ Wikivote 400 500 600 700 1 1.5 2 2.5 3 D i ff. o f l og - li k e li hood Γ Netscience 1200 1400 1600 1.5 2 2.5 3 3.5 D i ff. o f l og - li k e li hood Γ Citeseer
Figure 6: Experimental results for the maximum log-likelihoods (minimum energies) of QA and other algo-rithms. Vertical axes show the difference of maximum log-likelihoods. A higher value is better (closer to optimumstates). We denote L maxQA , L max16 SAs as the maximum log-likelihood of 16 CRPs in QA and SA, L max1600 SAs as the max-imum log-likelihood of 1600 CRPs in SA, and L maxBeam asthe maximum log-likelihood of the beam search with beam-width = 100. The solid line indicates L maxQA − L maxBeam . Lower(red) and upper dotted lines indicate L max16 SAs − L maxBeam and L max1600 SAs − L maxBeam . When these lines take positive values,QA and SA outperform the beam search. Whenever thesolid line is higher than the dotted lines QA outperformsSAs. The horizontal axes are the Γ for QA . .4. Experimental settings Our purpose is to search for a better MAPsolution to a CRP in a small number of itera-tions (or short running time). We evaluated op-timization algorithms in terms of maximum log-likelihood because we want a state with the high-est log-likelihood. We compared QA with SA andthe beam search. We used the beam search withan inadmissible score function that achieved thebest performance in (DaumeIII, 2007). We set thebeam-width to 100. We did not compare the vari-ational inference with QA because the variationalinference cannot deal with the Chinese restaurantformulation of the Dirichlet process mixture. Thatis, it is hard to compare them because their objec-tive functions are different.Since we used a multi core processer with 16cores, we set m = 16 (i.e., ran one CRP at onecore) We set α = 1 in the CRP. α is easy to esti-mate in SAs and QA, but beam search cannot esti-mate it; therefore, we fixed it in these experiments.The number of iterations, T , for SAs and QA was30. We generated 16 random seating arrangements { σ (random) j } j =1 for the initial settings of QA and 16SAs, i.e., we use σ (random) j for the same initial set-ting of the j -th seating arrangement. Moreover, wecompared QA ( m = 16) with 1600 SAs where theirMAP solutions are the best one of 1600 simulationswith different random initializations. In 1600 SAs,we tried 100 seeds and generated m = 16 randomseating arrangements for each seed, i.e., we ran theCRPs with 100 × m (= 16) = 1600 initial seatingarrangements. Figure 6 shows the experimental results. QA andSAs outperformed the beam search because eachline takes a positive value. QA finds a better localoptimum than that of 16 and 1600 SAs at someΓ . This means that it is useful to run QA withchanging Γ rather than to run multiple SAs.The effective Γ has a positive correlation withthe number of nodes. For example, the effec-tive Γ is around 2 in Netscience and is around3 . C · f ( β, Γ), where C is the numberof customers who share tables and thus depends onthe number of customers (nodes). This means thatthe effective parameter range can be inferred from the number of nodes and we have only to check afew Γ values.QA needs more time and memory than SA withthe linear order of m because QA uses m CRPsHowever, when a parallel processing environmentcan be used and we run multiple SAs in parallel,the scalability of QA is the same as that of SAs.QA ( T = 30, m = 16) and SA ( T = 30, m = 1)took about 15 and 13 seconds for Netscience, about25 and 22 seconds for Netscience, and about 79 and76 seconds for Wikivote, where each value was theaveraged running time of a single simulation forSA. Because of the multi core processing and thecaching of customers sharing tables, the runningtime of QA was almost the same as that of a sin-gle SA. Therefore, QA makes the CRPs convergefaster and finds a better seating arrangement thanmultiple-run SAs. The estimated number of classesachieving the best performance in QA ( m = 16),16 SAs, 1600 SAs and the beam search are 26, 22,65, and 61 for Netscience, 37, 35, 30, and 57 forCiteseer, and 8, 8, 8, and 27 for Wikivote.We found that a small Γ induces a fast scheduleof f , which means f ≫ β . The fast sched-ules make the convergence of QA too fast; therefore,QA converges at a worse optimum. QA is similarto SAs at large Γ because interaction f remains atalmost 0 for a limited number of iterations. LargerΓ makes CRPs in QA more independent, whichmeans the results of QA approach those of SA. Wefound that interaction f is almost zero when Γ = 5and T = 30, which means that the performance ofQA is similar to that of SA. Therefore, in practice,we only check values of Γ in descending order froma large value of Γ , such as Γ = 5. That is, theeffective value range is easy to infer from some Γ values (in our experimental results, we show somenon-effective values in order to provide the negativeexamples of QA).
5. Conclusion
We proposed a QA algorithm for the DPM mod-els based on the CRP. Our algorithm is differentfrom those of Kurihara et al. (2009) and Sato et al.(2009) in three ways: (i) it can handle an unfixednumber of classes in mixture models, (ii) it doesnot require heuristics such as a purity, and (iii) ituses parallel processing in QA. The proposed al-gorithm (Eq. (4)) is easy to implement becauseit is similar to a classical CRP (Eq. (3)). Thatis, it is easy to apply the proposed algorithm to9ther nonparametric models with which it is noteasy to apply beam search, such as an infinite re-lational model (Kemp et al., 2006). The proposedalgorithm will therefore be a promising new opti-mization technique when it is used with rapidly ad-vancing multi core processers. As shown in Eq.(2),our algorithm is regarded as an optimization witha regularized term and its performance depends onparameter f like that other optimization algorithmswith a regularized term does (e.g., L1 and L2 reg-ularized optimization algorithms often used in ma-chine learning). For future work, it will be worthanalyzing what kind of schedule of f enables QA towork well. Acknowledgements
The present work was partially supported by theMitsubishi Foundation, and also by the Next Gen-eration Super Computer Project, Nanoscience Pro-gram from MEXT of Japan. The authors are par-tially supported by Grand-in-Aid for JSPS Fellows(23-7601). Numerical calculations were partly per-formed on supercomputers at the Institute for SolidState Physics, University of Tokyo. This researchwas funded in part by the Joint Usage/ResearchCenter for Interdisciplinary Large-scale InformationInfrastructures in Japan. This work was supportedby a JSPS Grant-in-Aid for Young Scientists (B)24700135.
Appendix A. Proof of Theorem 3.1
We use the following Trotter product formula(Trotter, 1959) to approximate p QA ( σ ; β, Γ) = 1
Z σ ⊤ e − β ( H c + H q ) σ. (A.1)If A , ..., A L are symmetric matrices, we haveexp L X l =1 A l ! = " L Y l =1 exp( A l /m ) m + O (cid:18) m (cid:19) . (A.2)Applying the Trotter product formula to Eq.(A.1)with finite m , we have p QA ( σ ; β, Γ) = 1
Z σ ⊤ e − β ( H c + H q ) σ ≈ Z σ ⊤ (cid:16) e − βm H c e − βm H q (cid:17) m σ. (A.3) We evaluate the residual of this approximation.Since e A + A = e A e A does not hold in general ,we need to use the Trotter product formula forcomputation. We rewrite σ as σ and note that σ ⊤ (cid:0) e A (cid:1) σ = P σ σ ⊤ e A σ σ ⊤ e A σ . Hence, we ex-press Eq.(A.3) by marginalizing out auxiliary vari-ables { σ ′ , σ , σ ′ , ..., σ m , σ ′ m } , σ ⊤ (cid:16) e − βm H c e − βm H q (cid:17) m σ = X σ ′ X σ ... X σ m X σ ′ m σ ⊤ e − βm H c σ ′ σ ′⊤ e − βm H q σ · · · σ ⊤ m e − βm H c σ ′ m σ ′⊤ m e − βm H q σ m +1 = X σ ′ X σ ... X σ m X σ ′ m m Y j =1 σ ⊤ j e − βm H c σ ′ j σ ′⊤ j e − βm H q σ j +1 , (A.4)where σ m +1 = σ . To express Eq.(A.4) more partic-ularly, we use the following Lemma Appendix A.1and Lemma Appendix A.2. Lemma Appendix A.1. σ ⊤ j e − βm H c σ ′ j = exp (cid:20) − βm E ( σ j ) (cid:21) δ ( σ j , σ ′ j ) , (A.5) where δ ( σ j , σ ′ j ) = 1 if σ j = σ ′ j and δ ( σ j , σ ′ j ) = 0 otherwise.Proof. By the definition, e − βm H c is diagonal with[ e − βm H c ] kk = E ( σ ( k ) ), and σ j and σ ′ j are binaryindicator vectors, i.e. only one element in σ j is oneand the others are zero. Thus, the above lemmaholds. Lemma Appendix A.2. σ ′⊤ j e − βm H q σ j +1 = (cid:20) sinh (cid:18) βm Γ (cid:19)(cid:21) N exp " N X i =1 N X n =1 δ ( ˜ σ ′ j,i,n , ˜ σ j +1 ,i,n ) log coth (cid:18) βm Γ (cid:19) . (A.6) Proof.
Using ( A ⊗ B )( C ⊗ D ) = ( AC ) ⊗ ( BD ) and If A A = A A , then e A + A = e A e A . A + A = e A e A when A A = A A , we find, σ ′⊤ j e − βm H q σ j +1 = σ ′⊤ j e − βm {− Γ P Ni =1 P Nn =1 σ xi,n } σ j +1 = σ ′⊤ j N O i =1 N O n =1 e βm Γ σ xi,n ! σ j +1 = N Y i =1 N Y n =1 ˜ σ ′⊤ j,i,n e βm Γ σ x ˜ σ j +1 ,i,n . (A.7)Note that σ j = N Ni =1 N Nn =1 ˜ σ j,i,n . e βm Γ σ x = ∞ X l =0 l ! (cid:18) βm Γ σ x (cid:19) l = " (cid:18) βm Γ (cid:19) + · · · E + " βm Γ + 13! (cid:18) βm Γ (cid:19) + · · · σ x = cosh (cid:18) βm Γ (cid:19) E + sinh (cid:18) βm Γ (cid:19) σ x . (A.8)Substituting Eq.(A.8) into Eq.(A.7), we haveEq.(A.6) because˜ σ ′⊤ j,i,n (cid:20) cosh (cid:18) βm Γ (cid:19) E + sinh (cid:18) βm Γ (cid:19) σ x (cid:21) ˜ σ j +1 ,i,n = cosh (cid:18) βm Γ (cid:19) δ ( ˜ σ ′ j,i,n , ˜ σ j +1 ,i,n )+ sinh (cid:18) βm Γ (cid:19) (1 − δ ( ˜ σ ′ j,i,n , ˜ σ j +1 ,i,n )) (A.9)Using Lemma Appendix A.1 and LemmaAppendix A.2 into Eq.(A.4), σ ⊤ (cid:16) e − βm H c e − βm H q (cid:17) m σ = X σ · · · X σ m m Y j =1 exp (cid:20) − βm E ( σ j ) (cid:21) (cid:20) sinh (cid:18) βm Γ (cid:19)(cid:21) N exp " N X i =1 N X n =1 δ ( ˜ σ ′ j,i,n , ˜ σ j +1 ,i,n ) log coth (cid:18) βm Γ (cid:19) . (A.10)and from Eq.(A.3), we have shown, p QA ( σ ; β, Γ) ≈ X σ · · · X σ m p QA-ST ( σ , σ , ..., σ m ; β, Γ) . (A.11) The same relation holds for σ , σ , · · · , σ m . Appendix B. Derivation of Eq. (4)Since p QA (˜ σ j,i |{ σ d } md =1 \{ ˜ σ j,i } ; β, Γ) ∝ e − βm E ( σ j ) e f ( β, Γ)( s ( σ j − ,σ j )+ s ( σ j ,σ j +1 )) , (B.1)we have p QA (˜ σ j,i |{ σ d } md =1 \{ ˜ σ j,i } ; β, Γ) ∝ P Nn =1 ,n = i δ (˜ σ j,i,n , (1 , ⊤ ) N − α ! βm e ˜ s (˜ σ j − ,i , ˜ σ j,i , ˜ σ j +1 ,i ) f ( β, Γ) , if ˜ σ j,i ∈ ˜Σ j,i \{ ρ ( i )2 N − } , (cid:18) αN − α (cid:19) βm , if ˜ σ j,i = ρ ( i )2 N − , σ j,i ˜Σ j,i , (B.2)˜ s (˜ σ j − ,i , ˜ σ j,i , ˜ σ j +1 ,i )= N X n =1 ,n = i δ (˜ σ j − ,i,n , ˜ σ j,i,n ) + N X n =1 ,n = i δ (˜ σ j,i,n , ˜ σ j +1 ,i,n ) . (B.3)This is easy to understand when you consider themeaning of ˜ s (˜ σ j − ,i , ˜ σ j,i , ˜ σ j +1 ,i ) in multiple CRPs.˜ s (˜ σ j − ,i , ˜ σ j,i , ˜ σ j +1 ,i ) indicates the number of cus-tomers who share tables with the i -th customer inthe j − j + 1-th CRPs. Therefore, Eq.(4)is derived as another formulation of Eq.(B.2). Notethat Eq.(4) is the approximation of Eq.(13). Appendix C. Details of Network Model
In this section, we explain the Newman networkmodel. V is the vertex set. v is a vertex; i.e., v ∈ V . V is the number of vertices. K is the num-ber of classes. Suppose that the vertices fall into K classes with probability π , where π k is the proba-bility that a vertex is assigned to class k . Vertex i belongs to class k , indicated by z i = k . Each classhas a probability φ kv that a link from a particu-lar vertex in class k connects to vertex v . A linkfrom vertex i to vertex v is indicated by ℓ i = v .Each vertex links to other vertices in accordance11ith φ . That is, vertex i links to vertex v in accor-dance with φ z i v . The generation process for link ℓ i is represented by ℓ i ∼ Multi( φ z i ) , z i ∼ Multi( π ) , where φ z i = ( φ z i , φ z i , · · · , φ z i V ), and Multi( · ) isa multinomial distribution.Suppose that φ t is distributed in accordance withthe Dirichlet distribution H ( τ ); i.e., φ k ∼ H ( τ ),where τ is a parameter of the Dirichlet distribution. G is a random probability measure over φ : G ∼ DP( α, H ( τ )) , where DP( · ) indicates the Dirichletprocess (DP), α is the DP concentration parameterthat is equal to the hyper parameter of the CRP,and H is the base measure, which is the Dirichletdistribution here. The generation process for link ℓ i is represented by ℓ i ∼ Multi( φ z i ) , φ z i ∼ G. Here, we define A as an adjacency matrix withelements A iv = 1 if there is an edge from i to v ;otherwise A iv = 0. The probability of z i given z − i = { z }\ z i and adjacency matrix A is p ( z i = k | A, z − j ; α ) ∝ p ( A i | z i = k, A − i , z − i ) p ( z i = k | z − i ; α ) , (C.1)where A i = ( A i , A i , · · · , A iV ), and A − i = A \ A i .We can calculate the probability of Eq.(C.1) asfollows. p ( A i | z i = k, A − i , z − i ) = g ( V τ + P u = i P v A uv z ku ) g ( V τ + P u P v A uv z ku ) Y v g ( τ + P u A uv z ku ) g ( τ + P u = j A uv z ku ) , (C.2)where g ( · ) is the gamma function, and z ki indicates1 if z i = k ; otherwise, it indicates 0. p ( z i = k | z − i ; α ) = N − ik V − α (if k previously used.) αV − α (if k is new.) , (C.3)where N − ik is the number of vertices except vertex i assigned to class k ; i.e., N − ik = P v = i z kv . We canadapt a Gibbs sampler for estimating z i by usingEq. (C.1). References
Aldous, D., 1985. Exchangeability and related topic. Ecoled’Ete de Probabilities de Saint-Flour XIII-1983, 1–198.Antoniak, C., 1974. Mixtures of Dirichlet processes withapplications to bayesian nonparametric problems. TheAnnals of Statistics 2, 1152–1174. Blei, D.M., Jordan, M.I., 2005. Variational inference fordirichlet process mixtures. Bayesian Analysis 1, 121–144.DaumeIII, H., 2007. Fast search for dirichlet process mix-ture models, in: Proceedings of Artificial Intelligence andStatistics.Dempster, A., Laird, N., Rubin, D., 1977. Maximum likeli-hood from incomplete data via the EM algorithm. Journalof the Royal Statistical Society series B 39, 1–38.Farhi, E., Goldstone, J., Gutmann, S., J. Lapan, A.L.,Preda., D., 2001. A quantum adiabatic evolution algo-rithm applied to random instances of an np -completeproblem. Science 292, 472–476.Frey, B.J., Dueck, D., 2007. Clustering by passing messagesbetween data points. Science 315, 972–976.Geman, S., Geman, D., 1984. Stochastic relaxation, Gibbsdistributions, and the Bayesian restoration of images.IEEE Pattern Analysis and Machine Intelligence, 6, 721–741.Kadowaki, T., Nishimori, H., 1998. Quantum annealing inthe transverse ising model. Physical Review E 58, 5355–5363.Kemp, C., Tenenbaum, J.B., Griffiths, T.L., Yamada, T.,Ueda, N., 2006. Learning systems of concepts with aninfinite relational model, in: AAAI.Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P., 1983. Optimiza-tion by simulated annealing. Science 220, 671–680.Kurihara, K., Tanaka, S., Miyashita, S., 2009. Quantumannealing for clustering, in: Proceedings of the 25th Con-ference on Uncertainty in Artificial Intelligence.Kurihara, K., Welling, M., Teh, Y.W., 2007. Collapsed vari-ational dirichlet process mixture models, in: IJCAI, pp.2796–2801.L. Xu, A.K., Oja, E., 1993. Rival penalized competitivelearning for clustering analysis, rbf net, and curve detec-tion. IEEE Transaction of Neural Network 4, 636–649.Lloyd, S., 1996. Universal quantum simulators. Science 273,1073–1078.Neal, R.M., 2000. Markov chain sampling methods for dirich-let process mixture models. Journal of Computational andGraphical Statistics 9, 249–265.Newman, M.E.J., Leicht, E.A., 2007. Mixture models andexploratory analysis in networks, in: Proceedings of Na-tional Academy of Sciences of the United States of Amer-ica.Nielsen, M., Chuang, I., 2000. Quantum Computation andQuantum Information. Cambridge University Press.Santoro, G.E., Martoˇn´ak, R., Tosatti, E., Car, R., 2002.Theory of quantum annealing of an Ising spin glass. Sci-ence 295, 2427–2430.Sato, I., Kurihara, K., Tanaka, S., Nakagawa, H., Miyashita,S., 2009. Quantum annealing for variational bayes infer-ence, in: Proceedings of the 25th Conference on Uncer-tainty in Artificial Intelligence.Suzuki, M., 1976. Relationship between d -dimensional quan-tal spin systems and ( d + 1)-dimensional Ising systems –equivalence , critical exponents and systematic approxi-mants of the partition function and spin correlations –.Progress of Theoretical Physics 56, 1454–1469.Trotter, H.F., 1959. On the product of semi-groups of oper-ators. Proceedings of the American Mathematical Society10, 545–551.Wang, C., Lai, J., 2011. Energy based competitive learning.Neurocomputing 74, 2265–2275.+ 1)-dimensional Ising systems –equivalence , critical exponents and systematic approxi-mants of the partition function and spin correlations –.Progress of Theoretical Physics 56, 1454–1469.Trotter, H.F., 1959. On the product of semi-groups of oper-ators. Proceedings of the American Mathematical Society10, 545–551.Wang, C., Lai, J., 2011. Energy based competitive learning.Neurocomputing 74, 2265–2275.