[PDF] Scalability of using Restricted Boltzmann Machines for Combinatorial Optimization

Abstract

Estimation of Distribution Algorithms (EDAs) require flexible probability models that can be efficiently learned and sampled. Restricted Boltzmann Machines (RBMs) are generative neural networks with these desired properties. We integrate an RBM into an EDA and evaluate the performance of this system in solving combinatorial optimization problems with a single objective. We assess how the number of fitness evaluations and the CPU time scale with problem size and with problem complexity. The results are compared to the Bayesian Optimization Algorithm, a state-of-the-art EDA. Although RBM-EDA requires larger population sizes and a larger number of fitness evaluations, it outperforms BOA in terms of CPU times, in particular if the problem is large or complex. RBM-EDA requires less time for model building than BOA. These results highlight the potential of using generative neural networks for combinatorial optimization.

Full PDF

aa r X i v : . [ c s . N E ] N ov SCALABILITY OF USING RESTRICTED BOLTZMANNMACHINES FOR COMBINATORIAL OPTIMIZATION

MALTE PROBST, FRANZ ROTHLAUF, AND J ¨ORN GRAHL

Abstract.

Estimation of Distribution Algorithms (EDAs) require ﬂexibleprobability models that can be eﬃciently learned and sampled. RestrictedBoltzmann Machines (RBMs) are generative neural networks with these de-sired properties. We integrate an RBM into an EDA and evaluate the per-formance of this system in solving combinatorial optimization problems witha single objective. We assess how the number of ﬁtness evaluations and theCPU time scale with problem size and with problem complexity. The resultsare compared to the Bayesian Optimization Algorithm, a state-of-the-art EDA.Although RBM-EDA requires larger population sizes and a larger number ofﬁtness evaluations, it outperforms BOA in terms of CPU times, in particularif the problem is large or complex. RBM-EDA requires less time for modelbuilding than BOA. These results highlight the potential of using generativeneural networks for combinatorial optimization. Introduction

Estimation of Distribution Algorithms (EDA, M¨uhlenbein and Paaß, 1996; Lar-ra˜naga and Lozano, 2002) are metaheuristics for combinatorial and continuous non-linear optimization. They maintain a population of solutions which they improveover consecutive generations. Unlike other heuristic methods, EDAs do not improvesolutions with mutation, recombination, or local search. Instead, they estimate howlikely it is that decisions are part of an optimal solution, and try to uncover thedependency structure between the decisions. This information is obtained fromthe population by the estimation of a probabilistic model. If a probabilistic modelgeneralizes the population well, random samples drawn from the model have a struc-ture and solution quality that is similar to the population itself. Repeated modelestimation, sampling, and selection steps can solve diﬃcult optimization problemsin theory (M¨uhlenbein and Mahnig, 1999) and in practice (Lozano and Larra˜naga,2006). It is important to empirically assess the eﬃciency of using probability mod-els in EDAs. Simple models, such as factorizations of univariate frequencies, canbe quickly estimated from a population, but they cannot represent interactions be-tween the decision variables. As a consequence, EDAs using univariate frequenciescannot eﬃciently solve complex problems. Using ﬂexible probability models suchas Bayesian networks allows complex problems to be solved, but ﬁtting the modelto a population and sampling new solutions can be very time-consuming.A central goal of EDA research is the identiﬁcation of probabilistic models thatare ﬂexible and can quickly be estimated and sampled. This is also a central topic

Key words and phrases.

Combinatorial Optimization, Heuristics, Evolutionary Computation,Estimation of Distribution Algorithms, Neural Networks . in the ﬁeld of machine learning. A recent focus in machine learning is the develop-ment of feasible unsupervised learning algorithms for generative neural networks.These algorithms and models can learn complex patterns from high-dimensionaldatasets. Moreover, generative neural networks can sample new data based on theassociations that they have learned so far, and they can be ﬁt to data (e.g., to apopulation of solutions) in an unsupervised manner. This makes them potentiallyuseful for EDAs. Some of these models can also be “stacked” on several layers andbe used as building blocks for “deep learning”.In this paper, we focus on Restricted Boltzmann Machines (RBM, Smolensky,1986; Hinton, 2002). RBMs are a basic, yet powerful, type of generative neuralnetwork where the connections between the neurons form a bipartite graph (Section3 explains RBMs in detail). Due to a recent breakthrough (Hinton et al., 2006),training RBMs is computationally tractable. They show impressive performancein classic machine learning tasks such as image or voice recognition (Dahl et al.,2012).Given these successes, it is not surprising that researchers have integrated RBMsand similar models into EDAs and studied how these systems perform in optimiza-tion tasks. Zhang and Shin (2000) used a Helmholtz Machine in an EDA. HelmholtzMachines are predecessors of RBMS. Due to their limited performance, they arenowadays widely discarded. Zhang and Shin (2000) evaluated their EDA by com-paring it to a simple Genetic Algorithm (Goldberg, 1989). They did not study thescaling behavior for problems of diﬀerent sizes and complexity. In a series of recentpapers, Huajin et al. (2010); Shim et al. (2010); Shim and Tan (2012) and Shimet al. (2013) studied EDAs that use RBMs. These works are similar to ours in thatan RBM is used inside an EDA. An important diﬀerence is that they consideredproblems with multiple objectives. Also, they hybridized the EDA with particleswarm optimization. Thus, it is unknown if using and RBM in an EDA leads tocompetitive performance in single-objective combinatorial optimization.Therefore, in this paper, we raise the following questions:(1) How eﬃcient are EDAs that use RBMs for single-objective combinatorialoptimization?(2) How does the runtime scale with problem size and problem diﬃculty?(3) Is the performance competitive with the state-of-the-art?To answer these questions, we integrated an RBM into an EDA (the RBM-EDA)and conducted a thorough experimental scalability study. We systematically variedthe diﬃculty of two tunably complex single-objective test problems (concatenatedtrap functions and NK landscapes), and we computed how the runtime and thenumber of ﬁtness evaluations scaled with problem size. We then compared theresults those obtained for the Bayesian Optimization Algorithm (BOA, Pelikanet al., 1999a; Pelikan, 2005). BOA is a state-of-the-art EDA.RBM-EDA solved the test problems in polynomial time depending on the prob-lem size. Indeed, concatenated trap functions can only be solved in polynomial timeby decomposing the overall problem into smaller parts and then solving the partsindependently. RBM-EDA’s polynomial scalability suggests that RBM-EDA rec-ognized the structure of the problem correctly and that it solved the sub-problemsindependently from one another. The hidden neurons of the RBM (its latent vari-ables) captured the building blocks of the optimization problem. The runtime ofRBM-EDA scaled better than that of BOA on trap functions of high order and

CALABILITY OF USING RBMS FOR COMBINATORIAL OPTIMIZATION 3

NK landscapes with large k . RBM-EDA hence appears to be useful for complexproblems. It was mostly faster than BOA if instances were large.The paper is structured as follows: In Section 2, we introduce Estimation ofDistribution Algorithms and the Bayesian Optimization Algorithm. In section 3,we introduce Restricted Boltzmann Machines, show how an RBM samples newdata, and describe how an RBM is ﬁt to given data. Section 3.4 describes RBM-EDA. The test functions, the experimental design and the results are presented inSection 4. Section 5 concludes the paper.2. Estimation of Distribution Algorithms

We introduce Estimation of Distribution Algorithms (Section 2.1) and the BayesianOptimization Algorithm (Section 2.2).2.1.

Estimation of Distribution Algorithms.

EDAs are population-based meta-heuristics (M¨uhlenbein and Paaß, 1996; M¨uhlenbein and Mahnig, 1999; Pelikanet al., 1999b; Larra˜naga et al., 1999). Similar to Genetic algorithms (GA, Hol-land, 1975; Goldberg, 1989), they evolve a population of solutions over a numberof generations by means of selection and variation.Algorithm 1 outlines the basic functionality of an EDA. After initializing a pop-ulation P of solutions, the EDA runs for multiple generations. In each genera-tion, a selection operator selects a subset P parents of high-quality solutions from P . P parents is then used as input for the variation step. In contrast to a GA,which creates new individuals using recombination and mutation, an EDA buildsa probabilistic model M from P parents , often by estimating their (joint) probabil-ity distribution. Then, the EDA draws samples from M to obtain new candidatesolutions. Together with P parents , these candidate solutions constitute P for thenext generation. The algorithm stops after the population has converged or anothertermination criterion is met.EDA variants mainly diﬀer in their probabilistic models M . The models describedependency structures between the decisions variables with diﬀerent types of prob-ability distributions. Consider a binary solution space with n decision variables.A naive EDA could attempt to store a probability for each solution. M wouldcontain 2 n probabilities. This could be required if all variables depended on eachother. However, storing 2 n probabilities is computationally intractable for large n .If some decision variables are, however, independent from other variables, then thejoint distribution could be factorized into products of marginal distributions andthe space required for storing M shrinks. If all variables are independent, only n probabilities have to be stored. In most problems, some variables are independent of Algorithm 1

Estimation of Distribution Algorithm Initialize

Population P while not converged do P parents ← Select high-quality solutions from P based on their ﬁtness M ← Build a model estimating the (joint) probability distribution of P parents P candidates ← Sample new candidate solutions from M P ← P parents ∪ P candidates end while SCALABILITY OF USING RBMS FOR COMBINATORIAL OPTIMIZATION other variables but the structure of the dependencies is unknown to those who wantto solve the problem. Hence, model building consists of ﬁnding a network structurethat matches the problem structure and estimating the model’s parameters.Simple models like the Breeder Genetic Algorithm (M¨uhlenbein and Paaß, 1996)or population-based incremental learning (Baluja, 1994) use univariate, fully factor-ized probability distributions with a vector of activation probabilities for the vari-ables and choose to ignore dependencies between decision variables. Slightly morecomplex approaches like the bivariate marginal distribution algorithm use bivari-ate probability distributions which model pairwise dependencies between variablesas trees or forests (Pelikan and M¨uhlenbein, 1999). More complex dependenciesbetween variables can be captured by models with multivariate interactions, likethe Bayesian Optimization Algorithm (Pelikan et al., 1999a, see section 2.2) or theextended compact GA (Harik, 1999). Such multivariate models are better suitedfor complex optimization problems. Univariate models can lead to an exponen-tial growth of the required number of ﬁtness evaluations (Pelikan et al., 1999a;Pelikan, 2005). However, the computational eﬀort to build a model M increaseswith its complexity and representational power. Many algorithms use probabilisticgraphical models with directed edges, i.e., Bayesian networks, or undirected edges,i.e., Markov random ﬁelds (Larra˜naga et al., 2012).2.2. Bayesian Optimization Algorithm.

The Bayesian Optimization Algorithmis the state-of-the-art EDA optimization algorithm for discrete optimization prob-lems. It was been proposed by Pelikan et al. (1999a) and has been heavily used andresearched since then (Pelikan and Goldberg, 2003; Pelikan, 2008; Abdollahzadehet al., 2012).BOA uses a Bayesian network for modeling dependencies between variables. De-cision variables correspond to nodes and dependencies between variables correspondto directed edges. As the number of possible network topologies grows exponen-tially with the number of nodes, BOA uses a greedy construction heuristic to ﬁnda network structure G to model the training data. Starting from an unconnected(empty) network, BOA evaluates all possible additional edges, adds the one thatmaximally increases the ﬁt between the model and selected individuals, and repeatsthis process until no more edges can be added. The ﬁt between selected individualsand the model is measured by the Bayesian Information Criterion (BIC, Schwarz,1978). BIC is based on the conditional entropy of nodes given their parent nodesand correction terms penalizing complex models. BIC assigns each network G ascalar score(1) BIC ( G ) = n X i =1 (cid:18) − H ( X i | Π i ) N − | Π i | log ( N )2 (cid:19) , where n is the number of decision variables, N is the sample size (i.e., the numberof selected individuals), Π i are the predecessors of node i in the Bayesian network( i ’s parents), and | Π i | is the number of parents of node i . The term H ( X i | Π i ) isthe conditional entropy of the i ’th decision variable X i given its parental nodes, Π i ,deﬁned as(2) H ( X i | Π i ) = − X x i ,π i p ( x i , π i ) log p ( x i | π i ) , CALABILITY OF USING RBMS FOR COMBINATORIAL OPTIMIZATION 5 h h h m v v v n ...... w w w nm w HV Figure 1.

A Restricted Boltzmann Machine as a graph. Thevisible neurons v i ( i ∈ ..n ) can hold a data vector of length n from the training data. In the EDA context, V represents decisionvariables. The hidden neurons h j ( j ∈ ..m ) represent m features.Weight w ij connects v i to h j .where p ( x i , π i ) is the observed probability of instances where X i = x i and Π i = π i ,and p ( x i | π i ) is the conditional probability of instances where X i = x i given thatΠ i = π i . The sum in (2) runs over all possible conﬁgurations of X i and Π i . The BICscore depends only on the conditional entropy of a node and its parents. Therefore,it can be calculated independently for all nodes. If an edge is added to the Bayesiannetwork, the change of the BIC can be computed quickly. The term − | π i | . . . in(1) penalizes model complexity. BOAs greedy network construction algorithm addsthe edge with the largest gain in BIC( G ) until no more edges can be added. Edgeadditions resulting in cycles are not considered.After the network structure has been learned, BOA calculates the conditional ac-tivation probability tables for each node. Once the model structure and conditionalactivation probabilities are available, BOA can produce new candidate solutions bydrawing random values for all nodes in topological order.3. Restricted Boltzmann Machines and the RBM-EDA

Restricted Boltzmann Machines (Smolensky, 1986) are stochastic neural net-works that are successful in areas such as image classiﬁcation, natural languageprocessing, or collaborative ﬁltering (Dahl et al., 2010; Hinton et al., 2006; Salakhut-dinov et al., 2007). In this section, we describe the structure of Restricted Boltz-mann Machines (Section 3.1), show how an RBMs can sample new data (Section3.2), and how contrastive divergence learning is used to model the probability dis-tribution of given data (Section 3.3). Finally, we describe RBM-EDA, an EDA thatuses an RBM as its probabilistic model (Section 3.4).3.1.

Structure of RBMs.

Figure 1 illustrates the structure of an RBM. We de-note V as the input (or “visible”) layer. V holds the input data represented by n binary variables v i , i = 1 , , . . . , n . The m binary neurons h j , j = 1 , , . . . , m of thehidden layer H are called feature detectors as they are able to model patterns inthe data. A weight matrix W holds weights w i,j ∈ R between all neurons v i and h j . Together, V , H , and W form a bipartite graph. The weights W are undirected.An RBM forms a Markov random ﬁeld.In the sampling and training phase, each neuron in V and H makes stochasticdecisions about whether it is active (its value then becomes 1) or not (its value then SCALABILITY OF USING RBMS FOR COMBINATORIAL OPTIMIZATION becomes 0). Therefore, it collects inputs from all neurons to which it is directlyconnected. W determines the strengths of the inputs.An RBM encodes a joint probability distribution P ( V, H ). In the sampling phase,a conﬁguration of V and H is thus sampled with probability P ( V, H ) (Smolensky,1986). In the training phase, the weights W are adapted such that the marginalprobability P ( V ) approximates the probability distribution of the training data.Training and sampling are tractable because of the bipartite structure of the RBM.Hence, it is not necessary to know the problem structure beforehand.3.2. Sampling.

The goal of the sampling phase is to generate new values for theneurons in the visible layer V according to P ( V, H ). This is straightforward if theactivations of the neurons in the hidden layer H are known. In this case, all v i areindependent of each other and the conditional probability that v i is active is simpleto compute. The conditional probability P ( v i = 1 | H ) that the visible neuron v i isactive, given the hidden layer H is calculated as(3) P ( v i = 1 | H ) = sigm X j w ij h j  , where sigm( x ) = e − x is the logistic function. Analogously, given the activationsof the visible layer V , the conditional probability P ( h j = 1 | V ) for the hiddenneurons H is calculated as(4) P ( h j = 1 | V ) = sigm X i w ij v i ! . Although the two conditional distributions P ( V | H ) and P ( H | V ) are simple tocompute, sampling from the joint probability distribution P ( V, H ) is much morediﬃcult, as it usually requires integrating over one of the conditional distributions.An alternative is to use Gibbs sampling, which approximates the joint distribution P ( V, H ) from the conditional distributions. Gibbs sampling starts by assigningrandom values to the visible neurons. Then, it iteratively samples from P ( H | V )and P ( V | H ), respectively, while assigning the result of the previous sampling stepto the non-sampled variable. Sampling in the order of V → H → V → H → V → ... forms a Markov chain. Its stationary distribution is identical to the joint probabilitydistribution P ( V, H ) (Geman and Geman, 1984). The quality of the approximationincreases with the number of sampling steps. If Gibbs sampling is started with a V that has a high probability under the stationary distribution, only a few samplingsteps are necessary to obtain a good approximation.3.3. Training.

In the training phase, the RBM adapts the weights W such that P ( V ) approximates the distribution of the training data. An eﬀective approach foradjusting the weights of an RBM is contrastive divergence (CD) learning (Hinton,2002). CD learning maximizes the log-likelihood of the training data under themodel, log( P ( V )), by performing a stochastic gradient ascent. The main elementof CD learning is Gibbs sampling. In addition, all neurons are connected to a special “bias” neuron, which is always active andworks like an oﬀset to the neuron’s input. Bias weights are treated like normal weights duringlearning. Due to brevity, we omitted the bias terms throughout the paper. For details, see Hintonet al. (2006).

CALABILITY OF USING RBMS FOR COMBINATORIAL OPTIMIZATION 7

For each point V in the training data, CD learning updates w ij in the direction of − ∂ log( P ( V )) ∂w ij . This partial derivative is the diﬀerence of two terms usually referredto as positive and negative gradient, ∆ pos ij and ∆ neg ij (Hinton, 2002). The totalgradient ∆ w ij is(5) ∆ w ij = ∆ pos ij − ∆ neg ij = < v i · h j > data − < v i · h j > model , where < x > is the expected value of x . ∆ pos ij is the expected value of v i h j whensetting the visible layer V to a data vector from the training set and sampling H according to (4). ∆ pos ij increases the marginal probability P ( V ) of the data point V . In contrast, ∆ neg ij is the expected value of a conﬁguration sampled from thejoint probability distribution P ( V, H ), which is approximated by Gibbs sampling.If P ( V ) is equal to the distribution of the training data, in the expectation, thepositive and negative gradient equal each other and the total gradient becomes zero.Calculating ∆ neg ij exactly is infeasible since a high number of Gibbs sampling stepsis required until the RBM is sampling from its stationary distribution. Therefore,CD estimates ∆ neg ij by using two approximations. First, CD initializes the Markovchain with a data vector from the training set, rather than using an unbiased,random starting point. Second, only a small number of sampling steps is used. Wedenote CD using N sampling steps as CD- N . CD- N approximates the negativegradient ∆ neg ij by initializing the sampling chain to the same data point V whichis used for the calculation of ∆ pos ij . Subsequently, it performs N Gibbs samplingsteps. Note that the ﬁrst half-step V → H has, in practice, already been performedduring the calculation of ∆ pos ij . Despite using these two approximations, CD- N usually works well (Hinton et al., 2006).Algorithm 2 describes contrastive divergence for N = 1 (CD-1). For each trainingvector, V is initialized with the training vector and H is sampled according to (4).This allows the calculation of ∆ pos ij as v i h j . Following this, two additional samplingsteps are carried out: First, we calculate the “reconstruction” ˆ V of the trainingvector as in (3). Subsequently, we calculate the corresponding hidden probabilities P ( ˆ h j = 1 | ˆ V ). Now, we can approximate ∆ neg ij as ˆ v i · P ( ˆ h j | ˆ V ) and obtain ∆ w ij .Finally, we update the weights as(6) w ij := w ij + α · ∆ w ij where α ∈ (0 , . . . ,

1) is a learning rate deﬁned by the user. This procedure repeatsfor several epochs, i.e., passes through the training set. Usually, CD is implementedin a mini-batch fashion. That is, we calculate ∆ w ij in (6) for multiple trainingexamples at the same time, and subsequently use the average gradient to update w ij . This reduces sampling noise and makes the gradient more stable (Bishop, 2006;Hinton et al., 2006)3.4. Restricted Boltzmann EDA.

This section describes how we used an RBMin an EDA. The RBM should model the properties of promising solutions andthen be used to sample new candidate solutions. In each generation of the EDA,we trained the RBM to model the probability distribution of the solutions whichsurvived the selection process. Then, we sampled candidate solutions from theRBM and evaluate their ﬁtness.

SCALABILITY OF USING RBMS FOR COMBINATORIAL OPTIMIZATION

We chose to use the following parameters: The number m of hidden neurons wasset to be half the number n of visible neurons (i.e. half the problem size). Standardvalues were used for the parameters that control the training and sampling phase(Hinton, 2010).Sampling parameters. When sampling new candidate solutions from the RBM, weused the individuals in P parents to initialize the visible neurons close to the station-ary distribution. Subsequently, we performed 25 full Gibbs sampling steps.Training parameters. The learning rate α was set to 0.05 and 0.5 for the weightsand biases, respectively. We applied standard momentum (Qian, 1999), which addsa fraction β of the last parameter update to the current update, making the updateless prone to ﬂuctuations caused by noise, and dampening oscillations. We startedwith β = 0 . β = 0 . − . ∗ γ ∗ W to the RBM’s optimization objective. The partialgradient ∆ w ij in Equation (5) thus includes the term − γ ∗ w ij , which decays largeweights and thereby reduces overﬁtting. The weight cost γ determines the strengthof the decay. We chose γ = 0 . (cid:16) P ( v i =1)1 − P ( v i =1) (cid:17) , where P ( v i = 1) is the probability that the visible neuron v i isset to one in the training set (see Hinton, 2010). This initialization speeds up thetraining process.Parameter control. We applied a simple parameter control scheme for the learningrate α , the momentum β , and the number of epochs. The scheme was based on thereconstruction error e . The reconstruction error is the diﬀerence between a trainingvector V and its reconstruction ˆ V after a single step of Gibbs sampling (see lines2 and 7 in Algorithm 2). e usually decreases with the number of epochs. Everysecond epoch t ∈ , . . . , T , we calculated for a ﬁxed subset s of the training set S the relative diﬀerence e st = 1 / | s | X j ∈ s X i | v i − ˆ v i | /n CD learning does not minimize the reconstruction error e but maximizes P ( V ), which cannotbe calculated exactly tractably. As an alternative, the reconstruction error e is usually a goodapproximation for how good the model can (re-)produce the training data. Algorithm 2

Pseudo code for a training epoch using CD-1 for all training examples do V ← set V to the current training example H ← sample H | V , i.e. set h j to 1 with P ( h j = 1 | V ) from (4) ∆ pos ij = v i h j ˆ V ← sample ”reconstruction” ˆ V | H , using (3) ˆ H ← calculate P ( ˆ H | ˆ V ) as in (4) ∆ neg ij = ˆ v i · P (ˆ h j | ˆ V ) ∆ w ij ← calculate all ∆ w ij as in (5) w ij ← update all weights according to (6) end for CALABILITY OF USING RBMS FOR COMBINATORIAL OPTIMIZATION 9 between v and ˆ v . We measured the decrease γ of the reconstruction error in thelast 25% of all epochs as γ = e S . t − e St e S − e St . γ was then used to adjust the learningparameters. The learning rate α was initialized in the ﬁrst epoch with α = 0 . γ < . α was decreased to 0 . β to 0 .

5. As soon as γ < . β to β = 0 . γ < .

01. The rationale behind this was that the RBMhas learned the relevant dependencies between the variables, and further trainingwas unlikely to improve the model considerably. Furthermore, we stopped thetraining if the RBM was overﬁtting, i.e., learning noise instead of problem structure.Therefore, we split the original training set into a training set S containing 90% of allsamples and a validation set S ′ containing the remaining 10%. We trained the RBMonly for the solutions in S and, after each epoch, calculated the reconstruction error e S and e S ′ for the training and validation set S and S ′ , respectively. We stoppedthe training phase as soon as | e S − e S ′ | e S ′ ≥ .

02 (i.e., the absolute diﬀerence betweenthe reconstruction errors was larger than 2%).4.

Experiments

We describe the test problems (Section 4.1) and our experimental design (Section4.2). The results are presented in Section 4.3.4.1.

Test Problems.

We evaluated the performance of RBM-EDA on onemax,concatenated deceptive traps (Ackley, 1987), and NK landscapes (Kauﬀman andWeinberger, 1989). All three are standard benchmark problems. Their diﬃcultydepends on the problem size, i.e., problems with more decision variables are morediﬃcult. Furthermore, the diﬃculty of concatenated deceptive trap functions andNK landscapes is tunable by a parameter. All three problems are deﬁned on binarystrings of ﬁxed length.The onemax problem assigns a binary solution x of length l a ﬁtness value f = P li =1 x i , i.e., the ﬁtness of x is the number of ones in x . The onemax function israther simple. It is unimodal and can be solved by a deterministic hill climber.A trap function is deﬁned on a binary solution x of length k . It assigns a solution x a ﬁtness f k ( x ) = ( k if P i x i = k, and k − ( P i x i + 1) otherwise.The optimal solution consists of all ones. Trap functions are diﬃcult to solve be-cause the second-best solution consists of all zeros and the structure of the functionis deceptive. It leads search methods away from the global optimum towards thesecond-best solution.A concatenated trap function places o trap functions of order k on a singlebitstring x of length o · k . Its ﬁtness is calculated as f ( x ) = P oi =1 f ki (Ackley, 1987).The problem diﬃculty increases with k as well as with o . Concatenated traps aredecomposable. Dependencies exist between the k variables of a trap function, butnot between variables in the o diﬀerent trap functions (Deb and Goldberg, 1991).NK landscapes are deﬁned by two parameters N and k and N ﬁtness components f Ni (Kauﬀman and Weinberger, 1989). A solution vector x consists of l = N bits.The bits are assigned to N overlapping subsets, each of size k + 1. The ﬁtness of a solution is the sum of N ﬁtness components. Each component f Ni depends on thevalue of the corresponding variable x i as well as k other variables. Each f Ni mapseach possible conﬁgurations of its k + 1 variables to a ﬁtness value. The overallﬁtness function is therefore f ( x ) = 1 /N N X i =1 f Ni ( x i , x i , . . . , x iK ) . Each decision variable usually inﬂuences several f Ni . These dependencies betweensubsets make NK landscapes non-decomposable. The problem diﬃculty increaseswith k . k = 0 is a special case where all decision variables are independent and theproblem reduces to onemax.4.2. Experimental Setup.

We adopted an experimental setup similar to Pelikan(2008). We studied the performance of RBM-EDA on the onemax function, onconcatenated deceptive traps with traps size k ∈ { , , } , and NK landscapes with k ∈ { , , , } . We report results for BOA so that RBM-EDA can be compared tothe state-of-the-art.Both EDAs used tournament selection without replacement of size two (Millerand Goldberg, 1995). We used bisection to determine the smallest population sizefor which a method solved a problem to optimality. For onemax and deceptive traps,we required each method to ﬁnd the optimal solution in 30 out of 30 independentruns. For NK landscapes, we used 25 randomly chosen problem instances per sizeand determined the population size that solved ﬁve out of ﬁve independent runs ofeach instance.We report the average number of ﬁtness evaluations that were necessary to solvethe problem to optimality. In addition, we report average CPU running times. CPUrunning time includes the time required for ﬁtness calculation, the time requiredfor model building (either building the Bayesian network or training the RBM),the time required for sampling new solutions, and the time required for selection.We also report CPU times even though previous EDA research mostly ignored thetime required for model building and sampling.We implemented RBM-EDA and BOA in Java. The experiments were conductedon a single core of an AMD Opteron 6272 processor with 2,100 MHz. The JBLASlibrary was used for the linear algebra operations of the RBM.4.3. Results.

We report the performance of RBM-EDA and BOA on onemax (Fig-ure 2), concatenated traps (Figure 3), and NK landscapes (Figure 4). The ﬁgureshave a log-log scale. Straight lines indicate polynomial scalability. Each ﬁgureshows the average number of ﬁtness evaluations (left-hand side) and the overallCPU time (right-hand side) required until the optimal solution was found. The ﬁg-ures also show regression lines obtained from ﬁtting a polynomial to the raw results(details are in Table 1).First, we study the number of ﬁtness evaluations required until the optimal so-lution was found (Figures 2-4, left). For the onemax problem, RBM-EDA neededfewer ﬁtness evaluations than BOA (Figure 2, left-hand side) and had a slightlylower complexity ( O ( n . ) vs. O ( n . )). For concatenated traps, BOA needed lessﬁtness evaluations (Figure 3, left-hand side). As the problem diﬃculty increased(larger k ), the performance of the two approaches became more similar. The com-plexity of BOA was slightly lower than the one of RBM-EDA (around O ( n . ) versus CALABILITY OF USING RBMS FOR COMBINATORIAL OPTIMIZATION 11 F i t ne ss E v a l ua t i on s ProblemSize