Scalability of using Restricted Boltzmann Machines for Combinatorial Optimization
aa r X i v : . [ c s . N E ] N ov SCALABILITY OF USING RESTRICTED BOLTZMANNMACHINES FOR COMBINATORIAL OPTIMIZATION
MALTE PROBST, FRANZ ROTHLAUF, AND J ¨ORN GRAHL
Abstract.
Estimation of Distribution Algorithms (EDAs) require flexibleprobability models that can be efficiently learned and sampled. RestrictedBoltzmann Machines (RBMs) are generative neural networks with these de-sired properties. We integrate an RBM into an EDA and evaluate the per-formance of this system in solving combinatorial optimization problems witha single objective. We assess how the number of fitness evaluations and theCPU time scale with problem size and with problem complexity. The resultsare compared to the Bayesian Optimization Algorithm, a state-of-the-art EDA.Although RBM-EDA requires larger population sizes and a larger number offitness evaluations, it outperforms BOA in terms of CPU times, in particularif the problem is large or complex. RBM-EDA requires less time for modelbuilding than BOA. These results highlight the potential of using generativeneural networks for combinatorial optimization. Introduction
Estimation of Distribution Algorithms (EDA, M¨uhlenbein and Paaß, 1996; Lar-ra˜naga and Lozano, 2002) are metaheuristics for combinatorial and continuous non-linear optimization. They maintain a population of solutions which they improveover consecutive generations. Unlike other heuristic methods, EDAs do not improvesolutions with mutation, recombination, or local search. Instead, they estimate howlikely it is that decisions are part of an optimal solution, and try to uncover thedependency structure between the decisions. This information is obtained fromthe population by the estimation of a probabilistic model. If a probabilistic modelgeneralizes the population well, random samples drawn from the model have a struc-ture and solution quality that is similar to the population itself. Repeated modelestimation, sampling, and selection steps can solve difficult optimization problemsin theory (M¨uhlenbein and Mahnig, 1999) and in practice (Lozano and Larra˜naga,2006). It is important to empirically assess the efficiency of using probability mod-els in EDAs. Simple models, such as factorizations of univariate frequencies, canbe quickly estimated from a population, but they cannot represent interactions be-tween the decision variables. As a consequence, EDAs using univariate frequenciescannot efficiently solve complex problems. Using flexible probability models suchas Bayesian networks allows complex problems to be solved, but fitting the modelto a population and sampling new solutions can be very time-consuming.A central goal of EDA research is the identification of probabilistic models thatare flexible and can quickly be estimated and sampled. This is also a central topic
Key words and phrases.
Combinatorial Optimization, Heuristics, Evolutionary Computation,Estimation of Distribution Algorithms, Neural Networks . in the field of machine learning. A recent focus in machine learning is the develop-ment of feasible unsupervised learning algorithms for generative neural networks.These algorithms and models can learn complex patterns from high-dimensionaldatasets. Moreover, generative neural networks can sample new data based on theassociations that they have learned so far, and they can be fit to data (e.g., to apopulation of solutions) in an unsupervised manner. This makes them potentiallyuseful for EDAs. Some of these models can also be “stacked” on several layers andbe used as building blocks for “deep learning”.In this paper, we focus on Restricted Boltzmann Machines (RBM, Smolensky,1986; Hinton, 2002). RBMs are a basic, yet powerful, type of generative neuralnetwork where the connections between the neurons form a bipartite graph (Section3 explains RBMs in detail). Due to a recent breakthrough (Hinton et al., 2006),training RBMs is computationally tractable. They show impressive performancein classic machine learning tasks such as image or voice recognition (Dahl et al.,2012).Given these successes, it is not surprising that researchers have integrated RBMsand similar models into EDAs and studied how these systems perform in optimiza-tion tasks. Zhang and Shin (2000) used a Helmholtz Machine in an EDA. HelmholtzMachines are predecessors of RBMS. Due to their limited performance, they arenowadays widely discarded. Zhang and Shin (2000) evaluated their EDA by com-paring it to a simple Genetic Algorithm (Goldberg, 1989). They did not study thescaling behavior for problems of different sizes and complexity. In a series of recentpapers, Huajin et al. (2010); Shim et al. (2010); Shim and Tan (2012) and Shimet al. (2013) studied EDAs that use RBMs. These works are similar to ours in thatan RBM is used inside an EDA. An important difference is that they consideredproblems with multiple objectives. Also, they hybridized the EDA with particleswarm optimization. Thus, it is unknown if using and RBM in an EDA leads tocompetitive performance in single-objective combinatorial optimization.Therefore, in this paper, we raise the following questions:(1) How efficient are EDAs that use RBMs for single-objective combinatorialoptimization?(2) How does the runtime scale with problem size and problem difficulty?(3) Is the performance competitive with the state-of-the-art?To answer these questions, we integrated an RBM into an EDA (the RBM-EDA)and conducted a thorough experimental scalability study. We systematically variedthe difficulty of two tunably complex single-objective test problems (concatenatedtrap functions and NK landscapes), and we computed how the runtime and thenumber of fitness evaluations scaled with problem size. We then compared theresults those obtained for the Bayesian Optimization Algorithm (BOA, Pelikanet al., 1999a; Pelikan, 2005). BOA is a state-of-the-art EDA.RBM-EDA solved the test problems in polynomial time depending on the prob-lem size. Indeed, concatenated trap functions can only be solved in polynomial timeby decomposing the overall problem into smaller parts and then solving the partsindependently. RBM-EDA’s polynomial scalability suggests that RBM-EDA rec-ognized the structure of the problem correctly and that it solved the sub-problemsindependently from one another. The hidden neurons of the RBM (its latent vari-ables) captured the building blocks of the optimization problem. The runtime ofRBM-EDA scaled better than that of BOA on trap functions of high order and
CALABILITY OF USING RBMS FOR COMBINATORIAL OPTIMIZATION 3
NK landscapes with large k . RBM-EDA hence appears to be useful for complexproblems. It was mostly faster than BOA if instances were large.The paper is structured as follows: In Section 2, we introduce Estimation ofDistribution Algorithms and the Bayesian Optimization Algorithm. In section 3,we introduce Restricted Boltzmann Machines, show how an RBM samples newdata, and describe how an RBM is fit to given data. Section 3.4 describes RBM-EDA. The test functions, the experimental design and the results are presented inSection 4. Section 5 concludes the paper.2. Estimation of Distribution Algorithms
We introduce Estimation of Distribution Algorithms (Section 2.1) and the BayesianOptimization Algorithm (Section 2.2).2.1.
Estimation of Distribution Algorithms.
EDAs are population-based meta-heuristics (M¨uhlenbein and Paaß, 1996; M¨uhlenbein and Mahnig, 1999; Pelikanet al., 1999b; Larra˜naga et al., 1999). Similar to Genetic algorithms (GA, Hol-land, 1975; Goldberg, 1989), they evolve a population of solutions over a numberof generations by means of selection and variation.Algorithm 1 outlines the basic functionality of an EDA. After initializing a pop-ulation P of solutions, the EDA runs for multiple generations. In each genera-tion, a selection operator selects a subset P parents of high-quality solutions from P . P parents is then used as input for the variation step. In contrast to a GA,which creates new individuals using recombination and mutation, an EDA buildsa probabilistic model M from P parents , often by estimating their (joint) probabil-ity distribution. Then, the EDA draws samples from M to obtain new candidatesolutions. Together with P parents , these candidate solutions constitute P for thenext generation. The algorithm stops after the population has converged or anothertermination criterion is met.EDA variants mainly differ in their probabilistic models M . The models describedependency structures between the decisions variables with different types of prob-ability distributions. Consider a binary solution space with n decision variables.A naive EDA could attempt to store a probability for each solution. M wouldcontain 2 n probabilities. This could be required if all variables depended on eachother. However, storing 2 n probabilities is computationally intractable for large n .If some decision variables are, however, independent from other variables, then thejoint distribution could be factorized into products of marginal distributions andthe space required for storing M shrinks. If all variables are independent, only n probabilities have to be stored. In most problems, some variables are independent of Algorithm 1
Estimation of Distribution Algorithm Initialize
Population P while not converged do P parents ← Select high-quality solutions from P based on their fitness M ← Build a model estimating the (joint) probability distribution of P parents P candidates ← Sample new candidate solutions from M P ← P parents ∪ P candidates end while SCALABILITY OF USING RBMS FOR COMBINATORIAL OPTIMIZATION other variables but the structure of the dependencies is unknown to those who wantto solve the problem. Hence, model building consists of finding a network structurethat matches the problem structure and estimating the model’s parameters.Simple models like the Breeder Genetic Algorithm (M¨uhlenbein and Paaß, 1996)or population-based incremental learning (Baluja, 1994) use univariate, fully factor-ized probability distributions with a vector of activation probabilities for the vari-ables and choose to ignore dependencies between decision variables. Slightly morecomplex approaches like the bivariate marginal distribution algorithm use bivari-ate probability distributions which model pairwise dependencies between variablesas trees or forests (Pelikan and M¨uhlenbein, 1999). More complex dependenciesbetween variables can be captured by models with multivariate interactions, likethe Bayesian Optimization Algorithm (Pelikan et al., 1999a, see section 2.2) or theextended compact GA (Harik, 1999). Such multivariate models are better suitedfor complex optimization problems. Univariate models can lead to an exponen-tial growth of the required number of fitness evaluations (Pelikan et al., 1999a;Pelikan, 2005). However, the computational effort to build a model M increaseswith its complexity and representational power. Many algorithms use probabilisticgraphical models with directed edges, i.e., Bayesian networks, or undirected edges,i.e., Markov random fields (Larra˜naga et al., 2012).2.2. Bayesian Optimization Algorithm.
The Bayesian Optimization Algorithmis the state-of-the-art EDA optimization algorithm for discrete optimization prob-lems. It was been proposed by Pelikan et al. (1999a) and has been heavily used andresearched since then (Pelikan and Goldberg, 2003; Pelikan, 2008; Abdollahzadehet al., 2012).BOA uses a Bayesian network for modeling dependencies between variables. De-cision variables correspond to nodes and dependencies between variables correspondto directed edges. As the number of possible network topologies grows exponen-tially with the number of nodes, BOA uses a greedy construction heuristic to finda network structure G to model the training data. Starting from an unconnected(empty) network, BOA evaluates all possible additional edges, adds the one thatmaximally increases the fit between the model and selected individuals, and repeatsthis process until no more edges can be added. The fit between selected individualsand the model is measured by the Bayesian Information Criterion (BIC, Schwarz,1978). BIC is based on the conditional entropy of nodes given their parent nodesand correction terms penalizing complex models. BIC assigns each network G ascalar score(1) BIC ( G ) = n X i =1 (cid:18) − H ( X i | Π i ) N − | Π i | log ( N )2 (cid:19) , where n is the number of decision variables, N is the sample size (i.e., the numberof selected individuals), Π i are the predecessors of node i in the Bayesian network( i ’s parents), and | Π i | is the number of parents of node i . The term H ( X i | Π i ) isthe conditional entropy of the i ’th decision variable X i given its parental nodes, Π i ,defined as(2) H ( X i | Π i ) = − X x i ,π i p ( x i , π i ) log p ( x i | π i ) , CALABILITY OF USING RBMS FOR COMBINATORIAL OPTIMIZATION 5 h h h m v v v n ...... w w w nm w HV Figure 1.
A Restricted Boltzmann Machine as a graph. Thevisible neurons v i ( i ∈ ..n ) can hold a data vector of length n from the training data. In the EDA context, V represents decisionvariables. The hidden neurons h j ( j ∈ ..m ) represent m features.Weight w ij connects v i to h j .where p ( x i , π i ) is the observed probability of instances where X i = x i and Π i = π i ,and p ( x i | π i ) is the conditional probability of instances where X i = x i given thatΠ i = π i . The sum in (2) runs over all possible configurations of X i and Π i . The BICscore depends only on the conditional entropy of a node and its parents. Therefore,it can be calculated independently for all nodes. If an edge is added to the Bayesiannetwork, the change of the BIC can be computed quickly. The term − | π i | . . . in(1) penalizes model complexity. BOAs greedy network construction algorithm addsthe edge with the largest gain in BIC( G ) until no more edges can be added. Edgeadditions resulting in cycles are not considered.After the network structure has been learned, BOA calculates the conditional ac-tivation probability tables for each node. Once the model structure and conditionalactivation probabilities are available, BOA can produce new candidate solutions bydrawing random values for all nodes in topological order.3. Restricted Boltzmann Machines and the RBM-EDA
Restricted Boltzmann Machines (Smolensky, 1986) are stochastic neural net-works that are successful in areas such as image classification, natural languageprocessing, or collaborative filtering (Dahl et al., 2010; Hinton et al., 2006; Salakhut-dinov et al., 2007). In this section, we describe the structure of Restricted Boltz-mann Machines (Section 3.1), show how an RBMs can sample new data (Section3.2), and how contrastive divergence learning is used to model the probability dis-tribution of given data (Section 3.3). Finally, we describe RBM-EDA, an EDA thatuses an RBM as its probabilistic model (Section 3.4).3.1.
Structure of RBMs.
Figure 1 illustrates the structure of an RBM. We de-note V as the input (or “visible”) layer. V holds the input data represented by n binary variables v i , i = 1 , , . . . , n . The m binary neurons h j , j = 1 , , . . . , m of thehidden layer H are called feature detectors as they are able to model patterns inthe data. A weight matrix W holds weights w i,j ∈ R between all neurons v i and h j . Together, V , H , and W form a bipartite graph. The weights W are undirected.An RBM forms a Markov random field.In the sampling and training phase, each neuron in V and H makes stochasticdecisions about whether it is active (its value then becomes 1) or not (its value then SCALABILITY OF USING RBMS FOR COMBINATORIAL OPTIMIZATION becomes 0). Therefore, it collects inputs from all neurons to which it is directlyconnected. W determines the strengths of the inputs.An RBM encodes a joint probability distribution P ( V, H ). In the sampling phase,a configuration of V and H is thus sampled with probability P ( V, H ) (Smolensky,1986). In the training phase, the weights W are adapted such that the marginalprobability P ( V ) approximates the probability distribution of the training data.Training and sampling are tractable because of the bipartite structure of the RBM.Hence, it is not necessary to know the problem structure beforehand.3.2. Sampling.
The goal of the sampling phase is to generate new values for theneurons in the visible layer V according to P ( V, H ). This is straightforward if theactivations of the neurons in the hidden layer H are known. In this case, all v i areindependent of each other and the conditional probability that v i is active is simpleto compute. The conditional probability P ( v i = 1 | H ) that the visible neuron v i isactive, given the hidden layer H is calculated as(3) P ( v i = 1 | H ) = sigm X j w ij h j , where sigm( x ) = e − x is the logistic function. Analogously, given the activationsof the visible layer V , the conditional probability P ( h j = 1 | V ) for the hiddenneurons H is calculated as(4) P ( h j = 1 | V ) = sigm X i w ij v i ! . Although the two conditional distributions P ( V | H ) and P ( H | V ) are simple tocompute, sampling from the joint probability distribution P ( V, H ) is much moredifficult, as it usually requires integrating over one of the conditional distributions.An alternative is to use Gibbs sampling, which approximates the joint distribution P ( V, H ) from the conditional distributions. Gibbs sampling starts by assigningrandom values to the visible neurons. Then, it iteratively samples from P ( H | V )and P ( V | H ), respectively, while assigning the result of the previous sampling stepto the non-sampled variable. Sampling in the order of V → H → V → H → V → ... forms a Markov chain. Its stationary distribution is identical to the joint probabilitydistribution P ( V, H ) (Geman and Geman, 1984). The quality of the approximationincreases with the number of sampling steps. If Gibbs sampling is started with a V that has a high probability under the stationary distribution, only a few samplingsteps are necessary to obtain a good approximation.3.3. Training.
In the training phase, the RBM adapts the weights W such that P ( V ) approximates the distribution of the training data. An effective approach foradjusting the weights of an RBM is contrastive divergence (CD) learning (Hinton,2002). CD learning maximizes the log-likelihood of the training data under themodel, log( P ( V )), by performing a stochastic gradient ascent. The main elementof CD learning is Gibbs sampling. In addition, all neurons are connected to a special “bias” neuron, which is always active andworks like an offset to the neuron’s input. Bias weights are treated like normal weights duringlearning. Due to brevity, we omitted the bias terms throughout the paper. For details, see Hintonet al. (2006).
CALABILITY OF USING RBMS FOR COMBINATORIAL OPTIMIZATION 7
For each point V in the training data, CD learning updates w ij in the direction of − ∂ log( P ( V )) ∂w ij . This partial derivative is the difference of two terms usually referredto as positive and negative gradient, ∆ pos ij and ∆ neg ij (Hinton, 2002). The totalgradient ∆ w ij is(5) ∆ w ij = ∆ pos ij − ∆ neg ij = < v i · h j > data − < v i · h j > model , where < x > is the expected value of x . ∆ pos ij is the expected value of v i h j whensetting the visible layer V to a data vector from the training set and sampling H according to (4). ∆ pos ij increases the marginal probability P ( V ) of the data point V . In contrast, ∆ neg ij is the expected value of a configuration sampled from thejoint probability distribution P ( V, H ), which is approximated by Gibbs sampling.If P ( V ) is equal to the distribution of the training data, in the expectation, thepositive and negative gradient equal each other and the total gradient becomes zero.Calculating ∆ neg ij exactly is infeasible since a high number of Gibbs sampling stepsis required until the RBM is sampling from its stationary distribution. Therefore,CD estimates ∆ neg ij by using two approximations. First, CD initializes the Markovchain with a data vector from the training set, rather than using an unbiased,random starting point. Second, only a small number of sampling steps is used. Wedenote CD using N sampling steps as CD- N . CD- N approximates the negativegradient ∆ neg ij by initializing the sampling chain to the same data point V whichis used for the calculation of ∆ pos ij . Subsequently, it performs N Gibbs samplingsteps. Note that the first half-step V → H has, in practice, already been performedduring the calculation of ∆ pos ij . Despite using these two approximations, CD- N usually works well (Hinton et al., 2006).Algorithm 2 describes contrastive divergence for N = 1 (CD-1). For each trainingvector, V is initialized with the training vector and H is sampled according to (4).This allows the calculation of ∆ pos ij as v i h j . Following this, two additional samplingsteps are carried out: First, we calculate the “reconstruction” ˆ V of the trainingvector as in (3). Subsequently, we calculate the corresponding hidden probabilities P ( ˆ h j = 1 | ˆ V ). Now, we can approximate ∆ neg ij as ˆ v i · P ( ˆ h j | ˆ V ) and obtain ∆ w ij .Finally, we update the weights as(6) w ij := w ij + α · ∆ w ij where α ∈ (0 , . . . ,
1) is a learning rate defined by the user. This procedure repeatsfor several epochs, i.e., passes through the training set. Usually, CD is implementedin a mini-batch fashion. That is, we calculate ∆ w ij in (6) for multiple trainingexamples at the same time, and subsequently use the average gradient to update w ij . This reduces sampling noise and makes the gradient more stable (Bishop, 2006;Hinton et al., 2006)3.4. Restricted Boltzmann EDA.
This section describes how we used an RBMin an EDA. The RBM should model the properties of promising solutions andthen be used to sample new candidate solutions. In each generation of the EDA,we trained the RBM to model the probability distribution of the solutions whichsurvived the selection process. Then, we sampled candidate solutions from theRBM and evaluate their fitness.
SCALABILITY OF USING RBMS FOR COMBINATORIAL OPTIMIZATION
We chose to use the following parameters: The number m of hidden neurons wasset to be half the number n of visible neurons (i.e. half the problem size). Standardvalues were used for the parameters that control the training and sampling phase(Hinton, 2010).Sampling parameters. When sampling new candidate solutions from the RBM, weused the individuals in P parents to initialize the visible neurons close to the station-ary distribution. Subsequently, we performed 25 full Gibbs sampling steps.Training parameters. The learning rate α was set to 0.05 and 0.5 for the weightsand biases, respectively. We applied standard momentum (Qian, 1999), which addsa fraction β of the last parameter update to the current update, making the updateless prone to fluctuations caused by noise, and dampening oscillations. We startedwith β = 0 . β = 0 . − . ∗ γ ∗ W to the RBM’s optimization objective. The partialgradient ∆ w ij in Equation (5) thus includes the term − γ ∗ w ij , which decays largeweights and thereby reduces overfitting. The weight cost γ determines the strengthof the decay. We chose γ = 0 . (cid:16) P ( v i =1)1 − P ( v i =1) (cid:17) , where P ( v i = 1) is the probability that the visible neuron v i isset to one in the training set (see Hinton, 2010). This initialization speeds up thetraining process.Parameter control. We applied a simple parameter control scheme for the learningrate α , the momentum β , and the number of epochs. The scheme was based on thereconstruction error e . The reconstruction error is the difference between a trainingvector V and its reconstruction ˆ V after a single step of Gibbs sampling (see lines2 and 7 in Algorithm 2). e usually decreases with the number of epochs. Everysecond epoch t ∈ , . . . , T , we calculated for a fixed subset s of the training set S the relative difference e st = 1 / | s | X j ∈ s X i | v i − ˆ v i | /n CD learning does not minimize the reconstruction error e but maximizes P ( V ), which cannotbe calculated exactly tractably. As an alternative, the reconstruction error e is usually a goodapproximation for how good the model can (re-)produce the training data. Algorithm 2
Pseudo code for a training epoch using CD-1 for all training examples do V ← set V to the current training example H ← sample H | V , i.e. set h j to 1 with P ( h j = 1 | V ) from (4) ∆ pos ij = v i h j ˆ V ← sample ”reconstruction” ˆ V | H , using (3) ˆ H ← calculate P ( ˆ H | ˆ V ) as in (4) ∆ neg ij = ˆ v i · P (ˆ h j | ˆ V ) ∆ w ij ← calculate all ∆ w ij as in (5) w ij ← update all weights according to (6) end for CALABILITY OF USING RBMS FOR COMBINATORIAL OPTIMIZATION 9 between v and ˆ v . We measured the decrease γ of the reconstruction error in thelast 25% of all epochs as γ = e S . t − e St e S − e St . γ was then used to adjust the learningparameters. The learning rate α was initialized in the first epoch with α = 0 . γ < . α was decreased to 0 . β to 0 .
5. As soon as γ < . β to β = 0 . γ < .
01. The rationale behind this was that the RBMhas learned the relevant dependencies between the variables, and further trainingwas unlikely to improve the model considerably. Furthermore, we stopped thetraining if the RBM was overfitting, i.e., learning noise instead of problem structure.Therefore, we split the original training set into a training set S containing 90% of allsamples and a validation set S ′ containing the remaining 10%. We trained the RBMonly for the solutions in S and, after each epoch, calculated the reconstruction error e S and e S ′ for the training and validation set S and S ′ , respectively. We stoppedthe training phase as soon as | e S − e S ′ | e S ′ ≥ .
02 (i.e., the absolute difference betweenthe reconstruction errors was larger than 2%).4.
Experiments
We describe the test problems (Section 4.1) and our experimental design (Section4.2). The results are presented in Section 4.3.4.1.
Test Problems.
We evaluated the performance of RBM-EDA on onemax,concatenated deceptive traps (Ackley, 1987), and NK landscapes (Kauffman andWeinberger, 1989). All three are standard benchmark problems. Their difficultydepends on the problem size, i.e., problems with more decision variables are moredifficult. Furthermore, the difficulty of concatenated deceptive trap functions andNK landscapes is tunable by a parameter. All three problems are defined on binarystrings of fixed length.The onemax problem assigns a binary solution x of length l a fitness value f = P li =1 x i , i.e., the fitness of x is the number of ones in x . The onemax function israther simple. It is unimodal and can be solved by a deterministic hill climber.A trap function is defined on a binary solution x of length k . It assigns a solution x a fitness f k ( x ) = ( k if P i x i = k, and k − ( P i x i + 1) otherwise.The optimal solution consists of all ones. Trap functions are difficult to solve be-cause the second-best solution consists of all zeros and the structure of the functionis deceptive. It leads search methods away from the global optimum towards thesecond-best solution.A concatenated trap function places o trap functions of order k on a singlebitstring x of length o · k . Its fitness is calculated as f ( x ) = P oi =1 f ki (Ackley, 1987).The problem difficulty increases with k as well as with o . Concatenated traps aredecomposable. Dependencies exist between the k variables of a trap function, butnot between variables in the o different trap functions (Deb and Goldberg, 1991).NK landscapes are defined by two parameters N and k and N fitness components f Ni (Kauffman and Weinberger, 1989). A solution vector x consists of l = N bits.The bits are assigned to N overlapping subsets, each of size k + 1. The fitness of a solution is the sum of N fitness components. Each component f Ni depends on thevalue of the corresponding variable x i as well as k other variables. Each f Ni mapseach possible configurations of its k + 1 variables to a fitness value. The overallfitness function is therefore f ( x ) = 1 /N N X i =1 f Ni ( x i , x i , . . . , x iK ) . Each decision variable usually influences several f Ni . These dependencies betweensubsets make NK landscapes non-decomposable. The problem difficulty increaseswith k . k = 0 is a special case where all decision variables are independent and theproblem reduces to onemax.4.2. Experimental Setup.
We adopted an experimental setup similar to Pelikan(2008). We studied the performance of RBM-EDA on the onemax function, onconcatenated deceptive traps with traps size k ∈ { , , } , and NK landscapes with k ∈ { , , , } . We report results for BOA so that RBM-EDA can be compared tothe state-of-the-art.Both EDAs used tournament selection without replacement of size two (Millerand Goldberg, 1995). We used bisection to determine the smallest population sizefor which a method solved a problem to optimality. For onemax and deceptive traps,we required each method to find the optimal solution in 30 out of 30 independentruns. For NK landscapes, we used 25 randomly chosen problem instances per sizeand determined the population size that solved five out of five independent runs ofeach instance.We report the average number of fitness evaluations that were necessary to solvethe problem to optimality. In addition, we report average CPU running times. CPUrunning time includes the time required for fitness calculation, the time requiredfor model building (either building the Bayesian network or training the RBM),the time required for sampling new solutions, and the time required for selection.We also report CPU times even though previous EDA research mostly ignored thetime required for model building and sampling.We implemented RBM-EDA and BOA in Java. The experiments were conductedon a single core of an AMD Opteron 6272 processor with 2,100 MHz. The JBLASlibrary was used for the linear algebra operations of the RBM.4.3. Results.
We report the performance of RBM-EDA and BOA on onemax (Fig-ure 2), concatenated traps (Figure 3), and NK landscapes (Figure 4). The figureshave a log-log scale. Straight lines indicate polynomial scalability. Each figureshows the average number of fitness evaluations (left-hand side) and the overallCPU time (right-hand side) required until the optimal solution was found. The fig-ures also show regression lines obtained from fitting a polynomial to the raw results(details are in Table 1).First, we study the number of fitness evaluations required until the optimal so-lution was found (Figures 2-4, left). For the onemax problem, RBM-EDA neededfewer fitness evaluations than BOA (Figure 2, left-hand side) and had a slightlylower complexity ( O ( n . ) vs. O ( n . )). For concatenated traps, BOA needed lessfitness evaluations (Figure 3, left-hand side). As the problem difficulty increased(larger k ), the performance of the two approaches became more similar. The com-plexity of BOA was slightly lower than the one of RBM-EDA (around O ( n . ) versus CALABILITY OF USING RBMS FOR COMBINATORIAL OPTIMIZATION 11 F i t ne ss E v a l ua t i on s ProblemSize
BOAO(n ) RBMO(n ) C P U t i m e t o t a l ( s ) ProblemSize
BOAO(n ) RBMO(n ) Figure 2.
Number of evaluations (left) and CPU time (right) overproblem size l for the onemax problem. O ( n . ) for larger problems). The situation was similar for NK landscapes, whereBOA needed fewer fitness evaluations and scaled better than RBM-EDA (Figure 4,left-hand side).We now discuss average total CPU times (Figures 2-4, right-hand side). Be-sides the time required for fitness evaluation, this includes time spent for modelbuilding, sampling, and selection. For the onemax problem, RBM-EDA was fasterthan BOA and had a lower time complexity (Figure 2, right-hand side). For de-ceptive trap functions, BOA was faster on small problem instances, but its timecomplexity was larger that that of RBM-EDA (Figure 3, right-hand side). Hence,the absolute difference in computation time became smaller for larger and moredifficult problems. For traps of size k = 5, RBM-EDA was faster for problems withmore than 180 decision variables (36 concatenated traps). For traps of size k = 6,RBM-EDA was already faster for problems with more than 60 decision variables(10 concatenated traps). BOA’s time complexity increased slightly with higher k (from O ( n . ) for k = 4 to O ( n . ) for k = 6). In contrast, the time complexity ofRBM-EDA remained about the same (between O ( n . ) and O ( n . )).The results for NK landscapes were qualitatively similar (Figure 4, right-handside). BOA was faster than EDA-RBM, but its computational effort increasedstronger with n . Therefore the computation times became similar for difficult andlarge problems (cf. results for k = 5 , n ∈ , , k = 5 for various problem sizes , . We omitted the relative CPU times for selection in order to increase the readability of Figure5. By definition, they are proportional to the CPU time for fitness evaluations, and, in absolutenumbers, negligible. The results for concatenated traps of size k ∈ (4 ,
6) and NK landscapes of size k ∈ (3 , , F i t ne ss E v a l ua t i on s ProblemSize
BOAO(n ) RBMO(n ) C P U t i m e t o t a l ( s ) ProblemSize
BOAO(n ) RBMO(n ) (a) k = 4 F i t ne ss E v a l ua t i on s ProblemSize
BOAO(n ) RBMO(n ) C P U t i m e t o t a l ( s ) ProblemSize
BOAO(n ) RBMO(n ) (b) k = 5 F i t ne ss E v a l ua t i on s ProblemSize
BOAO(n ) RBMO(n ) C P U t i m e t o t a l ( s ) ProblemSize
BOAO(n ) RBMO(n ) (c) k = 6 Figure 3.
Number of fitness evaluations (left) and CPU time(right) over problem size l for deceptive traps with ( a ) k =4 , ( b ) k = 5, and ( c ) k = 6.First, we study the absolute CPU times for model building (Figure 5, left-handside). For all three problem types, RBM-EDA’s model building had a lower CPUtime complexity (Onemax: O ( n . ) vs. O ( n . ), 5-traps: O ( n . ) vs. O ( n . ), NKlandscapes k = 5: O ( n . ) vs. O ( n . )). Model building was faster for RBM-EDAthan for BOA for all onemax problems, for concatenated 5-traps of size > = 140 andfor NK- k = 5 problems of size ∈ (30 , , , CALABILITY OF USING RBMS FOR COMBINATORIAL OPTIMIZATION 13 F i t ne ss E v a l ua t i on s ProblemSize
BOAO(n ) RBMO(n ) C P U t i m e t o t a l ( s ) ProblemSize
BOAO(n ) RBMO(n ) (a) k = 2 F i t ne ss E v a l ua t i on s ProblemSize
BOAO(n ) RBMO(n ) C P U t i m e t o t a l ( s ) ProblemSize
BOAO(n ) RBMO(n ) (b) k = 3 F i t ne ss E v a l ua t i on s ProblemSize
BOAO(n ) RBMO(n ) C P U t i m e t o t a l ( s ) ProblemSize
BOAO(n ) RBMO(n ) (c) k = 4 F i t ne ss E v a l ua t i on s ProblemSize
BOAO(n ) RBMO(n ) C P U t i m e t o t a l ( s ) ProblemSize
BOAO(n ) RBMO(n ) (d) k = 5 Figure 4.
Number of evaluations (left) and CPU time (right) overproblem size l = N for NK landscapes with ( a ) k = 2 , ( b ) k =3 , ( c ) k = 4, and ( d ) k = 5. C P U t i m e f o r M ode l B u il d i ng ( s ) ProblemSize
BOAO(n ) RBMO(n ) − R e l a t i v e C P U T i m e Problem SizeBOA Model BuildingBOA Sampling BOA Fitness EvalRBM Model Building RBM SamplingRBM Fitness Eval (a) Onemax C P U t i m e f o r M ode l B u il d i ng ( s ) ProblemSize
BOAO(n ) RBMO(n ) R e l a t i v e C P U T i m e Problem SizeBOA Model BuildingBOA Sampling BOA Fitness EvalRBM Model Building RBM SamplingRBM Fitness Eval (b) Concatenated traps, k = 5 C P U t i m e f o r M ode l B u il d i ng ( s ) ProblemSize
BOAO(n ) RBMO(n ) R e l a t i v e C P U T i m e Problem SizeBOA Model BuildingBOA Sampling BOA Fitness EvalRBM ModelBuilding RBM SamplingRBM Fitness Eval (c) NK landscapes, k = 5 Figure 5.
Relative CPU times for model building, sampling, andfitness evaluation.growing problem size, the share monotonously increased (95.2-99.5% for onemax,97.4-99.8% for 5-traps, 85.6-97.2% for NK- k = 5). The relative times for samplingand fitness were very low. Furthermore, their shares of the total time decreasedwith growing problem size, despite the growing population sizes. In contrast, theRBM spent a significantly smaller fraction of the total time for model building.Also, with growing problem size, this fraction decreased or stayed relatively con-stant (85.7-88.3% for onemax, 92.2-64.3% for 5-traps, 78.8-49.2% for NK- k = 5). CALABILITY OF USING RBMS FOR COMBINATORIAL OPTIMIZATION 15
Table 1.
Approximated scaling behavior between the number offitness evaluations and problem size, as well as CPU times andproblem size
Fitness evaluations CPU timeBOA RBM BOA RBMONEMAX O ( n . ) O ( n . ) O ( n . ) O ( n . )4-TRAPS O ( n . ) O ( n . ) O ( n . ) O ( n . )5-TRAPS O ( n . ) O ( n . ) O ( n . ) O ( n . )6-TRAPS O ( n . ) O ( n . ) O ( n . ) O ( n . )NK k=2 O ( n . ) O ( n . ) O ( n . ) O ( n . )NK k=3 O ( n . ) O ( n . ) O ( n . ) O ( n . )NK k=4 O ( n . ) O ( n . ) O ( n . ) O ( n . )NK k=5 O ( n . ) O ( n . ) O ( n . ) O ( n . ) Correspondingly, the CPU time fractions for sampling and fitness evaluations in-creased with growing problem size. Hence, the total CPU time for RBM-EDA wasmuch less dominated by model building .In summary, we found that the performance of RBM-EDA was competitive withthe state-of-the art, especially for large and difficult problem instances. This may besurprising as the number of fitness evaluations necessary to solve a problem was, inmost problems, higher for RBM-EDA than for BOA. Even more, the computationaleffort in terms of fitness evaluations grew faster with the problem size n . The highernumber of fitness evaluations used by RBM-EDA indicates that the statistical modelcreated during training in RBM-EDA was less accurate than the statistical modelin BOA. However, from a computational perspective, building this accurate modelin BOA was much more expensive than learning the more inaccurate model used inRBM-EDA. Furthermore, the time necessary for building the model increased slowerwith increasing problem size than for BOA. Thus, the lower time effort necessaryfor learning a less accurate statistical model in RBM-EDA overcompensated for thehigher number of fitness evaluations that were necessary to find optimal solutions.As a result, the CPU times for RBM-EDA were not only lower than BOA, theyalso increased more slowly with growing problem sizes. This was true especially fordifficult and large problems.5. Summary and Conclusions
We carried out an in-depth experimental analysis of using a Restricted Boltz-mann Machine within an Estimation of Distribution Algorithm for combinatorialoptimization. We tested RBM-EDA on standard binary benchmark problems: one-max, concatenated deceptive traps and NK landscapes of different sizes. We carriedout a scalability analysis for the number of fitness evaluations and the computa-tion times required to solve the problems to optimality. We compared our resultsto those obtained from the Bayesian Optimization Algorithm, a state-of-the-artmethod.Our experimental results suggest that RBM-EDA is competitive with the state-of-the-art. We observed that it was less efficient in terms of fitness evaluations, Note that in this comparison, each algorithm used the population sizes from its own bisectionrun, i.e., the population size for BOA was usually smaller. If we used the same population sizes,we expect the time dominance of model building in BOA to be even greater. both in absolute numbers and in complexity. However, the estimated complexity ofprobabilistic model building in RBM-EDA was lower than in BOA. RBM-EDA wasable to build probabilistic models much faster than BOA if the problem is large.This caused smaller total runtimes for RBM-EDA than for BOA, especially if theproblem instance was large and difficult (cf. results for concatenated traps with l > , k = 5, traps with l > , k = 6, or NK landscape with N = 5 , k = 32). Insum, RBM-EDA can be an alternative if problems are large and difficult and thecomputational effort for fitness evaluations is low. This highlights the potential ofusing generative neural networks for combinatorial optimization.Another advantage of using neural networks in EDAs is that RBMs can be par-allelized without many of the problems that occur when parallelizing other EDAs.Parallelizing an RBM-EDA on a Graphics Processing Unit leads to massive speed-ups by a factor of 200 and more, compared to optimized CPU code (Probst et al.,2014).There are multiple directions for further research. We demonstrated that RBMsare useful in EDAs, but fine-tuning the parameters could improve the RBM’s modelquality, possibly leading to further performance improvements. It might also bebeneficial to stack RBMs on multiple layers and to use such deep systems for solvinghierarchical problems. References
Abdollahzadeh, A., Reynolds, A., Christie, M., Corne, D. W., Davies, B. J.,Williams, G. J. J., et al., 2012. Bayesian optimization algorithm applied to un-certainty quantification. SPE Journal 17 (03), 865–873.Ackley, D. H., 1987. A connectionist machine for genetic hill climbing. KluwerAcademic, Boston.Baluja, S., 1994. Population-based incremental learning. a method for integratinggenetic search based function optimization and competitive learning. Tech. Rep.CMU-CS-94-163, Carnegie Mellon University, Pittsburgh, PA.Bishop, C. M., 2006. Pattern recognition and machine learning. Information Scienceand Statistics. Springer.Dahl, G. E., Ranzato, M. A., Mohamed, A., Hinton, G. E., 2010. Phone recogni-tion with the mean-covariance restricted boltzmann machine. Advances in NeuralInformation Processing 23.Dahl, G. E., Yu, D., Deng, L., Acero, A., 2012. Context-dependent pre-trained deepneural networks for large-vocabulary speech recognition. IEEE Transactions onAudio, Speech, and Language Processing 20 (1), 30–42.Deb, K., Goldberg, D. E., 1991. Analyzing deception in trap functions. Universityof Illinois, Department of General Engineering.Geman, S., Geman, D., 1984. Stochastic Relaxation, Gibbs Distributions, and theBayesian Restoration of Images. IEEE Transactions on Pattern Analysis andMachine Intelligence PAMI-6 (6), 721–741.Goldberg, D. E., 1989. Genetic algorithms in search, optimization, and machinelearning. Addison-Wesley Professional.Harik, G., 1999. Linkage learning via probabilistic modeling in the ECGA. IlliGALReport No. 99010, University of Illinois at Urbana-Champaign, Urbana, IL.Hinton, G. E., 2002. Training Products of Experts by Minimizing Contrastive Di-vergence. Neural Computation 14, 1771–1800.
CALABILITY OF USING RBMS FOR COMBINATORIAL OPTIMIZATION 17
Hinton, G. E., 2010. A practical guide to training restricted boltzmann machines.Tech. Rep. UTML TR 2010-003, Department of Computer Science, University ofToronto.Hinton, G. E., Osindero, S., Teh, Y.-W., 2006. A fast learning algorithm for deepbelief nets. Neural Computation 18, 1527–1554.Holland, J. H., 1975. Adaptation in natural and artificial systems. University ofMichigan Press, Ann Arbor.Huajin, T., Shim, V. A., Tan, K. C., Chia, J. Y., 2010. Restricted boltzmannmachine based algorithm for multi-objective optimization. In: IEEE Congress onEvolutionary Computation (CEC). pp. 1–8.Kauffman, S. A., Weinberger, E. D., 1989. The NK model of rugged fitness land-scapes and its application to maturation of the immune response. Journal oftheoretical biology 141 (2), 211–245.Larra˜naga, P., Etxeberria, R., Lozano, J. A., Pe˜na, J. M., 1999. Optimization bylearning and simulation of Bayesian and Gaussian networks. Tech. Rep. EHU-KZAA-IK-4/99, Intelligent Systems Group, Dept. of Computer Science and Ar-tificial Intelligence, University of the Basque Country.Larra˜naga, P., Karshenas, H., Bielza, C., Santana, R., 2012. A review on probabilis-tic graphical models in Evolutionary computation. Journal of Heuristics 18 (5),795–819.Larra˜naga, P., Lozano, J. A., 2002. Estimation of Distribution Algorithms: A NewTool for Evolutionary Computation. Genetic Algorithms and Evolutionary Com-putation, 2. Kluwer Academic Pub.Lozano, J. A., Larra˜naga, P., 2006. Towards a New Evolutionary Computation:Advances on Estimation of Distribution Algorithms. Studies in Fuzziness andSoft Computing. Springer London, Limited.Miller, B. L., Goldberg, D. E., 1995. Genetic algorithms, tournament selection, andthe effects of noise. Complex Systems 9, 193–212.M¨uhlenbein, H., Mahnig, T., 1999. FDA - a scalable evolutionary algorithm forthe optimization of additively decomposed functions. Evolutionary Computation7 (4), 353–376.M¨uhlenbein, H., Paaß, G., 1996. From Recombination of Genes to the Estimation ofDistributions I. Binary Parameters. In: Voigt, H.-M., Ebeling, W., Rechenberg,I., Schwefel, H.-P. (Eds.), Parallel Problem Solving from Nature - PPSN IV.Vol. 1141 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, pp.178–187.Pelikan, M., 2005. Bayesian Optimization Algorithm. In: Hierarchical BayesianOptimization Algorithm. Vol. 170 of Studies in Fuzziness and Soft Computing.Springer Berlin / Heidelberg, pp. 31–48.Pelikan, M., January 2008. Analysis of estimation of distribution algorithms andgenetic algorithms on NK landscapes. Tech. Rep. 2008001, Missouri Estimationof Distribution Algorithms Laboratory (MEDAL), instances can be found athttp://medal-lab.org/files/nk-instances.tar.gz.Pelikan, M., Goldberg, D. E., 2003. Hierarchical boa solves ising spin glasses andmaxsat. In: Genetic and Evolutionary Computation - GECCO 2003. Springer,pp. 1271–1282.Pelikan, M., Goldberg, D. E., Cantu-Paz, E., 1999a. BOA: The Bayesian Opti-mization Algorithm. In: Genetic and Evolutionary Computation Conference. pp. http://doi.acm.org/10.1145/2576768.2598273
Qian, N., 1999. On the Momentum Term in Gradient Descent Learning Algorithms.Neural Networks 12 (1), 145–151.Salakhutdinov, R., Mnih, A., Hinton, G. E., 2007. Restricted boltzmann machinesfor collaborative filtering. In: In Machine Learning, Proceedings of the Twenty-fourth International Conference (ICML 2004). ACM. AAAI Press, pp. 791–798.Schwarz, G., 1978. Estimating the dimension of a model. The annals of statistics6 (2), 461–464.Shim, V. A., Tan, K. C., 2012. Probabilistic graphical approaches for learning,modeling, and sampling in evolutionary multi-objective optimization. In: Liu, J.,Alippi, C., Bouchon-Meunier, B., Greenwood, G., Abbass, H. (Eds.), Advancesin Computational Intelligence. Vol. 7311 of Lecture Notes in Computer Science.Springer Berlin Heidelberg, pp. 122–144.Shim, V. A., Tan, K. C., Chia, J. Y., 2010. Probabilistic based evolutionary optimiz-ers in bi-objective travelling salesman problem. In: Deb, K., Bhattacharya, A.,Chakraborti, N., Chakroborty, P., Das, S., Dutta, J., Gupta, S., Jain, A., Aggar-wal, V., Branke, J., Louis, S., Tan, K. (Eds.), Simulated Evolution and Learning.Vol. 6457 of Lecture Notes in Computer Science. Springer Berlin Heidelberg, pp.588–592.Shim, V. A., Tan, K. C., Chia, J. Y., Al Mamun, A., Mar. 2013. Multi-objectiveoptimization with estimation of distribution algorithm in a noisy environment.Evol. Comput. 21 (1), 149–177.Smolensky, P., 1986. Parallel Distributed Processing: Explorations in the Mi-crostructure of Cognition. Vol. 1. MIT Press, Cambridge, MA, USA, Ch. In-formation Processing in Dynamical Systems: Foundations of Harmony Theory,pp. 194–281.Zhang, B.-T., Shin, S.-Y., 2000. Bayesian evolutionary optimization using helmholtzmachines. Parallel Problem Solving from Nature PPSN VI, 827–836.
Johannes Gutenberg-Universit¨at Mainz, Dept. of Information Systems and BusinessAdministration, Jakob-Welder-Weg 9, 55128 Mainz, Germany
E-mail address : {probst|rothlauf|grahl}@uni-mainz.de URL ::