[PDF] Scalable Parallel Numerical CSP Solver

Abstract

We present a parallel solver for numerical constraint satisfaction problems (NCSPs) that can scale on a number of cores. Our proposed method runs worker solvers on the available cores and simultaneously the workers cooperate for the search space distribution and balancing. In the experiments, we attained up to 119-fold speedup using 256 cores of a parallel computer.

Full PDF

aa r X i v : . [ c s . A I] N ov Scalable Parallel Numerical CSP Solver

Daisuke Ishii , Kazuki Yoshizoe , and Toyotaro Suzumura , Tokyo Institute of Technology, Tokyo, Japan. Japan Science and Technology Agency, Japan. IBM Research, Dublin, Ireland. { dsksh,yoshizoe,suzumura } @acm.org Abstract.

We present a parallel solver for numerical constraint satis-faction problems (NCSPs) that can scale on a number of cores. Ourproposed method runs worker solvers on the available cores and simul-taneously the workers cooperate for the search space distribution andbalancing. In the experiments, we attained up to 119-fold speedup using256 cores of a parallel computer.

Numerical constraint satisfaction problems (NCSPs, Section 2) have been suc-cessfully applied to problems described in the domain of reals[3,6]. Given a NCSPwith search space represented as a box (i.e., interval vector), the branch andprune algorithm eﬃciently computes a paving , a set of boxes that encloses thesolution set, yet its exponential computational complexity limits the tractableinstances. Although the solving process exhibits a parallelism, no parallel NCSPsolver has been made available to date because of the diﬃculty in partitioningthe search space equally.In this research, we parallelize a NCSP solver to scale its solving process onboth shared memory and distributed memory parallel computers (Section 3).Our parallel method consists of parallel worker solvers that solve a portion ofsearch space on CPU cores and interact with neighbor workers via message pass-ing for dynamic load balancing. We also propose a preprocess that acceleratesthe initial search space distribution by sending sets of boxes via static routingbetween the workers. We have implemented the method by extending the Re-alpaver solver using the X10 language to realize a process-level parallelizationover a number of cores. Section 4 reports experimental results when our methodwas deployed on two hardware environments.

Related work.

There have been several work regarding parallel solving of CSPswith either discrete or continuous domains. Parallel solving of generic CSPs onmassive computer clusters and supercomputers has been explored in [12,2,18].This work focuses on a massive parallel solver for NCSPs that has a diﬀer-ent characteristics compared to generic CSPs. In the survey [7], existing work isclassiﬁed into (i) search-space splitting methods[16,12,2,5,15,18], (ii) cooperativemethods for heterogeneous solver agents (cf. portfolios)[2], and (iii) paralleliza-tion of constraint propagation. Our method belongs to the ﬁrst category. A feworks have used approaches (ii)[11] and (iii)[8] for parallelization of NCSP solv-ing. However, to the best of our knowledge, a massive parallelization methodthat uses the typical approach (i) has not yet been proposed.Substantial work regarding the parallelization of the branch and bound algo-rithm with search-space splitting exists [9,13,14]. This approach has also beenapplied to CSP solvers [16,5] and SAT solvers [15]. This work explores an eﬃcientparallel method for solving NCSPs with similar approach to [9,13]. A numerical constraint satisfaction problem (NCSP) is deﬁned as a triple ( v, v , c )that consists of a vector of variables v = ( v , . . . , v n ), an initial domain in theform of a box v ∈ IR n ( IR denotes the set of closed real intervals), and a con-straint c ( v ) ≡ f ( v ) = 0 ∧ g ( v ) ≥

0, where f : R n → R e and g : R n → R i ,i.e., a conjunction of e equations and i inequalities. A solution of a NCSP is anassignment of its variables v ∈ v that satisﬁes its constraints. The solution set Σ of a NCSP is the region within its initial domain that satisﬁes its constraints,i.e., Σ ( v ) := { ˜ v ∈ v | c (˜ v ) } . The target of this paper is under-constrained NCSPs such that n > e . In general, an under-constrained problem may have acontinuous set of inﬁnitely many solutions.

Branch and Prune Algorithm.

The branch and prune algorithm [17] is thestandard solving method for NCSPs. It takes a NCSP and a precision ǫ as aninput and outputs a set of boxes (or paving ) S that approximates the solutionset with precision ǫ . Examples of S are illustrated in Figure 1.An intermediate state of the algorithm is represented as a pair of sets ofboxes ( L, S ). The solver receives an initial state ( { v } , ∅ ) and iteratively ap-plies the step computation (illustrated in Figure 2) until it reaches a ﬁnal state( ∅ , S ). In the step computation, ﬁrst, it takes the ﬁrst element of the queue L ofboxes and applies the Prune procedure, which is a ﬁltering procedure that shavesboundary portions of the considered box. In this work, we use an implementationproposed in [6] which provides a veriﬁcation process based on an interval New-ton method combined with a relatively simple ﬁltering process based on the Hullconsistency[1]. As a result, a box becomes either empty, precise enough (its widthis smaller than ǫ ), veriﬁed as an inner box of the solution set Σ , or undecided.Precise and inner boxes are appended to S and undecided boxes are passed to Branch . Second, the

Branch procedure bisects the box at the midpoint along acomponent corresponding to one of the variables and the sub-boxes will be putback in the queue. In this work, we assume

Branch selects variables in an orderwhich makes the search to behave in a breadth-ﬁrst manner and thus the solvingprocess gradually improves the precision of the overall pavings (Figure 1).The computation of

Prune is expensive and is the bottleneck of the solvingprocess. Under certain conditions, application of

Prune contracts a large portionof the search space into a tight box (cf. quadratic convergence of the intervalNewton methods).

Prune can also ﬁlter out the whole box if the considered box is ig. 1.

Overlay of two solu-tion box sets (pavings) with ǫ = 0 . , . . . . S : solution box setPrune L : queue of boxes . . . Branchempty box

Fig. 2.

A step computa-tion of the branch andprune algorithm p p B&PB&P B&P p p p p . . . . . . . . .. . . . . .. . . . . . . . . . . . nb boxes Initial domain Fig. 3.

Distribution of the boxqueue L in the preprocess veriﬁed as an inner or totally inconsistent region. These characteristics of Prune result in the unbalanced nature of search trees. Therefore, a straightforwardparallel method does not work eﬃciently. It is crucial for eﬃcient NCSP solvingto execute

Prune on each step of traversing the search tree which makes it morediﬃcult to distribute a search path among processors. These properties will bediscussed in Section 4.1.

We propose a parallel method that runs several workers on the available CPUcores p , . . . , p p − (we assume a single worker runs on a core and identify botha worker and a core with p i ). Workers homogeneously interleave the follow-ing three procedures and cooperate in a decentralized fashion: (i) breadth-ﬁrstbranch and prune search, (ii) distribution and workload balancing of search spacein a sender-initiated manner, and (iii) termination detection. Sub-trees in thesearch space of the branch and prune becomes unbalanced but can be searchedindependently: there are no conﬂuence of multiple branches, and we have noshared information between branches. Distribution of the search space amongworkers is done in preprocess and postprocess as described in the following sub-sections. The preprocess distributes the portions of the search space to the otherworkers, and the postprocess balances the load of each worker during search.Termination is detected by circulating a termination token via idling workersbased on Dijkstra’s method (see Section 11.4.4 of [9]). Preprocess: Search-Space Splitting and Distribution.

A solving processis started by a worker p that possesses the initial domain (i.e., a box whichcontains the whole search space) in the queue L . To distribute the subsequentsearch space (i.e., a queue of boxes) equally to each core, the preprocess invokesa partition of the queue in two (or more) and then sends a portion to anotherworker. Figure 3 illustrates in a downward direction some initial transitions ofthe box queue L distributed among three workers. In our implementation, theistribution routing is formed as a binary tree whose height is ⌈ log p ⌉ . In eachnode of the tree, the branch and prune process runs until the number of boxesin the queue reaches nb . Next, the queue is sorted by the volume of the boxesand the half of the content (i.e., nb / Postprocess: Dynamic Load Balancing.

During search, each worker nor-malizes the loads within a predeﬁned neighborhood which consists of a smallnumber of neighbor workers. Because there are suﬃciently large number ofboxes, we simply regard the number of boxes in the queue L as the amount ofload. Assume p workers are running and each worker p i possesses l i boxes inits queue. We also assume for each worker p i that N i is a set of | N | neighborworkers, N − i is a set of workers where p j ∈ N − i ⇔ p i ∈ N j , L i is a set of loadsof the neighbor workers, and ∆ is a predeﬁned load margin. The load balancingprocedure of a worker p i performs the following steps once every ns branch andprune steps.1. For each worker p j in N − i , inform the load l i and put in the list L j .2. Calculate the mean µ of the loads in L i .3. If µ < ∆ , for each worker p j in N i , send at most µ − l j boxes to p j (to beeﬃcient, a certain number of boxes should be kept locally).Neighbor workers can be identiﬁed e.g. as adjacent nodes in the | N | -dimensionalmesh of workers. The routing between neighbor workers is ﬁxed during a solvingprocess and thus it may happen that a worker possesses an excessive load thanothers. However, this load imbalance will be resolved by the subsequent loadbalancing processes. We have implemented the proposed method and measured the speedup of thesolving process of under-constrained NCSPs. The experiments were performedwith an exhaustive set of parameter combinations to explore the optimal settings.

Implementation.

We have implemented the proposed method with C++ (gccver. 4.4.7 and 4.3.4) and X10 (ver. 2.3.1)[4], a high productivity language forparallel computing. In the following, we use the term place , which is a notion ofX10 that in our setting represents a CPU core. Libraries Realpaver (ver. 1.1)[10]for sequential NCSP solving and Gaol (ver. 4.0.1) for basic interval computationwere used to facilitate the implementation. The Prune procedure was realizedby calling the sequential implementation in Realpaver. Each run of

Prune tookaround 0.2–1ms and the overall execution became the bottleneck of the branchand prune algorithm (occupied greater than 95% of running time in sequentialsolving). The procedures for search space distribution and load balancing wereimplemented with X10. Communication of boxes and loads between places were http://sourceforge.net/projects/gaol/ mplemented as async tasks and performed in parallel to the search process sothat the overhead will be hidden. In the experiments, timings t for sequentialruns on single core were measured using the C++ implementation described in[6], which worked identically and faster than our X10 version. Experiment Environments.

Two sets of experiments were operated using (1)a shared-memory machine equipped with 40 cores (four of 10-core Intel XeonE7-4860 2.26GHz) and 256GB of local memory and (2) up to 256 cores of SGIUV1000, a pseudo-shared memory machine equipped with 2,048 cores (8-coreXeon E7-8837 2.67GHz) and 16TB of memory. UV1000 works as a single sharedmemory machine by emulating memory accesses using communication based ona high speed NUMAlink5 network which has a bandwidth of 120Gbps. We usedthe MPI backend of X10 with options

X10 NTHREADS = 6 and

GC NPROCS = 2.

Experiments on a Shared Memory Machine.

We solved the problemsshown in [6,3] using 40 cores of the machine (1). We report the results for two rep-resentative problems. Parameters in the load balancing method were set as eithercombination of the following values: nb = 32 , | N | ∈ { , } , ns ∈ { , , } ,and ∆ = 10. We also computed with and without the preprocess (when the pre-process is not used, the postprocess is executed from the beginning). For eachproblem, we solved two instances with two multiplicative precisions. The speci-ﬁcation of each instance and the computational results are presented in Table 1.In the table, the columns “problem”, “size”, “ ǫ ”, “pp”, and “ | N | ” represent thename of the problem, size (i.e., the number of projection/parameter variables),the precision, usage of the preprocess, and the number of neighbor workers, re-spectively. The rest of the columns represents the results. t and representthe running time and the number of branches on single core. t ns ,i and ns ,i represent the running time and the largest number of branches performed by aworker when computed with the interval ns and i X10 places (best timings areunderlined). Figure 4 illustrates the speedup of the solving process.

Experiments on a Cluster with High-speed Interconnection.

We solvedthe problem “ ” using up to 256 cores of the machine (2), UV1000. Parame-ters in the load balancing method were set as either combination of the followingvalues: nb = 8 , | N | ∈ { , } , ns = 1000, and ∆ = 10. The results are presented inTable 2. Each column of the table represents the same information as presentedin Table 1 except that the column “ , ” represents the number ofloads sent by the load balancer in the solving process with ns = 1000 and 256X10 places. Figure 5 illustrates the speedup of the solving process. In the experiments, our method scaled up to 256 cores with the optimal con-ﬁgurations. We achieved speedups up to 32.3 fold using 40 cores of the sharedmemory machine and up to 119 fold using 256 cores of the cluster machine.The best speedup of 119 fold was obtained with the preprocess. The pre-process facilitates and accelerates the workload distribution in the early stage able 1.

Experimental result on the shared memory machineproblem size ǫ pp | N | t t , t , t , ,

4D sphere 2+4 0.02 yes 2 254 9.52 7.94 10.1 238 319 37.0and plane 4 11.0 8.73 9.45 37.6( sp2-4 ) no 2 9.34 8.13 16.5 35.44 10.8 8.67 10.5 37.30.01 yes 2 721 38.6 22.8 23.9 669 601 38.04 38.1 25.1 24.0 38.7no 2 35.7 22.7 30.0 37.64 36.8 24.7 24.7 38.73-RPR robot 3+3 0.2 yes 2 1 100 199 73.9 34.1 1 936 939 33.7( ) 4 296 68.5 36.4 38.0no 2 185 64.0 34.0 33.54 244 58.3 36.1 38.00.1 yes 2 4 080 1 010 714 282 7 186 845 30.24 2 820 1 070 257 36.0no 2 971 678 244 28.54 2 630 901 231 36.0

Table 2.

Experimental result on the UV1000 clusterproblem size ǫ pp | N | t t , t ,

256 , , ) 4 54.6 33.6 157 85 904no 2 39.8 20.4 43.8 10 8564 53.4 32.0 123 66 2120.1 yes 2 3 040 341 25.6 192 39 6084 371 87.8 176 128 084no 2 325 54.7 105 34 8924 339 52.1 184 129 512 of the search process. In some of the experiments without using the preprocess,the speedup ratio became saturated when using many cores (e.g., sp2-4 with ns = 1000 and the experiments on the cluster). This was because the load bal-ancing process was too infrequent for the given number of workers and the workload diﬀusion became too slow. When comparing the right-hand graphs for theinstance sp2-4 , ns = 1000, in Figure 4, we can notice that the point of saturationshifts according to the search space size. On the other hand, in some other ex-periments, the results got worse with the preprocess (e.g., results with ns = 10on the machine (1)). It occasionally happens that the preprocess mostly solvesthe problem. However, the preprocess can result in highly unbalanced searchtrees because of the Prune process, and in such cases the postprocess will nothave enough time for load balancing.Regarding the neighborhood sizes | N | = 2 ,

4, there was a trade oﬀ betweenthe workloads balance and the amount of communications required. For the s = È N È = ns = È N È = ns = È N È = ns = È N È = ns = È N È = ns = È N È = æ æ æ æà à à à ì ì ì ì ò ò ò òô ô ô ôç ç ç ç

10 20 30 400102030 æ æ æ æà à à à ì ì ì ì ò ò ò òô ô ô ôç ç ç ç

10 20 30 400102030 (a) sp2-4 , ǫ = 0 . æ æ æ æà à à à ì ì ì ì ò ò ò òô ô ô ôç ç ç ç

10 20 30 400102030 æ æ æ æà à à à ì ì ì ì ò ò ò òô ô ô ôç ç ç ç

10 20 30 400102030 (b) sp2-4 , ǫ = 0 . æ æ æ æà à à à ì ì ì ì ò ò ò òô ô ô ôç ç ç ç

10 20 30 400102030 æ æ æ æà à à à ì ì ì ì ò ò ò òô ô ô ôç ç ç ç

10 20 30 400102030 (c) , ǫ = 0 . æ æ æ æà à à à ì ì ì ì ò ò ò òô ô ô ôç ç ç ç

10 20 30 400102030 æ æ æ æà à à à ì ì ì ì ò ò ò òô ô ô ôç ç ç ç

10 20 30 400102030 (d) , ǫ = 0 . Fig. 4.

Speedups on the shared memory machine with 40 cores. Left- and right-handside graphs correspond to computations with and without the preprocess, respectively p = yes, È N È = pp = yes, È N È = pp = no, È N È = pp = no, È N È = æ æ æ æ æà à à à à ì ì ì ì ì ò ò ò ò ò

50 100 150 200 2500102030405060 (a) , ǫ = 0 . æ æ æ æ æà à à à à ì ì ì ì ì ò ò ò ò ò

50 100 150 200 250020406080100120 (b) , ǫ = 0 . Fig. 5.

Speedups using 256 cores of UV1000 shared memory machine, it was unclear which size had the advantage. How-ever, for UV1000, the solver was notably slower for | N | = 4 than | N | = 2. It isunderstandable because larger number of neighbors signiﬁcantly increased thenumber of communications (see “ , ” in Table 2) and communica-tions between places were much more costly compared to normal shared memorymachines despite the high speed network of UV1000.Three intervals ns = 10 , , , ǫ = 0 .

2, with ns = 1000 on themachine (1)). Conversely, small intervals required greater amount of communi-cations and therefore we used ns = 1000 to draw better performance on thecluster where communications were more costly.There was a large overhead caused by the workers sending a large number ofboxes for load balancing when the number of workers was not suﬃcient againstthe problem size. Speedups for , ǫ = 0 .

1, using 40 workers or less shows anexample of such overheads (Figure 4(d)). Resolving this overhead by suppressingredundant box sends is a part of the future work.

In this paper, we proposed a parallel branch and prune algorithm, based on,non-portfolio, search-space splitting approach. In the experiments, using 256X10 places (i.e., cores), we achieved speedup factors of as much as 119. Weexpect that our parallelized solver will be applied to large practical problems,e.g., the robotics problems in [3].

Acknowledgments.

This work was partially funded by JSPS (KAKENHI 25880008and 25700038). Computing resources for the experiments in this paper wereprovided by Prof. Kazunori Ueda (Waseda University, Tokyo) and a ComputeCanada RAC award, for environments (1) and (2), respectively. eferences

1. Benhamou, F., Goualard, F., Granvilliers, L., Puget, J.F.: Revising Hull and BoxConsistency. In: Proc. of ICLP. pp. 230–244 (1999)2. Bordeaux, L., Hamadi, Y., Samulowitz, H.: Experiments with Massively ParallelConstraint Solving. In: Proc. of IJCAI. pp. 443–448 (2006)3. Caro, S., Chablat, D., Goldsztejn, A., Ishii, D., Jermann, C.: A branch and prunealgorithm for the computation of generalized aspects of parallel robots. ArtiﬁcialIntelligence 211, 34–50 (2014)4. Charles, P., Grothoﬀ, C., Saraswat, V.: X10: an object-oriented approach to non-uniform cluster computing. In: Proc. of OOPSLA. pp. 519–538 (2005)5. Chu, G., Schulte, C., Stuckey, P.: Conﬁdence-based work stealing in parallel con-straint programming. In: Proc. of CP. pp. 226–241.

LNCS