[PDF] Learning Branching Heuristics for Propositional Model Counting

Abstract

Propositional model counting or #SAT is the problem of computing the number of satisfying assignments of a Boolean formula and many discrete probabilistic inference problems can be translated into a model counting problem to be solved by #SAT solvers. Generic ``exact'' #SAT solvers, however, are often not scalable to industrial-level instances. In this paper, we present Neuro#, an approach for learning branching heuristics for exact #SAT solvers via evolution strategies (ES) to reduce the number of branching steps the solver takes to solve an instance. We experimentally show that our approach not only reduces the step count on similarly distributed held-out instances but it also generalizes to much larger instances from the same problem family. The gap between the learned and the vanilla solver on larger instances is sometimes so wide that the learned solver can even overcome the run time overhead of querying the model and beat the vanilla in wall-clock time by orders of magnitude.

Full PDF

LLearning Branching Heuristics for PropositionalModel Counting

Pashootan Vaezipoor , Gil Lederman , Yuhuai Wu , Chris J. Maddison ,Roger Grosse , Edward Lee , Sanjit A. Seshia , Fahiem Bacchus University of Toronto, UC Berkeley, Vector Institute

Abstract

Propositional model counting or

Neuro , an approach forlearning branching heuristics for exact

Propositional model counting is the problem of counting the number of satisfying solutions to aBoolean formula. When the Boolean formula is expressed in conjunctive normal form (CNF), thisproblem is known as the

Cactus plots comparing

Neuro to SharpSAT on the grid _ wrld(10 , benchmark. For any point t on the y axis, the plot shows the number of benchmark problems that are individually solvable by the solver,within t steps (left) and seconds (right). * Equal contribution (correspondence to < [email protected] >).Preprint. Under review. a r X i v : . [ c s . L G ] J u l odern exact SharpSAT . We cast the problem as a Markov Decision Process (MDP) in which the agent has toselect the best literal for

SharpSAT to branch on next. We use a Graph Neural Network (GNN) [33]to represent the part of the input formula the solver is currently working on. The model is trainedend-to-end using an Evolution Strategies algorithm, with the objective of minimizing the meannumber of branching decisions required to solve instances from a given distribution of problems. Wecall this augmented solver

Neuro .We found that

Neuro can generalize to unseen problem instances from the same distribution aswell as to instances that were much larger than those trained on. Furthermore, despite the runtime overhead of querying the model, that

Neuro has to overcome, on some problem domains ourapproach achieved orders of magnitude improvements in the solver’s wall-clock run time (Figure 1).This is quite remarkable in the context of prior related work [45, 35, 5, 14, 19, 17, 22], where usingML to improve combinatorial solvers had at best yielded modest wall-clock time improvements (lessthan than a factor of two) and positions this line of research as a viable path to improve the run timeperformance of exact model counters.The rest of the paper is organized as follows: In Section 2 we provide some needed background andﬁx the terminology. We describe the learning approach in Section 3, and compare it to related workin Section 4. Section 5 details our dataset generation process that we later use in our experiments inSection 6. We conclude with a short discussion in Section 7.

A propositional Boolean formula consists of a set of propositional (true/false) variables composedby applying the standard operators “and” ( ∧ ), “or” ( ∨ ) and “not” ( ¬ ). A literal is any variable v orits negation ¬ v . A clause is a disjunction of literals (cid:87) ni =1 l i . A clause is said to be a unit clause if itcontains only one literal. Finally, a Boolean formula is in Conjunctive Normal Form (CNF) if it is aconjunction of clauses. We denote the set of literals and clauses of a CNF formula φ by L ( φ ) and C ( φ ) , respectively. We will assume that all formulas are in CNF.A truth assignment for any formula φ is a mapping of its variables to { , } ( false / true ). Thus thereare n different truth assignments when φ has n variables. A truth assignment π satisﬁes a literal (cid:96) when (cid:96) is the variable v and π ( v ) = 1 or when (cid:96) = ¬ v and π ( v ) = 0 . It satisﬁes a clause when atleast one of its literals is satisﬁed. A CNF formula φ is satisﬁed when all of its clauses are satisﬁedunder π in which case we call π a satisfying assignment for φ .The φ is to compute the number of satisfying assignments. If (cid:96) is a unit clauseof φ then all of φ ’s satisfying assignments must make (cid:96) true. If another clause c (cid:48) = ¬ (cid:96) ∨ (cid:96) (cid:48) is in φ ,then every satisfying assignment must also make (cid:96) (cid:48) true since ¬ (cid:96) ∈ c (cid:48) must be false. This process ofﬁnding all literals whose truth value is forced by unit clauses is called Unit Propagation (UP) and isused in all SAT and true and false . When a literal (cid:96) is set to true the formula φ can be reduced by ﬁnding all forced literalsusing UP (this will include (cid:96) and its negation), removing all clauses containing a true literal, andﬁnally removing all false literals from all clauses. The resulting formula is denoted by UP( φ , (cid:96) ).Two sets of clauses are called disjoint if they share no variables. A component C ⊂ C ( φ ) is a subsetof φ ’s clauses that is disjoint from its complement C ( φ ) − C . A formula φ can be efﬁciently brokenup into a maximal number of disjoint components C , . . . , C k . Although most formulas initiallyconsist of only one component, as variables are set by branching decisions and clauses are removed,the reduced formulas will often break up into multiple components. Components are important for2mproving the efﬁciency of ( φ ) = (cid:81) ki =1 COUNT ( C i ) . In contrast, solving the formula as a monolithtakes Θ( n ) where n is the number of variables in the input formula, and so not efﬁcient for large n .A formula φ can be represented by a literal-clause incidence graph (LIG). This graph contains anode for every clause and a node for every literal of φ (i.e., v and ¬ v for every variable v ). An edgeconnects a clause node n c and a literal node n (cid:96) if and only if (cid:96) ∈ c . Figure 2 shows an example. Notethat every component of φ forms a disconnected sub-graph of the LIG.Figure 2: The Literal-Clause Incidence Graphof the formula: ( x ∨ ¬ x ) ∧ ( ¬ x ∨ ¬ x ) ∧ ( x ∨ x ) . Algorithm 1

Component Caching DPLL function ( φ ) if inCache ( φ ) then return cacheLookUp ( φ ) Pick a literal (cid:96) ∈ L ( φ ) (cid:96) = CountSide ( φ , (cid:96) ) ¬ (cid:96) = CountSide ( φ , ¬ (cid:96) ) addToCache ( φ , (cid:96) + ¬ (cid:96) ) return (cid:96) + ¬ (cid:96) function CountSide ( φ , (cid:96) ) φ (cid:96) = UP( φ , (cid:96) ) if φ (cid:96) contains an empty clause then return if φ (cid:96) contains no clauses then k = return k K = findComponents ( φ (cid:96) ) return (cid:81) κ ∈ K ( κ ) Both exact [37, 30, 26] and approximate [9, 25] modelcounters have been developed. In this paper, we focuson the former, using the state of the art exact modelcounter

SharpSAT [37].

SharpSAT and other modernexact clause learning and component caching [3, 2]. A simpliﬁed version of thealgorithm with the clause learning parts omitted is givenin Algorithm 1. A more detailed version along withmore elaborate analysis is provided in Appendix A.The algorithm works on one componentat a time. If that component’s model count has alreadybeen cached it returns the cached value. Otherwise itselects a literal to branch on (line 4) and computes themodel count under each value of this literal by call-ing

CountSide() . The sum of these two counts isthe model count of the passed component φ , and so isstored in the cache (line 7). The CountSide functionﬁrst unit propagates the input literal. If an empty clauseis found, then the current formula φ (cid:96) is unsatisﬁable andhas zero models. Otherwise, φ (cid:96) is divided into its com-ponents which are independently solved. The productof sub-component model counts is returned. Criticalto the performance of the algorithm is the choice ofwhich literal from the current formula φ to branch on.This choice affects the efﬁciency of clause learning andthe effectiveness of component generation and cachinglookup success. SharpSAT uses the VSADS heuris-tic [32] which is a linear combination of a heuristicaimed at making clause learning effective (VSIDS) anda count of the number of times a variable appears in thecurrent formula.

Graph Neural Networks (GNNs) are a class of neural networks used for representation learning overgraphs [16, 33]. Utilizing a neighbourhood aggregation (or message passing) scheme, GNNs map thenodes of the input graph to a vector space. Let G = ( V, E ) be an undirected graph with node featurevectors h (0) v for each node v ∈ V . GNNs use the graph structure and the node features to learn anembedding vector h v for every node. This is done through iterative applications of a neighbourhoodaggregation function. In each iteration k , the embedding of a node h ( k ) v is updated by aggregatingthe embeddings of its neighbours from iteration k − and passing the result through a nonlinearaggregation function A parameterized by W ( k ) : h ( k ) v = A (cid:16) h ( k − v , (cid:88) u ∈N ( v ) h ( k − u ; W ( k ) (cid:17) , (1)where N ( v ) = { u | u ∈ V ∧ ( v, u ) ∈ E } . After K iterations, h ( K ) v is extracted as the ﬁnal nodeembedding h v for node v . Through this scheme, v ’s node embedding at step k incorporates thestructural information of all its k -hop neighbours.3 .3 Evolution Strategies Evolution Strategies (ES) are a class of zeroth order black-box optimization algorithms [7, 42].Inspired by natural evolution, a population of parameter vectors (genomes) is perturbed (mutated)at every iteration, giving birth to a new generation. The resulting offspring are then evaluated by apredeﬁned ﬁtness function. Those offspring with higher ﬁtness score will be selected for producingthe next generation.We adopt a version of ES that has shown to achieve great success in the standard RL benchmarks [29]:Let f : Θ → R denote the ﬁtness function for a parameter space Θ , e.g., in an RL environment, f computes the stochastic episodic reward of a policy π θ . To produce the new generation of parametersof size n , [29] uses an additive Gaussian noise with standard deviation σ to perturb the currentgeneration: θ ( i ) t +1 = θ t + σ(cid:15) ( i ) , where (cid:15) ( i ) ∼ N (0 , I ) . We then evaluate every new generation withﬁtness function f ( θ ( i ) t +1 ) for all i ∈ [1 , . . . , n ] . The update rule of the parameter is as follows, θ t +1 = θ t + η ∇ θ E θ ∼N ( θ t ,σ I ) [ f ( θ )] ≈ θ t + η nσ n (cid:88) i f ( θ ( i ) t +1 ) (cid:15) ( i ) , where η is the learning rate. The update rule is intuitive: each perturbation (cid:15) ( i ) is weighted by theﬁtness of the corresponding offspring θ ( i ) t +1 . We follow the rank-normalization and mirror samplingtechniques of [29] to scale the reward function and reduce the variance of the gradient, respectively. We formalize the problem of learning the branching heuristic for as an MDP. In oursetting, the environment is

SharpSAT , which is deterministic except for the initial state, where aninstance (CNF formula) is chosen randomly from a given distribution. A time step t is equivalent toan invocation of the branching heuristic by the solver (Algorithm 1: line 4). At time step t the agentobserves state s t , consisting of the component φ t that the solver is operating on, and performs anaction from the action space A t = { l | l ∈ L ( φ t ) } . The objective function is to reduce the numberof decisions the solver makes, while solving the counting problem. In detail, the reward function isdeﬁned by, R ( s ) = (cid:26) if s is a terminal state with “instance solved” status ,r penalty otherwise . If not ﬁnished, episodes are aborted after a predeﬁned maximum number of steps, without receivingthe termination reward.

Training with Evolution Strategies.

With the objective being deﬁned, we observe that for our task,the potential action space as well as the horizon of the episode can be quite large (up to 20,000 and1,000, respectively). As [41] shows, the exploration complexity of an action space-exploration RLalgorithm (e.g, Q-Learning, Policy Gradient) increases with the size of the action space and theproblem horizon. On the other hand, a parameter space-exploration algorithm like ES is independentof these two factors. Therefore, we choose to use a version of ES proposed by [29] for optimizingour agent.

As the task for the neural network agent is to pick a literal l from the component φ , we opt for aliteral-clause incidence graph representation of the CNF formula (see Section 2 for details). We useGNNs to compute a literal selection heuristic based on the LIG graph. The LIG representation issimilar to the one used by [36, 14, 22], in contrast to the variable-clause incidence graph of [45]. Indetail, given the literal-clause incidence graph G = ( V, E ) of a component φ , we denote the set ofall clause nodes as C ⊂ V , and the set of all literal nodes as L ⊂ V , V = C ∪ L . The initial vector4epresentation is denoted by h (0) c for each clause c ∈ C and h (0) l for each literal l ∈ L . Both arelearn-able model parameters. We run the following message passing steps iteratively:Literal to Clause: h ( k +1) c = A (cid:16) h ( k ) c , (cid:88) l ∈ c [ h ( k ) l , h ( k )¯ l ]; W ( k ) C (cid:17) , ∀ c ∈ C, Clause to Literal: h ( k +1) l = A (cid:16) h ( k ) l , (cid:88) c,l ∈ c h ( k ) c ; W ( k ) L (cid:17) , ∀ l ∈ L, where A is a nonlinear aggregation function, parameterized by W ( k ) C for clause aggregation and W ( k ) L for literal aggregation at the k th iteration. Following [36, 22], to ensure the graph representation isinvariant under negating every literal (negation invariance), we also concatenate the literal representa-tions corresponding to the same variable h ( k ) l , h ( k )¯ l when running literal-to-clause message passing.After K iterations, we obtain a d -dimensional vector representation for every literal in the graph.We pass each literal representation through a policy network, a Multi-Layer Perceptron (MLP), toobtain a score, and choose the literal with the highest score. Recently, Xu et al. [43] developed asimple GNN architecture named Graph Isomorphism Network (GIN), and proved that it achievesmaximum expressiveness among the class of GNNs. We hence choose GIN for the parameterizationof the aggregation function A. Speciﬁcally, A ( x, y ; W ) = MLP ((1 + (cid:15) ) x + y ; W ) , where (cid:15) is ahyperparameter. Architectural details are included in Appendix C. In practice, CNF formulas are encoded from a higher level problem in some other domain, with itsown semantics. These features of the original problem domain, which we call semantic features , areall but lost during the encoding process. Classical constraint solvers only process CNF formulas, andso their heuristics by deﬁnition are entirely independent of any speciﬁc problem domain, and onlyconsider internal solver properties, such as variable activities. These internal solver properties are afunction of the CNF representation and internal solver dynamics, and quite detached from the originalproblem domain. Thus, it is not unreasonable that semantic features of the original problem domaincould contain additional useful structure that can be exploited by the low-level solver heuristic.One such semantic feature that often naturally arises in real-world problems is time . Many problemsare iterative in nature, with a distinct temporal dimension to them, e.g., dynamical systems, boundedmodel checking. At the original problem domain, there is often a state that is evolved through timevia repeated applications of a state transition function. A structured CNF encoding of such problemsusually maps every state s t to a set of variables, and adds sets of clauses to represent the dynamicalconstraints between every transition ( s t , s t +1 ) . As explained, this process removes all temporalinformation. In contrast, with a learning-based approach, the time step feature from the originalproblem can be readily incorporated as additional input to the network, effectively annotating eachvariable with its time-step. In our experiments, we represented time by appending to each literalembedding a scalar value (representing the normalized time-step t ) before passing it through theoutput MLP. We perform an ablation study to investigate the impact of this additional feature inSection 6. The ﬁrst successful application of machine learning to propositional satisﬁability solvers was the portfolio-based

SAT solver

SATZilla [44]. Equipped with a set of standard SAT solvers, a classiﬁerwas trained ofﬂine that could map a given SAT instance to the solver from the set that was best suitedto solve that instance. Considering that each solver from the set can be regarded as a conﬁguration ofa set of heuristics, this method was effectively performing a heuristic selection task.Recent work has been directed along two paths: heuristic improvement [35, 21, 22, 45], and purelyML-based solvers [36, 1]. In the former, a model is trained to replace a particular solver heuristic in astandard solver, thus it is embedded as a module within the solver’s framework and guides the searchprocess. In the latter approach, the aim is to train a model that acts as a stand-alone “neural” solver.These neural solvers are inherently stochastic and often incomplete , meaning that they can onlyprovide an estimate of the satisﬁability of a given instance. This is often undesirable in applicationsof SAT solvers (e.g., formal veriﬁcation) where an exact answer is required. In terms of functionality,5ur work is analogous to the ﬁrst group, in that we aim at improving the branching heuristics of astandard solver. To our knowledge, no prior work has applied ML to improve exact model counters.More concretely, our work is similar to [45], which used

Reinforcement Learning (RL) and graphneural networks to learn branching heuristics of a local search-based

SAT solver

WalkSAT [34].Since the scope of local-search solvers is limited to small problems, their method does not scale toindustrial-size instances. Our method is also related to [22] and [14], where similar techniques wereused in solving quantiﬁed Boolean formulas and mixed integer programs, respectively. In contrast to[22], which incorporates a large set of hand-crafted, solver-speciﬁc features, our approach requiresno prior knowledge about the dynamics of the solver.

To evaluate the versatility of our method, we generated a diverse set of problems from variousdomains to test on. Unlike other works in this area which often experiment on small random instances(e.g., random graphs [45, 36, 21]), we chose our problems from either known SAT benchmarks orreal-world applications: sudoku( n, k ) : Randomly generated partially ﬁlled n × n Sudoku problems ( n ∈ { , } ) with k squares revealed (lower is harder). We allow our Sudoku problems to have more than one solution.The cell( R, n, r ) : Elementary (i.e., one-dimensional, binary) Cellular Automata are simple systems ofcomputation where the cells of an n -bit binary state vector are progressed through time by repeatedapplications of a rule R (seen as a function on the state space). Figure 4a shows the evolution gridof rules 9, 35 and 49 for 20 iterations. Reversing Elementary Cellular Automata was proposed as abenchmark problem in SAT Competition 2018 [18]. To generate an instance, we randomly sample astate T . The problem is then to compute the number of initial states I that would lead to terminalstate T in r applications of R , i.e., (cid:12)(cid:12) { I : R r ( I ) = T } (cid:12)(cid:12) . The proposed CNF encoding in [18] encodesthe entire r -step evolution grid by mapping each cell to a single Boolean variable, n × r in total.The clauses impose the constraints between cells of consecutive rows as given by the rule R . Thevariables corresponding to T (last row of the evolution grid) are assigned as unit clauses. grid _ wrld( s, t ) : This dataset is based on encoding a grid world with different types of squares (e.g.,lava, water, recharge), and a formal speciﬁcation such as “Do not recharge while wet” , or “avoidlava” [39, 40]. We randomly sample a grid world of size s and a random starting position I for anagent. At each step, the agent chooses to move uniformly at random between the 4 available directions.We encode the following problem to CNF: “Count the number of trajectories of length t beginningfrom I that always avoid lava” . This number can be used to compute the probability of satisfactionof the agent policy, which can be used for example to infer speciﬁcations from demonstrations (see[39, 40] for details). bv _ expr( n, d, w ) : For this dataset we randomly generate arithmetic sentences of the form e ≺ e ,where ≺∈ {≤ , ≥ , <, >, = , (cid:54) = } and e , e are expressions of maximum depth d over n binary vectorvariables of size w , random constants and operators ( + , − , ∧ , ∨ , ¬ , XOR , | · | ). The problem is tocount the number of integer solutions to the resulting relation in ([0 , w ] ∩ Z ) n . To evaluate our method, we designed experiments to answer the following questions:

1) I.I.D.Generalization:

Can a model trained on instances from a given distribution generalize to unseeninstances of the same distribution?

2) Upward Generalization:

Can a model trained on smallinstances generalize to larger ones?

3) Wall-Clock Improvement:

Can the model improve the runtime substantially?

4) Interpretation:

Does the sequence of actions taken by the model exhibit anydiscernible pattern at the problem level? Additionally, we studied the impact of the trained modelon a variety of solver-speciﬁc quality metrics (e.g., cache-hit rate, . . . ), the results of which are inAppendix D. Our baseline is

SharpSAT ’s heuristic. The parameters of the cellular automata dataset in a previous version of this paper were slightly different,causing small discrepancies in the results values while not affecting the overall conclusion.

Neuro generalizes to both i.i.d. test problems as well as larger, non-i.i.d. ones, sometimes achievingorders of magnitude improvements over

SharpSAT ’s heuristics. All episodes are capped at 100k steps. i.i.d. Upward Generalization a r s c l a u s e s S h a r p S A TN e u r o a r s c l a u s e s S h a r p S A T N e u r o sudoku(9 ,

182 3k 220 sudoku(16 ,

1k 31k 2,373 cell(9 , ,

210 1k 370 cell(9 , ,

820 4k 53,349 cell(35 , ,

6k 25k 353 cell(35 , ,

12k 49k 21,166 cell(35 , ,

25k 102k 26,460 cell(35 , ,

48k 195k 33,820 cell(49 , ,

6k 25k 338 cell(49 , ,

12k 49k 24,992 cell(49 , ,

25k 102k 30,817 cell(49 , ,

48k 195k 37,345 grid _ wrld(10 ,

329 967 195 grid _ wrld(10 ,

740 2k 13,661

367 (37x) grid _ wrld(10 ,

2k 6k 93,093 grid _ wrld(10 ,

2k 7k 100k ≤ grid _ wrld(12 ,

2k 8k 100k ≤ bv _ expr(5 , ,

90 220 328 bv _ expr(7 , ,

187 474 5,865 (a) cell (49) (b) grid _ wrld Figure 3:

Neuro generalizes well to larger problems. Compare the robustness of

Neuro vs.

SharpSAT as theproblem sizes increase. Solid and dashed lines correspond to

SharpSAT and

Neuro , respectively. All episodesare capped at 100k steps.

The grid _ wrld problem was a natural candidate for testing the effect of adding the time feature(Section 3.3), so we report the results for that problem with time feature included and later in thissection we perform an ablation study on that feature. Experimental Protocol.

For each dataset, we sampled 1,800 instances for training and 200 fortesting. We trained for 1000 ES iterations. At each iteration, we sampled 8 formulas from the trainingset and 48 perturbations with σ = 0 . . With mirror sampling, we obtained in total

96 = 48 · perturbations. For each perturbation, we ran the agent on the 8 formulas (in parallel), to a total of

768 = 96 · episodes per parameter update. All episodes, unless otherwise mentioned, were cappedat 1000 steps during training and 100,000 during testing. The agent received a negative reward of r penalty = − − at each step. We used the Adam optimizer [20] with default hyperparametersand a learning rate of 0.01. We used a weight decay of 0.005 and used the same architecturalhyperparameters for our model for all datasets (details in Appendix C). Table 1 summarizes the results of the i.i.d. generalization over the fourproblem domains of Section 5. We report the average number of branching steps on the test set.

Neuro outperformed the baseline across all datasets. Most notably, on grid _ wrld , it reduced thenumber of branching steps by a factor of 3.0, from 195 down to 66. On cell , it reduced it by anaverage factor of 1.8 over the three different cellular rules. Similar improvement held for bv _ expr .We observed less improvements on sudoku ; we conjecture this is due to the dense structure of theproblem. The sudoku encoding is global , in that every square is 1 hop away on the LIG from allother relevant squares, and there is no local problem structure to exploit. Appendix B.1 includescactus plots comparing the performance of SharpSAT to Neuro across all datasets.7 a) (b) (c) (d)

Figure 4:

Contrary to

SharpSAT , Neuro branches earlier on variables of the bottom rows. (a) Evolution ofa bit-vector through repeated applications of Cellular Automata rules. The result of applying the rule at eachiteration is placed under the previous bit-vector, creating a two-dimensional, top-down representation of thesystem’s evolution; (b) The initial formula simpliﬁcation on a single formula. Yellow indicates the regions of theformula that this process prunes; (c) & (d) Variable selection ordering by

SharpSAT and

Neuro averaged overthe entire dataset. Lighter colours show that the corresponding variable is selected earlier on average.

Upward Generalization.

Directly training on challenging

Neuro on small problem instances and relying on generalization to solve themore challenging instances from the same problem domain. We created instances of larger sizes (upto an order of magnitude more clauses and variable) for each of the datasets in Section 5. We tookthe models trained from the previous i.i.d. setting and directly evaluated on these larger instanceswithout further training.The evaluation results are shown in the right half of Table 1. We see that

Neuro generalized to thelarger instances across all datasets and in almost all of them achieved substantial gains compared tothe baseline as we increased the instance sizes. Figure 3 shows this effect for multiple sizes of cell(49) and grid _ wrld by plotting the percentage of the problems solved within a number of steps (plots forother problems are included in Appendix B.2). The gaps get more pronounced once we remove thecap of steps, i.e., let the episodes run to completion. In that case, on grid _ wrld(10 , , Neuro took an average of 1,320 branching decisions, whereas

SharpSAT took 809,408 (613x improvement).

Wall-Clock Improvement.

Improvements of this scale in step count on large instances are signiﬁcantenough for

Neuro to beat

SharpSAT in wall-clock time, as evident in Figure 1 for grid _ wrld and inFigure 5 for cell(49) . Note that this is in spite of the imposed overhead of querying the model thatlimits the number of steps Neuro can take per second compared to

SharpSAT . For example, whilesolving cell(49 , , , SharpSAT took 331 steps/sec on average whereas

Neuro was only ableto take 17. We expect that this overhead could be greatly reduced, as our implementation is far fromoptimized: it calls an out of process Python code from within the solver’s main loop (in C++), doesnot utilize a GPU nor does it perform any optimizations on the neural network’s inference.Figure 5:

Cactus plots comparing

Neuro to SharpSAT on the cell(49 , , benchmark (lower and tothe right is better). For any point t on the y axis, the plot shows the number of benchmark problems that areindividually solvable by the solver, within t steps (left) and seconds (right). Full-sized variable selection heatmap on dataset cell(35 , , . Lighter colours show that thecorresponding variable is selected earlier on average across the dataset. We show the 99th percentile for eachrow of the heatmap in the last column. Notice Neuro ’s tendency towards selecting variables of the bottom rowsearlier.

Problem-Level Interpretation.

Encodings to CNF can be quite removed from the original problemdomain. Consider grid _ wrld : the problems are encoded to a state machine, then to a circuit, andﬁnally to CNF, and many new variables are created along this process. In contrast, cell has astraightforward encoding that directly relates the CNF representation to an easy-to-visualize evolutiongrid which coincides with the standard representation of Elementary Cellular Automata. This allowsfor interpretation of Neuro ’s policy in the original problem domain.Our conjecture was that the model will learn to solve the problem from the bottom up. On theevolution grid, the known terminal state T is the bottom row, and the task is to count the number ofdistinct top rows I compatible with T . The natural way to decompose this problem is to start fromthe known state T and continue assigning variables to "guess" the preimage, row by row from bottomup. Different preimages can be computed independently upwards, and indeed, this is how a humanwould approach the problem.Heat maps in Figure 4 (c) and (d) depict the behaviour under SharpSAT and

Neuro respectively.The heat map aligns with the evolution grid, with the terminal state T at the bottom. For each dataset,the hotter coloured cells indicate that, on average, the corresponding variable tends to be branched onearlier by the policy. The cooler colours show that the variable is often selected later or not at all,meaning that its value is often inferred through UP either initially or after some variable assignments.That is why the bottom row T and adjacent rows are completely dark, because they are simpliﬁedby the solver before any branching happens. We show the effect of this early simpliﬁcation on asingle formula per dataset in Figure 4 (b). Notice that in cell(35) and cell(49) the simpliﬁcationshatters the problem space into few small components (dark triangles), while in cell(9) which is amore challenging problem, it only chips away a small region of the problem space, leaving it as asingle component. Regardless of this, as conjectured, we can see a clear trend with Neuro focusingmore on branching early on variables of the bottom rows in cell(9) and in a less pronounced wayin cell(35&49) . Moreover, as more clearly seen in the heatmap for the larger problem in Fig 6, thelearned heuristics actually branches early according to the pattern of the rule.Figure 7:

Ablation study on the impact ofthe “time” feature on upward generalizationon grid _ wrld(10 , . Time Feature.

We tested the degree to which the “time”feature contributed to the upward generalization perfor-mance of grid _ wrld . We compared three architectureswith SharpSAT as the baseline: GNN : The standardarchitecture proposed in Section 3.2, GNN+Time : Sameas

GNN but with the variable embeddings augmented withthe “time” semantic feature (Section 3.3) and Time :Where no variable embedding is computed and only the“time” feature is fed to the policy network.As can be seen in Figure 7, we discovered that the “time”feature is responsible for most of the improvement over

SharpSAT . This fact is encouraging, because it demon-strates the potential gains that could be achieved by simplyutilizing problem-level data, such as “time”, that otherwise9ould have been lost during the CNF encoding. More elaborate ablation studies can be found inAppendix E.

We studied the feasibility of enhancing the variable branching heuristic in propositional modelcounting via learning. We used the branching steps that the solver makes as a measure of itsperformance and trained our model to minimize that measure. We demonstrated experimentallythat the resulting model not only is capable of generalizing to the unseen instances from the sameproblem distribution, but also maintains its lead relative to

SharpSAT on larger problems. For certainproblems, this lead widens to a degree that the trained model achieves wall-clock time improvementover the standard heuristic, in spite of the imposed run time overhead of querying the model. This isexciting as it positions this line of research as a potential path towards building better model countersand hence broadening their application horizon.

References [1] Saeed Amizadeh, Sergiy Matusevych, and Markus Weimer. Learning To Solve Circuit-SAT:An Unsupervised Differentiable Approach. In . OpenReview.net, 2019. https://openreview.net/forum?id=BJxgz2R9t7 .[2] Fahiem Bacchus, Shannon Dalmao, and Toniann Pitassi. solving , pages 340–351. IEEEComputer Society, 2003. doi: 10.1109/SFCS.2003.1238208. https://doi.org/10.1109/SFCS.2003.1238208 .[4] Fahiem Bacchus, Shannon Dalmao, and Toniann Pitassi. Value Elimination: Bayesian In-terence via Backtracking Search. In Christopher Meek and Uffe Kjærulff, editors,

UAI ’03,Proceedings of the 19th Conference in Uncertainty in Artiﬁcial Intelligence, Acapulco, Mexico,August 7-10 2003 , pages 20–28. Morgan Kaufmann, 2003. https://dslpitt.org/uai/displayArticleDetails.jsp?mmnu=1&smnu=2&article_id=909&proceeding_id=19 .[5] Maria-Florina Balcan, Travis Dick, Tuomas Sandholm, and Ellen Vitercik. Learning to Branch.In Jennifer G. Dy and Andreas Krause, editors,

Proceedings of the 35th International Conferenceon Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018 ,volume 80 of

Proceedings of Machine Learning Research , pages 353–362. PMLR, 2018. http://proceedings.mlr.press/v80/balcan18a.html .[6] Roberto J. Bayardo, Jr. and Joseph Daniel Pehoushek. Counting Models Using ConnectedComponents. In Henry A. Kautz and Bruce W. Porter, editors,

Proceedings of the Seven-teenth National Conference on Artiﬁcial Intelligence and Twelfth Conference on on InnovativeApplications of Artiﬁcial Intelligence, July 30 - August 3, 2000, Austin, Texas, USA , pages 157–162. AAAI Press / The MIT Press, 2000. .[7] Hans-Georg Beyer and Hans-Paul Schwefel. Evolution strategies - A Comprehensive In-troduction.

Nat. Comput. , 1(1):3–52, 2002. doi: 10.1023/A:1015059928466. https://doi.org/10.1023/A:1015059928466 .[8] Elazar Birnbaum and Eliezer L. Lozinskii. The Good Old Davis-Putnam Procedure HelpsCounting Models.

J. Artif. Intell. Res. , 10:457–477, 1999. doi: 10.1613/jair.601. https://doi.org/10.1613/jair.601 .[9] Supratik Chakraborty, Daniel J. Fremont, Kuldeep S. Meel, Sanjit A. Seshia, and Moshe Y. Vardi.Distribution-Aware Sampling and Weighted Model Counting for SAT. In Carla E. Brodleyand Peter Stone, editors,

Proceedings of the Twenty-Eighth AAAI Conference on Artiﬁcial ntelligence, July 27 -31, 2014, Québec City, Québec, Canada , pages 1722–1730. AAAI Press,2014. .[10] Martin Davis and Hilary Putnam. A computing procedure for quantiﬁcation theory. J. ACM ,7(3):201–215, 1960. doi: 10.1145/321033.321034. URL http://doi.acm.org/10.1145/321033.321034 .[11] Martin Davis, George Logemann, and Donald W. Loveland. A machine program for theorem-proving.

Commun. ACM , 5(7):394–397, 1962. doi: 10.1145/368273.368557. URL https://doi.org/10.1145/368273.368557 .[12] Carmel Domshlak and Jörg Hoffmann. Fast probabilistic planning through weighted modelcounting. In Derek Long, Stephen F. Smith, Daniel Borrajo, and Lee McCluskey, editors,

Proceedings of the Sixteenth International Conference on Automated Planning and Scheduling,ICAPS 2006, Cumbria, UK, June 6-10, 2006 , pages 243–252. AAAI, 2006. URL .[13] Carmel Domshlak and Jörg Hoffmann. Probabilistic Planning via Heuristic Forward Searchand Weighted Model Counting.

J. Artif. Intell. Res. , 30:565–620, 2007. doi: 10.1613/jair.2289. https://doi.org/10.1613/jair.2289 .[14] Maxime Gasse, Didier Chételat, Nicola Ferroni, Laurent Charlin, and Andrea Lodi. ExactCombinatorial Optimization with Graph Convolutional Neural Networks. In Hanna M.Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B. Fox, andRoman Garnett, editors,

Advances in Neural Information Processing Systems 32: AnnualConference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December2019, Vancouver, BC, Canada , pages 15554–15566, 2019. http://papers.nips.cc/paper/9690-exact-combinatorial-optimization-with-graph-convolutional-neural-networks .[15] Carla P. Gomes, Ashish Sabharwal, and Bart Selman. Model Counting. In Armin Biere,Marijn Heule, Hans van Maaren, and Toby Walsh, editors,

Handbook of Satisﬁability , volume185 of

Frontiers in Artiﬁcial Intelligence and Applications , pages 633–654. IOS Press, 2009. https://doi.org/10.3233/978-1-58603-929-5-633 .[16] M. Gori, G. Monfardini, and F. Scarselli. A New Model for Learning in Graph Domains. In

Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005. , volume 2,pages 729–734 vol. 2, 2005.[17] C. Hansknecht, I. Joormann, and S. Stiller. Cuts, Primal Heuristics, and Learning to Branchfor the Time-Dependent Traveling Salesman Problem. Technical report, arXiv, May 2018. https://arxiv.org/abs/1805.01415 .[18] Marijn J. H. Heule, Matti Juhani Järvisalo, and Martin Suda, editors.

Proc. of SAT Competition2018: Solver and Benchmark Descriptions , 2018. University of Helsinki. http://hdl.handle.net/10138/237063 .[19] Elias Boutros Khalil, Pierre Le Bodic, Le Song, George L. Nemhauser, and Bistra Dilkina.Learning to Branch in Mixed Integer Programming. In Dale Schuurmans and Michael P.Wellman, editors,

Proceedings of the Thirtieth AAAI Conference on Artiﬁcial Intelligence,February 12-17, 2016, Phoenix, Arizona, USA , pages 724–731. AAAI Press, 2016. .[20] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In YoshuaBengio and Yann LeCun, editors, , 2015. http://arxiv.org/abs/1412.6980 .[21] Vitaly Kurin, Saad Godil, Shimon Whiteson, and Bryan Catanzaro. Improving SAT SolverHeuristics with Graph Networks and Reinforcement Learning.

CoRR , abs/1909.11830, 2019. http://arxiv.org/abs/1909.11830 . 1122] Gil Lederman, Markus N. Rabe, Sanjit Seshia, and Edward A. Lee. Learning Heuristics forQuantiﬁed Boolean Formulas through Reinforcement Learning. In .OpenReview.net, 2020. https://openreview.net/forum?id=BJluxREKDB .[23] Wei Li, Pascal Poupart, and Peter van Beek. Exploiting Structure in Weighted Model CountingApproaches to Probabilistic Inference.

J. Artif. Intell. Res. , 40:729–765, 2011. http://jair.org/papers/paper3232.html .[24] João Marques-Silva. Computing with SAT Oracles: Past, Present and Future. In Florin Manea,Russell G. Miller, and Dirk Nowotka, editors,

Sailing Routes in the World of Computation - 14thConference on Computability in Europe, CiE 2018, Kiel, Germany, July 30 - August 3, 2018,Proceedings , volume 10936 of

Lecture Notes in Computer Science , pages 264–276. Springer,2018. https://doi.org/10.1007/978-3-319-94418-0_27 .[25] Kuldeep S. Meel and S. Akshay. Sparse Hashing for Scalable Approximate Model Counting:Theory and Practice.

CoRR , abs/2004.14692, 2020. https://arxiv.org/abs/2004.14692 .[26] Umut Oztok and Adnan Darwiche. A Top-Down Compiler for Sentential Decision Diagrams. InQiang Yang and Michael J. Wooldridge, editors,

Proceedings of the Twenty-Fourth InternationalJoint Conference on Artiﬁcial Intelligence, IJCAI 2015, Buenos Aires, Argentina, July 25-31,2015 , pages 3141–3148. AAAI Press, 2015. http://ijcai.org/Abstract/15/443 .[27] Neil Robertson and Paul D. Seymour. Graph Minors. X. Obstructions to Tree-Decomposition.

J. Comb. Theory, Ser. B , 52(2):153–190, 1991. doi: 10.1016/0095-8956(91)90061-N. https://doi.org/10.1016/0095-8956(91)90061-N .[28] Neil Robertson and Paul D. Seymour. Graph Minors XXIII. Nash-Williams’ ImmersionConjecture.

J. Comb. Theory, Ser. B , 100(2):181–205, 2010. doi: 10.1016/j.jctb.2009.07.003. https://doi.org/10.1016/j.jctb.2009.07.003 .[29] Tim Salimans, Jonathan Ho, Xi Chen, and Ilya Sutskever. Evolution Strategies as a ScalableAlternative to Reinforcement Learning.

CoRR , abs/1703.03864, 2017. URL http://arxiv.org/abs/1703.03864 .[30] Tian Sang, Fahiem Bacchus, Paul Beame, Henry A. Kautz, and Toniann Pitassi. Com-bining Component Caching and Clause Learning for Effective Model Counting. In

SAT2004 - The Seventh International Conference on Theory and Applications of Satisﬁabil-ity Testing, 10-13 May 2004, Vancouver, BC, Canada, Online Proceedings , 2004. .[31] Tian Sang, Paul Beame, and Henry A. Kautz. Performing Bayesian Inference by WeightedModel Counting. In Manuela M. Veloso and Subbarao Kambhampati, editors,

Proceedings,The Twentieth National Conference on Artiﬁcial Intelligence and the Seventeenth InnovativeApplications of Artiﬁcial Intelligence Conference, July 9-13, 2005, Pittsburgh, Pennsylvania,USA , pages 475–482. AAAI Press / The MIT Press, 2005. .[32] Tian Sang, Paul Beame, and Henry A. Kautz. Heuristics for Fast Exact Model Counting. InFahiem Bacchus and Toby Walsh, editors,

Theory and Applications of Satisﬁability Testing,8th International Conference, SAT 2005, St. Andrews, UK, June 19-23, 2005, Proceedings ,volume 3569 of

Lecture Notes in Computer Science , pages 226–240. Springer, 2005. https://doi.org/10.1007/11499107_17 .[33] Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuchner, and Gabriele Monfardini.The Graph Neural Network Model.

IEEE Trans. Neural Networks , 20(1):61–80, 2009. doi:10.1109/TNN.2008.2005605. https://doi.org/10.1109/TNN.2008.2005605 .[34] Bart Selman, Henry A. Kautz, and Bram Cohen. Local Search Strategies for SatisﬁabilityTesting. In David S. Johnson and Michael A. Trick, editors,

Cliques, Coloring, and Satisﬁability,Proceedings of a DIMACS Workshop, New Brunswick, New Jersey, USA, October 11-13, 1993 ,volume 26 of

DIMACS Series in Discrete Mathematics and Theoretical Computer Science ,pages 521–531. DIMACS/AMS, 1993. doi: 10.1090/dimacs/026/25. https://doi.org/10.1090/dimacs/026/25 . 1235] Daniel Selsam and Nikolaj Bjørner. NeuroCore: Guiding High-Performance SAT Solverswith Unsat-Core Predictions.

CoRR , abs/1903.04671, 2019. http://arxiv.org/abs/1903.04671 .[36] Daniel Selsam, Matthew Lamm, Benedikt Bünz, Percy Liang, Leonardo de Moura, and David L.Dill. Learning a SAT Solver from Single-Bit Supervision. In . OpenReview.net,2019. https://openreview.net/forum?id=HJMC_iA5tm .[37] Marc Thurley. SharpSAT - Counting Models with Advanced Component Caching and ImplicitBCP. In Armin Biere and Carla P. Gomes, editors,

Theory and Applications of SatisﬁabilityTesting - SAT 2006, 9th International Conference, Seattle, WA, USA, August 12-15, 2006,Proceedings , volume 4121 of

Lecture Notes in Computer Science , pages 424–429. Springer,2006. https://doi.org/10.1007/11814948_38 .[38] Seinosuke Toda. PP is as Hard as the Polynomial-Time Hierarchy.

SIAM J. Comput. , 20(5):865–877, 1991. https://doi.org/10.1137/0220053 .[39] Marcell Vazquez-Chanlatte, Susmit Jha, Ashish Tiwari, Mark K. Ho, and Sanjit A. Se-shia. Learning Task Speciﬁcations from Demonstrations. In Samy Bengio, Hanna M.Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Gar-nett, editors,

Advances in Neural Information Processing Systems 31: Annual Confer-ence on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December2018, Montréal, Canada , pages 5372–5382, 2018. http://papers.nips.cc/paper/7782-learning-task-specifications-from-demonstrations .[40] Marcell Vazquez-Chanlatte, Markus N. Rabe, and Sanjit A. Seshia. A Model Counter’s Guide toProbabilistic Systems.

CoRR , abs/1903.09354, 2019. http://arxiv.org/abs/1903.09354 .[41] Anirudh Vemula, Wen Sun, and J. Andrew Bagnell. Contrasting Exploration in Parameterand Action Space: A Zeroth-Order Optimization Perspective. In Kamalika Chaudhuri andMasashi Sugiyama, editors,

The 22nd International Conference on Artiﬁcial Intelligence andStatistics, AISTATS 2019, 16-18 April 2019, Naha, Okinawa, Japan , volume 89 of

Proceedingsof Machine Learning Research , pages 2926–2935. PMLR, 2019. http://proceedings.mlr.press/v89/vemula19a.html .[42] Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen Schmidhuber.Natural Evolution Strategies.

J. Mach. Learn. Res. , 15(1):949–980, 2014. http://dl.acm.org/citation.cfm?id=2638566 .[43] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How Powerful are Graph NeuralNetworks? In . OpenReview.net, 2019. https://openreview.net/forum?id=ryGs6iA5Km .[44] Lin Xu, Frank Hutter, Holger H. Hoos, and Kevin Leyton-Brown. SATzilla: Portfolio-basedAlgorithm Selection for SAT.

J. Artif. Intell. Res. , 32:565–606, 2008. doi: 10.1613/jair.2490. https://doi.org/10.1613/jair.2490 .[45] Emre Yolcu and Barnabás Póczos. Learning Local Search Heuristics for Boolean Satisﬁability.In Hanna M. Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché-Buc, Emily B.Fox, and Roman Garnett, editors,

Algorithm 2

DPLL extended to count all so-lutions (CDP) function CDP ( φ ) if φ contains an empty clause then return if φ contains no clauses then k = return k Pick a literal l ∈ φ return CDP ( UP ( φ , l )) + CDP ( UP ( φ , ¬ l )) Algorithm 3

Using Components function Relsat ( φ ) Pick a literal l ∈ φ l = CountSide ( φ , l ) ¬ l = CountSide ( φ , ¬ l ) return l + ¬ l function CountSide ( φ , l ) φ l = UP( φ , l ) if φ l contains an empty clause then return if φ l contains no clauses then k = return k K = findComponents ( φ l ) return (cid:81) κ ∈ K Relsat ( κ ) In this section we provide some more details aboutexact algorithms for solving k unset variables can be assigned true or false so thereare k models (line 6).This algorithm is not very efﬁcient, running in time Θ( n ) where n is the number of variables in the inputformula. Note that the algorithm is actually a class ofalgorithms each determined by the procedure used toselect the next literal to branch on. The complexitybound is strong in the sense that no matter how thebranching decisions are made, we can ﬁnd a sequenceof input formulas on which the algorithm will taketime exponential in n as the formulas get larger.Breaking the formula into components and solv-ing each component separately is an approach sug-gested by Bayardo and Pehoushek [6] and used inthe Relsat solver. This approach is shown in Algo-rithm 3. This algorithm works on one component ata time and is identical to (Algorithm 1)except that caching is not used.Breaking the formula into components can yield considerable speedups depending on n , the numberof variables needed to be set before the formula is broken into components. If we consider ahypergraph in which every variable is a node and every clause is a hyperedge over the variablesmentioned in the clause, then the branch-width [27] of this hypergraph provides an upper bound on n . As a result we can obtain a better upper bound on the run time of Relsat of n O ( w ) where w is the branch-width of the input’s hypergraph. However, this run time will only be achieved if thebranching decisions are made in an order that respects the branch decomposition with width w . Inparticular, there exists a sequence of branching decisions achieving a run time of n O ( w ) . Computingthat sequence would require time n O (1) O ( w ) [28], hence a run time of n O ( w ) can be achieved.Finally, if component caching is used we obtain Algorithm 1 which has a better upper bound of O ( w ) .Again this run time can be achieved with a n O (1) O ( w ) computation of an appropriate sequence ofbranching decisions.In practice, the branch-width of most instances is very large, making a run time of O ( w ) infeasible.Computing a branching sequence to achieve that run time is also infeasible. Fortunately, in practicalinstances unit propagation is also very powerful. This means that making only a few decisions ( < w )often allows unit propagation to set w or more variables thus breaking the formula apart into separatecomponents. Furthermore, most instances are falsiﬁed by a large proportion of their truth assignments.This makes clause learning an effective addition to sudoku (b) cell(9) (c) cell(35) (d) cell(49) (e) grid _ wrld (f) bv _ expr Figure 8:

Cactus Plot –

Neuro outperforms

SharpSAT on all i.i.d benchmarks (lower and to the right is better).A cut-off of 100k steps was imposed though both solvers managed to solve the datasets in less than that manysteps. effective use of the cache. And making decisions that allow the solver to learn more effective clausesallows the solver to more efﬁciently traverse the often large space of non-solutions.

B More on the Results

In this section we present a more elaborate discussion of our results. Aggregated measures ofperformance, such as average number of decisions (Table 1) only give us an overall indication of

Neuro ’s lead compared to

SharpSAT and as such, they are incapable of showing whether it isperforming better on easier or harder instances in the dataset. Cactus plots are the standard wayof comparing solver performances in the SAT community. Although typically used to comparethe wall-clock time (Figure 1b), here we use them to compare the number of steps (i.e., branchingdecisions).

B.1 I.I.D. Generalization

Figure 8 shows cactus plots for all of the i.i.d. benchmark problems. Unsurprisingly, the improvementson sudoku are relatively modest, albeit consistent across the dataset. On all cell datasets, and grid _ wrld , an exponential growth is observed with Neuro ’s lead over

SharpSAT as the problemsget more difﬁcult (moving right along the x axis). Lastly, on bv _ expr , Neuro does better almostuniversally, except near the 100 problems mark and at the very end (3 most difﬁcult problems).

B.2 Upwards Generalization

On some datasets, namely cell(49) and grid _ wrld , the Neuro ’s lead over

SharpSAT becomes morepronounced as we test the upwards generalization (using the model trained on smaller instances andtesting on larger ones). Cactus plots of Figure 9&10 show this effect clearly for these datasets. Ineach ﬁgure, the i.i.d. plot is included as a reference on the left and on the right the plots for test setswith progressively larger instances are depicted.The upward generalization lead is less striking, although still signiﬁcant, on bv _ expr (2.7x up from1.6x). On sudoku and cell(9) Neuro ’s lead is still maintained but it becomes less prominent on moredifﬁcult datasets. Figure 11 summarizes these points by comparing the percentage of the problemssolvable by

SharpSAT vs.

Neuro under a given number of steps. Notice the robustness of thelearned model in cell(35&49) and grid _ wrld . As these datasets get more difﬁcult, SharpSAT either15 ell(49 , , , , , , , , (a) (b)Figure 9: Cactus Plot – cell(49) : Neuro maintains its lead over

SharpSAT on larger datasets (lower and to theright is better). A cut-off of 100k steps was imposed. (a) i.i.d. generalization on cell(49 , , ; (b) Upwardgeneralization of the model trained on cell(49 , , over larger datasets. grid _ wrld(10 ,

5) grid _ wrld(10 ,

10) grid _ wrld(10 ,

12) grid _ wrld(10 , (a) (b)Figure 10: Cactus Plot – grid _ wrld : Neuro maintains its lead over

SharpSAT on larger datasets (lower andto the right is better). A cut-off of 100k steps was imposed. (a) i.i.d. generalization on grid _ wrld(10 , ; (b) Upward generalization of the model trained on grid _ wrld(10 , over larger datasets. takes more steps or completely fails to solve the problems altogether, whereas Neuro relativelysustains its performance.

B.3 Discussion

Many dataset attributes may lead to the upward generalization success of the aforementioned datasets,but one of the main contributing factors is the model’s ability to observe similar components manytimes during training. In other words, if a problem gets shattered by the initial simpliﬁcation (unitpropagation) into smaller components, there is a high chance that the model’s behaviour learnsto solve such components. If larger problems of the same domain also break down into similarcomponents, then

Neuro can generalize well on them. In Section 6.1, we discussed this phenomenafor cell via heat maps. In Figure 12 we provide full heat maps for larger datasets of both cell(35) and cell(49) . Not only the “shattering” effect is evident from these plots, we can also observe that in bothdatasets

Neuro branches on variables from the bottom going up. This matches with our conjecturepresented in Section 6.1.

C Architecture Details

Both our literal and clause embeddings are of size . GNN messages are implemented by anMLP with ReLU non-linearity. Clause-to-literal messages are of dimensions × × , andliteral-to-clause messages are of dimension × × (as described in Section 3 we “tie” theliterals to achieve negation-invariance, hence the doubled ﬁrst dimension). We use iterations inthe GNN, and ﬁnal literal embeddings are passed through the MLP policy network of dimensions × × × to get the ﬁnal score. When using the extra time feature, the ﬁrst dimension of thedecision layer is . The initial (iteration 0) embeddings of both literals and clauses are trainablemodel parameters. In Appendix E, where we augment the literal features with “ variable scores ”we start with a feature vector of size for each literal, and pass it through an MLP of dimensions × × to get the initial literal embedding. 16a) sudoku (b) cell(9) (c) cell(35) (d) cell(49) (e) grid _ wrld (f) bv _ expr Figure 11:

Neuro generalizes well to larger problems on almost all datasets (higher and to the left is better).Compare the robustness of

Neuro vs.

SharpSAT as the problem sizes increase. Solid and dashed linescorrespond to

SharpSAT and

Neuro , respectively. All episodes are capped at 100k steps.

D Trained Policy’s Impact on Solver Performance Measures

In this section we analyze the impact of

Neuro on solver’s performance through the lens of a set ofsolver-speciﬁc performance measures. These measures include: Number of conﬂict clauses that thesolver encounters while solving a problem ( num conflicts ), Total (hit+miss) number of cachelookups ( num cache lookups ), Average size of components stored on the cache ( avg(compsize stored) ), Cache hit-rate ( cache hit-rate ) and Average size of the components thatare successfully found on the cache ( avg(comp size hit) ).A conﬂict clause is generated whenever the solver encounters an empty clause, indicating that thecurrent sub-formula has zero models. Thus the number of conﬂict clauses generated is a measure ofthe amount of work the solver spent traversing the non-solution space of the formula. Cache hits andthe size of the cached components, on the other hand, give an indication of how effectively the solveris able to traverse the formula’s solution space . In particular, when a component with k variables isfound in the cache (a cache hit) the solver does not need to do any further work to count the numberof solutions over those k variables. This could potentially save the solver O ( k ) computations. This17a) cell(35 , , (b) cell(49 , , Figure 12:

Clear depiction of

Neuro ’s pattern of variable branching. The “Units” plots show the initial formulasimpliﬁcation the solvers. Yellow indicates the regions of the formula that this process prunes. Heatmaps showthe variable selection ordering by

SharpSAT and

Neuro . Lighter colours show that the corresponding variableis selected earlier on average across the dataset. O ( k ) worst case time is rarely occurs in practice; nevertheless, the number of cache hits, and theaverage size of the components in those cache hits give an indication of how effective the solver is intraversing the formula’s solution space. Additional indicators of solver’s performance in traversingthe solution space are the number of components generated and their average size. Every time thesolver is able to break its current sub-formula into components it is able to reduce the worst casecomplexity of solving that sub-formula. For example, when a sub-formula of m variables is brokenup into two components of k and k variables, the worst case complexity drops from O ( m ) to O ( k ) + 2 O ( k ) . Again the worst case rarely occurs (as indicated by the fact that cell(49 , , and grid _ wrld(10 , . Looking atthe individual performance measures, we see that the Neuro encounters fewer conﬂicts (larger ), meaning that it is traversing the non-solution space more effectively in bothdatasets. The cache measures, indicate that the standard heuristic is able to traverse the solutionspace a bit more effectively, ﬁnding more components ( num cached lookups ) of similar or largeraverage size. However,

Neuro is able to utilize the cache as efﬁciently (with comparable cachehit rate) while ﬁnding components in the cache that are considerably larger than those found by the(a) cell(49 , , (b) grid _ wrld(10 , Figure 13:

Radar charts showing the impact of each policy across different solver-speciﬁc performancemeasures.

Cactus Plot – Ablation study on the impact of the “time” and VSADS features over upwardgeneralization on grid _ wrld(10 , (lower and to the right is better). A termination cap of 100k steps wasimposed on the solver. standard heuristic. In sum, the learnt heuristic ﬁnds an effective trade-off of learning more powerfulclauses, with which the solver can more efﬁciently traverse the non-solution space, at the cost of aslight degradation in its efﬁciency traversing the solution space. The net result in an improvement inthe solver’s run time. E Ablation Study

Variable Score

We mentioned in Section 2 that

SharpSAT ’s default way of selecting variables isbased on the VSADS heuristic which incorporates the number of times a variable v appears in thecurrent sub-formula, and (a function of the) number of conﬂicts it took part in. At every branchingjuncture, the solver picks a variable among the ones in the current component with maximum scoreand branches on one of its literals (see Algorithm 1). As part of our efforts to improve the performanceof our model, we performed an additional ablation study over that of Section 6.1. Concretely, wemeasured the effect of including the variable scores in our model (as detailed in Appendix C) andtested on the grid _ wrld(10 , and cell(49 , , datasets (Figures 14 & 15). For both datasets,the inclusion of the variable scores produced results inferior to the ones achieved without them! Thisis surprising, though consistent with what was observed in [22]. Random Policies

As an essential sanity check, we tested how a “random policy” performs com-pared to the trained model, in order to assure that our model’s performance improvements are nottrivially attainable without training. To that end, we tested on cell(35 , , dataset two suchrandom policies:

1) Random Literal: which chooses a literal uniformly at random;

2) RandomNetwork : where we randomly set our model’s weights instead of training. Both of these policieswere inferior to the

SharpSAT ’s results of 353 steps (Table 1), achieving an average of 867 and 740,respectively.Figure 15:

Cactus Plot – Inclusion of VSADS score as a feature hurts the upward generalization on cell(49 , , (lower and to the right is better). A termination cap of 100k steps was imposed on thesolver.(lower and to the right is better). A termination cap of 100k steps was imposed on thesolver.