Improving ENIGMA-Style Clause Selection While Learning From History
NNew Techniques that Improve ENIGMA-styleClause Selection Guidance
Martin Suda [0000 − − − Czech Technical University in Prague, Czech Republic [email protected]
Abstract.
We re-examine the topic of machine-learned clause selectionguidance in saturation-based theorem provers. The central idea, recentlypopularized by the ENIGMA system, is to learn a classifier for recogniz-ing clauses that appeared in previously discovered proofs. In subsequentruns, clauses classified positively are prioritized for selection. We proposeseveral improvements to this approach and experimentally confirm theirviability.For the demonstration, we use a Recursive Neural Network to classifyclauses based on their derivation history and the presence or absence ofautomatically supplied theory axioms therein. The automatic theoremprover
Vampire guided by the network achieves a 41 % improvement ona relevant subset of smt-lib in a real time evaluation.
Keywords:
Saturation-based theorem proving · Clause Selection · Ma-chine Learning · Recursive Neural Networks.
The idea to improve the performance of saturation-based Automatic TheoremProvers (ATPs) with the help of machine learning (ML), while going back atleast to the early work of Schulz [8,30], has recently been enjoying a renewedinterest. Most notable is the ENIGMA system [15,16] extending the ATP E [31]by machine learned clause selection guidance. The architecture trains a binaryclassifier for recognizing as positive those clauses that appeared in previouslydiscovered proofs and as negative the remaining selected ones. In subsequentruns, clauses classified positively are prioritized for selection.A system such as ENGIMA needs to carefully balance the expressive powerof the used ML model with the time it takes to evaluate its advice. For example,Loos et al. [21], who were the first to integrate state-of-the art neural networkswith E , discovered their models to be too slow to simply replace the traditionalclause selection mechanism. Another interesting aspect is what features we allowthe model to learn from. One could speculate that the recent success of ENIGMAon the Mizar dataset [7,17] can at least partially be explained by the involvedproblems sharing a common source and encoding. It is still open whether somenew form of general “theorem proving knowledge” could be learned to improvethe performance of an ATP across, e.g., the very diverse TPTP library. a r X i v : . [ c s . A I] F e b M. Suda
In this paper, we propose several improvements to ENIGMA-style clauseselection guidance. – We lay out a set of possibilities for integrating the learned advice into theATP and single out the recently developed layered clause selection [10,11,37]as particularly suitable for the task. – We improve on the issue of evaluation speed by proposing a lazy evaluationscheme under which many generated clauses need not be evaluated by thepotentially slow classifier. – We demonstrate the importance of “positive bias”, i.e., of tuning the classifierto rather err on the side of false positives than on the side of false negatives. – Finally, we propose the use of “negative mining” for improving learning fromproofs obtained while relying on previously learned guidance.To test these ideas, we designed a Recursive Neural Network (RvNN) to clas-sify clauses based solely on their derivation history and the presence or absenceof automatically supplied theory axioms therein. This allows us to test here,as a byproduct of the conducted experiments, whether the human-engineeredheuristic for controlling the amount of theory reasoning presented in [11] can bematched or even overcome by the automatically discovered neural guidance.The rest of the paper is structured as follows. Sect. 2 recalls the necessaryATP theory, explains clause selection and how to improve it using ML. Sect. 3covers layered clause selection and the new lazy evaluation scheme. In Sect. 4, wedescribe our neural architecture and in Sect. 5 we bring everything together andevaluate the presented ideas, using the prover Vampire as our workhorse and arelevant subset of smt-lib as the testing grounds. Finally, Sect. 6 concludes.
The technology behind the modern automatic theorem provers (ATPs) for first-order logic (FOL), such as E [31], Spass [41], or
Vampire [20], can be roughlyoutlined by using the following three adjectives.
Refutational:
The task of the prover is to check whether a given conjecture G logically follows from given axioms A , . . . , A n , i.e. whether A , . . . , A n | = G, (1)where G and each A i are FOL formulas. The prover starts by negating the con-jecture G and transforming ¬ G, A , . . . , A n into an equisatisfiable set of clauses C . It then applies a sound logical calculus to iteratively derive further clauses,logical consequence of C , until the obvious contradiction in the form of the emptyclause ⊥ is derived. This refutes the assumption that ¬ G, A , . . . , A n could besatisfiable and thus confirms (1). We evaluated a variation of this architecture on the Mizar dataset in [36].ew Techniques that Improve ENIGMA-style Clause Selection Guidance 3
Superposition-based:
The most popular calculus used in this context is super-position [3], an extension of ordered resolution [4] with a built-in support forhandling equality (see also [23]). It consists of several inference rules, such as theresolution rule, factoring, subsumption, superposition, or demodulation.Inference rules in general determine how to derive new clauses from old ones,where by old clauses we mean either the initial clauses C or clauses derived pre-viously. The clauses that need to be present for a rule to be applicable are calledthe premises and the newly derived clause is called the conclusion . By apply-ing the inference rules the prover gradually constructs a derivation , a directedacyclic (hyper-)graph (DAG), with the initial clauses forming the leaves and thederived clauses (labeled by the respective applied rules) forming the internalnodes. A proof is a sub-DAG of the derivation obtained by collecting the clausesand inference applications transitively incident with the final empty clause. Saturation-based:
A saturation algorithm is the concrete way of organizing theprocess of deriving new clauses, such that every applicable inference is eventuallyconsidered. Modern saturation-based ATPs employ some variant of the given-clause algorithm , in which clauses are selected for inferences one by one [27].The process employs two sets of clauses, often called the active set A andthe passive set P . At the beginning all the initial clauses are put to the passiveset. Then in every iteration, the prover selects and removes a clause C from P , inserts it into A , and performs all the applicable inferences with premisesin A such that (at least) one of the premises is C . The conclusions of theseinferences are then inserted into P . This way the prover maintains (at the end ofeach iteration) the invariant that inferences among the clauses in the active sethave been performed. The selected clause C is sometimes also called the “given-clause”, the active set is called “processed” and the passive set “unprocessed”.During a typical prover run, P grows much faster than A (the folklore says thegrowth is roughly quadratic). Analogously, although for different reasons, whena proof is discovered, it is often much smaller than A . Notice that every clause C ∈ A that is in the end not part of the proof did not need to be selected andrepresents a wasted effort. This explains why clause selection , i.e. the procedurefor picking in each iteration the next clause to process, is one of the main heuristicdecision points in the prover, that hugely affects its performance [32]. There are two basic criteria that have been identified as generally correlatingwith the likelihood of a clause contributing to the yet-to-be discovered proof.One is clause’s age or, more precisely, its “date of birth”, typically imple-mented as an ever increasing timestamp. Preferring for selection old clauses tomore recently derived ones corresponds to a breadth-first strategy and ensuresfairness. The other criterion is clause’s size, referred to as weight in the ATPlingo, and is realized by some form of symbol counting. Preferring for selectionsmall clauses to large ones is a greedy strategy, based on the observation thatsmall conclusions typically belong to inferences with small premises and that the
M. Suda ultimate conclusion—the empty clause—is the smallest of all. The best resultsare achieved when these two criteria (or their variations) are combined [32].To implement efficient clause selection by numerical criteria such as age andweight, an ATP represents the passive set P as a set of priority queues. Aqueue contains (pointers to) the clauses in P ordered by its respective criterion.Selection typically alternates between the available queues under a certain ratio.A successful strategy is, for instance, to select 10 clauses by weight for everyclause selected by age, i.e., with an age-to-weight ratio of 1:10. The idea to improve clause selection by learning from previous prover experiencegoes, to the best of our knowledge, back to [8,30] and has more recently beensuccessfully employed by the ENIGMA system and others [15,16,7,14,21].The experience is collected from successful prover runs, where each selectedclause constitutes a training example and the example is marked as positive , ifthe clause ended-up in the discovered proof, and negative otherwise. A MachineLearning (ML) algorithm is then used to fit this data and produce a model M for classifying clauses into positive and negative, accordingly. A good learningalgorithm produces a model M which not only accurately classifies the trainingdata but also generalizes well to unseen examples. The computational costs ofboth training and evaluation are also important.While clauses are logical formulas, i.e., discrete objects forming a countableset, ML algorithms, rooted in mathematical statistics, are primarily equippedto dealing with fixed-seized real-valued vectors. Thus the question of how to represent clauses for the learning is the first obstacle that needs to be overcome,before the whole idea can be made to work. In the beginning, the authors ofENIGMA experimented with various forms of hand-crafted numerical clause features [15,16]. An attractive alternative explored in later work [21,7,14] is theuse of artificial neural networks , which can be understood as extracting the mostrelevant features automatically.An important distinction can in both cases be made between approacheswhich have access to the concrete identity of predicate and function symbols (i.e.,the signature) that make up the clauses, and those that do not. For example:Is the ML algorithm allowed to assume that the symbol grp mult is used torepresent the multiplication operation in a group or does it only recognize ageneral binary function? The first option can be much more powerful, but weneed to ensure that the signature symbols are aligned and used consistentlyacross the problems in our benchmark. Otherwise the learned advice cannotmeaningfully cary over to previously unsolved problems. While the assumptionof aligned signature has been employed by the early systems [15,21], the mostrecent version of ENIGMA [14,24] can work in a “signature agnostic” mode.In this work we represent clauses solely by their derivation history, deliberatelyignoring their logical content. This places our approach to the second group. ew Techniques that Improve ENIGMA-style Clause Selection Guidance 5 Once we have a trained model M , an immediate possibility for integrating itinto the clause selection procedure is to introduce a new queue that will orderthe clauses using M . Two basic versions of this idea have been described: “Priority”: The ordering puts all the clauses classified by M as positive beforethose classified negatively. Within the two classes, older clauses are preferred.Let us for the purposes of future reference “semi-formally” denote this schemeas M , . It has been successfully used by the early ENIGMAs [15,16,7]. “Logits”: Even models officially described as binary classifiers typically inter-nally compute a real-valued estimate L of how much “positive” or “negative” anexample appears to be and only turn this estimate into a binary decision in thelast step, by comparing it against a fixed threshold t , most often 0. A machinelearning slang for this estimate L is the logit . The second version orders the clauses on the new queue by the “raw” logitsproduced by a model. We denote it M − R to stress that clauses with high L aretreated as small from the perspective of the selection and therefore preferred.This scheme has been used by Loos et al. [21] and in the latest ENGIMA [14,38]. Combining with a traditional strategy.
While it is possible to rely exclusivelyon selection governed by the model, it turns out to be better [7] to combine itwith the traditional heuristics. The most natural choice is to take S , the originalstrategy that was used to generate the training data, and extend it by addingthe new queue, be it M , or M − R , next to the already present queues. Wethen again supply a ratio under which the original selection from S and the newselection based on M get alternated. We will denote this kind of combinationwith the original strategy as S ⊕ M , and S ⊕ M − R , respectively. Layered Clause Selection (LCS) is a recently developed method [10,11,37] forsmoothly incorporating a categorical preference for certain clauses into a baseclause selection strategy S . In this paper, we will readily use it in combinationwith the binary classifier advice from a trained model M .When we instantiate LCS to our particular case, its function can be sum-marized by the expression S ⊕ S [ M ] . In words, the base selection strategy S is alternated with S [ M ], the sameselection scheme S but applied only to clauses classified positively by M . Implicit A logit can be turned into a (formal) probability, i.e. a value between 0 and 1, bypassing it, as is typically done, through the sigmoid function σ ( x ) = 1 / (1 + e − x ). We rely here on the monotone mode of split from [10]. There is also a disjoint mode. M. Suda here is a convention that whenever there is no positively classified passive clause,a fallback to plain S occurs. Additionally, we again specify a “second-level” ratioto govern the alternation between pure S and S [ M ].The main advantage of LCS, compared to the options outlined in the previoussection, is that the original, typically well-tuned, base selection mechanism S isalso applied to the positively classified clauses in M . It is often the case that evaluating a clause by the model M is a relativelyexpensive operation (see, e.g., [21]). As we explain here, however, this operationcan be avoided in many cases, especially when using LCS to integrate the advice.We propose the following lazy evaluation approach to be used with S ⊕S [ M ].Every clause inserted into the passive set P is initially registered by both S and S [ M ] without being evaluated by M . Then, whenever (as governed by thesecond-level ratio) it is the moment to select a clause from S [ M ], the algorithm1. picks (as usual, according to S ) the best clause C in S [ M ],2. only then evaluates C by M , and3. if C gets classified as negative, it forgets C , a goes back to 1.This repeats until the first positively classified clause is found, which is thenreturned. Note that this way the “observable behaviour” of S [ M ] is preserved.The power of lazy evaluation lies in the fact that not every clause needs to beevaluated before a proof is found. Indeed, recall the remark that the passive set P is typically much larger than the active set A , which also holds on a typicalsuccessful termination. Every clause left in passive at that moment is a clausethat did not need to be evaluated by M thanks to lazy evaluation.We remark that lazy evaluation can similarly be used with the integrationmode M , based on priorities.We experimentally demonstrate the effect of the technique in Sect. 5.4. In this work we choose to represent a clause, for the purpose of learning, solely byits derivation history. Thus a clause can only be distinguished by the axioms fromwhich it was derived and by the precise way in which these axioms interactedwith each other through inferences in the derivation. This means we deliberatelyignore the clause’s logical content.We decided to focus on this representation, because it promises to be fast.Although an individual clause’s derivation history may be large, it is a sim-ple function of its parents’ histories (just one application of an inference rule).Moreover, before a clause with a complicated history can be selected, most ofits ancestors will have been selected already. This promises the amortised costof evaluating a single clause to be a constant. Exceptions are caused by simplifying inferences applied eagerly outside of the gov-ernance of the main clause selection mechanism.ew Techniques that Improve ENIGMA-style Clause Selection Guidance 7
A second motivation comes from our recent work [11], where we have shownthat theory reasoning facilitated by automatically adding theory axioms for ax-iomatising theories, while in itself a powerful technique, often leads the proverto unpromising parts of the search space. In [11] we developed a heuristic forcontrolling the amount of theory reasoning in the derivation of a clause. Our goalhere is to test whether a similar or even stronger heuristic can be automaticallydiscovered by a neural network.Examples of axioms that
Vampire uses to axiomatise theories include thecommutativity or associativity axioms for the arithmetic operations, an axiom-atization of the theory of arrays [6] or of the theory of term algebras [19]. For usit is mainly important that the axioms are introduced internally by the proverand can therefore be consistently identified across individual problems.
A Recursive Neural Network (RvNN) is a network created by recursively com-posing a finite set of neural building blocks over a structured input [12]. A generalneural block is a function N θ : R k → R l depending on a vector of parameters θ that can be optimized during training (see below in Section 4.3).In our case, the structured input is a clause derivation, i.e. a DAG with nodesidentified with the derived clauses. To enable a recursion, an RvNN representseach node C by a real vector v C (of a fixed dimension n ) called an embedding .During training a network learns to embed the space of derivable clauses into R n in some a priori unknown, but still useful way.We assume that each initial clause C , a leaf of the derivation DAG, is labeledas belonging to one of the automatically added theory axioms or coming fromthe user input. Let these labels form a finite set of axiom origin labels L A .Furthermore, let the applicable inference rules that label the internal nodes ofthe DAG form a finite set of inference rule labels L R . The specific building blocksof our neural architecture are the following three (indexed families of) functions: – for every axiom label l ∈ L A , a “null-ary” init function I l ∈ R n which to aninitial clause C labeled by l assigns its embedding v C := I l , – for every inference rule r ∈ L R , a deriv function, D r : R n × · · · × R n → R n which to a conclusion clause C c derived by r from premises ( C , . . . , C k ) withembeddings v C , . . . , v C k assignes the embedding v C c := D r ( v C , . . . , v C k ), – and, finally, a single eval function E : R n → R which evaluates an embedding v C such that the corresponding clause C is classified as positive whenever E ( v C ) ≥ t , with the threshold t set, by default, to 0.By recursively composing the init and deriv functions, any derived clause C can be assigned an embedding v C and also evaluated by E to see whether thenetwork recommends it as positive, that should be preferred in proof search. Here we outline the details of our architecture for the benefit of neural networkpractitioners. All the used terminology is standard (see, e.g., [13]).
M. Suda
We realized each init function I l as an independent learnable vector. Similarly,each deriv function D r was independently defined. For a rule of arity two, suchas resolution, we used: D r ( v , v ) = LayerNorm( y ) , y = W r · x + b r , x = ReLU( W r · [ v , v ] + b r ) , where [ · , · ] denotes vector concatenation, ReLU is the Rectified Linear Unit non-linearity ( f ( x ) = max { , x } ) applied component-wise, and the learnable matrices W r , W r and vectors b r , b r are such that x ∈ R n and y ∈ R n . (We took inspi-ration for doubling the embedding size before applying the non-linearity from[29].) Finally, LayerNorm is a layer normalization [2] module, without whichtraining often became numerically unstable for deeper derivation DAGs. For unary inference rules, such as factoring, we used an equation analogousto the above, except for the concatenation operation. We did not need to modelan inference rule with a variable number of premises, but one option would beto arbitrarily “bracket” its arguments into a tree of binary applications.Finally, the eval function was E ( v ) = W · ReLU( W · v + b ) + c with trainable W ∈ R n × n , b ∈ R n , W ∈ R × n , and c ∈ R . To train a network means to find values for the trainable parameters such thatit accurately classifies the training data and ideally also generalises to unseenfuture cases. We follow a standard methodology for training our RvNN.In particular, we use the Gradient Descent (GD) optimization algorithm(with the Adam optimiser [18]) minimising the typical Binary Cross Entropy loss , composed as a sum of contributions, for every selected clause C , of theform − y C · log( σ ( E ( v C ))) − (1 − y C ) · log(1 − σ ( E ( v C ))) , with y C = 1 for the positive and y C = 0 for the negative examples.These contributions are weighted such that each derivation DAG (corre-sponding to a prover run on a single problem) receives equal weight. Moreover,within each DAG we re-scale the influence of positive versus the negative exam-ples such that these two categories contribute evenly. The scaling is importantas our training data is highly unbalanced (cf. Sect. 5.1).We split the available successful derivations into a training set and a valida-tion set, and only train on the first set using the second to observe generalisationto unseen examples. As the GD algorithm progresses, iterating over the trainingdata in rounds called epochs , we evaluate the loss on the validation set and stopthe process early if this loss does not decrease for a specified period. This earlystopping criterion was important to produce a model that generalizes well.As another form of regularisation, i.e. a technique for preventing overfittingto the training data, we employ dropout [35] (independently for each “read” of We also tried to skip LayerNorm and replace ReLU by the hyperbolic tangent func-tion. This restores stability, but does not train or classify so well.ew Techniques that Improve ENIGMA-style Clause Selection Guidance 9 a clause embedding by one of the deriv or eval functions). Dropout means thatat training time each component v i of the embedding v has a certain probabilityof being zero-ed out. This “voluntary brain damage” makes the network morerobust as it prevents neurons from forming too complex co-adaptations [35].Finally, we experimented with using non-constant learning rates as suggestedin [33,34] and [39]. The schedule used in our experiment uses a linear warmupfor the first 50 epochs followed by a hyperbolic cooldown (cf. Fig. 1 in Sect. 5.2). Since our representation of clauses deliberately discards information, we endup encountering distinct clauses indistinguishable from the perspective of thenetwork. For example, every initial clause C originating from the input problem(as opposed to being added as a theory axiom) receives the same embedding v C = I input . Indistinguishable clauses also arise as conclusions of an inferencethat can be applied in more than one way to certain premises.Mathematically, we deal with an equivalence relation ∼ on clauses based on“having the same derivation tree”: C ∼ C ↔ derivation ( C ) = derivation ( C ) . The “fingerprint” derivation ( C ) of a clause could be defined as a formal expres-sion recording the derivation history of C using the labels from L A as “null-ary”operators and those from L R as operators with arities of the corresponding in-ference rules. For example: Resolution ( axiom , Factoring ( input )).We made use of this equivalence in our implementation in two places:1. When preparing the training data. We “compressed” each derivation DAGas a factorisation by ∼ , keeping only one representative of each class. A classcontaining a positive example was marked as a positive example.2. When interfacing the trained model from the ATP. We cached the embed-dings (and evaluated logits) for the already encountered clauses under theirclass identifier. Sect. 5.4 evaluates the effect of this technique. We implemented the infrastructure for training an RvNN clause derivation clas-sifier (as described in Sect. 4) in python , relying on the
PyTorch (version 1.7)library [25] and its
TorchScript extension for interfacing the trained model fromC++. We modified the automatic theorem prover
Vampire (version 4.5.1) to1) optionally record the constructed derivation, including information on whichclauses are being selected, to a log file (the logging-mode ), 2) to be able to loada trained
TorchScript model and use it for clause selection guidance undervarious modes of integration (detailed in sections 2.3 and 3). We took the same subset of 20 795 problems from the smt-lib library [5] asin [11]; it was formed as the largest set of problems in a fragment supported by
Vampire , not known to be satisfiable, and not solvable by
Vampire in 10 s bytwo particularly simple (from the perspective of theory reasoning) strategies. The necessary supplementary materials can be found at https://git.io/JtHNl.0 M. Suda
As the baseline strategy S we took Vampire ’s implementation of the Dis-count saturation loop under the age-to-weight ratio 1:10 (which typically per-forms well with Discount), keeping all other settings default, including the en-abled
Avatar architecture. We later enhanced this S with various forms ofguidance. All the benchmarking was done using a 10 s time limit. During an initial run, the baseline strategy S was able to solve 734 problemsunder the 10 s time limit. We collected the corresponding successful derivationsusing the logging-mode (and lifting the time limit, since the logging causes anon-negligible overhead) and processed them into a form suitable for traininga neural model. The derivations contained approximately 5 . . Vampire used 31 distinct theory axioms to facilitatetheory reasoning. Including the “user input” label for clauses coming from theactual problem files, there were in total 32 distinct labels for the derivation leaves.In addition, we recorded 15 inference rules, such as resolution, superposition,backward and forward demodulation or subsumption resolution and includingone rule for the derivation of a component clause in
Avatar [40,26]. Thus weobtained 15 distinct labels for the internal nodes.We compressed these derivations identifying clauses with the same “abstractderivation history” dictated by the labels, as described in Sect. 4.4. This reducedthe derivation set to 0 . Since the size of the training set is relatively small, we instantiated the architec-ture described in Sect. 4.2 with embedding size n = 64 and dropout probability p = 0 .
3. We trained for 100 epochs, with a non-constant learning rate peaking at α = 2 . × − in epoch 50. Every epoch we computed the loss on the validationset and selected the model which minimizes this quantity. This was the modelfrom epoch 45 in our case, which we will denote M here.The development of the training and validation loss throughout training, aswell as that of the learning rate, is plotted in Fig. 1. Additionally, the rightside of the figure allows us to compare the validation loss—an ML estimate ofthe model’s ability to generalize—with the ultimate metric of practical general- Running on an Intel(R) Xeon(R) Gold 6140 CPUs @ 2 . Fig. 1.
Training the neural model. Red: the training (left) and validation (right) lossas a function training time; shaded: per problem weighted standard deviations. Blue(left): development of the learning rate throughout training. Green (right): in trainingunseen problems solved by
Vampire equipped with the corresponding model. ization, namely the number of in-training-unseen problems solved by
Vampire equipped with the corresponding model for guidance. We can see that the proxyand the target correspond quite well, at least to the degree that we measuredthe highest ATP gain with the validation-loss-minimizing M .We remark that this assurance was not cheap to obtain. While the whole100 epoch training took 45 minutes to complete (using 20 workers and 1 masterprocess in a parallel training setup), each of the 20 ATP evaluation data pointscorresponds to approximately 2 hours of 30 core computation. In this part of the experiment we tested the various ways of integrating the learntadvice as described in sections 2.3 and 3. Let us recall that these are the singlequeue schemes M − R and M , based on the raw logits and the binary decision,respectively, their combinations S ⊕ M − R and S ⊕ M , with the base strategy S under some second level ratio, and, finally, S ⊕ S [ M ], the integration of theguidance by the layered clause selection scheme.Our results are shown in Table 1. It starts by reporting on the performanceof the baseline strategy S and then compares it to the other strategies. Wecan see that the two single queue approaches are quite weak, with the better M , solving only 25 % of the baseline. Nor can the combination S ⊕ M − R beconsidered a success, as it only solves more problems when less and less adviceis taken, seemingly approaching the performance of S from below. This trend Integrated using the layered scheme with a second level ratio 2:1 (cf. Sect. 5.3). We had to switch to a different machine after producing the training data. There,a rerun of S gave a slightly better performance than the 734 solved problems usedfor training. We still used the original run’s results to compute the gained and lostvalues here; the percentage solved is with respect to the new run of S .2 M. Sudastrategy ratio M eval. time% S ) gained lost S − . M − R − . . M , − . .
30 574
S ⊕ M − R . .
86 277 . . . . . . . . S ⊕ M , . . . . . . . . . . S ⊕ S [ M ] 2:1 27 . . . . . . . . . . Table 1.
Performance results of various forms of integrating the model advice. repeats with
S ⊕ M , , although here an interesting number of problems notsolved by the baseline is gained by strategies which rely on the advice more thanhalf of the time.With our model M , only the layered clause selection integration S ⊕ S [ M ]is able to improve on the performance of the baseline strategy S . In fact, itimproves on it very significantly: with the second level ratio of 1:2 we achieve137 . Table 1 also shows the percentage of computation time the individual strategiesspent evaluating the advice, i.e. interfacing M .A word of warning first. These number are hard to interpret across differentstrategies. It is because different guidance steers the prover to different partsof the search space. For example, notice the seemingly paradoxical situationmost pronounced with S ⊕ M − R , where the more often is the advice from M nominally requested, the less time the prover spends interfacing M . Lookingclosely at a few problems, we discovered that in strategies relying a lot on M − R ,such as S ⊕ M − R under the ratio 1:5, most of the time is spent performingforward subsumption. An explanation is that the guidance becomes increasinglybad and the prover slows down, processing larger and larger clauses for whichthe subsumption checks are expensive and dominate the runtime. When the guidance is the same, however, we can use the eval time percentageto estimate the efficiency of the integration. The results shown in Table 1 were A similar experience with bad guidance has been made by the authors of ENIGMA.ew Techniques that Improve ENIGMA-style Clause Selection Guidance 13 M eval. time% S )both techniques enabled 33 . . . . . . . . Table 2.
Performance decrease caused by turning off abstraction caching and lazyevaluation, and both; demonstrated on
S ⊕ S [ M ] under the second level ratio 1:2. Fig. 2.
The Receiver Operating Characteristic curve (left) and a related plot withexplicit threshold (right) for the selected model M ; both based on validation data. obtained using both lazy evaluation and abstraction caching (as described insections 3.1 and 4.4). Taking the best performing S ⊕ S [ M ] under the sec-ond level ratio 1:2, we selectively disabled: first abstraction caching, then lazyevaluation and finally both techniques, obtaining the values shown in Table 2.We can see that the techniques considerably contribute to the overall per-formance. Indeed, without them Vampire would spend the whole 73 . . . S . Two important characteristics, from the machine learning perspective, of anobtained model are the true positive rate (TPR) (also called sensitivity) andthe true negative rate (TNR) (also specificity). TPR is defined as the fractionof positively labeled examples which the model also classifies as such. TNR is,analogously, the fraction of negatively labeled examples. Our model M achieves(on the validation set) 86 . . t , set by default to t = 0 (recall Sect. 4.1). Changing this thresh-old allows us to trade TPR for TNR and vice versa in straightforward way. Theinterdependence of these two values on the varied threshold is traditionally cap- With the exception of the M − R guidance, with which it is incompatible.4 M. Suda threshold S ) gained lost − .
50 1063 140 . − . . .
00 1036 137 . .
25 945 125 . .
50 825 109 . Table 3.
The performance of
S ⊕S [ M ] under the second level ratio 1:2 while changingthe logit threshold. A smaller threshold means more clauses classified as positive. tured by the so called Receiver Operating Characteristic (ROC) curve, shownfor our model in Fig. 2 (left). The tradition dictates that the x axis be labeledby the false positive rate (FPR) (also called fall-out) which is simply 1 − TNR.Under such presentation, one generally strives to pick a threshold value at whichthe curve is the closest to the upper left corner of the plot. However, this isnot necessarily the best configuration for every application.In the Fig. 2 (right), we “decompose” the ROC curve by using the threshold t for the independent axis x . We also highlight, for every problem (again, inthe validation set), what is the minimal logit value across all positively labeledexamples belonging to that problem. In other words, what is the logit of the“least positively classified” clause from the problem’s proof. We can see thatfor the majority of the problems these minima are below the threshold t = 0.This means that for those problems at least one clause from the original proofis getting classified as negative by M under t = 0.These observations motivated us to experiment with non-zero values of thethreshold in an ATP evaluation. Particularly promising seemed the use of athreshold t smaller than zero with the intention of classifying more clauses aspositive. The results of the experiment are in shown Table 3. Indeed, we couldfurther improve the best performing strategy from Table 1 with both t = − . t = − .
5. It can be seen that smaller values lead to fewer problems lost, buteven the ATP gain is better with t = − .
25 than with the default t = 0, leadingto the overall best improvement of 141 . S . As previously unsolved problems get proven with the help of the trained guid-ance, the new proofs can be used to enrich the training set and potentially helpobtaining even better models. This idea of alternating the training and the ATPevaluation steps in a reinforcing loop has been proposed and successfully real-ized by the authors of ENIGMA on the Mizar dataset [17]. Here we propose anenhancement of the idea and repeat an analogous experiment in our setting.By collecting proofs discovered by a selection of 8 different configurationstested in the previous sections, we grew our set of solved problems from 734 to1528. We decided to keep one proof per problem, strictly extending the origi- Minimizing the standard cross entropy loss should actually automatically “bring thecurve” close to that corner for the threshold t = 0.ew Techniques that Improve ENIGMA-style Clause Selection Guidance 15 S ) (percent |U| ) gained lostplain 1268 167 . . . .
140 274Table 4.
The performance of new models learned from guided proofs. U is the set of1528 problems used for the training. The gained and lost counts are here w.r.t. U . nal training set. We then repeated the same training procedure as described inSect. 5.2 on this new set and on an extension of this set obtained as follows. Negative mining: We suspected that the successful derivations obtained withthe help of M might not contain enough “typical wrong decisions” from theperspective of S to provide for good enough training. We therefore logged the failing runs of S on the (1528 − that negative mining indeed helps to produce a bettermodel. Mainly, however, it shows that training from additional derivations fur-ther dramatically improves the performance of the obtained strategy. We revisited the topic of ENIGMA-style clause selection guidance by a machinelearned binary classifier and proposed four improvements to previous work: 1)the use of layered clause selection for integrating the advice, 2) the lazy evalu-ation trick to reduce the overhead of interfacing a potentially expensive model,3) the “positive bias” idea suggesting to be really careful not to discard poten-tially useful clauses, and 4) the “negative mining” technique to provide enoughnegative examples when learning from proofs obtained with previous guidance.We have also shown that a strong advice can be obtained by looking justat the derivation history to discriminate a clause. The automatically discoveredneural guidance significantly improves upon the human-engineered heuristic pre-sented in [11] under identical conditions. As reported elsewhere, we achieved asimilar success with the architecture on the Mizar benchmark [36].By deliberately focusing of the representation of clauses by their derivations,we obtained some nice properties, such as relative speed of evaluation. However,in situations where theory reasoning by automatically added theory axioms isnot prevalent, such as on most of the TPTP library, we expect guidance basedon derivations with just a single axiom origin label, the input , to be quite weak.Still, we see a great opportunity in using statistical methods for analyzingATP behaviour; not only for improving prover performance with a “blackbox”guidance, but also as a tool for discovering regularities that could be “broughtback” to improve our understanding of the technology on a deeper level. There is a related idea of “hard negative mining” used in computer vision. The ATP eval was again integrating via
S ⊕ S [ M ] under the second level ratio 1:2.6 M. Suda Acknowledgement
This work was supported by the Czech Science Foundation project 20-06390Yand the project RICAIP no. 857306 under the EU-H2020 programme.
References
1. Avenhaus, J., Denzinger, J., Fuchs, M.: DISCOUNT: A system for distributedequational deduction. In: Hsiang, J. (ed.) Rewriting Techniques and Applica-tions, 6th International Conference, RTA-95, Kaiserslautern, Germany, April 5-7, 1995, Proceedings. Lecture Notes in Computer Science, vol. 914, pp. 397–402. Springer (1995). https://doi.org/10.1007/3-540-59200-8 72, https://doi.org/10.1007/3-540-59200-8 722. Ba, L.J., Kiros, J.R., Hinton, G.E.: Layer normalization. CoRR abs/1607.06450 (2016), http://arxiv.org/abs/1607.064503. Bachmair, L., Ganzinger, H.: Rewrite-based equational theorem provingwith selection and simplification. J. Log. Comput. (3), 217–247 (1994).https://doi.org/10.1093/logcom/4.3.217, https://doi.org/10.1093/logcom/4.3.2174. Bachmair, L., Ganzinger, H., McAllester, D.A., Lynch, C.: Resolution theoremproving. In: Robinson and Voronkov [28], pp. 19–99. https://doi.org/10.1016/b978-044450813-3/50004-7, https://doi.org/10.1016/b978-044450813-3/50004-75. Barrett, C., Fontaine, P., Tinelli, C.: The Satisfiability Modulo Theories Library(SMT-LIB). cs.SC/0310056 (2003), http://arxiv.org/abs/cs/031005623. Nieuwenhuis, R., Rubio, A.: Paramodulation-based theorem proving. In: Robinsonand Voronkov [28], pp. 371–443. https://doi.org/10.1016/b978-044450813-3/50009-6, https://doi.org/10.1016/b978-044450813-3/50009-624. Ols´ak, M., Kaliszyk, C., Urban, J.: Property invariant embedding for auto-mated reasoning. In: Giacomo, G.D., Catal´a, A., Dilkina, B., Milano, M., Barro,S., Bugar´ın, A., Lang, J. (eds.) ECAI 2020 - 24th European Conference onArtificial Intelligence, 29 August-8 September 2020, Santiago de Compostela,Spain, August 29 - September 8, 2020 - Including 10th Conference on Pres-tigious Applications of Artificial Intelligence (PAIS 2020). Frontiers in Artifi-cial Intelligence and Applications, vol. 325, pp. 1395–1402. IOS Press (2020).https://doi.org/10.3233/FAIA200244, https://doi.org/10.3233/FAIA20024425. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen,T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., De-Vito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai,J., Chintala, S.: Pytorch: An imperative style, high-performance deep learninglibrary. In: Wallach, H., Larochelle, H., Beygelzimer, A., d ' Alch´e-Buc, F., Fox,E., Garnett, R. (eds.) Advances in Neural Information Processing Systems 32,pp. 8024–8035. Curran Associates, Inc. (2019), http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf26. Reger, G., Suda, M., Voronkov, A.: Playing with AVATAR. In: Felty, A.P.,Middeldorp, A. (eds.) Automated Deduction - CADE-25 - 25th InternationalConference on Automated Deduction, Berlin, Germany, August 1-7, 2015, Pro-ceedings. Lecture Notes in Computer Science, vol. 9195, pp. 399–415. Springer(2015). https://doi.org/10.1007/978-3-319-21401-6 28, https://doi.org/10.1007/978-3-319-21401-6 2827. Riazanov, A., Voronkov, A.: Limited resource strategy in resolution theorem prov-ing. J. Symb. Comput. abs/1708.07120 (2017), http://arxiv.org/abs/1708.0712035. Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.:Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn.Res.15