[PDF] Testing a Saturation-Based Theorem Prover: Experiences and Challenges (Extended Version)

Abstract

This paper attempts to address the question of how best to assure the correctness of saturation-based automated theorem provers using our experience developing the theorem prover Vampire. We describe the techniques we currently employ to ensure that Vampire is correct and use this to motivate future challenges that need to be addressed to make this process more straightforward and to achieve better correctness guarantees.

Full PDF

aa r X i v : . [ c s . L O ] A p r Testing a Saturation-Based Theorem Prover:Experiences and Challenges (Extended Version) ⋆ Giles Reger , Martin Suda , and Andrei Voronkov , , University of Manchester, Manchester, UK TU Wien, Vienna, Austria Chalmers University of Technology, Gothenburg, Sweden EasyChair

Abstract.

This paper attempts to address the question of how best to assure thecorrectness of saturation-based automated theorem provers using our experiencedeveloping the theorem prover Vampire. We describe the techniques we currentlyemploy to ensure that Vampire is correct and use this to motivate future chal-lenges that need to be addressed to make this process more straightforward andto achieve better correctness guarantees.

This paper considers the problem of checking that a saturation-based automated the-orem prover is correct . We consider this question within the context of the Vampiretheorem prover [8], but many of our discussions generalise to similar theorem proverssuch as E [14], SPASS [17], and iProver [7]. We discuss what we mean by this cor-rectness, give examples of bugs that we have encountered while developing Vampire,describe how we detect such bugs and, as our main contribution, outline the challengesthat need to be addressed.It should not be necessary to motivate the general need to ensure that any pieceof software is correct. However, the cost of incorrect software is not uniform and herewe brieﬂy motivate why ensuring that theorem provers, such as Vampire, are correctis of signiﬁcant importance. Theorem provers are often used as black boxes in othertechniques (e.g. program veriﬁcation) and those techniques rely on the results of thetheorem prover for the correctness of their own results. Another area that makes use ofautomated theorem provers is the application of so-called hammers [9,6] in interactivetheorem proving. These combinations usually provide functionality to reconstruct theproofs of the automated theorem provers using their own trusted kernels, although theyalso offer users the option to skip such steps.It is clear that correctness is important here, so how are we doing? Most theoremprovers seem to be generally correct. However, cases of unsoundness are not uncom-mon. In SMT-COMP 2016 there were 603 conﬂicts (solvers returning different results) ⋆ This work was supported by EPSRC Grant EP/K032674/1. Martin Suda and Andrei Voronkovwere partially supported by ERC Starting Grant 2014 SYMCAR 639270. Martin Suda wasalso partially supported by the Austrian research projects FWF S11403-N23 and S11409-N23.Andrei Voronkov was also partially supported by the Wallenberg Academy Fellowship 2014 -TheProSE. n 73 benchmarks. This turned out to be three solvers giving incorrect results for variousreasons. In the CASC competition [16], there is a period of testing where soundness ischecked and resolved, and there have been a number of solvers later disqualiﬁed fromthe competition due to unsoundness. In our experience, adding a new feature to a the-orem prover is a highly complex task and it is very easy to introduce unsoundness, orgeneral incorrectness, especially in areas of the code that are encountered during proofsearch very infrequently.The paper makes the following contributions: – An overview of what we mean by correctness with respect to saturation-based the-orem provers (Section 2). – A description of the approach we take to bug ﬁnding (Section 3) that involves sam-pling the input space of proof search heuristics and problems. – A sample of real world bugs that we have found in Vampire that demonstrates thecomplexity of this problem (Section 4). – A presentation of proof checking approaches that can help ensure the correctnessof proofs (Section 5) and a reﬂection on what cannot currently be checked. – We propose a set of challenges (Section 6) that need to be addressed to produce abetter solution to this problem in general.The aim of this paper is to provide sufﬁcient context to explain the challenges out-lined in Section 6. Addressing these challenges is part of our current ongoing research.

We separate four different ways in which Vampire can be incorrect: – Incorrect Result – Program Crash – Assertion Violation – Performance BugWe will now brieﬂy discuss what we mean by each class of incorrect behaviour. Themost damaging (and interesting) of these is an incorrect result and we discuss this ﬁrst.

To understand what a correct and incorrect result mean to Vampire we need to introducesome of the theoretical foundations of the underlying technique. We note that the ap-proach used by Vampire is the same as that taken by other ﬁrst-order theorem provers,so these discussions, and the challenges outlined in Section 6, generalise beyond Vam-pire.Vampire accepts problems (formulas) in the form ( Premise ∧ . . . ∧ Premise n ) → Conjecture (1)and can give one of three answers: See http://smtcomp.sourceforge.net/2016/ . Theorem , if the problem is true in all models, – Non-Theorem , if there are models in which (1) is false, and – Unknown , if Vampire cannot deduce one of the previous answers.Providing one of the ﬁrst two results when that result does not hold is clearly incorrect.Providing

Unknown as the result is clearly incorrect in the sense that there is a knownanswer, but, due to the undecidability of ﬁrst-order-logic and the general hardness of theproblem, it is often unavoidable. However, as discussed below, we should understandthe different ways in which

Unknown as a result can be produced. Note that

Unknown will be returned if Vampire exceeds either the time or memory allotted to it.More speciﬁcally, Vampire is a refutational theorem prover; it establishes the valid-ity of problems in the form (1) by detecting unsatisﬁability of its negation:

Premise ∧ . . . ∧ Premise n ∧ ¬ Conjecture . (2)This works by translating (2) into a set of clauses S and adding consequences of S untilthe contradiction false is derived or all possible consequences have been added. Thisprocess is called saturation and may not terminate in general for a satisﬁable set S .If Vampire derives a contradiction then it has shown that the problem (1) is valid ,i.e. a theorem. Deriving a contradiction when the problem in (1) is not valid is unsound and an incorrect result .If Vampire fails to derive a contradiction and saturates the set S in ﬁnitely manysteps then there is a result [1] telling us that under certain conditions we can concludethat false cannot be a consequence of S and therefore problem (1) is a non-theorem.These conditions capture the completeness of the underlying inference system and gen-erally require that all possible non-redundant inferences have been performed.However, there are many things that Vampire does to heuristically improve proofsearch that break the completeness conditions. For example: – Certain well-performing selection functions [4] might prevent inferences that needto be performed for completeness conditions to hold. – Some useful preprocessing steps explicitly remove some of the premises [5]. – Strategies such as the limited resource strategy [13] might at some point choose todelete a clause from the search space. – Reasoning with theories such as arithmetic is in general undecidable and the aboveconditions fail to hold.If the completeness conditions do not hold then upon saturation the result is unknown.Sometimes it is easy to detect when the completeness conditions hold, sometimes it isnon-trivial and sometimes they are erroneously broken. In this last case (when we thinkthe conditions hold but they do not) this will lead to incorrectly reporting non-theoremi.e. this completeness issue is another kind of incorrect result .To ensure the requirement that all possible non-redundant inferences will in theend be performed, we impose certain fairness criteria on the saturation process. Moreconcretely, we require that no such inference is postponed indeﬁnitely. Notice that this isby nature a tricky condition to deal with as it cannot be seen to have been violated afterﬁnitely many steps while the prover is running. And since, due to the semi-decidability3f ﬁrst-order logic, there is no upper bound on the length of the computation requiredto derive false , a non-fair implementation might in certain cases never be able to return

Theorem , even if it is the correct answer and instead keep computing indeﬁnitely. Thus,this fairness issue does not lead to an incorrect result per se, but rather just negativelyinﬂuences performance. As such it may be extremely hard to detect and deal with.

A program crash is where Vampire terminates unexpectedly, usually due to one of thefollowing: – Unhandled Exception – Floating Point Error (SIGFPE) – Segmentation Fault (SIGSEG)Unhandled exceptions are bugs as we should handle them. In general, Vampire han-dles all known classes of exceptions at the top level, but we have recently had issueswith integrated tools (MiniSAT and Z3) producing exceptions that we did not handle.Floating point errors and segmentation faults are typical software bugs that should bedetected and removed. Later we discuss the potential dangers of memory related bugs.

Vampire is developed defensively with frequent use of assertions . For example, theseare inserted wherever a function makes some assumptions about its input or the resultsof a nested function call, and wherever we believe a certain line to be unreachable.Vampire consists of roughly 194,000 lines of C++ code with roughly 2,500 assertions,meaning that there is roughly one assertion per 77 lines. The majority of potential errorsare detected early as assertion violations.

This is not a bug in the sense of the above bugs i.e. Vampire still returns the correctresult without early termination. However, we include this category of bugs as it issomething we are interested in and try to detect. A performance bug is where somethingshould perform well, but due to a mistake or error it does not. An easy case to detectis where a heuristic previously performed well but after implementing an orthogonalheuristic it does not. A very difﬁcult case to detect is where a new heuristic is addedbut implemented incorrectly and therefore does not perform as it should. There is littlehope of detecting these cases without manual inspection.

In this section we brieﬂy describe how we 1) detect and 2) investigate bugs in Vampire.The main point of this section is that these two steps can be equally difﬁcult. The searchspace for Vampire is vast, and ﬁnding the combination of inputs that triggers a bug isvery difﬁcult. Some bugs are incredibly subtle, particularly soundness bugs or those in-volving memory errors, and tracking them down can involve hunting through thousandsof lines of output. 4 ge weight ratioavataravatar add complementaryavatar congruence closureavatar delete deactivatedavatar eager removalavatar fast restartavatar flush periodavatar flush quotientavatar minimize modelavatar nonsplittable componentsbackward demodulationbackward subsumptionbackward subsumption resolutionbinary resolutionblock clause eliminationcc unsat corescondensationdemodulation redundancy checkequality proxyequality resolution with deletionequational tautology removalextensionality resolutionfmb adjust sortsfmb detect sort bounds fmb enumeration strategyfmb size weight ratiofmb symmetry ratiofool paramodulationforward demodulationforward literal rewritingforward subsumptionforward subsumption resolutionfunction definition eliminationgeneral splittingglobal subsumptionincreased numeral weightinequality splittinginline letinst gen big restart ratioinst gen passive reactivationinst gen resolution ratioinst gen restart periodinst gen restart period quotientinst gen selectioninst gen with resolutioninstantiationliteral comparison modeliteral maximality afterchecklookahaed delay lrs weight limit onlynamingnongoal weight coefficientnonliterals in clause weightsat solversaturation algorithmselectionsimulated time limitsine depthsine generality thresholdsine selectionsine tolerancesossplit at activationsuperposition from variablessymbol precedenceterm algebra acyclicityterm algebra rulestheory axiomstheory flatteningtime limitunit resulting resolutionunused predicate definition removaluse dismatching

Fig. 1.

The proof search parameters for Vampire.

The two inputs to Vampire are the input problem and a strategy capturing proof searchparameters.The currently used proof search parameters in Vampire are summarised in Figure 1.We do not have the space (or desire) to describe these all here. More than half of theseoptions have more than two possible values and some can take arbitrary numbers. Asan illustration of the size of this search space, if each option had just two values eachand we could check each option combination in one second and had a time machinethat could take us back to beginning of the world (4.5 billion years) we would onlyhave covered about 0.0004% of the possible combinations. Of course, this example issomewhat of an overestimate as there are parameter dependencies that shrink the searchspace and many parameter combinations will result in the same actual proof search.Nonetheless, the parameter search space is prohibitively large to explore systematically.

Given the large parameter space our current approach is to randomly sample the param-eter space, under the additional dependency constraints, and sets of available problems.We use a cluster that enables us to carry out around a million checks a day (using vary-ing short time limits). Once an error is detected, we must diagnose and ﬁx the fault.Below we describe some of our methods for doing this.

Tracing.

Vampire has its own library for tracing function calls. A macro is manuallyinserted at the start of each signiﬁcant function. This macro enables the tracing libraryto maintain the current call stack so that it can be printed on an assertion violation or Explanations for most of these options can be obtained by running vampire -explain

Vampire implements its own memory management library, allowingﬁne-grained control of memory allocation and deallocation. This also allows Vampire toenforce soft memory limits. In debug mode, Vampire will keep track of each allocatedpiece of memory and check that the corresponding deallocation is as expected. Vampirewill also report memory leaks by reporting on unallocated memory at the end of theproof search.

Segmentation Faults and Silent Memory Issues.

The most difﬁcult bug to debug is arogue pointer or piece of uninitialised memory. Often such bugs are only noticed viaincorrect results and much manual effort. We ﬁnd that a ﬁrst step of applying Valgrind will often detect the more straightforward issues. Proof Checking.

To detect unsoundness we employ proof checking, which we discussfurther in Section 5. We do not currently have a corresponding method for checking thata saturated set complies with necessary completeness conditions.

We now illustrate the kinds of bugs that can appear in Vampire. The majority of thesebugs were detected during development, but still managed to exist in the developmentversion of Vampire for some time before they were detected and ﬁxed. We attempt toinclude explanations of why the bugs were not detected immediately, which informsour later discussion of what could be done better.

Skolemisation is a necessary step of the process used to translate an input formula into aclause (see [12] for how Vampire implements this process). The standard transformationis as follows ( ∃ x )( ϕ [ x ]) −→ ϕ [ f x ( y , ..., y k )] where y , ..., y k are variables universally quantiﬁed in the containing formula. An opti-mised version of this transformation can be produced by noticing that in ( ∀ x )(( ∃ y )( p ( y, y )) ∨ ( ∃ z )( q ( x, z ))) the variable y does not rely on x and ( ∀ x ) can be pushed in; this is called miniscoping.A buggy implementation of this optimisation introduced the transformation ( ∃ x )( ϕ [ x ]) −→ ϕ [ f x ( y , ..., y k )] http://valgrind.org y , ..., y k are variables universally quantiﬁed in the containing formula and oc-curring in ϕ [ x ] . To understand why this is buggy consider ( ∀ u )( ∃ x )( p ( x, u ) ∧ ( ∃ y ) q ( x, y )) here we have ( ∃ y ) q ( x, y ) not containing u so it would be Skolemised to q ( f ( u ) , g ) according to the above rule. However, this is incorrect as x is itself dependent on theuniversally quantiﬁed u . The correct rule should instead be ( ∃ x )( ϕ [ x ]) −→ ϕ [ f x ( y , ..., y k )] where y , ..., y k are variables universally quantiﬁed in the containing formula and oc-curring in ϕ [ x ] σ where σ is a substitution containing previous Skolemisations.With this corrected rule, the Skolemised example formula becomes: p ( f ( u ) , u ) ∧ q ( f ( u ) , g ( u )) . We expected this optimisation to improve performance, so when it did we did notimmediately realise that this was due to Vampire solving a different, often slightly eas-ier, problem. This was only detected when inspecting a different proof to understand aseparate issue.

When the parser in Vampire was ﬁrst extended to handle the unary minus arithmeticoperator it erroneously parsed − t as ( t ) − ( − t ) , effectively resulting in t . This wascaused by incorrectly modifying a function that previously handled binary operatorsonly. Whilst we would expect such an error to lead to a crash, the function instead fellthrough a case statement treated it as binary minus with − t as the second term. This wasnot immediately detected as we did not have many non-theorem problems containingunary minus in our test set. Bugs involving memory allocation can be very difﬁcult to debug. In one case a class forrepresenting propositional clauses declared an array

SATLiteral _literals[1]; i.e. with implicit size of but then ensured that the correct amount of memory was al-located and initialised. With one exception: in the case of the empty clause no memorywas allocated. Later the class was extended with extra ﬁelds and a non-trivial construc-tor, which implicitly initialised the array. In the case of the empty clause this caused aconstructor to be called on an unallocated piece of memory i.e. a random piece of mem-ory, most likely already used, was written to, leading to non-deterministically unsoundbehaviour.In a similar case, an implementation of skip lists employed a similar trick whichbehaved as expected in debug mode. However, in release mode a higher level of optimi-sation ( -0 3 ) was applied, which removed necessary code. It was unclear whether thetrick or the optimisation were at fault but in either case the bug was difﬁcult to diagnoseas it only occurred in release mode. 7 .4 Problems with Hashing over Raw Data Vampire makes heavy use of data structures which rely on hashing. For each class ofobjects that could be a key we typically have overloaded functions specifying what itmeans to hash such objects. However, there is also a fallback implementation whichsimply hashes the sizeof(o) -many bytes starting from the address &o .This became the source of a hard to discover bug when moving to a new platform.To meet a memory alignment requirement, a new compiler decided that a struct holdingan int and a pointer should occupy 16 bytes, but only 12 of these corresponded tothe actually stored values. The remaining 4 bytes of padding would in principle holdarbitrary values, while still participating in the hash computation.Here the original code worked correctly and it was moving to a new platform thatintroduced the bug. Therefore, it was difﬁcult to apply the what changed recently ques-tion in a speciﬁc way. For reasoning in theories Vampire adds theory axioms (formulas such as x + y = y + x ).When these were added the incorrect axiom ≤ x ∨ abs ( − x ) = x was added instead of ¬ (0 ≤ x ) ∨ abs ( − x ) = x where abs is meant to give the absolute value of a number. This allowed Vampire toderive an inconsistency using the added theory axioms alone. However, this was notdetected for some time as the theory axiom was used to describe a feature that wasrarely used and never used in a non-theorem problem in our testing set of problems. Atanother point a more subtle inconsistency was introduced that survived for a while asthe proof of inconsistency from theory axioms was very long. Since encountering theseissues we have added an assertion that a proof should contain formulas from the input! In a recently added feature we make use of the Z3 SMT solver [3], which we use viaits API. A number of bugs occurred due to misusing this API and Z3 in general. Theﬁrst bug was based on failing to guard statements that could represent division by zero.Z3 treats division as an underspeciﬁed function and is allowed to assign any value to t/ . This was inconsistent with our use of Z3 and led to unsound inference steps. Inanother case we made use of an API call that returned an object without increasing areference counter, and consequently had memory issues as the object was sometimesdeleted. A ﬁnal bug was traced to Z3 itself and quickly ﬁxed by their developers. Thisdemonstrates the additional issues involved with integrating other tools.8 Proof Checking

The easiest way to conﬁrm a result indicating that the input formula is a theorem is tocheck that the associated proof only performs sound inference steps. This process iscalled proof checking and here we brieﬂy describe the capabilities and limitations ofthe proof checking technique as currently realised in Vampire.

We introduce the idea of proof checking using an example (see our previous work [10]for more information about proofs in Vampire). Given the clauses p ( a ) ¬ p ( x ) ∨ b = x ¬ p ( b ) Vampire will produce the following proof

1. p(a) [input]2. ˜p(X0) | b = X0 [input]3. ˜p(b) [input]4. a = b [resolution 2,1]5. ˜p(a) [backward demodulation 4,3]7. $false [subsumption resolution 5,1]

A proof is a directed acyclic graph printed in a linear form where nodes that haveno incoming edges are either input formulas or axioms introduced by Vampire, and thesingle node with no outgoing edges contains the contradiction. In the above proof eachderived clause is labelled with the name of the inference and the lines of the premises.To check a proof we just need to establish that for each inference that its conse-quence logically follows from its premises. By running vampire -p proofcheck we can produce the following output in TPTP format which captures the three problemsthat need to be solved to check that the proof is correct. fof(r4,conjecture, a = b ). %resolutionfof(pr2,axiom, ( ! [X0] : (˜p(X0) | b = X0) ) ).fof(pr1,axiom, p(a) ).%

We can pass these directly to an independent theorem prover and if a step cannot beindependently veriﬁed then it should be investigated.9 .2 What We are Missing

The above description suggests that we have a good method for checking the correctnessof proofs. However, there are two problems with the above approach. The ﬁrst problemis that in our experience it is not uncommon for a proof step to not be independentlyveriﬁed whilst still being sound. Such false positives take a lot of time to investigate.The second, and more substantial, problem is that there are parts of the proof processthat cannot be handled by the above approach. There are two main classes of inferencesthat cannot be handled in this way and we describe them below:

Symbol Introducing Preprocessing.

Certain inference steps of the clausiﬁcation phase,e.g. Skolemization and formula naming [12], introduce new symbols and as such donot preserve logical equivalence. This means the conclusion of the inference does notlogically follow from its premises. What these steps only preserve is global satisﬁabil-ity of the clause set they modify. One necessary condition for correctness is that theintroduced symbols be fresh , i.e. not appearing elsewhere in the input. This cannot (inprinciple) be checked by the described approach and requires a non-trivial extension.

SAT and SMT solving.

Vampire makes use of SAT and SMT solvers in various ways(see [11]). This means that we have some inferences in Vampire that are of the form P ∧ . . . ∧ P n → C by SAT/SMT , or even the argument that some abstraction or grounding ofthe premises leads to C by SAT or SMT solving. To handle such proof steps we need tocollect together the premises (potentially apply the necessary abstraction or grounding)and run a SAT or SMT solver as appropriate. We now present the main contribution of this paper, a discussion of what we have iden-tiﬁed as the main challenges left to be solved, or at least addressed. These challengesare given in order of importance, as we perceive it.

As described in Section 5, there is already reasonable support for independent checkingthe correctness of proofs. However, this situation could still be greatly improved.

Missing Features.

As outlined in Section 5.2 there are parts of proofs that cannot cur-rently be proof checked. To support checking these features might require adding addi-tional information to the proofs.

Automating Proof Checking.

Having independent tools able to check the correctness ofproofs is irrelevant if those tools are not used. Ideally, theorem provers should providethe functionality to check the proofs that they produce automatically. In general, theproblems produced during the proof checking process are easy to solve. Therefore, onecould imagine a situation where, in a certain mode, a theorem prover created theseproblems and called an independent solver.10 ndependence.

In some cases it might not be possible to ﬁnd an independent solverable to handle the problems produced by proof checking. The solver might not be ableto check an individual step, because it is too hard, or not be able to handle the languagefeatures the problem contains (in the case of preprocessing steps). In the former case,a weaker independence could be achieved by making use of a previous version of theoriginal theorem prover that we are more conﬁdent in.

Checking whether a proof is correct or not is essential. However, knowing that a proof isincorrect is not, in itself, very useful. Another missing piece to this puzzle are tools thatcan analyse proofs and extract, summarise or explain the reason the proof is incorrect.The proof checking process will reveal the proof step that fails to hold, but the problemof detecting the underlying reason for that proof step to have occurred is non-trivial.

So far we have completely ignored the incorrect result of reporting a problem to besatisﬁable when it is not. It is not clear how to practically check whether a saturatedset is indeed saturated as the notion of saturation is dependent on used calculus and itsinstantiation with parameters such as the term ordering and literal selection methods.One must also check that proof search never deletes anything that is not redundant.Note that this problem is signiﬁcantly more complex than proof checking. In proofchecking we must check that each inference of the proof is sound i.e. that we wereallowed to perform the inferences that we performed to derive a contradiction. If wehave a saturated set then we should check that every inference that Vampire chose not toperform was redundant; this is what we often have to do manually, with some intuitionabout what such inferences might be. The number of such inferences is typically a feworders of magnitude larger than the length of a typical proof.

As previously discussed, due to the enormous variability in proof search parametersand possible problem inputs, the best approach to detecting errors and incorrect resultsis through random search. However, the current approaches to random search are notoptimal. Here we brieﬂy outline areas of improvement.

Code Coverage.

Our current approach makes no attempts to ensure that testing coversall lines in the code. Even though this is a very weak notion of coverage, it could beused to detect areas of code that should be tested, or removed if never used.

Coverage of the Parameter Space.

Whilst random sampling of the parameter space asdone in Section 3 can be effective at discovering bugs, it is not clear that all areas of theparameter space are of equal interest. Clearly, combinations of features that have notbeen tested together should have priority, and features added more recently should betested more thoroughly. 11 overage of the Problem Space.

This is an area where relatively little has been done.Currently we use libraries of existing problems (such as TPTP [15] and SMTLIB [2])as possible inputs to the testing process. However, as we saw in Section 4, if we do nothave a problem that exercises a certain feature sufﬁciently, we are unlikely to detectbugs related to that feature. For example, the TPTP language contains features that arevery rarely, sometimes never, used within the TPTP library. This issue is not conﬁnedto language features. Proof search is dependent on particular dimensions of the inputproblem (e.g. size, signature) that are difﬁcult to quantify. If the input problems do notcover these dimensions sufﬁciently then certain parts of Vampire will not be tested ef-fectively. This suggests that a useful area of research would be the automatic generationof problems, or fuzzing of existing problems, to cover such dimensions.

This paper describes our experience testing the Vampire theorem prover and what wesee as the challenges to overcome to help us improve this effort. The ideas we discusshere generalise to other theorem provers and some efforts, such as good proof checkingtechniques and better problem coverage, would be widely beneﬁcial.Addressing the challenges set out in this paper is part of our current research andwe are currently working on providing a publicly available proof checking tool that canfully and automatically check proofs produced by Vampire.

References

1. L. Bachmair and H. Ganzinger. Resolution theorem proving. In

Handbook of AutomatedReasoning , vol. I, chapter 2, pp. 19–99. Elsevier Science, 2001.2. C. Barrett, A. Stump, and C. Tinelli. The Satisﬁability Modulo Theories Library (SMT-LIB). , 2010.3. L. M. de Moura and N. Bjørner. Z3: an efﬁcient SMT solver. In

Proc. of TACAS , vol. 4963of

LNCS , pp. 337–340, 2008.4. K. Hoder, G. Reger, M. Suda, and A. Voronkov. Selecting the selection. In

AutomatedReasoning: 8th International Joint Conference, IJCAR 2016, Coimbra, Portugal, June 27 –July 2, 2016, Proceedings , pp. 313–329. Springer International Publishing, 2016.5. K. Hoder and A. Voronkov. Sine qua non for large theory reasoning. In

Automated Deduction- CADE-23 - 23rd International Conference on Automated Deduction, Wroclaw, Poland, July31 - August 5, 2011. Proceedings , vol. 6803 of

Lecture Notes in Computer Science , pp. 299–314. Springer, 2011.6. C. Kaliszyk and J. Urban. Hol(y)hammer: Online ATP service for HOL light.

Mathematicsin Computer Science , 9(1):5–22, 2015.7. K. Korovin. iprover - an instantiation-based theorem prover for ﬁrst-order logic (systemdescription). In

Automated Reasoning, 4th International Joint Conference, IJCAR 2008,Sydney, Australia, August 12-15, 2008, Proceedings , vol. 5195 of

Lecture Notes in ComputerScience , pp. 292–298. Springer, 2008.8. L. Kov´acs and A. Voronkov. First-order theorem proving and Vampire. In

CAV 2013 , vol.8044 of

Lecture Notes in Computer Science , pp. 1–35, 2013. . L. C. Paulson and J. C. Blanchette. Three years of experience with sledgehammer, a practicallink between automatic and interactive theorem provers. In IWIL 2010. The 8th InternationalWorkshop on the Implementation of Logics , vol. 2 of

EPiC Series in Computing , pp. 1–11.EasyChair, 2012.10. G. Reger. Better proof output for Vampire. In

Vampire 2016. Proceedings of the 3rd VampireWorkshop , vol. 44 of

EPiC Series in Computing , pp. 46–60. EasyChair, 2017.11. G. Reger and M. Suda. The uses of sat solvers in vampire. In

Proceedings of the 1st and 2ndVampire Workshops , vol. 38 of

EPiC Series in Computing , pp. 63–69. EasyChair, 2016.12. G. Reger, M. Suda, and A. Voronkov. New techniques in clausal form generation. In

GCAI2016. 2nd Global Conference on Artiﬁcial Intelligence , vol. 41 of

EPiC Series in Computing ,pp. 11–23. EasyChair, 2016.13. A. Riazanov and A. Voronkov. Limited resource strategy in resolution theorem proving.

J.Symb. Comput. , 36(1-2):101–115, 2003.14. S. Schulz. E - a brainiac theorem prover.

AI Commun. , 15(2-3):111–126, 2002.15. G. Sutcliffe. The TPTP problem library and associated infrastructure.

J. Autom. Reasoning ,43(4):337–362, 2009.16. G. Sutcliffe. The CADE ATP system competition - CASC.

AI Magazine , 37(2):99–101,2016.17. C. Weidenbach, D. Dimova, A. Fietzke, R. Kumar, M. Suda, and P. Wischnewski. SPASSversion 3.5. In

Automated Deduction - CADE-22, 22nd International Conference on Auto-mated Deduction, Montreal, Canada, August 2-7, 2009. Proceedings , vol. 5663 of

LectureNotes in Computer Science , pp. 140–145. Springer, 2009., pp. 140–145. Springer, 2009.