[PDF] Vampire With a Brain Is a Good ITP Hammer

Abstract

Vampire has been for a long time the strongest first-order automatic theorem prover, widely used for hammer-style proof automation in ITPs such as Mizar, Isabelle, HOL, and Coq. In this work, we considerably improve the performance of Vampire in hammering over the full Mizar library by enhancing its saturation procedure with efficient neural guidance. In particular, we employ a recently proposed recursive neural network classifying the generated clauses based only on their derivation history. Compared to previous neural methods based on considering the logical content of the clauses, our architecture makes evaluating a single clause much less time consuming. The resulting system shows good learning capability and improves on the state-of-the-art performance on the Mizar library, while proving many theorems that the related ENIGMA system could not prove in a similar hammering evaluation.

Full PDF

VVampire With a Brain Is a Good ITP Hammer(short paper)

Martin Suda ! Czech Technical University in Prague, Czech Republic

Abstract

Vampire has been for a long time the strongest first-order automated theorem prover, widely usedfor hammer-style proof automation in ITPs such as Mizar, Isabelle, HOL and Coq. In this work,we considerably improve the performance of Vampire in hammering over the full Mizar libraryby enhancing its saturation procedure with efficient neural guidance. In particular, we employ arecursive neural network classifying the generated clauses based only on their derivation history.Compared to previous neural methods based on considering the logical content of the clauses, thisleads to large real-time speedup of the neural guidance. The resulting system shows good learningcapability and achieves state-of-the-art performance on the Mizar library, while proving manytheorems that the related ENIGMA system could not prove in a similar hammering evaluation.

Theory of computation → Automated reasoning; Computingmethodologies → Theorem proving algorithms; Computing methodologies → Machine learning

Keywords and phrases proof automation, ITP hammers, automatic theorem proving, machinelearning, recursive neural networks

Funding

Supported by the Czech Science Foundation project 20-06390Y and the project RICAIPno. 857306 under the EU-H2020 programme.

The usability of Interactive Theorem Provers (ITPs) is significantly enhanced by proofautomation. In particular, employing the so called hammers , systems that connect the ITPto an automatic theorem prover (ATP), may greatly speed up the formalisation process [6].There are two ingredients of the hammer technology that appear to be best implementedusing machine learning, especially while taking advantage of the corresponding large ambientITP libraries, which can be used for training. One is the premise selection task, in which thesystem decides on a manageable subset of the most relevant facts from the ITP library to bepassed to the ATP as axioms along with the current conjecture [1, 10, 2, 45, 30]. The otheris the internal guidance of the ATP’s proof search [43, 11], typically focusing on the clauseselection process in the predominant saturation-based proving paradigm [21, 27].ENIGMA [21, 22, 7, 20] is a system delivering internal proof search guidance driven bystate-of-the are machine learning methods to the automatic theorem prover E [35]. In 2019,the authors of ENIGMA announced [23] a 70 % improvement of the real-time performance ofE on the Mizar Mathematical Library (MML) [17]. This was achieved by the use of gradientboosted trees trained using efficiently extracted manually designed clause features.In this work, we present an enhancement of the automatic theorem prover

Vampire [26]by a new form of clause-selection guidance. We employ recursive neural networks [15] andlearn to classify clauses based solely on their derivation history. This means we deliberatelyabstract away the logical content of a clause, i.e. “what it says”, and only focus on “where itis coming from”. There is a pragmatic appeal in this design decision: evaluating a clausebecomes relatively fast compared to other approaches based on neural networks (cf., e.g.,[27, 7]). It is also very interesting that such a simple approach works at all, let alone beingable to match or even improve on the existing “better informed” methods as we show here. a r X i v : . [ c s . A I] F e b Vampire With a Brain Is a Good ITP Hammer

In the rest of this paper, we first recall (in Section 2) how the saturation-based ATPtechnology can be enhanced by internal guidance learnt from previous proofs. Noteworthy, isthe use of the recently developed layered-clause selection technique [13, 14, 41] for integratingthe learnt advice, which is novel in this context. We then explain (in Section 3) how toconstruct and train recursive neural networks as successful classifiers of clause derivations.Finally, we report (in Section 4) on an experimental evaluation over the whole MizarMathematical Library of our extension of

Vampire with the described techniques. Using thesame setting as ENIGMA in [23] we observe that the two systems are comparable in overallperformance, but additionally complement each other on a large subset of the benchmark.This promises a great benefit for a combined hammering portfolio of ML-guided ATPs.

Modern automatic theorem provers (ATPs) for first-order logic such as E [35], SPASS [46], or

Vampire [26] are one of the most mature tools for general reasoning in a variety of domains.In a nutshell, they work in the following way.Given a list of axioms A , . . . , A l and a conjecture G to prove, an ATP translates { A , . . . , A l , ¬ G } into a set of initial clauses C and tries to derive a contradiction ⊥ from C (thus showing that A , . . . , A l | = G ) using a logical calculus, such as resolution or superposition[4, 28]. The employed process of iteratively, transitively deriving new clauses (according tothe inference rules of the used calculus), logical consequences of C , is referred to as saturation and is typically implemented using some variant of a given-clause algorithm [31]: in eachiteration, a single clause C is selected and inferences are performed between C and all thepreviously selected clauses. Deciding which clause to select next is known to be a keyheuristical choice-point, hugely affecting the performance of an ATP [36].The idea to improve clause selection by learning from past prover experience goes (tothe best of our knowledge) back to [9, 34] and has more recently been successfully employedby the ENIGMA system [21, 22, 7, 20]. Experience is collected from successful prover runs,where each selected clause constitutes a training example and the example is marked as positive , if the clause ended-up in the discovered proof, and negative otherwise. A MachineLearning (ML) algorithm is then used to fit this data and produce a model M for classifying clauses into positive and negative, accordingly. A good learning algorithm produces a model M which accurately classifies the training data but also generalizes well to unseen examples;ideally, of course, with a low computational cost of both 1) training and 2) evaluation.When a model is prepared, we need to integrate its advice back to the prover’s clauseselection process. An ATP typically organizes this process by maintaining a set of priorityqueues, each ordering the yet-to-be-processed clauses by a certain criterion , and alternates—under a certain configurable ratio —between selecting the best clause from each queue. Oneway of integrating the learnt advice, adopted by ENIGMA, is to add another queue Q M inwhich clauses are ordered such that those positively classified by M precede the negativelyclassified ones, and extend the mentioned ratio such that Q M is used for, e.g., half of all theselections (while the remaining ones fall back to the original strategy).In this work, we rely instead on the layered clause selection paradigm [13, 14, 41], in whicha clause selection mechanism inherited from an underlying strategy is applied separately tothe set A of clauses classified as positive by M and to the set B of all the yet-to-be-processedclauses (i.e., A ⊆ B ). A “second level” ratio then dictates how often will the prover relayto selecting from either of these two sets. For example, with a second-level ratio A (unless A is currently empty and a fallback to B happens) . Suda 3 before selecting once from B . An advantage of this approach is that the original, typicallywell-tuned, selection mechanism is still applied within both A and B . A Recursive Neural Network (RvNN) is a network created by composing a finite set of neuralbuilding blocks recursively over a structured input [15].In our case, the structured input is a clause derivation: a directed acyclic (hyper-)graph(DAG) with the initial clauses C ∈ C as the leaves and the derived clauses as the internalnodes, connected by (hyper-)edges labeled by the corresponding applied inference rules.To enable the recursion, an RvNN represents each node C by a real vector v C (of a fixeddimension n ) called an embedding . Our network effectively embeds the space of derivableclauses into R n in some a priori unknown, but hopefully reasonable way.We assume that each initial clause C can be identified with an axiom A C from whichit was obtained through clausification (unless it comes from the conjecture) and that theseaxioms form a finite set A , fixed for the domain of interest. Now, the specific building blocksof our architecture are (mainly; see below) the following three (indexed families of) functions:for every axiom A i ∈ A , a “null-ary” init function I i ∈ R n which to an initial clause C ∈ C obtained through clausification from the axiom A i assigns its embedding v C := I i , for every inference rule r , a deriv function, D r : R n × · · · × R n → R n which to a conclusionclause C c derived by r from premises ( C , . . . , C k ) with embeddings v C , . . . , v C k assignesthe embedding v C c := D r ( v C , . . . , v C k ),and, finally, a single eval function E : R n → R which evaluates an embedding v C suchthat the corresponding clause C is classified as positive whenever E ( v C ) ≥ C can be assigned an embedding v C and also evaluated to see whether the network recommends it as positive, that should bepreferred in proof search, or negative, which will not likely contribute to a proof. Notice thatthe amortised cost of evaluating a single clause by the network is low, as it amounts to a constant number of function compositions.Let us now spend some time to consider what kinds of information about a clause canthe network take into account to perform its classification.The assumption about a fixed set of axioms enables to meaningfully carry over betweenproblems observations about which axioms and their combinations quickly lead to goodlemmas and which are, on the other hand, too prolific in the search while rarely useful. Webelieve this is the main source of information for the network to classify well. However, itmay not be feasible to represent in the network all the axioms available in the benchmark. Itis then possible to only reveal a specific subset to the network and represent all the remainingones using a single special embedding I unknown .Another, less obvious, source of information are the inference rules. Since there aredistinct deriv functions D r for every rule r , the network can also take into account thatdifferent inference rules give rise to conclusions of different degrees of usefulness. Finally, we always “tell the network” what the current conjecture G is by marking theconjecture clauses using a special embedding I goal . Focusing search on the conjecture is awell-known theorem proving heuristic and we give the network the opportunity to establishhow strongly should this heuristic be taken into account. As far as we know, nobody tried to establish this either theoretically or empirically though.

Vampire With a Brain Is a Good ITP Hammer

In fact, we implemented a stronger version of this idea by precomputing for every initialclause its SInE level [13, 40, 19]. Roughly, a SInE level is a heuristical distance of a formulafrom the conjecture along a relation defined by sharing common symbols. We add theselevels as additional input to the network in the leaves. The effect of enabling or disablingthis extension is demonstrated in the experiments in Section 4.

Our RvNN is parametrized by a tuple of learnable parameters ( θ I , θ D , θ E ) which determinethe corresponding init, deriv, and eval functions. To train the network means to find suitablevalues for these parameters such that it successfully classifies positive and negative clausesfrom the training data and ideally also generalises to unseen future cases.We follow a standard methodology for training our networks. In particular, we usethe Stochastic Gradient Descent (SGD) algorithm minimising a Binary Cross Entropy loss (against the final layer’s sigmoid non-linearity) [16]. Every clause in a derivation DAGselected by the saturation algorithm constitutes a contribution to the loss, with clauses thatparticipated in the found proof receiving the target label 1 . . weight these contributions such thateach derivation DAG (corresponding to a prover run on a single problem) receives equalweight, and, moreover, within each DAG we scale the importance of positive and negativeexamples such that these two categories contribute evenly.We split the available successful derivations into 80 % training set and 20 % validation set,and only train on the first set using the second to observe generalisation to unseen examples.As the SGD algorithm progresses, iterating over the training data in rounds called epochs ,we can evaluate the loss on the validation set and stop the process early if this loss does notdecrease for a specified period. At least in one case in our experiments, this early stopping criterion helped to produce a better model.It is out of the scope of this short paper to describe the precise details of our architectureor every hyper-parameter we used in the training. We anyway cannot claim to have exploredthe space of possibilities enough to be able to defend our choices as optimal. Nevertheless,anecdotally, we can recommend the Rectified Linear Unit ( f ( x ) = max { , x } ) as the mainnon-linearity when coupled with Layer Normalization [3] for stability. We followed thephilosophy of bottlenecks [33], using wider layers than the embedding size n just beforecomputing the non-linearity. We employed dropout [39] as a form of regularisation andexperimented with non-constant learning rates, following the suggestions of [37, 38] and [44]. We implemented clause guidance by a recursive neural network classifier for clause derivationsas described in Sections 2 and 3 in the automatic theorem prover

Vampire (version 4.5.1).We used the

PyTorch (version 1.7) library [29] and a set of python scripts for training theneural models and the

TorchScript extension for interfacing the trained models from C++. Following Jakubův and Urban [23], we used the Mizar40 [25] benchmark consisting of57 880 problems from the MPTP [42] and, in particular, the small ( bushy , re-proving) version.This version emulates the scenario where some form of premise selection has already takenplace and allows us to directly focus on evaluating internal guidance in an ATP. Supplementary materials accompanying the experiments can be found at https://git.io/JtugQ . . Suda 5 Table 1

Training statistics and inference speed of model classes explored in the first experiment.model class M n H n M n D n M n embedding size n

128 256 256 256 512revealed axioms m V %) 55 . . . . . In a previous experiment with this benchmark, we identified a strong

Vampire strategy,which we here denote as V and use it as a baseline. This strategy solved a total of 20 197problems from Mizar40 under the time limit of 10 s per problem. We used the correspondingsuccessful derivations (under the mentioned 80–20 split) as a dataset for training severalneural models differing in the following two capacity-related parameters: the size of the embedding n ; starting off with the default 256, we also tried 128 and 512, the number of revealed axioms m , for which the model learns a specific embedding; thedefault was 1000, denoted M , and we also tried 500, denoted H , and 2000 denoted D .In total, there are approximately 100 000 named axioms in Mizar40. We ranked them fromthose occurring most often in the obtained derivations and applied the respective mentionedcutoffs m . We trained the models using a parallel setup on 60 cores, generally for 500 epochs,with two exceptions mentioned below. We then reran Vampire ’s strategy V , now equippedwith these models for guidance, again using the time limit of 10 s on the same machine.The training statistics for the individual models are shown in Table 1. We can see that themodel sizes are dictated mainly by the embedding size n and not so much by the number ofrevealed axioms m . (Roughly, Θ( n ) of space is needed for storing the matrices representingthe deriv and eval functions, while Θ( n · m ) is required for storing the axiom embeddings.)We note that the sizes are comparable to those of the gradient boosted trees used by Jakubůvand Urban in [23]. Concerning the training times, the 230 s per epoch recorded for M n corresponds in 500 epochs to approximately 32 hours of 60 core computation, almost 80single-core days. In [23], a similarly sized model was trained in about 8 single-core days,which indicates that training neural networks is much more computation intensive.Table 2 presents the performance statistics, starting with the baseline V , followed byseven variations of neural guidance and ending with the performance of their union (N-union).In the middle, configuration M n e is the default one. Its subscript e

500 signifies that themodel was obtained after 500 epochs of training. Turning the SInE levels feature off in − SInE M n e leads to a degradation of performance, which shows the use of SInE levels isbeneficial. Also, doubling the number of revealed axioms with D n e and having them halvedwith H n e indicates that revealing more axioms pays off. The fact that the model M n e with the smaller embedding size n = 128 fared better than the default suggests that thelarger capacity of M n e allowed it to overfit to the training data, impairing generalisation.We selected M n e , a predecessor of M n e obtained after only 370 epochs of training, whichminimises loss on the 20 % validation set. This model was able to further improve on M n e , Evaluated on a server with Intel(R) Xeon(R) Gold 6140 CPUs @ 2 . Under the layered clause selection scheme with the second-level ratio (cf. Section 2) set to 10:1 . The validation loss of M n e was 0 .

468 while M n e had 0 . M n e reached only0 . Vampire With a Brain Is a Good ITP Hammer

Table 2

The number of Mizar problems solved in 10 s by the base strategy V and seven strategiesenhancing V by the use of various neural models independently trained on V ’s successful runs.strategy V M n e M n e D n e M n e − SInE M n e H n e M n e N-unionsolved 20 197

24 597

24 531 24 467 24 198 23 815 23 486 22 169 28 645 V % +0 % +21 . +21 . . . . . . . V + +0 +5507 +5496 +5466 +5275 +4982 +4643 +4058 +8679 V− − − − − − − − − − being the best in our table with a 21 . V .At the low end of the ranking lies M n e . Since training a model with embedding size n = 512 was much slower (cf. Table 1), we stopped it for M n at epoch 350. However, therelatively low performance should mainly be ascribed to much lower evaluation speed.Returning to Table 1, we can see that the inference speed row is showing a significantdrop between M n and M n , with the later achieving only 17 . V .Roughly, we can say that due to evaluating the clauses with M n Vampire is only runningat 17 . M n e is still solvingalmost 10 % more than V while running under such a slowdown confirms yet again that thelearnt guidance is very useful.It is surprising that there is no similarly large difference in inference speed between M n and M n . Our best guess is that up to a certain threshold, vectorisation of thefloating point operations in the CPU is making the actual cost of performing the evaluationcomputations “effectively constant time”.Table 2 also shows the number of problems gained ( V +) and lost ( V− ) with respect tothe baseline. We can understand the “gained” metric as an indicator of generalisation, sincethese are solved problems the proofs of which were not present in the training set. However,in our particular case, there is no difference in the rank by V + and by the total solved. In [23], the authors continue to improve the guidance by iteratively learning from the growingbody of proofs obtained in the preceding evaluation runs. We found that repeating this looping procedure with neural networks is not straightforward: 1) the growing training setbecomes harder to manage and training times increase, 2) simply taking the union of all theseen successful derivations seems to be biased towards proof searches which already had agood guidance from some of the previously trained models and thus may not best cater fortraining a model that will only get integrated with the plain baseline strategy V .So far, we have been able to solve 25 447 Mizar problems by a single guided 10 s run,slightly more than the best comparable result of 25 397 reported in [23]. However, the unionof these two runs reaches 29 154 showing a high degree of complementarity and a greatpotential for a combined use in hammering portfolios of ML-guided ATPs [5, 25, 24, 8]. References Jesse Alama, Tom Heskes, Daniel Kühlwein, Evgeni Tsivtsivadze, and Josef Urban. Premiseselection for mathematics by corpus analysis and kernel methods.

J. Autom. Reason. , 52(2):191– We estimated the inference speed by computing the average number of generated clauses by each of therespective strategies on the 27 335 problems on which all the strategies timed out. . Suda 7 doi:10.1007/s10817-013-9286-5 . Alexander A. Alemi, François Chollet, Geoffrey Irving, Christian Szegedy, and Josef Urban.Deepmath - deep sequence models for premise selection.

CoRR , abs/1606.04442, 2016. URL: http://arxiv.org/abs/1606.04442 , arXiv:1606.04442 . Lei Jimmy Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization.

CoRR ,abs/1607.06450, 2016. URL: http://arxiv.org/abs/1607.06450 , arXiv:1607.06450 . Leo Bachmair, Harald Ganzinger, David A. McAllester, and Christopher Lynch. Resolution the-orem proving. In Robinson and Voronkov [32], pages 19–99. doi:10.1016/b978-044450813-3/50004-7 . Jasmin Christian Blanchette, David Greenaway, Cezary Kaliszyk, Daniel Kühlwein, and JosefUrban. A learning-based fact selector for isabelle/hol.

J. Autom. Reason. , 57(3):219–244, 2016. doi:10.1007/s10817-016-9362-8 . Jasmin Christian Blanchette, Cezary Kaliszyk, Lawrence C. Paulson, and Josef Urban. Hammer-ing towards QED.

J. Formaliz. Reason. , 9(1):101–148, 2016. doi:10.6092/issn.1972-5787/4593 . Karel Chvalovský, Jan Jakubuv, Martin Suda, and Josef Urban. ENIGMA-NG: efficientneural and gradient-boosted inference guidance for E. In Fontaine [12], pages 197–215. doi:10.1007/978-3-030-29436-6\_12 . Lukasz Czajka and Cezary Kaliszyk. Hammer for coq: Automation for dependent type theory.

J. Autom. Reason. , 61(1-4):423–453, 2018. doi:10.1007/s10817-018-9458-4 . J. Denzinger and S. Schulz. Learning Domain Knowledge to Improve Theorem Proving. InM.A. McRobbie and J.K. Slaney, editors,

Proc. of the 13th CADE, New Brunswick , number1104 in LNAI, pages 62–76. Springer, 1996. Michael Färber and Cezary Kaliszyk. Random forests for premise selection. In Carsten Lutz andSilvio Ranise, editors,

Frontiers of Combining Systems - 10th International Symposium, FroCoS2015, Wroclaw, Poland, September 21-24, 2015. Proceedings , volume 9322 of

Lecture Notes inComputer Science , pages 325–340. Springer, 2015. doi:10.1007/978-3-319-24246-0\_20 . Michael Färber, Cezary Kaliszyk, and Josef Urban. Monte carlo tableau proof search. InLeonardo de Moura, editor,

Automated Deduction - CADE 26 - 26th International Conferenceon Automated Deduction, Gothenburg, Sweden, August 6-11, 2017, Proceedings , volume10395 of

Lecture Notes in Computer Science , pages 563–579. Springer, 2017. doi:10.1007/978-3-319-63046-5\_34 . Pascal Fontaine, editor.

Automated Deduction - CADE 27 - 27th International Conference onAutomated Deduction, Natal, Brazil, August 27-30, 2019, Proceedings , volume 11716 of

LectureNotes in Computer Science . Springer, 2019. doi:10.1007/978-3-030-29436-6 . Bernhard Gleiss and Martin Suda. Layered clause selection for saturation-based theoremproving. In Pascal Fontaine, Konstantin Korovin, Ilias S. Kotsireas, Philipp Rümmer, andSophie Tourret, editors,

Joint Proceedings of the 7th Workshop on Practical Aspects ofAutomated Reasoning (PAAR) and the 5th Satisfiability Checking and Symbolic ComputationWorkshop (SC-Square) Workshop, 2020 co-located with the 10th International Joint Conferenceon Automated Reasoning (IJCAR 2020), Paris, France, June-July, 2020 (Virtual) , volume2752 of

CEUR Workshop Proceedings , pages 34–52. CEUR-WS.org, 2020. URL: http://ceur-ws.org/Vol-2752/paper3.pdf . Bernhard Gleiss and Martin Suda. Layered clause selection for theory reasoning - (shortpaper). In Nicolas Peltier and Viorica Sofronie-Stokkermans, editors,

Automated Reasoning -10th International Joint Conference, IJCAR 2020, Paris, France, July 1-4, 2020, Proceedings,Part I , volume 12166 of

Lecture Notes in Computer Science , pages 402–409. Springer, 2020. doi:10.1007/978-3-030-51074-9\_23 . Christoph Goller and Andreas Küchler. Learning task-dependent distributed representationsby backpropagation through structure. In

Proceedings of International Conference on NeuralNetworks (ICNN’96), Washington, DC, USA, June 3-6, 1996 , pages 347–352. IEEE, 1996. doi:10.1109/ICNN.1996.548916 . Vampire With a Brain Is a Good ITP Hammer Ian J. Goodfellow, Yoshua Bengio, and Aaron C. Courville.

Deep Learning . Adaptivecomputation and machine learning. MIT Press, 2016. URL: . Adam Grabowski, Artur Kornilowicz, and Adam Naumowicz. Mizar in a nutshell.

J. Formaliz.Reason. , 3(2):153–245, 2010. doi:10.6092/issn.1972-5787/1980 . Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N.Vishwanathan, and Roman Garnett, editors.

Advances in Neural Information ProcessingSystems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9,2017, Long Beach, CA, USA , 2017. URL: https://proceedings.neurips.cc/paper/2017 . Krystof Hoder and Andrei Voronkov. Sine qua non for large theory reasoning. In Nikolaj Bjørnerand Viorica Sofronie-Stokkermans, editors, , volume 6803 of

Lecture Notes in Computer Science , pages 299–314.Springer, 2011. doi:10.1007/978-3-642-22438-6\_23 . Jan Jakubuv, Karel Chvalovský, Miroslav Olsák, Bartosz Piotrowski, Martin Suda, and JosefUrban. ENIGMA anonymous: Symbol-independent inference guiding machine (system descrip-tion). In Nicolas Peltier and Viorica Sofronie-Stokkermans, editors,

Automated Reasoning -10th International Joint Conference, IJCAR 2020, Paris, France, July 1-4, 2020, Proceedings,Part II , volume 12167 of

Lecture Notes in Computer Science , pages 448–463. Springer, 2020. doi:10.1007/978-3-030-51054-1\_29 . Jan Jakubuv and Josef Urban. ENIGMA: efficient learning-based inference guiding machine. InHerman Geuvers, Matthew England, Osman Hasan, Florian Rabe, and Olaf Teschke, editors,

Intelligent Computer Mathematics - 10th International Conference, CICM 2017, Edinburgh,UK, July 17-21, 2017, Proceedings , volume 10383 of

Lecture Notes in Computer Science , pages292–302. Springer, 2017. doi:10.1007/978-3-319-62075-6\_20 . Jan Jakubuv and Josef Urban. Enhancing ENIGMA given clause guidance. In Florian Rabe,William M. Farmer, Grant O. Passmore, and Abdou Youssef, editors,

Intelligent ComputerMathematics - 11th International Conference, CICM 2018, Hagenberg, Austria, August 13-17, 2018, Proceedings , volume 11006 of

Lecture Notes in Computer Science , pages 118–124.Springer, 2018. doi:10.1007/978-3-319-96812-4\_11 . Jan Jakubuv and Josef Urban. Hammering Mizar by learning clause guidance (short paper).In John Harrison, John O’Leary, and Andrew Tolmach, editors, , volume141 of

LIPIcs , pages 34:1–34:8. Schloss Dagstuhl - Leibniz-Zentrum für Informatik, 2019. doi:10.4230/LIPIcs.ITP.2019.34 . Cezary Kaliszyk and Josef Urban. Hol(y)hammer: Online ATP service for HOL light.

Math.Comput. Sci. , 9(1):5–22, 2015. doi:10.1007/s11786-014-0182-0 . Cezary Kaliszyk and Josef Urban. Mizar 40 for mizar 40.

J. Autom. Reason. , 55(3):245–256,2015. doi:10.1007/s10817-015-9330-8 . Laura Kovács and Andrei Voronkov. First-order theorem proving and Vampire. In NatashaSharygina and Helmut Veith, editors,

Computer Aided Verification - 25th InternationalConference, CAV 2013, Saint Petersburg, Russia, July 13-19, 2013. Proceedings , volume8044 of

Lecture Notes in Computer Science , pages 1–35. Springer, 2013. doi:10.1007/978-3-642-39799-8\_1 . Sarah M. Loos, Geoffrey Irving, Christian Szegedy, and Cezary Kaliszyk. Deep networkguided proof search. In Thomas Eiter and David Sands, editors,

LPAR-21, 21st InternationalConference on Logic for Programming, Artificial Intelligence and Reasoning, Maun, Botswana,May 7-12, 2017 , volume 46 of

EPiC Series in Computing , pages 85–105. EasyChair, 2017.URL: https://easychair.org/publications/paper/ND13 . Robert Nieuwenhuis and Albert Rubio. Paramodulation-based theorem proving. In Robinsonand Voronkov [32], pages 371–443. doi:10.1016/b978-044450813-3/50009-6 . Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, . Suda 9

Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, BenoitSteiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,

Advances in Neural Information Processing Systems 32 ,pages 8024–8035. Curran Associates, Inc., 2019. URL: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf . Bartosz Piotrowski and Josef Urban. Stateful premise selection by recurrent neural networks.In Elvira Albert and Laura Kovács, editors,

LPAR 2020: 23rd International Conferenceon Logic for Programming, Artificial Intelligence and Reasoning, Alicante, Spain, May 22-27, 2020 , volume 73 of

EPiC Series in Computing , pages 409–422. EasyChair, 2020. URL: https://easychair.org/publications/paper/g38n . Alexandre Riazanov and Andrei Voronkov. Limited resource strategy in resolution theoremproving.

J. Symb. Comput. , 36(1-2):101–115, 2003. doi:10.1016/S0747-7171(03)00040-3 . John Alan Robinson and Andrei Voronkov, editors.

Handbook of Automated Reasoning (in2 volumes) . Elsevier and MIT Press, 2001. URL: . Mark Sandler, Andrew G. Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen.Mobilenetv2: Inverted residuals and linear bottlenecks. In , pages4510–4520. IEEE Computer Society, 2018. URL: http://openaccess.thecvf.com/content_cvpr_2018/html/Sandler_MobileNetV2_Inverted_Residuals_CVPR_2018_paper.html , doi:10.1109/CVPR.2018.00474 . S. Schulz.

Learning Search Control Knowledge for Equational Deduction . Number 230 inDISKI. Akademische Verlagsgesellschaft Aka GmbH Berlin, 2000. Stephan Schulz, Simon Cruanes, and Petar Vukmirović. Faster, higher, stronger: E 2.3. InPacal Fontaine, editor,

Proc. of the 27th CADE, Natal, Brasil , number 11716 in LNAI, pages495–507. Springer, 2019. Stephan Schulz and Martin Möhrmann. Performance of clause selection heuristics for saturation-based theorem proving. In Nicola Olivetti and Ashish Tiwari, editors,

Automated Reasoning -8th International Joint Conference, IJCAR 2016, Coimbra, Portugal, June 27 - July 2, 2016,Proceedings , volume 9706 of

Lecture Notes in Computer Science , pages 330–345. Springer,2016. doi:10.1007/978-3-319-40229-1\_23 . Leslie N. Smith. Cyclical learning rates for training neural networks. In , pages 464–472. IEEE Computer Society, 2017. doi:10.1109/WACV.2017.58 . Leslie N. Smith and Nicholay Topin. Super-convergence: Very fast training of residual networksusing large learning rates.

CoRR , abs/1708.07120, 2017. URL: http://arxiv.org/abs/1708.07120 , arXiv:1708.07120 . Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.

J. Mach.Learn. Res. , 15(1):1929–1958, 2014. URL: http://dl.acm.org/citation.cfm?id=2670313 . Martin Suda. Aiming for the goal with SInE. In Laura Kovács and Andrei Voronkov,editors,

Vampire 2018 and Vampire 2019. The 5th and 6th Vampire Workshops , volume 71 of

EPiC Series in Computing , pages 38–44. EasyChair, 2020. URL: https://easychair.org/publications/paper/lZfv , doi:10.29007/q4pt . Tanel Tammet. GKC: A reasoning system for large knowledge bases. In Fontaine [12], pages538–549. doi:10.1007/978-3-030-29436-6\_32 . Josef Urban. MPTP 0.2: Design, implementation, and initial experiments.

J. Autom. Reason. ,37(1-2):21–43, 2006. doi:10.1007/s10817-006-9032-3 . Josef Urban, Jirí Vyskocil, and Petr Stepánek. Malecop machine learning connection prover.In Kai Brünnler and George Metcalfe, editors,

Automated Reasoning with Analytic Tableauxand Related Methods - 20th International Conference, TABLEAUX 2011, Bern, Switzerland,

July 4-8, 2011. Proceedings , volume 6793 of

Lecture Notes in Computer Science , pages 263–277.Springer, 2011. doi:10.1007/978-3-642-22119-4\_21 . Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N.Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Guyonet al. [18], pages 5998–6008. URL: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html . Mingzhe Wang, Yihe Tang, Jian Wang, and Jia Deng. Premise selection for theorem provingby deep graph embedding. In Guyon et al. [18], pages 2786–2796. URL: https://proceedings.neurips.cc/paper/2017/hash/18d10dc6e666eab6de9215ae5b3d54df-Abstract.html . Christoph Weidenbach, Dilyana Dimova, Arnaud Fietzke, Rohit Kumar, Martin Suda, andPatrick Wischnewski. SPASS version 3.5. In Renate A. Schmidt, editor,

Automated Deduction -CADE-22, 22nd International Conference on Automated Deduction, Montreal, Canada, August2-7, 2009. Proceedings , volume 5663 of

Lecture Notes in Computer Science , pages 140–145.Springer, 2009. doi:10.1007/978-3-642-02959-2\_10doi:10.1007/978-3-642-02959-2\_10