[PDF] Theory Exploration Powered By Deductive Synthesis

Abstract

Recent years have seen tremendous growth in the amount of verified software. Proofs for complex properties can now be achieved using higher-order theories and calculi. Complex properties lead to an ever-growing number of definitions and associated lemmas, which constitute an integral part of proof construction. Following this -- whether automatic or semi-automatic -- methods for computer-aided lemma discovery have emerged. In this work, we introduce a new symbolic technique for bottom-up lemma discovery, that is, the generation of a library of lemmas from a base set of inductive data types and recursive definitions. This is known as the theory exploration problem, and so far, solutions have been proposed based either on counter-example generation or the more prevalent random testing combined with first-order solvers. Our new approach, being purely deductive, eliminates the need for random testing as a filtering phase and for SMT solvers. Therefore it is amenable compositional reasoning and for the treatment of user-defined higher-order functions. Our implementation has shown to find more lemmas than prior art, while avoiding redundancy.

Full PDF

TTheory Exploration Powered By Deductive Synthesis

EYTAN SINGHER,

Technion

SHACHAR ITZHAKY,

TechnionRecent years have seen tremendous growth in the amount of verified software. Proofs for complex propertiescan now be achieved using higher-order theories and calculi. Complex properties lead to an ever-growingnumber of definitions and associated lemmas, which constitute an integral part of proof construction. Followingthis — whether automatic or semi-automatic — methods for computer-aided lemma discovery have emerged.In this work, we introduce a new symbolic technique for bottom-up lemma discovery, that is, the generationof a library of lemmas from a base set of inductive data types and recursive definitions. This is known asthe theory exploration problem, and so far, solutions have been proposed based either on counter-examplegeneration or the more prevalent random testing combined with first-order solvers. Our new approach, beingpurely deductive, eliminates the need for random testing as a filtering phase and for SMT solvers. Thereforeit is amenable compositional reasoning and for the treatment of user-defined higher-order functions. Ourimplementation has shown to find more lemmas than prior art, while avoiding redundancy.

Most forms of verification and synthesis rely on some form of semantic knowledge in orderto carry out the reasoning required to complete their task, whether it is checking entailment,deriving specifications for sub-problems ((Albarghouthi et al. 2016; Feser et al. 2015)), or equivalencereduction for making enumeration tractable ((Smith and Albarghouthi 2019)). Recent years haveshown that domain-specific knowledge can become invaluable for such tasks, whether via the designof a domain-specific language ((Ragan-Kelley et al. 2018)), specialized decision procedures ((Milderet al. 2012)) or frameworks for integrating domain knowledge ((Polozov and Gulwani 2015)). Whilethis approach enables the treatment of whole classes of problems with a well-crafted technique,they are far from supplying coverage of the space of programs that can benefit from analysis andautomated reasoning. Every library or module contributes a collection of new primitives, requiringtweaking or extending the underlying procedures. In a sense, every code base is its own sub-domain,with a vocabulary of symbols to reason about. Our goal in this work is to empower automatedreasoning tools with program specific knowledge to make them more versatile; this knowledgecannot be hand-crafted and has to be generated, in turn, from analysis of the code itself.In the context of verification tools, such as Dafny (Leino 2010) and Leon (Blanc et al. 2013) , as wellas interactive proof assistants, such as Coq (Coq Development Team 2017) and Isabelle/HOL (Nipkowet al. 2002), knowledge typically takes the form of a set of lemmas that encode reusable facts aboutthe functions defined in the code. Human engineers and researchers are tasked with writing downthese lemmas and prove their correctness from basic principles, with the help of the existing,underlying proof automation (whether it is SMT, type inference, or proof search). While lemmasare truths that follow logically from definitions and axioms, automated provers are not keenlyadept at finding them; when required lemmas are missing from the input, verification will failor diverge. Most non-trivial verification tasks require auxiliary lemmas that have to be suppliedto them: for example, both Dafny and Leon fail to prove the associativity and commutativityproperties of addition starting from the basis of an algebraic construction of the natural numbers. Leon is also a program synthesis tool, but this aspect of it is not relevant to this paper. a r X i v : . [ c s . P L ] S e p :2 Eytan Singher and Shachar Itzhaky However, when given the knowledge on these properties (i.e. encoded as lemmas) , they readilyprove composite facts such as ( x + ) + y = + ( x + y ) .Standard libraries, esp. ones included with proof assistants, tend to include a collection of usefullemmas for functions defined therein. As the vocabulary of types and functions grows, the numberof such axioms in the library may grow polynomially, since many useful lemmas involve two ormore function symbols from it. Even if proving each individual lemma is easy — indeed, modernproof assistants offer sufficient automation to make constructing some of these routine proofsalmost trivial — merely writing down all the possible lemmas is a tall order. Moreover, writing avariant of an existing function, even a very simple one, may bring an avalanche of new lemmasthat must be proved for the new variant w.r.t. all the other functions, including correspondencebetween the variant and the original function. As a result, theories contained in libraries are mostoften incomplete, and lemmas are added to them as the need arises, e.g. as they emerge in thecourse of developing larger proofs that use some definitions from said library.An alternative approach is to eagerly generate valid lemmas entailed by a small, initial knowledgebase, and to do so automatically, offline, as a precursor to any work that would be built on topof the library. This paradigm is known as theory exploration (Buchberger 2000; Buchberger et al.2006), and is the focus of this paper. Moreover, we focus on the formation of equational theories,that is, lemmas that curtail the equivalence of two terms, with universal quantification over allfree variables. The concept of theory exploration was shown to be useful by (Claessen et al. 2013a),specifically for equational reasoning. The state-of-the-art theory exploration mechanism relies onrandom testing to generate candidate conjectures based on functional definitions ( e.g. in Haskell)of the symbols. In this work, we explore a fully-symbolic method for theory exploration that takesadvantage of the characteristics of induction-based proofs.While random testing is simple and requires nothing more than an interpreter for the languageof the underlying vocabulary, there are certain downsides to it: • The user is required to supply random generators for the types of the free variables thatmay occur in the lemmas. While a template-based approach can provide generators for awide range of tree-like structures (in essence, any algebraic data structure), the space ofvalues grows exponentially, impairing the scalability of the technique to more complexdata structures. • The quality of the generated conjectures is only as good as the set of sampled values.A poorly selected sample may generate a high yield of candidate conjectures, makingverifying them infeasible. • Executing functions with longer run-time can limit scalability.The first two points are closely related as there is a trade-off between the cost of sampling andtesting many values and the cost of checking a flood of false positives. This is especially truewhen some of the free variables represent functions, as is the case when the vocabulary containshigh-order operators such as filter, map, fold, scan, etc., which are pervasive in functional libraries.Previous research effort revealed that testing-based discovery is sensitive to the number and sizeof type definitions occurring in the code base. QuickSpec uses a heuristic to restrict the set of typesallowed in terms in order to make the checker’s job easier. For example, lists can be nested up totwo levels (lists of lists, but not lists of lists of lists). This presents an obstacle towards scaling theapproach to real software libraries, since “QuickCheck’s size control interacts badly with deeplynested types […] will generate extremely large test data.” (Smallbone et al. 2017)

Main contributions

This paper provides the following contributions: In fact, these properties are hard-wired into decision procedures for linear integer arithmetic in SMT solvers. heory Exploration Powered By Deductive Synthesis 1:3

TermGeneration(SyGuE) ConjectureInference(SOE) ConjectureScreening InductionProver (cong. closure)iterative deepeningaugment knowledgebaseknowledge newknowledge

Fig. 1. TheSy system overview: breakdown into phases, with feedback loop. • A definition of a knowledge base , and what it means to be a knowledge extension relative to anunderlying reasoning method. New knowledge is said to extend exiting knowledge if it can beused as an enabler for deriving new conclusions using the same underlying reasoning, conclu-sions that could not be reached using the initial knowledge base. This provides a conceptualframework for theory exploration through use of formal methods. • A system for theory synthesis on top of canonical equational reasoning via congruence closure.Our implementation, TheSy, can discover more lemmas than were found by random testing-based tools — despite the preliminary implementation stage in which it currently is. • Definition of an interesting sub-problem,

Syntax-Guided enumeration (SyGuE) that emergesfrom our treatment of lemma discovery, but is useful for synthesis tasks as well as verification(such as for quantifier instantiation and invariant inference). Broadly, it refers to an exhaustivegeneration of programs defined by a given vocabulary, modulo equivalence.

Our theory exploration method, named TheSy (Theory Synthesizer, pronounced

Tessy ), is basedon syntax-guided enumerative synthesis. Similarly to previous approaches (Claessen et al. 2013a;Johansson 2017; Smallbone et al. 2017), TheSy generates a comprehensive set of terms from thegiven vocabulary, and looks for pairs that seem equivalent based on a relatively lightweight butunsound reasoning procedure. The pairs that seems equivalent are passed as a conjecture to atheorem prover. The process (as shown in Figure 1) is separated into four stages. These stages workin an iterative deepening fashion and are dependant on the results of each other. A quick review isgiven to help the reader understand their context later on.(1) Term Generation - Based on the given vocabulary and known equalities, build symbolic termsof incremented depth.(2) Conjecture Inference - TheSy will only attempt to prove two terms are equal after showingthey are equal on symbolic inputs. The comparison is done using known equalities. Termsthat are found to be equal are then passed on to the screening phase as a conjecture.(3) Conjecture Screening - Some of the conjectures, although true, will not contribute to thesystem. They are screened using the known equalities.(4) Induction Prover - The prover will now attempt to prove conjectures that passed screeningusing induction. Conjectures that were successfully proven are then added to the knownequalities, to be used in future proof attempts and the next iteration of previous stages.The phases are run iteratively in a loop so that discovered lemmas are fed back to earlier phases.This feedback may contribute to discovering more lemmas due to three factors: :4 Eytan Singher and Shachar Itzhaky (i) Conjecture inference is dependent upon known equalities. Additional equalities enablefinding new conjectures.(ii) Accurate screening by merging equivalence classes based on known equalities. Throughefficient and effective screening, TheSy is able to retry proofs to find more useful lemmas.(iii) The prover is based on known equalities with a congruence closure procedure. The morelemmas known to the system, the more lemmas become provable by this method.

Knowledge and the quality of discovered theories.

As a metric for evaluating the efficacy of resultsobtained from theory exploration, and therefore, their quality or usefulness, we use the notion of knowledge . A theory T in a given logical proof system induces a collection of attainable knowledge, K T = { φ T (cid:96) φ } , that is, characterized by the set of (true) statements that can be proven basedon T . In practice, a “pure” notion of knowledge based on provability is impractical, because mostinteresting logics are undecidable, and automated proving techniques cannot feasibly find proofsfor all true statements. We, therefore, parameterize knowledge relative to a prover — a procedurethat always terminates and can prove a subset of true statements. Termination can be achieved byrestricting the space of proofs or by size or resource bounds. We say that T S (cid:96) φ when a prover, S , isable to verify the validity of φ in a theory T . A more realistic characterization of knowledge wouldthen be K S T = (cid:8) φ T S (cid:96) φ (cid:9) . Assuming that the prover S is fixed, a theory T (cid:48) is said to increaseknowledge over T when K S T (cid:48) ⊃ K S T .The input to TheSy is a base theory inducing a base knowledge; successful output would beincreased knowledge obtained via extension of this base theory. In our current implementation,the underlying prover is congruence closure-based equality reasoning over universally quantifiedfirst-order formulas with uninterpreted functions. This procedure is weak but fast and constitutesone of the core procedures in SMT solvers. Notably, it cannot reason about recursive definitionssince such reasoning routinely requires the use of induction. To that end, TheSy is geared towardsdiscovering lemmas that can be proven by induction; a lemma is considered useful if it cannot beproven from existing lemmas by congruence closure alone, that is, without induction. Such lemmasare guaranteed to increase knowledge, since they at least add the fact of their own truthness to theknowledge base.To formalize the procedure of generating and comparing the terms, in an attempt to discovernew equality conjectures, we introduce the definition of SyGuE — Syntax Guided Enumeration— which is a variant of SyGuS (Syntax Guided Synthesis) (Alur et al. 2015). While SyGuS finds aprogram that meets a certain specification, the goal in SyGuE is to enumerate all the programsin a given grammar, while eliminating redundancy incurred by having multiple representationsof the same “pure” function (put simply, semantically equivalent programs). In any enumerativesynthesis task, some form of equivalence reduction is essential (Smith and Albarghouthi 2019);since even small sets of equivalent expressions incur a prohibitive blowup in the size of the searchspace when scaling up the size of enumerated expressions. Running example.

We introduce a simple running example using a list ADT with the usual twoconstructors and two additional functions ++ and filter. The initial state is shown in Figure 2. Theremainder of this section will provide a high-level description of the theory synthesis procedure,using this example to demonstrate the concepts. To simplify the presentation, we describe eachphase first, then explain how the output from the last phase is fed back to the first phase to completea feedback loop. The first two phases are more tightly coupled, as are the last two; each of thefollowing two subsections therefore describes two consecutive phases.TheSy maintains a state (Figure 2), consisting of the following elements: heory Exploration Powered By Deductive Synthesis 1:5 V = { [] list T , :: T → list T → list T , ++ list T → list T → list T , filter ( T → bool ) → list T → list T } C = { [] , :: } k = c = E = { [] ++ l = l , ( x :: xs ) ++ l = x :: ( xs ++ l ) , filter p [] = · · · } Fig. 2. An example starting state. • A sorted vocabulary V• A subset

C ⊆ V of constructors for some or all of the types • A set of equations E , initially consisting only of the definitions of the (non-constructor)functions in V• A term height bound k , and a value height bound c The first thing the synthesizer does is generate symbolic values used to discriminate different terms.There are two kinds of symbolic values at play, uninterpreted constants (sometimes called symbolicconstants ) and symbolic expressions . For all types, uninterpreted constants are generated to representarbitrary, opaque values inhabiting them. For symbolic expressions, we will make a distinctionbetween types. Types whose constructors are in C , we will call open types, and the rest, we will call closed types. For open types, symbolic expressions are constructed by composing the constructorsup to depth c . Each symbolic expression is therefore an expression containing a mixture of symbolsfrom C and (typed) uninterpreted constants.In the running example, only the type family list T is open. Type variables are not instantiated,which leaves the types T and T → bool as closed types (in principle, also bool itself, but since itdoes not occur in any function argument position it does not contribute to the synthesis space). Let e , e be symbolic constants of type T , l , l of type list T , and p , p of type T → bool. In addition tothose, additional symbolic constants of type T v , v are created; to be used solely in the symbolicexpressions [], v :: [], and v :: v :: [] which are also generated.Next, the synthesizer generates terms up to depth k using the entire vocabulary V , with theaddition of special variables called placeholders , denoted τ ◦ i . Each placeholder is annotated by atype and an index. For example, with the state in Figure 2, we will have terms such as:filter p list T ◦ list T ◦ ++ ( filter T → bool ◦ list T ◦ ) filter T → bool ◦ ( list T ◦ ++ list T ◦ ) ( filter T → bool ◦ list T ◦ ) ++ ( filter T → bool ◦ list T ◦ ) (1)The system then replaces the placeholders with all available symbolic values previously con-structed for their respective types, and merges terms that are (symbolically) equivalent on all ofthese values by using the equations E as rewrite rules and applying them to each assigned term.This effectively induces a partition of the synthesized terms into equivalence classes modulo thesymbolic examples that were generated beforehand. This process is very similar to observationalequivalence as used by program synthesis tools (Albarghouthi et al. 2013; Udupa et al. 2013),but since it uses the symbolic value terms instead of concrete values, we will call it symbolicobservational equivalence (SOE). :6 Eytan Singher and Shachar Itzhaky This way, for example, the two bottom theorems in (1) will be merged into the same equivalenceclass, yielding the candidate equation —filter p ( l ++ l ) ? = ( filter p l ) ++ ( filter p l ) The two terms on the top row will not be merged. This is because, for example, for the valuation (cid:8) list T ◦ (cid:55)→ [] , list T ◦ (cid:55)→ l , T ◦ (cid:55)→ v (cid:9) , The left one expands to [] whereas the right expands to l .The reasoning procedure differs from previous work, specifically those based on testing tech-niques, in that:(1) This lightweight reasoning is purely symbolic and does not require to run the terms insome execution environment — TheSy is platform agnostic.(2) Functions are naturally treated as first-class objects, without specific support implementa-tion.(3) The only needed input is the code defining the functions involved, and no support codesuch as an SMT solver or random value generators.(4) TheSy has a unique feedback loop between the prover and the synthesizer, allowing moreconjectures to be found and proofs to succeed. In principle, within each equivalence class generated in Subsection 2.1, any two terms are candidatesfor generating a new equation, since they were proven equal for all assignments of symbolicexpressions to placeholders. But doing so for every pair potentially creates many other, “obvious”equalities. For example, filter p ( l ++ l ) ? = filter p ( l ++ ( [] ++ l )) Which follows from the definition of ++ and has nothing to do with filter. In fact, this equalitycan be obtained via congruence closure, meaning it does not extend the knowledgebase. Thesynthesizer therefore avoids generating such candidates, by first refining the equivalence classesfrom Subsection 2.1 using the known equations from E . So, first, [] ++ list T ◦ is immediately mergedwith list T ◦ , and then, following the standard rules of congruence closure, the containing terms areequated as well.Following this refinement, just one representative of each sub-class is picked, and pairs of theserepresentatives become speculated lemmas and passed to the prover, which attempts to provethem by induction. Each such conjecture, if true, is guaranteed to increase the knowledge in thesystem as the equality was not provable using congruence closure. For practical reasons, the proveremploys the following induction scheme: • Structural induction based on the provided constructors ( C ). • The first placeholder of the inductive type is selected as the decreasing argument. • Exactly one level of induction is attempted for each candidate.The reasoning behind this design choice is that for every multi-variable term, e.g. list T ◦ ++ list T ◦ , thesynthesizer also generates the symmetric counterpart list T ◦ ++ list T ◦ . So electing to perform inductionon list T ◦ does not impede generality.In addition, if more than one level of induction is needed, the proof can (almost) always be revisedby making the inner induction an auxiliary lemma. Since the synthesizer produces all candidate Existing tools also contain feedback loops aiming to reduce false and duplicate conjectures. This is also done in TheSy, butTheSy includes additional feedback, as described below. heory Exploration Powered By Deductive Synthesis 1:7

Assume filter p ( xs ++ l ) = filter p xs ++ filter p l Prove filter p (( x :: xs ) ++ l ) = filter p ( x :: xs ) ++ filter p l via ( ) filter p (( x :: xs ) ++ l ) = filter p ( x :: ( xs ++ l ))( ) = match ( p x ) with true ⇒ x :: filter p ( xs ++ l ) false ⇒ filter p ( xs ++ l )( IH ) ( ) = match ( p x ) with true ⇒ x :: ( filter p xs ++ filter p l ) false ⇒ filter p xs ++ filter p l ( ) filter p ( x :: xs ) ++ filter p l = (cid:0) match ( p x ) with true ⇒ x :: filter p xs false ⇒ filter p xs (cid:1) ++ filter p l ( ) = match ( p x ) with true ⇒ x :: ( filter p xs ++ filter p l ) false ⇒ filter p xs ++ filter p l (cid:3) Fig. 3. Example proof by induction based on congruence closure and case splitting. equalities, that inner lemma will also be speculated and proved with one level of induction. Lemmasso proven are added to E and are available to the prover, so that multiple passes over the candidatescan gradually grow the set of provable equalities.When starting a proof, the prover never needs to look at the base case, as this case has alreadybeen checked during conjecture inference. Recall that placeholders such as list T ◦ are instantiated withuninterpreted values such as l . So for the example discussed above, the case of filter p ( [] ++ l ) = filter p [] ++ filter p l has already been discharged. We therefore focus on the induction step, whichis pretty routine but is included in Figure 3 for completeness of the presentation.It is worth noting that the conjecture inference, screening and induction phases utilize a commonreasoning core based on rewriting and congruence closure. In situations where the definitionsinclude conditions such as match ( p x ) above, the prover also performs automatic case split anddistributes equalities over the branches. Details and specific optimizations are described in thefollowing sections. The equations obtained from Subsection 2.2 are fed back in three different but interrelated ways.The first, inner feedback loop is from the induction prover to itself: the system will attempt toprove the smaller lemmas first, so that when proving the larger ones, these will already be availableas part of E . This enables more proofs to go through, The second feedback loop uses the lemmasobtained to filter out proofs that are no longer needed. The third, outer loop is more interesting:as equalities are made into rewrite rules, additional equations may now pass the inference phase,since the symbolic evaluation core can equate more terms based on this additional knowledge.It is worth noting that while concrete observational equivalence uses a trivially simple equivalencechecking mechanism with the trade-off that it may generate many incorrect equalities, our symbolic observational equivalence is conservative in the sense that a symbolic value may represent infinitelymany concrete inputs, and only if the synthesizer can prove that two terms will evaluate to equalvalues on all of them, by way of constructing a small proof, are they marked as equivalent. Thismeans that some actually-equivalent terms may be “blocked” by the inference phase, which cannothappen when using concrete values — but also means that having additional inference rules ( E ) can :8 Eytan Singher and Shachar Itzhaky improve this equivalence checking, potentially leading to more discovered lemmas. This propertyof TheSy is appealing because it allows an explored theory to evolve from basic lemmas to morecomplex ones.To understand this last point, consider the standard definition of + over the nat datatype:0 + x = x ( S x ) + y = S ( x + y ) Given the terms t = nat ◦ + nat ◦ and t = nat ◦ + nat ◦ , SOE will not find t = t with the valuations (cid:8) nat ◦ (cid:55)→{ , , } , nat ◦ (cid:55)→ n (cid:9) . This is because using the + definition t can be rewritten, e.g. + n → ∗ S ( S n ) ,while t , e.g. n +

2, cannot be elaborated using this definition alone. However, we can find (andprove) the auxiliary lemmas x + = x and ( S x ) + y = x + ( S y ) , enabling this rewriting of t : n + → ( S n ) + → ( S ( S n )) + → S ( S n ) thus finding that t = t and moving it to the prover (which will then succeed). This work relies heavily on term rewriting techniques. We are about to use term rewriting as thecore for the main phases of the exploration. In the conjecture inference phase, rewriting will serveas an equality checking mechanism for symbolic values. In the screening phase, rewriting will helpfilter out trivial lemmas. While, in the induction phase rewriting is used to establish the inductionstep through congruence closure reasoning. In this section we present the basic notation used andsome definitions, as well as properties that will be relevant for the exploration procedure.

Consider a formal language L of terms over some vocabulary of symbols. We use the notation R = t · → t to denote a rewrite rule from t to t . For a (universally quantified) semantic equalitylaw t = t , we would normally create both t · → t and t · → t . We refrain from assigning a directionto equalities since we do not wish to restrict the procedure to strongly normalizing systems, asis traditionally done in frameworks based on the Knuth-Bendix algorithm (Knuth and Bendix1983). Instead, we define equivalence when a sequence of rewrites can identify the terms in eitherdirection. A small caveat involves situations where FV ( t ) (cid:44) FV ( t ) , that is, one side of the equalitycontains variables that do not occur on the other. We choose to admit only rules t i · → t j whereFV ( t j ) ⊆ FV ( t i ) , because when FV ( t i ) ⊂ FV ( t j ) , applying the rewrite would have to create newsymbols for the unassigned variables in t j , which results in a large growth in the number of symbolsand typically makes rewrites much slower as a result.This slight asymmetry is what motivates the following definitions. Definition 3.1.

Given a rewrite rule R = t · → t , we define a corresponding relation R −→ suchthat s R −→ s ⇐⇒ s = C [ t σ ] ∧ s = C [ t σ ] for some context C and substitution σ for the freevariables of t , t . (A context is a term with a single hole, and C [ t ] denotes the term obtained byfilling the hole with t .) Definition 3.2.

Given a relation R −→ we define its symmetric closure: t R ←→ t ⇐⇒ t R −→ t ∨ t R −→ t heory Exploration Powered By Deductive Synthesis 1:9 Definition 3.3.

Given a set of rewrite rules G R = {R i } , we define a relation as union of therelations of the rewrites: {R i } ←−−→ (cid:98) = (cid:208) i R i ←→ .In the sequel, we will mostly use its reflexive transitive closure, {R i } ←−−→ ∗ .The relation {R i } ←−−→ ∗ is reflexive, transitive, and symmetric, so it is an equivalence relation over L .Under the assumption that all rewrite rules in {R i } are semantics preserving, for any equivalenceclass [ t ] ∈ L (cid:14) {R i } ←−−→ ∗ , all terms belonging to [ t ] are definitionaly equal. However, since L may beinfinite, it is essentially impossible to compute {R i } ←−−→ ∗ . An algorithm can explore a finite subset T ⊆ L , and in turn, construct a subset of {R i } ←−−→ ∗ . Definition 3.4.

Given a rewrite rule group G R and a set of terms from a language T ⊆ L , wedefine a relation {R i } ←−−→ ∗ (cid:12)(cid:12) T (cid:98) = {R i } ←−−→ ∗ ∩ (T × T ) Even {R i } ←−−→ ∗ (cid:12)(cid:12) T is too hard to compute, since for t , t ∈ T satisfying t R −→ t , the path betweenthem may include terms from L \ T . Even if there exists a path that is properly in T , it may bevery long (worst case |T | ). We therefore define the rewrite search space as {R i } ←−−→ T ( d ) = (cid:18) {R i } ←−−→ (cid:19) ≤ d ∩ (T × T ) (2)Where R ≤ d = (cid:208) ≤ i ≤ d R i .Notice that t {R i } ←−−→ T ( d ) t does not essentially mean that there exists a path of rewrites t {R i } −−−−→ t (cid:48) {R i } −−−−→ · · · {R i } −−−−→ t of length ≤ d . This is because {R i } −−−−→ itself may not be symmetric. It is not evennecessarily the case that there exists a common term t such that both t and t rewrite to t ; thedirection of rewrites my interleave arbitrarily in {R i } ←−−→ T ( d ) . In order to be able to cover a large set T , we introduce a compact data structure that can efficientlyrepresent many terms. Normally, terms are represented by their ASTs; but as there would be manyinstances of common subterms among the terms of T , this would be highly inefficient. Instead, weadopt the concept of a Program Expression Graph (PEG) (Panchekha et al. 2015; Tate et al. 2009),previously used for compiler optimizations, to support efficient processing of rewrite rules overa large number of terms. A PEG is essentially a hypergraph where each vertex represents a setof equivalent terms (programs), and labeled, directed hyperedges represent function applications.Hyperedges therefore have exactly one target and zero or more sources, which form an orderedmultiset (a vector, basically). Just to illustrate, the expression filter p ( l ++ l ) will be represented bythe nodes and edges shown in dark in Figure 4. The nullary edges represent the constant symbols ( p , l , l ), and the nodes u represents the entire term. The expression filter p l ++ filter p l , which isequivalent, is represented by the light nodes and edges, and the equivalence is captured by sharingof the node u .When used in combination with a rewrite system {R i } , each rewrite rule is compiled to a premisepattern hypergraph and a conclusion pattern hypergraph. Applying a rewrite rule is then reducedto finding a subgraph д that is isomorphic to the premise, and if one is found, extend it with a :10 Eytan Singher and Shachar Itzhaky pl l ++ filter u filterfilter ++ Fig. 4. A PEG representing the expression“ filter p ( l ++ l ) ” (dark) and the equiv-alent expression “ filter p l ++ filter p l ”(light). new subgraph д (cid:48) that isomorphic to the conclusion. The premise and the conclusion may havevertices in common, which are maintained between д and д (cid:48) , and new vertices are created asneeded. Hyperedges are never replaced, so that information about existing terms is preserved.Consequently, a single vertex can represent a set of terms exponentially large in the number ofedges, all of which will always be equivalent modulo {R i } ←−−→ ∗ .In addition, since hyperedges always represent functions, a situation may arise in which twovertices represent the same term: This happens if two edges ¯ u f −→ v and ¯ u f −→ v are introduced by {R i } for v (cid:44) v . In a purely-functional setting, this means that v and v are equal. Therefore, whensuch duplication is found, it is beneficial to merge v and v , eliminating the duplicate hyperedge.The PEG data structure therefore supports a vertex merge operation and a special transformation, compaction , that finds vertices eligible for merge in order to keep the overall graph size small. This section deals with the first phases of the theory exploration method. The objective is to findnew conjectures using the source code without additional knowledge from the user. Let us considera (possibly infinite) language L given by a sorted vocabulary V . The terms may contain freevariables, with the semantics that these will be universally quantified. For the rest of this paper,equality between terms will be definitional: two terms t , t ∈ L will be considered equivalent iffor any valuation σ of the free variables to concrete values of the appropriate types, t σ and t σ evaluate to the same value.For practical reasons, SyGuS solvers usually limit generated terms by a height bound k . Thisserves to bound the space of possible terms, and in our case, has the additional justification thatlemmas containing complex terms will be less general, and therefore, less useful. So the benefit ofincreasing the height drops for each level introduced. The size of the space, however, grows rapidly:the size of the term space T ( k ) (cid:98) = { t | t ∈ L ∧ height ( t ) ≤ k } is proportional to |V | k . This would bevery large even for small vocabularies and small k . Furthermore, the number of equality conjecturesis quadratic in the number of terms, roughly |T ( k ) | · (cid:16) |T ( k ) − | (cid:17) = O (cid:0) |V | k (cid:1) . Checking all pairsfor equality is clearly intractable.For this reason, the goal of the screening phase is to effectively approximate the set of validequality lemmas and produce a small set of conjectures for the prover to check. This must bedone efficiently while maintaining a low rate of false positives, to avoid overloading the verifier tothe point where the exploration becomes infeasible. In this section we will describe a techniqueto reduce the number of checked conjectures without undermining the subsequent steps of theexploration and proof process. this can be reduced by only considering terms of the same type, but with a relatively small number of term types, will stillbe asymptotically quadratic. heory Exploration Powered By Deductive Synthesis 1:11 The main idea is to split the terms into equivalence classes whose members may be equalfor all possible valuations, by replacing free variables with bounded symbolic values (symbolicexpressions), thus over-appoximating “real” definitional equality. We use a compact representationof these equivalence classes, without having to explicitly hold all the terms in each class. This helpsmitigate the term space explosion and support efficient enumeration and symbolic evaluation ofterms for the purpose of partitioning them into classes.

We would like an efficient term comparison procedure that is able to work without additionalinput from the user and is able to evaluate on all defined types, including function types. Thisproblem is generally undecidable, therefore solutions will always be incomplete — meaning thatsome equalities will fail to be detected — with a tradeoff between precision and performanceThe general approach taken both here and in (Claessen et al. 2013b) is to enumerate through allterms, and, by evaluating them, find sets of equal terms. From these sets, equality conjectures arederived and passed to a theorem prover for verification.Previous work (Smallbone et al. 2017) addressed this by applying random testing with concretevalues This requires value generators for each type τ of free variables occurring in the terms beingcompared, and an execution environment for L . A generator for type τ is a function F τ generatinga random distribution X τ of values from the domain of τ . Given that all the relevant G τ are presentand they created a sufficient cover of the space of input values, the correct equivalence classes willbe found by using Observational Equivalence (Albarghouthi et al. 2013).Although each type requires a single generator, the comparison will be limited by the distributions X τ . The ramification is that a generic value generator may not fit every scenario, and one or morespecialized generators are needed for full exploration of different vocabularies. As an example,consider a function such as C’s strstr that searches for an occurrence of a substring within alarger string. The probability of generating proper substrings with plain random sampling is verylow, so with random testing, it will be hard to distinguish strstr from string equality. The problemis exacerbated when free variables may take function types, such as is the case when higher-orderfunctions such as filter are present in the vocabulary.Therefore, while the random testing approach has been applied successfully before (Claessen et al.2013b; Einarsd´ottir et al. 2018; Johansson 2017; Smallbone et al. 2017; Valbuena and Johansson 2015),it is useful to consider a different approach. This work proposes to overcome these issues using asymbolic evaluation method based on term rewriting. Using semantics-preserving rewrite rulesallows to infer that terms are equal without having to substitute concrete values for the free variables,eliminating the need for random generators. It also allows to incorporate automatically discoveredequality laws as simplification rules. Obviously, if this inference were complete, we would not evenneed induction or theory exploration in the first place. Since rewrite-based definitional equalityis inherently weak, we would not gain much by applying it directly to the terms with the freevariables. Instead, we do substitute the variables, but with symbolic expressions that, while notrepresentative of all the values of a given type, still represent infinitely many such values. Forexample, the symbolic expression [ v ] , with v : int, represents all lists of length one, and [ v , v ] —all the lists of length two.This method overcomes the limitations of using concrete values by describing (infinitely) manyconcrete evaluations at once. In fact, most if not all of the variables can be left as-is, that is,as uninterpreted values. Only variables whose values influence the recursion depth need to berestricted. Consider that by restricting the values relevant to recursion, the rewriting may unrollthe terms until equality is found. For example the term map id [ x , y , z ] could be rewritten by the :12 Eytan Singher and Shachar Itzhaky definitions of map and id into [ x , y , z ] . The entire evaluation can then be carried out symbolically viaa rewrite search, using known equations E as the source for rewrite rules {R i } and approximating {R i } ←−−→ T ( d ) for some depth parameter d . As mentioned in Subsection 2.1, E is initialized to thedefinitions of the functions of the vocabulary V , but can be extended by discovered equalities, solong as these are proven correct by Section 5, thus making the approximation tighter.On top of that, a deductive approach can take advantage of pre-existing knowledge whenanalyzing new definitions. In this way, knowledge collected by analyzing a library or a modulecan be used when exploring client code that interacts with that library or module. This can enablereasoning about larger code bases without having to run all the available functions, some of whichmay perform expensive computations. In this subsection, we introduce the concept of Syntax Guided Enumeration (SyGuE). SyGuE issimilar to Syntax Guided Synthesis (SyGuS for short (Alur et al. 2015)) in that they both use aformal definition of a language to find program terms solving a problem. They differ in the problemdefinition: while SyGuS is defined as a search for a correct program over the well-formed programsin the language, SyGuE is the problem of iterating over all distinct programs in the language. SyGuSsolvers may be improved using a smart search algorithm, while SyGuE solvers need an efficientway to eliminate duplicate terms, which may depend on the definition of program equivalence.

Definition 4.1 (SyGuE).

Given a grammar G and a term equivalence relation Q , the SyGuEproblem is defined as generating all the equivalence classes of terms produced by G modulo Q .SyGuE ( G , Q) : = (cid:8) [ t i ] ∈ L/Q | t i ∈ L G (cid:9) Since the class [ t i ] may well be infinite, a representative is chosen to stand for it. Moreover, itis usually desirable to generate these representative terms in some ascending order, such as byincreasing size or height. Typically, synthesizers generate terms in layers — that is, first leaf terms(height = 0), then iteratively create terms of height = 1, 2, etc. from the terms of lower heights. Thistechnique from SyGuS is carried over to SyGuE.As the goal is finding equality conjectures that require induction to prove, we attempt to find allpairs of terms that might be equal. This is done by creating all the terms and dividing them intoequivalence classes using an over-approximation of the “true” equivalence relation, Q . Two terms t , t satisfying t Q t will form a candidate for a new equation. If we choose the relation Q = {R i } ←−−→ ∗ (which is not an over-approximation), then t , t are provably equal by {R i } . These are the lessinteresting equalities: what we are really interested in are equalities that are not directly provableusing {R i } , but can be proven by applying induction. To this end, Q = {R i } ←−−→ ∗ would be too strong.Before we define the equivalence relation that we are actually going to use for this instance ofSyGuE, we first describe the space of terms that will be generated. As mentioned in Subsection 2.1,at depth = 0 the synthesizer generates a collection of placeholders τ ◦ i for the available types τ . Theseplaceholders will become universally quantified variables in the eventual equations inferred, but forthe time being they are treated as uninterpreted symbols of the corresponding types. We store themin the PEG as vertices with nullary edges (similar to p , l , and l in the example of Subsection 3.2).In addition, we create vertices for all the constants in V . For each vertex, we also record its type.These vertices constitute level 0 of the enumeration.What follows is an iterative deepening process that is common in enumerative synthesis. Foreach function symbol, f ∈ V , of type α → α → · · · → α k → τ whose arity is k , we go over heory Exploration Powered By Deductive Synthesis 1:13 all the k -tuples of vertices (cid:104) u , · · · , u k (cid:105) such that u i : α i according to the recorded types. Wethen generate a new vertex v with associated type τ , and the hyperedge (cid:104) u , · · · , u k (cid:105) f −→ v . (As acommon optimization, to avoid recreating terms at higher levels, we also require that at least oneof u , · · · , u k be new, that is, generated at the previous level.) Now we describe the first part of how equivalence classes are formed and how this is used tofind conjectures. Remember that our goal is to find equivalences that cannot be found in thecurrent search space. This cannot be done directly with the uninterpreted values τ ◦ i created above,since most rewrite rules will have to destructure at least one of the arguments to a function. Forexample, the term x ++ y has no reduxes w.r.t. the rewrites induced by the definition of ++ given inFigure 2. To overcome this, we need to introduce some structure into the values represented by theplaceholders.We generate small symbolic expressions to be used as symbolic examples . They are created fromthe recursive data types used as placeholder types by combining uninterpreted values from the basetypes and constructors. These symbolic examples are used to discriminate programs according totheir functional behavior. Programs that evaluate to the same (symbolic) value on all the examplesare merged in the PEG. We take advantage of the compaction mechanism already present in thePEG data structure to detect when symbolic values are equal, without ever having to constructthose values explicitly or define a normal form. We repeat this process for a set of possible symbolicvaluations to the placeholders: for each valuation, essentially, a working copy of the PEG is created,placeholders are filled in, and rewrites are applied up to the rewrite depth d . Whenever two vertices u , u were successfully merged in all of these derived PEGs, we merge them in the master PEG aswell to indicate that they are now considered equivalent. This creates an ever-growing equivalencerelation over the set T of terms generated so far.It would be tremendously expensive, and also unnecessary, to create all the combination ofsymbolic values (that would be exponential in the number of placeholders.) Instead, we found thatit is sufficient, sometimes even desirable, to only expand a single placeholder. For the purpose ofpresentation, assume there is a single type τ (e.g. list) over which we plan to perform structuralinduction. We associate symbolic values with placeholder τ ◦ , e.g. list T ◦ (cid:55)→ {[] , v :: [] , v :: v :: []} . Theremaining placeholders remain uninterpreted. When multiple algebraic data types are present, theprocess is just repeated for each of them. Definition 4.2 (Symbolic Observational Equivalence).

Given a set S τ ⊆ L of terms representingsymbolic values of type τ , We define the symbolic observational equivalence (SOE) over terms as t (cid:27) t ⇐⇒ (cid:211) s ∈S τ t (cid:2) s (cid:14) τ ◦ (cid:3) {R i } ←−−→ ( d ) t (cid:2) s (cid:14) τ ◦ (cid:3) That is, then the symbolic values of t and t converge for all the symbolic values S τ that τ ◦ cantake. Equality between symbolic values is determined by a rewrite search of depth d . Automatic case-splitting.

Even with a single induction variable, and with size-bounded symbolicvalues, symbolic evaluation can get blocked and cripple the equality checks. This is a knownphenomenon stemming from the presence of conditional branches in definitions. For example,in the expression filter p ( v :: []) , the definition of filter would introduce a term of the form“match ( p v ) with · · · ”. Evaluation cannot continue without knowing the value of p v . Therewrite-based comparator overcomes this using a split-case, in a way that is commonly employedby symbolic execution environments (Cadar et al. 2008; Cadar and Sen 2013) Evaluation is forked :14 Eytan Singher and Shachar Itzhaky and both branches explored, where each fork assumes an additional path condition : e.g. , one forkassumes p v = false , and the other p v = false . Two terms will then be merged only if they aremerged on both forks.Evidently, the number of forked states grows exponentially with the number of conditions thatmay be encountered during evaluation. It is important to apply split-case judiciously, so a tighterbound is placed on the number of nested forking steps than the bound d on regular rewrite steps.Luckily, a small number of split is sufficient in many scenarios. Consider the following two termsfrom the running example in Subsection 2.1 (1).filter T → bool ◦ ( list T ◦ ++ list T ◦ ) ( filter T → bool ◦ list T ◦ ) ++ ( filter T → bool ◦ list T ◦ ) So the screener will have to compare, e.g. ,filter p (cid:0) ( v :: []) ++ l (cid:1) ? = (cid:0) filter p ( v :: []) (cid:1) ++ ( filter p l ) While filter would eventually apply p to all the elements of both v :: [] and l , it is enough tosplit into two cases: when p v = true , both terms evaluate to v :: filter p l ; when p v = false ,they both evaluate to filter p l .At the end of each iteration, every vertex represents an equivalence class defined by the rewritesearch modulo the set of symbolic valuations that was used. At this point we can say that allcombination of two terms from a single class can be an equality conjecture, which still has to bechecked by the prover since the symbolic examples still do not cover all possible inputs. This leadsto the next part of the process where we further filter the conjectures and try to prove the ones leftusing induction. This section describes the final conjecture filtering and proof attempt process. After runningSymbolic Observational Equivalence (Subsection 4.3), equivalence classes created from symbolicexamples are obtained. Terms that were found equal by the approximation (using rewriting) of (cid:27) might be provably equal without requiring induction. At this point it is needed to create finerclasses by a relation unifying conjectures that are provable without induction.Our experiments show that this filtering passes through a small set of candidate lemmas. Weuse the same rewrite search to discharge the induction step. While simple, we found it to be quiterobust when integrated with the exploration system as a whole. Instead of guessing auxiliarylemmas that are needed to carry a proof to completion, we rely on discovery of provable lemmasduring search. The robustness also stems from an ordering of the conjectures being explored.

Given a set of terms T , we start by building an over approximation of the provable term equalities.This is done by approximating the equivalence class C ∈ T / (cid:27) (see Theorem 4.2). As mentioned,this step is meant to screen out conjectures that do not require induction to prove, before applyinginduction, which can be done given the current set of rewrite rules {R i } . This screening helps inpreventing addition of unneeded rewrite rules into {R i } , keeping it minimal and reducing run time.We screen the conjectures by taking all the terms in C and create a finer partitioning C F = C / {R i } ←−−→ C ( d ) . Note that as the equivalence classes are represented in the compact structure (Subsec-tion 3.2), they are all represented by a single vertex. Therefore, to be able to perform rewrite searchand find the finer classes, it is needed to reconstruct the terms into a new compact representation,where each term is represented by a different vertex. We still wish to reduce the amount of terms heory Exploration Powered By Deductive Synthesis 1:15 we enumerate; although it does not affect the asymptotic complexity of the problem, it results insignificant speedups. The reconstruction is done bottom-up, taking advantage of memoization forcommon subterms (which are numerous), reducing the amount of edges that have to be inserted tothe graph.By creating C F using a rewrite search, we take advantage of the fact that running a rewrite searchand finding that t = t constitutes as a proof. This is true when uninterpreted values are used, asis the case in while approximating C F . From the definition of C F all terms in (cid:2) t j (cid:3) ∈ C F are provento be equal to each other by the rewrite search. For each such (cid:2) t j (cid:3) we want to choose a singlerepresentative. This representative term will be used in creating final conjectures to be passed tothe prover. It actually matters which representative we choose, although they were proven equal.Consider the conjecture c good : = x + y = y + x , we could have chosen different representatives andwrite it as c bad : = + x + y = + y + x . The chosen representative will affect the created rewriterule. Choosing the larger term might lead to a rewrite rule that won’t match all needed contextswhile applied.At this phase our main concern is to choose the representatives and order the conjectures well.We want to avoid having both rules c good , c bad are proven, or worse, having c bad proven and c good discarded because it is redundant. Continuing with the example, if c bad was proved before c good , itmeans c bad is now represented in {R i } . During a screening, c good will be deemed unnecessary dueto the presence of c bad . This can happen since the following is a valid rewrite chain: x + y x = + x −−−−−→ + x + y c bad −−−→ + y + x y = + y −−−−−→ y + x Because the system has a feedback loop, during which the lemmas proved are used again forlater proofs, it is important to avoid such bad choices.We define ordering on terms in an effort to always choose first the term that will match the mostsituations. We define order over the terms first by size (smallest first), then by generality (mostvariables first).

Definition 5.1 (Term Ordering).

For a term t ∈ L , let | t | denote its size, that is, the number of ASTnodes in its syntactic representation. We define a preference ordering (cid:22) of terms lexicographicallyover the tuple (cid:10) | t | , | FV ( t )| (cid:11) , where term sizes are compared with ≤ and free variable sets sizes arecompared with ≥ . t (cid:22) t ⇐⇒ (cid:10) | t | , | FV ( t )| (cid:11) ≤ (≤ , ≥) (cid:10) | t | , | FV ( t )| (cid:11) We then use this ordering as a heuristic to choose a representative from each (cid:2) t j (cid:3) ∈ C F , whichwill be minimal according this order.The overall process can be described as running a SyGuE ( G , (cid:27) ) followed by SyGuE ( G [ t j ] , {R i } ←−−→ ( d ) ) for each of the equivalence classes, where G [ t j ] is the input grammar restricted to the equivalenceclass [ t j ] . Finally, enumerating the resulting representatives is much less than all the pairs in each [ t j ] and definitely much less than |T | . The last step of each iteration in this process is attempting to prove the conjecture using induction.Given the equivalence class C and representatives for each fine equivalence class (cid:2) t j (cid:3) ∈ C F , aconjecture is built as c : = t = t for each combination two representatives. As will be shown inthe results, only a small amount (cid:2) t j (cid:3) classes exist foreach C F , which goes to show this screeningtechnique is effective. For each conjecture c we already know at this point that t is not provablyequal to t in d rewrite steps. This is the point in the process that we assume that an induction proof :16 Eytan Singher and Shachar Itzhaky is needed. Each conjecture is used as the induction hypothesis of the proof; intuitively, if there isa need for a stronger induction hypothesis, we rely on the system itself to eventually discover asufficient auxiliary lemma.At this point only the induction step is needed. The reason for that is that base cases for theinduction proof were already taken care of in Subsection 4.3. This happens when the examples areproved to be equal, which include all the base cases of the inductive data type. It remains to carryout the induction step.Given the conjecture c : = t = t , during induction on type τ , there is a need for inductionstep proof for each constructor ctor k ( v , . . . , v n k ) of τ . We construct a search space representingthe proof obligation for ctor k , to be dismissed using rewrite search. The search space is justthe compact structure containing the terms t , t , which are updated to represent a step using ctor k . The induction variable will always be chosen as the first placeholder value for the type τ which we will refer to as the induction variable or iv . A term t will be updated by substituting t [ ctor k ( v , . . . , v n k )/ iv ] where the constructor parameters are new uninterpreted variables of theappropriate types. We define a well founded order (cid:64) , to automatically recognise terms that aresub-structures of iv to which the induction hypothesis can be applied soundly. For example, given iv = v :: v (for v : τ , v : list τ ), we have v (cid:64) iv . Both the hypothesis and the order (cid:64) arerepresented as rewrite rules added into {R i } . Some proofs will require that the inducted value willbe in a certain position in the term, e.g. the first or the second argument to a recursive function.This is solved ahead of time by generating all possible terms, as we also generate a term with theplaceholder in the right position for the induction.The order of the conjectures being tested can affect the results of the theorem search. Considerplus associativity, the system can find that c : = x + ( y + z ) = ( x + y ) + z and also that c : = x + ( y + x ) = ( x + y ) + x . If we succeed in proving c before c , we will end up with an unneededlemma. However, by proving c first and adding it to {R i } , we can dismiss c as it can be obtainedfrom c without induction. The first step to solving this is using the order defined in Theorem 5.1.Another problem that may arise is proofs failing due to insufficient rewrite rules. Take the exampleof fold and sum (Theorem 5.2). The following three conjectures are at play: c : = fold ( + ) l = sum l (3) c : = x + fold ( + ) l = fold ( + ) x lc : = x + fold ( + ) l = x + sum l Before proving the conjecture c , the conjecture c needs be proved. Due to the ordering, first thesmaller conjecture’s proof, i.e. c will be attempted and will fail, then c will be proved. Withoutany further rearrangement of the conjectures, a proof attempt of conjecture c will succeed. Thismakes the system find that c is unnecessary and will prevent the system from proving it.Order is very important in this system, therefor if a lemma is found, all conjecture proofs thatwere failed to this point are reattempted. Before reattempting proof for a conjecture c , we recheck ifthe conjecture c is still needed by running a rewrite search over the terms of c . This way the systemreturns to the original smaller conjecture and succeeds in proving it finding fold ( + ) l = sum l ,making c redundant, thus proving only the smallest needed theorems for this case.This proof technique is robust as it can work for user defined functions and data structures, whileproving the needed auxiliary lemmas on the run. Although most of the proof process is dependanton the previous steps, the induction proof is independent of it’s predecessors. This means that anyproof technique that can work with two given terms can be used instead. Having said that thenaive approach solved all the needed proofs and does not majorly affect the run time. As only a Recall that some of the rewrite rules are directed, requiring that both terms be included. heory Exploration Powered By Deductive Synthesis 1:17 few term combinations reach this stage the run time of a single proof search is of little affect on thetotal system run time.

Example 5.2.

Proof of (3) c — fold ( + ) l = sum l by induction on l , once (3) c has been proven. Assume fold ( + ) xs = sum xsInduction Step fold ( + ) ( x :: xs ) ⇒ fold ( + ) ( + x ) xs ⇒ fold ( + ) x xs c = ⇒ x + fold ( + ) xs IH = ⇒ x + sum xs ⇒ sum ( x :: xs ) (cid:3) Notice that this example, variants of which occur in some tutorials on formal proofs using proofassistants, typically requires a strengthening of the inductive property. This was completely avoidedhere thanks to the auxiliary lemma, which was automatically discovered.

Evaluating a theory exploration system poses some challenges. One challenge is that differentcombinations of lemmas can be used to express the same knowledge (in the sense that was describedin Section 2). Testing whether a discovered theory expresses the correct and full knowledge expectedcould be done by checking that the knowledge obtained from exploration subsumes the expectedknowledge. However, this check is not feasible because, essentially, K S T is infinite; notice that itis not sufficient to check whether all of the lemmas in some T can be proven using some other T . There may still be some formula φ such that T (cid:96) S φ but T (cid:48) S φ — due to bounds necessarilyimposed on proof search in order to guarantee termination. For this reason, we resort to a manualinvestigation of the result relative to a set of known lemmas that we expect to be found (forexample, l ++ [] = l ). Another challenge arises when comparing with existing tools for automatedreasoning since different tools work in different scenarios in which they expect different kinds ofuser input. As a principal, our design goals for TheSy was to perform exploration using functiondefinitions alone without additional user guidance. This affects the interpretation of any comparisonto theorem provers or verification systems such as Leon, as the input is meaningfully different. Toevaluate TheSy we compare with Hipster (Johansson et al. 2014), a fully automated explorationsystem with several follow-ups (Einarsd´ottir et al. 2018; Johansson 2017; Valbuena and Johansson2015).As a comparison of TheSy and Hipster requires manual analysis, we perform a case study of ahandful of test cases that we consider representative, while adding a discussion of the results foreach of them and how they relate. Besides that, all lemmas found during exploration are insertedto Leon to attempt to verify them. This last step is also a demonstration of how theory explorationcan strengthen and augment software verification.We present the implementation details and experimental setup, then move to showcase the capa-bilities of TheSy using a collection of benchmarks that is representative of functional programmingpractices. These are shown independently of preexisting tools. This is followed by a discussion ofthe comparison with other systems. As TheSy relies heavily on term rewriting, there are a few additions to our implementation of thePEG data structure to improve performance. The rewriting engine is incremental, relying on thePEG supporting incremental processing: Our incremental PEG keeps track of when edges are added,and during the pattern match step of rewrite rule application, at least one matched edge should benew; that is, it has been created after the last time that rule was applied. This optimization avoids :18 Eytan Singher and Shachar Itzhaky recomputation of pattern matching across the ever-growing hypergraph, and prevents unnecessaryedge insertions and node merges.As mentioned, TheSy supports automatic case splitting; this is done for functions that containmatch expressions (if is just a special case of match). A special rewrite rule is created to enablesplit cases during term evaluation, by marking terms to split and split values. For each split markerand relating split values, it copies the hypergraph for each split value, replacing the marked termwith the split value. Case splitting is done recursively, performing rewrite search on each splitvalue for each level until reaching the maximum split depth. After finishing the search for allbranches, conclusions are merged to find what equalities were proved. This implementation resultsin repeating the same branches due to the ordering, which worsens with deeper split depth. Asthere is no consideration of whether a case split will be useful, depending on where it is applied,many unnecessary case splitting is done. While limiting scale in its current format, TheSy was ableto find lemmas about the functional operator ‘filter’ with small vocabularies.TheSy has a few control parameters governing the term generation and search processes. Forexperiments in which we use non-default parameters, the decision is explained in the followingdiscussion. Now we review the parameters and their impact on the exploration results. The rewritesearch depth d affects what equalities we might find: A shallow search can lead to missed lemmas,but deeper search may be prohibitively expensive as some of the rewrite rules can run infinitely,such as in the case of x → x +

0. The default search depth is d =

8. Terms are always generatedup to depth 2 ( k = c = c is reduced to 1. (Thisreduction negatively impacts the robustness of the inference phase; e.g. , reverse l = l for lists ofsize 1. These spurious lemmas will have to be discarded by the induction prover.)All benchmarks were executed on a Xeon E3-1231 v3 (quad-core) machine with 32GB of RAM. The cases being checked are defined by the sorted vocabulary used, which encapsulates the relevantfunctions and data type definitions. For each experiment, we report the number of lemmas whichwere: found and proven, passed screening but failed the proof, and those for which proof succeededonly on retry. If a proof failed when a lemma was first discovered, but succeeded later on, it meansthat a necessary auxiliary lemma was found and that lemma enabled the proof to go through. Thissituation is common as auxiliary lemmas are sometimes larger and occur later in our ordering.Some of the lemmas proven in early test cases are reused later to prevent TheSy from reprovinglemmas. Lemmas for ∨ and ∧ , which do not require induction to prove, are also encoded as rewriterules. (TheSy can in fact prove them, but they are discarded during screening for being to simple.)The experimental results are listed in Table 1, where for each given vocabulary all the lemmasfound by TheSy are presented. Additionally a breakdown of the run time is presented in Table 2. Asthe term generation and conjecture inference phases are highly coupled and as the term generationis relatively insignificant, the total time for both phases is presented. The conjecture screeningphase happens many times, as it is needed when proofs are retried. We therefore seperate itsrun time from the prover, presenting the accumulated time spent screening. The prover columnincludes two measurements. The left one is the total time spent on in the prover. The left representthe amount of time, from the total time in the prover, spent on retrying to prove a conjecture. Allthe measurements are given in seconds.There are two measurements given for experiments containing fold in the vocabulary. Whenusing the default parameters, TheSy creates many unneeded terms for those cases. To demonstrate heory Exploration Powered By Deductive Synthesis 1:19 V Lemmas discovered0 succ + x + = x succ x = x + ( succ 0 ) succ x + y = x + succ y ( x + y ) + z = x + ( y + z ) x + y = y + x [] :: ++ :+ l = l ++ [] l ++ ( x :: l ) = ( l :+ x ) ++ l l ++ ( l :+ y ) = ( l ++ l ) :+ y l ++ ( l ++ l ) = ( l ++ l ) ++ l rev :: :+ x :: rev l = rev ( l :+ x ) rev ( rev l ) = rev l rev :+ ++ :: rev ( rev l ) = l rev ( l :+ x ) = x :: rev l rev ( l ++ l ) = ( rev l ) ++ ( rev l ) filter filter p ( filter p l ) = filter p l filter p ( filter q l ) = filter q ( filter p l ) filter ++ filter p ( filter p l ) = filter p l filter p ( l ++ l ) = ( filter p l ) ++ ( filter p l ) + fold ( + ) l = sum l fold ( + ) x l = x + sum l fold ( + ) x l = x + fold ( + ) l x + sum l = x + fold ( + ) l true all-true ∧ fold all-true l = fold (∧) true l all-true l ∧ b = b ∧ fold (∧) true l all-true l ∧ b = fold (∧) b l b ∧ fold (∧) true l = fold (∧) b l false any-true ∨ fold fold (∨) false l = any-true l fold (∨) b l = b ∨ fold (∨) false l fold (∨) b l = any-true l ∨ b b ∨ fold (∨) false l = any-true l ∨ b map ◦ map д ( map f l ) = map ( д ◦ f ) l + rev :+ len len l = len ( rev l ) + ( len l ) = len ( l :+ x ) + ++ len len ( l ++ l ) = len l + len l rev switch flattentmap id t = tmap id t switch ( switch t ) = t tmap д ( switch t ) = switch ( tmap д t ) map :: :+ ++fold rev map f ( l ++ l ) = ( map f l ) ++ ( map f l ) rev rev l = l map f ( rev l ) = rev ( map f l ) rev ( l ++ l ) = ( rev l ) ++ ( rev l ) fold д ( fold д x l ) l = fold д x ( l ++ l ) x :: ( rev l ) = rev ( l :+ x ) Table 1. A list of the lemmas that were discovered using different vocabularies ( V ). the impact of this, we added a second run, reducing the number of placeholder values created.Certainly, it is possible to optimize all experiments by choosing the correct parameters, but thegoal remains to work fully automatically therefor it is preferred to keep the defaults.Another design choice impacting performance is adding a conjecture filtering phase, in the endof the iterative deepening, after finding new rules. This is important as was shown in the end ofthe overview (Section 2), as it allows finding new lemmas. The additional knowledge, discoveredduring the exploration, helps the conjecture inference phase discover more equalities. The effect onrun-time can be astronomical when working with larger vocabularies. This is because continuedrewriting of large hypergraphs becomes resource expensive as they keep growing. For example,considering the stress test, finding the lemma – map f ( rev l ) = rev ( map f l ) – is possible due tothis additional run, which changed the run-time from 1 hour to 4. Still, when considering theoryexploration, the additional lemma can be very useful for the future and should not be overlooked.A few concepts are displayed in the experimentation. • Being able to find basic facts about new data types such as Nat, List, and Tree. • The ability to explore higher-kind functions such as filter map and fold. • Working with bigger vocabularies as seen in the stress test • Knowledge transfer, some of the experiments successfully reuse knowledge to find newresults (this can be seen in the ”filter ++” test that uses facts on ++). :20 Eytan Singher and Shachar Itzhaky

Time ( sec ) LemmasGeneration Induction Prover V + Inference Filtering (retries) found failed retried0 succ + [] :: ++ :+ 2692.57 293.33 ¡1 0.00 4 0 0rev :: :+ 13.01 2.27 ¡1 0.09 2 1 1rev :+ ++ :: 1444.45 10.68 ¡1 0.10 3 1 1filter 1.34 1.12 ¡1 0.00 2 0 0filter ++ 30.20 5.04 1 0.00 2 8 00 sum fold + true all-true ∧ fold 501.94 169.13 3 1.07 4 3 2(15.02) (12.31) (1) (0.92) (3) (2) false any-true ∨ fold 517.29 231.81 2 0.34 4 2 1(30.57) (27.26) (3) (1.19) (3) (2)map ◦ + rev : + len 162.44 34.61 48 41.64 10 2 1 + ++ len 39.66 5.15 ¡1 0.00 1 0 0rev switch flatten tmap id 7.76 1.54 1 0.00 3 0 0map :: :+ ++ fold rev 15020.65 33.70 2 0.94 7 3 3 Table 2. Statistics for evaluation benchmarks. In each experiment, we report the total number of lemmasfound, as well as candidates that passed the initial screening but failed the prover, and the number of proofretry attempts. Numbers in parentheses represent faster runs with tweaked parameters (see text).

To compare the performance of TheSy to that of existing tools, we selected a set of sorted vocabu-laries that we believe express different aspects and difficulties in theory exploration. These includedifferent ADTs (lists trees and naturals), lemmas that rely on auxiliary lemmas for their proofs,and conditional lemmas (the latter are currently not supported by TheSy but are nevertheless apervasive component in automated theorem proving.) We compared TheSy with Hipster (Johansson2017), a theory exploration add-on for Isabelle/HOL, and Leon (Blanc et al. 2013), a verification andsynthesis tool for Scala. All three share a domain of purely functional programs with algebraicdata types and recursive definitions. We added Leon to our comparison to see which of the lemmascan be automatically proved without further intervention; meaning whether exploration can bebeneficial and prove or help prove lemmas that are not trivial for it. As the theory explorationsystems stop when they finish exploring defined space, we added a timeout of half an hour forLeon to keep the comparison close. Table 3 shows the results of running the three tools on differentvocabularies.Following is a case study comparing the three systems. Some cases benefit from using previouslyobtained knowledge, so results from previous cases are retained. During the study, Hipster andTheSy receive only the sorted vocabulary, while in Leon we also add the lemma definitions as ithas no concept of theory exploration. As we can see in the results, there are many cases, usuallythe more intricate ones, in which the exploration can serve a system like Leon.

Case study 1 — natural numbers.

We included the canonical construction of natural numbersbased on 0 and succ (successor) and a recursive definition of + . By default, two placeholders arecreated per type. + has two arguments of the inducted type, requiring three placeholders to fullyexplore the space. If only two placeholders were used, commutativity ( x + y = y + x ) can be heory Exploration Powered By Deductive Synthesis 1:21 V TheSy Hipster LeonNat 10 succ + x + 0 = xS x = x + (S 0)x + (S y) = S (x + y)(x + z) + y = x + (z + y)z + x = x + z x + 0 = xx + (S y) = S (x + y)(x + z) + y = x + (z + y)z + x = x + z x + 0 = xx + (S y) = S (x + y)(x + z) + y = x + (z + y)z + x = x + z time time + time map f (xs ++ ys) = map f xs ++ map f ysfilter p (xs ++ ys) = filter p xs ++ filter p ys filt z (filt y x2) = filt y (filt z x2)filt y (filt y z) = filt y z filt z (filt y x2) = filt y (filt z x2)filt y (filt y z) = filt y z map f (xs ++ ys) = map f xs ++ map f ysfilter p (xs ++ ys) = filter p xs ++ filter p ys time time ⇒ not (leq (Succ x) y)leq x y ⇒ leq x (Succ y) leq x xleq x (Succ x)not (leq (Succ x) x)leq y x ⇒ not (leq (Succ x) y)leq x y ⇒ leq x (Succ y) time Table 3. Results of the case study comparing TheSy, Hipster, and Leon discovered, but not associativity ( x + ( y + z ) = ( x + y ) + z ). With two placeholders, TheSy can findonly limited cases such as x + ( x + y ) = ( x + x ) + y . Therefore, the number of placeholders usedfor this vocabulary is increased to 3. TheSy and Hipster found the same knowledge, with a slightdifference. TheSy added an extra, unneeded lemma for x +

1. It happened due to the ordering ofthe lemmas, where the smaller is added first. Leon cannot prove commutativity as it requires anauxiliary lemma; although Leon proved the lemma, it requires user intervention to use it.

Case study 2 - lists.

This experiment involves polymorphic lists constructed with [] and ::, andthe definitions of ++ (list concatenation) and rev (reverse). As ++ is present in the vocabulary, witha signature containing the inducted type in more then one argument, similar to Case 1 but heremore function symbols are present. In this case TheSy runs twice, building basic knowledge on ++alone with 3 placeholders and only later runs on the full vocabulary with the default parameters.In this setting, TheSy discovers commutativity and associativity of ++, as well as a distributivitywith rev (rev ( l ++ l ) = rev l ++ rev l ) and the nice property rev ( rev l ) = l . Hipster generatesmost of the expected results but fails to discover rev ( rev l ) = l . Hipster did find a similar lemma,rev ( l ++ rev l ) = l + + rev l , which when instantiated with l = [] , implies the expected lemma.However, requiring this extra reasoning step makes the lemma harder to use in automated proversas well as in interactive environments. We do note that when we re-ran Hipster with V = { rev } ,then, based on lemmas discovered in the previous step, Hipster recovers the missing lemma. The :22 Eytan Singher and Shachar Itzhaky additional execution leads to a total runtime of 18 minutes. Leon manages to verify the simpleauxiliary lemmas while failing at those that require using them. Case study 3 — functional operators.

With the basic functional primitives for lists map andfilter, along with ++ from before, TheSy successfully discovers distributivity properties. Hipsterfails to discover these properties. Hipster can find the comutativity lemma filter p ( filter q l ) = filter q ( filter p l ) , which TheSy was unable to reach due to the number of case splits required byfilter (as a consequence of the conditional statement in the definition of filter). With filter alone,TheSy can recover this property in 7 seconds via an additional case-split depth. With map and++ alone, TheSy finds the same lemmas in 40 seconds. This case study shows the importanceof automatic case splitting when reasoning about functional programs. Further research andengineering effort should lead to more efficient treatment of this aspect, enabling a full explorationof theories containing filter and similar functions. The lemmas found in this case study can beproven without any support lemmas, and Leon succeeded in proving all of them.We then explored a third functional operator, fold. (As an aside, libraries often contain efficientimplementations of fold, so rewriting a user program using fold can significantly improve itsperformance; this was one of the motivations for this case study.) TheSy finds four lemmas, one ofwhich is, in fact, redundant (due to an implementation bug). Hipster finds a single lemma, showingthe equality of fold 0 + and sum but again, includes redundant symbols ( x + ( sum y ) instead ofsum y ). fold is another case where having many placeholders is unfavorable to TheSy’s run time. Ifwe limit the number of placeholders to one per type, TheSy finishes in 42 seconds. Leon is unableto prove any lemma in this case. Case study 4 — trees.

An important aspect is dealing with custom data structures. This caseinvestigates the behaviour over an additional data structure to check the effect on run time. Alltools succeed in this, although it does not contribute much to the comparison it does demonstrateto the reader the robustness of the exploration.

Case study 5 — conditions.

A difficult challenge in theory exploration is creating conjecturescontaining logical implications, as this quickly leads to space explosion. By allowing the userto provide predicates to use as premises, it is possible to reduce the lemma space significantly,while retaining suitable conjectures. TheSy does not yet support this feature, and therefore, cannotproduce such lemmas. Hipster uses a feature from QuickSpec to generate conjecture based ongiven predicate (Valbuena and Johansson 2015). This works well in the context of leq , but is limitedfor other forms of predicates (such as strstr as described in Subsection 4.1).

Program Expression Graphs.

Originally brought into use for the purpose of low-level compileroptimizations (Tate et al. 2009), PEGs can be used to represent a large program space compactly bypacking together equivalent programs. In that sense they are similar to Version Space Algebras (Lauet al. 2003), but their prime objective is entirely different. While VSAs focus on efficient intersections,PEGs are used to saturate a space of expressions with all equality relations that can be inferred.They have found use in optimizing expressions for more than just speed, for example to increasenumerical stability of floating-point programs in Herbie (Panchekha et al. 2015). There are twokey differences in the way PEGs are used in this work compared to prior: (i) equality laws arenot hard-coded nor fixed, they are fertilized as the system proves more lemmas automatically; (ii)saturation cannot be obtained in some of our cases, which we overcome by a bound on rewrite-ruleapplication depth. heory Exploration Powered By Deductive Synthesis 1:23

Automatic theorem provers.

Many system rely on known theorems or are designed to supportusers in semi-automated proof. Congruence closure is also a proven method tautology checkingin automated theorem provers, such as Vampire (Kov´acs and Voronkov 2013), and is used as adecision procedure for reasoning about equality in leading SMT solvers Z3 (De Moura and Bjørner2008) and CVC4 (Barrett et al. 2011). There, it is limited mostly to first-order reasoning, but canessentially be applied unchanged to higher-level scenarios such as ours.

Theory exploration.

IsaCoSy (Johansson et al. 2010) pioneered the use of synthesis techniques forbottom-up lemma discovery. IsaCoSy combines equivalence reduction with counterexample-guidedinductive synthesis (CEGIS (Solar-Lezama et al. 2006)) for filtering candidate lemmas. This requiresa solver capable of generating counterexamples to equivalence. Subsequent development werebased on random generation of test values, as implemented in QuickSpec (Smallbone et al. 2017)for reasoning about Haskell programs, later combined with automated provers for checking thegenerated conjectures (Claessen et al. 2013b; Johansson 2017). We have explored the tradeoffsbetween using concrete vs. symbolic values in Subsection 4.1 and Subsection 6.3.

Inductive synthesis.

In the area of SyGuS (Alur et al. 2015), tractable bottom-up enumerationis commonly achieved by some form of equivalence reduction (Smith and Albarghouthi 2019).When dealing with concrete input-output examples, observational equivalence (Albarghouthi et al.2013; Udupa et al. 2013) is very effective. The use of symbolic examples in synthesis has beensuggested (Drachsler-Cohen et al. 2017), but to the best of our knowledge, ours is the only settingwhere symbolic observational equivalence has been applied. Inductive synthesis, in combinationwith abduction (Dillig et al. 2017), has also been used to infer specifications (Albarghouthi et al.2016), although not as an exploration method but as a supporting mechanism for verification.

TheSy has been built from the grounds up based on term rewriting, and the current results areon par with what previous, non-symbolic tools have been able to accomplish. However, it is ouropinion that the advantages lie in its potential to extend to new domains that cannot be handled byconcrete testing and SMT solvers.Much of the appeal of TheSy as a new technique for theory exploration is its versatility inhandling abstract values. Since concrete data elements are not required for the lemma vettingprocess, TheSy can be extended to arbitrary families of types. Our experiments have shown that itsuccessfully handles first-class functions without generating the function bodies, but based solelyon their signatures. The next step would be to add support for refinement types (Freeman andPfenning 1991), and notably, dependent types. These were shown to be excellent tools in reasoningabout the correctness of software (Chlipala 2013; Hamza et al. 2019; Vazou et al. 2014). In particular,their combination — dependent refinement types — is most interesting because it exposes the tightinterconnection between types and propositions. This makes it an ideal challenge for symbolic,constructive reasoning.A second pain point, exposed by our experiments, is handling case splits when comparing termsas well as by carrying out proofs. Testing technology has evolved several tools targeted specificallyat handling conditional control flow in over 40 years of research in the field ((King 1976), withsome commercial outcomes such as (Tillmann and de Halleux 2008)). The appeal of applying aproof-theoretic approach is its ability to employ mathematical abstractions to describe classes ofvalues. The current implementation of TheSy essentially forks execution when splitting by case,copying the entire state. A more careful treatment can distinguish facts that are case-specific fromglobal ones, making the search more focused and allowing deeper proof exploration. :24 Eytan Singher and Shachar Itzhaky

We described a new method for theory exploration, which differentiates itself from existing workby basing the reasoning on a novel engine based on term rewriting. By creating a feedback loopbetween the four different phases, term generation, conjecture inference, conjecture screening andinduction prover, this system manages to efficiently explore many theories. The practical use ofsuch method can be applied to many tasks, especially in verification and optimization. This termrewriting based method, while its implementation basic, has shown results comparable to existingexploration methods. Our main conclusion is that deductive techniques can contribute to theoryexploration, in addition to their existing applications in invariant and auxiliary lemma inference. heory Exploration Powered By Deductive Synthesis 1:25

REFERENCES

Aws Albarghouthi, Isil Dillig, and Arie Gurfinkel. 2016. Maximal Specification Synthesis. In

Proceedings of the 43rd AnnualACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’16) . Association for ComputingMachinery, New York, NY, USA, 789–801. https://doi.org/10.1145/2837614.2837628Aws Albarghouthi, Sumit Gulwani, and Zachary Kincaid. 2013. Recursive program synthesis. In

International Conference onComputer Aided Verification . Springer, 934–950.Rajeev Alur, Rastislav Bodik, Garvit Juniwal, Milo MK Martin, Mukund Raghothaman, Sanjit A Seshia, Rishabh Singh,Armando Solar-Lezama, Emina Torlak, and Abhishek Udupa. 2015. Syntax-guided synthesis.

Dependable SoftwareSystems Engineering

40 (2015), 1–25.Clark Barrett, Christopher L Conway, Morgan Deters, Liana Hadarean, Dejan Jovanovi´c, Tim King, Andrew Reynolds, andCesare Tinelli. 2011. CVC4. In

International Conference on Computer Aided Verification . Springer, 171–177.R´egis Blanc, Viktor Kuncak, Etienne Kneuss, and Philippe Suter. 2013. An Overview of the Leon Verification System:Verification by Translation to Recursive Functions. In

Proceedings of the 4th Workshop on Scala (SCALA ’13) . Associationfor Computing Machinery, New York, NY, USA, Article 1, 10 pages. https://doi.org/10.1145/2489837.2489838Bruno Buchberger. 2000. Theory exploration with Theorema.

Analele Universitatii Din Timisoara, ser. Matematica-Informatica

38, 2 (2000), 9–32.Bruno Buchberger, Adrian Crˇaciun, Tudor Jebelean, Laura Kov´acs, Temur Kutsia, Koji Nakagawa, Florina Piroi, NikolajPopov, Judit Robu, Markus Rosenkranz, et al. 2006. Theorema: Towards computer-aided mathematical theory exploration.

Journal of Applied Logic

4, 4 (2006), 470–504.Cristian Cadar, Daniel Dunbar, and Dawson Engler. 2008. KLEE: Unassisted and Automatic Generation of High-CoverageTests for Complex Systems Programs. In

Proceedings of the 8th USENIX Conference on Operating Systems Design andImplementation (OSDI’08) . USENIX Association, USA, 209–224.Cristian Cadar and Koushik Sen. 2013. Symbolic execution for software testing: three decades later.

Commun. ACM

56, 2(2013), 82–90. https://doi.org/10.1145/2408776.2408795Adam Chlipala. 2013.

Certified Programming with Dependent Types: A Pragmatic Introduction to the Coq Proof Assistant . TheMIT Press, Cambridge, MA, USA.Koen Claessen, Moa Johansson, Dan Ros´en, and Nicholas Smallbone. 2013a. Automating inductive proofs using theoryexploration. In

International Conference on Automated Deduction . Springer, 392–406.Koen Claessen, Moa Johansson, Dan Ros´en, and Nicholas Smallbone. 2013b. Automating inductive proofs using theoryexploration. In

International Conference on Automated Deduction . Springer, 392–406.The Coq Development Team. 2017.

The Coq Proof Assistant Reference Manual, version 8.7 . http://coq.inria.frLeonardo De Moura and Nikolaj Bjørner. 2008. Z3: An efficient SMT solver. In

International conference on Tools andAlgorithms for the Construction and Analysis of Systems . Springer, 337–340.Isil Dillig, Thomas Dillig, Boyang Li, Kenneth L. McMillan, and Mooly Sagiv. 2017. Synthesis of circular compositionalprogram proofs via abduction.

Int. J. Softw. Tools Technol. Transf.

19, 5 (2017), 535–547. https://doi.org/10.1007/s10009-015-0397-7Dana Drachsler-Cohen, Sharon Shoham, and Eran Yahav. 2017. Synthesis with Abstract Examples. In

Computer AidedVerification - 29th International Conference, CAV 2017, Heidelberg, Germany, July 24-28, 2017, Proceedings, Part I . 254–278.https://doi.org/10.1007/978-3-319-63387-9 13S´olr´un Halla Einarsd´ottir, Moa Johansson, and Johannes ˚Aman Pohjola. 2018. Into the Infinite - Theory Exploration forCoinduction. In

Artificial Intelligence and Symbolic Computation , Jacques Fleuriot, Dongming Wang, and Jacques Calmet(Eds.). Springer International Publishing, Cham, 70–86.John K. Feser, Swarat Chaudhuri, and Isil Dillig. 2015. Synthesizing data structure transformations from input-outputexamples. In

Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation,Portland, OR, USA, June 15-17, 2015 , David Grove and Steve Blackburn (Eds.). ACM, 229–239. https://doi.org/10.1145/2737924.2737977Tim Freeman and Frank Pfenning. 1991. Refinement Types for ML. In

Proceedings of the ACM SIGPLAN 1991 Conference onProgramming Language Design and Implementation (PLDI ’91) . Association for Computing Machinery, New York, NY,USA, 268–277. https://doi.org/10.1145/113445.113468Jad Hamza, Nicolas Voirol, and Viktor Kunˇcak. 2019. System FR: Formalized Foundations for the Stainless Verifier.

Proc.ACM Program. Lang.

3, OOPSLA, Article 166 (Oct. 2019), 30 pages. https://doi.org/10.1145/3360592Moa Johansson. 2017. Automated Theory Exploration for Interactive Theorem Proving: - An Introduction to the HipsterSystem. In

Interactive Theorem Proving - 8th International Conference, ITP 2017, Bras´ılia, Brazil, September 26-29, 2017,Proceedings . 1–11. https://doi.org/10.1007/978-3-319-66107-0 1Moa Johansson, Lucas Dixon, and Alan Bundy. 2010. Conjecture Synthesis for Inductive Theories.

Journal of AutomatedReasoning

47 (2010), 251–289. :26 Eytan Singher and Shachar Itzhaky

Moa Johansson, Dan Ros´en, Nicholas Smallbone, and Koen Claessen. 2014. Hipster: Integrating Theory Exploration in aProof Assistant. In

CICM .James C. King. 1976. Symbolic Execution and Program Testing.

Commun. ACM

19, 7 (July 1976), 385–394. https://doi.org/10.1145/360248.360252Donald E Knuth and Peter B Bendix. 1983. Simple word problems in universal algebras. In

Automation of Reasoning . Springer,342–376.Laura Kov´acs and Andrei Voronkov. 2013. First-order theorem proving and Vampire. In

International Conference on ComputerAided Verification . Springer, 1–35.Tessa Lau, Steven A Wolfman, Pedro Domingos, and Daniel S Weld. 2003. Programming by demonstration using versionspace algebra.

Machine Learning

53, 1-2 (2003), 111–156.K. Rustan M. Leino. 2010. Dafny: An Automatic Program Verifier for Functional Correctness. In

Logic for Programming,Artificial Intelligence, and Reasoning . Springer Berlin Heidelberg, 348–370. https://doi.org/10.1007/978-3-642-17511-4 20Peter Milder, Franz Franchetti, James C. Hoe, and Markus P¨uschel. 2012. Computer Generation of Hardware for LinearDigital Signal Processing Transforms.

ACM Trans. Des. Autom. Electron. Syst.

17, 2, Article 15 (April 2012), 33 pages.https://doi.org/10.1145/2159542.2159547Tobias Nipkow, Lawrence C Paulson, and Markus Wenzel. 2002.

Isabelle/HOL: a proof assistant for higher-order logic . Vol. 2283.Springer Science & Business Media.Pavel Panchekha, Alex Sanchez-Stern, James R Wilcox, and Zachary Tatlock. 2015. Automatically improving accuracy forfloating point expressions. In

PLDI , Vol. 50. ACM New York, NY, USA, 1–11.Oleksandr Polozov and Sumit Gulwani. 2015. FlashMeta: A Framework for Inductive Program Synthesis.

SIGPLAN Not.

Commun.ACM

61, 1 (2018), 106–115. https://doi.org/10.1145/3150211Nicholas Smallbone, Moa Johansson, Koen Claessen, and Maximilian Algehed. 2017. Quick specifications for the busyprogrammer.

J. Funct. Program.

27 (2017), e18. https://doi.org/10.1017/S0956796817000090Calvin Smith and Aws Albarghouthi. 2019. Program Synthesis with Equivalence Reduction. In

Verification, Model Checking,and Abstract Interpretation - 20th International Conference, VMCAI 2019, Cascais, Portugal, January 13-15, 2019, Proceedings .24–47. https://doi.org/10.1007/978-3-030-11245-5 2Armando Solar-Lezama, Liviu Tancau, Rastislav Bod´ık, Sanjit A. Seshia, and Vijay A. Saraswat. 2006. Combinatorialsketching for finite programs. In

Proceedings of the 12th International Conference on Architectural Support for ProgrammingLanguages and Operating Systems, ASPLOS 2006, San Jose, CA, USA, October 21-25, 2006 . 404–415. https://doi.org/10.1145/1168857.1168907Ross Tate, Michael Stepp, Zachary Tatlock, and Sorin Lerner. 2009. Equality Saturation: A New Approach to Optimization.In

Proceedings of the 36th Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL ’09) .Association for Computing Machinery, New York, NY, USA, 264–276. https://doi.org/10.1145/1480881.1480915Nikolai Tillmann and Jonathan de Halleux. 2008. Pex–White Box Test Generation for .NET. In

Tests and Proofs , BernhardBeckert and Reiner H¨ahnle (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 134–153.Abhishek Udupa, Arun Raghavan, Jyotirmoy V Deshmukh, Sela Mador-Haim, Milo MK Martin, and Rajeev Alur. 2013.TRANSIT: specifying protocols with concolic snippets.

ACM SIGPLAN Notices

48, 6 (2013), 287–296.Irene Lobo Valbuena and Moa Johansson. 2015. Conditional Lemma Discovery and Recursion Induction in Hipster.

ECEASST

72 (2015).Niki Vazou, Eric L. Seidel, Ranjit Jhala, Dimitrios Vytiniotis, and Simon Peyton-Jones. 2014. Refinement Types for Haskell.In