[PDF] BUSTLE: Bottom-up program-Synthesis Through Learning-guided Exploration

Abstract

Program synthesis is challenging largely because of the difficulty of search in a large space of programs. Human programmers routinely tackle the task of writing complex programs by writing sub-programs and then analysing their intermediate results to compose them in appropriate ways. Motivated by this intuition, we present a new synthesis approach that leverages learning to guide a bottom-up search over programs. In particular, we train a model to prioritize compositions of intermediate values during search conditioned on a given set of input-output examples. This is a powerful combination because of several emergent properties: First, in bottom-up search, intermediate programs can be executed, providing semantic information to the neural network. Second, given the concrete values from those executions, we can exploit rich features based on recent work on property signatures. Finally, bottom-up search allows the system substantial flexibility in what order to generate the solution, allowing the synthesizer to build up a program from multiple smaller sub-programs. Overall, our empirical evaluation finds that the combination of learning and bottom-up search is remarkably effective, even with simple supervised learning approaches. We demonstrate the effectiveness of our technique on a new data set for synthesis of string transformation programs.

Full PDF

BBUSTLE: Bottom-up program-Synthesis ThroughLearning-guided Exploration

Augustus Odena ∗ Google Brain [email protected]

Kensen Shi ∗ Google Brain [email protected]

David Bieber

Google Brain

Rishabh Singh

Google Brain

Charles Sutton

Google Brain

Abstract

Program synthesis is challenging largely because of the difﬁculty of search in alarge space of programs. Human programmers routinely tackle the task of writingcomplex programs by writing sub-programs and then analysing their intermediateresults to compose them in appropriate ways. Motivated by this intuition, we presenta new synthesis approach that leverages learning to guide a bottom-up search overprograms. In particular, we train a model to prioritize compositions of intermediatevalues during search conditioned on a given set of input-output examples. This is apowerful combination because of several emergent properties: First, in bottom-upsearch, intermediate programs can be executed, providing semantic informationto the neural network. Second, given the concrete values from those executions,we can exploit rich features based on recent work on property signatures. Finally,bottom-up search allows the system substantial ﬂexibility in what order to generatethe solution, allowing the synthesizer to build up a program from multiple smallersub-programs. Overall, our empirical evaluation ﬁnds that the combination oflearning and bottom-up search is remarkably effective, even with simple supervisedlearning approaches. We demonstrate the effectiveness of our technique on a newdata set for synthesis of string transformation programs.

Program synthesis is a longstanding goal of artiﬁcial intelligence research (Manna & Waldinger,1971; Summers, 1977), but it remains difﬁcult in part because of the challenges of search (Aluret al., 2013; Gulwani et al., 2017a). The objective in program synthesis is to automatically write aprogram given a speciﬁcation of its intended behaviour, and current state of the art methods typicallyperform some form of search over a space of possible programs. Many different search methodshave been explored in the literature, both with and without learning. These include search within aversion-space algebra (Gulwani, 2011), bottom-up enumerative search (Udupa et al., 2013), stochasticsearch (Schkufza et al., 2013), genetic programming (Koza, 1994), reducing the synthesis problem tological satisﬁability (Solar-Lezama et al., 2006), beam search within a sequence-to-sequence neuralnetwork (Devlin et al., 2017), learning to perform premise selection to guide search (Balog et al.,2016), learning to prioritize grammar rules within top-down search (Lee et al., 2018), and learnedsearch based on partial executions (Ellis et al., 2019; Zohar & Wolf, 2018; Chen et al., 2019).While these approaches have yielded signiﬁcant progress, none of them completely capture thefollowing important intuition: human programmers routinely write complex programs by ﬁrst writing ∗ Equal ContributionPreprint. Under review. a r X i v : . [ c s . P L ] J u l ub-programs and then analysing their intermediate results to compose them in appropriate ways. Wepropose a new learning-guided system for synthesis, called B USTLE , which follows this intuitionin perhaps the simplest way possible. Given a speciﬁcation of a program’s intended behavior (inthis paper given by input-output examples), B

USTLE performs bottom-up enumerative search for asatisfying program, following Udupa et al. (2013). Each program explored during the bottom-upsearch is an expression that can be executed on the inputs, so we apply a machine learning model tothe resulting value to guide the search. The model is simply a classiﬁer trained to predict whether theintermediate value produced by a partial program is part of an eventual solution. This combination oflearning and bottom up search has several key advantages. First, because the input to the model isa value produced by executing a partial program, the model’s predictions can depend on semanticinformation about the program. Second, because the search is bottom-up, compared to previous workon execution-guided synthesis, the search procedure has ﬂexibility about which order to generate theprogram in, and this ﬂexibility can be exploited by machine learning.One big challenge with this approach is that exponentially many intermediate programs are exploredduring search, so the model needs to be both very fast and very accurate to yield wall-clock timespeed-ups. We are allowed to incur some slowdown, because if the model is accurate enough, we cansearch many fewer values before ﬁnding solutions. However, executing a program is still orders ofmagnitude faster than performing inference on even a small machine learning model, so this problemis severe. We employ two techniques to deal with this: First, we arrange both the synthesizer and themodel so that we can batch model prediction across hundreds of intermediate values. Second, weprocess intermediate expressions using property signatures (Odena & Sutton, 2020), which featurizeprogram inputs using another set of programs.A second big challenge is that neural networks require large amounts of data to train, but there isno available data source of intermediate expressions. We can generate programs at random to trainthe model following previous work (Balog et al., 2016; Devlin et al., 2017), but models trained onrandom programs do not always transfer to human-written benchmarks (Shin et al., 2019). We showthat our use of property signatures helps with this distribution mismatch problem as well.In summary, this paper makes the following contributions: • We present B

USTLE , which integrates machine learning into bottom-up program synthesis. • We show how to efﬁciently add machine learning in the synthesis loop using property signaturesand batched predictions. With these techniques, adding the model to the synthesizer provides anend-to-end improvement in synthesis time. • We present an evaluation on a synthesis task of string transformations. We show that B

USTLE leads to improvements in synthesis time compared to a baseline synthesizer without learning,a Deepcoder-style learning-based synthesizer (Balog et al., 2016), and a hand-designed set ofheuristics. Even though our model is trained on random programs, we show that the model’sperformance transfers to a set of human-written synthesis benchmarks.

In a Programming-by-Example (PBE) task (Winston, 1970; Menon et al., 2013; Gulwani, 2011), weare given a set of input/output pairs and the goal is to ﬁnd a program such that for each pair, thesynthesized program when executed on the input generates the corresponding output. To restrict thesearch space, the programs are typically restricted to a small domain-speciﬁc language (DSL). As anexample PBE speciﬁcation, consider the “io_pairs” given in Listing 2.

Following previous work (Gulwani, 2011; Devlin et al., 2017), we deal with string and numbertransformations commonly used in spreadsheets. Such transformations sit at a nice point on thecomplexity spectrum as a benchmark task; they are simpler than programs in general purposelanguages, but still expressive enough for many common string transformation tasks.The domain-speciﬁc language we use (shown in Figure 1) is broadly similar to those of Parisotto et al.(2017) and Devlin et al. (2017), but compared to these, our DSL is expanded in several ways that make2 xpression E := S | I String expression S := Concat ( S , S ) | Left ( S, I ) | Right ( S, I ) | Substr ( S, I , I ) | Replace ( S , I , I , S ) | Trim ( S ) | Repeat ( S, I ) | Substitute ( S , S , S ) | Substitute ( S , S , S , I ) | ToText ( I ) | LowerCase ( S ) | UpperCase ( S ) | ProperCase ( S ) | T | X Integer expression I := I + I | I − I | Find ( S , S ) | Find ( S , S , I ) | Len ( S ) | J | X String constants T := | , | . | - | / Integer constants J := 0 | | | | Input X := x | . . . | x k Figure 1: Domain-speciﬁc language (DSL) of expressions considered in this paper. The inputs to theprogram are x . . . x k , and indicates the space character.the synthesis task more difﬁcult: First, our DSL includes integers and integer arithmetic. Second,our DSL allows for arbitrarily nested expressions, rather than having a maximum size. Finally, andmost importantly, our DSL removes the restriction of having Concat at the top-level, as is the case inprevious works (Gulwani, 2011; Devlin et al., 2017). Without this constraint, version-space-basedand dynamic-programming-based beam search approaches cannot exploit the preﬁx-output propertyto prune partial programs during search.Our DSL allows for nesting and compositions of common string transformation functions. Thesefunctions include string concatenation (

Concat ) and other familiar string operations (see Supple-mentary Material for a full list). Integer functions include arithmetic, returning the index of the ﬁrstoccurrence of a substring within a string (

Find ), and string length. Finally, a few commonly usefulstring and integer constants are included.

The baseline synthesizer on top of which we build our approach is a bottom-up enumerative searchinspired by Udupa et al. (2013), which enumerates DSL expressions from smallest to largest. Thesearch follows Algorithm 1 if the lines colored in blue (4, 16, and 17) are removed. This baselineuses a value-based search: during the search, each candidate expression is executed to see if it meetsthe speciﬁcation. Then, rather than storing the expressions that have been produced in the search, westore the values produced by executing the expressions. This allows the search to avoid separatelyextending sub-expressions that are semantically equivalent on the given input examples.More speciﬁcally, every expression has an integer weight, which for the baseline synthesizer is thenumber of nodes in the AST. The search maintains a table mapping weights to a list of all the valuesof previously explored sub-expressions of that weight. The search is initialized with a set of inputvalues and constant values given by the user, all of which have weight . The search then proceedsstarting from expressions of weight 1, gradually creating all expressions of weight 2, and then ofweight 3, and so on. To create all values of weight n , we loop over all available functions, callingeach function with all combinations of arguments that would yield the desired weight. For example,if we are trying to construct all values of weight 10 of the form Concat ( x, y ) , we iterate over allvalues where x has weight and y has weight , and then where x has weight and y has weight , and so forth. (The Concat function itself contributes weight .) Every time a new expression isconstructed, we evaluate it on the given inputs, terminating the search when the expression producesthe desired outputs. In order to perform machine learning on values encountered during the enumeration, we make use ofrecent work on Property Signatures (Odena & Sutton, 2020). Consider a function with input type τ in and output type τ out . In this context, a property is a function of type: ( τ in , τ out ) → Bool thatdescribes some aspect of the function under consideration. If we have a list of such properties andsome inputs and outputs of the correct type, we can evaluate all the properties on the input-output3 io_pairs = [ (" butter ", " butterfly "), ("abc", "abc_"), ("xyz", "XYZ_"), ] p1 = lambda inp , outp : inp in outp p2 = lambda inp , outp : outp. endswith (inp) p3 = lambda inp , outp : inp.lower () in outp.lower () Listing 1: An example set of input-output pairs, along with three properties that can act on them. Theﬁrst returns

True for the ﬁrst two pairs and

False for the third. The second returns

False for all pairs.The third returns

True for all pairs. The resulting property signature is { Mixed , AllFalse , AllTrue } .These examples are written in Python for clarity, but our implementation is in Java. Algorithm 1

The B

USTLE

Synthesis Algorithm

Input:

Input/output examples ( I , O ) Output:

A program P such that P ( i ) = o for all inputs i ∈ I with corresponding output o ∈ O Auxiliary Data:

Supported operations

Ops , supported Properties

Props , and a model M trainedusing Props as described in Section 3.1. E ← ∅ (cid:46) E maps integer weights to terms with that weight C ← E XTRACT C ONSTANTS ( I , O ) E [1] ← I ∪ C (cid:46)

Inputs and Constants have weight 1 s io ← P ROPERTY S IGNATURE ( I , O , Props ) for w = 2 , . . . do (cid:46) Loop over all possible term weights for all op ∈ Ops do n ← op. arity A ← ∅ (cid:46) A holds all argument tuples for all [ w , . . . , w n ] (cid:46) Make all arg tuples with these weights that type-check s.t. (cid:80) i w i = w − , w i ∈ Z + do A ← A ∪ { ( a , . . . , a n ) | a i . weight = w i ∧ a i .type = op.argtypes i } for all ( a , . . . , a n ) ∈ A do V ← E XECUTE ( op, ( a , . . . , a n ) ) if V (cid:54)∈ E then (cid:46) Use model to reweight terms w (cid:48) ← w s vo ← P ROPERTY S IGNATURE ( V , O , Props ) w (cid:48) ← R EWEIGHT W ITH M ODEL ( M, s io , s vo , w ) E [ w (cid:48) ] ← E [ w (cid:48) ] ∪ {V} if V = O then return E XPRESSION ( V )pairs to get a list of outputs that we will call the property signature. More precisely, given a listof n input-output pairs and a list of k properties, the property signature is a length k vector. Theelements of the vector corresponding to a given property will have one of the values AllTrue , AllFalse ,and

Mixed , depending on whether the property returned

True for all n pairs, False for all n pairs,or True for some and

False for others, respectively. Concretely, then, any property signature can beidentiﬁed with a trit-vector, and we represent them in computer programs as arrays containing thevalues {− , , } . An example of property signatures is shown in Listing 1. We evaluate B

USTLE using a suite of 56 human-written benchmarks. These benchmarks wereinspired by questions on help forums for spreadsheet programs, and designed to be difﬁcult enoughto stress our system; in fact, our best system solves less than half of them. The search space exploredby the synthesizers to solve these benchmarks is quite large: on average, B

USTLE (orange line inFigure 2) searches over 6 million expressions per benchmark attempt, and 1.5 million expressionsper successful attempt. Most benchmarks have 2 or 3 input-output pairs, and no benchmark solvedby B

USTLE has more than 4, though for some unsolved benchmarks we did feel that we neededmore than 4 pairs to fully specify the semantics of the desired program. In each case, we gave what4 expectedProgram = " TO_TEXT (MINUS(LEN(var_0), LEN( SUBSTITUTE (var_0 , \"/\" , \"\"))))" io_pairs = {"/this/is/a/path": "4", "/home": "1", "/a/b": "2"} solution = " CONCATENATE (MID(var_0 , 3, 2), \"/\" , REPLACE (var_0 , 3, 2, \"/\") )" io_pairs = {" 08092019 " : " 09/08/2019 ", " 12032020 " : " 03/12/2020 "} solution = "UPPER( CONCATENATE (LEFT(var_0 , 1), MID(var_0 , ADD(FIND (\" \", var_0), 1), 1)))" io_pairs = {" product area" : "PA", "Vice president " : "VP"} Listing 2: Three of our benchmark problems (all solved by B

USTLE ).we felt was the number of pairs a user of such a system would ﬁnd reasonable (though of coursethis is a subjective judgment). Three representative benchmarks are shown in Listing 2. See theSupplementary Material for a full list.

USTLE : Bottom-Up Synthesis with Learning

Each intermediate expression we encounter during search can be evaluated to produce concreteresults, and we can pass those results into a machine learning model to guide the search. The model isa binary classiﬁer p ( y | I , V , O ) that predicts whether a set of values V , which result from evaluatingan expression on each input example i ∈ I , is an intermediate value of a program that maps theinputs I to the outputs O . Given those predictions, we de-prioritize sub-expressions that are unlikelyto appear in the ﬁnal result, which, when done correctly, dramatically speeds up the synthesizer.

Because we want the classiﬁer to learn whether a value is intermediate between an input and anoutput, the model receives two property signatures as input: one from the inputs and one from theintermediate value. Recall from Section 2.4 that a property signature is computed by applying alist of properties to a list of input-output pairs. So one signature is computed by applying all of theproperties to input-output pairs, and the other is applied to intermediate value-output pairs. A fewexample properties that we use include: (a) if v and o are both strings, is v an initial substring of o ; (b)do v and o have the same length; (c) does the output contain a space, and so on. (See SupplementaryMaterial for the full list of properties.) Then we concatenate these two vectors to obtain the model’sinput. The rest of the model is straightforward. Each element of the property signature is either AllTrue , AllFalse and

Mixed . So we embed the ternary property signature into a higher-dimensionaldense vector and then feed it into a fully connected neural network for binary classiﬁcation.This model is simple, but we are only able to use such a simple model due to our particular designchoices: our form of bottom up search guarantees that all intermediate expressions can yield a valuecomparable to the inputs and outputs, and Property Signatures can do a lot of the “representationalwork” that would otherwise require a larger model.The classiﬁer is trained by behavior cloning on a set of training problems. However, obtaininga training dataset is challenging. Ideally, the training set would contain synthesis tasks that areinteresting to humans, but such datasets can be small compared to what is needed to train deepnetworks. Instead, we train on randomly generated synthetic data (similar to Devlin et al. (2017)).This choice does come with a risk of poor performance on human-written tasks due to domainmismatch (Shin et al., 2019), but we show in Section 4 that B

USTLE can overcome this issue.Generating the random synthetic data is itself non-trivial: Because different DSL functions havecorresponding argument preconditions and invariants (e.g. for

Substr the position indices I and I should be less than the string length I ≤ Len(S) ∧ I ≤ Len(S) ), a random sampling of DSLprograms and inputs would lead to a large number of training examples where the program cannot beapplied to the sampled inputs. Instead, we use the idea of synthesis search-driven data generation fromTF-Coder (Shi et al., 2020): First, we generate a large number of random inputs. For each of thoseinputs, we run bottom-up search using a dummy output, so that the search will never succeed, but just5eeps generating expressions. For each generated expression, we compute all of its sub-expressions.This results in a dataset of triples, which contain an input, a sub-expression, and a larger expressionto which the sub-expression belongs. These comprise the positive examples for the classiﬁer. We alsocreate an analogous dataset of negative triples — of the same size as the positive dataset — wherethe sub-expression in question is not actually a sub-expression of the larger expression. The data setused in our experiments was created by performing searches on 1000 random inputs and keeping 100positive and 100 negative values at random from each search.

Incorporating the model into bottom-up synthesis is straightforward, and can be accomplished byadding the blue lines into Algorithm 1. Lines 4 and 16 compute the property signatures required forthe model input, as described previously. The main challenge is that the model produces a probability p ( y |I , V , O ) , but the search is organized by integer weights. We resolve this with a simple heuristic:at the time we generate a new value, we have the weight w of the expression that generates it. Wediscretize the model’s output probability into an integer in δ ∈ , . . . , by binning it into six binsbounded by [0 . , . , . , . , . , . , . . The new weight is computed from the discretized modeloutput as w (cid:48) = w + 5 − δ. This function is indicated by R

EWEIGHT W ITH M ODEL in Algorithm 1.A key challenge is making the model fast enough. Evaluating the model for every intermediatevalue could cause the synthesizer to slow down so much that the overall performance would beworse than with no model at all. However, B

USTLE actually outperforms our baselines even whenmeasured strictly in terms of wall-clock time (Section 4). There are several reasons for this. First,computing property signatures for the expressions allows us to take some of the work of representingthe intermediate state out of the neural network (which is slow) and to do it in the JVM (which isfast). Second, because a property signature is a ﬁxed length representation, it can be fed into a simplefeed-forward neural network, rather than requiring a recurrent model, as would be necessary if wepassed in e.g. the AST representation. Third, because of this ﬁxed length representation, it is easyto batch many calls to the machine learning model and process them using CPU vector instructions.Inference calls to the machine learning model could, in principle, be done in parallel to the rest ofthe synthesizer, either on a separate CPU core or on an accelerator, which would further improvewall-clock results. Due to these optimizations, computing property signatures and running the modelon them accounts for only roughly 20% of the total time spent; the rest is spent by the symbolicportions of the bottom-up search.

We evaluate B

USTLE on a suite of human-written benchmarks inspired by string transformationquestions asked by users on online help-forums. (See Supplementary Material for the full list ofbenchmarks.) We measure performance in two ways. First, we consider the number of benchmarkssuccessfully solved as a function of number of candidates considered, which gives insight into howmuch the machine learning model has been able to reduce the search space. Second, we considerthe number of benchmarks solved as a function of wall-clock time, which takes into account thecomputational cost of calling the model.We compare B

USTLE to three other methods. First we compare to the baseline bottom-up synthesizerwithout machine learning. Without learning, the synthesizer explores expressions in order of size,which is reasonable in this domain, because the correct programs are fairly small.Second, we compare to the baseline synthesizer augmented with simple but effective human-writtenheuristics, to quantify whether the improvement is due to learning or due to having good features.Speciﬁcally, we reweight the intermediate string values encountered during the search based onwhether they are a substring of the desired output and the edit distance compared to the output,normalized to a fraction.Finally, we explore whether learning within the synthesis loop provides beneﬁts compared to simplyusing learning once at the beginning of the search to choose which operations to prioritize, as inBalog et al. (2016). Speciﬁcally, we train a model similar to the model trained for B

USTLE , butlarger (since inference time is essentially inconsequential in this context) on the same data set used totrain the model for B

USTLE , but instead of predicting whether an expression is a sub-expression of a6igure 2: Plots of human-written benchmarks solved over time. (Left) The number of benchmarkssolved as a function of intermediate expressions considered. This metric makes B

USTLE looksomewhat better than it is, because it ignores slowdowns in wall-clock time, but it is still important tolook at. First, it gives us an upper bound on how good we can do in wall-clock terms given by puttingmore engineering effort into speeding up the model. Second, it gives us a sense of how well the modelis generalizing to the test set that is invariant to engineering considerations. (Right) Benchmarkssolved per unit of elapsed wall-clock time. B

USTLE still outperforms baselines here, but not by quiteas much, since inference takes non-zero time.solution, we predict which operations will be used in a solution. Then, for each benchmark attempt,we simply don’t use the 4 operations that the model predicts are least likely to appear (where 4 waschosen using a hyper-parameter search over the values in {1, 2, 3, 4, 5}).The results as a function of number of candidate expressions considered are shown in Figure 2 (Left).By this metric, B

USTLE (orange) performs very well, solving 24 tasks within 10 million candidateexpressions, signiﬁcantly outperforming the simple synthesizer (blue), for example, which onlysolves 17 tasks.As for wall-clock time (Figure 2, right), for any ﬁxed time budget, B

USTLE (orange) solves moretasks than the baselines. Despite the overhead of calling the model, B

USTLE (orange) still solves24 tasks within 10 seconds, outperforming the baseline synthesizer (blue) which only solves 17tasks. B

USTLE also outperforms the domain-speciﬁc heuristics, despite the model being slower toexecute. Furthermore, when using both the model and heuristics (red), the synthesizer performs thebest overall, solving 25 tasks within 10 seconds, while the heuristics-only approach only solves 23tasks. This indicates that although learning in the loop outperforms domain-speciﬁc heuristics, thetwo approaches can be combined to achieve better performance than either approach alone. Finally,we can see that premise selection (purple) was slightly helpful, leading to 2 more solved tasks within10 seconds compared to the baseline, but with much less improvement than B

USTLE . This is evidencethat using learning from partial executions within the synthesis loop is crucial for B

USTLE ’s success.We conduct two additional analyses to better understand the performance of B

USTLE . First, weinvestigate the predictions of the model when it is run on the intermediate values actually encounteredduring synthesis of the human-written benchmarks. We compute separate histograms for the model’spredictions on expressions that do appear in the solution and expressions that don’t appear in thesolution, for all benchmarks that were solved. Predictions for true sub-expressions skew positive andpredictions for negative sub-expressions skew negative. This provides further evidence the our modelgeneralizes well to human benchmarks, despite the domain mismatch to the synthetic data used intraining. The detailed results are in Figure 3. Finally, we evaluated whether all benchmarks solvedby the baseline (no model, no heuristics) were also solved by B

USTLE . The answer is mostly yes —B

USTLE only fails to solve one benchmark that was solved by the baseline synthesizer.

For surveys on program synthesis and machine learning for software engineering, see Gottschlichet al. (2018); Solar-Lezama (2018); Gulwani et al. (2017b); Allamanis et al. (2018). A well-known7igure 3: Histograms of model predictions for expressions seen while solving benchmarks. (Left)for expressions that were sub-expressions of a solution, the vast majority received predictions closeto 1, showing that the model can identify the correct expressions to prioritize during search. (Right)for expressions that were not sub-expressions of a solution, predictions skewed close to 0, although afraction of predictions were close to 1.synthesizer for spreadsheet programs is FlashFill (Gulwani, 2011), a version of which is deployed inMicrosoft Excel. FlashFill is based on Version Space Algebras (VSAs), which are powerful and fast,but only apply to restricted DSLs, e.g. the top-most function in the program must be

Concat , whichallows it to perform efﬁcient divide and conquer style search. Our technique has no such restrictions.Early work on machine learning for program synthesis includes Deepcoder (Balog et al., 2016), whichuses ML to select once at the beginning of search. Although this idea is pragmatic, the disadvantageis that once the search has started, the model can give no further feedback. Odena & Sutton (2020)use property signatures within a Deepcoder-style model for premise selection.One can also train a machine learning model to emit whole programs token-by-token using anencoder-decoder neural architecture (Bunel et al., 2018; Devlin et al., 2017; Parisotto et al., 2017),but when one does this one loses the ability to inspect outputs of intermediate programs. Previouswork has also considered using learning within syntax guided search over programs (Yin & Neubig,2017; Lee et al., 2018), but because these methods are top-down, it is much more difﬁcult to guidethem by execution information. Finally, Nye et al. (2019) learns to emit a partial program and ﬁll inthe holes with a symbolic search.The most closely related work to ours is neural execution-guided synthesis , which like us, usesvalues produced by intermediate programs within a neural network. Zohar & Wolf (2018) processintermediate values of a program using a neural network for a small, straight-line DSL, but they donot use the model to evaluate intermediate programs. Another approach is to rewrite a programminglanguage so that it can be evaluated “left-to-right”, allowing values to be used to prioritize the searchin an actor-critic framework (Ellis et al., 2019). Similarly, Chen et al. (2019) use intermediate valueswhile synthesizing a program using a neural encoder-decoder model, but again this work proceedsin a variant of left-to-right search that is modiﬁed to handle conditionals and loops. None of theseapproaches exploit our main insight, which is that bottom-up search allows the model to prioritizeand combine small programs that solve different subtasks.

Learning to search has been an active area in machine learning, especially in imitation learning(Daumé et al., 2009; Ross et al., 2011; Chang et al., 2015). Combining more sophisticated imitationlearning strategies into B

USTLE is an interesting direction for future work.

We have introduced B

USTLE , a technique for using machine learning to guide bottom-up search forprogram synthesis. B

USTLE exploits the fact that bottom-up search makes it easy to evaluate partialprograms and it uses machine learning to predict the likelihood that a given intermediate value is asub-expression of the desired solution. We have shown that B

USTLE improves over various baselineson human-written benchmarks, not only in terms of benchmarks solved per candidate programconsidered, but also in terms of wall-clock time. In fact, we view showing that learning-in-the loopcan be made fast enough for program synthesis as perhaps the major contribution of this work; the8dea of learning-in-the-loop, though novel as far as we are aware, is relatively obvious, but we werequite surprised to learn that it can be made efﬁcient enough to provide overall end-to-end speedups.9 eferences

Allamanis, M., Barr, E. T., Devanbu, P., and Sutton, C. A survey of machine learning for big codeand naturalness.

ACM Computing Surveys (CSUR) , 51(4):81, 2018.Alur, R., Bodík, R., Juniwal, G., Martin, M. M. K., Raghothaman, M., Seshia, S. A., Singh, R.,Solar-Lezama, A., Torlak, E., and Udupa, A. Syntax-guided synthesis. In

Formal Methods inComputer-Aided Design, FMCAD 2013, Portland, OR, USA, October 20-23, 2013 , pp. 1–8. IEEE,2013. URL http://ieeexplore.ieee.org/document/6679385/ .Balog, M., Gaunt, A. L., Brockschmidt, M., Nowozin, S., and Tarlow, D. Deepcoder: Learning towrite programs. arXiv preprint arXiv:1611.01989 , 2016.Bunel, R., Hausknecht, M., Devlin, J., Singh, R., and Kohli, P. Leveraging grammar and reinforcementlearning for neural program synthesis. In

International Conference on Learning Representations ,2018.Chang, K.-W., Krishnamurthy, A., Agarwal, A., Daume III, and Langford, J. Learning to searchbetter than your teacher. In

International Conference on Machine Learning (ICML) , 2015.Chen, X., Liu, C., and Song, D. Execution-guided neural program synthesis. In .OpenReview.net, 2019. URL https://openreview.net/forum?id=H1gfOiAqYm .Daumé, III, H., Langford, J., and Marcu, D. Search-based structured prediction.

Machine LearningJournal , 2009.Devlin, J., Uesato, J., Bhupatiraju, S., Singh, R., Mohamed, A., and Kohli, P. Robustﬁll: Neuralprogram learning under noisy I/O.

CoRR , abs/1703.07469, 2017. URL http://arxiv.org/abs/1703.07469 .Ellis, K., Nye, M., Pu, Y., Sosa, F., Tenenbaum, J., and Solar-Lezama, A. Write, execute, assess:Program synthesis with a REPL. In

NeurIPS , 2019.Gottschlich, J., Solar-Lezama, A., Tatbul, N., Carbin, M., Rinard, M., Barzilay, R., Amarasinghe, S.,Tenenbaum, J. B., and Mattson, T. The three pillars of machine programming. In

Proceedings of the2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages ,pp. 69–80. ACM, 2018.Gulwani, S. Automating string processing in spreadsheets using input-output examples. In

Pro-ceedings of the 38th Annual ACM SIGPLAN-SIGACT Symposium on Principles of ProgrammingLanguages , POPL ’11, pp. 317–330, New York, NY, USA, 2011. ACM. ISBN 978-1-4503-0490-0.doi: 10.1145/1926385.1926423. URL http://doi.acm.org/10.1145/1926385.1926423 .Gulwani, S., Polozov, O., and Singh, R. Program synthesis.

Foundations and Trends in ProgrammingLanguages , 4(1-2):1–119, 2017a. doi: 10.1561/2500000010. URL https://doi.org/10.1561/2500000010 .Gulwani, S., Polozov, O., Singh, R., et al. Program synthesis.

Foundations and Trends R (cid:13) inProgramming Languages , 4(1-2):1–119, 2017b.Koza, J. R. Genetic programming as a means for programming computers by natural selection. Statistics and computing , 4(2):87–112, 1994.Lee, W., Heo, K., Alur, R., and Naik, M. Accelerating search-based program synthesis using learnedprobabilistic models. In

Conference on Programming Language Design and Implementation(PLDI) , pp. 436–449, June 2018.Manna, Z. and Waldinger, R. J. Toward automatic program synthesis.

Commun. ACM , 14(3):151–165,1971. doi: 10.1145/362566.362568. URL https://doi.org/10.1145/362566.362568 .Menon, A., Tamuz, O., Gulwani, S., Lampson, B., and Kalai, A. A machine learning framework forprogramming by example. In

International Conference on Machine Learning , pp. 187–195, 2013.10ye, M. I., Hewitt, L. B., Tenenbaum, J. B., and Solar-Lezama, A. Learning to infer programsketches. In Chaudhuri, K. and Salakhutdinov, R. (eds.),

Proceedings of the 36th InternationalConference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA ,volume 97 of

Proceedings of Machine Learning Research , pp. 4861–4870. PMLR, 2019. URL http://proceedings.mlr.press/v97/nye19a.html .Odena, A. and Sutton, C. Learning to represent programs with property signatures. In

InternationalConference on Learning Representations , 2020.Parisotto, E., Mohamed, A.-r., Singh, R., Li, L., Zhou, D., and Kohli, P. Neuro-symbolic programsynthesis. In

ICLR , 2017.Ross, S., Gordon, G., and Bagnell, D. A reduction of imitation learning and structured prediction toNo-Regret online learning. In

Conference on Artiﬁcial Intelligence and Statistics (AISTATS) , 2011.Schkufza, E., Sharma, R., and Aiken, A. Stochastic superoptimization. In

Proceedings of theEighteenth International Conference on Architectural Support for Programming Languages andOperating Systems , ASPLOS ’13, pp. 305–316, New York, NY, USA, 2013. Association forComputing Machinery. ISBN 9781450318709. doi: 10.1145/2451116.2451150. URL https://doi.org/10.1145/2451116.2451150 .Shi, K., Bieber, D., and Singh, R. Tf-coder: Program synthesis for tensor manipulations, 2020.Shin, R., Kant, N., Gupta, K., Bender, C., Trabucco, B., Singh, R., and Song, D. Synthetic datasetsfor neural program synthesis. In . OpenReview.net, 2019. URL https://openreview.net/forum?id=ryeOSnAqYm .Solar-Lezama, A. Introduction to program synthesis. https://people.csail.mit.edu/asolar/SynthesisCourse/TOC.htma , 2018. Accessed: 2018-09-17.Solar-Lezama, A., Tancau, L., Bodík, R., Seshia, S. A., and Saraswat, V. A. Combinatorial sketchingfor ﬁnite programs. In

Conference on Architectural Support for Programming Languages andOperating Systems, ASPLOS 2006, San Jose, CA, USA, October 21-25, 2006 , pp. 404–415. ACM,2006.Summers, P. D. A methodology for lisp program construction from examples.

Journal of the ACM(JACM) , 24(1):161–175, 1977.Udupa, A., Raghavan, A., Deshmukh, J. V., Mador-Haim, S., Martin, M. M. K., and Alur, R.TRANSIT: Specifying protocols with concolic snippets. In

Conference on Programming LanguageDesign and Implementation (PLDI) , pp. 287–296. Association for Computing Machinery, 2013.Winston, P. H. Learning structural descriptions from examples. Technical report, Cambridge, MA,USA, 1970.Yin, P. and Neubig, G. A syntactic neural model for General-Purpose code generation. In

Assocationfor Computational Linguistics (ACL) , 2017.Zohar, A. and Wolf, L. Automatic program synthesis of long programs with a learned garbagecollector. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., andGarnett, R. (eds.),

Advances in Neural Information Processing Systems 31 , pp. 2094–2103. CurranAssociates, Inc., 2018. 11 str. isEmpty () // is empty? str. length () == 1 // is single char? str. length () <= 5 // is short string ? str. equals (lower) // is lowercase ? str. equals (upper) // is uppercase ? str. contains (" ") // contains space? str. contains (",") // contains comma? str. contains (".") // contains period ? str. contains ("-") // contains dash? str. contains ("/") // contains slash? str. matches (".*\\d.*") // contains digits ? str. matches ("\\d+") // only digits ? str. matches (".*[a-zA -Z].*") // contains letters ? str. matches ("[a-zA -Z]+") // only letters ? Listing 3: Java code for all Properties acting on single Strings. integer == 0 // is zero? integer == 1 // is one? integer == 2 // is two? integer < 0 // is negative ? Listing 4: Java code for all Properties acting on single Integers.

A Listing of Properties Used B USTLE computes two types of Property Signatures: the signature involving the inputs and theoutputs, and the signature involving the intermediate state and the outputs. In this paper, the inputsand outputs are always strings, but the intermediate state may be either an integer or a string, sowe have properties dealing with both inputs and strings. In an abstract sense, a property acts on aninput and an output, but some properties will simply ignore the input, and so we implement those asfunctions with only one argument. Thus, we have four types of properties: • properties acting on a single string (Listing 3). • properties acting on a single integer (Listing 4). • properties acting on a string and the output string (Listing 5). • properties acting on an integer and the output string (Listing 6).For a program with multiple inputs, we simply loop over all the inputs and the output, computingall of the relevant types of properties for each. For example, a program taking two string inputs andyielding one string output will have single-argument string properties for the output and two sets ofdouble-argument string properties, one for each input. We ﬁx a maximum number of inputs and padthe signatures so that they are all the same size. outputStr . contains (str) // output contains input? outputStr . startsWith (str) // output starts with input? outputStr . endsWith (str) // output ends with input? str. contains ( outputStr ) // input contains output ? str. startsWith ( outputStr ) // input starts with output ? str. endsWith ( outputStr ) // input ends with output ? outputStrLower . contains (lower) // output contains input ignoring case? outputStrLower . startsWith (lower) // output starts with input ignoring case? outputStrLower . endsWith (lower) // output ends with input ignoring case? lower. contains ( outputStrLower ) // input contains output ignoring case? lower. startsWith ( outputStrLower ) // input starts with output ignoring case? lower. endsWith ( outputStrLower ) // input ends with output ignoring case? str. equals ( outputStr ) // input equals output ? lower. equals ( outputStrLower ) // input equals output ignoring case? str. length () == outputStr . length () // input same length as output ? str. length () < outputStr . length () // input shorter than output ? str. length () > outputStr . length () // input longer than output ? Listing 5: Java code for all Properties acting on a String and the output String.12 integer < outputStr . length () // is less than output length ? integer <= outputStr . length () // is less or equal to output length ? integer == outputStr . length () // is equal to output length ? integer >= outputStr . length () // is greater or equal to output length ? integer > outputStr . length () // is greater than output length ? Math.abs( integer - outputStr . length ()) <= 1 // is very close to output length ? Math.abs( integer - outputStr . length ()) <= 3 // is close to output length ?

Listing 6: Java code for all Properties acting on an Integer and the output String.

B Expanded Description of DSL

Our DSL allows for nesting and compositions of common string transformation functions. Thesefunctions include string concatenation (

Concat ) returning a substring at the beginning (

Left ), middle(

Substr ), or right (

Right ) of a string, and other familiar string replacing a substring of one string,indicated by start and end position, with another string (

Replace ); removing white space fromthe beginning and ending of a string (

Trim ); concatenating a string with itself a speciﬁed numberof times (

Repeat ); substituting the ﬁrst k occurrences of a substring with another ( Substitute );converting an integer to a string (

ToText ); and converting a string to

LowerCase , UpperCase , orevery word capitalized (

ProperCase ). Integer functions include arithmetic, returning the index ofthe ﬁrst occurrence of a substring within a string (

Find ), and string length. Finally, a few commonlyuseful string and integer constants are included.

C List of Benchmark Programs

Here we show each of our human-written benchmark problems, and a possible solution written in aDSL that is a superset of the DSL used by our synthesizer. We have separated them into Listing 7,Listing 8, and Listing 9 for space reasons. Note that the synthesizer can and does solve problemswith programs different than the programs given here, and that it does not solve all of the problems.13 // abbreviate day of week and month names with capital letters CONCATENATE (LEFT(UPPER(var_0), 3), ", ", MID(UPPER(var_0), ADD(FIND(",", var_0), 2), 3),MID(var_0 , FIND(" ", var_0 , ADD(FIND(",", var_0), 2)), 3)) // add decimal point if not present IF( ISERROR (FIND(".", var_0)), CONCATENATE (var_0 , ".0"), var_0) // add plus sign to positive integers IF(EXACT(LEFT(var_0 , 1), "-"), var_0 , CONCATENATE ("+", var_0)) // add thousands separator to number TEXT(var_0 , " // append AM or PM to the hour depending on if it is morning CONCATENATE (LEFT(var_0 , MINUS(FIND(":", var_0), 1)), IF(EXACT(var_1 , " morning "), " AM", "PM")) // fix capitalization of city and state CONCATENATE (LEFT( PROPER (var_0), MINUS(LEN(var_0), 1)), UPPER(RIGHT(var_0 , 1))) // capitalize the first word and lowercase the rest REPLACE (LOWER(var_0), 1, 1, UPPER(LEFT(var_0 , 1))) // concatenate 3 strings in alphabetical order JOIN(" ", TRANSPOSE (SORT( TRANSPOSE ({ var_0 , var_1 , var_2 })))) // concatenate a variable number of strings IFNA(JOIN(", ", FILTER ( TRANSPOSE ({ var_0 , var_1 , var_2 }), NOT(EQ(0, LEN( TRANSPOSE ({ var_0 ,var_1 , var_2 })))))), "") // concatenate column values with underscore except when value is empty. CONCATENATE (var_0 , IF(OR(var_0="", AND(var_1="", var_2="")), "", "_"), var_1 , IF(OR(var_1="", var_2=""), "", "_"), var_2) // whether the first string contains the second TO_TEXT ( ISNUMBER (FIND(var_1 , var_0))) // whether the first string contains the second , ignoring case TO_TEXT ( ISNUMBER (FIND(LOWER(var_1), LOWER(var_0)))) // count the number of times the second string appears in the first TO_TEXT ( DIVIDE (MINUS(LEN(var_0), LEN( SUBSTITUTE (var_0 , var_1 , ""))), LEN(var_1))) // create email address from name and company LOWER( CONCATENATE (LEFT(var_0 , 1), var_1 , "@", var_2 , ".com")) // change DDMMYYYY date to MM/DD/YYYY CONCATENATE (MID(var_0 , 3, 2), "/", REPLACE (var_0 , 3, 2, "/")) // change YYYY -MM -DD date to YYYY/MM/DD SUBSTITUTE (var_0 , "-", "/") // change YYYY -MM -DD date to MM/DD SUBSTITUTE (RIGHT(var_0 , 5), "-", "/") // extract the text between alpha > and < REGEXEXTRACT (var_0 , " >(.*) <") // extract date and time from two different input formats . CONCATENATE ( REGEXEXTRACT (var_0 , "\d+"), REGEXEXTRACT (var_0 , " \w+"), REGEXEXTRACT (var_0 ,", \d+pm")) // extract the single number REGEXEXTRACT (var_0 , "[0 -9]+") // extract the number between the second parenthesis REGEXEXTRACT (var_0 ," .*$.*$ .*$(\ d+)$.*") // extract the number starting with 20 in the input string . TRIM( REGEXEXTRACT (var_0 ,"\s20\d+")) // extract the part of a URL between the 2nd and 3rd slash MID(var_0 , ADD(FIND("//", var_0), 2), MINUS(MINUS(FIND("/", var_0 , 9), FIND("/", var_0)),2)) // extract the part of a URL starting from the 3rd slash RIGHT(var_0 , ADD(1, MINUS(LEN(var_0), FIND("/", var_0 , ADD(FIND("//", var_0), 2)))))

Listing 7: Potential solutions for our benchmarks, along with comments describing the semantics ofthe solution. 14 // extract the top -level domain of a URL REGEXEXTRACT (var_0 , " //[^/]*\.(.+?) /") // get first name from second column LEFT(var_1 , MINUS(FIND(" ", var_1), 1)) // get last name from first column RIGHT(var_0 , MINUS(LEN(var_0), FIND(" ", var_0))) // find the longest common prefix of two words LEFT(var_0 , ARRAYFORMULA (SUM(IF(EQ(LEFT(var_0 , SEQUENCE (MIN(LEN(var_0), LEN(var_1)))),LEFT(var_1 , SEQUENCE (MIN(LEN(var_0), LEN(var_1))))), 1, 0)))) // find the longest common suffix of two words RIGHT(var_0 , ARRAYFORMULA (SUM(IF(EQ(RIGHT(var_0 , SEQUENCE (MIN(LEN(var_0), LEN(var_1)))),RIGHT(var_1 , SEQUENCE (MIN(LEN(var_0), LEN(var_1))))), 1, 0)))) // create acronym from multiple words in one cell JOIN("", ARRAYFORMULA (LEFT(SPLIT(var_0 , " "), 1))) // create capitalized acronym from multiple words in one cell UPPER(JOIN("", ARRAYFORMULA (LEFT(SPLIT(var_0 , " "), 1)))) // output " Completed " if 100% , "Not Yet Started " if 0%, and "In Progress " if between 0%and 100% IF(var_0="100%", " Completed ", IF(var_0="0%", "Not Yet Started ", "In Progress ")) // enclose negative numbers in parentheses IF(EXACT(LEFT(var_0 , 1), "-"), CONCATENATE ( SUBSTITUTE (var_0 , "-", "("), ")"), var_0) // create currency string CONCATENATE ("$", TEXT(var_0 , "0.00")) // determine if the text is a word or a number IF( REGEXMATCH (var_0 , "^[[: alpha :]]+$"), "word", " number ") // determine if the text is a word , number , or neither IF( REGEXMATCH (var_0 , "^[[: alpha :]]+$"), "word", IF( REGEXMATCH (var_0 , "^\d+(\.\d+)?$"), "number ", " neither ")) // pad text with spaces to a given width CONCATENATE (REPT(" ", MINUS(VALUE(var_1), LEN(var_0))), var_0) // pad number with 0 to width 5 CONCATENATE (REPT("0", MINUS (5, LEN(var_0))), var_0) // the depth of a path , i.e., count the number of / TO_TEXT (MINUS(LEN(var_0), LEN( SUBSTITUTE (var_0 , "/", "")))) // extract the rest of a word given a prefix RIGHT(var_0 , MINUS(LEN(var_0), LEN(var_1))) // prepend Mr. to last name CONCATENATE ("Mr. ", RIGHT(var_0 , MINUS(LEN(var_0), FIND(" ", var_0)))) // prepend Mr. or Ms. to last name depending on gender CONCATENATE (IF(EXACT(var_1 , "male"), "Mr. ", "Ms. "), RIGHT(var_0 , MINUS(LEN(var_0), FIND(" ", var_0)))) // yes if at least one input string is yes IF(OR( ARRAYFORMULA (EXACT("yes", {var_0 , var_1 , var_2 }))), "yes", "no") // remove duplicate numbers JOIN(" ", UNIQUE ( TRANSPOSE (SPLIT(var_0 , " "))))