[PDF] Programming with a Differentiable Forth Interpreter

Abstract

Given that in practice training data is scarce for all but a small set of problems, a core question is how to incorporate prior knowledge into a model. In this paper, we consider the case of prior procedural knowledge for neural networks, such as knowing how a program should traverse a sequence, but not what local actions should be performed at each step. To this end, we present an end-to-end differentiable interpreter for the programming language Forth which enables programmers to write program sketches with slots that can be filled with behaviour trained from program input-output data. We can optimise this behaviour directly through gradient descent techniques on user-specified objectives, and also integrate the program into any larger neural computation graph. We show empirically that our interpreter is able to effectively leverage different levels of prior program structure and learn complex behaviours such as sequence sorting and addition. When connected to outputs of an LSTM and trained jointly, our interpreter achieves state-of-the-art accuracy for end-to-end reasoning about quantities expressed in natural language stories.

Full PDF

PProgramming with a Differentiable Forth Interpreter

Matko Boˇsnjak Tim Rockt¨aschel Jason Naradowsky Sebastian Riedel Abstract

Given that in practice training data is scarce for all but asmall set of problems, a core question is how to incorporateprior knowledge into a model. In this paper, we considerthe case of prior procedural knowledge for neural networks,such as knowing how a program should traverse a sequence,but not what local actions should be performed at eachstep. To this end, we present an end-to-end differentiableinterpreter for the programming language Forth whichenables programmers to write program sketches with slotsthat can be ﬁlled with behaviour trained from programinput-output data. We can optimise this behaviour directlythrough gradient descent techniques on user-speciﬁedobjectives, and also integrate the program into any largerneural computation graph. We show empirically that ourinterpreter is able to effectively leverage different levelsof prior program structure and learn complex behaviourssuch as sequence sorting and addition. When connectedto outputs of an LSTM and trained jointly, our interpreterachieves state-of-the-art accuracy for end-to-end reasoningabout quantities expressed in natural language stories.

1. Introduction

A central goal of Artiﬁcial Intelligence is the creation ofmachines that learn as effectively from human instructionas they do from data. A recent and important step towardsthis goal is the invention of neural architectures thatlearn to perform algorithms akin to traditional computers,using primitives such as memory access and stack ma-nipulation (Graves et al., 2014; Joulin & Mikolov, 2015;Grefenstette et al., 2015; Kaiser & Sutskever, 2015; Kurachet al., 2016; Graves et al., 2016). These architectures canbe trained through standard gradient descent methods,and enable machines to learn complex behaviour frominput-output pairs or program traces. In this context, therole of the human programmer is often limited to providingtraining data. However, training data is a scarce resourcefor many tasks. In these cases, the programmer may have Department of Computer Science, University College Lon-don, London, UK Department of Computer Science, Universityof Oxford, Oxford, UK Department of Theoretical and Ap-plied Linguistics, University of Cambridge, Cambridge, UK.Correspondence to: Matko Boˇsnjak < [email protected] > . Proceedings of the th International Conference on MachineLearning , Sydney, Australia, PMLR 70, 2017. Copyright 2017 bythe author(s). partial procedural background knowledge: one may knowthe rough structure of the program, or how to implementseveral subroutines that are likely necessary to solve thetask. For example, in programming by demonstration (Lauet al., 2001) or query language programming (Neelakantanet al., 2015a) a user establishes a larger set of conditions onthe data, and the model needs to set out the details. In allthese scenarios, the question then becomes how to exploitvarious types of prior knowledge when learning algorithms.To address the above question we present an approach thatenables programmers to inject their procedural backgroundknowledge into a neural network. In this approach, theprogrammer speciﬁes a program sketch (Solar-Lezamaet al., 2005) in a traditional programming language. Thissketch deﬁnes one part of the neural network behaviour. Theother part is learned using training data. The core insightthat enables this approach is the fact that most programminglanguages can be formulated in terms of an abstract machinethat executes the commands of the language. We implementthese machines as neural networks, constraining parts of thenetworks to follow the sketched behaviour. The resultingneural programs are consistent with our prior knowledgeand optimised with respect to the training data.In this paper, we focus on the programming languageForth (Brodie, 1980), a simple yet powerful stack-basedlanguage that facilitates factoring and abstraction. Under-lying Forth’s semantics is a simple abstract machine. Weintroduce ∂

4, an implementation of this machine that isdifferentiable with respect to the transition it executes ateach time step, as well as distributed input representations.Sketches that users write deﬁne underspeciﬁed behaviourwhich can then be trained with backpropagation.For two neural programming tasks introduced in previouswork (Reed & de Freitas, 2015) we present Forth sketchesthat capture different degrees of prior knowledge. Forexample, we deﬁne only the general recursive structure ofa sorting problem. We show that given only input-outputpairs, ∂ ∂ ∂ a r X i v : . [ c s . N E ] J u l rogramming with a Differentiable Forth Interpreter language narratives, extract important numerical quantities,and reason with these, ultimately answering correspondingmathematical questions without the need for explicitintermediate representations used in previous work.The contributions of our work are as follows: i) We presenta neural implementation of a dual stack machine underlyingForth, ii) we introduce Forth sketches for programmingwith partial procedural background knowledge, iii) weapply Forth sketches as a procedural prior on learningalgorithms from data, iv) we introduce program codeoptimisations based on symbolic execution that can speedup neural execution, and v) using Forth sketches we obtainstate-of-the-art for end-to-end reasoning about quantitiesexpressed in natural language narratives.

2. The Forth Abstract Machine

Forth is a simple Turing-complete stack-based program-ming language (ANSI, 1994; Brodie, 1980). We chose Forthas the host language of our work because i) it is an estab-lished, general-purpose high-level language relatively closeto machine code, ii) it promotes highly modular programsthrough use of branching, loops and function calls, thusbringing out a good balance between assembly and higherlevel languages, and importantly iii) its abstract machine issimple enough for a straightforward creation of its contin-uous approximation. Forth’s underlying abstract machineis represented by a state S = ( D, R, H, c ) , which containstwo stacks: a data evaluation pushdown stack D ( datastack ) holds values for manipulation, and a return addresspushdown stack R ( return stack ) assists with return pointersand subroutine calls. These are accompanied by a heap orrandom memory access buffer H , and a program counter c .A Forth program P is a sequence of Forth words (i.e.commands) P = w ...w n . The role of a word varies, encom-passing language keywords, primitives, and user-deﬁnedsubroutines (e.g. DROP discards the top element of thedata stack, or

DUP duplicates the top element of the datastack). Each word w i deﬁnes a transition function betweenmachine states w i : S → S . Therefore, a program P itselfdeﬁnes a transition function by simply applying the word atthe current program counter to the current state. Althoughusually considered as a part of the heap H , we considerForth programs P separately to ease the analysis.An example of a Bubble sort algorithm implemented inForth is shown in Listing 1 in everything except lines3b-4c. The execution starts from line 12 where literalsare pushed on the data stack and the SORT is called. Line10 executes the main loop over the sequence. Lines 2-7 Forth is a concatenative language. In this work, we restrict ourselves to a subset of all Forthwords, detailed in Appendix A. : BUBBLE ( a1 ... an n-1 -- one pass ) DUP IF >R OVER OVER < IF SWAP

THEN R> SWAP >R 1- BUBBLE R> { observe D0 D-1 -> permute

D-1 D0 R0}

1- BUBBLE R> { observe D0 D-1 -> choose

NOP SWAP } R> SWAP >R 1- BUBBLE R> ELSE DROP THEN ; : SORT ( a1 .. an n -- sorted )

1- DUP 0 DO >R R@ BUBBLE R> LOOP

DROP ; \ Example call Listing 1: Three code alternatives (white lines are commonto all, coloured/lettered lines are alternative-speciﬁc): i)Bubble sort in Forth (a lines – green), ii) P

ERMUTE sketch(b lines – blue), and iii) C

OMPARE sketch (c lines – yellow).denote the

BUBBLE procedure – comparison of top twostack numbers (line 3a), and the recursive call to itself (line4a). A detailed description of how this program is executedby the Forth abstract machine is provided in Appendix B.Notice that while Forth provides common control structuressuch as looping and branching, these can always be reducedto low-level code that uses jumps and conditional jumps(using the words

BRANCH and

BRANCH0 , respectively).Likewise, we can think of sub-routine deﬁnitions as labelledcode blocks, and their invocation amounts to jumping to thecode block with the respective label. ∂

4: Differentiable Abstract Machine

When a programmer writes a Forth program, they deﬁnea sequence of Forth words, i.e. , a sequence of known statetransition functions. In other words, the programmer knows exactly how computation should proceed. To accommodatefor cases when the developer’s procedural backgroundknowledge is incomplete, we extend Forth to support thedeﬁnition of a program sketch . As is the case with Forthprograms, sketches are sequences of transition functions.However, a sketch may contain transition functions whosebehaviour is learned from data.To learn the behaviour of transition functions within a pro-gram we would like the machine output to be differentiablewith respect to these functions (and possibly representa-tions of inputs to the program). This enables us to chooseparameterised transition functions such as neural networks.To this end, we introduce ∂

4, a TensorFlow (Abadi et al.,2015) implementation of a differentiable abstract machinewith continuous state representations, differentiable wordsand sketches. Program execution in ∂ rogramming with a Differentiable Forth Interpreter Low-level code dD H ...>RCURRENT_REPR>R{permute...}{choose...}{choose...}{choose...}...

P c θ S iExecution RNN Px y

BiLSTMNed had to wash 3 shorts ……

R r

Figure 1:

Left:

Neural Forth Abstract Machine. A forth sketch P is translated to a low-level code P θ , with slots { ... } substituted by a parametrised neural networks. Slots are learnt from input-output examples ( x , y ) through the differentiablemachine whose state S i comprises the low-level code, program counter c , data stack D (with pointer d ), return stack R (with pointer r ), and the heap H . Right:

BiLSTM trained on Word Algebra Problems. Output vectors corresponding to arepresentation of the entire problem, as well as context representations of numbers and the numbers themselves are fed into H to solve tasks. The entire system is end-to-end differentiable. We map the symbolic machine state S = ( D, R, H, c ) to a continuous representation S = ( D , R , H , c ) intotwo differentiable stacks (with pointers), the data stack D = ( D , d ) and the return stack R = ( R , r ) , a heap H , andan attention vector c indicating which word of the sketch P θ is being executed at the current time step. Figure 1 depictsthe machine together with its elements. All three memorystructures, the data stack, the return stack and the heap, arebased on differentiable ﬂat memory buffers M ∈ { D , R , H } ,where D , R , H ∈ R l × v , for a stack size l and a value size v .Each has a differentiable read operationread M ( a ) = a T M and write operationwrite M ( x , a ) : M ← M − ( a1 T ) (cid:12) M + xa T akin to the Neural Turing Machine (NTM) memory (Graveset al., 2014), where (cid:12) is the element-wise multiplication,and a is the address pointer. In addition to the memorybuffers D and R , the data stack and the return stack containpointers to the current top-of-the-stack (TOS) element d , r ∈ R l , respectively. This allows us to implement pushingas writing a value x into M and incrementing the TOSpointer as:push M ( x ) : write M ( x , p ) (side-effect: p ← inc ( p ) )where p ∈ { d , r } , inc ( p ) = p T R + , dec ( p ) = p T R − , and R + and R − are increment and decrement matrices (leftand right circular shift matrices). The equal widths of H and D allow us to directly move vectorrepresentations of values between the heap and the stack. Popping is realized by multiplying the TOS pointer and thememory buffer, and decreasing the TOS pointer:pop M ( ) = read M ( p ) (side-effect: p ← dec ( p ) )Finally, the program counter c ∈ R p is a vector that, whenone-hot, points to a single word in a program of length p , and is equivalent to the c vector of the symbolic statemachine. We use S to denote the space of all continuousrepresentations S . Neural Forth Words

It is straightforward to convertForth words, deﬁned as functions on discrete machinestates, to functions operating on the continuous space S .For example, consider the word DUP , which duplicates thetop of the data stack. A differentiable version of

DUP ﬁrstcalculates the value e on the TOS address of D , as e = d T D .It then shifts the stack pointer via d ← inc ( d ) , and writes e to D using write D ( e , d ) . The complete description of im-plemented Forth Words and their differentiable counterpartscan be found in Appendix A. We deﬁne a Forth sketch P θ as a sequence of continuoustransition functions P = w ... w n . Here, w i ∈ S → S either corresponds to a neural Forth word or a trainabletransition function (neural networks in our case). We willcall these trainable functions slots , as they correspond tounderspeciﬁed “slots” in the program code that need to beﬁlled by learned behaviour.We allow users to deﬁne a slot w by specifying a pair ofa state encoder w enc and a decoder w dec . The encoder During training c can become distributed and is considered asattention over the program code. rogramming with a Differentiable Forth Interpreter produces a latent representation h of the current machinestate using a multi-layer perceptron, and the decoderconsumes this representation to produce the next machinestate. We hence have w = w dec ◦ w enc . To use slots withinForth program code, we introduce a notation that reﬂectsthis decomposition. In particular, slots are deﬁned by thesyntax { encoder -> decoder } where encoder and decoder are speciﬁcations of the corresponding slotparts as described below. Encoders

We provide the following options for encoders: static produces a static representation, independent ofthe actual machine state. observe e ...e m : concatenates the elements e ...e m ofthe machine state. An element can be a stack item Di at relative index i , a return stack item Ri , etc. linear N, sigmoid, tanh represent chained trans-formations, which enable the multilayer perceptronarchitecture. Linear N projects to N dimensions,and sigmoid and tanh apply same-named functionselementwise. Decoders

Users can specify the following decoders: choose w ...w m : chooses from the Forth words w ...w m .Takes an input vector h of length m to produce aweighted combination of machine states (cid:80) mi h i w i ( S ) . manipulate e ...e m : directly manipulates the machinestate elements e ... e m by writing the appropriatelyreshaped and softmaxed output of the encoder over themachine state elements with write M . permute e ...e m : permutes the machine state elements e ...e m via a linear combination of m ! state vectors. We model execution using an RNN which produces a state S n +1 conditioned on a previous state S n . It does so byﬁrst passing the current state to each function w i in theprogram, and then weighing each of the produced nextstates by the component of the program counter vector c i that corresponds to program index i , effectively using c asan attention vector over code. Formally we have: S n +1 = RNN ( S n , P θ ) = | P | (cid:88) i =1 c i w i ( S n ) Clearly, this recursion, and its ﬁnal state, are differentiablewith respect to the program code P θ , and its inputs. Further-more, for differentiable Forth programs the ﬁnal state of thisRNN will correspond to the ﬁnal state of a symbolic execu-tion (when no slots are present, and one-hot values are used). : BUBBLEDUPBRANCH0 8>R{...}1-BUBBLER>012345678 P dcDR r

Figure 2: ∂ d , r ) and values(rows of R and D ) are all in one-hot state (colours simplydenote values observed, deﬁned by the top scale), whilethe program counter maintains the uncertainty. Subsequentstates are discretised for clarity. Here, the slot { ... } haslearned its optimal behaviour. The ∂ ∂

4: symbolic execution and interpolation of if-branches.

Symbolic Execution

Whenever we have a sequence ofForth words that contains no branch entry or exit points, wecan collapse this sequence into a single transition insteadof naively interpreting words one-by-one. We symbolicallyexecute (King, 1976) a sequence of Forth words to calculatea new machine state. We then use the difference between thenew and the initial state to derive the transition function ofthe sequence. For example, the sequence

R> SWAP >R thatswaps top elements of the data and the return stack yields thesymbolic state D = r d ...d l . and R = d r ...r l . Compar-ing it to the initial state, we derive a single neural transitionthat only needs to swap the top elements of D and R . Interpolation of If-Branches

We cannot apply symbolicexecution to code with branching points as the branchingbehaviour depends on the current machine state, and wecannot resolve it symbolically. However, we can stillcollapse if-branches that involve no function calls or loopsby executing both branches in parallel and weighing theiroutput states by the value of the condition. If the if-branchdoes contain function calls or loops, we simply fall back toexecution of all words weighted by the program counter. rogramming with a Differentiable Forth Interpreter

Our training procedure assumes input-output pairs ofmachine start and end states ( x i , y i ) only. The output y i deﬁnes a target memory Y Di and a target pointer y di onthe data stack D . Additionally, we have a mask K i thatindicates which components of the stack should be includedin the loss (e.g. we do not care about values above the stackdepth). We use D T ( θ, x i ) and d T ( θ, x i ) to denote the ﬁnalstate of D and d after T steps of execution RNN and usingan initial state x i . We deﬁne the loss function as L ( θ ) = H ( K i (cid:12) D T ( θ, x i ) , K i (cid:12) Y Di )+ H ( K i (cid:12) d T ( θ, x i ) , K i (cid:12) y di ) where H ( x , y ) = − x log y is the cross-entropy loss,and θ are parameters of slots in the program P . We canuse backpropagation and any variant of gradient descentto optimise this loss function. Note that at this point itwould be possible to include supervision of the interme-diate states (trace-level), as done by the Neural ProgramInterpreter (Reed & de Freitas, 2015).

4. Experiments

We evaluate ∂ ∂ ∂ Speciﬁc to the transduction tasks, we discretise memoryelements during testing. This effectively allows the trainedmodel to generalise to any sequence length if the correctsketch behaviour has been learned. We also compare againsta Seq2Seq (Sutskever et al., 2014) baseline. Full details ofthe experimental setup can be found in Appendix E.

Sorting sequences of digits is a hard task for RNNs, as theyfail to generalise to sequences even marginally longer thanthe ones they have been trained on (Reed & de Freitas,2015). We investigate several strong priors based on Bubblesort for this transduction task and present two ∂ Test Length 8 Test Length: 64Train Length: 2 3 4 2 3 4Seq2Seq 26.2 29.2 39.1 13.3 13.6 15.9 ∂ ∂ P ERMUTE . A sketch specifying that the top two elementsof the stack, and the top of the return stack must be per-muted based on the values of the former (line 3b). Boththe value comparison and the permutation behaviourmust be learned. The core of this sketch is depicted inListing 1 (b lines), and the sketch is explained in detailin Appendix D.C

OMPARE . This sketch provides additional prior proce-dural knowledge to the model. In contrast to P

ERMUTE ,only the comparison between the top two elements on thestack must be learned (line 3c). The core of this sketch isdepicted in Listing 1 (c lines).In both sketches, the outer loop can be speciﬁed in ∂ BUBBLE . Indoing so, it deﬁnes sufﬁcient structure so that the behaviourof the network is invariant to the input sequence length.

Results on Bubble sort

A quantitative comparison of ourmodels on the Bubble sort task is provided in Table 1. For agiven test sequence length, we vary the training set lengthsto illustrate the model’s ability to generalise to sequenceslonger than those it observed during training. We ﬁnd that ∂ elements afterobserving only sequences of length two and three duringtraining. In comparison, the Seq2Seq baseline falters whenattempting similar generalisations, and performs close tochance when tested on longer sequences. Both ∂ OMPARE sketchperforms better due to more structure it imposes). Wediscuss this issue further in Section 5.

Next, we applied ∂ carry ) be carried over to the subsequent column. rogramming with a Differentiable Forth Interpreter : ADD-DIGITS ( a1 b1...an bn carry n -- r1 r2...r_{n+1} ) DUP 0 = IF DROP ELSE >R \ put n on R { observe D0 D-1 D-2 -> tanh -> linear 70-> manipulate

D-1 D-2 } DROP { observe D0 D-1 D-2 -> tanh -> linear 10-> choose { observe D-1 D-2 D-3 -> tanh -> linear 50-> choose R> 1- SWAP >R \ new_carry n-1 ADD-DIGITS \ call add-digits on n-1 subseq. R> \ put remembered results back on the stack THEN ; Listing 2: Manipulate sketch (a lines – green) and thechoose sketch (b lines – blue) for Elementary Addition.Input data is used to ﬁll data stack externallyWe assume aligned pairs of digits as input, with a carry forthe least signiﬁcant digit (potentially ), and the length ofthe respective numbers. The sketches deﬁne the high-leveloperations through recursion, leaving the core addition tobe learned from data.The speciﬁed high-level behaviour includes the recursivecall template and the halting condition of the recursion (noremaining digits, line 2-3). The underspeciﬁed additionoperation must take three digits from the previous call, thetwo digits to sum and a previous carry, and produce a singledigit (the sum) and the resultant carry (lines 6a, 6b and 7a,7b). We introduce two sketches for inducing this behaviour:M ANIPULATE . This sketch provides little prior procedu-ral knowledge as it directly manipulates the ∂ HOOSE . Incorporating additional prior information,C

HOOSE exactly speciﬁes the results of the computa-tion, namely the output of the ﬁrst slot (line 6b) is thecarry, and the output of the second one (line 7b) is theresult digit, both conditioned on the two digits and thecarry on the data stack. Depicted in Listing 2 in blue.The rest of the sketch code reduces the problem size by oneand returns the solution by popping it from the return stack.

Quantitative Evaluation on Addition

In a set of ex-periments analogous to those in our evaluation on Bubblesort, we demonstrate the performance of ∂ Test Length 8 Test Length 64Train Length: 2 4 8 2 4 8Seq2Seq 37.9 57.8 99.8 15.0 13.5 13.3 ∂ ∂ to longer sequences than those observed during training.In comparison, both the C HOOSE sketch and the lessstructured M

ANIPULATE sketch learn the correct sketchbehaviour and generalise to all test sequence lengths (withan exception of M

ANIPULATE which required more data totrain perfectly). In additional experiments, we were able tosuccessfully train both the C

HOOSE and the M

ANIPULATE sketches from sequences of input length , and we testedthem up to the sequence length of , conﬁrming theirperfect training and generalisation capabilities. Word algebra problems (WAPs) are often used to assess thenumerical reasoning abilities of schoolchildren. Questionsare short narratives which focus on numerical quantities,culminating with a question. For example:

A ﬂorist had 50 roses. If she sold 15 of them and then laterpicked 21 more, how many roses would she have?

Answering such questions requires both the understandingof language and of algebra — one must know whichnumeric operations correspond to which phrase and how toexecute these operations.Previous work cast WAPs as a transduction task by mappinga question to a template of a mathematical formula, thusrequiring manuall labelled formulas. For instance, oneformula that can be used to correctly answer the questionin the example above is (50 - 15) + 21 = 56 . In pre-vious work, local classiﬁers (Roy & Roth, 2015; Roy et al.,2015), hand-crafted grammars (Koncel-Kedziorski et al.,2015), and recurrent neural models (Bouchard et al., 2016)have been used to perform this task. Predicted formulatemplates may be marginalised during training (Kushmanet al., 2014), or evaluated directly to produce an answer.In contrast to these approaches, ∂ and theirexecution , without the need for manually labelled equationsand no explicit symbolic representation of a formula. Model description

Our model is a fully end-to-enddifferentiable structure, consisting of a ∂ rogramming with a Differentiable Forth Interpreter \ first copy data from H: vectors to R and numbers to D { observe R0 R-1 R-2 R-3 -> permute

D0 D-1 D-2 } { observe R0 R-1 R-2 R-3 -> choose + - * / } { observe R0 R-1 R-2 R-3 -> choose

SWAP NOP } { observe R0 R-1 R-2 R-3 -> choose + - * / } \ lastly, empty out the return stack

Listing 3: Core of the Word Algebra Problem sketch. Thefull sketch can be found in the Appendix.sketch, and a Bidirectional LSTM (BiLSTM) reader.The BiLSTM reader reads the text of the problem andproduces a vector representation (word vectors) for eachword, concatenated from the forward and the backward passof the BiLSTM network. We use the resulting word vectorscorresponding only to numbers in the text, numerical valuesof those numbers (encoded as one-hot vectors), and a vectorrepresentation of the whole problem (concatenation of thelast and the ﬁrst vector of the opposite passes) to initialisethe ∂ H . This is done in an end-to-end fashion,enabling gradient propagation through the BiLSTM to thevector representations. The process is depicted in Figure 1.The sketch, depicted in Listing 3 dictates the differentiablecomputation. First, it copies values from the heap H – word vectors to the return stack R , and numbers (asone-hot vectors) on the data stack D . Second, it containsfour slots that deﬁne the space of all possible operationsof four operators on three operands, all conditioned on thevector representations on the return stack. These slots are i)permutation of the elements on the data stack, ii) choosingthe ﬁrst operator, iii) possibly swapping the intermediateresult and the last operand, and iv) the choice of the secondoperator. The ﬁnal set of commands simply empties outthe return stack R . These slots deﬁne the space of possibleoperations, however, the model needs to learn how to utilisethese operations in order to calculate the correct result. Results

We evaluate the model on the Common Core (CC)dataset, introduced by Roy & Roth (2015). CC is notable forhaving the most diverse set of equation patterns, consistingof four operators ( + , - , × , ÷ ), with up to three operands.We compare against three baseline systems: (1) a localclassiﬁer with hand-crafted features (Roy & Roth, 2015),(2) a Seq2Seq baseline, and (3) the same model with a datageneration component (GeNeRe) Bouchard et al. (2016).All baselines are trained to predict the best equation, whichis executed outside of the model to obtain the answer. Incontrast, ∂ Due to space constraints, we present the core of the sketchhere. For the full sketch, please refer to Listing 4 in the Appendix.

Table 3: Accuracies of models on the CC dataset. Asteriskdenotes results obtained from Bouchard et al. (2016). Notethat GeNeRe makes use of additional dataModel Accuracy (%)

Template Mapping

Roy & Roth (2015) 55.5Seq2Seq ∗ (Bouchard et al., 2016) 95.0GeNeRe ∗ (Bouchard et al., 2016) 98.5 Fully End-to-End ∂

5. Discussion ∂ ∂ OMPARE sketch, which provides more structure, achieveshigher accuracies when trained on longer sequences.Similarly, employing softmax on the directly manipulatedmemory elements enabled perfect training for the M

ANIP - ULATE sketch for addition. Furthermore, it is encouragingto see that ∂ rogramming with a Differentiable Forth Interpreter

6. Related Work

Program Synthesis

The idea of program synthesis is asold as Artiﬁcial Intelligence, and has a long history in com-puter science (Manna & Waldinger, 1971). Whereas a largebody of work has focused on using genetic programming(Koza, 1992) to induce programs from the given input-output speciﬁcation (Nordin, 1997), there are also variousInductive Programming approaches (Kitzelmann, 2009)aimed at inducing programs from incomplete speciﬁcationsof the code to be implemented (Albarghouthi et al., 2013;Solar-Lezama et al., 2006). We tackle the same problem ofsketching, but in our case, we ﬁll the sketches with neuralnetworks able to learn the slot behaviour.

Probabilistic and Bayesian Programming

Our workis closely related to probabilistic programming languagessuch as Church (Goodman et al., 2008). They allow usersto inject random choice primitives into programs as a wayto deﬁne generative distributions over possible executiontraces. In a sense, the random choice primitives in suchlanguages correspond to the slots in our sketches. A coredifference lies in the way we train the behaviour of slots:instead of calculating their posteriors using probabilisticinference, we estimate their parameters using backprop-agation and gradient descent. This is similar in-kind to

TerpreT ’s FMGD algorithm (Gaunt et al., 2016), whichis employed for code induction via backpropagation. Incomparison, our model which optimises slots of neuralnetworks surrounded by continuous approximations ofcode, enables the combination of procedural behaviour andneural networks. In addition, the underlying programmingand probabilistic paradigm in these programming languagesis often functional and declarative, whereas our approachfocuses on a procedural and discriminative view. By usingan end-to-end differentiable architecture, it is easy to seam-lessly connect our sketches to further neural input and outputmodules, such as an LSTM that feeds into the machine heap.

Neural approaches

Recently, there has been a surge ofresearch in program synthesis, and execution in deep learn-ing, with increasingly elaborate deep models. Many of thesemodels were based on differentiable versions of abstractdata structures (Joulin & Mikolov, 2015; Grefenstette et al.,2015; Kurach et al., 2016), and a few abstract machines,such as the NTM (Graves et al., 2014), DifferentiableNeural Computers (Graves et al., 2016), and Neural GPUs(Kaiser & Sutskever, 2015). All these models are able toinduce algorithmic behaviour from training data. Our workdiffers in that our differentiable abstract machine allows usto seemingly integrate code and neural networks, and trainthe neural networks speciﬁed by slots via backpropagation.Related to our efforts is also the Autograd (Maclaurin et al.,2015), which enables automatic gradient computation in pure Python code, but does not deﬁne nor use differentiableaccess to its underlying abstract machine.The work in neural approximations to abstract structuresand machines naturally leads to more elaborate machin-ery able to induce and call code or code-like behaviour.Neelakantan et al. (2015a) learned simple SQL-likebehaviour–—querying tables from the natural languagewith simple arithmetic operations. Although sharingsimilarities on a high level, the primary goal of our modelis not induction of (fully expressive) code but its injection.(Andreas et al., 2016) learn to compose neural modules toproduce the desired behaviour for a visual QA task. NeuralProgrammer-Interpreters (Reed & de Freitas, 2015) learnto represent and execute programs, operating on differentmodes of an environment, and are able to incorporatedecisions better captured in a neural network than in manylines of code (e.g. using an image as an input). Users injectprior procedural knowledge by training on program tracesand hence require full procedural knowledge. In contrast,we enable users to use their partial knowledge in sketches.Neural approaches to language compilation have also beenresearched, from compiling a language into neural networks(Siegelmann, 1994), over building neural compilers (Gruauet al., 1995) to adaptive compilation (Bunel et al., 2016).However, that line of research did not perceive neural in-terpreters and compilers as a means of injecting proceduralknowledge as we did. To the best of our knowledge, ∂

7. Conclusion and Future Work

We have presented ∂

4, a differentiable abstract machinefor the Forth programming language, and showed how itcan be used to complement programmers’ prior knowledgethrough the learning of unspeciﬁed behaviour in Forthsketches. The ∂ ∂ ∂ rogramming with a Differentiable Forth Interpreter A CKNOWLEDGMENTS

We thank Guillaume Bouchard, Danny Tarlow, Dirk Weis-senborn, Johannes Welbl and the anonymous reviewersfor fruitful discussions and helpful comments on previousdrafts of this paper. This work was supported by a MicrosoftResearch PhD Scholarship, an Allen Distinguished Investi-gator Award, and a Marie Curie Career Integration Award.

References

Abadi, Mart´ın, Agarwal, Ashish, Barham, Paul, Brevdo,Eugene, Chen, Zhifeng, Citro, Craig, Corrado, Greg S.,Davis, Andy, Dean, Jeffrey, Devin, Matthieu, Ghemawat,Sanjay, Goodfellow, Ian, Harp, Andrew, Irving, Geoffrey,Isard, Michael, Jia, Yangqing, Jozefowicz, Rafal, Kaiser,Lukasz, Kudlur, Manjunath, Levenberg, Josh, Man´e,Dan, Monga, Rajat, Moore, Sherry, Murray, Derek, Olah,Chris, Schuster, Mike, Shlens, Jonathon, Steiner, Benoit,Sutskever, Ilya, Talwar, Kunal, Tucker, Paul, Vanhoucke,Vincent, Vasudevan, Vijay, Vi´egas, Fernanda, Vinyals,Oriol, Warden, Pete, Wattenberg, Martin, Wicke, Mar-tin, Yu, Yuan, and Zheng, Xiaoqiang. TensorFlow:Large-scale machine learning on heterogeneous systems,2015. URL http://tensorflow.org/ . Softwareavailable from tensorﬂow.org.Albarghouthi, Aws, Gulwani, Sumit, and Kincaid, Zachary.Recursive program synthesis. In

Computer AidedVeriﬁcation , pp. 934–950. Springer, 2013.Andreas, Jacob, Rohrbach, Marcus, Darrell, Trevor, andKlein, Dan. Neural module networks. In

Proceedingsof IEEE Conference on Computer Vision and PatternRecognition (CVPR) , 2016.ANSI.

Programming Languages - Forth , 1994. AmericanNational Standard for Information Systems, ANSIX3.215-1994.Bouchard, Guillaume, Stenetorp, Pontus, and Riedel,Sebastian. Learning to generate textual data. In

Proceed-ings of the Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) , pp. 1608–1616, 2016.Brodie, Leo.

Starting Forth . Forth Inc., 1980.Bunel, Rudy, Desmaison, Alban, Kohli, Pushmeet, Torr,Philip HS, and Kumar, M Pawan. Adaptive neuralcompilation. In

Proceedings of the Conference on NeuralInformation Processing Systems (NIPS) , 2016.Gaunt, Alexander L, Brockschmidt, Marc, Singh, Rishabh,Kushman, Nate, Kohli, Pushmeet, Taylor, Jonathan,and Tarlow, Daniel. TerpreT: A Probabilistic Program-ming Language for Program Induction. arXiv preprintarXiv:1608.04428 , 2016. Goodman, Noah, Mansinghka, Vikash, Roy, Daniel M,Bonawitz, Keith, and Tenenbaum, Joshua B. Church:a language for generative models. In

Proceedings ofthe Conference in Uncertainty in Artiﬁcial Intelligence(UAI) , pp. 220–229, 2008.Graves, Alex, Wayne, Greg, and Danihelka, Ivo. NeuralTuring Machines. arXiv preprint arXiv:1410.5401 , 2014.Graves, Alex, Wayne, Greg, Reynolds, Malcolm, Harley,Tim, Danihelka, Ivo, Grabska-Barwi´nska, Agnieszka,Colmenarejo, Sergio G´omez, Grefenstette, Edward,Ramalho, Tiago, Agapiou, John, et al. Hybrid computingusing a neural network with dynamic external memory.

Nature , 538(7626):471–476, 2016.Grefenstette, Edward, Hermann, Karl Moritz, Suleyman,Mustafa, and Blunsom, Phil. Learning to Transduce withUnbounded Memory. In

Proceedings of the Conferenceon Neural Information Processing Systems (NIPS) , pp.1819–1827, 2015.Gruau, Fr´ed´eric, Ratajszczak, Jean-Yves, and Wiber, Gilles.A Neural compiler.

Theoretical Computer Science , 141(1):1–52, 1995.Hochreiter, Sepp and Schmidhuber, J¨urgen. Long short-term memory.

Neural Computation , 9(8):1735–1780,1997.Joulin, Armand and Mikolov, Tomas. Inferring AlgorithmicPatterns with Stack-Augmented Recurrent Nets. In

Proceedings of the Conferences on Neural InformationProcessing Systems (NIPS) , pp. 190–198, 2015.Kaiser, Łukasz and Sutskever, Ilya. Neural GPUs learn al-gorithms. In

Proceedings of the International Conferenceon Learning Representations (ICLR) , 2015.King, James C. Symbolic Execution and Program Testing.

Commun. ACM , 19(7):385–394, 1976.Kingma, Diederik and Ba, Jimmy. Adam: A Methodfor Stochastic Optimization. In

Proceedings of theInternational Conference for Learning Representations(ICLR) , 2015.Kitzelmann, Emanuel. Inductive Programming: A Surveyof Program Synthesis Techniques. In

InternationalWorkshop on Approaches and Applications of InductiveProgramming , pp. 50–73, 2009.Koncel-Kedziorski, Rik, Hajishirzi, Hannaneh, Sabharwal,Ashish, Etzioni, Oren, and Ang, Siena. Parsing Alge-braic Word Problems into Equations.

Transactions of theAssociation for Computational Linguistics (TACL) , 3:585–597, 2015. rogramming with a Differentiable Forth Interpreter

Koza, John R.

Genetic Programming: On the Programmingof Computers by Means of Natural Selection , volume 1.MIT press, 1992.Kurach, Karol, Andrychowicz, Marcin, and Sutskever, Ilya.Neural Random-Access Machines. In

Proceedings of theInternational Conference on Learning Representations(ICLR) , 2016.Kushman, Nate, Artzi, Yoav, Zettlemoyer, Luke, and Barzi-lay, Regina. Learning to Automatically Solve AlgebraWord Problems. In

Proceedings of the Annual Meetingof the Association for Computational Linguistics (ACL) ,pp. 271–281, 2014.Lau, Tessa, Wolfman, Steven A., Domingos, Pedro, andWeld, Daniel S. Learning repetitive text-editing proce-dures with smartedit. In

Your Wish is My Command , pp.209–226. Morgan Kaufmann Publishers Inc., 2001.Maclaurin, Dougal, Duvenaud, David, and Adams, Ryan P.Gradient-based Hyperparameter Optimization throughReversible Learning. In

Proceedings of the InternationalConference on Machine Learning (ICML) , 2015.Manna, Zohar and Waldinger, Richard J. Toward automaticprogram synthesis.

Communications of the ACM , 14(3):151–165, 1971.Neelakantan, Arvind, Le, Quoc V, and Sutskever, Ilya.Neural Programmer: Inducing latent programs withgradient descent. In

Proceedings of the InternationalConference on Learning Representations (ICLR) , 2015a.Neelakantan, Arvind, Vilnis, Luke, Le, Quoc V, Sutskever,Ilya, Kaiser, Lukasz, Kurach, Karol, and Martens, James.Adding Gradient Noise Improves Learning for Very DeepNetworks. arXiv preprint arXiv:1511.06807 , 2015b.Nordin, Peter.

Evolutionary Program Induction of BinaryMachine Code and its Applications . PhD thesis, derUniversitat Dortmund am Fachereich Informatik, 1997.Reed, Scott and de Freitas, Nando. Neural programmer-interpreters. In

Proceedings of the InternationalConference on Learning Representations (ICLR) , 2015.Roy, Subhro and Roth, Dan. Solving General ArithmeticWord Problems. In

Proceedings of the Conference onEmpirical Methods in Natural Language Processing(EMNLP) , pp. 1743–1752, 2015.Roy, Subhro, Vieira, Tim, and Roth, Dan. Reasoningabout quantities in natural language.

Transactions of theAssociation for Computational Linguistics (TACL) , 3:1–13, 2015. Siegelmann, Hava T. Neural Programming Language. In

Proceedings of the Twelfth AAAI National Conference onArtiﬁcial Intelligence , pp. 877–882, 1994.Solar-Lezama, Armando, Rabbah, Rodric, Bod´ık, Rastislav,and Ebcio˘glu, Kemal. Programming by Sketching forBit-streaming Programs. In

Proceedings of Program-ming Language Design and Implementation (PLDI) , pp.281–294, 2005.Solar-Lezama, Armando, Tancau, Liviu, Bodik, Rastislav,Seshia, Sanjit, and Saraswat, Vijay. CombinatorialSketching for Finite Programs. In

ACM Sigplan Notices ,volume 41, pp. 404–415, 2006.Sutskever, Ilya, Vinyals, Oriol, and Le, Quoc V. Sequence toSequence Learning with Neural Networks. In

Proceed-ings of the Conference on Neural Information ProcessingSystems (NIPS) , pp. 3104–3112, 2014. rogramming with a Differentiable Forth Interpreter

AppendixA. Forth Words and their implementation

We implemented a small subset of available Forth words in ∂

4. The table of these words, together with their descriptionsis given in Table 4, and their implementation is given in Table 5. The commands are roughly divided into 7 groups. Thesegroups, line-separated in the table, are:

Data stack operations { num } , 1+, 1-, DUP, SWAP, OVER, DROP, +, -, *, / Heap operations @, !

Comparators >, <, =

Return stack operations >R, R>, @R

Control statements

IF..ELSE..THEN, BEGIN..WHILE..REPEAT, DO..LOOP

Subroutine control :, { sub } , ;, MACRO Variable creation

VARIABLE, CREATE..ALLOT

Table 4: Forth words and their descriptions. TOS denotes top-of-stack, NOS denotes next-on-stack, DSTACK denotes thedata stack, RSTACK denotes the return stack, and HEAP denotes the heap.

Forth Word Description { num } Pushes { num } to DSTACK. Increments DSTACK TOS by 1. Decrements DSTACK TOS by 1.

DUP

Duplicates DSTACK TOS.

SWAP

Swaps TOS and NOS.

OVER

Copies NOS and pushes it on the TOS.

DROP

Pops the TOS (non-destructive). +, -, *, /

Consumes DSTACK NOS and TOS. Returns NOS operator

TOS. @ Fetches the HEAP value from the DSTACK TOS address. ! Stores DSTACK NOS to the DSTACK TOS address on the HEAP. >, <, =

Consumes DSTACK NOS and TOS.Returns 1 (TRUE) if NOS > | < | = TOS respectivelly, 0 (FALSE) otherwise. >R Pushes DSTACK TOS to RSTACK TOS, removes it from DSTACK. R> Pushes RSTACK TOS to DSTACK TOS, removes it from RSTACK. @R Copies the RSTACK TOS TO DSTACK TOS.

IF..ELSE..THEN

Consumes DSTACK TOS, if it equals to a non-zero number (TRUE), executescommands between IF and ELSE. Otherwise executes commands betweenELSE and THEN.

BEGIN..WHILE..REPEAT

Continually executes commands between WHILE and REPEAT while the codebetween BEGIN and WHILE evaluates to a non-zero number (TRUE).

DO..LOOP

Consumes NOS and TOS, assumes NOS as a limit, and TOS as a current index.Increases index by 1 until equal to NOS. At every increment, executes commandsbetween DO and LOOP. : Denotes the subroutine, followed by a word deﬁning it. { sub } Subroutine invocation, puts the program counter PC on RSTACK, sets PC to thesubroutine address. ; Subroutine exit. Consumest TOS from the RSTACK and sets the PC to it.

MACRO

Treats the subroutine as a macro function.

VARIABLE

Creates a variable with a ﬁxed address. Invoking the variable name returns itsaddress.

CREATE..ALLOT

Creates a variable with a ﬁxed address. Do not allocate the next N addresses toany other variable (effectively reserve that portion of heap to the variable) rogramming with a Differentiable Forth Interpreter

Table 5: Implementation of Forth words described in Table 4. Note that the variable creation words are implemented as ﬁxedaddress allocation, and

MACRO words are implemented with inlining.

Symbol Explanation M Stack,

M ∈ {D , R} M Memory buffer, M ∈ { D , R , H } p Pointer, p ∈ { d , r , c } R ± Increment and decrement matrices (circular shift) R ± ij = (cid:26) i ± ≡ j ( mod n ))0 otherwise R + , R − , R ∗ , R / Circular arithmetic operation tensors R { op } ijk = (cid:26) i { op } j ≡ k ( mod n ))0 otherwise Pointer and value manipulation Expression

Increment a (or value x ) inc ( a ) = a T R + Decrement a (or value x ) dec ( a ) = a T R − Algebraic operation application { op } ( a , b ) = a T R { op } b Conditional jump a jump ( c , a ) : p = ( pop D () = T RU E ) c ← p c +(1 − p ) a a − Next on stack, a ← a T R − Buffer manipulation

READ from M read M ( a ) = a T M WRITE to M write M ( x , a ) : M ← M − a ⊗ · M + x ⊗ a PUSH x onto M push M ( x ) : write M ( x , a ) [side-effect: d ← inc ( d ) ]POP an element from M pop M () = read M ( a ) [side-effect: d ← dec ( d ) ] Forth Word

Literal x push D ( x ) write D ( inc ( read D ( d )) , d ) write D ( dec ( read D ( d )) , d ) DUP push D ( read D ( d )) SWAP x = read D ( d ) , y = read D ( d − ) : write D ( d ,y ) , write D ( d − ,x ) OVER push D ( read D ( d )) DROP pop D () +, -, *, / write D ( { op } ( read D ( d − ) ,read D ( d )) , d ) @ read H ( d ) ! write H ( d , d − ) < SWAP >> e = (cid:80) n − i =0 i ∗ d i , e = (cid:80) n − i =0 i ∗ d − i p = φ pwl ( e − e ) , where φ pwl ( x ) = min ( max (0 ,x +0 . , p +( p − = p = φ pwl ( d , d − ) p +( p − >R push R ( d ) R> pop R () @R write D ( d ,read R ( r )) IF.. ELSE.. THEN p = ( pop D () = ) p ∗ .. +(1 − p ) ∗ .. BEGIN.. WHILE.. REPEAT .. jump ( c,.. ) DO..LOOP start = c ,current = inc ( pop D ()) ,limit = pop D () p = ( current = limit ) jump ( p,.. ) ,jump ( c,start ) rogramming with a Differentiable Forth Interpreter B. Bubble sort algorithm description

An example of a Forth program that implements the Bubble sort algorithm is shown in Listing 1. We provide a descriptionof how the ﬁrst iteration of this algorithm is executed by the Forth abstract machine:The program begins at line 11, putting the sequence [2 4 2 7] on the data stack D , followed by the sequence length . It thencalls the

SORT word.

D R c comment1 [] [] 11 execution start2 [2 4 2 7 4] [] 8 pushing sequence to D , callingSORT subroutine puts A SORT to R For a sequence of length , SORT performs a do-loop in line 9 that calls the

BUBBLE sub-routine times. It does so bydecrementing the top of D with the word to . Subsequently, is duplicated on D by using DUP , and is pushed onto D .3 [2 4 2 7 3] [A SORT ] 9 1-4 [2 4 2 7 3 3] [A

SORT ] 9 DUP6 [2 4 2 7 3 3 0] [A

SORT ] 9 0 DO consumes the top two stack elements and as the limit and starting point of the loop, leaving the stack D to be[2,4,2,7,3]. We use the return stack R as a temporary variable buffer and push onto it using the word >R . This drops from D , which we copy from R with R@ SORT ] 9 DO8 [2 4 2 7] [Addr

SORT

3] 9 > R9 [2 4 2 7 3] [Addr

SORT

3] 9 @RNext, we call

BUBBLE to perform one iteration of the bubble pass, (calling

BUBBLE times internally), and consuming .Notice that this call puts the current program counter onto R , to be used for the program counter c when exiting BUBBLE .Inside the

BUBBLE subroutine,

DUP duplicates on R . IF consumes the duplicated and interprets is as TRUE. >R puts on R .10 [2 4 2 7 3] [A SORT

BUBBLE ] 0 calling BUBBLE subroutine putsA

BUBBLE to R

11 [2 4 2 7 3 3] [A

SORT

BUBBLE ] 1 DUP12 [2 4 2 7 3] [A

SORT

BUBBLE ] 1 IF13 [2 4 2 7] [A

SORT

BUBBLE

3] 1 > RCalling

OVER twice duplicates the top two elements of the stack, to test them with < , which tests whether < . IF tests ifthe result is TRUE ( ), which it is, so it executes SWAP .14 [2 4 2 7 2 7] [A

SORT

BUBBLE

3] 2 OVER OVER15 [2 4 2 7 1] [A

SORT

BUBBLE

3] 2 <

16 [2 4 2 7] [A

SORT

BUBBLE

3] 2 IF17 [2 4 7 2] [A

SORT

BUBBLE

3] 2 SWAPTo prepare for the next call to

BUBBLE we move back from the return stack R to the data stack D via R> , SWAP it with thenext element, put it back to R with >R , decrease the TOS with and invoke BUBBLE again. Notice that R will accumulatethe analysed part of the sequence, which will be recursively taken back.18 [2 4 7 2 3] [A SORT

BUBBLE ] 3 R >

19 [2 4 7 3 2] [A

SORT

BUBBLE ] 3 SWAP20 [2 4 7 3] [A

SORT

BUBBLE

2] 3 > R21 [2 4 7 2] [A

SORT

BUBBLE

2] 3 1-22 [2 4 7 2] [A

SORT

BUBBLE

2] 0 ...BUBBLEWhen we reach the loop limit we drop the length of the sequence and exit

SORT using the ; word, which takes the returnaddress from R . At the ﬁnal point, the stack should contain the ordered sequence [7 4 2 2]. Note that Forth uses Reverse Polish Notation and that the top of the data stack is 4 in this example. rogramming with a Differentiable Forth Interpreter

C. Learning and Run Time Efﬁciency

C.1. Accuracy per training examplesSorter

When measuring the performance of the model asthe number of training instances varies, we can observe thebeneﬁt of additional prior knowledge to the optimisationprocess. We ﬁnd that when stronger prior knowledge isprovided (C

OMPARE ), the model quickly maximises thetraining accuracy. Providing less structure (P

ERMUTE )results in lower testing accuracy initially, however, bothsketches learn the correct behaviour and generalise equallywell after seeing training instances. Additionally, itis worth noting that the P

ERMUTE sketch was not alwaysable to converge into a result of the correct length, and bothsketches are not trivial to train.In comparison, Seq2Seq baseline is able to generalise onlyto the sequence it was trained on (Seq2Seq trained andtested on sequence length 3). When training it on sequencelength 3, and testing it on a much longer sequence length of8, Seq2Seq baseline is not able to achieve more than accuracy. A cc u r a cy comparepermuteSeq2Seq (test 3)Seq2Seq (test 8) Figure 3: Accuracy of models for varying number oftraining examples, trained on input sequence of length 3 forthe Bubble sort task. Compare, permute, and Seq2Seq (test8) were tested on sequence lengths 8, and Seq2Seq (test 3)was tested on sequence length 3.

Adder

We tested the models to train on datasets ofincreasing size on the addition task. The results, depicted inTable 4 show that both the choose and the manipulate sketchare able to perfectly generalise from examples, trainedon sequence lengths of , tested on . In comparison, theSeq2Seq baseline achieves when trained on examples, but only when tested on the input of the samelength, . If we test Seq2Seq as we tested the sketches, it isunable to achieve more . . A cc u r a cy choosemanipulateSeq2Seq (test 8)Seq2Seq (test 16) Figure 4: Accuracy of models for varying number oftraining examples, trained on input sequence of length 8 forthe addition task. Manipulate, choose, and Seq2Seq (test16) were tested on sequence lengths 16, and Seq2Seq (test8) was tested on sequence length 8.

C.2. Program Code Optimisations

We measure the runtime of Bubble sort on sequences ofvarying length with and without the optimisations describedin Section 3.4. The results of ten repeated runs are shownin Figure 5 and demonstrate large relative improvementsfor symbolic execution and interpolation of if-branchescompared to non-optimised ∂ R e l a t i v e R un t i m e I m p r o v e m en t [ % ] Figure 5: Relative speed improvements of program codeoptimisations for different input sequence lengths ( bottom ). D. ∂ Listing 1 (lines 3b and 4b – in blue) deﬁnes the

BUBBLE word as a sketch capturing several types of prior knowledge.In this section, we describe the P

ERMUTE sketch. In it, weassume

BUBBLE involves a recursive call, that terminates atlength 1, and that the next

BUBBLE call takes as input somefunction of the current length and the top two stack elements.The input to this sketch are the sequence to be sorted andits length decremented by one, n − (line 1). These inputs rogramming with a Differentiable Forth Interpreter are expected on the data stack. After the length ( n − )is duplicated for further use with DUP , the machine testswhether it is non-zero (using IF , which consumes the TOSduring the check). If n − > , it is stored on the R stack forfuture use (line 2).At this point (line 3b) the programmer only knows thata decision must be made based on the top two data stackelements D0 and D-1 (comparison elements), and the topreturn stack, R0 (length decremented by 1). Here the precisenature of this decision is unknown but is limited to variantsof permutation of these elements, the output of which pro-duce the input state to the decrement -1 and the recursive BUBBLE call (line 4b). At the culmination of the call, R0 ,the output of the learned slot behaviour, is moved onto thedata stack using R> , and execution proceeds to the next step.Figure 2 illustrates how portions of this sketch are executedon the ∂ >R (line 3 in P ), as indicated by the vector c , next to program P . Both data and return stacks are partially ﬁlled ( R has 1element, D has 4), and we show the content both throughhorizontal one-hot vectors and their corresponding integervalues (colour coded). The vectors d and r point to the topof both stacks, and are in a one-hot state as well. In thisexecution trace, the slot at line 4 is already showing optimalbehaviour: it remembers the element on the return stack (4)is larger and executes BUBBLE on the remaining sequencewith the counter n subtracted by one, to 1. E. Experimental details

The parameters of each sketch are trained usingAdam (Kingma & Ba, 2015), with gradient clipping(set to 1.0) and gradient noise (Neelakantan et al., 2015b).We tuned the learning rate, batch size, and the parametersof the gradient noise in a random search on a developmentvariant of each task.

E.1. Seq2Seq baseline

The Seq2Seq baseline models are single-layer networkswith LSTM cells of dimensions.The training procedure for these models consists of 500epochs of Adam optimisation, with a batch size of 128,a learning rate of 0.01, and gradient clipping when theL2 norm of the model parameters exceeded 5.0. We varythe size of training and test data (Fig. 3), but observe noindication of the models failing to reach convergence underthese training conditions. E.2. Sorting

The Permute and Compare sketches in Table 1 were trainedon a randomly generated train, development and test set containing 256, 32 and 32 instances, respectively. Notethat the low number of dev and test instances was due to thecomputational complexity of the sketch.The batch size was set to a value between and ,depending on the problem size, and we used an initiallearning rate of . . E.3. Addition

We trained the addition Choose and Manipulate sketchespresented in Table 2 on a randomly generated train, develop-ment and test sets of sizes 512, 256, and 1024 respectively.The batch size was set to , and we used an initial learningrate of . E.4. Word Algebra Problem

The Common Core (CC) dataset (Roy & Roth, 2015) is par-titioned into a train, dev, and test set containing 300, 100,and 200 questions, respectively. The batch size was set to , and we used an initial learning rate of . . The BiLSTMword vectors were initialised randomly to vectors of length . The stack width was set to and the stack size to . F. QualitativeAnalysis on BubbleSort of PC traces

In Figure 6 we visualise the program counter traces. Thetrace follows a single example from start, to middle, and theend of the training process. In the beginning of training, theprogram counter starts to deviate from the one-hot represen-tation in the ﬁrst steps (not observed in the ﬁgure due tounobservable changes), and after two iterations of SORT , ∂ ∂ rogramming with a Differentiable Forth Interpreter (a) Program Counter trace in early stages of training. (b) Program Counter trace in the middle of training. (c) Program Counter trace at the end of training. Figure 6: Program Counter traces for a single example at different stages of training BubbleSort in Listing 1 (red: successiverecursion calls to

BUBBLE , green: successive returns from the recursion, and blue: calls to SORT). The last element in thelast row is the halting command, which only gets executed after learning the correct slot behaviour. rogramming with a Differentiable Forth Interpreter

G. The complete Word Algebra Problem sketch

The Word Algebra Problem (WAP) sketch described in Listing 3 is the core of the model that we use for WAP problems.However, there were additional words before and after the core which took care of copying the data from the heap to dataand return stacks, and ﬁnally emptying out the return stack.The full WAP sketch is given in Listing 4. We deﬁne a

QUESTION variable which will denote the address of the questionvector on the heap. Lines 4 and 5 create

REPR BUFFER and

NUM BUFFER variables and denote that they will occupyfour sequential memory slots on the heap, where we will store the representation vectors and numbers, respectively. Lines7 and 8 create variables

REPR and

NUM which will denote addresses to current representations and numbers on the heap.Lines 10 and 11 store

REPR BUFFER to REPR and

NUM BUFFER to NUM , essentially setting the values of variables

REPR and

NUM to starting addresses allotted in lines 4 and 5. Lines 14-16 and 19-20 create macro functions

STEP NUM and

STEP REPR which increment the

NUM and

REPR values on call. These macro functions will be used to iterate through theheap space. Lines 24-25 deﬁne macro functions

CURRENT NUM for fetching the current number, and

CURRENT REPR forfetching representation values. Lines 28-32 essentially copy values of numbers from the heap to the data stack by usingthe

CURRENT NUM and

STEP NUM macros. After that line 35 pushes the question vector, and lines 36-40 push the wordrepresentations of numbers on the return stack.Following that, we deﬁne the core operations of the sketch. Line 43 permutes the elements on the data stack (numbers) as afunction of the elements on the return stack (vector representations of the question and numbers). Line 45 chooses an operatorto execute over the TOS and NOS elements of the data stack (again, conditioned on elements on the return stack). Line 47executes a possible swap of the two elements on the data stack (the intermediate result and the last operand) conditioned onthe return stack. Finally, line 49 chooses the last operator to execute on the data stack, conditioned on the return stack.The sketch ends with lines 52-55 which empty out the return stack. rogramming with a Differentiable Forth Interpreter \ address of the question on H VARIABLE QUESTION \ allotting H for representations and numbers CREATE REPR_BUFFER 4 ALLOT CREATE NUM_BUFFER 4 ALLOT \ addresses of the first representation and number VARIABLE REPR VARIABLE NUM REPR_BUFFER REPR ! NUM_BUFFER NUM ! \ macro function for incrementing the pointer to numbers in H MACRO: STEP_NUM NUM @ 1+ NUM ! ; \ macro function for incrementing the pointer to representations in H MACRO: STEP_REPR REPR @ 1+ REPR ! ; \ macro functions for fetching current numbers and representations MACRO: CURRENT_NUM NUM @ @ ; MACRO: CURRENT_REPR REPR @ @ ; \ copy numbers to D CURRENT_NUM STEP_NUM CURRENT_NUM STEP_NUM CURRENT_NUM \ copy question vector, and representations of numbers to R QUESTION @ >R CURRENT_REPR >R STEP_REPR CURRENT_REPR >R STEP_REPR CURRENT_REPR >R \ permute stack elements, based on the question and number representations { observe R0 R-1 R-2 R-3 -> permute

D0 D-1 D-2 } \ choose the first operation { observe R0 R-1 R-2 R-3 -> choose + - * / } \ choose whether to swap intermediate result and the bottom number { observe R0 R-1 R-2 R-3 -> choose