[PDF] BF++: a language for general-purpose program synthesis

Abstract

Most state of the art decision systems based on Reinforcement Learning (RL) are data-driven black-box neural models, where it is often difficult to incorporate expert knowledge into the models or let experts review and validate the learned decision mechanisms. Knowledge-insertion and model review are important requirements in many applications involving human health and safety. One way to bridge the gap between data and knowledge driven systems is program synthesis: replacing a neural network that outputs decisions with a symbolic program generated by a neural network or by means of genetic programming. We propose a new programming language, BF++, designed specifically for automatic programming of agents in a Partially Observable Markov Decision Process (POMDP) setting and apply neural program synthesis to solve standard OpenAI Gym benchmarks.

Full PDF

**correspondence: [email protected] BF ++ : a language for general - purpose program synthesis P reprint , compiled F ebruary

22, 2021

Vadim Liventsev ∗ , Aki Härmä , and Milan Petkovi´c Eindhoven University of Technology

Philips Research Eindhoven A bstract Most state of the art decision systems based on Reinforcement Learning (RL) are data-driven black-box neuralmodels, where it is often di ﬃ cult to incorporate expert knowledge into the models or let experts review andvalidate the learned decision mechanisms. Knowledge-insertion and model review are important requirements inmany applications involving human health and safety. One way to bridge the gap between data and knowledgedriven systems is program synthesis: replacing a neural network that outputs decisions with a symbolic programgenerated by a neural network or by means of genetic programming. We propose a new programming language,BF ++ , designed speciﬁcally for automatic programming of agents in a Partially Observable Markov DecisionProcess (POMDP) setting and apply neural program synthesis to solve standard OpenAI Gym benchmarks.Source code is available at https: // github.com / vadim0x60 / cibi K eywords Reinforcement Learning · Program Synthesis · Programming Languages ntroduction

Reinforcement Learning (RL) has been successfully used tobeat the best human players in games like Go [1] and Dota 2[2] as well as solve complex real-world tasks like controllingrobots [3], optimizing chemical reactions [4] and managementof tra ﬃ c lights [5]. RL methods have interesting potential formany applications in healthcare [6]. The use of pure data-drivenRL methods in healthcare has similar challenges as other safety-critical domains such as autonomous driving and robotics. Weshould be able to initialize the system using expert knowledgefor an acceptable baseline performance. This information isdi ﬃ cult to learn from the data, or ingest into the parametersof a neural model. Secondly, explainability, often lacking inblack-box models, is required for acceptability in clinical usecases .In this work we focus on an alternative approach for RL basedon program induction, known as Programmatically InterpretableReinforcement Learning [7]. We introduce BF ++ , a new pro-gramming language tailor-made for this approach (section 4.1).We then demonstrate that neural program synthesis with BF ++ can solve arbitrary reinforcement learning challenges and givesus an avenue for knowledge sharing between domain expertsand data-driven via the mechanism of expert inspiration (section5.5) and case studies of successful programs (section 6.2). ackground In this paper we deﬁne a Reinforcement Learning environmentas Partially Observable Markov Decision Process [8, 9]: whenat step i the agent takes action a i ∈ A it has an impact on thestate of the environment s i ∈ S via distribution p s ( s i + | s i , a i ) ofconditional probabilities of possible subsequent states. Stateis a latent variable that the agent cannot observe. Instead, theagent can see an observation o i ∈ O which is a random variable Safety requirements in healthcare are the main motivation for ourresearch. However, in this paper we use conventional OpenAI Gymbenchmarks to enable comparison between methods that depends on the latent state via distribution p o ( o i | s i , a i ). A , S and O are sets of all possible actions, states and observationsrespectively. Finally, at every step the agent observes a reward r i = R ( s i , a i )Given this limited toolset, without full (or any) prior knowledgeof how the agent’s actions inﬂuence the the environment (distri-butions p s ( s i + | s i , a i ) and p o ( o i | s i , a i )), the agent has to come upwith a strategy that will maximize n -step return R n = n (cid:88) t = i r t (1)where n is the agent’s planning horizon. It is, in the generalsense, a hyperparameter, however if an environment has a limiton how many steps an episode can last, it is reasonable to set n equal to the step limit.Conventional solutions [10] introduce a parametrized policyfunction π φ ( a | s ) that deﬁnes agent’s behavior as a probabilitydistribution over actions and / or function Q φ ( a | s ) that deﬁneswhat R n the agent is expecting to receive if they take action a .Parameters φ are learned empirically, using gradient descent orevolutionary methods [11, 12].This approach has been applied extensively and with great suc-cess [13] in Partially Observable Markov Decision Process(POMDP) settings, however it does have major limitations:1. The agent is deﬁned as stateless. As such, when makinga decision a i the agent is unable to take into accountany observations it made prior to step i . Long-termdependencies like "this patient should not receive thisdrug since she has shown signs of allergy when thisdrug was administered to her 17 iterations ago" cannotbe captured by a memoryless model.2. The agent is represented as a set of model weights φ ,often with millions of parameters. Such a program canbe used as a black box decision system, but domain a r X i v : . [ c s . A I] F e b reprint – BF ++ : a language for general - purpose program synthesis / or make theircontributions to the agent’s programming.In this paper, we address these limitations by representing an RLagent with a program in a specialized language, to be introducedin section 4.1, as opposed to π φ and Q φ elated work Despite Program Synthesis being one of the most challengingtasks in Computer Science, many solutions exist, see [14]. Theycan be roughly classiﬁed by what kind of data they make use of.One can leverage large datasets of code snippets annotated withnatural language like CoNaLa [15] and treat program synthesisas a machine translation task [16]Given a dataset of program inputs and expected outputs one cansearch for programs that satisfy given examples [17, 18] usingtechniques like neural-guided program search [19]. One canalso generate input-output pairs artiﬁcially [20].Models like Neural Turing Machines [21], Memory Networks[22] and Neural Random Access Machines [23] are also trainedwith input-output pairs and even though they don’t explicitlygenerate code, they ﬁt the deﬁnition of program.One can use their own domain knowledge to write a sketch of the necessary program in one of specialized programminglanguages [24] that let the developer leave out certain parametersto be selected via machine learning.But in this work our goal is to synthesize programs with notraining data - only an environment where the program can betested - and this seems to be an underexplored area. The task ofgeneral RL has been introduced, e.g, in [7, 25], but the model in[7] uses sketches , and the system introduced by Abolaﬁa et al [25] only supported non-interactive programs. That is, in theirmodel, a program is a function from an input string to an outputstring, while the POMDP setting is much more general. ++ Abolaﬁa et al [25] picked BF [26] as their language for programsynthesis for the following reasons: • In industry-grade programming languages like Pythonor Java program code can contain a very large varietyof characters since any of the 143859 Unicode [27]characters can be used in string literals. In BF, however,only 8 characters can be used: they can be one-hot-encoded with vectors of size 8. • BF’s simple syntax means that an arbitrary string ofvalid characters is likely to be a valid program. Inmore complex languages, most possible strings resultin a syntax error. A generative model being trained towrite programs in such a language risks being stuckin a long exploration phase when all the programs itgenerates are invalid and it has no positive examples inthe dataset. Brainfuck • Despite all of the above, it is a Turing-complete lan-guage.The simplicity of the language also means that it is relativelyeasy to develop a compiler that translates programs from anindustry-standard programming languages like Java and Pythonto BF thus making use of the expert knowledge existing in thoselanguages.In the current paper, we introduce an extended version of theoriginal BF language, BF ++ . As explained below, the extensionsto the original BF syntax are particularly useful in the RL usecases.BF’s runtime model is inspired by the classic Turing Machine[28]: at any point during the program’s execution, the state ofthe program consists of: • An inﬁnite tape of cells T where each cell holds aninteger number. • A memory pointer p T that points to a certain cell in thetape ( active cell T p T ). • A string of characters C that represents program code. • A code pointer p C pointing to a character about to beexecuted.The code pointer starts at the ﬁrst character, then this charactergets executed and the pointer is incremented (moved to the nextcharacter). There are 8 possible characters: > Move the memory pointer one cell right. p T : = p T + < Move the memory pointer one cell left. p T : = p T − + Increment the active cell . T p T : = T p T + - Decrement the active cell . T p T : = T p T − . Write T p T from the active cell to the output stream , Read x from the input stream to the active cell . T p T : = x [ If the active cell T p T =

0, jump (move p C ) to the matching]. ] If the active cell T p T (cid:44)

0, jump (move p C ) to the matching [[ and ] commands constitute a loop that will be executed re-peatedly until the active cell becomes zero. They are also theonly way to write a BF program with a syntax error: a valid BFprogram is one that doesn’t contain non-matching [ or ] In BF memory cells T i hold non-negative values only. In BF ++ T i ∈ Z , a negation operator ˜ is introduced and operators [] areredeﬁned to loop while the active cell is non-positive, i.e. ˜ If the active cell T p T : = − T p T . [ If the active cell T p T ≥

0, jump (move p C ) to the matching]. If you happen to be executing a BF program on a computer withﬁnite memory, the tape will be ﬁnite due to your hardware limitations the deﬁnition of input and output streams is purposefully under-speciﬁed, it may depend on the particular implementation reprint – BF ++ : a language for general - purpose program synthesis ] If the active cell T p T <

0, jump (move p C ) to the matching [This decision was taken because negative observations are com-mon in control problems (see section 5) as is branching onwhether the observed value is positive or negative. The main issue of BF as a language for Reinforcement Learningis its input-output system. It assumes that the program canfreely decide on the relative frequency of inputs to outputs. Forexample, the following program +[.....,] inputs 5 integers, outputs the 5th character it read, then goesback to the beginning and proceeds indeﬁnitely outputting every5th character it inputs. Thus it assumes a 5:1 frequency of inputsto outputs. If we simply assume that inputs are observations andoutputs are actions, such program will not be able to operate ina POMDP environment where I / O frequency is ﬁxed at 1:1 andthe agent that has made an observation has to act before it canmake the next observation. In other words, operators . and , areblocking: . stops program execution and waits until new inputis received to resume execution, , stops program execution andwaits until there is an opportunity to act in the environment.To address this, in BF ++ . operator is non-blocking. It outputsthe current value of the active cell by placing it at the bottom ofthe action queue S - a sequence of integer numbers that representactions the program is planning to take in the environment. Wealso introduce a non-blocking operator ! that places T p T on topof the action queue. . S : = S (cid:95) ( T p T )! S : = ( T p T ) (cid:95) S (2)where (cid:95) denotes concatenation of tuplesThe program can thus decide by using . or ! whether thenewly added action takes precedence over ones already in thequeue. As soon as an opportunity to act arises, the top of theaction queue (item S or several items S , S , . . . , see section5.2) deﬁnes which action the program takes and is then removedfrom the queue. If S k does not exist (the queue is empty orshorter than k ) default value of S k = , operator, on the other hand, is blocking. Thus its function ismore important than just reading an observation into memory.Executing , is when the program moves to the next step ofPOMDP. The system where the only way to proceed to the followingiteration is the , operator, naively implemented, means that tobe successful in any POMDP environment, a program has tocontain an inﬁnite loop with a , operator. Any program thathas a ﬁnite number of , steps will terminate prematurely in anenvironment that supports arbitrarily long number of iterations.Since we originally set out to develop a language where mostrandom programs would be valid, this had to be addressed. We decided to turn any BF ++ program into an inﬁnite loop witha , operator by default:1. Every BF ++ program starts with a virtual , operatorat address p C = −

1: it is executed before all operatorsin the code of the program, they are indexed startingfrom p C =

02. When the code pointer p C reaches the end of the pro-gram it loops back to the virtual comma p C : = − / decision-making. Another issue complicating applications of BF to Reinforce-ment Learning is that since its memory tape holds only integernumbers its inputs and outputs have to be integer as well. Andthis issue cannot be ﬁxed simply by replacing an integer tapewith a tape of ﬂoating point numbers as BF ’s only operations formanipulating numbers are + and - - increment and decrement.Non-integer action and observation spaces are fairly common inreinforcement learning tasks hence BF ++ implements coercionmechanisms for reading and writing continuous vectors intodiscrete memory.We assume that the vector observation space O is a hypercubedeﬁned as an intersection of n separate scalar observation spaces O k such that o ∈ O k , o ∈ O k , . . . , o n ∈ O kn ⇔ ( o , o , . . . , o n ) ∈ O kn (3)This assumption theoretically excludes some possible obser-vation spaces, but almost all POMDP tasks discussed in theresearch literature and all OpenAI Gym tasks conform to thisassumption.To write an observation onto the memory tape we the ob-servation vector of size n is aligned with memory cells T p T , T p T + , . . . , T p T + n − and turned into an integer with the useof d discretization bins. T p T + k − : = min ω ∈ ,... d | o k <τ k ω ω (4)If O k is an interval O k = [ o low , o high ], it is split into discretizationbins evenly, as in eq. 5: τ ω = (cid:40) o low + o high − o low d ω, ω = , , . . . , d − + ∞ , ω = d (5)Some environments, however, have unbounded observationspaces O k = ( −∞ ; + ∞ ), O k = ( −∞ ; o high ], O k = [ o low ; + ∞ ).This spaces are challenging because the formal description O k does not in any way reﬂect the actual underlying distribu-tions of observations. It can be the case, for example, that O k = ( −∞ ; + ∞ ) but most observations found in the environ-ment fall in the interval O k = [42; 43]. For such observationspaces, BF ++ uses a ﬂuid discretization system that learns thetrue distribution of observations online. The idea was inspiredby a work of Touati et al [29], although, they assumed that O k reprint – BF ++ : a language for general - purpose program synthesis τ ω can be arbitrary. With each newobservation, thresholds τ ω are readjusted so that among h priorobservations, roughly ω out of d observations are lower valuesthat τ ω :minimize τ (cid:88) ω ∈ , ,... d | ω d − (cid:80) i (cid:48) ∈ i − h , i − h + ,..., i − I ( o ki (cid:48) < τ ω ) h | (6)To solve this optimization problem, one has to sort previous d observations in ascending order so thatsort : { o i | i ∈ i − h , i − h + , . . . , i − } −→ { s i | i ∈ , , . . . , h } (7)is such a bijection that s < s < · · · < s h holds and set τ ω = s (cid:100) ω d h (cid:101) (8)See ﬁgure 1 for a visual example.This system has 2 hyperparameters: d and h . With a low d a lotof the information observed form the environment is lost, whilewhen d is in the hunderds the generated programs can becomevery complex. h switches between relative and absolute observa-tions. With a very high h , ω = h = h present an additional challenge: how to cor-rectly discretize observation in the ﬁrst h iterations? We imple-mented burn-in : before training or evaluation we run h iterationsof a random agent (see section 4.8 to collect a history of h observations and pick correct thresholds. A symmetrical problem arises with actions taken by the agent.Memory tape holds integer numbers T k ∈ Z and any value canbe pushed onto the action stack. However, the action that’soutput to the environment has to belong to a N -dimensionalaction space A , an intersection of unidimensional action spaces A k . The "act" operation thus includes a coercion system and isdeﬁned as: a k : =  S k d − , A k = ( −∞ ; + ∞ ) a min + | S k d − − a min | , A k = [ a min ; + ∞ ) a max − | a max − S k d − | , A k = ( −∞ ; a max ] a min + ( S k mod d ) d − ∗ ( a max − a min ) , A k = [ a min ; a max ] S k , A k ⊂ Z S : = ( S N + , S N + , . . . ) (9) It is notoriously hard to introduce any kind of branching behaviorin BF [30]. To facilitate if-then style programs we introduce a goto operator ^ deﬁned as p T : = T p T (10)Note that it is not a goto; in the traditional C sense, since thememory pointer is being moved, not the code pointer. Still, it letsthe agent preemptively store potential actions in memory cellsand than branch between this actions based on the observation. Operator @ writes a random number into the active cell . Arandom agent is often used as a starting point for explorationand in BF ++ a random agent can be implemented as @! With all the commands we introduced in sections 4.1 - 4.7 it isstill surprisingly hard to encode relatively simple decisions like"add action 5 to the top of the action queue": [>]+++++!

This program moves the memory pointer right until it hits acell that contains zero, increments it ﬁve times, and then pushes T p T to the top of the action queue. It also loses the currentvalue of the memory pointer which might be meaningful. Ourexperiments have shown that it takes a very long time for theneural model to learn to write this kind of combinations.To mitigate this issue we introduce shorthands : commands mean "write the respective number (0,1,2,3 or 4)" intothe cell and commands abcde mean "move the memory pointerto cell a,b,c,d or e" where cells a,b,c,d and e are the ﬁrst 5cells in the memory tape. We intentionally made the numberof shorthands equal to discretization constant d =

5. Due toour method of discretization of continious action spaces (seesections 4.5, 4.6) the program will often encounter situationswhen it can choose between d di ﬀ erent actions and thanks toshorthands taking them can be encoded as , , . . . In total (assuming 5 shorthands) BF ++ has 22 commands: ><^@+~-[].,!01234abcde reprint – BF ++ : a language for general - purpose program synthesis @^~01234abcde are considered optional and can bedisabled if the task at hand calls for it. The number of shorthandcommands can be increased or decreased.Observation discretization and action coercion techniques builtinto the language mean that BF ++ is compatible with anyPOMDP environment. However, in practice, there is one im-portant limitation: the complexity of the program required tooperate in an environment is directly proportional to dimen-sionality of it’s action and observation spaces A and O . If, forexample the observation space is 10000-dimensional, once anobservation is read onto tape T it takes 9999 > operators to reachsecond to last observation. Thus, in practice, BF ++ should beused with low-dimensional POMDPs.An extension of our methodology to high-dimensional POMDPs(such as Atari games [31], where the observation is a matrix ofpixels on simulated game screen) can be achieved by adding ascene encoder neural network that maps the observed image toa low-dimensional vector as proposed in [32]. xperimental setup Our experiments were designed to test the following hypotheses: H BF ++ can be used in conjunction with a program synthe-sis algorithm to solve arbitrary reinforcement learningchallenges (POMDPs) H BF ++ can be used to take a program written by an expertand use program synthesis to automatically improve it H BF ++ can be used to generate an interpretable solutions toReinforcement Learning Challenges that experts canlearn from H Optional commands @^~01234abcde introduced for conve-nience make it easier for experts to write programs in BF ++ H Optional commands @^~01234abcde improve the qualityof programs synthesised by neural modelsHence we1. Pick several commonly studied reinforcement learningenvironments2. Employ an expert to write BF ++ programs to solvethem3. Develop a program synthesis model following from[25]4. Compare the best programs generated by the modelwith expert programs in terms of program quality5. Perform ablation studies: remove some of the optionalcommands from the language (resulting language iscalled BF + ), remove the expert program from themodel’s program pool, compare program quality6. Perform case studies: analyze programs generatedby the model to gain insight into how the model ap-proached the problem ﬁrst author of this paper We evaluate our framework on 4 low-dimensional (see section4.10) POMDPs sampled from OpenAI Gym [33] leaderboard(https: // github.com / openai / gym / wiki / Leaderboard):1.

CartPole-v1 [34]. A pole is attached to a cart. whichmoves along a frictionless track. The agent observescart position, cart velocity, pole angle and pole velocityat tip. The goal is to keep the pole upright by applyingforce between -1 and 1 to the cart. At every step theagent receives a + MountainCarContinuous-v0 [35]. A car is on a one-dimensional track, positioned between two "moun-tains". The goal is to drive up the mountain consum-ing a minimal amount of fuel by controlling the en-gine, setting it’s torque in the range [ −

1; 1]; however,the engine is not strong enough to scale the moun-tain in a single pass. Therefore, the only way to suc-ceed is to drive back and forth to build up momentum.We picked MountainCarContinuous-v0 as opposed toMountainCar-v0 to demonstrate the performance ofour discretization system.3.

Taxi-v3 [36]. There are 4 locations (labeled by dif-ferent letters) and the goal is to pick up the passengerat one location and drop him o ﬀ in another in as fewtimesteps as possible spending as little fuel as possible.4. BipedalWalker-v2 . A simulated 2D robot with legshas to learn how to walk. Moving rightwards is re-warded, falling is penalized. Observation vector con-sists of speeds, angular speeds and joint positions col-lected by the robot’s sensors. These observations donot, however, include any global coordinates - they canonly be inferred from sensor inputs. With action vectorof size 4 the agent controls speeds of the robots hip andknee motors.

For observation discretization (section 4.5) we picked d = h =

500 for ourexperiments, hence when the observation is among the highest20% of the last 500 observations it is written into memory as 4while if it falls between 40-th and 60-th percentiles it is 2.

For

CartPole we wrote 2 programs. One completely ignoresall observations and just alternates between "move right" and"move left":

Another calculates the di ﬀ erence between velocity of the cartand angular velocity of the pole. If it’s positive, the cart ispushed to the right (the cart has to catch up with the pole), ifit’s negative the cart is pushed to the left, if zero it is pushedrandomly: [a0>0>0>0>0>@>1>1>1>1>1>,>[->>-<<]>>+++++^!1] reprint – BF ++ : a language for general - purpose program synthesis (a) CartPole-v1 (b) MountainCarContinuous-v0 (c) Taxi-v3 (d) BipedalWalker-v2 Figure 2: Selected environments, visualizedThe ﬁrst part of this program sets up an action map on the tapewhere every possible value of the velocity di ﬀ erential has arespective cell with 0, 1 or (in the center) random number. Then [->>-<<] block does subtraction, +++++ adds 5 to the result,so that it belongs to in 0 ..

10 and not − .. ^ moves the memorypointer to the correct cell in the action map and ! puts the actiononto the action stack.For Mountain Car we wrote an elegant algorithm that reads theobservation vector into the tape, goes to the second observation(car velocity) and outputs it as action: >!a

In other words, we apply motor torque in the same directionwhere we’re currently headed, thus always accelerating our car.If we’re headed right, that helps us get to the destination and ifwe’re headed left that helps us get as high as possible onto thehill so that when direction reverses, the car has more energy topush through the right hill.For

Taxi we introduce 2 programs. The ﬁrst program:1. Finds the coordinates of the current destination (pas-senger to pick up or current passenger’s destination)2. Subtracts the current destination3. Moves in the resulting directionThe problem with this approach is that it always gets stuck whenit hits a wall. To compensate for that, the second program alter-nates between the strategy above (for 5 iterations) and randommovements (for 5 iterations) so that it eventually gets unstuck.See source code repository for the programs.Optional commands @^~01234abcde have all been invaluablein developing these programs - a fact in support of H . A morerigorous way to conﬁrm it would be employing several humanexperts to develop programs with and without optional operators,but ﬁnding volunteer BF ++ developers has proven di ﬃ cult.Developing programs for Bipedal Walker is, unfortunately,above our expert’s paygrade.

In order to train a generative model g to write BF ++ programswe treat the writing process as a reinforcement learning episodein its own right [25] . Every character of a program is an actiontaken by the writer agent , the programs are terminated by a NULL character. When the NULL character is written, a BF ++ agent is created in the target POMDP environment (e.g. Cart-Pole) and sum total of rewards Q collected in that episode isassigned as a reward to the writer agent for the NULL character.All other characters are rewarded with zero.The writer agent’s policy is modeled with an LSTM [37] neuralnetwork and is trained with a modiﬁed version of REINFORCE[38] algorithm. While standard REINFORCE optimizes PolicyGradient: O PG ( φ ) = E π ( C ; φ ) ( Q ) (11)where φ are LSTM parameters, C - program, Q - reward obtainedby the program in target environment,we optimize O ( φ ) = O PG ( φ ) + O PQT ( φ ) (12)where O PQT = K K (cid:88) k = log π ( C k ; φ ) (13)where C is the best (highest Q ) known program, C - secondbest, . . .Intuitively, both O PG ( φ ) and O PQT ( φ ) when optimized updatethe weights of the LSTM so that programs that we have foundto be successful are more likely. But Policy Gradient weighsprograms proportionately to their respective rewards while PQTcreates a priority queue of the best known programs and assignsa high importance to them and zero to the rest. O PQT component has been shown to have "a stabilizing a ﬀ ectand helps reduce catastrophic forgetting in the policy"[25]. Inaddition to this, we use O PQT to implement expert inspiration .By default, the priority queue of the best known programs isinitialized as an empty set. But if expert-written programs areavailable, it can be prepopulated with these programs that actas useful positive examples for teaching the writer agent . Thisapproach is used to incorporate programs from section 5.4 andtransfer knowledge from experts to the neural developer.In all experiments below, the writer agent ’s LSTM has hiddensize of 50, batch size of 4 and is trained with RMSProp [39]optimizer. reprint – BF ++ : a language for general - purpose program synthesis All experiments were run with an upper limit of 100000 trainingepisodes. Environments other than

Taxi also used ExponentialVariance Elimination [40] early stopping technique - trainingwas stopped when the postive trend in the quality of the bestfound program stopped, i.e. when the exponential moving av-erage of program quality is lower that it was 1000 episodesago. Agents for

Taxi are trained for a ﬁxed number of episodes,because we noticed that in this environment the longest part ofthe training process is learning to pick up your ﬁrst passengerand until that happens Q = −

200 holds.Once the training process is ﬁnished, we take the best knownprograms and since each of them was only tested once (leadingto high variance) we test them again, averaging total rewardsover 100 episodes. We use this averaged reward to pick the bestprogram. BF ++ interpreter and the training system were written in Pythonwith TensorFlow for neural models. GPU resources weren’tused, because the performance bottleneck of the system is notbackpropagation but rather testing a BF ++ program in the envi-ronment, single experiment runtime was between 1 hour (Cart-Pole) and 10 (Taxi). esults Table 1 presents the quality metric (average 100-episode reward)of the best program in every category, compared to that of a fullyrandom agent and the result required to join the OpenAI gymleaderboard for context. Note that the expert programs used a lotof optional operators (shorthands and @^! ), so it wasn’t possibleto implement expert inspiration with limited command sets.These results support (see section 5.1) hypothesis H - we haveobtained functional programs for all environments, H - whenexpert inspiration was used the resulting programs were bet-ter than expert programs and better than programs generatedwithout expert inspiration and H - ablation studies for optionaloperators do indeed show that those operators are useful. We have established that the program synthesis model is ableto learn from human experts. But can experts learn from themodel? ( H ) To conﬁrm this, we o ﬀ er a detailed explanation ofthe most successful program of all experiments listed in section5This program scored on Mountain Car : -..~+ The trailing ~ and + do not a ﬀ ect the behavior of the agent: theymodify the value of the active cell only for it to be immediatelyrewritten by the virtual comma (section 4.4) before it has anychance to inﬂuence actions. One can think about these com-mands as inactive genes in the DNA - we have found many Figure 3: Visual summary of the strategy enacted by -.. on Mountain Car resulting programs to contain such commands. If necessary thise ﬀ ect can be accounted for by incorporating program length intothe loss function. So this program is equivalent to: -.. When the virtual comma is executed, car position and car veloc-ity are read into memory, discretized into integers 0 . . .

4. Theposition is read into the active memory cell p T , while the veloc-ity is in cell p T +

1. Then the active cell is decremented and theresulting number is put onto the action stack twice. There is 1read operation and 2 write operations to the end of the actionstack, which introduces a delay before the actions get executed.When it’s time to act, the number on the action stack is coercedto one of the actions possible in this environment (0 for goingleft, 1 for doing nothing, 2 for right).A strategy emerges, illustrated on ﬁgure 3, in which the car puts"going right" onto the agenda if it’s on the far left or the centerright of the landscape, puts "going left" onto the agenda whenit’s on the far right or center left and schedules doing nothing ifit’s in the center. This strategy helps the car successfully reachthe right fringe every time it is applied. onclusions

In this paper, we have introduced a new programming languagetailored to the task of programmatically interpretable reinforce-ment learning. We have shown experimentally that this languagecan facilitate program synthesis as well as knowledge transferbetween expert-based systems and data-driven systems.The results in the OpenAI gym test examples show that theproposed system is able to ﬁnd a functional solution to theproblem. In some cases the performance is similar to the bestdeep learning solution but the obtained program remains stillexplainable. This is a very encouraging result and suggest thatthe use of program induction methods may indeed be a viableway towards explainable solutions in RL applications.We propose the following directions for future work:1. Develop translation mechanisms between BF ++ andother languages. Potentially, BF ++ can be used as bytecode [41] for reinforcement learning. The expertwould write a program in a higher-level language and reprint – BF ++ : a language for general - purpose program synthesis Q achieved by best programs found, averaged over 100 episodesEnvironment CartPole-v1 MountainCarContinuous-v0 Taxi-v3 BipedalWalker-v2Random agent 9.3 0 -200 -91.92BF ++ expert program 1 20.48 -6.55 -179.49 -BF ++ expert program 2 18.23 - -150.44 -BF + (without shorthands) LSTM 44.55 91.57 -57.93 -91.9BF + (without @^~ ) LSTM 48.14 81.16 -42.21 -31.79BF ++ LSTM 71.38 88.41 -199.82 -26.97BF ++ LSTM with expert inspiration 96.64 91.39 -60.65 -Leaderboard threshold 195 90 0 300transpile it into BF ++ so that the program then can beimproved with reinforcement learning.2. Use other neural network architectures as well as non-neural evolution methods like genetic programming[42] in conjunction with BF ++

3. Apply the framework to problems in Healthcare whereexpert inspiration is important for crossing the AIchasm [43].4. Use Natural Language Generation techniques to trans-late the BF ++ code automatically to a friendly human-readable text description as in [44, 45]. A cknowledgements This work was funded by the European Union’s Horizon 2020 re-search and innovation programme under grant agreement n° 812882.This work is part of "Personal Health Interfaces Leveraging HUman-MAchine Natural interactionS" (PhilHumans) project: https: // R eferences [1] David Silver, Julian Schrittwieser, Karen Simonyan, Ioan-nis Antonoglou, Aja Huang, Arthur Guez, Thomas Hu-bert, Lucas Baker, Matthew Lai, Adrian Bolton, YutianChen, Timothy Lillicrap, Fan Hui, Laurent Sifre, GeorgeVan Den Driessche, Thore Graepel, and Demis Hassabis.Mastering the game of Go without human knowledge. Na-ture , 550(7676):354–359, oct 2017. ISSN 14764687. doi:10.1038 / nature24270.[2] OpenAI, Christopher Berner, Greg Brockman, BrookeChan, Vicki Cheung, Przemysław De¸biak, Christy Denni-son, David Farhi, Quirin Fischer, Shariq Hashme, ChrisHesse, Rafal Józefowicz, Scott Gray, Catherine Olsson,Jakub Pachocki, Michael Petrov, Henrique Pondé deOliveira Pinto, Jonathan Raiman, Tim Salimans, JeremySchlatter, Jonas Schneider, Szymon Sidor, Ilya Sutskever,Jie Tang, Filip Wolski, and Susan Zhang. Dota 2 withLarge Scale Deep Reinforcement Learning. 2019. URLhttps: // arxiv.org / abs / Springer Tracts in AdvancedRobotics , volume 97, pages 9–67. 2014. doi: 10.1007 / ACS Central Science , 3(12):1337–1344, 2017.ISSN 23747951. doi: 10.1021 / acscentsci.7b00492. URLhttps: // pubs.acs.org / sharingguidelines.[5] I Arel, C Liu, T Urbanik, and A G Kohls. Reinforcementlearning-based multi-agent system for network tra ﬃ c sig-nal control. 2009. doi: 10.1049 / arXiv preprintarXiv:1908.08796 , 2019.[7] Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh,Pushmeet Kohli, and Swarat Chaudhuri. Programmaticallyinterpretable reinforcement learning. In Jennifer Dy andAndreas Krause, editors, Proceedings of the 35th Interna-tional Conference on Machine Learning , volume 80 of

Pro-ceedings of Machine Learning Research , pages 5045–5054,Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018.PMLR. URL http: // proceedings.mlr.press / v80 / verma18a.html.[8] K J Åström. Optimal control of Markov processes withincomplete state information. Journal of MathematicalAnalysis and Applications , 10(1):174–205, 1965. ISSN0022-247X. doi: https: // doi.org / / // / science / article / pii / ReinforcementLearning: An Introduction, Second edition in progress ,volume 3. 2017. doi: 10.1016 / S1364-6613(99)01331-5.[11] Seyed Sajad Mousavi, Michael Schukat, and Enda Howley.Deep Reinforcement Learning: An Overview. In

LectureNotes in Networks and Systems , volume 16, pages 426–440. 2018. doi: 10.1007 / // arxiv.org / abs / .[12] K. Arulkumaran, M. P. Deisenroth, M. Brundage, andA. A. Bharath. Deep reinforcement learning: A briefsurvey. IEEE Signal Processing Magazine , 34(6):26–38,2017. doi: 10.1109 / MSP.2017.2743240.[13] Leslie Pack Kaelbling, Michael L Littman, and Andrew WMoore. Reinforcement learning: A survey.

Journal ofartiﬁcial intelligence research , 4:237–285, 1996. reprint – BF ++ : a language for general - purpose program synthesis Foundations and Trends in Program-ming Languages , 4(1-2):1–119, 2017. ISSN 23251131.doi: 10.1561 / International Conference on Mining Software Reposito-ries , MSR, pages 476–486. ACM, 2018. doi: https: // doi.org / / // arxiv.org / abs / Proceed-ings of the 2015 ACM SIGPLAN International Conferenceon Object-Oriented Programming, Systems, Languages,and Applications , OOPSLA 2015, page 107–126, NewYork, NY, USA, 2015. Association for Computing Ma-chinery. ISBN 9781450336895. doi: 10.1145 / // doi.org / / , 2018. URL https: // microsoft.github.io / prose / impact / .[20] Richard Shin, Neel Kant, Kavi Gupta, Christopher Ben-der, Brandon Trabucco, Rishabh Singh, and Dawn Song.Synthetic datasets for neural program synthesis. Technicalreport, 2019.[21] Wojciech Zaremba, Ilya Sutskever, and Google Brain. Re-inforcement Learning Neural Turing Machines-revised.Technical report. URL https: // github.com / ilyasu123 / rlntm.[22] Jason Weston, Sumit Chopra, and Antoine Bordes. Mem-ory networks. In , oct 2015. URL http: // arxiv.org / abs / , 2016.[24] Alexander L Gaunt, Marc Brockschmidt, Rishabh Singh,Nate Kushman, Pushmeet Kohli, Jonathan Taylor, andDaniel Tarlow. TerpreT: A Probabilistic ProgrammingLanguage for Program Induction. Technical report.[25] Daniel A. Abolaﬁa, Mohammad Norouzi, Jonathan Shen,Rui Zhao, and Quoc V. Le. Neural Program Synthesis withPriority Queue Training. 2018. URL http: // arxiv.org / abs / // en.wikipedia.org / wiki / Brainfuck, 1993.URL http: // en.wikipedia.org / wiki / Brainfuck. [27] Julie D Allen, Deborah Anderson, Joe Becker, RichardCook, Mark Davis, Peter Edberg, Michael Everson, As-mus Freytag, Laurentiu Iancu, Richard Ishida, et al. Theunicode standard.

Mountain view, CA , 2012.[28] A M Turing. On computable numbers, with an applicationto the entscheidungsproblem. a correction.

Proceedingsof the London Mathematical Society , s2-43(1):544–546,1938. ISSN 1460244X. doi: 10.1112 / plms / s2-43.6.544.[29] Ahmed Touati, Adrien Ali Taiga, and Marc G Bellemare.Zooming for e ﬃ cient model-free reinforcement learningin metric spaces. arXiv preprint arXiv:2003.04069 , 2020.[30] Mats Linander. control ﬂow in brainfuck | matslina. Avail-able at the internet address http: // calmerthanyouare.org / / / / control-ﬂow-in-brainfuck.html,2016. URL http: // calmerthanyouare.org / / / / control-ﬂow-in-brainfuck.html.[31] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling.The arcade learning environment: An evaluation platformfor general agents. Journal of Artiﬁcial Intelligence Re-search , 47:253–279, jun 2013.[32] Daiki Kimura. Daqn: Deep auto-encoder and q-network. arXiv preprint arXiv:1806.00630 , 2018.[33] Greg Brockman, Vicki Cheung, Ludwig Pettersson,Jonas Schneider, John Schulman, Jie Tang, and Woj-ciech Zaremba Openai. OpenAI Gym. Technical report.[34] A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlikeadaptive elements that can solve di ﬃ cult learning controlproblems. IEEE Transactions on Systems, Man, and Cyber-netics , SMC-13(5):834–846, Sep. 1983. ISSN 2168-2909.doi: 10.1109 / TSMC.1983.6313077.[35] Andrew William Moore. E ﬃ cient memory-based learningfor robot control. Technical report, 1990.[36] Thomas G. Dietterich. Hierarchical reinforcement learningwith the maxq value function decomposition. Journal ofArtiﬁcial Intelligence Research , 13:227–303, 2000.[37] Felix A Gers, Jürgen Schmidhuber, and Fred Cummins.Learning to forget: Continual prediction with lstm. 1999.[38] Ronald J Williams. Simple statistical gradient-followingalgorithms for connectionist reinforcement learning.

Ma-chine learning , 8(3-4):229–256, 1992.[39] Geo ﬀ rey Hinton, Nitish Srivastava, and Kevin Swersky.Neural networks for machine learning lecture 6a overviewof mini-batch gradient descent.[40] vadim0x60 / evestop: Early stopping with exponential vari-ance elmination. https: // github.com / vadim0x60 / evestop.(Accessed on 01 / / // en.wikipedia.org / w / index.php?title = Bytecode&oldid = A ﬁeld guide to genetic programming .Lulu. com, 2008.[43] P. A. Keane and E. J. Topol. With an eye to AI and au-tonomous diagnosis.

NPJ Digit Med , 1:40, 2018. [PubMed reprint – BF ++ : a language for general - purpose program synthesis / s41746-018-0048-y] [PubMed:29618526].[44] Kyle Richardson, Sina Zarrieß, and Jonas Kuhn. Thecode2text challenge: Text generation in source code li-braries. CoRR , abs / // arxiv.org / abs / / ACM 41st International Conferenceon Software Engineering (ICSE) , pages 795–806, 2019.doi: 10.1109 //