[PDF] Neurogenetic Programming Framework for Explainable Reinforcement Learning

Abstract

Automatic programming, the task of generating computer programs compliant with a specification without a human developer, is usually tackled either via genetic programming methods based on mutation and recombination of programs, or via neural language models. We propose a novel method that combines both approaches using a concept of a virtual neuro-genetic programmer: using evolutionary methods as an alternative to gradient descent for neural network training}, or scrum team. We demonstrate its ability to provide performant and explainable solutions for various OpenAI Gym tasks, as well as inject expert knowledge into the otherwise data-driven search for solutions.

Full PDF

**correspondence: [email protected] N eurogenetic P rogramming F rameworkfor E xplainable R einforcement L earning P reprint , compiled F ebruary

9, 2021

Vadim Liventsev ∗ , Aki Härmä , and Milan Petkovi´c Eindhoven University of Technology

Philips Research Eindhoven A bstract Automatic programming, the task of generating computer programs compliant with a speciﬁcation without a hu-man developer, is usually tackled either via genetic programming methods based on mutation and recombinationof programs, or via neural language models. We propose a novel method that combines both approaches using aconcept of a virtual neuro-genetic programmer , or scrum team. We demonstrate its ability to provide performantand explainable solutions for various OpenAI Gym tasks, as well as inject expert knowledge into the otherwisedata-driven search for solutions. Source code is available at https://github.com/vadim0x60/cibi K eywords Reinforcement Learning · Program Synthesis · Genetic Programming ntroduction

Automatic programming is a discipline that studies applicationof mathematical models to infer programs from data and usethese generated programs to solve various tasks. For example,given a dataset of clinical decision-making generate a programthat describes how doctors make treatment decisions based onclinical history available to them.One of the primary motivations behind automatic programmingis enabling an exchange of knowledge between human expertsand machine learning models: black box models have achievedimpressive results in diverse decision support settings [2], some-times competing with human experts in the ﬁeld. The advantageof automatic programming systems is that they can also cooper-ate with the experts: • Code generation models can be trained to produce pro-grams similar to what experts wrote, incorporating ex-pert knowledge into the model • Code generation models can generate new programsby applying modiﬁcations to expert-written programs,using them as the basis • Experts can examine the generated programs, under-stand the algorithm suggested by the system and learnfrom itThe beneﬁts of regulating the program induction task, especiallyin limited data conditions have been demonstrated earlier [3].The programming language used in program induction is a reg-ulating constraint that may be expected to guide the learningalgorithm to favor solutions with operations that are most natural for the given language [4]. For example, a strictly sequential lan-guage such as C seems most natural in cases where the expectedsolution is a sequential protocol. A parallel language, such asvarious cellular automata models or deep neural networks, favorconcurrent solutions.We may also approach this from the angle of algorithmic in-formation theory. The Kolmogorov complexity of the solutioncorresponds to the length of the program and the number of freeparameters [5]. In the case of a black box deep model, the Kol-mogorov complexity is an architectural constant of the parallelnetwork model while in program induction it is optimized by thelearning algorithm based on the available training data. Whendata is limited, the PI training may use

Occam’s razor to ﬁnd aminimal-complexity solution matching the training data, whilea standard NN solution always has the same complexity.In fact, the use of program synthesis for solving a machinelearning problem can be seen as generalized hyperparameteroptimization ; the program optimization may, in principle, con-verge to a program that implements a particular deep neuralnetwork optimized for the problem. In genetic programming [8, 9] new programs are generated by mutating and mixing a population of programs. A more recent approach, largely draw-ing on the earlier success of deep neural language models (seeCodeBERT [10] inspired by BERT [11]), have been to trainblack box neural models that generate executable programs astext [12, 13, 14]. Neural program synthesis and genetic pro-gramming both have unique advantages [15]. In this paper wepropose a novel hybrid of the two families of methods. Wecall the method Instant Scrum in reference to a popular Agilesoftware team work model [16]. We show that

Instant Scrum , ∗ [email protected] not to be confused with neuroevolution [1]: using evolutionary methods as an alternative to gradient descent for neural network training "We consider it a good principle to explain the phenomena by the simplest hypothesis possible" [6, book 3, chapter 2], misattributed [7] toWilliam of Ockham a r X i v : . [ c s . A I] F e b reprint – N eurogenetic P rogramming F ramework ackground Speciﬁcation is central to the ﬁeld of automatic programming:if no requirements to generated programs are speciﬁed, the taskbecomes random program generation - not useful in most real-world settings beyond testing compilers [17, 18]. The ﬁeldof automatic programming can be subdivided by what type ofspeciﬁcation is used.One way to specify expected behavior of a program is a datasetof input-output pairs [13, 12]. A lot of work in this area hasbeen inspired [19, 20, 21] by the task of learning a formula thatﬁts a series of cells in spreadsheet applications [22].Alternatively, the program’s pre- and postcondition predicatescan be described in a formal language: proof-theoretic synthesis [23] is the task of generating a program such that if it’s inputconforms to the precondition, its output has to conform to thepostcondition.In the task of semantic parsing [24, 25, 26], speciﬁcation is atextual description of an algorithm that has to be translated fromnatural language into machine language.Finally, speciﬁcation can manifest in the form of a reinforcementlearning environment [27] - a program deﬁnes behavior of anagent that interacts with its environments and receives positiveand negative rewards from it. The goal is to ﬁnd a program thatmaximises rewards. In this work, we choose RL speciﬁcation,because it generalizes other methods: input-output pairs canbe seen as an environment that negatively reinforces a metricof di ﬀ erence between expected and observed program output,while in proof-theoretic synthesis the reward is determined bythe postcondition. Many real-world problems, such as robot con-trol [28] and clinical decision making [29, 30] are formulated asreinforcement learning as well. More concretely, we model the task as

Episodic Partially Ob-servable Markov Decision Process : M = ( S nt , S t , A , O , p o ( o | s , a ) , p s ( s next | s prev ) , p r ( r | s , a ) , p init ( s ))(1)Here, S nt is the set of non-terminal (environment) states and S t is the set of terminal states . A is the set of actions that thelearning agent can perform, and O is the set of observations about the current state that the agent can make.A reinforcement learning episode starts in a state s ∈ S nt sam-pled from p init , the initial distribution over environment states. An agent action a ∈ A at the state s causes the environmentstate to change probabilistically, and the destination state fol-lows the distribution p s ( ·| s , a ). At state s , the probability ofmaking observation o is p o ( o | s ) and the probability of obtaining reward r is p r ( r | s , a ). The process continues until a p s yieldsa terminal state s ∈ S t . This distinction is what sets episodic POMDP popularized by OpenAI gym [31] apart from the moretraditional approach [32, 33] where the process is inﬁnite.A program is a sequence of tokens c = ( c (1) , c (2) , . . . ) (2)that deﬁnes behavior of an agent. Depending on the program-ming language and implementation choices the tokens can becharacters or higher-level tokens, i.e. keywords. We denote thelanguage’s alphabet , i.e. the set of all possible tokens as L .An interpreter is a tuple (cid:104) α, µ (cid:105) where µ ( c k , m k , o k ) is the memo-rization function that deﬁnes how the agent’s memory updatesupon making an observation and α ( c , m k ) is the action function deﬁnes which action the agent at a certain memory state takesin the POMDP. Memory is intialized at state m init The agent’s goal is maximizing total reward collected in theenvironment, calculated as follows:

Algorithm 1

Evaluating total reward for a program function Eval ( c ) R tot ← m ← m init s ∼ p init ( s ) while s ∈ S nt do (cid:46) Observe o ∼ p o ( o | s ) m ← µ ( c , m , o ) (cid:46) Act a ← α ( c , m ) r ∼ p r ( s , a ) (cid:46) Get rewarded R tot ← R tot + r (cid:46) Next state s next ∼ p s ( s next | s , a ) s ← s next end while return R tot end function Since the algorithm for computing this function involves re-peatedly sampling values from distributions, function

Eval ( c )is a mapping from the set of programs to the set of real-valuedrandom variables. Programmatically Interpretable Reinforcement Learning [27]is the task of maximizing expected total reward with respectto c , i.e. ﬁnding a program c that is best for a given POMDPenvironment: E ( Eval ( c )) −→ max c (3) reprint – N eurogenetic P rogramming F ramework ethodology How does one manage a composition of code generators insuch a way that the composition yields better programs thanindividual contributors are capable of? This question is studiedextensively in software project management literature [34]. Andwhile, admittedly, project management literature is concernedwith human developers and, admittedly, there exist considerabledi ﬀ erences between human developers and mathematical mod-els of code generation [35], we mitigate these di ﬀ erences withseveral simplifying assumptions. Following from traditional genetic programming, we deﬁne a population of programs. The codebase is a tuple of 2-tuples,representing a program C ( i ) c = c and the total rewaard it collected C ( i ) R = R tot ∼ Eval ( c ) (see section 3.3): C = (cid:104)(cid:104) C (1) c , C (1) R (cid:105) , (cid:104) C (2) c , C (2) R (cid:105) . . . (cid:104) C ( | C | ) c , C ( | C | ) R (cid:105)(cid:105) (4)However, unlike in traditional genetic programming, the initialpopulation can (optionally) be empty. A software developer can:1. Check out programs from the codebase C

2. Output new a program c

3. Receive feedback on their program’s quality q

4. Learn from the feedback by moﬁdying its strategyThus, a developer is a 2-tuple of a program distribution p dev ( c | θ, C ) and a parameter update procedure Update ( θ, c , q )Distribution p dev ( c | θ, C ) is deﬁned over programs and isparametrized with learnable parameters θ as well as codebase C .Having codebase as parameter enables the developer to generatenew programs as a modiﬁcation and / or combination of existingprograms, i.e. to apply genetic programming.Learnable parameters θ encode the developer’s current method-ology of programming that can be modiﬁed upon receipt ofpositive or negative feedback using the developer’s update pro-cedure.The team of developers is a tuple of 2-tuples: T = (cid:104)(cid:104)T (1) p , T (1)upd (cid:105) , (cid:104)T (2) p , T (2)upd (cid:105) . . . (cid:104)T ( |T | ) p , T ( |T | )upd (cid:105)(cid:105) (5) We deﬁne two empirical metrics of overall program ﬁtness. Theﬁrst is empirical total reward : R ( c |C ) = | C | (cid:80) i = I [ C ( i ) c = c ] C ( i ) R | C | (cid:80) i = I [ C ( i ) c = c ] (6) If a program has been tested in the environment ( Eval() function)several times, there will be several copies of it in the codebasewith di ﬀ erent quality samples. Averaging over them yields an un-biased estimate of the expectation from equation 3, E ( Eval ( c )),that we set out to maximize.The second is empirical program quality , deﬁned as Q ( c |C ) = | C | (cid:80) i = I [ C ( i ) c = c ] e C ( i ) R | C | (cid:80) i = I [ C ( i ) c = c ] (7) Empirical program quality is an unbiased estimate of E ( e Eval ( c ) )The idea behind exponentiating the total reward is to encour-age exploration [36]. Programs that on average perform poorly,but sometimes, stochastically, collect high rewards, will havea higher Q ( c | C ) than R ( c | C ). We consider these programs tobe high-quality additions to the codebase because they containthe knowledge necessary for solving environment M , even ifon average they don’t solve it. We hypothesize that applying genetic operators (section 3.5) to programs with high Q ( c | C )can yield programs with high R ( c | C ). For this reason we traindevelopers to maximize Q , but when the training is complete,we pick programs from the codebase with the highest R as "bestprograms". Q ( c | C ) has an additional technical advantage over R ( c | C ): invari-ant Q ( c | C ) ≥ c . This lets one sample programsfrom the codebase with probabilities proportional to their quality,see eq. 19. Just like instant run-o ﬀ voting achieves similar results to exhaus-tive ballot runo ﬀ voting , but does it much faster by replacing aseries of ballots cast in a series of elections with a ballot castonce that goes on to participate in a series of virtual elections[37], our Instant Scrum algorithm does the same to Scrum [16]:it simulates the iterative software development process recom-mended by Scrum methodology without humans in the loopmaking it possible to run many sprints per second:

Algorithm 2

Instant Scrum with a team of developers procedure I nstant S crum ( T , C , N max ) N ← while N < N max do (cid:46) For each developer in the team for i = , , . . . , | T | do (cid:46) Sample a program from the developer c new ∼ p i ( c | θ i , C ) R tot ← Eval ( c new ) (cid:46) Test the program (cid:46) Save the code and test result to the codebase

C ← C ∪ {(cid:104) c new , R tot (cid:105)} Update i ( θ i , c , q ) (cid:46) Train the developer N ← N + (cid:46) Increment sprint counter end for end while end procedure reprint – N eurogenetic P rogramming F ramework (a) 1-point and 2-point crossover [38] Parent 1 ae>>>>>34+

Parent 2 a[e>-a-]b[e>>-b-]

Shu ﬄ e mutation >>4+>3>e>a Uniform mutation ae@>!>>35+ ae>>>>-]b[e>>-b-] ae>>-a-34+

Uniform crossover aee>->>3b+

Messy crossover ae>>>>e>-a-]b[e>>-b-]

Pruning e>>>>>4+ (b) All operators applied a pair of BF ++ [39] programs Figure 1: Genetic operators by exampleTo combine genetic programming and neural program synthe-sis we introduce 3 types of developers: genetic and neural and dummy , create a team that contains developers of all types andrun

Instant Scrum . A genetic developer writes programs by1. Selecting one of the 7 available stochastic genetic op-erators (described below)2. Selecting two programs from the codebase C (parents c and c ) 3. Using the operator to modify the parents and yield anew (child) programA genetic operator is a probability distribution p op over childprograms given 2 parent programs. p op ( c child | c , c ) (8)Operators whose p op is invariant to c and depends on c onlyare called mutation operators: they generate a new programby mutating one program c . The rest are called combination operators as they combine The simplest method for randomly modifying a program is shu ﬄ e mutation : randomly re-order the tokens of c . Let A be the setof all possible permutations of size | c | . | A | = | c | !. Then p shu ﬄ e ( c child | c , c ) = (cid:80) α ∈ A I [ α ( c parent ) = c child ] | c | ! (9)Another approach is uniform mutation where a loaded coin is tossed for every token in c . With probability p ind it is replaced witha random token from the alphabet L of the programming language, with probability 1 − p ind it stays the same. The evolution of asingle token under shu ﬄ e mutation is deﬁned by distribution p ( c new | c old ) = p ind | L | + (1 − p ind ) I [ c new = c old ] (10)Hence over full programs the operator is deﬁned as p unimut ( c child | c , c ) = I [ | c child | = | c | ] | c | (cid:89) i = (cid:32) p ind | L | + (1 − p ind ) I [ c ( i )child = c ( i )1 ] (cid:33) (11) The combination operators we propose are all variants of crossover - a classic genetic programming technique rooted in the way apair of DNA molecules exchanges genes during mitosis and meiosis, displayed on ﬁgure 1a.In DNA [38], as well as in most genetic programming literature [8, 9] the crossover operator combines 2 parent sequences toproduce 2 children. In this section, in order to reduce complexity, we deﬁne the distributions as if only the ﬁrst child program is reprint – N eurogenetic P rogramming F ramework (cid:104) c , c (cid:105) is equally likely to be selected for combination as (cid:104) c , c (cid:105) (seeeq. 19) this modiﬁcation does not a ﬀ ect the resulting genetic developer distribution.In one-point crossover a random cut position k is selected and the trailing sections of 2 parent programs beginning with the cutpoint are swapped with each other. If the parent programs have di ﬀ erent lengths, the cut point has to ﬁt within both programs: | c , c | = min {| c | , | c |} (12)2 ≤ k ≤ | c , c | (13)Hence the probability of c child being born out of one-point crossover is p ( c child | c , c ) = I [ | c child | = | c | ] | c , c | − | c , c | (cid:88) k = k − (cid:89) i = I [ c ( i )child = c ( i )1 ] | c | (cid:89) i = k I [ c ( i )child = c ( i )2 ] (14) Two-point crossover is similar, but instead of swapping the trailing ends of programs, a section in the middle of the programs ischosen, determined by randomly selected cut-o ﬀ indices k and k and swapped: p ( c child | c , c ) = I [ | c child | = | c | ]( | c , c | − | c , c | − | c , c |− (cid:88) k = | c , c | (cid:88) k = k + k − (cid:89) i = I [ c ( i )child = c ( i )1 ] k − (cid:89) i = k I [ c ( i )child = c ( i )2 ] | c | (cid:89) i = k I [ c ( i )child = c ( i )1 ] (15) Uniform crossover mirrors uniform mutation in that a loaded coin is tossed for each token in c . With probability p ind the token isreplaced, but the replacement is not drawn randomly from the alphabet. Instead, the replacement comes from c : p unicx ( c child | c , c ) = | c | (cid:89) i = I [ | c child | = | c | ] (cid:16) p ind I [ c ( i )child = c ( i )2 ] + (1 − p ind ) I [ c ( i )child = c ( i )1 ] (cid:17) (16)Finally, messy crossover is a version of one-point crossover without the assumption that both parent programs have to be cut at thesame index k . In messy crossover , one parent is cut at index k , another is cut at index k and the head of one is attached to the tailof the other: p messy ( c child | c , c ) = | c , c | − | c , c | (cid:88) k = | c , c | (cid:88) k = I [ | c child | = k + | c | − k ] k − (cid:89) i = I [ c ( i )child = c ( i )1 ] | c |− k (cid:89) i = I [ c ( k + i )child = c ( k + i )2 ] (17) After initial experiments we found that generated programs oftencontain sections of unreachable code or code that makes changesto the execution state and fully reverses them. To address this,we introduced an additional operator for removing dead code( pruning ): when Instant Scrum encounters a successful program,pruning helps separate sections of this program that led to itssuccess from sections that appeared in a highly-rated programby accident.Implementation of the pruning operator depends on the program-ming language at hand, here we deﬁne it as a pruning function c pruned = Prune ( c ) that outputs a program functionally equiva-lent to c (memory functions ( α, µ ) of c pruned are equal to that of c ) and | c pruned | ≤ c and a degenerate probability distribution: p prune ( c child | c , c ) = (cid:40) c child = Prune ( c )0 otherwise (18) Let P genetic be a tuple of all available genetic operators, in orderof introduction, i.e. P (1)genetic = p shu ﬄ e and P (4)genetic = p Genetic developer’s program distribution is a mixture distribu-tion, combining di ﬀ erent operators that can be applied, weightedby learnable parameters, and di ﬀ erent programs that can be sam-pled from the codebase, weighted by empirical quality (eq. 7). p genetic ( c | θ, C ) = C (cid:88) c C (cid:88) c Q ( c | C ) Q ( c | C )( C (cid:80) c Q ( c | C )) |P genetic | (cid:88) i = θ i P ( i )genetic ( c | c , c )(19)This is a true probability distribution if and only if |P genetic | (cid:80) i = θ i = One challenge that remains to be solved to fully deﬁne the ge-netic developer (folowing section 3.2) is to deﬁne a learning reprint – N eurogenetic P rogramming F ramework Update genetic . To do this, we notice thatequation 19 contains a multi-armed bandit [40] hiding in plainsight. Indeed, once the genetic developer samples c and c from the codebase, it has to pick one of 7 available options(pull one of 7 levers ) to then receive a reward Eval ( c child ). Thissubproblem can be represented with a POMDP of its own andsolved using one of the standard bandit algorithms [41].Following Occam’s razor , we picked the simplest method, epsilon-greedy optimization : we calculate the value of eachoperator as mean total reward of programs generated with thisoperator: V ( i ) = |C ( P ( i )genetic ) | |C ( P ( i )genetic ) | (cid:88) k = C ( P ( i )genetic ) ( k ) R (20)where C ( P ( i )genetic ) is the subset of the codebase produced viaoperator P ( i )genetic .The Update genetic procedure recalculates values V and sets oper-ator probabilities to θ i = (cid:15) |P ( i )genetic | + I [ i = arg max i V ( i ) ](1 − (cid:15) ) (21)where (cid:15) is a hyperparameter responsible for regulating the exploration-exploitation tradeo ﬀ [42]In future work, however, other bandit optimization algorithmscan be used in its place . The genetic developer, as described above, has 2 hyperparame-ters: 1. p ind deﬁnes severity of mutation in p unimut and p unicx (cid:15) deﬁnes learnability of genetic operator distributionNote that the team mechanism a ﬀ orded by Instant Scrum can beused not only to combine genetic and neural program synthesis,but also to combine several genetic developers with di ﬀ erenthyperparameters. The neural developer , also known as the senior developer be-cause of their unique ability to write original programs, is anLSTM [43] network followed by a linear layer that generates asequence of vectors h , h , h , . . . where h i ∈ R |L| + ∀ i and j -thelement of vector h i , h ( j ) i , represents the probability of i -th tokenof the program being j -th token in the alphabet, p ( c ( i ) = L ( j ) ).The last element of the vector represents a special end of pro-gram symbol. This vector depends deterministically on the fullset of neural network parameters (LSTM and linear layer) θ andcan be represented as a function h i ( θ ). Then p neural ( c | θ, C ) = h L + | c | + | c | (cid:89) i = |L| (cid:88) j = I [ c ( i ) = L ( j ) ] h i ( θ ) (22)For the Update neural procedure we use the algorithm proposed in[12]. The subproblem of generating a program c is consideredas a reinforcement learning episode of it’s own, where tokensare actions and token number | c | + end of program token) isassigned reward q = e R ; R ∼ Eval ( c ). In this subenvironment h i ( θ ) is the policy network [44, chapter 13] trained using REIN-FORCE algorithm with Priority Queue Training. This algorithminvolves a priority queue of best known programs: we imple-ment it as programs from C with highest Q ( c | C ) which meansthat the neural developer can train on programs written by otherdevelopers. h i ( θ ) can also represent several LSTM layers stacked or a di ﬀ er-ent type of recurrent neural network, i.e. GRU [45]. Hyperpa-rameters of this neural network, such as hidden state size and / ornumber of stacked layers are hyperparameters of the neuraldeveloper. The last developer we introduce is the simplest one: p dummy ( c child | c , c ) = Q ( c child | C ) C (cid:80) c Q ( c | C ) (23)Dummy developer does not generate novel programs. Instead, ituses the same quality-weighted program sampling as in equation19 to decide which existing program to copy. Their utility maynot be obvious at ﬁrst, but note (section 3.3) that when the sameprogram is added to the codebase several times, it’s total rewardand quality estimates are averaged and grow more accurate.Dummy developer is a smart compromise between speed atwhich Instant Scrum (algorithm 2) is searching the programspace and the quality of it’s working map of the program space,focusing on its most "interesting" (high Q ( c | C )) parts. Withoutdummy developer, all empirical total rewards E [ Eval ( c )] wouldbe low quality estimates of true ﬁtness of the program and onespurious success of an otherwise bad program could steer thesearch in the wrong direction. On the other hand, we could testeach program many times before adding it to the codebase, butthat would slow down the search prohibitively. xperimental setup In the table below, we introduce 5 teams. Neural developersare denoted as lstm(hidden state dimensionality), several num-bers mean a stacked LSTM. Genetic developers are denoted asgen( p ind , (cid:15) ), see secion 3.5.6. T small and T large are recommendedconﬁgurations while T genetic , T neural are ablation studies to provethat combination of neural and genetic methods is useful. Our open-source software implementation allows for drop-in replacement of bandit algorithms reprint – N eurogenetic P rogramming F ramework T small T large T genetic T neural lstm(10) (cid:88) lstm(50) (cid:88) lstm(256) (cid:88) lstm(10 , (cid:88) lstm(50 , (cid:88) (cid:88) (cid:88) lstm(256 , (cid:88) gen(0 . , . (cid:88) gen( , . (cid:88) gen( , . (cid:88) gen( , . (cid:88) dummy (cid:88) (cid:88) (cid:88) (cid:88) Instant Scrum can be used to generate programs in any program-ming language provided:1. An interpreter (cid:104) α, µ (cid:105) , see section 2.12. A known ﬁnite alphabet L

3. A pruning function

Prune ( c )The complexity of the chosen language is important since incomplex languages random perturbations of program sourcecode often produce grammatically invalid programs. This issuehas been addressed with structural models [46, 14] [9, chapter4], however, we sidestep the issue entirely by using BF ++ [39] -a simple language developed for programmatically interpretablereinforcement learning where most random combinations ofcharacters are valid programs. Each BF ++ command is repre-sented with a single character, thus the only way to tokenize itis to let tokens c (1) , c (2) , c (3) , . . . be single characters. Following from [39] we synthesize programs for

CartPole-v1 [47],

MountainCarContinuous-v0 [48],

Taxi-v3 [49]

BipedalWalker-v2

OpenAI Gym [31] environments, see ﬁg-ure 2.

Where possible, we run all experiments twice - a control ex-periment with empty intial codebase, and an experiment wherecodebase is pre-populated with human-written programs from[39]. Exceptions to this rule are • Teams T genetic and T pure that only have code modiﬁca-tion (not generation) capability and thus require initial-ization • BipedalWalker-v2 environment, because no programsfor this environment were provided in [39]

For Taxi we set an N max to 100000 | T | sprints, meaning every de-veloper in the team trains for 100000 iterations. For other taskswe used Exponential Variance Elimination [50] early stoppingalgorithm to stop the process when the positive trend in Eval ( c )is not present for 10000 sprints. This approach rules out the hy-pothesis that Instant Scrum is equivalent to enumerative search and it ﬁnds good programs by exhaustion as opposed to learning- if that was the case, early stopping would ﬁre immediately. Taxienvironment is treated di ﬀ erently because programs that cannotpick up and drop o ﬀ at least one passenger are always rewardedwith -200 and at ﬁrst it takes many iterations to synthesize atleast one program that can. In addition to these stopping rules, ahard timelimit was set.After the process is stopped, we pick 100 programs with thehighest R ( c | C ) and make sure each of them has been tested atleast 100 times, otherwise we run Eval ( c ) and add result to thecodebase until 100 samples is reached. We implemented the framework with Python and Tensorﬂowas well as DEAP [51] for genetic operators. It is available at https://github.com/vadim0x60/cibi esults

See table 1 for a summary of best programs generated. Themetric used, average R over 100 evaluations is the same metricthat’s used in the OpenAI gym leaderboard, so we include thethreshold required to join the leaderboard for context. Initialprograms refers to the best program in the codebase before

In-stant Scrum starts when it is prepopulated with programs from[39].The main hypothesis of this paper is conﬁrmed : neurogeneticapproach is superior to neural program induction or geneticprogramming separately. Besides, one unintuitive result of ourexperiments is that initialization of the codebase with previouslyavailable programs can be harmful, see T large . Overall, bestresults were acheived without inspiration from human experts,however, it is very valuable for lightweight teams with few small(in terms of | θ | ) developers.Additionally, we can explore T large to see which of its manydevelopers actually produced the best programs:Task Init R ( c | C ) DeveloperCartPole-v1 157.35 lstm(256)CartPole-v1 (cid:88) (cid:88) (cid:88) -150.44 humanBipedalWalker-v2 8.13 lstm(256,256)The same is true for T small :Task Init R ( c | C ) DeveloperCartPole-v1 60.93 lstm(50,50)CartPole-v1 (cid:88) (cid:88) . , . p unicx Taxi-v3 (cid:88) -150.44 humanBipedalWalker-v2 -0.15 lstm(50,50)However, comparing results for T small versus T neural proves thatgenetic developers have been intstrumental to the quality of reprint – N eurogenetic P rogramming F ramework (a) CartPole-v1 (b) MountainCarContinuous-v0 (c) Taxi-v3 (d) BipedalWalker-v2 Figure 2: Selected tasks, visualizedEnvironment CartPole-v1 MountainCarContinuous-v0 Taxi-v3 BipedalWalker-v2Initial programs 20.48 -6.55 -150.44 T small T large -32.12 -150.44 T genetic - 59.12 - 0 - -47.54 - T neural iscussion We have introduced a neurogenetic programming framework,demonstrated its e ﬃ cacy and advantages over simpler programinduction methods.We believe that this framework can become a basis for manyfuture methods - new methods of program synthesis can be builtinto the Instant Scrum framework as developers and combinedwith existing ones as necessary. In particular, one type of devel- oper currently absent from our experiments is a neural mutation - a neural network that modiﬁes existing programs and can betrained to modify them in a way that improves their performance.Another important direction is applying the framework to morespecialized tasks like robotics or healthcare decision support. A cknowledgements This work was funded by the European Union’s Horizon 2020 researchand innovation programme under grant agreement 812882. This work ispart of "Personal Health Interfaces Leveraging HUman-MAchine Natu-ral interactionS" (PhilHumans) project: R eferences [1] Dario Floreano, Peter Dürr, and Claudio Mattiussi. Neu-roevolution: from architectures to learning. Evolutionaryintelligence , 1(1):47–62, 2008.[2] Yuxi Li. Deep reinforcement learning: An overview. arXivpreprint arXiv:1701.07274 , 2017.[3] Jacob Devlin, Rudy Bunel, Rishabh Singh, Matthew J.Hausknecht, and Pushmeet Kohli. Neural program meta-induction.

CoRR , abs / http://arxiv.org/abs/1710.04157 .[4] Max Garzon. Models of Massive Parallelism: Analysis ofCellular Automata and Neural Networks . Springer Pub-lishing Company, Incorporated, 1st edition, 2012. ISBN3642779077.[5] Andrei N Kolmogorov. Three approaches to the quanti-tative deﬁnition oﬁnformation’.

Problems of informationtransmission , 1(1):1–7, 1965. [6] Claudius Ptolemaeus, Nicolaus Copernicus, Johannes Ke-pler, and Charles Glenn tr Wallis.

The almagest . Ency-clopaedia Britannica, 1952.[7] William M Thorburn. The myth of occam’s razor.

Mind ,27(107):345–353, 1918.[8] Wolfgang Banzhaf, Peter Nordin, Robert E Keller, andFrank D Francone.

Genetic programming: an introduction ,volume 1. Morgan Kaufmann Publishers San Francisco,1998.[9] John R Koza.

Genetic programming: on the programmingof computers by means of natural selection , volume 1. MITpress, 1992.[10] Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi-aocheng Feng, Ming Gong, Linjun Shou, Bing Qin, TingLiu, Daxin Jiang, and Ming Zhou. Codebert: A pre-trainedmodel for programming and natural languages, 2020.[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. BERT: pre-training of deep bidirec-tional transformers for language understanding.

CoRR , reprint – N eurogenetic P rogramming F ramework / http://arxiv.org/abs/1810.04805 .[12] Daniel A. Abolaﬁa, Mohammad Norouzi, Jonathan Shen,Rui Zhao, and Quoc V. Le. Neural Program Synthesis withPriority Queue Training. arXiv preprint arXiv:1801.03526 ,2018. URL http://arxiv.org/abs/1801.03526 .[13] Matej Balog, Alexander L. Gaunt, Marc Brockschmidt,Sebastian Nowozin, and Daniel Tarlow. Deepcoder: Learn-ing to write programs. CoRR , abs / http://arxiv.org/abs/1611.01989 .[14] Uri Alon, Roy Sadaka, Omer Levy, and Eran Yahav. Struc-tural language models of code. In International Conferenceon Machine Learning , pages 245–256. PMLR, 2020.[15] Algorithm synthesis: Deep learning and geneticprogramming. http://iao.hfuu.edu.cn/blogs/33-algorithm-synthesis-deep-learning-and-genetic-programming .(Accessed on 02 / / Agile software develop-ment with Scrum , volume 1. Prentice Hall Upper SaddleRiver, 2002.[17] Vsevolod Livinskii, Dmitry Babokin, and John Regehr.Random testing for c and c ++ compilers with yarpgen. Proc. ACM Program. Lang. , 4(OOPSLA), November 2020.doi: 10.1145 / https://doi.org/10.1145/3428264 .[18] Gergö Barany. Liveness-driven random program gen-eration. CoRR , abs / http://arxiv.org/abs/1709.04421 .[19] Oleksandr Polozov and Sumit Gulwani. Flashmeta: Aframework for inductive program synthesis. In Proceed-ings of the 2015 ACM SIGPLAN International Conferenceon Object-Oriented Programming, Systems, Languages,and Applications , pages 107–126, 2015.[20] Jacob Devlin, Jonathan Uesato, Surya Bhupatiraju,Rishabh Singh, Abdel-rahman Mohamed, and PushmeetKohli. Robustﬁll: Neural program learning under noisy i / o.In International conference on machine learning , pages990–998. PMLR, 2017.[21] Vu Le and Sumit Gulwani. Flashextract: A framework fordata extraction by examples. In

Proceedings of the 35thACM SIGPLAN Conference on Programming LanguageDesign and Implementation , pages 542–553, 2014.[22] Sumit Gulwani. Automating string processing in spread-sheets using input-output examples.

ACM Sigplan Notices ,46(1):317–330, 2011.[23] Saurabh Srivastava, Sumit Gulwani, and Je ﬀ rey S. Foster.From program veriﬁcation to program synthesis. SIG-PLAN Not. , 45(1):313–326, January 2010. ISSN 0362-1340. doi: 10.1145 / https://doi.org/10.1145/1707801.1706337 .[24] Wang Ling, Phil Blunsom, Edward Grefenstette,Karl Moritz Hermann, Tomáš Koˇcisk`y, Fumin Wang, andAndrew Senior. Latent predictor networks for code gen-eration. In Proceedings of the 54th Annual Meeting ofthe Association for Computational Linguistics (Volume 1:Long Papers) , pages 599–609, 2016. [25] Pengcheng Yin and Graham Neubig. A syntactic neuralmodel for general-purpose code generation. In

Proceedingsof the 55th Annual Meeting of the Association for Com-putational Linguistics (Volume 1: Long Papers) , pages440–450, 2017.[26] Maxim Rabinovich, Mitchell Stern, and Dan Klein. Ab-stract syntax networks for code generation and semanticparsing. In

Proceedings of the 55th Annual Meeting ofthe Association for Computational Linguistics (Volume 1:Long Papers) , pages 1139–1149, 2017.[27] Abhinav Verma, Vijayaraghavan Murali, Rishabh Singh,Pushmeet Kohli, and Swarat Chaudhuri. Programmaticallyinterpretable reinforcement learning. In Jennifer Dy andAndreas Krause, editors,

Proceedings of the 35th Inter-national Conference on Machine Learning , volume 80 of

Proceedings of Machine Learning Research , pages 5045–5054, Stockholmsmässan, Stockholm Sweden, 10–15 Jul2018. PMLR. URL http://proceedings.mlr.press/v80/verma18a.html .[28] Jens Kober, J. Andrew Bagnell, and Jan Peters. Re-inforcement learning in robotics: A survey.

The In-ternational Journal of Robotics Research , 32(11):1238–1274, 2013. doi: 10.1177 / https://doi.org/10.1177/0278364913495721 .[29] Vadim Liventsev. Heartpole: A transparenttask for reinforcement learning in healthcare. https://github.com/vadim0x60/heartpole/blob/master/HeartPole_abstract.pdf . (Accessedon 01 / / https://github.com/akiani/rlsepsis234/blob/master/writeup.pdf .(Accessed on 01 / / Journal of MathematicalAnalysis and Applications , 10(1):174–205, 1965. ISSN0022-247X. doi: https: // doi.org / / .[33] Jr Kramer, J David R. Partially Observable Markov Pro-cesses., 1964.[34] Frederick P Brooks Jr. The mythical man-month: essayson software engineering . Pearson Education, 1995.[35] Claire Le Goues, Michael Dewey-Vogt, Stephanie Forrest,and Westley Weimer. A systematic study of automated pro-gram repair: Fixing 55 out of 105 bugs for $8 each. In , pages 3–13. IEEE, 2012.[36] Thomas Rückstiess, Frank Sehnke, Tom Schaul, DaanWierstra, Yi Sun, and Jürgen Schmidhuber. Exploring pa-rameter space in reinforcement learning.

Paladyn , 1(1):14–24, 2010. reprint – N eurogenetic P rogramming F ramework Voting Systems: From Method toAlgorithm . PhD thesis, California State University ChannelIslands.[38] Thomas Hunt Morgan.

A Critique of the Theory of Evolu-tion . Princeton University Press, 1916.[39] Vadim Liventsev, Aki Härmä, and Milan Petkovi´c. Bf ++ :alanguage for general-purpose neural program synthesis,2021.[40] Herbert Robbins. Some aspects of the sequential designof experiments. Bulletin of the American MathematicalSociety , 58(5):527–535, 1952.[41] Volodymyr Kuleshov and Doina Precup. Algorithms formulti-armed bandit problems.

CoRR , abs / http://arxiv.org/abs/1402.6028 .[42] William G Macready and David H Wolpert. Bandit prob-lems and the exploration / exploitation tradeo ﬀ . IEEE Trans-actions on evolutionary computation , 2(1):2–22, 1998.[43] Felix A Gers, Jürgen Schmidhuber, and Fred Cummins.Learning to forget: Continual prediction with lstm. 1999.[44] Richard S Sutton and Andrew G Barto.