[PDF] Towards Mixed Optimization for Reinforcement Learning with Program Synthesis

Abstract

Deep reinforcement learning has led to several recent breakthroughs, though the learned policies are often based on black-box neural networks. This makes them difficult to interpret and to impose desired specification constraints during learning. We present an iterative framework, MORL, for improving the learned policies using program synthesis. Concretely, we propose to use synthesis techniques to obtain a symbolic representation of the learned policy, which can then be debugged manually or automatically using program repair. After the repair step, we use behavior cloning to obtain the policy corresponding to the repaired program, which is then further improved using gradient descent. This process continues until the learned policy satisfies desired constraints. We instantiate MORL for the simple CartPole problem and show that the programmatic representation allows for high-level modifications that in turn lead to improved learning of the policies.

Full PDF

TTowards Mixed Optimization forReinforcement Learning with Program Synthesis

Surya Bhupatiraju * 1

Kumar Krishna Agrawal * 1

Rishabh Singh Abstract

Deep reinforcement learning has led to severalrecent breakthroughs, though the learned policiesare often based on black-box neural networks.This makes them difﬁcult to interpret, and toimpose desired speciﬁcation or constraints dur-ing learning. We present an iterative framework,M

ORL , for improving the learned policies usingprogram synthesis. Concretely, we propose to usesynthesis techniques to obtain a symbolic repre-sentation of the learned policy, which can be de-bugged manually or automatically using programrepair. After the repair step, we distill the policycorresponding to the repaired program, which isfurther improved using gradient descent. This pro-cess continues until the learned policy satisﬁes theconstraints. We instantiate M

ORL for the simpleCartPole problem and show that the programmaticrepresentation allows for high-level modiﬁcationswhich in turn leads to improved learning of thepolicies.

1. Introduction

There have been many recent successes in using deep re-inforcement learning (D RL ) to solve challenging problemssuch as learning to play Go and Atari games (Silver et al.,2016; 2017; Mnih et al., 2015). While the effectiveness ofreinforcement learning methods in these domains has beenimpressive, they have some shortcomings. These learnedpolicies are based on black-box deep neural networks whichare difﬁcult to interpret. Additionally, it is challenging toimpose and validate certain desirable policy speciﬁcations,such as worst-case guarantees or safety constraints. Thismakes it difﬁcult to debug and improve these policies, there-fore hindering their use for safety-critical domains. * Equal contribution Google Brain, USA. Correspondence to:Surya Bhupatiraju < [email protected] > , Kumar KrishnaAgrawal < [email protected] > . Published at the ICML workshop Neural Abstract Machines &Program Induction v2 (NAMPI) — Workshop Paper , Stockholm,Sweden, 2018. Copyright 2018 by the author(s).

There has been some recent work on using program synthe-sis techniques to interpret learned policies using higher-levelprograms (Verma et al., 2018) and decision trees (Bastaniet al., 2018). The key idea in P

IRL (Verma et al., 2018)is to ﬁrst train a D RL policy using standard methods andthen use an imitation learning-like approach to search for aprogram in a domain-speciﬁc language (DSL) that conformsto the behavior traces sampled from the policy. Similarly,V IPER (Bastani et al., 2018) uses imitiation learning (a mod-iﬁed form of the D

AGGER algorithm (Ross et al., 2011)) toextract a decision tree corresponding to the learned policy.The main goal of these works is to extract a symbolic high-level representation of the policy (as a DSL program or adecision tree) which is more interpretable and also amenablefor program veriﬁcation techniques.We build upon these recent advances to propose an iterativeframework for learning interpretable and safe policies. Themain steps in the workﬂow of our framework are as follows.We start with a random initial policy π . We use programsynthesis techniques similar to P IRL and V

IPER to learn asymbolic representation of the learned policy as a program P . After obtaining a programmatic representation of thepolicy, we perform program repair (Weimer et al., 2009; Job-stmann et al., 2005) to obtain a repaired program P (cid:48) that sat-isﬁes some set of constraints. Note that the program repairstep can be performed either automatically using a safetyspeciﬁcation constraint or it can be performed manually bya human expert that modiﬁes P to remove undesirable be-haviors (or add desired behaviors). We then use behavioralcloning (Bratko et al., 1995) to obtain the correspondingimproved policy π (cid:48) , which is then further improved usingstandard gradient descent to obtain π . This process of im-proving policies from π t → P t → P (cid:48) t → π (cid:48) t → π t+1 isrepeated until achieving desirable performance and safetyguarantees. We name this iterative procedure a mixed opti-mization scheme for reinforcement learning, or M ORL .As a ﬁrst step towards a full realization of M

ORL , wepresent a simple instantiation of our framework for theCartPole (Barto et al., 1983) problem. We demonstratethe efﬁcacy of our approach to learn near-optimal policies,while enabling the user to better interpret the learned policy.In addition, we argue that the scheme has a natural interpre- a r X i v : . [ c s . L G ] J u l owards Mixed Optimization for Reinforcement Learning with Program Synthesis Figure 1.

An overview of the proposed method. We decompose policy learning into alternating between policy optimization and programrepair. Starting from a black-box policy π t , we consider the following steps (1) Synthesis , which generates a program P t correspondingto the policy π t . The program is sampled from an underlying Domain Speciﬁc Language (DSL) D (2) Repair , which corresponds to debugging the program ,allowing us to impose high-level constraints on the learned program. (3)

Imitation corresponds to distilling theprogram back into a reactive representation. (4)

Policy Optimization in this case corresponds to gradient-based policy optimization. tation and can be readily extended to capture more notionsof policy improvement and discuss the potential beneﬁtsand obstacles of using such an approach.This paper makes the following key contributions: • We propose a simple framework for iterative policyreﬁnement by performing repair at the level of pro-grammatic representation of learned policies. • We instantiate the framework for the CartPole problemand show the effectiveness of performing modiﬁcationsin the symbolic representation.

2. Mixed Optimization for ReinforcementLearning

Our goal is to improve policy learning by decomposing theusual gradient-based optimization scheme into an iterativetwo-stage algorithm. In this context, we view improvementas either making the policies (1) safe – to ensure perfor-mance under safety, (2) interpretable – allowing some levelof introspection into the policy’s decisions, (3) sample ef-ﬁcient , or (4) alignment with priors. While there are othernotions of improvement, for the remainder of the paper, wefocus on sample efﬁciency as a notion of policy improve-ment. We include a discussion of the other approaches asthey apply to our framework.

Consider the typical Markov decision process (MDP) setup ( S , A , R , T , ρ , γ ) , with a state space S , an action space A , a reward function R , the transition dynamics of theenvironment T , the initial starting state distribution ρ , andthe discount factor γ . The goal will be to ﬁnd a policy, orfunction π : S → A , that achieves the maximum expectedreward. Normally, the reward design and speciﬁcation for atask T corresponds to deﬁning the reward function R ( s, a ) ,such that an optimal policy π ∗ solves the task. An alternative view of solving the task could be deﬁned ashaving access to an oracle policy π or a ﬁxed number oftrajectories from it. In this setting, our goal is learning apolicy by imitation learning , which would also equivalently solve the task. In this work, we focus on improving policylearning using imitation learning (Abbeel & Ng, 2004; Ho& Ermon, 2016), though the framework is more general andextends well to reinforcement learning.We consider a symbolic representation D (such as a DSL)that is expressive enough to represent different policies. Thesynthesis problem can then be deﬁned as learning a program P ∈ D such that ∀ s ∈ S : π ( s ) ≈ P ( s ) , i.e. the learnedprogram P produces approximately the same output actionsas the actions produced by the policy π for all (or a sampledset of) input states S . In M

ORL we maintain two representations of a policy: • a reactive, black-box policy where we represent thepolicy as a differentiable function, such as a neural net-work, allowing us to use gradient-based optimizationmethods like TRPO (Schulman et al., 2015) or PPO(Schulman et al., 2017). • a symbolic program , which represents the policy asan interpretable program. The symbolic program rep-resentations are amenable for analysis and transforma-tions using automated program veriﬁcation and repairtechniques, or human inspection.With these intermediate representations, we alternate be-tween the following; the ﬁrst step allows us to ﬁnetune poli-cies in function space and the second allows us to imposeconstraints or incorporate human debugging. The procedure(Fig 1) consists of four key steps, as detailed below. Synthesis : Given a task T , we consider a Domain SpeciﬁcLanguage D , such that there exists some program P ∈ D owards Mixed Optimization for Reinforcement Learning with Program Synthesis Figure 2.

Evaluating the usefulness of maintaining differentiable,and symbolic representations of the policy. Each plot corre-sponds to ﬁnetuning a policy cloned from a program (in thiscase decision trees) with TRPO (averaged over 5 runs). In thiscase,

Near-Optimal is obtained by manual debugging of the

Intermediate policy, which is obtained from

Worst policy. that is a sufﬁcient representation of the task. In the ﬁrststep of M

ORL , we seek to synthesize such a program thatis equivalent to the policy π . A programmatic representa-tion of the policy allows us to leverage approaches such asprogram repair and veriﬁcation to provide guarantees forthe underlying policy. For this step, and in the scope ofthis paper, we assume that we can utilize existing programsynthesis methods such as V IPER or P

IRL , so we do notattempt to perform this step explicitly. We focus on thefollowing steps in the M

ORL scheme.

Repair : In this step, we modify the synthesized programaccordingly to satisfy constraints imposed either on D , oron the synthesized program P . This step allows us to mean-ingfully debug the policy, either through human-in-the-loopveriﬁcation for interpretability, or through automated pro-gram repair techniques that involve deﬁning Constraint Sat-isfaction Problems (CSP) typically solved using SAT/SMTsolvers (Singh et al., 2013). For the scope of this paper,we mimic the repair process by manually modifying theinitial program to obtain three programs that achieve threedifferent levels of success at the task of interest. Imitation : Following the program synthesis and repairsteps, we distill (Rusu et al., 2015) the program back into areactive policy using imitation learning. Given that we haveaccess to an oracle P (cid:48) t , we ﬁnd that we reliably imitate theprogram (Ross et al., 2011). Note that it is possible to stopthe optimization here. Indeed, we observe that a user mayend the procedure of M ORL here, if certain performance orsafety bounds have been reached, and may skip the last step.

Figure 3.

An important step in the algorithm is alternating betweensymbolic and policy representations. Here we plot the convergencerate of randomly initialized policies to the program behavior. Inthis work, we used simple behavioral cloning to retrain the policies.We note that more sample efﬁcient algorithms would be able toemulate the behavior from the program more quickly.

Policy Optimization

Finally, we ﬁnetune the policy usinggradient descent. We posit that by optimizing in both pro-gram space and over the space of policies in a differentiablespace, we are able to better escape local minima while stillmaintaining an underlying intuition for how the policy isperforming from the inspection of the program.

3. Experiments

We evaluate our framework on the CartPole-v0 problem inthe OpenAI Gym environment for discrete control (Brock-man et al., 2016). We present a ﬁrst simple instantiationof the framework to showcase its usefulness compared todirect reinforcement learning. In our preliminary evaluation,we evaluate the following research questions: • Does program repair lead to faster convergence? • Does programmatic representation help humans pro-vide better repair insights?To this effect, we train an initial policy π ( Worst ) thatperforms poorly, and then extract the corresponding sym-bolic representation P . For the symbolic representation,we chose V IPER ’s (Bastani et al., 2018) decision treerepresentation of the policy. We then modify the symbolicprogram to get a new program P (cid:48) , which performs betterthan the original program by repairing certain values inthe decision tree. This is followed by behavioural cloningto obtain π (cid:48) (corresponding to P (cid:48) ), which is optimized toobtain π . owards Mixed Optimization for Reinforcement Learning with Program Synthesis Figure 4.

Debugging

Worst (red) to

Intermediate (green).In one step of debugging the policy, we ﬁx the policy to make thecart shift in the same direction as the pole.

To simulate the iterative optimization of the framework,we perform two different modiﬁcations of the programrepair step to obtain P ( Intermediate ) and P ( Near-optimal ) that have different characteristics interms of repair improvements. For example, the modiﬁ-cation to obtain program P from P is shown in Fig 4,where we manually provide the insight of making the cartshift in the same direction as the pole.In our experiments, we ﬁrst ﬁnd the average perfor-mance of each of the levels of policies across 25 runs.The Worst policy gets an average reward of 9.28, the

Intermediate policy gets an average reward of 104.0,and the

Near-optimal policy gets an average reward of200.0. When we attempt to distill the programs to continu-ous policies π , we ﬁnd that each of the resulting levels ofpolicies get 10.64, 66, and 185, respectively, as shown inFigure 3 after 15000 epochs. Lastly, when we take the result-ing distilled policies and then ﬁnetune these with TRPO, weﬁnd that the resulting average rewards are 38.65, 79.03, andand 176.8 after 25 episodes of training with 10 trajectoriesof length 200. In Figure 2, we run TRPO for a total of 250episodes to see the limiting behavior.From our results, we validate our hypothesis that under badinitialization ( Worst ), TRPO takes an order of magnitudelonger to converge to near-optimal policy, when comparedto policies initialized after program repair. We believe thatproviding high-level insights programmatically can helppolicies discover better or safer behaviors.

4. Related Work

Our framework is inspired from the recent works ofP

IRL (Verma et al., 2018) and V

IPER (Bastani et al., 2018)in using program synthesis to learn symbolic interpretablerepresentations of learnt policies, and then using programveriﬁcation to verify certain properties of the program. P

IRL ﬁrst trains a D RL policy for a domain and then usesan imitation learning like approach to generate speciﬁca-tions (input-output behaviors) for the synthesis problem. Itthen uses a Bayesian optimization technique to search forprograms in a DSL that conforms to the speciﬁcation. Ititeratively builds up new behaviors by executing the initialpolicy as an oracle to obtain outputs for inputs that were notoriginally sampled but are observed in executing the learntprograms. It maintains a family of programs consistent withthe speciﬁcation and chooses the one as output that achievesthe maximum reward on the task.V IPER uses a modiﬁed form of the D

AGGER initiation learn-ing algorithm to extract a decision tree corresponding to thelearnt policy. It then uses program veriﬁcation techniquesto validate correctness, stability, and robustness propertiesof the extracted programs (represented as decision trees).While previous approaches stop at learning a veriﬁable sym-bolic representation of policies, our framework aims at iter-ative improvement of policies. In particular, if the extractedsymbolic program does not satisfy certain desirable veriﬁca-tion constraints, unlike previous approaches, our frameworkallows for repairing the programs in symbolic space anddistilling the programs to policies for further optimization.

5. Discussion and Future Work

We presented a preliminary instantiation of the M

ORL frame-work showing the beneﬁts of learning a symbolic represen-tation of the policy. Namely, that by optimizing the policyby iterating between two representations, we were able toconverge faster to near-optimal performance starting with apoor initialization.There are a number of assumptions we make in this paperin order to instantiate our framework. While the M

ORL framework is general enough to encapsulate many differ-ent approaches of synthesis, repair, and imitation, we onlyconsider the simplest forms of these. For instance, we hand-design the candidate repaired programs, and use a simplesupervised approach for imitation learning. Each of theseaspects could be signiﬁcantly scaled up to be used for largerprograms and for more complicated tasks. While CartPolewas a simple sandbox for which we could test symbolicprograms, for more complicated tasks, automated programrepair and veriﬁcation techniques would be more efﬁcient.Reward design (Clark & Amodei, 2016) and safety(Hadﬁeld-Menell et al., 2017) is another exciting researchdirection. Note that we can instead use the reward func-tion R as the program representation for M ORL ; this wouldinstead provide a procedure for more interpretable or veriﬁ-able inverse reinforcement learning. owards Mixed Optimization for Reinforcement Learning with Program Synthesis

References

Abbeel, Pieter and Ng, Andrew Y. Apprenticeship learn-ing via inverse reinforcement learning. In

Proceedingsof the twenty-ﬁrst international conference on Machinelearning , pp. 1. ACM, 2004.Barto, A. G., Sutton, R. S., and Anderson, C. W. Neuron-like adaptive elements that can solve difﬁcult learningcontrol problems.

IEEE Transactions on Systems, Man,and Cybernetics , SMC-13(5):834–846, Sept 1983. ISSN0018-9472.Bastani, Osbert, Pu, Yewen, and Solar-Lezama, Armando.Veriﬁable reinforcement learning via policy extraction. arXiv preprint arXiv:1805.08328 , 2018.Bratko, Ivan, Urbanˇciˇc, Tanja, and Sammut, Claude. Be-havioural cloning: phenomena, results and problems.

IFAC Proceedings Volumes , 28(21):143–149, 1995.Brockman, Greg, Cheung, Vicki, Pettersson, Ludwig,Schneider, Jonas, Schulman, John, Tang, Jie, andZaremba, Wojciech. Openai gym. arXiv preprintarXiv:1606.01540 , 2016.Clark, Jack and Amodei, Dario. Faulty reward func-tions in the wild. https://blog.openai.com/faulty-reward-functions/ , 2016.Hadﬁeld-Menell, Dylan, Milli, Smitha, Abbeel, Pieter, Rus-sell, Stuart J, and Dragan, Anca. Inverse reward design.In

Advances in Neural Information Processing Systems ,pp. 6768–6777, 2017.Ho, Jonathan and Ermon, Stefano. Generative adversarialimitation learning. In

Advances in Neural InformationProcessing Systems , pp. 4565–4573, 2016.Jobstmann, Barbara, Griesmayer, Andreas, and Bloem, Rod-erick. Program repair as a game. In

CAV , pp. 226–238, Berlin, Heidelberg, 2005. Springer-Verlag. doi:10.1007/11513988 23. URL http://dx.doi.org/10.1007/11513988_23 .Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David,Rusu, Andrei A., Veness, Joel, Bellemare, Marc G.,Graves, Alex, Riedmiller, Martin A., Fidjeland, Andreas,Ostrovski, Georg, Petersen, Stig, Beattie, Charles, Sadik,Amir, Antonoglou, Ioannis, King, Helen, Kumaran, Dhar-shan, Wierstra, Daan, Legg, Shane, and Hassabis, Demis.Human-level control through deep reinforcement learn-ing.

Nature , 518(7540):529–533, 2015.Ross, St´ephane, Gordon, Geoffrey, and Bagnell, Drew. Areduction of imitation learning and structured predictionto no-regret online learning. In

AISTATS , pp. 627–635,2011. Rusu, Andrei A, Colmenarejo, Sergio Gomez, Gulcehre,Caglar, Desjardins, Guillaume, Kirkpatrick, James, Pas-canu, Razvan, Mnih, Volodymyr, Kavukcuoglu, Koray,and Hadsell, Raia. Policy distillation. arXiv preprintarXiv:1511.06295 , 2015.Schulman, John, Levine, Sergey, Abbeel, Pieter, Jordan,Michael, and Moritz, Philipp. Trust region policy opti-mization. In

International Conference on Machine Learn-ing , pp. 1889–1897, 2015.Schulman, John, Wolski, Filip, Dhariwal, Prafulla, Radford,Alec, and Klimov, Oleg. Proximal policy optimizationalgorithms. arXiv preprint arXiv:1707.06347 , 2017.Silver, David, Huang, Aja, Maddison, Chris J., Guez,Arthur, Sifre, Laurent, van den Driessche, George, Schrit-twieser, Julian, Antonoglou, Ioannis, Panneershelvam,Vedavyas, Lanctot, Marc, Dieleman, Sander, Grewe, Do-minik, Nham, John, Kalchbrenner, Nal, Sutskever, Ilya,Lillicrap, Timothy P., Leach, Madeleine, Kavukcuoglu,Koray, Graepel, Thore, and Hassabis, Demis. Masteringthe game of go with deep neural networks and tree search.

Nature , 529(7587):484–489, 2016.Silver, David, Hubert, Thomas, Schrittwieser, Julian,Antonoglou, Ioannis, Lai, Matthew, Guez, Arthur, Lanc-tot, Marc, Sifre, Laurent, Kumaran, Dharshan, Graepel,Thore, Lillicrap, Timothy P., Simonyan, Karen, and Has-sabis, Demis. Mastering chess and shogi by self-playwith a general reinforcement learning algorithm.

CoRR ,abs/1712.01815, 2017.Singh, Rishabh, Gulwani, Sumit, and Solar-Lezama, Ar-mando. Automated feedback generation for introductoryprogramming assignments. In

PLDI , pp. 15–26, 2013.Verma, Abhinav, Murali, Vijayaraghavan, Singh, Rishabh,Kohli, Pushmeet, and Chaudhuri, Swarat. Programmati-cally interpretable reinforcement learning. arXiv preprintarXiv:1804.02477 , 2018.Weimer, Westley, Nguyen, ThanhVu, Le Goues, Claire,and Forrest, Stephanie. Automatically ﬁnding patchesusing genetic programming. In