Towards Mixed Optimization for Reinforcement Learning with Program Synthesis
TTowards Mixed Optimization forReinforcement Learning with Program Synthesis
Surya Bhupatiraju * 1
Kumar Krishna Agrawal * 1
Rishabh Singh Abstract
Deep reinforcement learning has led to severalrecent breakthroughs, though the learned policiesare often based on black-box neural networks.This makes them difficult to interpret, and toimpose desired specification or constraints dur-ing learning. We present an iterative framework,M
ORL , for improving the learned policies usingprogram synthesis. Concretely, we propose to usesynthesis techniques to obtain a symbolic repre-sentation of the learned policy, which can be de-bugged manually or automatically using programrepair. After the repair step, we distill the policycorresponding to the repaired program, which isfurther improved using gradient descent. This pro-cess continues until the learned policy satisfies theconstraints. We instantiate M
ORL for the simpleCartPole problem and show that the programmaticrepresentation allows for high-level modificationswhich in turn leads to improved learning of thepolicies.
1. Introduction
There have been many recent successes in using deep re-inforcement learning (D RL ) to solve challenging problemssuch as learning to play Go and Atari games (Silver et al.,2016; 2017; Mnih et al., 2015). While the effectiveness ofreinforcement learning methods in these domains has beenimpressive, they have some shortcomings. These learnedpolicies are based on black-box deep neural networks whichare difficult to interpret. Additionally, it is challenging toimpose and validate certain desirable policy specifications,such as worst-case guarantees or safety constraints. Thismakes it difficult to debug and improve these policies, there-fore hindering their use for safety-critical domains. * Equal contribution Google Brain, USA. Correspondence to:Surya Bhupatiraju < [email protected] > , Kumar KrishnaAgrawal < [email protected] > . Published at the ICML workshop Neural Abstract Machines &Program Induction v2 (NAMPI) — Workshop Paper , Stockholm,Sweden, 2018. Copyright 2018 by the author(s).
There has been some recent work on using program synthe-sis techniques to interpret learned policies using higher-levelprograms (Verma et al., 2018) and decision trees (Bastaniet al., 2018). The key idea in P
IRL (Verma et al., 2018)is to first train a D RL policy using standard methods andthen use an imitation learning-like approach to search for aprogram in a domain-specific language (DSL) that conformsto the behavior traces sampled from the policy. Similarly,V IPER (Bastani et al., 2018) uses imitiation learning (a mod-ified form of the D
AGGER algorithm (Ross et al., 2011)) toextract a decision tree corresponding to the learned policy.The main goal of these works is to extract a symbolic high-level representation of the policy (as a DSL program or adecision tree) which is more interpretable and also amenablefor program verification techniques.We build upon these recent advances to propose an iterativeframework for learning interpretable and safe policies. Themain steps in the workflow of our framework are as follows.We start with a random initial policy π . We use programsynthesis techniques similar to P IRL and V
IPER to learn asymbolic representation of the learned policy as a program P . After obtaining a programmatic representation of thepolicy, we perform program repair (Weimer et al., 2009; Job-stmann et al., 2005) to obtain a repaired program P (cid:48) that sat-isfies some set of constraints. Note that the program repairstep can be performed either automatically using a safetyspecification constraint or it can be performed manually bya human expert that modifies P to remove undesirable be-haviors (or add desired behaviors). We then use behavioralcloning (Bratko et al., 1995) to obtain the correspondingimproved policy π (cid:48) , which is then further improved usingstandard gradient descent to obtain π . This process of im-proving policies from π t → P t → P (cid:48) t → π (cid:48) t → π t+1 isrepeated until achieving desirable performance and safetyguarantees. We name this iterative procedure a mixed opti-mization scheme for reinforcement learning, or M ORL .As a first step towards a full realization of M
ORL , wepresent a simple instantiation of our framework for theCartPole (Barto et al., 1983) problem. We demonstratethe efficacy of our approach to learn near-optimal policies,while enabling the user to better interpret the learned policy.In addition, we argue that the scheme has a natural interpre- a r X i v : . [ c s . L G ] J u l owards Mixed Optimization for Reinforcement Learning with Program Synthesis Figure 1.
An overview of the proposed method. We decompose policy learning into alternating between policy optimization and programrepair. Starting from a black-box policy π t , we consider the following steps (1) Synthesis , which generates a program P t correspondingto the policy π t . The program is sampled from an underlying Domain Specific Language (DSL) D (2) Repair , which corresponds to debugging the program ,allowing us to impose high-level constraints on the learned program. (3)
Imitation corresponds to distilling theprogram back into a reactive representation. (4)
Policy Optimization in this case corresponds to gradient-based policy optimization. tation and can be readily extended to capture more notionsof policy improvement and discuss the potential benefitsand obstacles of using such an approach.This paper makes the following key contributions: • We propose a simple framework for iterative policyrefinement by performing repair at the level of pro-grammatic representation of learned policies. • We instantiate the framework for the CartPole problemand show the effectiveness of performing modificationsin the symbolic representation.
2. Mixed Optimization for ReinforcementLearning
Our goal is to improve policy learning by decomposing theusual gradient-based optimization scheme into an iterativetwo-stage algorithm. In this context, we view improvementas either making the policies (1) safe – to ensure perfor-mance under safety, (2) interpretable – allowing some levelof introspection into the policy’s decisions, (3) sample ef-ficient , or (4) alignment with priors. While there are othernotions of improvement, for the remainder of the paper, wefocus on sample efficiency as a notion of policy improve-ment. We include a discussion of the other approaches asthey apply to our framework.
Consider the typical Markov decision process (MDP) setup ( S , A , R , T , ρ , γ ) , with a state space S , an action space A , a reward function R , the transition dynamics of theenvironment T , the initial starting state distribution ρ , andthe discount factor γ . The goal will be to find a policy, orfunction π : S → A , that achieves the maximum expectedreward. Normally, the reward design and specification for atask T corresponds to defining the reward function R ( s, a ) ,such that an optimal policy π ∗ solves the task. An alternative view of solving the task could be defined ashaving access to an oracle policy π or a fixed number oftrajectories from it. In this setting, our goal is learning apolicy by imitation learning , which would also equivalently solve the task. In this work, we focus on improving policylearning using imitation learning (Abbeel & Ng, 2004; Ho& Ermon, 2016), though the framework is more general andextends well to reinforcement learning.We consider a symbolic representation D (such as a DSL)that is expressive enough to represent different policies. Thesynthesis problem can then be defined as learning a program P ∈ D such that ∀ s ∈ S : π ( s ) ≈ P ( s ) , i.e. the learnedprogram P produces approximately the same output actionsas the actions produced by the policy π for all (or a sampledset of) input states S . In M
ORL we maintain two representations of a policy: • a reactive, black-box policy where we represent thepolicy as a differentiable function, such as a neural net-work, allowing us to use gradient-based optimizationmethods like TRPO (Schulman et al., 2015) or PPO(Schulman et al., 2017). • a symbolic program , which represents the policy asan interpretable program. The symbolic program rep-resentations are amenable for analysis and transforma-tions using automated program verification and repairtechniques, or human inspection.With these intermediate representations, we alternate be-tween the following; the first step allows us to finetune poli-cies in function space and the second allows us to imposeconstraints or incorporate human debugging. The procedure(Fig 1) consists of four key steps, as detailed below. Synthesis : Given a task T , we consider a Domain SpecificLanguage D , such that there exists some program P ∈ D owards Mixed Optimization for Reinforcement Learning with Program Synthesis Figure 2.
Evaluating the usefulness of maintaining differentiable,and symbolic representations of the policy. Each plot corre-sponds to finetuning a policy cloned from a program (in thiscase decision trees) with TRPO (averaged over 5 runs). In thiscase,
Near-Optimal is obtained by manual debugging of the
Intermediate policy, which is obtained from
Worst policy. that is a sufficient representation of the task. In the firststep of M
ORL , we seek to synthesize such a program thatis equivalent to the policy π . A programmatic representa-tion of the policy allows us to leverage approaches such asprogram repair and verification to provide guarantees forthe underlying policy. For this step, and in the scope ofthis paper, we assume that we can utilize existing programsynthesis methods such as V IPER or P
IRL , so we do notattempt to perform this step explicitly. We focus on thefollowing steps in the M
ORL scheme.
Repair : In this step, we modify the synthesized programaccordingly to satisfy constraints imposed either on D , oron the synthesized program P . This step allows us to mean-ingfully debug the policy, either through human-in-the-loopverification for interpretability, or through automated pro-gram repair techniques that involve defining Constraint Sat-isfaction Problems (CSP) typically solved using SAT/SMTsolvers (Singh et al., 2013). For the scope of this paper,we mimic the repair process by manually modifying theinitial program to obtain three programs that achieve threedifferent levels of success at the task of interest. Imitation : Following the program synthesis and repairsteps, we distill (Rusu et al., 2015) the program back into areactive policy using imitation learning. Given that we haveaccess to an oracle P (cid:48) t , we find that we reliably imitate theprogram (Ross et al., 2011). Note that it is possible to stopthe optimization here. Indeed, we observe that a user mayend the procedure of M ORL here, if certain performance orsafety bounds have been reached, and may skip the last step.
Figure 3.
An important step in the algorithm is alternating betweensymbolic and policy representations. Here we plot the convergencerate of randomly initialized policies to the program behavior. Inthis work, we used simple behavioral cloning to retrain the policies.We note that more sample efficient algorithms would be able toemulate the behavior from the program more quickly.
Policy Optimization
Finally, we finetune the policy usinggradient descent. We posit that by optimizing in both pro-gram space and over the space of policies in a differentiablespace, we are able to better escape local minima while stillmaintaining an underlying intuition for how the policy isperforming from the inspection of the program.
3. Experiments
We evaluate our framework on the CartPole-v0 problem inthe OpenAI Gym environment for discrete control (Brock-man et al., 2016). We present a first simple instantiationof the framework to showcase its usefulness compared todirect reinforcement learning. In our preliminary evaluation,we evaluate the following research questions: • Does program repair lead to faster convergence? • Does programmatic representation help humans pro-vide better repair insights?To this effect, we train an initial policy π ( Worst ) thatperforms poorly, and then extract the corresponding sym-bolic representation P . For the symbolic representation,we chose V IPER ’s (Bastani et al., 2018) decision treerepresentation of the policy. We then modify the symbolicprogram to get a new program P (cid:48) , which performs betterthan the original program by repairing certain values inthe decision tree. This is followed by behavioural cloningto obtain π (cid:48) (corresponding to P (cid:48) ), which is optimized toobtain π . owards Mixed Optimization for Reinforcement Learning with Program Synthesis Figure 4.
Debugging
Worst (red) to
Intermediate (green).In one step of debugging the policy, we fix the policy to make thecart shift in the same direction as the pole.
To simulate the iterative optimization of the framework,we perform two different modifications of the programrepair step to obtain P ( Intermediate ) and P ( Near-optimal ) that have different characteristics interms of repair improvements. For example, the modifi-cation to obtain program P from P is shown in Fig 4,where we manually provide the insight of making the cartshift in the same direction as the pole.In our experiments, we first find the average perfor-mance of each of the levels of policies across 25 runs.The Worst policy gets an average reward of 9.28, the
Intermediate policy gets an average reward of 104.0,and the
Near-optimal policy gets an average reward of200.0. When we attempt to distill the programs to continu-ous policies π , we find that each of the resulting levels ofpolicies get 10.64, 66, and 185, respectively, as shown inFigure 3 after 15000 epochs. Lastly, when we take the result-ing distilled policies and then finetune these with TRPO, wefind that the resulting average rewards are 38.65, 79.03, andand 176.8 after 25 episodes of training with 10 trajectoriesof length 200. In Figure 2, we run TRPO for a total of 250episodes to see the limiting behavior.From our results, we validate our hypothesis that under badinitialization ( Worst ), TRPO takes an order of magnitudelonger to converge to near-optimal policy, when comparedto policies initialized after program repair. We believe thatproviding high-level insights programmatically can helppolicies discover better or safer behaviors.
4. Related Work
Our framework is inspired from the recent works ofP
IRL (Verma et al., 2018) and V
IPER (Bastani et al., 2018)in using program synthesis to learn symbolic interpretablerepresentations of learnt policies, and then using programverification to verify certain properties of the program. P
IRL first trains a D RL policy for a domain and then usesan imitation learning like approach to generate specifica-tions (input-output behaviors) for the synthesis problem. Itthen uses a Bayesian optimization technique to search forprograms in a DSL that conforms to the specification. Ititeratively builds up new behaviors by executing the initialpolicy as an oracle to obtain outputs for inputs that were notoriginally sampled but are observed in executing the learntprograms. It maintains a family of programs consistent withthe specification and chooses the one as output that achievesthe maximum reward on the task.V IPER uses a modified form of the D
AGGER initiation learn-ing algorithm to extract a decision tree corresponding to thelearnt policy. It then uses program verification techniquesto validate correctness, stability, and robustness propertiesof the extracted programs (represented as decision trees).While previous approaches stop at learning a verifiable sym-bolic representation of policies, our framework aims at iter-ative improvement of policies. In particular, if the extractedsymbolic program does not satisfy certain desirable verifica-tion constraints, unlike previous approaches, our frameworkallows for repairing the programs in symbolic space anddistilling the programs to policies for further optimization.
5. Discussion and Future Work
We presented a preliminary instantiation of the M
ORL frame-work showing the benefits of learning a symbolic represen-tation of the policy. Namely, that by optimizing the policyby iterating between two representations, we were able toconverge faster to near-optimal performance starting with apoor initialization.There are a number of assumptions we make in this paperin order to instantiate our framework. While the M
ORL framework is general enough to encapsulate many differ-ent approaches of synthesis, repair, and imitation, we onlyconsider the simplest forms of these. For instance, we hand-design the candidate repaired programs, and use a simplesupervised approach for imitation learning. Each of theseaspects could be significantly scaled up to be used for largerprograms and for more complicated tasks. While CartPolewas a simple sandbox for which we could test symbolicprograms, for more complicated tasks, automated programrepair and verification techniques would be more efficient.Reward design (Clark & Amodei, 2016) and safety(Hadfield-Menell et al., 2017) is another exciting researchdirection. Note that we can instead use the reward func-tion R as the program representation for M ORL ; this wouldinstead provide a procedure for more interpretable or verifi-able inverse reinforcement learning. owards Mixed Optimization for Reinforcement Learning with Program Synthesis
References
Abbeel, Pieter and Ng, Andrew Y. Apprenticeship learn-ing via inverse reinforcement learning. In
Proceedingsof the twenty-first international conference on Machinelearning , pp. 1. ACM, 2004.Barto, A. G., Sutton, R. S., and Anderson, C. W. Neuron-like adaptive elements that can solve difficult learningcontrol problems.
IEEE Transactions on Systems, Man,and Cybernetics , SMC-13(5):834–846, Sept 1983. ISSN0018-9472.Bastani, Osbert, Pu, Yewen, and Solar-Lezama, Armando.Verifiable reinforcement learning via policy extraction. arXiv preprint arXiv:1805.08328 , 2018.Bratko, Ivan, Urbanˇciˇc, Tanja, and Sammut, Claude. Be-havioural cloning: phenomena, results and problems.
IFAC Proceedings Volumes , 28(21):143–149, 1995.Brockman, Greg, Cheung, Vicki, Pettersson, Ludwig,Schneider, Jonas, Schulman, John, Tang, Jie, andZaremba, Wojciech. Openai gym. arXiv preprintarXiv:1606.01540 , 2016.Clark, Jack and Amodei, Dario. Faulty reward func-tions in the wild. https://blog.openai.com/faulty-reward-functions/ , 2016.Hadfield-Menell, Dylan, Milli, Smitha, Abbeel, Pieter, Rus-sell, Stuart J, and Dragan, Anca. Inverse reward design.In
Advances in Neural Information Processing Systems ,pp. 6768–6777, 2017.Ho, Jonathan and Ermon, Stefano. Generative adversarialimitation learning. In
Advances in Neural InformationProcessing Systems , pp. 4565–4573, 2016.Jobstmann, Barbara, Griesmayer, Andreas, and Bloem, Rod-erick. Program repair as a game. In
CAV , pp. 226–238, Berlin, Heidelberg, 2005. Springer-Verlag. doi:10.1007/11513988 23. URL http://dx.doi.org/10.1007/11513988_23 .Mnih, Volodymyr, Kavukcuoglu, Koray, Silver, David,Rusu, Andrei A., Veness, Joel, Bellemare, Marc G.,Graves, Alex, Riedmiller, Martin A., Fidjeland, Andreas,Ostrovski, Georg, Petersen, Stig, Beattie, Charles, Sadik,Amir, Antonoglou, Ioannis, King, Helen, Kumaran, Dhar-shan, Wierstra, Daan, Legg, Shane, and Hassabis, Demis.Human-level control through deep reinforcement learn-ing.
Nature , 518(7540):529–533, 2015.Ross, St´ephane, Gordon, Geoffrey, and Bagnell, Drew. Areduction of imitation learning and structured predictionto no-regret online learning. In
AISTATS , pp. 627–635,2011. Rusu, Andrei A, Colmenarejo, Sergio Gomez, Gulcehre,Caglar, Desjardins, Guillaume, Kirkpatrick, James, Pas-canu, Razvan, Mnih, Volodymyr, Kavukcuoglu, Koray,and Hadsell, Raia. Policy distillation. arXiv preprintarXiv:1511.06295 , 2015.Schulman, John, Levine, Sergey, Abbeel, Pieter, Jordan,Michael, and Moritz, Philipp. Trust region policy opti-mization. In
International Conference on Machine Learn-ing , pp. 1889–1897, 2015.Schulman, John, Wolski, Filip, Dhariwal, Prafulla, Radford,Alec, and Klimov, Oleg. Proximal policy optimizationalgorithms. arXiv preprint arXiv:1707.06347 , 2017.Silver, David, Huang, Aja, Maddison, Chris J., Guez,Arthur, Sifre, Laurent, van den Driessche, George, Schrit-twieser, Julian, Antonoglou, Ioannis, Panneershelvam,Vedavyas, Lanctot, Marc, Dieleman, Sander, Grewe, Do-minik, Nham, John, Kalchbrenner, Nal, Sutskever, Ilya,Lillicrap, Timothy P., Leach, Madeleine, Kavukcuoglu,Koray, Graepel, Thore, and Hassabis, Demis. Masteringthe game of go with deep neural networks and tree search.
Nature , 529(7587):484–489, 2016.Silver, David, Hubert, Thomas, Schrittwieser, Julian,Antonoglou, Ioannis, Lai, Matthew, Guez, Arthur, Lanc-tot, Marc, Sifre, Laurent, Kumaran, Dharshan, Graepel,Thore, Lillicrap, Timothy P., Simonyan, Karen, and Has-sabis, Demis. Mastering chess and shogi by self-playwith a general reinforcement learning algorithm.
CoRR ,abs/1712.01815, 2017.Singh, Rishabh, Gulwani, Sumit, and Solar-Lezama, Ar-mando. Automated feedback generation for introductoryprogramming assignments. In
PLDI , pp. 15–26, 2013.Verma, Abhinav, Murali, Vijayaraghavan, Singh, Rishabh,Kohli, Pushmeet, and Chaudhuri, Swarat. Programmati-cally interpretable reinforcement learning. arXiv preprintarXiv:1804.02477 , 2018.Weimer, Westley, Nguyen, ThanhVu, Le Goues, Claire,and Forrest, Stephanie. Automatically finding patchesusing genetic programming. In