[PDF] Program Synthesis Guided Reinforcement Learning

Abstract

A key challenge for reinforcement learning is solving long-horizon planning and control problems. Recent work has proposed leveraging programs to help guide the learning algorithm in these settings. However, these approaches impose a high manual burden on the user since they must provide a guiding program for every new task they seek to achieve. We propose an approach that leverages program synthesis to automatically generate the guiding program. A key challenge is how to handle partially observable environments. We propose model predictive program synthesis, which trains a generative model to predict the unobserved portions of the world, and then synthesizes a program based on samples from this model in a way that is robust to its uncertainty. We evaluate our approach on a set of challenging benchmarks, including a 2D Minecraft-inspired ``craft'' environment where the agent must perform a complex sequence of subtasks to achieve its goal, a box-world environment that requires abstract reasoning, and a variant of the craft environment where the agent is a MuJoCo Ant. Our approach significantly outperforms several baselines, and performs essentially as well as an oracle that is given an effective program.

Full PDF

PProgram Synthesis Guided Reinforcement Learning

Yichen Yang Jeevana Priya Inala Osbert Bastani Yewen Pu Armando Solar-Lezama Martin Rinard Abstract

A key challenge for reinforcement learning is solv-ing long-horizon planning and control problems.Recent work has proposed leveraging programsto help guide the learning algorithm in these set-tings. However, these approaches impose a highmanual burden on the user since they must pro-vide a guiding program for every new task theyseek to achieve. We propose an approach thatleverages program synthesis to automatically gen-erate the guiding program. A key challenge ishow to handle partially observable environments.We propose model predictive program synthesis ,which trains a generative model to predict theunobserved portions of the world, and then syn-thesizes a program based on samples from thismodel in a way that is robust to its uncertainty.We evaluate our approach on a set of challengingbenchmarks, including a 2D Minecraft-inspired“craft” environment where the agent must performa complex sequence of subtasks to achieve itsgoal, a box-world environment that requires ab-stract reasoning, and a variant of the craft envi-ronment where the agent is a MuJoCo Ant. Ourapproach signiﬁcantly outperforms several base-lines, and performs essentially as well as an oraclethat is given an effective program.

1. Introduction

Reinforcement learning has been applied to solving chal-lenging planning and control problems (Mnih et al., 2015;Arulkumaran et al., 2017). Despite a signiﬁcant amount ofrecent progress, solving long-horizon problems remains asigniﬁcant challenge due to the combinatorial explosion ofpossible strategies.One promising approach to addressing these issues is toleverage programs to guide the behavior of the agents (An-dreas et al., 2017; Sun et al., 2020). In this paradigm, theuser provides a sequence of high-level instructions designed MIT EECS & CSAIL University of Pennsylvania Autodesk,Inc.. Correspondence to: Yichen Yang < [email protected] > .Preprint, under review. to guide the agent. For instance, the program might encodeintermediate subgoals that the agent should aim to achieve,but leave the reinforcement learning algorithm to discoverhow exactly to achieve these subgoals. In addition, to handlepartially observable environments, these programs might en-code conditionals that determine the course of action basedon the agent’s observations.The primary drawback of these approaches is that the userbecomes burdened with providing such a program for everynew task. Not only is this process time-consuming for theuser, but a poorly written program may hamper learning. Anatural question is whether we can automatically synthesize these programs. That is, rather than require the user toprovide the program, we instead have them provide a high-level speciﬁcation that encodes only the desired goal. Then,our framework automatically synthesizes a program thatachieves this speciﬁcation. Finally, this program is used toguide the reinforcement learning algorithm.The key challenge to realizing our approach is how to handlepartially observable environments. In the fully observedsetting, the program synthesis problem reduces to STRIPSplanning (Fikes & Nilsson, 1971)—i.e., search over thespace of possible plans to ﬁnd one that achieves the goal.However, these techniques are hard to apply in settingswhere the environment is initially unknown.To address this challenge, we propose an approach called model predictive program synthesis (MPPS) . At a high level,our approach synthesizes the guiding program based on aconditional generative model of the environment, but ina way that is robust to the uncertainty in this model. Inparticular, for a user-provided goal speciﬁcation φ , the agentchooses its actions using the following three steps:• Hallucinator:

First, inspired by world-models (Ha &Schmidhuber, 2018), the agent keeps track of a condi-tional generative model g over possible realizations ofthe unobserved portions of the environment.• Synthesizer:

Next, given the world predicted by g , theagent synthesizes a program p that achieves φ assumingthis prediction is accurate. Since world predictionsare stochastic in nature, it samples multiple predictedworlds and computes the program that maximizes theprobability of success according to these samples. a r X i v : . [ c s . A I] F e b rogram Synthesis Guided Reinforcement Learning AgentGemIronWoodStone

Workbench

FactoryLegend * 0* 1* 2 * 1* 0* 1* 1* 1* 1 * 1* 0* 0* 1* 2* 1

Hallucinated (a) * 0* 1* 2 * 1* 0* 1* 0* 1* 2 * 1* 0* 1* 0* 2* 2 * 1* 0* 1

Hallucinated (b)

Figure 1. (a) An initial state for the craft environment. Bright regions are observed and dark ones are unobserved. This map has two zones separated by a stone boundary (blue line). The ﬁrst zone contains the agent, 2 irons, and 1 wood; the second contains 1 iron and 1 gem.The goal is to get the gem. The agent represents the high-level structure of the map (e.g., resources in each zone) as abstraction variables .The ground truth abstraction variables are in the top-right; we only show the counts of gems, irons, and woods in each zone and the zonecontaining the agent. The two thought bubbles below are abstract variables hallucinated by the agent based on the observed parts ofthe map. In both, the zone that the agent is in contains a gem, so the synthesized program is “get gem”. However, this program cannotachieve the goal. (b) The state after the agent took 20 actions, failed to obtain the gem, and is now synthesizing a new program. They haveexplored more of the map, so the hallucinations are more accurate, and the new program is a valid strategy for obtaining the gem. • Executor:

Finally, the agent executes the strategy en-coded by p for a ﬁxed number of steps N . Concretely, p is a sequence of components p = c ; ... ; c k , whereeach component is an option c τ = ( π τ , β τ ) (Suttonet al., 1999), which says to execute policy π τ untilcondition β τ holds.If φ is not satisﬁed after N steps, then the above process isrepeated. Since the hallucinator now has more information(assuming the agent has explored more of the environment),the agent now has a better chance of achieving its goal.Importantly, the agent is implicitly encouraged to exploresince it must do so to discover whether the current programcan successfully achieve the goal φ .Similar to Sun et al. (2020), the user instantiates our frame-work in a new domain by providing a set of prototype com-ponents ˜ c , where ˜ c is a logical formula encoding a usefulsubtask for that domain. For instance, ˜ c may encode thatthe agent should navigate to a goal position. The user doesnot need to provide a policy to achieve ˜ c ; our frameworkuses reinforcement learning to automatically train such apolicy c . Our executor reuses these policies c to solve differ-ent tasks in varying environments within the same domain.In particular, for a new task and/or environment, the useronly needs to provide a speciﬁcation φ , which is a logicalformula encoding the goal of that task.We instantiate this approach in the context of a 2D Minecraft-inspired environment (Andreas et al., 2017; Sohn et al.,2018; Sun et al., 2020), which we call the “craft environ-ment”, and a “box-world” environment (Zambaldi et al.,2019). We demonstrate that our approach signiﬁcantly out-performs existing approaches for partially observable en- vironments, while performing essentially as well as usinghandcrafted programs to guide the agent. In addition, wedemonstrate that the policy we learn can be transferred to acontinuous variant of the craft environment, where the agentis replaced by a MuJoCo (Todorov et al., 2012) ant. Related work.

There has been recent interest in program-guided reinforcement learning, where a program encodinghigh-level instructions on how to achieve the goal (essen-tially, a sequence of options) is used to guide the agent.Andreas et al. (2017) uses programs to guide agents that areinitially unaware of any semantics of the programs (i.e., theprogram is just a sequence of symbols), with the goal ofunderstanding whether the structure of the program alone issufﬁcient to improve learning. Jothimurugan et al. (2019)enables users to write speciﬁcations in a high-level languagebased on temporal logic. Then, they show how to translatethese speciﬁcations into shaped reward functions to guidelearning. Most closely related is recent work (Sun et al.,2020) that has demonstrated how program semantics can beused to guide reinforcement learning in the craft environ-ment. As with this work, we assume that the user providessemantics of each option in the program (i.e., the subgoalthat should be achieved by that option), but not an actualpolicy implementing this option (which is learned usingreinforcement learning). However, we do not assume thatthe user provides the program, just the overall goal.More broadly, our work ﬁts into the literature on combininghigh-level planning with reinforcement learning. In particu-lar, there is a long literature on planning with options (Suttonet al., 1999) (also known as skills (Hausman et al., 2018)),including work on inferring options (Stolle & Precup, 2002). rogram Synthesis Guided Reinforcement Learning

However, these approaches cannot be applied to MDPs withcontinuous state and action spaces or to partially observedMDPs. Recent work has addressed the former (Abel et al.,2020; Jothimurugan et al., 2021) by combining high-levelplanning with reinforcement learning to handle low-levelcontrol, but not the latter, whereas our work tackles bothchallenges. Similarly, classical planning algorithms such asSTRIPS (Fikes & Nilsson, 1971) cannot handle uncertaintyin the realization of the environment. There has also beenwork on replanning (Stentz et al., 1995) to handle smallchanges to an initially known environment, but they cannothandle environments that are initially completely unknown.Alternatively, there has been work on hierarchical planningin POMDPs (Charlin et al., 2007; Toussaint et al., 2008),but these are not designed to handle continuous state and ac-tion spaces. We leverage program synthesis (Solar-Lezama,2008) in conjunction with the world models approach (Ha& Schmidhuber, 2018) to address these issues.Finally, there has broadly been recent interest in using pro-gram synthesis to learn programmatic policies that are moreinterpretable (Verma et al., 2018; Inala et al., 2021), veri-ﬁable (Bastani et al., 2018; Verma, 2019), and generaliz-able (Inala et al., 2020). In contrast, we are not directlysynthesizing the policy, but a program to guide the policy.

2. Motivating Example

Figure 1a shows a 2D Minecraft-inspired crafting game. Inthis grid world, the agent can navigate and collect resources(e.g., wood), build tools (e.g., a bridge) at workshops usingcollected resources, and use the tools to achieve subtasks(e.g., use a bridge to cross water). The agent can onlyobserve the × grid around its current position; sincethe environment is static, it also memorizes locations it hasseen before. A single task consists of a randomly generatedmap (i.e., the environment) and goal (i.e., obtain a certainresource or build a certain tool).To instantiate our framework, we provide prototype com-ponents that specify high-level behaviours such as gettingwood or using toolshed to build a bridge. Figure 2 shows thedomain-speciﬁc language that encodes the set of prototypes.For each prototype, we need to provide a logical formula ˜ c that formally speciﬁes its desired behavior. Rather thanspecifying behavior over concrete state s , we instead specifyit over abstraction variables that encode subsets of the statespace. For instance, we divide the map into zones that areregions separated by obstacles such as water and stone. Asan example, the map in Figure 1a has two zones: the regioncontaining the agent and the region blocked off by stones.Then, the zone the agent is currently in is represented by anabstraction variable z —i.e., the states s where the agent isin zone i is represented by the logical predicate z = i . The prototype components are logical formulas over theseabstraction variables—e.g., the prototype for “get wood” is ∀ i, j . ( z − = i ∧ z + = j ) ⇒ ( b − i,j = connected ) ∧ ( ρ + j, wood = ρ − j, wood − ∧ ( ι + wood = ι − wood + 1) . In this formula, b i,j indicates whether zones i and j are con-nected, ρ i,r denotes the count of resource r in zone i , and ι r denotes the count of resource r in the agent’s inventory. The + and − superscripts on each abstraction variable indicatesthat it represents the initial state of the agent before theexecution of the prototype and the ﬁnal state of the agentafter the execution of the prototype, respectively.Thus, this formula says that (i) the agent goes from zone i to j , (ii) i and j are connected, (iii) the count of wood inthe agent’s inventory increases by one, and (iv) the countof wood in zone j decreases by one. All of the prototypecomponents we use are summarized in Appendix A.Before solving any tasks, for each prototype ˜ c , our frame-work uses reinforcement learning to train a component c that implements ˜ c —i.e., an option c = ( π, β ) that attemptsto satisfy the behavior encoded by the logical formula ˜ c .To solve a new task, the user provides a logical formula φ encoding the goal of this task. Then, the agent acts in theenvironment to try achieve φ . For example, Figure 1a showsthe initial state of an agent where the task is to obtain a gem.First, based on the observations so far, the agent π usesthe hallucinator g to predict multiple potential worlds, eachof which represents a possible realization of the full map.One convenient aspect of our approach is that rather thanpredicting concrete states, it sufﬁces to predict the abstrac-tion variables used in the prototype components ˜ c and goalspeciﬁcation φ . For instance, Figure 1a shows two samplesof the world predicted by g ; here, the only values it predictsare the number of zones in the map, the type of the bound-ary between the zones, and the counts of the resources andworkshops in each zone. In this example, the ﬁrst predictedworld contains two zones, and the second contains one zone.Note that in both predicted worlds, there is a gem located insame zone as the agent.Next, π synthesizes a program p that achieves the goal inthe maximum possible number of predicted worlds. Thesynthesized program in Figure 1a is a single component “getgem”, which is an option that searches the current zone (orzones already connected with the current zone) for a gem.Note that this program achieves the goal for the predictedworlds shown in Figure 1a.Finally, the agent executes the program p = c ; ... ; c k for aﬁxed number N of steps. In particular, it executes the policy π τ of component c τ = ( π τ , β τ ) until β τ holds, upon whichit switches to executing c τ +1 . In our example, there is only rogram Synthesis Guided Reinforcement Learning C := get R | use T | use WR := wood | iron | grass | gold | gem T := bridge | axe W := factory | workbench | toolshed Figure 2.

Prototype components for the craft environment; thethree kinds of prototypes are get resource ( R ), use tool ( T ), anduse workshop ( W ). one component “get gem”, so it executes the policy for thiscomponent until the agent ﬁnds a gem.In this case, the agent fails to achieve φ since there is no gemin the same zone as the agent. Thus, the agent repeats theabove process. Since the agent now has more observations, g more accurately predicts the world. For instance, Figure1b shows the intermediate step when the agent does theﬁrst replanning. Note that it now correctly predicts thatthe only gem is in the second zone. As a result, the newlysynthesized program is p = for building axe (cid:122) (cid:125)(cid:124) (cid:123) get wood ; use workbench ; get iron ; use factory ; use axe ; get gem . That is, it builds an axe to break the stone so it can get to thezone containing the gem. Finally, the agent executes thisnew program, which successfully ﬁnds the gem.

3. Problem Formulation

POMDP.

We consider a partially observed Markov decisionprocess (POMDP) with states

S ⊆ R n , actions A ⊆ R m ,observations O ⊆ R q , initial state distribution P , ob-servation function h : S → O , and transition function f : S × A → S . Given initial state s ∼ P , policy π : S → A , and time horizon T ∈ N , the generated tra-jectory is ( s , a , s , a , . . . , s T , a T ) , where o t = h ( s t ) , a t = π ( o t ) , and s t +1 = f ( s t , a t ) .We assume that the state includes the unobserved parts of theenvironment—e.g., in our craft environment, it representsboth the entire map as well as the agent’s current position. Programs.

We consider programs p = c ; ... ; c k that arecomposed of components c τ ∈ C . Each component c rep-resents an option c = ( π, β ) , where π : O → A is a policyand β : O → { , } . To execute p , the agent uses the op-tions c , ..., c k in sequence. To use option c τ = ( π τ , β τ ) , ittakes actions π τ ( o ) until β τ ( o ) = 1 ; at this point, the agentswitches to option c τ +1 and continues this process. User-provided prototype components.

Rather than havethe user directly provide the components C used in our pro-grams, we instead have them provide prototype components Hallucinator(C-VAE) SynthesizerExecutor(Modular Network) Action Program * * * * * * * * *

Figure 3.

Architecture of our agent (the blue box). ˜ c ∈ ˜ C . Importantly, prototypes can be shared across closelyrelated tasks. Each prototype component is a logical formulathat encodes the expected desired behavior of a component.More precisely, ˜ c is a logical formula over variables s − and s + , where s − denotes the initial state before executing theoption and s + denotes the ﬁnal state after executing theoption. For instance, the prototype component ˜ c ≡ ( s − = s ⇒ s + = s ) ∨ ( s − = s ⇒ s + = s ) says that if the POMDP is currently in state s , then c shouldtransition it to s , and if it is currently in state s , then c should transition it to s .Rather than directly deﬁne ˜ c over the states s , we can insteaddeﬁne it over abstraction variables that represent subsetsof the state space. This approach can improve scalability ofour synthesis algorithm—e.g., it enables us to operate overcontinuous state spaces as long as the abstraction variablesthemselves are discrete. User-provided speciﬁcation.

To specify a task, the userprovides a speciﬁcation φ , which is a logical formula overstates s ; in general, φ may not directly refer to s but to othervariables that represent subsets of S . Our goal is to designan agent that achieves any given φ (i.e., act in the POMDPto reach a state that satisﬁes φ ) as quickly as possible.

4. Model Predictive Program Synthesis

We describe the architecture of our agent, depicted in Figure3. It is composed of three parts: the hallucinator g , whichpredicts possible worlds; the synthesizer , which generates aprogram p that succeeds with high probability according toworlds sampled from g ; and the executor , which uses p toact in the POMDP. These parts are run once every N stepsto generate a program p to execute for the subsequent N steps, until the user-provided speciﬁcation φ is achieved. Hallucinator.

First, the hallucinator is a conditional gen-erative model trained to predict the environment given theobservation so far. For simplicity, we assume the observa-tion o on the current step already encodes all observations rogram Synthesis Guided Reinforcement Learning so far. Since our craft environment is static, o simply en-codes the portion of the map that has been revealed so far,with a special symbol indicating parts that are unknown.To be precise, the hallucinator g encodes a distribution g ( s | o ) , which is trained to approximate the actual dis-tribution P ( s | o ) . Then, at each iteration (i.e., once every N steps), our agent samples m worlds ˆ s , ..., ˆ s m ∼ g ( · | o ) .We choose g to be a conditional variational auto-encoder(CVAE) (Sohn et al., 2015).When using abstract variables to represent the states, we canhave g directly predict the values of these abstract variablesinstead of having g predict the concrete state. Intuitively,this approach works since as described below, the synthe-sizer only needs to know the values of the abstract variablesto generate a program. Synthesizer.

The synthesizer aims to compute the programthat maximizes the probability of satisfying the goal φ : p ∗ = arg max p E P ( s | o ) [ p solves φ for s ] ≈ arg max p m m (cid:88) j =1 [ p solves φ for ˆ s j ] , (1)where the ˆ s j are samples from g . The objective (1) canbe expressed as a MaxSAT problem (Krentel, 1986). Inparticular, suppose for now that we are searching over pro-grams p = c ; ... ; c k of ﬁxed length k . Then, consider theconstrained optimization problem arg max ξ ,...,ξ k m m (cid:88) j =1 ∃ s − , s +1 , ..., s − k , s + k . ψ j , (2)where ξ τ and s δτ (for τ ∈ { , ..., k } and δ ∈ {− , + } ) arethe optimization variables. Intuitively, ξ , ..., ξ k encodesthe program p = c ; ... ; c k , and ψ j encodes the event that p solves φ for world ˆ s j . In particular, we have ψ j ≡ ψ j, start ∧ (cid:34) k (cid:94) τ =1 ψ j,τ (cid:35) ∧ (cid:34) k − (cid:94) τ =1 ψ (cid:48) j,τ (cid:35) ∧ ψ j, end , where ψ j, start ≡ ( s − = ˆ s j ) encodes that the initial state is ˆ s j , ψ j,τ ≡ (cid:0) ( ξ τ = ˜ c ) ⇒ ˜ c ( s − τ , s + τ ) (cid:1) encodes that if the τ th component has prototype ˜ c , then the τ th component should transition the system from s − τ to s + τ , ψ (cid:48) j,τ ≡ ( s + τ = s − τ +1 ) encodes that the ﬁnal state of component τ should equal theinitial state of component τ + 1 , and ψ j, end ≡ φ ( s + j ) encodes that the ﬁnal state of the last component shouldsatisfy the user-provided goal φ .We use a MaxSAT solver to solve (2) (De Moura & Bjørner,2008). Given a solution ξ = ˜ c , ..., ξ k = ˜ c k , the synthe-sizer returns the corresponding program p = c ; ... ; c k .We incrementally search for longer and longer programs,starting from k = 1 and incrementing k until either we ﬁnda program that achieves at least a minimum objective value,or we reach a maximum program length k max , at which pointwe use the best program found so far. Executor.

The executor runs the synthesized program p = c ; ... ; c k for the subsequent N steps. It iteratively useseach component c τ = ( π τ , β τ ) , starting from τ = 1 . Inparticular, it uses action a t = π τ ( o t ) at each time step t ,where o t is the observation on that step. It does so until β τ ( o t ) = 1 , at which point it increments τ ← τ + 1 .Finally, it continues until either it has completed running theprogram (i.e., β k ( o t ) = 1 ), or after N time steps. In the for-mer case, by construction, the goal φ has been achieved, sothe agent terminates. In the latter case, the agent iterativelyreruns the above three steps based on the current observationto synthesize a new program. At this point, the hallucinatorlikely has additional information about the environment, sothe new program has a greater chance of achieving φ .

5. Learning Algorithm

Next, we describe our algorithm for learning the parametersof models used by our agent. In particular, there are twoparts that need to be learned: (i) we need to learn parametersof the conditional variational auto-encoder (CVAE) halluci-nator g , and (ii) we need to learn the components c basedon the user-provided prototype components ˜ c . Hallucinator.

We choose the hallucinator g to be a condi-tional variational auto-encoder (CVAE) (Sohn et al., 2015)trained to estimate the distribution P ( s | o ) of states giventhe current observation. First, we obtain samples ( o t , s t ) using rollouts collected using a random agent. Then, wetrain the CVAE using the standard evidence lower bound(ELBo) on the log likelihood (Kingma & Welling, 2013): (cid:96) ( θ, ˜ θ ) = E P ( s,o ) (cid:104) E h ˜ θ ( z | s,o ) [log g θ ( s | z, o )] (3) − D KL ( h ˜ θ ( z | s, o ) (cid:107) N ( z ; 0 , (cid:105) , where h ˜ θ is the encoder and g θ is the decoder: h ˜ θ ( z | s, o ) = N ( z ; ˜ µ ˜ θ ( s, o ) , ˜ σ ˜ θ ( s, o ) · I ) g θ ( s | z, o ) = N ( s ; µ θ ( z, o ) , σ θ ( z, o ) · I ) , where µ θ , σ θ , ˜ µ ˜ θ , and ˜ σ ˜ θ are neural networks, and I is theidentity matrix. We train h ˜ θ and g θ by jointly optimizing(3), and then choose the hallucinator to be g = g θ . rogram Synthesis Guided Reinforcement Learning Executor.

Our framework uses reinforcement learning tolearn components c that implement the user-provided pro-totype components ˜ c . The learned components c ∈ C canbe shared across multiple tasks. Our approach is based onneural module networks for reinforcement learning Andreaset al. (2017). In particular, we train a neural module π for each component c . In addition, we construct a mon-itor β that checks when to terminate execution, and take c = ( π, β ) .First, β is constructed from ˜ c —in particular, it returnswhether ˜ c is satisﬁed based on the current observation o .Note that we have assumed that ˜ c can be checked based onlyon o ; this assumption holds for all prototypes in our craftenvironment. If it does not hold, we additionally train π toexplore in a way that enables it to check ˜ c .Now, to train the policies π , we generate random initialstates s and goal speciﬁcations φ . For training, we useprograms synthesized from the fully observed environments;such a program p is guaranteed to achieve φ from s . We usethis approach since it avoids the need to run the synthesizerrepeatedly during training.Then, we sample a rollout { (( o , a , r ) , ..., ( o T , a T , r T )) } by using the executor in conjunction with the program p andthe current options c τ = ( π τ , β τ ) (where π τ is randomlyinitialized). We give the agent a reward ˜ r on each time stepwhere achieves the subgoal of a single component c τ —i.e.,the executor increments τ ← τ + 1 . Then, we use actor-critic reinforcement learning (Konda & Tsitsiklis, 2000) toupdate the parameters of each policy π .Finally, as in Andreas et al. (2017), we use curriculumlearning to speed up training—i.e., we train using goals thatcan be achieved with shorter programs ﬁrst.

6. Experiments

In this section, we describe empirical evaluations of ourapproach. As we show, it signiﬁcantly outperforms non-program-guided baselines, while performing essentially aswell as an oracle that is given the ground truth program.

We consider a 2D Minecraft-inspired craftinggame based on the ones in Andreas et al. (2017); Sun et al.(2020) (Figure 1a). A map in this domain is an × grid,where each grid cell either is empty or contains a resource(e.g., wood or gold), an obstacle (e.g., water or stone), ora workshop. In each episode, we randomly sample a mapfrom a predeﬁned distribution, a random initial positionfor the agent, and a random task (one of 14 possibilities,each of which involves getting a certain resource or buildinga certain tool). The more complicated tasks may require the agent to build intermediate tools (e.g., a bridge or anaxe) to reach initially inaccessible regions to achieve itsgoal. In contrast to prior work, our agent does not initiallyobserve the entire map; instead, they can only observe gridcells in a × square around them. Since the environmentis static, any previously visited cells remain visible. Theagent has a discrete action space, including move actionsin four directions, and a special “use” action that can picka resource, use a workshop, or use a tool. The maximumlength of each episode T = 100 . Ant-craft.

Next, we consider a variant of 2D-craft wherethe agent is replaced with a MuJoCo (Todorov et al., 2012)ant (Schulman et al., 2016) (illustrated in Figure 5a). Forsimplicity, we do not model the physics of the interactionbetween the ant and its environment—e.g., the ant automati-cally picks up resources in the grid cell it currently occupies.The policy needs to learn the continuous control to walkthe ant as well as the strategy to perform the tasks. Thisenvironment is designed to demonstrate that our approachcan be applied to continuous control tasks.

Box-world.

Finally, we consider the box-world environ-ment (Zambaldi et al., 2019), which requires abstract rela-tional reasoning. It is a × grid world with locks andboxes randomly scattered throughout (visualized in Figure5b). Each lock occupies a single grid cell, and the box itlocks occupies the adjacent grid cell. The box contains akey that can open a subsequent lock. Each lock and boxis colored; the key needed to open a lock is contained inthe box of the same color. The agent is given a key to getstarted, and its goal is to unlock the box of a given color.The agent can move in the room in four directions; it opensa lock for which it has the key simply by walking over it,at which point it can pick up the adjacent key. We assumethat once the agent has the key of a given color, it can un-lock multiple locks of that color. We modify the originalenvironment to be partially observable; in particular, theagent can observe a × grid around them (as well as thepreviously observed grid cells). In each episode, we samplea random conﬁguration of the map, where the number ofboxes in the path to the goal is randomly chosen between1 to 4, and the number of “distractor branches” (i.e., boxesthat the agent can open but does not help them reach thegoal) is also randomly chosen between 1 to 4. An end-to-end neural policy trained with thesame actor-critic algorithm and curriculum learning as dis-cussed in Section 5. It uses one actor network per task.

World models.

The world models approach (Ha & Schmid-huber, 2018) handles partial observability by using a gener-ative model to predict the future. It trains a V model (VAE)and an M model (MDN-RNN) to learn a compressed spa- rogram Synthesis Guided Reinforcement Learning a v g _ r e w a r d s (a) a v g _ f i n i s h _ t i m e s (b) a v g _ r e w a r d s (c) a v g _ f i n i s h _ t i m e s (d) OursOracleEnd-to-endWMRelational

Figure 4. (a,b) Training curves for 2D-craft environment. (c,d) Training curves for the box-world environment. (a,c) The average rewardon the test set over the course of training; the agent gets a reward of 1 if it successfully ﬁnishes the task in the time horizon, and 0 otherwise.(b,d) The average number of steps taken to complete the task on the test set. We show our approach (“Ours”), the program guided agent(“Oracle”), the end-to-end neural policy (“End-to-end”), world models (“World models”), and relational deep RL (“Relational”).(a) (b)

Figure 5. (a) The Ant-craft environment. The policy needs to con-trol the ant to perform the crafting tasks. (b) The box-world envi-ronment. The grey pixel denotes the agent. The goal is to get thewhite key. The unobserved parts of the map is marked with “x”.The key currently held by the agent is shown in the top-left corner.In this map, the number of boxes in the path to the goal is 4, and itcontains 1 distractor branch. tial and temporal representation of the environment. The Vmodel takes the observations at each step t and encodes itinto a latent vector z t . The M model is a recurrent modelthat takes the latent vectors z , ..., z t as input and predicts z t +1 . The latent states of the M model and the latent vectors z t from the V model together form the world model features,which are used as inputs to the controller (C model). Program guided agent.

This technique uses a programto guide the agent policy (Sun et al., 2020). Unlike ourapproach, the ground truth programs (i.e., a program guar-anteed to achieve the goal) is provided to the agent at thebeginning; we synthesize this program using the full map(i.e., including parts of the map that are unobserved by theagent). This baseline can be viewed as an oracle since it isstrictly more powerful than our approach.

Relational Deep RL.

For the box-world environment, wealso compare with the relational deep RL approach (Zam-baldi et al., 2019), which replaces the policy network with arelational module based on the multi-head attention mecha-nism (Vaswani et al., 2017) operating over the map features. (a) (b)

Figure 6.

Comparison of behaviors between the optimistic ap-proach (left) and our MPPS approach (right), in a scenario wheregoal is to get the gem. (a) This state is the point at which theoptimistic approach ﬁrst synthesizes the correct program insteadof the (incorrect) one “get gem”. It only does so after the agenthas observed all the squares in its current zone (the green arrowsshow the agent’s trajectory so far). (b) The initial state of ourMPPS strategy. It directly synthesizes the correct program, sincethe hallucinator knows the gem is most likely in the other zone.Thus, the agent completes the task much more quickly.

The output of the relational module is used as input to anMLP network that computes the action.

For our approach, we use a CVAEas the hallucinator with MLPs (a hidden layer of dimen-sion 200) for the encoder and the decoder. We pre-trainthe CVAE on 100 rollouts with 100 timesteps in eachrollout—i.e., 10,000 ( s, o ) pairs. We use the Z3 SMT solver(De Moura & Bjørner, 2008) to solve the MAXSAT syn-thesis formula. We set the number of sample completions m = 3 , and the number of steps to replan N = 20 . Weuse the same architecture for the actor networks and criticnetworks across our approach and all baselines: for actornetworks, we use MLP with a hidden layer of dimension128, and for critic networks, we use MLP with a hidden layerof dimension 32. We train each model on 400K episodes,and evaluate on a test set containing 10 scenarios per task. Ant-craft.

We ﬁrst pre-train a goal following policy for theant: given a randomly chosen goal position, this policy con- rogram Synthesis Guided Reinforcement LearningAvg. reward Avg. ﬁnish step

End-to-end 0.49 60.7World models 0.50 59.3Ours

Table 1.

Average rewards and average completion times (i.e., num-ber of steps) on the test set for Ant-craft environment, for the bestpolicy found for each approach. trols the ant to move to that position. We use the soft actor-critic algorithm (Haarnoja et al., 2018) for pre-training. Theexecutor in our approach, as well as our baseline policies,outputs actions that are translated into the goal positions asinputs to this ant controller. We let the ant controller run for50 timesteps in the simulator to execute each move actionfrom the upper-stream policies. We initialize each policywith the trained model from the 2D-craft environment, andﬁne-tune it on the Ant-craft environment for 40K episodes.

Box-world.

Following Zambaldi et al. (2019), we use a one-layer CNN with 32 kernels of size × to process the rawmap inputs before feeding into the downstream networksacross all approaches. For the programs in our approach,we have a prototype component for each color, where thedesired behavior of the component is to get the key of thatcolor. The full deﬁnition of the prototype components weuse for box-world is in Appendix B. For the hallucinatorCVAE, we use the same architecture as in the craft environ-ment with a hidden dimension of 300, and trained with 100k ( s, o ) pairs. For the synthesizer, we set m = 3 and N = 10 .We train each model for 200K episodes, and evaluate on atest set containing 10 scenarios per level. Each level has aspeciﬁc number of boxes in the path to the goal (i.e., thegoal length). Our test set contains four levels with goallengths between 1 to 4. Figures 4a & 4b show the training curves of eachapproach. As can be seen, our approach learns a substan-tially better policy than the unsupervised baselines; it solvesa larger percentage of test scenarios as well as using shortertime. Compared with program guided agent (i.e., the oracle),our approach achieves a similar average reward with slightlylonger average ﬁnish time. These results demonstrate thatour approach signiﬁcantly outperforms non-program-guidedbaselines, while performing nearly as well as an oracle thatknows the ground truth program.

Ant-craft.

Table 1 shows results for the best policy foundusing each approach. As before, our approach signiﬁcantlyoutperforms the baseline approaches while performing com-parably with the oracle approach.

Box-world.

Figure 4c & 4d shows the training curves. As

Avg. reward Avg. ﬁnish step

Optimistic 0.60 53.7Ours

Table 2.

Comparison to optimistic ablation on challenging tasksfor the 2D-craft environment. before, our approach performs substantially better than thebaselines, and achieves a similar performance as the pro-gram guided agent (i.e., the oracle).

Finally, we compare our model predictive program synthesiswith an alternative, optimistic synthesis strategy: it consid-ers the unobserved parts of the map to be possibly in anyconﬁgurations, and synthesizes the shortest program as longas it works on any of these possibilities. We compare onthe most challenging tasks for 2D-craft (i.e., get gold orget gem), since for these tasks, the ground truth programdepends heavily on the map. We show results in Table 2.As can be seen, our approach signiﬁcantly outperforms theoptimistic synthesis approach, and performs comparably tothe oracle. Finally, in Figure 6, we illustrate the difference inbehavior between our approach and the optimistic strategy.

7. Conclusion

We have proposed an algorithm for synthesizing programs toguide reinforcement learning. Our algorithm, called modelpredictive program synthesis, handles partially observed en-vironments by leveraging the world models approach, whereit learns a generative model over the remainder of the worldconditioned on the observations thus far. In particular, itsynthesizes a guiding program that accounts for the uncer-tainty in the world model. Our experiments demonstrate thatour approach signiﬁcantly outperforms non-program-guidedapproaches, while performing comparably to an oracle thatis given access to the ground truth program. These resultsdemonstrate that our approach can obtain the beneﬁts ofprogram-guided reinforcement learning without requiringthe user to provide a guiding program for every new taskand world conﬁgurations.

References

Abel, D., Umbanhowar, N., Khetarpal, K., Arumugam, D.,Precup, D., and Littman, M. Value preserving state-actionabstractions. In

International Conference on ArtiﬁcialIntelligence and Statistics , pp. 1639–1650. PMLR, 2020.Andreas, J., Klein, D., and Levine, S. Modular multitaskreinforcement learning with policy sketches. In Precup,D. and Teh, Y. W. (eds.),

Proceedings of the 34th Inter- rogram Synthesis Guided Reinforcement Learning national Conference on Machine Learning , volume 70 of

Proceedings of Machine Learning Research , pp. 166–175,International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR. URL http://proceedings.mlr.press/v70/andreas17a.html .Arulkumaran, K., Deisenroth, M. P., Brundage, M., andBharath, A. A. Deep reinforcement learning: A briefsurvey.

IEEE Signal Processing Magazine , 34(6):26–38,2017.Bastani, O., Pu, Y., and Solar-Lezama, A. Veriﬁable rein-forcement learning via policy extraction. arXiv preprintarXiv:1805.08328 , 2018.Charlin, L., Poupart, P., and Shioda, R. Automated hierar-chy discovery for planning in partially observable envi-ronments.

Advances in Neural Information ProcessingSystems , 19:225, 2007.De Moura, L. and Bjørner, N. Z3: An efﬁcient smtsolver. In

Proceedings of the Theory and Practice ofSoftware, 14th International Conference on Tools andAlgorithms for the Construction and Analysis of Systems ,TACAS’08/ETAPS’08, pp. 337–340, Berlin, Heidelberg,2008. Springer-Verlag. ISBN 3540787992.Fikes, R. E. and Nilsson, N. J. Strips: A new approach tothe application of theorem proving to problem solving.

Artiﬁcial intelligence , 2(3-4):189–208, 1971.Ha, D. and Schmidhuber, J. World models.

CoRR ,abs/1803.10122, 2018. URL http://arxiv.org/abs/1803.10122 .Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. Softactor-critic: Off-policy maximum entropy deep rein-forcement learning with a stochastic actor. In Dy, J.and Krause, A. (eds.),

Proceedings of the 35th Interna-tional Conference on Machine Learning , volume 80 of

Proceedings of Machine Learning Research , pp. 1861–1870, Stockholmsm¨assan, Stockholm Sweden, 10–15Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/haarnoja18b.html .Hausman, K., Springenberg, J. T., Wang, Z., Heess, N.,and Riedmiller, M. Learning an embedding space fortransferable robot skills. In

International Conference onLearning Representations , 2018.Inala, J. P., Bastani, O., Tavares, Z., and Solar-Lezama,A. Synthesizing programmatic policies that inductivelygeneralize. In

International Conference on LearningRepresentations , 2020.Inala, J. P., Yang, Y., Paulos, J., Pu, Y., Bastani, O., Ku-mar, V., Rinard, M., and Solar-Lezama, A. Neurosym-bolic transformers for multi-agent communication. arXivpreprint arXiv:2101.03238 , 2021. Jothimurugan, K., Alur, R., and Bastani, O. A composablespeciﬁcation language for reinforcement learning tasks.In

NeurIPS , 2019.Jothimurugan, K., Bastani, O., and Alur, R. Abstract valueiteration for hierarchical reinforcement learning. In

AIS-TATS , 2021.Kingma, D. P. and Welling, M. Auto-encoding variationalbayes. arXiv preprint arXiv:1312.6114 , 2013.Konda, V. and Tsitsiklis, J. Actor-critic algo-rithms. In Solla, S., Leen, T., and M¨uller, K.(eds.),

Advances in Neural Information Process-ing Systems , volume 12, pp. 1008–1014. MITPress, 2000. URL https://proceedings.neurips.cc/paper/1999/file/6449f44a102fde848669bdd9eb6b76fa-Paper.pdf .Krentel, M. W. The complexity of optimization prob-lems. In

Proceedings of the Eighteenth Annual ACMSymposium on Theory of Computing , STOC ’86, pp.69–76, New York, NY, USA, 1986. Association for Com-puting Machinery. ISBN 0897911938. doi: 10.1145/12130.12138. URL https://doi.org/10.1145/12130.12138 .Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness,J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje-land, A. K., Ostrovski, G., et al. Human-level controlthrough deep reinforcement learning. nature , 518(7540):529–533, 2015.Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel,P. High-dimensional continuous control using generalizedadvantage estimation. In

Proceedings of the InternationalConference on Learning Representations (ICLR) , 2016.Sohn, K., Lee, H., and Yan, X. Learning structured outputrepresentation using deep conditional generative models.In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., andGarnett, R. (eds.),

Advances in Neural Information Pro-cessing Systems , volume 28, pp. 3483–3491. Curran As-sociates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/file/8d55a249e6baa5c06772297520da2051-Paper.pdf .Sohn, S., Oh, J., and Lee, H. Hierarchical reinforcementlearning for zero-shot generalization with subtask de-pendencies. In

Proceedings of the 32nd InternationalConference on Neural Information Processing Systems ,NIPS’18, pp. 7156–7166, Red Hook, NY, USA, 2018.Curran Associates Inc.Solar-Lezama, A.

Program synthesis by sketching . Citeseer,2008. rogram Synthesis Guided Reinforcement Learning

Stentz, A. et al. The focussed dˆ* algorithm for real-timereplanning. In

IJCAI , volume 95, pp. 1652–1659, 1995.Stolle, M. and Precup, D. Learning options in reinforcementlearning. In

International Symposium on abstraction,reformulation, and approximation , pp. 212–223. Springer,2002.Sun, S.-H., Wu, T.-L., and Lim, J. J. Program guided agent.In

International Conference on Learning Representations ,2020. URL https://openreview.net/forum?id=BkxUvnEYDH .Sutton, R. S., Precup, D., and Singh, S. Between mdpsand semi-mdps: A framework for temporal abstraction inreinforcement learning.

Artiﬁcial intelligence , 112(1-2):181–211, 1999.Todorov, E., Erez, T., and Tassa, Y. Mujoco: A physicsengine for model-based control. In ,pp. 5026–5033, 2012. doi: 10.1109/IROS.2012.6386109.Toussaint, M., Charlin, L., and Poupart, P. Hierarchicalpomdp controller optimization by likelihood maximiza-tion. In

UAI , volume 24, pp. 562–570, 2008.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attentionis all you need, 2017.Verma, A. Veriﬁable and interpretable reinforcement learn-ing through program synthesis. In

Proceedings of theAAAI Conference on Artiﬁcial Intelligence , volume 33,pp. 9902–9903, 2019.Verma, A., Murali, V., Singh, R., Kohli, P., and Chaudhuri,S. Programmatically interpretable reinforcement learning.In

International Conference on Machine Learning , pp.5045–5054. PMLR, 2018.Zambaldi, V., Raposo, D., Santoro, A., Bapst, V., Li, Y.,Babuschkin, I., Tuyls, K., Reichert, D., Lillicrap, T.,Lockhart, E., Shanahan, M., Langston, V., Pascanu, R.,Botvinick, M., Vinyals, O., and Battaglia, P. Deep rein-forcement learning with relational inductive biases. In

International Conference on Learning Representations ,2019. URL https://openreview.net/forum?id=HkxaFoC9KQ . rogram Synthesis Guided Reinforcement Learning A. Prototype Components for Craft

In this section, we describe the prototype components (i.e.,logical formulas encoding option pre/postconditions) thatwe use for the craft environment. First, recall that thedomain-speciﬁc language that encodes the set of prototypesfor the craft environment is C := get R | use T | use WR := wood | iron | grass | gold | gem T := bridge | axe W := factory | workbench | toolshedAlso, the set of possible artifacts (objects that can be madein some workshop using resources or other artifacts) in thecraft environment is A = (cid:26) bridge , axe , plank , stick , cloth , rope , bed , shears , ladder (cid:27) . We deﬁne the following abstraction variables:•

Zone: z = i indicates the agent is in zone i • Boundary: b i,j = b indicates how zones i and j areconnected, where b ∈ { connected , water , stone , not adjacent } • Resource: ρ i,r = n indicates that there are n units ofresource r in zone i • Workshop: ω i,r = b , where b ∈ { true , false } , indi-cates whether there exists a workshop r in zone i • Inventory: ι r = n indicates that there are n objects r (either a resource or an artifact) in the agent’s inventoryWe use z − , b − , ρ − , ω − , ι − and z + , b + , ρ + , ω + , ι + to de-note the initial state and ﬁnal state for a prototype com-ponents, respectively. Now, the logical formulae for eachprototype components are deﬁned as follows. (1) “get r ” (for any resource r ∈ R ). First, we have thefollowing prototype component telling the agent to obtain aspeciﬁc resource r : ∀ i, j . ( z − = i ∧ z + = j ) ⇒ ( b − i,j = connected ) ∧ ( ρ + j,r = ρ − j,r − ∧ ( ι + r = ι − r + 1) ∧ Q . Here, Q refers to the conditions that the other ﬁelds of theabstract state stay the same—i.e., ( b + = b − ) ∧ ( ω + = ω − ) ∧ ( ι + \ r = ι −\ r ) ∧ ( ρ + \ ( j,r ) = ρ −\ ( j,r ) ) , where ι \ r means all the other ﬁelds in ι except ι r , andsimilarly for ρ \ ( j,r ) . In particular Q addresses the frameproblem from classical planning. (2) “use r ” (for any workshop r ∈ W ). Next, we have aprototype component telling the agent to use a workshop tocreate an artifact. To do so, we introduce a set of auxiliaryvariables to denote the number of artifacts made in thiscomponent: m o = n indicates that n units of artifact o ismade, the set of artifacts that can be made at workshop r as A r , and the number of units of ingredient q needed to make1 unit of artifact o as k o,q , where q ∈ R ∪ A ; note that { A r } and { k o,q } come from the rule of the game.Then, the logical formula for “use r ” is ∀ i, j . ( z − = i ∧ z + = j ) ⇒ ( b − i,j = connected ) ∧ ( w j,r = true ) ∧ (cid:32) (cid:88) o ∈ A r m o ≥ (cid:33) ∧  (cid:88) o/ ∈ A r m o = 0  ∧ (cid:32) ∀ q ∈ R, ι + q = ι − q − (cid:88) o ∈ A r k o,q m o (cid:33) ∧ (cid:32) ∀ q ∈ A, ι + q = ι − q − (cid:88) o ∈ A r k o,q m o + m q (cid:33) ∧ (cid:32) ∀ o ∈ A r , ¬ (cid:32)(cid:94) q ι + q ≥ k o,q (cid:33)(cid:33) ∧ Q , where Q = ( b + = b − ) ∧ ( ω + = ω − ) ∧ ( ρ + = ρ − ) . This formula reﬂects the game setting that when the agentuses a workshop, it will make the artifacts until the ingredi-ents in the inventory are depleted. (3) “use r” ( r = bridge/axe). Next, we have the followingprototype component for telling the agent to use a tool. Theformula for this prototype component encodes the logic ofzone connectivity. In particular, it is ∀ i, j . ( z − = i ∧ z + = j ) ⇒ ( b − i,j = water/stone ) ∧ ( b + i,j = connected ) ∧ ( ι + r = ι − r − ∧ (cid:16) ∀ i (cid:48) , j (cid:48) , ( b + i (cid:48) ,j (cid:48) = connected ) ⇒ (cid:0) ( b − i (cid:48) ,j (cid:48) = connected ) ∨ X (cid:1)(cid:17) ∧ (cid:16) ∀ i (cid:48) , j (cid:48) , ( b + i (cid:48) ,j (cid:48) (cid:54) = connected ) ⇒ ( b + i (cid:48) ,j (cid:48) = b − i (cid:48) ,j (cid:48) ) (cid:17) ∧ Q , where X = ( b − i (cid:48) ,i = connected ∨ b − i (cid:48) ,j = connected ) ∧ ( b − j (cid:48) ,i = connected ∨ b − j (cid:48) ,j = connected ) Q = ( ω + = ω − ) ∧ ( ρ + = ρ − ) ∧ ( ι + \ r = ι −\ r ) . rogram Synthesis Guided Reinforcement Learning B. Prototype Components for Box World

In this section, we describe the prototype components forthe box world. They are all of the form “get k ”, where k ∈ K is a color in the set of possible colors in the boxworld. First, we deﬁne the following abstraction variables:• Box : b k ,k = n indicates that there are n boxes withkey color k and lock color k in the map• Loose key : (cid:96) k = b , where b ∈ { true , false } , indicateswhether there exists a loose key of color k in the map• Agent’s key : ι k = b , where b ∈ { true , false } , indi-cates whether the agent holds a key of color k As in the craft environment, we use b − , (cid:96) − , ι − and b + , (cid:96) + , ι + to denote the initial state and ﬁnal state for aprototype components, respectively. Since the conﬁgura-tions of the map in the box world can only contain at mostone loose key, we add cardinality constraints Card ( (cid:96) ) ≤ ,where Card ( · ) counts the number of variables that are true.Then, the logical formula deﬁning the prototype component“get k ” is X ∨ Y , where X = (cid:96) − k ∧ ι + k ∧ ( Card ( l + ) = 0) ∧ ( b + = b − ) Y = ( Card ( ι − ) = 1) ∧ ι + k ∧ ¬ ι − k ∧ ( l + = l − ) ∧ (cid:16) ∀ k . ι − k ⇒ (cid:16) ( b + k,k = b − k,k − ∧ ( b + \ ( k,k ) = b −\ ( k,k ) ) (cid:17)(cid:17) In particular, X encodes the desired behavior when the agentpicks up a loose key k , and Y encodes the desired behaviorwhen the agent unlocks a box to get key kk