[PDF] Learning Navigation Subroutines from Egocentric Videos

Abstract

Planning at a higher level of abstraction instead of low level torques improves the sample efficiency in reinforcement learning, and computational efficiency in classical planning. We propose a method to learn such hierarchical abstractions, or subroutines from egocentric video data of experts performing tasks. We learn a self-supervised inverse model on small amounts of random interaction data to pseudo-label the expert egocentric videos with agent actions. Visuomotor subroutines are acquired from these pseudo-labeled videos by learning a latent intent-conditioned policy that predicts the inferred pseudo-actions from the corresponding image observations. We demonstrate our proposed approach in context of navigation, and show that we can successfully learn consistent and diverse visuomotor subroutines from passive egocentric videos. We demonstrate the utility of our acquired visuomotor subroutines by using them as is for exploration, and as sub-policies in a hierarchical RL framework for reaching point goals and semantic goals. We also demonstrate behavior of our subroutines in the real world, by deploying them on a real robotic platform. Project website: this https URL.

Full PDF

LLearning Navigation Subroutines fromEgocentric Videos

Ashish Kumar Saurabh Gupta Jitendra Malik , UC Berkeley Facebook AI Research UIUC ashish [email protected], [email protected], [email protected]

Abstract:

Planning at a higher level of abstraction instead of low level torquesimproves the sample efﬁciency in reinforcement learning, and computational efﬁ-ciency in classical planning. We propose a method to learn such hierarchical ab-stractions, or subroutines from egocentric video data of experts performing tasks.We learn a self-supervised inverse model on small amounts of random interac-tion data to pseudo-label the expert egocentric videos with agent actions. Vi-suomotor subroutines are acquired from these pseudo-labeled videos by learninga latent intent-conditioned policy that predicts the inferred pseudo-actions fromthe corresponding image observations. We demonstrate our proposed approachin context of navigation, and show that we can successfully learn consistent anddiverse visuomotor subroutines from passive egocentric videos. We demonstratethe utility of our acquired visuomotor subroutines by using them as is for ex-ploration, and as sub-policies in a hierarchical RL framework for reaching pointgoals and semantic goals. We also demonstrate behavior of our subroutines inthe real world, by deploying them on a real robotic platform. Project website:https://ashishkumar1993.github.io/subroutines/.

Keywords:

Subroutines, Passive Data, Hierarchical Reinforcement Learning

Every morning, when you decide to get a cup of coffee from the kitchen, you go down the hallway,turn left into the corridor and then enter the room on the right. Instead of deciding the exact mus-cle torques, you reason at this higher level of abstraction by composing these reusable lower level visuomotor subroutines to reach your goal. These visuomotor subroutines are classically known asoperators in STRIPS planning [1], or more recently as options in RL [2]. Once these subroutines arelearned, they can be composed to solve novel tasks, e.g. exiting the building, ﬁnding an object, etc,enabling an agent to quickly learn new tasks by simply learning how to compose them together.These subroutines, or short-horizon policies with consistent behavior, can be manually designed asdone in classical robotics or STRIPS, or can be learned through interaction by training a hierarchicalagent through reward-based reinforcement learning. Learning through environment interactions isextremely slow, making it prohibitively expensive to operationalize in the real world. We propose athird way of learning these subroutines by using imitation learning on egocentric videos of expertsperforming tasks. We expect that these videos contain subroutines that have been appropriatelycombined to solve some tasks, and an appropriate clustering algorithm can be used to isolate andextract these subroutines. For example, in indoor navigation, such clusters could be exiting doors,walking down hallways; and for driving, they could be following a lane, changing lanes, etc. Onceisolated, these subroutines just need to be ﬁne tuned through reward-based RL for downstream tasks,which is very sample efﬁcient, as we show in our experiments.To imitate an expert at a visuomotor task, we need to know both the perceptual input to the expertand the action taken. One way to do this is by instrumenting the agent to collect the perceptualinput as well as the action executed, as done in autonomous driving [3]. However, this limits thescalability of the data collection procedure. To scale it up, we could instead learn from videos ofpeople performing tasks uploaded on websites such as YouTube. Such videos fall in two categories: a r X i v : . [ c s . R O ] O c t obot Interacting with Environments Egocentric videos of Experts + x Affordance Model predicts feasible Subroutines Subroutines Executed

Figure 1: Approach Overview:

We propose an approach that combines learning from direct environmentalinteraction, with learning from ﬁrst-person videos collected over the Internet. Inverse models built using a smallnumber of environmental interactions are used to interpret videos, and learn affordances (what can I do) and subroutines (how can I do it). Affordances predict which subset of subroutines are feasible given the currentimage. These subroutines can then be executed in novel environments. ﬁrst person (egocentric) or third person video. Third person videos have the beneﬁt of having actioninformation but don’t have the perceptual input. Skills learned from such videos don’t depend onthe perceptual input [4, 5]. But our focus is on navigation tasks for which the perceptual input isimportant. We thus face the opposite challenge when using egocentric video of a person performinga task ( e.g. biking with a head mounted GoPro camera). The perceptual input to the agent isavailable, but the action information is typically missing. In this paper, we will address this case anddemonstrate our technique in the navigation domain.We start with egocentric videos of experts navigating to achieve some task unknown to us. Giventhese videos, we want our robot to learn meaningful and useful subroutines.We obtain action labels for the egocentric videos by training a self-supervised inverse model onrandom interaction data. Egocentric videos can then be pseudo labeled by running the inverse modelon consecutive image pairs. Note that the action space of the experts might be different from theaction space of the robot (for example if we were to download egocentric navigation videos availableonline). Hence, these pseudo-action labels are not the actual action taken, but an action imagined bythe robot to make the transition between the observations as closely as possible. (Section 3).Once we have the pseudo action labels, we need to label the subroutines in the videos and learn acontroller which can be used in downstream tasks. For this, we slice up a trajectory into smaller sub-trajectories. The slicing length is a hyper parameter, that controls the complexity of the subroutineslearned (longer sub-trajectories will lead to more complex subroutines). We then encode each sub-trajectory into a discrete latent variable which should be predictive of the action given the videoframe, for every frame in the sub-trajectory. (Section 4).To effectively use the learned subroutines in downstream tasks, we must also infer which subroutinescan be applied where. For this, we additionally train an affordance model to predict which subrou-tines can be invoked for a given input image from our repertoire of learned subroutines. We do thisby predicting the inferred one-hot latent encoding of the trajectory from the ﬁrst image. (Section 4).We evaluate our learned subroutines and the affordance model on downstream navigation tasks,which are unknown to our method during the subroutine learning phase. We show that our learnedsubroutines can be composed together for zero-shot exploration in novel environments with a 50%improvement in exploration over several learning and non-learning baselines. We also evaluate ourlearned subroutines on downstream point navigation and area goal tasks. We ﬁne-tune our affor-dance model and subroutines through reward-based RL and observe a × improvement in learningsample complexity over alternate initializations. (Section 6). Classical Navigation.

Classical approaches to navigation employ geometric reasoning to solve thetask [6, 7]. While most works optimize in the base action space of the agent, few works employ hand-crafted motion primitives to speed up planning [8]. Dynamic Motion Primitives (DMPs) propose a2ramework for specifying macro actions [9], which can be learned from a demonstration [10] for aspeciﬁc macro action. However, the set of macro actions are still manually speciﬁed. In contrast,these behaviors automatically emerge as a consequence of our algorithm.

Learned Navigation.

Recent learning based-efforts use reinforcement learning or imitation learningto learn policies for solving speciﬁc locomotion, navigation or manipulation tasks [11, 12, 13, 14,15, 16, 17]. While these works learn to leverage high-level semantics, they still directly operate inthe base action space of the robot. Learned skills are task and environment speciﬁc, and it takes alarge number of interaction samples to even solve the same task in a new environment (in navigation[11, 12] as well as in manipulation [16, 17]). To address this, some works [18] use intrinsic rewardssuch as prediction error. However, these approaches don’t distill out composable skills to solvenovel tasks. Works like [19] distill out composable skills but do not scale to realistic setups aswe show in our experiments. Moreover, as all training signal is derived from interaction with theenvironment, skill acquisition is extremely expensive. In fact, our experiments show that our useof passive videos for learning skills is more sample efﬁcient and results in better performance thansuch purely interactive approaches.

Learning from State-Action Trajectories.

Several works use learning from demonstration in sce-narios where they have access to both the observations and the ground truth actions to solve the task[20, 21]. Works like [22] extend these formulations to work with trajectory collections that havemultiple modes. However, this line of work relies on ground truth action labels. In contrast, we onlyassume observation data (without paired actions), and evaluate on novel tasks in novel environments.

Learning from State Only Trajectories.

Contemporary works [23, 24, 25] study the problem oflearning from state only trajectories, similar to our work here. However, all of these works onlystudy the scenario where the agent solves the task in exactly the same environment that they havestate-only demonstrations for. In contrast, we do not assume access to environments for which wehave videos for, making learning more challenging and rendering these past techniques ineffective.Additionally, our goal is to learn subroutines that work in previously unseen environments, whichgoes beyond the focus of these works. Works like [26, 27, 28] focus on imitation from visual dataand learn a monolithic policy from expert data for the task at hand. We focus on learning composablesubroutines which can then be used in several downstream tasks.

Sub-policies and Options in Hierarchical RL.

Hierarchical RL is an active area of research[29, 30, 2], with a number of recent papers (such as [31, 32]). These works acquire sub-policiesin a top-down manner while interacting with the environment to solve a reward based task. Our ap-proach on the other hand investigates a bottom-up development of subroutines, and can learn fromrelatively inexpensive unlabelled passive data. Our learned subroutines are complementary to theseframeworks and can be used to initialize any of these top-down HRL methods to accelerate learning.

Affordance Learning from Videos.

Researchers have studied affordance learning from Internetvideos [33] by leveraging YouTube videos to learn about affordances. While this is a great ﬁrst step,it does not learn a controller that can be used in downstream tasks. In contrast, we learn a controllerfor each subroutine for the the speciﬁc robot at hand, allowing immediate deployment.

We need action labels on expert videos to learn a controller for downstream tasks. We build aself-supervised inverse model which takes two consecutive image observations, o t and o t +1 , andpredicts the action ˆ a which the agent took to transition from o t to o t +1 . This inverse model isthen used to pseudo-label egocentric videos of experts. However, since the videos may come fromdiverse sources (e.g. internet videos), the expert uploading the video may have a different actionspace than S . To handle this mismatch, we expect the inverse model to predict the action which S should have taken to go from o et to o et +1 and not the action actually taken by the expert. We ﬁrst build a self-supervised one-step inverse model ψ [34, 35] for the agent from random inter-action data in an environment (simulation environment or the real world). More concretely, givena pair of consecutive image observations, o t and o t +1 , we train the model ψ to predict action a t that was executed to transition from image o t to o t +1 as follows: ˆ a t = ψ ( o t , o t +1 ) . The agent S { . . . , o t , a t , o t +1 , a t +1 . . . } to train ψ by sampling a t uniformly from { left, right,forward } and executing it in the environment conveying it from o t to o t +1 . ̂ a t a t a t +1 a t ̂ a t −1 o et o et −1 o et +1 ̂ a t … … = (a) Learning Inverse Model from Random Interaction (b) Pseudo-labeling Egocentric Expert Videos Cross Entropy Loss o t o t +1 Figure 2: Pseudo-labeling: (a)

We execute random ac-tions in an environment to obtain image-action sequences { . . . , o t , a t , o t +1 . . . } . We use triplets ( o t , a t , o t +1 ) to train aninverse model to predict action a t given consecutive images o t , o t +1 . (b) We use this inverse model to pseudo-label egocentricvideos of navigating agents. Grey circles with = sign representcross-entropy loss. We then use this learned inversemodel ψ to pseudo-label the dataset D of egocentric expert videos. Givena sequence of images { o e , o e , . . . } from a video, we evaluate ψ on con-secutive pairs of images to obtain ˆ a = ψ ( o et , o et +1 ) . We use observations asa means of implicitly mapping equiv-alent actions between agents. Thisgenerates a pseudo labeled dataset ˆ D ,that contains image action sequences, { o e , ˆ a , o e , ˆ a , . . . , o eT } . We formally deﬁne subroutines andthe affordance model for an agent S as (cid:104) α, { π i } i =1 ..N (cid:105) , where N is thenumber of subroutines available to the S , π i is the i th subroutine and α is the affordance model.The affordance model predicts the probability distribution ˜ p t given an input observation o t , where (˜ p t ) i is the probability that π i is applicable given the observation o t . Each subroutine π i is a closedloop policy which takes the current observation o t and predicts a distribution over actions ˜ a t . Thus, ˜ p t = α ( o t ) and ˜ a t = π i ( o t ) .We isolate these subroutines from ˆ D by clustering them to improve the action prediction accuracy ofthe visuomotor trajectories in ˆ D . Intuitively, if the observation contains a T-junction with a possibleleft and right turn, they need to be clustered separately to unambiguously predict the future giventhe observation and the cluster id.To effectively use these subroutines, we learn another model to infer which subroutines are appli-cable in what scenario. For example, given an image of a hallway with two doors, one on the leftand other on the right, the model should learn to assign high probabilities to both go into left door and go into right door subroutines, whereas when there are no doors, it should simply peak on thesubroutine go down a hallway . We slice up each trajectory (from the pseudo-labeled dataset ˆ D ) into smaller overlapping trajec-tories of ﬁxed length T , where T is a hyper-parameter which determines the complexity of thesubroutines. We encode actions in each sub-trajectory into a discrete latent vector z . This z is thenused to predict actions corresponding to different frames in the video. We implement the trajectoryencoder as network f and the subroutines as a network π parametrized by the subroutine-id z : e = f (ˆ a , ˆ a , . . . , ˆ a T ) (1) z ∼ softmax ( e ) (2) ˜ a t , h t +1 = π ( o t , z, h t ) ∀ t ∈ { . . . T − } (3)where state h t and h t +1 are the current and updated hidden states respectively, o t is the currentobservation, and z is a discrete latent vector which speciﬁes the subroutine to invoke. We train theaffordance model α to predict the subroutine id given the ﬁrst image of the video sequence: ˜ z = α ( o ) . (4) A smaller T leads to simpler subroutines. We show ablations over T in supplementary. a o e o e o e … == ̂ a ̂ a ̂ a T … ˜ a ˜ a ̂ a e

1D CNN

Cross Entropy Loss

Pseudo-action Sequence VideoOnly used during training SubroutineAffordanceModel ̂ a z ˜ z = Cross Entropy Loss … Figure 3: Learning Visuomotor Routines and Affordances:

We want tomine visuomotor routines with the ability to explicitly invoke them. Weimplement this as recurrent network that takes in the current observationand a one-hot vector z that speciﬁes the subroutine to invoke. Since wedon’t have the labels for subroutine id z , we obtain it by jointly training twonetworks: one that looks at the entire future sequence of actions to predictthe subroutine id, and other which takes the subroutine id z and the imageas input to predict the action to take. Both the networks are jointly trainedto minimize cross-entropy loss. Finally, we also train an affordance modelthat predicts the inferred subroutine id z from the ﬁrst image. At test time, α takes the cur-rent observation as input andpredicts which subroutinesare applicable, and π thenexecutes the selected subrou-tine in a closed loop, exe-cuting the predicted action ˜ a t and receiving the next obser-vation o t +1 as it proceeds. Training : Both the networks, π and f , are jointly optimizedto maximize the likelihood ofthe pseudo-labeled action se-quence. α is optimized tomaximize the likelihood of z given ﬁrst observation of thevideo sequence. The subrou-tine id z is sampled from thetrajectory encoding e througha Gumbel-Softmax distribu-tion [36]. This allows esti-mating gradients for param-eters of f despite the sam-pling. Figure 3 shows the net-work diagram. Our experiments involve use of environments (where the agent can actively interact with the en-vironment) E train and E test , and a dataset of ﬁrst-person videos D . We describe choices for theenvironment, agent and this video dataset: Environments : We model environments using a visually realistic simulator derived from scans ofreal world indoor environments from the Stanford Building Parser Dataset [37] (SBPD) and theMatterport 3D Dataset [38] (MP3D). These scans have been used to study navigation tasks in [13,39, 40], and we adapt publicly available simulation code from [13]. We split these environments intofour disjoint sets: E train , E video , E val and E test . E train is used to train the inverse model, E val is usedfor development of policies for down-stream tasks, and E test is used for evaluating performance ofour policies on down-stream tasks. E video is used to create a dataset of egocentric videos. Agent Model:

Our agent is modeled as a cylinder with 4 actions: a) stay in place, b,c) rotate left orright by θ ( = 30 ◦ ), and d) move forward by x ( = 40 cm). The robot is equipped with a RGB cameramounted at a height h ( = 120 cm) from the ground with an elevation φ ( = − ◦ ) from the horizontal. Dataset D : We create MP3D Walks Dataset of egocentric videos.

MP3D Walks Dataset is auto-generated using the E video environments, by rendering out images along the path taken by an expertnavigator to navigate between given pairs of random points. We implement this expert as an analyt-ical path planner which has access to the ground truth free space map. We additionally ensure thatexperts have a different action space than our agent (see supplementary for speciﬁcs). MP3D WalksDataset consists of around K clips of steps each, without any action labels. The agent starts at . K different locations spreadover 4 environments ( E train ) and executes random actions for steps. The collected data ( K interaction samples) is used to train the inverse model. See supplementary for ablations over thenumber of interaction samples and the generalization performance of the inverse model over differ-ent camera heights of the test images. 5 ubR2SubR1 (a) Robustness and Diversity of Subroutines SubR2SubR2SubR2 SubR2 (b) Consistency of SubroutinesFigure 4: (a) Robustness and Diversity of Subroutines : Subroutines

SubR2 (top row) and

SubR1 (bottomrow) when deployed on a real robot demonstrate robustness and consistency over perturbations to startinglocation. (b) Consistency of Subroutines : Learned Subroutines when deployed on a real robot demonstrateconsistent behavior over different starting locations, as demonstrated for

SubR2 . SubR1 SubR1 SubR1 SubR2 SubR2 SubR2

Figure 5: Affordance Model Visualization:

The images in the ﬁrst 3 columns were assigned high probabilityfor

SubR1 by the affordance model (that goes rightwards), while images in the next three columns were as-signed high probability for

SubR2 (that goes leftward). These images, indeed afford the predicted subroutines.

This model is then used to pseudo-label videos in D to obtain dataset ˆ D as described in Section 3.2. ˆ D is used to learn subroutines π ( ., z ) and the affordance model. (Section 4.1). Subroutine Training : We slice each of the K videos into clips of length steps with a slidingwindow of . This gives us a total of . M clips to train our subroutines. We experiment with using subroutines ( i.e. the z vector is -dimensional. We show ablations over the number of subroutinesas well as the length of each subroutine in supplementary.). This model is trained by minimizing thecross-entropy loss between the actions output by the policy ( ˜ a ) and the pseudo-labels ( ˆ a ) obtainedfrom the inverse model. Affordance Training : We train the affordance model to predict the inferred subroutine id z giventhe ﬁrst image of the length trajectory by minimizing cross-entropy loss over the inferred z label. : We deployed our subroutines (learned in simulation) in the real worldon a real robot ( iCreate2 platform equipped with a RGB camera). Figure 4a shows the diver-sity between two of our learned subroutines SubR1 and

SubR2 . We observe that

SubR1 prefersturning rightward into doors and corridors, and

SubR2 prefers turning leftward. The subroutinesshow robustness to perturbations in the starting location and consistently enter the door. Figure 4bshows that

SubR2 consistently turns left into doors and corridors across different starting locations.See supplementary for simulation results showing diversity and consistency of our learned subrou-tines.

Affordance Prediction : We show observations from E test for which the affordance modelprediction is high for SubR1 and

SubR2 in Figure 5. Top row shows observations that cause a highprediction for

SubR1 , while bottom row shows images that excite

SubR2 . The exploration task requires the agent to explore a novel environment efﬁciently.

Task Setup : Werandomly initialize the agent at different locations in the novel test environment E test . For each6 able 1: Exploration Metrics: VMSR beats 3 hand-crafted baselines and two state-of-the-art learning basedtechniques [19, 18]. See text for details.

Method ↓ ↓

Distance ↑ Rate (%) ↓ Random 0 18.09 7.5 65Forward Bias Policy 0 15.25 13.11 82Always Forward, Rotate on Collision 0 14.89 13.31 72Skills from Diversity [19] 10 M M K location, we do random executions (each from a randomly chosen initial orientation) of length each. Metrics : We measure different aspects of exploration via the following metrics: ↓ :Number of environment interactions used for training. Average Distance to Trajectory (ADT) ↓ :Given executed trajectories from a given starting location, we compute the mean geodesic distanceof points in the environment to the closest point on the trajectory. If we wanted to visit a pointin the environment, this metric measures how much we will need to go off the trajectory to get tothis point, in expectation. We report the average over all starting locations. Maximum Distance ↑ :Measures how far the executed trajectories convey the agent. For each trajectory, we measure themaximum geodesic distance from the starting location to all points on the trajectory. We report theaverage maximum geodesic distance over trials. Collision Rate ↓ : Fraction of forward actions thatresult in collisions. We emphasize that VMSR is not trained to optimize for any of these metrics. Exploration via Subroutines : Given the visual observation from the current location, we repeatthe following two steps: a) we use the affordance model α to sample the subroutine z to execute,b) we execute the sampled subroutine for 10 steps. Baselines.

We compare with three hand-craftedbaselines: a) Random policy (randomly execute one of the 4 actions), b) Forward bias policy (biasedto more frequently execute forward action), and c) Always forward but rotate on collision policy. Wealso compare to the state-of-the-art unsupervised RL-based skill learning methods d) DIAYN [19],and e) Curiosity [18]. These learning based techniques were trained with comparable networks(ResNet 18 models pre-trained on ImageNet) for over M samples. More details about thesebaselines are in supplementary. Results.

Table 1 shows that VMSR outperforms all the baselines on all three metrics. VMSR suc-cessfully learned to bias towards forward action in navigation and the notion of obstacle avoidance,leading to a high maximum distance, and a low collision rate. It outperforms hand-crafted baselinesthat were designed with these insights in mind. Furthermore, it outperforms state-of-the-art learningbased techniques for learning skills [19, 18], that were trained on × more interaction samples(45 K vs. We next investigate how we can use VMSR to solve goal-driven tasks. We do this by setting uphierarchical RL policies based of our learned subroutines and affordance models.

Task Setup:

We setup two goal driven navigation tasks, PointGoal and AreaGoal as deﬁned in [41].For PointGoal task, the agent is required to reach a given goal location (speciﬁed as a relative offsetfrom robot’s current location). For the AreaGoal task, the agent is required to go to the washroom.We study both tasks in sparse and dense reward settings. RL and HRL policies are developed on thevalidation environment E val , and ﬁnally trained on the test environment E test with 3 random seedsto assess sample efﬁciency for learning. HRL via Subroutines and Affordance Model:

We follow the framework in [30] and initialize thesub policies with our subroutines and meta-controller with our affordance model. We then ﬁne tunethe sub policies and the meta-controller via reinforcement learning.7 E p i s o d e R e w a r d Point Navigation with Sparse Reward

Area Goal with Sparse Rewards E p i s o d e R e w a r d Point Navigation with Dense Reward

Random InitImageNet InitDIAYNVMSR (4 SubRs) [Ours] (a) Subroutines and Affordances for Hierarchical RL

Area Goal with Sparse Rewards

Random InitImageNet InitCuriosityVMSR (1 SubR) [Ours] (b) Subroutines for RLFigure 6: (a) Subroutines and Affordances for Hierarchical RL : Initializing from VMSR leads to upto 4xmore sample efﬁcient learning in downstream navigation tasks. First Column shows results for PointGoal (go to ( x, y ) coordinate), second column shows results for AreaGoal (go to washroom). We see improvements acrossthese tasks for both sparse and dense reward scenarios, with larger gains in the harder case of sparser rewards. (b) Subroutines for RL : Initializing a ﬂat RL policy with VMSR (with only 1 SubR ) leads to improved samplecomplexity for AreaGoal navigation (go to washroom), compared to alternate initializations.

Comparisons:

We compare with the following alternates for initializing the meta-controller andthe sub-policies: a)

Random Initialization , b)

ImageNet Initialization , and c)

Initialization fromskills via DIAYN [19] pre-training . (c) doesn’t provide an affordance model, so we initialize themeta-controller image CNN with the sub-policy CNN. We can also compare VMSR initialization toinitialization obtained from Curiosity [18]. Pathak et al. [18] use a monolithic policy ( i.e. withoutany handle to control what they do), and study an AreaGoal task. Thus, for a fair comparison, welimit the comparison to the AreaGoal task and use a monolithic RL policy instead of the hierarchicalpolicy. We report three training curves: a)

Random Initialization , b)

Initialization from Curiosity[18] , and c)

VMSR with 1 subroutine (to obtain a monolithic policy).

Results:

Training rewards are plotted in Figure 6a and Figure 6b. Figure 6a shows the compari-son among hierarchical policies. We observe upto × faster training when initialized with VMSRand affordance models as compared to the next best baseline (which is ImageNet initialization).Improvements are generally larger for the harder case of sparser rewards. DIAYN [19] based initial-ization entirely fails, as it collapses to a trivial policy (more details in supplementary). Even amongnon hierarchical policies (Figure 6b), initializing with VMSR (VMSR (1 SubR )) performs the bestin AreaGoal Tasks, outperforming random initialization and initialization from curiosity policy [18],which also learns a trivial policy (see supplementary).

In this paper, we developed a technique that combined learning from interaction with learning fromvideos, to extract meaningful and useful subroutines from egocentric videos of experts performingdifferent tasks. We showed how these extracted subroutines can be used as is for exploration, orcan be specialized using hierarchical RL for solving other downstream navigation tasks. We believeadvances made in this paper will enable scaling up of policy learning in robotics.8 cknowledgments

Authors would like to thank Allan Jabri, Shiry Ginosar, Devendra Singh Chaplot, Ashvin Nair andAngjoo Kanazawa for feedback on the manuscript. This work was supported by Berkeley Deep-Drive.

References [1] R. E. Fikes and N. J. Nilsson. STRIPS: A new approach to the application of theorem provingto problem solving.

Artiﬁcial intelligence , 1971.[2] R. S. Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A framework fortemporal abstraction in reinforcement learning.

Artiﬁcial intelligence , 1999.[3] H. Xu, Y. Gao, F. Yu, and T. Darrell. End-to-end learning of driving models from large-scalevideo datasets. In

CVPR , pages 2174–2182, 2017.[4] X. B. Peng, A. Kanazawa, J. Malik, P. Abbeel, and S. Levine. Sfv: Reinforcement learning ofphysical skills from videos. In

SIGGRAPH Asia 2018 , page 178. ACM, 2018.[5] X. B. Peng, P. Abbeel, S. Levine, and M. van de Panne. Deepmimic: Example-guided deep re-inforcement learning of physics-based character skills.

ACM Transactions on Graphics (TOG) ,37(4):143, 2018.[6] S. M. LaValle.

Planning Algorithms . Cambridge University Press, Cambridge, U.K., 2006.Available at http://planning.cs.uiuc.edu/.[7] S. Thrun, W. Burgard, and D. Fox.

Probabilistic robotics . MIT press, 2005.[8] K. Hauser, T. Bretl, K. Harada, and J.-C. Latombe. Using motion primitives in probabilisticsample-based planning for humanoid robots. In

Algorithmic Foundation of Robotics . 2008.[9] S. Schaal. Dynamic movement primitives-a framework for motor control in humans and hu-manoid robotics. In

Adaptive motion of animals and machines , pages 261–280. Springer, 2006.[10] A. J. Ijspeert, J. Nakanishi, H. Hoffmann, P. Pastor, and S. Schaal. Dynamical movementprimitives: learning attractor models for motor behaviors.

Neural computation , 25(2):328–373, 2013.[11] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and A. Farhadi. Target-drivenvisual navigation in indoor scenes using deep reinforcement learning. In

ICRA , 2017.[12] P. Mirowski, R. Pascanu, F. Viola, H. Soyer, A. Ballard, A. Banino, M. Denil, R. Goroshin,L. Sifre, K. Kavukcuoglu, et al. Learning to navigate in complex environments. In

ICLR , 2017.[13] S. Gupta, J. Davidson, S. Levine, R. Sukthankar, and J. Malik. Cognitive mapping and planningfor visual navigation. In

CVPR , 2017.[14] F. Sadeghi. Divis: Domain invariant visual servoing for collision-free goal reaching.

RSS ,2019.[15] F. Sadeghi and S. Levine. Cad2rl: Real single-image ﬂight without a single real image. arXivpreprint arXiv:1611.04201 , 2016.[16] S. Levine, C. Finn, T. Darrell, and P. Abbeel. End-to-end training of deep visuomotor policies.

JMLR , 2016.[17] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen. Learning hand-eye coordinationfor robotic grasping with deep learning and large-scale data collection.

The InternationalJournal of Robotics Research , 37(4-5):421–436, 2018.[18] D. Pathak, P. Agrawal, A. A. Efros, and T. Darrell. Curiosity-driven exploration by self-supervised prediction. In

ICML , 2017. 919] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine. Diversity is all you need: Learning skillswithout a reward function. In

ICLR , 2019.[20] B. D. Argall, S. Chernova, M. Veloso, and B. Browning. A survey of robot learning fromdemonstration.

Robotics and Autonomous systems , 2009.[21] A. Billard, S. Calinon, R. Dillmann, and S. Schaal. Robot programming by demonstration. In

Springer Handbook of robotics . 2008.[22] K. Hausman, Y. Chebotar, S. Schaal, G. Sukhatme, and J. J. Lim. Multi-modal imitationlearning from unstructured demonstrations using generative adversarial nets. In

NIPS , 2017.[23] Y. Aytar, T. Pfaff, D. Budden, T. L. Paine, Z. Wang, and N. de Freitas. Playing hard explorationgames by watching youtube. arXiv preprint arXiv:1805.11592 , 2018.[24] F. Torabi, G. Warnell, and P. Stone. Behavioral cloning from observation. In

IJCAI , 2018.[25] A. D. Edwards, H. Sahni, Y. Schroeker, and C. L. Isbell. Imitating latent policies from obser-vation. arXiv preprint arXiv:1805.07914 , 2018.[26] D. Pathak, P. Mahmoudieh, G. Luo, P. Agrawal, D. Chen, Y. Shentu, E. Shelhamer, J. Malik,A. A. Efros, and T. Darrell. Zero-shot visual imitation. In

ICLR , 2018.[27] C. Finn, T. Yu, T. Zhang, P. Abbeel, and S. Levine. One-shot visual imitation learning viameta-learning. arXiv preprint arXiv:1709.04905 , 2017.[28] T. Yu, C. Finn, A. Xie, S. Dasari, T. Zhang, P. Abbeel, and S. Levine. One-shot imitationfrom observing humans via domain-adaptive meta-learning. arXiv preprint arXiv:1802.01557 ,2018.[29] P. Dayan and G. E. Hinton. Feudal reinforcement learning. In

NIPS , 1993.[30] A. G. Barto and S. Mahadevan. Recent advances in hierarchical reinforcement learning.

Dis-crete event dynamic systems , 2003.[31] A. S. Vezhnevets, S. Osindero, T. Schaul, N. Heess, M. Jaderberg, D. Silver, andK. Kavukcuoglu. Feudal networks for hierarchical reinforcement learning. arXiv preprintarXiv:1703.01161 , 2017.[32] A. Levy, R. Platt, and K. Saenko. Hierarchical actor-critic. arXiv preprint arXiv:1712.00948 ,2017.[33] D. F. Fouhey, V. Delaitre, A. Gupta, A. A. Efros, I. Laptev, and J. Sivic. People watching:Human actions as a cue for single view geometry.

IJCV , 2014.[34] M. I. Jordan and D. E. Rumelhart. Forward models: Supervised learning with a distal teacher.

Cognitive science , 16(3):307–354, 1992.[35] P. Agrawal, A. V. Nair, P. Abbeel, J. Malik, and S. Levine. Learning to poke by poking:Experiential learning of intuitive physics. In

NIPS , pages 5074–5082, 2016.[36] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. arXivpreprint arXiv:1611.01144 , 2016.[37] I. Armeni, O. Sener, A. R. Zamir, H. Jiang, I. Brilakis, M. Fischer, and S. Savarese. 3Dsemantic parsing of large-scale indoor spaces. In

CVPR , 2016.[38] A. Chang, A. Dai, T. Funkhouser, M. Halber, M. Niessner, M. Savva, S. Song, A. Zeng, andY. Zhang. Matterport3D: Learning from RGB-D data in indoor environments. In , 2017.[39] A. Kumar*, S. Gupta*, D. Fouhey, S. Levine, and J. Malik. Visual memory for robust pathfollowing. In

Advances in Neural Information Processing Systems , 2018.[40] T. Swedish and R. Raskar. Deep visual teach and repeat on path networks. In

CVPRW , 2018.1041] P. Anderson, A. Chang, D. S. Chaplot, A. Dosovitskiy, S. Gupta, V. Koltun, J. Kosecka, J. Ma-lik, R. Mottaghi, M. Savva, and A. Zamir. On evaluation of embodied navigation agents. arXivpreprint arXiv:1807.06757 , 2018.[42] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[43] A. Murali, T. Chen, K. V. Alwala, D. Gandhi, L. Pinto, S. Gupta, and A. Gupta. Py-robot: An open-source robotics framework for research and benchmarking. arXiv preprintarXiv:1906.08236 , 2019. 11 upplementary Material

A1 Subroutines and Affordances Full Training Details

Inverse Model Training and Pseudo-labeling.

The agent starts at . K different locations spreadover 4 environments ( E train ) and executes random actions for steps. The collected data ( K interaction samples) is used to train the inverse model. We use cross-entropy loss between the actualaction and the predicted action. We use Adam [42] with batch size and . learning rate.Ablations over number of interaction samples by varying number of starting locations and numberof steps per starting location is shown in Figure A4.This model is then used to pseudo-label videos in D to obtain dataset ˆ D . ˆ D is used to learn subrou-tines π ( ., z ) and the affordance model. Subroutine Training : We slice each of the K videos into clips of length steps with a slidingwindow of (ablations over the length of subroutines is shown in Figure A4). This gives us a totalof . M clips to train our subroutines. We experiment with using subroutines ( i.e. the z vector is -dimensional) and show ablations over this hyper-parameter Figure A4. This model is trained byminimizing the cross-entropy loss between the actions output by the policy ( ˜ a ) and the pseudo-labels( ˆ a ) obtained from the inverse model. Affordance Training : We train the affordance model to predict the inferred subroutine id z giventhe ﬁrst image in length trajectory by minimizing cross-entropy loss over the inferred z label. A2 Consistency and Diversity Visualizations

We unroll different subroutines from different locations in the test environment E test , and visualizethe trajectories followed by each of them in the top view in Figure A2. We show multiple rolloutsof each subroutine from each of the starting locations. Randomness in behavior comes from thesampling of the actions from the network output. The three top view ﬁgures in each column of Fig-ure A2 correspond to one subroutine at three different starting locations and we rollout 8 trajectoriesfrom each starting location. Thus, each column demonstrates that a speciﬁc subroutine does similarthings when initialized at different locations, showing the consistency of our learned subroutines.For example, SubR1 always turns right, SubR2 always turns left. Rollouts shown in different rowsof Figure A2 show that different subroutines show diverse behaviors when started from the samelocation. This shows the diversity of across our learned subroutines.We also quantitatively compute disentanglement: we unroll the different subroutines from the samestarting location and compute the intersection over union between trajectories from SubR i and SubR j (for example, IoU between the green region and blue region in plots in the top row of Figure A2).A higher IoU implies similar areas are traversed by two sets of sub-policies, lower IoU implies thetwo sub-policies are distinct. Thus, we should expect a higher IoU between the trajectories from thesame subroutine and a lower IoU between different subroutines. The average IoU between differentsubroutines is 0.42, and the average IoU between trajectories from the same subroutines is 0.58.Thus, indeed different subroutines are disentangled. A3 Affordance Model Entropy Visualization

We also look at the entropy of the output of affordance model in Figure A1 in top view. The arrowshows the direction in which the agent is facing and we plot the entropy of the prediction of theaffordance model when the ﬁrst-person (egocentric) observation is given as input. A higher entropyimplies that more subroutines apply in the given scenario. The observed entropy is consistent withour expectations, as explained in the ﬁgure caption.12 ubroutine 0 subroutine 0subroutine 1subroutine 0 subroutine 2subroutine 1 subroutine 0 subroutine 2subroutine 1 subroutine 3

Figure A2: Subroutine Consistency and Diversity:

Each top-view ﬁgure shows multiple roll-outs of a sub-routine from a given location. The black arrow in white circle shows starting position and the black dots showsthe ending location of the rollouts. Columns show the same subroutine over different starting locations, il-lustrating the consistency of our subroutines while rows show different subroutines unrolled from the samelocation illustrating their diversity. It appears that

SubR1 prefers turning right and,

SubR2 prefers turning left.Note that policies only use ﬁrst person views.

A4 Baselines for Exploration Random Policy : We randomly sample an action from the four possible actions (stay, left, right,forward) at every step.2.

Forward Bias Policy : Since motion is typically dominated by forward motion, we compare to an-other policy that samples the forward action more preferably. We use the distribution of actions inthe MP3D Walks Dataset, probabilities for stop, turn left, turn right and forward were [0 . , . , . , . respectively.3. Always Forward, Rotate on Collision : This baseline repeats the following procedure: rotate by arandom angle sampled from ( − π, π ] , move straight till collision.4. Diversity Policy (DIAYN) [19] : We use the state-of-the-art RL-based unsupervised skill learningalgorithm from Eysenbach et al. [19] to learn 4 diverse skills on E train environments. We testthe learned skills for exploration by randomly sampling a skill, and then executing it for steps,where we sample actions from the probabilities output by the selected skill. Policy architectureis same as those for our subroutines, discriminator is based of a ResNet 18 model. Both modelsare initialized from ImageNet. Policy is trained for over 10 million interaction, best performanceoccurs at around 1M interaction samples.5. Curiosity Policy [18] : We train a curiosity-based agent that seeks regions of space where its for-ward model has high prediction error [18]. Policy architecture is same as that for our subroutines(except that it does not take in the latent vector z ), and initialized from ImageNet. Forward modelis learned in the conv average pooled feature space of a ﬁxed Resnet 18 model pre-trained onImageNet. Trajectories are executed by sampling from the action probabilities output by thepolicy. Once again, policy is trained for over 10 million interaction, best performance occurs ataround 1M interaction samples. 13 igure A1: Multi-modality in Affordance Predictions : We visu-alize the entropy of the distribution output by the affordance modelin the test environment. A larger circle denotes a higher entropymeaning more subroutines can be invoked at that location. We ob-serve that the affordance model has a higher entropy as the agentapproaches hallway intersections, or room entrances. This multi-modality collapses as the agent crosses the decision junctions. Curiosity Model:

Pathak et al. [18]proposed use of prediction error ofa forward model as an intrinsic re-ward for learning skills using RL.We were surprised at the rather poorperformance for the curiosity model.We found that the model convergesto the policy of simply rotating in-place. Such a degenerate solutionmakes sense as rotating in-place hashigher prediction error than stayingin-place and moving forward. In-place rotations cause new parts of theenvironment to become visible whichmakes for a harder prediction task.Staying-in-place and moving forward cause only minor changes to the image or no changes at all.Thus, the curiosity model rightly learns to simply rotate in-place. We saw this same behavior acrossdifferent runs with different hyper-parameters and different architectures: policies will collapse tooutputting just the rotation actions. Entropy based regularization is used to prevent such a collapse.We used such regularization and cross-validated various choices for the trade-offs in loss betweenentropy regularization and policy gradient loss, but didn’t ﬁnd it to alleviate this issue. We selectedthe best model for the task of exploration across different runs and different number of training it-erations. This selected model ended up being a heavily regularized model that would pick actionsalmost uniformly at random, as that would get higher performance than simply rotating in-place.As both extremes (taking actions randomly, or picking only the rotate in-place action) are trivialsolutions, the curiosity model starts to ignore the image and consequently performs on-par withuninitialized models for reinforcement learning tasks.

Diversity Model:

The diversity model from Eysenbach et al. [19] seeks to classify states withthe skill id that was used to get to it (see Algorithm 1 in [19]). While this works well for theenvironments studied in [19], it breaks down for visual navigation. This is because, the same statecan be reached via different skills depending on the starting state. This causes the skill classiﬁers q to only perform at chance. Consequently, the reward for the skill policies is uniform, causingthe policies to collapse (all actions produce the same reward, and hence no learning happens). Weobserved this empirically in our experiments as well: accuracy for state classiﬁcation was at chance(25% for four skills), and the reward stayed constant. Best performing policy (based on validation forexploration metrics) always predicted the following probabilities for different actions for differentskills: [0 . , . , . , . (for stop, left, right, forward respectively). As thiscan be done without looking at the image, the policy learns to ignores the image. Thus, the modelperform on-par with uninitialized models for hierarchical reinforcement learning experiments. A5 Exploration Visualization

We show the coverage for each method in Figure A3. Figure 10 overlay trajectories executed bydifferent policies onto the map (only used for visualization). We see a wider coverage for VMSRover other methods and also observe that the trajectories avoid the walls of the hallways when goingdown them.

A6 Ablations

We show ablations over 4 hyper-parameters in Figure A4 (see caption for more details). We compareVMSR initialization to inverse features initialization for downstream HRL task in Figure A5. Weshow the ability of the trained model to generalize across various camera heights of the referenceimages in Figure A6. 14 andom Forward Bias Policy Always Forward, Rotate on CollisionSkills from Curiosity Skills from Diversity VMSR (Ours)

Figure A3: Coverage Visualization:

We show coverage of the overall space after sampling 20 roll-outs from11 different locations in the the test environment E test . Note that VMSR covers more of the environment. It isable to come out of rooms and different roll-outs go towards different areas. Curiosity, diversity and Randompolicies spend most of their time inside rooms. Policies that are biased to move forward do come out, but donot show diverse behavior. Visualizations show top view, however policies only use ﬁrst person views. C o lli s i o n s With Affordance Modelrandom M a x D i s t With Affordance Modelrandom Interaction Samples (Visual Diversity) A D T Interaction Samples (Episode Length)

Num Subroutines

With Affordance Modelrandom

Path Length

Figure A4: Dependence on active environment interaction samples, length of reference videos and num-ber of subroutines speciﬁed : Column 1 and 2 : We plot the exploration metrics against the number of selfsupervision interaction samples. There are two orthogonal ways of achieving this – increasing the numberof restarts while keeping each episode length ﬁxed (Col 1) and increasing the length of each self supervisionepisode while keeping the number of restarts ﬁxed (Col 2). We see that visual diversity improves performanceon Max Dist metric, but saturates at 45K interaction samples (1500 restarts with 30 steps each). Performanceroughly remains the same as we increase the episode length.

Column 3 : We change the number of subroutineslearned on the x-axis and compare the use of affordance model for sampling subroutines to randomly samplingsubroutines. Affordance model shows improvement in collision rate over random sampling, indicating that theaffordance model better respects the constraints of the physical space. We don’t see an improvement in the ex-ploration metric or max distance metric.

Column 4 : We observe improvements as we increase the path lengthof the reference trajectories. Longer trajectories presumably allow VMSR to learn more complex subroutines. E p i s o d e R e w a r d Area Goal with Sparse Rewards

Inverse FeaturesVMSR (4 SubRs) [Ours]

Figure A5:

We compare VMSR initialization to initializing the image features of the sub-policies with thefeatures from the inverse model for the downstream HRL task of PointGoal with sparse rewards. We see thatVMSR is 3x more sample efﬁcient compared to this baseline.

Seen During Training A cc u r ac y ( % ) Height (cm)

Figure A6:

We test the generalization of the learned inverse model on images from E val , which is unseenduring training. We plot the prediction accuracy (y axis) as we increase the camera height from the ground (xaxis). The agent is trained on heights from 90cm to 150 cm during training in E train and evaluated on heightsfrom 10cm to 300cm. We observe a very consistent performance even in the range not seen during training.Note that the agent starts touching the ceiling of the room in some places at 300cm. A7 RL Experimental Setup

We use E test for RL experiments. We use A2C to train all our algorithms on Point Goal task andArea Goal task. • Area Goal:

The task is to ﬁnd the nearest washroom. E test contains 2 washroom, and westart the agent 10-23 steps away from the nearest washroom. We randomly start the agentat a different location for every episode. • Point Goal:

We specify the goal coordinates relative to the start position, and randomlysample the start and the goal locations every episode. The goal is 10-17 steps away fromthe start location.

A8 Video Results

The enclosed video vmsr.mp4 contains video results of real robot deployment followed by anexplanation of our method. We use [43] for real robot deployment. Note that along with the 4primitive actions (rotate left, rotate right, move forward and stay in place), the robot also movesslightly backward incase of a collision. 16 able A1:

Split of environments between different sets used in the paper. These environments are from StanfordBuilding Parser Dataset (SBPD) [37] and Matterport 3D Dataset (MP3D) [38]. We ﬁx a step size ( x ) androtation angle ( θ ) for each area by randomly sampling from the list. For elevation angle and height of the robot,we resample a value from the given ranges for every video . Split Environments Agent SettingsStep Sizes Rotation Angles Elevations Height( x in cm ) ( θ ) ( φ ) ( h in cm ) E train area1, area6, B6ByNegPMKs,Vvot9Ly1tCj

20, 50, 80 36 ◦ , 24 ◦ , 18 ◦ [-25 ◦ , 5 ◦ ] [90, 150] E video area5a, area5b, p5wJjkQkbXX,VFuaQ6m2Qom, 2n8kARJN3HM,SN83YJsR3w2

30, 60, 90 40 ◦ , 30 ◦ , 24 ◦ , 20 ◦ [-35 ◦ , -5 ◦ ] [80, 160] E val area3

40 30 ◦ -15 ◦ E test area4

40 30 ◦ -15 ◦ A9 Area Splits and Agent Settings

We give details of area splits and the action space of the experts which generate reference videosin Table A1. In the table, step size ( x ) refers to the length of a single forward step, θ refers to therotation angle for left/right turn, φ refers to the elevation angle of the onboard RGB camera from thehorizontal and hh