RoboTHOR: An Open Simulation-to-Real Embodied AI Platform
Matt Deitke, Winson Han, Alvaro Herrasti, Aniruddha Kembhavi, Eric Kolve, Roozbeh Mottaghi, Jordi Salvador, Dustin Schwenk, Eli VanderBilt, Matthew Wallingford, Luca Weihs, Mark Yatskar, Ali Farhadi
RRoboTHOR: An Open Simulation-to-Real Embodied AI Platform
Matt Deitke ∗ , Winson Han ∗ , Alvaro Herrasti ∗ , Aniruddha Kembhavi , ∗ , Eric Kolve ∗ ,Roozbeh Mottaghi , ∗ , Jordi Salvador ∗ , Dustin Schwenk ∗ , Eli VanderBilt ∗ , Matthew Wallingford ∗ ,Luca Weihs ∗ , Mark Yatskar ∗ , Ali Farhadi PRIOR @ Allen Institute for AI University of Washington ai2thor.allenai.org/robothor
RealSimulation
Action
Same ObjectDifferent AppearanceSame ActionDifferent Result
Figure 1: We present R
OBO
THOR, a platform to develop and test embodied AI agents with corresponding environmentsin simulation and the physical world. The complexity of environments in R
OBO
THOR along with disparities in appearanceand control dynamics between simulation and reality pose new challenges and open many avenues for further research.
Abstract
Visual recognition ecosystems (e.g. ImageNet, Pascal,COCO) have undeniably played a prevailing role in the evo-lution of modern computer vision. We argue that interac-tive and embodied visual AI has reached a stage of devel-opment similar to visual recognition prior to the advent ofthese ecosystems. Recently, various synthetic environmentshave been introduced to facilitate research in embodied AI.Notwithstanding this progress, the crucial question of howwell models trained in simulation generalize to reality hasremained largely unanswered. The creation of a compara-ble ecosystem for simulation-to-real embodied AI presentsmany challenges: (1) the inherently interactive nature ofthe problem, (2) the need for tight alignments between realand simulated worlds, (3) the difficulty of replicating phys-ical conditions for repeatable experiments, (4) and the as-sociated cost. In this paper, we introduce R OBO
THOR todemocratize research in interactive and embodied visual AI. R OBO
THOR offers a framework of simulated environments ∗ Alphabetically listed equal contribution paired with physical counterparts to systematically exploreand overcome the challenges of simulation-to-real transfer,and a platform where researchers across the globe can re-motely test their embodied models in the physical world.As a first benchmark, our experiments show there exists asignificant gap between the performance of models trainedin simulation when they are tested in both simulations andtheir carefully constructed physical analogs. We hope that R OBO
THOR will spur the next stage of evolution in em-bodied computer vision.
1. Introduction
For decades, the AI community has sought to create per-ceptive, communicative and collaborative agents that canaugment human capabilities in real world tasks. Whilethe advent of deep learning has led to remarkable break-throughs in computer vision [33, 24, 48] and natural lan-guage processing [45, 16], creating active and intelligentembodied agents continues to be immensely challenging.The widespread availability of large and open, computer a r X i v : . [ c s . C V ] A p r ision and natural language datasets [50, 34, 47, 66], mas-sive amounts of compute, and standardized benchmarkshave been critical to this fast progress. In stark contrast,the considerable costs involved in acquiring physical robotsand experimental environments, compounded by the lackof standardized benchmarks are proving to be principal hin-drances towards progress in embodied AI. In addition, cur-rent state of the art supervised and reinforcement learningalgorithms are data and time inefficient; impeding the train-ing of embodied agents in the real world.Recently, the vision community has leveraged progressin computer graphics and created a host of simulated per-ceptual environments such as AI2-THOR [32], Gibson [72],MINOS [53] and Habitat [54], with the promise of trainingmodels in simulation that can be deployed on robots in thephysical world. These environments are free to use, con-tinue to be improved and lower the barrier of entry to re-search in real world embodied AI; democratizing researchin this direction. This has led to progress on a variety oftasks in simulation, including visual navigation [22, 69], in-struction following [5, 68] and embodied question answer-ing [21, 15]. But the elephant in the room remains: Howwell do these models trained in simulation generalize to thereal world?
While progress has been ongoing, the large costs in-volved in undertaking this research has restricted pursuitsin this direction to a small group of well resourced orga-nizations. We believe that creating a free and accessibleframework that pairs agents acting in simulated environ-ments with robotic counterparts acting in the physical worldwill open up this important research topic to all—bringingfaster progress and potential breakthroughs. As a step to-wards this goal, we present R
OBO
THOR.R
OBO
THOR is a platform to develop artificial embod-ied agents in simulated environments and test them in both,simulation as well as the real world. A key promise ofR
OBO
THOR is to serve as an open and accessible bench-marking platform to stimulate reproducible research in em-bodied AI. With this in mind, it has been designed with thefollowing properties: • Simulation and Real Counterparts - R
OBO
THORconsists of a training and validation corpus of 75scenes in simulation at present. A total of 14 test-devand test-standard scenes are present in simulation andtheir counterparts constructed in the physical world.Scenes are designed to have diverse wall and furniturelayouts, and all are densely populated by a variety ofobject categories. Figure 1 shows a view of the din-ing room in one of the test-dev scenes, in simulation aswell as real. • Modular - Scenes in R
OBO
THOR are built in a mod-ular fashion, drawing from an asset library containing wall structures, flooring components, ceilings, light-ing elements, furniture pieces and objects; altogethertotaling 731 unique assets distributed across scenes.This enables scene augmentation and easy expansionof R
OBO
THOR to fit the needs of researchers. • Re-configurable - The physical environments are alsobuilt using modular and movable components, allow-ing us to host scenes with vastly different layouts andfurniture arrangements within a single physical space.This allows us to scale our test corpora while limit-ing their cost and physical footprint. Reconfiguring thespace to a new scene can be accomplished in roughly30 minutes. • Accessible to all - The simulation environment, assetsand algorithms we develop will be open source. Morecritically, researchers from all over the world will beable to remotely deploy their models on our hardwareat no cost to them. We will be setting up a systematicmeans of reserving time in our environment. • Replicable - The physical space has been designed tobe easily replicable by other researchers should theywish to construct their own physical environment. Thisis achieved through open sourced plans, readily avail-able building components, IKEA furniture, and a lowcost amounting to roughly $10,000 in materials andassets to create the physical space. In addition, we useLoCoBot, an inexpensive and easily obtainable robot. • Benchmarked - In addition to open sourcing baselinemodels, we will host challenges involving several em-bodied AI tasks with a focus on the ability to trans-fer these models successfully onto robots running in aphysical environment.R
OBO
THOR has been designed to support a variety ofembodied AI tasks. In this work we benchmark models forsemantic navigation, the task of navigating to an instanceof the specified category in the environment. The complex-ity and density of scenes in R
OBO
THOR renders this taskquite challenging, with humans requiring 49.5 steps (me-dian statistic) to find the target object. We train a set of com-petitive models using a pure reinforcement learning (RL)approach with asynchronous actor critic (A3C) on the sim-ulated training environments, measure their performance onthe simulated as well as real validation environments and ar-rive at the following revealing findings. (1) Similar to find-ings in past works such as [69], semantic navigation modelsstruggle with generalizing to unseen environments in simu-lation. We show that their performance takes an even largerhit when deployed onto a physical robot in the real world.(2) We analyze simulated and real world egocentric viewsand find a disparity in feature space in spite of the imagesfrom the two modalities looking fairly similar to the nakedye; a key factor affecting the transfer of policies to the realworld. (3) As expected and noted in previous works, controldynamics in the real world vary significantly owing to mo-tor noise, slippage, and collisions. (4) Off the shelf imagecorrection mechanisms such as image-to-image translationdo not improve performance.These findings reveal that training embodied AI mod-els that generalize to unseen simulated environments andfurther yet to the real world remains a daunting challenge;but also open up exciting research frontiers. We hope thatR
OBO
THOR will allow more research teams from acrossthe globe to participate in this research which will result innew model architectures and learning paradigms that canonly benefit the field.
2. Related Work
Embodied AI Environments.
In recent years, severalsynthetic frameworks have been proposed to investigatetasks including visual navigation, task completion and ques-tion answering in indoor scenes [32, 53, 70, 8, 74, 72,46, 54]. These free virtual environments provide excel-lent testbeds for embodied AI research by abstracting awaythe noise in low-level control, manipulation and appear-ance and allowing models to focus on the high-level endgoal. R
OBO
THOR provides a framework for studying theseproblems as well as for addressing the next frontier: trans-ferring models from simulation to the real world.Robotics research platforms [1, 2] have traditionallybeen expensive to acquire. More recent efforts have led tolow cost robot solutions [58, 3, 43] opening up the spaceto more research entities. There has been a long historyof using simulators in conjunction with physical robots.These largely address tasks such as object manipulation us-ing robotic arms [11] and autonomous vehicles [58, 55].
Visual Navigation.
In this paper, we explore models forthe task of visual navigation, a popular topic in the roboticsand computer vision communities. The navigation problemcan be divided into two broad categories, spatial navigationand semantic navigation. Spatial navigation approaches[62, 17, 64, 23, 25, 56, 79, 49, 30, 14, 10] typically addressnavigating towards a pre-specified coordinate or a frame ofa scene and they focus on understanding the geometry ofthe scene and learning better exploration strategies. For ex-ample, [79] address navigation towards a given input image,[10] address navigation towards a point in a scene and [30]learn a collision-free navigation policy. Semantic naviga-tion approaches [22, 75, 69, 71, 39, 52, 41] attempt to learnthe semantics of the target in conjunction with navigation.For example, [75] use prior knowledge of object relationsto learn a policy that better generalize to unseen scenes orobjects. [69] use meta-learning to learn a self-supervisednavigation policy toward a specified object category. [71]use prior knowledge of scene layouts to navigate to a spe- cific type of room. We benchmark models on the task ofsemantic navigation.Navigation using language instructions has been ex-plored by [5, 68, 18, 31, 67, 36, 37]. This line of workhas primarily been tested in simulation; transferability tothe real world remains an open question and addressing thisvia R
OBO
THOR is a promising future endeavour. Naviga-tion has also been explored in other contexts such as au-tonomous driving (e.g., [12, 73]) or city navigation (e.g.,[38, 13]). In this work, we focus on indoor navigation.
Sim2Real Transfer.
Domain adaptation in general as wellas Sim2Real in particular, have a long history in computervision. There are different techniques to adapt models froma source domain to a target domain. The main approachesare based on randomization of the source domain to bet-ter generalize to the target domain [63, 29, 51, 44, 61],learning the mapping between some abstraction or higherorder statistics of the source and target domains [27, 59,19, 35, 76, 42, 78], interpolating between the source andthe target domain on a learned manifold [20, 9], or generat-ing the target domain using generative adversarial training[7, 60, 57, 6, 26, 28]. R
OBO
THOR enables source random-ization via scene diversity and asset diversity. We also ex-periment with using an off the shelf target domain mappingmethod, the GAN-based model of [77].
3. RoboTHOR
State of the art learning algorithms for embodied AIuse reinforcement learning based approaches to train mod-els, which typically require millions of iterations to con-verge to a reasonable policy. Training policies in the realworld with real robots would take years to complete, dueto the mechanical constraints of robots. Synthetic envi-ronments, on the other hand, provide a suitable platformfor such training strategies, but how well models trained insimulation transfer to the real world, remains an open ques-tion. R
OBO
THOR is a platform, built upon the AI2-THORframework [32] to build and test embodied agents with anemphasis on studying this problem of domain transfer fromsimulation to the real world.
Scenes. R OBO
THOR consists of a set of 89 apartments, 75in train/val (we use 60 for training and 15 for validation), 4in test-dev (which are used for validation in the real world)and 10 in test-standard (blind physical test set) drawn froma set of 15, 2 and 5 wall layouts respectively. Apartmentsthat share the same wall layout have completely differentroom assignments and furniture placements (for example, abedroom in one apartment might be an office in another).Apartment layouts were designed to encompass a wide va-riety of realistic living spaces. Figure 3 shows a heatmapof wall placements across the train/val subset. This set ofapartments is only instantiated in simulation, while the test-dev and test-standard apartments are also substantiated inigure 2: Distribution of object categories in R
OBO
THORthe physical world. The layouts, furniture, objects, light-ing, etc. of the simulation environments have been de-signed carefully so as to closely resemble the correspond-ing scenes in their physical counterparts, while avoidingany overlap between the wall layouts and object instancesamong train/val, test-dev and test-standard. This resem-blance will enable researchers to study the discrepanciesbetween the two modalities and systematically identify thechallenges of the domain transfer.
Assets.
A guiding design principle of R
OBO
THOR is mod-ularity , which allows us to easily augment and scale scenes.A large asset library was created by digital artists fromwhich scenes were created by selectively drawing fromthese assets. This is in contrast to environments that arebased on 3D scans of rooms which are challenging to alterand interact with. The framework includes 11 types of fur-niture (e.g. TV stands and dining tables) and 32 types ofsmall objects (e.g. mugs and laptops) across all scenes. Themajority of real furniture and objects were gathered fromIKEA. Among the small objects categories, 14 are desig-nated as targets and guaranteed to be found in all scenesfor use in semantic navigation tasks. In total there are 731unique object instances in the asset library with no overlapamong train/val, test-dev and test-standard scenes. Figure 2shows the distribution of object categories amongst the as-set library. We distribute object categories as uniformly aspossible to avoid bias toward specific locations. Figure 3shows the spatial distribution of target objects, backgroundobjects and furniture in the scenes. Figure 4 shows the dis-tribution of the number of visible objects in a single frame.A large number of frames consist of the agent looking at thewall as is common in apartments, but outside these viewsmany objects are visible to the agent at any given point as itnavigates the environment.
Physical space.
The physical space for R
OBO
THOR is . m × . m . The space is partitioned into rooms andcorridors using ProPanel walls, which are designed to belightweight and easy to set up and tear down, allowing us toeasily configure a new apartment layout in a few minutes. Agent.
The physical robot used is a LoCoBot , which Target Objects Background ObjectsFurniture Walls
Figure 3:
Spatial distribution of objects and walls.
Heatmaps illustrate the diverse spatial distribution of targetobjects, background objects, furniture, and walls. L o c a t i o n s w i t h i n S c e n e s Objects Visible to Agent
Figure 4:
Object visibility statistics.
The distribution ofobjects visible to an agent at a single time instant.is equipped with an Intel RealSense RGB-D camera. Wereplicate the robot in simulation with the same physical andcamera properties. To mimic the noisy dynamics of therobot movements, we add noise to the controller in sim-ulation. The noise parameters are estimated by manuallymeasuring the error in orientation and displacement overmultiple runs.
API.
To enable a seamless switch between the synthetic andthe real environments, we provide an API that is agnosticto the underlying platform. Hence, agents trained in sim-lation can be easily deployed onto the LoCoBot for test-ing. The API was built upon the PyRobot [43] frameworkto manage control of the LoCoBot base as well as camera.
Connectivity.
A main goal of this framework is to provideaccess to researchers across the globe to deploy their mod-els onto this physical environment. With this in mind, wedeveloped the infrastructure for connecting to the physicalrobot or the simulated agent via HTTP. A scheduler preventsaccessing the same physical hardware by multiple parties.
Localization.
We installed Super-NIA-3D localizationmodules across the physical environment to estimate the lo-cation of the robot and return that to the users. For our ex-periments, we do not use the location information for train-ing as this type of training signal is not usually available inreal world scenes. We use this location information only forevaluation and visualization.
4. Visual Semantic Navigation
In this paper, we benchmark models for the task of visualsemantic navigation, i.e. navigating towards an instance ofa pre-specified category. R
OBO
THOR enables various em-bodied tasks such as question answering, task completionand instruction following. Navigation is a key componentof all these tasks and is a necessary and important first steptowards studying transfer in the context of embodied tasks.Visual semantic navigation evaluates the agent’s capa-bilities not only in avoiding obstacles and making the rightmoves towards the target, but also understanding the se-mantics of the scene and targets. The agent should learnhow different instances of an object category look like andshould be able to reason about occlusion, scale changes andother variations in object appearance.More specifically, our goal is to navigate towards an in-stance of an object category specified by a noun (e.g., Ap-ple) given ego-centric sensory inputs. The sensory input canbe an RGB image, a depth image, or combination of both.At each time step the agent must issue one of the followingactions:
Move Ahead , Rotate Right , Rotate Left , Look Up , Look Down , Done . The action
Done signifies that the agentreports that it has reached its goal and leads to an end ofepisode. We consider an episode successful if (a) the objectis in view (b) the agent is within a threshold of distance tothe target and (c) the agent reports that it observes the ob-ject. The starting location of the agent is a random locationin the scene.The motion of the agent in the simulated world isstochastic in nature, mirroring its behavior in the real world.This renders the task more challenging. Previous workssuch as [69] consider agent motion along the axes on a grid.But given the end goal of navigating in the real world withmotor noise and wheel slippage, deterministic movementsin training lead to sub optimal performance during testing.The semantic navigation task is very challenging owing to the size and complexity of the scenes. Figure 5 shows thelengths of shortest paths to the target objects, in terms of the
Move Ahead and
Rotate actions. But shortest path statisticsare a bit misleading for the task of semantic navigation be-cause they assume that the agent already knows the locationit must travel towards. In fact, the agent must explore untilit observes the target, and then move swiftly towards it. Weconducted a study where humans were posed with the prob-lem of navigating in scenes in R
OBO
THOR (simulation) tofind targets. The median number of steps was 49.5 (com-pared to 22.0 for shortest paths), illustrating the explorationnature of the task. Figure 6 shows an example trajectoryfrom a human compared to the corresponding shortest path. T a r g e t s Shortest Paths to Target go straightturn
Figure 5:
Histogram of actions along the shortest path.
The number of actions invoked along the shortest paths totargets in the training scenes. Note that the shortest path isvery difficult to obtain in practice, since it assumes a priori knowledge of the scene.
Shortest PathHuman trajectory
Figure 6:
Example human trajectory.
The shortest path toa target vs the path taken by a human from the same startinglocation are visualized. The human wanders around lookingfor the TV and on seeing it, walks straight towards it.
We measure the performance of the following baselinemodels: andom - This model chooses an action randomly amongstthe set of possible actions. This includes invoking the
Done action. The purpose of this baseline is to ascertain if thescenes and starting locations are not overly simplistic.
Instant Done - This model invokes the
Done action at thevery first time step. The purpose of this baseline is to mea-sure the percentage of trivial starting locations.
Blind - This model receives no sensory input. It only con-sists of an LSTM with simply a target embedding input.Its purpose is to establish a baseline that can only leveragestarting and target location bias in the dataset.
Image-A3C - The agent perceives the scene at time t in theform of an image o t . The image is fed into a pre-trainedand frozen ResNet-18 to obtain a × × tensor, fol-lowed by two × convolution layers to reduce the channeldepth to 32 and finally concatenated with a 64 dimensionalembedding of the target word and provided to an LSTMwhich generates the policy. The agent is trained using theasynchronous advantage actor-critic (A3C) [40] formula-tion, to act to maximize its expected discounted cumula-tive reward. Two kinds of rewards are provided: a positivesuccess reward and a negative step reward to encourage ef-ficient paths. Image+Detection-A3C - The image is fed into Faster-RCNN [48] trained on the MSCOCO dataset. Each result-ing detection has a category, probability and box locationsand dimensions, which are converted to an embedding us-ing an MLP. The resulting set of embeddings are convertedinto a tensor × × by aligning detection boxes tospatial grid cells and only considering the top 3 boxes perlocation. This tensor is concatenated to the tensor obtainedfor the image modality and then processed as above. We report Success Rate and Success weighted by PathLength (SPL) [4], common metrics for evaluating naviga-tion models. In addition, we also report path lengths interms of the number of actions taken as well as distancetravelled. This quantifies the exploration carried out by theagent.
5. Experiments
Training.
We train models to navigate towards a single tar-get, a Television with an action space of 4 including
MoveAhead , Rotate Right , Rotate Left and
Done . The rotationactions rotate the agent by 45 degrees. The agent must is-sue the
Done action when it observes the object. If an agentis successful, it receives a reward of +5. The step penaltyat each time step is -0.01. To mimic the noisy dynamics ofthe real robot, we add noise to the movement of the virtualagent. For translation, we add a gaussian noise with mean0.001m and standard deviation 0.005m, and for rotation weuse a mean of 0 ◦ and standard deviation of 0.5 ◦ . Models were trained on 8 TITAN X GPUs for 100,000episodes using A3C with 32 threads. For each episode, wesample a random training scene and random starting loca-tion in the scene for the agent. We use the Adam optimizerwith a learning rate of 0.0001.We train the models on a subset of 50 train scenes andreport numbers on 2 of the test-dev scenes. We report met-rics on starting locations categorized into easy , medium and hard . If the length of the shortest path from the startingpoint to the target is among the lowest 20% path lengths inthat scene, it is considered easy. If it is between 20% and60%, it is considered medium, and longer paths are consid-ered as hard. Sim-to-Sim
Table 1 shows the results of our benchmarkedmodels when trained on simulation and evaluated on thetest-dev scenes in simulation. The trivial baselines Random,Instant Done and Blind perform very poorly, indicating thatthe dataset lacks a bias that is trivial to exploit using thesemodels. The Image only model performs reasonably well,succeeding at over 50% of easy trajectories but doing verypoorly at hard ones. Adding object detection does not helpperformance. Our analysis in Section 6 shows that objectdetectors trained in the real world show a drop in perfor-mance in simulation, which might be contributing to theirineffectiveness.
Sim-to-Real
Due to the slow motion of the robot, episodeson the robot last as long as 10 minutes. This limits theamount of testing that can be done in the real world. Ta-ble 2 shows the result of the best performing model on Sim-to-Sim, evaluated on a real scene on a subset of startinglocations as those reported in Table 1. This shows that thereis a sizeable drop in performance for the real robot, espe-cially in SPL. The robot however, does learn to navigatearound and explore its environment fairly safely and over80% of the trajectories have no collisions with obstacles.This leads to high values for episode lengths in all 3 cases -easy, medium and hard.
Overfit Sim-to-Real
The semantic navigation sim-to-realtask in R
OBO
THOR tests two kinds of generalization:Moving to new scenes and moving from simulation to real.To factor out the former and focus on the latter, we traineda policy on a test-dev scene, (which expectedly led to over-fitting on sim) and then deployed this on the robot. Table 3shows these results and demonstrates the upper bound ofcurrent models in the real world if they had memorized thetest-dev scene perfectly in simulation. The robot does verywell on the easy targets, but is affected a lot on hard tar-gets. For all three modes, the SPL is affected tremendously.Appearance and control variations often lead the robot tospaces away from the target, which does not happen in sim-ulation due to overfitting.asy Medium HardSuccess SPL Episode Path Success SPL Episode Path Success SPL Episode Pathlength length length length length lengthRandom 7.58 5.32 4.36 0.34 0.00 0.00 4.27 0.30 0.00 0.00 3.06 0.19Instant Done 4.55 3.79 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00Blind 4.55 3.79 1.00 0.00 0.00 0.00 1.00 0.00 0.00 0.00 1.00 0.00Image 55.30 38.12 45.87 9.26 28.79 19.12 78.49 14.82 1.47 0.97 81.09 14.22Image+Detection 36.36 19.89 63.41 11.39 11.36 5.25 90.37 16.65 0.74 0.61 83.01 14.00Table 1:
Benchmark results for Sim-to-Sim
Easy Medium HardSuccess SPL Episode Path Success SPL Episode Path Success SPL Episode Pathlength length length length length lengthImage 33.33 3.53 53.16 7.18 16.66 3.70 43.83 5.33 0.00 0.00 67.83 7.00Table 2:
Benchmark results for Sim-to-Real
6. Analysis
We now dig deeper into the appearance disparities be-tween real and simulation images via t-sne embeddings, ob-ject detection results and the output policies for both modal-ities. We also provide a study showing the effect of chang-ing camera parameters between real and simulation for thetransfer problem. Finally, we evaluate using an image trans-lation method for the purposes of domain adaptation.
Appearance disparities.
To the naked eye, images betweenthe real and simulation worlds in R
OBO
THOR look quitesimilar. However when we look at embeddings provided bynetworks, the disparity becomes more visible. We consid-ered 846 images, each from simulation and real collectedfrom the same locations in the scene, passed these imagesthrough ResNet-18 to obtain a 512 dim feature vector andthen used t-SNE [65] to reduce the dimensionality to 3. Fig-ure 7 shows these embeddings. One can clearly see the sep-aration of points into real and simulation clusters. This in-dicates that embeddings of images in the two modalities aredifferent, which goes towards explaining the drop in per-formance between simulation and real test-dev scenes (seeTable 1 and Table 2). Figure 7 also shows the nearest neigh-bor (cosine similarity) simulation image to one of the realimages. We found that although nearest neighbors are notfar away spatially, they are still slightly different, sometimeshaving a different view, or a different distance to an obsta-cle. This might explain why the robot takes different ac-tions in the real world. This analysis indicates that methodsof representation learning might need to be revisited, espe-cially when these representations must work across real andsimulation modalities.
Object detection transfer.
Since we leverage an off theshelf object detector (Faster-RCNN) trained on natural im-ages (MS-COCO), it is imperative to compare the accuracyof this model on images collected in the real and simulated Figure 7:
Comparison of embeddings for real and syn-thetic images.
The scatter plot shows a t-SNE visualizationof ResNet-18 (pre-trained on ImageNet) features for imagesfrom real and simulated apartments. Also shown are thenearest neighbor in feature space and spatial nearest neigh-bor, which differ slightly in the viewpoint of the agent.apartments. We collected 761 corresponding images fromboth environments, ran Faster-RCNN on them and obtainedground truth annotations for 10 object classes (all of whichintersected with MS-COCO classes) on Amazon Mechani-cal Turk. At an Intersection-over-Union (IOU) of 0.5, weobtained a mAP of 0.338 for real images and 0.255 for sim-ulated ones; demonstrating that there is a performance hitgoing across modalities. Furthermore, detection probabil-ities tend to differ between the two modalities as demon-strated in Figure 8, rendering transfer more challenging forasy Medium HardSuccess SPL Episode Path Success SPL Episode Path Success SPL Episode Pathlength length length length length lengthImage Sim-2-Sim 100 82.17 8.09 1.05 100 86.17 27.52 4.77 94.12 83.18 42.26 7.90Image Sim-2-Real 100 12.28 15.00 2.33 83.33 18.68 43.33 5.5 50 28.53 30.16 7.54Table 3:
Benchmark results for Sim-to-Real trained on a single test-dev scene.
LaptopCup Clock Cup ClockApple
Real Simulation
Figure 8:
Object detection.
Results of object detection ina real and simulated image. Solid lines denote high con-fidence detections whereas dashed lines denote low confi-dence detections.models that exploit these probabilities. Note that this trans-fer is in the opposite direction (Real to Sim) compared toour current transfer setup, but is revealing nonetheless.
Modifying camera parameters.
Since the real and simu-lation apartments were both designed by us, we were ableto model the simulation camera close to the one presenton LoCoBot. However, to test the sensitivity of the learntpolicy to the camera parameters, we performed an exper-iment where we trained with a Field Of View (FOV) of90 ◦ and tested with an FOV of 42.5 ◦ . The resulting realworld experiments for the Image only model show a hugedrop of performance to 16% for easy, and 0% for mediumand hard. Interestingly, the robot tends to move very littleand instead rotate endlessly. We hypothesize that modelstrained on simulation likely overfit to the image distributionobserved at training, and the images captured at a differentFOV vary so significantly in feature space, that they invokeunexpected behaviors. Since popular image representationmodels are trained on images from the internet, they are bi-ased towards the distribution of cameras that people use totake pictures. Cameras on robots are usually quite differ-ent. This suggests that we should fine tune representationsto match robot cameras and also consider different cameraparameters as a camera augmentation step during trainingin simulation. Domain adaptation via image translation.
Since appear-ance statistics vary between real and simulation, we exper-imented with applying an image-to-image translation tech-nique, CycleGAN [77] to translate real world images to-wards simulation images. This would enable us to train insimulation and apply the policy on the real robot while pro-cessing the translated images. We needed to use a trans- lation model that could use unpaired image data, since weare unable to obtain paired images with 0 error in the place-ment of the agent. Paired images were obtained for 3 test-dev scenes to train CycleGAN. The policy (trained in sim-ulation) was run on the robot on the remaining test-devscene. Interestingly, the CycleGAN model does learn toflatten textures and adjust lighting and shadows as seen inFigure 9. However, the resultant robot performance is verypoor and obtains 0% accuracy. While image translationlooks pleasing to the eye, it does introduce spurious errorswhich hugely affect the image embeddings and thus the re-sultant policy.
Real Real to Sim
Figure 9:
Examples of real to simulation transfer.
We useCycleGAN [77] to translate real images towards simulatedones. The model learns to flatten out the texture and adjustthe shadows to look like a simulated image.
7. Conclusion
In this paper, we presented R
OBO
THOR , an open, mod-ular, re-configurable and replicable embodied AI platformwith counterparts in simulation and the real world, whereresearchers across the globe can remotely deploy their mod-els onto physical robots and test their algorithms in thephysical world. Our preliminary findings show the perfor-mance of models drops significantly when transferring fromsimulation to real. We hope that R
OBO
THOR will enablemore research towards this important problem.
Acknowledgements.
This work is in part supported byNSF IIS 1652052, IIS 17303166, DARPA N66001-19-2-4031, 67102239 and gifts from Allen Institute for AI. eferences [1] Baxter. https://en.wikipedia.org/wiki/Baxter_(robot) . 3[2] Sawyer. . 3[3] The MIT RACECAR. https://mit-racecar.github.io , 2016. 3[4] Peter Anderson, Angel X. Chang, Devendra Singh Chaplot,Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, JanaKosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva,and Amir Roshan Zamir. On evaluation of embodied navi-gation agents. arXiv , 2018. 6[5] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, MarkJohnson, Niko S¨underhauf, Ian Reid, Stephen Gould, andAnton van den Hengel. Vision-and-Language Navigation:Interpreting visually-grounded navigation instructions in realenvironments. In
CVPR , 2018. 2, 3[6] Konstantinos Bousmalis, Alex Irpan, Paul Wohlhart, YunfeiBai, Matthew Kelcey, Mrinal Kalakrishnan, Laura Downs,Julian Ibarz, Peter Pastor, Kurt Konolige, Sergey Levine, andVincent Vanhoucke. Using simulation and domain adapta-tion to improve efficiency of deep robotic grasping. In
ICRA ,2018. 3[7] Konstantinos Bousmalis, Nathan Silberman, David Dohan,Dumitru Erhan, and Dilip Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial net-works. In
CVPR , 2017. 3[8] Simon Brodeur, Ethan Perez, Ankesh Anand, FlorianGolemo, Luca Celotti, Florian Strub, Jean Rouat, HugoLarochelle, and Aaron C. Courville. Home: a householdmultimodal environment. arXiv , 2017. 3[9] Rui Caseiro, Joao F. Henriques, Pedro Martins, and JorgeBatista. Beyond the shortest path : Unsupervised domainadaptation by sampling subspaces along the spline flow. In
CVPR , 2015. 3[10] Devendra Singh Chaplot, Saurabh Gupta, Dhiraj Gandhi,Abhinav Gupta, and Ruslan Salakhutdinov. Learning to ex-plore using active neural mapping. In
CVPR Workshop onHabitat Embodied Agents , 2019. 3[11] Yevgen Chebotar, Ankur Handa, Viktor Makoviychuk, MilesMacklin, Jan Issac, Nathan D. Ratliff, and Dieter Fox. Clos-ing the sim-to-real loop: Adapting simulation randomizationwith real world experience. In
ICRA , 2019. 3[12] Chenyi Chen, Ari Seff, Alain Kornhauser, and JianxiongXiao. Deepdriving: Learning affordance for direct percep-tion in autonomous driving. In
ICCV , 2015. 3[13] Howard Chen, Alane Suhr, Dipendra Misra, Noah Snavely,and Yoav Artzi. Touchdown: Natural language navigationand spatial reasoning in visual street environments. In
CVPR ,2019. 3[14] Tao Chen, Saurabh Gupta, and Abhinav Gupta. Learningexploration policies for navigation. In
ICLR , 2019. 3[15] Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee,Devi Parikh, and Dhruv Batra. Embodied question answer-ing. In
CVPR , 2018. 2 [16] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and KristinaToutanova. Bert: Pre-training of deep bidirectional trans-formers for language understanding. In
NAACL-HLT , 2019.1[17] Dieter Fox, Wolfram Burgard, and Sebastian Thrun. Markovlocalization for reliable robot navigation and people detec-tion. In
Sensor Based Intelligent Robots , 1999. 3[18] Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach,Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell.Speaker-follower models for vision-and-language naviga-tion. In
NeurIPS , 2018. 3[19] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pas-cal Germain, Hugo Larochelle, Franc¸ois Laviolette, MarioMarchand, and Victor Lempitsky. Domain-adversarial train-ing of neural networks.
JMLR , 2016. 3[20] Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman.Geodesic flow kernel for unsupervised domain adaptation.In
CVPR , 2012. 3[21] Daniel Gordon, Aniruddha Kembhavi, Mohammad Raste-gari, Joseph Redmon, Dieter Fox, and Ali Farhadi. IQA:Visual question answering in interactive environments. In
CVPR , 2018. 2[22] Saurabh Gupta, James Davidson, Sergey Levine, Rahul Suk-thankar, and Jitendra Malik. Cognitive mapping and plan-ning for visual navigation. In
CVPR , 2017. 2, 3[23] Raia Hadsell, Pierre Sermanet, Jan Ben, Ayse Erkan, MarcoScoffier, Koray Kavukcuoglu, Urs Muller, and Yann LeCun.Learning long-range vision for autonomous off-road driving.
J. of Field Robotics , 2009. 3[24] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In
CVPR ,2016. 1[25] Peter Henry, Christian Vollmer, Brian Ferris, and Dieter Fox.Learning to navigate through crowded environments. In
ICRA , 2010. 3[26] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu,Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell.CyCADA: Cycle-consistent adversarial domain adaptation.In
ICML , 2018. 3[27] Judy Hoffman, Dequan Wang, Fisher Yu, and Trevor Darrell.Fcns in the wild: Pixel-level adversarial and constraint-basedadaptation. arXiv , 2016. 3[28] Haoshuo Huang, Qixing Huang, and Philipp Krahenbuhl.Domain transfer through deep activation matching. In
ECCV , 2018. 3[29] Stephen James, Andrew J. Davison, and Edward Johns.Transferring end-to-end visuomotor control from simulationto real world for a multi-stage task. In
CORL , 2017. 3[30] Gregory Kahn, Adam Villaflor, Bosen Ding, Pieter Abbeel,and Sergey Levine. Self-supervised deep reinforcementlearning with generalized computation graphs for robot nav-igation. In
ICRA , 2018. 3[31] Liyiming Ke, Xiujun Li, Yonatan Bisk, Ari Holtzman, ZheGan, Jingjing Liu, Jianfeng Gao, Yejin Choi, and SiddharthaSrinivasa. Tactical rewind: Self-correction via backtrackingin vision-and-language navigation. In
CVPR , 2019. 332] Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt,Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Ab-hinav Gupta, and Ali Farhadi. AI2-THOR: An Interactive 3DEnvironment for Visual AI. arXiv , 2017. 2, 3[33] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.Imagenet classification with deep convolutional neural net-works. In
NeurIPS , 2012. 1[34] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, Lubomir D.Bourdev, Ross B. Girshick, James Hays, Pietro Perona, DevaRamanan, C. Lawrence Zitnick, and Piotr Doll´ar. Microsoftcoco: Common objects in context. In
ECCV , 2014. 2[35] Mingsheng Long, Yue Cao, Jianmin Wang, and Michael I.Jordan. Learning transferable features with deep adaptationnetworks. In
ICML , 2015. 3[36] Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib,Zsolt Kira, Richard Socher, and Caiming Xiong. Self-monitoring navigation agent via auxiliary progress estima-tion. In
ICLR , 2019. 3[37] Chih-Yao Ma, Zuxuan Wu, Ghassan AlRegib, CaimingXiong, and Zsolt Kira. The regretful agent: Heuristic-aidednavigation through progress estimation. In
CVPR , 2019. 3[38] Piotr Mirowski, Matthew Koichi Grimes, Mateusz Ma-linowski, Karl Moritz Hermann, Keith Anderson, DenisTeplyashin, Karen Simonyan, Koray Kavukcuoglu, AndrewZisserman, and Raia Hadsell. Learning to navigate in citieswithout a map. In
NeurIPS , 2018. 3[39] Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer,Andrew J. Ballard, Andrea Banino, Misha Denil, RossGoroshin, Laurent Sifre, Koray Kavukcuoglu, Dharshan Ku-maran, and Raia Hadsell. Learning to navigate in complexenvironments. In
ICLR , 2017. 3[40] Volodymyr Mnih, Adri`a Puigdom`enech Badia, Mehdi Mirza,Alex Graves, Timothy P. Lillicrap, Tim Harley, David Silver,and Koray Kavukcuoglu. Asynchronous methods for deepreinforcement learning. In
ICML , 2016. 6[41] Arsalan Mousavian, Alexander Toshev, Marek Fiser, JanaKosecka, and James Davidson. Visual representations forsemantic target driven navigation. In
ECCV Workshop onVisual Learning and Embodied Agents in Simulation Envi-ronments , 2018. 3[42] Matthias Mueller, Alexey Dosovitskiy, Bernard Ghanem,and Vladlen Koltun. Driving policy transfer via modularityand abstraction. In
CORL , 2018. 3[43] Adithyavairavan Murali, Tao Chen, Kalyan Vasudev Alwala,Dhiraj Gandhi, Lerrel Pinto, Saurabh Gupta, and AbhinavGupta. Pyrobot: An open-source robotics framework for re-search and benchmarking. arXiv , 2019. 3, 5[44] Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba,and Pieter Abbeel. Sim-to-real transfer of robotic controlwith dynamics randomization. In
ICRA , 2018. 3[45] Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gard-ner, Christopher Clark, Kenton Lee, and Luke S. Zettle-moyer. Deep contextualized word representations. In
NAACL , 2018. 1[46] Xavier Puig, Kevin Ra, Marko Boben, Jiaman Li, TingwuWang, Sanja Fidler, and Antonio Torralba. Virtualhome:Simulating household activities via programs. In
CVPR ,2018. 3 [47] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, andPercy Liang. Squad: 100, 000+ questions for machine com-prehension of text. In
EMNLP , 2016. 2[48] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster r-cnn: Towards real-time object detection with regionproposal networks. In
NeurIPS . 2015. 1, 6[49] Charles Richter and Nicholas Roy. Safe visual navigation viadeep learning and novelty detection. In
RSS , 2017. 3[50] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,Aditya Khosla, Michael S. Bernstein, Alexander C. Berg,and Fei-Fei Li. Imagenet large scale visual recognition chal-lenge.
IJCV , 2014. 2[51] Fereshteh Sadeghi and Sergey Levine. CAD2RL: real single-image flight without a single real image. In
RSS , 2017. 3[52] Nikolay Savinov, Alexey Dosovitskiy, and Vladlen Koltun.Semi-parametric topological memory for navigation. In
ICLR , 2018. 3[53] Manolis Savva, Angel X. Chang, Alexey Dosovitskiy,Thomas A. Funkhouser, and Vladlen Koltun. MINOS: mul-timodal indoor simulator for navigation in complex environ-ments. arXiv , 2017. 2, 3[54] Manolis Savva, Abhishek Kadian, Oleksandr Maksymets,Yili Zhao, Erik Wijmans, Bhavana Jain, Julian Straub, JiaLiu, Vladlen Koltun, Jitendra Malik, Devi Parikh, and DhruvBatra. Habitat: A platform for embodied ai research. In
ICCV , 2019. 2, 3[55] Shital Shah, Debadeepta Dey, Chris Lovett, and AshishKapoor. Airsim: High-fidelity visual and physical simula-tion for autonomous vehicles. In
Field and Service Robotics ,2017. 3[56] Shaojie Shen, Nathan Michael, and Vijay Kumar. Au-tonomous multi-floor indoor navigation with a computation-ally constrained mav. In
ICRA , 2011. 3[57] Ashish Shrivastava, Tomas Pfister, Oncel Tuzel, JoshSusskind, Wenda Wang, and Russell Webb. Learningfrom simulated and unsupervised images through adversarialtraining. In
CVPR , 2017. 3[58] Siddhartha S. Srinivasa, Patrick Lancaster, Johan Michalove,Matt Schmittle, Colin Summers, Matthew Rockett, Joshua R.Smith, Sanjiban Chouhury, Christoforos Mavrogiannis, andFereshteh Sadeghi. MuSHR: A low-cost, open-sourcerobotic racecar for education and research. arXiv , 2019. 3[59] Baochen Sun, Jiashi Feng, and Kate Saenko. Return of frus-tratingly easy domain adaptation. In
AAAI , 2016. 3[60] Yaniv Taigman, Adam Polyak, and Lior Wolf. Unsupervisedcross-domain image generation. In
ICLR , 2017. 3[61] Jie Tan, Tingnan Zhang, Erwin Coumans, Atil Iscen,Yunfei Bai, Danijar Hafner, Steven Bohez, and VincentVanhoucke. Sim-to-real: Learning agile locomotion forquadruped robots. In
RSS , 2018. 3[62] Charles Thorpe, Martial H. Hebert, Takeo Kanade, andSteven A. Shafer. Vision and navigation for the carnegie-mellon navlab.
TPAMI , 1988. 3[63] Joshua Tobin, Rachel H Fong, Alex Ray, Jonas Schneider,Wojciech Zaremba, and Pieter Abbeel. Domain randomiza-tion for transferring deep neural networks from simulation tothe real world. In
IROS , 2017. 364] Chris Urmson, Joshua Anhalt, Drew Bagnell, ChristopherBaker, Robert Bittner, M. N. Clark, John Dolan, Dave Dug-gins, Tugrul Galatali, Chris Geyer, Michele Gittleman, SamHarbaugh, Martial Hebert, Thomas M. Howard, Sascha Kol-ski, Alonzo Kelly, Maxim Likhachev, Matt McNaughton,Nick Miller, Kevin Peterson, Brian Pilnick, Raj Rajkumar,Paul Rybski, Bryan Salesky, Young-Woo Seo, Sanjiv Singh,Jarrod Snider, Anthony Stentz, William “Red” Whittaker,Ziv Wolkowicki, Jason Ziglar, Hong Bae, Thomas Brown,Daniel Demitrish, Bakhtiar Litkouhi, Jim Nickolaou, Var-sha Sadekar, Wende Zhang, Joshua Struble, Michael Taylor,Michael Darms, and Dave Ferguson. Autonomous drivingin urban environments: Boss and the urban challenge.
J ofField Robotics , 2008. 3[65] Laurens van der Maaten and Geoffrey E. Hinton. Visualizingdata using t-sne.
JMLR , 2008. 7[66] Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill,Omer Levy, and Samuel R. Bowman. Glue: A multi-taskbenchmark and analysis platform for natural language un-derstanding. In
BlackboxNLP@EMNLP , 2018. 2[67] Xin Wang, Qiuyuan Huang, Asli Celikyilmaz, Jianfeng Gao,Dinghan Shen, Yuan-Fang Wang, William Yang Wang, andLei Zhang. Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation.In
CVPR , 2019. 3[68] Xin Wang, Wenhan Xiong, Hongmin Wang, and WilliamYang Wang. Look before you leap: Bridging model-freeand model-based reinforcement learning for planned-aheadvision-and-language navigation. In
ECCV , 2018. 2, 3[69] Mitchell Wortsman, Kiana Ehsani, Mohammad Rastegari,Ali Farhadi, and Roozbeh Mottaghi. Learning to learn how tolearn: Self-adaptive visual navigation using meta-learning.In
CVPR , 2019. 2, 3, 5[70] Yi Wu, Yuxin Wu, Georgia Gkioxari, and Yuandong Tian.Building generalizable agents with a realistic and rich 3d en-vironment. arXiv , 2018. 3[71] Yi Wu, Yuxin Wu, Aviv Tamar, Stuart Russell, GeorgiaGkioxari, and Yuandong Tian. Bayesian relational memoryfor semantic visual navigation. In
ICCV , 2019. 3[72] Fei Xia, Amir R. Zamir, Zhiyang He, Alexander Sax, Jiten-dra Malik, and Silvio Savarese. Gibson env: Real-world per-ception for embodied agents. In
CVPR , 2018. 2, 3[73] Huazhe Xu, Yang Gao, Fisher Yu, and Trevor Darrell. End-to-end learning of driving models from large-scale videodatasets. In
CVPR , 2017. 3[74] Claudia Yan, Dipendra Kumar Misra, Andrew Bennett,Aaron Walsman, Yonatan Bisk, and Yoav Artzi. CHALET:cornell house agent learning environment. arXiv , 2018. 3[75] Wei Yang, Xiaolong Wang, Ali Farhadi, Abhinav Gupta, andRoozbeh Mottaghi. Visual semantic navigation using scenepriors. In
ICLR , 2019. 3[76] Yang Zhang, Philip David, and Boqing Gong. Curricu-lum domain adaptation for semantic segmentation of urbanscenes. In
ICCV , 2017. 3[77] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei AEfros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In
ICCV , 2017. 3, 8 [78] Xinge Zhu, Hui Zhou, Ceyuan Yang, Jianping Shi, andDahua Lin. Penalizing top performers: Conservative lossfor semantic segmentation adaptation. In
ECCV , 2018. 3[79] Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J. Lim,Abhinav Gupta, Li Fei-Fei, and Ali Farhadi. Target-drivenvisual navigation in indoor scenes using deep reinforcementlearning. In