MOCA: A Modular Object-Centric Approach for Interactive Instruction Following
Kunal Pratap Singh, Suvaansh Bhambri, Byeonghwi Kim, Roozbeh Mottaghi, Jonghyun Choi
MMOCA: A Modular Object-Centric Approach forInteractive Instruction Following https://github.com/gistvision/moca
Kunal Pratap Singh ∗ IIT Roorkee & GIST [email protected]
Suvaansh Bhambri ∗ IIT Roorkee & GIST [email protected]
Byeonghwi Kim ∗ GIST [email protected]
Roozbeh MottaghiAllen Institute for AI [email protected]
Jonghyun ChoiGIST [email protected]
Abstract
Performing simple household tasks based on languagedirectives is very natural to humans, yet it remains an openchallenge for an AI agent. Recently, an ‘interactive instruc-tion following’ task has been proposed to foster researchin reasoning over long instruction sequences that requiresobject interactions in a simulated environment. It involvessolving open problems in vision, language and navigationliterature at each step. To address this multifaceted prob-lem, we propose a modular architecture that decouples thetask into visual perception and action policy, and name itas
MOCA , a Modular Object-Centric Approach. We evalu-ate our method on the ALFRED benchmark and empiricallyvalidate that it outperforms prior arts by significant marginsin all metrics with good generalization performance (highsuccess rate in unseen environments). Our code is availableat https://github.com/gistvision/moca .
1. Introduction
The prospect of having a robotic assistant that can carryout household tasks based on natural language instructionsis a distant dream that has eluded the research commu-nity for decades. On recent progress in computer vision,natural language processing and visual navigation, severalbenchmarks have been developed to encourage research ondifferent components of such assistants including naviga-tion [2, 5, 4, 21], object interaction [45, 29], and interactivereasoning [8, 12] in simulated environments [20, 3, 28]. Re-cently, taking a step forward to model real world scenarios,ALFRED (Action Learning from Realistic Environments ∗ indicates equal contribution. This work is done while KPS and SB areon remote internship in GIST. Action Policy Module
ActionDecoder P ICK O BJECT
ClassDecoder
Object-CentricMask PredictionObstructionDetection M OVE A HEAD
DynamicFiltersDynamicFilters
GoalstatementStep-by-stepinstructions
GoalEncoderInstructionEncoder
Visual Perception Module F r a m e Figure 1:
Overview of MOCA.
It consists of the visual perception(VPM) and action policy modules (APM). The VPM predicts aninteraction mask for the object that the agent interacts with using object-centric mask prediction . The APM predicts the current ac-tion of the agent. a t and m t denote the action and the interactionmask at time step t . and Directives) [35] has been proposed to tackle an aggre-gated challenge of vision, language and navigation; follow-ing a long sequence of instructions while interacting withobjects through egocentric visual observations to accom-plish real-world household tasks such as Put the pen fromthe cabinet drawer to the side table . To alleviate the burdenof such complex reasoning from a high level ‘goal state-ment’, ALFRED also provides low-level ‘step-by-step in-structions’ that an agent must follow to achieve these goals.In addition, it also poses various realistic constraints suchas long trajectory horizons, irreversible state changes, par-tial observability and egocentric vision which are individu-ally open problems in the literature [45, 9, 5, 42]. Success-ful task completion requires addressing several individualchallenges including pixel-level scene understanding andsequential action prediction with multi-modal information.To address these multifaceted challenges in an end-to-end manner, we present a M odular O bject- C entric1 a r X i v : . [ c s . A I] D ec pproach, ( MOCA ) which is illustrated in Figure 1. Mod-ularity addresses the bottlenecks in each aspect by well-defined individual components which are learned togetherin an end-to-end neural architecture. Specifically, we firstdecouple the two major yet distinct aspects involved in in-teractive instruction following, i.e ., policy and visual per-ception. The action policy module (APM) is responsiblefor sequential action prediction, whereas the visual percep-tion module (VPM) generates pixel-wise interaction maskfor the objects of interest for manipulation. As the VPMrequires pixel-level understanding while APM abstracts theinput to a single action label, we propose to learn them inseparate branches (see Sec. 3.1).The ability to interact with objects in the environment iscentral to interactive instruction following. Following [35],we let the agent to interact with objects by predicting apixel-wise interaction mask of the target object. Accuratemask generation for an object of interest is crucial for suc-cessful task completion. Taking an object-centric viewpointto it to ensure good localisation, we propose a class-awaremask prediction mechanism for interaction mask generation(Sec. 3.2). It bifurcates the task of interaction mask predic-tion into inferring the target object class and mask genera-tion to ensure accurate pixel-level perception at every timestep. We further improve the localizing ability in time byusing the spatial relationship amongst the objects that areinteracted with over consecutive time steps (Sec. 3.2.2).Visual grounding by natural language is also an impor-tant building block in this task but unfortunately is one ofthe long-lived open problems in the literature [43, 33, 17,16]. Inspired by the use of dynamic filters [18] to encodemulti-modal information either with a single image or witha single sentence [11], we extend it to a continuous streamof varying visual observations and instructions (Sec. 3.3).In addition, we observe that sometimes immovable ob-jects like walls, tables, kitchen counters block the agent’spath and after multiple failed attempts to pass through them,the trajectory is declared as failed. To address this, we fur-ther propose an obstruction detection module in the APM.It detects when the agent is stuck around an obstacle andforces it to take an action to moves away from it (Sec. 3.4).Finally, we use a simple color swapping augmentation tocurb the sample complexity of imitation learning, specif-ically behaviour cloning which is used to train our agent,MOCA. It is found to be particularly effective in training agood agent for our task (Sec. 4.2).We empirically validate the proposed method, MOCA onthe ALFRED benchmark [35] with all provided evaluationmetrics. In comparison to prior arts that are either publishedor not, MOCA outperforms them by large margins on allmetrics and ranks first in the public leaderboard at the time https://bit.ly/2UrT3ur , entry details: https://bit.ly/3pyzHSK of submission.
2. Related Work
Vision and language navigation tasks require an agent toreach a goal by following natural or templated language in-structions in a simulated environment through visual obser-vations [2, 4, 5, 27]. [2] proposed the Vision-and-LanguageNavigation (VLN) task on the Room2Room (R2R) bench-mark where an agent navigates on a fixed underlying navi-gation graph based on natural language instructions to reachthe goal. Substantial improvements [41, 10, 26, 19, 25,37, 23, 22] have been achieved on this benchmark by vari-ous proposals such as progress monitoring [25], augmentingtrajectories via instruction generation [10] and environmentdropout [37]. Recently, [21] proposed Vision and LanguageNavigation in Continuous Environments (VLN-CE) whichlifts the assumption of known navigation-graph and perfectagent localisation from the R2R [2] benchmark. It also re-quires the agent to navigate via egocentric visual inputs asopposed to panoramic images. Both VLN and VLN-CE ad-dress the problem of navigation but ALFRED [35] requiresan agent to navigate via egocentric visual observations andalso interact with objects by producing a pixel-wise interac-tion mask to complete a task.ALFRED [35] provides a high level goal statement andlow-level step-by-step instructions that the agent needs tofollow to accomplish a task. Shridhar et al . [35] pro-posed a Sequence-to-Sequence model with progress mon-itoring [25]. Even though such models perform reason-ably well on VLN [2, 25], it does not generalize at all tounseen environments (near zero unseen success rate) indi-cating the difficulty of ALFRED as compared to previousbenchmarks. Recently, Nguyen et al . [38] presented an ap-proach where they relax the egocentric vision constraint ofALFRED by collecting multiple views per time step, essen-tially making it similar to panoramic views in VLN, thenthese visual features are processed via hierarchical atten-tion with the step-by-step instructions. Recently, [36] pre-sented TextWorld [7] based environments corresponding toembodied ones in ALFRED and named it as ALFWorld.Here, we propose to take a modular approach and show howdecoupling various factors helps us analyze the bottlenecksinvolved and in turn helps us in learning an effective and su-perior performing agent on this benchmark. Note that we donot relax any constrains set of the original ALFRED bench-mark and still outperform all existing methods [35, 38].
To accomplish the tasks on ALFRED [35], the agentneeds to effectively interact with the right objects. Fol-lowing [35], we perform object interaction by predicting2 pixel-wise interaction mask of the target object. Thisrequires the agent to localise and produce a pixel-wisemask of the target object it intends to interact with. Vi-sual grounding refers to localizing a specific object in animage using a natural language description. Previous meth-ods leverage a pre-trained segmentation model [13, 46]to generate a set of candidate regions and either encodethe visual and textual information separately using CNN-LSTM based methods [17, 15, 30, 44] or use a joint embed-ding [24, 33, 40, 6] to predict the best candidate proposalcorresponding to the language query. [24, 33] reconstructthe language query based on joint vision and language fea-tures for better grounding. [43, 15] divide the input refer-ring expression into modular phrases and process them us-ing a modular attention network. Motivated by them wepropose to split the interaction mask prediction into twostages; class prediction and mask generation (Sec. 3.2) andleverage a pre-trained instance segmentation model [13].Previous works [6, 15, 30] have used simple arithmeticoperations such as concatenation or element-wise productfor grounding tasks but fail to fully capture the vision-language correspondence. [34, 31] have employed dynamicparameters for fully-connected layers and batchnorm statis-tics respectively for visual grounding. [39] use conditionalbatch-norm based on linguistic input for object tracking.[11] uses language query to predict hybrid convolution ker-nels for visual question answering. Contrary to these works,which operate in a static environment, we employ dynamicfilters in an embodied environment with varied egocentricvisual observations for the same language query over mul-tiple time steps as further discussed in Sec. 3.3.
3. Approach
Towards building an instruction following AI agent ina near-realistic scenario, we use the ALFRED bench-mark [35] to train and evaluate our model. It provides ahigh level goal statement ( S goal ) and step-by-step instruc-tions ( S instr ) to accomplish each task. We tackle the prob-lem of predicting low-level actions and pixel-wise interac-tion masks to achieve the goal in a modular fashion; separat-ing them into individual branches but train the entire archi-tecture in an end-to-end manner. For better visual-languagegrounding, we propose to employ language guided dynamicfilters to help the agent generalise to unseen environments.Further, we propose a simple obstacle avoidance modulethat facilitates smooth navigation through the environment.The objective function of our agent is as follows: L = − T (cid:88) t =1 y ct log( p c,t ) (cid:124) (cid:123)(cid:122) (cid:125) V P M − λ a T (cid:88) t =1 y at log( p k,t ) (cid:124) (cid:123)(cid:122) (cid:125) AP M + λ s AL s ( s , s ∗ ) + λ p AL p ( p , p ∗ ) , (1) where p c,t and p k,t denote the class probability and actionprobability, respectively. t and T denote the current timestep and total trajectory duration respectively. y at and y ct denote the GT action and class, respectively. VPM andAPM denote visual perception module (Sec. 3.1.1) and ac-tion policy module (Sec. 3.1.2). Subgoal progress and over-all progress sequences are denoted by s and p , respectively. s ∗ and p ∗ denote the GT subgoal and overall progress se-quences. AL s ( s , s ∗ ) and AL p ( p , p ∗ ) are auxiliary sub-goal and overall progress monitoring loss respectively sameas [35]. λ a , λ s , and λ p are the balancing hyper-parameters. Action prediction requires global semantic visual cueswhereas visual perception for interaction requires local ob-ject specific features which enable precise localisation ofthe desired objects. On the language front, the low-levelaction-oriented information in step-by-step instructions isimportant for action prediction, whereas the object cate-gory information in the goal statement is sufficient for themask prediction ( e.g ., the interaction). The contrasting na-ture of the two tasks serves as a motivation for separatingthe branches for action and interaction mask prediction asshown in Figure 2. MOCA has two high level modules; thevisual perception module (VPM) and action policy module(APM). Subscripts m and a indicate whether a componentbelongs to VPM or APM, respectively. We present a quan-titative analysis of the benefits of this decoupling in Sec. 4.1through input and model ablations. The visual perception module (VPM) predicts the interac-tion mask of the object the agent wants to interact with. TheVPM takes the visual features and the language features ex-tracted by the goal encoder.
Action-conditioned visual perception.
For capturing thecross-modal information between the goal statement andthe visual observation at each time step, we use languageguided dynamic filters for generating the attended visualfeatures (Sec. 3.3 for details).Additionally, we propose to use previous action embed-ding to predict the correct object along with the visual andlanguage input. For example, in the goal statement,
Washthe spatula, put it in the first drawer , the agent first needs towash the spatula in the sink, for which we have two objectclasses, namely spatula and sink that the agent needs to in-teract with, but this has to be done in a particular order. Herethe action information becomes important, i.e ., if the actionis P UT O BJECT , the agent needs to predict the sink’s (recep-tacle) mask whereas if its P
ICK O BJECT , it needs to predictthe spatula’s (object) mask. MOCA conditions the objectinteraction on the previous action embedding which helps3
STMa P ICK O BJECT ... box on the table. Pick upthe box from the table. Turnright and walk forward to ... attn
Examine an empty box bythe light of a floor lamp.
Goal statement FC LSTMm c on c a t attn c on c a t FCa M OVE A HEAD
Step-by-step Instructions c on c a t c on c a t c on c a t ObstructionDetection
MaskGenerator
FCm
InstanceAssociation
BiLSTM m Goal Encoder Dynamic Filters Class Decoder Object-CentricMask PredictionInstruction Encoder
Action Decoder
Obstruction Detection
Action Policy ModuleVisual Perception Module
Dynamic Filters
Predicted Action
BiLSTM a R e s N e t Box
Predicted Mask
FCFC FCFCFC
Figure 2:
Detailed architecture of MOCA.
The input frame, goal statement, and step-by-step instructions are denoted by I t , S goal , and S instr , respectively. Subscripts m and a indicate whether a component belongs to VPM or APM, respectively. h t,m and h t,a denotehidden states of the class and action decoder, respectively. I t is encoded by ResNet18. Dynamic filters convolve over the visual features, v t , to give attended visual features, ˆ v goal and ˆ v instr . The target class, c t and action a t are predicted on the basis of attended visual andlanguage features with the previous action embedding. Blue dashed lines denote the input from the previous time step. the agent to temporally align the object class with their cor-responding interaction actions at the respective time stepamong multiple objects present in the goal statement. Insummary, as shown in Figure 2, the hidden state h t,m ofthe class decoder, LSTM m , is updated with three differentinformation concatenated as: h t,m = LSTM m ([ˆ v t,goal ; ˆ x t,goal ; a t − ]) (2)where [ ; ] denotes concatenation. ˆ x t,goal and ˆ v t,goal are theattended language and visual features, respectively. Theclass decoder’s current hidden state h t,m is then used to pre-dict the interaction mask m t by invoking the object-centricmask prediction (Sec. 3.2). We propose a module to predict the action sequence basedon multi-modal information and call it as ‘action policymodule’ (APM) depicted by the lower block in Figure 2. Ittakes visual features and step-by-step instructions but doesnot use the goal statement as input since it lacks low-leveltask-oriented information which is crucial for achieving thegoal, unlike [35].As illustrated in Figure 2, the attended language fea-tures are generated by the instruction encoder. Similar toVPM, we employ language guided dynamic filters for gen-erating attended visual features (Sec. 3.3). Action decoderthen takes attended visual and language features, along withthe previous action embedding to predict the action decoderhidden state. After the action decoder hidden state is up-dated, a fully connected layer is used to predict the next action, a t as follows: h t,a = LSTM a ([ˆ v t,instr ; ˆ x t,instr ; a t − ]) a t = argmax k ( FC a ([ˆ v t,instr ; ˆ x t,instr ; a t − ; h t,a ])) ,k ∈ [1 , N action ] , (3)where ˆ v t,instr , ˆ x t,instr , a t − , and h t,a denote attended vi-sual features, attended language features, previous actionembedding and current action decoder hidden state, respec-tively. FC a , takes as input ˆ v t,instr , ˆ x t,instr , a t − , and h t,a and predicts the next action, a t . Note N action de-notes the number of actions. We keep the same action spaceas [35]. APM is further equipped by our obstruction detec-tion (Sec. 3.4) mechanism. APM is trained using softmaxcross entropy as shown in Equation 1. In the VPM, we perform object interaction by predictinga pixel-wise interaction mask of the object of interest. In-spired by past work in language guided localisation [43, 17],we separate the task of interaction mask prediction intotwo stages; target class prediction and instance associa-tion . This bifurcation enables us to leverage the qualityof pre-trained instance segmentation models while also en-suring accurate localisation. We refer to this mechanismas ‘object-centric mask prediction’, and it facilitates objectmanipulation for our agent. We ablate the effect of this com-ponent in Table 2 in Sec. 4.2.4 .2.1 Target Class Prediction
We take an object-centric viewpoint to interaction mask pre-diction by explicitly encoding the ability to reason aboutobject categories in MOCA. To make the mask predictionobject-centric, we first predict the target object class, c t , thatthe agent intends to interact with at the current time step t .Specifically, FC m takes as input the hidden state, h t,m ,of the class decoder and outputs the target object class c t , attime step t as shown in Equation 4. The predicted class isthen used as an index for the mask generator to acquire theset of instance masks corresponding to the predicted class. c t = argmax k FC m ( h t,m ) , k ∈ [1 , N class ] , (4)where FC m ( · ) is a fully connected layer and N class denotesthe number of the classes of a target object. The target ob-ject prediction network is trained as a part of the VPM withsoftmax cross-entropy loss with ground-truth object class assupervision. Note that VPM and APM are trained togetherin an end-to-end manner to align the predicted objects withtheir corresponding interaction actions. During inference, we employ a two-way criterion to selectthe best instance mask; ‘confidence based’ and ‘associationbased.’ If the agent interacts with an object for the firsttime, i.e ., the target object class between successive timesteps is different, we pick the instance mask with the highestconfidence score of mask prediction (‘confidence-based’).We use a pre-trained mask generator to obtain the instancemasks and confidence scores, { ( m i,c t , s i,c t ) } M t i =1 , where i and M t denote the index and the number of predicted in-stance masks respectively for the target object class c t pre-dicted by our agent at the current time step. This ensuresthat the best possible mask from our mask generator is pre-dicted and our agent does not interact with wrong objects.On the other side, if the agent intends to interact with thesame object instance over an interval, i.e ., the same objectclass is predicted over consecutive time steps (‘association-based’). For instance, as illustrated in Figure 3, an agent istrying to open a drawer and put a knife in it, the drawer se-lected for putting the knife must be the same one which wasopened in the previous time step. To ensure this, we selectthe instance mask with the shortest Euclidean distance be-tween its center and the center of the instance mask selectedin the previous time step. The ‘association based’ selectioncriterion ensures to pick the same instance to interact withover consecutive time steps, even though its confidence maynot be the highest over that period. In summary, instanceassociation predicts the current time step’s interaction mask m t = m ˆ i,c t with the center coordinate, d ∗ t = d ˆ i,c t , where ˆ i Goal Statement:
Put a cleaned knife in a drawer.Association-basedConfidence-based Association-basedConfidence-based
Figure 3:
Qualitative illustration of ‘Instance Association inTime’.
The generated masks of the drawers are labelled red, green,and blue with their corresponding confidence scores. (cid:88) denotesthe target object interacted with at that time step. × denotes thatthe target object with the highest confidence score is replaced byInstance-Association in Time. Using single-fold confidence-basedapproach makes the agent interact with different drawers over con-secutive time steps as the closed drawer has higher confidencescore. Instance-Association in Time helps the agent to interactwith the same drawer over time and place the knife in it. is obtained as: ˆ i = argmax i s i,c t , if c t (cid:54) = c t − , argmin i || d i,c t − d ∗ t − || , if c t = c t − , (5)where c t is the predicted target object class and d i,c t thecenter coordinate of a mask instance, m i,c t , of the predictedclass. Table 3 in Sec. 4.2 ablates instance association in timeto highlight its empirical significance. It is a straightforward and a common practice to con-catenate the flattened visual and language features whenaddressing the two modalities of vision and language [17,35, 15]. However, this na¨ıve approach often fails to capturevision-language correspondence, which bottlenecks the per-formance in unseen environments [35].To improve the generalization of our agent, MOCA, to unseen environments, we propose to use language guideddynamic filters to capture the spatial information attendedon the language instructions. Dynamic filters [18] havebeen successfully applied in various downstream vision-language modelling tasks such as VQA [11] and naturallanguage moment retrieval methods [32].Contrary to these tasks which are performs either with asingle image or a single sentence for a predetermined videosequence, our task has varying visual observations and lan-guage features at each time step. Thus, we extend the usageof the dynamic filters to generate kernels which attempt tocapture various aspects of the language from the attended5anguage features. These Dynamic kernels are convolvedwith the visual features to obtain attended cross-modal fea-ture maps as shown in the ‘Dynamic Filters’ block of Fig-ure 2. The filter generator network, f DF , takes the languagefeatures, x , as input and produces N DF dynamic filters.These filters convolve with the visual features, v t , to out-put multiple joint embeddings, ˆ v t = DF ( v t , x ) as follows: w i = f DF i ( x ) , ∀ i ∈ { , . . . , N DF } , ˆ v i,t = v t ∗ w i , ˆ v t = [ˆ v ,t ; . . . ; ˆ v N DF ,t ] , (6)where N DF , ∗ and [ ; ] denote the number of dynamic filters,convolution and concatenation operation, respectively.This multimodal encoding helps the agent to betterutilise the correspondence between underspecified naturallanguage instructions and visual features in unseen envi-ronments, thereby reducing the agent’s dependence on aparticular modality. The dynamic filters are conditioned onthe language features which makes them more adaptive andflexible towards varying inputs while performing tasks inunseen environments. This is in contrast with traditionalconvolutions which have fixed weights after training andhence cannot adapt according to the input. We empiricallyinvestigate the benefit of using language-guided dynamicfilters in Sec. 4.2. It is observed that our agent, MOCA, tends to getstranded around immovable obstacles which eventuallyleads to trajectory failure. To mitigate this, we propose anobstruction detection mechanism in the APM to avoid ob-stacles. While navigating to a certain location, at every timestep, the agent computes the difference between the obser-vation at the current time step, I t , and the previous timestep, I t − with a tolerance hyper-parameter (cid:15) as following: || I − I || < (cid:15). (7)When this equation holds and agent predicts the samenavigation action over consequent time steps and fails tomove through an object, the agent detects it as an ob-struction. When an obstruction is detected, the agent’saction space is narrowed down to two navigation actions, i.e ., { R OTATE L EFT , R OTATE R IGHT } to let the agent escapefrom the obstacle as illustrated in Figure 4. We empiricallyinvestigate its effect in Sec. 4.2.
4. Experiments
Dataset
For near-realistic simulation of the interactive in-struction following task, we use ALFRED benchmark thatis running in the AI2-THOR [20] virtual environment. The
Left Ahead Right C on fi den c e Left Ahead Right C on fi den c e Left Ahead Right C on fi den c e NavigableObstruction
Obstruction by "Ahead" Take "Right" instead
Figure 4:
Obstruction detection . (cid:88) denotes the action taken atthat time step. The MoveAhead action marked by × shows thatour agent detects an obstruction at time step t by comparing pre-vious and current observations, I t − and I t . Therefore, insteadof taking the M OVE A HEAD action again, our agent predicts theR
OTATE R IGHT action to detour it. scenes in AI2-THOR are partitioned into ‘train’, ‘valida-tion’ and ‘test’ sets. To evaluate the generalization abilityof an agent, the validation and test scenes are split into twosections; seen and unseen folds.
Evaluation metrics.
We follow the evaluation metricsproposed in [35], i.e ., Success Rate, denoted by
Task andGoal Condition Success rate, denoted by
Goal-Cond . Ad-ditionally, to measure the efficiency of an agent, the abovemetrics are penalized by the length of the sequence to com-pute a path-length-weighted (PLW) score for each met-ric [1]. For more details on evaluation metrics refer to [35].
Implementation details.
The egocentric visual observa-tions are resized to × . For the visual encoder, weuse a pre-trained ResNet-18 [14]. For the mask generator,we use a Mask R-CNN [13] which takes a × visualobservation and outputs the instance masks with 119 objectclasses along with confidence scores.For training the Mask-RCNN, we use 2.1M frames andcorresponding instance segmentation masks collected fromreplaying the training set expert trajectories are used. Notethat we do not use any frame or segmentation masks fromvalidation or test trajectories for Mask R-CNN training.MOCA is trained end-to-end using Adam with an initiallearning rate of − and a batch size of 4 for 50 epochs.Balancing hyper-parameters in (1) are set as λ a = 1 . , λ s = 0 . , and λ p = 0 . . We will release our code andpre-trained agents publicly. We first conduct quantitative analysis of the performanceon task success rate (Task) and goal-condition success rate6 odel
Validation Test
Seen Unseen Seen Unseen
Task Goal-Cond Task Goal-Cond Task Goal-Cond Task Goal-CondShridhar et al . [35] . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . )Nguyen et al . [38] N/A N/A N/A N/A . ( . ) . ( . ) . ( . ) . ( . )MOCA (Ours) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) Input Ablations
No Language . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . )No Vision . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . )Goal-Only . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . )Instructions-Only . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . ) . ( . )Human - - - - - - . ( . ) . ( . ) Table 1:
Task and Goal-Condition Success Rate.
For each metric, the corresponding path weighted metrics are given in (parentheses).The highest values per fold and metric are shown in blue . ‘N/A’ denotes ‘not available’ as the scores are not reported in the leaderboard. (Goal-Cond) and summarise the results in Table 1 with pre-vious methods. As shown in the figure, MOCA shows sig-nificant improvement over the previous methods [35, 38] onall metrics. The higher success rate in the unseen scenes in-dicates its ability to generalize in novel environments. Weachieve a relative improvement of 77.96% in Seen Task SRand 19.10% in Unseen Task SR over Nguyen et al . [38] thatwon ALFRED challenge in ECCV 2020.MOCA outperforms them in both
Seen and
Unseen ‘Goal-Condition’ metrics and gives a relative improvementof 36.79% and 15.72%, respectively. It implies that ourmethod improves the agent’s general understanding for allsub-tasks required for the successful completion of a task.As indicated in the parenthesis in Table 1, our approach pro-vides better Path Length Weighted results for all the metricswhich shows the efficiency of the agent’s performance. Wealso present sub-goal ablation in the supplementary.
Input ablations.
We ablate the inputs to our model inTable 1 to investigate the vision and language bias of ouragent. When the agent is only given visual inputs (
No Lan-guage ) without goal and step-by-step instructions, we ob-serve that the agent is still able to perform some tasks in theseen environments by memorising familiar visual and targetclass sequences, but fails to generalize on the unseen fold.
No Vision setting is able to achieve some goal conditionsuccess by following navigation instructions, but the lackof the visual input handicaps the interaction ability of theagent, thereby preventing it to achieve any success on bothseen and unseen environments.The
Goal-Only setting highlights the ability of our agentto utilise the goal-statement better as compared to Shrid-har et al . [35]. Since the Action Policy Module (APM)of MOCA does not utilise the goal-statement because ofits lack of low-level action-specific information, the ac-tion prediction ability of this setting is equivalent to the
No-Language setting. However, since the goal-statement is used in the Visual Perception Module (VPM), it allowsthe agent to perform accurate object interaction and henceachieves much better performance than
No-Language . Thisresult is a direct benefit of the policy and perception decou-pling discussed in Sec. 3.1.
Instruction-Only ablation in Table 1 indicates the perfor-mance when the agent does not receive the goal-statement.The low-level instructions drastically improve the actionprediction ability over the
Goal-Only setting as the APMcan now leverage the low-level action information. How-ever, the VPM is deprived of its language input which de-pletes the target-class prediction ability (Sec. 3.2.1) of theobject-centric mask prediction module. This results in nu-merous failed interactions and thus it performs worse thanMOCA and
Goal-Only setting.It is interesting to note that despite using either oneof step-by-step instructions (
Goal-Only ) or goal statement(
Instructions-Only ), MOCA is still able to outperform [35]on all the metrics as shown in Table 1. This highlights theflexibility and robustness of our modular architecture whichhelps the agent to fully exploit each language input whichis pivotal for achieving good performance on ALFRED. Wewould also like to highlight that for input ablations, theagent is deprived of the dynamic filters for either APM orVPM, or both, due to which the agent fails to perform wellon unseen environments in all the input ablation settings.
Model ablations.
To investigate the significance of eachmodule with empirical studies, we perform a series of abla-tion studies on MOCA and summarize the results in Table 2.We begin by removing color swap augmentation which re-duces the performance on all metrics which highlights itsimportance in training a better agent by curbing the sam-ple complexity of imitation learning. We use the non color-swap variant to ablate over various modules for further anal-ysis due to computational constraints.Second, we remove the language guided dynamic filters(Sec. 3.3). The ablated model leads to a significant decreasein both seen and unseen metrics. This drop can be attributed7 omponents Validation-Seen Validation-Unseen
Decoupling Object-centric Dynamic Color Task Goal-Cond. Task Goal-Cond.Policy & Perception Mask Prediction Filters Swap (cid:51) (cid:51) (cid:51) (cid:51) . ( . ) . ( . ) . ( . ) . ( . ) (cid:51) (cid:51) (cid:51) . ( . ) . ( . ) . ( . ) . ( . ) (cid:51) (cid:51) . ( . ) . ( . ) . ( . ) . ( . ) (cid:51) (cid:51) . ( . ) . ( . ) . ( . ) . ( . ) (cid:51) (cid:51) . ( . ) . ( . ) . ( . ) . ( . )Table 2: Ablation study for each component of MOCA.
For each metric, we report the corresponding path weighted scores in parentheses.The absence of checkmark denotes that a corresponding component is removed from MOCA.
Model
Validation-Seen Validation-Unseen
Task Goal-Cond Task Goal-CondMOCA (Ours) . ( . ) . ( . ) . ( . ) . ( . )– w/o I.A.T. . ( . ) . ( . ) . ( . ) . ( . )– w/o O.D. . ( . ) . ( . ) . ( . ) . ( . ) Table 3:
Ablation for Instance Association in Time and Ob-struction Detection.
Both Instance Association in Time and Ob-struction Detection are ablated on the validation dataset.
Class: PlateClass: Cellphone
Class: PlateClass: Cellphone (a) Shridhar et al . [35] (b) MOCA (ours)
Figure 5:
Qualitative comparison of identifying target objectsby mask prediction.
Green regions denote the interaction maskspredicted by the model. The ground-truth object class the agentneeds to interact with is shown on the top-left corner. to the lack of cross-modal correspondence between visualand language inputs in the absence of dynamic filters.Third, we demonstrate the importance of decouplingAPM and VPM (Sec. 3.1). To remove the benefit of the de-coupling, we take the concatenation of the goal-statementand instructions as the language input and perform actionand mask prediction from the same pipeline similar to [35]while keeping other modules the same. The ablated modelexhibits a drastic decrease in task success rates indicating its inability to fully utilise the language inputs due to jointprocessing for mask and action module.Finally, we remove object-centric mask prediction(Sec. 3.2). Instead, we directly upsample the joint vision-language-action embedding using deconvolution layers topredict the interaction mask similar to Shridhar et al . [35].We observe that the performance drastically drops on bothseen and unseen folds due to poor mask generation abilityas indicated in Table 2, highlighting the effectiveness of ourobject-centric mask prediction module to consistently pre-dict an accurate interaction mask.Table 3 ablates the obstruction detection module fromSec. 3.4. The performance drop indicates that it is able tohelp the agent avoid obstacles effectively. We also ablateover the Instance-Association in Time (IAT) presented inSec. 3.2.2. For this setting, instead of picking the maskinstance for the predicted target class using IAT, we picka random instance of that class. This setting achieves al-most half the performance of MOCA which highlights thatmerely predicting the right object class is not sufficient, thecorrect instance should also be selected.
We conduct qualitative analysis on the interaction maskprediction ability (Sec. 3.2) of our agent MOCA. Our
Object-Centric Mask Prediction allows MOCA to reasonabout object classes (Sec. 3.2.1) which ensures that it in-teracts with the right object. This is in contrast with [35]that upsamples a linear embedding via a deconvolution net-work and trains it to predict class-agnostic masks, therebynot preserving any information about object category. InFigure 5a, since [35] lacks the ability to reason about objectclass, it predicts inaccurately localized masks even thoughboth the objects are fully observable.In contrast, in Figure 5b, MOCA successfully predictswhat objects it intends to interact with ( i.e ., the cellphoneand plate). Identifying the correct objects enables it to pre-dict an accurately localised mask with the mask generator’shelp. We present further ablation on the importance of rea-soning about object classes and qualitative example videosof our agent’s task completion ability in the supplementary.8 . Conclusion
We explore the problem of interactive instruction fol-lowing on the ALFRED benchmark. To address individualchallenges in this compositional task, we propose, a mod-ular object-centric approach, MOCA that exploits a modu-lar design, an object-centric perspective on interaction, andflexible multimodal correspondence. MOCA outperformsall prior arts by large margins with superior generalizationperformance. Our framework presents a pathway for futurework on this benchmark by giving the flexibility to work onupgrading individual components of the architecture.
References [1] Peter Anderson, Angel X. Chang, Devendra Singh Chaplot,Alexey Dosovitskiy, Saurabh Gupta, Vladlen Koltun, JanaKosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva,and Amir R. Zamir. On evaluation of embodied navigationagents. arXiv , arXiv:1807.06757, 2018. 6[2] Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, MarkJohnson, Niko S¨underhauf, Ian Reid, Stephen Gould, andAnton van den Hengel. Vision-and-language navigation: In-terpreting visually-grounded navigation instructions in realenvironments. In
CVPR , 2018. 1, 2[3] Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Hal-ber, Matthias Niessner, Manolis Savva, Shuran Song, AndyZeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. arXiv , arXiv:1709.06158,2017. 1[4] Devendra Singh Chaplot, Kanthashree Mysore Sathyendra,Rama Kumar Pasumarthi, Dheeraj Rajagopal, and Rus-lan Salakhutdinov. Gated-attention architectures for task-oriented language grounding. In
AAAI , 2017. 1, 2[5] Howard Chen, Alane Suhr, Dipendra Misra, Noah Snavely,and Yoav Artzi. Touchdown: Natural language navigationand spatial reasoning in visual street environments. In
CVPR ,2019. 1, 2[6] Kan Chen, Rama Kovvuri, and Ram Nevatia. Query-guidedregression network with context policy for phrase grounding.In
ICCV , 2017. 3[7] Marc-Alexandre Cˆot´e, ´Akos K´ad´ar, Xingdi Yuan, BenKybartas, Tavian Barnes, Emery Fine, James Moore,Matthew J. Hausknecht, Layla El Asri, Mahmoud Adada,Wendy Tay, and Adam Trischler. Textworld: A learning en-vironment for text-based games. In
CGW@IJCAI , 2018. 2[8] Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee,Devi Parikh, and Dhruv Batra. Embodied Question Answer-ing. In
CVPR , 2018. 1[9] Kuan Fang, Alexander Toshev, Li Fei-Fei, and SilvioSavarese. Scene memory transformer for embodied agentsin long-horizon tasks. In
CVPR , 2019. 1[10] Daniel Fried, Ronghang Hu, Volkan Cirik, Anna Rohrbach,Jacob Andreas, Louis-Philippe Morency, Taylor Berg-Kirkpatrick, Kate Saenko, Dan Klein, and Trevor Darrell.Speaker-follower models for vision-and-language naviga-tion. In
NeurIPS , 2018. 2 [11] Peng Gao, Pan Lu, Hongsheng Li, Shuang Li, Yikang Li,Steven Hoi, and Xiaogang Wang. Question-guided hy-brid convolution for visual question answering. arXiv ,arXiv:1808.02632, 2018. 2, 3, 5[12] Daniel Gordon, Aniruddha Kembhavi, Mohammad Raste-gari, Joseph Redmon, Dieter Fox, and Ali Farhadi. Iqa:Visual question answering in interactive environments. In
CVPR , 2018. 1[13] Kaiming He, Georgia Gkioxari, Piotr Doll´ar, and Ross Gir-shick. Mask r-cnn. In
ICCV , 2017. 3, 6[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In
CVPR ,2016. 6[15] Ronghang Hu, Marcus Rohrbach, Jacob Andreas, TrevorDarrell, and Kate Saenko. Modeling relationships in refer-ential expressions with compositional modular networks. In
CVPR , 2017. 3, 5[16] Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Seg-mentation from natural language expressions. In
ECCV ,2016. 2[17] Ronghang Hu, Huazhe Xu, Marcus Rohrbach, Jiashi Feng,Kate Saenko, and Trevor Darrell. Natural language objectretrieval. In
CVPR , 2016. 2, 3, 4, 5[18] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and Luc Gool.Dynamic filter networks. In
NeurIPS , 2016. 2, 5[19] Liyiming Ke, Xiujun Li, Yonatan Bisk, Ari Holtzman, ZheGan, Jingjing Liu, Jianfeng Gao, Yejin Choi, and SiddharthaSrinivasa. Tactical rewind: Self-correction via backtrackingin vision-and-language navigation. In
CVPR , 2019. 2[20] Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt,Luca Weihs, Alvaro Herrasti, Daniel Gordon, Yuke Zhu, Ab-hinav Gupta, and Ali Farhadi. AI2-THOR: An Interactive 3DEnvironment for Visual AI. arXiv , arXiv:1712.05474, 2017.1, 6[21] Jacob Krantz, Erik Wijmans, Arjun Majumdar, Dhruv Ba-tra, and Stefan Lee. Beyond the nav-graph: Vision-and-language navigation in continuous environments. arXiv ,arXiv:2004.02857, 2020. 1, 2[22] Federico Landi, Lorenzo Baraldi, Massimiliano Corsini, andRita Cucchiara. Embodied vision-and-language navigationwith dynamic convolutional filters. In
BMVC , 2019. 2[23] Xiujun Li, Chunyuan Li, Qiaolin Xia, Yonatan Bisk, AsliC¸ elikyilmaz, Jianfeng Gao, Noah A. Smith, and Yejin Choi.Robust navigation with language pretraining and stochasticsampling. In
EMNLP/IJCNLP , 2019. 2[24] Jingyu Liu, Liang Wang, and Ming-Hsuan Yang. Referringexpression generation and comprehension via attributes. In
ICCV , 2017. 3[25] Chih-Yao Ma, Jiasen Lu, Zuxuan Wu, Ghassan AlRegib,Zsolt Kira, Richard Socher, and Caiming Xiong. Self-monitoring navigation agent via auxiliary progress estima-tion. In
ICLR , 2019. 2[26] Chih-Yao Ma, Zuxuan Wu, Ghassan AlRegib, CaimingXiong, and Zsolt Kira. The regretful agent: Heuristic-aidednavigation through progress estimation. In
CVPR , 2019. 2[27] Matt MacMahon, Brian Stankiewicz, and Benjamin Kuipers.Walk the talk: Connecting language, knowledge, and actionin route instructions. In
AAAI , 2006. 2
28] Manolis Savva*, Abhishek Kadian*, OleksandrMaksymets*, Yili Zhao, Erik Wijmans, Bhavana Jain,Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, DeviParikh, and Dhruv Batra. Habitat: A Platform for EmbodiedAI Research. In
ICCV , 2019. 1[29] Dipendra Misra, John Langford, and Yoav Artzi. Mappinginstructions and visual observations to actions with rein-forcement learning. In
EMNLP , 2017. 1[30] Varun K Nagaraja, Vlad I Morariu, and Larry S Davis. Mod-eling context between objects for referring expression under-standing. In
ECCV , 2016. 3[31] Hyeonwoo Noh, Paul Hongsuck Seo, and Bohyung Han. Im-age question answering using convolutional neural networkwith dynamic parameter prediction. In
CVPR , 2016. 3[32] Cristian Rodriguez-Opazo, Edison Marrese-Taylor, FatemehSaleh, Hongdong Li, and Stephen Gould. Proposal-free tem-poral moment localization of a natural-language query invideo using guided attention. In
WACV , 2020. 5[33] Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, TrevorDarrell, and Bernt Schiele. Grounding of textual phrases inimages by reconstruction. In
ECCV , 2016. 2, 3[34] Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fullyconvolutional networks for semantic segmentation. In
IEEETPAMI , 2017. 3[35] Mohit Shridhar, Jesse Thomason, Daniel Gordon, YonatanBisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer,and Dieter Fox. Alfred: A benchmark for interpretinggrounded instructions for everyday tasks. In
CVPR , 2020.1, 2, 3, 4, 5, 6, 7, 8, 11, 12[36] Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cˆot´e,Yonatan Bisk, Adam Trischler, and Matthew Hausknecht.ALFWorld: Aligning Text and Embodied Environments forInteractive Learning. arXiv , arXiv:2010.03768, 2020. 2[37] Hao Tan, Licheng Yu, and Mohit Bansal. Learning to nav-igate unseen environments: Back translation with environ-mental dropout. In
NAACL , 2019. 2[38] Takayuki Okatani Van-Quang Nguyen. A hierarchi-cal attention model for action learning from realisticenvironments and directives.
ECCV EVAL Workshop,https://askforalfred.com/EVAL/ , 2020. 2, 7[39] Harm D. Vries, Florian Strub, J´er´emie Mary, HugoLarochelle, Olivier Pietquin, and Aaron C. Courville.Modulating early visual processing by language. arXiv ,arXiv:1707.00683, 2017. 3[40] Liwei Wang, Yin Li, and Svetlana Lazebnik. Learningdeep structure-preserving image-text embeddings. In
CVPR ,2016. 3[41] Xin Wang, Wenhan Xiong, Hongmin Wang, and WilliamYang Wang. Look before you leap: Bridging model-freeand model-based reinforcement learning for planned-aheadvision-and-language navigation. In
ECCV , 2018. 2[42] Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, HaoZhu, Fangchen Liu, Minghua Liu, Hanxiao Jiang, Yifu Yuan,He Wang, Li Yi, Angel X. Chang, Leonidas J. Guibas, andHao Su. Sapien: A simulated part-based interactive environ-ment. In
CVPR , 2020. 1 [43] Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu,Mohit Bansal, and Tamara L. Berg. Mattnet: Modular at-tention network for referring expression comprehension. In
CVPR , 2018. 2, 3, 4[44] Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg,and Tamara L. Berg. Modeling context in referring expres-sions. In
ECCV , 2016. 3[45] Yuke Zhu, Daniel Gordon, Eric Kolve, Dieter Fox, Li Fei-Fei, Abhinav Gupta, Roozbeh Mottaghi, and Ali Farhadi. Vi-sual semantic planning using deep successor representations.In
ICCV , 2017. 1[46] Charles Lawrence Zitnick and Piotr Doll´ar. Edge boxes: Lo-cating object proposals from edges. In
ECCV , 2014. 3 ppendix A. Quantitative Analysis on Task-Type andSub-goal performance
The tasks in ALFRED [35] are divided into 7 high-levelcategories. Table 4 shows the performance of MOCA oneach task type. On short-horizon tasks such as
Pick &Place and
Examine , Shridhar et al . achieve some successin seen environments, but have near zero unseen successrates. However, MOCA outperforms them on both thesetask types in both seen and unseen scenes by large mar-gins.
Stack & Place and
Pick Two & Place are the twomost complex and the longest horizons task types in AL-FRED. MOCA achieves 5.2% and 11.2% seen success rateas compared to 0.9% and 0.8% of Shridhar et al . [35]. It alsoachieves some success in unseen scenes whereas Shridhar etal . show zero unseen success rates.Following [35], we also examine the performance ofMOCA on individual sub-goals. For sub-goal analysis, weuse the expert trajectory to move the agent to the startingtime step of the respective sub-goal. Then the agent startsinference based on the current observations. Table 5 showsthe agent’s performance on individual sub-goals.
Goto subgoal is indicative of the navigation ability of an agent.Even though navigation in unseen, visually complex envi-ronments is more challenging, our model achieves asopposed to of Shridhar et al . [35]. Although the gapbetween average sub-goal performance of Shridhar et al .and MOCA is relatively small, MOCA drastically outper-forms it on full task completion as shown in Table 1 of themain paper. This indicates MOCA’s ability to succeed onoverall task completion and not limiting itself to memoriz-ing short term sub-goals only.
B. Analysis on Object Class Reasoning
We investigate the importance of reasoning about ob-ject categories by removing the target class prediction stagefrom our
Object-Centric Mask Prediction . For this abla-tion, our agent selects the mask instance with the highestconfidence score across all classes ( i.e . without class pre-diction of a target object). We observe that this leads to ahuge drop in performance as the agent tries to interact withincorrect objects and hence fails to accomplish most of thetasks. Specifically, success rate in
Seen drops from . to . and in Unseen drops from . to . . This isindicative of the importance of explicitly enabling our agentto condition its mask prediction on object class information. C. Qualitative Analysis of Task Completion
We present qualitative examples of the task completionability of our agent and contrast it with Shridhar et al . [35]in the attached videos. Each frame the videos includes the Task-Type Shridhar et al . [35] MOCA (ours)Seen Unseen Seen UnseenPick & Place . . Cool & Place . . Stack & Place . . Heat & Place . . Clean & Place . . Examine . . Pick Two & Place . . Average . . Table 4:
Success Rates across 7 task types in ALFRED.
Allvalues are in percentage. The agent is evaluated on the Validationset. Highest values per fold are indicated in blue . Sub-Goal Shridhar et al . [35] MOCA (ours)Seen Unseen Seen UnseenGoto
51 22
54 32
Pickup
32 21
53 44
Put
81 46
62 39
Cool
88 92
87 38
Heat
85 89
84 86
Clean
57 79 Slice
25 12
51 55
Toggle
100 32
93 11
Average
68 46
70 47
Table 5:
Sub-goal success rate.
The highest values per fold andtask are shown in blue . goal statement and step-by-step instructions. The step-by-step instruction that the agent tries to accomplish at the cur-rent time step is highlighted in yellow. When our agentMOCA performs interaction, the predicted target class ofthe object at that time step is shown on the top-left cornerof the egocentric frame. Note we do not show object classfor Shridhar et al . [35] since they produce class-agnosticmasks. We present both success and failure cases of ouragent. In success 1.mp4 , while Shridhar et al . [35] failsto navigate to right object i.e . yellow spray bottles, MOCAsuccessfully navigates and places both of them on top of thetoilet, thereby satisfying the goal statement. It indicates ourAction Policy Module’s (APM) ability to predict an accu-rate action sequence based on vision and language inputs.For success 2.mp4 , both MOCA and Shridhar et al . [35]navigate correctly to the right locations at various stages ofthe task. However when the instruction asks to pick up thelettuce, MOCA correctly localizes and picks up the correctobject. The visual perception module (VPM) of MOCAwhich enables it to reason about object classes helps it topredict the mask of the correct object i.e . lettuce. On thecontrary, Shridhar et al . [35] picks up a cup which was not11 alk straight to the wall, turn left to the stove, and turn right to the sink. Pick up the yellowknife from the sink. Cut the apple in the sink into three pieces. Place the knife in the sink. Pickup a slice of apple from the sink.Place a cooked slice of apple to the right of the yellow knife on the counter. (a) Without decoupling policy and perception Walk straight to the wall, turn left to the stove, and turn right to the sink. Pick up the yellowknife from the sink. Cut the apple in the sink into three pieces. (b) With decoupling policy and perception
Figure 6:
Language attention at various frames with and without decoupling policy and perception.
The colors of frame bordersand words denote that the agent at the particular frame focuses on the same-colored words. a t denotes the action taken at time step t . (a)Without decoupling, the language attention of the agent keeps focusing on apple irrespective of the action taken. (b) With decoupling, thelanguage attention focuses on the words that correspond to the action taken at that time step. mentioned in the instruction at all, thereby failing on thetasks even though it performs all the other actions accu-rately. This can be attributed to its class-agnostic natureof interaction mask prediction. Similarly in success 3.mp4 ,while Shridhar et al . [35] fails to pick up the knife, dueto an inaccurately localized mask under limited visibil-ity and picks up the spatula instead, MOCA rightly picksup the knife and successfully accomplishes the task. suc-cess 4.mp4 also shows a similar example. success 5.mp4 demonstrates the ability to perform thetasks in a more efficient manner. Even though Shridhar etal . [35] successfully navigates to the cup, it takes a lot of un-necessary navigation actions which harms the path-length-weighted score. After picking up the cup it fails to navigatefurther and ends up being stuck at a desk and therefore fails.If our agent, MOCA would’ve faced a similar scenario, ourObstruction Detection module would have kicked in andhelped the agent to evade it. On the other hand, MOCAnavigates to the correct objects of interest i.e ., the cup, therefrigerator, and a counter. It also performs accurate inter-actions and therefore accomplishes the given task.For the fail.mp4 video, Shridhar et al . [35] tries to in-teract with an irrelevant object (cloth), instead of the tissuebox and fails at completing the task. Similarly, our agent,MOCA also tries to interact with the wrong target object(soap bottle), as it fails to navigate to the right position toobserve that object, making it invisible to it. This misleadsthe VPM to perceive the soap bottle as a tissue box andtherefore tries to place an unintended object on top of a toi-let and failing the task. D. Benefit of Decoupling Policy and Perception