[PDF] Optimistic Agent: Accurate Graph-Based Value Estimation for More Successful Visual Navigation

Abstract

We humans can impeccably search for a target object, given its name only, even in an unseen environment. We argue that this ability is largely due to three main reasons: the incorporation of prior knowledge (or experience), the adaptation of it to the new environment using the observed visual cues and most importantly optimistically searching without giving up early. This is currently missing in the state-of-the-art visual navigation methods based on Reinforcement Learning (RL). In this paper, we propose to use externally learned prior knowledge of the relative object locations and integrate it into our model by constructing a neural graph. In order to efficiently incorporate the graph without increasing the state-space complexity, we propose our Graph-based Value Estimation (GVE) module. GVE provides a more accurate baseline for estimating the Advantage function in actor-critic RL algorithm. This results in reduced value estimation error and, consequently, convergence to a more optimal policy. Through empirical studies, we show that our agent, dubbed as the optimistic agent, has a more realistic estimate of the state value during a navigation episode which leads to a higher success rate. Our extensive ablation studies show the efficacy of our simple method which achieves the state-of-the-art results measured by the conventional visual navigation metrics, e.g. Success Rate (SR) and Success weighted by Path Length (SPL), in AI2THOR environment.

Full PDF

UUtilising Prior Knowledge for Visual Navigation:Distil and Adapt

M. Mahdi Kazemi Moghaddam, Qi Wu, Ehsan Abbasnejad, and Javen Shi The Australian Institute for Machine Learning The University of Adelaide

Abstract.

We, as humans, can impeccably navigate to localise a targetobject, even in an unseen environment. We argue that this impressiveability is largely due to incorporation of prior knowledge (or experience)and visual cues –that current visual navigation approaches lack. In thispaper, we propose to use externally learned prior knowledge of objectrelations, which is integrated to our model via constructing a neuralgraph. To combine appropriate assessment of the states and the prior(knowledge), we propose to decompose the value function in the actor-critic reinforcement learning algorithm and incorporate the prior in thecritic in a novel way that reduces the model complexity and improvesmodel generalisation. Our approach outperforms the current state-of-the-art in AI2THOR visual navigation dataset.

Keywords:

Visual Navigation, Knowledge Graph, GTN, AI2THOR

We, human beings, are capable of ﬁnding an object in an unexplored environ-ment, e.g. a room in a new house. We primarily rely on our prior knowledge, inaddition to other sensory information. For example, we know in any bathroom,bar soap or liquid hand wash is probably near the basin, thus observing onehelps ﬁnding the other. Our belief also needs to be adjusted upon observations.For example, bar soap or liquid hand wash might be misplaced. It is desirable todevelop a navigation robot or agent that can utilise the prior knowledge whilebeing prepared for updating beliefs and adapting to a new environment.Most current visual navigation approaches use either supervised learning orreinforcement learning (RL) to learn the visual associations . That is, duringtraining the agent explores the environment to seek the optimal mapping from(primarily) the ego-centric observational inputs to series of actions. There aretwo main issues with this approach. Firstly no prior knowledge about the envi-ronment is provided and used by the navigating agent which signiﬁcantly limitsits use and generalisability. Secondly, the agent learns everything from scratch( i.e. forgets everything and has to re-learn) when it encounters a new environ-ments, hence limiting the reusability.To address the ﬁrst issue, recent works [18,19,38] used graph neural networks,as the most natural representation of the prior knowledge, to encode the object-object relationships in a pre-trained knowledge graph. On the other hand, to a r X i v : . [ c s . C V ] A p r M. M. Kazemi M. et al.

I know TV should be close to couch

Correct priorWrong prior TV Couch

Book Chair

Flower

Turn LeftTurn Left

Feedback

Fig. 1.

Motivated by humans’ navigation system, our agent is able to beneﬁt frombeliefs on object relationships, stored in its knowledge graph, while updating them inorder to navigate towards a given object, e.g.

TV, for example. handle the second issue, there are works [35] that use meta-learning to allow theagent quickly adapt to a new environment. They have shown success in handlingthe train-test distribution shift while being more sample eﬃcient. Their eﬃciencyin adapting external knowledge to unobserved scenes, however, has remainedunexplored.While these recent advances address parts of the problem, the main issue ofincorporating prior graph ( e.g. semantic graph of relationships between objects)to a navigating agent using reinforcement learning (RL) persists. In addition, thesemantic prior graph has to be grounded to the visual cues in RL and updatedwhen presented with a new environment.To that end, we propose an approach to eﬃciently beneﬁt from externalknowledge while dynamically updating it as our agent observes new scenes. Theexternal knowledge presents the agent with general rules about the semanticrelationships of the objects in indoor environments. Thus, our graph contains asthe nodes, both name of the objects and scene representations and as the edges,the existence of a relation between a pair of objects, hence capturing the corre-lation between the semantics of the scene and the agent’s ego-centric view. Wefurther use Graph Transformer Networks (GTN) [39] to learn a representationfor that graph to be subsequently used in our agent’s decisions. GTN, contraryto its counterparts such as Graph Convolutional Networks (GCN) [15], allowsfor heterogeneous objects and generating new structures–corresponding to newconnections that are not in the initial graph. This is particularly useful in ourcase when an agent may encounter objects with relations diﬀerent to that of theprior, for instance, while dish-washing soap and hand-wash may be semanticallyvery close, they are typically in diﬀerent locations in an indoor environment ( e.g. bathroom and kitchen that can be far). As such, the desired graph solution hasto be able to learn these connections from the data, rendering GTN the betterchoice.To incorporate the graph into the RL algorithm, we further found that simplyconditioning the RL’s policy on the prior graph does not lead to better perfor-

PKVNDA 3 mance. To remedy the issue, we ﬁrst noticed that, intuitively, the prior graphshould be used to guide the RL training, rather than providing a signal for eachindividual action. Secondly, the success of the agent, which is reﬂected in its ex-pected (accumulated) reward, is partially due to a proper prior rather than thecurrent state alone. As such, we devise an approach to decompose the reward toaccount for both the state (as is the convention) in addition to the prior graph’scontribution. This, moreover, enables the policy to distil the prior’s knowledgerather than seeking to exploit all the details that might not necessarily be relatedto its navigation decisions.This further entails when we use a variant of an actor-critic RL algorithm( e.g.

Asynchronous Advantage Actor-Critic; A3C [25] in our case), where thecritic’s role is divided between its evaluation of the current state and the priorgraph. We show that reduces the variance of the gradients leading to a guidedlearning that improves the performance. Finally, we employ Model AgnosticMeta-Learning (MAML) [9] to enable test time adaptation of the prior and thepolicy. All in all, this leads to a principled and modular approach that can beemployed in conjunction with other approaches for navigation. We show thecombination of these three components lead to the state-of-the-art results inAI2THOR dataset.In summary, our main contributions are: – For the ﬁrst time, we introduce a method to distil and adapt prior knowledgefor RL-based visual navigation; – We theoretically prove, and empirically demonstrate, how to eﬃciently injectprior for value estimation in actor-critic RL models which leads to lowervariance and higher performance; – Finally, our proposed method outperforms the existing state-of-the-art onAI2THOR public navigation dataset in all four evaluation metrics.

Classical approaches to robotic navigation mainly divide the problem into local-isation, mapping and planning [5, 21, 24]. Besides the computational complexityissues, those approaches lack the semantic scene understanding [6]. Semanticundesrstanding is, especially, important for real-world navigation scenarios. Forexample, in scenarios where the robot is asked to navigate to an object [11, 42]or an outdoor location [12, 22]. End-to-end visual navigation has recently beenextensively studied [1,23] and many novel tasks have been introduced. The mainapproaches can be divided into supervised (Imitation Learning) [2, 3] and unsu-pervised (RL) [23]. The target is also given in diﬀernt modalities. Some tasksconsider target images [11, 42] while most others consider language instruc-tions [1–3, 34, 35, 38]. Providing the target as an image simpliﬁes the task byintroducing similarity measure options between observation and target. Multi-modality, however, increases the challenges. This is mainly because the agent has

M. M. Kazemi M. et al. to ground language instructions on observations while performing planning andnavigation. This is further complicated where the agent has to learn throughexploration, e.g.

RL. There have been a few valuable simulators released re-cently for various visual (or vision and language) navigation tasks [3, 16, 29, 36].AI2THOR, among them, is of especial interest to us. This is mainly due to itshigh quality near photo-realistic design and continous state space. The latterrenders the task speciﬁcally more challenging compared to other environmentslike [3]. The main reason is that rather than traversing graph nodes in the envi-ronment, the agent is placed in a near-real-world setup with a hugely expandedstate space. In this environment, it’s very likely for a sub-optimal agent to standin front of a blocked path, e.g. an obstacle, and continuously perform a failingaction, e.g. move forward, until the maximum step limit is exhausted. In ourtask, we use RL to train an agent to navigate to target objects given objectnames as language instruction.

Graph Neural Networks have recently been applied to diﬀerent supervised andsemi-supervised tasks successfully [15, 31, 39, 41]. While most of the tasks areclassiﬁcation of structured data, such as citation graphs, they’ve also been usedto represent structured knowledge and reasoning [37,40]. Graph neural networkshave also been used in RL-based navigation tasks to represent topological en-vironment maps [20] and help more eﬃcient exploration [7]. In [20] the authoruse the graph to localise the agent in the environment. Generating and incorpo-rating scene graphs [10] is also closely related to our problem. However, here weconstruct our knowledge graph externaly and reﬁne the node relationships with-out explicit object detection. This also separates our approach from [26], wherean of-the-shelf object detector is used. These methods, while improving the per-formance, have been explored before and can be applied to any other approachfor further improvements, including ours. A similar work to ours is proposed byVijay et. al. [32] where the prior knowledge is injected to RL for navigation. Inthat work, the authors learn various edge features encoded as one-hot vectors.The authors apply their approach to a 2D fully-observable environment while inour case the environment is more challenging. The most relevant work to ours isproposed by Yang et. al. [38]. In that work, the authors use a similarly trainedknowledge graph for navigation in a diﬀerent scenario, where all the objectsin graph are used as a target. This is to show the ability of agent to navigateto the objects not trained for. There are a few major diﬀerences between ourapproaches, however. Firstly, we have ﬁve layers of graphs, as discussed in 3.2.Secondly, our graph is adaptively used for value estimation rather than as obser-vation input. Thirdly, our graph embeddings are fundamentally diﬀerent, whereours is inspired by the latest Graph Transformer Networks [39].

PKVNDA 5

Generalisabuility of trained neural networks has always been a major challengedue to the gap of distribution between training and test environments. In somecases this gap is between simulation and real-world [27,28]. In simulation the gapis narrower which makes it a decent feed for meta-learning approaches [9, 30]. Inmeta-learning the aim is to learn a loss that bridges the two distributions. In [9]the authors propose to learn an initialisation of the whole network for faster andmore eﬃcient adaptation using second order derivatives. Meta-learning has alsobeen applied to RL in diﬀerent scenarios [8, 33]. Of especial interest to us, isthe recent work of Wortsman et al. [35] where the authors beneﬁt from ModelAgnostic Meta-Learning [9] and design a trainable self-adaptation loss.

In this section, we ﬁrst deﬁne the problem and then discuss our proposed methodin more depth.

Our proposed method is based on actor-critic methods in RL. The navigationtask is divided into episodes. The overall episode scenario is as follows: theagent is randomly spawned in a position in one of the four available scenes for arandomly selected room type. There are many diﬀerent scenes in each room typewith their speciﬁc design and conﬁguration. Then a randomly sampled targetobject, from among visible objects in the scene, is presented to the agent in plainlanguage, e.g. ”fridge” or ”soap”, for example. The only accessible observationto the agent is its egocentric RGB image at each time step. The agent has totake actions sampled from its policy based on the observation at each time stepto ﬁnd the target object. An episode ends if either the agent stops within adeﬁned distance of an instance of the target object or the maximum numberof actions is exhausted. In our RL-based method we deﬁne the problem as aPortially-Observable Markov Decision Problem (POMDP), tuple of { X, A, r, γ } .Here { X } is the state space comprising of RGB observation images, the actionspace is A , r is the reward and γ is the discount factor. This setup follows ourmain baseline [35].Following the recent conventions in visual navigation tasks, we measure theperformance of our method based on success rate and SPL; the former considersjust the outcome while the latter measures the quality of navigation relative tothe optimal trajectory, using the following formula: N (cid:80) Ni =0 S i O i max( O i ,L i ) , where S i is a binary value for success, O i is the optimal length of the i − th trajectoryand L i is the actual length traversed by the agent.In the this setup, the agent is trained to maximise the accumulated expectedreturn, (cid:80) τ E τ ∼ π [ (cid:80) Tt = i γ t r t ], where τ is the trajectory and π is the agent’s policy.The policy is approximated by a neural network approximator, here a CNN-LSTM variant, π = f ( x t , Y ; θ , θ π , θ v ); x t is the state observation, Y is the M. M. Kazemi M. et al. target vector and [ θ , θ v , θ π ] the network parameters, which are deﬁned in moredetails in the following sections. More details on the network architecture arepresented in section 4.2. Our prior knowledge graph G ( V, E ) encodes the semanticand the correlation of objects in the scene. The set of nodes V includes featuresrelated to all the objects in the environment ( e.g. whether used as navigationtarget or not). Each node feature, v i ∈ R d , encodes the concatenation of ob-servation features x t and the semantic vector embedding of the object. For theedges, e ij = 1 if and only if the concerned objects appear in the same egocentricview of the agent. All the edge weights are initilised as one, which means diﬀer-ent distances are considered equal initially and just the co-occurrence is injectedas prior.Furthermore, diﬀerent from [38], our adjacency matrix, A ∈ R n × n × C , is athree-dimensional tensor where each channel C encodes the knowledge speciﬁcto a scene type. We also add a last channel of self-connections in practice toensure the trivial relation is included. The graph separation enables the agent tomainly attend to one of the graph channels in each scene and avoid distraction.This way, more scene-speciﬁc knowledge can be encoded. Intuitively, the agentshould be able to reason about the kitchen utensils diﬀerent from the livingroom furniture. Despite the separation of channels, in our method we enablecross-channel reasoning which is necessary when objects are shared betweenscenes. To do so, we adopt the recently proposed approach by Yun et al. [39],GTN, that suits our purpose. Fridge

Micro- wave

Basin

Soap towel

Chair

Toaster

Fig. 2.

We acquire our knowledge graphfrom the relationships between objectsof our environment in Visual Gnomedataset [17].

Prior Knowledge Initialisation

Inspired by [38], we also initialise ourgraph using the knowledge existing inthe Visual Genome [17] dataset. Thegraph encodes the co-occurrence of ob-jects as edges, if it is higher than athreshold frequency [38]. However, weconsider this as a coarse initialisationfor two main reasons: any prior knowl-edge obtained from external sourcesmay not contain all the relevant in-formation for the target environmentswe are interested in; what is referredto as the dataset bias. Additionally,we argue that even a knowledge graphbuilt from the given training environ-ments is unreliable. This is mainly be-cause the objects visually change sig-niﬁcantly in every new scene. Therefore, rather than relying on the graph we

PKVNDA 7 propose to adapt it dynamically starting from the prior by distilling it into thepolicy network parameters.

Knowledge Learning and Adaptation

In order to overcome the distribu-tional shift in diﬀerent scenes, our method updates the prior knowledge andadapts it to the new environments. This shift is speciﬁcally a major challenge inour task where no perfect prior applicable to all the scene objects can be found.Therefore, we use the prior knowledge as just an initial belief of the new scenesand not a strict rule.Using GTN [39], our agent is able to learn new edges and weights amongall the nodes whether inside an adjacency layer or across diﬀerent ones. Thisway, depending on the scene, we are able to extract features from the graph thathelp more eﬃcient navigation. Furthermore, the graph structure along with theextracted features are adapted to the episode at hand. Formally, we have: H li = softmax( W li A ) , H i = (cid:89) H li , (1)where we learn multiple normalised (softmax) weighted sums across the channelsof adjacency matrix A , in H li , where l and M are hyper-parameters. Here, H i learns the adjacency matrix as the result of matrix multiplication of H li s. Wecan learn up to M diﬀerent new adjacency matrices. In addition, using (cid:107) , weconcatenate M learnt graph representations using node feature extractor weights G ψ , i.e. Q = (cid:107) Mi =1 σ ( ˜ D − i ˜ H i G ψ ( X )) . (2)The input node feature matrix is X ∈ R n × d . Also, ˜ H i = H i + I the augmented i th adjacency matrix with self-connections, ˜ D − i is its inverse degree matrix fornormalisation. Therefore, the output graph representation vector Q is the resultof both node and edge operations dynamically learnt during training, beforebeing distilled and adapted by the policy. The edge operations on adjacencymatrix allow us to go beyond prior knowledge incorporation to learning andadapting it, as we will discuss in the subsequent sections.For the adaptation, we adopt (MAML) [9] to continuously adapt the learntknowledge during test time. To do so, in our RL setup, we devide the trainingtrajectories into meta-train D and meta-validation D (cid:48) domains. Then, a lossfunction parameterised by φ is learnt to compensate for the domain shift duringtest. So, the overall optimisation objective for each training trajectory sampleis: min θ total , φ L RL ([ θ total , φ ] − α meta ∇ θ total L D φ ( θ total , D ) , D (cid:48) ) (3)We deﬁne θ total = [ θ , θ π , θ v , ψ ] in this equation for readability. In addition, α is a learning rate hyper-parameter for adaptation relative to each parameter setin θ total . We employ a variant of actor-critic RL algorithm known as A3C as the RLcore of our method to build upon. Even though we experiment around a sin-

M. M. Kazemi M. et al. gle algorithm, we observe no speciﬁc requirement preventing our method fromapplicability to other actor-critic algorithms.In our method, the actions are sampled from π = f ( x t , Y ; θ , θ π , θ v ), where x t is the RGB image observation at time t , Y is the semantic embedding vec-tor of the target object in the current episode, θ is the set of parameters ofthe backbone (CNN-LSTM) embedding network, θ π is the parameters of thepolicy sub-network (a.k.a actor) and θ v is the set of value sub-network’s param-eters (a.k.a critic).It is proven that policy gradient methods have high variance in practice.Therefore, in actor-critic the gradient variance is reduced by using the boot-strapped estimates of the state-value function as the baseline. This estimate isprovided using V = g ( x t , Y, G ; θ v , ψ ) where G is our graph neural network pa-rameterised by ψ . An overall demonstration of our method is found in Figure 3.3.Conventionally, the policy and value functions share the network parameters,except for the last layer. However, in our proposed method, we augment the criticsub-network with our knowledge features extracted from our knowledge proposedgraph. These features lead to a more accurate value estimate which then reducesthe variance of the ﬁnal policy updates. This is because the Advantage functionat each state is deﬁned as A ( x t ) = r ( a t | x t ) + V ( x t +1 ) − V ( x t ) and the policyis updated using the gradients from L π = − log( π ( a t | x t )) × A ( x t ) − β × H t ( π ). H t is the entropy and β its hyper-parameter to encourage exploration, whichare not our concern at this point. Therefore, a more accurate value estimationreduces the variance in the gradients of the policy which we hypothesise canthen improve the optimality of the learned policy. We empirically validate thishypothesis. C NN L S T M A C Y= Glove(target) A d a p t a t i o n  ( ) v( ) v( ) v Fig. 3.

Overview of our approach. Inspired byMAML [9], during training we learn an adapta-tion loss that can adapt the knowledge to unseenscenes.

In A3C, the variance of policygradients are reduced by inte-grating estimates of state-valuein advantage function. Unlikecommon practice, here we par-tially separate the critic head’sparameters θ v from the policyhead’s θ π to improve the valueestimation and the variance asa result. Intuitively, there’s acorrelation between the objectspresent in the observation atcurrent time-step x t and thetarget object to navigate to.This correlation is shown in theestimated value of the currentstate, that is: V ( x t ) = E [ (cid:80) t γ t r t ], as the relationships deﬁned by edges of our PKVNDA 9 graph. Therefore, in our method we regress the state-value function in accordingto: V ( x t ) = E [ T − (cid:88) t =0 γ t r t ] (4) V ( x t ) = W G ( x t , X ; ψ ) + W f ( x t , Y ; θ , (5) W and W are aggregation parameters of the two sub-networks, implementedas linear layers in practice. Theoretically, we observe this method as the decom-position of the reward (or return, e.g. expectation of cumulative future rewards)into two components: one, estimated by the main back-bone network param-terised with θ and the other component estimated by the relations between thesemantic target, semantic correlation of the available objects in the scene andthe correlation of those with the current observation x t . This way we reduce thevariance of gradients in our actor-critic algorithm as deﬁned in the following lossfunction, per step: L A3C ( a t | x t ) = − log π ( a t | x t ; θ , θ π )( r t ( a t | x t ) + ( V ( x t +1 ) − V ( x t )))It should be noted that we continue following the original A3C method by aug-menting the above loss with entropy regulariser to encourage exploration. We use AI2THOR environment as our experimental framework. This simulatorconsists of photo-realistic indoor environments (e.g. houses) categorised into fourdiﬀerent room types: kitchen, bedroom, bathroom and living room. In order forfair comparison, we follow the same setup as SAVN [35]. In this setup, 20 scenesof each room type are used for training; 5 scenes for each as validation and 5for test. We train all our methods until convergence with the maximum sevenmillion episodes, whichever occurs ﬁrst. The target objects for each scene arelisted as follows: kitchen: toaster, microwave, fridge, coﬀee maker, garbage can,box and bowl; living room: pillow, laptop, TV, garbage can, box and bowl;bedroom: plant, book, lamp and alarm clock; and bathroom: sink, toilet paper,soap bottle and light switch, totalling 23. All the objects available in the datasetare 89 that are included in the graph. The objects are chosen to be small enoughnot be easily seen from distance without exploration; The same target objectslist is shared between training, validation and test but the scenes are unique. Inorder to train our model, we use Pytorch framework. We use SGD for adaptationoptimizer and Adam [14] otherwise. The loss function of our A3C algorithm issame as the original approach. For the reward, we use 5 for reaching the targetand -0.01 for each single step, limited to 50.

Table 1.

Comparison of our results with the baselines. Our approach improves all thebaselines in all the four evaluation metrics, conventionally used in previous SOTA.

Method SPL Success SPL > > A3C

GCN [38] 15.47 35.13 11.37 22.25

SAVN [35] 16.15 40.86 13.91 28.70

Ours 17.27 43.8 15.39 33.68

We extract the observation features x t using a pre-trained ResNet-18 at eachtime-step. For computational eﬃciency, these features are extracted and savedonce for later use. We use Glove [13] to generate 300-dimensional semantic em-beddings for the target as well as graph objects. Therefore, the input to ouractor-critic network is the concatenation of target object and the observationfeatures, as a 1024-dimensioanl feature vector.Our actor-critic network comprises of a LSTM with 512 hidden states andand two fully-connected layers one for actor and the other for critic. The actoroutputs a 6-dimensional distribution π ( a t | x t ) over actions using a Softmax whilethe critic estimates a single value. As mentioned before, another novelty of ourapproach is an unconventional value estimation network. In this network, thehidden state of the LSTM is concatenated with a 512-dimensional representationvector extracted from the graph.The input to the graph, as node features, is a 1024-dimensional vector. Thisvector is a concatenation of 512 observation features with 512 Glove [13] embed-dings of the objects in the simulator. The Glove embeddings are mapped from300 to 512 using linear layers. There are 89 nodes in each layer of the graphsadjacency matrix and 5 layers in total; 4 layers dedicated to the edges betweenobjects in each scene and one self-connections layer for regularisation. We learna two layer adjacency matrix using GTN. In order to better show the contribution and necessity of each component toour ﬁnal method, we ﬁrst compare it with a few diﬀerent baselines (shown inTable 1). First, the previous state-of-the-art introduced by Wortsman et. al. [35],abbreviated as

SAVN . Second, a similar work proposed by Yang et. al. [38],where the authors use a ﬁxed knowledge graph structure, abbreviated as

GCN .In SAVN, the authors use MAML [9] to learn a loss function L i nt , approximatedby an instance of Temporal Convolutional Networks [4], over training episodes.The learnt loss is then used during validation/ test to produce gradients toupdate the network weights according to the task at hand. We share the generaladaptation framework while performing it quite diﬀerently on our knowledgegraph. In GCN [38], the authors use a single graph for all the objects existing inthe dataset. The embedding output of the graph is concatenated to the target PKVNDA 11

SAVN

21 steps failureOurs

13 steps success

SAVN

21 steps failure

Ours

34 steps success

Fig. 4.

Qualitative comparison. Our approach improves both eﬀective trajectory length(SPL) and success rate. Trajectories steps are randomly sub-sampled for visualisationpurposes. vector embedding and observation features comprising the input to the policynetwork. We conjugate, and experimentally prove, that this approach increasesthe diﬃculty of policy optimisation by increasing the observed state space. Inour approach, we beneﬁt from the graph information more eﬃciently using ourproposed value estimation method. In addition to that, our graph structureand embedding architecture is also diﬀerent. Additionally, to further show thecapabilities of our method, we compare our results with trivial methods. Onemethods is a random agent for which the policy is to uniformly sample an actionat all times. Another relatively trivial baseline is named A3C. This methodis the result of removing the eﬀect of knowledge graph as well as adaptationframework. Therefore, it acts as the simplest RL-based agent. We present the newstate-of-the-art results on AI2THOR public visual navigation dataset achievedusing our method. All the baselines are also trained in the same setting for faircomparisons.

In this section we seek to answer a few principal questions with regards to ourproposed method that sheds more light on its strengths as well as weaknesses.

We answer the aforementioned questions, and some minor unlisted ones as aresults, using extensive experiments. – What is the quantitative improvement with respect to the baselines andprevious state-of-the-art results? – What is the contribution of the graph adaptation method?As can be seen in Table 1, ours improves the previous SOTA by almost 3% onsuccess rate and more than 1% on SPL. This shows that our graph adaptationhelps the agent ﬁnd the targets in smaller number of steps. This is further em-phasised on longer trajectory lengths where the knowledge graph can receivemore adaptation gradients (remember we perform test-time gradients every sixsteps). As is shown, this adaptation further increases the performance gain tomore than 5% on success rate and more than 2% on SPL, which is double thegain in shorter distances. Thus, the shorter trajectory performance gain can beassigned to the adaptation learnt by the knowledge graph; while the longer tra-jectory gain can be then related to the test-time gradient-based adaptation. Thisconﬁrms the eﬀectiveness of our two layer adaptation approach. Furthermore,this also conﬁrms the intuitive idea behind our approach that a ﬁxed prior knowl-edge has limitations. This limitation is to some extent compensated using ourapproach. We, also, hope this can encourage future research in this promisingarea. – What is the contribution of the graph-based value estimation? How wouldthe approach perform without value estimation part?In order to observe the eﬀectiveness of our graph-based value estimation, wewill analyse the results shown on Table 2. In this table, we compare our ﬁnalmethod with two variants. First variant, termed as ours-input is to simply addour graph as part of the state space observation to the main network. This issimilar to the approach propose in [38]. As can be seen we can still improve thebaseline results; however, we do not observe signiﬁcant gain as compared withtwo other methods. We conjugate that estimating the value using our graph can,instead, directly draw relations between the states for better action samplingand policy training. As another variant, termed as ours-policy , we study theeﬀect of removing the state-space expansion, by directly using the graph as aside knowledge base for the policy to condition the actions upon. This approachcan be observed as a weighted ensemble of policy functions representing diﬀerentdistributions. Again, as is observed, the performance gain is limited compared toour ﬁnal model. This further conﬁrms the eﬀectivity of integration of knowledgegraph for value estimation. This way, the size of observed state space by themodel is kept limited, while the algorithm best learns how to employ and adaptthe provided knowledge. – What is the contribution of some of our design choices like the graph nodefeatures?

PKVNDA 13

Table 2.

Our three diﬀerent methods for knowledge incorporation. This shows theeﬀectiveness of graph-based value estimation.

Method SPL Success SPL > > Ours-policy

Ours-value

There are diﬀerent hyper-parameters in our method that are optimised usingconventional routines. Due to computational complexity an extensive study of allthese parameters is practically infeasible (a single training of our method fromscratch takes up to six days on a single Quadro RTX8000 GPU with 12 parallelagents). Therefore, we believe current results can potentially be further improvedby more careful hyper-parameter tuning. Among these parameters, however,graph’s node features is considered a signiﬁcant design part. In order to show theintegrity of current design, we show the eﬀect of removing observation (egocentricimage) features from the node features. Thus, the graph will reduce to ﬁxedcorrellations among the objects. It can also be observed as a sub-network for thevalue estimation to store value decomposition information without consideringthe observational correlations. As can be seen in table 3, there is a signiﬁcantdrop in performance. This further proves the contribution of prior knowledgefor value estimation. Additionally, this is a counter-argument for the followingargument: the graph sub-network acts as additional parameters for the valuefunction to decompose the return irrespective of the knowledge stored in thegraph. – Under what circumstances the model has gained performance improvementsand what are its weaknesses?Finally, in this section, we provide analysis of the practical performance of ourmethod compared to previous SOTA ref [35] using sample test-set trajectories.As is shown in ﬁgure 4 top, the agent is navigating towards an instance of”box” in a kitchen scene. SAVN, passes the target location (green star) withoutsuccessful stopping.

Table 3.

Ours-no-image is the variant ofour model where the image features are re-moved from the graph node features. Wecan see the graph is highly reliant on theobservations to learn the relationships

Method SPL SuccessOurs-no-image

Ours-best

In contrast, our method success-fully stops at the target after 13 steps.In this trajectory only two single adap-tation step is performed (six step in-tervals). A similar scenario happens inthe ﬁgure 4 bottom where the agentis navigating towards a ”book” in abedroom scene. In this example, ouragent misses the target once; howeverit’s able to return after more adapta-tion steps are taken to conform the prior knowledge to the current scene. Formore detailed comparison, we also provide detailed results per room type, inTable 4.

Table 4.

Detailed comparison with previous SOTA; SPL/Success rate are reportedper room type. We can see that our method is general enough that improves theperformance in 3/4 of the room types, with marginal performance on 1/4.

Method Bathroom Bedroom Kitchen Living roomSAVN

Ours

Table 5.

Ours-unlimited is the variantof our model where the during test timeunlimited adaptation steps are takenevery six steps.

Ours-best is when thisis limited to four updates.

Method SPL SuccessOurs-unlimited

Ours-best

Adaptation Steps

How many adapta-tion steps is enough during testing?

Thisquestion is answered here using experi-mental results. As can be seen in table5,we once limit the number of adaptationsteps, using learning rate 0.01, to onlyfour steps. This is experimentally chosenas having the best performance comparedto higher values. If we continue the adap-tation during testing the performance de-clines. We conjugate that this is due to two diﬀerent reasons: one is the well-known forgetting problem associated with meta-learning approaches. That is theagent updates itself to the extent it loses the useful information stored as net-work weights. Second is a limitation of our approach that we plan to investigatefurther in the future. That is, in longer episodes, the agent experiences morediﬀerent observation states that the adaptation loss in no longer able to provideuseful feedback on. Thus, the gradient updates are hurting more than improvingthe performance.

In this paper, we present, for the ﬁrst time, the use of knowledge graphs in con-junction with meta-learning for visual navigation without explicitly employingoﬀ-the-shelf object detectors. Using extensive experiments and ablation stud-ies, we prove the eﬃciency of our approach in beneﬁting from externally gainedprior knowledge while adapting it to the new environments, where necessary. Weshowed, for the ﬁrst time, that knowledge distillation from the critic and priorknowledge graph improves performance in navigating agents.Since we have empirically proven eﬃcacy of our approach, as part of our fu-ture work, we plan to extend it to other RL algorithms. Furthermore, we plan toinvestigate incorporating various knowledge bases required for the task of navi-gation, like object categories, relative location etc. We believe this work createsnew avenues for future research for improved knowledge distillation techniquesfor navigation.

PKVNDA 15

References

1. Anderson, P., Chang, A., Chaplot, D.S., Dosovitskiy, A., Gupta, S., Koltun, V.,Kosecka, J., Malik, J., Mottaghi, R., Savva, M., et al.: On evaluation of embodiednavigation agents. arXiv preprint arXiv:1807.06757 (2018)2. Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., S¨underhauf, N., Reid,I., Gould, S., van den Hengel, A.: Vision-and-language navigation: Interpretingvisually-grounded navigation instructions in real environments. In: Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition. pp. 3674–3683(2018)3. Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sunderhauf, N.,Reid, I., Gould, S., van den Hengel, A.: Vision-and-language navigation: In-terpreting visually-grounded navigation instructions in real environments. 2018IEEE/CVF Conference on Computer Vision and Pattern Recognition (Jun2018). https://doi.org/10.1109/cvpr.2018.00387, http://dx.doi.org/10.1109/CVPR.2018.00387

4. Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutionaland recurrent networks for sequence modeling (2018)5. Cadena, C., Carlone, L., Carrillo, H., Latif, Y., Scaramuzza, D., Neira, J., Reid, I.,Leonard, J.J.: Past, present, and future of simultaneous localization and mapping:Toward the robust-perception age. IEEE Transactions on robotics (6), 1309–1332 (2016)6. Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R.,Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban sceneunderstanding. In: Proceedings of the IEEE conference on computer vision andpattern recognition. pp. 3213–3223 (2016)7. Dai, H., Li, Y., Wang, C., Singh, R., Huang, P.S., Kohli, P.: Learning transferablegraph exploration (2019)8. Duan, Y., Schulman, J., Chen, X., Bartlett, P.L., Sutskever, I., Abbeel, P.:Rl2: Fast reinforcement learning via slow reinforcement learning. arXiv preprintarXiv:1611.02779 (2016)9. Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptationof deep networks (2017)10. Gu, J., Zhao, H., Lin, Z., Li, S., Cai, J., Ling, M.: Scene graph gener-ation with external knowledge and image reconstruction. 2019 IEEE/CVFConference on Computer Vision and Pattern Recognition (CVPR) (Jun2019). https://doi.org/10.1109/cvpr.2019.00207, http://dx.doi.org/10.1109/CVPR.2019.00207

11. Gupta, S., Davidson, J., Levine, S., Sukthankar, R., Malik, J.: Cognitive mappingand planning for visual navigation. 2017 IEEE Conference on Computer Vision andPattern Recognition (CVPR) (Jul 2017). https://doi.org/10.1109/cvpr.2017.769, http://dx.doi.org/10.1109/CVPR.2017.769

12. Gupta, S., Fouhey, D., Levine, S., Malik, J.: Unifying map and landmark basedrepresentations for visual navigation. arXiv preprint arXiv:1712.08125 (2017)13. JeﬀreyPennington, R., Manning, C.: Glove: Global vectors for word representation.Citeseer14. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)15. Kipf, T.N., Welling, M.: Semi-supervised classiﬁcation with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907 (2016)6 M. M. Kazemi M. et al.16. Kolve, E., Mottaghi, R., Han, W., VanderBilt, E., Weihs, L., Herrasti, A., Gordon,D., Zhu, Y., Gupta, A., Farhadi, A.: Ai2-thor: An interactive 3d environment forvisual ai (2017)17. Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S.,Kalantidis, Y., Li, L.J., Shamma, D.A., et al.: Visual genome: Connecting languageand vision using crowdsourced dense image annotations. International Journal ofComputer Vision (1), 3273 (Feb 2017). https://doi.org/10.1007/s11263-016-0981-7, http://dx.doi.org/10.1007/s11263-016-0981-7

18. Li, R., Tapaswi, M., Liao, R., Jia, J., Urtasun, R., Fidler, S.: Situation recogni-tion with graph neural networks. 2017 IEEE International Conference on Com-puter Vision (ICCV) (Oct 2017). https://doi.org/10.1109/iccv.2017.448, http://dx.doi.org/10.1109/ICCV.2017.448

19. Marino, K., Salakhutdinov, R., Gupta, A.: The more you know: Using knowledgegraphs for image classiﬁcation. 2017 IEEE Conference on Computer Vision andPattern Recognition (CVPR) (Jul 2017). https://doi.org/10.1109/cvpr.2017.10, http://dx.doi.org/10.1109/CVPR.2017.10

20. Meng, X., Ratliﬀ, N., Xiang, Y., Fox, D.: Scaling local control to large-scale topo-logical navigation (2019)21. Milford, M., Wyeth, G.: Persistent navigation and mapping using a biologicallyinspired slam system. The International Journal of Robotics Research (9), 1131–1153 (2010)22. Mirowski, P., Grimes, M., Malinowski, M., Hermann, K.M., Anderson, K.,Teplyashin, D., Simonyan, K., Zisserman, A., Hadsell, R., et al.: Learning to nav-igate in cities without a map. In: Advances in Neural Information Processing Sys-tems. pp. 2419–2430 (2018)23. Mirowski, P., Pascanu, R., Viola, F., Soyer, H., Ballard, A.J., Banino, A., Denil, M.,Goroshin, R., Sifre, L., Kavukcuoglu, K., et al.: Learning to navigate in complexenvironments. arXiv preprint arXiv:1611.03673 (2016)24. Mishkin, D., Dosovitskiy, A., Koltun, V.: Benchmarking classic and learned navi-gation in complex 3d environments (2019)25. Mnih, V., Badia, A.P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver,D., Kavukcuoglu, K.: Asynchronous methods for deep reinforcement learning. In:Balcan, M.F., Weinberger, K.Q. (eds.) Proceedings of The 33rd InternationalConference on Machine Learning. Proceedings of Machine Learning Research,vol. 48, pp. 1928–1937. PMLR, New York, New York, USA (20–22 Jun 2016), http://proceedings.mlr.press/v48/mniha16.html

26. Mousavian, A., Toshev, A., Fiser, M., Kosecka, J., Wahid, A., David-son, J.: Visual representations for semantic target driven navigation.2019 International Conference on Robotics and Automation (ICRA) (May2019). https://doi.org/10.1109/icra.2019.8793493, http://dx.doi.org/10.1109/ICRA.2019.8793493

27. Pan, X., You, Y., Wang, Z., Lu, C.: Virtual to real reinforcement learning forautonomous driving. Procedings of the British Machine Vision Conference 2017(2017). https://doi.org/10.5244/c.31.11, http://dx.doi.org/10.5244/c.31.11

28. Pan, X., You, Y., Wang, Z., Lu, C.: Virtual to real reinforcement learning forautonomous driving. arXiv preprint arXiv:1704.03952 (2017)29. Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub,J., Liu, J., Koltun, V., Malik, J., Parikh, D., Batra, D.: Habitat: A platform forembodied ai research (2019)30. Thrun, S., Pratt, L.: Learning to learn. Springer Science & Business Media (2012)PKVNDA 1731. Velikovi, P., Cucurull, G., Casanova, A., Romero, A., Li, P., Bengio, Y.: Graphattention networks (2017)32. Vijay, V.K., Ganesh, A., Tang, H., Bansal, A.: Generalization to novel objectsusing prior relational knowledge (2019)33. Wang, J.X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J.Z., Munos, R.,Blundell, C., Kumaran, D., Botvinick, M.: Learning to reinforcement learn. arXivpreprint arXiv:1611.05763 (2016)34. Wang, X., Xiong, W., Wang, H., Wang, W.Y.: Look before you leap: Bridgingmodel-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. Lecture Notes in Computer Science p. 3855 (2018)35. Wortsman, M., Ehsani, K., Rastegari, M., Farhadi, A., Mottaghi, R.: Learning tolearn how to learn: Self-adaptive visual navigation using meta-learning. In: Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recognition.pp. 6750–6759 (2019)36. Xia, F., Zamir, A.R., He, Z., Sax, A., Malik, J., Savarese, S.: Gibson env: Real-world perception for embodied agents. In: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. pp. 9068–9079 (2018)37. Xiong, W., Hoang, T., Wang, W.Y.: Deeppath: A reinforcement learning methodfor knowledge graph reasoning. arXiv preprint arXiv:1707.06690 (2017)38. Yang, W., Wang, X., Farhadi, A., Gupta, A., Mottaghi, R.: Visual semantic navi-gation using scene priors. arXiv preprint arXiv:1810.06543 (2018)39. Yun, S., Jeong, M., Kim, R., Kang, J., Kim, H.J.: Graph transformer networks(2019)40. Zhang, Y., Dai, H., Kozareva, Z., Smola, A.J., Song, L.: Variational reasoning forquestion answering with knowledge graph. In: Thirty-Second AAAI Conference onArtiﬁcial Intelligence (2018)41. Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., Wang, L., Li, C., Sun, M.:Graph neural networks: A review of methods and applications. arXiv preprintarXiv:1812.08434 (2018)42. Zhu, Y., Mottaghi, R., Kolve, E., Lim, J.J., Gupta, A., Fei-Fei, L., Farhadi, A.:Target-driven visual navigation in indoor scenes using deep reinforcement learning.2017 IEEE International Conference on Robotics and Automation (ICRA) (May2017). https://doi.org/10.1109/icra.2017.7989381,28. Pan, X., You, Y., Wang, Z., Lu, C.: Virtual to real reinforcement learning forautonomous driving. arXiv preprint arXiv:1704.03952 (2017)29. Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub,J., Liu, J., Koltun, V., Malik, J., Parikh, D., Batra, D.: Habitat: A platform forembodied ai research (2019)30. Thrun, S., Pratt, L.: Learning to learn. Springer Science & Business Media (2012)PKVNDA 1731. Velikovi, P., Cucurull, G., Casanova, A., Romero, A., Li, P., Bengio, Y.: Graphattention networks (2017)32. Vijay, V.K., Ganesh, A., Tang, H., Bansal, A.: Generalization to novel objectsusing prior relational knowledge (2019)33. Wang, J.X., Kurth-Nelson, Z., Tirumala, D., Soyer, H., Leibo, J.Z., Munos, R.,Blundell, C., Kumaran, D., Botvinick, M.: Learning to reinforcement learn. arXivpreprint arXiv:1611.05763 (2016)34. Wang, X., Xiong, W., Wang, H., Wang, W.Y.: Look before you leap: Bridgingmodel-free and model-based reinforcement learning for planned-ahead vision-and-language navigation. Lecture Notes in Computer Science p. 3855 (2018)35. Wortsman, M., Ehsani, K., Rastegari, M., Farhadi, A., Mottaghi, R.: Learning tolearn how to learn: Self-adaptive visual navigation using meta-learning. In: Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recognition.pp. 6750–6759 (2019)36. Xia, F., Zamir, A.R., He, Z., Sax, A., Malik, J., Savarese, S.: Gibson env: Real-world perception for embodied agents. In: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition. pp. 9068–9079 (2018)37. Xiong, W., Hoang, T., Wang, W.Y.: Deeppath: A reinforcement learning methodfor knowledge graph reasoning. arXiv preprint arXiv:1707.06690 (2017)38. Yang, W., Wang, X., Farhadi, A., Gupta, A., Mottaghi, R.: Visual semantic navi-gation using scene priors. arXiv preprint arXiv:1810.06543 (2018)39. Yun, S., Jeong, M., Kim, R., Kang, J., Kim, H.J.: Graph transformer networks(2019)40. Zhang, Y., Dai, H., Kozareva, Z., Smola, A.J., Song, L.: Variational reasoning forquestion answering with knowledge graph. In: Thirty-Second AAAI Conference onArtiﬁcial Intelligence (2018)41. Zhou, J., Cui, G., Zhang, Z., Yang, C., Liu, Z., Wang, L., Li, C., Sun, M.:Graph neural networks: A review of methods and applications. arXiv preprintarXiv:1812.08434 (2018)42. Zhu, Y., Mottaghi, R., Kolve, E., Lim, J.J., Gupta, A., Fei-Fei, L., Farhadi, A.:Target-driven visual navigation in indoor scenes using deep reinforcement learning.2017 IEEE International Conference on Robotics and Automation (ICRA) (May2017). https://doi.org/10.1109/icra.2017.7989381,