[PDF] Deep Active Object Recognition by Joint Label and Action Prediction

Abstract

An active object recognition system has the advantage of being able to act in the environment to capture images that are more suited for training and that lead to better performance at test time. In this paper, we propose a deep convolutional neural network for active object recognition that simultaneously predicts the object label, and selects the next action to perform on the object with the aim of improving recognition performance. We treat active object recognition as a reinforcement learning problem and derive the cost function to train the network for joint prediction of the object label and the action. A generative model of object similarities based on the Dirichlet distribution is proposed and embedded in the network for encoding the state of the system. The training is carried out by simultaneously minimizing the label and action prediction errors using gradient descent. We empirically show that the proposed network is able to predict both the object label and the actions on GERMS, a dataset for active object recognition. We compare the test label prediction accuracy of the proposed model with Dirichlet and Naive Bayes state encoding. The results of experiments suggest that the proposed model equipped with Dirichlet state encoding is superior in performance, and selects images that lead to better training and higher accuracy of label prediction at test time.

Full PDF

DDeep Active Object Recognition by Joint Label andAction Prediction

Mohsen Malmir a, ∗ , Karan Sikka b , Deborah Forster c , Ian Fasel d , Javier R.Movellan d , Garrison W. Cottrell a a Computer Science and Engineering Department, University of California San Diego, 9500Gilman dr., San Diego CA 92093, USA b Electrical and Computer Engineering Department, University of California San Diego, 9500Gilman dr., San Diego CA 92093, USA c Qualcomm Inst., University of California San Diego, 9500 Gilman dr., San Diego CA 92093,USA d Emotient.com, 4435 Eastgate Mall, Suite 320, San Diego, CA, USA, San Diego CA 92121, USA

Abstract

An active object recognition system has the advantage of being able to act in theenvironment to capture images that are more suited for training and that lead tobetter performance at test time. In this paper, we propose a deep convolutionalneural network for active object recognition that simultaneously predicts the ob-ject label, and selects the next action to perform on the object with the aim ofimproving recognition performance. We treat active object recognition as a re-inforcement learning problem and derive the cost function to train the networkfor joint prediction of the object label and the action. A generative model of ob-ject similarities based on the Dirichlet distribution is proposed and embedded inthe network for encoding the state of the system. The training is carried out bysimultaneously minimizing the label and action prediction errors using gradientdescent. We empirically show that the proposed network is able to predict boththe object label and the actions on GERMS, a dataset for active object recogni-tion. We compare the test label prediction accuracy of the proposed model withDirichlet and Naive Bayes state encoding. The results of experiments suggest that ∗ Corresponding Author

Email addresses: [email protected] (Mohsen Malmir), [email protected] (Karan Sikka), [email protected] (Deborah Forster), [email protected] (Ian Fasel), [email protected] (Javier R. Movellan), [email protected] (Garrison W. Cottrell)

Preprint submitted to Computer Vision and Image Understanding October 9, 2018 a r X i v : . [ c s . A I] D ec he proposed model equipped with Dirichlet state encoding is superior in perfor-mance, and selects images that lead to better training and higher accuracy of labelprediction at test time. Keywords:

Active Object Recognition, Deep Learning, Q-learning

1. Introduction

A robot interacting with its environment can collect large volumes of dynamicsensory input to overcome many challenges presented by static data. A robotmanipulating an object with the capability to control its camera orientation, forexample, is an example of an active object recognition system. In such dynamicinteractions, the robot can select the training data for its models of the environ-ment, with the goal of maximizing the accuracy with which it perceives its sur-roundings. In this paper, we focus on active object recognition (AOR) with thegoal of developing a model that can be used by a robot to recognize an object heldin its hand.There are a variety of approaches to active object recognition, the goal ofwhich is to re-position sensors or change the environment so that the new inputsto the system become less ambiguous for label prediction [1, 2, 19]. An issuewith previous approaches to active object recognition is that they mostly usedsmall simplistic datasets, which were not reﬂective of challenges in real-worldapplications [15]. To avoid this problem, we have collected a large dataset foractive object recognition, called GERMS , which contains more than 120K highresolution (1920x1080) RGB images of 136 different plush toys. This paper ex-tends our previous work, Deep Q-learning [15], where an action selection networkwas trained on top of a pre-trained convolutional neural network. In this paper weextend the model to train the network end-to-end using GERMS images to jointlypredict object labels and action values.This paper makes two primary contributions: First, we develop a deep activeobject recognition (DAOR) model to jointly predict the label and the best nextaction on an input image. We propose a deep convolutional neural network thatoutputs the object labels and action-values in different layers of the network. Weuse reinforcement learning to teach the network to predict the action values, andminimize the action value prediction error along with the label prediction cross Available at http://rubi.ucsd.edu/GERMS/

2. Literature Review

Active object recognition systems include two modules: A recognition mod-ule and a control module. Given a sequence of images, the recognition moduleproduces a belief state about the objects that generated those images. Given thisbelief state, the control module produces actions that will affect the images ob-served in the future [19]. The controller is typically designed to improve the speedand accuracy of the recognition module.One of the earliest active systems for object recognition was developed byWilkes and Tsotsos [3]. They used a heuristic procedure to bring the object intoa ‘standard’ view by a robotic-arm-mounted camera. In a series of experimentson 8 Origami objects, they qualitatively report promising results for achievingthe standard view and retrieving the correct object labels. Seibert and Waxmanexplicitly model the views of an object by clustering the images acquired fromthe view-sphere of the object into aspects [4]. The correlation matrices betweenthese aspects are then used in an aspect network to predict the correct object la-bel. Using three model aircraft objects, they show that the belief over the correct3bject improves with the number of observed transitions compared to randomlygenerated paths on the view sphere of these objects.Schiele and Crowley developed a framework for active object recognition bymaking an analogy between object recognition and information transmission [5].They try to minimize the conditional entropy H ( O | M ) between the original object O and the observed signal M . They used the COIL-100 dataset for their experi-ments, which consists of 7200 images of 100 toy objects rotated in depth [6]. Thisdataset has been appealing for active object recognition because it provides sys-tematically deﬁned views of objects. At test time, by sequentially moving to themost and second most discriminative views of each object, Schiele and Crowleyachieved almost perfect recognition accuracy on this dataset.Borotschnig et al. formulate the observation planning in terms of maximiza-tion of the expected entropy loss over actions [7]. Larger entropy loss is equivalentto less ambiguity in interpreting the image. With an active vision system con-sisting of a turntable and a moving camera, they report improvements in objectrecognition over random selection of the next viewing pose on a small set of ob-jects. Callari and Ferrie take into account the object modeling error and search foractions that simultaneously minimize both modeling variance and uncertainty ofbelief over objects [9]. Using a set of 10 custom clay objects, they report decreasein the entropy of the classiﬁer output and Kullback-Leibler divergence betweenthe posterior distribution of each object and the corresponding true distribution.Browatzki et al. use a particle ﬁlter approach to determine the viewing poseof an object held in-hand by an iCub humanoid robot [10, 11]. For selecting thenext best action, instead of maximizing the expected information gain, which iscomputationally expensive, they maximize a measure of variance of observationsacross different objects. They show that their method is superior to random actionselection on small sets of custom objects. Atanasov et al. focus on the comparisonof myopic greedy action selection that looks ahead only one step and non-myopicaction selection which considers several time steps into the future [12]. They for-mulate the problem as a Partially Observable Markov Decision Process, showingtheir method is superior to random and greedy selection of actions on a small setof household objects.Rebguns et al. used acoustic properties of objects to learn an infomax con-troller to recognize a set of 10 objects [18]. In this work, they proposed a Dirich-let based model to fuse information from different observations into a single be-lief vector. Using this latent variable mixture model for acoustic similarities, therobot learned to rapidly reduce uncertainty about the categories of the objects ina room. The state encoding of our system is similar to the mixture model of this4ork, however we embed this model into the network and train its parametersusing gradient descent which is more suited for neural networks.Paletta & Pinz [8] treat active object recognition as an instance of a reinforce-ment learning problem, using Q-learning to ﬁnd the optimal policy. They used anRBF neural network with the reward function depending on the amount of entropyloss between the current and the next state.A common trend in many of these approaches is the use of small, sometimescustom- designed sets of objects. There are medium sized datasets such as COIL-100, which consists of 7200 images of 100 toy objects rotated in depth [6]. Thisdataset is not an adequately challenging dataset for several reasons, including thesimplicity of the image background, and the high similarity of different views ofthe objects due to single-track recording sessions. What is missing is a challengingdataset for active object recognition with inherent similarities among differentobject categories. The dataset should be large enough to train models with largenumber of parameters, such as deep convolutional neural networks. In the nextsection, we describe GERMS, a large and challenging dataset for active objectrecognition that we used for experiments in this paper.

3. The GERMS Dataset

The GERMS dataset was collected in the context of the RUBI project, whosegoal is to develop robots that interact with toddlers in early childhood educationenvironments [13, 14, 15]. This dataset consists of 1365 video recordings of give-and-take trials using 136 different objects. The objects are soft toys depictingvarious human cell types, microbes and disease-related organisms. Figure 1 showsthe entire set of these toys. Each video consists of the robot (RUBI) bringing thegrasped object to its center of view, rotating it by 180 ◦ and then returning it. Thedataset was recorded from RUBI’s head-mounted camera at 30 frames per second.The data for GERMS were collected in two days. On the ﬁrst day, each objectwas handed to RUBI in one of 6 pre-determined poses, 3 to each arm, after whichRUBI grabbed the object and captured images while rotating it. The robot alsocaptured the positions of its joints for every capture image. On the second day,we asked a set of human subjects to hand the GERM objects to RUBI in posesthey considered natural. A total of 12 subjects participated in test data collection,each subject handing between 10 and 17 objects to RUBI. For each object, atleast 4 different test poses were captured. The background of the GERMS datasetwas provided by a large screen TV displaying video scenes from the classroom inwhich RUBI operates, including toddlers and adults moving around.5 igure 1: The GERMS dataset. The objects represent human cell types, microbes and disease-related organisms. Table 1: GERMS dataset statistics (mean ± std) Number oftracks Images perTrack Total Num-ber of Images

Day 1 816 157 ±

12 76,722Day 2 549 145 ±

19 51,561

We use half of the data collected in day 1 and 2 for training and the other halfof each day for testing. More speciﬁcally, three random tracks out of six tracks foreach object in Day 1 and two randomly selected tracks for each object from Day2 were used for training the network and the rest was used for testing. Table 1shows the statistics of training and testing data for the experiments in this paper.

4. Proposed Network

The traditional view of an active object recognition pipeline usually treats thevisual recognition and action learning problems separately, with visual featuresbeing ﬁxed when learning actions. In this work, we try to solve both problemssimultaneously to reduce the training time of an AOR model. By incorporatingthe errors from action prediction into visual feature extraction, we hope to acquirefeatures that are suited for both label and action prediction.The proposed network is shown in ﬁgure 2. The input image is ﬁrst trans-formed to a set of beliefs over different object labels by a classiﬁcation network.The belief is then combined with the the previously belief vectors to produce an6ncoding of the state of the system. This is done by the

Mixture belief update layerin the network. The accumulated belief is then transformed into action-values, thatare used to select the next input image. R eL U

256 136 S o ft m a x S t a t e U pda t e

256 10 1 A c t i on S e l e c t i on Target Label Target Action ValueActions R eL U R eL U R eL U U Figure 2: The proposed network for active object recognition. The red arrows representing thetarget values indicate the layer at which the target values are used to train the network. Thenumbers represent the number of units in each layer of the network. See table 2 for more details.

We next detail each part of the network, describing the challenges and theircorresponding solution. We ﬁrst address the transformation of images into be-liefs over object classes. Then we tackle belief accumulation over observed im-ages,followed by the action learning and,ﬁnally, present the full description of thealgorithm to train this model.

The goal of this part of the network is to transform a single image into beliefsover different object labels. The feature extraction stage is comprised of 3 convo-lution layers followed by 3 fully connected layers. The dimensions of each layerare shown in ﬁgure 2. The convolution layers use ﬁlters of 3 × ×

7, 64 × × × × × P = . D = { I i , y i , P i } Ni = , where I i ∈ R × × is the image captured by the robot camera, y i ∈ { o , o , ..., o c } is the object labeland P i is an integer number denoting the pose of the robot’s gripper as positive7ntegers [15]. In order to learn the weights of the single image classiﬁcation part,we perform gradient decent on action prediction and cross-entropy costs, denotedby C RL and C CL respectively. The cross-entropy classiﬁcation cost C CL is: C CL = − ∑ i C ∑ j = I ( y i = c ) log B i j . (1)Here I is the indicator function for the class of the object and B i j = P ( o j | I i ) is thepredicted label belief for the i th image corresponding to the j th class. The nextsubsection describes the action prediction cost C RL . Table 2: Number of units and parameters for the proposed network.

Layer Number of Units Input to Unit Num. Parameters

Conv1 64x30x30 3 × × × × × × Active object recognition can be treated as a reinforcement learning problem,whose goal is to learn an optimal policy π ∗ : S → A from states S to actions A .The optimal policy is expected to maximize the total reward for every interactionsequence s π T with the environment, s π ( s ) −−−→ s π ( s ) −−−→ s π ( s ) −−−→ . . . π ( s T −− ) −−−−−→ s T where s i π ( s i ) −−−→ s i + is the transition from s i to s i + by performing the action a i = π ( s i ) . The total reward for an interaction sequence s π T is T R ( s π T ) = ∑ Tt = γ t R ( s t ) where R : S → R is a reward function and γ , < γ < Q ( λ ) algorithm to train thenetwork to predict actions for improved classiﬁcation [16]. This is a model-freemethod that learns to predict the expected reward of actions in each state. Morespeciﬁcally, let Q π ( s , a ) be the action value for state s and action a , Q π ( s , a ) = E π { T R ( s π T ) | s = s , a = a } , which is the expected reward for doing action a in state s . Let the agent inter-act with the environment to produce a set of interaction sequences { s π T } . Then Q ( λ ) learns a policy by applying the following update rule to every observed tran-sition T R π ( s t , s t + ) = s t π ( s t ) −−−→ s t + , Q π ( s t , a t ) ← ( − α ) Q π ( s t , a t ) + α (cid:104) R ( s t + ) + γ max a Q π ( s t + , a ) (cid:105) (2)where 0 < α < a t is selected using an epsilon-greedy version of the learned policy. We interpret this iterative update in thefollowing way to be useful for training a neural network. Let the output layer ofthe network predict Q ( s , a ) for the learned policy π for every possible action a in s . Then a practical approximation of the optimal policy is obtained by minimizingthe reinforcement learning cost, C RL = ∑ T R π ( s t , s t + ) ∈{ s π T } (cid:104) R ( s t + ) + γ max a Q π ( s t + , a ) − Q π ( s t , a t ) (cid:105) (3)In the proposed network, action value prediction is done by transforming the stateof the system S t at t th through layers ReLU3,ReLU4 and LU2. We train theweights of the network in these layers by minimize C RL . In the next subsec-tion, we go into the details of state encoding, and after that we describe the set ofactions. State encoding has a prominent effect on the performance of an AOR system.Based on the current state of the system, an action is selected that is expected to9ecrease the ambiguity about the object label. An appealing choice is to transformimages into beliefs over different target classes and use them as the state of thesystem. Based on the target label beliefs, the system decides to perform an actionto improve its target label prediction. What we expect from the AOR system isto guide the robot to pick object views that are more discriminative among targetclasses.We ﬁrst transform the input image I i into a belief vector B i = [ B i j ] Cj = usingthe the ﬁrst 7 layers of the network, where B i j ≥ , C ∑ j = B i j = , The produced label belief vector is then combined with the previously observedbelief vectors from this interaction sequence to form the state of the system. Themotivation for this encoding is that the combined belief encodes the ambiguityof the system about target classes and thus can be used to navigate to more dis-criminative views of objects. Active object recognition methods usually adapt aNaive Bayes approach to combining beliefs from different observations. Assumethat in an interaction sequence, a sequence of images I t = { I , I , . . . , I t } havebeen observed and their corresponding beliefs B t = { B , B , . . . , B t } have beencalculated. The state of the system at time t is calculated using Naive Bayes beliefcombination, which is to take the product of the individual belief vectors and thennormalize, s t = P ( O | I t ) = P ( O , I t ) P ( I t ) ∝ P ( O ) t ∏ i = P ( I i | O ) ∝ t ∏ i = P ( O | I i ) (4)where O is the target label, and P ( O | I i ) is the vector of beliefs produced using sin-gle image classiﬁcation. Here we assumed a uniform prior over images and targetlabels. The problem with Naive Bayes is that if an image is observed repeatedlyin I t , the result will change based on the number of repetitions. This is undesir-able since the state of the system changes with repeated observations of an imagewhere no new information is added to the system. If a speciﬁc image is good forclassiﬁcation, the system can visit that image more often to artiﬁcially increase the10erformance of the system. To avoid this problem, we adapt a generative modelbased on Dirichlet distribution to combine different belief vectors.We use a generative model similar to [18] to calculate the state of the sys-tem given a set of images. The intuition behind this model is that performingan action on an object will produce a distribution of belief vectors. We modelthe observed belief vectors given the object and action as a Dirichlet distribution,parameters of which are learned from the data. The model is shown in ﬁgure 3.Here A is a discrete variable representing the action from the repertoire of actions { a , a , . . . , a H } , O ∈ { o , o , . . . , o C } represents the object label and α ∈ R C isthe vector of parameters of the Dirichlet distribution from which the belief vector B ∈ R C over target labels is drawn, P ( B | α ) = Dir ( B ; α )= Γ ( ∑ Cj = [ α ] j ) ∏ Cj = Γ ([ α ] j ) C ∏ j = [ B ] [ α ] j − j (5)The state of the system is calculated by computing the posterior probability ofobject-action beliefs using the model in ﬁgure 3. Let P Oa ( a i , B i ) = P ( O , a | a i , B i ) denote the posterior probability of an object-action pair given the performed actionand the observed belief vector. Assuming uniform prior over object and α and adeterministic policy for choosing actions, P ( o , a | B ) == (cid:82) α P ( O , a , B , α ) d α P ( B ) ∝ (cid:90) α P ( O ) P ( a ) P ( α | O , a ) P ( B | α ) d α ∝ (cid:90) α Oa Dir ( B ; α Oa )) d α Oa (6)The notation α Oa is to make clear that there is an α for each pair of object-action.Instead of full posterior probability, we use ˆ α Oa , maximum likelihood estimate of α , and replace the integral above by , P ( o , a | B ) ≈ Dir ( B | ˆ α Oa ) (7)For an interaction sequence B t and A t = { a , a , . . . , a t } , the posterior probabil-11ty of object-action pair is, P ( O , a | A t , B t ) = t ∏ i = P ( O , a | B i ) I ( a , a i ) (8)The state of the system is comprised of the vector of object posterior beliefs forevery object and action, plus the features and belief extracted from the latest image I t , s t = { [ P ( O , a | A t , B t )] , B t } , (9) O ∈ { o , o , . . . , o C } a ∈ { a , a , . . . , a H } Note that s t ∈ R CH is a vector of length C × H . Our goal is to train the network jointly for action and label prediction. Weachieved this by minimizing the total cost which is sum of the costs for bothlabel (1) and action prediction (3). Note that the errors for action value predictionare backpropagated through the entire network, reaching visual feature extractionunits. The total cost function for action value and label prediction is,

Cost = C RL + C CL (10)The weights of the network in the visual feature extraction layers (Conv1,Conv2, Conv3, ReLU1, ReLU2, LU1) are trained using backpropagation on (10),while the action prediction layers (ReLU3, ReLU4 and LU2) are trained by gra-dient descent on the action prediction error (3).To learn the parameters of the belief update, that is α Oa , we use gradient de-scent on maximum likelihood of the data. The maximum likelihood of Dirichletdistribution is a convex function of its parameters and can be minimized usinggradient descent. For a set of beliefs B N observed by performing action a on theobject O , the gradient of the log-likelihood with respect to the parameters are, ∂ log P ( B N | α Oa ) ∂ [ α Oa ] k = N dd [ α Oa ] k log Γ ( C ∑ j = [ α Oa ] j ) − dd [ α Oa ] k log Γ ([ α Oa ] k ) + log B k = N Ψ ( C ∑ j = [ α Oa ] j ) − N Ψ ([ α Oa ] k ) + log B k (11)12 .. ... Action(a) Currentbelief Accumulatedbelief ... P ( o , a ∣ B ) P ( o , a H ∣ B ) P ( o , a ∣ B ) Figure 3: Dirichlet belief update layer. Each unit in this layer represents a Dirichlet distributionfor a pair of object-action. The parameters of this layer are the vectors of Dirichlet parameters α Ok for each unit. where Ψ ( x ) = d / d ( x ) log Γ ( x ) is the digamma function. We use one unit perDirichlet distribution Dir ( | α Oa ) in the belief update. These units receive the cur-rent belief and their output for the previously observed belief, and produce anupdated belief. An schematic of the belief update layer of the network is shown inﬁgure 3. Learning α Ok is carried out simultaneously with the rest of the networkweights in one training procedure. Another component that has an important effect on the performance of ourAOR system is the reward function which maps state of the system (4) into re-wards. A simple choice for reward function is R ( s t ) = (cid:26) + i [ B t ] i = Target-Label ( I t )) − correct-label reward function. A reward of + ( − ) is given tothe system if at time step t the action a t brings the object to a pose for which thepredicted label is correct (wrong). The intention behind this reward function is todrive the AOR system to pick actions that lead to best next view of the object interms of label prediction. 13 .6. Action Coding In order to be able to reach every position in the robot’s joint gripper range, weuse a set of relative rotations as the actions of the system. More speciﬁcally, weuse 10 actions to rotate the gripper from its current position by any of the followingoffset values: {± π , ± π , ± π , ± π , ± π } . The total range of rotation for each ofthe robot’s grippers is ± π . The actions are selected to be ﬁne grained enoughso that the robot can reach any position with minimum number of movementspossible. This encoding is simple and ﬂexible in the range of positions that therobot can reach, however we found that the policies can become stuck with a fewactions without trying the rest. Encoding the states with the Dirichlet belief updatehelps alleviate this issue to some degree, however, it doesn’t completely removethe problem. We deal with this problem by forcing the algorithm to pick the nextbest action whenever the best action leads to an image which has already beenseen.

5. Experimental Results

We trained the network by minimizing the costs of classiﬁcation, action valueprediction (3) and negative of log-likelihood of Dirichlet distributions (11). Weused backpropagation with minibatches of size 128 to train the network. For Q ( λ ) , we used initial learning rate of 0 . . , , , Q ( λ ) , weused ε -greedy policy in the training stage, with ε decreasing step-wise from 0.9 to0.1. We found that using an ε > ε = Figure 4 shows the average negative log-likelihood of the data under Dirichletdistributions for training of a DN model. This ﬁgure shows that the neg-log-likelihood of data decreases after the ﬁrst 1000 iterations, after which is the rateof change is decreased but not stopped.14 lgorithm 1

Training the network for joint label and action prediction. procedure T RAIN R ← for iteration=1 To N do I , y ← NextImage(iteration) s ← [ ] Actions ← RandomActions(NumActions) for t=1 To NumMoves do s t , predictedActions ← FeedForward ( I t , s t − , Actions ) I t + , ← NextImage ( I t , predictedActions ) targetActionVals , ˆ y ← LookAhead ( I t + , s t , Actions ) if t = NumMoves then targetActionVals ← targetActionVals + R ( s t ) for W ∈ { ReLU , ReLU , LU } do W ← W − λ W ∂∂ W { C RL } for W ∈ { Conv , Conv , Conv , ReLU , ReLU , So f tmax } do W ← W − λ W ∂∂ W { C RL + C CL } for O ∈ { o , o , . . . , o C } , a ∈ { a , a , . . . , a H } do α Oa ← α Oa + λ ∂∂ α Oa log P ( B t | α Oa ) In the ﬁrst experiment, we compare the effectiveness of the Dirichlet and NaiveBayes state encodings in terms of label prediction accuracy. For Naive Bayesmodels (NB), the state of the system is updated using (4), while the size and con-ﬁguration of the rest of the network remains the same. Dirichlet state encodingis implemented using (9). We refer to Dirichlet models as (DR). For each en-coding and for each arm, we train 10 different models and report the average testlabel prediction accuracy as a function of number of observed images, comparingthe Deep Active Object Recognition (DAOR) and Random (Rnd) action selectionpolicies. Figure 5 plots the performance for these models. It is obvious that theDirichlet model is superior to Naive Bayes in label prediction accuracy.The ﬁrst point to notice in ﬁgure 5 is the performance difference betweenNaive Bayes and Dirichlet belief updates on single images (action 0). NB modelsachieve a performance less than 35%, while Dirichlet achieves higher than 40%.One interpretation of this result is that the Naive Bayes models pick actions that15 igure 4: Average Negative log-likelihood of data under Dirichlet distributions. The decrease innegative log-likelihood indicates learning in the belief update layer.Figure 5: Test label prediction accuracy as a function of number of observed images for left andright arms for Dirichlet state encoding with repeated visits (DR) and non-repeated visits (DN). bounce between a subset of train images, leading to underﬁtting of the model. Inthe next subsection, we provide some evidence for this justiﬁcation. On the other16and, the performance of DR-DAOR model tends to saturate after 3 actions, whileDR-Rnd keeps improving for subsequent actions. This might be due to the factthat DR-DAOR also bounces between subsets of images at the test time. We canavoid such behavior by forcing the policies to pick actions that lead to joint posesthat haven’t already been visited in the same interaction sequence.

We train a set of models using Dirichlet state encoding, while forcing thepolicy to pick non-duplicate joint poses in every action of an interaction sequence.This approach is easy to implement by keeping a history of visited joint posesduring an interaction sequence and picking actions with highest action value thatdon’t lead to the visited joint positions. We refer to this model as Dirichlet withnon-repeated visits (DN). Comparison between DN and DR for Rnd and DAORpolicies (both forced to visit novel poses) is shown in ﬁgure 6.

Figure 6: Test label prediction accuracy as a function of number of observed images for left andright arms for Naive Bayes (NB) and Dirichlet (DR) state encoding.

Comparison between the models mentioned above is shown in table 3. We seethat the best performing model is DN-DAOR with the exception of action 1 forthe right arm, which DR-DAOR achieves the best performance. For both arms,Dirichlet models perform signiﬁcantly better than Naive Bayes, improving the17odel’s performance on average by 10% for the right arm and 14% for the leftarm.

Table 3: Comparison of DQN, random and sequential.

Policy ObservedFrames 0 1 2 3 4 5NB-Rnd 31.3 38.1 41.3 43.4 45.0 46.1

Right Arm

NB-DAOR 31.3 42.1 45.8 48.0 48.3 49.0DR-RND 40.3 48.7 51.9 53.6 54.6 55.2DR-DAOR 40.3

NB-Rnd 32.7 39.5 42.9 44.9 46.3 47.4

Left Arm

NB-DAOR 32.7 43.7 47.5 49.6 50.0 50.6DR-RND 43.7 52.5 55.8 57.5 58.6 59.3DR-DAOR 43.7 53.0 54.9 55.9 55.5 55.4DN-RND 45.4 54.5 58.0 60.0 61.1 61.9DN-DAOR 45.4

It may help us understand the weakness and strength of different models ifwe take a closer look into the learned policies. For this purpose, we visualizethe consecutive actions in the interactions sequences of length 5, as shown fortraining data in ﬁgures 7 and for test data in ﬁgure 8. Each plot represents actionsin different rows, with the magnitude and orientation of the action begin depictedby the length and direction of the corresponding arrow on the left side. Each timestep of the interaction sequence is shown as a numbered column. The coloredlines in each plot connect one action in column i to another action in column i + Figure 7: Visualization of (left) NB and (right) DN models for train data. Each row represents anaction and each column represents a move performed by the policy in an interaction sequence. Thecolor of lines connecting two columns are different for clarity for every consecutive time steps,while the thickness of these line indicate the frequency of that transition in interaction sequences.

Figure 8 visualizes the learned policies at test time for NB-DAOR and DN-DAOR. We see on the left side that NB-DAOR only swings between the two largerotations in the opposite direction, while DN-DAOR prefers to do a few larger19ctions (thick purple and blues lines connecting columns 2, 3 and 4) followed byfew smaller actions in different directions. There is no back and forth for DN-DAOR between visited joint positions, which leads to better performance on thetest set.

Figure 8: Visualization of (left) NB and (right) DN models for test data. NB model prefers torepeats the same two actions, swinging between two joint poses at one end of the joint range.The DN model usually performs a few larger rotations on the object, followed by a few smallerrotations in different directions.

6. CONCLUSIONS

In this paper, we proposed a model for deep active object recognition basedconvolutional neural networks. The model is trained by jointly minimizing theaction and label prediction simultaneously. The visual features in early stages of20his network were trained by minimizing action and label prediction costs. Thedifference between the work presented here and deeply supervised networks [20]is that in the latter, the training is carried out by minimizing the classiﬁcation errorin different layers, while in our approach we minimized the action learning costsalong with classiﬁcation error.We also adapted an alternative to the common Naive Bayes belief update rulefor state encoding of the system. Naive Bayes has the potential to overﬁt to sub-sets of training images, which could lead to lower accuracy at the test time. Weused a generative model based on Dirichlet distribution to model the belief overtarget classes and actions performed on them. This model was embedded into thenetwork, which allowed training the network in one pass jointly for label and ac-tion prediction. The results of experiments conﬁrmed that the proposed Dirichletmodel is superior in test label prediction to the Naive Bayes approach for system’sstate encoding.A common trend we observed in the models trained in this paper was thestrong preference for a few actions, which led to limited examination of the ob-jects, and thus lower performance on label prediction. This preference was thestrongest in the Naive Bayes state encoding models. Employing Dirichlet forstate encoding helped alleviate this problem, mainly for the training data and lessfor test data. We observed that the strong preference for a limited set of actionsweakens in the training stage for the DR-DAOR model, and as a result of thisthe test label prediction accuracy was improved. We hypothesize that in additionto state encoding, learning actions on the training images which have high labelprediction accuracy, leads to this strong preference. In training our models, thetraining accuracy reaches above 90% after 1000 iterations. This may cause the Q ( λ ) to reward every action, which ﬁnally may lead to one action taking over andalways producing the highest action value.

7. Acknowledgments

The research presented here was funded by NSF IIS 0968573 SoCS, IIS INT2-Large 0808767, and NSF SBE-0542013 and in part by US NSF ACI-1541349 andOCI-1246396, the University of California Ofﬁce of the President, and the Cali-fornia Institute for Telecommunications and Information Technology (Calit2).

8. References8. References