Grounded Language Understanding for Manipulation Instructions Using GAN-Based Classification
GGROUNDED LANGUAGE UNDERSTANDING FOR MANIPULATION INSTRUCTIONSUSING GAN-BASED CLASSIFICATION
Komei Sugiura and Hisashi Kawai
National Institute of Information and Communications Technology, Japan
ABSTRACT
The target task of this study is grounded language understand-ing for domestic service robots (DSRs). In particular, we fo-cus on instruction understanding for short sentences whereverbs are missing. This task is of critical importance to buildcommunicative DSRs because manipulation is essential forDSRs. Existing instruction understanding methods usuallyestimate missing information only from non-grounded knowl-edge; therefore, whether the predicted action is physically ex-ecutable or not was unclear.In this paper, we present a grounded instruction under-standing method to estimate appropriate objects given an in-struction and situation. We extend the Generative Adversar-ial Nets (GAN) and build a GAN-based classifier using la-tent representations. To quantitatively evaluate the proposedmethod, we have developed a data set based on the standarddata set used for Visual QA. Experimental results have shownthat the proposed method gives the better result than baselinemethods.
Index Terms — grounded language understanding, human-robot communication, domestic service robots
1. INTRODUCTION
Based on increasing demands to improve the quality of life ofthose who need support, many DSRs are being developed [1].The target task of this study is grounded language understand-ing for DSRs. In particular, we focus on instruction under-standing for short sentences where verbs are missing. Thistask is of critical importance to build communicative DSRsbecause manipulation is essential for DSRs.An example situation where a user asks a DSR to fetcha bottle is shown in the left-hand figure of Fig. 1. The right-hand figure of Fig. 1 shows a standard DSR platform.Existing instruction understanding methods usually esti-mate missing information only from non-grounded knowl-edge; therefore, whether the predicted action is physicallyexecutable or not is unclear. Moreover, the interaction some-times takes more than one minute until the robot starts to exe-
This work was partially supported by JST CREST and JSPS KAKENHIGrant Number 15K16074. The authors thank Dr. Peng Shen for his sugges-tions. cute the task in a typical setting. Such interactions are incon-venient for the user.In this paper, we present a grounded instruction under-standing method to estimate appropriate objects given aninstruction and situation. We extend the GAN [2] and builda GAN-based classifier using latent representations. Unlikeother methods, the user does not need to directly specifywhich action to take in the utterance, because it is estimated.There have been many studies on the variations of GANs(e.g., [3, 4]). Recently, some studies applied GANs for clas-sification tasks [5, 6]. Our study is inspired by these method;however, the difference is that our method has Extractor. Ex-tractor’s task is to convert raw input to latent representationsthat are more informative for classification. Generator’s taskis data augmentation to improve generalization ability.The following are our key contributions: • We propose a novel GAN-based classifier called LA-tent Classifier Generative Adversarial Nets (LAC-GAN). LAC-GAN is composed of three main com-ponents: Extractor E , Generator G , and Discriminator D . The method is explained in Section 2. • LAC-GAN is applied to manipulation instruction un-derstanding. To quantitatively evaluate the proposedmethod, we have developed a data set based on the stan-dard data set used for Visual QA [7]. The results areshown in Section 7.
Fig. 1 . Left: Sample input used in the experiment. Verb ismissing in the instruction. Right: Standard domestic servicerobot platform [8]. a r X i v : . [ c s . R O ] J a n . RELATED WORK The classic dialogue management mechanisms adopted forDSRs process linguistic and non-linguistic information sep-arately. More recently, the robotics community has started topay more attention to the mapping between language and real-world information, mainly focusing on motion [9–11]. Kollar et al. proposed a path planning method from natural languagecommands [12]. However, most of the SLU methods used forDSRs are still rule-based [13]. In the dialogue community, Li-son et al. presented a model for priming speech recognitionusing visual and contextual information [14].We developed LCore [15], which is a multimodal SLUmethod. In LCore, the SLU was integrated with image, mo-tion prediction, and object relationships. However, the gram-mar and vocabulary was limited and the situation was artifi-cial. We also developed Rospeex , which is a multilingualspoken dialogue toolkit for robots [16]. Rospeex has beenused by 45,000 unique users. In recent years, there have beenmany studies on image-based caption generation and visualQA [7,17]. Our study has a deep relationship with these stud-ies, however the difference is that our focus is not on captiongeneration.There have been many studies on the variations of GANs(e.g., [2–4]). A GAN is usually composed of two main com-ponents: Generator G and Discriminator D . AC-GAN [18]uses category labels as well as the estimated source (real orfake) as Discriminator’s output. Most of GAN-related stud-ies focus on generating pseudo samples, e.g., image and text.Recently, some studies applied GANs for classification tasks[5, 6]. Our study is inspired by these method; however, thedifference is that LAC-GAN has Extractor.
3. TASK DEFINITION
In this paper, we focus on grounded language understandingfor manipulation instructions. This problem is of critical im-portance to build communicative DSRs because manipulationis essential for DSRs. In particular, we focus on instructionunderstanding for short sentences missing verbs.Specifically, the following is a typical use case consideredin this study:U: “Robot, bottle please.”R: “Please select from the list (GUI shows a listof manipulable bottles).”The task here is to estimate whether candidate objects arelikely to be manipulable. This is challenging because thephysical situation should be modeled to understand the in-struction.On the other hand, most DSRs use non-grounded lan-guage understanding to solve such a task. Missing informa-tion is estimated only from linguistic knowledge; therefore, http://rospeex.org whether the predicted action is physically executable or notis unclear. Moreover, the interaction sometimes takes morethan one minute until the robot starts to execute the task in atypical setting [1]. Such interactions are inconvenient for theuser.Here, we define terminology used in this study as follows: • An manipulation instruction understanding is definedto be classifying the target object (trajector) as manip-ulable or not given the situation. • The situation is defined as a set of sentences explaininga (camera) image. • A trajector is the target object which is focused in thescene [19]. • The trajector is manipulable if a typical DSR is techni-cally capable of manipulating it given the situation.The key evaluation metric in this study is the classificationaccuracy.
4. GENERATIVE ADVERSARIAL NETS
GAN [2] is composed of two main components: Generator G and Discriminator D . Input to G is a d z -dimensional randomvariable, x . Output from G is x fake defined as follows: x fake = G ( z ) . (1)The input source of D is denoted as S , which is se-lected from the set { real, f ake } . When S is real , a realtraining sample denoted as x real is input to D . D ’s taskis to discriminate the source, S ∈ { real, f ake } , given x ∈ { x real , x fake } . On the other hand, G ’s task to fake D . D outputs the likelihood of S being real given x as fol-lows: D ( x ) = p ( S = real | x ) . (2)The following cost functions are used to optimize GAN’snetwork parameters. J ( D ) = − E x real log D ( x real ) − E z log(1 − D ( G ( z ))) ,J ( G ) = − J ( D ) , where J ( D ) and J ( G ) denote the cost functions of D and G ,respectively. In the training process, the training for D and G are conducted alternately. First, D ’s parameters are trained,and then G ’s parameters are trained. D ’s parameters are fixedwhile G ’s parameters are trained. ig. 2 . Model structure of LAC-GAN. The numbers on the layers represent the node numbers.
5. LATENT CLASSIFIER GAN5.1. Model Structure
We extend GAN for classification, and propose a novelmethod called LAtent Classifier Generative Adversarial Net-works (LAC-GAN). Our approach is inspired by the fact that G does not have to generate raw representations, e.g., imageor text, in classification tasks. Instead in our approach, G is used for data augmentation, and asked to generate latentrepresentations of the data. LAC-GAN’s model structureis shown in Fig. 2. LAC-GAN is composed of three maincomponents: Extractor E , Generator G , and Discriminator D . Suppose we obtain a training sample ( x raw , y ) , where x raw ∈ R d raw and y denote raw features and the label, re-spectively. We assume that y is a categorical variable, whichis a d y -dimensional binary vector.Unlike other studies where GANs are used for samplegeneration, our focus in on GAN-based classification. Fromthis background, it is reasonable to convert D ’s input to moreinformative features in terms of classification. Such featuresare extracted by E from x raw .Input to E is x raw , and output from E is p E ( y ) , whichis the likelihood of y given x raw . In the optimization processof E , we simply minimize the following cross-entropy-basedcost function: J C = − (cid:88) j y j log p E ( y j ) , (3)where y j denotes the label for j -th category. We designed E ’sstructure as a bottleneck network, in which the output of thebottleneck layer is extracted as a d real -dimensional vector, x real .Input to G is category denoted as c and a d z -dimensionalrandom vector denoted as z . In each mini-batch, new ran-dom samples are drawn as c and z from a categorical distribu-tion and a continuous distribution, respectively. The standardnormal distribution was used as the continuous distribution: z ∼ N ( , I ) . Output from G is denoted as x fake , whichis a d fake -dimensional continuous vector. In LAC-GAN, G ’srole is data augmentation; therefore, G is expected to generate x real -like samples to improve generalization ability.OR gate in Fig. 2 shows that either x real or x fake is inputto D . D ’s task is two-fold. The first one is to discriminate thesource, S ∈ { real, f ake } , given x ∈ { x real , x fake } . Theother is to categorize the input. Therefore, D has two typesof output: the likelihood of S given x real , p D ( S ) , and thelikelihood of y given x real , p D ( y ) .Similar to other GAN-based classification models [6, 18],we define separate cost functions for p D ( S ) and p D ( y ) . Theformer is defined as follows: J S = − E x real log D ( x real ) − E z ,c log(1 − D ( G ( z , c ))) , (4)where G ( z , c ) is the output of G given z and c [3]. The samecost function as Equation (3) is used for the latter, where p E isrewritten as p D . The total cost function for D is defined as theweighted sum of the two. The weight parameter is denoted as λ . Thus, the cost functions for LAC-GAN are defined as fol-lows: J ( E ) lacgan = J C , (5) J ( D ) lacgan = J S + λJ C , (6) J ( G ) lacgan = − J S , (7)where J ( E ) lacgan , J ( D ) lacgan , and J ( E ) lacgan denote the cost functionsof E, D , and G . We use batch normalization (BN [20]) to regularize the pa-rameters of a layer. BN reduces internal covariate shift tostabilize the training by converting input mean 0 and variance within each mini-batch. Since BN act as a standardizationmethod, dropout does not have to be used where BN is ap-plied. We do not apply BN in the first layer of D , which isstandard in GAN-based approaches. We use dropout insteadfor the layer.BN is usually applied after addition, which is called post-activation. In pre-activation (PA), BN is applied before addi-tion. PA outperformed post-activation in the CIFAR-10 taskwhen the network is very deep [21]. As explained in Section5, the x raw is represented as a paragraph vector [22], which isnot standardized. We apply PA to standardize the data withineach mini-batch.We use ReLU, softmax, tanh as activation functions.Since the output of the E and D is a categorical variable,we use softmax in their final layers. The output of G is apositive/negative continuous values; therefore, we use tanh in G ’s final layer. We use ReLU for the other layers. LeakyReLU was reported to show better performance than ReLU;however, we did not obtain statistically significant results inpilot experiments for our task.
6. DEVELOPING OBJECT MANIPULATIONMULTIMODAL DATA SET
As far as we know, no standard data set exists for object-manipulation instruction understanding. Therefore, we firstexplain the data set developed for this study. To avoid creat-ing a data set that is too artificial, we extracted a subset fromthe standard Visual Genome dataset [17].The Visual Genome dataset contains over 100k images,where each image has as average of 21 objects. The bound-ing boxes of the objects are given by human annotators, andthey are canonicalized to the WordNet synsets [23]. Each im-age also contains regions with descriptions given by humanannotators. Unlike other datasets such as MS-COCO [24],the Visual Genome dataset contains more bounding boxes andtheir descriptions per image. This is suitable for our problemsetting because rich representation is available for a situation.Another advantage is that the data set contains a wide varietyof images; therefore, we can empirically validate classifica-tion methods in various situations.Next, we selected target synsets that were likely to be usedin DSR use cases. In this paper, we extracted a bounding boxas a sample if its label is either of the following synsets: • apple, ball, bottle, can, cellular telephone, cup, glass,paper, remote control, shoe, or teddy(bear),where “n.01” is not written for readability. These synsetswere randomly selected from the synsets that are often usedin objection manipulation tasks by DSRs. The images werefiltered by the above synsets, and they were also filtered bythe minimum height and width of the objects. In this study,both are set to 50 pixels. The images were randomly extracted Fig. 3 . Two representative samples for the synset “bot-tle.n.01”. Each yellow box shows the bounding box of thetrajector. The left sample is labeled as “positive” becausethe trajector is manipulable under the situation. However, theright sample is labeled as “negative” because the trajector isalready grasped and the robot cannot manipulate it.from the Visual Genome data set, and the number of imageswere balanced among the above synsets.Next, the samples were labeled using the following crite-ria.(E1) The bounding box contains multiple objects of the samekind, e.g. several shoes in a basket.(E2) The bounding box does not contain necessary informa-tion about the object, e.g. the handle of a glass.(N) The bounding box sufficiently contains the trajector;however, the trajector is not suitable for grasping. Forexample, a meat ball is categorized as part of the synset“ball.n.01”, however the robot should not grasp it.(M0) The object is not manipulable in that situation. Inother words, the path planning for manipulation fails.This category includes cases where the trajector is sur-rounded by many obstacles, it is held by a human, or itis moving.(M1) The object is manipulable, however autonomous grasp-ing could fail in the situation. If the robot is remotelycontrolled, the object can be safely grasped.(M2) The object is manipulable, and the robot can au-tonomously manipulate the object in that situation.(O) None of the above.Examples of the images are shown in Fig. 3. The annotatorwas one of the authors and a robotics expert. To exclusivelylabel a sample, the criteria were checked in the same order asthe above list. For example, if the sample was labeled as (E1),it was never labeled as (M0).In the experiments, the samples categorized as (N), (M0),(M1), and (M2) were used; therefore, the task is a 4-classclassification problem. Categories (E1) and (E2) were notused because it is unlikely that sufficient situation informa-tion would be available.he data samples were shuffled and divided into the train-ing (80%), validation (10%), and test (10%) sets. The statis-tics of the original and labeled data set are shown in Table 1.Hereafter, we call the labeled data set the “Object Manipula-tion Multimodal Data Set”.
7. EXPERIMENTS7.1. Setup
In the experiments, we assumed that the input was given as thelinguistic expressions of an instruction and situation. Here,the instruction did not contain a verb, but contained the tra-jector’s ID. The linguistic expressions were obtained from theObject Manipulation Multimodal Data Set.The input to LAC-GAN is given as follows: x raw = { x name , x situation } , where x name and x situation denote the embedded represen-tations of the trajector’s name and the situation, respectively.We used the distributed memory model of paragraph vec-tor (PV-DM [22]) to obtain the embedded representations.First, the name expression was obtained based on the trajec-tor’s ID. Most of the name expressions contained only onecandidate consisted of a noun; however, some expressionscontained multiple candidates consisted of multiple words,e.g., “cups in stack | stacked cups.” Regardless of the num-ber of words, the name expressions were converted to a 200-dimensional paragraph vector, x name . This is done by av-eraging the paragraph vector of the candidates. The situa-tion was composed of multiple descriptions of other objectsin the scene. Those descriptions were converted to a 200-dimensional vector, x situation .To train the PV-DM, we extracted descriptions from theVisual Genome data set, and built a corpus. The corpus con-sisted of 4.72 million sentences, where the average length ofa sentence was 5.18 words.Table 2 shows the experimental setup. The random vari-able z was sampled from N ( , I ) , where d z was set to d z =100 ; however, preliminary experimental results showed thatthe effect of d z was not large among all hyper-parameters.The dimensions d raw , d real , d fake and d y were set to d raw =400 , d real = d fake = 50 , and d y = 4 , respectively. Table 1 . Statistics of the Object Manipulation MultimodalData Set. The abbreviated categories are defined in Section 6.
Data set size (all categories) 896Number of unique words describing situations 7926Average number of words describing situations 305Training-set size (N, M0, M1, M2) 539 (80%)Validation-set size (N, M0, M1, M2) 67 (10%)Test-set size (N, M0, M1, M2) 67 (10%)
Table 2 . Experimental setup. Extractor, Generator, and Dis-criminator are denoted as
E, G, and D , respectively. Optimization Adam (Learning rate = 0 . ,method β = 0 . , β = 0 . ) d raw Name (200) + Situation (200)Num. nodes ( E ) 400 (input), 400, 100, 50, 100, 4 (output)Num. nodes ( G ) 104 (input), 100, 100, 50 (output)Num. nodes ( D ) 50 (input), 100, 100, 5 (output)Batch size 50 ( E ), 20 ( G and D )Weight λ Table 3 . Test-set accuracy obtained from the best models ofeach method. The best model is obtained as the model whichgives the highest validation-set accuracy.
Method Test-set accuracyBaseline (AC-GAN [22], without PA) 50.7%Baseline (AC-GAN, with PA) 58.2%Extractor only 61.1%Ours (LAC-GAN) 67.1%
We compared our method (LAC-GAN) with baseline meth-ods including AC-GAN [18] using the Object ManipulationMultimodal Data Set. In general in training deep networks,the accuracy does not monotonically increases as the increasein epochs. Due to the cost of cross-validation in deep net-works, the best model is usually considered to be the modelwhich gives the highest validation-set accuracy in the stan-dard experimental protocol. According to this protocol, thetest-set accuracy obtained by each best model was compared.The result is shown in Table 3.To make the comparison fair, the proposed and baselinemethods were made to have the same structure and the samenumber of nodes except the input layer. “With/without PA”represents whether the pre-activation was used or not. “Ex-tractor only” represents the test-set accuracy based on theoutput of Extractor, p E ( y ) ; therefore, this means the test-setaccuracy obtained by a simple six-layered feed-forward net-work.Table 3 shows that LAC-GAN outperformed the baselinemethods including AC-GAN and “Extractor only”. From thecomparison between LAC-GAN and AC-GAN, it is indicatedthat we can obtain better performance by extracting informa-tive features. From the comparison between LAC-GAN and“Extractor only”, it is indicated that GAN-based data genera-tion can improve the test-set accuracy. This also indicates thatthe generalization ability is enhanced by LAC-GAN.
8. CONCLUSION
Based on increasing demands to improve the quality of lifeof those who need support, many DSRs are being developed.lthough there are still many tasks that DSRs cannot do, theyhave advantages over human support staff and service dogs. Ahuman carer cannot work without rest, and training a servicedog requires nearly two years.In this paper, we presented a grounded language un-derstanding method to estimate manipulability from shortinstructions. We extended the GAN [2] to build LAC-GAN,which is a GAN-based classifier using latent representations.To quantitatively evaluate LAC-GAN, we have developed theObject Manipulation Multimodal Data Set. Linguistic ex-pressions on the trajector and situation are extracted from thedata set and converted into paragraph vectors by PV-DM [22].The manipulability is predicted based on the paragraph vec-tors by LAC-GAN. We experimentally validated LAC-GAN,and found it gives the better result than baseline methodsincluding AC-GAN [18]. Future directions include the in-tegrating the proposed method with object detection andcaption generation.
9. REFERENCES [1] Luca Iocchi, Dirk Holz, Javier Ruiz-del Solar, Komei Sug-iura, and Tijn van der Zant, “RoboCup@Home: Analysis andResults of Evolving Competitions for Domestic and ServiceRobots,”
Artificial Intelligence , vol. 229, pp. 258–281, 2015.[2] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio, “Generative Adversarial Nets,” in
Advances inNeural Information Processing Systems , 2014, pp. 2672–2680.[3] Mehdi Mirza and Simon Osindero, “Conditional GenerativeAdversarial Nets,” arXiv preprint arXiv:1411.1784, 2014.[4] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, IlyaSutskever, and Pieter Abbeel, “InfoGAN: Interpretable Rep-resentation Learning by Information Maximizing GenerativeAdversarial Nets,” in
Advances in Neural Information Pro-cessing Systems , 2016, pp. 2172–2180.[5] Jost Tobias Springenberg, “Unsupervised and Semi-Supervised Learning with Categorical Generative AdversarialNetworks,” arXiv preprint arXiv:1511.06390, 2015.[6] Peng Shen, Xugang Lu, Sheng Li, and Hisashi Kawai, “Condi-tional Generative Adversarial Nets Classifier for Spoken Lan-guage Identification,” Proc. of Interspeech, 2017.[7] Oriol Vinyals, Alexander Toshev, Samy Bengio, and DumitruErhan, “Show and tell: A neural image caption generator,” in
Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , 2015, pp. 3156–3164.[8] Kunimatsu Hashimoto, Fuminori Saito, Takashi Yamamoto,and Koichi Ikeda, “A field study of the human support robot inthe home environment,” in
Advanced Robotics and its SocialImpacts (ARSO), 2013 IEEE Workshop on , 2013, pp. 143–150.[9] Volker Kr¨uger, Danica Kragic, Ales Ude, and ChristopherGeib, “The Meaning of Action: A Review on Action Recog-nition and Mapping,”
Advanced Robotics , vol. 21, no. 13, pp.1473–1501, 2007. [10] Yuuya Sugita and Jun Tani, “Learning semantic combinato-riality from the interaction between linguistic and behavioralprocesses,”
Adaptive Behavior , vol. 13, no. 1, pp. 33–52, 2005.[11] Tetsunari Inamura, Iwaki Toshima, Hiroaki Tanie, and Yoshi-hiko Nakamura, “Embodied Symbol Emergence Based onMimesis Theory,”
International Journal of Robotics Research ,vol. 23, no. 4, pp. 363–377, 2004.[12] T. Kollar, S. Tellex, D. Roy, and N. Roy, “Toward Understand-ing Natural Language Directions,” in
Proceeding of the 5thACM/IEEE international conference on Human-robot interac-tion , 2010, pp. 259–266.[13] Andrea Vanzo, Danilo Croce, Emanuele Bastianelli, RobertoBasili, and Daniele Nardi, “Robust spoken language under-standing for house service robots,” in
Proceedings of the 17thInternational Conference on Intelligent Text Processing andComputational Linguistics , 2016, pp. 3–9.[14] P. Lison and G.J.M. Kruijff, “Salience-driven contextual prim-ing of speech recognition for human-robot interaction,” in
Pro-ceedings of the 18th European Conference on Artificial Intelli-gence , 2008.[15] Komei Sugiura, Naoto Iwahashi, Hideki Kashioka, and SatoshiNakamura, “Bayesian learning of confidence measure functionfor generation of utterances and motions in object manipulationdialogue task,” in
Proceedings of Interspeech , 2009, pp. 2483–2486.[16] Komei Sugiura and Koji Zettsu, “Rospeex: A cloud roboticsplatform for human-robot spoken dialogues,” in
Proc.IEEE/RSJ IROS , 2015, pp. 6155–6160.[17] Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, KenjiHata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al., “Visual Genome: ConnectingLanguage and Vision Using Crowdsourced Dense Image An-notations,” arXiv:1602.07332, 2016.[18] Augustus Odena, Christopher Olah, and Jonathon Shlens,“Conditional Image Synthesis with Auxiliary ClassifierGANs,” arXiv preprint arXiv:1610.09585, 2016.[19] Ronald W. Langacker,
Foundations of Cognitive Grammar:Theoretical Prerequisites , Stanford Univ Pr, 1987.[20] Sergey Ioffe and Christian Szegedy, “Batch Normalization:Accelerating Deep Network Training by Reducing Internal Co-variate Shift,” in
Prof. of ICML , 2015, pp. 448–456.[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,“Identity Mappings in Deep Residual Networks,” in
Proc. ofEuropean Conference on Computer Vision , 2016, pp. 630–645.[22] Quoc Le and Tomas Mikolov, “Distributed Representations ofSentences and Documents,” in
Proc. of ICML , 2014, pp. 1188–1196.[23] George A. Miller et al., “WordNet: a Lexical Database forEnglish,”
Communications of the ACM , vol. 38, no. 11, pp.39–41, 1995.[24] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C LawrenceZitnick, “Microsoft COCO: Common Objects in Context,” in