Zero-Shot Visual Slot Filling as Question Answering
ZZero-Shot Visual Slot Filling as Question Answering
Larry Heck , Simon Heck, Viv Labs Department of Computer Science, Pomona [email protected], [email protected]
Abstract
This paper presents a new approach to visual zero-shot slotfilling. The approach extends previous approaches by refor-mulating the slot filling task as Question Answering. Slottags are converted to rich natural language questions that cap-ture the semantics of visual information and lexical text onthe GUI screen. These questions are paired with the user’sutterance and slots are extracted from the utterance usinga state-of-the-art ALBERT-based Question Answering sys-tem trained on the Stanford Question Answering dataset(SQuaD2). An approach to further refine the model withmulti-task training is presented. The multi-task approach fa-cilitates the incorporation of a large number of successive re-finements and transfer learning across similar tasks. A newVisual Slot dataset and a visual extension of the popular ATISdataset is introduced to support research and experimentationon visual slot filling. Results show F1 scores between 0.52and 0.60 on the Visual Slot and ATIS datasets with no train-ing data (zero-shot).
Introduction
The last decade has seen the development and broad deploy-ment of personal digital assistants (PDAs) including AppleSiri, Microsoft Cortana, Amazon Alexa, Google Assistantand Samsung Bixby. A primary component of the PDAsis Natural Language Understanding (NLU) - understand-ing the meaning of the user’s utterance. Referring to Fig-ure1, the NLU task typically consists of determining thedomain of the user’s request (e.g., travel), the user’s intent(e.g., find flight) and information bearing parameters com-monly referred to as semantic slots (e.g., City-departure,City-arrival, and Date). The task of determining the semanticslots is typically called slot filling (Tur and De Mori 2011).State-of-the-art methods for semantic slot filling arepredominately based on dynamic deep learning models.Early slot filling models used recurrent neural networks(RNNs)(Mesnil et al. 2014), then progressed to long short-term memory (LSTM) neural networks (Liu and Lane 2016)and most recently transformer-based neural networks (Chen,Zhuo, and Wang 2019).While these dynamic deep learned approaches haveachieved high accuracy in slot filling, they require largesets of domain-dependent supervised training data. How-ever, many applications cannot afford large supervised train-ing sets. For example, in the design of AI skills for modern Figure 1: An example semantic representation with domain,intent, and semantic slot annotations.digital assistants, developers are often not highly skilled inNLU. As a result, the quality of the AI skills is often limitedby the lack of high-quality NLU annotations. In general, therequirement for large supervised training sets has limited thebroad expansion of AI Skills to adequately cover the long-tail of user goals and intents.Recent work has focused on developing models and ap-proaches that require less supervised training data. Zero andfew-shot learning methods have been developed across NLPtasks (Dauphin et al. 2013; Yann et al. 2014; Upadhyayet al. 2018). Methods can be broadly categorized into trans-fer learning (Jaech, Heck, and Ostendorf 2016; El-Kahkyet al. 2014; Hakkani-T¨ur et al. 2016), reinforcement learn-ing (Liu et al. 2017; Kumar et al. 2017; Shah, Hakkani-T¨ur,and Heck 2016) and more recently, synthetic training (Xuet al. 2020; Campagna et al. 2020).In many cases, the user interacts with an App screen orweb page and therefore uses multiple modalities such asvoice, vision, and/or touch(Heck et al. 2013; Hakkani-T¨uret al. 2014; Li et al. 2019; Zhang et al. 2020; Selvarajuet al. 2019). For these settings, zero- and few-shot learningcan be achieved by leveraging the semantics contained inthe screen. In (Bapna et al. 2017), the authors incorporatedvisual slot names or descriptions in a domain-agnostic slottagging model called a Concept Tagger. The Concept Tag-ger models the visual slot description (e.g., ”destination”) asa Bag-of-Words (BOW) embedding vector and injects it aFeed-Forward network inside the original deep LSTM net-work to process the user’s utterance (e.g., ”Get a cab to 1945 a r X i v : . [ c s . A I] N ov harleston”). Results showed the inclusion of slot descrip-tions significantly out-performed the previous state-of-the-art multi-task transfer learning approach (Hakkani-T¨ur et al.2016).The Concept Tagger in (Bapna et al. 2017) is limited inseveral ways. First, the BOW semantic representation of thevisual slot description is static and does not model dynam-ics of the description language. Second, the method is lim-ited to only visual slots with text descriptions and does notincorporate other semantic information from the visual ele-ments (i.e., is the element a form field or a radio button withchoices). Third, the Concept Tagger incorporates multi-tasklearning only through the visual slot description.This paper addresses all three limitations of the ConceptTagger work. The next section describes a new approach thatformulates the visual slot filling problem as Question An-swering (Heck, Heck, and Osborne 2020; Namazifar et al.2020). Richer semantics extracted from the visual screenare incorporated and an iterative multi-task fine-tuning of atransformer-based (ALBERT) Question Answering systemfor visual slot filling is introduced. In the Experiments Sec-tion, we introduce a new corpora collected for visual slot fill-ing called the Visual Slot dataset. We compare the new zero-shot visual slot filling as QA approach to state-of-the-artmethods on this new corpora as well as ATIS(Tur, Hakkani-T¨ur, and Heck 2010). Finally, we summarize our findingsand suggest next steps in the Conclusions and Future WorkSection. Approach
The foundation of the approach presented in this paper isthe utilization of deeper semantics in the visual representa-tion of the slot on the user’s screen. While previous slot fill-ing methods treated the slot label as simply a classificationtag with no semantic information, the approach of this pa-per extracts meaning from the visual slot representation. Forsimple text fields, the semantics of the visual slot is derivedfrom the lexical content and reformulated as a question, e.g.,the visual slot text ”Departure City” becomes ”What is theDeparture City?”. By formulating the slot description as aQuestion and the user’s utterance as the Paragraph, we candirectly utilize powerful transformer-based extractive Ques-tion Answering models such as ALBERT (Lan et al. 2019)as shown in Figure 2. The Start/End Span of the extractedAnswer is the filled slot. We call our approach
Visual SlotFilling as Question Answering (QA) .In addition to the lexical semantics contained in the textslot description, the type of the visual graphical user inter-face (GUI) element on the App or web page provides ad-ditional semantic information. For example, a Trip LoggingApp might have a set of Radio Buttons where the user canchoose the type of trip: Business, Personal, or Other. Trans-lating the Radio Buttons to natural language would be in theform a question ”Would you like to log this trip as business,personal or other?”The set of GUI design elements of a mobile App that areavailable to translate into Questions are shown in Figure 3.These GUI design elements are automatically classified via Figure 2: ALBERT Question Answering (Lan et al. 2019)Figure 3: Semantically annotated mobile GUI using com-puter vision to identify 25 UI component categories, 197 textbutton concepts, and 99 icon classes (Liu et al. 2018)a convolutional deep neural network computer vision sys-tem trained on the RICO dataset (Deka et al. 2017; Liu et al.2018). The computer vision classifier identifies 25 UI com-ponent categories (e.g., Ads, Checkboxes, On/Off switches,Radio Buttons), 197 text button concepts (e.g., login, back,add/delete/save/continue), and 99 icon classes (e.g., arrowsforward/backward, dots, plus symbol) with a 94% accuracy.In our visual slot filling as QA method, a rule is created totranslate each GUI design element into an appropriate Ques-tion. Each type of GUI design element has a unique rule typethat fires depending on the visual presence of the GUI de-sign element. If multiple GUI design elements are visible,then multiple translation rules fire generating simultaneousQuestions to be paired with the user’s utterance.
Fine-Tuning
Slot Question Utterance
ALBERT Visual-Slot NLU
Figure 4: Single (N=2) and Multi-Task (N >
2) Visual SlotFilling as QAFinally, our Visual slot filling as QA method can be for-ulated as both single- and multi-task training. Single-task(ST) training for our method is shown in Figure 4 with twoFine-tuning steps (N=2). The first step is for general purposeQuestion Answering trained with SQuaD2, and the secondfine-tuning step is trained with supervised (annotated) slotfilling data from the visual App or web page GUI. Single-task zero-shot visual slot filling as QA is achieved by simplyusing the model of Fine-Tuning > − Experiments
Setup
Our base QA system is based on the Pytorch implementationof ALBERT . We use the pre-trained LM weights in the en-coder module trained with all the official hyper-parameters . Visual Slot Dataset
Amazon Mechanical Turk (AMT) was used to collect a newVisual Slot dataset to support zero-shot visual slot filling ex-periments . The AMT crowd workers were asked to formu-late requests to mobile App screens from the Trip LoggerApp (Michaelsoft) in the RICO dataset as shown in Fig-ure 5. A total of 750 queries were annotated with 10 slottypes corresponding to the GUI design semantics derivedfrom the previously described computer vision system (e.g.,fuel cost, trip description). The data was divided to maintainbalanced samples per slot-type with 500 queries reserved fortraining and 250 queries for the test set.The visual GUI elements of the Trip Logger App includeText fields (e.g., Odometer Value), Radio Buttons (e.g.,Business, Personal, Other), Checkbox (e.g., Track distancewith GPS), Text buttons (e.g., TRIP, DAY, FUEL, OTHER,Start trip). For the Text fields, questions were generated bysimple templates such as ”What is the odometer value?”for ”Odometer Value”. The Checkbox GUI element useda binary-valued ”Yes/No” template for the question e.g.,”Should I track distance with GPS?”. Radio buttons use tem-plates that generate questions in list form e.g., ”Would youlike to log this trip as business, personal or other?”. https://github.com/huggingface/transformers ALBERT (xxlarge) The data will be published on GitHub with this paper
Figure 5: Trip Logger Mobile Application (Michaelsoft)
ATIS with Simulated Visual Elements Dataset
The ATIS dataset is a widely used NLU benchmark for userinteracting through natural language with a flight bookingsystem (Tur, Hakkani-T¨ur, and Heck 2010). To use ATISfor visual slot filling as QA, we extended the dataset in twoways. First, as shown in Figure 6, each slot tag was refor-mulated as a natural language Question. For example, theATIS slot tag ”B-aircraft code” is translated into the ques-tion ”what is the aircraft code?” Second, each slot is clas-sified into a simple visual text field such as those used inApp forms (e.g., a field description in the American Airlinesflight reservation App form).Figure 6: ATIS simulated visual elements translated to Ques-tions
Evaluation
Table 1 summarizes zero-shot F1 scores (harmonic meanof precision and recall) on the new Visual Slot dataset andATIS with simulated visual elements dataset. For compari-son, the F1 score is shown for the state-of-the-art supervisedapproach BERT for Joint Intent Classification and Slot Fill-ing (Chen, Zhuo, and Wang 2019). Our new Visual Slots asQA approach yields a F1 score of 0.48 for zero-shot single-task (ST) training on the Visual Slot dataset and 0.60 forTIS. This is achieved by simply using the model trainedon SQuaD2 in Figure 4 as Fine-Tuning
Joint Tag-QA Visual Slots as QABERT (no vis) Single-Task Multi-Task(ST) (MT)Vis-Slot 0.00 0.01
ATIS (sim.) 0.00 0.00 - Table 1: Zero-Shot F1 scores on Visual Slot dataset andATIS (simulated visuals). The table shows Joint BERT andTag QA (no visuals) versus our new Visual Slot filling asQuestion Answering (Single- and Multi-Task)To explore the effects of small amounts of training data(few-shot), we varied the number of supervised trainingsamples in the Visual Slot dataset as shown in Table 2. Wealso trained a model to determine the effect of removingthe semantically rich Questions to represent the visual slotscalled ”Tag-QA (no visual)”. In this model, the Questionsused in Visual Slots as QA were replaced with tag sym-bols, where the tag symbol had no semantic information(e.g., ”XYZ”). Otherwise, Tag-QA (no visual) is the samemodel as Visual Slots as QA. The training samples were ran-domly chosen across all 10 slot types from the complete setof 500 utterances. When all 500 utterances are used, the dif-ference between the F1 scores of the competing methods toour new Visual Slots as QA methods becomes smaller butboth single-task (ST) and multi-task (MT) models still per-form the best. The Zero-Shot and Few-Shot results ( (Chen, Zhuo, and Wang 2019)
Tag-QA (no visual) 0.01 0.07 0.29 0.32 0.71Visual Slots as QA (ST) 0.48 0.46 0.73
Table 2: Slot F1 scores on Visual-Slot dataset. Joint BERTand Tag QA versus our new Visual Slot filling as QuestionAnswering (single- and multi-task)Tables 3 and 4 show zero-shot F1 scores on the VisualSlot and ATIS datasets, respectively for our new approachwhen varying the number of visual GUI elements that are displayed to the user. For example, when 2 visual elementsare displayed, the model must not only parse the slots fromthe utterance for one of the visual elements but also cor-rectly reject the filling of slots into the other element. Forboth the Visual Slot and ATIS datasets, the models degradegracefully. This robustness is likely the result of the initialFine Tuning on the SQuaD2 dataset which is trained to re-ject false questions - questions that do not have a correctanswer to extract from the given Paragraph.
Table 3: Zero-Shot Slot F1 scores on Visual Slot Datasetfor varying numbers of visual elements shown to the usersimultaneously
Table 4: Zero-Shot Slot F1 scores on ATIS (with simulatedvisual elements) for varying numbers of visual elementsshown to the user simultaneously
Conclusions and Future Work
This paper presented a new approach to visual zero-shotslot filling. The approach extends previous approaches byreformulating the slot filling task as Question Answering.Slot tags are converted to rich natural language questionsthat capture the semantics of visual information and lexi-cal text on the GUI screen. These questions are paired withthe user’s utterance and slots are extracted from the utter-ance using a state-of-the-art ALBERT-based Question An-swering system trained on the Stanford Question Answer-ing dataset (SQuaD2). An approach to further refine themodel with multi-task training is presented. The multi-taskapproach facilitates the incorporation of a large number ofsuccessive refinements and transfer learning across similartasks. A new Visual Slot and a visual extension of the popu-lar ATIS dataset is introduced to support research and exper-imentation on visual slot filling. Results show the F1 scoresbetween 0.52 and 0.60 on the Visual Slot and ATIS datasetswith no training data (zero-shot).Future work will complete a comprehensive study of thenew Visual slot filling as QA approach to a broader set of vi-sual GUI screens across both mobile Apps and web pages. Inaddition, we plan to explore improved rejection methods forscreens with high-density competing visual GUI elements. eferences
Bapna, A.; Tur, G.; Hakkani-Tur, D.; and Heck, L. 2017. To-wards zero-shot frame semantic parsing for domain scaling. arXivpreprint arXiv:1707.02363 .Campagna, G.; Foryciarz, A.; Moradshahi, M.; and S., L. M. 2020.Zero-Shot Transfer Learning with Synthesized Data for Multi-Domain Dialogue State Tracking. In
Proceedings of the 58th An-nual Meeting of the Association for Computational Linguistics .Chen, Q.; Zhuo, Z.; and Wang, W. 2019. BERT for Joint IntentClassification and Slot Filling. arXiv:1902.10909 .Dauphin, Y. N.; Tur, G.; Hakkani-Tur, D.; and Heck, L. 2013. Zero-shot learning for semantic utterance classification. arXiv preprintarXiv:1401.0509 .Deka, B.; Huang, Z.; Franzen, C.; Hibschman, J.; Afergan, D.; Li,Y.; Nichols, J.; and Kumar, R. 2017. Rico: A Mobile App Datasetfor Building Data-Driven Design Applications. In
Proceedings ofthe 30th Annual Symposium on User Interface Software and Tech-nology , UIST ’17.El-Kahky, A.; Liu, X.; Sarikaya, R.; Tur, G.; Hakkani-Tur, D.; andHeck, L. 2014. Extending domain coverage of language under-standing systems via intent transfer between domains using knowl-edge graphs and search query click logs. In , 4067–4071. IEEE.Hakkani-T¨ur, D.; Slaney, M.; Celikyilmaz, A.; and Heck, L. 2014.Eye gaze for spoken language understanding in multi-modal con-versational interactions. In
Proceedings of the 16th InternationalConference on Multimodal Interaction , 263–266.Hakkani-T¨ur, D.; Tur, G.; Celikyilmaz, A.; Chen, Y.-N. V.; Gao,J.; Deng, L.; and Wang, Y.-Y. 2016. Multi-Domain Joint SemanticFrame Parsing using Bi-directional RNN-LSTM. In
Proceedingsof The 17th Annual Meeting of the International Speech Communi-cation Association (INTERSPEECH 2016) . ISCA.Heck, L.; Hakkani-T¨ur, D.; Chinthakunta, M.; Tur, G.; Iyer, R.;Parthasarathy, P.; Stifelman, L.; Shriberg, E.; and Fidler, A. 2013.Multi-Modal Conversational Search and Browse. In
Proceedings ofthe First Workshop on Speech, Language and Audio in Multimedia(SLAM 2013) , 96–101.Heck, L.; Heck, S.; and Osborne, J. 2020. Visual Natural LanguageUnderstanding as Question Answering. In
Pomona College Under-graduate Symposium, Summer 2020 arXiv:1604.00117 .Kumar, S.; Shah, P.; Hakkani-Tur, D.; and Heck, L. 2017. Fed-erated control with hierarchical multi-agent deep reinforcementlearning. In
Conference on Neural Information Processing Systems(NeurIPS), Hierarchical Reinforcement Learning Workshop .Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; and Sori-cut, R. 2019. ALBERT: A Lite BERT for Self-supervised Learningof Language Representations. arXiv:1909.11942 .Li, D.; Tasci, S.; Ghosh, S.; Zhu, J.; Zhang, J.; and Heck, L. 2019.RILOD: near real-time incremental learning for object detection atthe edge. In
Proceedings of the 4th ACM/IEEE Symposium on EdgeComputing , 113–126.Liu, B.; and Lane, I. 2016. Attention-Based Recurrent Neural Net-work Models for Joint Intent Detection and Slot Filling. Liu, B.; Tur, G.; Hakkani-Tur, D.; Shah, P.; and Heck, L. 2017.End-to-end optimization of task-oriented dialogue model with deepreinforcement learning. arXiv preprint arXiv:1711.10712 .Liu, T. F.; Craft, M.; Situ, J.; Yumer, E.; Mech, R.; and Kumar, R.2018. Learning Design Semantics for Mobile Apps. In
The 31stAnnual ACM Symposium on User Interface Software and Technol-ogy , UIST ’18, 569–579.Mesnil, G.; Dauphin, Y.; Yao, K.; Bengio, Y.; Deng, L.; Hakkani-Tur, D.; He, X.; Heck, L.; Tur, G.; Yu, D.; et al. 2014. Using recur-rent neural networks for slot filling in spoken language understand-ing.
IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing arXiv:2011.03023 .Selvaraju, R. R.; Lee, S.; Shen, Y.; Jin, H.; Ghosh, S.; Heck, L.;Batra, D.; and Parikh, D. 2019. Taking a hint: Leveraging expla-nations to make vision and language models more grounded. In
Proceedings of the IEEE International Conference on ComputerVision , 2591–2600.Shah, P.; Hakkani-T¨ur, D.; and Heck, L. 2016. Interactive re-inforcement learning for task-oriented dialogue management. In
Conference on Neural Information Processing Systems (NIPS),Workshop on Deep Learning for Action and Interaction .Tur, G.; and De Mori, R., eds. 2011.
Spoken Language Understand-ing: Systems for Extracting Semantic Information from Speech . Wi-ley.Tur, G.; Hakkani-T¨ur, D.; and Heck, L. 2010. What is left to beunderstood in ATIS? In , 19–24. IEEE.Upadhyay, S.; Faruqui, M.; Tur, G.; Dilek, H.-T.; and Heck, L.2018. (Almost) zero-shot cross-lingual spoken language under-standing. In , 6034–6038. IEEE.Xu, S. X.; Semnani, S. J.; Campagna, G.; and Lam, M. S. 2020.AutoQA: From Databases To QA Semantic Parsers With Only Syn-thetic Training Data. In
Proceedings of the 2020 Conference onEmpirical Methods in Natural Language Processing .Yann, D.; Tur, G.; Hakkani-Tur, D.; and Heck, L. 2014. Zero-shotlearning and clustering for semantic utterance classification usingdeep learning. In
International Conference on Learning Represen-tations (cited on page 28) .Zhang, J.; Zhang, J.; Ghosh, S.; Li, D.; Tasci, S.; Heck, L.; Zhang,H.; and Kuo, C.-C. J. 2020. Class-incremental learning via deepmodel consolidation. In