Generalized Zero-shot Intent Detection via Commonsense Knowledge
GGeneralized Zero-shot Intent Detection via Commonsense Knowledge
A.B. Siddique, Fuad Jamour, Luxun Xu, Vagelis Hristidis
University of California, Riverside msidd005,fuadj,lxu051{@ucr.edu}, [email protected]
Abstract
Identifying user intents from natural languageutterances is a crucial step in conversationalsystems that has been extensively studied asa supervised classification problem. However,in practice, new intents emerge after deploy-ing an intent detection model. Thus, thesemodels should seamlessly adapt and classifyutterances with both seen and unseen intents– unseen intents emerge after deployment andthey do not have training data. The few ex-isting models that target this setting rely heav-ily on the scarcely available training data andoverfit to seen intents data, resulting in a biasto misclassify utterances with unseen intentsinto seen ones. We propose
RIDE : an in-tent detection model that leverages common-sense knowledge in an unsupervised fashionto overcome the issue of training data scarcity.
RIDE computes robust and generalizable re-lationship meta-features that capture deep se-mantic relationships between utterances andintent labels; these features are computed byconsidering how the concepts in an utteranceare linked to those in an intent label via com-monsense knowledge. Our extensive exper-imental analysis on three widely-used intentdetection benchmarks show that relationshipmeta-features significantly increase the accu-racy of detecting both seen and unseen intentsand that
RIDE outperforms the state-of-the-artmodel for unseen intents.
Virtual assistants such as Amazon Alexa andGoogle Assistant allow users to perform a varietyof tasks (referred to as ‘skills’ in Alexa) throughan intuitive natural language interface. For exam-ple, a user can set an alarm by simply issuing theutterance “Wake me up tomorrow at 10 AM” to avirtual assistant, and the assistant is expected to un-derstand that the user’s intent (i.e., “AddAlarm”) isto invoke the alarm module, then set the requestedalarm accordingly. Detecting the intent implied in a natural language utterance (i.e., intent detection )is typically the first step towards performing anytask in conversational systems.Intent detection (or classification) is a challeng-ing task due to the vast diversity in user utterances.The challenge is further exacerbated in the morepractically relevant setting where the full list ofpossible intents (or classes) is not available beforedeploying the conversational system, or intents areadded over time. This setting is an instance of the generalized zero-shot classification problem (Felixet al., 2018): labeled training utterances are avail-able for seen intents but are unavailable for unseenones, and at inference time, models do not haveprior knowledge on whether the utterances theyreceive imply seen or unseen intents; i.e., unseenintents emerge after deploying the model. Thissetting is the focus of this paper.Little research has been conducted on buildinggeneralized zero-shot (GZS) models for intent de-tection, with little success. The authors in (Liuet al., 2019) proposed a dimensional attentionmechanism into a capsule neural network (Sabouret al., 2017) and computing transformation matri-ces for unseen intents to accommodate the GZSsetting. This model overfits to seen classes and ex-hibits a strong bias towards classifying utterancesinto seen intents, resulting in poor performance.Most recently, the authors in (Yan et al., 2020) ex-tended the previous model by utilizing the density-based outlier detection algorithm LOF (Breuniget al., 2000), which allows distinguishing utter-ances with seen intents from those with unseenones, which partially mitigates the overfitting issue.Unfortunately, the performance of this model issensitive to that of LOF, which fails in cases whereintent labels are semantically close.We propose
RIDE , a model for GZS intent RIDE : Relationship Meta-features Assisted Intent DEtection Currently under review. GitHub Code Repository will beshared upon acceptance a r X i v : . [ c s . C L ] F e b etection that utilizes commonsense knowledgeto compute robust and generalizable unsupervised relationship meta-features . These meta-featurescapture deep semantic associations between an ut-terance and an intent, resulting in two advantages:( i ) they significantly decrease the bias towards seenintents as they are similarly computed for bothseen and unseen intents and ( ii ) they infuse com-monsense knowledge into our model, which signif-icantly reduces its reliance on training data withoutjeopardizing its ability to distinguish semanticallyclose intent labels. Relationship meta-features arecomputed by analyzing how the phrases in an utter-ance are linked to an intent label via commonsenseknowledge.Figure 1 shows how the words (or phrases) inan example utterance are linked to the words inan example intent label through the nodes (i.e.,concepts) of a commonsense knowledge graph. Inthis example, the link (cid:104) look for , Synonym, find (cid:105) indicates that look for and find aresynonyms, and the links (cid:104) feeling hungry ,CausesDesire, eat (cid:105) and (cid:104) restaurant ,U sedF or, eat (cid:105) can be used to infer the existenceof the link (cid:104) feeling hungry , IsRelated, restaurant (cid:105) , which indicates that feelinghungry and restaurant are related. Thesetwo links, the direct and the inferred ones, carry asignificant amount of semantic relatedness, whichindicates that the given utterance-intent pair iscompatible. Note that this insight holds regardlessof whether the intent is seen or not. RIDE utilizesthis insight to build relationship meta-featurevectors that quantify the relatedness between anutterance and an intent in an unsupervised fashion.
RIDE combines relationship meta-features withcontextual word embeddings (Peters et al., 2018),and feeds the combined feature vectors into a train-able prediction function to finally detect intentsin utterances. Thanks to our relationship meta-features,
RIDE is able to accurately detect bothseen and unseen intents in utterances. Our exten-sive experimental analysis using the three widelyused benchmarks, SNIPS (Coucke et al., 2018),SGD (Rastogi et al., 2019), and MultiWOZ (Zanget al., 2020) show that our model outperforms thestate-of-the-art model in detecting unseen intentsin the GZS setting by at least 30.36%.A secondary contribution of this paper is that wemanaged to further increase the accuracy of GZSintent detection by employing Positive-Unlabeled
CausesDesireHasSubeventIsA H a s P r e r e q u i s i t e UsedFor U s e d F o r S y n o n y m A t L o c a t i o n something nearby I am Find eatgo torestauranthaveluncheatingplacetable personfindInformationHasPrerequisite
Look for Restaurant S y n o n y m I s R e l a t e d feelinghungry Utterance:Intent Label:
Figure 1: Example utterance, intent, and small com-monsense knowledge graph. The presence of directlinks such as (cid:104) look for , Synonym, find (cid:105) and in-ferred ones such as (cid:104) feeling hungry , IsRelated, restaurant (cid:105) between phrases in the utterance andthose in the intent label indicate utterance-intent com-patibility. (PU) learning (Elkan and Noto, 2008) to predictif a new utterance belongs to a seen or unseen in-tent. PU learning assists intent detection modelsby mitigating their bias towards classifying mostutterances into seen intents. A PU classifier isable to perform binary classification after beingtrained using only positive and unlabeled exam-ples. We found that the use of a PU classifier alsoimproves the accuracy of existing GZS intent detec-tion works, but our model is again outperformingthese works.
Let S = {I , · · · , I k } be a set of seen intents and U = {I k +1 , · · · , I n } be a set of unseen intentswhere S ∩ U = ∅ . Let X = {X , X , ..., X m } be a set of labeled training utterances where eachtraining utterance X i ∈ X is described with a tuple ( X i , I j ) such that I j ∈ S . An intent I j is comprisedof an Action and an
Object and takes the form “Ac-tionObject” (e.g., “FindRestaurant”); an Actiondescribes a user’s request or activity and an Objectdescribes the entity pointed to by an Action (Chenet al., 2013; Wang et al., 2015; Vedula et al., 2020).In both the zero-shot (ZS) and the GZS settings,the training examples have intent labels from set S only, however, the two settings differ as follows. ZS Intent Detection.
Given a test utterance X (cid:48) i whose true label I j is known to be in U a priori,predict a label I (cid:48) j ∈ U . If intents are described using a complex textual description,Actions and Objects can be extracted using existing NLPtools such as dependency parsers. ( | )LSTMLSTMLook for something nearby. I amfeeling hungry.FindRestaurantReserveRestaurantPlayMovie P( | ) a r g m a x F i nd R e s t a u r a n t Utterance Embedding Relationship Meta-features Intent Embedding
Loss P( | )P( | )PU Classifier or PredictionFunction
RMG
Figure 2: Overview of
RIDE . GZS Intent Detection.
Given a test utterance X (cid:48) i ,predict a label I (cid:48) j ∈ S ∪ U . Note that unlike in theZS setting, it is not known whether the true labelof X (cid:48) i belongs to S or U , which exacerbates thechallenge in this setting; we focus on this settingin this paper. Knowledge graphs (KG) are structures that capturerelationships between entities, and are typicallyused to capture knowledge in a semi-structuredformat; i.e., they are used as knowledge bases.Knowledge graphs can be viewed as collections oftriples, each representing a fact of the form (cid:104) head ,relation, tail (cid:105) where head and tail describeentities and relation describes the relationship be-tween the respective entities. In this work, we useConceptNet (Speer et al., 2016), which is a rich andwidely-used commonsense knowledge graph. In-terested readers can check Appendix A.1 for moredetails on ConceptNet.
While large knowledge graphs may capture a largesubset of knowledge, they are incomplete: somerelationships (or links) are missing. Link predic-tion (Kazemi and Poole, 2018) augments knowl-edge graphs by predicting missing relations usingexisting ones. In the context of this work, we pre-train a state-of-the-art link prediction model (LP)on the ConceptNet KG to score novel facts that arenot necessarily present in the knowledge graph.Given a triple (i.e., fact) in the form (cid:104) head ,relation, tail (cid:105) , a link prediction model scoresthe triple with a value between 0 and 1, which quan-tifies the level of validity of the given triple. Thedetails of training our link predictor are availablein Appendix A.2.
Positive-Unlabeled (PU) classifiers learn a stan-dard binary classifier in the unconventional settingwhere labeled negative training examples are un-available. The state-of-the-art PU classifier (Elkanand Noto, 2008), which we integrate into ourmodel, learns a decision boundary based on thepositive and unlabeled examples, and thus can clas-sify novel test examples into positive or negative.The aim of the PU classifier is to learn a probabilis-tic function f ( X i ) that estimates P ( I j ∈ S | X i ) as closely as possible. In this work, we train aPU classifier using our training set (utterances withonly seen intents labeled as positive) and validationset (utterances with both seen and unseen intents asunlabeled). We use 512-dimensions sentence em-bedding as features when using the PU classifier,generated using a pre-trained universal sentenceencoder (Cer et al., 2018). Figure 2 shows an overview of our model: given aninput utterance X i , we first invoke the PU classifier(if it is available) to predict whether X i implies aseen or an unseen intent; i.e., whether X i ’s intentbelongs to set S or U . Then, an instance of ourcore model (the red box in Figure 2) is invoked foreach intent in S or U based on the PU’s prediction.Our core model predicts the level of compatibilitybetween the given utterance X i and intent I j , i.e.,the probability that the given utterance implies thegiven intent P ( I j |X i ) ∈ [0 , . Finally, our modeloutputs the intent with the highest compatibilityprobability, i.e., argmax I j P ( I j |X i ) .Our core model concatenates relationship meta-features, utterance embedding, and intent embed-ding and feeds them into a trainable predictionfunction. The Relationship Meta-features Gener-ator (RMG) is at the heart of our model, and it isthe most influential component. Given an utterancend an intent, RMG generates meta-features thatcapture deep semantic associations between thegiven utterance and intent in the form of a meta-feature vector. RMG extracts relationship meta-features by uti-lizing the “ActionObject” structure of intent labelsand commonsense knowledge graphs. Relationshipmeta-features are not only generalizable, but alsodiscriminative: while the example utterance “Lookfor something nearby. I am feeling hungry.” maybe related to the intent “ReserveRestaurant”, theAction part of this intent is not related to any phrasein the utterance; thus, “ReserveRestaurant” is lessrelated to the utterance than “FindRestaurant”.RMG takes the following inputs: a set of re-lations in a knowledge graph (35 in the case ofConceptNet) R = { r , r , .., r t } ; the set of n-grams G i = { g , g , .., g q } that correspond to the inpututterance X i , where |G| = q ; and an intent la-bel I j = {A , O} , where A and O are the Actionand Object components of the intent, respectively.RMG computes a relationship meta-features vectorin four steps, where each step results in a vectorof size |R| . The smaller vectors are: e −→AX i , e −→OX i , e ←−AX i ,and e ←−OX i , where e −→AX i captures the Action to utterancesemantic relationships and e −→OX i captures the Objectto utterance relationships. The remaining two vec-tors capture relationships in the other direction; i.e.,utterance to Action/Object, respectively. Capturingbi-directional relationships is important becausea relationship in one direction does not necessar-ily imply one in the other direction – for exam-ple, (cid:104) table , AtLocation, restaurant (cid:105) doesnot imply (cid:104) restaurant , AtLocation, table (cid:105) .The final output of RMG is the relationship meta-features vector e relationship , which is the concatena-tion of the four aforementioned vectors. We explainnext how the smaller vectors is computed.RMG computes e −→AX i by considering the strengthof each relation in R between A and each n-gramin G i . That is, e −→AX i has |R| cells, where each cellcorresponds to a relation r ∈ R . Each cell iscomputed by taking max ( LP ( A , r, g )) over all g ∈ G i . LP ( head , relation , tail ) outputs theprobability that the fact represented by the triple (cid:104) head , relation, tail (cid:105) exists. The vector e −→OX i is computed similarly, but with passing O insteadof A when invoking the link predictor; i.e., tak- Algorithm 1:
RMG
Input: R = { r , · · · , r t } : relations in KG G i = { g , · · · , g q } : utterance n-grams I j = {A , O} : intent’s Action and Object Output: e relationship : X i - I j relationship meta-featuresLet e −→AX i = RM ( A , G i , → ) // Action to utteranceLet e −→OX i = RM ( O , G i , → ) // Object to utteranceLet e ←−AX i = RM ( A , G i , ← ) // utterance to ActionLet e ←−OX i = RM ( O , G i , ← ) // utterance to ObjectLet e relationship = [ e −→AX i , e −→OX i , e ←−AX i , e ←−OX i ] return e relationship Function
RM( concept , phrases , direction ) : Let e = [] foreach r ∈ R doif direction = → then Let p = Max ( LP ( concept, r, g )) for g ∈ phrases if direction = ← then Let p = Max ( LP ( g, r, concept )) for g ∈ phrases e .append(p) return e ing max ( LP ( O , r, g )) over all g ∈ G i to computeeach cell. The vectors e ←−AX i and e ←−OX i are computedsimilarly, but with swapping the head and tail when invoking the link predictor; i.e., utterancephrases are passed as head and Action/Objectparts are passed as tail . Algorithm 1 outlinesthe previous process. Finally, the generated meta-features are passed through a linear layer with sig-moid activation before concatenation with the ut-terance and intent embeddings. Given an utterance X i = { w , w , · · · , w u } with u words, first we compute an embedding emb ( w j ) ∈ R dim for each word w j in the utterance, where emb ( w j ) is the concatenation of a contextual embed-ding obtained from a pre-trained ELMo model andparts of speech (POS) tag embedding. Then, we usebi-directional LSTM to produce a d -dimensionalrepresentation as follows: −→ h j = LSTM fw ( −→ h j − , emb ( w j )) . ←− h j = LSTM bw ( ←− h j − , emb ( w j )) . Finally, we concatenate the output of the lasthidden states as utterance embedding e utterance =[ −→ h u ; ←− h u ] ∈ R d . We encode intent labels similarlyto produce an intent embedding e intent ∈ R d . .3 Training the Model Our model has two trainable components: theLSTM units in the utterance and intent encodersand the compatibility probability prediction func-tion. We jointly train these components using train-ing data prepared as follows. The training examplesare of the form (( X i , I j ) , Y ) , where Y is a binarylabel representing whether the utterance-intent pair ( X i , I j ) are compatible: means they are compat-ible, and means they are not. For example, theutterance-intent pair (“I want to play this song”,“PlaySong”) gets a label of , and the same utter-ance paired with another intent such as “BookHotel”gets a label of . We prepare our training data byassigning a label of to the available utterance-intent pairs (where intents are seen ones); theseconstitute positive training examples. We createa negative training example for each positive oneby corrupting the example’s intent. We corrupt in-tents by modifying their Action, Object, or both;for example, the utterance-intent pair (“Look forsomething nearby. I am hungry.”, “FindRestau-rant”) may result in the negative examples (..., “Re-serveRestaurant”), (..., “FindHotel”), or (...,“Rent-Movies”). We train our core model by minimizingthe cross-entropy loss over all the training exam-ples. In this section, we describe the datasets, evalua-tion settings and metrics, competing methods, andimplementation details of our proposed method.
Table 1 presents important statistics on the datasetswe used in our experiments.
SNIPS (Coucke et al., 2018). A crowd-sourcedsingle-turn NLU benchmark with intents acrossdifferent domains. SGD (Rastogi et al., 2019). A recently publisheddataset for the eighth Dialog System TechnologyChallenge, Schema Guided Dialogue (SGD) track.It contains dialogues from domains with a totalof intents. It is one of the most comprehensiveand challenging publicly available datasets. Thedialogues contain user intents as part of dialoguestates. We only kept utterances where users expressan intent by comparing two consecutive dialoguestates to check for the expression of a new intent. MultiWOZ (Zang et al., 2020). Multi-DomainWizard-of-Oz (MultiWOZ) is a well-known and
Dataset SNIPS SGD MultiWOZ
Dataset Size . K . K . KVocab. Size . K . K . KAvg. Length .
05 10 .
62 11 . Table 1: Dataset statistics. publicly available dataset. We used the most re-cent version . of MultiWOZ in our experiments,which contains utterances spanning intents. Sim-ilarly to the pre-processing of SGD dataset, we onlykept utterances that express an intent to maintainconsistency with the previous work. We use standard classification evaluation measures:accuracy and F1 score. The values for all the met-rics are per class averages weighted by their respec-tive support. We present evaluation results for thefollowing intent detection settings:
ZS intent Detection.
In this setting, a model istrained on all the utterances with seen intents –i.e., all samples ( X i , I j ) where I j ∈ S . Whereasat inference time, the utterances are only drawnfrom those with unseen intents; the model has toclassify a given utterance into one of the unseenintents. Note that this setting is less challengingthan the GZS setting because models know thatutterances received at inference time imply intentsthat belong to the set of unseen intents only, thusnaturally reducing their bias towards classifyingutterances into seen intents. For each dataset, werandomly place ≈ , ≈ , and ≈ of theintents in the seen set for training and the rest intothe unseen set for testing, and report the averageresults over runs. It is important to highlightthat selecting seen/unseen sets in this fashion ismore challenging to models because all the intentsget an equal chance to appear in the unseen set,which exposes models that are capable of detectingcertain unseen intents only. GZS intent Detection.
In this setting, models aretrained on a subset of utterances implying seen in-tents. At inference time, test utterances are drawnfrom a set that contains utterances implying a mixof seen and unseen intents (disjoint set from train-ing set) and the model is expected to select thecorrect intent from all seen and unseen intents fora given test utterance. This is the most realistic andchallenging problem setting, and it is the main fo-cus of this work. For the GZS setting, we decidedthe train/test splits for each dataset as follows: ForNIPS, we first randomly selected 5 out of 7 in-tents and designated them as seen intents – theremaining 2 intents were designated as unseen in-tents. We then selected of the utterances thatimply any of the 5 seen intents for training. Thetest set consists of the remaining utterances inaddition to all utterances that imply one of the 2 un-seen intents. Previous work (Liu et al., 2019) usedthe same number of seen/unseen intents, but se-lected the seen/unseen intents manually. Whereaswe picked unseen intents randomly, and we reportresults over runs resulting in a more challeng-ing and thorough evaluation. That is, each intentgets an equal chance to appear as an unseen in-tent in our experiments, which allows testing eachmodel more comprehensively. For SGD, we usedthe standard splits proposed by the dataset authors.Specifically, the test set includes utterances thatimply unseen intents and seen intents; we re-port average results over runs. For MultiWOZ,we used of the utterance that imply (out of ) randomly selected intents for training and therest of the utterances (i.e., the remaining ofseen intents’ utterances and all utterances implyingunseen intents) for testing. We compare our model
RIDE against the followingstate-of-the-art (SOTA) models and several strongbaselines:
SEG (Yan et al., 2020). A semantic-enhancedGaussian mixture model that uses large margin lossto learn class-concentrated embeddings coupledwith a density-based outlier detection algorithmLOF to detect unseen intents.
ReCapsNet-ZS (Liu et al., 2019). A model thatemploys capsule neural network and a dimensionalattention module to learn generalizable transforma-tional metrices from seen intents.
IntentCapsNet (Xia et al., 2018). A model thatutilizes capsule neural networks to learn low-levelfeatures and routing-by-agreement to adapt to un-seen intents. This model was originally proposedfor detecting intents in the standard ZS setting. Weextended it to support the GZS setting with the helpof its authors.
Other Baseline Models . ( i ) Zero-shot DDN (Ku-mar et al., 2017): A model for ZS intent detectionthat achieves zero-shot capabilities by projectingutterances and intent labels into the same high di-mensional embedding space. ( ii ) CDSSM (Chen et al., 2016): A model for ZS intent detection thatutilizes a convolutional deep structured semanticmodel to generate embeddings for unseen intents.( iii ) CMT (Socher et al., 2013): A model for ZSintent detection that employs non-linearity in thecompatibility function between utterances and in-tents to find the most compatible unseen intents.( iv ) DeViSE (Frome et al., 2013): A model thatwas originally proposed for zero-shot image classi-fication that learns a linear compatibility function.Note that baseline ZS models have been extendedto support GZS setting. We lemmatize ConeptNet KG, that has millionnodes (English only after lemmatization), . mil-lion edges, and relation types. The link predictoris trained on the lemmatized version of ConceptNetKG. The link predictor has two -dimensionalembedding layers and a negative sampling ratio of ; it is trained for , epochs using Adam opti-mizer with a learning rate of . , L2 regularizationvalue of . , and batch size of . Our relation-ship meta-features generator takes in an utterance’sn-grams with n ≤ and an intent label, and uses thepre-trained link predictor to produce relationshipmeta-features with dimensions. Our utteranceand intent encoders use pre-trained ELMo contex-tual word embeddings with dimension andPOS tags embeddings with dimension, anda two-layer RNN with -dimensional bidirec-tional LSTM as recurrent units. Our predictionfunction has two dense layers with relu and soft-max activation. Our core model is trained for upto epochs with early stopping using Adam op-timizer and a cross entropy loss with initial learn-ing rate of . and ReduceLROnPlateau sched-uler (PyTorch, 2020) with patience epochs. Ituses dropout rate of . and batch size of . Anegative sampling ratio of up to is used. Weuse the same embeddings generation and trainingmechanism for all competing models. Standard ZS Intent Detection.
Figure 3 presentsthe F1 scores averaged over runs for all com-peting models with varying percentages of seenintents in the ZS setting. The performance ofall models improves as the percentage of seen in-tents increases, which is expected because increas-ing the percentage of seen intent gives models ac- e V i S E C M T C D S S M Z e r o - s h o t D N N I n t e n t C a p s N e t R e C a p s N e t S E G R I D E M ethod F S c o r e (a) SNIPS dataset D e V i S E C M T C D S S M Z e r o - s h o t D N N I n t e n t C a p s N e t R e C a p s N e t S E G R I D E M ethod F S c o r e (b) SGD dataset D e V i S E C M T C D S S M Z e r o - s h o t D N N I n t e n t C a p s N e t R e C a p s N e t S E G R I D E M ethod F S c o r e (c) MultiWOZ dataset Figure 3: F1 scores for competing models in the ZS setting with varying percentages of seen intents. In the ZSsetting, utterances with only unseen intents are encountered at inference time, and models are aware of this. Ourmodel
RIDE consistently outperforms all other models for any given percentage of seen intents.
Method SNIPS SGD MultiWOZ
Unseen Seen Unseen Seen Unseen SeenAcc F1 Acc F1 Acc F1 Acc F1 Acc F1 Acc F1DeViSE 0.0311 0.0439 0.9487 0.6521 0.0197 0.0177 0.8390 0.5451 0.0119 0.0270 0.8980 0.5770CMT 0.0427 0.0910
RIDE w / o PU 0.8728 0.9103 0.8906 0.8799 0.3865 0.4634 0.8126 0.8295 0.3704 0.4645 0.8558 0.8816
RIDE / w PU Table 2: Main results: accuracy and F1 scores for competing models in the GZS setting (i.e., models receiveboth seen and unseen intents at inference time, which makes the setting more challenging than the ZS setting). Wepresent results for two variants of our model:
RIDE w / o PU which does not use a PU classifier, and
RIDE / w PUwhich uses one. Our model consistently achieves the best F1 score for both seen and unseen intents across alldatasets, regardless of whether the PU classifier is integrated or not. cess to more training data and intents. Our model
RIDE consistently outperforms the SOTA modelSEG (Yan et al., 2020) and all other models inthe ZS setting with a large margin across all thedatasets. Specifically, it is at least . moreaccurate on F1 score than the second best model forany percentage of seen intents on all the datasets.Note that all models perform worse on SGD andMultiWOZ compared to SNIPS because these twodatasets are more challenging: they contain closelyrelated intent labels such as “FindRestaurant” and“FindHotel”. GZS Intent Detection.
Table 2 shows accuracyand F1 scores averaged over runs for all compet-ing models in the GZS setting. For unseen intents,our model RIDE outperforms all other competingmodels on accuracy with a large margin. Specifi-cally,
RIDE is . , . , and . moreaccurate than the SOTA model SEG on SNIPS,SGD, and MultiWOZ for unseen intents, respec-tively. Moreover, our model consistently achievesthe highest F1 score on seen as well as unseen in- tents, which confirms its generalizability. CMT andIntentCapsNet achieve the highest accuracy for ut-terances with seen intents on all datasets, but theirF1 score is among the worst due to their biased-ness towards misclassifying utterances with unseenintents into seen ones. RIDE outperforms the SOTAmodel SEG regardless of whether a PU classifieris incorporated or not. For SNIPS, the role of thePU classifier is negligible as it causes a slight im-provement in accuracy and F1 score. For SGD andMultiWOZ, which are more challenging datasets,the PU classifier is responsible for significant im-provements in accuracy. Specifically, it provides . and . percentage points improvementfor SGD and MultiWOZ, respectively, on unseenintents. Effect of PU Classifier on Other Models.
We ob-served that one of the main sources of error formost models in the GZS setting is their tendency tomisclassify utterances with unseen intents into seenones due to overfitting to seen intents. We investi-gated whether existing models can be adapted to e V i S E C M T C D S S M Z e r o - s h o t D N N I n t e n t C a p s N e t R e C a p s N e t R I D E M ethod F S c o r e (a) SNIPS dataset D e V i S E C M T C D S S M Z e r o - s h o t D N N I n t e n t C a p s N e t R e C a p s N e t R I D E M ethod F S c o r e (b) SGD dataset D e V i S E C M T C D S S M Z e r o - s h o t D N N I n t e n t C a p s N e t R e C a p s N e t R I D E M ethod F S c o r e (c) MultiWOZ dataset Figure 4: F1 scores for unseen intents for the competing models in the GZS setting after integrating a PU classifier.
Configuration SNIPS SGD MultiWOZ
UI-Embed w / o PU 0.2367 0.1578 0.1723Rel-M w / o PU 0.7103 0.3593 0.3321RIDE w / o PU 0.9103 0.4634 0.4645UI-Embed / w PU 0.7245 0.4202 0.4124Rel-M / w PU 0.8463 0.5167 0.4781RIDE / w PU Table 3: Ablation study: F1 scores for unseen intentsin GZS setting; the key reason behind our model’s as-tonishing accuracy is our relationship meta-features. accurately classify utterances with unseen intentsby partially eliminating their bias towards seen in-tents. Figure 4 presents F1 scores of all the modelswith and without PU classifier. A PU classifiersignificantly improves the results of all the compet-ing models. For instance, the IntentCapsNet modelwith a PU classifier achieves an F1 score of 74% forunseen intents on SNIPS dataset in the GZS settingcompared to an F1 score of less than 0.01% withoutthe PU classifier. Note that the PU classifier has anaccuracy (i.e., correctly predicting whether the ut-terance implies a seen or an unseen intent) of . and an F1 score of . for SNIPS dataset; . accuracy and . F1 score for SGD dataset; and . accuracy and . F1 score for MultiWOZdataset. Interestingly, our model
RIDE (withoutPU classifier) outperforms all the competing mod-els even when a PU classifier is incorporated intothem, which highlights that the PU classifier is notthe main source of the performance of our model.We did not incorporate the PU classifier into SEGmodel because it already incorporates an equivalentmechanism to distinguish seen intents from unseenones (i.e., outlier detection).
Ablation Study.
To quantify the effectiveness ofeach component in our model, we present the re-sults of our ablation study in Table 3 (in the GZS setting). Utilizing utterance and intent embeddingsonly (i.e., UI-Embed) results in very low F1 score,i.e., 23.67% on SNIPS dataset. Employing rela-tionship meta-features only (i.e., Rel-M) results insignificantly better results: an F1 score of 71.03%on SNIPS dataset. When utterance and intent em-beddings are used in conjunction with relationshipmeta-features (i.e.,
RIDE w / o PU), it achieves bet-ter F1 score compared to the Rel-M or UI-Embedconfigurations. A similar trend can be observed forthe other datasets as well. Finally, when our entiremodel is deployed (i.e., including utterance andintent embeddings, relationship meta-features, andthe PU classifier, i.e.,
RIDE / w PU), it achievesthe best results on all the datasets.
The deep neural networks have proved highly ef-fective for many critical NLP tasks (Siddique et al.,2020; Farooq et al., 2020; Zhang et al., 2018;Williams, 2019; Ma et al., 2019; Siddique et al.,2021; Liu and Lane, 2016; Gupta et al., 2019). Weorganize the related work on intent detection intothree categories: ( i ) supervised intent detection, ( ii )standard zero-shot intent detection, and ( iii ) gener-alized zero-shot intent detection. Supervised Intent Detection.
Recurrent neuralnetworks (Ravuri and Stolcke, 2015) and seman-tic lexicon-enriched word embeddings (Kim et al.,2016) have been employed for supervised intentdetection. Recently, researchers have proposedsolving the related problems of intent detectionand slot-filling jointly (Liu and Lane, 2016; Zhanget al., 2018; Xu and Sarikaya, 2013). Supervisedintent classification works assume the availabilityof a large amount of labeled training data for allintents to learn discriminative features, wherease focus on the more challenging and more prac-tically relevant setting where intents are evolvingand training data is not available for all intents.
Standard Zero-shot Intent Detection.
The au-thors in (Yazdani and Henderson, 2015) proposedusing label ontologies (Ferreira et al., 2015) (i.e.,manually annotated intent attributes) to facilitategeneralizing a model to support unseen intents. Theauthors in (Dauphin et al., 2013; Kumar et al., 2017;Williams, 2019) map utterances and intents to thesame semantic vector space, and then classify ut-terances based on their proximity to intent labelsin that space. Similarly, the authors in (Gangalet al., 2019) employ the outlier detection algorithmLOF (Breunig et al., 2000) and likelihood ratiosfor identifying out-of-domain test examples. Whilethese works showed promising results for intent de-tection when training data is unavailable for someintents, they assume that all utterances faced atinference time imply unseen intents only. Extend-ing such works to remove the aforementioned as-sumption is nontrivial. Our model does not assumeknowledge of whether an utterance implies a seenor an unseen intent at inference time.
Generalized Zero-shot Intent Detection.
To thebest of our knowledge, the authors in (Liu et al.,2019) proposed the first work that specifically tar-gets the GZS intent detection setting. They attemptto make their model generalizable to unseen in-tents by adding a dimensional attention moduleto a capsule network and learning generalizabletransformation matrices from seen intents. Re-cently, the authors in (Yan et al., 2020) proposedusing a density-based outlier detection algorithmLOF (Breunig et al., 2000) and semantic-enhancedGaussian mixture model with large margin loss tolearn class-concentrated embeddings to detect un-seen intents. In contrast, we leverage rich common-sense knowledge graph to capture deep semanticand discriminative relationships between utterancesand intents, which significantly reduces the bias to-wards classifying unseen intents into seen ones. Ina related, but orthogonal, line of research, the au-thors in (Ma et al., 2019; Li et al., 2020; Gulyaevet al., 2020) addressed the problem of intent detec-tion in the context of dialog state tracking wheredialog state and conversation history are availablein addition to an input utterance. In contrast, thiswork and the SOTA models we compare against inour experiments only consider an utterance withouthaving access to any dialog state elements.
We have presented an accurate generalized zero-shot intent detection model. Our extensive exper-imental analysis on three intent detection bench-marks shows that our model is 30.36% to 58.50%more accurate than the SOTA model for unseenintents. The main novelty of our model is its uti-lization of relationship meta-features to accuratelyidentify matching utterance-intent pairs with verylimited reliance on training data, and without mak-ing any assumption on whether utterances implyseen or unseen intents at inference time. Further-more, our idea of integrating Positive-Unlabeledlearning in GZS intent detection models further im-proves our models’ performance, and significantlyimproves the accuracy of existing models as well.
References
Antoine Bordes, Nicolas Usunier, Alberto Garcia-Duran, Jason Weston, and Oksana Yakhnenko.2013. Translating embeddings for modeling multi-relational data. In
Advances in neural informationprocessing systems , pages 2787–2795.Markus M Breunig, Hans-Peter Kriegel, Raymond TNg, and Jörg Sander. 2000. Lof: identifying density-based local outliers. In
Proceedings of the 2000ACM SIGMOD international conference on Manage-ment of data , pages 93–104.Daniel Cer, Yinfei Yang, Sheng-yi Kong, Nan Hua,Nicole Limtiaco, Rhomni St John, Noah Constant,Mario Guajardo-Cespedes, Steve Yuan, Chris Tar,et al. 2018. Universal sentence encoder. arXivpreprint arXiv:1803.11175 .Yun-Nung Chen, Dilek Hakkani-Tür, and XiaodongHe. 2016. Zero-shot learning of intent embeddingsfor expansion by convolutional deep structured se-mantic models. In , pages 6045–6049. IEEE.Zhiyuan Chen, Bing Liu, Meichun Hsu, Malu Castel-lanos, and Riddhiman Ghosh. 2013. Identifying in-tention posts in discussion forums. In
Proceedingsof the 2013 conference of the North American chap-ter of the association for computational linguistics:human language technologies , pages 1041–1050.Alice Coucke, Alaa Saade, Adrien Ball, ThéodoreBluche, Alexandre Caulier, David Leroy, ClémentDoumouro, Thibault Gisselbrecht, Francesco Calta-girone, Thibaut Lavril, et al. 2018. Snips voice plat-form: an embedded spoken language understandingsystem for private-by-design voice interfaces. arXivpreprint arXiv:1805.10190 .ann N Dauphin, Gokhan Tur, Dilek Hakkani-Tur,and Larry Heck. 2013. Zero-shot learning forsemantic utterance classification. arXiv preprintarXiv:1401.0509 .Charles Elkan and Keith Noto. 2008. Learning classi-fiers from only positive and unlabeled data. In
Pro-ceedings of the 14th ACM SIGKDD internationalconference on Knowledge discovery and data min-ing , pages 213–220.Umar Farooq, A. B. Siddique, Fuad Jamour, ZhijiaZhao, and Vagelis Hristidis. 2020. App-aware re-sponse synthesis for user reviews. In ,pages 699–708.Rafael Felix, Vijay BG Kumar, Ian Reid, and GustavoCarneiro. 2018. Multi-modal cycle-consistent gen-eralized zero-shot learning. In
Proceedings of theEuropean Conference on Computer Vision (ECCV) ,pages 21–37.Emmanuel Ferreira, Bassam Jabaian, and FabriceLefevre. 2015. Online adaptative zero-shot learn-ing spoken language understanding using word-embedding. In , pages 5321–5325. IEEE.Andrea Frome, Greg S Corrado, Jon Shlens, Samy Ben-gio, Jeff Dean, Marc’Aurelio Ranzato, and TomasMikolov. 2013. Devise: A deep visual-semantic em-bedding model. In
Advances in neural informationprocessing systems , pages 2121–2129.Varun Gangal, Abhinav Arora, Arash Einolghozati,and Sonal Gupta. 2019. Likelihood ratios and gen-erative classifiers for unsupervised out-of-domaindetection in task oriented dialog. arXiv preprintarXiv:1912.12800 .Pavel Gulyaev, Eugenia Elistratova, Vasily Kono-valov, Yuri Kuratov, Leonid Pugachev, and MikhailBurtsev. 2020. Goal-oriented multi-task bert-based dialogue state tracker. arXiv preprintarXiv:2002.02450 .Arshit Gupta, John Hewitt, and Katrin Kirchhoff. 2019.Simple, fast, accurate intent classification and slotlabeling for goal-oriented dialogue systems. In
Pro-ceedings of the 20th Annual SIGdial Meeting on Dis-course and Dialogue , pages 46–55.Seyed Mehran Kazemi and David Poole. 2018. Simpleembedding for link prediction in knowledge graphs.In
Advances in neural information processing sys-tems , pages 4284–4295.Joo-Kyung Kim, Gokhan Tur, Asli Celikyilmaz, BinCao, and Ye-Yi Wang. 2016. Intent detection us-ing semantically enriched word embeddings. In , pages 414–419. IEEE. Anjishnu Kumar, Pavankumar Reddy Muddireddy,Markus Dreyer, and Björn Hoffmeister. 2017. Zero-shot learning across heterogeneous overlapping do-mains. In
INTERSPEECH , pages 2914–2918.Miao Li, Haoqi Xiong, and Yunbo Cao. 2020. Thesppd system for schema guided dialogue state track-ing challenge. arXiv preprint arXiv:2006.09035 .Bing Liu and Ian Lane. 2016. Attention-based recur-rent neural network models for joint intent detectionand slot filling. arXiv preprint arXiv:1609.01454 .Han Liu, Xiaotong Zhang, Lu Fan, Xuandi Fu, QimaiLi, Xiao-Ming Wu, and Albert YS Lam. 2019. Re-constructing capsule networks for zero-shot intentclassification. In
Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) , pages 4801–4811.Yue Ma, Zengfeng Zeng, Dawei Zhu, Xuan Li, Yiy-ing Yang, Xiaoyuan Yao, Kaijie Zhou, and Jian-ping Shen. 2019. An end-to-end dialogue statetracking system with machine reading comprehen-sion and wide & deep classification. arXiv preprintarXiv:1912.09297 .Matthew E. Peters, Mark Neumann, Mohit Iyyer, MattGardner, Christopher Clark, Kenton Lee, and LukeZettlemoyer. 2018. Deep contextualized word repre-sentations. In
Proc. of NAACL .PyTorch. 2020. torch.optim — pytorch 1.3.0 documen-tation. https://pytorch.org/docs/stable/optim.html . (Accessed on 11/12/2020).Abhinav Rastogi, Xiaoxue Zang, Srinivas Sunkara,Raghav Gupta, and Pranav Khaitan. 2019. Towardsscalable multi-domain conversational agents: Theschema-guided dialogue dataset. arXiv preprintarXiv:1909.05855 .Suman Ravuri and Andreas Stolcke. 2015. Recurrentneural network and lstm models for lexical utteranceclassification. In
Sixteenth Annual Conference of theInternational Speech Communication Association .Sara Sabour, Nicholas Frosst, and Geoffrey E Hin-ton. 2017. Dynamic routing between capsules. In
Advances in neural information processing systems ,pages 3856–3866.A. B. Siddique, Samet Oymak, and Vagelis Hristidis.2020. Unsupervised paraphrasing via deep rein-forcement learning. In
Proceedings of the 26thACM SIGKDD International Conference on Knowl-edge Discovery & Data Mining , KDD ’20, page1800–1809, New York, NY, USA. Association forComputing Machinery.A.B. Siddique, Fuad Jamour, and Vagelis Hristidis.2021. Linguistically-enriched and context-awareero-shot slot filling. In
Proceedings of the Web Con-ference 2021 , New York, NY, USA. Association forComputing Machinery.Richard Socher, Milind Ganjoo, Christopher D Man-ning, and Andrew Ng. 2013. Zero-shot learningthrough cross-modal transfer. In
Advances in neu-ral information processing systems , pages 935–943.Robyn Speer, Joshua Chin, and Catherine Havasi. 2016.Conceptnet 5.5: An open multilingual graph of gen-eral knowledge. arXiv preprint arXiv:1612.03975 .Robyn Speer, Joshua Chin, and Catherine Havasi. 2017.Conceptnet 5.5: An open multilingual graph of gen-eral knowledge.Théo Trouillon, Johannes Welbl, Sebastian Riedel, ÉricGaussier, and Guillaume Bouchard. 2016. Complexembeddings for simple link prediction. InternationalConference on Machine Learning (ICML).Nikhita Vedula, Nedim Lipka, Pranav Maneriker, andSrinivasan Parthasarathy. 2020. Open intent extrac-tion from natural language interactions. In
Pro-ceedings of The Web Conference 2020 , pages 2009–2020.Jinpeng Wang, Gao Cong, Wayne Xin Zhao, and Xi-aoming Li. 2015. Mining user intents in twitter:A semi-supervised approach to inferring intent cat-egories for tweets. In
Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence ,AAAI’15, page 318–324. AAAI Press.Kyle Williams. 2019. Zero shot intent classificationusing long-short term memory networks.
Proc. In-terspeech 2019 , pages 844–848.Congying Xia, Chenwei Zhang, Xiaohui Yan,Yi Chang, and Philip S Yu. 2018. Zero-shot userintent detection via capsule neural networks. arXivpreprint arXiv:1809.00385 .Puyang Xu and Ruhi Sarikaya. 2013. Convolutionalneural network based triangular crf for joint intentdetection and slot filling. In ,pages 78–83. IEEE.Guangfeng Yan, Lu Fan, Qimai Li, Han Liu, XiaotongZhang, Xiao-Ming Wu, and Albert YS Lam. 2020.Unknown intent detection using gaussian mixturemodel with an application to zero-shot intent clas-sification. In
Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguistics ,pages 1050–1060.Majid Yazdani and James Henderson. 2015. A modelof zero-shot learning of spoken language understand-ing. In
Proceedings of the 2015 Conference on Em-pirical Methods in Natural Language Processing ,pages 244–249. Xiaoxue Zang, Abhinav Rastogi, Srinivas Sunkara,Raghav Gupta, Jianguo Zhang, and Jindong Chen.2020. MultiWOZ 2.2 : A dialogue dataset withadditional annotation corrections and state trackingbaselines. In
Proceedings of the 2nd Workshop onNatural Language Processing for Conversational AI ,pages 109–117, Online. Association for Computa-tional Linguistics.Chenwei Zhang, Yaliang Li, Nan Du, Wei Fan, andPhilip S Yu. 2018. Joint slot filling and intent de-tection via capsule neural networks. arXiv preprintarXiv:1812.09471 . A Appendices
This section provides supplementary details on var-ious aspects of this paper. First, we provide moredetails on the commonsense knowledge graph wehave used as a source of knowledge on semanticrelatedness of concepts. Then, we describe thespecifics of the datasets we used in our evaluationand our preprocessing procedures. Finally, we pro-vide the details on handling a special case whenutterances do not imply intents.
A.1 Knowledge Graph Details
Although creating and maintaining knowledgegraphs is laborious and time consuming, theimmense utility of such graphs has led manyresearchers and institutions to make the effortof building and maintaining knowledge graphsin many domains, which lifts the burden off ofother researchers and developers who utilize thesegraphs. For tasks that involve commonsensereasoning such as generalized zero-shot intentdetection, the ConceptNet (Speer et al., 2016)commonsense knowledge graph stands out as oneof the most popular and freely available resources.ConceptNet originated from the crowdsourcingproject Open Mind Common Sense, and includesknowledge not only from crowdsourced resourcesbut also expert-curated resources. It is availablein core languages, and more common lan-guages. It was employed to show state-of-the-artresults at SemEval 2017 (Speer et al., 2017). In thiswork, we considered relation types to generateour relationship meta-features. The relationtypes are: RelatedT o , F ormOf , IsA , P artOf , HasA , U sedF or , CapableOf , AtLocation , Causes , HasSubevent , HasF irstSubevent , HasLastSubevent , HasP rerequisite , HasP roperty , M otivatedByGoal , ObstructedBy , Desires , CreatedBy , Synonym , Antonym , DistinctF rom , erivedF rom , SymbolOf , Def inedAs , M annerOf , LocatedN ear , HasContext , SimilarT o , EtymologicallyRelatedT o , EtymologicallyDerivedF rom , CausesDesire , M adeOf , ReceivesAction , ExternalU RL , and
Self .The relationship meta-feature generator pro-duces × dimension vector for eachutterance-intent pair. Specifically, we generate rela-tionships: ( i ) from utterance to Object (i.e., Objectpart in intent label); ( ii ) utterance to Action (i.e.,Action part in intent label); and ( iii ) Object to ut-terance; ( iv ) Action to utterance.A knowledge graph may not have redundant, butnecessary information. For example, a knowledgegraph may have the entry (cid:104) movie , IsA, film (cid:105) but not (cid:104) film , IsA, movie (cid:105) or vice-versa, be-cause one triple can be inferred from the otherbased on background knowledge (i.e., symmetricnature of the IsA relation). Similarly, the triple (cid:104) movie , HasA, subtitles (cid:105) can be used toinfer the triple (cid:104) subtitles , P artOf, movie (cid:105) based on the background knowledge (i.e., inverserelation between
HasA and
P artOf ). So, if thiskind of redundant information (i.e., complement-ing entries for all such triples) is not available inthe knowledge graph itself, there is no way forthe model to learn these relationships automati-cally. To overcome this issue, we incorporate thebackground knowledge that each of the relationtypes
IsA , RelatedT o , Synonym , Antonym , DistinctF rom , LocatedN ear , SimilarT o , and
EtymologicallyRelatedT o is symmetric; andthat the relation types
P artOf and
HasA are in-versely related in our link prediction model as de-scribed in (Kazemi and Poole, 2018).
A.2 Training the Link Predictor
The training data for a link prediction model isprepared as follows. First, the triples in the inputknowledge graph are assigned a label of . Then,negative examples are generated by corrupting truetriples (i.e., modifying the head or tail of exist-ing triples) and assigning them a label of − (Bor-des et al., 2013). Finally, we train our LP usingthe generated training data by minimizing the L regularized negative log-likelihood loss of trainingtriples (Trouillon et al., 2016). A.3 Datasets Preprocessing
SNIPS Natural Language Understanding bench-mark (SNIPS) (Coucke et al., 2018) is a commonly used dataset for intent detection, whereas DialogueSystem Technology Challenge 8, Schema GuidedDialogue dataset (SGD) (Rastogi et al., 2019) andMulti-Domain Wizard-of-Oz (MultiWOZ) (Zanget al., 2020) were originally proposed for the taskof dialogue state tracking. For SGD and Multi-WOZ, we perform a few trivial preprocessing stepsto extract utterances that contain intents, along withtheir labels, and use them for the task of general-ized zero-shot intent detection. First, we providethe details on the preprocessing steps specific tothe SGD and MultiWOZ dataset and then describethe preprocessing steps that are common for alldatasets.
Steps for SGD and MultiWOZ.
To maintain con-sistency with the previous work on intent detection,we extract only the utterances where user/systemexpresses an intent, and discard the rest from theoriginal SGD and MultiWOZ datasets. The dia-logue state contains a property “active_intent” thatkeeps track of the user’s current intent. After eachuser utterance, we compare dialogue states to checkfor the expression of a new user intent, i.e., whetherthe value of the “active_intent” is modified. When-ever the user expresses a new intent, the value of the“active_intent” is updated. Moreover, sometimes,the bot (i.e., system) also offers new intents to theuser (e.g., offering reserving a table to the user, whohas successfully searched for a restaurant), whichis tracked in the system actions property “act =OFFER_INTENT”, and “values = < new_intent > ”.We also keep such system utterances. Common Steps.
We perform some standard pre-processing steps on all the datasets. We use spaCyto tokenize the sentences. Since intent labels aregiven in the “ActionObject” format, we tokenizethem into “Action Object” phrases before feed-ing them into our model. For example, the in-tent labels “FindHotel” and “RateBook”, are trans-formed into “Find Hotel” and “Rate Book”, re-spectively. Note that some Objects parts of intentlabels are compound. Consider the intent label“SearchOneWayFlight” whose Action is “Search”and Object is “OneWayFlight”. In such cases,our relationship meta-features generator computesmeta-features for each part of the compound ob-ject then averages them to produce the Objectmeta-features vector. In the previous example,“OneWayFlight” meta-features vector is computedas the average of the meta-features of “OneWay”and “Flight”. ethod SGD MultiWOZ
CNN 0.9497 0.9512GRU
Table 4: F1 score for intent existence binary classifiers.