Zero-shot User Intent Detection via Capsule Neural Networks
Congying Xia, Chenwei Zhang, Xiaohui Yan, Yi Chang, Philip S. Yu
ZZero-shot User Intent Detection via Capsule Neural Networks
Congying Xia ∗ , Chenwei Zhang ∗ , Xiaohui Yan , Yi Chang , , , Philip S. Yu Department of Computer Science, University of Illinois at Chicago, Chicago, IL 60607 USA Huawei Technologies, San Jose, CA 95050 USA College of Artificial Intelligence, Jilin University, Changchun, China College of Computer Science and Technology, Jilin University, Changchun, China Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, China { cxia8,czhang99,psyu } @uic.edu , [email protected] , [email protected] Abstract
User intent detection plays a critical role inquestion-answering and dialog systems. Mostprevious works treat intent detection as a clas-sification problem where utterances are la-beled with predefined intents. However, itis labor-intensive and time-consuming to la-bel users’ utterances as intents are diverselyexpressed and novel intents will continuallybe involved. Instead, we study the zero-shotintent detection problem, which aims to de-tect emerging user intents where no labeled ut-terances are currently available. We proposetwo capsule-based architectures: I
NTENT -C APS N ET that extracts semantic features fromutterances and aggregates them to discriminateexisting intents, and I NTENT C APS N ET -ZSLwhich gives I NTENT C APS N ET the zero-shotlearning ability to discriminate emerging in-tents via knowledge transfer from existing in-tents. Experiments on two real-world datasetsshow that our model not only can better dis-criminate diversely expressed existing intents,but is also able to discriminate emerging in-tents when no labeled utterances are available. With the increasing complexity and accuracy ofspeech recognition technology, companies arestriving to deliver intelligent conversation under-standing systems as people interact with soft-ware agents that run on speaker devices or smartphones via natural language interface (Hoy, 2018).Products like Apple’s Siri, Amazon’s Alexa andGoogle Assistant are able to interpret humanspeech and respond them via synthesized voices.With recent developments in deep neural net-works, user intent detection models (Hu et al.,2009; Xu and Sarikaya, 2013; Zhang et al., 2016; ∗ Indicates Equal Contribution.Previously avilable on http://doi.org/10.13140/RG.2.2.11739.46889
Liu and Lane, 2016; Chen et al., 2016b) are pro-posed to classify user intents given their diverselyexpressed utterances in the natural language. Thedecent performances on intent detection usuallycome with deep neural network classifiers opti-mized on large-scale utterances which are human-labeled among existing predefined user intents.As more features and skills are being addedto devices which expand their capabilities to newprograms, it is common for voice assistants to en-counter the scenario where no labeled utterance ofan emerging user intent is available in the train-ing data, as illustrated in Figure 1. Current in-tent detection methods train classifiers in a super-vised fashion and they are good at discriminatingexisting intents such as
Get Weather and
PlayMusic whose labeled utterances are already avail-able. However, these models, by the nature of de-signs, are incapable to detect utterances of emerg-ing intents like
AddToPlaylist and
RateABook ,since no labeled utterances are available. More-over, it’s labor-intensive and time-consuming toannotate utterances of emerging intents and retrainthe whole intent detection model.Thus, it is imperative to develop intent detectionmodels with the zero-shot learning (ZSL) ability(Lampert et al., 2014; Socher et al., 2013; Chang-pinyo et al., 2016): the ability to expand classifiersand the intent detection space beyond the existingintents, of which we have labeled utterances dur-ing training, to emerging intents, of which no la-beled utterances are available.The research on zero-shot intent detection isstill in its infancy. Previous zero-shot learningmethods for intent detection utilize external re-sources such as label ontologies (Ferreira et al.,2015a,b) or manually defined attributes that de-scribe intents (Yazdani and Henderson, 2015) toassociate existing and emerging intents, which re-quire extra annotation. Compatibility-based meth- a r X i v : . [ c s . C L ] S e p • How cold is it in Princeton Junction? •• Should I bring an umbrella today? •• Put Sungmin into my summer playlist
Existing Intents with Labeled Utterances
Get WeatherPlay Music •• I want to hear any tune from twenties. •• Play me a song by charles neidich
Add to Playlist
Extracting interpretable semantic features Aggregating semantic features for intent detection lossRate a Book
Zero-shot Dynamic Routing
Emerging Intents with Unlabeled Utterances
DetectionCapsZero-shot DetectionCaps … SemanticCaps
Dynamic Routing
Figure 1 : Illustration of the proposed I
NTENT C APS N ET -ZSL model for zero-shot intent detection: labeled utterances withexisting intents like GetWeather and
PlayMusic are used to train an intent detection classifier among existing intents, inwhich SemanticCaps extract intepretable semantic features and DetectionCaps dynamically aggregate semantic features forintent detection using a novel routing-by-agreement mechanism. For emerging intents, I
NTENT C APS N ET -ZSL builds zero-shotDetectionCaps that utilize the (1) outputs of SemanticCaps, (2) the routing information on existing intents from DetectionCaps,and (3) similarities of the emerging intent label to existing intent labels to discriminate emerging intents like AddToPlayist from
RateABook . Solid lines indicate the training process and dash lines indicate the zero-shot inference process. ods for zero-shot intent detection (Chen et al.,2016a; Kumar et al., 2017) assume the capabilityof learning a high-quality mapping from the utter-ance to its intent directly, so that such mapping canbe further capitalized to measure the compatibilityof an utterance with emerging intents. However,the diverse semantic expressions may impede thelearning of such mapping.In this work, we make the very first attempt totackle the zero-shot intent detection problem witha capsule-based (Hinton et al., 2011; Sabour et al.,2017) model. A capsule houses a vector represen-tation of a group of neurons, and the orientation ofthe vector encodes properties of an object (like theshape/color of a face), while the length of the vec-tor reflects its probability of existence (how likelya face with certain properties exists). The capsulemodel learns a hierarchy of feature detectors viaa routing-by-agreement mechanism: capsules fordetecting low-level features (like nose/eyes) sendtheir outputs to high-level capsules (such as faces)only when there is a strong agreement of their pre-dictions to high-level capsules.The aforementioned properties of capsule mod-els could be quite appealing for text modeling,specifically in this case, modeling the user utter-ance for intent detection: low-level semantic fea-tures such as the get action, time and city namecontribute to a more abstract intent (
GetWeather )collectively. A semantic feature, which may beexpressed quite differently among users, can con-tribute more to one intent than others. The dy-namic routing-by-agreement mechanism can beused to dynamically assign a proper contributionof each semantic and aggregate them to get an in- tent representation.More importantly, we discover the potential ofzero-shot learning ability on the capsule model,which is not yet widely recognized. It makes thecapsule model even more suitable for text mod-eling when no labeled utterances are availablefor emerging intents. The ability to neglect thedisagreed output of low-level semantics for cer-tain intents during routing-by-agreement encour-ages the learning of generalizable semantic fea-tures that can be adapted to emerging intents. Foreach emerging intent with no labeled utterances,a Zero-shot DetectionCaps is constructed explic-itly by using not only semantic features Seman-ticCaps extracted, but also existing routing agree-ments from DetectionCaps and similarities of anemerging intent label to existing intent labels.In summary, the contributions of this work are: • Expanding capsule neural networks to textmodeling, by extracting and aggregating seman-tics from utterances in a hierarchical manner; • Proposing a novel and effective capsule-basedmodel for zero-shot intent detection; • Showing and interpreting the effectiveness ofour model on two real-world datasets.
In this section, we first define related concepts, andformally state the problem.
Intent.
An intent is a purpose, or a goal that under-lies a user-generated utterance (Watson Assistant,2017). An utterance can be associated with one ormultiple intents. We only consider the basic casethat an utterance is with a single intent. However,utterances with multiple intents can be handled byegmenting them into single-intent snippets usingsequential tagging tools like CRF (Lafferty et al.,2001), which we leave for future works.
Intent Detection.
Given a labeled training datasetwhere each sample has the following format: ( x, y ) where x is an utterance and y is its intent la-bel, each training example is associated with oneof K existing intents y ∈ Y = { y , y , ..., y K } .The intent detection task tries to associate an ut-terance x existing with its correct intent category inthe existing intent classes Y . Zero-shot Intent Detection.
Given the labeledtraining set { ( x, y ) } where y ∈ Y , the zero-shotintent detection task aims to detect an utterance x emerging which belongs to one of L emerging in-tents z ∈ Z = { z , z , ..., z L } where Y ∩ Z = ∅ . We propose two architectures based on cap-sule models: I
NTENT C APS N ET that is trainedto discriminate among utterances with existinglabels, e.g. existing intents for intent detec-tion; I NTENT C APS N ET -ZSL that gives zero-shotlearning ability to I NTENT C APS N ET for discrim-inating unseen labels, i.e. emerging intents in thiscase. As shown in Figure 2, the cores of the pro-posed architectures are three types of capsules:SemanticCaps that extract interpretable semanticfeatures from the utterance, DetectionCaps thataggregate semantic features for intent detection,and Zero-shot DetectionCaps which discriminateemerging intents. In the original capsule model (Sabour et al.,2017), convolution-based PrimaryCaps are intro-duced as the first layer to obtain different vector-ized features from the raw input image. Whilein this work, an intrinsically similar motivation isadopted to extract different semantic features fromthe raw utterance by a new type of capsule namedSemanticCaps. Unlike the PrimaryCaps which useconvolution operators with a large reception fieldto extract spacial-proximate features, the Seman-ticCaps is based on a bi-direction recurrent neuralnetwork with multiple self-attention heads, whereeach self-attention head focuses on certain part ofthe utterance and extracts a semantic feature thatmay not be expressed by words in proximity.Given an input utterance x = ( w , w , ..., w T ) of T words, each word is represented by a vector SemanticCaps DetectionCaps
Multi-Head Self-attention Recurrent Encoder
Play MusicGet Weather prediction vectors R × D H R × D P activation vectors × D I …… squashsquash Add to Playlist Zero-shot Intent Detection x existing x emerging M W k , r p k | r c kr v k m r activation vectors s k prediction vectors R × D P × D I squashsquash u l | r n l Q lk Zero-shot DetectionCaps loss
Train (cid:3) I NTENT C APS N ET on the existing intents for Intent DetectionInference emerging intents with I NTENT C APS N ET-
ZSL for Zero-shot Intent Detection • Put Sungmin into my summer playlist• Play me a song by charles neidich
Rate a Book …… semantic vectors c kr vote vectors g k , r = c kr p k | r Figure 2 : The architecture of I
NTENT C APS N ET andI NTENT C APS N ET -ZSL. During training, utterances with ex-isting intents are fed into the SemanticCaps which output vec-torized semantic features, i.e. semantic vectors. Then Detec-tionCaps combine these features into higher-level predictionvectors and output an activation vector for intent detection oneach existing intent. During inference, emerging utterancestake advantages of the SemanticCaps trained in I NTENT C AP - S N ET to extract semantic features from the utterance (shownin 1), then the vote vectors on the existing intents are trans-ferred to emerging intents (shown in 2) using similarities be-tween existing and emerging intents (shown in 3). The ob-tained activation vectors for emerging intents are used forzero-shot intent detection. of dimension D W that can be pre-trained using askip-gram language model (Mikolov et al., 2013).A recurrent neural network such as a bidirectionalLSTM (Hochreiter and Schmidhuber, 1997) is ap-plied to sequentially encode the utterance into hid-den states: → h t = LSTM fw ( w t , ← h t − ) , ← h t = LSTM bw ( w t , ← h t +1 ) . (1)For each word w t , we concatenate each for-ward hidden state (cid:126) h t obtained from the forward LSTM fw with a backward hidden state ← h t from LSTM bw to obtain a hidden state h t for the word w t . The whole hidden state matrix can be definedas H = ( h , h , ..., h T ) ∈ R T × D H , where D H is the number of hidden units in each LSTM.Inspired by the success of self-attention mech-anisms (Vaswani et al., 2017; Lin et al., 2017)for sentence embedding, we adopt a multi-headself-attention framework where each self-attentionead is encouraged to be attentive to a specific se-mantic feature of the utterance, such as certain setsof keywords or phrases in the utterance: one self-attention may be attentive for the “get” action in GetWeather , while another one may be attentiveto city name in
GetWeather : it decides for itselfwhat semantics to be attentive to.A self-attention weight matrix A is computedas: A = softmax (cid:0) W s tanh (cid:0) W s H T (cid:1)(cid:1) , (2)where W s ∈ R D A × D H and W s ∈ R R × D A areweight matrices for the self-attention. D A is thehidden unit number of self-attention and R is thenumber of self-attention heads. The softmax func-tion makes sure for each self-attention head, theattentive scores on all the words sum to one.A total number of R semantic features are ex-tracted from the input utterance, each from a sep-arate self-attention head: M = AH , where M =( m , m , ..., m R ) ∈ R R × D H . Each m r is a D H − dimensional semantic vector.Each semantic vector will have a distinguish-able orientation when the objective is properly reg-ularized (details in Equation 6), as we want eachattention to be attentive to a unique semantic fea-ture of the utterance. The vector representationadopted in capsules is suitable to portray the low-level semantic properties as well as high-level in-tents of the utterance, where the orientation of avector represents semantic/intent properties thatmay slightly vary depending on the expressions.The capsule encourages the learning of general-izable semantic vectors: less informative seman-tic properties for one intent may not be penalizedby their orientations: they simply possess smallnorms as they are less likely to exist. The output of SemanticCaps are low-level vectorrepresentations of R different semantic featuresextracted from the utterances. To combine thesefeatures into higher-level representations, we buildDetectionCaps that choose different semantic fea-tures dynamically so as to form an intent represen-tation for each intent via an unsupervised routing-by-agreement mechanism.As a semantic feature may contribute differentlyin detecting different intents, the DetectionCapsfirst encode semantic features with respect to eachintent: p k | r = m r W k,r , (3) where k ∈ { , , ..., K } , r ∈ { , , ..., R } . W k,r ∈ R D H × D P is the weight matrix of the De-tectionCaps, p k | r is the prediction vector of the r -th semantic feature of an existing intent k , and D P is the dimension of the prediction vector. Dynamic Routing-by-agreement.
The predic-tion vectors obtained from SemanticCaps routedynamically to DetectionCaps. The Detection-Caps computes a weighted sum over all predictionvectors: s k = R (cid:88) r c kr p k | r , (4)where c kr is the coupling coefficient that deter-mines how informative, or how much contribu-tion the r -th semantic feature is to the intent y k . c kr is calculated by an unsupervised, iterativedynamic routing-by-agreement algorithm (Sabouret al., 2017), which is briefly recalled in Algorithm1. As shown in this algorithm, b kr is the initiallogit representing the log prior probability that aSemanticCap r is coupled to an DetectionCap k. Algorithm 1
Dynamic routing algorithm procedure D YNAMIC R OUTING ( p k | r , iter ) for all semantic capsule r and intent capsule k: b kr ← . for iter iterations do for all SemanticCaps r: c r ← softmax( b r ) for all DetectionCaps k: s k ← Σ r c kr p k | r for all DetectionCaps k: v k = squash( s k ) for all SemanticCaps r and DetectionCaps k: b kr ← b kr + p k | r · v k end for Return v k end procedure The squashing function squash( · ) is applied on s k to get an activation vector v k for each existingintent class k : v k = (cid:107) s k (cid:107) (cid:107) s k (cid:107) s k (cid:107) s k (cid:107) , (5)where the orientation of the activation vector v k represents intent properties while its norm indi-cates the activation probability. The dynamicrouting-by-agreement mechanism assigns low c kr when there is inconsistency between p k | r and v k ,which ensures the outputs of the SemanticCaps getsent to appropriate subsequent DetectionCaps. Max-margin Loss for Existing Intents.
The lossfunction considers both the max-margin loss oneach labeled utterance, as well as a regularizationterm that encourages each self-attention head to bettentive to a different semantic feature of the ut-terance: L = K (cid:88) k =1 { [[ y = y k ]] · max(0 , m + − (cid:107) v k (cid:107) ) + λ [[ y (cid:54) = y k ]] · max(0 , (cid:107) v k (cid:107) − m − ) } + α || AA T − I || F , (6)where [[]] is an indicator function, y is the groundtruth intent label for the utterance x , λ is a down-weighting coefficient, m + and m − are margins. α is a non-negative trade-off coefficient that encour-ages the discrepancies among different attentionheads. To detect emerging intents effectively, Zero-shotDetectionCaps are designed to transfer knowledgefrom existing intents to emerging intents.
Knowledge Transfer Strategies.
As Semantic-Caps are trained to extract semantic features fromutterances with various existing intents, a self-attention head which has similar extraction behav-ior among existing and emerging intents may helptransfer knowledge. For example, a self-attentionhead that extracts the “play” action mentioned by turn on/I want to hear in the beginningof an utterance for
PlayMusic is helpful if it is alsoattentive to expressions for the “add” action like add/I want to have in the beginning of anutterance with an emerging intent
AddtoPlaylist .The coupling coefficient c kr learned by Detec-tionCaps in a totally unsupervised fashion embod-ies rich knowledge of how informative r -th seman-tic is to the existing intent k . We can capitalize onthe existing routing information for emerging in-tents. For example, how the word play routes to GetWeather can be helpful in routing the word add to AddtoPlaylist .The intent labels also contain knowledge ofhow two intents are similar with each other. Forexample, an emerging intent
AddtoPlaylist canbe closer to one existing intent
PlayMusic than
GetWeather due to the proximity of the embed-ding of
Playlist to Play or Music , than
Weather . Build Vote Vectors.
As the routing informationand the semantic extraction behavior are stronglycoupled ( c kr is calculated by p k | r iteratively inLine 4-6 of Algorithm 1) and their products aresummarized to get the activation vector v k for in- tent k (Line 5-6 of Algorithm 1), we denote vec-tors before summation as vote vectors: g k,r = c kr p k | r , (7)where g k,r is the r -th vote vector for an existingintent k . Zero-shot Dynamic Routing.
The zero-shot dy-namic routing utilizes vote vectors from existingintents to build intent representations for emerg-ing intents via a similarity metric between existingintents and emerging intents.Since there are K existing intents and L emerg-ing intents, the similarities between existing andemerging intents form a matrix Q ∈ R L × K . Specif-ically, the similarity between an emerging intent z l ∈ Z and an existing intent y k ∈ Y is computed as: q lk = exp {− d ( e z l , e y k ) } (cid:80) Kk =1 exp {− d ( e z l , e y k ) } , (8)where d ( e z l , e y k ) = ( e z l − e y k ) T Σ − ( e z l − e y k ) . (9) e z l , e y k ∈ R D I × are intent embeddings com-puted by the sum of word embeddings of the in-tent label. Σ models the correlations among intentembedding dimensions and we use Σ = σ I . σ is a hyper-parameter for scaling. The predictionvectors for emerging intents are thus computed as: u l | r = K (cid:88) k =1 q lk g k,r . (10)We feed the prediction vector n l to Algorithm 1and derive activation vectors n l on emerging in-tents as the output. The final intent representa-tion n l for each emerging intent is updated towardthe direction where it coincides with representa-tive votes vectors.We can easily classify the utterance of emergingintents by choosing the activation vector with thelargest norm ˆ z = arg max z l ∈ Z (cid:107) n l (cid:107) . To demonstrate the effectiveness of our proposedmodels, we apply I
NTENT C APS N ET to detectexisting intents in an intent detection task, anduse I NTENT C APS N ET -ZSL to detect emerging in-tents in a zero-shot intent detection task. Datasets.
For each task, we evaluate our pro-posed models by applying it on two real-word odel SNIPS-NLU (on 5 existing intents)
CVA (on 80 existing intents)Accuracy Precision Recall F1 Accuracy Precision Recall F1TFIDF-LR 0.9546 0.9551 0.9546 0.9545 0.7979 0.8104 0.7979 0.7933TFIDF-SVM 0.9584 0.9586 0.9584 0.9581 0.7989 0.8111 0.7989 0.7942CNN 0.9595 0.9596 0.9595 0.9595 0.8223 0.8288 0.8223 0.8210RNN 0.9516 0.9522 0.9516 0.9518 0.8286 0.8330 0.8286 0.8275GRU 0.9535 0.9535 0.9535 0.9534 0.8239 0.8281 0.8239 0.8216LSTM 0.9569 0.9573 0.9569 0.9569 0.8319 0.8387 0.8319 0.8306Bi-LSTM 0.9501 0.9502 0.9501 0.9502 0.8428 0.8479 0.8428 0.8419Self-attention Bi-LSTM 0.9524 0.9522 0.9524 0.9522 0.8521 0.8590 0.8521 0.8513I
NTENT C APS N ET Table 1 : Intention detection results using I
NTENT C APS N ET on two datasets. All the metrics (Accuray, Precision, Recall andF1) are reported using the average value weighted by their support on per class. datasets: SNIPS Natural Language Understand-ing benchmark (SNIPS-NLU) and a CommercialVoice Assistant (CVA) dataset. The statistical in-formation on two datasets are shown in Table 2.SNIPS-NLU is an English natural language cor-pus collected in a crowdsourced fashion to bench-mark the performance of voice assistants. CVAis a Chinese natural language corpus collectedanonymously from a commercial voice assistanton smart phones. Dataset SNIPS-NLU CVA
Vocab Size 10,896 1,709Number of Samples 13,802 9,992Average Sentence Length 9.05 4Number of Existing Intents 5 80Number of Emerging Intents 2 20
Table 2 : Dataset statistics.
Baselines.
We first compare the proposed capsule-based model I
NTENT C APS N ET with other textclassification alternatives on the detection of ex-isting intents: 1) TFIDF-LR/TFIDF-SVM: we useTF-IDF to represent the utterance and use logis-tic regression/support vector machine as classi-fiers. 2) CNN: a convolutional neural network(Kim, 2014) that uses convolution and poolingoperations, which is popular for text classifica-tion. 3) RNN/GRU/LSTM/BiLSTM: we adoptdifferent types of recurrent neural networks: thevanilla recurrent neural network (RNN), gatedrecurrent unit (GRU) (Tang et al., 2015), longshort-term memory networks (LSTM) (Hochre-iter and Schmidhuber, 1997), and bi-directionallong short-term memory (Bi-LSTM) (Schusterand Paliwal, 1997). Their last hidden states https://github.com/snipsco/nlu-benchmark/ are used for classification. 4) Self-Attention Bi-LSTM: we apply a Bi-LSTM model with self-attention mechanism (Lin et al., 2017) and the out-put sentence embedding is used for classification.We also compare our proposed modelI NTENT C APS N ET -ZSL with different zero-shot learning strategies: 1) DeViSE (Fromeet al., 2013) finds the most compatible emergingintent label for an utterance by learning a linearcompatibility function between utterances andintents; 2) CMT (Socher et al., 2013) introducesnon-linearity in the compatibility function; CMTand DeViSE are originally designed for zero-shotimage classification based on pretrained CNNfeatures. We use LSTM to encode the utteranceand adopt their zero-shot learning strategiesin our task; 3) CDSSM (Chen et al., 2016a)uses CNN to extract character-level sentencefeatures, where the utterance encoder shares theweights with the label encoder; 4) Zero-shot DNN(Kumar et al., 2017) further improves the per-formance of CDSSM by using separate encodersfor utterances and intent. The proposed modelI NTENT C APS N ET -ZSL can be seen as a hybridmodel: it has the advantages of the compatibil-ity models to model the correlations betweenutterances and intents directly; it also explicitlyderives intent representations for emerging intentswithout labeled utterances. Dataset D W D H D A R σ α
SNIPS-NLU 300 32 20 3 4 0.0001CVA 200 200 100 8 1 0.01
Table 3 : Hyperparameter settings.
Implementation Details.
The hyperparametersused for experiments are shown in Table 3. Weuse three fold cross-validation to choose hyperpa- odel SNIPS-NLU (on 2 emerging intents)
CVA (on 20 emerging intents)Accuracy Precision Recall F1 Accuracy Precision Recall F1DeViSE (Frome et al., 2013) 0.7447 0.7448 0.7447 0.7446 0.7809 0.8060 0.7809 0.7617CMT (Socher et al., 2013) 0.7396
NTENT C APS N ET -ZSL w/o Self-attention 0.7587 0.7764 0.7588 0.7547 0.8103 0.8512 0.8103 0.8115I NTENT C APS N ET -ZSL w/o Bi-LSTM 0.7619 0.7631 0.7619 0.7616 0.8366 NTENT C APS N ET -ZSL w/o Regularizer 0.7675 0.7676 0.7675 0.7675 0.8544 0.8730 0.8544 0.8553I NTENT C APS N ET -ZSL Table 4 : Zero-shot intention detection results using I
NTENT C APS N ET -ZSL on two datasets. All the metrics (Accuray, Preci-sion, Recall and F1) are reported using the average value weighted by their support on per class. rameters. The dimension of the prediction vector D P is 10 for both datasets. D I = D W becausewe use the averaged word embeddings containedin the intent label as the intent embedding. An ad-ditional input dropout layer with a dropout keeprate 0.8 is applied to the SNIPS-NLU dataset. Inthe loss function, the down-weighting coefficient λ is 0.5, margins m + k and m − k are set to 0.9 and0.1 for all the existing intents. The iteration num-ber iter used in the dynamic routing algorithm is3. Adam optimizer (Kingma and Ba, 2014) is usedto minimize the loss. Quantitative Evaluation.
The intention detectionresults on two datasets are reported in Table 1,where the proposed capsule-based model I
NTENT -C APS N ET performs consistently better than bag-of-word classifiers using TF-IDF, as well as vari-ous neural network models designed for text clas-sification. These results demonstrate the noveltyand effectiveness of the proposed capsule-basedmodel I NTENT C APS N ET in modeling text for in-tent detection.Also, we report results on zero-shot inten-tion detection task in Table 4, where our modelI NTENT C APS N ET -ZSL outperforms other base-lines that adopt different zero-shot learning strate-gies. CMT has higher precision but low ac-curacy and recall on the SNIPS-NLU dataset.CDSSM fails on CVA dataset, probabily becausethe character-level model is suitable for Englishcorpus but not for CVA, which is in Chinese. Ablation Study.
To study the contribution ofdifferent modules of I
NTENT C APS N ET -ZSL forzero-shot intent detection, we also report abla-tion test results in Table 4. “w/o Self-attention”is the model without self-attention: the last for-ward/backward hidden states of the bi-LSTM re-current encoder are used; “w/o Bi-LSTM” uses the LSTM with only a forward pass; “w/o Reg-ularizer” does not encourage discrepancies amongdifferent self-attention heads: it adopts α = 0 inthe loss function. Generally, from the lower partof Table 4 we can see that all modules contributeto the effectiveness of the model. On the SNIPS-NLU dataset, each of the three modules has a com-parable contribution to the whole model (around2-3% improvement in F1 score). While on theCVA dataset, the self-attention plays the most im-portant role, which gives the model a 5.2% im-provement in F1 score. Discriminative Emerging Intent Representa-tions.
Besides quantitative evidences supportingthe effectiveness of the I
NTENT C APS N ET -ZSL,we visualize activation vectors of emerging intentsin Figure 3. Since the activation vectors of utter-ances with emerging intents are of high dimensionand we are interested in their orientations whichindicate their intent properties, t-SNE is appliedon the normal vector of the activation vectors toreduce the dimension to 2. We color the utterancesaccording to their ground-truth emerging intent la-bels. Figure 3 : t-SNE visualization of normal activation vectorsof utterances with 20 emerging intents in CVA.
As illustrated in Figure 3, I
NTENT C APS N ET -ZSL has the ability to learn discriminative intentrepresentations for emerging intents in zero-shotetectionCaps, so that utterances with differentintents naturally have different orientations. In themeanwhile, utterances of the same emerging in-tent but with nuances in expressions result in theirproximity in the t-SNE space. However, we doobserve less satisfied cases where the model mis-take an emerging intent DecreaseScreenBright-ness (No. 9) with
ReduceFontSize (No. 10) and
SetColdColor (No. 11). When we check activa-tion vectors of intents in Figure 3 we also find thatthese three intents tend to have similar representa-tions around the area (15, -5). We think it is dueto their inherent similarity as these three intents alltry to tune display configurations.
Capsule models try to bring more interpretabilitywhen compared with traditional deep neural net-works. We provide case studies here toward theintepretability of the proposed model in 1) extract-ing meaningful semantic features and 2) transfer-ring knowledge from existing intents to emergingintents.
Extracting Meaningful Semantic Features.
Toshow that SemanticCaps have the ability to extractmeaningful semantic features from the utterance,we study the self-attention matrix A within the Se-manticCaps and visualize the attention scores ofutterances on both existing and emerging intents. Existing Intent:
PlayMusic • Play Action play music by charlie adams fromi want to hear any tune from twentiesopen up music on last fm • Musician Name i want to hear music by madeleine peyroux from on youtubeplay me a song by charles neidichuse itunes to play artist ringo shiina track in heaven
Existing Intent:
SearchCreativeWork • Search Action find fields of sacrifice moviei m looking for music of nashville season sagashow me television show children in need rocks • Creative Work Type please find me platinum box ii song ?show me a picture called heart like a hurricanewhere can i buy a photograph called feel love ?
Table 5 : Attentions on utterances with existing intents onSNIPS-NLU.
From Table 5 we can see that each self-attentionhead almost always focuses on one unique seman-tic feature of the utterance. For example, in the in-tent of
PlayMusic one self-attention head alwaysfocuses on the “play” action while another atten-tion focuses on musician names. We also observethat the learned attention adopts well to diverse ex-pressions. For example, the self-attention head in
PlayMusic is attentive to various mentions of mu-sician names when they are followed by words like by , play and artist , even when named enti-ties are not tagged and given to the model. Theself-attention head that extracts the “search” actionin SearchCreativeWork is able to be attentiveto various expressions such as find , lookingfor and show . Extraction-behavior Transfer by Semantic-Caps.
More importantly, we observe appealingextraction behaviors of SemanticCaps on utter-ances of emerging intents as well, even if they arenot trained to perform semantic extraction on ut-terances of emerging intents.
Emerging Intent:
RateBook • Rate Action i d rate this novel a fiveadd the rating for this current series a four out of pointsi give ruled britannia a rating of five out of • Book Name give the televised morality series a onei want to give the coming of the terraphiles a rating ofthe chronicle charlie peace earns stars from me • Rating Score rate the grisly wife three points out of fivei would give this current chronicle three pointsthis saga deserves a score of four
Emerging Intent:
AddToPlaylist • Song/Artist Name add star light star bright to my jazz classics playlisti want a song by john schlitt in the bajo las estrellas playlistput sungmin into my summer playlist • Playlist Name add an album to my list la mejor msica dancecan you add danny carey to my masters of metal playlisti want to put a copy of this tune into skatepark punks
Table 6 : Attentions on utterances with emerging intents onSNIPS-NLU.
From Table 6 we observe that the same self-attention head that extracts “play” action in the ex-isting intent
PlayMusic is also attentive to wordsor phrases referring to the “rate” action in anemerging intent
RateABook : like rate , addthe rating , and give . Other self-attentionheads are almost always focusing on other aspectsof the utterances such as the book name or the ac-tual rating score.Such behavior not only shows that Seman-ticCaps have the capacity to learn an intent-independent semantic feature extractor, which ex-tracts generalizable semantic features that eitherexisting or emerging intent representations arebuilt upon, but also indicates that SemanticCapshas the ability to transfer extraction behaviorsamong utterances of different intents. Knowledge Transfer via Intent Similarity.
Be-side extracting semantic features and utilizingexisting routing information, we use similari-ties between intent embeddings to help trans-er vote vectors from I
NTENT C APS N ET toI NTENT C APS N ET -ZSL. We study the similaritydistribution of each emerging intents to all exist-ing intents in Figure 4. Figure 4 : Accuracy vs. variance of the similarity distributionfor 20 emerging intents in CVA dataset.
The y axis is the zero-shot detection accuracyon each emerging intent in the CVA dataset. Thex axis measures var ( q l ) , the variance of the simi-larity distribution of each emerging intent l to allthe existing intents. If an emerging intent hasa high variance in the similarity distribution, itmeans that some existing intents have higher sim-ilarities with this emerging intent than others: themodel is more certain about which existing intentto transfer the similarity knowledge from, basedon intent label similarities. In this case, 13 outof 20 emerging intents with high variances wherevar ( q l ) > . always have a decent perfor-mance (Accuracy (cid:62) In this paper, a capsule-based model, namelyI
NTENT C APS N ET , is first introduced to harnessthe advantages of capsule models for text mod-eling in a hierarchical manner: semantic fea-tures are extracted from the utterances with self-attention, and aggregated via the dynamic routing-by-agreement mechanism to obtain utterance-levelintent representations. We believe that the in-ductive biases subsumed in such capsule-basedhierarchical learning schema have broader appli-cability on various text modeling tasks, besides its evidenced performance on the intent detec-tion task we studied in this paper. The proposedI NTENT C APS N ET -ZSL model further introduceszero-shot learning ability to the capsule modelvia various means of knowledge transfer from ex-isting intents for discriminating emerging intentswhere no labeled utterances or excessive exter-nal resources are available. Experiments on tworeal-world datasets show the effectiveness and in-tepretability of the proposed models. We thank the reviewers for their valuable com-ments. This work is supported in part by NSFthrough grants IIS-1526499, IIS-1763325, andCNS-1626432, and NSFC 61672313. XiaohuiYan’s work is funded by the National Natural Sci-ence Foundation of China (NSFC) under GrantNo. 61502447.
References
Soravit Changpinyo, Wei-Lun Chao, Boqing Gong,and Fei Sha. 2016. Synthesized classifiers for zero-shot learning. In
CVPR , pages 5327–5336.Yun-Nung Chen, Dilek Hakkani-T¨ur, and XiaodongHe. 2016a. Zero-shot learning of intent embeddingsfor expansion by convolutional deep structured se-mantic models. In
ICASSP , pages 6045–6049.Yun-Nung Chen, Dilek Hakkani-T¨ur, G¨okhan T¨ur,Jianfeng Gao, and Li Deng. 2016b. End-to-endmemory networks with knowledge carryover formulti-turn spoken language understanding. In
IN-TERSPEECH , pages 3245–3249.Emmanuel Ferreira, Bassam Jabaian, and FabriceLefevre. 2015a. Online adaptative zero-shot learn-ing spoken language understanding using word-embedding. In
ICASSP , pages 5321–5325.Emmanuel Ferreira, Bassam Jabaian, and FabriceLef`evre. 2015b. Zero-shot semantic parser for spo-ken language understanding. In
INTERSPEECH ,pages 1403–1407.Andrea Frome, Greg S Corrado, Jon Shlens, SamyBengio, Jeff Dean, Tomas Mikolov, et al. 2013. De-vise: A deep visual-semantic embedding model. In
NIPS , pages 2121–2129.Geoffrey E Hinton, Alex Krizhevsky, and Sida DWang. 2011. Transforming auto-encoders. In
ICANN , pages 44–51.Sepp Hochreiter and J¨urgen Schmidhuber. 1997.Long short-term memory.
Neural computation ,9(8):1735–1780.atthew B Hoy. 2018. Alexa, siri, cortana, and more:An introduction to voice assistants.
Medical refer-ence services quarterly , 37(1):81–88.Jian Hu, Gang Wang, Fred Lochovsky, Jian-tao Sun,and Zheng Chen. 2009. Understanding user’s queryintent with wikipedia. In
WWW , pages 471–480.Yoon Kim. 2014. Convolutional neural net-works for sentence classification. arXiv preprintarXiv:1408.5882 .Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 .Anjishnu Kumar, Pavankumar Reddy Muddireddy,Markus Dreyer, and Bj¨orn Hoffmeister. 2017. Zero-shot learning across heterogeneous overlapping do-mains. In
INTERSPEECH , volume 2017, pages2914–2918.John D. Lafferty, Andrew McCallum, and FernandoC. N. Pereira. 2001. Conditional random fields:Probabilistic models for segmenting and labeling se-quence data. In
ICML , pages 282–289.Christoph H Lampert, Hannes Nickisch, and StefanHarmeling. 2014. Attribute-based classification forzero-shot visual object categorization.
IEEE Trans-actions on Pattern Analysis and Machine Intelli-gence , 36(3):453–465.Zhouhan Lin, Minwei Feng, Cicero Nogueira dos San-tos, Mo Yu, Bing Xiang, Bowen Zhou, and YoshuaBengio. 2017. A structured self-attentive sentenceembedding. In
ICLR .Bing Liu and Ian Lane. 2016. Attention-based recur-rent neural network models for joint intent detectionand slot filling. In
INTERSPEECH , pages 685–689.Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprintarXiv:1301.3781 .Sara Sabour, Nicholas Frosst, and Geoffrey E Hinton.2017. Dynamic routing between capsules. In
NIPS ,pages 3859–3869.Mike Schuster and Kuldip K Paliwal. 1997. Bidirec-tional recurrent neural networks.
IEEE Transactionson Signal Processing , 45(11):2673–2681.Richard Socher, Milind Ganjoo, Christopher D Man-ning, and Andrew Ng. 2013. Zero-shot learningthrough cross-modal transfer. In
NIPS , pages 935–943.Duyu Tang, Bing Qin, and Ting Liu. 2015. Documentmodeling with gated recurrent neural network forsentiment classification. In
EMNLP , pages 1422–1432.Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In
NIPS , pages 6000–6010.IBM Watson Assistant. 2017. Defining intents. In https://console.bluemix.net/docs/services/conversation/intents.html .Puyang Xu and Ruhi Sarikaya. 2013. Convolutionalneural network based triangular crf for joint intentdetection and slot filling. In
ASRU , pages 78–83.Majid Yazdani and James Henderson. 2015. A modelof zero-shot learning of spoken language under-standing. In
EMNLP , pages 244–249.Chenwei Zhang, Wei Fan, Nan Du, and Philip S Yu.2016. Mining user intentions from medical queries:A neural network based heterogeneous jointly mod-eling approach. In