[PDF] Improving Device Directedness Classification of Utterances with Semantic Lexical Features

Abstract

User interactions with personal assistants like Alexa, Google Home and Siri are typically initiated by a wake term or wakeword. Several personal assistants feature "follow-up" modes that allow users to make additional interactions without the need of a wakeword. For the system to only respond when appropriate, and to ignore speech not intended for it, utterances must be classified as device-directed or non-device-directed. State-of-the-art systems have largely used acoustic features for this task, while others have used only lexical features or have added LM-based lexical features. We propose a directedness classifier that combines semantic lexical features with a lightweight acoustic feature and show it is effective in classifying directedness. The mixed-domain lexical and acoustic feature model is able to achieve 14% relative reduction of EER over a state-of-the-art acoustic-only baseline model. Finally, we successfully apply transfer learning and semi-supervised learning to the model to improve accuracy even further.

Full PDF

IIMPROVING DEVICE DIRECTEDNESS CLASSIFICATION OF UTTERANCES WITHSEMANTIC LEXICAL FEATURES

Kellen Gillespie † , Ioannis C. Konstantakopoulos † , Xingzhi Guo (cid:63) , Vishal Thanvantri Vasudevan † ,Abhinav Sethy †† Amazon.com, Inc. (cid:63)

Stony Brook University

ABSTRACT

User interactions with personal assistants like Alexa, GoogleHome and Siri are typically initiated by a wake term or wake-word. Several personal assistants feature “follow-up” modesthat allow users to make additional interactions without theneed of a wakeword. For the system to only respond when ap-propriate, and to ignore speech not intended for it, utterancesmust be classiﬁed as device-directed or non-device-directed.State of the art systems have largely used acoustic featuresfor this task, while others have used only lexical featuresor have added LM-based lexical features. We propose a di-rectedness classiﬁer that combines semantic lexical featureswith a lightweight acoustic feature and show it is effectivein classifying directedness. The mixed-domain lexical andacoustic feature model is able to achieve 14% relative re-duction of EER over a state of the art acoustic-only baselinemodel. Finally, we successfully apply transfer learning andsemi-supervised learning to the model to improve accuracyeven further.

Index Terms — Directedness, Semantic Classiﬁcation,LSTM, Semi-supervised Learning, Word Embeddings

1. INTRODUCTION

Personal assistants such as Alexa, Google Home, and Siri arebecoming commonplace in modern households [1]. Thesesystems respond to user commands preﬁxed by a wake com-mand or wakeword , e.g. “Alexa, what time is it?” or “HeyGoogle, what’s the weather?” . This wakeword acts as signalthat the utterance is intended for the assistant, in other words device-directed .To make consecutive commands more natural and conver-sational, personal assistants have added “follow-up” modesthat allow users to give consecutive commands without re-peated use of the wakeword. Recent examples include BaiduDuerOS’s “Full-Duplex” feature, Google Home’s “Contin-ued Conversation” feature and Amazon Alexa’s “Follow-upMode” [2, 3, 4]. These features allow for more natural con-versations as shown in Table 1. S TANDARD M ODE F OLLOW - UP M ODE C OMPUTER , WHAT ’ S THE WEATHER ? C

OMPUTER , WHAT ’ S THE WEATHER ?(D

EVICE R ESPONDS ) (D

EVICE R ESPONDS )C OMPUTER , WHAT ABOUT TOMORROW ? W

HAT ABOUT TOMORROW ?(D

EVICE R ESPONDS ) (D

EVICE R ESPONDS )C OMPUTER , THANK YOU T HANK YOU

Table 1 . More Natural Conversation with Follow-up ModesTo achieve this behavior, the device’s microphone is au-tomatically re-opened after the device responds to the user.Users can then give device-directed follow-up commandswithout the need of a wakeword. However, if a user hasno follow-up commands to give, the system could pick upspeech not intended for it or non-device-directed speech.Non-device-directed speech includes background speech,speech from media sources, and any other utterance not in-tended for the device. To alleviate this problem, and to avoidresponding to non-device-directed utterances, virtual assis-tants employ directedness classiﬁers to separate utterancesintended for the device from those that are not.Directedness classiﬁcation is an important task in manysystems. Many such systems use acoustic features to deter-mine directedness. In the personal assistant space, Mallidi etal. [5] leverage ASR decoder features as well as acoustic em-beddings from the audio signal to perform directedness clas-siﬁcation for one of the aforementioned follow-up modes. Weuse this work as the acoustic-only baseline in our evaluations.Another system in the personal assistant space makes useof acoustic features and improves performance through theaddition of an attention mechanism [6]. In related smart-environment applications, Reich et al. [7] use ASR decoderand prosodic features to perform directedness classiﬁcationand Paek et al. [8] combine signal and user behavior charac-teristics to determine directedness in a continuous listeningpresentation domain.Other systems rely on language features to determine di-rectedness [9]. In a simulated planetary exploration scenario,Dowding, et al. [10] develop a system to help robotic mem-bers of mixed human robot teams determine if utterances wereintended for them or other members. Shriberg et al. [11]add LM-based lexical features and contextual similarity fea- a r X i v : . [ ee ss . A S ] S e p TRUCTURED U NSTRUCTUREDDID YOU REORDER YOUR PILLS IT ’ S IT ’ S JUST FLASHING YELLOW I DON ’ T KNOW WHAT SHE JUST ORDER CAN YOU HEY COMPUTERWELL YOU GOT FOUR MORE HOURS BREAK AT A BIGGERMOM WHAT DID YOU SAY TELL US MOVINGWHAT ARE YOU DOING WEATHER TALKING ABOUT HAL

Table 2 . Structured and Unstructured Utterances Misidenti-ﬁed as Device-directedtures to supplement acoustic features in a Human-Human-Computer dialog environment.Our proposed model will leverage signals from bothacoustic and lexical domains. Other systems that have com-bined these domains have used LM-based or character-levellexical features [5, 11]. We propose a system to combineacoustic features with word-level semantic lexical featuresto improve directedness classiﬁcation. We further improvethe model with contextual features and attention mechanisms[6, 11]. To reduce the data labeling burden required by themodel, we use transfer learning from a related dataset andexperiment with semi-supervised learning techniques.The paper is organized as follows. In Section 2, we lookat the follow-up dataset to motivate the features and architec-tures to be evaluated. In Section 3, we describe our baselinefeed-forward model as well as our proposed LSTM architec-tures. In Section 4, we evaluate the discussed models andfeatures and discuss the results. In Section 5, we examine theeffectiveness of a simple semi-supervised learning techniqueon our model training. In Section 6, we discuss our overallﬁndings and plans to improve our model.

2. DATA ANALYSIS

The primary feature of our proposed model is token-level se-mantic embeddings of the current turn utterance text derivedfrom pre-trained word embeddings. Sentence embeddingscan typically be derived by averaging token-level embed-dings, but in this section we examine the data to identify ad-ditional features and architectures to develop a better model.

Many non-device-directed utterances in our data, due to fac-tors such as containing multiple speakers or ASR errors, showa lack of sentence structure. Table 2 shows some examples ofstructured and non-structured utterances from the non-device-directed class.A model that accounts for word order, and thus indirectlymodels grammatical structure, can be expected to learn to re-ject such unstructured utterances as non-device-directed. Forthis reason, we expect a recurrent model such as an LSTMwill outperform a baseline DNN model on this task.

While the current turn utterance can normally provide mostof the information needed to perform directedness classiﬁca-tion, there is a subset of our dataset that contains contextually-relevant utterances. We give a few examples in Table 3. P REVIOUS T URN C URRENT T URNCOMPUTER ADD BANANAS TO MY SHOPPING LIST ADD CHICKENCOMPUTER WHAT ’ S THE WEATHER FOR TODAY AND FOR TOMORROWCOMPUTER WHO IS ADELE ? PLAYCOMPUTER ARE YOU UNIQUE ? ARE YOU ONE OF A KIND ? Table 3 . Utterances with Relevant Previous TurnsLooking solely at the current turn, these examples maybe difﬁcult to classify. With the context of the previous turn,however, they more clearly become interpretable as device-directed utterances. For this reason, we experiment with in-corporating the previous turn text into the features to aid inclassiﬁcation.

While we expect semantic lexical features to aid in directed-ness classiﬁcation, they do not on their own solve the problemof textual ambiguity present in our data. Many utterances be-long to both classes in our dataset, in fact 29.7% of the datasetconsists of utterances belonging to both classes. We highlightsome examples below in Table 4, showing the most frequentoccurring utterances belonging to both classes. For each ut-terance, we look at class frequencies to determine prior prob-abilities of each belonging to each class. U TTERANCE P (DD) P (NDD) THANK YOU

STOP

OKAY

CANCEL

WHAT

GOOD NIGHT

PLAY

Table 4 . Textual ambiguity of utterances common to device-directed (DD) and non-device-directed (NDD) classes.It’s clear that for many frequent utterances, text-only inputwill be insufﬁcient for distinguishing between classes. Forthis reason, we will fuse the lexical features with acousticfeatures to assist in separating these ambiguous cases. Ouracoustic feature will be token-level conﬁdence scores froman upstream ASR decoding component [12]. We also expecta subset of these cases to be assisted by the contextual featuresdescribed in Section 2.2. ig. 1 . Input Feature Representations

3. CANDIDATE MODELS3.1. Feature Representation

We experiment with three features for our candidate models:the ASR-produced text of the current turn, the text of the pre-vious turn, and token-level ASR conﬁdence scores. The turnsare separated by special tokens similar to those used in othersystems [13]. We concatenate the lexical features, the wordembeddings of each token, with their corresponding acousticfeature in the form of the ASR conﬁdence scores. For pre-trained embeddings we use fastText wiki news word vectors[14] with an embedding size of 300. The only token embed-ding we speciﬁcally train for our data is an out-of-vocabularyor OOV token to better represent the OOV tendencies of per-sonal assistant commands, while all other embeddings are leftas-is. Due to the trained OOV token we do not make use ofsubword vectors in fastText. Figure 1 shows the overall inputrepresentation of our features.

The non-recurrent proposed model is a fully-connected feed-forward network using a word average sentence representa-tion of the utterance as input. The sentence embedding issimply the average of the token features shown in Figure 1.The DNN has 3 layers: an input layer, dense hidden layer ofsize 150, and an output layer.

The LSTM model [15] makes use of sequential informationin the input features. For this reason, the token-level featuresin Figure 1 are used as-is, with each token representing aninput frame. The LSTM has 3 layers of size 150 followed bya single dense layer and an output layer. Bidirectional LSTM(BLSTM) models were tested but did not improve over thesingle direction versions.

Attention mechanisms are a common addition to LSTM ar-chitectures in neural translation and even directedness classi-ﬁcation tasks [16, 6]. Instead of using the ﬁnal LSTM hid-den state as the sentence embedding, the Attention model performs a soft selection over the hidden states of the entiresequence to generate a sentence embedding. The attentionmechanism is a simple afﬁne layer activated with a tanh func-tion before being passed to a softmax selection layer.

4. EXPERIMENTATION

Our dataset consists of follow-up utterances with ground truthannotations of device-directed or non-device-directed. Thedataset contains roughly 240K utterances and is spilt intotrain, dev, and test partitions with 200K, 20K, and 20K ut-terances, respectively. The class split is roughly 5:1 in favorof the device-directed class in each partition, as non-device-directed speech is far less frequent.

While the dataset described above is not small, there is farmore data available for ﬁrst-turn interactions. We pre-trainall models on a very large (2M) ASR false-wake dataset thatcontains ﬁrst-turn device-directed speech and non-device-directed false-wake speech. False-wake speech are instanceswhere the device mistakenly heard the wakeword. Whileﬁrst-turn and false-wake data is not exactly like follow-upmode data, the tasks are related enough that we consistentlysee improvement when pre-training on this data. We then ﬁnetune the models on the proper follow-up dataset.

The models were trained using SGD with the cross entropyloss function. We allowed for a decaying learning rate, andchose maximum and minimum learning rates using LR rangetests [17]. For transfer learning, we pre-train with higherlearning rates and ﬁne-tune with smaller learning rates.

Table 5 shows the initial evaluation results for the candidatemodels, as well as an ablation test for our best model. Wecompare with the acoustic-only baseline model [5], but dueto implementation and feature differences this model wastrained once on a combination of our pre-trained and follow-up datasets. The DNN model is compared with and withoutthe previous turn feature, as its average sentence embeddingis unable to distinguish between turns. For the best model, weperform an ablation study to measure feature contributionsand compare accuracy with and without transfer learning.Since our classes are imbalanced, we utilize the equal errorrate (EER) metric.The LSTM is able to outperform the average embeddingmodel by a signiﬁcant margin. The LSTM model with atten-tion performs even better, though the improvement is muchsmaller. The best model with all features is able to reduce

ODEL F EATURES

EER % A COUSTIC - ONLY B ASELINE [5] † – 10.6AVG-DNN C , T C , P , T C , P , T TTN C , P , T LSTM+A

TTN C , P , T C P T TTN -N O TL C , P , T Table 5 . Model evaluation with feature ablation for the bestmodel. C : current utterance text, P : previous utterance text, T :token-level ASR conﬁdences, † : trained in a different mannerEER by 14% relative over the baseline model based on acous-tic features.The ablation study with the best model shows that allfeatures add complementary predictive power to the model,albeit at different magnitudes. The token-level ASR con-ﬁdences are the most important feature, quickly followedby the current turn text. The previous turn is less critical,likely due to the fact that only a subset of the dataset containscontextually-relevant utterances. Finally, our best model re-duces EER by 14% relative over the same one trained withouttransfer learning LSTM+Attn-NoTL .

5. SEMI-SUPERVISED LEARNING

Labeling data is time consuming and costly. Semi-supervisedlearning approaches allow a model trained on labeled datato continue training on unlabeled data. Many approachesexist, but for simplicity we ﬁrst investigate a self-teachingmethodology. We use two models to aid in labeling like a co-training approach [18]. In this section we describe in detailour method and the results.

We employ the self-teaching semi-supervised learning ap-proach. In this method, ﬁrst a model is trained in a standardsupervised approach with labeled data L . Next, the trainedmodel performs inference on a large set of unlabeled data U .The unlabeled cases where the model is highly conﬁdent U L are labeled, removed from the unlabeled data and folded intotraining set. The model is then retrained with the new dataset L + U L and the process is repeated until stopping criteria ismet.We expand on the method by combining the scores oftwo distinct models, our proposed model and the acoustic-only baseline model, to reduce risk by labeling only thosecases where the models have agreement. We apply a non-linear transformation to the acoustic model score to lower its S EMI - SUPERVISED P ASS T EST

EER % D EV L OSS

STOPPING POINT ] Table 6 . Semi-supervised model evaluationcontribution to the overall score, notably in high-confusionposterior ranges.

Our unlabeled dataset U consists of 500K utterances. We se-lect the LSTM without contextual features or attention; theunlabeled data does not contain previous turns and the at-tention model, likely due to hyperparameter issues, did notachieve good results during semi-supervised training. At eachiteration we label the highest 1% of scores as DD and thelowest 0.2% of scores as NDD, to maintain the original classpriors of the labeled dataset.The stopping criteria is met once dev loss is no longer im-proving, in this case after 20 passes of semi-supervised train-ing. We choose pass 15, the point of lowest dev loss, as ourselected model. We evaluate this model, as well as severalother versions, on the original test set and compare to the orig-inal model. Table 6 shows test set EER and dev set loss overselect passes during semi-supervised training. Test EER im-proves 5% relative due to semi-supervised learning, and ourstopping point based on the dev loss achieves the best EER.

6. CONCLUSIONS & FUTURE WORK

We have proposed a directedness classiﬁer for spoken ut-terances based on acoustic and semantic lexical features.We have shown token-level semantic information, informedfrom word embeddings, can classify utterances with reason-able accuracy. Supplementing these lexical features witha lightweight acoustic feature and contextual informationimproves accuracy even further. Our results show that arecurrent model, such as an LSTM with attention, can uti-lize these features to reduce EER by 14% relative over anacoustic-only baseline model. Finally, we show that transferlearning and semi-supervised techniques can further improvethe model without increasing the burden of data labeling.Our future work will be along several dimensions. First,we plan to investigate more model architectures, such asCNN-LSTMs, FLSTMs, and Transformer-style models. Forsemi-supervised learning, we plan to experiment with muchlarger unlabeled datasets and more advanced graph-basedtechniques for exploration during training. We further planto leverage contextual features and attention in our semi-supervised experimentation. . REFERENCES [1] Ruhi Sarikaya, “The technology behind personal digitalassistants: An overview of the system architecture andkey components,”

IEEE Signal Processing Magazine ,vol. 34, no. 1, pp. 67–81, Jan 2017.[2] Kyle Wiggers, “Baidu’s dueros voice platform is nowon 400 million devices,” July 2019, [Online; posted 2-July-2019].[3] Andrew Gebhart, “Google home hub listens longer withcontinued conversation,” March 2019, [Online; posted8-March-2019].[4] Nick Statt, “Amazon adds follow-up mode for alexa tolet you make back-to-back requests,” March 2018, [On-line; posted 9-March-2018].[5] Sri Harish Reddy Mallidi, Roland Maas, Kyle Goehner,Ariya Rastrow, Spyros Matsoukas, and Bj¨orn Hoffmeis-ter, “Device-directed utterance detection,”

Proc. AnnualConference of the International Speech CommunicationAssociation (INTERSPEECH) , vol. abs/1808.02504,2018.[6] Atta Norouzian, Bogdan Mazoure, Dermot Connolly,and Daniel Willett, “Exploring attention mechanism foracoustic-based classiﬁcation of speech utterances intosystem-directed and non-system-directed,”

CoRR , vol.abs/1902.00570, 2019.[7] Daniel S Reich, Felix Putze, Dominic Heger, Joris IJs-selmuiden, Rainer Stiefelhagen, and Tanja Schultz, “Areal-time speech command detector for a smart controlroom,” in

Proc. Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH) ,2011.[8] Tim Paek, Eric Horvitz, and Eric K. Ringger, “Continu-ous listening for unconstrained spoken dialog,” in

Proc.Annual Conference of the International Speech Commu-nication Association (INTERSPEECH) , 2000.[9] Heeyoung Lee, Andreas Stolcke, and ElizabethShriberg, “Using out-of-domain data for lexical ad-dressee detection in human-human-computer dialog,” in

Proc. Annual Conference of the North American Chap-ter of the Association for Computational Linguistics(NAACL) , 2013, pp. 221–229.[10] John Dowding, Richard Alena, William J. Clancey,Maarten Sierhuis, and Jeffrey Graham, “Are you talk-ing to me? dialogue systems supporting mixed teams ofhumans and robots,” in

AAAI Fall Symposium: AurallyInformed Performance , 2006. [11] Elizabeth Shriberg, Andreas Stolcke, Dilek Hakkani-Tr,and Larry Heck, “Learning when to listen: Detectingsystem-addressed speech in human-human-computerdialog,” in

Proc. Annual Conference of the InternationalSpeech Communication Association (INTERSPEECH) ,September 2012, pp. 334–337.[12] Mark Gales and Steve Young,

Application of HiddenMarkov Models in Speech Recognition , Now Founda-tions and Trends, 2008.[13] Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova, “BERT: pre-training of deepbidirectional transformers for language understanding,”

CoRR , vol. abs/1810.04805, 2018.[14] Tomas Mikolov, Edouard Grave, Piotr Bojanowski,Christian Puhrsch, and Armand Joulin, “Advances inpre-training distributed word representations,” in

Proc.of the International Conference on Language Resourcesand Evaluation (LREC) , 2018.[15] Sepp Hochreiter and J¨urgen Schmidhuber, “Long short-term memory,”

Neural Computation , vol. 9, no. 8, pp.1735–1780, Nov. 1997.[16] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Ben-gio, “Neural machine translation by jointly learning toalign and translate,” in

Proc. 3rd International Confer-ence on Learning Representations (ICLR), San Diego,CA, USA, May 7-9, 2015, Conference Track Proceed-ings , 2015.[17] Leslie N. Smith, “Cyclical learning rates for trainingneural networks,” , pp. 464–472,2015.[18] Avrim Blum and Tom Mitchell, “Combining labeledand unlabeled data with co-training,” in