The NPU System for the 2020 Personalized Voice Trigger Challenge
Jingyong Hou, Li Zhang, Yihui Fu, Qing Wang, Zhanheng Yang, Qijie Shao, Lei Xie
TThe NPU System for the 2020 Personalized Voice Trigger Challenge
Jingyong Hou, Li Zhang, Yihui Fu, Qing Wang, Zhanheng Yang, Qijie Shao, Lei Xie
Audio, Speech and Language Processing Group (ASLP@NPU),School of Computer Science, Northwestern Polytechnical University, China { jyhou,lxie } @nwpu-aslp.org Abstract
This paper describes the system developed by the NPU teamfor the 2020 personalized voice trigger challenge. Our submit-ted system consists of two independently trained sub-systems: asmall footprint keyword spotting (KWS) system and a speakerverification (SV) system. For the KWS system, a multi-scaledilated temporal convolutional (MDTC) network is proposedto detect wake-up word (WuW). For the SV system, we adoptArcFace loss and supervised contrastive loss to optimize thespeaker verification system. The KWS system predicts poste-rior probabilities of whether an audio utterance contains WuWand estimates the location of WuW at the same time. When theposterior probability of WuW reaches a predefined threshold,the identity information of triggered segment is determined bythe subsequent SV system. On the evaluation dataset, our sub-mitted system obtains a final score of 0.081 and 0.091 in theclose talking and far-field tasks, respectively.
Index Terms : Keyword spotting, Voice Trigger, Speaker Veri-fication
1. Introduction
Keyword spotting (KWS) aims to detect pre-defined keyword(s)from audio and one important application of KWS is wake-up word (WuW) detection, which is usually used to trigger aspeech interface in various devices such as smartphone, smartspeaker and different kinds of IoT gadgets. For some personaldevices, e.g. smartphone, smart watch, ear-buds, users usuallydo not want other people to wake up their devices. To build apersonalized WuW system that can be only triggered by the de-vice owner, a speaker verification (SV) module is usually usedto perform authentication after the WuW detection. The SV sys-tem is to identify whether a test utterance match the speaker’senrollment utterance and accept or reject the identity claim ofthe speaker accordingly.KWS and SV have been extensively studied (most indepen-dently) in the past. As a flagship event of ISCSLP, the 2020 per-sonalized voice trigger challenge (PVTC2020) combines WuWand SV tasks together, providing a sizable dataset and a com-mon testbed which is able to train and test a personalized voicetrigger system. More details about the task setting, dataset andevaluation plan can be found in [1]. We have participated in thechallenge and our submitted system ranks 2nd among all sub-mitted systems in both close talking and far-field tasks, with adetection cost of 0.081 and 0.091 respectively .
2. Methods
Our system consists of two independently trained sub-systems:a small footprint KWS system and an SV system, as shown inFigure 1. The KWS system takes test audio as input and outputs KWS System SV System
KWS Threshold
Test Audio
Enrollment AudioSV
ThresholdTriggered
Segment Reject / Accept?
Figure 1:
Framework of our proposed personalized voice trig-ger system.
Dilated-Depth TCN
BNPoint-ConvBN+ReLUPoint-ConvBN SE
ScaleAdd
ReLU
Output Y Input X DTC-Stack(s=4)DTC-Stack(s=4)
DTC-Stack(s=4) ...
DTC-Stack(s=4)DTC-Block(d=1)DTC-Block(d=2)
DTC-Block(d=4)DTC-Block(d=8) AddFCSigmoid
Figure 2:
MDTC based KWS system. triggered segment. After that, the triggered segment will be fedto the SV system. Then, the SV system decides whether thetriggered segment is spoken by the enrolled target speaker.
Our proposed end-to-end (E2E) KWS system is shown in Fig-ure 2. It takes T frame mel-filter banks (80 banks per frame) X = ( x , x , ..., x T ) as input and outputs T WuW poste-rior probabilities Y = ( y , y , ..., y T ) . For each time frame t , once its WuW posterior probability y t ≥ γ , wake-up wordhas judged as occurred. γ ∈ (0 , is a threshold.To model the acoustic sequence in our keyword detector,a multi-scale dilated temporal convolutional (MDTC) networkis proposed. A basic block, namely DTC block, is shown inFigure 2 (a). First, a dilated depthwise 1d-convolution network(Dilated-Depth TCN) is used to obtain temporal context with a r X i v : . [ c s . S D ] F e b lter size is (5*1) and dilation rate can be set accordingly. Sincesimple depthwise 1d-convolution is used, the number of train-ing parameters and computation cost can be reduced greatly.After the Dilated-Depth TCN, two layers of pointwise convo-lution (Point Conv) are used to integrate the features from dif-ferent channels. We insert batch normalization (BN) and ReLUactivation functions between different convolutional layers. Inaddition, we add a squeeze-and-exception (SE) module after thelast Point Conv layer to learn the attention information betweendifferent channels. A residual connection between input andlast ReLU activation function is also adopted to prevent gra-dient vanishing and gradient explosion. Four DTC blocks arestacked to form a DTC stack, as shown in Figure 2 (b). Thedilation rates of four DTC blocks are set to 1, 2, 4 and 8 respec-tively. The receptive field of each DTC stack is 60 frames. Inour submitted system, we use 4 DTC stacks as the feature ex-tractor. Receptive field of the feature extractor is ∗
60 = 240 frames, which is big enough to model a WuW. We extract fea-ture maps from DTC stacks with different receptive fields andsum them up as the input of a keyword classifier. For the key-word classifier, a simple fully-connected layer followed by asigmoid output layer is used to predict the posterior of WuW.
For a positive training utterance, we select up-to 40 framesaround middle frame of the WuW region as positive trainingsamples and assign to them. Other frames in the positivetraining utterance are discarded as ambiguous and are not usedin training. For negative training utterances, all frames are re-garded as negative training samples and assigned to . OurKWS system is thus modeled as a sequence binary classifica-tion problem. To train the model, binary cross entropy (BCE)loss is used: Loss ( BCE ) = − y ∗ i ln y i − (1 − y ∗ ) ln(1 − y i ) , (1)where y ∗ i ∈ { , } is the ground-truth class label for frame i , y i = M ( x i ; θ ) ∈ (0 , is the WuW posterior predicted by theKWS model M with parameter θ . We augment the original training set of the challenge, which iscritical for our system to generalize to the development set aswell as the evaluation set. For the original training data, thekeyword always appears at the beginning of the utterance andrest of the utterance is the speech spoken by the same speaker.For the development set and the evaluation set, the keywordalways appears at the end of the utterance, and there may existspeech from other speakers before the keyword. Only using theoriginal training data to train our model, it is hard to generalizethe model to the development and evaluation sets, which resultsin a very low recall.The positive keyword training set is composed as the fol-lowing parts.1) The keyword segments in the training positive utter-ances;2) Randomly select non-keyword speech segments and padthem before the above keyword segments in 1);3) Pad non-keyword speech segments both before and afterthe above keyword segments in 1);In addition, we also create more negative training utter-ances. The specific strategy is to cut the positive utterance in 3) at the middle frame of the keyword into two segments whichare subsequently used as negative training examples. This kindof negative examples can improve the generalization ability ofthe model too.SpecAugment [2] is also applied during training, which isfirst proposed for end-to-end (E2E) ASR to alleviate over-fittingand has recently proven to be effective in training E2E KWSsystem as well [3]. Specifically, we apply time as well as fre-qency masking during training. We randomly select − consecutive frames and set all of their Mel-filter banks to zero,for time masking. For frequency masking, we randomly select − consecutive dimensions of the 80 Mel-filter banks andset their values to zero for all frames of the utterance. As mentioned before, the keywords always appear at the end ofpositive utterances. Based on this, we do not explicitly predictthe starting and ending frames of the keywords. Instead, in anutterance, we take the frame with the largest keyword posterioras the middle position of the keyword. We use the estimatedmiddle position and the end frame of the keyword to estimatethe starting frame of the keyword. Although this trick can notbe applied to real applications, it is effective for the specificcondition of this challenge. Note that there are several previ-ous studies that explicitly model the location of keyword in apositive utterance [4, 5, 6, 7].
We use the similar model structure as the baseline [1]. OurSV system consists of a front-end feature extractor, a statisticpooling layer and a back-end classifier. ResNet34 [8] with SE-block [9] is used as the feature extractor. For back-end classifier,the ArcFace loss [10] (the Baseline system uses AM-Softmax)is used. Supervised contrastive loss [11] is also adopted to fur-ther decrease the intra-class distance and increase the inter-classdistance. We have trained two models. The two models have thesame model structure except for the statistic pooling layer. Onemodel uses attentive statistic pooling (ASP) [12], and the otheruses the self-attentive pooling (SAP) [13].
We choose SLR38, SLR47, SLR62, SLR82 and SLR33 fromOpenSLR to pretrain our SV model. Data augmentation strate-gies in Kaldi [14] are adopted duration training. The MU-SAN [15] noise and room impulse response (RIR) databasesfrom [16] are used for noisy speech simulation. Eighty di-mensional Mel-filter bank features with 25ms window size and10ms window shift are extracted as model inputs.
3. Experimental Setups
We randomly separate speakers’ data in the original train-ing set as a validation set, and the rest participates in the backpropagation. Adam optimization is used with a learning rate of . and a mini-batch size of M = 150 . After each epoch, weevaluate the loss on the validation set. If there is no reduction inloss, the learning rate begins to decay by a factor of . . After https://openslr.org. Note that speakers with less than 10 utterancesare excluded in model training t least epochs of training, we stop the training if there is nofurther decrease in the loss on the validation set. To train the SV system, we keep the same training-validationdata configuration as the KWS system. We first pretrain ourbase model using the selected data from OpenSLR. After that,we fine-tune the model twice using the training data. The origi-nal training dataset is used to adapt the base model at first. Then,to obtain a text-dependent speaker verification system, the key-word segments (cut from the whole utterance) is used to fine-tune the model again. Stochastic gradient descent (SGD) opti-mizer is used for model training. During pretraining, the initiallearning rate is 0.1, decayed 10 times every 5 epochs. After 30epochs, the loss will converge to around 0.2. In the finetunestage, we freeze parameters of the pretrained model in the be-ginning and only train the final layer. When the loss reachesaround 0.2, all the parameters are tuned in the training, until theloss converge to a steady state.
In this challenge, a special detection cost function C d definedby the organizers is used to evaluate a personalized voice trig-ger system. False rejection (FR) and false alarm (FA) are twotypical detection errors in the task. Detection cost function is aweighted sum of FA rate (FAR) and FR rate (FRR): C d = F RR + α ∗ F AR, (2)where, α is a factor used to adjust the cost of FAR and FRR, andit is constantly set as [1]. Usually, by tuning the detectionthreshold of the SV system, FAR and FRR will change and thereis a tradeoff between the two. If we know the detection label ofa dataset, we can get minC d , a minimum C d on the dataset: minC d = argmin δ ( F RR ( δ ) + α ∗ F AR ( δ )) , (3)where δ is the threshold of the SV system.We have two thresholds need to be determined – γ for theKWS system and δ for the SV system. On the development set,we find that a lower KWS FRR could make the minC d smaller.To this end, we choose a smaller γ = 0 . to ensure that theKWS FRR is relatively lower. On the development set, we tra-verse all the SV thresholds to get δ and δ which minimize the C d for close talking and far-field tasks. On the evaluation set,the SV threshold for each task is directly borrowed from thattuned for the development set.
4. Experimental Results
Figure 3 and 4 show DET curves for different KWS systemson the development set. We reproduce the official KWS system(‘PVTC2020 Baseline’ in Figure 3 and 4) according to the offi-cial source code. Compared to the PVTC2020 Baseline, ourproposed KWS system achieves obviously much better DETcurves in both close talking and far-field tasks.In order to further improve the performance of our KWSsystem, the negative utterances of the development set are usedto train our KWS model. There are two reasons: 1) we assumethat the recording condition in development set is more match-able to that in the evaluation set, compared to the training set; Figure 3:
DET curves of different KWS systems on developmentset with close talking task.
Figure 4:
DET curves of different KWS systems on developmentset with far-field task.
2) we want to use data from more speakers to improve the gen-eralization ability of the model.We do not have the label of evaluation set, so we still haveto test the results on the development set even we have usedit to train our KWS model. As expected, using developmentnegative utterances to train our KWS model could improve theperformance by a big margin (‘PVTC2020 Baseline + dev’ inFigure 3 and 4).In the model training, / of the training data is far-fielddata. However, comparing Figure 1 and Figure 2, we find thatthe performance on far-field task is much worse than that onclose talking task. This indicates that the far-field task is morechallenging than the close talking task. SV results on the development set are shown in Table 1. SV-ASP and SV-SAP are two models with different statistic poolinglayers mentioned in Sec. 2.2.1. Here, equal error rate (EER) andis used as the SV evaluation metric. Compared to the baselineSV system, our system (with ArcFace loss and supervised con-trastive loss) obtains better ERRs on both tasks. We also noticeable 1:
EERs of different SV systems on development dataset
Model Name Close talking data Far-field dataPVTC Baseline 1.326 1.904SV-ASP 0.961 1.564SV-SAP 1.001 1.624SV-Fusion 0.821 1.423Table 2:
The minimum detection costs minC d on developmentset and actual detection costs C d on evaluation set Dataset Close talking data Far-field dataDevelopment 0.042 0.055Evaluation 0.081 0.091that ASP and SAP have comparable EER performance. Finally,simple score fusion on the two SV system brings further EERreduction.
Table 2 shows personalized voice trigger performance of oursystem on different datasets and tasks. The values in the tableare minimum detection costs minC d on the development setand actual detection costs C d on the evaluation set. On bothdevelopment and evaluation sets, our system’s performance inthe far-field task is worse than that in in close talking task. Thisis consistent to the KWS results on the two tasks.Another observation is that on both tasks, performance onthe development set is always better than that on the evaluationset. Our SV threshold determination method could be one rea-son to explain this. Another possible reason is that, comparedto the development set, there are more speakers in evaluationset and more speakers in the trials may make the SV task morechallenging. An Intel (R) Xeon (R) E5-2620 v3 CPU with a main frequencyof 2.4 GHz is used to evaluate the inference efficiency of oursystem. Our KWS model has around 180k parameters. Onthe evaluation set, the normalized real-time factor (RTF) of ourKWS sub-system is 0.05. The SV sub-system does not needto process all the samples in the evaluation set. Instead, onlythe triggered segments will be sent to the SV system. To calcu-late the real-time factor of the SV part in the whole system, wecompute the processing time of triggered segments on the eval-uation set, and then divide it by the total duration of evaluationset. Our SV system’s normalized RTF is 0.07.
5. Conclusions
Our personalized voice trigger system submitted to PVTC2020is introduced in this paper. Our system consists of a KWS sys-tem and an SV system. The KWS system and the SV system areindependently optimized but eventually cooperate well. For theKWS system, a novel MDTC network is proposed and data aug-mentation strategies are used. For the SV system, we revise thebaseline system by using the ArcFace loss and the supervisedcontrastive loss, which is shown to be effective for performancegain. On the evaluation dataset, our submitted system obtainsa final score of 0.081 and 0.091 in the close talking and far-field tasks, respectively. We will focus on joint optimization ofKWS and SV systems in the future. We believe this will bring performance gain apparently.
6. References [1] Y. Jia, X. Wang, X. Qin, Y. Zhang, X. Wang, J. Wang, andM. Li, “The 2020 personalized voice trigger challenge: Opendatabase, evaluation metrics and the baseline systems,” arXivpreprint arXiv:2101.01935 , 2021.[2] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph,E. D. Cubuk, and Q. V. Le, “SpecAugment: A Simple DataAugmentation Method for Automatic Speech Recognition,” in
Proc. INTERSPEECH , 2019, pp. 2613–2617. [Online]. Available:http://dx.doi.org/10.21437/Interspeech.2019-2680[3] J. Hou, Y. Shi, M. Ostendorf, M.-Y. Hwang, and L. Xie, “Miningeffective negative training samples for keyword spotting,” in
Proc.ICASSP , 2020.[4] ——, “Region proposal network based small-footprint keywordspotting,”
IEEE Signal Processing Letters , vol. 26, no. 10, pp.1471–1475, 2019.[5] Y. Segal, T. S. Fuchs, and J. Keshet, “Speechyolo: Detection andlocalization of speech objects,” arXiv preprint arXiv:1904.07704 ,2019.[6] T. Maekaku, Y. Kida, and A. Sugiyama, “Simultaneous detectionand localization of a wake-up word using multi-task learning ofthe duration and endpoint,” in
Proc. INTERSPEECH , 2019.[7] C. Jose, Y. Mishchenko, T. S´en´echal, A. Shah, A. Escott, andS. N. P. Vitaladevuni, “Accurate detection of wake word start andend using a cnn.” in
Proc. INTERSPEECH , 2020, pp. 3346–3350.[8] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning forimage recognition,” in
Proc. CVPR , 2016, pp. 770–778.[9] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,”in
Proc. CVPR , 2018, pp. 7132–7141.[10] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additiveangular margin loss for deep face recognition,” in
Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recogni-tion , 2019, pp. 4690–4699.[11] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola,A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastivelearning,” arXiv preprint arXiv:2004.11362 , 2020.[12] K. Okabe, T. Koshinaka, and K. Shinoda, “Attentive statis-tics pooling for deep speaker embedding,” arXiv preprintarXiv:1803.10963 , 2018.[13] F. R. rahman Chowdhury, Q. Wang, I. L. Moreno, and L. Wan,“Attention-based models for text-dependent speaker verification,”in . IEEE, 2018, pp. 5359–5363.[14] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al. ,“The kaldi speech recognition toolkit,” in
Proc. ASRU , 2011.[15] D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, andnoise corpus,” arXiv preprint arXiv:1510.08484 , 2015.[16] J. B. Allen and D. A. Berkley, “Image method for efficiently sim-ulating small-room acoustics,”