[PDF] Analyzing Learned Representations of a Deep ASR Performance Prediction Model

Abstract

This paper addresses a relatively new task: prediction of ASR performance on unseen broadcast programs. In a previous paper, we presented an ASR performance prediction system using CNNs that encode both text (ASR transcript) and speech, in order to predict word error rate. This work is dedicated to the analysis of speech signal embeddings and text embeddings learnt by the CNN while training our prediction model. We try to better understand which information is captured by the deep model and its relation with different conditioning factors. It is shown that hidden layers convey a clear signal about speech style, accent and broadcast type. We then try to leverage these 3 types of information at training time through multi-task learning. Our experiments show that this allows to train slightly more efficient ASR performance prediction systems that - in addition - simultaneously tag the analyzed utterances according to their speech style, accent and broadcast program origin.

Full PDF

AAnalyzing Learned Representations of a Deep ASR PerformancePrediction Model

Zied Elloumi , Laurent Besacier Olivier Galibert Laboratoire national de m´etrologie et d’essais (LNE) , France Univ. Grenoble Alpes, CNRS, Grenoble INP, LIG, F-38000 Grenoble, France [email protected]@univ-grenoble-alpes.fr

Benjamin Lecouteux Abstract

This paper addresses a relatively new task:prediction of ASR performance on unseenbroadcast programs. In a previous paper, wepresented an ASR performance prediction sys-tem using CNNs that encode both text (ASRtranscript) and speech, in order to predict worderror rate. This work is dedicated to the anal-ysis of speech signal embeddings and text em-beddings learnt by the CNN while training ourprediction model. We try to better understandwhich information is captured by the deepmodel and its relation with different condition-ing factors. It is shown that hidden layers con-vey a clear signal about speech style, accentand broadcast type. We then try to leveragethese 3 types of information at training timethrough multi-task learning. Our experimentsshow that this allows to train slightly more ef-ﬁcient ASR performance prediction systemsthat - in addition - simultaneously tag the an-alyzed utterances according to their speechstyle, accent and broadcast program origin.

Predicting automatic speech recognition (ASR)performance on unseen speech recordings is animportant Grail of speech research. In a previ-ous paper (Elloumi et al., 2018), we presenteda framework for modeling and evaluating ASRperformance prediction on unseen broadcast pro-grams. CNNs were very efﬁcient encoding bothtext (ASR transcript) and speech to predict ASRword error rate (WER). However, while achiev-ing state-of-the-art performance prediction results,our CNN approach is more difﬁcult to understandcompared to conventional approaches based onengineered features such as

TransRater for in-stance. This lack of interpretability of the repre-sentations learned by deep neural networks is a https://github.com/hlt-mt/TranscRater general problem in AI. Recent papers started toaddress this issue and analyzed hidden represen-tations learned during training of different natu-ral language processing models (Mohamed et al.,2012; Wu and King, 2016; Belinkov and Glass,2017; Shi et al., 2016; Belinkov et al., 2017; Wanget al., 2017). Contribution.

This work is dedicated to theanalysis of speech signal embeddings and text em-beddings learnt by the CNN during training ofour ASR performance prediction model. Our goalis to better understand which information is cap-tured by the deep model and its relation with con-ditioning factors such as speech style, accent orbroadcast program type. For this, we use a dataset presented in (Elloumi et al., 2018) which con-tains a large amount of speech utterances takenfrom various collections of French broadcast pro-grams. Following a methodology similar to (Be-linkov and Glass, 2017), our deep performanceprediction model is used to generate utterancelevel features that are given to a shallow classiﬁertrained to solve secondary classiﬁcation tasks. Itis shown that hidden layers convey a clear signalabout speech style, accent and show. We then tryto leverage these 3 types of information at trainingtime through multi-task learning. Our experimentsshow that this allows to train slightly more efﬁ-cient ASR performance prediction systems that -in addition - simultaneously tag the analyzed utter-ances according to their speech style, accent andbroadcast program origin.

Outline.

The paper is organized as follows. Insection 2, we present a brief overview of relatedworks and present our ASR performance predic-tion system in section 3. Then, we detail ourmethodology to evaluate learned representationsin section 4. Our multi-task learning experimentsfor ASR performance prediction are presented insection 5. Finally, section 6 concludes this work. a r X i v : . [ c s . C L ] A ug Related works

Several works tried to understand learned rep-resentations for NLP tasks such as AutomaticSpeech Recognition (ASR) and Neural MachineTranslation (NMT).(Shi et al., 2016) and (Belinkov et al., 2017)tried to better understand the hidden represen-tations of NMT models which were given to ashallow classiﬁer in order to predict syntactic la-bels (Shi et al., 2016), part-of-speech labels orsemantic ones (Belinkov et al., 2017). It wasshown that lower layers are better at POS tag-ging, while higher layers are better at learningsemantics. (Mohamed et al., 2012) and (Be-linkov and Glass, 2017) analyzed the feature rep-resentations from a deep ASR model using t-SNE visualization (Maaten and Hinton, 2008) andtried to understand which layers better capturethe phonemic information by training a shallowphone classiﬁer. Also relevant is the work of(Wang et al., 2017) who proposed an in-depth in-vestigation on three kinds of speaker embeddingslearned for a speaker recognition task, i.e. i-vector,d-vector and RNN/LSTM based sequence-vector(s-vector). Classiﬁcation tasks were designedto facilitate better understanding of the encodedspeaker representations. Multi-task learning wasalso proposed to integrate different speaker em-beddings and improve speaker veriﬁcation perfor-mance.

In (Elloumi et al., 2018), we proposed a new ap-proach using convolution neural networks (CNNs)to predict ASR performance from a collection ofheterogeneous broadcast programs (both radio andTV). We particularly focused on the combina-tion of text (ASR transcription) and signal (rawspeech) inputs which both proved useful for CNNprediction. We also observed that our system re-markably predicts WER distribution on a collec-tion of speech recordings.To obtain speech transcripts (ASR outputs) forthe prediction model, we built our own FrenchASR system based on the KALDI toolkit (Poveyet al., 2011). A hybrid HMM-DNN system wastrained using 100 hours of broadcast news from

Quaero , ETAPE (Gravier et al., 2012),

ESTER 1& ESTER 2 (Galliano et al., 2005) and

REPERE (Kahn et al., 2012) collections. ASR performancewas evaluated on the held out corpora presentedin table 2 (used to train and evaluate ASR predic-tion) and its averaged value was 22.29% on theTRAIN set, 22.35% on the DEV set and 31.20%on the TEST set (which contains more challengingbroadcast programs).Figure 1 shows our network architecture. Thenetwork input can be either a pure text in-put, a pure signal input (raw signal) or a dual(text+speech) input. To avoid memory issues, sig-nals are downsampled to 8khz and models aretrained on six-second speech turns (shorter speechturns are padded with zeros). For text input, thearchitecture is inspired from (Kim, 2014) (greenin Figure 1): the input is a matrix of dimen-sions 296x100 (296 is the longest ASR hypothe-sis length in our corpus ; 100 is the dimension ofpre-trained word embeddings on a large held outtext corpus of 3.3G words). For speech input, weuse the best architecture ( m18 ) proposed in (Daiet al., 2017) (colored in red in Figure 1) of dimen-sions 48000 x 1 (48000 samples correspond to 6sof speech).For WER prediction, our best approach (calledCNN Softmax ) used sof tmax probabilities and anexternal ﬁxed WER

V ector which corresponds toa discretization of the WER output space (see(Elloumi et al., 2018) for more details). Thebest performance obtained is 19.24% MAE usingtext+speech input. Our ASR prediction system isbuilt using both Keras (Chollet et al., 2015) and

Tensorﬂow .In the next section, we analyze the represen-tations learnt in the higher layers (3 blocks col-ored in yellow and dotted in Figure 1) for puretext (TXT), pure speech (RAW-SIG) and both(TXT+RAW-SIG). In this section, we attempt to understand what ourbest ASR performance prediction system (Elloumiet al., 2018) learned. We analyze the text andspeech representations obtained by our architec-ture. Alike (Belinkov and Glass, 2017), the jointtext+speech model is used to generate utterance Mean Absolute Error (MAE) is a common metric to eval-uate WER prediction ; it computes the absolute deviation be-tween the true and predicted WERs, averaged over the num-ber of utterances in the test set. igure 1 : Architecture of our CNN with text (green) and signal (red) inputs for WER predictionlevel features (hidden representations of speechturns colored in yellow in Figure 1) that are givento a shallow classiﬁer trained to solve secondaryclassiﬁcation tasks such as: • STYLE: classify the utterances between( spontaneous and non spontaneous ) styles(see table 1), • ACCENT: classify the utterances between native and non native speech (see also table1, we used the speaker annotations providedwith our datasets in order to label our utter-ances in native/non native speech), • SHOW: classify the utterances in differentbroadcast programs (as described in table 2,each utterance of our corpus is labeled with a broadcast program name ).As a more visual analysis, we also plot an ex-ample of hidden representations projected to a 2-Dspace using t-distributed Stochastic Neighbor Em-bedding (t-SNE) (Maaten and Hinton, 2008). We built three shallow classiﬁers (SHOW,STYLE, ACCENT) with a similar architecture.The classiﬁer is a feed-forward neural networkwith one hidden layer (size of the hidden layeris set to 128) followed by dropout (rate of 0.5)and a ReLU non-linearity. Finally, a sof tmax layer is used for mapping onto the label set size.We chose this simple formulation as we are inter-ested in evaluating the quality of the representa-tions learned by our ASR prediction model, ratherthan optimizing the secondary classiﬁcation tasks. https://lvdmaaten.github.io/tsne/code/tsne_python.zip The network input size depends on which layerto analyze (see ﬁgure 1). Training is performedusing

Adam (Kingma and Ba, 2014) (using de-fault parameters) over shufﬂed mini-batches in or-der to minimize the cross-entropy loss. The mod-els are trained for 30 epochs with a batch size of16 speech utterances. After training, we keep themodel with the best performance on DEV set andreport its performance on the TEST set. The clas-siﬁer outputs are evaluated in terms of accuracy.

A data set from (Elloumi et al., 2018) was em-ployed in our experiments, divided into three sub-sets: training (TRAIN), development (DEV) andtest (TEST). Speech utterances come from vari-ous French broadcast collections gathered duringprojects or shared tasks:

Quaero , ETAPE , ESTER1 & ESTER 2 and

REPERE .The TEST set contains unseen broadcast pro-grams that are different from those present inTRAIN and DEV (Elloumi et al., 2018).

Category TRAIN DEV TEST

Non Spontaneous 54250 6101

Spontaneous

Table 1 : Distribution of our utterances betweennon spontaneous and spontaneous styles, nativeand non native accentsTables 1 and 2 show the whole data set in termsof speech turns available for each classiﬁcationtask. We clearly see that the data is unbalanced forthe three categories (STYLE, ACCENT, SHOW).Since we are interested in evaluating the discrim-inative power of our learned representations for how TRAIN DEV

FINTER-DEBATE 7632 833FRANCE3-DEBATE 928 77LCP-PileEtFace

Table 2 : Number of utterances for each broadcastprogramthese 3 tasks, we extracted a balanced version ofour TRAIN/DEV/TEST sets by ﬁltering amongover-represented labels (ﬁnal number of kept ut-terances corresponds to bold numbers in table 1and 2 ). Table 3 shows the distribution of our ﬁnalbalanced TRAIN/DEV/TEST sets as well as thenumber of categories for each task. × × - STYLE × × × ACCENT × × × Table 3 : Description of our balanced data set foreach category

For each classiﬁcation task, we build a shallowclassiﬁer using the hidden representations of

TXT , RAW-SIG and

TXT+RAW-SIG blocks as input.The experimental results are presented in table 4for both DEV and TEST sets separated by two ver-tical bars ( || ).Classiﬁcation performance is all above a ran-dom baseline accuracy ( >

50% for STYLE andACCENT and >

20% for SHOW). This showsthat training a deep WER prediction system givesrepresentation layers that contain a meaningfulamount of information about speech style, speechaccent and broadcast program label. Predictingutterance style (spontaneous/non spontaneous) isslightly easier than predicting accent (native/nonnative) especially from text input. One expla-nation might be that speech utterances are short( < s ) while accent identiﬁcation needs proba-bly longer sequences. We also observe that us-ing both text and speech improves the learnedrepresentations for the STYLE task while it is For the

SHOW classiﬁcation task, the

FRANCE3-DEBATE shows were ﬁnally removed since they represent atoo small amount of speech turns. less clear for the ACCENT task (for which im-provement seen on DEV is not conﬁrmed onTEST). Finally, text input is signiﬁcantly betterthan speech input whereas we could have expectedbetter performance from speech for the SHOWtask (speech signals convey information about theaudio characteristics of a broadcast program). Itmeans that text input contains correlated infor-mation with broadcast-program type, speech styleand speaker’s accent. In case of SHOW task,our performance prediction system is able to cap-ture information (vocabulary, topic, syntax, etc.)about a speciﬁc broadcast program type, based ontextual features and to differ it from others (ra-dio programs, TV debate programs, phone calls,broadcast news programs, etc.). Likewise, thetextual information captured is very different be-tween spontaneous/non-spontaneous speech stylesand native/non-native speaker’s accents.Among the representations analyzed, the out-puts of the CNNs (A1,B1) lead to the best classiﬁ-cation results, in line with previous ﬁndings aboutconvolutions as feature extractors. Performancethen drops using the higher (fully connected) lay-ers that do not generate better representations fordetecting style, accent or show.

Layer Dim. SHOW STYLE ACCENT

TXTA1 1280 || - || || || - 80.01 || || || - 79.23 || || RAW-SIGB1 512 || - || || B2 512 41.22 || - 72.20 || || || - 72.38 || || || - 72.38 || || (A3+B4) || - || || C2 128 53.06 || - 79.62 || || Random - Table 4 : Show/Style/Accent classiﬁcation accu-racies using representations from different layerslearned during the training of our ASR WER pre-diction system.We visualize an example of utterance represen-tations from C2(TXT+RAW-SIG) layer in ﬁgure2 using the t-SNE. For a ﬁxed utterance dura-tion 4s ≤ D <

5s (716 speech turns) and 5s ≤ D < ≤ D <

5s (b) 5s ≤ D < Figure 2 : Visualization of utterance representationsfrom C2 layer for different speech styles (S spon-taneous - NS non spontaneous) - (a) utt. length is4s ≤ D <

5s and (b) 5s ≤ D < TELSONNE category(Accuracy of 82%), which contains many phonecalls from the radio listeners. This show is ratherdifferent from the 4 other shows in DEV (broad-cast debates and news).

Figure 3 : Confusion matrix for SHOW classiﬁ-cation using C2(TXT+RAW-SIG) layer as input,evaluated on DEV

We have seen in the previous section that, whiletraining an ASR performance prediction system,hidden layers convey a clear signal about speechstyle, accent and show. This suggests that these3 types of information might be useful to struc-ture the deep ASR performance prediction models.In this section, we investigate the effect of knowl- edge of these labels (style, accent, show) at train-ing time on prediction systems qualities. For this,we perform multi-task learning providing the ad-ditional information about broadcast type, speechstyle and speaker’s accent during training. The ar-chitecture of the multi-task model is similar to thesingle-task WER prediction model of Figure 1 butwe add additional outputs: a sof tmax functionis added for each new classiﬁcation task after thelast fully connected layer (C2). The output dimen-sion depends on the task: 6 for SHOW and 2 forSTYLE and ACCENT tasks.We use the full (unbalanced) data set describedin tables 1 and 2. Training of the multitask modeluses

Adadelta update rule and all parameters areinitialized from scratch (8.70M). Models are per-formed for 50 epochs with batch size of 32. MAEis used as the loss function for WER predictiontask while cross-entropy loss is used for the clas-siﬁcation tasks.In the composite (multitask) loss, we assign aweight of 1 for MAE loss (main task) and a smallerweight of 0.3 (tuned using a grid search on DEVdataset) for cross-entropy (secondary classiﬁca-tion task) loss(es).After training, we take the model that lead to thebest

MAE on DEV set and report its performanceon TEST. We build several models that simulta-neously address 1, 2, 3 and 4 tasks. The mod-els are evaluated with a speciﬁc metric for eachtask: MAE & Kendall for WER prediction taskand Accuracy for classiﬁcation tasks.Table 5 summarizes the experimental results onDEV and TEST sets, separated by two verticalbars ( || ). We considered the mono-task model de-scribed in (Elloumi et al., 2018) (and summarizedin section 3) as a baseline system.We recall that we evaluated the SHOW classiﬁ-cation task only on the DEV set (TEST broadcastprograms are new and were unseen in the TRAIN).First of all, we notice that performance of classi-ﬁcation tasks in muti-task scenarios are very good:we are able to train efﬁcient ASR performanceprediction systems that simultaneously tag the an-alyzed utterances according to their speech style,accent and broadcast program origin. Such multi-task systems might be useful diagnostic tools toanalyze and predict ASR on large speech collec-tions. Moreover, our best multi-task systems dis- Correlation between true ASR values and predicted ASRvalues odels Performance prediction task Classiﬁcation tasksMAE Kendall SHOW STYLE ACCENTBaseline: Mono-taskWER (Elloumi et al., 2018) 15.24 || || || || || - - - WER STYLE || || || WER ACCENT || || || || || || || || || || - - 89.87 || WER SHOW STYLE || || || - || - || || || - 99.29 || || WER ALL COMBINED OUTPUTS 14.50 || || - - - Table 5 : Evaluation of ASR performance prediction with multi-tasks models (

DEV || T EST ) computedwith MAE and Kendall - secondary classiﬁcation tasks accuracy is also reportedplay a better performance (MAE, Kendall) thanthe baseline system, which means that the implicitinformation given about style, accent and broad-cast program type can be helpful to structure thesystem’s predictions. For example, in 2-task case,the best model is obtained on WER+SHOW taskswith a difference of +0.41%, +2.25% for MAE andKendall respectively (on DEV) compared to thebaseline on WER prediction task. However, it isalso important to mention that the impact of multi-task learning on the main task (ASR performanceprediction) is limited: only slight improvementson the test set are observed for MAE and Kendallmetrics. Anyway, the systems trained seem com-plementary since their combination (averaging,over all multi-task systems, predicted WERs at ut-terance level) leads to signiﬁcant performance im-provement (MAE and Kendall).

This paper presented an analysis of learned repre-sentations of our deep ASR performance predic-tion system. Experiments show that hidden layersconvey a clear signal about speech style, accent,and broadcast type. We also proposed a multi-tasklearning approach to simultaneously predict WERand classify utterances according to style, accentand broadcast program origin.

References

Yonatan Belinkov and James Glass. 2017. Analyz-ing hidden representations in end-to-end automaticspeech recognition systems. In

Advances in NeuralInformation Processing Systems , pages 2438–2448.Yonatan Belinkov, Llu´ıs M`arquez, Hassan Sajjad, Nadir Durrani, Fahim Dalvi, and James Glass. 2017.Evaluating layers of representation in neural ma-chine translation on part-of-speech and semantictagging tasks. In

Proceedings of the Eighth Interna-tional Joint Conference on Natural Language Pro-cessing (Volume 1: Long Papers) , volume 1, pages1–10.Franc¸ois Chollet et al. 2015. Keras. https://github.com/fchollet/keras .Wei Dai, Chia Dai, Shuhui Qu, Juncheng Li, andSamarjit Das. 2017. Very deep convolutional neuralnetworks for raw waveforms. In

Acoustics, Speechand Signal Processing (ICASSP), 2017 IEEE Inter-national Conference on , pages 421–425. IEEE.Zied Elloumi, Laurent Besacier, Olivier Galibert, Juli-ette Kahn, and Benjamin Lecouteux. 2018. Asr per-formance prediction on unseen broadcast programsusing convolutional neural networks. In

IEEE Inter-national Conference on Acoustics, Speech and Sig-nal Processing (ICASSP) .Sylvain Galliano, Edouard Geoffrois, Djamel Mostefa,Khalid Choukri, Jean-Franc¸ois Bonastre, and Guil-laume Gravier. 2005. The ester phase ii evaluationcampaign for the rich transcription of french broad-cast news. In

Interspeech , pages 1149–1152.Guillaume Gravier, Gilles Adda, Niklas Paulson,Matthieu Carr´e, Aude Giraudel, and Olivier Galib-ert. 2012. The etape corpus for the evaluation ofspeech-based tv content processing in the french lan-guage. In

LREC-Eighth international conference onLanguage Resources and Evaluation , page na.Juliette Kahn, Olivier Galibert, Ludovic Quintard,Matthieu Carr´e, Aude Giraudel, and Philippe Joly.2012. A presentation of the repere challenge. In

Content-Based Multimedia Indexing (CBMI), 201210th International Workshop on , pages 1–6. IEEE.Yoon Kim. 2014. Convolutional neural net-works for sentence classiﬁcation. arXiv preprintarXiv:1408.5882 .iederik P. Kingma and Jimmy Ba. 2014. Adam:A method for stochastic optimization.

CoRR ,abs/1412.6980.Laurens van der Maaten and Geoffrey Hinton. 2008.Visualizing data using t-sne.

Journal of machinelearning research , 9(Nov):2579–2605.Abdel-rahman Mohamed, Geoffrey Hinton, and Ger-ald Penn. 2012. Understanding how deep beliefnetworks perform acoustic modelling. In

Acous-tics, Speech and Signal Processing (ICASSP), 2012IEEE International Conference on , pages 4273–4276. IEEE.Daniel Povey, Arnab Ghoshal, Gilles Boulianne, LukasBurget, Ondrej Glembek, Nagendra Goel, MirkoHannemann, Petr Motlicek, Yanmin Qian, PetrSchwarz, et al. 2011. The kaldi speech recog-nition toolkit. In

IEEE 2011 workshop on auto-matic speech recognition and understanding , EPFL-CONF-192584. IEEE Signal Processing Society.Xing Shi, Inkit Padhi, and Kevin Knight. 2016. Doesstring-based neural mt learn source syntax? In

Pro-ceedings of the 2016 Conference on Empirical Meth-ods in Natural Language Processing , pages 1526–1534.Shuai Wang, Yanmin Qian, and Kai Yu. 2017. Whatdoes the speaker embedding encode? In

Inter-speech , volume 2017, pages 1497–1501.Zhizheng Wu and Simon King. 2016. Investigatinggated recurrent neural networks for speech synthe-sis.