[PDF] DNN adaptation by automatic quality estimation of ASR hypotheses

Abstract

In this paper we propose to exploit the automatic Quality Estimation (QE) of ASR hypotheses to perform the unsupervised adaptation of a deep neural network modeling acoustic probabilities. Our hypothesis is that significant improvements can be achieved by: i)automatically transcribing the evaluation data we are currently trying to recognise, and ii) selecting from it a subset of "good quality" instances based on the word error rate (WER) scores predicted by a QE component. To validate this hypothesis, we run several experiments on the evaluation data sets released for the CHiME-3 challenge. First, we operate in oracle conditions in which manual transcriptions of the evaluation data are available, thus allowing us to compute the "true" sentence WER. In this scenario, we perform the adaptation with variable amounts of data, which are characterised by different levels of quality. Then, we move to realistic conditions in which the manual transcriptions of the evaluation data are not available. In this case, the adaptation is performed on data selected according to the WER scores "predicted" by a QE component. Our results indicate that: i) QE predictions allow us to closely approximate the adaptation results obtained in oracle conditions, and ii) the overall ASR performance based on the proposed QE-driven adaptation method is significantly better than the strong, most recent, CHiME-3 baseline.

Full PDF

aa r X i v : . [ c s . C L ] F e b DNN Adaptation by Automatic Quality Estimationof ASR Hypotheses

Daniele Falavigna a , Marco Matassoni a, ∗ , Shahab Jalalvand a,b , Matteo Negri a , MarcoTurchi a a FBK-Fondazione Bruno Kessler, Trento, Italy b University of Trento, Trento, Italy

Abstract

In this paper we propose to exploit the automatic Quality Estimation (QE) of ASR hy-potheses to perform the unsupervised adaptation of a deep neural network modeling acous-tic probabilities. Our hypothesis is that signiﬁcant improvements can be achieved by: i) automatically transcribing the evaluation data we are currently trying to recognise, and ii) selecting from it a subset of “good quality” instances based on the word error rate (WER)scores predicted by a QE component. To validate this hypothesis, we run several experi-ments on the evaluation data sets released for the CHiME-3 challenge. First, we operate inoracle conditions in which manual transcriptions of the evaluation data are available, thusallowing us to compute the true sentence WER. In this scenario, we perform the adaptationwith variable amounts of data, which are characterised by diﬀerent levels of quality. Then,we move to realistic conditions in which the manual transcriptions of the evaluation dataare not available. In this case, the adaptation is performed on data selected according tothe WER scores predicted by a QE component. Our results indicate that: i) QE predictionsallow us to closely approximate the adaptation results obtained in oracle conditions, and ii) the overall ASR performance based on the proposed QE-driven adaptation method issigniﬁcantly better than the strong, most recent, CHiME-3 baseline.

Keywords:

Deep neural networks, DNN adaptation, ASR quality estimation.

1. Introduction

Automatic speech recognition (ASR) with microphone arrays is gaining increasing in-terest in a variety of application scenarios, such as home and oﬃce automation, smart carsand humanoid robots. In such applications, ASR should be able to operate in environments ∗ Please cite this article as: D. Falavigna et al., DNN adaptation by automatic quality estimation of ASRhypotheses, Computer Speech & Language (2016), http://dx.doi.org/10.1016/j.csl.2016.11.002

Email addresses: [email protected] (Daniele Falavigna), [email protected] (Marco Matassoni), [email protected] (Shahab Jalalvand), [email protected] (Matteo Negri), [email protected] (Marco Turchi)

Preprint submitted to Computer Speech and Language February 7, 2017 here noises of various types, competing speakers and reverberation heavily aﬀect recogni-tion performance, which is usually satisfactory in controlled acoustic conditions. To copewith the above scenarios, most of the current approaches are based on the implementation ofa variety of enhancement techniques such as beamforming, denoising and dereverberation [7].The last CHiME challenge (CHiME-3 ) provided an excellent framework to evaluate sig-nal enhancement approaches and noise-robust acoustic modelling techniques for ASR. Par-ticipants’ results [5] evidenced the eﬀectiveness of signal enhancement approaches, mostlybased on “beamforming”, combined with the use of hybrid acoustic models based on deepneural networks hidden Markov models (DNN-HMMs) [19, 33, 47, 39]. The eﬀectivenessof acoustic modelling based on context-dependent DNN-HMMs was also demonstrated inseveral works dealing with applications spanning from mobile voice search [8] to the tran-scription of broadcast news and YouTube videos [19], conversational (at the telephone or inlive scenarios) speech recognition [41] and ASR in noisy environments [42].In [22], we observed a signiﬁcant WER reduction on the CHiME-3 real test data byretraining the baseline DNN on the evaluation set itself and using the automatic transcrip-tions resulting from a ﬁrst decoding pass to align acoustic observations with DNN outputs.After this “unsupervised” DNN retraining step (unsupervised as it relies on automatic tran-scriptions, which are not revised by a human) we achieved a WER reduction from 20.2% to15.5%. These positive results motivate the research proposed in this work, which furtherexplores unsupervised techniques for the “self” adaptation of a DNN (i.e. by tuning itsparameters on the same test set we are currently trying to recognize) with an improved,automatically generated supervision. In particular, the full DNN retraining step adoptedin [22] is now substituted by a more sophisticated solution, which enhances the adaptationwith eﬀective instance weighing and selection criteria.At its core, our adaptation method is similar to the one described in [55], which addsto the objective function to optimize a regularization term based on the Kullback-Leiblerdivergence (KLD) between the original (non adapted) and the current DNN output distri-bution. However, departing from Yu et al.’s method, we explore diﬀerent ways to enhancethe process with automatic ASR quality predictions. In particular, building on the outcomesof previous research on automatic quality estimation for ASR [34, 23], we focus on two al-ternative solutions. The ﬁrst one is based on weighing the KLD regularization term withcoeﬃcients that depend on the predicted quality of each transcribed sentence. The secondone is based on ﬁltering the adaptation set by removing the utterances that, in terms ofpredicted quality, seem to be less reliable. http://spandh.dcs.shef.ac.uk/chime_challenge/ This performance was achieved using ﬁlter-bank log-energies as acoustic features, asincluded in a previous CHiME-3 ASR baseline used in the competition held in 2015( http://spandh.dcs.shef.ac.uk/chime_challenge/ ). A stronger baseline, which employs speakernormalized features, has been successively delivered. This baseline is the one used in the experimentsdocumented in this paper. The reason for choosing KLD regularization is that it can be implemented by constraining directlythe target output DNN probabilities, making it possible to easily integrate the approach in the existingopen-source ASR toolkits like KALDI [38]. cross conditions . In this setting, the manually transcribed develop-ment set is distinct from the evaluation set and it is used for a supervised DNN adaptationprocess. The available supervision allows us to compare the resulting performance with theresults achieved when automatic (hence less accurate) transcriptions are used in place ofthe manual ones. With an oracle-based sentence selection (obtained by using the manualtranscriptions as references to calculate the true sentence WER) we observe signiﬁcant per-formance improvements when subsets of good quality adaptation instances are used insteadof the whole data. This ﬁnding supports the intuition that automatic techniques to estimatetranscription quality (e.g. by ﬁltering out those with higher WER) can be used to informthe adaptation process and increase its robustness.In the second round of experiments we switch to self DNN adaptation in homogeneousconditions , in which the adaptation is performed with automatic transcriptions of the sameevaluation data we are currently trying to recognise (i.e. the real test set used in the CHiME-3 challenge). In this scenario we achieve the most interesting results, with a relative WERreduction of 11.7% (from 15.4% to 13.6%) obtained with the automatic sentence-level WERpredictions returned by our ASR QE component. Besides this, another interesting result ofour experiments is that the ASR performance gain achieved through self DNN adaptationhas to be mostly attributed to the automatic selection of the adaptation utterances ratherthan to weighing the KLD regularization term, which has a limited eﬀect on the overallperformance. To complete our analysis, the usefulness of the proposed QE-based adaptationmethod is veriﬁed not only with ﬁlter-bank features, but also with feature normalization viamaximum likelihood linear regression (fMLLR) transformations, which characterize the bestperforming systems in the CHiME-3 challenge [20, 54] as well as the most recent baseline.This is an interesting result since, while DNN adaptation has already proven to be eﬀectivewith ﬁlter-bank features (see Section 2 for a review of DNN adaptation approaches), thebeneﬁts yielded by adaptation using acoustic features (speaker) normalized through fMLLRtransformations are still questionable.To the best of our knowledge, this paper represents the ﬁrst investigation of the use ofQE-based sentence selection for unsupervised DNN adaptation in the framework of ASRdecoding. Overall, its main contributions include: • A new application of the ASR QE procedure described in [34] to predict the WER ofautomatic transcription hypotheses; • An extension of the KLD regularization approach for unsupervised DNN adaptation[55], which could be easily integrated in the KALDI speech recognition toolkit [38]; • Signiﬁcant improvements over the strong, most recent CHiME-3 baseline.All the experiments described in this paper have been carried out with the TranscRateropen-source tool described in [24]. 3he paper is organized as follows. In Section 2 we summarize relevant previous worksrelated to our research. In Section 3 we describe our approach to DNN adaptation. InSections 4 and 5 we present the adopted automatic WER prediction method and the ASRsystem architectures. After the description of our experimental setup in Section 6, ourresults are presented in Section 7 and discussed in Section 8.

2. Related work

Detailed overviews about the CHiME challenges of years 2011 (CHiME-1), 2013 (CHiME-2) and 2015 (CHiME-3) can be respectively found in [6, 52, 5]. While the ﬁrst round in2011 was mostly focused on speech separation task, the last two editions addressed largevocabulary speech recognition in noisy environments. The last one, in particular, involvedmany participants addressing a variety of topics such as noise reduction, de-reverberation,speaker/noise adaptation, system combination and rescoring with long-spanning LMs. Oursubmission [22] mostly focused on three aspects: i) the automatic selection of the bestchannel, ii) DNN retraining and iii) rescoring of word lattices with a linear combination of4-grams LMs and RNNLMs. As mentioned in the introduction, the signiﬁcant improvementsachieved with unsupervised DNN retraining motivated us to further investigate the DNNadaptation issue.In the past, several adaptation techniques have been proposed for artiﬁcial neural net-works employed in ASR hybrid systems. These are mostly based on the estimation of lineartransformations of their input, output or hidden units [14, 1, 35, 28, 43]. Feature discrim-inative linear regression (fDLR) [40] and output-features discriminative linear regression(oDLR) [53] are other approaches speciﬁcally investigated for DNN adaptation. Regardlessof the layer to which the transformation is applied, in all the mentioned approaches only theweights of the linear transformation are updated in order to optimize an objective functioncomputed on the adaptation data. In this way, the probability that the DNN model overﬁtsthe adaptation data is reduced. Note that both fMLLR and fDLR are linear transforma-tions applied to the input features; the diﬀerence between the two lies in the estimationcriterion they adopt. fMLLR maximizes the likelihood of the adaptation observations, whilefDLR optimizes a discriminative criterion computed on the same adaptation observation(e.g. it minimizes the mean squared error between target and actual output-state networkdistribution).A variant of fDLR is described in [21], which proposes to adapt the DNN parameterswithin a maximum a posterior (MAP) framework. Basically, the method consists in addingto the objective function to optimize a term representing the prior density of the linear trans-formation weights. This approach demonstrated to be equivalent to L2 norm regularization[29] if the prior distribution of transformation weights is assumed to be Gaussian N (0 , I ).In general, adding a regularization term to the objective function proved to be eﬀectiveto reduce model overﬁtting. An excellent review of “conservative training” approaches forartiﬁcial neural networks can be found in [2]. The use of a momentum term to update theDNN weights, the use of small values for the learning rate as well as of an early stoppingcriterion can be considered as adaptation methods.4n [55], the Kullback-Leibler divergence (KLD) between the original unadapted distri-bution of the DNN outputs and the related distribution estimated on the adaptation set isconsidered as regularization term. As reported in [55] this approach, also employed in ourwork, allowed obtaining signiﬁcant WER reductions compared to fDLR transformation ofthe input features on two diﬀerent tasks: voice search and lecture transcription. The weightassigned to the regularization term in the objective function is an important parameter tochoose when using regularized learning.The use of fMLLR features in combination with hybrid DNN-HMMs has been studiedin [36]. On a private clean speech evaluation set, the authors observed that: i) ﬁlter-bankfeatures and fMLLR features achieved comparable performance, and ii) only the combinationof the two types of features, either at an early or late fusion stage, provided signiﬁcant WERreductions. These results are somehow in contrast with those obtained in the CHiME-3 challenge, in which fMLLR normalization gives signiﬁcant improvements compared tospeaker-independent ﬁlter-bank features. However, we have to consider that participants inthe CHiME-3 challenge experimented on noisy data and did not apply any automatic speakerdiarization module, since both utterance segmentation and speaker labels were manuallychecked.An approach for unsupervised speaker adaptation of DNNs using fMLLR features is alsoreported in [48]. The authors propose to train speaker-dependent amplitude parametersassociated to hidden units of the network, obtaining signiﬁcant performance improvementson the recognition of English TED talks.In the context of speaker-adaptive training (SAT) via fMLLR [12], recently proposedapproaches make use of i-vector [26] as speaker representation to perform acoustic featurenormalization. In [32] an adaptation neural network is trained to convert i-vectors to speaker-dependent linear shifts which, in turn, are used to generate speaker-normalized features forSAT-DNN training/decoding. The work reported in [13] proposes to process HMM-basedi-vectors with speciﬁc hidden layers of a DNN before combining them with hidden layers pro-cessing standard acoustic features. The work reported in [25] proposes to incorporate priorstatistics (derived from gender clustering of training data) into i-vectors estimation, showingsigniﬁcant perfomance improvements when the approach is used for DNN adaptation of ahybrid ASR system.The automatic selection of training data for acoustic modelling in speech recognitionhas been previously addressed in the context of lightly supervised training [27] and activelearning approaches [18, 10]. The use of conﬁdence measures for improving MLLR transfor-mations has also been investigated by [37] in a German conversational speech recognitiontask. The authors showed signiﬁcant WER reductions, for an ASR system based on a Gaus-sian Mixture Model (GMM), by removing the low conﬁdence frames from the adaptationdata. More recently, [49] proposed an automatic sentence selection method based on diﬀer-ent types of conﬁdence measures for the semi-supervised training of DNNs in a low-resourcesetting.The use of QE as a quality prediction method alternative to conﬁdence estimation isinspired by previous research on QE for machine translation [31, 44, 50, 45]. In the ASRﬁeld it has been ﬁrst proposed in [34]. In such previous work, the objective was to by-5ass the dependency of conﬁdence estimation on knowledge about the inner workings of thedecoder that produces the transcriptions and, in turn, to avoid the risk of biased (oftenoverestimated) quality estimates [9]. In [34] ASR QE is explored as a supervised regressionproblem in which the WER of an utterance transcription has to be automatically predicted.The extensive experiments in diﬀerent testing conditions discussed in [34] indicate that re-gression models based on Extremely Randomized Trees (XRT) [15] can achieve competitiveperformance, being able to outperform strong baselines and to approximate the true WERscores computed against reference transcripts. In [46], our basic approach was reﬁned inorder to achieve robustness to large diﬀerences between training and test data. The pro-posed domain-adaptive approach based on multitask learning was intrinsically evaluated onmulti-domain data, achieving good results both in regression and in classiﬁcation mode. Inorder to explore the possible applications of ASR QE, in [23] we proposed its use for suc-cessfully improving hypothesis combination with ROVER [11]. Finally, in [24], we describedTranscRater, our recently released open-source ASR QE tool. The tool is the one used forthe experiments described in this paper.

3. KL-divergence based regularization

The DNNs considered in this work estimate the posterior probability of an output unit s i associated to a HMM output probability density function (PDF). The state posteriorprobability p [ s i | o t ], being o t an observation at time t , is then converted into a PDF usingthe following Bayes formula: p [ o t | s i ] = p [ s i | o t ] p [ o t ] p [ s i ] 1 ≤ i ≤ I (1)where I is the total number of output PDFs and p [ o t ] is discarded since it does notdepend on the state.A possible criterion for estimating weights and biases of the DNN is to minimize over atraining sample the cross-entropy C (ˆ p, p ) between a target distribution ˆ p and the estimatedone: C (ˆ p, p ) = 1 T T X t =1 I X i =1 ˆ p [ s i | o t ] log p [ s i | o t ] (2)where T is the total number of frames in the training utterances. Usually, the entriesˆ p [ s i | o t ] in the target distribution are obtained by forced alignment using an existing ASRsystem and assume the value of 1 over the aligned states.The usual way to adapt a DNN trained on a large set of data (e.g. some thousandsof hours of speech), given a much smaller set of adaptation data (e.g. some minutes ofspeech ), is to retrain the DNN over the adaptation set. This approach assumes that eithermanual or automatic transcriptions of the adaptation sentences are available but, due to Uttered by a new speaker or recorded in a new acoustic environment. D ( ∗ p, p ) = (1 − α ) C (ˆ p, p ) + α N N X t =1 I X i =1 ∗ p [ s i | o t ] log p [ s i | o t ] (3) where N is the number of adaptation frames, ∗ p [ s i | o t ] is the posterior probability computedwith the original DNN and α is the regularization coeﬃcient. As reported in [55], Equation 3can be rewritten as follows: D ( ∗ p, p ) = 1 N N X t =1 I X i =1 P [ s i | o t ] log p [ s i | o t ] (4)where P [ s i | o t ] = (1 − α )ˆ p [ s i | o t ] + α ∗ p [ s i | o t ] 0 ≤ α ≤ P and the current probabilitydistribution p . The new target distribution is obtained as a linear interpolation of the originaldistribution ∗ p and the distribution ˆ p computed via forced alignment with the adaptationdata. Note that, in Equation 5, a value of α = 0 is equivalent to do a “pure” retraining ofthe DNN over the adaptation data (i.e. completely trusting them), while a value of α = 1means that the output probability distribution of the adapted DNN is forced to follow thatof the original DNN (i.e. completely trusting the original model). Usually, the value of α is estimated on a development set, together with the value of the learning rate, and doesnot change across the test utterances. What one can expect is that the optimal value of α is close to 0 when the size of the adaptation set is large and the transcriptions of theadaptation sentences are not aﬀected by errors (i.e. in supervised conditions). Otherwise,when the size of the adaptation set is small and/or its transcription can be aﬀected by errors(i.e. in the case of unsupervised adaptation), the optimal value of α should increase.It is worth remarking that the original DNN, producing the distribution ∗ p used in theabove equations, could have been trained by optimizing a criterion diﬀerent from cross-entropy minimization. This is actually the approach used in this study (see Section 5 fordetails on baseline DNN training).Finally, unlike other methods, note that KLD-based regularization binds directly theDNN output probabilities rather than the model parameters. In this way, the method canbe easily implemented with any software tool based on back-propagation (e.g. the KALDItoolkit), without introducing any modiﬁcation.7 .1. Soft DNN adaptation Experiments in [55] have shown a dependency of the optimal value of α in Equation 5on the size of the adaptation data. However, as mentioned above, one could also expectthat the optimal value of α depends on the quality of the supervision. Starting from thisintuition, here we propose to compute α on a sentence basis, as a function of sentence WERestimates. To this end, we take advantage of previous research we conducted on ASR qualityestimation (WER prediction) and word error detection.In principle, we could simply use as sentence-dependent regularization coeﬃcient thefollowing value: α ( k ) = WER predk , ≤ k ≤ K , where 0 ≤ WER predk ≤ k th sentence WER and K is the total number of adaptation sentences.However, note that in doing this if the value of K is small and WER predk ∼ = 0 , ∀ k the originaldistribution ∗ p , in Equation 5, is weighted by α ∼ = 0 (i.e. we completely trust the adaptationdata), augmenting the risk that the adapted DNN overﬁts the adaptation data. To avoidthis eﬀect we can simply add a bias to the sentence WER estimate as follows: α ( k ) = β + (1 − β ) × WER predk ≤ k ≤ K ≤ β ≤ β = 0 gives α ( k ) = WER predk , i.e. the regularizationcoeﬃcient depends only on the sentence transcription quality. A value of β = 1 gives α ( k ) = β , i.e. the regularization coeﬃcient remains ﬁxed over all adaptation sentences (thisis the case of Equation 5). Therefore, optimizing over β allows us to control the trade-oﬀbetween the quality of the supervision and the size of the adaptation set.We refer to the DNN adaptation method based on Equation 6 as a “soft” adaptation(in which the coeﬃcients vary sentence by sentence), in contrast with the “hard” DNNadaptation approach based on Equation 5 (in which the coeﬃcients are ﬁxed).

4. ASR quality estimation

The simplest approach to roughly estimate transcription quality (without reference tran-scripts) is to consider sentence conﬁdence scores, which describe how the system is certainabout the quality of its own hypotheses. Sentence conﬁdence scores can be computed byaveraging the conﬁdence of the words in the best output string. Such information, however,often reﬂects a biased perspective inﬂuenced by individual ASR decoder features.Indeed, conﬁdence scores are usually close to the maximum value, thus shifting thepredicted WER (computed as 1 − conf idence ) to scores that are close to zero.To obtain more objective and reliable sentence-level WER predictions, in [34] we pro-posed ASR quality estimation as a supervised regression method that eﬀectively exploits acombination of “glass-box” and “black-box” features. Glass-box features, similar to conﬁ-dence scores, capture information inherent to the inner workings of the ASR system thatproduced the transcriptions. The black-box ones, instead, are extracted by looking only atthe signal and the transcription. On one side, they try to capture the diﬃculty of transcrib-ing the signal while, on the other side, they try to capture the plausibility of the outputtranscriptions. In both cases, the information used is independent from knowledge about8 SR (9) From each CN bin: the log of the ﬁrst word posterior (1), the log of the ﬁrstword posterior from the previous/next bin (2), the mean/std/min/max of the logposteriors in the bin (4), if the ﬁrst word of the previous/next bin is silence (2) Sentencelevel (10) From each transcribed sentence: number of words (1), LM log probability (1), LMlog probability of part of speech (POS) (1), log perplexity (1), LM log perplexityof POS (1), percentage (%) of numbers (1), % of tokens which do not contain only“[a-z]” (1), % of content words (1), % of nouns (1), % of verbs (1).

Wordlevel (22) From each transcribed word: Part-of-speech tag/score of the previous/current/nextwords (6), RNNLM probabilities given by models trained on in-domain/out-of-domain data (2), in-domain/out-of-domain 4-gram LM probability (2), number ofphoneme classes including fricatives, liquids, nasals, stops and vowels (5), numberof homophones (1), number of lexical neighbors (heteronyms) (1) binary featuresanswering the three questions: “is the current word a stop word?”/“is the currentword before/after repetition?”/“is the current word before/after silence?” (5).

Table 1: Features (41 in total) for sentence-level WER prediction.the ASR system, making ASR QE applicable to a wide range of scenarios in which the onlyelements available for quality prediction are the signal and the transcription.In this paper, we trained XRT-based models [15] with a combination of 41 ASR (glass-box) and textual (black-box) features. The ASR features are extracted from the confusionnetwork (CN) [30] derived from the word lattices generated by the ASR decoder (the oneemployed in this work is based on the KALDI toolkit [38]), while the textual features arethe same of [34]. Table 1 provides the complete list of the features used. In our experiments,the regressors are respectively trained and tested on the CHiME-3 dt05 real and et05 real transcriptions described in Section 6.1. Their parameters, such as the number of bags,the number of trees per bag and the number of leaves per tree are tuned to minimize themean absolute error (MAE) between the true and predicted WER scores using k-fold cross-validation on the dt05 real data.

5. ASR system

The architecture of our ASR system is depicted in Figures 1 and 2. In the formerone, it uses fMLLR normalized features; in the latter one, it uses ﬁlter-bank features. Thesystem is mainly based on the KALDI CHiME-3 v2 package (derived from the ASR systemdescribed in [20]) with the addition of a second decoding pass that performs unsupervisedDNN adaptation as described in Section 3.In our submission to CHiME-3 [22], we reached the best performance on the evaluationset, et05 real , with a simple delay-and-sum (DS) beamforming consisting in uniform weight-ing of the rephased signals of the 5 frontal microphones. A similar approach, although basedon the well known BeamformIt toolkit [4], is also included in the recent software packageimplementing the CHiME-3 baseline. Hence, in order to comply with the baseline, for thiswork we used BeamformIt to implement signal enhancement.9

RAIN5kWSJ0 + SELECTIONASR−QEAM LM DNNDNN mfcc

BEAMFORMITCH5CH6CH4CH3CH1 ADAPTATION 2−PASS1−PASSGMM FMLLR unsupervised transcription

DNN

Figure 1: ASR architecture based on the KALDI CHiME-3 v2 package plus ASR QE hy-potheses selection and DNN adaptation.

TRAIN5kWSJ0 + SELECTIONASR−QEAM LM unsupervised transcriptionGMM + DNN

DNNDNN fbank

BEAMFORMITCH5CH6CH4CH3CH1 ADAPTATION 2−PASS1−PASS

Figure 2: ASR architecture based on the KALDI CHiME-3 package based on standardﬁlter-bank features.After beamforming, both ﬁlter-bank and fMLLR features are computed and processedby a corresponding hybrid DNN-HMM system that produces the supervision for adaptingthe DNN in the ﬁnal decoding pass.

The employed ﬁlter-bank consists of 40 log Mel scaled ﬁlters. Feature vectors are com-puted every 10 ms by using a Hamming window of 25 ms length and are mean/variancenormalized on a speaker-by-speaker basis. The baseline DNN is trained using the Karel’ssetup [51] included in the KALDI toolkit. To this aim the 8,738 training utterances werealigned to their transcriptions by means of the baseline GMM-HMM models. An 11-framecontext window (5 frames on each side) is used as input to form a 440 dimensional feature The initial GMM system makes use of the KALDI recipe associated to the earlier CHiME challenges[6, 52].

For training, 13 mel-frequency cepstral coeﬃcients (MFCCs) are computed every 10 ms byusing a Hamming window of 25 ms length. These features are mean/variance normalized ona speaker-by-speaker basis, spliced by +/- 3 frames next to the central frame and projecteddown to 40 dimensions using linear discriminant analysis (LDA) and maximum likelihoodlinear transformation (MLLT) [38]. Then, a single speaker-dependent fMLLR transform isestimated and applied for speaker adaptive training of triphone HMM-GMMs. The DNN-HMMs hybrid systems are built on top of LDA+MLLT+fMLLR features and SAT triphoneHMM-GMMs.During decoding, ﬁrst LDA+MLTT+fMLLR are derived using auxiliary HMM-GMMs.To this end, a preliminary decoding pass with speaker-independent (SI) HMM-GMM isconducted to produce a word lattice for each input utterance. Then, a single fMLLR trans-form for each speaker is estimated from suﬃcient statistics collected from SI word latticesin order to maximize the likelihood of the acoustic observations given the SAT triphoneHMM-GMMs. These transforms are used with SAT triphone HMM-GMMs to produce newword lattices. A second set of fMLLR transforms is estimated from new word lattices andcombined with the ﬁrst set of transforms. Finally, the resulting transforms are applied tonormalize the features processed by the DNN-HMM hybrid system in the ﬁrst decoding passof Figure 1. The training of the corresponding baseline DNN, as well as DNN adaptationby KLD regularization, use the recipe adopted for ﬁlter-bank features. The LM employed in the experiments is the 3-gram LM provided with the CHiME-3v2 package release, which uses the Kneser-Ney smoothing method for estimating back-oﬀprobabilities. It was trained with around 37 million words. After pruning low frequencywords, the vocabulary size is approximately 5,000 words and the perplexity value (measuredover dt05 real reference transcriptions) is 119.2.Finally, although not depicted in the ﬁgure, we also run a ﬁnal rescoring of the n-bestlists generated in the second decoding pass with the 5-gram LM and the RNNLM includedin the CHiME-3 v2 package. 11 . Experimental setup

For our experiments, we use the multiple-microphone evaluation data collected for theCHiME-3 challenge, which is publicly available. Complete details about this data set, theoverall challenge and its outcomes can be found in the related overview paper [5], which alsoreports the performance of the 26 participating systems.Six diﬀerent microphones, placed on a tablet PC, were used to record sentences of theWall Street Journal (WSJ) corpus, uttered by diﬀerent speakers in four diﬀerent environ-ments (bus, cafe, pedestrian area and street junction). The training corpus consists of 1,600“real” noisy sentences uttered by 4 speakers, and of 7,138 “simulated” noisy sentences ut-tered by 83 speakers forming the WSJ SI-84 training set. Simulated noisy sentences aregenerated by means of convolution of clean signals with impulse responses of the abovementioned environments and summing the corresponding pre-recorded background noises.Two evaluation corpora were collected in this scenario: the dt05 real development set,formed by 1,640 sentences uttered by 4 diﬀerent speakers, and the et05 real test set, formedby 1,320 utterances acquired from 4 other speakers. In addition, two parallel sets of “sim-ulated” noisy utterances (namely dt05 simu and et05 simu ) were generated as previouslydescribed. There is no speaker overlap between training, development and test sets. Thenumber of utterances in the evaluation corpora is equally distributed among speakers andtypes of noise, that is, every speaker uttered the same number of sentences in each of thefour environments. In both training and evaluation data sets, utterance segmentation wasmanually checked and the corresponding speaker identity was annotated. Therefore, noautomatic speaker diarization module was employed in the experiments. Table 2 showssome statistics of the CHiME-3 training, real development and test sets (the correspondingsimulated development and test sets exhibit the same statistics of the real data).tr05 simu tr05 real dt05 real et05 realduration 15h9m 2h54m 2h16m 1h50m

All the experiments were conducted on the “real” subsets of the CHiME-3 evaluationdata: dt05 real and et05 real . For brevity, henceforth we will respectively refer to them http://spandh.dcs.shef.ac.uk/chime_challenge/download.html . DT05 and

ET05 . We report performance for ASR systems employing both fMLLRnormalized features (Figure 1) and ﬁlter-bank features (Figure 2). ASR parameters (LMweight, α and β coeﬃcients in Equations 5 and 6 respectively) are tuned on the developmentset DT05 .The soft adaptation approach described in Section 3 was applied in both “oracle” and“predicted” conditions. Oracle WER scores ( oWER henceforth) are computed from refer-ence transcriptions, while predicted WER scores ( pWER ) are estimated by the ASR QEsystem described in Section 4. Both values are used as WER estimates in Equation 6 tocompute the target probability distribution. The performance achieved by oracle sentenceWER represents the upper bound of the soft adaptation approach.The QE model used for WER prediction is trained and optimized on the developmentset. The XRT parameters are tuned in 8-fold cross validation, minimizing the mean absoluteerror (MAE) between the predicted and the true WER. The partitioning is done to avoidspeaker or sentence overlaps between training and test folds.Table 3 gives the complete list of DNN adaptation experiments we performed. Eachexperiment is identiﬁed by: i) a combination of adaptation/evaluation sets, ii) the super-vision used (manual or automatic), and iii) the features employed (ﬁlter-bank or fMLLRnormalized). For instance, the experiment named DT05+man+fMLLR+ET05 in the ﬁrstrow of the table indicates that the baseline DNN is adapted using

DT05 as adaptation set,the manual supervision, fMLLR features and the evaluation set is

ET05 .adaptation type of features evaluationcross conditions set supervision type setDT05+man+fMLLR+ET05 DT05 manual fMLLR ET05DT05+man+fbank+ET05 DT05 manual ﬁlter-bank ET05DT05+auto+fMLLR+ET05 DT05 automatic fMLLR ET05DT05+auto+fbank+ET05 DT05 automatic ﬁlter-bank ET05homogeneous conditionsDT05+auto+fMLLR+DT05 DT05 automatic fMLLR DT05DT05+auto+fbank+DT05 DT05 automatic ﬁlter-bank DT05ET05+auto+fMLLR+ET05 ET05 automatic fMLLR ET05ET05+auto+fbank+ET05 ET05 automatic ﬁlter-bank ET05Table 3: List of DNN adaptation experiments.The table is divided in two parts, respectively describing experiments carried out in cross and homogeneous conditions. In cross conditions (ﬁrst four rows), the adaptation andevaluation sets are distinct. Here, our goal is to compare performance by varying the typeof supervision (manual or automatic) of the adaptation data and, in case the automaticsupervision is used, by varying the size of the adaptation set according to its quality. Inhomogeneous conditions (last four rows), the adaptation set coincides with the evaluation13et. In this case, our goal is to compare performance achieved by selecting adaptation setswith diﬀerent levels of quality.Note that DNN adaptation with manual supervision (ﬁrst two rows of Table 3) is onlymeaningful in cross conditions, since we assume it is not available for the evaluation set

ET05 . The automatic supervisions of the adaptation sets (i.e

DT05 or ET05 , depending onthe experiment type) are produced by the ﬁrst decoding passes of the ASR systems depictedin Figures 1 and 2. KLD regularization with manual supervision is applied according toEquation 5. Instead, with automatic supervision, both hard (based on Equation 5) and soft(based on Equation 6) DNN adaptation approaches are applied.Furthermore, it’s worth observing that the cross-condition situation ﬁts an “oﬄine”application scenario, where a DNN can be adapted using data and corresponding automatictranscriptions collected on the ﬁeld during the ASR system working. In a successive phase,the adapted DNN can be loaded into the ASR system itself.Finally, as mentioned in Section 1, and as it will be shown below, top performanceis achieved by properly selecting subsets of the adaptation data. Therefore, similarly tothe other tuning parameters, the optimal selection thresholds used to estimate the ﬁnalperformance are computed on the development set

DT05 .

7. Results

In this section we present the experimental results obtained in the diﬀerent settingsoutlined in Table 3, playing with: i) the diﬀerent conditions, ii) the type of supervision, iii) the size of the adaptation data and iv) the way the adaptation data is selected.In the analysis and in the subsequent discussion, our WER scores are not comparedagainst those achieved by the best ASR system participating in the CHiME-3 challenge[54], which uses a far more complex architecture for signal pre-processing and cross systemcombination, as well as an augmented set of training data. Indeed, implementing such astate-of-the-art system was out of the scope of this work, whose objective is to show theeﬀectiveness of QE-based DNN adaptation to improve the performance of a standard, lesscomplex but still strong ASR system. For this reason, our term of comparison is representedby the reference CHiME-3 baseline, which results in 15.4% WER on ET05 and 8.2% WERon

DT05 . Despite the generality of the proposed approach, integrating our method in astate-of-the-art ASR system like the one described in [54] and quantify the performancegains yielded by QE-based DNN adaptation is left as a possible direction for future work.

In this section we analyse the performance achieved in cross conditions, both with manualand automatic supervision. First, we use use all the sentences in

DT05 for adapting theDNN. Then, we show the performance achieved with the automatic supervision provided byadaptation data derived from

DT05 after removing the utterances with the highest WER.14 .1.1. Using all the adaptation utterances

Figure 3a shows the WERs (as functions of the regularization coeﬃcient α in Equa-tion 5) achieved on the evaluation set ET05 by using fMLLR features both with manual andautomatic supervision. In a similar way, the performance reached with ﬁlter-bank featuresis given in Figure 3b. The horizontal line in both the ﬁgures corresponds to the baselineperformance.

11 12 13 14 15 16 17 18 0 0.1 0.3 0.5 0.7 0.9 1 W E R ( % ) α DT05+auto+fMLLR+ET05DT05+man+fMLLR+ET05 (a)

14 16 18 20 22 24 26 0 0.1 0.3 0.5 0.7 0.9 1 α DT05+auto+fbank+ET05DT05+man+fbank+ET05 (b)

Figure 3: WER achieved on evaluation set

ET05 as a function of the regularization coeﬃcient α , using DT05 as adaptation set.As can be seen, the use of manual supervision, or equivalently the supervised adaptation ,allows to improve baseline performance with both types of features. In both cases, thereis an intermediate optimal value of α in the interval [0 , With the best value, wegain about 1% WER point, indicating the eﬃcacy of the interpolation procedure expressedby Equation 5.Note the substantial performance reduction in Figure 3b at α = 0 (both for supervisedand unsupervised adaptation), suggesting that a data overﬁtting eﬀect has probably oc-curred. The same behavior is not observed with supervised adaptation using fMLLR features(Figure 3a), where at α = 0 no signiﬁcant performance degradation is observed. This resultcan be explained by considering that fMLLR transformations already reduce the acousticmismatch between adaptation ( DT05 ) and evaluation (

ET05 ) sets. This is conﬁrmed bythe fact that, in Figure 3b, the curve labelled

DT05+man+fbank+ET05 is shifted towardsthe right part of the graph more than the corresponding curve

DT05+man+fMLLR+ET05 As explained in Section 3, a value of α = 0 corresponds to completely ignoring the contribution ofthe original DNN output distribution in the construction of the cross-entropy function (i.e. we completelytrust the adaptation data), while a value α = 1 forces the DNN parameters to follow those of the originaldistribution (i.e. we completely trust the original model).

15n Figure 3a, meaning that the adaptation procedure trusts the fMLLR normalized fea-tures more than the ﬁlter-bank ones. Referring to the same Figure 3a, data overﬁtting (at α = 0) instead occurs with unsupervised adaptation, as if the errors in the supervisionacted similarly to an acoustic mismatch between adaptation and evaluation sets. Based onthese outcomes, we decided to investigate the eﬀects of reducing the errors in the automatictranscription. To check the possible impact of automatic transcription errors in the supervision, weextracted from the adaptation set

DT05 the utterances whose true WER, computed fromthe reference transcriptions, is lower than 10%. Then, we adapted the baseline DNN withthe hard approach and by varying the value of the regularization coeﬃcient α . The resultsare shown in Figures 4a and 4b. For comparison purposes, the two ﬁgures also include thesame curves of Figures 3a and 3b related to the use of the whole adaptation set. As can beseen, the selection of adaptation utterances with WER <

10% produces curves that approachthose obtained using manual supervision, showing the beneﬁts of reducing the transcriptionerrors in the supervision.

12 13 14 15 16 17 0 0.1 0.3 0.5 0.7 0.9 1 W E R ( % ) α DT05+auto+fMLLR+ET05DT05+man+fMLLR+ET05DT05+auto+fMLLR+ET05 (oWER<10%) (a)

14 16 18 20 22 24 0 0.1 0.3 0.5 0.7 0.9 1 α DT05+auto+fbank+ET05DT05+man+fbank+ET05DT05+auto+fbank+ET05 (oWER<10%) (b)

Figure 4: WER achieved on evaluation set

ET05 as a function of the regularization coeﬃcient α , using as adaptation set the subset of DT05 with oW ER ≤ In this section we report and discuss the performance achieved in homogeneous conditionswith automatic supervision. First, we use the whole available set of adaptation utterances.Then, by applying ASR QE for WER prediction, we experiment with diﬀerent subsets ofthe data selected according to their estimated quality.16 .2.1. Using all the adaptation utterances

The experiments were conducted by performing two decoding passes, as explained inSection 5. The transcriptions resulting from the ﬁrst pass, based on the baseline DNNs,provide the supervision for the following adaptation steps. Then, the adapted DNNs areexploited in the second decoding pass to produce the ﬁnal transcriptions. Performance re-sults achieved on both development and evaluation sets, with hard and soft adaptation, aregiven in Table 4 (in parentheses, we show the absolute WER reduction with respect to base-line performance). In the case of soft adaptation, both oracle and automatically-predictedexperiment code HARD SOFT ( oWER ) SOFT ( pWER )adaptation adaptation adaptationDT05+auto+fMLLR+DT05 8.0(0.2) 7.9(0.3) 8.0(0.2)DT05+auto+fbank+DT05 9.5(1.6) 9.2(1.9) 9.3(1.8)ET05+auto+fMLLR+ET05 14.5(0.9) 14.3(1.1) 14.4(1.0)ET05+auto+fbank+ET05 17.7(3.2) 17.1(3.8) 17.6(3.3)Table 4: %WER achieved by unsupervised DNN adaptation in homogeneous conditions.sentence WERs were tested. Similarly to the experiments in Figure 3, we measured perfor-mance as a function of the coeﬃcient α . We also carried out the same set of experimentsevaluating performance also as function of coeﬃcient β deﬁned in Equation 6. However, forreasons of compactness, we do not provide the whole set of results and Table 4 only refersto the top WER values achieved.First of all, diﬀerently from the performance shown in Figures 3a and 3b, experimentsin homogeneous conditions do not exhibit clear minimum values of the corresponding WER.Basically, no signiﬁcant WER variations are observed for both α and β coeﬃcients rangingin the interval [0 . − . α = β = 0 .

7, while for( α, β ) > . In Table 4 it is worth noting the signiﬁcant WER reductions, compared to baseline per-formance, yielded by ﬁlter-bank features on both

DT05 and

ET05 . Although similar perfor-mance gains are not observed with fMLLR features, especially on

DT05 (as just pointed outabove, probably due to their capability of reducing the acoustic mismatch between train-ing and testing conditions), these results conﬁrm the eﬀectiveness of the two-pass decodingmethod. Note also that no substantial advantages are brought by the soft adaptation ap-proach compared to the hard one. Despite this fact, it is worth observing the very close These experiments were motivated by the signiﬁcant performance improvement obtained in [22] using“full” retraining of DNN in a two-pass ASR architecture. To see examples of this trend of performance the reader can refer to the Figures 5a, 5b, 6a and6b, which report the WER scores achieved with hard adaptation and fMLLR features in homogeneousconditions (speciﬁcally, refer to the curves obtained without automatic sentence selection, respectively

DT05+auto+fMLLR+DT05 and

ET05+auto+fMLLR+ET05 ). i ) DNNadaptation in homogeneous conditions with two passes of decoding, using the whole set ofadaptation utterances, yields performance improvements, ii ) automatic utterance selectionbased on oracle WER values is eﬀective in cross conditions suggesting, together with theoutcome above, to repeat the corresponding experiment in homogeneous conditions andusing ASR QE, and iii ) no signiﬁcant performance gain can be achieved with the softadaptation method based on Equation 6. The latter result suggests to investigate measuresfor expressing sentence quality that are alternative to sentence WER used in Equation 6. Inaddition, weighing of output probabilities in Equation 5, could be applied at a granularitylevel higher than that of the sentence, e.g. at the level of word or even of single frame.Though interesting, a formal veriﬁcation of these hypotheses is out of the scope of thispaper and is left for future work. W E R ( % ) α DT05+auto+fMLLR+DT05DT05+auto+fMLLR+DT05 oWER(300)DT05+auto+fMLLR+DT05 oWER(600)DT05+auto+fMLLR+DT05 oWER(900)DT05+auto+fMLLR+DT05 oWER(1200) (a) α DT05+auto+fMLLR+DT05DT05+auto+fMLLR+DT05 pWER(300)DT05+auto+fMLLR+DT05 pWER(600)DT05+auto+fMLLR+DT05 pWER(900)DT05+auto+fMLLR+DT05 pWER(1200) (b)

Figure 5: WER achieved with oracle ( oWER ) and ASR QE ( pWER ) selection of adaptationutterances, on the development set

DT05 , as functions of the regularization coeﬃcient α . For conciseness, in the next set of experiments we report performance results only forfMLLR normalized features, since they are the most eﬀective ones. However, the sameand even more evident trends, were also observed using ﬁlter-bank features. Figures 5aand 5b report the performance achieved on

DT05 using subsets of adaptation utterances ofdiﬀerent size. The utterances of the development set

DT05 were sorted according to theWER resulting from the ﬁrst decoding pass. For sorting, we used both oracle WER values18nd WER predictions obtained with the ASR QE approach described in Section 4. Weextracted from

DT05 four adaptation sets, respectively containing the “best” 300, 600, 900and 1 ,

200 utterances. The various subsets, together with their automatic transcriptions,were used to adapt the baseline DNN by means of the hard approach. The reason forputting thresholds to the size of the adaptation set to compute our results lies in the factthat we want to do a fair comparison between the two selection methods adopted ( oWER and pWER ). In fact, sentence selection according to a preassigned WER threshold producesunbalanced adaptation sets of diﬀerent sizes in correspondence to the application of each ofthe two methods. Figures 6a and 6b were derived, similarly to Figures 5a and 5b, from the

11 12 13 14 15 16 0 0.1 0.3 0.5 0.7 0.9 1 W E R ( % ) α ET05+auto+fMLLR+ET05ET05+auto+fMLLR+ET05 oWER(300)ET05+auto+fMLLR+ET05 oWER(600)ET05+auto+fMLLR+ET05 oWER(900) (a)

11 12 13 14 15 16 0 0.1 0.3 0.5 0.7 0.9 1 α ET05+auto+fMLLR+ET05ET05+auto+fMLLR+ET05 pWER(300)ET05+auto+fMLLR+ET05 pWER(600)ET05+auto+fMLLR+ET05 pWER(900) (b)

Figure 6: WER achieved with oracle ( oWER ) and ASR QE ( pWER ) selection of adaptationutterances, on the evaluation set

ET05 , as functions of the regularization coeﬃcient α .evaluation corpus ET05 .From the above ﬁgures, it is evident the eﬃcacy of using only subsets of mid-high qualitytranscriptions for adapting the DNNs employed in the second decoding pass. Indeed, in eachﬁgure the minimum WER is reached with a couple of optimal values of the pair: ( α, K ),where K is the size of the adaptation set. This value is 900 for DT05 (Figures 5a and5b)and 600 for

ET05 (Figures 6a and6b). The total improvement with respect to i) the baselineperformance and ii) the performance achieved using the whole set of adaptation utterancesis remarkable. The diﬀerence in the optimal values of K for DT05 and

ET05 is probably dueto the diﬀerent size of the two corpora (

DT05 contains 1,640 utterances,

ET05 contains 1,320utterances). Unsurprisingly, the performance achieved with the ASR QE approach is lowerthan the upper-bound results obtained with oracle WER estimates. However, especially onthe evaluation corpus

ET05 , the improvements over the baseline are considerable.Note that, in all ﬁgures, the optimal values of α result to be quite low, ranging in the19nterval [0 . − . W E R ( % ) α DT05+auto+fMLLR+DT05 oWER<5%DT05+auto+fMLLR+DT05 oWER<10%DT05+auto+fMLLR+DT05 oWER<15%DT05+auto+fMLLR+DT05 oWER<20% (a) α DT05+auto+fMLLR+DT05 pWER<5%DT05+auto+fMLLR+DT05 pWER<10%DT05+auto+fMLLR+DT05 pWER<15%DT05+auto+fMLLR+DT05 pWER<20% (b)

Figure 7: WER achieved with oracle ( oWER ) and ASR QE ( pWER ) selection of adaptationutterances, on the development set

DT05 , varying the WER thresholds.performance achieved on the

DT05 corpus with hard DNN adaptation as a function ofthe α coeﬃcient, varying the thresholds applied to oWER values to select the adaptationutterances. Figure 7b, instead, shows the performance reached when pWER estimates areemployed. Also in this case, the performance improvements with respect to the baseline, withboth oWER and pWER are evident. The optimal values for the pair oWER thr , α (Figure 7a)resulted to be oWER thr = 10%, α = 0 .

1, where oWER thr indicates the selection threshold.Similarly, using pWER estimates the corresponding optimal values are pWER thr = 10%, α = 0 .

3. With pWER , the higher value for α compared to the value resulting from theuse of the oWER ( α = 0 .

1) approach is probably due to errors in the automatic WERpredictions that have to be compensated.Table 5 gives the ﬁnal performance achieved on both

DT05 and

ET05 using the two-pass decoding approach and the automatic selection of adaptation utterances by means ofautomatically predicted WERs. For both sets, the optimal values of the pairs ( α , WER thr )20re those estimated on the

DT05 development corpus, i.e. ( α = 0 . oWER thr = 10%) and( α = 0 . pWER thr = 10%). DT05 ET05fMLLR fbank fMLLR fbankbaseline 8.2 11.1 15.4 20.9oWER 7.1 7.7 12.4 14.2pWER 7.5 8.3 13.6 15.0Table 5: %WER achieved in homogeneous conditions using the optimal parameter pairs( α , WER thr ) estimated on

DT05 .Table 5 conﬁrms the eﬀectiveness of the proposed two-pass, QE-based adaptation ap-proach when QE parameter optimization is carried out on the development set. In the tableit can be noticed that, although ﬁlter-bank features exhibit higher WER than the fMLLRnormalized ones, after the unsupervised adaptation procedure the performance gap is sig-niﬁcantly reduced (less than 2% absolute WER on

ET05 ). In all cases, the small diﬀerencesbetween the performance yielded by the use of oracle and the corresponding predicted WERsis noteworthy.The results in Table 5 are noticeable, considering that they outperform those given by astrong ASR baseline, implemented with state-of-the-art ASR technologies, i.e.: BeamformItfor speech enhancement, hybrid DNN-HMMs for acoustic modeling and speaker-dependentfMLLR transformations for acoustic model adaptation.

Table 6 illustrates the results achieved on

ET05 with the LM rescoring procedure releasedin the updated CHiME-3 recipe. This procedure rescores the ﬁnal word lattices producedin the second decoding pass by two consecutive steps: ﬁrst by using a 5-grams LM, then bymeans of a linear combination of a 5-grams LM and a RNNLM.3-gram 5-gram RNNLMoWER 12.4 10.8 9.9pWER 13.6 11.9 10.9Table 6: %WER achieved, in homogeneous conditions on

ET05 , with automatic data selec-tion and using the baseline LM rescoring passes (see [20]).The signiﬁcant performance gains demonstrate the additive eﬀect of LM rescoring overDNN adaptation, allowing us to reach a signiﬁcant 10 .

9% WER on

ET05 . See http://spandh.dcs.shef.ac.uk/chime_challenge/results.html for the oﬃcial results of thechallenge. . Discussion The results achieved so far allow us to claim that, regardless of the type of acousticfeatures employed in the experiments (ﬁlter-bank or fMLLR normalized): a ) The beneﬁts yielded by KLD-based regularization, compared with DNN retrainingwithout any regularization, are limited. This is probably due to the fact that the sizeof the adaptation sets considered in our experiments is large enough to prevent dataoverﬁtting (actually, previous research on KLD regularization [55] demonstrates itseﬀectiveness using only few minutes of adaptation data); b ) The presence of errors in the automatic transcription of the adaptation data is detri-mental, especially when DNN adaptation is carried out in homogeneous conditions. Infact, comparing the results in the last two rows of Table 4 (achieved by using the whole ET05 corpus as adaptation set) with those in Table 5 (obtained by using a subset ofadaptation utterances with “few” transcription errors) we notice, in oracle conditions,absolute WER reductions of around 2% with fMLLR and 4% with ﬁlter-bank features.Coherent WER reductions of around 1% and 3% are also achieved when applying ourASR QE-based selection method. This demonstrates the eﬀectiveness of the proposedQE-informed approach for DNN adaptation.It is worth pointing out that, till now, we have only considered KLD regularizationfor implementing DNN adaptation. However, as mentioned in Section 2, several previousworks proposed alternative approaches based on the use of a single linear transformation,which can be applied either to the input or the output layer of the network. Therefore,in order to assess the eﬀectiveness and the general applicability of the proposed QE-basedapproach, we also experimented with the output-feature discriminative linear regression(oDLR) transformation, in a way similar to that described in [53]. The results obtainedin homogeneous conditions, both with and without ASR QE, are given in Table 7 (forcomparison purposes, the baseline performance is also reported in the table). Similarly toresults shown in Table 5, the optimal thresholds for both oracle and predicted sentence WERvalues are empirically estimated on

DT05 . The resulting values for oW ER thr and pW ER thr are respectively 10% and 20%. DT05 ET05fMLLR fbank fMLLR fbankbaseline 8.2 11.1 15.4 20.9oDLR 7.9 9.6 13.8 17.5oDLR+oWER 7.4 9.2 13.0 16.8oDLR+pWER 7.7 9.5 13.6 17.2Table 7: %WER achieved in homogeneous conditions with oDLR-based adaptation, withoutusing ASR QE (oDLR), using utterance selection based on oracle WERs (oDLR+oWER)and on predicted WERs (oDLR+pWER). 22s shown in the table, the use of oDLR alone (even without ASR QE) always resultsin noticeable improvements over the baseline. The considerable WER reductions measuredin oracle conditions (oDLR+oWER), however, indicate the high potential of a QE-drivenselection of the adaptation utterances also with this simple DNN adaptation method. Ingeneral, the performance improvements are smaller than the corresponding results for KLDregularization reported in Table 5. Such lower results can be explained by the ﬁndingsreported in [16], in which the authors compared approaches based on MLLR and maximuma posterior probability (MAP) for GMM-HMMs adaptation. In this case, the impact oferrors in the supervision is directly proportional to the number of transformation parametersto estimate. Indeed, while in the experiments reported in Section 7 all the parameters ofthe original DNN are adapted, with oDLR only a small fraction of them (around 13%) isupdated. The reduced sensitivity to errors in the supervision is also reﬂected by the highervalue of the threshold used to select the adaptation data (20% for oDLR vs 10% for KLD).The results measured in oracle conditions suggest a higher potential for the application ofQE to KLD-based regularization rather than to oDLR. This intuition, however, is partiallycontradicted by the last row of Table 7 (oDLR+pWER). With predicted WER scores, indeed,the values achieved with fMLLR are only slightly worse or identical to those in Table 5. Toput into perspective this unexpected “exception” in the results, it’s worth remarking that theimpact of QE in DNN adaptation is proportional to the acoustic mismatch between trainingand test data. As observed in Sections 7.1.1 and 7.2.1, fMLLR features have the capabilityto reduce such mismatch, making the gains brought by QE-based adaptation less evidentthan those achieved with ﬁlter-bank. In light of this, although on

ET05 and with fMLLRfeatures oDLR is competitive with the more complex KLD-based regularization proposedin this paper, we believe that more challenging data (featuring a higher mismatch betweentraining and test) would increase the distance between the two approaches and reward ourmethod. A comparison between diﬀerent DNN adaptation methods across multiple datasets featuring variable degrees of acoustic mismatch is deﬁnitely an interesting direction forfuture research.

9. Conclusions

In this paper, we proposed to exploit automatic Quality Estimation (QE) of ASR hy-potheses to perform the unsupervised adaptation of a deep neural network modeling acousticprobabilities. We developed our approach motivated by the two following hypotheses:1. The adaptation process does not necessarily require the supervision of a manually-transcribed development set. Manual supervision can be replaced by a two-pass de-coding procedure, in which the evaluation data we are currently trying to recogniseare automatically transcribed and used to inform the adaptation process;2. The whole process can beneﬁt from methods that take into account the quality of thesupervision. In particular, automatic quality predictions can be used either to weighthe adaptation instances or to discard the less reliable ones.23o implement our approach, we retrained a (baseline) DNN by minimizing an objectivefunction deﬁned by a linear combination of the usual cross-entropy measure (evaluated ona given adaptation set) and a regularization term. This is the Kullback-Leibler divergencebetween the output distribution of the original DNN and the actual output distribution.First, we experimented in “cross conditions”, by adapting on the real development setof the CHiME-3 challenge and testing on the corresponding real evaluation set. In thisscenario, we found that, when using all the manually-transcribed adaptation data, the KLD-based approach is eﬀective. Then, moving to the automatically-generated supervision of theadaptation data, we discovered a correlation between performance results and the quality ofthe adaptation data. In particular, in “oracle” conditions (i.e. with true WER scores), DNNadaptation beneﬁts from removing utterances with a WER score above a given threshold.Building on this result, we focused on “self” DNN adaptation in “homogeneous con-ditions”, in which the baseline DNN is adapted on the same evaluation set (

ET05 ) byexploiting the automatic supervision derived from a ﬁrst ASR decoding pass. Similarly tothe cross-condition scenario, this approach allowed us to signiﬁcantly improve the perfor-mance when “low quality” sentences (i.e. sentences that exhibit oracle WERs higher thanan optimal threshold) are removed from the adaptation set. Improvements were measurednot only in “oracle” conditions (i.e. with true WER scores), but also in realistic conditionsin which manual references are not available and the only viable solution is to rely uponpredicted WERs. To this aim, building on previous positive results on quality estimation forASR [34, 46, 23], we used automatic WER prediction as a criterion to isolate subsets of theadaptation data featuring variable quality. The results of an extensive set of experimentsallowed us to conclude that: • Exploiting ASR QE for DNN adaptation in a two-pass decoding architecture yieldssigniﬁcant performance improvements over the strong, most recent CHiME-3 baseline; • Self DNN adaptation is more eﬀective with ﬁlter-bank acoustic features than withfMLLR normalized features. This behavior is probably due to the smaller mismatchbetween training and test data caused by the use of fMLLR transformations, indicatinga higher potential of the QE-driven approach in a scenario characterized by weaknessof fMLLR in reducing such mismatch (e.g. with small adaptation sets); • ASR QE is less eﬀective with output discriminative linear regression (oDLR) transfor-mation for DNN adaptation, due to the lower number of parameters to adapt comparedto KLD regularization. This demonstrates the portability of our method, but a highereﬀectiveness with large DNNs.Finally, we applied the LM rescoring procedure delivered with the CHiME-3 baseline tothe word lattices produced after the second, DNN-adapted, decoding pass. The resultingWER reductions demonstrate the independent eﬀects of LM rescoring and the proposedDNN adaptation approach. Our full-ﬂedged system for DNN adaptation, integrating KLDand ASR QE for data selection, allows us to outperform the strong CHiME-3 baseline witha 1.7% WER reduction (from 12.6% to 10.9%).24ome interesting directions for future work already emerged in the course of this research.One is to further explore the portability of the proposed ASR QE approach, by integratingit into other state-of-the-art ASR systems. To validate our working hypotheses, in thispaper we started from the strong CHiME-3 baseline; now it would be interesting to testits eﬀectiveness within more powerful DNNs (e.g. capable of modeling time dependenciesamong acoustic observations, such as bidirectional recurrent neural networks [17, 3]) having ahigher number of parameters to adapt. Another interesting direction is to investigate ways toexpress hypotheses’ quality at a granularity level higher than that of the sentence, e.g. at thelevel of words or even single frames. To do this we also plan to replace the “ ad hoc ” formulaexpressed by Equation 6, to weigh the KLD regularization term in Equation 4, by jointlyoptimizing both the DNN weights and the sentence-dependent regularization coeﬃcients.Based on the successful results obtained in the speciﬁc CHiME-3 application framework(read speech acquired by multiple microphones in noisy conditions), we also plan to extendour approach to other domains, possibly featuring higher degrees of acoustic mismatchfor which the KLD-based regularization proposed in this paper seems to have the highestpotential. Finally, an interesting direction to investigate is “incremental” DNN adaptation,where the DNN is periodically adapted on speech utterances and related transcriptionsstored after they have been processed by the ASR system. This application scenario reﬂectsthe cross-condition experimental situation deﬁned in Section 6.2 and, based on our currentresults, represents a promising and natural extension of this research.

References [1] Abrash, V., Franco, H., Sankar, A., Cohen, M., 1995. Connectionist Speaker Normalization and Adap-tation, in: Proc. of Interspeech, Madrid, Spain. pp. 2183–2186.[2] Albesano, D., Gemello, R., Laface, P., Mana, F., Scanzio, S., 2006. Adaptation of Artiﬁcial NeuralNetworks Avoiding Catastrophic Forgetting, in: Proc. of International Joint Conference on NeuralNetworks, Vancouver, Canada. pp. 2863–2870.[3] Amodei, D., et al., 2016. Deep Speech 2: End-to-End Speech Recognition in English and Mandarin,in: Proc. of International Conference on Machine Learning, New York, USA.[4] Anguera, X., Wooters, C., Hernando, J., 2007. Acoustic Beamforming for Speaker Diarization ofMeetings. IEEE Transactions on Audio, Speech, and Language Processing 15, 2011–2022.[5] Barker, J., Marxer, R., Vincent, E., Watanabe, S., 2015. The third ’CHiME’ Speech Separation andRecognition Challenge: Dataset, task and baselines, in: Proc. of IEEE ASRU Workshop, Scottsdale,Arizona, USA.[6] Barker, J., Vincent, E., Ma, N., Christensen, H., Green, P., 2013. The PASCAL CHiME speechseparation and recognition challenge. Computer Speech and Language 27, 621–633.[7] Brandstein, M., Ward, D., 2001. Microphone Arrays: Signal Processing Techniques and Applications.Springer.[8] Dahl, G., Yu, D., Deng, L., Acero, A., 2012. Context Dependent Pre-Trained Deep Neural Networksfor Large-Vocabulary Speech Recognition. IEEE Trans. on Audio Speech and Language Processing 20,30–42.[9] Evermann, G., Woodland, P.C., 2000. Large Vocabulary Decoding and Conﬁdence Estimation UsingWord Posterior Probabilities, in: Proc. of ICASSP, Istanbul, Turkey. pp. 2366–2369.[10] Facco, A., Falavigna, D., Gretter, R., Vigano, M., 2006. Design and Evaluation of Acoustic andLanguage Models for Large Scale Telephone Services. Speech Communication 48, 176–190.

11] Fiscus, J.G., 1997. A Post-processing System to Yield Reduced Word Error Rates: Recognizer OutputVoting Error Reduction (ROVER), in: Proceedings of the IEEE Workshop on Automatic SpeechRecognition and Understanding, IEEE, Santa Barbara, CA, USA. pp. 347–354.[12] Gales, M.J.F., 1998. Maximum Likelihood Linear Transformations for HMM-based Speech Recognition.Computer Speech and Language 12, 75–98.[13] Garimella, S., Mandal, A., Strom, N., Hoﬀmeister, B., Matsoukas, S., Parthasarathi, S.H.K., 2015.Robust i-vector based Adaptation of DNN Acoustic Model for Speech Recognition, in: Proc. of Inter-speech, Dresden, Germany.[14] Gemello, R., Mana, F., Scanzio, S., Laface, P., Mori, R.D., 2007. Linear Hidden Transformations forAdaptation of Hybrid ANN/HMM Models. Speech Communication 49, 827–835.[15] Geurts, P., Ernst, D., Wehenkel, L., 2006. Extremely Randomized Trees. Machine Learning 63, 3–42.[16] Gollan, C., Bacchiani, M., 2008. Conﬁdence Scores for Acoustic Model Adaptation, in: Proc. ofICASSP, Las Vegas, Nevada, USA. pp. 4289–4292.[17] Graves, A., Jaitly, N., 2014. Towards End-to-End Speech Recognition with Recurrent Neural Networks,in: Proc. of International Conference on Machine Learning, Beijing, China. pp. 1764–1772.[18] Hakkani-Tur, D., Riccardi, G., Gorin, A., 2002. Active Learning for Automatic Speech Recognition,in: Proc. of ICASSP, Orlando, FL, USA. pp. 3904–3907.[19] Hinton, G., Deng, L., Yu, D., Wang, Y., 2012. Deep Neural Networks for Acoustic Modeling in SpeechRecognition. IEEE Signal Processing Magazine 9, 82–97.[20] Hori, T., Chen, Z., Erdogan, H., Hershey, J.R., Roux, J.L., Mitra, V., Watanabe, S., 2015. TheMERL/SRI System for the 3RD CHiME Challenge using Beamforming, Robust Feature Extraction,and Advanced Speech Recognition, in: IEEE workshop on Automatic Speech Recognition and Under-standing (ASRU), pp. 475–481.[21] Huang, Z., Li, J., Siniscalchi, M., Chen, I., Weng, C., Lee, C., 2014. Feature Space Maximum aPosteriori Linear Regression for Adaptation of Deep Neural Networks, in: Proc. of INTERSPEECH,Singapore. pp. 2992–2996.[22] Jalalvand, S., Falavigna, D., Matassoni, M., Svaizer, P., Omologo, M., 2015a. Boosted Acoustic ModelLearning and Hypotheses Rescoring on the CHiME-3 Task, in: Proc. of the IEEE Automatic SpeechRecognition and Understanding Workshop (ASRU), Scottsdale, Arizona, USA. pp. 409–415.[23] Jalalvand, S., Negri, M., Falavigna, D., Turchi, M., 2015b. Driving ROVER With Segment-based ASRQuality Estimation, in: Proc. of ACL, Beijing, China. pp. 1095–1105.[24] Jalalvand, S., Negri, M., Turchi, M., C. de Souza, J.G., Falavigna, D., Qwaider, M.R.H., 2016. Tran-scRater: a Tool for Automatic Speech Recognition Quality Estimation, in: Proc. of ACL-2016 SystemDemonstrations, Berlin, Germany. pp. 43–48.[25] Karanasou, P., Gales, M.J.F., Woodland, P.C., 2015. I-vector Estimation using Informative Priors forAdaptation of Deep Neural Networks, in: Proc. of Interspeech, Dresden, Germany. pp. 2872–2876.[26] Kenny, P., Oullet, P., Dehak, N., Gupta, V., Dumouchel, P., 2008. A Study of Interspeaker Variabilityin Speaker Veriﬁcation. IEEE Transactions on Audio, Speech, and Language Processing 16, 980–988.[27] Lamel, L., Gauvain, J., Adda, G., 2001. Investigating Lightly Supervised Acoustic Model Training, in:Proc. of ICASSP, Salt Lake City, USA. pp. 477–480.[28] Li, B., Sim, K., 2010. Comparison of Discriminative Input and Output Transformation for SpeakerAdaptation in the Hybrid NN/HMM Systems, in: Proc. of Interspeech, Makuhari, Japan. pp. 526–529.[29] Li, X., Bilmes, J., 2006. Regularized Adaptation of Discriminative Classiﬁers, in: Proc. of ICASSP,Toulouse, France.[30] Mangu, L., Brill, E., Stolcke, A., 2000. Finding Consensus in Speech Recognition: Word Error Mini-mization and Other Applications of Confusion Networks. Computer Speech and Language 14, 373–400.[31] Mehdad, Y., Negri, M., Federico, M., 2012. Match without a Referee: Evaluating MT Adequacywithout Reference Translations, in: Proc. of the Machine Translation Workshop (WMT2012), Montr´eal,Canada. pp. 171–180.[32] Miao, Y., Zhang, H., Metze, F., 2015. Speaker Adaptive Training of Deep Neural Network AcousticModels using I-vectors. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, p. 126–130.[53] Yao, K., Yu, D., Seide, F., Su, H., Deng, L., Gong, Y., 2012. Adaptation of Context-Dependent DeepNeural Networks for Automatic Speech Recognition, in: Proc. of SLT, Miami, Florida, USA.[54] Yoshioka, .T., Ito, N., Delcroix, M., Ogawa, A., Kinoshita, K., Fujimoto, M., Yu, C., Fabian, W.,Espi, M., Higuchi, T., Araki, S., Nakatani, T., 2015. Advances in speech enhancement and recogni-tion for mobile multi-microphone devices, in: IEEE workshop on Automatic Speech Recognition andUnderstanding (ASRU), pp. 436–443.[55] Yu, D., Yao, K., Su, H., Li, G., Seide, F., 2013. KL-Divergence Regularized Deep Neural NetworkAdaptation for Improved Large Vocabulary Speech Recognition, in: Proc. of ICASSP, Vancouver,Canada. pp. 7893–7897.p. 126–130.[53] Yao, K., Yu, D., Seide, F., Su, H., Deng, L., Gong, Y., 2012. Adaptation of Context-Dependent DeepNeural Networks for Automatic Speech Recognition, in: Proc. of SLT, Miami, Florida, USA.[54] Yoshioka, .T., Ito, N., Delcroix, M., Ogawa, A., Kinoshita, K., Fujimoto, M., Yu, C., Fabian, W.,Espi, M., Higuchi, T., Araki, S., Nakatani, T., 2015. Advances in speech enhancement and recogni-tion for mobile multi-microphone devices, in: IEEE workshop on Automatic Speech Recognition andUnderstanding (ASRU), pp. 436–443.[55] Yu, D., Yao, K., Su, H., Li, G., Seide, F., 2013. KL-Divergence Regularized Deep Neural NetworkAdaptation for Improved Large Vocabulary Speech Recognition, in: Proc. of ICASSP, Vancouver,Canada. pp. 7893–7897.