Multimodal Integration for Large-Vocabulary Audio-Visual Speech Recognition
MMultimodal Integration for Large-VocabularyAudio-Visual Speech Recognition
Wentao Yu, Steffen Zeiler, Dorothea Kolossa
Institute of Communication Acoustics, Ruhr University Bochum, Germany { wentao.yu, steffen.zeiler, dorothea.kolossa } @rub.de Abstract —For many small- and medium-vocabulary tasks,audio-visual speech recognition can significantly improve therecognition rates compared to audio-only systems. However,there is still an ongoing debate regarding the best combinationstrategy for multi-modal information, which should allow for thetranslation of these gains to large-vocabulary recognition. Whilean integration at the level of state-posterior probabilities, usingdynamic stream weighting, is almost universally helpful for small-vocabulary systems, in large-vocabulary speech recognition, therecognition accuracy remains difficult to improve. In the fol-lowing, we specifically consider the large-vocabulary task of theLRS2 database, and we investigate a broad range of integrationstrategies, comparing early integration and end-to-end learningwith many versions of hybrid recognition and dynamic streamweighting. One aspect, which is shown to provide much benefithere, is the use of dynamic stream reliability indicators, whichallow for hybrid architectures to strongly profit from the inclusionof visual information whenever the audio channel is distortedeven slightly.
Index Terms —Audiovisual Speech Recognition, Multi-modalIntegration, Dynamic Stream Weighting
I. I
NTRODUCTION
Large Vocabulary Continuous Speech Recognition (LVCSR)remains difficult as a lipreading task, because many pairs ofphonemes correspond to a single viseme, making many pairsof words almost indistinguishable to a vision-only system, asfor example “do” and “to”. Due to this intrinsic difficulty, anintegration of lipreading into speech recognition becomes diffi-cult in large- or open-vocabulary applications [1]. Nonetheless,lip-reading gives a great benefit to human listening [2]. Inthis work, we use an exemplary large-vocabulary dataset - theLRS2 corpus described in [3], to test whether and how similarbenefits are attainable for automatic systems.Many studies have shown that video information can dra-matically improve small-vocabulary speech recognition per-formance, when the audio signals are recorded in a noisyenvironment. Often, stream weighting proved to be an effectivemethod to combine audio and video information. As in [4],separate models are then trained for each of the modalities,and possibly, for each of the feature streams per modality.Stream weighting is realized through a weighted combinationof the DNN state posteriors of each modalitylog (cid:101) p ( s | o t ) = (cid:88) i λ it · log p ( s | o it ) , (1) This project has received funding from the German Research FoundationDFG under grant number KO3434/4-2. where log p ( s | o it ) is the log-posterior of state s in stream i andlog (cid:101) p ( s | o t ) is its estimated combined log-posterior. The streamweights λ it are typically predicted by appropriate reliabilitymeasures. In most of the state-based multi-modal integrationstudies, only two streams with few reliabilities are used. Forexample, in [5] the weights are only estimated from an entropyestimate. In this work, we consider a broad range of possiblereliability metrics and apply them to fuse the information ofthree models, one acoustic and two visual.Overall, we compare the performance of three different in-tegration methods. The paper is organized as follows: SectionII discusses the differences between end-to-end and hybridspeech recognition models. Early integration and state-basedintegration are discussed in Section III. Different reliabilityindicators are introduced in Section IV. Section V explainsthe experimental setup, while Section VI shows the results.Finally, in Section VII, we discuss the overall performance ofall systems and give an outlook on future work.II. E ND - TO - END VS . HYBRID MODELS
In recent years, end-to-end acoustic speech recognition hasquickly gained widespread popularity. In its original form, thismodel predicts character sequences directly from the audiosignal. Different from the end-to-end model, in a hybrid uni-modal speech recognition system, an acoustic model is trainedto calculate log-pseudo-posteriors log p ( s | o t ) . A decoder thenuses these pseudo-posteriors to obtain the best word sequenceby graph search through a language model [6].While the hybrid model has the disadvantage of highercomplexity compared to end-to-end learning, many recentpapers have still shown superior performance of hybrid ASRcompared to end-to-end recognition, for example in [7].For our application of multi-modal recognition, hybridframeworks are advantageous for multi-modal fusion, becausethey allow for an integration at the level of the pseudo-posteriors, and for using reliability information, which hasproven beneficial for multimodal integration in many studies,see e.g. [8]–[10]. End-to-end models, in contrast, typically usean attention mechanism rather than reliabilities for the multi-modal fusion, which is also the case in the baseline modelsthat we consider [11], [12]. Recently, [13] has extended thework of [11] and improved the performance by using a lossfunction that explicitly considers facial action units. a r X i v : . [ ee ss . A S ] J u l treamInte-grationNet σ A t σ VA t σ VS t λ A t × λ VA t × λ VS t × log p ( s | o A t ) log p ( s | o VA t ) log p ( s | o VS t ) + exp (cid:101) p ( s | o t ) Fig. 1: Audio-visual fusion strategy for audio and two streamsof different video modelsIII. S
YSTEM OVERVIEW
A. System framework
As shown in Fig. 1, a set of reliability measures σ A t , σ VA t , σ VS t (described in more detail in Sec. IV) is used as theinput for the stream integration net, which uses these to obtainweights λ it for all streams over time. We then fuse the audioand video models through Eq. (1).For training the stream integration net, we carry out forcedalignment on the clean audio training set to obtain target statesequences p ∗ ( s | o t ) , in which the reference state probability isset to one and any other state probabilities are set to zero. Twostate-based loss functions are employed here, the cross-entropy(CE) CE = − T T (cid:88) t =1 S (cid:88) s =1 p ∗ ( s | o t ) · log (cid:101) p ( s | o t ) , (2)and the mean squared error (MSE)MSE = 1 T · S T (cid:88) t =1 S (cid:88) s =1 ( p ∗ ( s | o t ) − (cid:101) p ( s | o t )) . (3)We also consider maximum mutual information (MMI) as asequence-based criterion [14] via ∂ F MMI ∂ log (cid:101) p ( s | o t ) = κ ( p ∗ ( s | o t ) − γ DENt ( s )) . (4)Here, γ DENt ( s ) is the posterior probability of state s at time t ,computed over the denominator lattices that are obtained fromthe state pseudo-posteriors (cid:101) p ( s | o t ) . κ is the acoustic scalingfactor and set to . . B. Oracle weight baseline
To get an estimate of the best achievable word error rate(WER), we use convex optimization via CVX [15], [16] tooptimize the cross-entropy in Eq. (2), which yields a set oforacle stream weights.
C. Early integration baseline
Early integration fuses audio and video information directlyat the input of the system, using stacked feature vectors via o t = [( o A t ) T , ( o VS t ) T , ( o VA t ) T ] T (5) where o A t are audio features, o VS t are shape-based videofeatures, and o VA t are appearance-based video features, de-scribed in more detail in Sec. V-C, and T denotes vectortransposition. As the audio and video features have differentframe rates, we use a digital differential analyzer, similar toBresenham's algorithm [17] to synchronize the video featuresbefore applying Eq. (5). D. End-to-end baselines
In addition to the early integration baseline, we comparethe performance of our suggested hybrid audiovisual ASRto end-to-end models with attention mechanisms [11], [12],which offer an alternative approach to multimodal fusion. Asdescribed in [12], both audio and video encoders are LSTMnetworks. The decoder is an LSTM transducer [18], whichuses the encoded audio and video sequences, either via adual attention mechanism [12], or, in [11], using a multi-headattention mechanism that is specifically optimized towardsaudio-visual integration.IV. R
ELIABILITY MEASURES
As shown in Tab. I, the proposed reliability measures canbe grouped into the ones that are model-based ( MB ) and thesignal-based ( SB ) measures. For the model-based measures,the audio and video models are considered separately. Thesignal-based measures can also be subdivided into audio-based( AB ) and video-based ( VB ) measures.TABLE I: Proposed reliability measures Model-based ( MB ) Signal-based ( SB )Audio-based ( AB ) Video-based ( VB )• Entropy• Dispersion• Posterior difference• Temporal divergence• Entropy anddispersion ratio • MFCC• ∆ MFCC• SNR• Signal andnoise energy• Soft VAD • IDCT• Image distortion
A. Model-based reliability measures
The entropy is a proxy for the model’s uncertainty aboutthe state s , given the current observation o t . It is calculatedfor each stream i viaH it = − S i (cid:88) s =1 p ( s | o it ) · log p ( s | o it ) , (6)with S i as the number of states in stream model i .Similarly, the dispersion is related to the decoder’s discrim-inative power. It is computed by:D it = 2 K ( K − K (cid:88) l =1 K (cid:88) m = l +1 log (cid:98) p ( l | o it ) (cid:98) p ( m | o it ) , (7)where the probabilities p are sorted in descending order toobtain (cid:98) p . K is set to 15.The K-largest posterior difference , defined viaDiff it = 1 K − K (cid:88) k =2 log (cid:98) p (1 | o it ) (cid:98) p ( k | o it ) , (8)s also considered, showing the average ratio between thelargest posterior and the next K − values.The temporal divergence is computed as the Kullback-Leibler divergence between two posterior vectors p ( s | o it ) and p ( s | o it +∆ t ) , i.e.Div i ∆ t ( t ) = D KL ( p ( s | o it ) || p ( s | o it +∆ t )) . (9) ∆ t is set to ms. As described in [19], the mean ofthe temporal divergence is also an interesting measure ofreliability and it is used here by averaging Div i ∆ t ( t ) oversegments of ms length.The entropy ratio is described in [20]. The strongly related dispersion ratio ω iD,t is estimated based on the averagedispersion D t ω iD,t = (cid:101) D it (cid:80) k =A , VA , VS (cid:101) D kt , (10)where (cid:101) D it = (cid:40) D it < D t D it D it ≥ D t . (11)A, VA, and VS represent the audio, video appearance andvideo shape stream, respectively. D it is obtained from Eq. (7). B. Signal-based reliability measures
The first 5
MFCC coefficients and their temporal deriva-tives, ∆ MFCC , are related to the audio quality.The estimated Signal-to-Noise Ratio (
SNR ) also representsthe quality of the audio signal; it is computed in each frame
SNR t = 10 log (cid:32) S t N t (cid:33) . (12)The signal energy S t is estimated as the sum of squaredamplitudes of the Hamming-windowed frame t . The noiseenergy N t is estimated by a variant of the minima-controlledrecursive averaging algorithm (MCRA-2) [21].The ratio between the energy of the speech band and thetotal energy of each frame is used as a soft voice-activitydetection ( soft VAD ) cue.The first 5 Inverse Discrete Cosine Transform ( IDCT )coefficients of the mouth region represent low-level imageproperties.The image distortion measures comprise the lighting condi-tion, the degree of blurring and the head pose, all computed asin [22]. The lighting condition represents the mean brightnessof the image. The degree of blurring is estimated as thevariance of the image after high-pass filtering. To obtain anindicator for head rotation and tilt, the cross-correlation be-tween the original image and its horizontally mirrored versionis computed. V. E
XPERIMENTAL SETUP
A. Dataset
Our experiments are based on the LRS2 corpus. The trainingset contains 45,839 spoken sentences and 17,660 words, witha test set of 1,243 sentences and 1,698 words. To analyzethe performance in different acoustic noise conditions, wehave artificially created noisy versions of the LRS2 database.The augmentation recipe from Kaldi’s Voxceleb example isemployed for this purpose, using the MUSAN corpus [23]as the noise dataset. It contains 3 different kinds of noise,i.e. ambient noise, music and babble noise. Seven differentSNRs are selected, from − dB to dB in steps of dB.Each audio signal is augmented with these three noise typesand the SNR is randomly chosen from all SNRs. B. Implementation details
All hybrid recognition experiments are carried out using theKaldi toolkit [24], with the training set of the LRS2 corpusemployed in the training of the acoustic and visual models.The initial HMM-GMM training follows the standard KaldiAMI recipe; subsequent HMM-DNN training uses the nnet2recipe. The output dimension of all three models is 3784.For performance reasons, the audio model alignments are alsoused for the HMM-DNN training of both video models. Theintegration model has 5 hidden layers, with 43, 25, 17, 10 and3 units, respectively, each using ReLU activation functions.The output is the predicted weight of each stream, λ it . Asigmoid function is used to limit the weights to values between0 and 1. Finally, we normalize the multi-modal posteriorprobabilities to one at each time t . To avoid overfitting, earlystopping is used if the training loss does not improve for1200 iterations. The stream integration model is trained on thedevelopment set and performance is tested on the evaluationset. C. Feature extraction
The audio model uses 13-dimensional MFCCs as features.MFCCs are extracted with ms frame size and ms frameshift. The video frame is 40 ms long without overlap. Themouth region is detected via OpenFace [25]. The video ap-pearance model (VA) uses 43-dimensional IDCT coefficientsof the mouth region in the grayscale image as features. Thevideo shape model (VS) uses the 34-dimensional non-rigidshape parameters [25] as features.VI. RESULTS
Here, we provide the performance comparisons betweenour proposed hybrid model and the end-to-end AVSR models,which are described in [12] and [11].Fig. 2 shows the performance of all considered baselinehybrid models (more details in Tab. II). The audio-onlymodel ( AO ) has a much better performance than the video-shape ( VS ) and video-appearance ( VA ) models alone. Earlyintegration ( EI ) can already improve the WER at lower SNRconditions ( (cid:54) dB), but there is no improvement, if wecompare the average WER over all SNRs. As the oracle weightodel ( OW ) shows, there is much room for improvementthrough optimal stream integration.Fig. 2: Word error rate on LRS2 corpusFig. 2 also depicts the word error rate results of differentend-to-end models, which can be found in the appendix of[11]. The Watch Listen Attend and Spell model [12] ( WLAS )and the proposed fusion strategy in [11] (
AV Align ) can notimprove the WER on the LRS2 dataset. We also find thatthe hybrid audio-only model ( AO ) offers a better performancethan the end-to-end audio-only model ( AO e2e ) from [11].TABLE II: WER (%) on LRS2.
SNR -9 -6 -3 0 3 6 9 clean avg.AO 64.45 61.05 48.90 44.73 34.28 27.57 23.54 14.20 39.84VA 83.44 83.44 83.44 83.44 83.44 83.44 83.44 83.44 83.44VS 88.18 88.18 88.18 88.18 88.18 88.18 88.18 88.18 88.19EI 58.01 54.69 47.27 43.77 38.99 33.53 32.60 24.58 41.68OW 42.67 37.08 26.61 24.88 18.22 14.74 12.64 9.02 23.23MSE 50.66 47.10 35.81 33.56 25.20 19.93
Tab. II summarizes all results of the Kaldi experiments:the first 3 rows show the performance of all single-modalitymodels. The performance metrics of the video appearance andshape models are far from satisfying. We have also employedthe pre-trained spatio-temporal visual front-end from [26] toextract high-level visual features, without seeing improve-ments. We hypothesize that the unsatisfying video modelperformance is due to an insufficient amount of training data.The 4th and 5th row show the results of the early-integrationbaseline, and of the oracle weighting that gives an upperbound of achievable performance for the considered hybridarchitecture. The final 4 rows show the WERs for our proposedexperiments, using all reliability indicators. Comparing thedifferent loss functions for training the stream integration net,cf. Sec. III-A, the mean squared error (
MSE ) has the best performance at lower SNR conditions ( (cid:54) dB), which can becarried over into the sequence-based optimization by addingan MSE-based fine-tuning after pre-training with the original MMI loss ( MM ). Comparing the best performance betweenour proposed hybrid audio-visual model ( MM ) and the end-to-end audio-visual model ( AV ALIGN ) in Fig. 2, we findthat the hybrid model offers clear performance benefits, andthat, in contrast to the end-to-end integration mechanism, it isindeed able to profit strongly from the inclusion of the visualmodality under all noisy conditions.To compare the value of the different types of relia-bility indicators for multi-modal integration, we have re-peated this experiment with different reliability sets, us-ing the best combination of loss functions, MMI trainingwith MSE fine-tuning ( MM ). Tab. III gives the averageWER over all SNR conditions. Here, we find that audio-based reliabilities have a slightly higher benefit, but us-ing all reliabilities simultaneously achieves the best per-formance overall, corresponding to a relative word-error-rate reduction of 24.97% over the best audio-only model.TABLE III: WER (%) of different reliability sets, abbre-viations as introduced in Tab. I MB SB AB VB AllMM 31.40 31.33 31.03 31.76
VII.
CONCLUSION
Improving the performance of large-vocabulary speechrecognition through the inclusion of video data has remainedchallenging despite much progress in deep learning models forspeech recognition and image processing. In this paper, weaddress this issue by learning an explicit stream integrationnetwork for audio-visual speech recognition. This networkutilizes stream reliability indicators to optimize stream fu-sion time-frame by time-frame, ultimately providing a dis-criminatively optimized fusion of state-posteriors for hybridspeech recognition. We have compared the performance of thislearned integration model to that of early integration as wellas to a baseline end-to-end model. All experiments show thatthe proposed model dramatically outperforms both of thesebaseline systems and that it is capable of providing clear im-provements in accuracy compared to audio-only recognition,as hoped.However, experiments based on oracle knowledge forstream fusion also point at the possibility of significant furthergains. Achieving similar improvements without oracle infor-mation will be the natural next goal of our work, where wewill focus on both the topology and loss function of the fusionnetwork as well as on the integration of deeper, pre-trainedimage recognition models.R
EFERENCES[1] K. Thangthai and R. Harvey, “Building large-vocabulary speaker-independent lipreading systems,” in
Interspeech , 2018.[2] M. J. Crosse, G. M. Di Liberto, and E. C. Lalor, “Eye can hear clearlynow: inverse effectiveness in natural audiovisual speech processing relieson long-term crossmodal temporal integration,”
Journal of Neuroscience ,vol. 36, no. 38, pp. 9888–9895, 2016.3] T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Deepaudio-visual speech recognition,” in arXiv:1809.02108 , 2018.[4] H. Meutzner, N. Ma, R. Nickel, C. Schymura, and D. Kolossa, “Im-proving audio-visual speech recognition using deep neural networks withdynamic stream reliability estimates,” in
Proc. ICASSP , 2017, pp. 5320–5324.[5] M. Gurban, J. Thiran, T. Drugman, and T. Dutoit, “Dynamic modalityweighting for multi-stream HMMs inaudio-visual speech recognition,”in
Proc. ICMI , 2008, pp. 237–240.[6] H. Bourlard and N. Morgan,
Connectionist speech recognition: a hybridapproach , vol. 247, Springer Science & Business Media, 2012.[7] C. L¨uscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zeyer, R. Schl¨uter,and H. Ney, “RWTH ASR systems for LibriSpeech: Hybrid vsAttention,”
Interspeech, Graz, Austria , 2019.[8] V. Estellers, M. Gurban, and J. Thiran, “On dynamic stream weightingfor audio-visual speech recognition,”
IEEE Transactions on Audio,Speech, and Language Processing , vol. 20, no. 4, pp. 1145–1157, 2011.[9] A.H. Abdelaziz, S. Zeiler, and D. Kolossa, “Learning dynamic streamweights for coupled-HMM-based audio-visual speech recognition,”
Au-dio, Speech, and Language Processing, IEEE/ACM Transactions on , vol.23, no. 5, pp. 863–876, May 2015.[10] S. Zeiler, R. Nickel, N. Ma, G. Brown, and D. Kolossa, “Robustaudiovisual speech recognition using noise-adaptive linear discriminantanalysis,” in
Proc. ICASSP , 2016.[11] G. Sterpu, C. Saam, and N. Harte, “Attention-based audio-visual fusionfor robust automatic speech recognition,” in
Proc. ICMI , 2018, pp. 111–115.[12] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip readingsentences in the wild,” in
IEEE Conference on Computer Vision andPattern Recognition , 2017, pp. 6447–6456.[13] G Sterpu, C Saam, and N Harte, “How to teach DNNs to pay attentionto the visual modality in speech recognition,”
IEEE/ACM Transactionson Audio, Speech, and Language Processing , vol. 28, pp. 1052–1064,2020.[14] K. Vesel`y, A. Ghoshal, L. Burget, and D. Povey, “Sequence-discriminative training of deep neural networks,” in
Interspeech , 2013,vol. 2013, pp. 2345–2349.[15] M. Grant and S. Boyd, “CVX: Matlab Software for Disciplined ConvexProgramming, version 2.1,” http://cvxr.com/cvx, Mar. 2014.[16] M. Grant and S. Boyd, “Graph implementations for nonsmooth convexprograms,” in
Recent Advances in Learning and Control , V. Blondel,S. Boyd, and H. Kimura, Eds., Lecture Notes in Control and InformationSciences, pp. 95–110. Springer-Verlag Limited, 2008, http://stanford.edu/ ∼ boyd/graph dcp.html.[17] R. Sproull, “Using program transformations to derive line-drawingalgorithms,” ACM Transactions on Graphics (TOG) , vol. 1, no. 4, pp.259–273, 1982.[18] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation byjointly learning to align and translate,”
Proc. ICLR , 2014.[19] H. Hermansky, “Multistream recognition of speech: Dealing withunknown unknowns,”
Proceedings of the IEEE , vol. 101, no. 5, pp.1076–1088, 2013.[20] H. Misra, H. Bourlard, and V. Tyagi, “New entropy based combinationrules in HMM/ANN multi-stream ASR,” in
Proc. ICASSP . IEEE, 2003,vol. 2, pp. II–741.[21] S. Rangachari and P. Loizou, “A noise-estimation algorithm for highlynon-stationary environments,”
Speech communication , vol. 48, no. 2,pp. 220–231, 2006.[22] L. Sch¨onherr, D. Orth, M. Heckmann, and D. Kolossa, “Environmentallyrobust audio-visual speaker identification,” in
Proc. SLT . IEEE, 2016,pp. 312–318.[23] D. Snyder, G. Chen, and D. Povey, “MUSAN: A Music, Speech, andNoise Corpus,” 2015, arXiv:1510.08484v1.[24] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel,M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, et al., “The Kaldispeech recognition toolkit,” in
Proc. IEEE . IEEE Signal ProcessingSociety, 2011.[25] B. Amos, L. Bartosz, and M. Satyanarayanan, “OpenFace: A general-purpose face recognition library with mobile applications,” Tech. Rep.,CMU-CS-16-118, CMU School of Computer Science, 2016.[26] T Stafylakis and G Tzimiropoulos, “Combining residual networks withLSTMs for lipreading,” arXiv preprint arXiv:1703.04105arXiv preprint arXiv:1703.04105