Deep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect
Daniel Michelsanti, Zheng-Hua Tan, Sigurdur Sigurdsson, Jesper Jensen
DDeep-Learning-Based Audio-Visual Speech Enhancement in Presenceof Lombard Effect
Daniel Michelsanti a , ∗ , Zheng-Hua Tan a , Sigurdur Sigurdsson b and Jesper Jensen a,b a Department of Electronic Systems, Aalborg University, Denmark b Oticon A/S, Denmark
A R T I C L E I N F O
Keywords :Lombard effectaudio-visual speech enhancementdeep learningspeech qualityspeech intelligibility
A B S T R A C T
When speaking in presence of background noise, humans reflexively change their way of speakingin order to improve the intelligibility of their speech. This reflex is known as Lombard effect. Col-lecting speech in Lombard conditions is usually hard and costly. For this reason, speech enhancementsystems are generally trained and evaluated on speech recorded in quiet to which noise is artificiallyadded. Since these systems are often used in situations where Lombard speech occurs, in this workwe perform an analysis of the impact that Lombard effect has on audio, visual and audio-visual speechenhancement, focusing on deep-learning-based systems, since they represent the current state of theart in the field.We conduct several experiments using an audio-visual Lombard speech corpus consisting of ut-terances spoken by different talkers. The results show that training deep-learning-based modelswith Lombard speech is beneficial in terms of both estimated speech quality and estimated speechintelligibility at low signal to noise ratios, where the visual modality can play an important role inacoustically challenging situations. We also find that a performance difference between genders ex-ists due to the distinct Lombard speech exhibited by males and females, and we analyse it in relationwith acoustic and visual features. Furthermore, listening tests conducted with audio-visual stimulishow that the speech quality of the signals processed with systems trained using Lombard speech isstatistically significantly better than the one obtained using systems trained with non-Lombard speechat a signal to noise ratio of −5 dB. Regarding speech intelligibility, we find a general tendency of thebenefit in training the systems with Lombard speech.
1. Introduction
Speech is perhaps the most common way that people useto communicate with each other. Often, this kind of com-munication is harmed by several sources of disturbance thatmay have different nature, such as the presence of competingspeakers, the loud music during a party, and the noise insidea car cabin. We refer to the sounds other than the speech ofinterest as background noise .Background noise is known to affect two attributes ofspeech: intelligibility and quality (Loizou, 2007). Both ofthese aspects are important in a conversation, since poor in-telligibility makes it hard to comprehend what a speaker issaying and poor quality may affect speech naturalness andlistening effort (Loizou, 2007). Humans tend to tackle thenegative effects of background noise by instinctively chang-ing the way of speaking, their speaking style , in a processknown as
Lombard effect (Lombard, 1911; Zollinger andBrumm, 2011). The changes that can be observed vary wide-ly across individuals (Junqua, 1993; Marxer et al., 2018) andaffect multiple dimensions: acoustically, the average funda-mental frequency (F0) and the sound energy increase, thespectral tilt flattens due to an energy increment at high fre-quencies and the centre frequency of the first and second for-mant (F1 and F2) shifts (Junqua, 1993; Lu and Cooke, 2008); ∗ Corresponding author [email protected] (D. Michelsanti); [email protected] (Z. Tan); [email protected] (S. Sigurdsson); [email protected], [email protected] (J.Jensen)
ORCID (s): (D. Michelsanti) visually, head and face motion are more pronounced andthe movements of the lips and jaw are amplified (Vatikiotis-Bateson et al., 2007; Garnier et al., 2010, 2012); temporally,the speech rate changes due to an increase of the vowel du-ration (Junqua, 1993; Cooke et al., 2014).Although Lombard effect improves the intelligibility ofspeech in noise (Summers et al., 1988; Pittman and Wiley,2001), effective communication might still be challenged bysome particular conditions, e.g. the hearing impairment ofthe listener. In these situations, speech enhancement (SE)algorithms may be applied to the noisy signal aiming at im-proving speech quality and speech intelligibility. In the lit-erature, several SE techniques have been proposed. Someapproaches consider SE as a statistical estimation problem(Loizou, 2007), and include some well-known methods, likethe Wiener filtering (Lim and Oppenheim, 1979) and theminimum mean square error estimator of the short-time mag-nitude spectrum (Ephraim and Malah, 1984). Many improv-ed methods have been proposed, which primarily distinguishthemselves by refined statistical speech models (Martin, 2005;Erkelens et al., 2007; Gerkmann and Martin, 2009) or noisemodels (Martin and Breithaupt, 2003; Loizou, 2007). Thesetechniques, which make statistical assumptions on the distri-butions of the signals, have been reported to be largely un-able to provide speech intelligibility improvements (Hu andLoizou, 2007; Jensen and Hendriks, 2012). As an alterna-tive, data-driven techniques , especially deep learning, do notmake any assumptions on the distribution of the speech, ofthe noise or on the way they are mixed: a learning algorithmis used to find a function that best maps features from de- - Page 1 of 16 a r X i v : . [ ee ss . A S ] M a y eep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect graded speech to features from clean speech. Over the years,the speech processing community has put a considerable ef-fort into designing training targets and objective functions(Wang et al., 2014; Erdogan et al., 2015; Williamson et al.,2016; Michelsanti et al., 2019b) for different neural networkmodels, including deep neural networks (Xu et al., 2014;Kolbæk et al., 2017), denoising autoencoders (Lu et al., 2013),recurrent neural networks (Weninger et al., 2014), fully con-volutional neural networks (Park and Lee, 2017), and gen-erative adversarial networks (Michelsanti and Tan, 2017).These methods represent the current state of the art in thefield (Wang and Chen, 2018), and since they use only audiosignals, we refer to them as audio-only SE (AO-SE) systems.Previous studies show that observing the speaker’s facialand lip movements contributes to speech perception (Sumbyand Pollack, 1954; Erber, 1975; McGurk and MacDonald,1976). This finding suggests that a SE system could toleratehigher levels of background noise, if visual cues could beused in the enhancement process. This intuition is confirmedby a pioneering study on audio-visual SE (AV-SE) by Girinet al. (2001), where simple geometric features extracted fromthe video of the speaker’s mouth are used. Later, more com-plex frameworks based on classical statistical approaches ha-ve been proposed (Almajai and Milner, 2011; Abel and Hus-sain, 2014; Abel et al., 2014), and very recently deep learn-ing methods have been used for AV-SE (Hou et al., 2018;Gabbay et al., 2018; Ephrat et al., 2018; Afouras et al., 2018;Owens and Efros, 2018; Morrone et al., 2019).It is reasonable to think that visual features are mostlyhelpful for SE when the speech is so degraded that AO-SEsystems achieve poor performance, i.e. when backgroundnoise heavily dominates over the speech of interest. Since insuch acoustical environment spoken communication is par-ticularly hard, we can assume that the speakers are underthe influence of Lombard effect. In other words, the inputto SE systems in this situation is Lombard speech. Despitethis consideration, state-of-the-art SE systems do not takeLombard effect into account, because collecting Lombardspeech is usually expensive. The training and the evaluationof the systems are usually performed with speech recorded inquiet and afterwards degraded with additive noise. Previouswork shows that speaker (Hansen and Varadarajan, 2009)and speech recognition (Junqua, 1993) systems that ignoreLombard effect achieve sub-optimal performance, also in vi-sual (Heracleous et al., 2013; Marxer et al., 2018) and audio-visual settings (Heracleous et al., 2013). It is therefore ofinterest to conduct a similar study also in a SE context.With the objective of providing a more extensive analy-sis of the impact of Lombard effect on deep-learning-basedSE systems, the present work extends a preliminary study(Michelsanti et al., 2019a), providing the following novelcontributions. First, new experiments are conducted, wheredeep-learning-based SE systems trained with Lombard ornon-Lombard speech are evaluated on Lombard speech us-ing a cross-validation setting to avoid that a potential intra-speaker variability of the adopted dataset leads to biased con-clusions. Then, an investigation of the effect that the inter- Command Colour* Preposition Letter* Digit* Adverbbin blue at 0–9 againlay green by A–Z nowplace red in (no W) pleaseset white with soon
Table 1
Sentence structure for the Lombard GRID corpus (Alghamdiet al., 2018). The ‘*’ indicates a keyword. Adapted from(Cooke et al., 2006). speaker variability has on the systems is carried out, bothin relation to acoustic as well as visual features. Next, asan example application, a system trained with both Lom-bard and non-Lombard data using a wide signal-to-noise-ratio (SNR) range is compared with a system trained only onnon-Lombard speech, as it is currently done for the state-of-the-art models. Finally, especially since existing objectivemeasures are limited to predict speech quality and intelligi-bility from the audio signals in isolation, listening tests usingaudio-visual stimuli have been performed. This test setup,which is generally not employed to evaluate SE systems, iscloser to a real-world scenario, where a listener is usuallyable to look at the face of the talker.
2. Materials: Audio-Visual Speech Corpusand Noise Data
The speech material used in this study is the LombardGRID corpus (Alghamdi et al., 2018), which is an exten-sion of the popular audio-visual GRID dataset (Cooke et al.,2006). It consists of 55 native speakers of British English( males and females) that are between and yearsold. The sentences pronounced by the talkers adhere to thesyntax from the GRID corpus, six-word sentences with thefollowing structure:
Audio EncoderVideo Encoder Fu s i on S ub - N e t w o r k Audio Decoder InverseSTFTEstimated IdealAmplitude MaskVideo InputAudio Input Enhanced SpeechMagnitude STFTConsecutiveVideo Frames
Figure 1:
Pipeline of the audio-visual speech enhancement framework used in this study, adapted from (Gabbay et al., 2018),and identical to (Michelsanti et al., 2019a). The deep-learning-based system estimates an ideal amplitude mask from the videoof the speaker’s mouth and the magnitude spectrogram of the noisy speech. The estimated mask is used to enhance the speechin time-frequency domain.
STFT indicates the short-time Fourier transform. loop (Lane and Tranel, 1971), the speech signal was mixedwith the SSN at a carefully adjusted level, providing a self-monitoring feedback to the speakers.In our study, the audio and the video signals from thefrontal camera were arranged as explained in Section 4 tobuild training, validation, and test sets. The audio signalshave a sampling rate of kHz. The resolution of the frontalvideo stream is pixels with a variable frame rate ofaround frames per second (FPS). Audio and video signalsare temporally aligned.To generate speech in noise, SSN was added to the audiosignals of the Lombard GRID database. SSN was chosen tomatch the kind of noise used in the database, since, as re-ported by Hansen and Varadarajan (2009), Lombard effectoccurs differently across noise types, although other stud-ies (Lu and Cooke, 2009; Garnier and Henrich, 2014) failedto find such an evidence. The SSN we used was generatedas in (Kolbæk et al., 2016), by filtering white noise with alow-order linear predictor, whose coefficients were found us-ing 100 random sentences from the Akustiske Databaser forDansk (ADFD) speech database.
3. Methodology
In this study, we train and evaluate systems that performspectral SE using deep learning, as illustrated in Figure 1.The processing pipeline is inspired by Gabbay et al. (2018)and the same as the one used in (Michelsanti et al., 2019a).To have a self-contained exposition, we report the main de-tails of it in this section.
We assume to have access to two streams of information:the video of the talker’s face, and an audio signal, 𝑦 ( 𝑛 ) = 𝑥 ( 𝑛 ) + 𝑑 ( 𝑛 ) , where 𝑥 ( 𝑛 ) is the clean signal of interest, 𝑑 ( 𝑛 ) isan additive noise signal, and 𝑛 indicates the discrete-time in-dex. The additive noise model presented in time domain,can also be expressed in the time-frequency (TF) domainas 𝑌 ( 𝑘, 𝑙 ) = 𝑋 ( 𝑘, 𝑙 ) + 𝐷 ( 𝑘, 𝑙 ) , where 𝑌 ( 𝑘, 𝑙 ) , 𝑋 ( 𝑘, 𝑙 ) , and 𝐷 ( 𝑘, 𝑙 ) are the short-time Fourier transform (STFT) coeffi-cients at frequency bin 𝑘 and at time frame 𝑙 of 𝑦 ( 𝑛 ) , 𝑥 ( 𝑛 ) , and 𝑑 ( 𝑛 ) , respectively. Our models adopt a mask approxi-mation approach (Michelsanti et al., 2019b), producing anestimate ̂𝑀 ( 𝑘, 𝑙 ) of the ideal amplitude mask, defined as 𝑀 ( 𝑘, 𝑙 ) = | 𝑋 ( 𝑘, 𝑙 ) | ∕ | 𝑌 ( 𝑘, 𝑙 ) | , with the following objectivefunction: 𝐽 = 1 𝑇 𝐹 ∑ 𝑘,𝑙 ( 𝑀 ( 𝑘, 𝑙 ) − ̂𝑀 ( 𝑘, 𝑙 ) ) , (1)with 𝑘 ∈ {1 , … , 𝐹 } , 𝑙 ∈ {1 , … , 𝑇 } , and 𝑇 × 𝐹 being thedimension of the training target. Recent preliminary exper-iments have shown that using this objective function leadsto better performance for AV-SE than competing methods(Michelsanti et al., 2019b). In this work, each audio signal was peak-normalised. Weused a sample rate of kHz and a -point STFT, with aHamming window of samples ( ms) and a hop size of samples ( ms). Only the 321 bins that cover the pos-itive frequencies were used, because of the conjugate sym-metry of the STFT.Each video signal was resampled at a frame rate of FPS using motion interpolation as implemented in FFMPEG .The face of the speaker was detected in every frame using thefrontal face detector implemented in the dlib toolkit (King,2009), consisting of histogram of oriented gradients (HOG)filters and a linear support vector machine (SVM). The bound-ing box of the single-frame detections was tracked using aKalman filter. The face was aligned based on landmarksusing a model that estimated the position of the corners ofthe eyes and of the bottom of the nose (King, 2009) and wasscaled to pixels. The mouth was extracted by crop-ping the central lower face region of size
128 × 128 pixels.Each segment of consecutive grayscale video framesspanning a total of ms was paired with the respective consecutive audio frames. The preprocessed audio and video signals, standardisedusing the mean and the variance from the training set, wereused as input to a video and an audio encoders, respectively. http://ffmpeg.org - Page 3 of 16eep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect Both encoders consisted of convolutional layers, each ofthem followed by leaky-ReLU activation functions (Maaset al., 2013) and batch normalisation (Ioffe and Szegedy,2015). For the video encoder, also max-pooling and 0.25dropout (Hinton et al., 2012) were adopted. The fusion of thetwo modalities was accomplished using a sub-network con-sisting of 3 fully connected layers, followed by leaky-ReLUactivations, on the outputs of the 2 encoders. The
321 × 20 estimated mask was obtained with an audio decoder having6 transposed convolutional layers followed by leaky-ReLUactivations and a ReLU activation as output layer. Skip con-nections between the layers , , and of the audio encoderand the corresponding decoder layers were used to avoid thatthe bottleneck hindered the information flow (Isola et al.,2017). The values of the training target, 𝑀 ( 𝑘, 𝑙 ) , were lim-ited in the [0 , interval (Wang et al., 2014).The weights of the network were initialised with the Xa-vier approach (Glorot and Bengio, 2010). The training wasperformed using the Adam optimiser (Kingma and Ba, 2015)with the objective function in Equation (1) and a batch sizeof . The learning rate, initially set to ⋅ −4 , was scaledby a factor of . when the loss increased on the validationset. An early stopping technique was used, by selecting thenetwork that performed the best on the validation set acrossthe epochs used for training. The estimated ideal amplitude mask of an utterance wasobtained by concatenating the outputs of the network, ob-tained by processing non-overlapping consecutive audio-vi-sual paired segments. The estimated mask was point-wisemultiplied with the complex-valued STFT spectrogram ofthe noisy signal and the result inverted using an overlap-addprocedure to get the time-domain signal (Allen, 1977; Grif-fin and Lim, 1984).
Until now, we only presented AV-SE systems. In or-der to understand the relative contribution of the audio andthe visual modalities, we also trained networks to performmono-modal SE, by removing one of the two encoders fromthe neural network architecture, without changing the otherexplained settings and procedures. Both AO-SE and video-only SE (VO-SE) systems estimate a mask and apply it tothe noisy speech, but they differ in the signals used as input.
4. Experiments
The experiments conducted in this study compare theperformance of AO-SE, VO-SE, and AV-SE systems in termsof two widely adopted objective measures: perceptual eval-uation of speech quality (PESQ) (Rix et al., 2001), specifi-cally the wideband extension (ITU, 2005) as implemented byLoizou (2007), and extended short-time objective intelligi-bility (ESTOI) (Jensen and Taal, 2016). PESQ scores, usedto estimate speech quality, lie between −0 . and . , wherehigh values correspond to high speech quality. However, thewideband extension that we use maps these scores to mean Training MaterialNon-LombardSpeech LombardSpeechSystemInput NarrowSNR Range WideSNR Range NarrowSNR Range WideSNR RangeVision VO-NL VO-NL (w)
VO-L VO-L (w)
Audio AO-NL AO-NL (w)
AO-L AO-L (w)
Audio-Visual AV-NL AV-NL (w)
AV-L AV-L (w)
Table 2
Models used in this study. The ‘ (w) ’ is used to distinguish thesystems trained with a wide SNR range from the ones trainedwith a narrow SNR range. opinion score (MOS) values, on a scale from approximately1 to 4.64. ESTOI scores, which estimate speech intelligibil-ity, practically range from 0 to 1, where high values corre-spond to high speech intelligibility.As mentioned before (Section 2), clean speech signalswere mixed with SSN to match the noise type used in theLombard GRID corpus. Current state-of-the-art SE systemsare trained with signals at several SNRs to make them robustto various noise levels. We followed a similar methodologyand trained our models with two different SNR ranges, nar-row (between −20 dB and dB) and wide (between −20 dBand dB). We used these two ranges because on the onehand we would like to assess the performance of SE sys-tems when Lombard speech occurs, and on the other hand wewould like to have SNR-independent systems, i.e. systemsthat also work well at higher SNRs. Such a setup allows us tobetter understand whether Lombard speech, which is usuallynot available because it is hard to collect, should be used totrain SE systems and which are the advantages and the dis-advantages of various training configurations. The modelsused in this work are shown in Table 2.Similarly to the work by Marxer et al. (2018), the ex-periments were conducted adopting a multi-speaker setup,in which all the speakers in the database were used for bothtraining and evaluating the systems. This choice was madefor a practical reason. People may exhibit speech charac-teristics that differ considerably from each other when theyspeak in presence of noise (Junqua, 1993; Marxer et al., 2018).It is possible to model these differences by training speaker-dependent systems, but this requires a large set of Lombardspeech for every speaker. Unfortunately, the audio-visualspeech corpus that we use, despite being one of the largestexisting audio-visual databases for Lombard speech, onlycontains utterances per speaker, which are not enough totrain a deep-learning-based model.The experiments were performed according to a strati-fied five-fold cross-validation procedure (Liu and Özsu, 2009).Specifically, the data was divided into five folds of approx-imately the same size, four of them used for training andvalidation, and one for testing. This process was repeatedfive times for different test sets in order to evaluate the sys-tems on the whole dataset. Before the split, the signals wererearranged to have about the same amount of data for eachspeaker across the training ( ∼ 35 utterances), the validation - Page 4 of 16eep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect ( ∼ 5 utterances), and the test ( ∼ 10 utterances) sets. Thisensured that each fold was a good representative of the inter-speaker variations of the whole dataset. For some speakers,some data was missing or corrupted, so we used fewer utter-ances. Among the speakers, the recordings from speakers were discarded by the database collectors due to technicalissues, and the data from speaker s was used only in thetraining set, because only 40 of the utterances could be used.Effectively, speakers were used to evaluate our systems. Since we would like to assess the performance of SEsystems when Lombard speech occurs, SSN is added to thespeech signals from the Lombard GRID corpus at 6 differ-ent SNRs, in uniform steps between −20 dB and dB. Thischoice was driven by the following considerations (Michel-santi et al., 2019a). Since Lombard and non-Lombard ut-terances from the Lombard GRID corpus have an energydifference between and dB (Marxer et al., 2018), theactual SNR can be computed assuming that the conversa-tional speech level is between and dB sound pressurelevel (SPL) (Raphael et al., 2007; Moore et al., 2012) and thenoise level at dB SPL, like in the recording conditions ofthe database. The SNR range obtained in this way is between −17 and dB. In the experiments, we used a slightly widerrange because of the possible speech level variations causedby the distance between the listener and the speaker.For all the systems, Lombard speech was used to buildthe test set, while for training and validation we used Lom-bard speech for VO-L, AO-L, and AV-L, and non-Lombardspeech for VO-NL, AO-NL, and AV-NL (Table 2). Figure 2 shows the cross-validation results in terms ofPESQ and ESTOI for all the different systems. On average,every model improves the estimated speech quality and theestimated speech intelligibility of the unprocessed signals,with the exception of VO-NL at 5 dB SNR, which showsan ESTOI score comparable with the one of noisy speech.Another general trend that can be observed is that AV sys-tems outperform the respective AO and VO systems, an ex-pected result since the information that can be exploited us-ing two modalities is no less than the information of the sin-gle modalities taken separately.It is worth noting that VO systems’ performance changesacross SNR, although they do not use the audio signal to esti-mate the ideal amplitude mask. This is because the estimatedmask is applied to the noisy input signal, so the performancedepends on the noise level of the input audio signal.PESQ scores show that the performance that can be ob-tained with AO systems is comparable with VO systems per-formance at very low SNRs. Only for SNR ≥ −10 dB, AOmodels start to perform substantially better than VO mod-els. The difference increases with higher SNRs. Also forESTOI, this pattern can be observed when SNR ≥ −10 dB,but for SNR ≤ −15 dB VO systems perform better than therespective AO systems, especially at −20 dB SNR where theperformance gap is very large. This can be explained by the PESQ VO-L VO-NL AO-L AO-NL AV-L AV-NL −20 - dB 1.163 1.113 1.353 1.283 1.446 1.331ESTOI VO-L VO-NL AO-L AO-NL AV-L AV-NL −20 - dB 0.372 0.335 0.448 0.423 0.528 0.488 Table 3
Average scores for the systems trained on a narrow SNR range. fact that the noise level is so high that recovering the cleanspeech only using the noisy audio input is very challenging,and that the visual modality provides a richer informationsource at this noise level.For all the modalities, L systems tend to be better thanthe respective NL systems. The only exception is AO-NL,which have a higher PESQ score than AO-L at −20 dB SNR,but this difference is very modest ( . ). AV-L always out-performs AV-NL in terms of PESQ by a large margin, withmore than dB SNR gain, if we consider the performancebetween −20 dB and −10 dB SNR. On average (Table 3), theperformance gap in terms of PESQ between L and NL sys-tems, is greater for the audio-visual case ( . ) than for theaudio-only ( . ) and the video-only ( . ) cases, mean-ing that the speaking style mismatch is more detrimentalwhen both the modalities are used. Regarding ESTOI, thegap between AV-L and AV-NL ( . ) is still the largest,but the one between VO-L and VO-NL ( . ) is greaterthan the gap between AO-L and AO-NL ( . ): this sug-gests that the impact of visual differences between Lombardand non-Lombard speech on estimated speech intelligibilityis higher than the impact of acoustic differences.These results suggest that training systems with Lom-bard speech is beneficial in terms of both estimated speechquality and estimated speech intelligibility. This is in linewith and extends our preliminary study (Michelsanti et al.,2019a), where only a subset of the whole database was usedto evaluate the models. Previous work found a large inter-speaker variability forLombard speech, especially between male and female speak-ers (Junqua, 1993). Here, we investigate whether this vari-ability affects the performance of SE systems.Figure 3 shows the average PESQ and ESTOI scores bygender. Since the scores are computed on different speechmaterial, it may be hard to make a direct comparison be-tween males and females by looking at the absolute perfor-mance. Instead, we focus on the gap between L and NLsystems averaged across SNRs for same gender. At a firstglance, the trends of the different conditions are as expected:L systems are better than the respective NL ones, and AVsystems outperform AO systems trained with speech of thesame speaking style, in terms of both estimated speech qual-ity and estimated speech intelligibility. We also notice thatthe scores of VO systems are worse than the AO ones, alsofor ESTOI. This is because we average across all the SNRsand VO is better than AO only at very low SNRs, but con-siderably worse for SNR ≥ −5 dB (Figure 2). - Page 5 of 16eep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect PESQ -20 -15 -10 -5 0 5
SNR (dB)
VO-LVO-NLAO-LAO-NLAV-LAV-NLUnproc. -20 -15
SNR (dB)
ESTOI -20 -15 -10 -5 0 5
SNR (dB)
VO-LVO-NLAO-LAO-NLAV-LAV-NLUnproc.
Figure 2:
Cross-validation results in terms of PESQ and ESTOI for the systems trained on a narrow SNR range. At every SNR,there are three pairs of coloured bars with error bars, each of them referring to VO, AO, and AV systems (from left to right). Thewide bars in dark colours represent L systems, while the narrow ones in light colours represent NL systems. The heights of eachbar and the error bars indicate the average scores and the 95% confidence intervals computed on the pooled data, respectively.The transparent boxes with black edges, overlaying the bars of the other systems, and the error bars indicate the average scoresof the unprocessed signals (
Unproc. ) and their 95% confidence intervals, respectively.
PESQ
Male Female11.051.11.151.21.251.31.351.41.451.5
ESTOI
Male Female00.10.20.30.40.50.6
VO-LVO-NLAO-LAO-NLAV-LAV-NLUnproc.
Figure 3:
Cross-validation results for male and female speakersin terms of PESQ and ESTOI.
The difference between L and NL systems is larger forfemales than it is for males. This can be observed for all themodalities and it is more noticeable for AV systems, mostlikely because they account for both audio and visual dif-ferences. In order to better understand this behaviour, weprovide a more in-depth analysis, investigating the impactthat some acoustic and geometric articulatory features haveon estimated speech quality and estimated speech intelligi-bility.We consider three different features that have already
MA MS
Figure 4:
Mouth aperture (MA) and mouth spreading (MS)from facial landmarks. been used to study Lombard speech in previous work (Gar-nier et al., 2006, 2012; Tang et al., 2015; Alghamdi, 2017):F0, mouth aperture (MA) and mouth spreading (MS). Theaverage F0 for each speaker was estimated with Praat (Boersmaand Weenink, 2001), using the default settings for pitch es-timation. The average MA and MS per speaker were com-puted from facial landmarks (Figure 4) obtained with thepose estimation algorithm (Kazemi and Sullivan, 2014), train-ed on the iBUG 300-W database (Sagonas et al., 2016), im-plemented in the dlib toolkit (King, 2009).Let Δ F0, Δ MA, and Δ MS denote the average differencein audio and visual features, respectively, between Lombardand non-Lombard speech. Similarly, let Δ PESQ and Δ ESTOI - Page 6 of 16eep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect -10 0 10 20 30 40 50
F0 (Hz) -0.0200.020.040.060.080.10.12 ES T O I f o r AV S ys t e m s MaleFemale -2 0 2 4 6
MA (Pixels) -0.0200.020.040.060.080.10.12 ES T O I f o r AV S ys t e m s MaleFemale -2 -1 0 1 2 3 4
MS (Pixels) -0.0200.020.040.060.080.10.12 ES T O I f o r AV S ys t e m s MaleFemale -10 0 10 20 30 40 50
F0 (Hz)
PES Q f o r AV S ys t e m s MaleFemale -2 0 2 4 6
MA (Pixels)
PES Q f o r AV S ys t e m s MaleFemale -2 -1 0 1 2 3 4
MS (Pixels)
PES Q f o r AV S ys t e m s MaleFemale
Figure 5:
Scatter plots showing the relationship between the audio/visual features and PESQ/ESTOI. For each circle, whichrefers to a particular speaker, the 𝑦 -coordinate indicates the average performance increment of AV-L with respect to AV-NL interms of PESQ or ESTOI, while the 𝑥 -coordinate indicates the average increment of audio (fundamental frequency) or visual(mouth aperture and mouth spreading) features in Lombard condition with respect to the respective feature in non-Lombardcondition. The lines show the least-squares lines for male speakers (blue), female speakers (red), and all the speakers (yellow).MA, MS, and F0 indicate mouth aperture, mouth spreading, and fundamental frequency, respectively. denote the increment in PESQ and ESTOI, respectively, ofAV-L with respect to AV-NL. Figure 5 illustrates the rela-tionship between Δ F0, Δ MA, and Δ MS and Δ PESQ, and Δ ESTOI. We notice that on average for each speaker Δ PESQand Δ ESTOI are both positive, with only one exception rep-resented by a male speaker, whose Δ ESTOI is slightly lessthan . This indicates that no matter how different the speak-ing style of a person is in presence of noise, there is a benefitin training a system with Lombard speech. Focusing on therange of the features’ variations, most of the speakers havepositive Δ MA, Δ MS, and Δ F0. This is in accordance withprevious research, which suggests that in Lombard conditionthere is a tendency to amplify lips’ movements and rise thepitch (Garnier et al., 2010, 2012; Junqua, 1993). Δ MA and Δ MS values lie between −2 and pixels, and between −2 and pixels, respectively, for both male and female speak-ers. Regarding the Δ F0 range, it is wider for females, up to Hz, against the Hz reached by males.Among the three features considered, Δ F0 is the one thatseems to be related the most with Δ PESQ and Δ ESTOI.This can be seen by comparing the distributions of the cir-cles with the least-squares lines in the plots of Figure 5 orby analysing the correlation between PESQ/ESTOI incre-ments and audio/visual feature increments, using Pearson’sand Spearman’s correlation coefficients. Given 𝑛 pairs of ( 𝑥 𝑖 , 𝑦 𝑖 ) observations, with 𝑖 ∈ {1 , … , 𝑛 } ,from two variables 𝑥 and 𝑦 , whose sample means are denotedas ̄𝑥 and ̄𝑦 , respectively, we refer to the Pearson’s correlationcoefficient as 𝜌 𝑃 ( 𝑥, 𝑦 ) . We have that −1 ≤ 𝜌 𝑃 ( 𝑥, 𝑦 ) ≤ ,where denotes the absence of a linear relationship betweenthe two variables, and −1 and a perfect positive linearrelationship and a perfect negative linear relationship, re-spectively. To complement the Pearson’s correlation coef-ficient, we also consider the Spearman’s correlation coeffi-cient, 𝜌 𝑆 ( 𝑥, 𝑦 ) , defined as (Sharma, 2005): 𝜌 𝑆 ( 𝑥, 𝑦 ) = 𝜌 𝑃 ( 𝑟 𝑥 , 𝑟 𝑦 ) , (2)where 𝑟 𝑥 and 𝑟 𝑦 indicate rank variables. The advantage ofusing ranks is that 𝜌 𝑆 allows to assess whether the relation-ship between 𝑥 and 𝑦 is monotonic (not limited to linear).As shown in Table 4, for AV systems, Δ F0 has a highercorrelation with Δ PESQ ( 𝜌 𝑃 = 0 . , 𝜌 𝑆 = 0 . ) and Δ ESTOI( 𝜌 𝑃 = 0 . , 𝜌 𝑆 = 0 . ) than Δ MA and Δ MS. We observethat for female speakers, the correlation between the fea-tures’ increments and the performance measures’ incrementsis usually higher, especially when considering Δ MS, sug-gesting that some inter-gender difference should be presentnot only for Δ F0 (whose range is way wider for females aspreviously stated), but also for visual features.In Table 4 we also report the correlation coefficients forthe single modalities. The correlation of visual features’ in- - Page 7 of 16eep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect 𝜌 𝑃 𝜌 𝑆 all m f all m f Δ PESQ (VO) - Δ MA . . . . . . PESQ (AO) - Δ MA . . . . . . PESQ (AV) - Δ MA . . . . . . ESTOI (VO) - Δ MA . . . . . . ESTOI (AO) - Δ MA . . . . . . ESTOI (AV) - Δ MA . . . . . . PESQ (VO) - Δ MS .
19 − . . .
12 − . . PESQ (AO) - Δ MS . . . . . . PESQ (AV) - Δ MS . . . . . . ESTOI (VO) - Δ MS .
45 − . . .
22 − . . ESTOI (AO) - Δ MS . . . . . . ESTOI (AV) - Δ MS . . . .
34 − . . PESQ (VO) - Δ F0 . . . . . . PESQ (AO) - Δ F0 . . . . . . PESQ (AV) - Δ F0 . . . . . . ESTOI (VO) - Δ F0 . . . . . . ESTOI (AO) - Δ F0 . . . . . . ESTOI (AV) - Δ F0 . . . . . . Table 4
Pearson’s ( 𝜌 𝑃 ) and Spearman’s ( 𝜌 𝑆 ) correlation coefficientsbetween PESQ/ESTOI increments and audio/visual featureincrements for male speakers (m), female speakers (f), andall the speakers. MA, MS, and F0 indicate mouth aperture,mouth spreading, and fundamental frequency, respectively. crements with Δ PESQ or Δ ESTOI is sometimes higher forAO systems than it is for VO systems. This might seemcounter-intuitive, because AO systems do not use visual in-formation. However, correlation does not imply causation(Field, 2013): since visual and acoustic features are corre-lated (Almajai et al., 2006), it is possible that other acousticfeatures, which are not considered in this study even thoughthey might be correlated with Δ MA and Δ MS, play a rolein the enhancement. Similar considerations can be done for Δ F0, which has a correlation with Δ ESTOI for VO systems( 𝜌 𝑃 = 0 . , 𝜌 𝑆 = 0 . ) higher than the one for AO systems( 𝜌 𝑃 = 0 . , 𝜌 𝑆 = 0 . ). By looking at the inter-gender dif-ferences, we find that, in general, the correlation coefficientscomputed for female speakers are higher than the ones com-puted for male speakers, especially when considering Δ MS.In general, a performance difference between gendersexists when L systems are compared with NL ones, with agap that is larger for females. This is unlikely to be caused bythe small gender imbalance in the training set (23 males and30 females). Instead, it is reasonable to assume that this re-sult is due to the characteristics of the Lombard speech of fe-male speakers, which shows a large increment of F0, the fea-ture that correlates the most with the estimated speech qual-ity and the estimated speech intelligibility increases, amongthe ones considered.
The models presented in Section 4.1 have been trained toenhance signals when Lombard effect occurs, i.e. at SNRsbetween −20 and dB. However, from a practical perspec-tive, SNR-independent systems, capable of enhancing both PESQ VO-L (w)
VO-NL (w)
AO-L (w)
AO-NL (w)
AV-L (w)
AV-NL (w) −20 - dB 1.153 1.080 1.346 1.295 1.424 1.323 - dB 2.348 2.418 3.127 3.155 3.151 3.169ESTOI VO-L (w) VO-NL (w)
AO-L (w)
AO-NL (w)
AV-L (w)
AV-NL (w) −20 - dB 0.376 0.330 0.442 0.422 0.517 0.483 - dB 0.844 0.825 0.927 0.929 0.928 0.930 Table 5
Average scores for the systems trained on a wide SNR range.
Lombard and non-Lombard speech, are preferred. There areseveral ways to achieve this goal. For example, it is possi-ble to train a system (with Lombard speech) that works atlow SNRs, and another one (with non-Lombard speech) thatworks at high SNRs. This approach requires switching be-tween the two systems, which can be problematic, becauseit involves an online estimation of the SNR. An alternativeapproach is to train general systems with Lombard speech atlow SNRs and non-Lombard speech at high SNRs. We fol-lowed this alternative approach, building such systems andstudying their strengths and limitations. We also comparedthem with systems trained only with non-Lombard speechfor the whole SNR range, because this is what current state-of-the-art systems do.The test set was built by mixing additive SSN with Lom-bard speech at SNRs between −20 and dB, and withnon-Lombard speech at SNRs between and dB. ForVO-NL (w) , AO-NL (w) , and AV-NL (w) , only non-Lombardspeech was used during training, while for VO-L (w) , AO-L (w) , and AV-L (w) , Lombard speech was used with SNR ≤ dB and non-Lombard speech with SNR ≥ dB, to matchthe speaking style of the test set (Table 2). The results interms of PESQ and ESTOI are shown in Figure 6.The relative performance of the systems at SNR ≤ dBis similar to the one observed for the systems trained on anarrow SNR range (Section 4.1): L (w) systems outperformthe respective NL (w) systems, AV performance is higher thanAO and VO performance, and VO is considerably better thanAO only in terms of ESTOI at very low SNRs.When SNR ≥ dB, NL (w) systems perform better thanL (w) systems in terms of PESQ. The difference is on aver-age (Table 5) larger for VO ( . ) than it is for AO ( . )and AV ( . ). This can be explained by the fact that it isharder for VO-L (w) to recognise when non-Lombard speechoccurs using only the video of the speaker. However, theseperformance gaps are smaller than the ones between L (w) andNL (w) systems at SNR ≤ dB ( . for VO, . for AO,and . for AV).Regarding ESTOI at SNR ≥ dB, the difference be-tween AO and AV becomes negligible, with VO systems thatperform considerably worse. This is because audio featuresare more informative than visual ones at high SNRs, makingAO-SE systems already good to recover speech intelligibil-ity. In addition, the average gaps between NL (w) and L (w) arequite small: . for AO and AV, while for VO it is actually −0 . .In general, at SNR ≤ dB, the systems that use bothLombard and non-Lombard speech for training perform bet- - Page 8 of 16eep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect PESQ -20 -15 -10 -5 0 5 10 15 20 25 30
SNR (dB)
VO-L (w)
VO-NL (w)
AO-L (w)
AO-NL (w)
AV-L (w)
AV-NL (w)
Unproc.
20 25 30
SNR (dB)
ESTOI -20 -15 -10 -5 0 5 10 15 20 25 30
SNR (dB)
VO-L (w)
VO-NL (w)
AO-L (w)
AO-NL (w)
AV-L (w)
AV-NL (w)
Unproc.
Figure 6:
As Figure 2, but for the systems trained on a wide SNR range. ter than the ones that only use non-Lombard speech. Athigher SNRs, their PESQ and ESTOI scores are slightly worsethan the ones of the systems trained only with non-Lombardspeech. However, this performance gap is small, and seemsto be larger for the estimated speech quality than for the es-timated speech intelligibility. The way we combined non-Lombard and Lombard speech for training seems to be thebest solution for an SNR-independent system, although asmall performance loss may occur at high SNRs.
5. Listening Tests
Although it has been shown that visual cues have an im-pact on speech perception (Sumby and Pollack, 1954; McGurkand MacDonald, 1976), the currently available objective mea-sures used to estimate speech quality and speech intelligibil-ity, e.g. PESQ and ESTOI, only take into account the audiosignals. Even when listening tests are performed to evaluatethe performance of a SE system, visual stimuli are usuallyignored and not presented to the participants (Hussain et al.,2017), despite the fact that visual inputs are typically avail-able during practical deployment of SE systems.For these reasons, and in an attempt to evaluate the pro-posed AV enhancement systems in a setting as realistic aspossible, we performed two listening tests, one to assess thespeech quality and the other to assess the speech intelligi-bility, where the processed audio signals from the LombardGRID corpus were accompanied by their corresponding vi- sual stimuli. Both tests were conducted in a silent room,where a MacBookPro11,4 equipped with an external moni-tor, a sound card (Focusrite Scarlett 2i2) and a set of closedheadphones (Beyerdynamic DT770) was used for audio andvideo playback. The multimedia player (VLC media player3.0.4) was controlled by the subjects with a graphical user in-terface (GUI) modified from MUSHRAM (Vincent, 2005).The processed signals used in this test were from the systemstrained on the narrow SNR range previously described (Sec-tion 4.1). All the audio stimuli were normalised accordingto the two-pass EBU R128 loudness normalisation proce-dure (EBU, 2014), as implemented in ffmpeg-normalize , toguarantee that signals of different conditions were perceivedas having the same volume. The subjects were allowed toadjust the general loudness to a comfortable level during thetraining session of each test. The quality test was carried out by experienced listen-ers, who volunteered to be part of the study. The participantswere between 26 and 44 years old, and had self-reported nor-mal hearing and normal (or corrected to normal) vision. Onaverage, each participant spent approximately minutes tocomplete the test. https://github.com/slhck/ffmpeg-normalize - Page 9 of 16eep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect R e f. A O - L A O - N L A V - L A V - N L U np r o c . A n c ho r MUSHRA (-5 dB SNR) R e f. A O - L A O - N L A V - L A V - N L U np r o c . A n c ho r MUSHRA (5 dB SNR)
Figure 7:
Box plots showing the results of the MUSHRA ex-periments for the signals at −5 dB SNR (left) and at dB SNR(right). The red horizontal lines and the diamond markers in-dicate the median and the mean values, respectively. Outliers(identified according to the 1.5 interquartile range rule) aredisplayed as red crosses. Ref. indicates the reference signals.
The test used the MUlti Stimulus test with Hidden Refer-ence and Anchor (MUSHRA) (ITU, 2003) paradigm to as-sess the speech quality on a scale from to , divided into equal intervals labelled as bad , poor , fair , good , and ex-cellent . No definition of speech quality was provided to theparticipants. Each subject was presented with sequencesof trials each, to evaluate the systems at −5 dB SNR, and to evaluate the systems at dB SNR. Lower SNRs werenot considered to ensure that the perceptual quality assess-ment was not influenced too much by the decrease in intel-ligibility. One trial consisted of one reference (clean speechsignal) and seven other signals to be rated with respect to thereference: hidden reference, systems under test (AO-L,AO-NL, AV-L, AV-NL), unprocessed signal, and hid-den anchor (unprocessed signal at −10 dB SNR). The par-ticipants were allowed to switch at will between any of thesignals inside the same trial. The order of presentation ofboth the trials and the conditions was randomised, and sig-nals from different randomly chosen speakers were usedfor each sequence of trials.Before the actual test, the participants were trained in aspecial separate session, with the purpose of exposing themto the nature of the impairments and making them familiarwith the equipment and the grading system. The average scores assigned by the subjects for each con-dition are shown in Figure 7 in the form of box plots.Non-parametric approaches are used to analyse the data(Mendonça and Delikaris-Manias, 2018; Winter et al., 2018),since the assumption of normal distribution of the data is in-valid, given the number of participants and their different in-terpretation of the MUSHRA scale. Specifically, the pairedtwo-sided Wilcoxon signed-rank test (Wilcoxon, 1945) is a-
Small Effect Size Medium Effect Size Large Effect Size . ≤ | 𝑑 𝐶 | < .
28 0 . ≤ | 𝑑 𝐶 | < .
43 0 . ≤ | 𝑑 𝐶 | ≤ Table 6
Interpretation of the effect size (Cliff’s delta, 𝑑 𝐶 ). Adaptedfrom (Vargha and Delaney, 2000). −5 dB SNR dB SNRComparison 𝑝 𝑑 𝐶 𝑝 𝑑 𝐶 AO-L - AO-NL < . . . . AV-L - AV-NL < . . < . . AO-L - AV-L . . . . AO-NL - AV-NL < . . . . AO-L - Unproc. . . < . . AV-L - Unproc. . . < . . Table 7 𝑝 -values ( 𝑝 ) and effect sizes (Cliff’s delta, 𝑑 𝐶 ) for the MUSHRAexperiments. The significant level (0.0083) for the 𝑝 -values iscorrected with the Bonferroni method. dopted to determine whether there exists a median differ-ence between the MUSHRA scores obtained for two differ-ent conditions. Differences in median are considered signif-icant for 𝑝 < 𝛼 ∕ 𝑚 = 0 . ( 𝛼 = 0 . , 𝑚 = 6 ), where thesignificance level is corrected with the Bonferroni method tocompensate for multiple hypotheses tests (Field, 2013). Theuse of 𝑝 -values as the only analysis strategy has been heavilycriticized (Hentschke and Stüttgen, 2011) because statisticalsignificance can be obtained with a big sample size (Sullivanand Feinn, 2012; Moore et al., 2012) even if the magnitude ofthe effect is negligible (Hentschke and Stüttgen, 2011). Forthis reason, we complement 𝑝 -values with a non-parametricmeasure of the effect size, the Cliff’s delta (Cliff, 1993): 𝑑 𝐶 = ∑ 𝑚𝑖 =1 ∑ 𝑛𝑗 =1 [ 𝑥 𝑖 > 𝑦 𝑗 ] − ∑ 𝑚𝑖 =1 ∑ 𝑛𝑗 =1 [ 𝑥 𝑖 < 𝑦 𝑗 ] 𝑚𝑛 , (3)where 𝑥 𝑖 and 𝑦 𝑗 are the observations of the samples of sizes 𝑚 and 𝑛 to be compared and [ 𝑃 ] indicates the Iverson bracket,which is if 𝑃 is true and otherwise. As reported inTable 6, we consider the effect size to be small if . ≤ | 𝑑 𝐶 | < . , medium if . ≤ | 𝑑 𝐶 | < . , and largeif | 𝑑 𝐶 | ≥ . , according to the indication by Vargha andDelaney (2000). The 𝑝 -values and the effect sizes for thecomparisons considered in this study are shown in Table 7.At SNR = −5 dB, a significant ( 𝑝 < . ) medium( . < | 𝑑 𝐶 | < . ) difference exists between Lombardand non-Lombard systems for both the audio-only and theaudio-visual cases. The increment in quality when using vi-sion with respect to audio-only systems is perceived by thesubjects ( | 𝑑 𝐶 | > . ), but it has only a relatively small ef-fect ( | 𝑑 𝐶 | < . ). This was expected, since visual cues af-fect more the intelligibility at low SNRs than quality, as alsoshown by objective measures (Figure 2). More specifically,for non-Lombard systems, this difference is significant andgreater than the one found for Lombard systems, meaning - Page 10 of 16eep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect that vision contributes more when the enhancement of Lom-bard speech is performed with systems that were not trainedwith it. We can notice that there is a large ( | 𝑑 𝐶 | > . )difference between the unprocessed signals and the versionenhanced with Lombard systems. However, this differenceis not significant, probably due to the heterogeneous inter-pretation of the MUSHRA scale by the subjects and theirpreference of the different natures of the impairment (pres-ence of noise or artefacts caused by the enhancement).At an SNR of 5 dB a small difference between Lombardand non-Lombard systems is observed, despite being not sig-nificant in the audio-only case ( 𝑝 = 0 . ). At this noiselevel, audio-visual systems appear to be indistinguishable( | 𝑑 𝐶 | < . ) from the respective audio-only systems. Thisconfirms the intuition that vision does not help in improvingthe speech quality at high SNRs. Finally, the difference be-tween the unprocessed signals and the respective enhancedversions using Lombard systems is both large ( | 𝑑 𝐶 | > . )and significant ( 𝑝 < . ), which makes it clear that bothAO-L and AV-L improve the speech quality. The intelligibility test was carried out by listeners,who volunteered to be part of the study. The participantswere between 24 and 65 years old, and had self-reported nor-mal hearing and normal (or corrected to normal) vision. Onaverage, each participant spent approximately minutes tocomplete the test. Each subject was presented with sequences of audio-visual stimuli from the Lombard GRID corpus: speakers × 4 SNRs ( −20 , −15 , −10 , and −5 dB) × 5 processing con-ditions (unprocessed, AO-L, AO-NL, AV-L, AV-NL). Theparticipants were asked to listen to each stimulus only onceand, based on what they heard, they had to select the colourand the digit from a list of options and to write the letter(Table 1). The order of presentation of the stimuli was ran-domised.Before the actual test, the participants were trained in aspecial separate session consisting of a sequence of audio-visual stimuli. The mean percentage of correctly identified keywords asa function of the SNR is shown in Figure 8. We can seethat among the three fields, the colour is the easiest word tobe identified by the participants. In general, the followingtrends can be observed. At low SNRs the intelligibility ofthe signals enhanced with AV systems is higher than the in-telligibility obtained with AO systems. This difference sub-stantially diminishes when the SNR increases. There is nobig performance difference between L and NL systems, butin general AV-L tends to have higher percentage scores thanthe other systems. AV-L is also the only system that doesnot decrease the mean intelligibility scores for all the fieldsif compared to the unprocessed signals. 𝑝 SNRComparison -20 dB -15 dB -10 dB -5 dBAO-L - AO-NL . . . . AV-L - AV-NL . . . . AO-L - AV-L . . . . AO-NL - AV-NL . . . . AO-L - Unproc. . . . . AV-L - Unproc. . . . . 𝑑 𝐶 SNRComparison -20 dB -15 dB -10 dB -5 dBAO-L - AO-NL − . . .
31 − . AV-L - AV-NL . . . . AO-L - AV-L − .
91 − .
35 − .
17 − . AO-NL - AV-NL − .
32 − .
37 − . . AO-L - Unproc. − . .
17 − .
09 − . AV-L - Unproc. . .
46 0 . Table 8 𝑝 -values ( 𝑝 ) and effect sizes (Cliff’s delta, 𝑑 𝐶 ) for the meanintelligibility scores for all the keywords obtained in the listeningtests. Table 8 shows Cliff’s deltas and 𝑝 -values, computed withthe paired two-sided Wilcoxon signed-rank test, as in theMUSHRA experiments.The effect sizes support the observations made from Fig-ure 8. Medium and large differences ( | 𝑑 𝐶 | > . ) exist be-tween AO and AV systems, especially at low SNRs. WhileAO-L and AO-NL are indistinguishable ( | 𝑑 𝐶 | < . ) forSNR < −10 dB, there is a medium ( . ≤ | 𝑑 𝐶 | < . ) dif-ference between AV-L and AV-NL, except for −15 dB SNR( 𝑑 𝐶 = 0 . ). Moreover, the intelligibility increase of AV-Lover the unprocessed signals is perceived by the subjects atSNR ≤ −15 dB ( | 𝑑 𝐶 | > . ).Regarding the 𝑝 -values, if we focus on each SNR sep-arately, the difference between two approaches can be con-sidered significant for 𝑝 < . (cf. Section 5.1.2). Thiscondition is met only when we compare AO-L with AV-Lat −20 dB SNR and AV-L with the noisy speech at −15 dBSNR.There are three main sources of variability that most like-ly prevent the differences to be significant. First, the varia-tion in lipreading ability among individuals is large and doesnot directly reflect the variation found in auditory speech per-ception skills (Summerfield, 1992). Secondly, individualshave very different fusion responses to discrepancy in theauditory and visual syllables (Mallick et al., 2015), whichin our case might occur due to the artefacts produced in theenhancement process. Finally, the participants were not ex-posed to the same utterances processed with the different ap-proaches like in MUSHRA. Since the vocabulary set of theLombard GRID corpus is small and some words are easier tounderstand because they contain unambiguous visemes, theintelligibility scores are affected not only by the various pro-cessing conditions, but also by the different sentences used. - Page 11 of 16eep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect -20 -15 -10 -5 SNR (dB) I n t e lli g i b ili t y ( % ) Colour
AO-LAO-NLAV-LAV-NLUnproc. -20 -15 -10 -5
SNR (dB)
Letter -20 -15 -10 -5
SNR (dB)
Digit -20 -15 -10 -5
SNR (dB)
Mean
Figure 8:
Percentage of correctly identified words obtained in the listening tests for the colour, the letter, and the digit fields,averaged across subjects. The mean intelligibility scores for all the fields are also reported.
6. Conclusion
In this paper, we presented an extensive analysis of theimpact of Lombard effect on audio, visual and audio-visualspeech enhancement systems based on deep learning. Weconducted several experiments using a database consistingof speakers and showed the general benefit of training asystem with Lombard speech.In more detail, we first trained systems with Lombard ornon-Lombard speech and evaluated them on Lombard speechadopting a cross-validation setup. The results showed thatsystems trained with Lombard speech outperform the respec-tive systems trained with non-Lombard speech in terms ofboth estimated speech quality and estimated speech intelli-gibility. We also observed a performance difference acrossspeakers, with an evident gap between genders: the perfor-mance difference between the systems trained with Lom-bard speech and the ones trained with non-Lombard speechis larger for females than it is for males. The analysis thatwe performed suggests that this difference might be primar-ily due to the large increment in the fundamental frequencythat female speakers exhibit from non-Lombard to Lombardconditions.With the objective of building more general systems ableto deal with a wider SNR range, we then trained systems us-ing Lombard and non-Lombard speech and compared themwith systems trained only on non-Lombard speech. As inthe narrow SNR case, systems that include Lombard speechperform considerably better than the others at low SNRs.At high SNRs, the estimated speech quality and the esti-mated speech intelligibility obtained with systems trainedonly with non-Lombard speech are higher, even though theperformance gap is very small for the audio and the audio-visual cases. Combining non-Lombard and Lombard speechfor training in the way we did guarantees a good compromisefor the enhancement performance across all the SNRs.We also performed subjective listening tests with audio-visual stimuli, in order to evaluate the systems in a situationcloser to the real-world scenario, where the listener can seethe face of the talker. For the speech quality test, we found significant differences between Lombard and non-Lombardsystems at all the used SNRs for the audio-visual case andonly at −5 dB SNR for the audio-only case. Regarding thespeech intelligibility test, we observed that on average thescores obtained with the audio-visual system trained withLombard speech are higher than the other processing con-ditions. However, we were unable to find significant differ-ences in most of the cases, suggesting that in future worksmore effort should be put into designing new paradigms forspeech intelligibility tests to control the several sources ofvariability caused by the combination of auditory and visualstimuli.
7. Acknowledgements
This work was supported, in part, by the Oticon Founda-tion.
References
Abel, A., Hussain, A., 2014. Novel two-stage audiovisualspeech filtering in noisy environments. Cognitive Com-putation 6 (2), 200–217.Abel, A., Hussain, A., Luo, B., 2014. Cognitively inspiredspeech processing for multimodal hearing technology. In:Proceedings of CICARE. IEEE, pp. 56–63.Afouras, T., Chung, J. S., Zisserman, A., 2018. The conver-sation: Deep audio-visual speech enhancement. In: Pro-ceedings of Interspeech. pp. 3244–3248.Alghamdi, N., 2017. Visual speech enhancement and its ap-plication in speech perception training. Ph.D. thesis, Uni-versity of Sheffield.Alghamdi, N., Maddock, S., Marxer, R., Barker, J., Brown,G. J., 2018. A corpus of audio-visual Lombard speechwith frontal and profile views. The Journal of the Acous-tical Society of America 143 (6), EL523–EL529. - Page 12 of 16eep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect
Allen, J., 1977. Short term spectral analysis, synthesis,and modification by discrete Fourier transform. IEEETransactions on Acoustics, Speech, and Signal Process-ing 25 (3), 235–238.Almajai, I., Milner, B., 2011. Visually derived Wiener fil-ters for speech enhancement. IEEE Transactions on Au-dio, Speech, and Language Processing 19 (6), 1642–1651.Almajai, I., Milner, B., Darch, J., 2006. Analysis of corre-lation between audio and visual speech features for cleanaudio feature prediction in noise. In: Proceedings of In-terspeech/ICSLP. p. 1634.Boersma, P., Weenink, D., 2001. Praat: doing phonet-ics by computer. , ac-cessed: March 20, 2019.Cliff, N., 1993. Dominance statistics: Ordinal analyses toanswer ordinal questions. Psychological bulletin 114 (3),494.Cooke, M., Barker, J., Cunningham, S., Shao, X., 2006. Anaudio-visual corpus for speech perception and automaticspeech recognition. The Journal of the Acoustical Societyof America 120 (5), 2421–2424.Cooke, M., King, S., Garnier, M., Aubanel, V., 2014.The listening talker: A review of human and algorith-mic context-induced modifications of speech. ComputerSpeech & Language 28 (2), 543–571.EBU, 2014. EBU recommendation R128 - Loudness nor-malisation and permitted maximum level of audio signals.Ephraim, Y., Malah, D., 1984. Speech enhancement usinga minimum-mean square error short-time spectral ampli-tude estimator. IEEE Transactions on Acoustics, Speech,and Signal Processing 32 (6), 1109–1121.Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Has-sidim, A., Freeman, W. T., Rubinstein, M., 2018. Look-ing to listen at the cocktail party: A speaker-independentaudio-visual model for speech separation. ACM Transac-tions on Graphics 37 (4), 112:1–112:11.Erber, N. P., 1975. Auditory-visual perception of speech.Journal of Speech and Hearing Disorders 40 (4), 481–492.Erdogan, H., Hershey, J. R., Watanabe, S., Le Roux, J., 2015.Phase-sensitive and recognition-boosted speech separa-tion using deep recurrent neural networks. In: Proceed-ings of ICASSP. IEEE, pp. 708–712.Erkelens, J. S., Hendriks, R. C., Heusdens, R., Jensen, J.,2007. Minimum mean-square error estimation of discreteFourier coefficients with generalized Gamma priors. IEEETransactions on Audio, Speech, and Language Processing15 (6), 1741–1752.Field, A., 2013. Discovering statistics using IBM SPSSstatistics. Sage. Gabbay, A., Shamir, A., Peleg, S., 2018. Visual speech en-hancement. In: Proceedings of Interspeech. pp. 1170–1174.Garnier, M., Bailly, L., Dohen, M., Welby, P., Lœvenbruck,H., 2006. An acoustic and articulatory study of Lombardspeech: Global effects on the utterance. In: Proceedingsof Interspeech/ICSLP. pp. 2246–2249.Garnier, M., Henrich, N., 2014. Speaking in noise: Howdoes the Lombard effect improve acoustic contrasts be-tween speech and ambient noise? Computer Speech &Language 28 (2), 580–597.Garnier, M., Henrich, N., Dubois, D., 2010. Influence ofsound immersion and communicative interaction on theLombard effect. Journal of Speech, Language, and Hear-ing Research 53 (3), 588–608.Garnier, M., Ménard, L., Richard, G., 2012. Effect of be-ing seen on the production of visible speech cues. A pi-lot study on Lombard speech. In: Proceedings of Inter-speech/ICSLP. pp. 611–614.Gerkmann, T., Martin, R., 2009. On the statistics of spec-tral amplitudes after variance reduction by temporal cep-strum smoothing and cepstral nulling. IEEE Transactionson Signal Processing 57 (11), 4165–4174.Girin, L., Schwartz, J.-L., Feng, G., 2001. Audio-visual en-hancement of speech in noise. The Journal of the Acous-tical Society of America 109 (6), 3007–3020.Glorot, X., Bengio, Y., 2010. Understanding the difficulty oftraining deep feedforward neural networks. In: Proceed-ings of AISTATS. pp. 249–256.Grancharov, V., Kleijn, W., 2008. Speech Quality Assess-ment. Springer Berlin Heidelberg, pp. 83–100.Griffin, D., Lim, J., 1984. Signal estimation from modi-fied short-time Fourier transform. IEEE Transactions onAcoustics, Speech, and Signal Processing 32 (2), 236–243.Hansen, J. H., Varadarajan, V., 2009. Analysis and compen-sation of Lombard speech across noise type and levelswith application to in-set/out-of-set speaker recognition.IEEE Transactions on Audio, Speech, and Language Pro-cessing 17 (2), 366–378.Hentschke, H., Stüttgen, M. C., 2011. Computation of mea-sures of effect size for neuroscience data sets. EuropeanJournal of Neuroscience 34 (12), 1887–1894.Heracleous, P., Ishi, C. T., Sato, M., Ishiguro, H., Hagita, N.,2013. Analysis of the visual Lombard effect and automaticrecognition experiments. Computer Speech & Language27 (1), 288–300. - Page 13 of 16eep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I.,Salakhutdinov, R. R., 2012. Improving neural networksby preventing co-adaptation of feature detectors. arXivpreprint arXiv:1207.0580.Hou, J.-C., Wang, S.-S., Lai, Y.-H., Lin, J.-C., Tsao, Y.,Chang, H.-W., Wang, H.-M., 2018. Audio-visual speechenhancement based on multimodal deep convolutionalneural network. IEEE Transactions on Emerging Topicsin Computational Intelligence 2 (2), 117–128.Hu, Y., Loizou, P. C., 2007. A comparative intelligibilitystudy of single-microphone noise reduction algorithms.The Journal of the Acoustical Society of America 122 (3),1777–1786.Hussain, A., Barker, J., Marxer, R., Adeel, A., Whitmer, W.,Watt, R., Derleth, P., 2017. Towards multi-modal hearingaid design and evaluation in realistic audio-visual settings:Challenges and opportunities. In: Proceedings of CHAT.pp. 29–34.Ioffe, S., Szegedy, C., 2015. Batch normalization: Acceler-ating deep network training by reducing internal covariateshift. In: Proceedings of ICML. pp. 448–456.Isola, P., Zhu, J.-Y., Zhou, T., Efros, A. A., 2017. Image-to-image translation with conditional adversarial networks.In: Proceedings of CVPR. pp. 1125–1134.ITU, 2003. Recommendation ITU-R BS.1534-1: Methodfor the subjective assessment of intermediate quality levelof coding systems.ITU, 2005. Recommendation P.862.2: Wideband extensionto recommendation P.862 for the assessment of widebandtelephone networks and speech codecs.Jensen, J., Hendriks, R. C., 2012. Spectral magnitude min-imum mean-square error estimation using binary andcontinuous gain functions. IEEE Transactions on Audio,Speech, and Language Processing 20 (1), 92–102.Jensen, J., Taal, C. H., 2016. An algorithm for predictingthe intelligibility of speech masked by modulated noisemaskers. IEEE/ACM Transactions on Audio, Speech, andLanguage Processing 24 (11), 2009–2022.Junqua, J.-C., 1993. The Lombard reflex and its role on hu-man listeners and automatic speech recognizers. The Jour-nal of the Acoustical Society of America 93 (1), 510–524.Kazemi, V., Sullivan, J., 2014. One millisecond face align-ment with an ensemble of regression trees. In: Proceed-ings of CVPR. pp. 1867–1874.King, D. E., 2009. Dlib-ml: A machine learning toolkit.Journal of Machine Learning Research 10, 1755–1758.Kingma, D. P., Ba, J., 2015. Adam: A method for stochasticoptimization. In: Proceedings of ICLR. Kolbæk, M., Tan, Z.-H., Jensen, J., 2016. Speech enhance-ment using long short-term memory based recurrent neu-ral networks for noise robust speaker verification. In: Pro-ceedings of SLT. IEEE, pp. 305–311.Kolbæk, M., Tan, Z.-H., Jensen, J., 2017. Speech intelli-gibility potential of general and specialized deep neuralnetwork based speech enhancement systems. IEEE/ACMTransactions on Audio, Speech and Language Processing25 (1), 153–167.Lane, H., Tranel, B., 1971. The Lombard sign and the roleof hearing in speech. Journal of Speech and Hearing Re-search 14 (4), 677–709.Lim, J. S., Oppenheim, A. V., 1979. Enhancement and band-width compression of noisy speech. Proceedings of theIEEE 67 (12), 1586–1604.Liu, L., Özsu, M. T., 2009. Encyclopedia of database sys-tems. Vol. 6. Springer New York, NY, USA.Loizou, P. C., 2007. Speech enhancement: Theory and prac-tice. CRC press.Lombard, E., 1911. Le signe de l’elevation de la voix. An-nales des Maladies de L’Oreille et du Larynx 37 (2), 101–119.Lu, X., Tsao, Y., Matsuda, S., Hori, C., 2013. Speech en-hancement based on deep denoising autoencoder. In: Pro-ceedings of Interspeech. pp. 436–440.Lu, Y., Cooke, M., 2008. Speech production modificationsproduced by competing talkers, babble, and stationarynoise. The Journal of the Acoustical Society of America124 (5), 3261–3275.Lu, Y., Cooke, M., 2009. Speech production modificationsproduced in the presence of low-pass and high-pass fil-tered noise. The Journal of the Acoustical Society ofAmerica 126 (3), 1495–1499.Maas, A. L., Hannun, A. Y., Ng, A. Y., 2013. Rectifier non-linearities improve neural network acoustic models. In:ICML Workshop on Deep Learning for Audio, Speechand Language Processing.Mallick, D. B., Magnotti, J. F., Beauchamp, M. S., 2015.Variability and stability in the McGurk effect: Contri-butions of participants, stimuli, time, and response type.Psychonomic Bulletin & Review 22 (5), 1299–1307.Martin, R., 2005. Speech enhancement based on mini-mum mean-square error estimation and supergaussian pri-ors. IEEE Transactions on Speech and Audio Processing13 (5), 845–856.Martin, R., Breithaupt, C., 2003. Speech enhancement in theDFT domain using Laplacian speech priors. In: Proceed-ings of IWAENC. Vol. 3. pp. 87–90. - Page 14 of 16eep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect
Marxer, R., Barker, J., Alghamdi, N., Maddock, S., 2018.The impact of the Lombard effect on audio and visualspeech recognition systems. Speech Communication 100,58–68.McGurk, H., MacDonald, J., 1976. Hearing lips and seeingvoices. Nature 264 (5588), 746–748.Mendonça, C., Delikaris-Manias, S., 2018. Statistical testswith mushra data. In: Audio Engineering Society Con-vention 144. Audio Engineering Society.Michelsanti, D., Tan, Z.-H., 2017. Conditional generativeadversarial networks for speech enhancement and noise-robust speaker verification. In: Proceedings of Inter-speech. pp. 2008–2012.Michelsanti, D., Tan, Z.-H., Sigurdsson, S., Jensen, J.,2019a. Effects of Lombard reflex on the performanceof deep-learning-based audio-visual speech enhancementsystems. In: Proceedings of ICASSP. pp. 6615–6619.Michelsanti, D., Tan, Z.-H., Sigurdsson, S., Jensen, J.,2019b. On training targets and objective functions fordeep-learning-based audio-visual speech enhancement.In: Proceedings of ICASSP. pp. 8077–8081.Moore, D. S., McCabe, G. P., Craig, B. A., 2012. Introduc-tion to the Practice of Statistics. WH Freeman New York.Morrone, G., Pasa, L., Tikhanoff, V., Bergamaschi, S.,Fadiga, L., Badino, L., 2019. Face landmark-basedspeaker-independent audio-visual speech enhancement inmulti-talker environments. In: Proceedings of ICASSP.pp. 6900–6904.Owens, A., Efros, A. A., 2018. Audio-visual scene analysiswith self-supervised multisensory features. In: Proceed-ings of ECCV. pp. 631–648.Park, S. R., Lee, J., 2017. A fully convolutional neural net-work for speech enhancement. In: Proceedings of Inter-speech. pp. 1993–1997.Pittman, A. L., Wiley, T. L., 2001. Recognition of speechproduced in noise. Journal of Speech, Language, andHearing Research.Raphael, L. J., Borden, G. J., Harris, K. S., 2007. Speechscience primer: Physiology, acoustics, and perception ofspeech. Lippincott Williams & Wilkins.Rix, A. W., Beerends, J. G., Hollier, M. P., Hekstra, A. P.,2001. Perceptual evaluation of speech quality (PESQ)-anew method for speech quality assessment of telephonenetworks and codecs. In: Proceedings of ICASSP. Vol. 2.IEEE, pp. 749–752.Sagonas, C., Antonakos, E., Tzimiropoulos, G., Zafeiriou,S., Pantic, M., 2016. 300 faces in-the-wild challenge:Database and results. Image and Vision Computing 47,3–18. Sharma, A., 2005. Text book of correlations and regression.Discovery Publishing House.Sullivan, G. M., Feinn, R., 2012. Using effect size - or whythe P value is not enough. Journal of graduate medicaleducation 4 (3), 279–282.Sumby, W. H., Pollack, I., 1954. Visual contribution tospeech intelligibility in noise. The Journal of the Acous-tical Society of America 26 (2), 212–215.Summerfield, Q., 1992. Lipreading and audio-visual speechperception. Philosophical Transactions of the Royal Soci-ety of London. Series B: Biological Sciences 335 (1273),71–78.Summers, W. V., Pisoni, D. B., Bernacki, R. H., Pedlow,R. I., Stokes, M. A., 1988. Effects of noise on speech pro-duction: Acoustic and perceptual analyses. The Journal ofthe Acoustical Society of America 84 (3), 917–928.Tang, L. Y., Hannah, B., Jongman, A., Sereno, J., Wang,Y., Hamarneh, G., 2015. Examining visible articulatoryfeatures in clear and plain speech. Speech Communication75, 1–13.Vargha, A., Delaney, H. D., 2000. A critique and improve-ment of the CL common language effect size statistics ofMcGraw and Wong. Journal of Educational and Behav-ioral Statistics 25 (2), 101–132.Vatikiotis-Bateson, E., Barbosa, A. V., Chow, C. Y., Oberg,M., Tan, J., Yehia, H. C., 2007. Audiovisual Lombardspeech: Reconciling production and perception. In: Pro-ceedings of AVSP. p. 41.Vincent, E., 2005. MUSHRAM: A MATLAB interface forMUSHRA listening tests. http://c4dm.eecs.qmul.ac.uk/downloads/ , accessed: March 20, 2019.Wang, D., Chen, J., 2018. Supervised speech separationbased on deep learning: An overview. IEEE/ACM Trans-actions on Audio, Speech, and Language Processing26 (10), 1702–1726.Wang, Y., Narayanan, A., Wang, D., 2014. On training tar-gets for supervised speech separation. IEEE/ACM Trans-actions on Audio, Speech, and Language Processing22 (12), 1849–1858.Weninger, F., Hershey, J. R., Le Roux, J., Schuller, B.,2014. Discriminatively trained recurrent neural networksfor single-channel speech separation. In: Proceedings ofGlobalSIP. IEEE, pp. 577–581.Wilcoxon, F., 1945. Individual comparisons by rankingmethods. Biometrics Bulletin 1 (6), 80–83.Williamson, D. S., Wang, Y., Wang, D., 2016. Complex ra-tio masking for monaural speech separation. IEEE/ACMTransactions on Audio, Speech and Language Processing24 (3), 483–492. - Page 15 of 16eep-Learning-Based Audio-Visual Speech Enhancement in Presence of Lombard Effect