[PDF] Improved Lite Audio-Visual Speech Enhancement

Abstract

Numerous studies have investigated the effectiveness of audio-visual multimodal learning for speech enhancement (AVSE) tasks, seeking a solution that uses visual data as auxiliary and complementary input to reduce the noise of noisy speech signals. Recently, we proposed a lite audio-visual speech enhancement (LAVSE) algorithm. Compared to conventional AVSE systems, LAVSE requires less online computation and moderately solves the user privacy problem on facial data. In this study, we extend LAVSE to improve its ability to address three practical issues often encountered in implementing AVSE systems, namely, the requirement for additional visual data, audio-visual asynchronization, and low-quality visual data. The proposed system is termed improved LAVSE (iLAVSE), which uses a convolutional recurrent neural network architecture as the core AVSE model. We evaluate iLAVSE on the Taiwan Mandarin speech with video dataset. Experimental results confirm that compared to conventional AVSE systems, iLAVSE can effectively overcome the aforementioned three practical issues and can improve enhancement performance. The results also confirm that iLAVSE is suitable for real-world scenarios, where high-quality audio-visual sensors may not always be available.

Full PDF

IImproved Lite Audio-Visual Speech Enhancement

Shang-Yi Chuang , Hsin-Min Wang , Yu Tsao Research Center for Information Technology Innovation, Academia Sinica, Taiwan Institute of Information Science, Academia Sinica, Taiwan

Abstract

Numerous studies have investigated the eﬀectiveness of audio-visual multimodallearning for speech enhancement (AVSE) tasks, seeking a solution that usesvisual data as auxiliary and complementary input to reduce the noise of noisyspeech signals. Recently, we proposed a lite audio-visual speech enhancement(LAVSE) algorithm. Compared to conventional AVSE systems, LAVSE requiresless online computation and moderately solves the user privacy problem on facialdata. In this study, we extend LAVSE to improve its ability to address threepractical issues often encountered in implementing AVSE systems, namely, therequirement for additional visual data, audio-visual asynchronization, and low-quality visual data. The proposed system is termed improved LAVSE (iLAVSE),which uses a convolutional recurrent neural network architecture as the coreAVSE model. We evaluate iLAVSE on the Taiwan Mandarin speech with videodataset. Experimental results conﬁrm that compared to conventional AVSEsystems, iLAVSE can eﬀectively overcome the aforementioned three practicalissues and can improve enhancement performance. The results also conﬁrmthat iLAVSE is suitable for real-world scenarios, where high-quality audio-visualsensors may not always be available.

Keywords: speech enhancement, audio-visual, data compression,asynchronous multimodal learning, low-quality data

Email address: [email protected], [email protected],[email protected] (Shang-Yi Chuang , Hsin-Min Wang , Yu Tsao ) Preprint submitted to Elsevier September 1, 2020 a r X i v : . [ ee ss . A S ] A ug . Introduction Speech is the most natural and convenient means for human-human andhuman-machine communications. In recent years, various speech-related appli-cations have been developed and have facilitated our daily lives. For most ofthese applications, however, the performance may be aﬀected by acoustic dis-tortions, which may lower the quality of the input speech. These acoustic dis-tortions may come from diﬀerent sources, such as recording sensors, backgroundnoise, and reverberations. To alleviate the distortion issue, many approacheshave been proposed, and speech enhancement (SE) is one of them. The goalof SE is to enhance low-quality speech signals to improve quality and intelligi-bility. SE systems have been widely used as front-end processes in automaticspeech recognition [1, 2, 3], speaker recognition [4], speech coding [5], hearingaids [6, 7, 8], and cochlear implants [9, 10] to improve the performance of targettasks.Traditional SE methods are generally designed based on the properties ofspeech and noise signals. A class of approaches estimates the statistics of speechand noise signals to design a gain/ﬁlter function, which is then used to suppressthe noise components in noisy speech. Notable examples belonging to this classinclude the Wiener ﬁlter [11, 12] and its extensions [13], such as the minimummean square error spectral estimator [14, 15], maximum a posteriori spectralamplitude estimator [16, 17], and maximum likelihood spectral amplitude esti-mator [18, 19]. Another class of approaches considers the temporal propertiesor data distributions of speech and noise signals. Notable examples include har-monic models [20], linear prediction models [21, 22], hidden Markov models [23],singular value decomposition [24], and Karhunen-Loeve transform [25]. In re-cent years, numerous machine-learning-based SE methods have been proposed.These approaches generally learn a model from training data in a data-drivenmanner. Then, the trained model is used to convert the noisy speech signals intothe clean speech signals. Notable machine-learning-based SE methods include2ompressive sensing [26], sparse coding [27, 28], non-negative matrix factoriza-tion [29], and robust principal component analysis [30, 31].More recently, deep learning (DL) has became a popular and eﬀective ma-chine learning algorithm [32, 33, 34] and has brought signiﬁcant progress inthe SE ﬁeld [35, 36, 37, 38, 39, 40, 41, 42, 43]. Based on the deep struc-ture, an eﬀective representation of the noisy input signal can be extractedand used to reconstruct a clean signal [44, 45, 46, 47, 48, 49, 50]. VariousDL-based model structures, including deep denoising autoencoders [51, 52],fully connected neural networks [53, 54, 55], convolutional neural networks(CNNs) [56, 57], recurrent neural networks (RNNs), and long short-term mem-ory (LSTM) [58, 59, 60, 61, 62, 63], have been used as the core model of an SEsystem and have been proven to provide better performance than traditionalstatistical and machine-learning methods. Another well-known advantage ofDL models is that they can ﬂexibly fuse data from diﬀerent domains. Recently,researchers have tried to incorporate text [64], bone-conducted signals [65], andvisual cues [66, 67, 68, 69] into SE systems as auxiliary and complementaryinformation to achieve better SE performance. Among them, visual cues arethe most common and intuitive because most devices can capture audio and vi-sual data simultaneously. Numerous audio-visual SE (AVSE) systems have beenproposed and conﬁrmed to be eﬀective [70, 71, 72, 73]. In our previous work,a lite AVSE (LAVSE) approach was proposed to handle the immense visualdata and potential privacy issues [74]. The LAVSE system uses an autoencoder(AE)-based compression network along with a latent feature quantization unit[75, 76] to successfully reduce the size of visual data and handle the privacyissues.In this study, we intend to further explore three practical issues that areoften encountered when implementing AVSE systems in real-world scenarios;they are: (1) the requirement of additional visual data (usually much larger thanaudio data), (2) audio-visual asynchronization, and (3) low-quality visual data.In order to address these issues, we extend the LAVSE system to an improvedLAVSE (iLAVSE) system, which is formed by a multimodal convolutional RNN3CRNN) architecture in which the recurrent part is realized by implementing aLSTM layer. The audio data are provided as input directly to the SE model,while the visual input is ﬁrst processed by a three-unit data compression moduleCRQ (C for color channel, R for resolution, and Q for bit quantization) anda pretrained AE module. In CRQ, we adopt three data compression units:reducing the number of channels, reducing the resolution, and reducing thenumber of bits. The AE is formed by a deep convolutional architecture and canextract meaningful and compact representations, which are then quantized andused as input of the CRNN AVSE model. Based on the visual data compressionCRQ module and AE module, the size of visual input is signiﬁcantly reduced,and the privacy issue can be addressed well.Audio-visual asynchronization is a common issue that may arise from low-quality audio-visual sensors. We propose to handle this issue based on a dataaugmentation scheme. The problem of low-quality visual data also include thefailure of the sensor to capture the visual signal. A practical example is theuse of an AVSE system in a car driving scenario. When the car passes trougha tunnel, the visual information disappears due to the insuﬃcient light. Wesolve this problem through a zero-pad training scheme. The proposed iLAVSEwas evaluated on the Taiwan Mandarin speech with video (TMSV) dataset[74]. Based on the special design of model architecture and data augmentation,iLAVSE can eﬀectively overcome the above three issues and provide more robustSE performance than LAVSE and several related SE methods.The remainder of this paper is organized as follows. Section 2 reviews relatedwork on AVSE systems and data quantization techniques. Section 3 introducesthe proposed iLAVSE system. Section 4 presents our experimental setup andresults. Finally, Section 5 provides the concluding remarks.4 nhanced SpectrogramFusionNetAudioNet SEModelDataReconstruction

Conv a a a a v EnhancedWaveform ReconstructedLipFeature

Loss a Loss v Conv v v v VisualNetDataPreprocessing

STFT

NoisyWaveform LipImagesNoisySpectrogramClean Spectrogram

STFT

Clean Waveform v BatchNormalization v LinearActivation v SigmoidActivation (a) AVDCNN[70].

Enhanced SpectrogramFusionNetAudioNet SEModelDataPreprocessingDataReconstruction

Conv a a a a a v AE STFTISTFT FC2 Qua latent

EnhancedWaveformNoisyWaveform LipImagesNoisySpectrogram Compressed LipFeatureReconstructedLipFeatureClean Spectrogram

Loss a Loss v STFT

Clean Waveform v Instance Normalization v LayerNormalization v LeakyReLUActivation (b) LAVSE[74].

Figure 1: Previous AVSE systems.

2. Related Work

In this section, we review several existing AVSE systems. In [77], a fully con-nected network was used to jointly process audio and visual inputs to performSE. Since the fully connected architecture cannot eﬀectively process visual in-formation, the AVSE system in [77] is only slightly better than its audio-only SEcounterpart. In order to further improve the performance, a multimodal deepCNN SE (termed AVDCNN) system [70] was subsequently proposed. As shownin Fig. 1a (ISTFT denotes inverse short time Fourier transform; FC denotesfully connected layers; Conv denotes convolutional layers; Pool denotes max-pooling layers), the AVDCNN system consists of several convolutional layers toprocess audio and visual data. Experimental results show that compared withthe audio-only deep CNN system, the AVDCNN system can eﬀectively improvethe SE performance. Later, Gabbay et al. proposed another visual SE (VSE)model, which has a similar architecture to AVDCNN, but does not reconstructthe visual part in the output layer [78]. In the meantime, a looking-to-listen(L2L) system was proposed, which uses estimated complex masks to reconstruct5 ign (S) Exponential (Exp) Mantissa (Man) index 0 1 ⋯ ⋯ 𝑠 ! 𝑒 " 𝑒 ⋯ 𝑒 $ 𝑒 % 𝑚 & 𝑚 "! 𝑚 "" ⋯ 𝑚 𝑚 ’! 𝑚 ’" Figure 2: Single-Precision Floating-Point Format. enhanced spectral features [79]. In [80], a variational AE (VAE) mode was usedas the basis model to build the AVSE system. The authors also investigated thepossibility of using a strong pretrained model for visual feature extraction andperforming SE in an unsupervised manner.Unlike audio-only SE systems, the above-mentioned AVSE systems requireadditional visual input, which causes additional hardware and computationalcosts. In addition, the use of facial or lip images may cause privacy issues. Somework has been done to deal with these two issues. In [74], the LAVSE system wasproposed to eﬀectively reduce the size of visual input and user identiﬁability.As shown in Fig. 1b, the LAVSE system uses an AE to extract meaningfuland compact representations of visual data as the input of the SE model toreduce computational costs and appropriately solve the privacy problem in facialinformation.

Quantization is a simple and eﬀective way to reduce the size of data. Fig. 2shows the data format of single-precision ﬂoating-point in IEEE 754 [81]. Thereare 32 base-2 bits, including 1 sign bit, 8 exponential bits, and 23 mantissabits. The decimal value of a single-precision ﬂoating-point representation is6alculated as value = ( − S × ( Exp − bias ) × M an S = s Exp = e e e e e e e e Exp = (cid:88) i =1 e i × (8 − i ) M an = m m ...m M an = (cid:88) i =9 m i × (8 − i ) (1)where the subscripts 2 and 10 of value , Exp , and

M an denote base-2 andbase-10, respectively. The sign bit determines whether the value representedis positive ( S = 0) or negative ( S = 1). The exponential bits represent a2’s complement, which can store negative values with a bias of 127 (2 −

3. Proposed iLAVSE System

As mentioned earlier, this study investigates three practical issues: (1) therequirement of additional visual data, (2) audio-visual data asynchronization,7 nhanced SpectrogramFusionNetAudioNet SE ModelData PreprocessingData Reconstruction

Conv a a a a a v AE CRQSTFTISTFT FC2 Qua latent

Enhanced WaveformNoisyWaveform Lip ImagesNoisySpectrogram Compressed Lip FeatureReconstructed Lip FeatureClean Spectrogram

Loss a Loss v STFT

Clean Waveform v Instance Normalization v LayerNormalization v LeakyReLU Activation

Figure 3: The proposed iLAVSE system. and (3) low-quality visual data. We propose three approaches to address theseissues respectively: (1) visual data compression, (2) compensation on audio-visual asynchronization, and (3) zero-pad training. By integrating the abovethree approaches with the CRNN AVSE architecture, the proposed iLAVSEcan perform SE well even under unfavorable testing conditions. In this section,we ﬁrst present the overall system of iLAVSE. Then, we describe the three issuesand our solutions.

The proposed iLAVSE system is demonstrated in Fig. 3. As shown in theﬁgure, the iLAVSE system includes three stages: a data preprocessing stage, a8 RQ Lip Images Grayscale Images Low-resolutionImages Quantized Images

Col img

Res img

Qua img

Figure 4: The proposed CRQ module.

CRNN-based AVSE stage, and a data reconstruction stage. The functions ofiLAVSE are shown as follows,

CRQ ( I i,n ) = Qua img ( Res img ( Col img ( I i,n ))) Z i,n = Encoder AE ( CRQ ( I i,n )) A i,n = Conv a Conv a P ool a Conv a X i,n − L : n + L )))) V i,n = Qua latent ( Z i,n ) AV i,n = [ A Ti,n ; V Ti,n − L : n + L ] T F i,n = F C LST M AV i,n ))ˆ Y i,n = F C a F i,n )ˆ Z i,n = F C v F i,n ) (2)where i ∈ { , ..., K } denotes the i -th training utterance, and K is the number ofthe training utterances; n ∈ { L, ..., F − L } denotes the n -th sample frame, L isthe size of the concatenated frames for a context window, and F is the numberof frames of the i -th utterance. We have implemented three data compressionfunctions in iLAVSE, which are outlined in green blocks in Fig. 3. CRQ is athree-unit data compression module used to compress the visual image data.As shown in Fig. 4, the CRQ module consists of Col img , Res img , and Qua img ,denoting color channel reduction, resolution reduction, and bit quantization,9espectively. Qua latent stands for the bit quantization of the latent featureextracted by Encoder AE , the encoder part of a pretrained AE.In the data preprocessing stage, the waveform of the noisy data is trans-formed into log1p spectral features ( X ) by using the short time Fourier trans-form (STFT), while the visual image data ( I ) are compressed and transformedinto latent features ( Z ) by the CRQ module and Encoder AE . Next, in the CRNNAVSE stage, the audio spectral features X pass through an audio net composedof convolutional and pooling layers to extract the audio latent features ( A ), andthe Qua latent unit further quantizes the visual input Z to V . Then, the audiolatent features A and the quantized visual latent features V are concatenated as AV , which is then sent into the fusion net and turned into F . Then, the fusedfeatures F are decoded into the audio spectral features ( ˆ Y ) and the visual la-tent features ( ˆ Z ) respectively through a linear layer. During testing, the former(with the phase of the noisy speech) is reconstructed into the speech waveformusing the inverse STFT in the data reconstruction stage.Note that we choose the log1p feature [82] because its projecting range canavoid some minimum values in the data. If we take 10 − for example and if logis applied, the projected value is −

6; but if log1p is applied, the projected valueis 0. This characteristic enables the log1p feature to be easily normalized andtrained.

For AVSE systems, the main goal is to use visual data as an auxiliary inputto retrieve the clean speech signals from the distorted speech signals. However,the size of visual data is generally much larger than that of audio data, whichmay cause unfavorable hardware and computational costs when implementingthe AVSE system. Our previous work has proven that visual data may notrequire very high precision, and the original image sequence can be replaced bymeaningful and compact representations extracted by an AE [74]. In this study,we further explore directly reducing the size of visual data by the CRQ compres-10 ncoder AE Decoder

AELatent Feature Decoded Lip ImagesCRQLip ImagesLow-Quality Lip Images v

2D Convolutional Layer v

2D Transposed Convolutional Layer v

2D Instance Normalization v LeakyReLU Activation

Dlib Crop Face Images

Figure 5: The AE model for visual input data compression. sion module. The AE is directly applied to the compressed image sequence toextract a compact representation. The extracted representation is then furthercompressed by Qua latent and sent to the CRNN-based AVSE stage in iLAVSE.

Visual Feature Extraction by a CNN-based AE.

As mentioned earlier, iLAVSEuses the three visual data compression units in the CRQ module, namely Col img ,Res img , and Qua img , to perform color channel reduction, resolution reduction,and bit quantization, respectively. The size of the original image sequence can benotably reduced by the three units. The compressed visual data is then passedto Encoder AE , and the latent representation is used as the visual representation.As shown in Fig. 5, we use a 2D-convolution-layer-only AE to process the visualinput data. For a given visual input, the AE is trained to reconstruct the inputimages.Generally, captured images are saved in RGB (three channels) or grayscale(one channel) format. Therefore, to make the iLAVSE system applicable todiﬀerent scenarios, we consider both RGB and grayscale visual inputs to trainthe AE model. As a result, this AE model can reconstruct RGB and grayscaleimages. 11 a) 32-bit AE features. (b) 3-bit AE features. Figure 6: Original and quantized visual latent features. F r e q u e n c y BeforeAfter

Figure 7: The distributions of original and quantized visual latent features.

In addition, we use images with diﬀerent resolutions to train the AE model.Since the lip images are about 100 to 250 pixels square, we designed threesettings to reduce the resolution— , , and (pixels square). When usinga resolution of , for example, the original image at sizes of 100 to 250 pixelssquare is resized to 64 pixels square.For data quantization, we ﬁrst quantize the values of an input image byremoving the mantissa bits in the ﬂoating-point representation. To train the AE,we place the quantized and original images at the input and output, respectively.In real-world applications, the AE model can reconstruct the original visual datafrom the quantized version. Latent Feature Compression.

After extracting the latent feature by passing thecompressed images to the AE, Qua latent in Fig. 3 can further reduce the num-12 a) Synchronous audio and visual data. (b) Asynchronous audio and visual data.

Figure 8: Synchronous and asynchronous audio and visual data. ber of bits of each latent feature element. The quantized visual latent featuresare then used in the CRNN AVSE stage. Fig. 6 shows the visual latent fea-tures before and after the Qua latent module. In real-world applications, theEncoder AE module and Qua latent unit can be installed in a low-quality visualsensor, thereby improving the online computing eﬃciency and greatly reducingthe transmission costs.To further conﬁrm that the quantized latent representation can be used toreplace the original latent representation, we plotted the distributions of thelatent representations before and after applying bit quantization in Fig. 7.The lighter green bins represent the feature before Qua latent is applied, andthe darker green bins represent the feature after Qua latent is applied. We cansee that the darker green bins cover the range of the lighter green bins well,indicating that we can use the quantized latent feature to replace the originallatent feature. Multimodal data asynchronization is a common issue in multimodal learning.We also encountered this problem when implementing the AVSE system. Theideal situation is that the audio and visual data are precisely synchronized intime. Otherwise, the ancillary visual information may not be helpful or may evenworsen the SE performance. Fig. 8 shows the synchronous and asynchronoussituations of audio and visual data. Owing to audio-visual asynchronization,the video frames are not aligned with the speech well. In this study, we proposea data augmentation approach to alleviate this audio-visual asynchronization13 a) Low-quality lip images. (b) Low-quality latent features.

Figure 9: Low-quality visual data. issue. The main idea is to artiﬁcially simulate various asynchronous audio-visualdata to train the AVSE systems.

Because visual data are regarded as an auxiliary input to the AVSE systems,a necessary requirement is that low-quality visual conditions will not degradethe SE performance. In use with poor lighting conditions, such as in a tunnel orat a night market, the quality of video frames may be poor. In Fig. 9a, whichshows an example, where a segment of frames (in the middle region) has verypoor quality. Using the entire video frames directly may degrade the AVSEperformance. To overcome this problem, we intend to let iLAVSE dynamicallydecide whether video data should be used. More speciﬁcally, when the quality ofa segment of image frames is poor (which can be determined using an additionallight sensor), iLAVSE can directly discard the visual information. In this study,we prepare the training data by replacing the visual latent features of low-qualityframes with zero, as shown in Fig. 9b. In this way, iLAVSE can perform SEbased on only audio input without considering visual information, when thevideo frames are in low quality. Note that this study only considered that alow-quality situation occurs in a consecutive segment of frames, not in sporadicframes. However, it is believed that the proposed zero-pad training method issuitable for diﬀerent low-quality visual data scenarios.

4. Experiments

This section presents the experimental setup and results. Two standard-ized evaluation metrics were used to evaluate the SE performance: perceptual14 nhanced SpectrogramAudioNet

Conv a a a a a NoisySpectrogram v Instance Normalization v Layer Normalization v LeakyReLU Activation (a) Audio-only SE.

Enhanced SpectrogramAudioNet1

Conv a1 a1 a1 a1 a1 NoisySpectrogram FC a2 a2 a2 a2 a2 AudioNet2+ v Instance Normalization v Layer Normalization v LeakyReLUActivation (b) Dual-path-audio-only SE.

Figure 10: Architectures of two audio-only SE systems. evaluation of speech quality (PESQ) [83] and short-time objective intelligibilitymeasure (STOI) [84]. PESQ was developed to evaluate the quality of processedspeech, and the score ranged from -0.5 to 4.5. A higher PESQ score indicatesthat the enhanced speech has better speech quality. STOI was designed to eval-uate the speech intelligibility. The score typically ranges from 0 to 1. A higherSTOI value indicates better speech intelligibility.Two audio-only baseline SE systems were implemented for comparison. Theirmodel architectures are illustrated in Fig. 10. Fig. 10a is a system with thevisual part in the iLAVSE system deleted, and Fig. 10b is a system with adual-path audio model. The additional audio net in Fig. 10b is to increase thenumber of model parameters to be the same as in the iLAVSE model. This sys-tem tests whether additional improvements can be achieved by simply increasingthe number of model parameters.The loss function for training iLAVSE is based on the mean square error15omputed from both the audio and visual parts,

Loss a = 1 KF K (cid:88) i =1 F (cid:88) n =1 || ˆ Y i,n − Y i,n || ,Loss v = 1 KF K (cid:88) i =1 F (cid:88) n =1 || ˆ Z i,n − Z i,n || ,Loss = Loss a + µ × Loss v , (3)where µ is empirically determined as 10 − . For training the two audio-only SEsystems, Loss a is used.In this study, all the SE models were implemented using the PyTorch [85]library. The optimizer is Adam [86] with a learning rate of 5 × − . In this section, the details of the dataset and the implementation steps ofiLAVSE and other SE systems are introduced.

We evaluated the proposed system on the TMSV dataset . The datasetcontains video recordings of 18 native speakers (13 males and 5 females), eachspeaking 320 utterances of Mandarin sentences, with the script of the TaiwanMandarin hearing in noise test [87]. Each sentence has 10 Chinese characters,and the length of each utterance is approximately 24 seconds. The utteranceswere recorded in a recording studio with suﬃcient light, and the speakers wereﬁlmed from the front view. The video was recorded at a resolution of 1920 pixels × https://bio-asplab.citi.sinica.edu.tw/Opensource.html The recorded speech signals were downsampled to 16 kHz and mixed intomonaural waveforms. The speech waveforms were converted into spectrogramswith STFT. The window size of STFT was 512, corresponding to 32 milliseconds.The hop length was 320, so the interval between each frame was 20 milliseconds.The audio data was formatted at 50 frames per second and was aligned withthe video data. For each speech frame, the log1p magnitude spectrum [82]was extracted, and the value was normalized to zero mean and unit standarddeviation. The normalization process was conducted at the utterance level;that is, the mean and standard deviation vectors were calculated on all framesof an utterance. The length of the context window was 5, i.e., ± × ± ×

5. 17 .2. Experimental Result4.2.1. AVSE Versus Audio-Only SE

The two audio-only SE systems shown in Fig. 10 were used as the baselines.The results of the audio-only SE (denoted as AOSE) and dual-path audio-onlySE (denoted as AOSE(DP)) systems are shown in Table 1. As mentioned earlier,AOSE(DP) has a similar number of model parameters to LAVSE. From theresults in Table 1, we note that AOSE and AOSE(DP) yield similar performancein terms of PESQ and STOI. The result suggests that the additional path withextra parameters cannot provide improvements for the audio-only SE system inthis task. Table 1 also lists the results of two existing AVSE systems, namelyAVDCNN [70] and LAVSE [74]. Compared to AOSE and AOSE(DP), bothAVDCNN and LAVSE yield higher PESQ and STOI scores, conﬁrming theeﬀectiveness of incorporating visual data into the SE system.

In this set of experiments, we ﬁrst examined the ability of iLAVSE to incor-porate compressed visual data. Fig. 11 shows a sequence of original lip images.As shown in Fig. 3, the visual data preprocessing is carried out by a CRQ mod-ule, which implements three units: Col img , Res img , and Qua img . Then, afterthe latent representation is extracted by Encoder AE , Qua latent further quantizesthe bits of the latent representation. In other words, there are four units that PESQ STOINoisy

AOSE

AOSE(DP)

AVDCNN

LAVSE 1.374 0.646

Table 1: Average PESQ and STOI scores of the two audio-only SE systems and the twoexisting AVSE systems over SNRs of -1, -4, -7 and -10 dB. igure 11: Original uncompressed images of lips. (a) RGB 64 input. (b)

RGB 32 input. (c)

RGB 16 input.(d)

GRAY 64 input. (e)

GRAY 32 input. (f)

GRAY 16 input.(g)

RGB 64 output. (h)

RGB 32 output. (i)

RGB 16 output.(j)

GRAY 64 output. (k)

GRAY 32 output. (l)

GRAY 16 output.

Figure 12: The lip images (input and output of the AE module) with diﬀerent resolutions in

RGB or GRAY . { Col img ; Res img ; Qua img ; Qua latent } = { A; B; C; D } , where A is either RGB or GRAY (for grayscale), B denotes the image resolution, C indicates the imagedata quantization, and D stands for the latent feature quantization. In Fig. 12,several versions of compressed visual data are presented. In the top row, thethree ﬁgures from left to right (i.e., Fig. 12a, Fig. 12b, and Fig. 12c) denote the

RGB images with resolutions , , and , respectively. In the second row,the three ﬁgures from left to right (i.e., Fig. 12d, Fig. 12e, and Fig. 12f) denotethe GRAY images with resolutions of , , and . The reconstructed outputgenerated by the autoencoder corresponding to each input is shown in Fig. 12g,Fig. 12h, Fig. 12i, Fig. 12j, Fig. 12k, and Fig. 12l, respectively. Comparingeach pair of input and output, we conﬁrmed that the AE can reconstruct theinput images well at diﬀerent resolutions ( , , and ) in either RGB or GRAY .Then, we evaluated iLAVSE with diﬀerent types of compressed visual data.The results are listed in Table 2. From the table, we ﬁrst see that iLAVSEoutperforms AOSE(DP) in terms of PESQ and STOI with diﬀerent compressedvisual data. Moreover, compared to LAVSE (the underlined scores), we notethat iLAVSE can still achieve comparable performance even though the reso-

PESQ STOI

RGB GRAY RGB GRAY

AOSE(DP) iLAVSE iLAVSE iLAVSE Table 2: The performance of iLAVSE using lip images with reduced channel numbers andresolutions. The underlined scores are the same as those of LAVSE in Table 1 because theiLAVSE with the { RGB , } setup is equivalent to LAVSE. a) { RGB , , 5bits(i) } input. (b) { RGB , , 5bits(i) } output.(c) { GRAY , , 5bits(i) } input. (d) { GRAY , , 5bits(i) } output. Figure 13: AE lip images in 5 bits (1 sign bit and 4 exponential bits). lution of the visual data has been notably reduced. For example, the { GRAY , } case in Table 2 strikes a good balance between the data compression ratioof 48 ((3 ÷ × ((64 × ÷ (16 × { GRAY , } as a representative setup in the followingdiscussion.Next, we investigated quantized images. The input and output (recon-structed) images in RGB and

GRAY are shown in the left and right columns inFig. 13, respectively. The original 32-bit images were reduced to 5-bit images(1 sign bit and 4 exponential bits). From the ﬁgures, we observe that the AEcan reconstruct the quantized image well. We also evaluated iLAVSE with thequantized images. The results are shown in Table 3. The PESQ and STOIscores reveal that when the numerical precision of the input image is reducedto 5 bits (1 sign bit and 4 exponential bits), iLAVSE still maintains satisfac-tory performance. When the number of bits is further reduced, the PESQ andSTOI scores both decrease notably. Compared to LAVSE that uses raw visualdata, the overall compression ratio R comp of the CRQ module from { RGB , ,212bits(i) } to { GRAY , , 5bits(i) } is 307.2 times, which is calculated as follows. R comp = R color × R res × R Qua R color = 31 R res = 64 R Qua = 325 R comp = 31 × ×

325 = 307 . In this set of experiments, we investigated the impact of the bit quantizaionin the Qua latent unit on the visual latent representation. We intended to usefewer bits to represent the original 32-bit latent representation. The compressedrepresentation was used as the visual feature input of the AVSE model. InFig. 6a and Fig. 6b, the latent representations of lip features before and afterapplying data quantization (from 32 bits to 3 bits) are depicted. As can be seenfrom the ﬁgures, the user identity has been almost completely removed, therebymoderately addressing the privacy problem.

PESQ STOITotal bits R G R G1 Table 3: The performance of iLAVSE with or without image quantization (the original imageis with 32 bits), R : { RGB , } and G : { GRAY , } . The underlined scores are the resultssame as that of LAVSE.

22e further evaluated iLAVSE with latent representation quantization. Thenumber of bits was reduced from 32 to 1, 3, 5, 7 and 9 (1 sign bit and 0, 2, 4,6, and 8 exponential bits). The results are listed in Table 4. From the table,we can note that for diﬀerent types of visual input, latent representations withdiﬀerent levels of quantization provide similar performance in terms of PESQand STOI. For example, when quantizing the latent representation to 3 bits,PESQ = 1.410 and STOI = 0.641 under the condition of { GRAY , , 5bits(i) } ,which are much better than the performance of AOSE(DP) (PESQ = 1.283 andSTOI = 0.610) and comparable to the performance of LAVSE (PESQ = 1.374and STOI = 0.646). In this set of experiments, we evaluated the SE systems compared in thisstudy with diﬀerent SNR levels. The AVDCNN system using the original high-quality images is denoted as “AVSE”. For LAVSE, we used the { RGB , ,32bits(i), 32bits(l) } setup. For iLAVSE, we used { GRAY , , 5bits(i), 3bits(l) } ,where (i) and (l) denote the quantization unit applied to the images and thelatent features, respectively. The PESQ and STOI scores for diﬀerent SNR levelsare shown in Fig. 14. It can be seen from the ﬁgure that all four SE systems PESQ STOITotal bits R G R G1 Table 4: The performance of iLAVSE with or without latent quantization, R : { RGB , ,32bits(i) } and G : { GRAY , , 5bits(i) } (1 sign bit + 4 exponential bits). P E S Q Noisy AOSE(DP) AVSE LAVSE iLAVSE (a) PESQ. S T O I Noisy AOSE(DP) AVSE LAVSE iLAVSE (b) STOI.

Figure 14: The performance of diﬀerent SE systems at diﬀerent SNR levels. LAVSE: { RGB , , 32bits(i), 32bits(l) } , iLAVSE: { GRAY , , 5bits(i), 3bits(l) } . have higher PESQ and STOI scores than the “Noisy” speech. In addition, theiLAVSE system is always better than the other three SE systems at diﬀerentSNR levels in terms of PESQ, and maintains satisfactory performance in termsof STOI.We further examined the spectrogram and waveform of the “Noisy” speechand the enhanced speech provided by AOSE(DP), LAVSE, and iLAVSE. Anexample under the condition of street noise at -7 dB is shown in Fig. 15. Thespectrogram and waveform of the clean speech are also plotted for comparison.From the ﬁgure, we see that iLAVSE can suppress the noise components in thenoisy speech more eﬀectively than AOSE(DP), thus conﬁrming the eﬀectivenessof using the visual information. In addition, we note that the output plots ofiLAVSE and LAVSE are very similar, which suggests that iLAVSE can stillprovide satisfactory performance even with compressed visual data. We simulated the audio-visual asynchronization condition by oﬀsetting thevisual and audio data streams of each utterance in the time domain. We de-signed 5 asynchronization conditions, i.e., 5 speciﬁc oﬀset ranges (OFR): [-1, 1],[-2, 2], [-3, 3], [-4, 4], and [-5, 5]. For example, for OFR = [-1, 1], the oﬀset rangeis from -1 to 1. An oﬀset of -1, 0, or 1 frame (each frame = 20ms) was randomly24 a) Clean waveform. (b) Clean spectrogram.(c) Noisy waveform. (d) Noisy spectrogram.(e) AOSE(DP) waveform. (f) AOSE(DP) spectrogram.(g) LAVSE waveform. (h) LAVSE spectrogram.(i) iLAVSE waveform. (j) iLAVSE spectrogram.

Figure 15: The waveforms and spectrograms of an example speech utterance under the con-dition of street noise at -7 dB. P E S Q Test Offset

NoisyAOSE(DP)iLAVSE(OFR0)iLAVSE(OFR1)iLAVSE(OFR2)iLAVSE(OFR3)iLAVSE(OFR4)iLAVSE(OFR5) (a) PESQ. S T O I Test Offset

NoisyAOSE(DP)iLAVSE(OFR0)iLAVSE(OFR1)iLAVSE(OFR2)iLAVSE(OFR3)iLAVSE(OFR4)iLAVSE(OFR5) (b) STOI.

Figure 16: The PESQ and STOI scores of iLAVSE trained and tested with diﬀerent audio-visual asynchronous data.

We simulated the low-quality visual data condition by applying a low-qualitypercentage range (LPR) to the visual data. The low-quality percentage (LP)determines the percentage of missing frames in the visual data, and the LPRindicates the range of randomly assigned LPs for each batch. For example, ifLPR is set to 10, LP will be randomly selected from 0% to 10%; if LP is set to4% for a batch with a length of 150 frames, a sequence of 6 (150 × ∈ {

0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100 } for training, and set LPs ∈{

0, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100 } to test the performance on speciﬁcpercentages of missing visual data. The starting point of the missing visual partwas randomly assigned for each batch.The iLAVSE models trained with the 11 diﬀerent LPRs are denoted asiLAVSE(LPR i ), where i = 0, ..., 10. The training set of iLAVSE(LPR0) did notcontain missing visual data. A larger value of i in LPR i indicates a more severelow-quality visual data condition. The results are presented in Fig. 17, wherethe x-axis represents the LP value used for testing. The results in the ﬁgure showthat without involving low-quality visual data in training (iLAVSE(LPR0)), theperformance drops rapidly when visual data loss occurs in the testing data. The27 .70.80.911.11.21.31.4 0 10 20 30 40 50 60 70 80 90 100 P E S Q Test Low-Quality Percentage

NoisyAOSE(DP)iLAVSE(LPR0)iLAVSE(LPR10)iLAVSE(LPR20)iLAVSE(LPR30)iLAVSE(LPR40)iLAVSE(LPR50)iLAVSE(LPR60)iLAVSE(LPR70)iLAVSE(LPR80)iLAVSE(LPR90)iLAVSE(LPR100) (a) PESQ. S T O I Test Low-Quality Percentage

NoisyAOSE(DP)iLAVSE(LPR0)iLAVSE(LPR10)iLAVSE(LPR20)iLAVSE(LPR30)iLAVSE(LPR40)iLAVSE(LPR50)iLAVSE(LPR60)iLAVSE(LPR70)iLAVSE(LPR80)iLAVSE(LPR90)iLAVSE(LPR100) (d) PESQ zoom in.

Figure 17: The PESQ and STOI scores of iLAVSE trained with diﬀerent LPRs and tested onspeciﬁc LP conditions.

PESQ and STOI scores are even worse than those of Noisy and AOSE(DP). Onthe other hand, the iLAVSE models trained with low-quality visual data (evenwith low LPRs) are robust against all LP testing conditions. When the LPof the testing data is very high, the performance of iLAVSE converges to thatof AOSE(DP), which shows that the beneﬁt from visual information becomesnegligible.

5. Conclusions

The proposed iLAVSE system includes three stages: a data preprocessingstage, a CRNN-based AVSE stage, and a data reconstruction stage. The pre-processing stage uses a CRQ module and an AE module to extract a compact28atent representation as the visual input to the AVSE stage. In our experiments,instead of sending the original visual image to the AVSE stage, we can notablyreduce the input size to 0.33% (1 ÷ . × Acknowledgements

This work was supported by the Ministry of Science and Technology [109-2221-E-001-016-, 109-2634-F-008-006-, 109-2218-E-011-010-].

References [1] A. El-Solh, A. Cuhadar, R. A. Goubran, Evaluation of speech enhancementtechniques for speaker identiﬁcation in noisy environments, in: Proc. ISM2007.[2] J. Li, L. Deng, R. Haeb-Umbach, Y. Gong, Robust automatic speech recog-nition: a bridge to practical applications, Academic Press, 2015.[3] E. Vincent, T. Virtanen, S. Gannot, Audio source separation and speechenhancement, John Wiley & Sons, 2018.294] J. Li, L. Yang, J. Zhang, Y. Yan, Y. Hu, M. Akagi, P. C. Loizou, Compara-tive intelligibility investigation of single-channel noise-reduction algorithmsfor chinese, japanese, and english, The Journal of the Acoustical Society ofAmerica 129 (5) (2011) 3291–3301.[5] J. Li, S. Sakamoto, S. Hongo, M. Akagi, Y. Suzuki, Two-stage binauralspeech enhancement with wiener ﬁlter for high-quality speech communica-tion, Speech Communication 53 (5) (2011) 677–689.[6] H. Levit, Noise reduction in hearing aids: An overview, J. Rehabil. Res.Develop. 38 (1) (2001) 111–121.[7] T. Venema, Compression for clinicians, chapter 7, The many faces of com-pression.: Thomson Delmar Learning (2006).[8] E. W. Healy, M. Delfarah, E. M. Johnson, D. Wang, A deep learning al-gorithm to increase intelligibility for hearing-impaired listeners in the pres-ence of a competing talker and reverberation, The Journal of the AcousticalSociety of America 145 (3) (2019) 1378–1388.[9] F. Chen, Y. Hu, M. Yuan, Evaluation of noise reduction methods for sen-tence recognition by mandarin-speaking cochlear implant listeners, Ear andhearing 36 (1) (2015) 61–71.[10] Y.-H. Lai, F. Chen, S.-S. Wang, X. Lu, Y. Tsao, C.-H. Lee, A deep de-noising autoencoder approach to improving the intelligibility of vocodedspeech in cochlear implant simulation, IEEE Transactions on BiomedicalEngineering 64 (7) (2016) 1568–1578.[11] P. Scalart, et al., Speech enhancement based on a priori signal to noiseestimation, in: Proc. ICASSP 1996.[12] J. Chen, J. Benesty, Y. A. Huang, E. J. Diethorn, Fundamentals of noisereduction, in: Springer Handbook of Speech Processing, Springer, 2008,pp. 843–872. 3013] E. H¨ansler, G. Schmidt, Topics in acoustic echo and noise control: selectedmethods for the cancellation of acoustical echoes, the reduction of back-ground noise, and speech processing, Springer Science & Business Media,2006.[14] J. Makhoul, Linear prediction: A tutorial review, Proceedings of the IEEE63 (4) (1975) 561–580.[15] T. F. Quatieri, R. J. McAulay, Shape invariant time-scale and pitch mod-iﬁcation of speech, IEEE Transactions on Signal Processing 40 (3) (1992)497–510.[16] T. Lotter, P. Vary, Speech enhancement by MAP spectral amplitude es-timation using a super-gaussian speech model, EURASIP Journal on Ad-vances in Signal Processing 2005 (7) (2005) 354850.[17] S. Suhadi, C. Last, T. Fingscheidt, A data-driven approach to a priori snrestimation, IEEE Transactions on Audio, Speech, and Language Processing19 (1) (2010) 186–195.[18] R. McAulay, M. Malpass, Speech enhancement using a soft-decision noisesuppression ﬁlter, IEEE Transactions on Acoustics, Speech, and SignalProcessing 28 (2) (1980) 137–145.[19] U. Kjems, J. Jensen, Maximum likelihood based noise covariance matrixestimation for multi-microphone speech enhancement, in: Proc. EUSIPCO2012.[20] R. Frazier, S. Samsam, L. Braida, A. Oppenheim, Enhancement of speechby adaptive ﬁltering, in: Proc. ICASSP 1976.[21] B. Atal, M. Schroeder, Predictive coding of speech signals and subjectiveerror criteria, IEEE Transactions on Acoustics, Speech, and Signal Pro-cessing 27 (3) (1979) 247–254. 3122] Y. Ephraim, Statistical-model-based speech enhancement systems, Pro-ceedings of the IEEE 80 (10) (1992) 1526–1555.[23] L. Rabiner, B. Juang, An introduction to hidden markov models, IEEEASSP Magazine 3 (1) (1986) 4–16.[24] Y. Hu, P. C. Loizou, A subspace approach for enhancing speech corruptedby colored noise, in: Proc. ICASSP 2002.[25] A. Rezayee, S. Gazor, An adaptive klt approach for speech enhancement,IEEE Transactions on Speech and Audio Processing 9 (2) (2001) 87–95.[26] J.-C. Wang, Y.-S. Lee, C.-H. Lin, S.-F. Wang, C.-H. Shih, C.-H. Wu, Com-pressive sensing-based speech enhancement, IEEE/ACM Transactions onAudio, Speech, and Language Processing 24 (11) (2016) 2122–2131.[27] J. Eggert, E. Korner, Sparse coding and nmf, in: Proc. IJCNN 2004.[28] Y.-H. Chin, J.-C. Wang, C.-L. Huang, K.-Y. Wang, C.-H. Wu, Speakeridentiﬁcation using discriminative features and sparse representation, IEEETransactions on Information Forensics and Security 12 (8) (2017) 1979–1987.[29] N. Mohammadiha, P. Smaragdis, A. Leijon, Supervised and unsupervisedspeech enhancement using nonnegative matrix factorization, IEEE Trans-actions on Audio, Speech, and Language Processing 21 (10) (2013) 2140–2151.[30] E. J. Cand`es, X. Li, Y. Ma, J. Wright, Robust principal component anal-ysis?, Journal of the ACM 58 (3) (2011) 1–37.[31] P.-S. Huang, S. D. Chen, P. Smaragdis, M. Hasegawa-Johnson, Singing-voice separation from monaural recordings using robust principal compo-nent analysis, in: Proc. ICASSP 2012.[32] O. Ronneberger, P. Fischer, T. Brox, U-net: Convolutional networks forbiomedical image segmentation, in: Proc. MICCAI 2015.3233] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recogni-tion, in: Proc. CVPR 2016.[34] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,(cid:32)L. Kaiser, I. Polosukhin, Attention is all you need, in: Proc. NIPS 2017.[35] X.-L. Zhang, D. Wang, A deep ensemble learning method for monauralspeech separation, IEEE/ACM Transactions on Audio, Speech, and Lan-guage Processing 24 (5) (2016) 967–977.[36] S. Pascual, A. Bonafonte, J. Serr`a, Segan: Speech enhancement generativeadversarial network, in: Proc. Interspeech 2017.[37] D. Michelsanti, Z.-H. Tan, Conditional generative adversarial networks forspeech enhancement and noise-robust speaker veriﬁcation, in: Proc. Inter-speech 2017.[38] Y. Luo, N. Mesgarani, Tasnet: time-domain audio separation network forreal-time, single-channel speech separation, in: Proc. ICASSP 2018.[39] Y. Zhang, Q. Duan, Y. Liao, J. Liu, R. Wu, B. Xie, Research on speechenhancement algorithm based on sa-unet, in: Proc. ICMCCE 2019.[40] S. Xu, E. Fosler-Lussier, Spatial and channel attention based convolutionalneural networks for modeling noisy speech, in: Proc. ICASSP 2019.[41] J. Kim, M. El-Khamy, J. Lee, T-gsa: Transformer with gaussian-weightedself-attention for speech enhancement, in: Proc. ICASSP 2020.[42] Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, L. Xie,Dccrn: Deep complex convolution recurrent network for phase-aware speechenhancement, in: Proc. Interspeech 2020.[43] C.-H. Yang, J. Qi, P.-Y. Chen, X. Ma, C.-H. Lee, Characterizing speechadversarial examples using self-attention u-net enhancement, in: Proc.ICASSP 2020. 3344] D. S. Williamson, Y. Wang, D. Wang, Complex ratio masking for monau-ral speech separation, IEEE/ACM Transactions on Audio, Speech, andlanguage Processing 24 (3) (2015) 483–492.[45] D. Wang, J. Chen, Supervised speech separation based on deep learning:An overview, IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing 26 (10) (2018) 1702–1726.[46] N. Zheng, X.-L. Zhang, Phase-aware speech enhancement based on deepneural networks, IEEE/ACM Transactions on Audio, Speech, and Lan-guage Processing 27 (1) (2018) 63–76.[47] P. Plantinga, D. Bagchi, E. Fosler-Lussier, Phonetic feedback for speechenhancement with and without parallel speech data, in: Proc. ICASSP2020.[48] S. Wang, W. Li, S. M. Siniscalchi, C.-H. Lee, A cross-task transfer learningapproach to adapting deep speech enhancement models to unseen back-ground noise using paired senone classiﬁers, in: Proc.ICASSP 2020.[49] J. Qi, H. Hu, Y. Wang, C.-H. H. Yang, S. M. Siniscalchi, C.-H. Lee, Explor-ing deep hybrid tensor-to-vector network architectures for regression basedspeech enhancement, in: Proc. Interspeech 2020.[50] G. Carbajal, R. Serizel, E. Vincent, E. Humbert, Joint nn-supported mul-tichannel reduction of acoustic echo, reverberation and noise (2020).[51] X. Lu, Y. Tsao, S. Matsuda, C. Hori, Speech enhancement based on deepdenoising autoencoder., in: Proc. Interspeech 2013.[52] B. Xia, C. Bao, Wiener ﬁltering based speech enhancement with weighteddenoising auto-encoder and noise classiﬁcation, Speech Communication 60(2014) 13–29.[53] D. Liu, P. Smaragdis, M. Kim, Experiments on deep learning for speechdenoising, in: Proc. Interspeech 2014.3454] Y. Xu, J. Du, L.-R. Dai, C.-H. Lee, A regression approach to speech en-hancement based on deep neural networks, IEEE/ACM Transactions onAudio, Speech and Language Processing 23 (1) (2015) 7–19.[55] M. Kolbæk, Z.-H. Tan, J. Jensen, Speech intelligibility potential of gen-eral and specialized deep neural network based speech enhancement sys-tems, IEEE/ACM Transactions on Audio, Speech and Language Processing25 (1) (2017) 153–167.[56] S.-W. Fu, T.-Y. Hu, Y. Tsao, X. Lu, Complex spectrogram enhancement byconvolutional neural network with multi-metrics learning, in: Proc. MLSP2017.[57] A. Pandey, D. Wang, A new framework for cnn-based speech enhancementin the time domain, IEEE/ACM Transactions on Audio, Speech, and Lan-guage Processing 27 (7) (2019) 1179–1188.[58] P. Campolucci, A. Uncini, F. Piazza, B. D. Rao, On-line learning algo-rithms for locally recurrent neural networks, IEEE Transactions on NeuralNetworks 10 (2) (1999) 253–271.[59] F. Weninger, F. Eyben, B. Schuller, Single-channel speech separation withmemory-enhanced recurrent neural networks, in: Proc. ICASSP 2014.[60] H. Erdogan, J. R. Hershey, S. Watanabe, J. Le Roux, Phase-sensitiveand recognition-boosted speech separation using deep recurrent neural net-works, in: Proc. ICASSP 2015.[61] Z. Chen, S. Watanabe, H. Erdogan, J. R. Hershey, Speech enhancement andrecognition using multi-task learning of long short-term memory recurrentneural networks, in: Proc. Interspeech 2015.[62] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J. R. Her-shey, B. Schuller, Speech enhancement with lstm recurrent neural networksand its application to noise-robust asr, in: Proc. LVA/ICA 2015.3563] L. Sun, J. Du, L.-R. Dai, C.-H. Lee, Multiple-target deep learning for lstm-rnn based speech enhancement, in: Proc. HSCMA 2017.[64] K. Kinoshita, M. Delcroix, A. Ogawa, T. Nakatani, Text-informed speechenhancement with deep neural networks, in: Proc. Interspeech 2015.[65] C. Yu, K.-H. Hung, S.-S. Wang, Y. Tsao, J.-w. Hung, Time-domain multi-modal bone/air conducted speech enhancement, IEEE Signal ProcessingLetters 27 (2020) 1035–1039.[66] J. Wu, Y. Xu, S.-X. Zhang, L.-W. Chen, M. Yu, L. Xie, D. Yu, Timedomain audio visual speech separation, in: Proc. ASRU 2019.[67] D. Michelsanti, Z.-H. Tan, S. Sigurdsson, J. Jensen, Deep-learning-basedaudio-visual speech enhancement in presence of lombard eﬀect, SpeechCommunication 115 (2019) 38–50.[68] M. L. Iuzzolino, K. Koishida, Av (se) 2: Audio-visual squeeze-excite speechenhancement, in: Proc. ICASSP 2020.[69] R. Gu, S.-X. Zhang, Y. Xu, L. Chen, Y. Zou, D. Yu, Multi-modal multi-channel target speech separation, IEEE Journal of Selected Topics in SignalProcessing 14 (3) (2020) 530–541.[70] J.-C. Hou, S.-S. Wang, Y.-H. Lai, Y. Tsao, H.-W. Chang, H.-M. Wang,Audio-visual speech enhancement using multimodal deep convolutionalneural networks, IEEE Transactions on Emerging Topics in ComputationalIntelligence 2 (2) (2018) 117–128.[71] E. Ideli, B. Sharpe, I. V. Baji´c, R. G. Vaughan, Visually assisted time-domain speech enhancement, in: Proc. GlobalSIP 2019.[72] A. Adeel, M. Gogate, A. Hussain, W. M. Whitmer, Lip-reading drivendeep learning approach for speech enhancement, IEEE Transactions onEmerging Topics in Computational Intelligence (2019) 1–10.3673] D. Michelsanti, Z.-H. Tan, S.-X. Zhang, Y. Xu, M. Yu, D. Yu, J. Jensen,An overview of deep-learning-based audio-visual speech enhancement andseparation, arXiv preprint arXiv:2008.09586 (2020).[74] S.-Y. Chuang, Y. Tsao, C.-C. Lo, H.-M. Wang, Lite audio-visual speechenhancement, in: Proc. Interspeech 2020.[75] S. Wu, G. Li, F. Chen, L. Shi, Training and inference with integers in deepneural networks.[76] Y.-T. Hsu, Y.-C. Lin, S.-W. Fu, Y. Tsao, T.-W. Kuo, A study on speechenhancement using exponent-only ﬂoating point quantized neural network(eofp-qnn), in: Proc. SLT 2018.[77] J.-C. Hou, S.-S. Wang, Y.-H. Lai, J.-C. Lin, Y. Tsao, H.-W. Chang, H.-M.Wang, Audio-visual speech enhancement using deep neural networks, in:Proc. APSIPA 2016.[78] A. Gabbay, A. Shamir, S. Peleg, Visual speech enhancement, in: Proc.Interspeech 2018.[79] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T.Freeman, M. Rubinstein, Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation, ACM Transactionson Graphics 37 (4) (2018) 1–11.[80] M. Sadeghi, S. Leglaive, X. Alameda-Pineda, L. Girin, R. Horaud, Audio-visual speech enhancement using conditional variational auto-encoders,IEEE/ACM Transactions on Audio, Speech, and Language Processing 28(2020) 1788–1800.[81] Institute of Electrical and Electronics Engineers, IEEE Standard for BinaryFloating-point Arithmetic, ANSI/IEEE Std 754-1985, 1985.[82] Y.-J. Lu, C.-F. Liao, X. Lu, J.-w. Hung, Y. Tsao, Incorporating broadphonetic information for speech enhancement, in: Proc. Interspeech 2020.3783] A. W. Rix, J. G. Beerends, M. P. Hollier, A. P. Hekstra, Perceptual evalu-ation of speech quality (pesq)-a new method for speech quality assessmentof telephone networks and codecs, in: Proc. ICASSP 2001.[84] C. H. Taal, R. C. Hendriks, R. Heusdens, J. Jensen, An algorithm forintelligibility prediction of time–frequency weighted noisy speech, IEEETransactions on Audio, Speech, and Language Processing 19 (7) (2011)2125–2136.[85] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., Pytorch: An imperativestyle, high-performance deep learning library, in: Proc. NIPS 2019.[86] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, in:Proc. ICLR 2015.[87] M. Huang, Development of taiwan mandarin hearing in noise test, Depart-ment of speech language pathology and audiology, National Taipei Univer-sity of Nursing and Health Science (2005).[88] G. Hu, 100 nonspeech environmental sounds, available: http://web.cse.ohio-state.edu/pnl/corpus/HuNonspeech/HuCorpus.htmlhttp://web.cse.ohio-state.edu/pnl/corpus/HuNonspeech/HuCorpus.html