[PDF] SEANet: A Multi-modal Speech Enhancement Network

Abstract

We explore the possibility of leveraging accelerometer data to perform speech enhancement in very noisy conditions. Although it is possible to only partially reconstruct user's speech from the accelerometer, the latter provides a strong conditioning signal that is not influenced from noise sources in the environment. Based on this observation, we feed a multi-modal input to SEANet (Sound EnhAncement Network), a wave-to-wave fully convolutional model, which adopts a combination of feature losses and adversarial losses to reconstruct an enhanced version of user's speech. We trained our model with data collected by sensors mounted on an earbud and synthetically corrupted by adding different kinds of noise sources to the audio signal. Our experimental results demonstrate that it is possible to achieve very high quality results, even in the case of interfering speech at the same level of loudness. A sample of the output produced by our model is available at this https URL.

Full PDF

SSEANet: A Multi-modal Speech Enhancement Network

Marco Tagliasacchi, Yunpeng Li, Karolis Misiunas, Dominik Roblek

Google Research { mtagliasacchi, yunpeng, kmisiunas, droblek } @google.com Abstract

We explore the possibility of leveraging accelerometer data toperform speech enhancement in very noisy conditions. Al-though it is possible to only partially reconstruct user’s speechfrom the accelerometer, the latter provides a strong condition-ing signal that is not inﬂuenced from noise sources in the en-vironment. Based on this observation, we feed a multi-modalinput to SEANet (Sound EnhAncement Network), a wave-to-wave fully convolutional model, which adopts a combinationof feature losses and adversarial losses to reconstruct an en-hanced version of user’s speech. We trained our model withdata collected by sensors mounted on an earbud and synthet-ically corrupted by adding different kinds of noise sources tothe audio signal. Our experimental results demonstrate thatit is possible to achieve very high quality results, even in thecase of interfering speech at the same level of loudness. Asample of the output produced by our model is available athttps://google-research.github.io/seanet/multimodal/speech.

Index Terms : speech denoising, multimodal, accelerometers.

1. Introduction

Enhancing the quality of speech is of paramount importance indigital communications. Speech degradation can occur for var-ious reasons, e.g., from the interference of background noise,which can also contain overlapping speakers, to the effect ofreverberations caused by room acoustics, to the artifacts intro-duced by compression and network impairments. This has mo-tivated a very rich literature on speech enhancement and denois-ing. Traditional signal processing methods adopt spectral noisesubtraction [1, 2], spectral masking [3, 4], statistical methodsbased on Wiener ﬁltering [5] and Bayesian estimators [6, 7].These methods make different assumptions about the underly-ing noise model (e.g., known signal-to-noise ratio (SNR), sta-tionary noise, limited noise types, etc.), therefore they are un-able to cope with challenging noisy conditions emerging whensystems are deployed “in-the-wild”.In recent years, data-driven methods have emerged, basedon deep model architectures. Early works include methodsbased on denoising auto-encoders [8] and recurrent models [9].More recently, deep architectures have been adopted to im-prove speech enhancement based on spectral masking [10]. Al-ternatively, generative models based on GANs [11, 12] andWaveNet [13] have been proposed. Speech denoising can alsobe seen as a special case of source separation, in which oneof the sources represents the speech signal of interest [14, 15,16, 17]. Our work belongs to the family of multi-modal mod-els, which leverage additional conditioning signals to enhancethe target speech. For example, the work in [18] uses tightcrops of mouth images to denoise speech. This approach waslater extended by Looking2Listen [19], which uses visual infor-mation from facial crops segmented from videos to disentan-gle different speakers that talk simultaneously. A similar ap- proach is presented in [20], which adopts an attention mecha-nism to weight the contribution of the audio and visual modal-ities. Multi-modal cues can also be exploited for voice activitydetection [21].In this paper we consider the problem of multi-modalspeech denoising. Instead of leveraging video as an additionalmodality, we consider data collected with a bone-conductanceaccelerometer mounted in an earbud, which operates syn-chronously with the microphone but at a lower sampling fre-quency. The sensor captures the local vibrations induced by thevoice of the speaker, while being relatively insensitive to exter-nal sources. Hence, it can be used as a conditioning signal toenhance user’s speech and suppress noise. The fact that inertialmeasurement sensors mounted in mobile devices can be sensi-tive to speech has been recognized in the past literature. For ex-ample, gyroscope signals were used to recognize speech in [22],while [23] reconstructs speech from accelerometer-sensed re-verberations induced by smartphone loudspeakers. The workin [24] combines signals from a microphone and a bone sen-sor using a Gaussian mixture model on the high-resolution logspectra of each sensor. Similarly, multi-modal inputs are com-bined in [25] using deep denoising autoencoders that recon-struct Mel-scale features fed to an ASR system. An ad-hocspeech recovery stage is needed to reconstruct the time-domaindenoised waveforms.The proposed multi-modal SEANet (Speech EnhAncemntNetwork) model receives two waveforms, one acquired with amicrophone and one with an accelerometer, and produces asoutput a denoised speech waveform. The model is fully con-volutional and maps waveforms to waveforms, without resort-ing to explicit time-frequency representations like short-timeFourier Transform (STFT) or mel spectrograms. To train themodel, we adopt a combination of adversarial and reconstruc-tion losses inspired by the recent MelGAN model [26], whichsynthesizes waveforms from mel spectrograms. The adversar-ial losses induce the model to produce output waveforms that adiscriminator cannot distinguish from clean speech. The recon-struction losses operate in the feature space deﬁned by the dis-criminator and preserve speech content while suppressing noise.In our experiments we consider challenging scenarios inwhich the target speech signal is mixed with that of otherspeakers, or different kinds of background noise sampled fromFreesound [27]. We demonstrate that by leveraging the condi-tioning signal collected by the accelerometer, it is possible todenoise speech even in very adverse conditions. We collected adataset that contains speech and the corresponding accelerom-eter readings and observed an improvement in scale-invariantsignal-to-distortion ratio (SI-SDRi) of 9.6dB when the inter-ferer is mixed with a unit gain.

2. Method

The proposed SEANet model is trained in a fully supervisedfashion using pairs (cid:104) ( x m , x a ) , y m (cid:105) , where x m denotes the input a r X i v : . [ ee ss . A S ] O c t lean speechNoise Noisy speechAccelerometer + = Decoder DiscriminatorMulti-modalEncoder clean or noisy?

Figure 1:

SEANet model overview. A noisy speech signal, obtained superimposing clean speech with a noise source, is fed to themulti-modal encoder together with the accelerometer signal. Spectrograms are shown only for illustration purposes, as they are notexplicitly computed by the proposed wave-to-wave model. noisy speech collected by the microphone, x a the accelerome-ter signal used as conditioning, and y m the target audio signalcorresponding to clean speech. Note that x a might have oneor more channels, depending on the number of accelerometeraxes used. We assume that x m , x a and y m are time-aligned andavailable at the same sampling rate. Since the sampling rate ofaccelerometers is typically smaller than that of the microphone,the former signal is interpolated before being fed to the model.The model architecture consist of a UNet generator G ( x m , x a ) , which take as its input an audio x m and one ormore accelerometer readings x a time-aligned with the audio. InFigure 1 we illustrate the case in which a single accelerometeraxis is used. The generator produces as output a single-channelwaveform ˆ x m , which represents the denoised speech. The dis-criminator is asked to determine whether its input comes fromthe distribution of clean speech, or from the output of the gen-erator. Model architecture : Our UNet generator is a symmetricencoder-decoder network with skip-connections. The decoderadopts the same architecture as the generator in [26], while theencoder mirrors the decoder in its layout. A skip-connectionis added between each encoder block and its mirrored decoderblock. The out-most skip connects only the speech channelneeded by the output. The encoder and the decoder have eachfour blocks stacked together, which are sandwiched betweentwo plain convolution layers. The encoder follows a down-sampling scheme of (2, 2, 8, 8) while the decoder up-samplesin the reverse order. The number of channels is doubled when-ever down-sampling and halved whenever up-sampling. Eachdecoder block consists of an up-sampling layer, in the form ofa transposed 1D convolution, followed by three residual unitseach containing 1D convolutions with dilation rates of 1, 3, and9, respectively. The encoder block again mirrors the decoderblock, and consists of the same residual units followed by astrided 1D convolution for down-sampling. The overall struc-ture of the generator is illustrated in Figure 2.For the discriminator, we use the same multi-resolutionconvolutional architecture as [26]. Three structurally identicaldiscriminators are applied to input audio at different resolutions:original, 2x down-sampled, and 4x down-sampled. Each dis-criminator consists an initial plain convolution followed by fourgrouped convolutions [28], each of which has a group size of 4,a down-sampling factor of 4, and a channel multiplier of 4 up to a maximum of 1024 output channels. They are followed bytwo more plain convolution layers to produce the ﬁnal output,i.e., the logits. Note that since the discriminator is fully convo-lutional, the number of logits in the output is more than one andproportional to the length of the input audio. Each logit judgesthe plausibility of a segment of the input that corresponds toits receptive ﬁeld. We refer interested readers to [26] for morearchitectural details.We use weight normalization [29] and ELU activation [30]in the generator, while layer normalization and Leaky ReLUactivation [31] with α = 0 . are used in the discriminator. Loss functions : SEANet combines adversarial and recon-struction losses to train simultaneously the generator and thediscriminators. The adversarial loss is a hinge loss averagedover multiple resolutions and over time. More formally, let k ∈ { , . . . , K } index over the individual discriminators fordifferent resolutions, and t index over the length of the output,i.e., the number of logits T k , of discriminator k . The discrimi-nator loss can be written as L D = E y m  K (cid:88) k,t T k max(0 , − D k,t ( y m ))  + E ( x m ,x a )  K (cid:88) k,t T k max(0 , D k,t ( G ( x m , x a ))  , (1)while the adversarial loss for the generator is L adv G = E ( x m ,x a )  K (cid:88) k,t T k max(0 , − D k,t ( G ( x m , x a ))  . (2)For the reconstruction loss we use the “feature” loss pro-posed in [26], namely the normalized L1 distance between thediscriminator internal layer output of the generator audio andthat of the corresponding target audio: L rec G = E x  K (cid:88) k,l L (cid:107) D ( l ) k ( y m ) − D ( l ) k ( G ( x m , x a )) (cid:107) T k,l  , (3) onv1D (k=7, n=32)Encoder block (N=64, S=2)InputEncoder block (N=128, S=2)Encoder block (N=256, S=8)Encoder block (N=512, S=8)Decoder block (N=256, S=8)Decoder block (N=128, S=8)Decoder block (N=64, S=2)Decoder block (N=32, S=2)Conv1D (k=7, n=128)Conv1D (k=7, n=512)OutputConv1D (k=7, n=1) Encoder block (N, stride)Residual unit (N/2, dilation=1)Residual unit (N/2, dilation=9)Residual unit (N/2, dilation=3)Conv1D (k=2S, n=N, stride)Decoder block (N, stride)TransposedConv1D(k=2S, n=N, stride)Residual unit (N, dilation=3)Residual unit (N, dilation=1)Residual unit (N, dilation=9)Residual unit (N, dilation)Conv1D (k=3, n=N, dilation)Conv1D (k=1, n=N) Figure 2:

Generator architecture. where x (cid:44) (cid:104) ( x m , x a ) , y m (cid:105) denotes a training example, L is thenumber of internal layers, D ( l ) k for l ∈ { , . . . , L } is the outputof layer l of discriminator k , and T k,l is length of the featurelayer D ( l ) k . Compared with per-sample losses, such as the av-erage L1 distance between waveforms, the feature loss tends tobe less sensitive to small misalignment. The overall generatorloss is a weighted sum of the adversarial and the reconstructionloss, i.e., L G = L adv G + λ · L rec G . (4)For all our experiments, we set the weight of the recon-struction loss λ to and use a discriminator with K = 3 scales. We train with the Adam optimizer, with a batch size of16 and a constant learning rate of 0.0001 with β = 0 . and β = 0 . . We train for 200k iterations (2M iterations whentraining on Librispeech) on a single GPU. We evaluate resultsusing the last checkpoint of each training run. No parametertuning or early stopping were performed.

3. Experiments

Datasets : We collected an in-house dataset with sensorsmounted on an earbud, since a dataset with these characteristicsis not available in the literature. The microphone sampled audiowaveforms at 16kHz, while the 2-axis accelerometer operatedat 4kHz. We selected one of the two axes and interpolated theaccelerometer signal at 16kHz before feeding it to the model.We then applied high-pass ﬁltering with a cut-off of 20Hz to allsignals and normalized the amplitudes dividing all samples bya factor . · quantile ( x ; 0 . and clipping the result in the [ − , +1] range. This is necessary to deal with isolated spikeswhich were present in the raw output of the accelerometer.We asked 25 subjects to speak while wearing one earbudin a relatively quiet ofﬁce environment. In total we collected ∼ ∼ P o w e r Sp e c t r a l D e n s i t y [ d B ] microphoneaccelerometer Figure 3:

Power spectral density: microphone vs. accelerome-ter.

To explore the quality potentially achievable if we had ac-cess to more data, we created a synthetically generated multi-modal dataset. First, we trained a variant of SEANet whichlearns to map audio waveforms to the corresponding accelerom-eter waveform, using the in-house dataset described above. Thismodel uses the same architecture described in Section 2, withthe only difference that it receives one input channel with cleanaudio and produces one output channel with the correspond-ing accelerometer signal. Note that learning this mapping ismuch easier than reconstructing audio samples from the ac-celerometer signal alone. Then, we fed audio samples fromLibrispeech [32] to synthesize the corresponding accelerometersignal. In this case, we followed the canonical split provided byLibrispeech, using train-clean-100 for training and test-clean for testing.To generate the noisy input x m , we mix the clean micro-phone recording y m with other noise sources. We considertwo scenarios: i) mixed speech , in which an utterance froma different speaker is mixed with the clean source; ii) mixednoise , in which we mix with samples taken at random fromFreesound [27], to mimic a wide and diverse range of noisesources, with a unit mixing gain. In one of the experiments,we also limit the bandwidth of the accelerometer to simulate asensor operating at lower sampling rates. In this case we usethe following downsampling factors {

16, 20, 32, 40, 50, 64,80, 100 } , corresponding to the sampling frequencies { } . Wealso report results of an audio-only SEANet model, in whichthe accelerometer input is not used. Metrics and baselines : In order to evaluate the quality ofthe enhanced speech, we measure the scale invariant signal-to-distortion ratio (SI-SDR), which accommodates for an ampli-tude gain mismatch between the estimated signal ˆ y m and theground truth clean reference signal x m . The SI-SDR is com-puted as described in [17].We evaluated models recently proposed in the speech en-hancement and separation literature, which receive as input onlythe audio signal. It is worth noting that a direct comparisonwith these methods is not meaningful, as SEANet receives asinput an additional conditioning signal. However, this evalua-tion is useful to gauge the level of complexity of the dataset,highlighting the added value of leveraging the accelerometersignal. Namely, we include in our evaluation iTDCN++ [17]and Wavesplit [33]. The iTDCN++ model is inspired by Conv-TasNet and predicts a mask with a sigmoid activation that isable 1: Mean SI-SDRi for the In-house dataset. scenario split SEANet SEANetaudio + accel audio onlyMixed noise 1 . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . avg. . . Mixed speech 1 . ± . − . ± . . ± . − . ± . . ± . − . ± . . ± . − . ± . . ± . − . ± . avg. . − . Table 2:

Mean SI-SDRi for Librispeech. scenario split SEANet SEANetaudio + accel audio onlyMixed noise test . ± . . ± . Mixed speech test . ± . − . ± . applied to the mixture STFT coefﬁcients. Wavesplit infers andclusters representations of each speaker and then estimates eachsource signal conditioned on the inferred representations. Results : Table 1 reports the results obtained repeating ﬁvereplicas, on each of the ﬁve splits for the two scenarios. The av-erage SI-SDRi is 8.9dB when mixing with background noisefrom Freesound and 9.6dB when mixing with speech. Notethat the variability across replicas is small (standard devia-tion ± . − . dB), while there is a more signiﬁcant vari-ability across splits. We repeated the experiment by chang-ing the gain used during mixing and observed that the SI-SDRi varies between 3.7dB (6.2dB), at 10dB mixing gain, and15.0dB (15.1dB), at -10dB mixing gain for mixing noise (mix-ing speech). Table 1 includes results when SEANet is trainedusing audio only. In the mixed noise scenario, the model isstill able to enhance speech, although attaining a lower SI-SDRi (7.9dB vs. 8.9dB). Conversely, in the mixed speech sce-nario the audio-only variant of SEANet is unable to separate thespeakers. This is not surprising, since the model as describedin this paper does not include a permutation invariant loss,which is needed to separate sources of the same kind. Usingaudio-only, iTDCN++ attains 7.5dB on mixed noise (trained onsynthetically reverberated Libri-Light speech + synthetically-reverberated Freesound) and 4.2dB on mixed speech (trained onsynthetically reverberated Libri-Light speech mixtures), whileWavesplit attains 8.8dB on mixed speech (trained on Lib-rispeech mixtures, with no reverberation). This demonstratesthe inherent difﬁculty of the in-house dataset and the fact thatthe availability of the conditioning signal makes the denois-ing problem signiﬁcantly easier, especially in the scenario withmixed speech.We also evaluate a model trained on Librispeech withsynthetically generated accelerometer signals. Table 2 showsthat this model achieves an SI-SDRi of 12.4dB on bothmixed noise and mixed speech, thus hinting to the fact thatbetter accuracy can be attained using a larger dataset dur-ing training. Examples of the denoised results producedby SEANet are publicly available at the following page:https://github.com/google-research/seanet/multimodal/speech.We investigated the contribution of the conditioning pro-vided by the accelerometer. To this end, we progressively dec- S I - S D R i Mixed speechMixed noise (a)

In-house dataset. S I - S D R i Mixed speechMixed noise (b)

Librispeech.

Figure 4:

Improvement in SI-SDR for different accelerometersampling rates (each point represents one replica). imated the accelerometer signal before feeding it to our modelduring both training and evaluation. Figure 4a shows an inter-esting result. In the scenario with two overlapping speakers, arapid decrease in SI-SDRi is observed when the sampling ratedrops below 400Hz, and our model is unable to separate thespeakers when the sampling rate is smaller than 200Hz. Con-versely, for the scenario with background noise, only a smalldecrease in SI-SDRi is observed, also when the sampling rateof the accelerometer is drastically reduced. The average SI-SDRi across the splits drops from 8.9dB to 8.0dB. We can ar-gue that this is a simpler scenario, giving the distinct acousticcharacteristics of the background noise. These results are con-ﬁrmed when training and evaluating on the multi-modal datasetgenerated from Librispeech, as illustrated in Figure 4b. In thiscase the average SI-SDRi drops from 12.4dB to 9.8dB.

4. Conclusions

In this paper we show that the accelerometer data collected fromsensors mounted on earbuds provides a strong conditioning sig-nal for speech denoising. This is especially useful in the chal-lenging scenario with overlapping speakers. In our future workwe plan to expand the multi-modal aspect of SEANet by explor-ing how to combine multiple microphone signals, accelerometeraxes and visual cues.

5. Acknowledgements

We would like to thank Kevin Wilson, Scott Wisdom, John Her-shey, Dick Lyon and Neil Zeghidour for their help with andfeedback on this work. We also thank Alina Mihaela Stan forthe help collecting the in-house dataset. . References [1] M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement ofspeech corrupted by acoustic noise,” in

IEEE International Con-ference on Acoustics, Speech, and Signal Processing (ICASSP) ,vol. 4, 1979, pp. 208–211.[2] S. Kamath and P. Loizou, “A multi-band spectral subtractionmethod for enhancing speech corrupted by colored noise,” in

IEEE International Conference on Acoustics, Speech, and SignalProcessing (ICASSP) , vol. 4, 05 2002.[3] A. M. Reddy and B. Raj, “Soft mask methods for single-channelspeaker separation,”

IEEE Transactions on Audio, Speech, andLanguage Processing , vol. 15, no. 6, pp. 1766–1776, 2007.[4] E. M. Grais and H. Erdogan, “Single channel speech music sepa-ration using nonnegative matrix factorization and spectral masks,”in

International Conference on Digital Signal Processing (DSP) ,2011, pp. 1–6.[5] P. Scalart and J. V. Filho, “Speech enhancement based on a priorisignal to noise estimation,” in

IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP) , vol. 2, 1996,pp. 629–632 vol. 2.[6] H. Attias, J. C. Platt, A. Acero, and L. Deng, “Speech denoisingand dereverberation using probabilistic models,” in

Advances inNeural Information Processing Systems 13 , 2001, pp. 758–764.[7] P. C. Loizou, “Speech enhancement based on perceptually mo-tivated bayesian estimators of the magnitude spectrum,”

IEEETransactions on Speech and Audio Processing , vol. 13, no. 5, pp.857–869, 2005.[8] X. Feng, Y. Zhang, and J. Glass, “Speech feature denoisingand dereverberation via deep autoencoders for noisy reverberantspeech recognition,” in

IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) , 2014, pp. 1759–1763.[9] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux,J. R. Hershey, and B. Schuller, “Speech enhancement with lstmrecurrent neural networks and its application to noise-robust asr,”in

Latent Variable Analysis and Signal Separation , E. Vincent,A. Yeredor, Z. Koldovsk´y, and P. Tichavsk´y, Eds. Cham:Springer International Publishing, 2015, pp. 91–99.[10] S. Wisdom, J. R. Hershey, K. Wilson, J. Thorpe, M. Chinen,B. Patton, and R. A. Saurous, “Differentiable consistency con-straints for improved deep speech enhancement,” in

IEEE Inter-national Conference on Acoustics, Speech, and Signal Processing(ICASSP) , 2019, pp. 900–904.[11] S. Pascual, A. Bonafonte, and J. Serr`a, “SEGAN: Speech en-hancement generative adversarial network,” in

INTERSPEECH ,2017, pp. 3642–3646.[12] C. Donahue, B. Li, and R. Prabhavalkar, “Exploring speech en-hancement with generative adversarial networks for robust speechrecognition,” in

IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , 2018, pp. 5024–5028.[13] D. Rethage, J. Pons, and X. Serra, “A wavenet for speech denois-ing,” in

IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 2018, pp. 5069–5073.[14] Y. Luo and N. Mesgarani, “TaSNet: Time-domain audio separa-tion network for real-time, single-channel speech separation,” in

IEEE International Conference on Acoustics, Speech and SignalProcessing (ICASSP) , 2018, pp. 696–700.[15] M. Kolbæk, D. Yu, Z. Tan, and J. Jensen, “Multitalker speechseparation with utterance-level permutation invariant training ofdeep recurrent neural networks,”

IEEE/ACM Transactions on Au-dio, Speech, and Language Processing , vol. 25, no. 10, pp. 1901–1913, 2017.[16] J. L. Roux, G. Wichern, S. Watanabe, A. M. Sarroff, and J. R.Hershey, “The Phasebook: Building complex masks via discreterepresentations for source separation,” in

IEEE International Con-ference on Acoustics, Speech, and Signal Processing (ICASSP) .IEEE, 2019, pp. 66–70. [17] I. Kavalerov, S. Wisdom, H. Erdogan, B. Patton, K. W. Wilson,J. L. Roux, and J. R. Hershey, “Universal sound separation,” in

IEEE Workshop on Applications of Signal Processing to Audioand Acoustics, WASPAA 2019, New Paltz, NY, USA, October 20-23, 2019 . IEEE, 2019, pp. 175–179.[18] J. Hou, S. Wang, Y. Lai, Y. Tsao, H. Chang, and H. Wang, “Audio-visual speech enhancement using multimodal deep convolutionalneural networks,”

IEEE Transactions on Emerging Topics in Com-putational Intelligence , vol. 2, no. 2, pp. 117–128, 2018.[19] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim,W. T. Freeman, and M. Rubinstein, “Looking to listen at thecocktail party: A speaker-independent audio-visual model forspeech separation,”

ACM Trans. Graph. , vol. 37, no. 4, Jul. 2018.[Online]. Available: https://doi.org/10.1145/3197517.3201357[20] T. Ochiai, M. Delcroix, K. Kinoshita, A. Ogawa, and T. Nakatani,“Multimodal speakerbeam: Single channel target speech extrac-tion with audio-visual speaker clues,” in

INTERSPEECH , 2019,pp. 2718–2722.[21] I. Ariav and I. Cohen, “An end-to-end multimodal voice activitydetection using wavenet encoder and residual networks,”

IEEEJournal of Selected Topics in Signal Processing , vol. 13, no. 2,pp. 265–274, 2019.[22] Y. Michalevsky, D. Boneh, and G. Nakibly, “Gyrophone: recog-nizing speech from gyroscope signals,” in

Proceedings of the 23rdUSENIX conference on Security Symposium , 2014, pp. 1053–1067.[23] S. A. Anand, C. Wang, J. Liu, N. Saxena, and Y. Chen,“Spearphone: A speech privacy exploit via accelerometer-sensed reverberations from smartphone loudspeakers,”

CoRR ,vol. abs/1907.05972, 2019. [Online]. Available: http://arxiv.org/abs/1907.05972[24] J. R. Hershey, T. T. Kristjansson, and Z. Zhang, “Model-basedfusion of bone and air sensors for speech enhancement and ro-bust speech recognition,” in

ISCA Tutorial and Research Work-shop on Statistical and Perceptual Audio Processing, ICC, Jeju,Korea, October 3, 2004 . ISCA, 2004, p. 139.[25] H. Liu, Y. Tsao, and C. Fuh, “Bone-conducted speechenhancement using deep denoising autoencoder,”

Speech Com-mun. , vol. 104, pp. 106–112, 2018. [Online]. Available:https://doi.org/10.1016/j.specom.2018.06.002[26] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh,J. Sotelo, A. de Brebisson, Y. Bengio, and A. Courville, “Mel-GAN: Generative adversarial networks for conditional waveformsynthesis,” in

Advances in Neural Information Processing Sys-tems , 2019.[27] F. Font, G. Roma, and X. Serra, “Freesound technical demo,” in

ACM International Conference on Multimedia (MM’13) , ACM.Barcelona, Spain: ACM, 21/10/2013 2013, pp. 411–412.[28] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classi-ﬁcation with deep convolutional neural networks,” in

Advances inNeural Information Processing Systems , 2012, pp. 1097–1105.[29] T. Salimans and D. P. Kingma, “Weight normalization: A sim-ple reparameterization to accelerate training of deep neural net-works,” in

Advances in Neural Information Processing Systems ,2016, pp. 901–909.[30] S. H. Djork-Arn´e Clevert, Thomas Unterthiner, “Fast and accu-rate deep network learning by exponential linear units (ELUs),”in

International Conference on Learning Representations , 2016.[31] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectiﬁer nonlineari-ties improve neural network acoustic models,” in

ICML Workshopon Deep Learning for Audio, Speech and Language Processing ,2013.[32] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: An ASR corpus based on public domain audio books,”in

IEEE International Conference on Acoustics, Speech, and Sig-nal Processing (ICASSP) , 2015, pp. 5206–5210.[33] N. Zeghidour and D. Grangier, “Wavesplit: End-to-end speechseparation by speaker clustering,”