Speech enhancement with weakly labelled data from AudioSet
Qiuqiang Kong, Haohe Liu, Xingjian Du, Li Chen, Rui Xia, Yuxuan Wang
SSPEECH ENHANCEMENT WITH WEAKLY LABELLED DATA FROM AUDIOSET
Qiuqiang Kong, Haohe Liu, Xingjian Du, Li Chen, Rui Xia, Yuxuan Wang
ByteDance, Shanghai, China { kongqiuqiang, liuhaohe.0379, duxingjian.real, chenli.cloud,rui.xia, wangyuxuan.11 } @bytedance.com ABSTRACT
Speech enhancement is a task to improve the intelligibilityand perceptual quality of degraded speech signal. Recently,neural networks based methods have been applied to speechenhancement. However, many neural network based methodsrequire noisy and clean speech pairs for training. We pro-pose a speech enhancement framework that can be trainedwith large-scale weakly labelled AudioSet dataset. Weaklylabelled data only contain audio tags of audio clips, but not theonset or offset times of speech. We first apply pretrained au-dio neural networks (PANNs) to detect anchor segments thatcontain speech or sound events in audio clips. Then, we ran-domly mix two detected anchor segments containing speechand sound events as a mixture, and build a conditional sourceseparation network using PANNs predictions as soft condi-tions for speech enhancement. In inference, we input a noisyspeech signal with the one-hot encoding of “Speech” as a con-dition to the trained system to predict enhanced speech. Oursystem achieves a PESQ of 2.28 and an SSNR of 8.75 dB onthe VoiceBank-DEMAND dataset, outperforming the previ-ous SEGAN system of 2.16 and 7.73 dB respectively.
Index Terms — Speech enhancement, weakly labelleddata, AudioSet.
1. INTRODUCTION
Speech enhancement (SE) is a task to improve the intelligibil-ity and perceptual quality of degraded speech signal. Speechenhancement has many applications in our life, such as tele-conference, mobile phone calls, automatic speech recognitionand hearing aids [1]. Early works of speech enhancement ap-plied signal processing methods such as minimum-meansquare error short-time spectral amplitude estimator [2] andnon-negative matrix factorization (NMF) [3]. Those con-ventional methods perform well under stationary noise, buthave limited performance under non-stationary noise or inlow signal-to-noise ratio (SNR) environments. Recently, neu-ral network based methods have been proposed for speechenhancement, such as denoising autoencoder [4], fully con-nected neural networks [5], recurrent neural networks (RNNs)[6], convolutional neural networks [7, 8] (CNNs) or time do- main CNNs [9, 10, 11] and generative adversarial networks(GANs) [12, 13]. Those neural network based speech en-hancement methods require clean speech and backgroundnoise for training. Recently, universal source separationsystems [14, 15] have been proposed for source separationwithout clean training data.However, previous neural network based speech enhance-ment methods require clean speech and background noise fortraining, while collecting clean speech and background noisecan be difficult and time consuming. For example, the back-ground noises recorded in the laboratories [16] can be dif-ferent from real world sounds. It is difficult to collect a large-scale dataset covering a wide range of sounds in our world. Inaddition, speech datasets such as TIMIT [17] and VoiceBank[18] contain neutral emotion speech, while there can be vari-ous emotions of speech in our real life. Recently, a large-scaleAudioSet [19] dataset containing hundreds of different soundclasses from YouTube was released, which provides a largervariety of sounds than previous speech and noise datasets.However, the difficulty of using AudioSet for speech en-hancement is that audio clips in AudioSet are weakly labelled.That is, each audio clip is only labelled the presence or ab-sence of sound events, without knowing their onset and offsettimes. Also, AudioSet does not indicate clean speech in audioclips, and speech are usually mixed with other sound events.In this article, we propose a speech enhancement frameworktrained with weakly labelled data. First, we apply pretrainedaudio neural networks (PANNs) [20] to select 2-second an-chor segments that are most likely to contain speech or soundevents in an audio clip. One contribution of this work is thatwe propose an anchor segment mining algorithm to better se-lect anchor segments for creating mixtures. Two randomly se-lected anchor segments are used to constitute a mixture. Thena convolutional UNet [21] is used to predict the waveformof individual anchor segments. We extend the loss functioncalculated on spectrogram [22] to a loss function calculatedin the waveform domain. For the speech enhancement task,we evaluate various metrics including PESQ, CSIG, etc. thatwere not discussed in [22].This paper is organized as follows: Section 2 introducesour speech enhancement system trained with weakly labelleddata. Section 3 shows the experiment results. Section 4 con- a r X i v : . [ c s . S D ] F e b ludes this work.
2. SPEECH ENHANCEMENT WITH WEAKLYLABELLED DATA2.1. Neural Network Based Speech Enhancement
Recently, neural network based methods have been appliedto speech enhancement, and have outperformed conventionalspeech enhancement methods [5]. The neural network basedspeech enhancement methods require pairs of noisy speechand clean speech for training. We denote a noisy speech as x ∈ R L , and its corresponding clean speech as s ∈ R L , where L is the number of samples in an audio clip. Then, a neuralnetwork learns a mapping: f : x (cid:55)→ s , where f can be mod-eled by a neural network with learnable parameters, such asfully connected neural networks [5], RNNs [6], CNNs [7, 8]and time domain CNNs [9, 10, 11]. we denote the enhancedspeech as ˆ s = f ( x ) . In training, the parameters of f canbe optimized by minimizing a loss function l (ˆ s, s ) , such as amean absolute error (MAE) loss: l MAE = (cid:107) ˆ s − s (cid:107) , (1)where (cid:107)·(cid:107) is an l norm. In inference, the enhanced speech ˆ s can be calculated by ˆ s = f ( x ) . However, one disadvantage ofthe above neural network based speech enhancement methodis that noisy and clean speech pairs are required for training,which can be difficult and time consuming to obtain. To ad-dress this problem, we propose a speech enhancement frame-work that can be trained with weakly labelled data. That is,training a speech enhancement system from audio clips con-taining noisy speech. Our speech enhancement system is trained with a large-scaleweakly labelled AudioSet [19] dataset containing 527 kindsof sound classes. Most of audio clips have durations of 10 sec-onds. AudioSet is weakly labelled, that is, each audio clip isonly labelled with tags, but without onset and offset times ofsound events. Also, AudioSet does not indicate clean speech,where speech are usually mixed with other sounds under un-known SNR. Previous works has investigated general sourceseparation with weakly labelled data [22]. Our improvementto [22] is that we propose a novel anchor segment mining al-gorithm in Section 2.5. To begin with, we denote two anchorsegments containing different sound events as s and s re-spectively. The anchor segments s and s are selected fromtwo audio clips that are most likely to contain speech or soundevents. The anchor segments s and s are selected to havedisjoint audio tags that is described in Section 2.5. In training,we build a neural network to learn a mapping: f ( s + s , c ) (cid:55)→ s , (2) where c ∈ [0 , K is a conditional vector that controls whatsource to separate, and K is the number of sound classes inAudioSet. In training, there is no need for s or s to be clean.The conditional vector c is the audio tagging probability cal-culated on s . To explain, if s contains both “Speech” and“Water”. When conditioning on the audio tagging probability c , the system (2) will separate both “Speech” and “Water”.In inference, the enhanced speech ˆ s can be obtained by inputa noisy speech x and setting the conditional vector c as theone-hot encoding of “Speech”: ˆ s = f ( x, c ) . (3)To explain, the training of the speech enhancement system de-scribed in (2) does not require clean speech. Still, we can ob-tain clean speech from noisy speech using the trained speechenhancement system in (3). The anchor segments s and s are 2-second segments used toconstitute a mixture as input. To begin with, we randomly se-lect two sound classes from AudioSet. For each sound class,we randomly select an audio clip in AudioSet. However, thereare no information of when the sound classes occur in audioclips. Therefore, we apply a sound event detection (SED) sys-tem [20] to predict the frame-wise presence probability of thesound class. The SED system is a DecisionLevelMax sys-tem from PANNs [20], which applies log mel spectrogramas input feature, and uses a 14-layer CNN as a classificationmodel. Each convolutional layer has a kernel size of × .The convolutional layers are followed by a time distributedfully connected layer with K outputs to predict the frame-wise presence probability of sound classes. The frame-wisepredictions are max pooled along the time axis to obtain clip-wise predictions. We denote the weak labels of an audio clipas y ∈ { , } K , and its clip-wise prediction as ˆ y ∈ [0 , K .The SED system is trained by minimizing a binary crossen-tropy loss [20] between predicted and target weak label tags:loss = − K (cid:88) k =1 y k ln ˆ y k + (1 − y k ) ln (1 − ˆ y k ) . (4)The first row of Fig. 1 shows the log mel spectrogram ofa 10-second audio clip from AudioSet containing “Speech”and other sound classes. The second row shows the frame-wise SED prediction of “Speech”. We select anchor segment s that is most likely to contain the selected sound event, asshown in the red block in Fig. 1. Similarly, we select anchorsegment s from another audio clip. Then, we mix s + s asinput to (2). ig. 1 . Top: log mel spectrogram of a 10-second audioclip from AudioSet; Middle: predicted SED probability of“Speech”, where red block shows the selected anchor seg-ment; Bottom: predicted audio tagging probabilities of theanchor segment. The conditional vector c controls what sources to separatefrom s + s . However, there is no ground truth label of s , so c is unknown. In addition, there can be multiple sound eventsin s . We apply an audio tagging system on s : c = g AT ( s ) to estimate the conditional vector c . The audio tagging sys-tem g AT is a 14-layer CNN of PANNs [20]. The 14-layer CNNconsists of several convolutional layers. Then, global averageand max pooling are applied to summarize the feature mapsinto a fixed dimension embedding vector. Finally, a fully con-nected layer is applied on the embedding vector to predict thepresence probability of sound events. The training of the au-dio tagging system applies the binary crossentropy describedin (4). The advantage of using audio tagging prediction ratherthan one-hot encoding of labels to build c is that g AT ( s ) provides a better estimation of sound events probability in s than labels. From the top to bottom in Fig. 1 shows the logmel spectrogram of a 10-second audio clip, the SED result ofthe audio clip, and the audio tagging probabilites of the se-lected anchor segment. For example, the predominant soundevents in s are “Music” and “Speech”. Other sound classesin s include “Basketball” and “Slam”, etc. Previous work [22] proposed to select s and s randomlyfrom AudioSet. However, if s and s contain mutual soundclasses, the separation result of equation (2) can be incorrect.For example, if both s and s contain clean “Speech”, when Algorithm 1
Anchor segment mining. Mini-batch of anchor segments: S = { s , ..., s B } , andtheir predicted tags: R = { r , ..., r B } . for r ∈ R do for r ∈ R do if r ∩ r = ø then Collect anchor segments of r and r to con-stitute s and s . Remove r and r from R . end if end for end for the conditional vector c is the one-hot encoding of “Speech”,the system in (2) will learn to only separate “Speech” from s , but not “Speech” from s . For a speech enhancementsystem, we aim to separate all “Speech” from both s and s . To address this problem, we propose an anchor segmentmining method to select s and s to have disjoint condi-tional vectors. In training, we randomly select B anchor seg-ments to constitute a mini-batch { s , ..., s B } , where B is themini-batch size. Then, we calculate the conditional vectors { c , ..., c B } by g AT ( s ) . For each conditional vector c b , weapply thresholds to predict their present tags r b , where thethresholds are calculated from PANNs [20] with equal preci-sion and recall for sound classes. Then, we propose a miningalgorithm described in Algorithm 1 to select pairs of anchorsegments to have disjoint predicted tags from the mini-batchto constitute s and s . We apply convolutional UNets [21, 22] on the spectrogram ofmixture to build separation systems. To begin with, the wave-form of a mixture is transformed into a spectrogram. A UNetconsists of an encoder and a decoder. The encoder consists of12 convolutional layers with kernel sizes of × to extracthigh-level representations. Downsampling layers with sizesof × are applied to every two convolutional layers. Thedecoder is symmetric to the encoder with 12 convolutionallayers. Transposed convolutional layers are used to upsamplefeature maps after every two convolutional layers. Shortcutconnections are added between encoder and decoder layerswith same hierarchies. In each convolutional layer, the con-ditional vector c is multiplied with a learnable matrix, andis added to the feature maps as biases. This bias informationcontrols what sound events to separate from a mixture. Thedecoder outputs a spectrogram mask with values between 0and 1, and is multiplied to the mixture spectrogram to obtainthe separated spectrogram of s . Then, an inverse short timeFourier transform (ISTFT) is applied on the separated spec-trogram using the phase of mixture to obtain ˆ s . The separa-tion system is trained by minimizing the loss function (1). . EXPERIMENTS Our speech enhancement system is trained on the balancedsubset of the weakly labelled AudioSet [19] containing20,550 audio clips with 527 sound classes. The audio clipshave durations of 10-second. Audio clips are weakly labelled,and there can be multiple sound events in an audio clip. Thereare 5,251 audio clips containing “Speech”. To begin with, weresample all audio clips to 32 kHz to be consistent with theconfiguration of PANNs [20]. The sound event detection andaudio tagging systems from PANNs are used to select anchorsegments as described in Section 2.5. To build the separationsystem, we extract spectrograms of mixtures using short timeFourier transform (STFT) with a window size 1024 and a hopsize 320. All anchor segments have durations of 2 seconds.We set mini-batch size to 24. Adam optimizer [23] is used fortraining. We trained the system for 1 million iterations usinga single Tesla-V100-SXM2-32GB GPU card in one week.We evaluate our proposed speech enhancement system di-rectly on the test set of the VoiceBank [18] and DEMAND[16] datasets without training on them. There are 824 pairednoisy and clean speech for testing in VoiceBank-DEMAND.Each audio clip has a sample rate of 48 kHz. The noisy speechhave four SDR settings of 15, 10, 5 and 0 dB. There are 10types of noise, including 2 types of synthetic noise and 8 typesof noise from DEMAND. There are 28 speakers from Voice-Bank. The major difference between our speech enhancementmethod with previous works is that, we do not use the trainingdata from VoiceBANK-DEMAND, and directly evaluate ourspeech enhancement system on the test clips.Following previous works of speech enhancement [24,12, 25], we apply Perceptual evaluation of speech quality(PESQ) [26], Mean opinion score (MOS) predictor of sig-nal distortion (CSIG), MOS predictor of background-noiseintrusiveness (CBAK), MOS predictor of overall signal qual-ity (COVL) [27] and segmental signal-to-ratio noise (SSNR)[28] to evaluate the speech enhancement performance. Table1 shows that noisy speech without enhancement achievesPESQ, CSIG, CBAK, COVL, SSNR of 1.97, 3.35, 2.44, 2.63and 1.68 dB respectively. Our proposed speech enhance-ment system achieves a PESQ of 2.28, outperforming theWiener [24] and SEGAN [12] systems. Our system achievesa CBAK of 2.96 and an SSNR of 8.75 dB, outperformingthe Wiener and SEGAN systems of 2.68 and 5.07 dB, indi-cating the effectiveness of training speech enhancement withweakly labelled data. On the other hand, our system achievesa CSIG of 2.43 and COVL of 2.30, lower than other systems,indicating that our speech enhancement may lost details ofspeech, especially the high frequency component shown inFig. 2. The left and right columns of Fig. 2 visualizestwo speech enhancement examples of our proposed system.From top to bottom rows show the log mel spectrogram ofnoisy speeches, target clean speeches and enhanced speechesrespectively. Considering that our system is trained with M e l b a n k s Mixture Mixture M e l b a n k s Clean Clean M e l b a n k s Separated
Separated
Fig. 2 . The left and right columns show two examples ofspeech enhancement. Top: log mel spectrogram of noisyspeech; Middle: ground truth clean speech; Bottom: en-hanced speech.
Table 1 . Speech enhancement results
PESQ CSIG CBAK COVL SSNRNoisy 1.97 3.35 2.44 2.63 1.68Wiener [24] 2.22 3.23 2.68 2.67 5.07SEGAN [12] 2.16 3.48 2.94 2.80 7.73Wave-U-Net [25] 2.40 3.52 3.24 2.96 9.97Proposed 2.28 2.43 2.96 2.30 8.75 weakly labelled data only, and does not use any training datafrom VoiceBank-DEMAND. We show that training a speechenhancement system from weakly labelled data is possible.We provide our speech enhancement demos in the followinglinks .
4. CONCLUSION
In this work, we propose a speech enhancement systemtrained with weakly labelled data from AudioSet. Our systemdoes not require clean speech and background noise to trainthe speech enhancement system. We propose to use the soundevent detection and audio tagging system from pretrained au-dio neural networks (PANNs), and an anchor segment miningalgorithm for selecting anchor segments. We build condi-tional UNet sound separation systems for speech enhance-ment. Our proposed systems outperform the Wiener andSEGAN systems evaluated with the VoiceBank-DEMANDdataset in the PESQ, CBAK and SSNR metrics without usingany training data from VoiceBank-DEMAND. In future, wewill continue to investigate general source separation withweakly labelled data. . REFERENCES [1] P. C. Loizou, Speech Enhancement: Theory and Practice , CRCpress, 2013.[2] Y. Ephraim and D. Malah, “Speech enhancement using aminimum-mean square error short-time spectral amplitude es-timator,”
IEEE/ACM Transactions on Acoustics, Speech, andSignal Processing (TASLP) , vol. 32, no. 6, pp. 1109–1121,1984.[3] N. Mohammadiha, P. Smaragdis, and A. Leijon, “Supervisedand unsupervised speech enhancement using nonnegative ma-trix factorization,”
IEEE/ACM Transactions on Audio, Speech,and Language Processing (TASLP) , vol. 21, no. 10, pp. 2140–2151, 2013.[4] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancementbased on deep denoising autoencoder.,” in
INTERSPEECH ,2013, pp. 436–440.[5] Y. Xu, J. Du, L. Dai, and C. Lee, “A regression ap-proach to speech enhancement based on deep neural networks,”
IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing (TASLP) , vol. 23, no. 1, pp. 7–19, 2014.[6] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux,J. R. Hershey, and B. Schuller, “Speech enhancement withLSTM recurrent neural networks and its application to noise-robust ASR,” in
International Conference on Latent VariableAnalysis and Signal Separation (LVA-ICA) . Springer, 2015, pp.91–99.[7] S. R. Park and J. Lee, “A fully convolutional neural networkfor speech enhancement,” in
INTERSPEECH , 2017.[8] S. Fu, Y. Tsao, and X. Lu, “Snr-aware convolutional neural net-work modeling for speech enhancement.,” in
INTERSPEECH ,2016, pp. 3768–3772.[9] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing idealtime–frequency magnitude masking for speech separation,”
IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 27, no. 8, pp. 1256–1266, 2019.[10] D. Rethage, J. Pons, and X. Serra, “A wavenet for speechdenoising,” in
IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) , 2018, pp. 5069–5073.[11] A. Pandey and D. Wang, “TCNN: Temporal convolutional neu-ral network for real-time speech enhancement in the time do-main,” in
IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) , 2019, pp. 6875–6879.[12] S. Pascual, A. Bonafonte, and J. Serra, “SEGAN: Speechenhancement generative adversarial network,” in
INTER-SPEECH , 2017.[13] C. Donahue, B. Li, and R. Prabhavalkar, “Exploring speechenhancement with generative adversarial networks for robustspeech recognition,” in
International Conference on Acous-tics, Speech and Signal Processing (ICASSP) , 2018, pp. 5024–5028.[14] Scott Wisdom, Efthymios Tzinis, Hakan Erdogan, Ron JWeiss, Kevin Wilson, and John R Hershey, “Unsupervised sound separation using mixture invariant training,” in
Con-ference on Neural Information Processing Systems (NeurIPS) ,2020.[15] Scott Wisdom, Efthymios Tzinis, Hakan Erdogan, Ron JWeiss, Kevin Wilson, and John R Hershey, “Unsupervisedspeech separation using mixtures of mixtures,” in
Workshop onInternational Conference on Machine Learning (ICML) , 2020.[16] J. Thiemann, N. Ito, and E. Vincent, “DEMAND: a collectionof multi-channel recordings of acoustic noise in diverse envi-ronments,” in
Proceedings of Meetings on Acoustics , 2013.[17] J. S. Garofolo, “TIMIT acoustic phonetic continuous speechcorpus,”
Linguistic Data Consortium , 1993.[18] C. Veaux, J. Yamagishi, and S. King, “The voice bank corpus:Design, collection and data analysis of a large regional accentspeech database,” in
International Conference Oriental CO-COSDA with Conference on Asian Spoken Language Researchand Evaluation (O-COCOSDA/CASLRE) , 2013.[19] J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen,W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audioset: An ontology and human-labeled dataset for audio events,”in
IEEE International Conference on Acoustics, Speech andSignal Processing (ICASSP) , 2017, pp. 776–780.[20] Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, WenwuWang, and Mark D Plumbley, “Panns: Large-scale pre-trained audio neural networks for audio pattern recognition,”
IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 28, pp. 2880–2894, 2020.[21] A. Jansson, E. Humphrey, N. Montecchio, R. Bittner, A. Ku-mar, and T. Weyde, “Singing voice separation with deep u-netconvolutional networks,” in
International Society for MusicInformation Retrieval (ISMIR) , 2017, pp. 745–751.[22] Q. Kong, Y. Wang, X. Song, Y. Cao, W. Wang, and M. D.Plumbley, “Source separation with weakly labelled data: Anapproach to computational auditory scene analysis,” in
IEEEInternational Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP) , 2020, pp. 101–105.[23] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” in
International Conference on Learning Represen-tations (ICLR) , 2015.[24] P. Scalart and J. V. Filho, “Speech enhancement based on a pri-ori signal to noise estimation,” in
IEEE International Confer-ence on Acoustics, Speech, and Signal Processing (ICASSP) ,1996, vol. 2, pp. 629–632.[25] C. Macartney and T. Weyde, “Improved speech enhancementwith the wave-u-net,” arXiv preprint arXiv:1811.11307 , 2018.[26] ITU-T Recommendation, “Perceptual evaluation of speechquality (PESQ): An objective method for end-to-end speechquality assessment of narrow-band telephone networks andspeech codecs,”
Rec. ITU-T P. 862 , 2001.[27] Y. Hu and P. C. Loizou, “Evaluation of objective quality mea-sures for speech enhancement,”
IEEE/ACM Transactions onaudio, speech, and language processing (TASLP) , vol. 16, no.1, pp. 229–238, 2007.[28] S. R. Quackenbush, T. P. Barnwell, and M. A. Clements,