High Fidelity Speech Regeneration with Application to Speech Enhancement
Adam Polyak, Lior Wolf, Yossi Adi, Ori Kabeli, Yaniv Taigman
HHIGH FIDELITY SPEECH REGENERATIONWITH APPLICATION TO SPEECH ENHANCEMENT
Adam Polyak , ∗ , Lior Wolf , , Yossi Adi , Ori Kabeli , Yaniv Taigman Facebook AI Research The School of Computer Science, Tel Aviv University
ABSTRACT
Speech enhancement has seen great improvement in recentyears mainly through contributions in denoising, speaker sep-aration, and dereverberation methods that mostly deal withenvironmental effects on vocal audio. To enhance speechbeyond the limitations of the original signal, we take a re-generation approach, in which we recreate the speech fromits essence, including the semi-recognized speech, prosodyfeatures, and identity. We propose a wav-to-wav generativemodel for speech that can generate 24khz speech in a real-time manner and which utilizes a compact speech represen-tation, composed of ASR and identity features, to achieve ahigher level of intelligibility. Inspired by voice conversionmethods, we train to augment the speech characteristics whilepreserving the identity of the source using an auxiliary iden-tity network. Perceptual acoustic metrics and subjective testsshow that the method obtains valuable improvements over re-cent baselines.
Index Terms — speech enhancement, audio generation
1. INTRODUCTION
Speech is the primary means of human communication. Theimportance of enhancing speech audio for better communi-cation and collaboration has increased substantially amid theCOVID-19 pandemic due to the need for physical distancing.In the domain of speech enhancement, denoising anddereverberation, methods have received much of the atten-tion. The vast majority of these methods deal with environ-mental effects and train masking filters in order to removeunwanted sources, while assuming that the existing vocalsare intelligible enough. However, in common environmentswhere the recorded speech comes from a low fidelity mi-crophone or poorly treated acoustic spaces, such methodsstruggle to reconstruct a clear sounding natural voice, whichis similar to a voice that is recorded in a professional studio.Speech recognition and generation have seen remarkableprogress in recent years mainly due to advances in the robust-ness of neural-based Automatic Speech Recognition (ASR) ∗ The contribution of Adam Polyak is part of a Ph.D. thesis researchconducted at Tel Aviv University. and neural vocoders. We utilize these advances and introducea speech regeneration pipeline, in which speech is encodedat the semantic level through ASR, and a speech synthesizeris used to produce an output that is not only cleaner than theinput, but also has better perceptual metrics.Our main contributions are: (i) We present a novel genera-tive model that utilizes ASR and identity information in orderto recreate speech in high-fidelity through comprehension; (ii)We present quantitative and subjective evaluation in the ap-plication of speech enhancement, and; (iii) We provide engi-neering details on how to implement our method efficiently toutilize it in real-time communication. Samples can be foundat https://speech-regeneration.github.io .
2. RELATED WORK
Speech enhancement and speech dereverberation were widelyexplored over the years [1, 2]. Due to the success of deepnetworks, there has been a growing interest in deep learning-based methods for both speech enhancement and dereverber-ation working on either frequency domain or directly on theraw waveform [3, 4, 5]. Deep generative models such as Gen-erative Adversarial Network (GAN) or WaveNet were alsosuggested for the task [6, 7, 8].A different method to improve speech quality is by usingspeech Bandwidth Extension (BWE) algorithms. In BWE,one is interested in increasing the sampling rate of a givenspeech utterance. Early attempts were based on GaussianMixture Models and Hidden Markov Models [9, 10], and,more recently, deep learning methods were suggested for thetask [11, 12].Despite the success of the BWE methods, these weremainly applied to lower sampling rates (e.g. 8kHz, 16kHz).Recently, several studies suggested Audio Super Resolu-tion algorithms for upsampling to higher sample rates (e.g.22.5kHz, 44.1kHz). The authors in [13] introduce an end-to-end GAN based system for speech bandwidth extension foruse in downstream automatic speech recognition. The authorsin [14] suggest to use a WaveNet model to directly output ahigh sampled speech signal while the authors in [15] suggestusing GANs to estimate the mel-spectrogram and then applya vocoder to generate enhanced waveform. a r X i v : . [ c s . S D ] J a n nlike previous BWE methods that mainly use generativemodels for a sequence to sequence mapping, in the waveformor the spectrum domains, our model is conditioned on sev-eral high-level features. The proposed model utilizes a speechsynthesis model [16], an ASR model [17], a Pitch extractionmodel [18], and a loudness feature.Features extracted from an ASR network were utilizedin [19] for the task of voice conversion. In addition to differ-ences that arise from the different nature of the task, there aremultiple technical differences between the approaches: thegenerator of [19] is autoregressive, the previous method didnot include an identity network and modelled new speakersinaccurately, and lastly the loss terms used to optimized themodels were different.
3. SPEECH REGENERATION
Our regeneration pipeline is an encoder-decoder network.First, the raw input speech is passed through a backgroundremoval method that masks out non-vocal audio. The out-put is then passed through several subnetworks to generatedisentangled speech representations. The output of thesesubnetworks together with the spectral features condition thespeech generative decoder network, which synthesizes thefinal output. The architecture is depicted in Figure 1.
Denote the domain of audio samples by
X ⊂ R . The rep-resentation for a raw noisy speech signal is therefore a se-quence of samples x = ( x , . . . , x T ) , where x t ∈ X for all ≤ t ≤ T , where T varies for different input sequences. Wedenote by X ∗ the set of all finite-length sequences over X .Consider a single channel recordings with additive noise asfollows: x = y ∗ h + n , where y is the clean signal, h isthe Acoustic Transfer Function, n is a non stationary additivenoise in an unknown Signal-to-Noise Ratio (SNR), and ∗ isthe convolution operator.Given a training set of n examples, S = { ( x i , y i ) } ni =1 ,we first remove background noise using a pre-trained state-of-the-art speech enhancement model [5]. Denoting the de-noised signal as ˜ x , we define the set of such samples as ˜ S = { ( ˜ x i , y i ) } ni =1 .Our encoding extracts several different representations inorder to capture content, prosody, and identity features sepa-rately. Specifically, given an input signal ˜ x , the content repre-sentation is extracted using a pre-trained ASR network, E asr .In our implementation, we use the public implementation [20]of Wav2Letter [17]. The identity representation is obtainedin the form of d-vectors [21], using an identity encoder E id .The d-vector extractor is pre-trained on the VoxCeleb2 [22]dataset, achieving a 7.4% EER for speaker verification on thetest split of the VoxCeleb1 dataset. The average activationsfrom the penultimate layer forms the speaker representation. Lastly, the prosody representation includes both the funda-mental frequency and a loudness feature. The former F ( ˜ x ) ,is extracted using YAAPT [18], which was found to be robustagainst input distortions. The loudness measurement of thesignal, F loud ( ˜ x ) is extracted using A-weighting of the signalfrequencies. The F0 and the Loudness feature are upsampledand concatenated to form the prosody conditioning signal.To summarize, the encoding of a denoised signal is givenas E ( ˜ x ) = [ E id ( ˜ x ) , E asr ( ˜ x ) , F ( ˜ x ) , F loud ( ˜ x )] . The generative decoder network is optimized using the least-squares GAN [23] where the decoder, G and discriminator D minimize the following objective, L adv ( D, G, ˜ S ) = (cid:88) ˜ x ∈ ˜ S || − D ( ˆ x ) || L D ( D, G, ˜ S ) = (cid:88) ˜ x ∈ ˜ S [ || − D ( ˜ x ) || + || D ( ˆ x ) || ] , (1)where ˆ x = G ( z , E ( ˜ x )) , is the synthesized audio sample froma random noise vector sampled from a normal distribution z ∼ N (0 , .The decoder, G , is additionally being optimized with aspectral distance loss using various FFT resolutions betweenthe decoder output, ˆ x , and the target clean signal, y , as sug-gested in [24]. For a single FFT scale, m , the loss componentis defined as follows: L ( m ) spec ( y , ˆ x ) = (cid:107) S ( y ) − S ( ˆ x ) (cid:107) F (cid:107) S ( y ) (cid:107) F + (cid:107) log S ( y ) − log S ( ˆ x ) (cid:107) N , (2)where (cid:107) · (cid:107) F and (cid:107) · (cid:107) denotes the Forbenius and the L norms, S is the Short-time Fourier transform (STFT), and N the number of elements.The multi-scale spectral loss is achieved by summingEquation (2) over the following FFT resolutions M =[2048 , , , , , as follows, L spec ( y , ˆ x ) = 1 | M | (cid:88) m ∈ M L ( m ) spec ( y , ˆ x ) . (3)Inspired by the recently suggested spectral energy dis-tance formulation of the spectral loss [16], the spectral lossis applied as part of a compound loss term: L sed ( G, ˜ S ) = (cid:88) (˜ x , y ) ∈ ˜ S L spec ( y , G ( z , E ( ˜ x )))+ L spec ( y , G ( z , E ( ˜ x ))) − L spec ( G ( z , E ( ˜ x )) , G ( z , E ( ˜ x ))) , (4)where z , z are two different normally distributed randomnoise vectors. Intuitively, the energy loss maximizes the dis-crepancy between different values of z . ig. 1 . The proposed speech regeneration architecture.
Noisy speech signal, x , is initially enhanced using a backgroundremoval method. The enhanced signal, ˜ x , is decomposed into four components: (i) Speech features, E asr ( ˜ x ) , using a pretrainedASR encoder, (ii) Fundamental frequency values, F ( ˜ x ) which provide prosody features, (iii) Loudness features F loud ( ˜ x ) , and(iv) d-vector, E id ( ˜ x ) which embeds the speaker identity. The speech, pitch and loudness features are upsampled across thetemporal domain and concatenated. Decoder G receives as input the concatenated signal and conditioned on the concatenationof the identity vector with a random noise vector sampled from a normal distribution. Finally, G outputs the regenerated speech.Replacing L spec with L sed has improved the quality of thegenerated audio. Specifically, in our experiments, we noticedthe removal of metallic effects from the generated audio.Overall, the objective function of the decoder generator, G , is defined as: L g ( G, D, ˜ S ) = L sed ( G, ˜ S ) + λ · L adv ( D, G, ˜ S ) , (5)where λ is a tradeoff parameter set to 4 in our experiments.The preliminary enhancement of the input signal, x , pro-duces the denoised signal, ˜ x , sampled at 16kHz. The decoderthen receives as input the concatenated and upsampled condi-tioning signal, E ( ˜ x ) , sampled at 250Hz. G is conditioned onthe concatenation of the noise vector, z , and the speaker iden-tity, E id ( ˜ x ) , while the input to the model is E asr ( ˜ x ) concate-nated with F ( ˜ x ) and F loud ( ˜ x ) . Finally, the proposed modeloutputs a raw audio signal sampled at 24kHz.The architecture of the decoder G is based on the GAN-TTS [25] architecture consisting seven GBlocks. A GBlockcontains a sequence of two residual blocks, each with twoconvolutional layers. The convolutional layers employ a ker-nel size of 3 and dilation factors of , , , to increase thenetwork receptive field. Before each convolutional layer,the input is passed through a Conditional Batch Normaliza-tion [26], conditioned on a linear projection of the noise vec-tor and speaker identity, and a ReLU activation. The final fiveGBlocks upsample the input signal by factors of , , , , accordingly, to reach the target sample rate. Figure 1 includesthe hyperparameters of each GBlock.While GAN-TTS [25] was originally trained with both conditional and unconditional discriminators operating inmultiple scales, we found in preliminary experiments thatthe proposed method can generate high-quality audio usinga single unconditional discriminator, D . Moreover, our ar-chitecture of D is much simpler than the one proposed inprevious work. It involves a sequence of seven convolutionallayers followed by a leaky ReLU activation with a leakinessfactor of 0.2, except the final layer. The number of filtersin each layer is , , , , , , with kernelsizes of , , , , , , . Finally, to stabilize the adver-sarial training, both the discriminator and the decoder employspectral normalization [27].
4. EVALUATION
We present a series of experiments evaluating the proposedmethod using both objective and subjective metrics. We startby presenting the datasets used for evaluation. Next, a com-parison to several competitive baselines is held, and we con-clude with a discussion on the computational efficiency of theproposed method.
We evaluated our method for speech regeneration on two dif-ferent datasets. The first one is the Device and ProducedSpeech (DAPS) [28] dataset. DAPS is comprised of twelvedifferent recordings for each sample; two devices and sevenacoustic environments. For a fair comparison, we used thesame test partition employed by HifiGAN [4]. For this bench-mark, we optimized the generator on the public VCTK [29] able 1 . MUSHRA and perceptual metrics on the DAPSdataset [28]Method MUSHRA ↑ FDSD ↓ cFDSD ↓ Clean 4.88 ± ± ± ± ± ± Ours 3.76 ± dataset, which consists of 44 hours of clear professionallyrecorded voices of 109 speakers.The second dataset is a standard benchmark in the speechenhancement literature [30]. The dataset contains clean andartificially added noisy samples. The clean samples are basedon the VCTK dataset [29], where the noises are sampledfrom [31]. The dataset is split into a predefined train, vali-dation, and test sets. In both settings, we resample the audiosamples to 24kHz. We evaluated our method using bothobjective and subjective metrics. For subjective metric, weused MUSHRA [32] – We ask human raters to compare sam-ples created from the same test signal. The clean sample ispresented to the user before the processed files and is labeledwith a 5.0 score.For objective metrics we used two distances proposedin [25]: (i) Fr´echet Deep Speech Distance (FDSD) - a dis-tance measure calculated between the activations of tworandomly sampled sets of output signals to the target sam-ple using DeepSpeech2 [33] ASR model; (ii) ConditionalFr´echet Deep Speech Distance (cFDSD) - similar to FDSD,but computed between the generated output to its matchedclean target. Unlike cFDSD, where the generated outputshould match the target signal, in FDSD the random sets donot have to match to the target utterance. Note that the Fr´echetdistances are computed using a different ASR network thanthe one used for conditioning the proposed model.We do not employ the PESQ metric [34], which was de-signed to quantify degradation due to codecs and transmissionchannel errors. PESQ validity for other tasks is questionable,e.g., it shows a low correlation with MOS [35]. Moreover, itis defined for narrowband and wideband only.
Results.
Table 1 shows the performance of our methodon the DAPS dataset. We compared our regeneration methodto several competitive baselines in the domain of speech en-hancement. As can be seen, our method outperforms the base-lines in all metrics. The regenerated speech has substantiallylower perceptual distances, and scored convincingly higher
Table 2 . MUSHRA and perceptual metrics on the noisyVCTK dataset [30].Method MUSHRA ↑ FDSD ↓ cFDSD ↓ Clean 4.74 ± ± ± ± Ours 4.00 ± on the MUSHRA test compared to the baselines.Results for the noisy VCTK dataset are presented in Ta-ble 2. Our method is superior to the baselines models onthe noisy VCTK as well, however, improvements are modestcompared to DAPS. This is due to VCTK being a less chal-lenging dataset for enhancement tasks. Unlike DAPS, whichits test split was recorded in real noisy-reverberant environ-ments, in VCTK noise files were added to the clean samplesto artificially generate noisy samples. The accessibility to speech enhancement greatly relies on itsefficiency and ability to be applied while streaming. We haveefficiently implemented a server-based module using PyTorchJIT that is able to fetch speech audio and regenerate it witha
Real-Time Factor of 0.94. All modules required to com-pute the conditioning vector run in parallel either on the V100NVIDIA GPU or Intel Xeon CPU E5. The final pipeline cur-rently has a latency of about 100 milliseconds (ms), whichtakes into account the receptive field, future context and com-putation time of every module. All modules operate at a re-ceptive field of up to 40ms. The ASR network and the de-noiser use a future context of 20ms and 32ms respectively.The rest of the modules require no future context and aretrained to be fully causal. Such latency can fit Voice-Over-IP applications.
5. CONCLUSIONS
We present an enhancement method that goes beyond the lim-itations of a given speech signal, by extracting the compo-nents that are essential to the communication and recreatingaudio signals. The method is taking advantage of recent ad-vances in ASR technology, as well as in pitch detection andidentity-mimicking TTS and voice conversion technologies.The recreation approach can also be applied when one ofthe components is manipulated. For example, when editingthe content, modifying the pitch in post-production, or replac-ing the identity. An interesting application is to create super-intelligible speech, which would enhance the audience’s per-ception and could be used, e.g., to improve educational pre-sentations. . REFERENCES [1] Jacob Benesty et al.,
Noise reduction in speech process-ing , Springer Science & Business Media, 2009.[2] Hagai Attias et al., “Speech denoising and dereverbera-tion using probabilistic models,”
NeurIPS , 2001.[3] Daniel Stoller, Sebastian Ewert, and Simon Dixon,“Wave-u-net: A multi-scale neural network for end-to-end audio source separation,” arXiv:1806.03185 , 2018.[4] Jiaqi Su et al., “HiFi-GAN: High-fidelity denoising anddereverberation based on speech deep features in adver-sarial networks,”
INTERSPEECH , 2020.[5] Alexandre Defossez et al., “Real time speech enhance-ment in the waveform domain,”
INTERSPEECH , 2020.[6] Meet H Soni et al., “Time-frequency masking-basedspeech enhancement using generative adversarial net-work,”
ICASSP , 2018.[7] Dario Rethage, Jordi Pons, and Xavier Serra, “Awavenet for speech denoising,”
ICASSP , 2018.[8] Szu-Wei Fu et al., “Metricgan: Generative adversar-ial networks based black-box metric scores optimizationfor speech enhancement,”
ICML , 2019.[9] Kun-Youl Park et al., “Narrowband to wideband con-version of speech using gmm based transformation,”
ICASSP , 2000.[10] Guo Chen et al., “Hmm-based frequency bandwidth ex-tension for speech enhancement using line spectral fre-quencies,”
ICASSP , 2004.[11] Jonas Sautter et al., “Artificial bandwidth extension us-ing a conditional generative adversarial network withdiscriminative training,”
ICASSP , 2019.[12] X. Hao et al., “Time-domain neural network approachfor speech bandwidth extension,”
ICASSP , 2020.[13] Xinyu Li et al., “Speech audio super-resolution forspeech recognition.,”
INTERSPEECH , 2019.[14] Mu Wang et al., “Speech super-resolution using parallelwavenet,”
ISCSLP , 2018.[15] Leyuan Sheng et al., “High-quality speech syn-thesis using super-resolution mel-spectrogram,” arXiv:1912.01167 , 2019.[16] Alexey A Gritsenko et al., “A spectral energy distancefor parallel speech synthesis,” arXiv:2008.01160 , 2020.[17] Ronan Collobert et al., “Wav2Letter: an End-to-End Convnet-based Speech Recognition System,” arXiv:1609.03193 , 2016. [18] K. Kasi and S. A. Zahorian, “Yet another algorithm forpitch tracking,”
ICASSP , 2002.[19] Adam Polyak et al., “TTS skins: Speaker conversion viaASR,”
INTERSPEECH , 2020.[20] Jason Li et al., “Jasper: An end-to-end convolutionalneural acoustic model,”
INTERSPEECH , 2019.[21] Li Wan et al., “Generalized end-to-end loss for speakerverification,”
ICASSP , 2018.[22] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2:Deep speaker recognition,”
INTERSPEECH , 2018.[23] Xudong Mao et al., “Least squares generative adversar-ial networks,”
ICCV , 2017.[24] R. Yamamoto et al., “Parallel WaveGAN: A fast wave-form generation model based on generative adversarialnetworks with multi-resolution spectrogram,”
ICASSP ,2020.[25] Mikołaj Bi´nkowski et al., “High fidelity speech synthe-sis with adversarial networks,”
ICLR , 2020.[26] Vincent Dumoulin et al., “A learned representation forartistic style,”
ICLR , 2017.[27] Takeru Miyato et al., “Spectral normalization for gener-ative adversarial networks,”
ICLR , 2018.[28] Gautham J Mysore, “Can we automatically transformspeech recorded on common consumer devices in real-world environments into professional production qualityspeech?—a dataset, insights, and challenges,”
IEEE Sig-nal Processing Letters , vol. 22, no. 8, pp. 1006–1010.[29] C. Veaux et al., “CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit,” 2017.[30] Cassia Valentini-Botinhao et al., “Noisy speechdatabase for training speech enhancement algorithmsand tts models,” 2017.[31] Joachim Thiemann et al., “Demand: a collection ofmulti-channel recordings of acoustic noise in diverse en-vironments,”
Proc. Meetings Acoust , 2013.[32] F. Ribeiro et al., “CROWDMOS: An approach forcrowdsourcing mos studies,”
ICASSP , 2011.[33] D. Amodei et al., “Deep speech 2: End-to-end speechrecognition in english and mandarin,”
ICML , 2016.[34] A. W. Rix et al., “Perceptual evaluation of speech qual-ity (pesq)-a new method for speech quality assessmentof telephone networks and codecs,”
ICASSP , 2001.[35] C. Reddy et al., “A scalable noisy speech dataset andonline subjective test framework,” arXiv:1909.08050arXiv:1909.08050