Enhancing into the codec: Noise Robust Speech Coding with Vector-Quantized Autoencoders
Jonah Casebeer, Vinjai Vale, Umut Isik, Jean-Marc Valin, Ritwik Giri, Arvindh Krishnaswamy
EENHANCING INTO THE CODEC: NOISE ROBUST SPEECH CODING WITHVECTOR-QUANTIZED AUTOENCODERS
Jonah Casebeer ∗† Vinjai Vale ∗ (cid:91) Umut Isik (cid:93)
Jean-Marc Valin (cid:93)
Ritwik Giri (cid:93)
Arvindh Krishnaswamy (cid:93) † University of Illinois at Urbana-Champaign, (cid:91)
Stanford University (cid:93)
Amazon Web Services
ABSTRACT
Audio codecs based on discretized neural autoencoders haverecently been developed and shown to provide significantlyhigher compression levels for comparable quality speech out-put. However, these models are tightly coupled with speechcontent, and produce unintended outputs in noisy conditions.Based on VQ-VAE autoencoders with WaveRNN decoders,we develop compressor-enhancer encoders and accompany-ing decoders, and show that they operate well in noisy condi-tions. We also observe that a compressor-enhancer model per-forms better on clean speech inputs than a compressor modeltrained only on clean speech.
Index Terms — speech enhancement, speech coding, au-dio compression
1. INTRODUCTION
Audio codecs compress speech signals by eliminating redun-dant and unnecessary information, with their design oftenleveraging extensive domain expertise to keep compressionrates high, while keeping artifacts at a minimum. The mostpopular codecs, like the Opus codec in wideband mode, canproduce high-quality speech compression at around kb/s[1]. Recently, there have been successful efforts in buildinglearned codecs; starting with replacing the decoders withlearned decoders for fixed encoders, which can operate aslow as . kb/s to . kb/s [2, 3]. These learned decodersleverage advances in speech synthesizing generative modelssuch as WaveNet, WaveRNN, and LPCNet [4, 5, 6]. Morerecently, in [7, 8], the encoder and decoder were both learnedin a joint fashion, by using quantized bottlenecks based onVector-Quantized Variational Auto-Encoders (VQ-VAE) [9],and soft-to-hard quantizers. The VQVAE based model im-proved at . kb/s, on hand-designed encoders at . kb/s.VQ-VAE models are auto-encoders where latent vectorsare quantized using a learned vector quantization scheme.These discrete representations have been shown to have a ∗ Equal contribution. Work performed while at Amazon Web Services. good inductive bias for speech and perform well on unsuper-vised acoustic unit discovery tasks [10, 11]. VQ-VAE modelsare apt for low-bitrate compression, as an input can be rep-resented by a sequence of discrete codebook vector indices,while the location of codebook vectors can be hard-coded.Fully learned codecs like [7, 8] open new avenues forlearning based compression and demonstrate strong resultsfor compressing clean speech. However, they are tightly cou-pled with speech content and do not perform well under thenoisy conditions a codec might encounter in the wild. In thiswork, we focus on making them robust to speech corruptedwith noise. In [12], this issue was addressed by training anoise-robust feature extractor based on the Siamese learningparadigm, and then training a WaveNet model conditioned onthose features. Motivated by the performance of modern neu-ral speech enhancers in removing unwanted noise and rever-beration from audio signals [13, 14, 15, 16, 17], we combinethe learning based compression and learning based enhance-ment paradigms. We call the resulting paradigm ”enhanc-ing into the codec”. The proposed model is based on VQ-VAE with a WaveRNN decoder, and, trained end-to-end as aspeech enhancer, can simultaneously compress and enhancenoisy speech signals, independent of speaker identity. We re-fer to such a model as a compressor-enhancer: a model whichjointly compresses and enhances speech.We measure the performance of our models using MeanOpinion Scores (MOS) from a crowd-sourced study on Ama-zon Mechanical Turk. Across a range of compression ratesand noise levels, we compare our model to a non-enhancinglearned compressor both with and without additional en-hancement preprocessing, as well as the LPCNet neuralcodec [6]. We find that the proposed model performs wellin noisy scenarios, compared to both non-enhancing codecs,as well as the composition, with comparable total computecost, of a separate speech enhancer and a non-enhancingcodec. We also find that the compressor-enhancer VQ-VAEperforms significantly better than a clean-speech-trained VQ-VAE codec on clean speech inputs. a r X i v : . [ ee ss . A S ] F e b . METHOD The task of speech enhancement is to recover a clean speechsignal s from a noisy and possibly reverberant mixture x = s ∗ h + n where n represents some additive noise signal and s ∗ h represents the convolution of a room impulse response h with the speech signal. The goal of speech compression is toreconstruct a speech signal s after encoding it to a smaller rep-resentation ˆE and decoding it to the reconstruction ˆs . A neu-ral compression model is composed of an encoder network N e , a coding step C and decoder network N d which balancea trade-off between reconstruction fidelity and the size of ˆE .Thus, we define the joint compression-enhancement task.In the joint compression-enhancement task, the model re-ceives a noisy input x , which it encodes to a smaller rep-resentation ˆE and then decodes to an estimated clean anddecompressed speech signal ˆs . The full procedure is there-fore ˆs = N d ( ˆE (cid:122) (cid:125)(cid:124) (cid:123) C ( N e ( x ) (cid:124) (cid:123)(cid:122) (cid:125) E )) . (1) The proposed autoencoder model (Fig. 1) is comprised of aconvolutional encoder, a VQ-VAE bottleneck, and a recurrentdecoder. These correspond to N e , C , and N d respectively.The model takes as input a 16 kHz noisy speech signal whichis processed by the encoder and quantized by the bottleneck.The decoder autoregressively reconstructs the original 16 kHzwaveform using the quantized speech and speaker encodings.As such, the encoder is encouraged to produce a compressedrepresentation that gives the most information for the decoderto conditionally model the clean speech signal. The encoder first computes a log-Mel representation and thenapplies a series of 1D convolutional layers, treating the melbins as features. Each convolutional layer is followed bya batch normalization and then a ReLU non-linearity. Thestride of the log-Mel representation and convolutional layersare selected to produce encodings at a rate of 50 Hz. Theencoder also estimates one additional “speaker embedding”vector by performing a simple average across time over theoutput of a separate set of encoding layers. The output oftheses steps is called E . The vector quantized bottleneck quantizes the outputs of theencoding layer using a set of codebooks. Where a separatecodebook, constant over the entire input, is used to quantizethe speaker embedding. We represent a codebook contain-ing K codes by C = { c , · · · c K } . In the forward pass, the encoder outputs E = { e , · · · e K } are quantized by re-placing each e i with the closest c j to get the quantized en-coding ˆe i , where j = arg min k || e i − c k || . Due to thenon-differntiability of the arg min operation, VQ-VAE usesan additional two loss terms. The terms encourage each en-coding e i to be close to the selected c j and for the each code c j to minimize the quantization error incurred by any encod-ings that selected it. These are summarized below using thestopgradient operator sg which is identity at the forward passbut stops gradients in the reverse pass. In practice, we opti-mize the second term using an exponential moving averagek-means. For additional details, see [9]. L vq = λ || sg[ ˆE ] − E ] || + || sg[ E ] − ˆE ] || . (2) We use an RNN based model to synthesize raw 16 kHz audio.The model, which is described in [5], contains two Gated Re-current Units (GRU) and two dense layers. We first concate-nate the quantized speaker embedding to the quantized encod-ing and pass the resulting tensor through the first GRU. Then,we up-sample the GRU output to match the desired outputlength (in raw audio samples) and pass the upsampled ten-sor through the second GRU, and two final dense layers. Weapply softmax to the final dense layer and train the model topredict a distribution over 8-bit mu-law quantized values.
The final forward pass procedure is composed of passing rawnoisy audio x to the encoder, quantizing the resulting encod-ings and speaker embedding, and running an autoregressivemodel to produce an estimated clean waveform ˆs . The fullloss function shown below is composed of L vq and a cross-entropy term L ce ( s , ˆs ) , which measures the KL-divergencebetween the predicted distribution and the one-hot value ofthe mu-law quantized clean speech s . L = L vq + L ce ( s , ˆs ) . (3)
3. EXPERIMENTS AND RESULTS
We train the Codec Only and Enhancing Codec models attwo different kb/s and for the highest kb/s Enhancing Codecwe also experiment with modifying the training setup to usehigher SNR mixtures.
The encoder first computes an bin log-mel representationwith a hop-size of ms and a window size of ms on kHz audio. These are passed to the speech encoder whichhas five convolutional layers each with filters. The first,second, fourth and fifth layers use a stride of and a kernel ncoder VQ Bottleneck Speech EncoderSpeaker Encoder SpeechQuantizerSpeakerQuantizer
Decoder
Log-Mel CoarseRNN Upsample FineRNN
Fig. 1 . Block Diagram of the compressor-enhancer. The speech and speaker encoders are made up of several convolutionallayers with batch normalization and ReLU. The VQ bottleneck has separate quantizers for the speech and speaker encodings.In our experiments the speech quantizer is made up of several codebooks of size 512 and the speaker quantizer is made up of asingle codebook of size 512. The decoder is a WaveRNN based model and uses the quantized speech and speaker informationfirst to reconstruct a coarse and then reconstruct a fine waveform. The waveform output is mu-law quantized.of size . The third layer downsamples by using a stride of and a kernel of size . The speaker encoder has an identicalarchitecture but uses filters and omits the fifth layer. Theseare both passed to separate VQ bottlenecks which apply a lin-ear layer with output size before quantizing. The speakerencoding for an entire input file is quantized using a singlecode from a -bit codebook, while the speech encoding isquantized using two -bit codebooks for the . kb/s model andthree -bit codebooks for the . kb/s model. When severalcodebooks are used, each codebook uses its own linear layerand the resulting output quantizations are stacked. In our cur-rent implementation these steps are non-causal, but can easilybe made causal or with custom look-ahead by using causalconvolutions and adapting speaker-encodings over time. TheWaveRNN model’s first GRU which produces the coarse rep-resentation has hidden nodes, and its fine-representationGRU has hidden nodes.All VQ codebooks are trained using the exponential mov-ing average technique from [9]. We train the models with abatch size of 80 per GPU and a sample length of 1 second us-ing the Adam optimizer on 8 NVidia V100 GPUs for 3 days. To generate a training mixture, we retrieve clean speech datafrom the LibriSpeech dataset in [18], and noise data fromAudioSet [19]. When selecting noise clips from AudioSet,we avoid any clips with speech related tags. To increase theprevalence of challenging noise we sample noise clips withnon-stationary noise more frequently. The noisy mixtures arecreated with a random SNR between − and dB. Finally,all room impulse responses are synthetically generated usingthe image-source method. For additional details consult [16]. To evaluate, we use the test mixtures from the VCTK dataset[20]. It contains mixtures with SNRs of . dB, . dB, . dB and . dB across a variety of speakers and noises. Whenevaluating our models on clean speech we use the cleanspeech samples from the VCTK test set. Since the compression based models in this paper resynthe-size waveforms, their performance is not aptly measured bystandard numerical metrics; we therefore measure model per-formance using a Mean Opinion Score (MOS) from a crowd-sourced study on Mechanical Turk that uses the P808 evalua-tion method [21].
We compare the compressor-enhancer model with a compressor-only counterpart of identical architecture and size at band-widths of . kb/s and . kb/s . The compressor only modelis trained on the clean speech setup described in section 3.2.1.We also experiment with the combination of a speech en-hancement model (RNNoise) [17] and the compression onlymodel. We refer to these baselines as ”Codec Only”, and”Enhancement, then codec” respectively. As a final point ofcomparison we also evaluate a pre-trained LPCNet vocoder[6]. Fig. 2 displays the MOS scores of the models at differ-ent bandwidths. Within the examined bandwidths, ourenhancement-trained compression model (blue diamondsand circle) scores about . MOS above its non-enhancingcounterpart (black square). Interestingly, sequential speechenhancement and compression scores worse than compres-sion alone. We suspect this stems from the compression onlymodel being susceptible to out of distribution errors. Finally,the LPCNet model serves as another comparison point fora compression only neural codec. The model denoted as”Enhancing Codec LN” is identical to ”Enhancing Codec” in .675 0.900 1.125 1.350 1.575 1.800 kb/s M O S MOS vs kb/s
Codec OnlyEnhancing CodecEnhancing Codec LNEnhance Then CodecLPCNet
Fig. 2 . MOS scores on the VCTK test set comparedacross models running at a range of kb/s. Our proposedjoint compressor-enhancer models outperform the compres-sion only baselines as well as the sequential enhancement-compression baselines. The LN suffix denotes a modeltrained on mixtures with lower noise content. The results arestatistically significant with 95% confidence intervals of ap-proximately . .Model \ SNR (dB) 17.5 12.5 7.5 2.5
Codec Only
Enhancing Codec LN
Enhancement, then Codec
Table 1 . MOS comparison across SNRs on VCTK at . kb/s. The largest performance difference is in low SNRscenes.architecture but was trained on lower noise mixture with SNRranging from dB to dB. To see under what acoustic conditions enhancement-trainedcompression has the largest effect, we split MOS scores bySNR, and compare the . kb/s versions of the three mod-els discussed above. We display these results in Table 1.The MOS scores show that enhancement-trained compressioncompares favorably to both baselines across all SNRs. In the . dB scene the enhancing codec has a . to . lead overthe baselines. This margin is reduced to . in the higherSNR scenes. Observing the results at high-SNR, we also evaluated theMOS performance of enhancement-trained compression onclean speech. We compare the . kb/s compression onlymodel against our . kb/s enhancement-trained compres-sion model and display the results in Table 2. We chose to Model Clean Speech MOS Codec Only
Enhancing Codec LN . MOS comparison on VCTK clean speech at . kb/s. 95% confidence intervals of ≈ . omit the sequential enhancement then compression modelsince the speech is already clean. The enhancement-trainedmodel outperforms the compression only model, leading usto suspect that training with noise helps the model learn amore robust bottleneck, and thus generalize better. We also attempted a joint compression and enhancement ap-proach where we trained, first, a compression only modeland then an encoder-only enhancement model trained to,given noisy speech, output the discretized latent represen-tation of clean speech. With the goal being improvementsto out-of-domain errors for speech enhancement models, asa two-stage approach would mean a fewer number of pa-rameters needing to be trained on enhancement. However,we found that these models do not perform well, possibly be-cause the clean-trained autoregressive decoder is too sensitiveto out-of-domain inputs from the enhancer-encoder, makingthe simultaneous training of the decoder a key component ofjoint compression and enhancement.
4. CONCLUSION
In this work we presented a model that does joint compres-sion and enhancement of a noisy speech signal using a VQ-VAE with a convolutional encoder and a WaveRNN decoder.Through a set of mean opinion score based experiments, wefound that joint compression and enhancement performs bet-ter in the presence of noise, including in low SNR scenar-ios, than stand-alone compression; and also outperforms asequential combination of speech enhancement and a com-pression only neural codec. We also found that enhancementtraining improves codec performance on clean speech signals.
5. REFERENCES [1] Jean-Marc Valin, Koen Vos, and Timothy Terriberry,“Definition of the opus audio codec,”
IETF, September ,2012.[2] W Bastiaan Kleijn, Felicia SC Lim, Alejandro Luebs,Jan Skoglund, Florian Stimberg, Quan Wang, andThomas C Walters, “Wavenet based low rate speechcoding,” in .IEEE, 2018, pp. 676–680.[3] Jean-Marc Valin and Jan Skoglund, “A real-time wide-band neural vocoder at 1.6 kb/s using lpcnet,”
Proc.Interspeech 2019 , pp. 3406–3410, 2019.[4] Aaron van den Oord, Sander Dieleman, Heiga Zen,Karen Simonyan, Oriol Vinyals, Alex Graves, NalKalchbrenner, Andrew Senior, and Koray Kavukcuoglu,“Wavenet: A generative model for raw audio,” arXivpreprint arXiv:1609.03499 , 2016.[5] Jaime Lorenzo-Trueba, Thomas Drugman, Javier La-torre, Thomas Merritt, Bartosz Putrycz, Roberto Barra-Chicote, Alexis Moinet, and Vatsal Aggarwal, “Towardsachieving robust universal neural vocoding,” arXivpreprint arXiv:1811.06292 , 2018.[6] Jean-Marc Valin and Jan Skoglund, “Lpcnet: Improv-ing neural speech synthesis through linear prediction,”in
ICASSP 2019-2019 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2019, pp. 5891–5895.[7] Cristina Gˆarbacea, A¨aron van den Oord, Yazhe Li, Fe-licia SC Lim, Alejandro Luebs, Oriol Vinyals, andThomas C Walters, “Low bit-rate speech coding withvq-vae and a wavenet decoder,” in
ICASSP 2019-2019IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) . IEEE, 2019, pp. 735–739.[8] Kai Zhen, Mi Suk Lee, Jongmo Sung, SeungkwonBeack, and Minje Kim, “Psychoacoustic calibration ofloss functions for efficient end-to-end neural audio cod-ing,”
IEEE Signal Processing Letters , vol. 27, pp. 2159–2163, 2020.[9] Aaron Van Den Oord, Oriol Vinyals, et al., “Neural dis-crete representation learning,” in
Advances in NeuralInformation Processing Systems , 2017, pp. 6306–6315.[10] Jan Chorowski, Ron J Weiss, Samy Bengio, and A¨aronvan den Oord, “Unsupervised speech representationlearning using wavenet autoencoders,”
IEEE/ACMtransactions on audio, speech, and language process-ing , vol. 27, no. 12, pp. 2041–2053, 2019.[11] Mingjie Chen and Thomas Hain, “Unsupervisedacoustic unit representation learning for voice conver-sion using wavenet auto-encoders,” arXiv preprintarXiv:2008.06892 , 2020.[12] Felicia SC Lim, W Bastiaan Kleijn, Michael Chinen,and Jan Skoglund, “Robust low rate speech codingbased on cloned networks and wavenet,” in
ICASSP 2020-2020 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) . IEEE,2020, pp. 6769–6773.[13] Bingyin Xia and Changchun Bao, “Speech enhance-ment with weighted denoising auto-encoder.,” in
Inter-speech , 2013, pp. 3444–3448.[14] Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee, “Aregression approach to speech enhancement based ondeep neural networks,”
IEEE/ACM Transactions on Au-dio, Speech, and Language Processing , vol. 23, no. 1,pp. 7–19, 2014.[15] Felix Weninger, Hakan Erdogan, Shinji Watanabe, Em-manuel Vincent, Jonathan Le Roux, John R Hershey,and Bj¨orn Schuller, “Speech enhancement with lstmrecurrent neural networks and its application to noise-robust asr,” in
International Conference on Latent Vari-able Analysis and Signal Separation . Springer, 2015, pp.91–99.[16] Umut Isik, Ritwik Giri, Neerad Phansalkar, Jean-Marc Valin, Karim Helwani, and Arvindh Krish-naswamy, “Poconet: Better speech enhancement withfrequency-positional embeddings, semi-supervised con-versational data, and biased loss,” arXiv preprintarXiv:2008.04470 , 2020.[17] Jean-Marc Valin, “A hybrid dsp/deep learning approachto real-time full-band speech enhancement,” in . IEEE, 2018, pp. 1–5.[18] Vassil Panayotov, Guoguo Chen, Daniel Povey, and San-jeev Khudanpur, “Librispeech: an asr corpus based onpublic domain audio books,” in . IEEE, 2015, pp. 5206–5210.[19] Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman,Aren Jansen, Wade Lawrence, R Channing Moore,Manoj Plakal, and Marvin Ritter, “Audio set: Anontology and human-labeled dataset for audio events,”in . IEEE, 2017,pp. 776–780.[20] Cassia Valentini-Botinhao et al., “Noisy speechdatabase for training speech enhancement algorithmsand TTS models,”
University of Edinburgh. Schoolof Informatics. Centre for Speech Technology Research(CSTR) , 2017.[21] ITU-T,