Low Bit-Rate Wideband Speech Coding: A Deep Generative Model based Approach
aa r X i v : . [ c s . S D ] F e b LOW BIT-RATE WIDEBAND SPEECH CODING: A DEEP GENERATIVE MODEL BASEDAPPROACH
Gang Min , Xiongwei Zhang , Xia Zou , Xiangyang Liu Institute of Information and Communication, National University of Defense Technology, China Army Engineering University of PLA, China
ABSTRACT
Traditional low bit-rate speech coding approach only han-dles narrowband speech at 8kHz, which limits further im-provements in speech quality. Motivated by recent successfulexploration of deep learning methods for image and speechcompression, this paper presents a new approach through vec-tor quantization (VQ) of mel-frequency cepstral coefficients(MFCCs) and using a deep generative model called WaveG-low to provide efficient and high-quality speech coding. Thecoding feature is sorely an 80-dimension MFCCs vector for16kHz wideband speech, then speech coding at the bit-ratethroughout 1000-2000 bit/s could be scalably implemented byapplying different VQ schemes for MFCCs vector. This newdeep generative network based codec works fast as the Wave-Glow model abandons the sample-by-sample autoregressivemechanism. We evaluated this new approach over the multi-speaker TIMIT corpus, and experimental results demonstratethat it provides better speech quality compared with the state-of-the-art classic MELPe codec at lower bit-rate.
Index Terms — speech coding, mel-frequency cepstralcoefficients, vector quantization, WaveGlow
1. INTRODUCTION
Low bit-rate speech coding, which encodes speech signals atthe bit rate below 4800 bit/s, has widespread applications inthe field of both satellite and secure communications. Manysuccessful low bit-rate speech coding algorithms have beenproposed in the literatures, such as linear predictive coding(LPC-10) [1], code-excited linear prediction (CELP) [2],mixed excitation linear prediction (MELP) [3], etc. How-ever, high-quality speech coding under low bit-rate condi-tions still faces great challenge, especially for the widebandspeech and in the presence of background acoustic noises.All of the classic speech vocoders mentioned above belongto the source-filter speech coding framework, in which thespeech coding parameters include linear prediction coeffi-cients (LPCs), pitch, energy, etc. Different types of speechcoding parameters are rarely quantized together, so it is verydifficult to further reduce the speech coding rate. Therefore, many other speech coding methods have been studied towardsalternatives to the classic linear prediction coding model.MFCC codec encodes speech signals through scalar quan-tization (SQ) or vector quantization (VQ) of MFCCs, whichprovides a new promising scheme for speech coding at lowbit-rate conditions [4][5]. However, there are still some lim-itations need to be resolved for further improving the totalperformance. The first is the quality of coded speech needsfurther improvement, since there exists spectrum smearingproblem, especially in the high-frequency region, which iscaused by using the overlapped triangle window with mel-frequency scale for MFCCs extraction. Another is the pro-cessing efficiency also needs improvement since the tradi-tional MFCC codec uses the Griffin-Lim algorithm (GLA) toestimate the lost phase information via discrete Fourier trans-form (DFT) and inverse discrete Fourier transform (IDFT) it-eratively [6]. However, GLA suffers from slow convergenceproblem when the random initialization of the phase spectro-gram is not ideal. Moreover, current MFCC codec is rarelyable to handle 16kHz wideband speech signals [4].In the last decade, deep learning methods have been usedfor dramatically improving the performance of many speechprocessing applications, such as speech enhancement (SE),text-to-speech (TTS), automatic speech recognition (ASR),etc. Most recently, deep neural networks have shown tobe promising in handling the traditional speech coding task[7]. One of the most representative works is WaveNet basedcodec [8][9], which uses WaveNet as a generative model tosynthesize speech waveforms from the bitstream generatedfrom traditional speech codecs, such as codec2, MELP, etc.WaveNet is a kind of autoregressive neural networks-basedmodel which generates high-quality speech waveforms, how-ever, WaveNet suffers from very slow inference speed, whichprevents its real-time speech coding applications. Besides,other models such as simple RNN, LPCNet are also exploredfor speech and audio coding applications [10]-[12]. Theauthors in [13] presented Deep Vocoder, which compressesnarrowband speech with deep autoencoder and uses GLA torecover speech signals from decoded speech spectrogram,similar work in [14] presented DeepVoCoder which uses aconvolutional neural network (CNN)-based encoder modelto compress speech signals. However, the quality of codedpeech and the efficiency of speech decoding need furtherimprovement for real-word communication applications.Recent research on TTS using deep generative modelsconditioned on mel-spectrogram motivates our study in com-bining quantization of MFCCs with efficient and high-qualityspeech generative models for speech coding task in this pa-per. WaveGlow is a flow-based deep generative network,which delivers speech quality almost as good as WaveNet,however, the inference speed of WaveGlow is much fasterthan which of WaveNet because it abandons the sample-by-sample autoregressive mechanism [15]. Recent studyof comparison on neural vocoders for speech reconstruc-tion from mel-spectrogram also confirmed the superiority ofWaveGlow for making tradeoff between speech quality andcomputational complexity [16]. Therefore, we choose Wave-Glow as the generative model to synthesize speech waveformsfrom the quantized mel-spectrogram. The coding feature inour vocoder is an 80-dimension MFCCs vector for 16kHzwideband speech signal, then speech coding at the bit-ratethroughout 1000-2000 bit/s could be scalably implementedwith different quantization schemes of MFCCs vector.
2. ALGORITHM2.1. Speech Coding with MFCCs and WaveGlow
Speech coding model is the basis for converting speech sig-nals to bitstream. Like traditional speech vocoders, thereare mainly three steps for speech coding with quantizationof MFCCs and WaveGlow, which are extraction of speechcoding features, quantization of these features and speechsynthesis from quantized feature parameters, as is shown inFig.1. log(|F{·}|) · y m Quantizer
Q(·) z m c h a nn e l -1 (·) 1010...001 ẑ m IDCT{ · } WaveGlow ŷ m input speech s [ n ] enframeoverlap-addoutput speech ŝ [ n ] S m ŝ m Fig. 1 . Overview of the proposed vocoder.Let s [ n ] denote the speech waveforms, then it is enframedby a window w [ n ] , s m [ n ] = s [ mR + n ] w [ n ] (1)where L (0 ≤ n ≤ L − denotes the window length, R de-notes the frame shift, m ( m = 1 , , ..., M ) denotes the frameindex. At this time, each speech frame is concisely denotedas follows, s m = [ s m (0) , s m (1) , ..., s m ( L − ⊺ (2) Then, the log mel-spectrogram of each speech frame canbe computed as, y m = M log( | F { s m }| ) (3)where F { s m } is the N -point fast Fourier transform (FFT) of s m , |·| denotes the modulus of a complex number. Due to thesymmetry, the latter N/ − elements of | F { s m }| will be dis-carded. M ∈ R K × ( N/ denotes the mel-filter weightingmatrix, where K is the number of mel-filter bands.Furthermore, the MFCCs vector of each speech frame canbe computed as follows, z m = DCT { y m } (4)where DCT {·} denotes the discrete cosine transform.At the transmitter, the quantizer Q ( · ) uses the SQ or VQtechnique to quantize MFCCs vector z m , and converts it tobitstream, which is then modulated for transmitting.At the receiver, the quantized MFCCs vector ˆ z m is recov-ered by searching the codebook by the dequantizer Q − ( · ) .Then, the reconstructed log mel-sepctrogram ˆ y m is computedby inverse discrete cosine transform (IDCT) of ˆ z m , which isthen used for conditioning of WaveGlow in order to synthe-size speech frame. At last, the speech waveforms ˆ s [ n ] is re-constructed by the overlap-add operation. The quantization step of feature parameters is crucial for re-ducing the bit-rate of speech coding and maintaining high-quality of coded speech. Conventional speech vocoders con-tain different types of speech coding parameters, which arerarely quantized together. However, the speech coding pa-rameters in the proposed vocoder mentioned above are solelyMFCCs vector, so scalable speech coding schemes at differ-ent bit-rate could be implemented conveniently using the SQor VQ technique. The first element of MFCCs vector rep-resents energy, where its value and variance is significantlygreater than other elements, so it is independently quantizedusing the SQ technique. As for other elements of MFCCs vec-tor, they represent the vocal and excitation parameters, whichare quantized together using the VQ technique.
WaveGlow is a flow-based deep neural generative model forsynthesizing high-quality speech signals conditioned on mel-spectrogram. Previous study has shown that Mean OpinionScore (MOS) of the synthesized speech via WaveGlow is ableto reach up to 3.9 on the LJ speech corpus [15], so trying touse WaveGlow as a decoder for speech coding is very attrac-tive. WaveGlow consists of a series of invertible flow layersthat transforms a simple zero mean spherical Gaussian distri-bution to one which has the desired speech distribution [15].WaveGlow network could be directly trained by minimizing able 1 . Bit allocation scheme for MFCCs quantization. f s L R
Rate Bits/ Quantizaiton Scheme(Hz) (sample) (sample) (bit/s) frame Energy( z ) Formant and Pitch ( z ∼ z )16000 1024 256 1000 16 4-bit SQ 12-bit VQ16000 1024 256 2000 32 6-bit SQ (13-13)-bit MSVQthe negative log-likelihood of for training set. Once the Wave-Glow network is trained, doing inference to generate speechwaveforms from quantized mel-spectrogram could be imple-mented by sampling from a Gaussian distribution and puttingthem through the WaveGlow network. Bit allocation is an important procedure for determining thebit-rate of speech coding. As previously discussed, the firstelement of MFCCs vector z and other elements of MFCCsvector z ∼ z are quantized using different methods, re-spectively. The proposed vocoder proceeds with the wide-band speech signals (16kHz sampling rate), when the framelength is set as 64 msec (1024 samples) and the frame shiftis set as 16 msec (256 samples), respectively, we can designthe bit allocation scheme as is shown in Tab.1. We can seethat speech coding at different bit-rates could be flexibly im-plemented given the corresponding bit allocation schemes.When the bit-rate is 1000 bit/s, there are totally 16 bitsfor each speech frame, so only 4 bits are allocated for scalarquantization of energy parameter z and the last 12 bits areallocated for direct vector quantization of formant and pitchparameters z ∼ z . When the bit-rate is 2000 bit/s, there aretotally 32 bits for each speech frame, so 6 bits are allocatedfor scalar quantization of energy parameter and another 26bits are allocated for quantization other parameters.In order to reduce the codebook searching complexity atthe bit-rate of 2000 bit/s, we use multistage vector quanti-zation (MSVQ) method to encode z ∼ z efficiently. Tomake a tradeoff between the quantizing distortion and code-book searching burden, 2 cascaded codebooks are trained andthe codebook at each stage consists of codewords, thequantization result of z ∼ z is computed by comparisonon quantization distortion of different combination of the re-served codewords at each stage.
3. EXPERIMENTS AND RESULTS3.1. Dataset and Evaluation Metrics
We carry our experiments on the widely used TIMIT corpusto evaluate the performance of the proposed vocoder. TIMITis a multi-speaker corpus, which contains 462 speakers in thetraining dataset and 168 speakers in the test dataset. At thetraining stage, the whole TIMIT training set with 4620 utter-ances were used for extracting mel-spectrograms and training the WaveGlow network model, the duration of the trainingspeech is ∼ ∼
16 minutes.All the speech waveforms are sampled at 16kHz. The speechsignal was enframed to 1024 samples using a hamming win-dow and the frame shift is 256 samples. The dimension ofMFCCs vector for each speech frame is 80, i.e., K = 80 .Two different objective metrics were used for evaluatingthe quality of coded speech. The first is perceptual evalua-tion of speech quality (PESQ) [17], which is adopted as theITU-T P.862 standard and widely used for evaluating speechquality. Another is is the short-time objective intelligibil-ity (STOI) [18], which is also a popular objective measure.PESQ demonstrates the overall speech quality while the STOImeasure illustrates the speech intelligibility. For both the met-rics, higher score indicates better performance. Also, we willtake some subjective listening experiments to further demon-strate the performance of the proposed method. WaveGlow model was usually trained on single-speakercorpus for speech synthesis in previous study. However,speech coding for multi-speakers is much usual in real-wordcommunication applications. Therefore, to obtain a goodmulti-speaker WaveGlow model on TIMIT corpus, the hyper-parameters should be carefully configured. Considering boththe performance of WaveGlow network and the capacity ofour hardware platform (Intel Xeon CPU (2.2GHz), 128GRAM and NVIDIA GeForce GTX 1080Ti × GPUs), weconfigured the hyper-parameters of WaveGlow as is shown inTab.2. The quantized and unquantized mel-spectrogram wereindependently used as the input for WaveGlow training, theADAM algorithm was chosen as the optimizer with the learn-ing rate as × − . After 1,110,000 epoches of training,we obtained a WaveGlow network model which was used asa decoder for low bit-rate speech coding. For simplicity, we denote the proposed speech coding algo-rithm via quantization of MFCCs and WaveGlow as WaveG-low codec, some other notations are as follows, • OS: original speech signal • UQ: speech synthesis from unquantized MFCCs • UQT2000: WaveGlow codec at 2000 bit/s with unquantized (cid:94) (cid:104)(cid:89) (cid:89)(cid:100)(cid:1006)(cid:1004)(cid:1004)(cid:1004) (cid:89)(cid:100)(cid:1005)(cid:1004)(cid:1004)(cid:1004)
Fig. 2 . Comparison on spectrograms of the TIMIT utterance “She had your dark suit and greasy wash water all year”.
Table 2 . hyper-parameters setting for WaveGlow traininghyper-parameter valuenumber of flows 12number of mel-channels 80number of groups 8number of layers for coupling module 8number of mel-channels for coupling module 256kernel size for coupling module 3learning rate × − batch size 12MFCCs as input for training WaveGlow model • UQT1000: WaveGlow codec at 1000 bit/s with unquantizedMFCCs as input for training WaveGlow model • QT2000: WaveGlow codec at 2000 bit/s with quantizedMFCCs as input for training WaveGlow model • QT1000: WaveGlow codec at 1000 bit/s with quantizedMFCCs as input for training WaveGlow model
OS UQ QUT1000 QT1000 UQT2000 QT200001234
PES Q S c o r e Fig. 3 . speech quality in terms of PESQ score.
OS UQ UQT1000 QT1000 UQT2000 QT200000.20.40.60.81 S T O I S c o r e Fig. 4 . speech quality in terms of STOI score. Fig.2 shows the spectrograms of the reconstructed speechvia WaveGlow codec for a typical TIMIT utterance. We cansee that the structure of harmonic and frequency formantis both well preserved, which demonstrates that the origi-nal speech and the coded speech sounds closely. Fig.3 andFig.4 shows the speech quality in terms of PESQ and STOIscores for the test set. It should be noted that WaveGlowtrained with quantized MFCCs performs better than WaveG-low trained with unquantized MFCCs, because it overcomesthe dismatch problem during the WaveGlow training and in-ference stage. We can also see that the output speech qualityfor QT2000 and QT1000 is acceptable as the PESQ scoresof the output speech is about 2.75 and 2.52, respectively. Welistened these coded speech signals and found that the outputspeech of WaveGlow codec preserves high intelligibility andsomewhat naturalness though few audible artifacts exist.We also conducted subjective listening tests. 10 volun-teers rated the coded speech through the standard five pointmean opinion score (MOS) [19]. Each volunteer was pre-sented with 20 speech files encoded by WaveGlow codec andMELPe codec. The results are illustrated in Fig.5, whichillustrates that WaveGlow codec provide substantially im-proved speech quality than MELPe codec at similar bit-rate.In detail, the MOS score for QT2000 and QT1000 is about3.25 and 2.96, respectively.
MELPe1200 QT1000 MELPe2400 QT200001234 M O S S c o r e Fig. 5 . speech quality in terms of MOS score.
4. CONCLUSIONS
This paper presented a new low bit-rate wideband speech cod-ing approach though vector quantization of MFCCs. WaveG-low was used as a decoder in order to provide efficient andigh-quality speech coding at 1000-2000 bit/s. Experimen-tal results demonstrate that WaveGlow codec is promising forlow bit-rate source coding of speech signals with high speedinference. In further, other efficient generative models con-ditioned on mel-spectrogram, such as generative adversarialnetworks (GANs) [20][21], are also worth being explored forspeech coding purpose. Moreover, the post-filtering tech-nique is also worth studying to reduce the audible artifacts.
5. ACKNOWLEDGE
This work is partially supported by Natural Science Founda-tion of China(61701535, 61871471) and Key Research andDevelopment Project of Shannxi Province (2020GY-015).
6. REFERENCES [1] T. E. Tremain, “The government standard linear predic-tive coding algorithm: LPC10,”
Speech Technol. , vol. 1,pp. 40-49, 1982.[2] M. R. Schroeder, B. S. Atal, “Code-excited linear pre-diction (CELP): High-quality speech at very low bitrates,” in
Proc. IEEE Int. Conf. Acoust., Speech, SignalProcess. (ICASSP) , 1985, pp. 937–940.[3] A. V. McCree and T. P. Barnwell,“Mixed excitation LPCvocoder model for low bit rate speech coding,”
IEEETrans. on Speech and Audio Process. , vol. 3, no. 4,pp. 443–445, 1995.[4] L. E. Boucheron, P. L. De Leon, and S. Sandoval,“Low bit-rate speech coding through quantization ofmel-frequency cepstral coefficients,”
IEEE Trans. Au-dio, Speech, Lang. Process. , vol. 20, no. 2, pp. 610-619,Feb. 2012.[5] G. Min, X. W. Zhang, X. Zou, et al , “Perceptuallyweighted analysis-by-synthesis vector quantization forMFCC codec,”
IEEE Signal Process. Lett. , vol. 23, no.10, pp. 1379-1383, Oct. 2016.[6] D. W. Griffin and J. S. Lim, “Signal estimation frommodified short time fourier transform,”
IEEE Trans.Acoustic, Speech, and Signal Process. , vol. 32, no. 2,pp. 236-243, Apr. 1984.[7] T. B ¨ a ckstr ¨ o m,“End-to-end optimization of source mod-els for speech and audio coding using a machine learn-ing framework,” in Proc. Interspeech , 2019, pp. 3401-3405.[8] W. B. Kleijn, F. S. C. Lim, A. Luebs, et al , “Wavenetbased low bit rate speech coding,” in
Proc. IEEE Int.Conf. Acoust., Speech, Signal Process. (ICASSP) , 2018,pp. 676-680. [9] C. G ˆ a rbacea, A. V. D. Oord, Y. Li, et al , “Low bit-ratespeech coding with VQ-VAE and a WaveNet decoder,”in Proc. IEEE Int. Conf. Acoust., Speech, Signal Pro-cess. (ICASSP) , 2019, pp. 735-739.[10] J. Klejsa, P. Hedelin, C. Zhou, et al , “High qualityspeech coding with simple RNN,” in
Proc. IEEE Int.Conf. Acoust., Speech, Signal Process. (ICASSP) , 2019,pp. 7155-7159.[11] J. Valin and J. Skoglund, “LPCNET: Improving Neu-ral Speech synthesis through linear prediction,” in
Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.(ICASSP) , 2019, pp. 5891-5895.[12] R. Fejgin, J. Klejsa, L. Villemoes, et al , “Source codingof audio signals with a generative model,” in
Proc. IEEEInt. Conf. Acoust., Speech, Signal Process. (ICASSP) ,2020, pp. 341-345.[13] G. Min, C. Q. Zhang, X. W. Zhang, et al , “Deepvocoder: low bit rate speech compression with deep au-toencoder,” in
Proc. IEEE Int. Conf. Multi. and ExpoWorkshops (ICMEW) , 2019, pp. 1-6.[14] H. Yalim Keles, J. Rozhon, H. Gokhan Ilk, et al , “Deep-VoCoder: a CNN model for compression and codingof narrow band speech,” in
IEEE ACCESS , vol. 7, pp.75081-75089, Jun. 2019.[15] R. Prenger, R. Valle, and B. Catanzar, “Waveglow: aFlow-based generative network for speech synthesis,” in
Proc. IEEE Int. Conf. Acoust., Speech, Signal Process.(ICASSP) , 2019, pp. 7-12.[16] P. Govalkar, J. Fischer, F. Zalkow, et al , “A comparisonof recent neural vocoders for speech signal reconstruc-tion,” in
Proc. 10th ISCA Speech Synthesis Workshop(SSW) , 2019, pp. 7-12.[17] A. W. Rix, J. G. Beerends, M. P. Hollier, et al , “Percep-tual evaluation of speech quality (PESQ) - a new methodfor speech quality assessment of telephone networks andcodecs,” in
Proc. IEEE Int. Conf. Acoust., Speech, Sig-nal Process. (ICASSP) , 2001, vol. II, pp. 749-752.[18] C. H. Taal, R. C. Hendriks, R. Heusdens, et al , “An Al-gorithm for intelligibility prediction of time-frequencyweighted noisy speech,”
IEEE Trans. Acoustic, Speech,and Signal Process. , vol. 19, no. 7, pp. 2125-2136, Jul.2011.[19] P. C. Loizou,
Speech Enhancement: Theory and Prac-tice . Boca Raton, FL: CRC, 2007.[20] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, etal , “Generative adversarial networks,” arXiv preprint ,arXiv:1406.2661, pp. 1-9, 2014.21] K. Kumar, R. Kumar, T. de Boissiere, et al , “MelGAN:generative adversarial networks for conditional wave-form synthesis,” in