End-to-End Optimized Speech Coding with Deep Neural Networks
aa r X i v : . [ c s . S D ] O c t END-TO-END OPTIMIZED SPEECH CODING WITH DEEP NEURAL NETWORKS
Srihari Kankanahalli
Bloomberg [email protected]
ABSTRACT
Modern compression algorithms are often the result of la-borious domain-specific research; industry standards such asMP3, JPEG, and AMR-WB took years to develop and werelargely hand-designed. We present a deep neural networkmodel which optimizes all the steps of a wideband speechcoding pipeline (compression, quantization, entropy coding,and decompression) end-to-end directly from raw speech data– no manual feature engineering necessary, and it trains inhours. In testing, our DNN-based coder performs on par withthe AMR-WB standard at a variety of bitrates ( ∼ ∼ Index Terms — speech coding, deep learning, neural net-works, end-to-end training, compression
1. INTRODUCTION
The everyday applications of data compression are ubiqui-tous: streaming live videos and music in realtime across theplanet, storing thousands of images and songs on a single tinythumb drive, and more. In a way, improved compression waswhat made these innovations possible in the first place, anddesigning better and more efficient methods of compressioncould help expand them even further (to developing nationswith slower Internet speeds, for example).Essentially all modern compression standards are hand-designed, including the most prominent wideband speechcoder: AMR-WB [1]. It was created by eight speech codingresearchers working at the VoiceAge Corporation (in Mon-treal) and the Nokia Research Center (in Finland) over twoyears, and it provides speech at a wide variety of bitratesranging from 7kbps through 24kbps. (For reference, uncom-pressed wideband speech has a bitrate of 256kbps.)Recently, deep neural networks have shown an incredibleability to learn directly from data, circumventing traditionalfeature engineering to produce state-of-the-art results in a va-riety of areas [2]. Neural networks have seen significant his-torical interest from compression researchers, but almost al-ways as an intermediate pipeline step, or as a way to optimizethe parameters of an intermediate step [3]. For example, Kr-ishnamurthy et al. [4] used a neural network to perform vectorquantization on speech features; Wu et al. [5] used an ANN as part of a predictive speech coder; and Cernak et al. [6] useda deep neural network as a phonological vocoder.Our proposal is different in nature from all of these: wereframe the entire compression pipeline, from start to finish,as a neural network optimization problem (along the lines ofclassical autoencoders). As far as we know, this is only thesecond published work to learn an audio compression pipelineend-to-end – the previous being an obscure early attempt byMorishima et al. in 1990 [7] – and the first to compete witha contemporary standard. Cernak et al. [8] proposed a nearlyend-to-end design for a very-low-bitrate low-quality speechcoder in 2016; however, their pipeline still required extractionof acoustic features and pitch (and was also quite complex,composing several different deep and spiking neural networkstogether). All other related designs we know of employ ANNsas a mere component of a larger hand-designed system.In the domain of image compression, there has been someinterest in training ANN-based systems since the 1990s [9],but this has not yielded state-of-the-art results until fairlyrecently either (starting August 2016, when Toderici et al.trained a neural network model outperforming JPEG [10]).Thus, it seems our work is on the cutting edge of both deeplearning research and compression research.
2. NETWORK ARCHITECTURE AND TRAININGMETHODOLOGY
Our network architecture, shown in Figure 1, is inspired byboth residual neural networks [11] and autoencoders. Themodel is composed of an encoder subnetwork and a decodersubnetwork; it takes in a vector of 512 speech samples (a32ms speech window) and outputs another vector of 512speech samples (the reconstructed window after compressionand decompression). The network is composed of 4 differenttypes of residual blocks [11], shown in Figure 2. All convolu-tions use 1D filters of size 9 and PReLU activations [12]; theupsample block uses subpixel convolutions [13]. (We wereunable to successfully incorporate batch normalization.)
Quantization – mapping the real-valued output of a neural net-work into discrete bins – is an essential part of our pipeline. ig. 1 : Simplified network architecture.However, quantization is inherently non-differentiable, andtherefore incompatible with the standard gradient-descent-based methods used to train neural networks.In order to circumvent this, we use a differentiable ap-proximation first discussed by Agustsson et al. [14]. Specif-ically, we reframe scalar quantization as nearest-neighbor as-signment: given a list B of N bins, we quantize a scalar x by assigning it to the nearest quantization bin. This operationstill isn’t differentiable, but can be approximated as follows: D = [ | x − B | , ..., | x − B N | ] ∈ R N (1) S = sof tmax ( − σD ) (2) S is a soft assignment over the N quantization bins, which be-comes a hard assignment as σ → ∞ (and can later be roundedinto one). On the decoder side, we can ”dequantize” S backinto a real value ˆ S by taking the dot product of S and B . SinceAgustsson et al. did not give this approximation a name, wehereby dub it softmax quantization .In practice, we noticed no problems training with veryhigh temperature values from the start. For all experiments,we initialized with σ = 300 , making σ and B trainable pa-rameters of the network. (We also found that scalar quantiza-tion gave better-sounding results than the vector quantizationmore prominently discussed by Agustsson et al.) The network’s objective function is as follows: O ( x, y, c ) = λ mse ℓ ( x, y )+ λ perceptual P ( x, y )+ λ quantization Q ( c )+ λ entropy E ( c ) (3) (a) residual(b) channel change(c) downsample(d) upsample Fig. 2 : The four block types used in our network architecture.where x is the original signal, y is the reconstructed signal, c is the encoder’s output (the soft assignments to quantizationbins), ℓ ( x, y ) is mean-squared error, and λ corresponds toweights for each loss. P ( x, y ) , Q ( c ) , and E ( c ) are supple-mental losses, which we now discuss in more depth. • Perceptual loss.
Training a model solely to minimizemean-squared error often leads to blurry reconstruc-tions lacking in high-frequency content [15] [16].Therefore, we augment our model with a perceptualloss. We compute MFCCs [17] for both the origi-nal and reconstructed signals, and use the ℓ distancebetween MFCC vectors as a proxy for perceptual dis-tance. To allow for both coarse and fine differentiation,we use 4 MFCC filterbanks of sizes 8, 16, 32, and 128: P ( x, y ) = 14 X i =1 ℓ ( M i ( x ) , M i ( y )) (4)where M i is the MFCC function for filterbank i . • Quantization penalty.
Because softmax quantization isa continuous approximation, it is possible for the net-work to learn how to generate values outside the in-tended quantization bins – and it almost always will,f there is no additional penalty for doing so. There-fore, we define a loss function favoring soft assign-ments close to one-hot vectors: Q ( c ) = 1256 X i = 0 [ ( N − X j = 0 √ c i,j ) − . (5) Q ( c ) is zero when all 256 encoded symbols are one-hotvectors, and nonzero otherwise. • Entropy control.
We apply entropy coding to the quan-tized symbols, which provides a simple way to spec-ify different bitrates without having to engineer entirelydifferent network architectures for each one. Depend-ing on our desired bitrate, we can constrain the entropyof the encoder’s output to be higher or lower (by modi-fying the loss weight λ entropy appropriately).To estimate the encoder’s entropy, we compute a prob-ability distribution h specifying how often each quan-tized symbol appears in the encoder’s output, by aver-aging all of the soft assignments the encoder generatesover one minibatch. Thus, our entropy estimate is: E ( c ) = X h = histogram ( c ) − h i log ( h i ) (6) We train the network on samples from the TIMIT speechcorpuzs [18], which contains over 6,000 wideband record-ings of 630 American English speakers from 8 major dialects.We create smaller training/validation/test sets from the pre-existing train/test split: our training set consists of 3,000 filesfrom the original train set, our validation set consists of 200files from the original train set, and our test set consists of 500files from the original test set. Each set contains an even dis-tribution over the 8 dialects, and they do not share any speak-ers. Additionally, we preprocess each speech file by maxi-mizing its volume.We extract raw speech windows of length 32ms (512speech samples), with an overlap of 2ms (32 samples), usinga Hann window in the overlap region. This means that eachspeech window covers a total of 480 unique samples, or 30msof speech. Our training process takes place in two stages:1.
Quantization off . The network is trained without quan-tization; in this stage, only the ℓ and perceptual lossesare enabled. After 5 epochs, the quantization bins areinitialized using K-means clustering, λ entropy is set toan initial value τ initial , and quantization is turned on.We found that this ”pre-training” period improved thestability and quality of the network’s output.2. Quantization on . The network is trained for 145 moreepochs, targeting a specified bitrate. At the end of each epoch, we evaluate the model’s mean PESQ over ourvalidation set, and save the best-performing one. Wealso estimate the average bitrate of the encoder: bitrate = ( windows/sec ) ∗ ( symbols/window ) ∗ ( bits/symbol ) bps = 16000512 − ∗ ∗ E ( c ) bps (7)If the estimated bitrate is above the target bitrate region,then λ entropy is increased by a small value τ change ; ifit is below the target region, then λ entropy is decreasedby τ change . This removes the need to manually findthe optimal λ entropy for each target bitrate. (The targetregion is defined as our target bitrate ± α initial to a final value α final ,using cosine annealing [19] [20]. We repeat the training pro-cess for each bitrate we want to target; for example, if wewant to target 4 different bitrates, we train 4 networks (us-ing the same architecture, but ending up with different setsof weights). The training process takes about 20 hours pernetwork, on a GeForce GTX 1080 Ti.
3. RESULTS3.1. Objective Quality Evaluation
We evaluated the average PESQ of our speech coder versusthe AMR-WB standard around 4 different target bitrates. Theresults are shown in Figure 3, and we reproduce them below:
Dataset AMR-WB DNNBitrate PESQ Bitrate PESQ
Training set 8.85 3.478 9.02 3.64315.85 4.012 16.24 4.12319.85 4.103 20.06 4.20223.85 4.138 24.06 4.283Validation set 8.85 3.674 9.02 3.73015.85 4.176 16.24 4.22519.85 4.244 19.70 4.29823.85 4.290 23.71 4.372Test set 8.85 3.521 9.02 3.62915.85 4.063 16.24 4.13319.85 4.145 20.06 4.21523.85 4.178 24.06 4.296Our speech coder outperforms AMR-WB at all bitrates, espe-cially higher rates. The gap is bigger on the training set thanon the validation or test sets, indicating possible overfitting(note that we did not use dropout or weight regularization). a) training set (b) validation set(c) test set
Fig. 3 : Mean PESQ of our encoder, compared with AMR-WBat different bitrates.
We conducted a simple preference test using Amazon Me-chanical Turk. 20 speech files were randomly selected fromthe test set and processed with both AMR-WB and ourmethod, at the same 4 target bitrates as before. Then, 20listeners were presented the original speech signal plus bothprocessed versions (unlabeled and randomly switched). Eachlistener was asked to pick which of the two he or she pre-ferred. The subjects’ average preferences are recorded below:
Target Bitrate DNN No Preference AMR-WB
We evaluated the average time our model takes to encodeand decode one 30ms window, on an Intel i7-4970K CPU(3.8GhZ) and a GeForce GTX 1080 Ti GPU:
Processor Encoder Decoder Total
CPU 10.52ms 10.90ms 21.42msGPU 2.43ms 2.35ms 4.78ms Our speech coder runs in realtime (under 30ms for combinedencode and decode) without any optimizations beyond thosealready provided by TensorFlow and Keras. However, it’s im-portant to note that real speech coders will need to run onprocessors much slower than the CPU we used.
4. CONCLUSION
We have shown a proof-of-concept applying deep neural net-works (DNNs) to speech coding, with very promising results.Our wideband speech coder is learned end-to-end from rawsignal, with almost no audio-specific processing aside from arelatively simple perceptual loss; nevertheless, it manages tocompete with current standards.The key to further increasing quality probably lies in ourperceptual model, which could be significantly more complexand nuanced. This is where psychoacoustic theory can comeinto the picture once again: to develop a differentiable percep-tual loss for this and other audio processing tasks. In addition,expanding the training data to include music and backgroundnoise instead of solely clean speech may be fruitful.Finally, while our DNN-based coder already runs in real-time on a modern desktop CPU, it’s still a far cry from run-ning on embedded systems or cellphones. Model compres-sion, transfer learning, and clever architecture designs are allinteresting areas which could be explored here.
5. HYPERPARAMETERS
For purposes of reproducibility, we now make available thelist of parameters used for all experiments: σ initial α initial α final λ perceptual λ quantization λ mse τ initial τ change N http://srik.tk/speech-coding
6. REFERENCES [1] Bruno Bessette, Redwan Salami, Roch Lefebvre, Mi-lan Jelinek, Jani Rotola-Pukkila, Janne Vainio, HannuMikkola, and Kari Jarvinen, “The adaptive multirateideband speech codec (amr-wb),”
IEEE transactionson speech and audio processing , vol. 10, no. 8, pp. 620–636, 2002.[2] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton,“Deep learning,”
Nature , vol. 521, no. 7553, pp. 436–444, 2015.[3] Robert D Dony and Simon Haykin, “Neural networkapproaches to image compression,”
Proceedings of theIEEE , vol. 83, no. 2, pp. 288–303, 1995.[4] Ashok K. Krishnamurthy, Stanley C. Ahalt, Douglas E.Melton, and Prakoon Chen, “Neural networks for vec-tor quantization of speech and images,”
IEEE journalon selected areas in Communications , vol. 8, no. 8, pp.1449–1457, 1990.[5] Lizhong Wu, Mahesan Niranjan, and Frank Fallside,“Fully vector-quantized neural network-based code-excited nonlinear predictive speech coding,”
IEEEtransactions on speech and audio processing , vol. 2, no.4, pp. 482–489, 1994.[6] Milos Cernak, Blaise Potard, and Philip N Garner,“Phonological vocoding using artificial neural net-works,” in
Acoustics, Speech and Signal Process-ing (ICASSP), 2015 IEEE International Conference on .IEEE, 2015, pp. 4844–4848.[7] Shigeo Morishima, H Harashima, and Y Katayama,“Speech coding based on a multi-layer neural network,”in
Communications, 1990. ICC’90, Including Super-comm Technical Sessions. SUPERCOMM/ICC’90. Con-ference Record., IEEE International Conference on .IEEE, 1990, pp. 429–433.[8] Milos Cernak, Alexandros Lazaridis, Afsaneh Asaei,and Philip N Garner, “Composition of deep and spik-ing neural networks for very low bit rate speech cod-ing,”
IEEE/ACM Transactions on Audio, Speech, andLanguage Processing , vol. 24, no. 12, pp. 2301–2312,2016.[9] J Jiang, “Image compression with neural networks–asurvey,”
Signal Processing: Image Communication , vol.14, no. 9, pp. 737–760, 1999.[10] George Toderici, Damien Vincent, Nick Johnston,Sung Jin Hwang, David Minnen, Joel Shor, andMichele Covell, “Full resolution image compres-sion with recurrent neural networks,” arXiv preprintarXiv:1608.05148 , 2016.[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun, “Deep residual learning for image recognition,” in
Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition , 2016, pp. 770–778. [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in
Pro-ceedings of the IEEE international conference on com-puter vision , 2015, pp. 1026–1034.[13] Wenzhe Shi, Jose Caballero, Ferenc Husz´ar, JohannesTotz, Andrew P Aitken, Rob Bishop, Daniel Rueckert,and Zehan Wang, “Real-time single image and videosuper-resolution using an efficient sub-pixel convolu-tional neural network,” in
Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition ,2016, pp. 1874–1883.[14] Eirikur Agustsson, Fabian Mentzer, Michael Tschan-nen, Lukas Cavigelli, Radu Timofte, Luca Benini, andLuc Van Gool, “Soft-to-hard vector quantization forend-to-end learned compression of images and neuralnetworks,” arXiv preprint arXiv:1704.00648 , 2017.[15] Michael Mathieu, Camille Couprie, and Yann LeCun,“Deep multi-scale video prediction beyond mean squareerror,” arXiv preprint arXiv:1511.05440 , 2015.[16] Alexey Dosovitskiy and Thomas Brox, “Generating im-ages with perceptual similarity metrics based on deepnetworks,” in
Advances in Neural Information Process-ing Systems , 2016, pp. 658–666.[17] Lindasalwa Muda, Mumtaj Begam, and Irraivan Elam-vazuthi, “Voice recognition algorithms using melfrequency cepstral coefficient (mfcc) and dynamictime warping (dtw) techniques,” arXiv preprintarXiv:1003.4083 , 2010.[18] John S Garofolo, Lori F Lamel, William M Fisher,Jonathan G Fiscus, David S Pallett, Nancy L Dahlgren,and Victor Zue, “Timit acoustic-phonetic continuousspeech corpus,”
Linguistic data consortium , vol. 10, no.5, pp. 0, 1993.[19] Ilya Loshchilov and Frank Hutter, “Sgdr: stochas-tic gradient descent with restarts,” arXiv preprintarXiv:1608.03983 , 2016.[20] Xavier Gastaldi, “Shake-shake regularization,” arXivpreprint arXiv:1705.07485arXivpreprint arXiv:1705.07485