[PDF] Audio Dequantization for High Fidelity Audio Generation in Flow-based Neural Vocoder

Abstract

In recent works, a flow-based neural vocoder has shown significant improvement in real-time speech generation task. The sequence of invertible flow operations allows the model to convert samples from simple distribution to audio samples. However, training a continuous density model on discrete audio data can degrade model performance due to the topological difference between latent and actual distribution. To resolve this problem, we propose audio dequantization methods in flow-based neural vocoder for high fidelity audio generation. Data dequantization is a well-known method in image generation but has not yet been studied in the audio domain. For this reason, we implement various audio dequantization methods in flow-based neural vocoder and investigate the effect on the generated audio. We conduct various objective performance assessments and subjective evaluation to show that audio dequantization can improve audio generation quality. From our experiments, using audio dequantization produces waveform audio with better harmonic structure and fewer digital artifacts.

Full PDF

AAudio Dequantization for High Fidelity Audio Generation in Flow-basedNeural Vocoder

Hyun-Wook Yoon , Sang-Hoon Lee , Hyeong-Rae Noh , Seong-Whan Lee , Department of Computer and Radio Communications Engineering, Korea University, Seoul, Korea Department of Brain and Cognitive Engineering, Korea University, Seoul, Korea Department of Artiﬁcial Intelligence, Korea University, Seoul, Korea { hw yoon, sh lee, hr noh, sw.lee } @korea.ac.kr Abstract

In recent works, a ﬂow-based neural vocoder has shown signif-icant improvement in real-time speech generation task. The se-quence of invertible ﬂow operations allows the model to convertsamples from simple distribution to audio samples. However,training a continuous density model on discrete audio data candegrade model performance due to the topological differencebetween latent and actual distribution. To resolve this prob-lem, we propose audio dequantization methods in ﬂow-basedneural vocoder for high ﬁdelity audio generation. Data dequan-tization is a well-known method in image generation but hasnot yet been studied in the audio domain. For this reason, weimplement various audio dequantization methods in ﬂow-basedneural vocoder and investigate the effect on the generated au-dio. We conduct various objective performance assessmentsand subjective evaluation to show that audio dequantization canimprove audio generation quality. From our experiments, us-ing audio dequantization produces waveform audio with betterharmonic structure and fewer digital artifacts.

Index Terms : audio synthesis, neural vocoder, ﬂow-based gen-erative models, data dequantization, deep learning

1. Introduction

Most speech synthesis models take two-stage procedures togenerate waveform audio from the text. First stage generatesspectrogram conditioned on linguistic features such as text orphoneme. [1–5] In second stage, generally refer to as vocoderstage, audio samples are generated through model capable of es-timating audio samples from the acoustic features. Traditionalapproaches estimated audio samples either directly from thespectral density model [6] or hand-crafted acoustic model [7,8],but these approaches tended to produce low-quality audio.After the emergence of the WaveNet [9], models that gener-ate audio samples on previously generated samples had shownexceptional works in the ﬁeld. [10–12]. Nevertheless, dilatedcausal convolution networks used in the model require sequen-tial generation process during the inference, which infers thatreal-time speech synthesis is hard to achieve because parallelinference can’t be utilized. For this reason, generating high-quality waveform audio in real-time has become a challengingtask.To overcome the structural limitation of the auto-regressivemodel, most of the recent works are focused on non-

This work was supported by Institute of Information & communi-cations Technology Planning & Evaluation (IITP) grant funded by theKorea government (MSIT) (No. 2019-0-00079, Department of Artiﬁ-cial Intelligence, Korea University), the Magellan Division of Netmar-ble Corporation, and the Seoul R&BD Program(CY190019). autoregressive models such as knowledge distillation [13, 14],generative adversarial network [15–19], and ﬂow-based gener-ative model [20, 21]. We focus on the ﬂow-based generativemodel since it can model highly ﬂexible approximate poste-rior distribution in variational inference [22]. The transforma-tion from a single data-point to a Gaussian noise is one-to-one,which makes the parallel generation possible. However, wehave to acknowledge that audio samples are discrete data. Inother words, naive modeling of a continuous probability den-sity on discrete data can produce arbitrary high likelihood ondiscrete location [23, 24]. This can lead to degraded generationperformance in ﬂow-based neural vocoder. Therefore, dequan-tization is required before the transformation.In this paper, we present various audio dequantizationschemes that can be implemented in the ﬂow-based neuralvocoder. In image generation, adding continuous noise to data-points to dequantize the data is commonly used. However, tothe best of our knowledge, the effectiveness of data dequantiza-tion in audio domain is still an unknown area, so further investi-gation is needed. Unlike pixels of the image, audio samples arebounded to signed integer. To overcome this domain issue, weeither normalize range of noise values or range of audio sampleswith different normalization method. In addition, we adapt ﬂowblock from ﬂow-based neural vocoder to generate more ﬂexiblenoises known as variational dequantization [25].

2. Flow-based Neural Vocoder

FloWaveNet [21] and WaveGlow [20] are two pioneers in ﬂow-based neural vocoders. Both models are based on normalizingﬂow [22]. Two main contributions that they share are trainingsimplicity and faster generation. Since they use a single in-vertible ﬂow network repeatedly, model structure is intuitive.Moreover, optimization can be easily done with a single log-likelihood loss function. During inference, random noises ofequal length to the product of frames of mel-spectrogram andhop-size are sampled from the spherical Gaussian distributionand simultaneously converted to audio samples. As a result,ﬂow-based neural vocoders can produce waveform signal as fastas other non-autoregressive models.In general, ﬂow-based neural vocoder requires three steps:squeeze, ﬂow, and shufﬂe. In the squeezing step, the tempo-ral dimension of the feature is reduced whereas channels of thefeature are increased. According to FloWaveNet [21], this oper-ation increase the size of the receptive ﬁeld like dilated convolu-tions layer from the WaveNet [9]. During the ﬂow step, multipleblocks of ﬂow operate afﬁne transformation on the half of inputvectors. In detail, half input vectors are used to predict shiftand scale parameters for the other half in each ﬂow operation. a r X i v : . [ ee ss . A S ] A ug igure 1: Examples of audio dequantization for ﬂow-basedneural vocoder. (a) represents uniform dequantization. (b) rep-resents Gaussian dequantization. (c) represents variational de-quantization.

Lastly, the shufﬂe step mixes elements of data, giving ﬂexibilityin transformation.Although both models have similar concepts, they usedifferent techniques in detail. FloWaveNet [21] deﬁnes onesqueeze operation and multiple ﬂow operations as a single con-text block and duplicates this context block to form the ﬂow pro-cess. The model also implements the activation normalizationlayer suggested in Glow [26] before the coupling layer to stabi-lize training. After each ﬂow, the model simply swap odd andeven elements of vectors to shufﬂe features. WaveGlow [20]operates the squeezing step only once during the process. Also,the model adapts invertible 1x1 convolution from Glow to op-erate the shufﬂe step before each ﬂow operation.

3. Audio Dequantization

A raw audio is stored in computer digitally. In other words, val-ues of the audio are formed as discrete representations. There-fore, naively transforming audio samples into Gaussian noisecan lead to arbitrary high likelihood on values of data in ﬂow-based neural vocoder. To resolve this issue, we adapt the ideaof adding noise to each of the data-point to dequantize discretedistribution data in image generation task [23]. In image, a pixel x is represented as a single discrete value in { , , . . . , } , sodequantized data y can be formulated as y = x + u , where u represents D components of noise bounded to [0 , D .Unlike image, raw audio encoded in 16-bit WAV contains15-bit of negative and positive integer values, which can be rep-resented as {− , , . . . , , } . To apply data dequan-tization to raw audio, either range of audio samples must becompressed to 8-bit unsigned integer, or the range of dequan-tized data has to be within the range of ( − , D . For the properaudio dequantization, we present three different methods in thefollowing sections. In [23], authors note that optimizing the continuous model p model ( y ) on the dequantized data y ∼ p data can closelyoptimize the discrete model P model ( x ) on the origianl data x ∼ P data through Jensen’s inequality, which can be formu-lated as below: (cid:90) p data ( y ) log p model ( y ) dy (1) = (cid:88) x P data ( x ) (cid:90) [0 , D log p model ( x + u ) du (2) ≤ (cid:88) x P data ( x ) log (cid:90) [0 , D p model ( x + u ) du (3) = E x ∼ P data [log P model ( x )] (4)Since the uniform noise is bounded to [0 , D , values ofaudio samples have to be bounded to unsigned integer. For thispurpose, we preprocess raw audio with nonlinear compandingmethod called ’ mu-law companding ’ [27]. This method can sig-niﬁcantly reduce the range of audio samples while minimizingthe quantization error. Redistribution equation of the methodcan be expressed as: ˆ x = sign ( x )( ln (1 + µ | x | ) ln (1 + µ ) ) (5)where sign function represents sign of value x, and µ representsintegers that x values are mapped. We set µ to to apply8-bit mu-law companding . Then, we add random noise fromuniformly distributed function formulated as Unif (0 , .We assume that companding audio with lossy compres-sion can possibly produce noisy output. Therefore, we imple-ment iw(importance-weighted) dequantization proposed in [24]to improve generation quality. In the paper, authors demonstratethat sampling noise multiple times can directly approximate theobjective log-likelihood, which can lead to better log-likelihoodperformance. As a result, we deﬁne uniformly dequantized data y u as: y u = ˆ x + 1 K K (cid:88) k =1 Unif (0 , k (6)where Unif (0 , deﬁnes noise sampled uniformly from [0 , D . We set K to 10.To compare the performance of model depending on iw de-quantization , we refer to uniform dequantization with iw de-quantization as Uniform IW and only uniform deqauntizationmodel as

Uniform . In ﬂow-based neural vocoder, discrete data distribution is trans-formed into a spherical Gaussian distribution. In other words,equantizing data distribution to normal distribution can bemore optimal choice. With this though in mind, we formulateGaussian dequantization motivated from logistic-normal distri-butions [28]. Random noise samples are generated from normaldistribution formulated as N ( µ, σ ) , where mean and varianceare calculated from the given data batch. To properly imple-ment in audio domain, we apply a hyperbolic tangent functionto normalize noise boundary at ( − , D . As a result, normallydequantized data y n can be formulated as: y n = x + tanh( N ( M ( x b )) , Σ( x b )) (7)where M ( x b ) and Σ( x b ) represent mean and variance of batchgroup x b .To compare model performance between conventional andimproved method, we refer to the conventional method sug-gested in [28] as Gaussian Sig and proposed method as

Gaus-sian Tanh . Instead of adding noise from known distribution, noise distribu-tion can be formulated through a neural network such as a ﬂow-based network. Flow++ [25] suggests that if the noise samples u are generated from conditional probability model q ( u | x ) , prob-ability distribution of original data can be estimated as follows: P model ( x ) := (cid:90) [0 , D q ( u | x ) p model ( x + u ) q ( u | x ) du (8)Then, we can obtain the variational lower-bound on the log-likelihood function by applying Jensen’s inequality as below: E x ∼ P data [log P model ( x )] (9) = E x ∼ P data (cid:34) log (cid:90) [0 , D q ( u | x ) p model ( x + u ) q ( u | x ) du (cid:35) (10) ≥ E x ∼ P data (cid:34)(cid:90) [0 , D q ( u | x ) log p model ( x + u ) q ( u | x ) du (cid:35) (11) = E x ∼ P data E u ∼ q ( u | x ) (cid:20) log p model ( x + u ) q ( u | x ) (cid:21) (12)As a result, dequantized data y p from variational dequanti-zation can be deﬁned as: y p = x + q x ( (cid:15) ) (13)where (cid:15) ∼ p ( (cid:15) ) = N ( (cid:15) ; 0 , I ) .To implement variational dequantization in ﬂow-based neu-ral vocoder, we modify ﬂow model from FloWaveNet [21]. Weset initial input as 1-dimensional noise vector generated fromspherical Gaussian distribution N ( (cid:15) ; 0 , I ) , where the length isequal to target audio. In each context block, single squeeze stepand multiple ﬂow steps are operated. In each ﬂow step, an afﬁnetransformation conditioned on target audio is applied to the halfof squeezed vector. At the end, the vector is ﬂattened, and hy-perbolic tangent function is applied to ﬁt range of audio domain.Negative log likelihood from the dequantizer is trained jointlywith the ﬂow-based neural vocoder.We set a total of 16 ﬂow stacks as Flow Shallow and 48ﬂow stacks as

Flow Dense to examine whether the dept of de-quantization model is critical to the model performance.

Table 1:

Mean opinion score (MOS) results with 95% conﬁ-dence intervals on 150 randomly selected sentences in test set.

Methods MOS 95% CI

Ground Truth 4.489 ± ± Uniform IW ± Gaussian Tanh ± Flow Dense ±

4. Experimental Results and Analysis

We set FloWaveNet [21] as our baseline model and trained base-line with 6 different dequantization methods. Each model wastrained with VCTK-Corpus [29] containing 109 native speakersEnglish dataset. Since FloWaveNet and WaveGlow [20] eval-uated with only a single speaker dataset, we expanded the ex-periment on the model to a multimodal case where audio gen-eration is much harder due to the larger variation among differ-ent speakers. In the dataset, we withdrew some corrupted audioﬁles and used 44,070 audio clips. For each speaker, 70% of datawere used as training data, 20% as validation data, and rest ofthem as test data. All clips were down-sampled from 48,000Hzto 22,050Hz. From each audio clip, 16,000 chunks were ran-domly extracted.All models were trained on 4 Nvidia Titan Xp GPUs witha batch size of 8. We used Adam optimizer with a step size of × − and set the learning rate decay in every 200K iterationswith a factor of 0.5. We trained each model for 600K iterations. For subjective evaluation, we conducted a subjective 5 scaleMOS test on Amazon Mechanical Turk . Each participant wassuggested to wear either earbuds or headphones for the eligibletesting. Then they had to listen to 5 audio clips at least twiceand rated naturalness of audio on a scale of 1 to 5 with 0.5 pointincrements. We explicitly instructed the participants to focuson the quality of audio. We collected approximately 3,000 sam-ples for the evaluation. In the Table 1, models implemented au-dio dequantization show higher MOS than the baseline modelwhich show that audio dequantization can improve audio qual-ity. Except for the real audio, variational dequantization witha deeper layer receives the highest MOS. This shows that in-jecting noise with more complex distribution can produce morenatural audio. Audio generated from the baseline model tended to have dig-ital artifacts such as reverberation, trembling sound, and peri-odic noise. We assumed that the occurrence of these artifactswas due to the unnatural collapsing from the continuous den-sity model on discrete data-points. To prove that audio dequan-tization can remove such artifacts, we conducted several quan-titative evaluations in signal processing to compare the qualityof audio. First we randomly selected 400 sentences in test set.Then, we conducted mel-cepstral distortion (MCD) [30], globalsignal-to-noise ratio (GSNR) [31], segmental signal-to-noise ra-tio (SSNR) [11], and root mean square error of fundamental fre-quency (RMSE f ) [11]. All equations for the evaluation can be igure 2: Mel-spectrogram converted from audio samples generated by baseline [21] and proposed dequantization models. The bluebounding-box indicates the area where periodic noise appears and harmonic frequencies are presented.

Table 2:

MCD (dB) results with 95% conﬁdence intervals.

Methods MCD

95% CI

Baseline [21] 3.455 ± Uniform ± Uniform IW ± Gaussian Sig ± Gaussian Tanh ± Flow Shallow ± Flow Dense ± MCD [ dB ] = 1 T T − (cid:88) t =0 (cid:118)(cid:117)(cid:117)(cid:116) K (cid:88) k =1 ( c t,k − c (cid:48) t,k ) (14) GSNR [ dB ] = 10 log σ s σ r (15) SSNR [ dB ] = 10 log ( (cid:80) Mn =0 x s ( n ) (cid:80) Mn =0 ( x s ( n ) − y r ( n )) ) (16) RMSE f [ cent ] = 1200 (cid:112) ( log ( F r ) − log ( F s )) (17)where c t,k , c (cid:48) t,k represent original and synthesized k-th mel fre-quency cepstral coefﬁcient (MFCC) of t-th frame, σ s , σ r repre-sent power of speech signal and noise, x s ( n ) , y r ( n ) representraw and synthesized waveform sample at time n, and F r , F s represent fundamental frequency of raw and synthesized wave-form.In MCD, we compared all models including Uniform and

Gaussian Sig to see the improvement within the modiﬁcation.In Table 2, dequantizations with modiﬁed methods show bet-ter performance than the conventional methods.

Gaussian Tanh and

Flow Dense score relatively lower MCD than baseline,which shows that both models can produce better audio qualitythan the baseline model. There was no signiﬁcant performancedifference between the two variational dequantization methods.

Uniform IW shows slightly higher MCD than the baseline be-cause of the remaining audible noise generated from mu-lawcompanding.

Table 3:

GSNR (dB), SSNR (dB), and RMSE f (Hz) result.Higher is better for SNR and lower is better for RMSE f Methods GSNR SSNR RMSE f Baseline [21] -2.127 -2.284 44.881

Uniform IW -1.902 -1.990 38.359

Gaussian Tanh -2.112 -2.186 37.208

Flow Dense -2.048 -2.141 44.066Table 3 presents the result of SNR and RMSE f . All pro-posed methods show higher SNR than baseline, indicating thataudio dequantization can help reducing noise. In addition, Uni-form IW and

Gaussian Tanh dequantization show better perfor-mance in modeling fundamental frequency, while

Flow Dense dequantization shows a comparable result with the baselinemodel. We also visualized test outputs for qualitative evalua-tion in Figure 2. Figure 2(f) and Figure 2(h) show clearer har-monic structures than Figure 2(b). Although Figure 2(d) showsless clear harmonic structures than other approaches, we can seethat periodic noises are reduced signiﬁcantly. We provide audioresults on our online demo webpage.

5. Conclusions

In this paper, we proposed various audio dequantizationschemes that can be implemented in ﬂow-based neural vocoder.For the uniform dequantization, we compressed the range of au-dio domain to match with conventional uniform dequantizationmethod by using mu-law companding compression. In addi-tion, we implemented iw dequantization to resolve the noiseissue that occurs from the lossy compression. For the Gaussiandequantization, we applied hyperbolic tangent normalization ondata-oriented Gaussian noise to properly ﬁt the data within theaudio range. Lastly, we modiﬁed ﬂow block in ﬂow-based neu-ral vocoder to construct variational dequantization model to ap-ply more ﬂexible noise. From the experiments, we demon-strate that implementing audio dequantization can supplementthe ﬂow-based neural vocoder to produce better audio qualitywith fewer artifacts. https://claudin92.github.io/deqflow_webdemo/ . References [1] Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen,R. Pang, I. L. Moreno, Y. Wu et al. , “Transfer learning fromspeaker veriﬁcation to multispeaker text-to-speech synthesis,” in Advances in Neural Information Processing Systems , 2018, pp.4480–4490.[2] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. Weiss, N. Jaitly,Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgian-nakis, R. Clark, and R. Saurous, “Tacotron: Towards end-to-endspeech synthesis,” in

Interspeech , 2017, pp. 4006–4010.[3] J. Park, K. Han, Y. Jeong, and S. W. Lee, “Phonemic-level dura-tion control using attention alignment for natural speech synthe-sis,” in

IEEE International Conference on Acoustics, Speech andSignal Processing , 2019, pp. 5896–5900.[4] Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, “Voiceloop:Voice ﬁtting and synthesis via a phonological loop,” in

Interna-tional Conference on Learning Representations , 2018.[5] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang,Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al. , “Naturaltts synthesis by conditioning wavenet on mel spectrogram pre-dictions,” in

IEEE International Conference on Acoustics, Speechand Signal Processing , 2018, pp. 4779–4783.[6] D. Grifﬁn and J. Lim, “Signal estimation from modiﬁed short-time fourier transform,” in

IEEE Transactions on Acoustics,Speech, and Signal Processing , 1984, pp. 236–243.[7] H. Kawahara, “Straight, exploitation of the other aspect ofvocoder: Perceptually isomorphic decomposition of speechsounds,”

Acoustical Science and Technology , vol. 27, no. 6, pp.349–353, 2006.[8] M. Morise, F. Yokomori, and K. Ozawa, “World: a vocoder-basedhigh-quality speech synthesis system for real-time applications,”

IEICE TRANSACTIONS on Information and Systems , vol. 99,no. 7, pp. 1877–1884, 2016.[9] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu,“Wavenet: A generative model for raw audio,” arXiv preprintarXiv:1609.03499 , 2016.[10] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda,“Speaker-dependent wavenet vocoder.” in

Interspeech , 2017, pp.1118–1122.[11] T. Hayashi, A. Tamamori, K. Kobayashi, K. Takeda, and T. Toda,“An investigation of multi-speaker training for wavenet vocoder,”in

IEEE Automatic Speech Recognition and Understanding Work-shop , 2017, pp. 712–718.[12] S. ¨O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky,Y. Kang, X. Li, J. Miller, A. Ng, J. Raiman et al. , “Deep voice:Real-time neural text-to-speech,” in

International Conference onMachine Learning , vol. 70. JMLR. org, 2017, pp. 195–204.[13] W. Ping, K. Peng, and J. Chen, “Clarinet: Parallel wave genera-tion in end-to-end text-to-speech,” in

International Conference onLearning Representations , 2018.[14] A. v. d. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals,K. Kavukcuoglu, G. v. d. Driessche, E. Lockhart, L. C. Cobo,F. Stimberg et al. , “Parallel wavenet: Fast high-ﬁdelity speechsynthesis,” in

International Conference on Machine Learning ,2018, pp. 3918–3926.[15] J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, andA. Roberts, “Gansynth: Adversarial neural audio synthesis,” in

International Conference on Learning Representations , 2019.[16] P. Neekhara, C. Donahue, M. Puckette, S. Dubnov, andJ. McAuley, “Expediting tts synthesis with adversarial vocoding,”in

Interspeech , 2019, pp. 186–190.[17] R. Yamamoto, E. Song, and J.-M. Kim, “Probability density dis-tillation with generative adversarial networks for high-quality par-allel waveform generation,” in

Interspeech , 2019, pp. 699–703. [18] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh,J. Sotelo, A. de Br´ebisson, Y. Bengio, and A. C. Courville, “Mel-gan: Generative adversarial networks for conditional waveformsynthesis,” in

Advances in Neural Information Processing Sys-tems , 2019, pp. 14 881–14 892.[19] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel wavegan: A fastwaveform generation model based on generative adversarial net-works with multi-resolution spectrogram,” in

IEEE InternationalConference on Acoustics, Speech and Signal Processing . IEEE,2020, pp. 6199–6203.[20] R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A ﬂow-basedgenerative network for speech synthesis,” in

IEEE InternationalConference on Acoustics, Speech and Signal Processing , 2019,pp. 3617–3621.[21] S. Kim, S.-g. Lee, J. Song, J. Kim, and S. Yoon, “Flowavenet:A generative ﬂow for raw audio,” in

International Conference onMachine Learning , 2019, pp. 3370–3378.[22] D. J. Rezende and S. Mohamed, “Variational inference with nor-malizing ﬂows,” in

International Conference on Machine Learn-ing , 2015.[23] L. Theis, A. v. d. Oord, and M. Bethge, “A note on the evaluationof generative models,” in

International Conference on LearningRepresentations , 2015.[24] E. Hoogeboom, T. S. Cohen, and J. M. Tomczak, “Learn-ing discrete distributions by dequantization,” arXiv preprintarXiv:2001.11235 , 2020.[25] J. Ho, X. Chen, A. Srinivas, Y. Duan, and P. Abbeel, “Flow++:Improving ﬂow-based generative models with variational dequan-tization and architecture design,” in

International Conference onMachine Learning , 2019, pp. 2722–2730.[26] D. P. Kingma and P. Dhariwal, “Glow: Generative ﬂow with in-vertible 1x1 convolutions,” in

Advances in Neural InformationProcessing Systems , 2018, pp. 10 215–10 224.[27] T. Yoshimura, K. Hashimoto, K. Oura, Y. Nankaku, andK. Tokuda, “Mel-cepstrum-based quantization noise shapingapplied to neural-network-based speech waveform synthesis,”

IEEE/ACM Transactions on Audio, Speech, and Language Pro-cessing , vol. 26, no. 7, pp. 1177–1184, 2018.[28] S. M. S. J Atchison, “Logistic-normal distributions: Some prop-erties and uses,” pp. 261–272, 1980.[29] C. Veaux, J. Yamagishi, K. MacDonald et al. , “Superseded-cstrvctk corpus: English multi-speaker corpus for cstr voice cloningtoolkit,” 2016.[30] R. Kubichek, “Mel-cepstral distance measure for objective speechquality assessment,” in

IEEE Paciﬁc Rim Conference on Com-munications Computers and Signal Processing , vol. 1, 1993, pp.125–128.[31] M. Vondrasek and P. Pollak, “Methods for speech snr estimation:Evaluation tool and analysis of vad dependency,”