Neural source-filter-based waveform model for statistical parametric speech synthesis
NNEURAL SOURCE-FILTER-BASED WAVEFORM MODEL FOR STATISTICALPARAMETRIC SPEECH SYNTHESIS
Xin Wang , Shinji Takaki , Junichi Yamagishi ∗ National Institute of Informatics, Japan [email protected], [email protected], [email protected]
ABSTRACT
Neural waveform models such as the WaveNet are used in manyrecent text-to-speech systems, but the original WaveNet is quiteslow in waveform generation because of its autoregressive (AR)structure. Although faster non-AR models were recently reported,they may be prohibitively complicated due to the use of a distillingtraining method and the blend of other disparate training criteria.This study proposes a non-AR neural source-filter waveform modelthat can be directly trained using spectrum-based training criteriaand the stochastic gradient descent method. Given the input acousticfeatures, the proposed model first uses a source module to generatea sine-based excitation signal and then uses a filter module totransform the excitation signal into the output speech waveform.Our experiments demonstrated that the proposed model generatedwaveforms at least 100 times faster than the AR WaveNet and thequality of its synthetic speech is close to that of speech generated bythe AR WaveNet. Ablation test results showed that both the sine-wave excitation signal and the spectrum-based training criteria wereessential to the performance of the proposed model.
Index Terms — speech synthesis, neural network, waveformmodeling
1. INTRODUCTION
Text-to-speech (TTS) synthesis, a technology that converts textsinto speech waveforms, has been advanced by using end-to-endarchitectures [1] and neural-network-based waveform models [2, 3,4]. Among those waveform models, the WaveNet [2] directly modelsthe distributions of waveform sampling points and has demonstratedoutstanding performance. The vocoder version of WaveNet [5],which converts the acoustic features into the waveform, alsooutperformed other vocoders for the pipeline TTS systems [6].As an autoregressive (AR) model, the WaveNet is quite slowin waveform generation because it has to generate the waveformsampling points one by one. To improve the generation speed,the Parallel WaveNet [3] and the ClariNet [4] introduce a distillingmethod to transfer ‘knowledge’ from a teacher AR WaveNet toa student non-AR model that simultaneously generates all thewaveform sampling points. However, the concatenation of two largemodels and the mix of distilling and other training criteria reduce themodel interpretability and raise the implementation cost.In this paper, we propose a neural source-filter waveform modelthat converts acoustic features into speech waveforms. Inspiredby classical speech modeling methods [7, 8], we used a source ∗ This work was partially supported by JST CREST Grant NumberJPMJCR18A6, Japan and by MEXT KAKENHI Grant Numbers (16H06302,16K16096, 17H04687, 18H04120, 18H04112, 18KT0051), Japan. module to generate a sine-based excitation signal with a specifiedfundamental frequency (F0). We then used a dilated-convolution-based filter module to transform the sine-based excitation into thespeech waveform. The proposed model was trained by minimizingspectral amplitude and phase distances, which can be efficientlyimplemented using discrete Fourier transforms (DFTs). Because theproposed model is a non-AR model, it generates waveforms muchfaster than the AR WaveNet. A large-scale listening test showedthat the proposed model was close to the AR WaveNet in terms ofthe Mean opinion score (MOS) on the quality of synthetic speech.An ablation test showed that both the sine-wave excitation and thespectral amplitude distance were crucial to the proposed model.The model structure and training criteria are explained inSection 2, after which the experiments are described in Section 3.Finally, this paper is summarized and concluded in Section 4.
2. PROPOSED MODEL AND TRAINING CRITERIA2.1. Model structure
The proposed model (shown in Figure 1) converts an input acousticfeature sequence c B of length B into a speech waveform (cid:98) o T oflength T . It includes a source module that generates an excitationsignal e T , a filter module that transforms e T into the speechwaveform, and a condition module that processes the acousticfeatures for the source and filter modules. None of the modulestakes the previously generated waveform sample as the input. Thewaveform is assumed to be real-valued, i.e., (cid:98) o t ∈ R , < t ≤ T . The condition module takes as input the acoustic feature sequence c B = { c , · · · , c B } , where each c b = [ f b , s (cid:62) b ] (cid:62) containsthe F0 f b and the spectral features s b of the b -th speech frame.The condition module upsamples the F0 by duplicating f b toevery time step within the b -th frame and feeds the upsampled F0sequence f T to the source module. Meanwhile, it processes c B using a bi-directional recurrent layer with long-short-term memory(LSTM) units [9] and a convolutional (CONV) layer, after whichthe processed features are upsampled and sent to the filter module.The LSTM and CONV were used so that the condition module wassimilar to that of the WaveNet-vocoder [10] in the experiment. Theycan be replaced with a feedforward layer in practice. Given the input F0 sequence f T , the source module generates asine-based excitation signal e T = { e , · · · , e T } , where e t ∈ R , ∀ t ∈ { , · · · , T } . Suppose the F0 value of the t -th time step a r X i v : . [ ee ss . A S ] A p r eural filter moduleSource module b o T o T DFTFraming/windowing DFT Framing/windowinginverseDFT De-framing/windowing
Generated waveform
Condition module
Natural waveform e T HarmonicsNoise ++ FF f T Upsampling Dilated CONVs TransformationUpsamplingBi-LSTM CONVF0Acoustic featuresSpectral features & F0 … Dilated CONVs TransformationGradients
Training criteriaModel structure L b o T b y ( n ) a T b T e T b T + a T b x ( n ) y ( n ) x ( n ) @ L @ Re( b y ( n ) ) + j @ L @ Im( b y ( n ) ) Sine generator c B e < > T @ L @ b o T @ L @ b x ( n ) Fig. 1 . Structure of proposed model. B and T denote lengths of input feature sequence and output waveform, respectively. FF, CONV, andBi-LSTM denote feedforward, convolutional, and bi-directional recurrent layers, respectively. DFT denotes discrete Fourier transform.is f t ∈ R ≥ , and f t = 0 denotes being unvoiced. By treating f t asthe instantaneous frequency [11], a signal e < > T can be generated as e < >t = α sin( t (cid:88) k =1 π f k N s + φ ) + n t , if f t > σ n t , if f t = 0 , (1)where n t ∼ N (0 , σ ) is a Gaussian noise, φ ∈ [ − π, π ] is a randominitial phase, and N s is equal to the waveform sampling rate.Although we can directly set e T = e < > T , we tried twoadditional tricks. First, a ‘best’ phase φ ∗ for e < > T can bedetermined in the training stage by maximizing the correlationbetween e < > T and the natural waveform o T . During generation, φ is randomly generated. The second method is to generate harmonicsby increasing f k in Equation ( ) and use a feedforward (FF) layerto merge the harmonics and e < > T into e T . In this paper we use 7harmonics and set σ = 0 . and α = 0 . . Given the excitation signal e T from the source module and theprocessed acoustic features from the condition module, the filtermodule modulates e T using multiple stages of dilated convolutionand affine transformations similar to those in ClariNet [4]. Forexample, the first stage takes e T and the processed acousticfeatures as input and produces two signals a T and b T usingdilated convolution. The e T is then transformed using e T (cid:12) b T + a T , where (cid:12) denotes element-wise multiplication. Thetransformed signal is further processed in the following stages, andthe output of the final stage is used as generated waveform (cid:98) o T .The dilated convolution blocks are similar to those in ParallelWaveNet [3]. Specifically, each block contains multiple dilatedconvolution layers with a filter size of 3. The outputs of theconvolution layers are merged with the features from the conditionmodule through gated activation functions [3]. After that, themerged features are transformed into a T and ˜ b T . To make surethat b T is positive, b T is parameterized as b T = exp(˜ b T ) .Unlike ClariNet or Parallel WaveNet, the proposed model doesnot use the distilling method. It is unnecessary to compute themean and standard deviation of the transformed signal. Neither isit necessary to form the convolution and transformation blocks as aninverse autoregressive flow [12]. Because speech perception heavily relies on acoustic cues in thefrequency domain, we define training criteria that minimize thespectral amplitude and phase distances, which can be implementedusing DFTs. Given these criteria, the proposed model is trainedusing the stochastic gradient descent (SGD) method.
Following the convention of short-time Fourier analysis, we conductwaveform framing and windowing before producing the spectrumof each frame. For the generated waveform (cid:98) o T , we use (cid:98) x ( n ) =[ (cid:98) x ( n )1 , · · · , (cid:98) x ( n ) M ] (cid:62) ∈ R M to denote the n -th waveform frame oflength M . We then use (cid:98) y ( n ) = [ (cid:98) y ( n )1 , · · · , (cid:98) y ( n ) K ] (cid:62) ∈ C K to denotethe spectrum of (cid:98) x ( n ) calculated using K -point DFT. We similarlydefine x ( n ) and y ( n ) for the natural waveform o T .Suppose the waveform is sliced into N frames. Then the logspectral amplitude distance is defined as follows: L s = 12 N (cid:88) n =1 K (cid:88) k =1 (cid:104) log Re ( y ( n ) k ) + Im ( y ( n ) k ) Re ( (cid:98) y ( n ) k ) + Im ( (cid:98) y ( n ) k ) (cid:105) , (2)where Re ( · ) and Im ( · ) denote the real and imaginary parts of acomplex number, respectively.Although L s is defined on complex-valued spectra, the gradient ∂ L s ∂ (cid:98) o T ∈ R T for SGD training can be efficiently calculated. Letus consider the n -th frame and compose a complex-valued vector g ( n ) = ∂ L s ∂ Re ( (cid:98) y ( n ) ) + j ∂ L s ∂ Im ( (cid:98) y ( n ) ) ∈ C K , where the k -th element is g ( n ) k = ∂ L s ∂ Re ( (cid:98) y ( n ) k ) + j ∂ L s ∂ Im ( (cid:98) y ( n ) k ) ∈ C . It can be shown that, as longas g ( n ) is Hermitian symmetric, the inverse DFT of g ( n ) is equalto ∂ L s ∂ (cid:98) x ( n ) = [ ∂ L s ∂ (cid:98) x ( n )1 , · · · , ∂ L s ∂ (cid:98) x ( n ) m , ∂ L s ∂ (cid:98) x ( n ) M ] ∈ R M . Using the samemethod, ∂ L s ∂ (cid:98) x ( n ) for n ∈ { , · · · , N } can be computed in parallel.Given { ∂ L s ∂ (cid:98) x (1) , · · · , ∂ L s ∂ (cid:98) x ( N ) } , the value of each ∂ L s ∂ (cid:98) o t in ∂ L s ∂ (cid:98) o T can beeasily accumulated since the relationship between (cid:98) o t and each (cid:98) x ( n ) m has been determined by the framing and windowing operations. In the implementation using fast Fourier transform, (cid:98) x ( n ) of length M is zero-padded to length K before DFT. Accordingly, the inverse DFT of g ( n ) also gives the gradients w.r.t. the zero-padded part, which should bediscarded (see https://arxiv.org/abs/1810.11946 ). able 1 . Three framing and DFT configurations for L s and L p L s & L p L s & L p L s & L p DFT bins K
512 128 2048Frame length M
320 (20 ms) 80 (5 ms) 1920 (120 ms)Frame shift 80 (5 ms) 40 (2.5 ms) 640 (40 ms)Note: all configurations use Hann window.In fact, ∂ L s ∂ (cid:98) o T ∈ R T can be calculated in the same mannerno matter how we set the framing and DFT configuration, i.e., thevalues of N , M , and K . Furthermore, multiple L s s with differentconfigurations can be computed, and the gradients ∂ L s ∂ (cid:98) o T can besimply summed up. For example, using the three L s s in Table 1was found to be essential to the proposed model (see Section 3.3).The Hermitian symmetry of g ( n ) is satisfied if L s is carefullydefined. For example, L s can be the square error or Kullback-Leiblerdivergence (KLD) of the spectral amplitudes [13, 14]. The phasedistance defined below also satisfies the requirement. Given the spectra, a phase distance [15] is computed as L p = 12 N (cid:88) n =1 K (cid:88) k =1 (cid:12)(cid:12)(cid:12) − exp( j ( (cid:98) θ ( n ) k − θ ( n ) k )) (cid:12)(cid:12)(cid:12) = N (cid:88) n =1 K (cid:88) k =1 (cid:104) − Re ( (cid:98) y ( n ) k ) Re ( y ( n ) k ) + Im ( (cid:98) y ( n ) k ) Im ( y ( n ) k ) | (cid:98) y ( n ) k || y ( n ) k | (cid:105) , (3)where (cid:98) θ ( n ) k and θ ( n ) k are the phases of (cid:98) y ( n ) k and y ( n ) k , respectively.The gradient ∂ L p ∂ (cid:98) o T can be computed by the same procedure as ∂ L s ∂ (cid:98) o T . Multiple L p s and L s s with different framing and DFTconfigurations can be added up as the ultimate training criterion L . For different L ∗ s, additional DFT/iDFT and framing/windowingblocks should be added to the model in Figure 1.
3. EXPERIMENTS3.1. Corpus and features
This study used the same Japanese speech corpus and data divisionrecipe as our previous study [16]. This corpus [17] contains neutralreading speech uttered by a female speaker. Both validation andtest sets contain 480 randomly selected utterances. Among the 48-hour training data, 9,000 randomly selected utterances (15 hours)were used as the training set in this study. For the ablation test inSection 3.3, the training set was further reduced to 3,000 utterances(5 hours). Acoustic features, including 60 dimensions of Mel-generalized cepstral coefficients (MGCs) [18] and 1 dimension ofF0, were extracted from the 48 kHz waveforms at a frame shiftof 5 ms using WORLD [19]. The natural waveforms were thendownsampled to 16 kHz for model training and the listening test.
The first experiment compared the four models listed in Table 2 .The WAD model, which was trained in our previous study [6], The models were implemented using a modified CURRENNT toolkit[20] on a single P100 Nvidia GPU card. Codes, recipes, and generated speechcan be found on https://nii-yamagishilab.github.io . Table 2 . Models for comparison test in Section 3.2
WOR
WORLD vocoder
WAD
WaveNet-vocoder for 10-bit discrete µ -law waveform WAC
WaveNet-vocoder using Gaussian dist. for raw waveform
NSF
Proposed model for raw waveform
Natural WOR WAD WAC NSF12345 Q ua li t y ( M O S ) Fig. 2 . MOS scores of natural speech, synthetic speech given naturalacoustic features (blue), and synthetic speech given acoustic featuresgenerated from acoustic models (red). White dots are mean values.
Table 3 . Average number of waveform points generated in 1 s
WAD NSF (memory-save mode)
NSF (normal mode)0.19k 20k 227kcontained a condition module, a post-processing module, and 40dilated CONV blocks, where the k -th CONV block had a dilationsize of modulo ( k, . WAC was similar to
WAD but used a Gaussiandistribution to model the raw waveform at the output layer [4].The proposed
NSF contained 5 stages of dilated CONV andtransformation, each stage including 10 convolutional layers witha dilation size of modulo ( k, and a filter size of 3. Its conditionmodule was the same as that of WAD and
WAC . NSF was trainedusing L = L s + L s + L s , and the configuration of each L s ∗ islisted in Table 1. The phase distance L p ∗ was not used in this test.Each model generated waveforms using natural and generatedacoustic features, where the generated acoustic features wereproduced by the acoustic models in our previous study [6]. Thegenerated and natural waveforms were then evaluated by paid nativeJapanese speakers. In each evaluation round the evaluator listenedto one speech waveform in each screen and rated the speech qualityon a 1-to-5 MOS scale. The evaluator can take at most 10 evaluationrounds and can replay the sample during evaluation. The waveformsin an evaluation round were for the same text and were played ina random order. Note that the waveforms generated from NSF and
WAC were converted to 16-bit PCM format before evaluation.A total of 245 evaluators conducted 1444 valid evaluationrounds in all, and the results are plotted in Figure 2. Two-sidedMann-Whitney tests showed that the difference between any pairof models is statistically significant ( p < . ) except NSF and
WAC when the two models used generated acoustic features. Ingeneral,
NSF outperformed
WOR and
WAC but performed slightlyworse than
WAD . The gap of the mean MOS scores between
NSF and
WAD was about 0.12, given either natural or generated acousticfeatures. A possible reason for this result may be the differencebetween the non-AR and AR model structures, which is similarto the difference between the finite and infinite impulse responsefilters.
WAC performed worse than
WAD because some syllables wereperceived to be trembling in pitch, which may be caused by therandom sampling generation method.
WAD alleviated this artifactby using a one-best generation method in voiced regions [6].After the MOS test, we compared the waveform generationspeed of
NSF and
WAD . The implementation of
NSF has a normal k4k8k F r equen cy ( H z ) Natural2k4k8k F r equen cy ( H z ) NSF s L3(NSF s w/o L s nor L s ) L4(NSF s w/ L s , , , L p , , ) S3(NSF s noise excitation) N2(NSF s with b T = 0 ) Fig. 3 . Spectrogram (top) and instantaneous frequency (bottom) of natural waveform and waveforms generated from models in Table 4 givennatural acoustic features in test set (utterance AOZORAR 03372 T01). Figures are plotted using 5 ms frame length and 2.5 ms frame shift.
Table 4 . Models for ablation test (Section 3.3)
NSF s NSF trained on 5-hour data
L1 NSF s without using L s ssssssssss (i.e., L = L s + L s ) L2 NSF s without using L s ssssssssss (i.e., L = L s + L s ) L3 NSF s without using L s nor L s sssssssss (i.e., L = L s ) L4 NSF s using L = L s + L s + L s + L p + L p + L p L5 NSF s using KLD of spectral amplitudes S1 NSF s without harmonics S2 NSF s without harmonics or ‘best’ phase φ ∗ S3 NSF s only using noise as excitation N1 NSF s with b T = 1 in filter’s transformation layers N2 NSF s with b T = 0 in filter’s transformation layersgeneration mode and a memory-save one. The normal modeallocates all the required GPU memory once but cannot generatewaveforms longer than 6 seconds because of the insufficient memoryspace in a single GPU card. The memory-save mode can generatelong waveforms because it releases and allocates the memory layerby layer, but the repeated memory operations are time consuming.We evaluated NSF using both modes on a smaller test set, inwhich each of the 80 generated test utterances was around 5 secondslong. As the results in Table 3 show,
NSF is much faster than
WAD .Note that
WAD allocates and re-uses a small size of GPU memory,which needs no repeated memory operation.
WAD is slow mainlybecause of the AR generation process. Of course, both
WAD and
NSF can be improved if our toolkit is further optimized. Particularly, ifthe memory operation can be sped up, the memory-save mode of
NSF will be much faster.
This experiment was an ablation test on
NSF . Specifically, the 11variants of
NSF listed in Table 4 were trained using the 5-hourtraining set. For a fair comparison,
NSF was re-trained using the5-hour data, and this variant is referred to as
NSF s . The speechwaveforms were generated given the natural acoustic features andrated in 1444 evaluation rounds by the same group of evaluators inSection 3.2. This test excluded natural waveform for evaluation.The results are plotted in Figure 4. The difference between NST s and any other model except S2 was statistically significant( p < . ). Comparison among NST s , L1 , L2 , and L3 shows thatusing multiple L s s listed in Table 1 is beneficial. For L3 that usedonly L s , the generated waveform points clustered around one peakin each frame, and the waveform suffered from a pulse-train noise.This can be observed from L3 of Figure 3, whose spectrogram in the NSF s L1 L2 L3 L4 L5 S1 S2 S3 N1 N212345 Q ua li t y ( M O S ) Fig. 4 . MOS scores of synthetic samples from
NSF s and its variantsgiven natural acoustic features. White dots are mean MOS scores.high frequency band shows more clearly vertical strips than othermodels. Accordingly, this artifact can be alleviated by adding L s with a frame length of 5 ms for model training, which explainedthe improvement in L1 . Using phase distance ( L4 ) didn’t improvethe speech quality even though the value of the phase distance wasconsistently decreased on both training and validation data.The good result of S2 indicates that a single sine-waveexcitation with a random initial phase also works. Without thesine-wave excitation, S3 generated waveforms that were intelligiblebut lacked stable harmonic structure. N1 slightly outperformed NSF s while N2 produced unstable harmonic structures. Because thetransformation in N1 is equivalent to skip-connection [21], the resultindicates that the skip-connection may help the model training.
4. CONCLUSION
In this paper, we proposed a neural waveform model with separatedsource and filter modules. The source module produces a sine-wave excitation signal with a specified F0, and the filter module usesdilated convolution to transform the excitation into a waveform. Ourexperiment demonstrated that the sine-wave excitation was essentialfor generating waveforms with harmonic structures. We also foundthat multiple spectral-based training criteria and the transformationin the filter module contributed to the performance of the proposedmodel. Compared with the AR WaveNet, the proposed modelgenerated speech with a similar quality at a much faster speed.The proposed model can be improved in many aspects. Forexample, it is possible to simplify the dilated convolution blocks. Itis also possible to try classical speech modeling methods, includingglottal waveform excitations [22, 23], two-bands or multi-bandsapproaches [24, 25] on waveforms. When applying the model toconvert linguistic features into the waveform, we observed the over-smoothing affect in the high-frequency band and will investigate theissue in the future work. . REFERENCES [1] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster,Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang,Yuxuan Wang, Rj Skerrv-Ryan, et al., “Natural TTS synthesisby conditioning WaveNet on Mel spectrogram predictions,” in
Proc. ICASSP , 2018, pp. 4779–4783.[2] Aaron van den Oord, Sander Dieleman, Heiga Zen, KarenSimonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner,Andrew Senior, and Koray Kavukcuoglu, “WaveNet:A generative model for raw audio,” arXiv preprintarXiv:1609.03499 , 2016.[3] Aaron van den Oord, Yazhe Li, Igor Babuschkin, KarenSimonyan, Oriol Vinyals, Koray Kavukcuoglu, George van denDriessche, Edward Lockhart, Luis Cobo, Florian Stimberg,Norman Casagrande, Dominik Grewe, Seb Noury, SanderDieleman, Erich Elsen, Nal Kalchbrenner, Heiga Zen, AlexGraves, Helen King, Tom Walters, Dan Belov, and DemisHassabis, “Parallel WaveNet: Fast high-fidelity speechsynthesis,” in
Proc. ICML , 2018, pp. 3918–3926.[4] Wei Ping, Kainan Peng, and Jitong Chen, “Clarinet: Parallelwave generation in end-to-end text-to-speech,” in
Proc. ICLP ,2019.[5] Akira Tamamori, Tomoki Hayashi, Kazuhiro Kobayashi,Kazuya Takeda, and Tomoki Toda, “Speaker-dependentWaveNet vocoder,” in
Proc. Interspeech , 2017, pp. 1118–1122.[6] Xin Wang, Jaime Lorenzo-Trueba, Shinji Takaki, Lauri Juvela,and Junichi Yamagishi, “A comparison of recent waveformgeneration and acoustic modeling methods for neural-network-based speech synthesis,” in
Proc. ICASSP , 2018, pp. 4804–4808.[7] Per Hedelin, “A tone oriented voice excited vocoder,” in
Proc.ICASSP . IEEE, 1981, vol. 6, pp. 205–208.[8] Robert McAulayand Thomas Quatieri, “Speech analysis/synthesis based on asinusoidal representation,”
IEEE Trans. on Acoustics, Speech,and Signal Processing , vol. 34, no. 4, pp. 744–754, 1986.[9] Alex Graves,
Supervised Sequence Labelling with RecurrentNeural Networks , Ph.D. thesis, Technische Universit¨atM¨unchen, 2008.[10] XinWang, Shinji Takaki, and Junichi Yamagishi, “Investigationof WaveNet for text-to-speech synthesis,” Tech. Rep. 6, SIGTechnical Reports, feb 2018.[11] John R Carson and Thornton C Fry, “Variable frequencyelectric circuit theory with application to the theory offrequency-modulation,”
Bell System Technical Journal , vol.16, no. 4, pp. 513–540, 1937.[12] Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen,Ilya Sutskever, and Max Welling, “Improved variationalinference with inverse autoregressive flow,” in
Proc. NIPS ,2016, pp. 4743–4751.[13] Daniel D Lee and H Sebastian Seung, “Algorithms for non-negative matrix factorization,” in
Proc. NIPS , 2001, pp. 556–562.[14] Shinji Takaki, Hirokazu Kameoka, and Junichi Yamagishi,“Direct modeling of frequency spectra and waveformgeneration based on phase recovery for DNN-based speechsynthesis,” in
Proc. Interspeech , 2017, pp. 1128–1132. [15] Shinji Takaki, Toru Nakashika, Xin Wang, and JunichiYamagishi, “STFT spectral loss for training a neural speechwaveform model,” in
Proc. ICASSP , 2019, p. (accepted).[16] Hieu-Thi Luong, Xin Wang, Junichi Yamagishi, and NobuyukiNishizawa, “Investigating accuracy of pitch-accent annotationsin neural-network-based speech synthesis and denoisingeffects,” in
Proc. Interspeech , 2018, pp. 37–41.[17] Hisashi Kawai, Tomoki Toda, Jinfu Ni, Minoru Tsuzaki, andKeiichi Tokuda, “XIMERA: A new TTS from ATR based oncorpus-based technologies,” in
Proc. SSW5 , 2004, pp. 179–184.[18] Keiichi Tokuda, Takao Kobayashi, Takashi Masuko, andSatoshi Imai, “Mel-generalized cepstral analysis a unifiedapproach,” in
Proc. ICSLP , 1994, pp. 1043–1046.[19] Masanori Morise, Fumiya Yokomori, and Kenji Ozawa,“WORLD: A vocoder-based high-quality speech synthesissystem for real-time applications,”
IEICE Trans. onInformation and Systems , vol. 99, no. 7, pp. 1877–1884, 2016.[20] Felix Weninger, Johannes Bergmann, and Bj¨orn Schuller,“Introducing CURRENT: The Munich open-source CUDArecurrent neural network toolkit,”
The Journal of MachineLearning Research , vol. 16, no. 1, pp. 547–551, 2015.[21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,“Deep residual learning for image recognition,” in
Proc.CVPR , 2016, pp. 770–778.[22] Gunnar Fant, Johan Liljencrants, and Qi-guang Lin, “A four-parameter model of glottal flow,”
STL-QPSR , vol. 4, no. 1985,pp. 1–13, 1985.[23] Lauri Juvela, Bajibabu Bollepalli, Manu Airaksinen, and PaavoAlku, “High-pitched excitation generation for glottal vocodingin statistical parametric speech synthesis using a deep neuralnetwork,” in
Proc. ICASSP , 2016, pp. 5120–5124.[24] John Makhoul, R Viswanathan, Richard Schwartz, and AWFHuggins, “A mixed-source model for speech compression andsynthesis,”
The Journal of the Acoustical Society of America ,vol. 64, no. 6, pp. 1577–1581, 1978.[25] D. W. Griffin and J. S. Lim, “Multiband excitationvocoder,”
IEEE Transactions on Acoustics, Speech, and SignalProcessing , vol. 36, no. 8, pp. 1223–1235, Aug 1988. . FORWARD COMPUTATION
Figure 5 plots the two steps to derive the spectrum from the generated waveform (cid:98) o T . We use (cid:98) x ( n ) = [ (cid:98) x ( n )1 , · · · , (cid:98) x ( n ) M ] (cid:62) ∈ R M to denote the n -th waveform frame of length M . We then use (cid:98) y ( n ) = [ (cid:98) y ( n )1 , · · · , (cid:98) y ( n ) K ] (cid:62) ∈ C K to denote the spectrum of (cid:98) x ( n ) calculated using K -pointDFT, i.e., (cid:98) y ( n ) = DFT K ( (cid:98) x ( n ) ) . For fast Fourier transform, K is set to the power of 2, and (cid:98) x ( n ) is zero-padded to length K before DFT. b o T o T DFTFraming/windowing DFT Framing/windowing
Generated waveformNatural waveform L b y ( n ) b x ( n ) y ( n ) x ( n ) b o b o b o b o T b o T … b x (1) b x (2) b x ( N ) N frames … K DFT bins K -pointsDFT b y ( N ) b y (1) b y (2) Frame Length M Padding
K-M b x (1)1 b x (1)2 b x (1) M Framing/windowing b y (1)1 b y (1)2 b y (1) K b y (1)3 b x (2)1 b x (2)2 b x (2) M b x ( N ) M b x ( N )1 Complex-value domain Real-value domain
Fig. 5 . Framing/windowing and DFT steps. T , M , K denotes waveform length, frame length, and number of DFT bins. A.1. Framing and windowing
The framing and windowing operation is also parallelized over (cid:98) x ( n ) m using for each command in CUDA/Thrust. However, for explanation,let’s use the matrix operation in Figure 6. In other words, we compute (cid:98) x ( n ) m = T (cid:88) t =1 (cid:98) o t w ( n,m ) t , (4)where w ( n,m ) t is the element in the (cid:2) ( n − × M + m (cid:3) -th row and the t -th column of the transformation matrix W . b o b o b o b o T b x (1)1 b x (1)2 b x (1) M b x (2)1 b x (2)2 b x (2) M X= b x ( N ) M b x ( N )1 b x ( N )2 … st Frame2 nd FrameN th Frame T rows M (frame length) w w w M … w w w M … Frameshift NM columns w w w M …… W R NM ⇥ T Fig. 6 . A matrix format of framing/windowing operation, where w M denote coefficients of Hann window. A.2. DFT
Our implementation uses cuFFT ( cufftExecR2C and cufftPlan1d ) to compute { y (1) , · · · , y ( N ) } in parallel. https://docs.nvidia.com/cuda/cufft/index.html . BACKWARD COMPUTATION For back-propagation, we need to compute the gradient ∂ L ∂ (cid:98) o T ∈ R T following the steps plotted in Figure 7. b o T o T DFTFraming/windowing DFT Framing/windowing
Generated waveformNatural waveform L b y ( n ) b x ( n ) y ( n ) x ( n ) … N frames … K DFT bins K -pointsiDFT Frame Length M De-framing/windowinginverseDFT De-framing/windowing
Gradients @ L @ b o T @ L @ b x ( n ) @ L @ b x ( N ) @ L @ b x (2) @ L @ b x (1) Gradients w.r.t. zero-padded partNot used in de-framing/windowing
Padding
K-M @ L @ b o T g ( n ) k = @ L s @ Re ( b y ( n ) k ) + j @ L s @ Im ( b y ( n ) k ) C g (1) K g (1)2 g (1)1 g (1) g (2) g ( N ) g ( n ) @ L @ b x (1)1 @ L @ b x (1)2 @ L @ b x (1) M @ L @ b x (2) M @ L @ b x (2)1 @ L @ b x ( N )1 @ L @ b x ( N ) M @ L @ b o @ L @ b o @ L @ b o Complex-value domain Real-value domain
Fig. 7 . Steps to compute gradients
B.1. The 2nd step: from ∂ L ∂ (cid:98) x ( n ) m to ∂ L ∂ (cid:98) o t Suppose we have { ∂ L ∂ (cid:98) x (1) , · · · , ∂ L ∂ (cid:98) x ( N ) } , where each ∂ L ∂ (cid:98) x ( n ) ∈ R M and ∂ L ∂ (cid:98) x ( n ) m ∈ R . Then, we can compute ∂ L ∂ (cid:98) o t on the basis Equation (4) as ∂ L ∂ (cid:98) o t = N (cid:88) n =1 M (cid:88) m =1 ∂ L ∂ (cid:98) x ( n ) m w ( n,m ) t , (5)where w ( n,m ) t are the framing/windowing coefficients. This equation explains what we mean by saying ‘ ∂ L ∂ (cid:98) o t can be easily accumulated therelationship between (cid:98) o t and each (cid:98) x ( n ) m has been determined by the framing and windowing operations’.Our implementation uses CUDA/Thrust for each command to launch T threads and compute ∂ L ∂ (cid:98) o t , t ∈ { , · · · , T } in parallel. Because (cid:98) o t is only used in a few frames and there is only one w ( n,m ) t (cid:54) = 0 for each { n, t } , Equation (5) can be optimized as ∂ L ∂ (cid:98) o t = N t,max (cid:88) n = N t,min ∂ L ∂ (cid:98) x ( n ) m t,n w ( n,m t,n ) t , (6)where [ N t,min , N t,max ] is the frame range that (cid:98) o t appears, and m t,n is the position of (cid:98) o t in the n -th frame. B.2. The 1st step: compute ∂ L ∂ (cid:98) x ( n ) m Remember that (cid:98) y ( n ) = [ (cid:98) y ( n )1 , · · · , (cid:98) y ( n ) K ] (cid:62) ∈ C K is the K-points DFT spectrum of (cid:98) x ( n ) = [ (cid:98) x ( n )1 , · · · , (cid:98) x ( n ) M ] (cid:62) ∈ R M . Therefore we know Re ( (cid:98) y ( n ) k ) = M (cid:88) m =1 (cid:98) x ( n ) m cos( 2 πK ( k − m − , (7) Im ( (cid:98) y ( n ) k ) = − M (cid:88) m =1 (cid:98) x ( n ) m sin( 2 πK ( k − m − , (8)where k ∈ [1 , K ] . Note that, although the sum should be (cid:80) Km =1 , the summation over the zero-padded part (cid:80) Km = M +1 πK ( k − m − can be safely ignored . Although we can avoid zero-padding by setting K = M , in practice K is usually the power of 2 while the frame length M is not. uppose we compute a log spectral amplitude distance L over the N frames as L = 12 N (cid:88) n =1 K (cid:88) k =1 (cid:104) log Re ( y ( n ) k ) + Im ( y ( n ) k ) Re ( (cid:98) y ( n ) k ) + Im ( (cid:98) y ( n ) k ) (cid:105) . (9)Because L , (cid:98) x ( n ) m , Re ( (cid:98) y ( n ) k ) , and Im ( (cid:98) y ( n ) k ) are real-valued numbers, we can compute the gradient ∂ L ∂ (cid:98) x ( n ) m using the chain rule: ∂ L ∂ (cid:98) x ( n ) m = K (cid:88) k =1 ∂ L ∂ Re ( (cid:98) y ( n ) k ) ∂ Re ( (cid:98) y ( n ) k ) ∂ (cid:98) x ( n ) m + K (cid:88) k =1 ∂ L ∂ Im ( (cid:98) y ( n ) k ) ∂ Im ( (cid:98) y ( n ) k ) ∂ (cid:98) x ( n ) m (10) = K (cid:88) k =1 ∂ L ∂ Re ( (cid:98) y ( n ) k ) cos( 2 πK ( k − m − − K (cid:88) k =1 ∂ L ∂ Im ( (cid:98) y ( n ) k ) sin( 2 πK ( k − m − . (11)Once we compute ∂ L ∂ (cid:98) x ( n ) m for each m and n , we can use Equation (6) to compute the gradient ∂ L ∂ (cid:98) o t . B.3. Implementation of the 1st step using inverse DFT
Because ∂ L ∂ Re ( (cid:98) y ( n ) k ) and ∂ L ∂ Im ( (cid:98) y ( n ) k ) are real numbers, we can directly implement Equation (11) using matrix multiplication. However, a moreefficient way is to use inverse DFT (iDFT).Suppose a complex valued signal g = [ g , g , · · · , g L ] ∈ C K , we compute b = [ b , · · · , b K ] as the K-point inverse DFT of g by b m = K (cid:88) k =1 g k e j πK ( k − m − (12) = K (cid:88) k =1 [ Re ( g k ) + j Im ( g k )][cos( 2 πK ( k − m − j sin( 2 πK ( k − m − (13) = K (cid:88) k =1 Re ( g k ) cos( 2 πK ( k − m − − K (cid:88) k =1 Im ( g k ) sin( 2 πK ( k − m − (14) + j (cid:104) K (cid:88) k =1 Re ( g k ) sin( 2 πK ( k − m − K (cid:88) k =1 Im ( g k ) cos( 2 πK ( k − m − (cid:105) . (15)For the first term in Line 15, we can write K (cid:88) k =1 Re ( g k ) sin( 2 πK ( k − m − (16) = Re ( g ) sin( 2 πK (1 − m − Re ( g K +1 ) sin( 2 πK ( K − m − (17) + K (cid:88) k =2 Re ( g k ) sin( 2 πK ( k − m − K (cid:88) k = K +2 Re ( g k ) sin( 2 πK ( k − m − (18) = K (cid:88) k =2 (cid:104) Re ( g k ) sin( 2 πK ( k − m − Re ( g ( K +2 − k )) sin( 2 πK ( K + 2 − k − m − (cid:105) (19) = K (cid:88) k =2 (cid:104) Re ( g k ) − Re ( g ( K +2 − k )) (cid:105) sin( 2 πK ( k − m − (20)Note that in Line (17), Re ( g ) sin( πK (1 − m − Re ( g ) sin(0) = 0 , and Re ( g K +1 ) sin( πK ( K + 1 − m − Re ( g K +1 ) sin(( m − π ) = 0 .It is easy to those that Line (20) is equal to 0 if Re ( g k ) = Re ( g ( K +2 − k ) ) , for any k ∈ [2 , K ] . Similarly, it can be shown that (cid:80) Kk =1 Im ( g k ) cos( πK ( k − m − if Im ( g k ) = − Im ( g ( K +2 − k ) ) , k ∈ [2 , K ] and Im ( g ) = Im ( g ( K +1) ) = 0 . When these twoterms are equal to 0, the imaginary part in Line (15) will be 0, and b m = (cid:80) Kk =1 g k e j πK ( k − m − in Line (12) will be a real number. cuFFT performs un-normalized FFTs, i.e., the scaling factor K is not used. o summarize, if g satisfies the conditions below Re ( g k ) = Re ( g ( K +2 − k ) ) , k ∈ [2 , K (21) Im ( g k ) = (cid:40) − Im ( g ( K +2 − k ) ) , k ∈ [2 , K ]0 , k = { , K + 1 } (22)inverse DFT of g will be real-valued: K (cid:88) k =1 g k e j πK ( k − m − = K (cid:88) k =1 Re ( g k ) cos( 2 πK ( k − m − − K (cid:88) k =1 Im ( g k ) sin( 2 πK ( k − m − (23)This is a basic concept in signal processing: the iDFT of a conjugate-symmetric (Hermitian) signal will be a real-valued signal.We can observer from Equation (23) and (11) that, if (cid:104) ∂ L ∂ Re ( (cid:98) y ( n )1 ) + j ∂ L ∂ Im ( (cid:98) y ( n )1 ) , · · · , ∂ L ∂ Re ( (cid:98) y ( n ) K ) + j ∂ L ∂ Im ( (cid:98) y ( n ) K ) (cid:105) (cid:62) is conjugate-symmetric,the gradient vector ∂ L ∂ (cid:98) x ( n ) = [ ∂ L ∂ (cid:98) x ( n )1 , · · · , ∂ L ∂ (cid:98) x ( n ) M ] (cid:62) be computed using iDFT: ∂ L ∂ (cid:98) x ( n )1 ∂ L ∂ (cid:98) x ( n )2 · · · ∂ L ∂ (cid:98) x ( n ) M ∂ L ∂ (cid:98) x ( n ) M +1 · · · ∂ L ∂ (cid:98) x ( n ) K = iDFT ( ∂ L ∂ Re ( (cid:98) y ( n )1 ) + j ∂ L ∂ Im ( (cid:98) y ( n )1 ) ∂ L ∂ Re ( (cid:98) y ( n )2 ) + j ∂ L ∂ Im ( (cid:98) y ( n )2 ) · · · ∂ L ∂ Re ( (cid:98) y ( n ) M ) + j ∂ L ∂ Im ( (cid:98) y ( n ) M ) ∂ L ∂ Re ( (cid:98) y ( n ) M +1 ) + j ∂ L ∂ Im ( (cid:98) y ( n ) M +1 ) · · · ∂ L ∂ Re ( (cid:98) y ( n ) K ) + j ∂ L ∂ Im ( (cid:98) y ( n ) K ) ) . (24)Note that { ∂ L ∂ (cid:98) x ( n ) M +1 , · · · , ∂ L ∂ (cid:98) x ( n ) K } are the gradients w.r.t to the zero-padded part, which will not be used and can safely set to 0. The iDFT ofa conjugate symmetric signal can be executed using cuFFT cufftExecC2R command. It is more efficient than other implementations ofEquation (11) because• there is no need to compute the imaginary part;• there is no need to compute and allocate GPU memory for g k where k ∈ [ K + 2 , K ] because of the conjugate symmetry;• iDFT can be executed for all the N frames in parallel. B.4. Conjugate symmetry complex-valued gradient vector
The conjugate symmetry of (cid:104) ∂ L ∂ Re ( (cid:98) y ( n )1 ) + j ∂ L ∂ Im ( (cid:98) y ( n )1 ) , · · · , ∂ L ∂ Re ( (cid:98) y ( n ) K ) + j ∂ L ∂ Im ( (cid:98) y ( n ) K ) (cid:105) (cid:62) is satisfied if L is carefully chosen. Luckily, most of thecommon distance metrics can be used. B.4.1. Log spectral amplitude distance
Given the log spectral amplitude distance L s in Equation (9), we can compute ∂ L s ∂ Re ( (cid:98) y ( n ) k ) = (cid:104) log[ Re ( (cid:98) y ( n ) k ) + Im ( (cid:98) y ( n ) k ) ] − log[ Re ( y ( n ) k ) + Im ( y ( n ) k ) ] (cid:105) Re ( (cid:98) y ( n ) k ) Re ( (cid:98) y ( n ) k ) + Im ( (cid:98) y ( n ) k ) (25) ∂ L s ∂ Im ( (cid:98) y ( n ) k ) = (cid:104) log[ Re ( (cid:98) y ( n ) k ) + Im ( (cid:98) y ( n ) k ) ] − log[ Re ( y ( n ) k ) + Im ( y ( n ) k ) ] (cid:105) Im ( (cid:98) y ( n ) k ) Re ( (cid:98) y ( n ) k ) + Im ( (cid:98) y ( n ) k ) (26)Because (cid:98) y ( n ) is the DFT spectrum of the real-valued signal, (cid:98) y ( n ) is conjugate symmetric, and Re ( (cid:98) y ( n ) k ) and Im ( (cid:98) y ( n ) k ) satisfy the conditionin Equations (21) and (22), respectively. Because the amplitude Re ( (cid:98) y ( n ) k ) + Im ( (cid:98) y ( n ) k ) does not change the symmetry, ∂ L s ∂ Re ( (cid:98) y ( n ) k ) and ∂ L s ∂ Im ( (cid:98) y ( n ) k ) also satisfy the conditions in Equations (21) and (22), respectively, and (cid:104) ∂ L s ∂ Re ( (cid:98) y ( n )1 ) + j ∂ L s ∂ Im ( (cid:98) y ( n )1 ) , · · · , ∂ L s ∂ Re ( (cid:98) y ( n ) K ) + j ∂ L s ∂ Im ( (cid:98) y ( n ) K ) (cid:105) (cid:62) isconjugate-symmetric. It should be called circular conjugate symmetry in strict sense .4.2. Phase distance
Let (cid:98) θ ( n ) k and θ ( n ) k to be the phases of (cid:98) y ( n ) k and y ( n ) k , respectively. Then, the phase distance is defined as L p = 12 N (cid:88) n =1 K (cid:88) k =1 (cid:12)(cid:12)(cid:12) − exp( j ( (cid:98) θ ( n ) k − θ ( n ) k )) (cid:12)(cid:12)(cid:12) (27) = 12 N (cid:88) n =1 K (cid:88) k =1 (cid:12)(cid:12)(cid:12) − cos( (cid:98) θ ( n ) k − θ ( n ) k )) − j sin( (cid:98) θ ( n ) k − θ ( n ) k )) (cid:12)(cid:12)(cid:12) (28) = 12 N (cid:88) n =1 K (cid:88) k =1 (cid:104)(cid:0) − cos( (cid:98) θ ( n ) k − θ ( n ) k )) (cid:1) + sin( (cid:98) θ ( n ) k − θ ( n ) k )) (cid:105) (29) = 12 N (cid:88) n =1 K (cid:88) k =1 (cid:104) (cid:98) θ ( n ) k − θ ( n ) k )) + sin( (cid:98) θ ( n ) k − θ ( n ) k )) − (cid:98) θ ( n ) k − θ ( n ) k )) (cid:105) (30) = N (cid:88) n =1 K (cid:88) k =1 (cid:16) − cos( (cid:98) θ ( n ) k − θ ( n ) k )) (cid:17) (31) = N (cid:88) n =1 K (cid:88) k =1 (cid:104) − (cid:0) cos( (cid:98) θ ( n ) k ) cos( θ ( n ) k )) + sin( (cid:98) θ ( n ) k ) sin( θ ( n ) k )) (cid:1)(cid:105) (32) = N (cid:88) n =1 K (cid:88) k =1 (cid:104) − Re ( (cid:98) y ( n ) k ) Re ( y ( n ) k ) + Im ( (cid:98) y ( n ) k ) Im ( y ( n ) k ) (cid:113) Re ( (cid:98) y ( n ) k ) + Im ( (cid:98) y ( n ) k ) (cid:113) Re ( y ( n ) k ) + Im ( y ( n ) k ) (cid:105) , (33)where cos( (cid:98) θ ( n ) k ) = Re ( (cid:98) y ( n ) k ) (cid:113) Re ( (cid:98) y ( n ) k ) + Im ( (cid:98) y ( n ) k ) , cos( θ ( n ) k ) = Re ( y ( n ) k ) (cid:113) Re ( y ( n ) k ) + Im ( y ( n ) k ) (34) sin( (cid:98) θ ( n ) k ) = Im ( (cid:98) y ( n ) k ) (cid:113) Re ( (cid:98) y ( n ) k ) + Im ( (cid:98) y ( n ) k ) , sin( θ ( n ) k ) = Im ( y ( n ) k ) (cid:113) Re ( y ( n ) k ) + Im ( y ( n ) k ) . (35)Therefore, we get ∂ L p ∂ Re ( (cid:98) y ( n ) k ) = − cos( θ ( n ) k ) ∂ cos( (cid:98) θ ( n ) k ) ∂ Re ( (cid:98) y ( n ) k ) − sin( θ ( n ) k ) ∂ sin( (cid:98) θ ( n ) k ) ∂ Re ( (cid:98) y ( n ) k ) (36) = − cos( θ ( n ) k ) (cid:113) Re ( (cid:98) y ( n ) k ) + Im ( (cid:98) y ( n ) k ) − Re ( (cid:98) y ( n ) k )
12 2 Re ( (cid:98) y ( n ) k ) (cid:113) Re ( (cid:98) y ( n ) k ) + Im ( (cid:98) y ( n ) k ) Re ( (cid:98) y ( n ) k ) + Im ( (cid:98) y ( n ) k ) − sin( θ ( n ) k ) − Im ( (cid:98) y ( n ) k )
12 2 Re ( (cid:98) y ( n ) k ) (cid:113) Re ( (cid:98) y ( n ) k ) + Im ( (cid:98) y ( n ) k ) Re ( (cid:98) y ( n ) k ) + Im ( (cid:98) y ( n ) k ) (37) = − cos( θ ( n ) k ) Re ( (cid:98) y ( n ) k ) + Im ( (cid:98) y ( n ) k ) − Re ( (cid:98) y ( n ) k ) (cid:0) Re ( (cid:98) y ( n ) k ) + Im ( (cid:98) y ( n ) k ) (cid:1) − sin( θ ( n ) k ) − Im ( (cid:98) y ( n ) k ) Re ( (cid:98) y ( n ) k ) (cid:0) Re ( (cid:98) y ( n ) k ) + Im ( (cid:98) y ( n ) k ) (cid:1) (38) = − Re ( y ( n ) k ) Re ( (cid:98) y ( n ) k ) + Re ( y ( n ) k ) Im ( (cid:98) y ( n ) k ) − Re ( y ( n ) k ) Re ( (cid:98) y ( n ) k ) − Im ( y ( n ) k ) Im ( (cid:98) y ( n ) k ) Re ( (cid:98) y ( n ) k ) (cid:0) Re ( y ( n ) k ) + Im ( y ( n ) k ) (cid:1) (cid:0) Re ( (cid:98) y ( n ) k ) + Im ( (cid:98) y ( n ) k ) (cid:1) (39) = − Re ( y ( n ) k ) Im ( (cid:98) y ( n ) k ) − Im ( y ( n ) k ) Re ( (cid:98) y ( n ) k ) (cid:0) Re ( y ( n ) k ) + Im ( y ( n ) k ) (cid:1) (cid:0) Re ( (cid:98) y ( n ) k ) + Im ( (cid:98) y ( n ) k ) (cid:1) Im ( (cid:98) y ( n ) k ) (40) ∂ L p ∂ Im ( (cid:98) y ( n ) k ) = − cos( θ ( n ) k ) ∂ cos( (cid:98) θ ( n ) k ) ∂ Im ( (cid:98) y ( n ) k ) − sin( θ ( n ) k ) ∂ sin( (cid:98) θ ( n ) k ) ∂ Im ( (cid:98) y ( n ) k ) (41) = − Im ( y ( n ) k ) Re ( (cid:98) y ( n ) k ) − Re ( y ( n ) k ) Im ( (cid:98) y ( n ) k ) (cid:0) Re ( y ( n ) k ) + Im ( y ( n ) k ) (cid:1) (cid:0) Re ( (cid:98) y ( n ) k ) + Im ( (cid:98) y ( n ) k ) (cid:1) Re ( (cid:98) y ( n ) k ) . (42)Because both y ( n ) and (cid:98) y ( n ) are conjugate-symmetric, it can be easily observed that ∂ L p ∂ Re ( (cid:98) y ( n ) k ) and ∂ L p ∂ Re ( (cid:98) y ( n ) k ) satisfy the condition inEquations (21) and (22), respectively. . MULTIPLE DISTANCE METRICS Different distance metrics can be merged easily. For example, we can define L = L s + · · · + L sS + L p + · · · + L pP , (43)where L s ∗ ∈ R and L p ∗ ∈ R may use different numbers of DFT bins, frame length, or frame shift. Although the dimension of the gradientvector ∂ L ∗ ∂ (cid:98) x ( n ) may be different, the gradient ∂ L ∗ ∂ (cid:98) o T ∈ R T will always be a real-valued vector of dimension T after de-framing/windowing.The gradients then will be simply merged together as ∂ L ∂ (cid:98) o T = ∂ L s ∂ (cid:98) o T + · · · + ∂ L sS ∂ (cid:98) o T + ∂ L p ∂ (cid:98) o T + · · · + ∂ L pP ∂ (cid:98) o T . (44) b o T DFTconfig 1Framing/windowingconfig 1 DFTconfig 1 Framing/windowingconfig 1