STFT spectral loss for training a neural speech waveform model
SSTFT SPECTRAL LOSS FOR TRAINING A NEURAL SPEECH WAVEFORM MODEL
Shinji Takaki , Toru Nakashika , Xin Wang , Junichi Yamagishi , National Institute of Informatics, Japan The University of Electro-Communications, Japan The University of Edinburgh, UK [email protected], [email protected], [email protected], [email protected]
ABSTRACT
This paper proposes a new loss using short-time Fourier transform(STFT) spectra for the aim of training a high-performance neuralspeech waveform model that predicts raw continuous speech wave-form samples directly. Not only amplitude spectra but also phasespectra obtained from generated speech waveforms are used to cal-culate the proposed loss. We also mathematically show that trainingof the waveform model on the basis of the proposed loss can be inter-preted as maximum likelihood training that assumes the amplitudeand phase spectra of generated speech waveforms following Gaus-sian and von Mises distributions, respectively. Furthermore, thispaper presents a simple network architecture as the speech wave-form model, which is composed of uni-directional long short-termmemories (LSTMs) and an auto-regressive structure. Experimen-tal results showed that the proposed neural model synthesized high-quality speech waveforms.
Index Terms — speech synthesis, neural waveform modeling,WaveNet
1. INTRODUCTION
Research on speech waveform modeling is advancing because ofneural networks [1, 2]. The WaveNet [2] directly models waveformsignals and demonstrates excellent performance. The WaveNet canalso be used as a vocoder, which converts acoustic features, e.g.,a mel-spectrogram sequence, into speech waveform signals [3, 4].Such a neural speech waveform model used as a vocoder can be inte-grated into a text-to-speech synthesis system, and it has been shownthat it outperforms conventional signal-processing-based vocoders[5]. Neural speech waveform models will be essential componentsfor many speech synthesis applications.There have been several investigations of neural speech wave-form models about output distribution and training criteria. A cat-egorical distribution for discrete waveform samples and cross en-tropy are introduced to train the original WaveNet [2]. The mix-ture of logistic distribution and a discretized logistic mixture likeli-hood are also used to train the improved WaveNet [6]. The parallel-WaveNet [6] and the ClariNet [7] use a distilling approach to trans-fer the knowledge of auto-regressive (AR) WaveNet to a simplernon-AR student model. The distilling approach typically introducesan amplitude spectral loss as the auxiliary loss in the distillationin order to avoid generating devoiced/whisper voices. We believethat losses using short-time Fourier transform (STFT) spectra can be
This work was partially supported by JST CREST Grant Number JP-MJCR18A6, Japan and by MEXT KAKENHI Grant Numbers (16H06302,16K16096, 17H04687, 18H04120, 18H04112, 18KT0051, 18K18069),Japan. used beyond the distillation, and they can be used for training speechwaveform models themselves. Because it is known that (complex-valued) STFT spectra represent speech characteristics well, the spec-tra would be useful for efficient training of neural waveform models.In this paper, we propose a new loss using the STFT spectrafor the aim of training a high-performance neural speech waveformmodel that predicts raw continuous speech waveform samples di-rectly (rather than the auxiliary loss used for the distillation pro-cess). Because not only amplitude spectra but also phase spectrarepresent speech characteristics [8], the proposed loss considers bothamplitude and phase spectral losses obtained from generated speechwaveforms. Also, we give interpretation based on probability distri-butions about training using the proposed loss. This interpretationprovides us better understanding of the proposed method and its re-lationship to various spectral losses such as Kullback-Leibler diver-gence or Itakura–Saito divergence [9, 10]. This paper also presentsa simple network architecture as a speech waveform model. Theproposed simple network is composed of uni-directional long short-term memories (LSTMs) and an auto-regressive structure unlike theWaveNet, which uses a relatively complicated convolutional neuralnetwork (CNN) with stacked dilated convolution.The rest of this paper is organized as follows. Section 2 of thispaper presents a neural speech waveform and the proposed loss totrain it. Section 3 describes the proposed network architecture. Ex-perimental results are presented in Section 4. We conclude in Section5 with a summary and mention of our future work.
2. THE PROPOSED LOSS FOR A WAVEFORM MODEL
A neural speech waveform model targeted in this paper is as follows. y = f ( λ ) ( x ) , (1)where, y ∈ R M , x = [ x (cid:62) , ... x (cid:62) I ] (cid:62) , x i ∈ R D , and λ represent aneural network’s outputs (i.e., speech waveform samples), an inputsequence (e.g., a log-mel spectrogram), an input feature, and param-eters of a neural network, respectively. A sample index, a frameindex, and the dimension of an input feature are represented by m , i , and D , respectively. Back-propagation is generally used to get op-timal parameters λ for a loss function. Because training the modelis a regression task, a simple loss function is a square error betweennatural speech waveform samples and a neural network’s outputs y as follows. E ( wav ) = (cid:88) m (ˆ y m − y m ) , (2)where, ˆ · denotes natural data. However, this criterion does not con-sider any information related to frequency characteristics of speech. a r X i v : . [ ee ss . A S ] O c t Y = W ( DFT ) W ( DFT ) W ( DFT ) W ( DFT ) Y Y Y Y W L S Fig. 1 . STFT complex spectra calculated by using a matrix W .Here, W ∈ C LT × M represents an STFT operation. L , S , and W ( DFT ) denote frame length, frame shift, and a discrete Fouriertransform (DFT) matrix. White parts in the matrix W represent .In this paper, we propose loss functions using STFT spectra to ef-fectively train a neural speech waveform instead of using the abovesquare error as a loss function. In this section, we describe representations of amplitude and phasespectra used in the proposed loss. As shown in Fig. 1, an STFTcomplex spectral sequence, Y = [ Y (cid:62) , ..., Y (cid:62) T ] (cid:62) , is represented byusing a matrix W as follows. Y = W y , (3)where, t and W represent a frame index and a matrix which per-forms STFT operation, respectively. Also, a complex-valued, ampli-tude and phase spectra of frequency bin n at frame t are representedas follows. Y t,n = W t,n y , (4) A t,n = | Y t,n | = ( y (cid:62) W Ht,n W t,n y ) , (5) exp( iθ t,n ) = exp( i ∠ Y t,n )= Y t,n A t,n = W t,n y ( y (cid:62) W Ht,n W t,n y ) , (6)where, A t,n , θ t,n , W t,n , · H , and i represent an amplitude spec-trum, a phase spectrum, a row vector of W , Hermitian transpose,and imaginary unit, respectively. In this paper, Euler’s formula isapplied to a phase spectrum instead of directly using a phase spec-trum. We use amplitude and phase spectra as shown in Eq. (5) andEq. (6) for model training. A loss function for an amplitude spectrum of frequency bin n atframe t is defined as a square error. E ( amp ) t,n = 12 ( ˆ A t,n − A t,n ) (7) = 12 ( ˆ A t,n − ( y (cid:62) W Ht,n W t,n y ) ) . (8)We then obtain a partial derivative of Eq. (7) w.r.t. y as ∂E ( amp ) t,n ∂ y = (cid:16) A t,n − ˆ A t,n (cid:17) R (cid:16) exp( iθ t,n ) W Ht,n (cid:17) , (9) where, R ( z ) is a real part of a complex value z . Here the Wirtingerderivative is used to calculate the partial derivative in the complexdomain. (cid:80) n ∂E ( amp ) t,n /∂ y can be efficiently calculated by using aninverse FFT operation. A phase spectrum is a periodic variable with a period of 2 π . A lossfunction for a phase spectrum of frequency bin n at frame t is definedas follows to consider this periodic property. E ( ph ) t,n = 12 (cid:12)(cid:12)(cid:12) − exp( i (ˆ θ t,n − θ t,n )) (cid:12)(cid:12)(cid:12) (10) = 1 − (cid:32) ˆ Y t,n ˆ A t,n ( y (cid:62) W Ht,n W t,n y ) W t,n y + ˆ Y t,n ˆ A t,n ( y (cid:62) W Ht,n W t,n y ) W t,n y (cid:33) , (11)where, · is the complex conjugate. We obtain a partial derivative ofEq. (10) w.r.t. y as ∂E ( ph ) t,n ∂ y = sin(ˆ θ t,n − θ t,n ) I (cid:18) Y t,n W Ht,n (cid:19) , (12)where, I ( z ) denotes an imaginary part of a complex value z . (cid:80) n ∂E ( ph ) t,n /∂ y can be also efficiently calculated by using an in-verse FFT operation. A partial derivative of the amplitude loss function (Eq. (9)) includesa phase spectrum of outputs (i.e., θ t,n ), and that of the phase lossfunction (Eq. (12)) includes an amplitude spectrum of outputs (i.e., A t,n . Y can be rewritten as A t,n exp( i ( − θ t,n )) ). Thus, the am-plitude and phase losses are related to each other through a neuralnetwork’s outputs during model training, although each loss func-tion focuses on only amplitude or phase spectra.In this paper, a combination of amplitude and phase spectral lossfunctions is used for training a neural speech waveform model. E ( sp ) = (cid:88) t,n ( E ( amp ) t,n + α t,n E ( ph ) t,n ) , (13)where, α t,n denotes a weight parameter. We use three types of α t,n for training a neural speech waveform model. α t,n = 0 : Using only the amplitude spectral loss function for modeltraining. α t,n = 1 : Using a simple combination of the amplitude and phasespectral loss functions. α t,n = v t : Here, v t represents a voiced/unvoiced flag (1:voiced,0:unvoiced). In this case, we assume that phase spectra in unvoicedparts are random values, and hence the phase spectral loss computedin unvoiced parts is omitted from model training. First, Eq. (7) and Eq. (10) can be rewritten as follows. E ( amp ) t,n = log P g ( ˆ A t,n | ˆ A t,n , − log P g ( ˆ A t,n | A t,n , (14) E ( ph ) t,n = 1 − cos(ˆ θ t,n − θ t,n )= log P vm (ˆ θ t,n | ˆ θ t,n , − log P vm (ˆ θ t,n | θ t,n , , (15) nput net Duplication Clock rate: 200 Hz(frame shift = 5 ms)Clock rate: 16 kHz
Output netRemovingamplitude x y m y ′ m − − τ : m − x ′ m y m − − τ : m − x ′ Fig. 2 . Overview of the proposed network.where, P g ( · ) and P vm ( · ) are probability density functions of theGaussian distribution and the von Mises distribution. P g ( x | µ, σ ) = 1 √ πσ exp (cid:18) − ( x − µ ) σ (cid:19) (16) P vm ( x | µ, β ) = exp( β cos( x − µ ))2 πI ( β ) (17) I ( β ) is the modified Bessel function of the first kind of order .From relationships of Eqs. (7), (10), (14), and (15), minimiza-tion of Eq. (13) is equivalent to that of the log negative likelihood L below. L = − log (cid:89) t,n P g ( ˆ A t,n | A t,n , P α t,n vm (ˆ θ t,n | θ t,n , . (18)In other words, the amplitude and phase spectra of a neural network’soutputs represent mean parameters of the Gaussian distribution andthe von Mises distribution, respectively. Thus, optimizing a neuralnetwork on the basis of the proposed loss function is the maximumlikelihood estimation that uses the above probabilistic functions forthe speech waveform model.In this initial investigation we used the Gaussian distribution andthe von Mises distribution to define loss functions for amplitude andphase spectra. But we can easily define other meaningful spectrallosses by replacing the distributions with other distributions. For in-stance, if we use Poisson or exponential distribution instead of Gaus-sian distribution, we can define the Kullback-Leibler divergence orItakura–Saito divergence as an amplitude spectral loss [9, 10]. Also,we can easily define a different loss function for phase spectra byreplacing the von Mises distribution with other distributions such asthe generalized cardioid distribution [11], the generalized version ofthe von Mises distribution.
3. NETWORK ARCHITECTURE
Fig. 2 shows an overview of the proposed network. The processingbelow the dotted line in Fig. 2 is the same as that used in the WaveNetvocoder [12], in which input features are converted to hidden repre-sentations through an input network, and then they are duplicated toadjust time resolution. The processing above the dotted line is con-ceptually the same as that used in the WaveNet vocoder, in whichfeedback samples and hidden representations are fed back into anoutput network with an auto-regressive structure to output the nextspeech waveform sample. However, the following are two remark-able differences of the proposed network . An output network is simply based on uni-directional LSTMs.CNNs with stacked dilated convolution are not used in our method. Spectral amplitude information is removed from feedback wave-form samples. The output network with an auto-regressive structurefeeds natural waveform samples back to the network during teacher-forced training [14]. A network composed of uni-directional LSTMs We also investigated a further improved network without the auto-regressive structure in order to significantly reduce the computational costat the synthesis phase. See our next paper [13]. tends to rely on the feedback waveform samples while ignoring theinput features if natural waveform samples are directly fed into it.To solve this problem, feedback waveform samples are converted asfollows. y (cid:48) m − − τ : m − = F − (cid:18) F ( y m − − τ : m − ) |F ( y m − − τ : m − ) | (cid:19) , (19)where, F , F − , the fraction bar, and | · | denote FFT operation,inverse FFT operation, element-wise division, and the element-wiseabsolute, respectively. Speech waveform samples, whose amplitudespectra are , are obtained through the conversion. This conversioncan be regarded as data dropout [15], although amplitude spectra aredropped and are replaced with instead of .
4. EXPERIMENT4.1. Experimental conditions
The proposed neural speech waveform models were evaluated as avocoder . We used a female speaker (slt) from the CMU-ARCTICdatabase [16]. , and utterances were used as training andtest sets, respectively. Their speech waveforms have a sampling fre-quency of 16 kHz and a 16-bit PCM format.As an input feature, an 80-dim log-mel spectrogram was used.Frame shift, frame length, and FFT size were , , and , re-spectively. Log-mel spectrograms and waveform samples were nor-malized to have mean and variance for training the proposedspeech waveform model.Input features, i.e., the log-mel spectrogram sequence, were firstconverted into hidden representations through an input network com-posed of an -unit bi-directional LSTM and a CNN with fil-ters whose size is (time direction) × (frequency direction).Then, hidden representations were duplicated to adjust time resolu-tion. The previous samples and hidden representations obtainedfrom the input network were fed into an output network. The outputnetwork was composed of three -unit Uni-directional LSTMs.Frame shift, frame length, and FFT size used to obtain the STFTspectra of a neural network’s outputs were , , and , respec-tively. Mini-batches were created, each from randomly selectedspeech segments. Each mini-batch contained a total of 15 s speechwaveform samples (each speech segment equaled 0.125 s). We usedthe Adam optimizer [17], and the number of updating iteration was k.We trained four speech waveform models with the proposed net-work architecture. The difference among these four models was theloss function used for model training. Eq. (2) and Eq. (13) withthree types of α t,n were used as loss functions for training the fourmodels. For comparison, the WaveNet was trained by using -dimlog-mel spectrograms and 1,024 mu-law discrete waveform samplesas input and output training data. The network architecture of theWaveNet was the same as that used in [12]. The WORLD [18] wasalso used as a baseline signal-processing-based vocoder. Spectralenvelopes and aperiodicity measurements obtained by utilizing theWORLD were converted to 59-dim mel-cepstrum and 21-dim bandaperiodicity. The total dimensions of a WORLD acoustic featurewas ( (mel-cepstrum) + 1 (voiced/unvoiced flag) + 1 (lf0) +21 (band aperiodicity)). In total, we used six vocoders (the four pro-posed models, the WaveNet, the WORLD) in the experiment. Synthetic speech samples and codes for model training can be foundat https://nii-yamagishilab.github.io/TSNetVocoder/index.html and https://github.com/nii-yamagishilab/TSNetVocoder , respectively. .0 0.2 0.4Time (s)012345678 F r e q u e n c y ( k H z ) (a) NAT (b) WORLD (c) WN (d) E ( wav ) (e) α = 0 (f) α = 1 (g) α = v Fig. 3 . Spectrograms of analysis-by-synthesis speech waveform segments are shown. Here, NAT and WN are natural and the WaveNet.Spectrograms of the synthetic speech obtained by using the proposed models are shown in (d–g). Eq. (2) and Eq. (13) with three types of α are used as loss functions for model training, respectively.We evaluated analysis-by-synthesis (AbS) systems and text-to-speech (TTS) synthesis systems based on the six vocoders. ForTTS synthesis systems, acoustic models that convert linguistic fea-tures to acoustic features (i.e., -dim log-mel spectrogram or -dim WORLD acoustic features) were separately trained. Deep auto-regressive (DAR) models [15] were used as acoustic models. Fromthe TTS experimental results, we can see the robustness of the pro-posed waveform model against the degraded input features generatedby the acoustic models. Fig. 3 shows spectrograms of analysis-by-synthesis speech wave-form segments. There is an unvoiced part around 0.2 seconds. Wecan see from Fig. 3 (d) and (e) that the proposed models trainedby using E ( wav ) and E ( sp ) without the phase spectral loss func-tion (i.e., α = 0 ) generated noisy spectrograms. Also, it can beseen from Fig. 3 (f) that large value amplitude spectra are observedaround kHz in the unvoiced part, whereas they are not observedin the natural spectrogram. Such artifacts are not observed in thespectrogram shown as Fig. 3 (g). These results indicate that train-ing a proposed model by using the amplitude and phase spectral lossfunctions is adequate and adjusting the weight parameter α by usingthe voiced/unvoiced flag further improves the performance.Both spectrograms of the WaveNet and the proposed modeltrained by adjusting α t,n by using the voiced/unvoiced flags (i.e.,Fig. 3 (c) and (g)) are similar to the natural one, although the detailedspectral structures are changed. The subjective evaluation was conducted using 158 crowd-sourcedlisteners. Natural samples and samples synthesized from the 12 sys-tems were evaluated. The number of synthetic samples was ( systems × test sentences). Participants evaluated speech natural-ness on a five-point mean opinion score (MOS) scale. Each syntheticsample was evaluated times, giving a total of , data points.Thanks to the large number of data points, differences between allcombinations of evaluated systems are statistically significant (i.e., p < . ).First, among the proposed models, those trained together withthe phase spectral loss outperformed the others. Using not onlyamplitude spectra but also phase spectra to calculate loss is useful NAT WORLD WN E wav α = 0 α = 1 α = v Q u a li t y ( M O S ) AbSTTS
Fig. 4 . Box plots on naturalness evaluation results. Blue and reddots represent the mean results of analysis-by-synthesis and text-to-speech synthesis systems, respectively.for training a neural speech waveform model. The proposed modeltrained by adjusting α t,n by using the voiced/unvoiced flags ob-tained the best score among the proposed models.Second, it can be seen from Fig. 4 that the best performance ofthe proposed model (i.e., α = v ) is better than that of the WORLD.In the WORLD, the TTS result is drastically decreased from the AbSresult. On the other hand, we can see that the proposed model withthe best configuration (i.e., α = v ) is robust against acoustic featurespredicted from the acoustic model.Finally, compared with the WaveNet, the best performance ofthe proposed models (i.e., α = v ) is slightly worse. But note that thenumber of parameters of the proposed model (1.8M) is less than thatof the WaveNet (2.3M). Although we need to improve loss functionsand network architecture to achieve comparable performance withthe WaveNet, it is obvious that the proposed model achieved high-performance speech waveform modeling.
5. CONCLUSION
We proposed a new STFT spectral loss to train a high-performancespeech waveform model directly. We also presented a simplenetwork architecture for the model, which is composed of uni-directional LSTMs and an auto-regressive structure. Experimentalresults showed that the proposed model can synthesize high-qualityspeech waveforms.This is a part of our sequential work. In our next paper [13], wewill show that the above model can be improved further to achieveperformance comparable with the WaveNet. Our other future workincludes training based on other time-frequency analysis such asmodified discrete cosine transform. . REFERENCES [1] Shinji Takaki, Hirokazu Kameoka, and Junichi Yamagishi,“Direct modeling of frequency spectra and waveform gener-ation based on phase recovery for DNN-based speech synthe-sis,” in
Proc. Interspeech , 2017, pp. 1128–1132.[2] A¨aron van den Oord, Sander Dieleman, Heiga Zen, Karen Si-monyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, An-drew W. Senior, and Koray Kavukcuoglu, “Wavenet: A gener-ative model for raw audio,”
CoRR , vol. abs/1609.03499, 2016.[3] Akira Tamamori, Tomoki Hayashi, Kazuhiro Kobayashi,Kazuya Takeda, and Tomoki Toda, “Speaker-dependentWaveNet vocoder,” in
Proc. Interspeech , 2017, pp. 1118–1122.[4] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster,Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang,Yuxuan Wang, Rj Skerrv-Ryan, et al., “Natural TTS synthe-sis by conditioning WaveNet on Mel spectrogram predictions,”in
Proc. ICASSP , 2018, pp. 4779–4783.[5] Xin Wang, Jaime Lorenzo-Trueba, Shinji Takaki, Lauri Juvela,and Junichi Yamagishi, “A comparison of recent waveformgeneration and acoustic modeling methods for neural-network-based speech synthesis,” in
Proc. ICASSP , 2018, pp. 4804–4808.[6] Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Si-monyan, Oriol Vinyals, Koray Kavukcuoglu, George van denDriessche, Edward Lockhart, Luis C Cobo, Florian Stimberg,et al., “Parallel WaveNet: Fast high-fidelity speech synthesis,” arXiv preprint arXiv:1711.10433 , 2017.[7] Wei Ping, Kainan Peng, and Jitong Chen, “Clarinet: Parallelwave generation in end-to-end text-to-speech,” arXiv preprintarXiv:1807.07281 , 2018.[8] Pejman Mowlaee, Rahim Saeidi, and Yannis Stylianou, “IN-TERSPEECH 2014 special session: Phase importance inspeech processing applications,”
Interspeech , pp. 1623–1627,2014.[9] Daniel D Lee and H Sebastian Seung, “Algorithms for non-negative matrix factorization,” in
Proc. NIPS , 2001, pp. 556–562.[10] P. Smaragdis, B. Raj, and M. Shashanka, “Supervisedand semi-supervised separation of sounds from single-channelmixtures,”
Proceedings of 7th Int. Conf. Ind. Compon. Anal.Signal Separat. , pp. 414–421, 2007.[11] M. C. Jones and Arthur Pewsey, “A family of symmetric dis-tributions on the circle,”
Journal of the American StatisticalAssociation , vol. 100, no. 472, pp. 1422–1428, 2005.[12] Jaime Lorenzo-Trueba, Fuming Fang, Xin Wang, Isao Echizen,Junichi Yamagishi, and Tomi Kinnunen, “Can we stealyour vocal identity from the internet?: Initial investigationof cloning obamas voice using gan, wavenet and low-qualityfound data,” in
Proc. Odyssey 2018 The Speaker and LanguageRecognition Workshop , 2018, pp. 240–247.[13] Xin Wang, Shinji Takaki, and Junichi Yamagishi, “Neuralsource-filter-based waveform model for statistical parametricspeech synthesis,”
Submitted to ICASSP 2019 , 2019.[14] Ronald J Williams and David Zipser, “A learning algorithm forcontinually running fully recurrent neural networks,”
Neuralcomputation , vol. 1, no. 2, pp. 270–280, 1989. [15] Xin Wang, Shinji Takaki, and Junichi Yamagishi, “Autore-gressive neural f0 model for statistical parametric speech syn-thesis,”
IEEE/ACM Transactions on Audio, Speech, and Lan-guage Processing , vol. 26, no. 8, pp. 1406–1419, 2018.[16] J. Kominek and A. W. Black, “The CMU arctic speechdatabases,”
Fifth ISCA Workshop on Speech Synthesis , 2004.[17] Diederik P. Kingma and Jimmy Ba, “Adam: A method forstochastic optimization,”
CoRR , vol. abs/1412.6980, 2014.[18] Masanori Morise, Fumiya Yokomori, and Kenji Ozawa,“WORLD: A vocoder-based high-quality speech synthesis sys-tem for real-time applications,”
IEICE Trans. on Informationand Systems , vol. 99, no. 7, pp. 1877–1884, 2016. . DETAILS OF THE PARTIAL DERIVATIVES
The Wirtinger derivative is used to calculate the partial derivative with respect to a complex value z in the complex domain as, dEdz = 12 (cid:18) ∂E∂ R ( z ) − i ∂E∂ I ( z ) (cid:19) , (20) dEdz = 12 (cid:18) ∂E∂ R ( z ) + i ∂E∂ I ( z ) (cid:19) . (21)If E is a real function, the complex gradient vector is given by ∇ E = 2 dEd z (22) = ∂E∂ R ( z ) + i ∂E∂ I ( z ) . (23)For a non-analytic function, the chain rule is given by ∂E∂x = ∂E∂u ∂u∂x + ∂E∂u ∂u∂x . (24) A.1. Derivative of amplitude spectral loss E ( amp ) t,n Given the amplitude spectral loss E ( amp ) t,n = 12 ( ˆ A t,n − A t,n ) . (25)According to the chain rule, we can compute the derivative ∂E ( amp ) t,n ∂ y = ( A t,n − ˆ A t,n ) · ∂A t,n ∂ y . (26)For ∂A t,n ∂ y , since we know that A t,n = ( y (cid:62) W Ht,n W t,n y ) ∈ R , we can compute ∂A t,n ∂ y = ∂ ( y (cid:62) W Ht,n W t,n y ) ∂ y (27) = 12 ( y (cid:62) W Ht,n W t,n y ) − · ( W Ht,n W t,n + W (cid:62) t,n W t,n ) y (28) = 12 ( y (cid:62) W Ht,n W t,n y ) − · ( W Ht,n W t,n y + W (cid:62) t,n W t,n y ) (29) = 12 ( y (cid:62) W Ht,n W t,n y ) − · R ( W Ht,n W t,n y ) (30) = 1 A t,n · R ( Y t,n W Ht,n ) (31) = R ( Y t,n A t,n W Ht,n ) (32) = R (exp( iθ t,n ) W Ht,n ) (33)where, · (cid:62) and · H denotes transpose and Hermititian transpose, respectively. Note that, from Eq. (27) to (28), because W t,n is a complex-valued vector, we need to use ∂ y (cid:62) W Ht,n W t,n y ∂ y = ( W Ht,n W t,n + W (cid:62) t,n W t,n ) y . (34)From Eqs. (29) to (30), we use the fact that W Ht,n W t,n y = W (cid:62) t,n W t,n y and therefore W Ht,n W t,n y + W (cid:62) t,n W t,n y = 2 R ( W Ht,n W t,n y ) . (35)Based on Equation 33 and 26, we finally get ∂E ( amp ) t,n ∂ y = ( A t,n − ˆ A t,n ) R (exp( iθ t,n ) W Ht,n )) . (36) .2. Derivative of phase spectrum loss E ( ph ) t,n Based on the definition of E ( ph ) t,n , we get E ( ph ) t,n = 12 (cid:12)(cid:12)(cid:12) − exp( i (ˆ θ t,n − θ t,n )) (cid:12)(cid:12)(cid:12) (37) = 12 (cid:12)(cid:12)(cid:12)(cid:12) − exp( i (ˆ θ t,n )) 1exp( i ( θ t,n )) (cid:12)(cid:12)(cid:12)(cid:12) (38) = 12 (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − ˆ Y t,n ˆ A t,n A t,n Y t,n (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (39) = 12 (cid:32) − ˆ Y t,n ˆ A t,n A t,n Y t,n (cid:33) (cid:32) − ˆ Y t,n ˆ A t,n A t,n Y t,n (cid:33) (40) = 12 (cid:32) − ˆ Y t,n ˆ A t,n A t,n Y t,n (cid:33) (cid:32) − ˆ Y t,n ˆ A t,n A t,n Y t,n (cid:33) (41) = 12 (cid:32) Y t,n ˆ A t,n A t,n Y t,n ˆ Y t,n ˆ A t,n A t,n Y t,n − (cid:32) ˆ Y t,n ˆ A t,n A t,n Y t,n + ˆ Y t,n ˆ A t,n A t,n Y t,n (cid:33)(cid:33) (42) = 12 (cid:32) − (cid:32) ˆ Y t,n ˆ A t,n A t,n Y t,n + ˆ Y t,n ˆ A t,n A t,n Y t,n (cid:33)(cid:33) (43) = 1 − (cid:32) ˆ Y t,n ˆ A t,n A t,n Y t,n + ˆ Y t,n ˆ A t,n A t,n Y t,n (cid:33) (44) = 1 −
12 1ˆ A t,n A t,n (cid:16) ˆ Y Y + ˆ
Y Y (cid:17) . (45)Note that ˆ A t,n , A t,n and ˆ Y Y + ˆ
Y Y ∈ R while ˆ Y t,n and Y t,n ∈ C . Also note that Y t,n Y t,n = A t,n and ˆ Y t,n ˆ Y t,n = ˆ A t,n . Thus, we cancalculate ∂E ( ph ) t,n ∂ y as follows. ∂E ( ph ) t,n ∂ y = 12 (cid:32) ∂ ˆ A − t,n A − t,n ∂ y (cid:16) ˆ Y t,n Y t,n + ˆ Y t,n Y t,n (cid:17) + 1ˆ A t,n A t,n ∂ ˆ Y t,n Y t,n + ˆ Y t,n Y t,n ∂ y (cid:33) (46) = 12 (cid:32) −
12 ˆ A t,n A t,n (cid:16) Y t,n W Ht,n + Y t,n W (cid:62) t,n (cid:17) (cid:16) ˆ Y t,n Y t,n + ˆ Y t,n Y t,n (cid:17) + 1ˆ A t,n A t,n (cid:16) ˆ Y t,n W (cid:62) t,n + ˆ Y t,n W Ht,n (cid:17)(cid:33) (47) = 12 (cid:32) −
12 ˆ A t,n A t,n (cid:16) ˆ Y t,n Y t,n W Ht,n + ˆ Y t,n A t,n W Ht,n + ˆ Y t,n A t,n W (cid:62) t,n + ˆ Y t,n Y t,n W (cid:62) t,n (cid:17) + 1ˆ A t,n A t,n (cid:16) ˆ Y t,n W (cid:62) t,n + ˆ Y t,n W Ht,n (cid:17)(cid:33) (48) = 12 (cid:32) −
12 ˆ A t,n A t,n (cid:18) ˆ Y t,n Y t,n Y t,n W Ht,n + ˆ Y t,n W Ht,n + ˆ Y t,n W (cid:62) t,n + ˆ Y t,n Y t,n Y t,n W (cid:62) t,n (cid:19) + 1ˆ A t,n A t,n (cid:16) ˆ Y t,n W (cid:62) t,n + ˆ Y t,n W Ht,n (cid:17)(cid:33) (49) = 12 (cid:32)
12 ˆ A t,n A t,n (cid:18) ˆ Y t,n W Ht,n − ˆ Y t,n Y t,n Y t,n W (cid:62) t,n + ˆ Y t,n W (cid:62) t,n − ˆ Y t,n Y t,n Y t,n W Ht,n (cid:19)(cid:33) (50) = 14 (cid:32) ˆ Y t,n ˆ A t,n Y t,n A t,n (cid:18) Y t,n W Ht,n − Y t,n W (cid:62) t,n (cid:19) + ˆ Y t,n ˆ A t,n Y t,n A t,n (cid:18) Y t,n W (cid:62) t,n − Y t,n W Ht,n (cid:19)(cid:33) (51) = 14 (cid:32) ˆ Y t,n ˆ A t,n A t,n Y t,n (cid:18) Y t,n W Ht,n − Y t,n W (cid:62) t,n (cid:19) + ˆ Y t,n ˆ A t,n A t,n Y t,n (cid:18) Y t,n W (cid:62) t,n − Y t,n W Ht,n (cid:19)(cid:33) (52) = 12 R (cid:32) ˆ Y t,n ˆ A t,n A t,n Y t,n (cid:18) Y t,n W (cid:62) t,n − Y t,n W Ht,n (cid:19)(cid:33) (53) = I (cid:16) exp( i ( ∠ ˆ Y t,n − ∠ Y t,n )) (cid:17) I (cid:18) Y t,n W Ht,n (cid:19) (54) sin( ∠ ˆ Y t,n − ∠ Y t,n ) I (cid:18) Y t,n W Ht,n (cid:19) ..