Deep Griffin-Lim Iteration
Yoshiki Masuyama, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, Noboru Harada
TThis paper has been accepted to the 44th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2019).
DEEP GRIFFIN–LIM ITERATION
Yoshiki Masuyama † , Kohei Yatabe † , Yuma Koizumi ‡ , Yasuhiro Oikawa † , Noboru Harada ‡† Department of Intermedia Art and Science, Waseda University, Tokyo, Japan ‡ NTT Media Intelligence Laboratories, Tokyo, Japan
ABSTRACT
This paper presents a novel phase reconstruction method (only froma given amplitude spectrogram) by combining a signal-processing-based approach and a deep neural network (DNN). To retrieve atime-domain signal from its amplitude spectrogram, the correspond-ing phase is required. One of the popular phase reconstruction meth-ods is the Griffin–Lim algorithm (GLA), which is based on the re-dundancy of the short-time Fourier transform. However, GLA ofteninvolves many iterations and produces low-quality signals owing tothe lack of prior knowledge of the target signal. In order to addressthese issues, in this study, we propose an architecture which stacks asub-block including two GLA-inspired fixed layers and a DNN. Thenumber of stacked sub-blocks is adjustable, and we can trade theperformance and computational load based on requirements of ap-plications. The effectiveness of the proposed method is investigatedby reconstructing phases from amplitude spectrograms of speeches.
Index Terms — Phase reconstruction, spectrogram consistency,deep neural network, residual learning.
1. INTRODUCTION
In recent years, phase reconstruction has gained much attention inthe signal processing community [1, 2]. Many ordinary speech pro-cessing methods defined in the time-frequency domain have con-sidered only amplitude spectrograms and utilized the phase of theobserved signal without modifying it. Meanwhile, recent studieshave proven that phase reconstruction can improve the quality of thereconstructed signal [3], and thus several methods have been pro-posed for that [4, 5, 6, 7, 8]. Phase reconstruction solely from anamplitude spectrogram has also received increasing attention alongthe development of the short-time Fourier transform (STFT)-basedspeech synthesis [9, 10] which generates an amplitude spectrogramand requires phase reconstruction for generating a time-domain sig-nal. This paper focuses on such a situation where only an amplitudespectrogram is available for reconstructing the phase.When only an amplitude spectrogram is available and no explicitinformation is given for the phase, such as in STFT-based speechsynthesis, the Griffin–Lim algorithm (GLA) is one of the popularmethods for phase reconstruction [11]. GLA promotes the consis-tency of a spectrogram by iterating two projections (see Section 2.1),where a spectrogram is said to be consistent when its inter-bin de-pendency owing to the redundancy of STFT is retained [12]. GLAis based only on the consistency and does not take any prior knowl-edge about the target signal into account. Consequently, GLA oftenrequires many iterations and results in low-quality signals.For incorporating prior knowledge of target signals into phasereconstruction, deep neural networks (DNNs) have been applied re-cently [13, 14, 15, 16]. There exist a number of approaches to re-construct phase using DNNs. One approach is to treat it as a clas-sification problem by discretizing the candidates of phase [13, 14],
DNNAmplitude Amplitude …… GLA-inspiredfixed layers DeGLIoutputInitial vale(can be )
Fig. 1 . A block diagram of the proposed architecture for reconstruct-ing phase from a given amplitude spectrogram (top), which stacks acommon sub-block (bottom). The sub-block consists of two fixedGLA-inspired layers (red, blue) and a trainable DNN (green).which is effectively utilized in speech separation. Other approacheshandle phase as a continuous periodic variable [15] or treat complex-valued spectrogram [16]. While these DNN-based phase reconstruc-tion methods have obtained successful results, the number of layersis determined when they are trained. That is, their performance andcomputational load are fixed at the training. It should be beneficial ifone can easily trade the performance and computational load at thetime of inference depending on requirements of applications.In this study, we propose a phase reconstruction method whichincorporates a DNN into GLA. The proposed method stacks a com-mon sub-block motivated by the iterative procedure of GLA, whichconstructs a deep architecture, named deep Griffin–Lim iteration (DeGLI), as illustrated in Fig. 1. In the proposed architecture, thenumber of total layers corresponds to the number of stacking, andits depth can be adjusted afterward based on the allowable computa-tional load in applications. Its training procedure is also proposed toeffectively train the DNN within the sub-block. Our main contribu-tions are twofold: (1) proposing a deep architecture whose sub-blockcontains the fixed GLA-inspired layers which enable reduction ofthe amount of trainable parameters (Section 3.1); and (2) proposingits training procedure which instructs the sub-block to be a denoiser,instead of requiring it to reconstruct the phase (Section 3.2). Thanksto this training procedure, the difficulty of training a DNN in phasereconstruction arisen from the periodic nature of phase is circum-vented. To evaluate the effectiveness of the proposed method, thequality of the reconstructed signal by GLA and the proposed methodis compared.
2. RELATED WORKS2.1. Griffin–Lim Algorithm (GLA)
GLA is a popular phase recovery algorithm based on the consistencyof a spectrogram [11]. This algorithm expects to recover a complex-valued spectrogram, which is consistent and maintains the given am- c (cid:13) a r X i v : . [ c s . S D ] M a r his paper has been accepted to the 44th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2019). plitude A , by the following alternative projection procedure: X [ m +1] = P C (cid:0) P A ( X [ m ] ) (cid:1) , (1)where X is a complex-valued spectrogram updated through the iter-ation, P S is the metric projection onto a set S , and m is the iterationindex. Here, C is the set of consistent spectrograms, and A is the setof spectrograms whose amplitude is the same as the given one. Themetric projections onto these sets C and A are given by, P C ( X ) = GG † X , (2) P A ( X ) = A (cid:12) X (cid:11) | X | , (3)where G represents STFT, G † is the pseudo inverse of STFT (iSTFT), (cid:12) and (cid:11) are element-wise multiplication and division, respectively,and division by zero is replaced by zero. GLA is obtained as analgorithm for the following optimization problem [12]: min X (cid:107) X − P C ( X ) (cid:107) Fro s.t. X ∈ A , (4)where (cid:107) · (cid:107) Fro is the Frobenius norm. This equation minimizes theenergy of the inconsistent components under the constraint on am-plitude which must be equal to the given one. Although GLA hasbeen widely utilized because of its simplicity, GLA often involvesmany iterations until it converges to a certain spectrogram and re-sults in low reconstruction quality. This is because the cost functionin Eq. (4) only requires the consistency, and the characteristics ofthe target signal are not taken into account. Introducing prior knowl-edge of the target signal into the algorithm can improve the qualityof reconstructed signals as discussed in [17, 7].
Recently, DNNs including fixed STFT (and iSTFT) layers were con-sidered for treating phase information within the networks. A gen-erative adversarial network (GAN)-based approach to reconstruct acomplex-valued spectrogram solely from a given amplitude spectro-gram was presented in [16]. The output of the generator (a complex-valued spectrogram) is converted back to the time domain by iSTFTlayer and inputted to the discriminator, where this iSTFT layer is es-sential for its training as discussed in [16]. As another example, aDNN for speech separation [18] employed the multiple input spec-trogram inverse (MISI) layer which consists of the pair of STFT andiSTFT as in GLA. The MISI layer is applied to the output of theDNN for speech separation to improve its performance by consider-ing the effect of the phase reconstruction together with the separa-tion. In addition, in [19], the time-frequency representation was alsotrained with the DNN for speech separation. The success of theseDNNs indicates that considering STFT (and iSTFT) together with aDNN is important for treating phase.The common strategy for these DNNs is that fixed STFT-relatedlayers are placed after a rich DNN. Their loss functions are evalu-ated after going through such STFT-related layers, and their effectis propagated for updating the parameters of DNNs. Based on thisobservation, loss functions tied with STFT (and iSTFT) seem impor-tant in phase reconstruction because such loss functions are related tothe concept of the consistency. At the same time, fixed STFT-relatedlayers have several benefits for training. Since they do not containtrainable parameters, adding STFT-related layers does not increasethe number of trainable parameters while they capture the structureof complex-valued spectrograms efficiently. Therefore, use of theSTFT-related layers within DNNs may be recommended for treatingphase information. However, there are little research on such DNNcontaining STFT within the network.
3. PROPOSED DEEP ARCHITECTURE
Based on the above discussions, we propose an architecture for phasereconstruction, named deep Griffin–Lim iteration (DeGLI), which isa unification of GLA and a DNN. As illustrated in Fig. 1, the pro-posed architecture consists of a common sub-block, and it is stackedto form the whole deep architecture based on the iterative procedureof GLA. The architecture of DeGLI is introduced in Section 3.1,while its training procedure is described in Section 3.2.
One interesting trend of research in deep learning is to interpret anoptimization algorithm as a recurrent neural network (RNN) andconstruct a DNN architecture following that [20, 21, 22]. The DNNintroduced in the previous section [18, 19] was also obtained by asimilar approach called deep unfolding [23, 24]. In this context, theiterative procedure of GLA in Eq. (1) is interpreted as an RNN whichstacks the fixed linear layer P C and target-dependent nonlinear layer P A . By looking close at Eq. (1), it can be seen that the complex-valued spectrogram at m th iteration X [ m ] is inputted into the non-linear layer P A , and then its output passes through the fixed linearlayer P C consisting of STFT G and iSTFT G † as in Eq. (2). Thatis, GLA is a parameter-fixed RNN consisting of STFT and iSTFTlayers within the network. Inspired from the above observations, theproposed deep architecture for phase reconstruction, or DeGLI, isdefined through a sub-block based on GLA.Let us consider the intermediate representation of GLA, Y [ m ] = P A ( X [ m ] ) , (5) Z [ m ] = P C ( Y [ m ] ) , (6)where the combination of these equations recovers Eq. (1). Since Y [ m ] is the amplitude-replaced version of X [ m ] , their difference in-dicates the amount of mismatch between the amplitude of currentspectrogram | X [ m ] | and the desired amplitude A . Similarly, since Z [ m ] is the closest consistent spectrogram to Y [ m ] (in the Euclideansense), the difference between them indicates the amount of incon-sistent components [12]. Such differences should be quite informa-tive for phase reconstruction because the aim of GLA is to reducethem as much as possible. However, such intermediate informationis not considered in the original GLA in Eq. (1).To effectively use those intermediate information in a learningscheme, we propose DeGLI as the following architecture: X [ m +1] = B ( X [ m ] ) , (7) = Z [ m ] − F ( X [ m ] , Y [ m ] , Z [ m ] ) , (8)where B is the proposed DeGLI-block inspired by GLA as in Fig. 1,and F is a DNN. The whole architecture can also be viewed as anRNN or a feed-forward network in which the weights are shared.By stacking M DeGLI-blocks (which is equivalent to iterate Eq. (7) M times), the whole DeGLI architecture becomes M -times deeperwithout increasing the number of trainable parameters. That is, thetotal depth of the DeGLI architecture can be adjusted afterward,which enables one to easily trade its performance and computationalload for adapting the allowable computational time of various appli-cations. Note that, as a specific case, DeGLI reduces to the ordi-nary GLA when F ( X [ m ] , Y [ m ] , Z [ m ] ) = O , where O is the zeromatrix. A variant of GLA in [25] can also be obtained by setting F ( X [ m ] , Y [ m ] , Z [ m ] ) = γ ( X [ m ] − Z [ m ] ) (0 < γ < , which in-dicates that DeGLI is a general architecture including several GLA-type algorithms as spacial cases. his paper has been accepted to the 44th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2019). AmplitudeAmplitudeNoiseCleanSignal
DNNestimates residual
EstimatedSignalDNN
Fig. 2 . The block diagram for training the sub-block.One of the key points of the DeGLI architecture is that Z [ m ] (= P C ( P A ( X [ m ] ))) is the output of GLA-inspired layers at m thiteration, and the proposed DeGLI-block B is defined as the sub-traction of the DNN output F ( X [ m ] , Y [ m ] , Z [ m ] ) from the outputof GLA-inspired layers Z [ m ] . Defining the DeGLI-block in this wayis based on two reasons: (1) differences between each intermediaterepresentation indicates the residual to the desired ones as discussedin the above paragraph; and (2) it is known that treating a residual iseasier than directly estimating the target according to the literatureon the residual learning [26, 27]. Since the proposed DeGLI architecture can be interpreted as a largeDNN, a simple strategy for training the DeGLI-block is directly min-imizing the loss of phase reconstruction measured by the output: min θ D (cid:0) G θ ( A ) , P C ( G θ ( A )) (cid:1) , (9)where G ( · ) = P A ( B ( · · · B ( B ( · )))) represents the whole DeGLIarchitecture, θ represents all trainable parameters in G (i.e., the pa-rameters in F ), D ( · , · ) is a measure of mismatch such as a normof difference, and the minimization is considered for all A . Thisproblem is related to the optimization problem for GLA in Eq. (4)when D is the squared Frobenius norm of the difference of the vari-ables (note that, since P A is applied at the last of G , the constraintin Eq. (4) is always satisfied). Although the above training strat-egy is straightforward, the number of the blocks should be definedin advance for applying it. In addition, it did not work well in ourpreliminary experiments.In order to tackle this issue, we train the DeGLI-block B to be adenoiser by the training procedure illustrated in Fig. 2. Let X (cid:63) be acomplex-valued spectrogram of a target signal, and (cid:101) X = X (cid:63) + N beits noisy counterpart degraded by complex-valued noise N . Then,the DeGLI-block B is trained so that B ( (cid:101) X ) ≈ X (cid:63) , i.e., B ( (cid:101) X ) = (cid:101) Z − F ( (cid:101) X , (cid:101) Y , (cid:101) Z ) ≈ X (cid:63) , (10)based on the definition in Eq. (8). Since (cid:101) Z is obtained only from thefixed layers P A and P C , the optimization problem for training theDNN F is given by min θ D (cid:0)(cid:101) Z − X (cid:63) , F θ ( (cid:101) X , (cid:101) Y , (cid:101) Z ) (cid:1) . (11)In such denoising, the DNN F estimates the residual components (cid:101) Z − X (cid:63) which should not be contained in the GLA output (cid:101) Z = P C ( P A ( (cid:101) X )) . To be specific, the DNN takes the mismatch to theconsistency and amplitude into account by inputting (cid:101) Y and (cid:101) Z , andit implicitly eliminates the latent target signal (such as clean speech)through hidden layers in F . This training strategy is closely relatedto the residual learning strategy. It has been shown that a denoisingsub-block with the residual learning strategy is robust to the type andlevel of noise, and it can be applied to a variety of tasks as discussed C onv G L U C onv G L U C onv G L U C onv G L U C onv C o m p l e x - v a l u e d S p ec t r og r a m C o m p l e x - v a l u e d S p ec t r og r a m s Fig. 3 . The illustration of the DNN used in the experiment. It mapsreal and imaginary parts of three complex-valued spectrograms ( X , Y , and Z ) to those of the residual. Here, “Conv” indicates a con-volutional layer with the zero padding for keeping the input size,where k , s , and c are the kernel size, stride size, and the number ofchannels, respectively. “GLU” represents the gated linear unit.in [27, 28]. The idea of applying a denoising DNN for general taskscan also be found in [29, 30].Note that, after passing through the fixed nonlinear layer P A , theamplitude of the complex-valued spectrogram is always replaced bythe desired one. That is, the difference between (cid:101) Y and the target X (cid:63) is only phase, and thus denoising of (cid:101) Y (= P A ( (cid:101) X )) corresponds tophase reconstruction. It can be expected that the denoising sub-blockincluding GLA-inspired layers also works well in phase reconstruc-tion. In any case, the trained DNN F (and thus B ) only affects thephase of the final output because the amplitude is always set to thegiven one by P A after the last DeGLI block.
4. EXPERIMENT
In order to validate the effectiveness of DeGLI, the quality of re-constructed speeches was evaluated by objective measures. The pro-posed method was compared with GLA as a baseline method.
A DNN F used in the DeGLI-block B for the experiment is illus-trated in Fig. 3. The D Convolutional layers (Conv) and the gatedlinear units (GLU) [31] are stacked with the skip connections. Inthe Conv layers, the complex-valued spectrograms are treated as im-ages, where the real and imaginary parts are concatenated along thechannel direction. Note that the input of the DNN is three complex-valued spectrograms as in Figs. 1 and 2, which results in six channelsas each of the three consists of the real and imaginary parts.As the training dataset for denoising, the Wall Street Journal(WSJ- ) corpus recorded at the sampling rate of kHz was uti-lized.
14 250 speech files were randomly selected from the databaseto form a training set, and the rest of the data was used as a valida-tion set. During the mini-batch training, the utterances were dividedinto about -second-long segments (
32 768 samples), and the Adamoptimizer was utilized as the optimization solver. The network wastrained for epochs with a learning rate control, where the learningrate was decayed by multiplying − . if the loss function on thevalidation set did not decrease for consecutive epochs, and the ini-tial learning rate was set to − . As the noise utilized for trainingin the time-frequency domain (described in Section 3.2), the com-plex Gaussian noise was added so that the signal-to-noise ratio wasrandomly selected from − to dB, and the measure of mismatch D as the loss function in Eq. (11) was set to the (cid:96) -norm of difference.STFT was implemented with the Hann window, whose duration was ms, with ms shifting. As the test dataset, randomly selected utterances from the TIMIT dataset were utilized for obtainingamplitude spectrograms for phase reconstruction, where the initialphases were set to zero in the time-frequency domain (i.e., the am-plitude spectrogram was directly inputted as the initial value). his paper has been accepted to the 44th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2019). F r e qu e n c y TimeGLA Original Residual DNN output
Fig. 4 . An example of the spectrograms within the proposed DeGLI-block. “GLA” represents the output of the GLA-inspired layers Z = P C ( P A ( X )) , and “Original” is the clean speech signal X (cid:63) to be recovered. The difference between them Z − X (cid:63) is shown as“Residual”, while its estimation F ( X , Y , Z ) is denoted by “DNNoutput”. The DNN was able to accurately estimate the Residual. An example of the results of the residual learning is shown in Fig. 4for illustrating how the DNN in the proposed DeGLI-block works.As shown in the figure, the DNN appropriately estimated the resid-ual which is the difference between the output of the current GLA-inspired layers and the target spectrogram. Such estimated resid-ual is subtracted so that the difference to the ideal spectrogram isreduced. We expect that the estimation by the DNN is reasonablyaccurate to improve the output of the DeGLI-block.The performance of phase reconstruction was evaluated by STOI[32] and PESQ [33]. The score per iteration averaged among the testset is shown in the upper row of Fig. 5. Both STOI and PESQ ofthe proposed method were always higher than those of GLA at eachiteration, and it improved the performance as the number of iterationincreased. Since the iteration corresponds to the depth of the wholearchitecture of DeGLI, this result indicates that one can iterate theDeGLI-block until the quality of the reconstructed signal becomesatisfactory. Namely, one can eliminate unnecessary computation,or decide the depth based on the available computational resource atthat time. We stress that this unique feature of the proposed methodcannot be achieved by a single rich DNN directly mapping an in-putted amplitude spectrogram into the final reconstructed signal.Since the computational time per iteration is different betweenGLA and the proposed DeGLI-block, the performance was also in-vestigated in terms of computational time for fair comparison. Inthis experiment, “Intel Core i - XE ( . GHz)” and “NVIDIAGeForce GTX
Ti” were employed for the CPU and GPU, re-spectively. For both methods, STFT and iSTFT were implementedby TensorFlow. The scores per computational time are illustratedin the bottom row of Fig. 5. Since the computational time per iter-ation of the proposed method was about . and . times slowerthan GLA by using GPU and CPU, respectively, the difference ofthe scores between the methods is closer than in the top row. Never-theless, the proposed method notably outperformed GLA especiallyfor PESQ. To see the scores at some specific iterations, box plotsof the scores are also shown in Fig. 6. The results were obtainedfrom the th iteration for GLA and the th iteration for the pro-posed method because the computational times of these methods areroughly the same at those iteration numbers. It can be seen that thetendencies of the scores are the same as the averaged values in Fig. 5,and the effectiveness of the proposed DeGLI architecture was con-firmed by a paired one-side t -test ( p < . ).In summary, it was confirmed that the proposed DeGLI archi- Iteration0.750.80.850.90.95 S T O I Iteration2.42.62.83 P E S Q -2 Computational time [s]0.750.80.850.90.95 S T O I -2 Computational time [s]2.42.62.83 P E S Q GLA in GPUDeGLI in GPUGLA in CPUDeGLI in CPU
Fig. 5 . Average scores of STOI and PESQ per iteration (top) and percomputational time (bottom) for GLA (blue, circles) and the pro-posed method (red, cross marks). The yellow dashed line indicatesthat the real time factor is . For measuring the computational time,both methods were implemented by using CPU and GPU. GLA DeGLI0.850.90.95 S T O I GLA DeGLI22.533.5 P E S Q Fig. 6 . Box plots of the scores of STOI and PESQ, where the resultsof GLA were evaluated at the th iteration, while those of DeGLIwere obtained from the output of the th stacks. The red lines arethe median, and the boxes indicates the first and third quartiles.tecture can be trained so that utilizing the common block for ev-ery iteration improves the performance, which should be because oftraining as a denoiser and the residual learning strategy. Note thatthe trainable DNN used in this experiment was merely an example,and it must be possible to improve the performance by considering aDNN more suitable for phase reconstruction.
5. CONCLUSION
In this study, we proposed a deep architecture, named DeGLI, whichcombines a DNN with the iterative procedure of GLA. The key ideawas to stack the same sub-block, so that the depth of whole architec-ture can be adjusted without increasing the number of trainable pa-rameters. This feature enables one to trade the quality of the recon-structed signal and computational load depending on applications.The residual learning strategy was applied to train the sub-block asa denoiser, where the DNN removes the undesired components in-troduced by GLA. Experimental results confirmed that a denoisingsub-block is applicable to phase reconstruction, which indicates thatthe task of training can be different from the phase reconstructionwhich is not an easy task for a DNN owing to the periodic natureof phase. Investigation of a DNN suitable for the proposed DeGLIremains as a future work. his paper has been accepted to the 44th International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2019).
6. REFERENCES [1] T. Gerkmann, M. Krawczyk-Becker, and J. Le Roux, “Phaseprocessing for single-channel speech enhancement: Historyand recent advances,”
IEEE Signal Process. Mag. , vol. 32,no. 2, pp. 55–66, Mar. 2015.[2] P. Mowlaee, R. Saeidi, and Y. Stylianou, “Advances in phase-aware signal processing in speech communication,”
SpeechCommun. , vol. 81, pp. 1–29, July 2016.[3] K. Paliwal, K. W´ojcicki, and B. Shannon, “The importance ofphase in speech enhancement,”
Speech Commun. , vol. 53, no.4, pp. 465–494, Apr. 2011.[4] M. Krawczyk and T. Gerkmann, “STFT phase reconstructionin voiced speech for an improved single-channel speech en-hancement,”
IEEE/ACM Trans. Audio, Speech, Lang. Process. ,vol. 22, no. 12, pp. 1931–1940, Dec. 2014.[5] Y. Wakabayashi, T. Fukumori, M. Nakayama, T. Nishiura,and Y. Yamashita, “Single-channel speech enhancement withphase reconstruction based on phase distortion averaging,”
IEEE/ACM Trans. Audio, Speech, Lang. Proc. , vol. 26, no. 9,pp. 1559–1569, Sept. 2018.[6] Y. Masuyama, K. Yatabe, and Y. Oikawa, “Model-basedphase recovery of spectrograms via optimization on Rieman-nian manifolds,” in
Int. Workshop Acoust. Signal Enhance.(IWAENC) , Sept. 2018, pp. 126–130.[7] K. Yatabe, Y. Masuyama, and Y. Oikawa, “Rectified linearunit can assist Griffin–Lim phase recovery,” in
Int. WorkshopAcoust. Signal Enhance. (IWAENC) , Sept. 2018, pp. 555–559.[8] Y. Masuyama, K. Yatabe, and Y. Oikawa, “Griffin–Lim likephase recovery via alternating direction method of multipliers,”
IEEE Signal Process. Lett. , vol. 26, no. 1, pp. 184–188, Jan.2019.[9] S. Takaki, H. Kameoka, and J. Yamagishi, “Direct modeling offrequency spectra and waveform generation based on phase re-covery for DNN-based speech synthesis,” in
INTERSPEECH ,Aug. 2017, pp. 1128–1132.[10] Y. Saito, S. Takamichi, and H. Saruwatari, “Text-to-speechsynthesis using STFT spectra based on low-/multi-resolutiongenerative adversarial networks,” in
IEEE Int. Conf. Acoust.,Speech, Signal Process. (ICASSP) , Apr. 2018, pp. 5299–5303.[11] D. Griffin and J. Lim, “Signal estimation from modified short-time Fourier transform,”
IEEE Trans. Acoust., Speech, SignalProcess. , vol. 32, no. 2, pp. 236–243, Apr. 1984.[12] J. Le Roux, H. Kameoka, N. Ono, and S. Sagayama, “Fast sig-nal reconstruction from magnitude STFT spectrogram basedon spectrogram consistency,” in
Proc. 13th Int. Conf. Digit.Audio Eff. (DAFx-10) . Sept. 2010, 397–403.[13] N. Takahashi, P. Agrawal, N. Goswami, and Y. Mitsufuji,“PhaseNet: Discretized phase modeling with deep neural net-works for audio source separation,” in
INTERSPEECH , Sept.2018, pp. 2713–2717.[14] J. Le Roux, G. Wichern, S. Watanabe, A. Sarroff, and J. R. Her-shey, “Phasebook and friends: Leveraging discrete representa-tions for source separation,” arXiv preprint arXiv:1810.01395 ,2018.[15] S. Takamichi, Y. Saito, N. Takamune, D. Kitamura, andH.Saruwatari, “Phase reconstruction from amplitude spectro-grams based on von–Mises-distribution deep neural network,”in
Int. Workshop Acoust. Signal Enhance. (IWAENC) , Sept.2018, pp. 286–290. [16] K. Oyamada, H. Kameoka, K. Tanaka T. Kaneko, N. Hojo, andH. Ando, “Generative adversarial network-based approach tosignal reconstruction from magnitude spectrograms,” in
Eur.Signal Process. Conf. (EUSIPCO) , Sept. 2018.[17] P. Magron, J. Le Roux, and T. Virtanen, “Consistentanisotropic wiener filtering for audio source separation,” in
IEEE Workshop Appl. Signal Process. Audio Acoust. (WAS-PAA) , Oct. 2017, pp. 269–273.[18] Z.-Q. Wang, J. Le Roux, D. Wang, and J. R. Hershey, “End-to-end speech separation with unfolded iterative phase recon-struction,” in
INTERSPEECH , Sept. 2018, pp. 2708–2712.[19] G. Wichern and J. Le Roux, “Phase reconstruction with learnedtime-frequency representations for single-channel speech sep-aration,” in
Int. Workshop Acoust. Signal Enhance. (IWAENC) ,Sept. 2018, pp. 396–400.[20] K. Greger and Y. LeCun, “Learning fast approximations ofsparse coding,” in
Int. Conf. March. learn. , 2010, pp. 399–406.[21] M. Borgerding, P. Schniter, and S. Rangan, “AMP-inspireddeep networks for sparse linear inverse problems,”
IEEE Trans.Signal Process. , vol. 65, no. 16, pp. 4294–4308, Aug. 2017.[22] Y. Yang, J. Sun, H. Li, and Z. Xu, “Deep ADMM-Net for com-pressive sensing MRI,” in
Adv. Neural Process. Syst. (NIPS) ,pp. 10–18. Curran Associates, Inc., 2016.[23] J. R. Hershey, J. Le Roux, and F. Weninger, “Deep unfolding:Model-based inspiration of novel deep architectures,” arXivpreprint arXiv:1409.2574 , 2014.[24] S. Wisdom, J. Hershey, and and S. Watanabe J. Le Roux,“Deep unfolding for multichannel source separation,” in
IEEEInt. Conf. Acoust., Speech, Signal Process. (ICASSP) , Mar.2016, pp. 121–125.[25] N. Perraudin, P. Balazs, and P. L. Søndergaard, “A fast Griffin–Lim algorithm,” in
IEEE Workshop Appl. Signal Process. Au-dio Acoust. (WASPAA) , Oct. 2013, pp. 1–4.[26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learningfor image recognition,” in
IEEE Conf. Comp. Vision PatternRecognit. (CVPR) , June 2016.[27] K. Zhang, W. Zuo, Y. Chen, D. Meng, and L. Zhang, “Beyonda gaussian denoiser: Residual learning of deep cnn for imagedenoising,”
IEEE Trans. Image Process. , vol. 26, no. 7, pp.3142–3155, July 2017.[28] K. Zhang, W. Zuo, S. Gu, and L. Zhang, “Learning deep CNNdenoiser prior for image restoration,” in
IEEE Conf. Comp.Vision Pattern Recognit. (CVPR) , July 2017, pp. 2808–2817.[29] S. V. Venkatakrishnan, C. A. Bouman, and B. Wohlberg,“Plug-and-Play priors for model based reconstruction,” in
IEEE Glob. Conf. Signal, Inf. Process. , Dec. 2013, pp. 945–948.[30] Y. Romano, M. Elad, and P. Milanfar, “The little engine thatcould: Regularization by denoising (RED),”
SIAM J. ImagingSciences , vol. 10, no. 4, pp. 1804–1844, 2017.[31] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Languagemodeling with gated convolutional networks,” arXiv preprintarXiv:1612.08083 , 2016.[32] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen,“An algorithm for intelligibility prediction of time-frequencyweighted noisy speech,”
IEEE Trans. Audio, Speech, Lang.Process, , vol. 19, no. 7, pp. 2155–2136, 2011.[33] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra,“Perceptual evaluation of speech quality (PESQ) — A newmethod for speech quality assessment of telephone networksand codecs,” in