[PDF] A Waveform Representation Framework for High-quality Statistical Parametric Speech Synthesis

Abstract

State-of-the-art statistical parametric speech synthesis (SPSS) generally uses a vocoder to represent speech signals and parameterize them into features for subsequent modeling. Magnitude spectrum has been a dominant feature over the years. Although perceptual studies have shown that phase spectrum is essential to the quality of synthesized speech, it is often ignored by using a minimum phase filter during synthesis and the speech quality suffers. To bypass this bottleneck in vocoded speech, this paper proposes a phase-embedded waveform representation framework and establishes a magnitude-phase joint modeling platform for high-quality SPSS. Our experiments on waveform reconstruction show that the performance is better than that of the widely-used STRAIGHT. Furthermore, the proposed modeling and synthesis platform outperforms a leading-edge, vocoded, deep bidirectional long short-term memory recurrent neural network (DBLSTM-RNN)-based baseline system in various objective evaluation metrics conducted.

Full PDF

AA Waveform Representation Framework forHigh-quality Statistical Parametric Speech Synthesis

Bo Fan ∗ , Siu Wa Lee † , Xiaohai Tian ‡§ , Lei Xie ∗ and Minghui Dong †∗ School of Computer Science, Northwestern Polytechnical University, Xi’an, China † Human Language Technology Department, Institute for Infocomm Research, Singapore ‡ School of Computer Engineering, Nanyang Technological University (NTU), Singapore § Joint NTU-UBC Research Center of Excellence in Active Living for the Elderly, NTU, SingaporeEmail: { bofan,lxie } @nwpu-aslp.org, { swylee,mhdong } @i2r.a-star.edu.sg, [email protected] Abstract —State-of-the-art statistical parametric speech synthe-sis (SPSS) generally uses a vocoder to represent speech signalsand parameterize them into features for subsequent modeling.Magnitude spectrum has been a dominant feature over the years.Although perceptual studies have shown that phase spectrum isessential to the quality of synthesized speech, it is often ignoredby using a minimum phase ﬁlter during synthesis and the speechquality suffers. To bypass this bottleneck in vocoded speech,this paper proposes a phase-embedded waveform representationframework and establishes a magnitude-phase joint modelingplatform for high-quality SPSS. Our experiments on waveformreconstruction show that the performance is better than thatof the widely-used STRAIGHT. Furthermore, the proposedmodeling and synthesis platform outperforms a leading-edge,vocoded, deep bidirectional long short-term memory recurrentneural network (DBLSTM-RNN)-based baseline system in vari-ous objective evaluation metrics conducted.

I. I

NTRODUCTION

Statistical parametric speech synthesis (SPSS) has beenincreasingly popular due to its compact and ﬂexible repre-sentation of voice characteristics [1]. Conventionally, in anSPSS system, we ﬁrstly extract parametric representations ofspeech including spectral and excitation parameters from aspeech database and then model them with a set of models[2]. Several statistical generative models have been applied toSPSS successfully, e.g., hidden Markov model (HMM)-basedSPSS [2], deep neural network (DNN)-based SPSS [3] anddeep bidirectional long short-term memory recurrent neuralnetwork (DBLSTM-RNN)-based SPSS [4].To parameterize speech signals into features for subse-quent synthesis processes, vocoder has been typically used.It is based on the source-ﬁlter model [5], which assumes astationary speech segment is generated by passing a soundsource through a vocal tract ﬁlter. By using a vocoder, theresultant speech features are regular and suitable for mod-eling. However, in [6], their subjective listening test showsclear degradation of quality in vocoded speech. It furtherindicates that the source and ﬁlter parameters have to bejointly modelled for high-quality synthesis. Besides, to assureinterframe coherence [7], a minimum phase hypothesis [7] hasbeen used in most vocoders, which ignores the natural mixed-phase characteristics of speech signals, resulting in apparentdegradation of the speech waveform quality. More and more works have reported the importance of phaseinformation in different speech processing applications, suchas speech synthesis [8, 9], iterative signal reconstruction [10],automatic speech recognition [11, 12], speech coding [13] andpitch extraction [14]. Paliwal et al. [15] have investigated therelative importance of short-time magnitude and phase spectraon speech perception through human perception listening test.Results show that phase spectrum clearly contributes to thespeech intelligibility. Sometimes its contribution is as muchas the magnitude spectrum. Koutsogiannaki et al. [16] haveproposed the phase distortion deviation feature, enabling tocapture voice irregularities and highlights the importance ofthe phase spectrum in voice quality assessment. These twoworks indicate that phase information is important for bothhuman perception and voice quality assessment. Combiningphase spectrum with magnitude spectrum in frequency domainis equivalent to the speech waveform in time-domain. There-fore, the phase information is focused in our speech waveformrepresentation framework.There are some approaches of waveform representationdirectly in the time domain. Time domain pitch-synchronousoverlap-add (TD-PSOLA) [17] performs pitch-synchronousanalysis, modiﬁcation and synthesis. During synthesis, speechframes are summed up. The quality of the reconstructedwaveform with typical pitch or timing modiﬁcation is similarto that of the original waveform. Multi-band re-synthesispitch synchronous overlap add (MBR-PSOLA) [18] commentsTD-PSOLA with three mismatches: phase mismatch, pitchmismatch, spectral envelope mismatch. It further suggests tosolve these mismatches by re-synthesizing voiced parts of thespeech database with constant phase and constant pitch. Theartiﬁcial processing in MBR-PSOLA decreases the quality ofspeech and leads to buzzy sound [19]. Alternatively, there area few recent works for SPSS directly in the time domain.Tokuda et al. [20] have proposed an approach to modelcepstral coefﬁcients to approximate the speech waveform. Intheir framework, periodic, voiced components have not beenproperly generated yet. In [21], complex cepstrum has beenused to embed phase information for hidden semi-Markovmodels (HSMM) speech modelling.In this paper, we propose a phase-embedded waveformrepresentation framework, and establish a magnitude-phase a r X i v : . [ c s . S D ] O c t oint modeling platform for SPSS. This work uses glottal-synchronous overlap add approach for speech analysis andsynthesis where glottal closure instants (GCIs) are employed.GCIs refer to the moments of most signiﬁcant excitation thatoccur at the level of the vocal folds during each glottal period[22]. Short-term segments are deﬁned as any two consecutiveGCI periods. In order to produce smooth trajectories of ourfeatures which are required in SPSS, we design a cost functionwith a global smoothness constraint. The GCI locations se-lected are ﬁnally determined by conducting dynamic program-ming over a list of probable GCI candidates. Consequently,these segments will be very regular with stable magnitude andmatched phase spectrum. With this waveform representationframework, the bottleneck suffered from vocoded speech isthus bypassed. This framework is hence capable of deliv-ering better quality speech over the vocoded speech. Thenwe propose an approach for magnitude-phase joint spectrummodeling. Full spectrum is used in this framework, which is inline with the satisfactory performance in recent deep learning-based TTS [23]. To leverage on the modeling power of deeplearning, we use DBLSTM-RNN to learn magnitude and phasespectrum simultaneously. Bidirectional recurrent connectionscan fully exploit the speech contextual information in bothforward and backward directions. With purpose-built memorycells to store information, the long short-term memory (LSTM)architecture does better in ﬁnding and taking advantage of thelong range context. II. TD-PSOLATime domain pitch-synchronous overlap add (TD-PSOLA)is used for pitch and timing modiﬁcation of speech signals[17], [24]. It is also popular for concatenation-based TTS. Asno source-ﬁlter decomposition or vocoding is performed, thequality of resultant speech after analysis and reconstruction ishighly similar to the original speech.Given an arbitrary speech waveform signal x ( n ) , TD-PSOLA is carried out in the time domain. It ﬁrst decom-poses x ( n ) into a sequence of overlapping, pitch-synchronizedsegments. Each segment x s ( n ) lasts for two pitch periods,running from a pitch period before and another pitch periodafter the segment centre. Then a window function h s ( n ) ,such as hanning window, will be applied to each segment.Assuming S denotes the total number of the segments, where s = 1 , , ..., S , x s ( n ) = h s ( n ) x ( n ) (1) h s ( n ) is non-zero during the above two-pitch period. This ishow x s ( n ) is extracted for voiced speech; for unvoiced speech,the segment length is set to a constant. Any modiﬁcation inpitch or timing can then be performed on these extracted seg-ments. Finally, modiﬁed segments are overlapped and addedto produce the speech output [24].Although TD-PSOLA generates pitch- and timing-modiﬁedoutput signals with satisfactory speech quality, using TD-PSOLA in speech synthesis where statistical averaging, mod-eling or signal modiﬁcation are common, is not sufﬁcient. This ∙ GCIdetection SegmentsextractionIFFT

GCIs

MagnitudePhaseSegment lengthsFFTMagnitudePhase

Short-termsegments

Segmentlengths

Short-term segments

Overlap-add

Fig. 1. Our proposed waveform representation framework. is because matched attributes on phase and pitch are needed[18]. III. W

AVEFORM R EPRESENTATION F RAMEWORK

In this work, a glottal-synchronous based waveform repre-sentation framework is proposed for speech modelling. Similarto TD-PSOLA, glottal closure instants (GCIs) represent boththe pitch contours and the boundaries of individual cycles ofspeech. Existing GCI detection approaches generally estimatethe GCI locations in a local manner, ignoring the resultanttrajectories of various acoustic attributes, i.e. segment length(representing fundamental frequency ( F )), magnitude andphase spectrum, exhibited in the utterance. As smooth tra-jectories of these attributes are necessary for SPSS, we revisea state-of-the-art GCI detection approach, so as to facilitatesatisfactory modelling of these attributes. A. System Overview

The proposed framework, as shown in Fig. 1, consists of twoparts: analysis and synthesis. In the analysis stage, given anarbitrary waveform, ﬁrstly, the GCI locations are detected bythe following revised GCI detection module. Then, the wave-form is decomposed into overlapping short-term segments.Each segment is deﬁned by any two consecutive GCI periods.Finally, segment lengths, magnitude and phase spectrum areused to represent these segments.In the synthesis stage, given corresponding segment lengths,magnitude and phase spectrum, we convert them into overlap-ping short-term segments. Then, the waveform is reconstructedusing the similar technique as TD-PSOLA [17].

B. Glottal Closure Instant Detection

The GCI positions determine the features including segmentlengths, magnitude and phase spectrum. Thus, the GCI detec-tion method is of great importance.Among the present GCIs detection techniques, the SpeechEvent Detection using the Residual Excitation And a Mean-based Signal (SEDREAMS) algorithm [25] is widely used. In26], SEDREAMS was shown to have the highest robustnessand reliability. During the detection, SEDREAMS outputs onlyone GCI location for each GCI segment [25]. This is a localestimation process, without considering the GCI detectionresults in the neighborhood. However, SPSS requires smoothtrajectories of speech features, which are deﬁned once GCIlocations are determined. By considering lists of probableGCI candidates and estimating the optimal GCI locations in aglobal manner, the trajectories of these features are stabilized.Based on SEDREAMS, our modiﬁed GCI detection methodcontains the following steps:a) Given a waveform x ( n ) (Fig. 2(a)), calculate the movingaverage signal (Fig. 2(b)).b) Determine the intervals for possible GCI locations (Fig.2(c)).c) M candidates are chosen, based on the top M highestlinear predictive coding (LPC) residual values in the LPCresidual signal (Fig. 2(d)), as the possible GCI locationsin each interval. Suppose there are N intervals, the k -thcandidate of i -th interval denoted as g i,k .d) Transfer all the possible segment lengths into F . For the i -th segment, the j -th F is expressed as F i,j = F s/ ( g ( i +1) ,s − g i,t ) , (2)where F s is the sampling frequency, i = 1 , , ..., N , j =1 , , ..., M × M , s = 1 , , ..., M and t = 1 , , ..., M .e) Given the reference F ref , the optimal segment lengthsare determined by dynamic programming with the fol-lowing constraint, E = arg min j (cid:88) Ni =1 (cid:13)(cid:13) F ref − F i,j (cid:13)(cid:13) . (3)f) Finally, the GCI locations are deduced accordingly (Fig.2(e)).In our implementation, M is ﬁve and the reference F isextracted by STRAIGHT [27]. STRAIGHT is robust for F tracking and can generate a highly accurate and smooth F trajectory. The F trajectory extracted from STRAIGHT isrobust The dynamic programming process is implemented bythe Viterbi algorithm. In a voiced segment, the pitch located inthe middle is more stable compared to the rest. Consequently,Viterbi search starts at this middle position to both ends.A comparison of F trajectory between our GCI detectionand SEDREAMS is depicted in Fig. 3. From Fig. 3(a), it isobserved that the F given by our GCI detection is smootherthan the one from SEDREAMS. And from Fig. 3(b), it isclear that our GCI detection approach removes some abnormaljumps (around the 247-th frame) of the F trajectory occurredin the SEDREAMS.IV. W AVEFORM M ODELING

State-of-the-art SPSS usually models the magnitude spec-trum of speech signals and discards the phase spectrum. For the detailed implementations of the moving average ﬁlter and intervaldetermination, please refer to [25]

Fig. 2. (a) A section of voiced waveform; (b) The corresponding movingaverage signal; (c) Short intervals in the moving average signal; (d) LPCresidual signal in one interval with candidates marked with red cross; (e) Theﬁnal GCI locations marked with red stem.

150 160 170 180 1908090100110120130 Frame F ( H z ) (a) SEDREAMSour GCI detection 210 220 230 240 25080100120140160180 Frame F ( H z ) (b) SEDREAMSour GCI detection Fig. 3. A comparison of F trajectory between our GCI detection andSEDREAMS. (a) and (b) are two segments in the voiced parts. During synthesis, a vocoder based on minimum-phase or zero-phase ﬁlter is often used together with the generated magnitudespectra to produce the synthesized output. Nevertheless, phasespectrum has been recently found to be essential for speechperception. The speech quality of vocoded outputs are foundto be degraded from the original speech recordings [6]. Thismay shed light on SPSS, where speech waveform with phaseinformation in addition to the existing magnitude spectrum, ismodeled.In our work, speech signals are modeled by the corre-sponding magnitude and phase spectra, without the use ofa vocoder. Consequently, reconstruction of speech waveformis facilitated. We use a recently-emerging learning technique,DBLSTM-RNN, to jointly model the two spectra. DBLSTM-RNN is well-suited for learning sequential events apart fromlong time lags of unknown size [28]. Promising performancein various speech applications is observed [29], [4].Our joint model of magnitude and phase is constructed tatic phase F r e qu e n c y b i n

50 100 150 200 250 300 35050100150200250 Dynamic phaseFrame 50 100 150 200 250 300 35050100150200250 Log magnitude 50 100 150 200 250 300 35050100150200250-3-2-10123 -3-2-10123 -10-8-6-4-202

Fig. 4. Static and dynamic phase spectrum as well as their corresponding magnitude spectrum. as follows. We employ line spectrum pair (LSP) as thefeature representation of magnitude spectrum. LSP, being analternative LPC spectral representation, is robust and suitablefor interpolation and modeling [30], [31].For phase spectrum, we propose to use the dynamic phasespectrum for this waveform learning TTS framework. It isalso called group delay: the group delay τ k ( n ) at time n andfrequency bin k is calculated as the frequency derivative ofthe instantaneous phase θ k ( n ) , i.e. τ k ( n ) = θ k ( n ) − θ k − ( n ) . (4)To enable reconstruction of the phase spectrum afterDBLSTM-RNN modeling, the instantaneous phase at the ﬁrstfrequency bin is kept, together with the group delays of theremaining frequency bins. In other words, our phase represen-tation consists of θ ( n ) , τ ( n ) , τ ( n ) , ..., τ K ( n ) , where K isthe total number of frequency bins.This group-delay-based phase representation is found to bestable and facilitates statistical modeling in subsequent TTSprocess, as shown in Fig. 4. Comparing the spectra of staticphase and dynamic phase, the distribution of the dynamicphase often exhibits a smaller range. Comparing the log mag-nitude spectrum with the dynamic phase spectrum, patternsof voiced and unvoiced portions are consistent and spectralpatterns of individual speech sounds are quite similar in the logmagnitude spectrum and the dynamic phase spectrum. This isimportant and useful for our joint modeling. On the contrary,there is no clear difference in the static phase spectrum forindividual speech sounds. When moving along the time-axis,the static phase spectra look like the same.V. E XPERIMENTS

We conducted two experiments to assess the efﬁcacy ofour waveform representation framework. In the experiment onwaveform reconstruction, objective and subjective evaluationswere carried out to compare the performance between ourframework and other three vocoders: STRAIGHT, Tandem-STRAIGHT [32] and AHOCoder [33] respectively. As weknow, STRAIGHT is a very popular vocoder used for speechanalysis and reconstruction, and Tandem-STRAIGHT is theupgrade version of STRAIGHT. AHOCoder is reported to be of similar quality compared with STRAIGHT. In theexperiment on waveform modeling, we trained a text-to-speech(TTS) system based on our framework and also a baseline TTSsystem [4] as a comparison. This baseline is a leading-edgeapproach based on DBLSTM-RNN and generates high-qualitysynthesized speech. It uses STRAIGHT as its vocoder.A corpus with 4,936 Chinese utterances (around 6 hours)spoken by a native male speaker in a neutral style was usedin our experiments. Speech waveform signals are sampledat 16kHz. The contextual labels are both phonetically andprosodically rich, including quin-phone, prosody, tone andsyllable information. For TTS systems, the training, valida-tion and test data consist of 3,949, 494 and 493 utterances,respectively.

A. Experiment on Waveform Reconstruction

Speech waveform in the test set of the corpus was ana-lyzed and re-synthesized using our waveform representationframework and the three vocoders. The reconstructed speechwaveform was then used for objective and subjective evalua-tions.

1) Objective Evaluation:

In the objective evaluation, wecalculated the root mean square error (RMSE) between the re-constructed and original speech waveform signals in the voicedparts (RMSE voiced), the unvoiced parts (RMSE unvoiced)and the entire waveform (RMSE), respectively. The resultsare shown in Table I. These voiced/unvoiced results fromour framework and the three vocoders generally represent theperformance on vowels/consonants respectively.

TABLE IR

ECONSTRUCTION PERFORMANCE : OUR FRAMEWORK VS . THE THREEVOCODERS . Methods Measures RMSE voiced RMSE unvoiced RMSEOur framework

STRAIGHT [27] 0.173 0.044 0.152Tandem-STRAIGHT [32] 0.177 0.044 0.156AHOCoder [33] 0.182 0.049 0.160

The objective evaluation result shows that the performanceof our framework is much better than that of the three vocoders

Fig. 5. The reconstructed waveform using our framework and STRAIGHT. especially in the voiced parts. The short-term segments areextracted at a constant rate in the unvoiced parts from ourframework which is similar to STRAIGHT. Taking the wave-form in Fig. 5 around the 5000-th sample as an example,the absolute difference between (a) and (b) is very close tothat between (a) and (c). In the voiced parts, our frameworkperforms much better than STRAIGHT does. It is because ourframework retains the full phase spectrum, while STRAIGHTdiscards it and uses a minimum-phase setting instead. We cansee clearly from Fig. 5 that the absolute difference between(a) and (b) is much smaller than that between (a) and (c) inthe voiced parts.

2) Subjective Evaluation:

20 pairs of speech waveform arerandomly selected from the reconstructed waveforms. Thena group of 20 subjects were asked to perform the ABXpreference test. We put the original waveform into X , while weput the waveform reconstructed using our framework and eachof the three vocoders into A and B randomly. Each subject wasasked to answer which one( A or B ) is more similar to X . Thethird option Neutral means the subject has no preference onA or B. The ABX result is shown in Fig. 6. We can clearly seethat the reconstructed speech waveform using our frameworkis signiﬁcantly preferred as compared with all of the threevocoders.

B. Experiment on Waveform Modeling

In the baseline DBLSTM-RNN-based TTS [4], STRAIGHTis used to vocode the speech waveform by a 25-ms movingwindow, and shifted every 5-ms. The generated magnitudespectrum from STRAIGHT was converted into LSP. Thedimensionality of the input contextual label is 427. The outputfeature contains voiced/unvoiced ﬂag (1 dimension), log F0 (1dimension), LSP (40 dimensions) and gain (1 dimension), to-tally 43 dimensions. As suggested in [4], a neural network withtwo BLSTM layers sitting on two feed forward layers with 256 P r e f e r e n ce s c o r e ( % ) Fig. 6. The ABX result of the reconstructed speech waveform using ourframework and the three vocoders. We conducted t -test using a signiﬁcancelevel of p < . which is depicted with the error bars in orange. nodes in each layer is employed to train the DBLSTM-RNN-based TTS.For our TTS system, features were extracted from the short-term segments speciﬁed by GCI locations. The format ofthe input label is the same as the baseline. The segmentlength is transformed into F . The output feature comprisesseveral components: voice/unvoiced ﬂag (1 dimension), log F (1 dimension), LSP (40 dimensions), gain (1 dimension)and dynamic phase feature (257 dimensions), totally 300dimensions. The same network topology as baseline is usedto train our TTS system.To evaluate the performance of these two TTS systems, ﬁvemetrics are used for objective evaluation: • RMSE F : root mean square error in F estimation; • Voiced/unvoiced (V/U) error rate; • Log spectral distance (LSD):

LSD ( S p , S g ) = (cid:118)(cid:117)(cid:117)(cid:116) N N (cid:88) j =1 ( M s (cid:88) k =1 [10 log s p ( j, k ) − log s g ( j, k )] ) , (5)where S p and S g are the predicted and ground-truthmagnitude spectrum, respectively. N is the total numberof frames in the voiced parts and M s refers to thedimensionality of magnitude spectrum. s p ( j, k ) is the the k -th value of magnitude in j -th frame; • Mel cepstral distance (MCD):

M CD ( c p , c g ) = 10 ln (cid:118)(cid:117)(cid:117)(cid:116) M c (cid:88) k =1 [ c p ( k ) − c g ( k )] , (6)here c p and c g are the predicted and ground-truth Melcepstrum coefﬁcient vectors, respectively, and M c refersto the dimensionality of Mel cepstrum coefﬁcients; • Dynamic phase distance (DPD):

DP D ( d p , d g ) = (cid:118)(cid:117)(cid:117)(cid:116) M d (cid:88) k =1 [ d p ( k ) − d g ( k )] , (7)where d p and d g are the predicted and ground-truthdynamic phase feature vectors, respectively, and M d refers to the dimensionality of the dynamic phase feature.The synthesized speech waveform from the labels in thetest set uses the ground-truth durations. These ﬁve metricsare calculated at the GCIs level, i.e., the short-term segmentsare speciﬁed by the GCIs locations. In order to make thesystems comparable, GCI detection is required for all speechwaveforms synthesized from any system under comparison.And after the GCI detection, it should be aligned to theground-truth GCIs by ﬁnding out the closest one.The objective evaluation result is shown in Table II. It showsthat our TTS system is better than the baseline in terms of allthe ﬁve metrics. In particular, for DPD, the average absolutedifference in one frequency bin is about 0.70rad in our TTSsystem while 0.91rad for the baseline TTS system. TABLE IIO

BJECTIVE EVALUATION ON WAVEFORM MODELING WITH t - TEST USING ASIGNIFICANCE LEVEL OF p < . .Measures Methods Our TTS system Baseline [4]RMSE F0 (Hz) ± ± ± ± ± ± ± ± ± ± VI. C

ONCLUSIONS AND F UTURE W ORK

This paper proposed a glottal-synchronous based waveformrepresentation framework for high-quality statistical paramet-ric speech synthesis. Speech signal was represented by magni-tude and phase full-spectral components, without the the useof a vocoder. We revised the SEDREAMS GCI detection ap-proach to improve the feature stability for statistical modelling.Both objective and subjective evaluations were conductedto assess the reconstruction performance of our framework.Results indicate that, comparing to the reconstructed signalobtained by three popular vocoders, the proposed frameworkachieves promising results in RMSE in time domain speechwaveform and preference score.We also proposed a platform for speech modelling.DBLSTM-RNN is applied to jointly model the correspondingmagnitude and phase spectra, and group delay-based phaserepresentation is used to facilitate statistical modelling. Objec-tive results show that, the TTS system based on the proposedframework generates the features, speciﬁcally the phase fea-ture, with lower distortion as compared with a vocoder based system. Further works include studying the speech quality ofsynthesized speech and the associated factors and experimentson subjective evaluation.A

CKNOWLEDGMENT

This work was supported by the National Natural ScienceFoundation of China (61175018 and 61571363).R

EFERENCES [1] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, andT. Kitamura, “Simultaneous modeling of spectrum, pitchand duration in HMM-based speech synthesis,” in

Proc.Eurospeech , 1999, pp. 2347–2350.[2] H. Zen, K. Tokuda, and A. Black, “Statistical parametricspeech synthesis,”

Speech Communication , vol. 51, no.11, pp. 1039–1064, 2009.[3] H. Zen, A. Senior, and M. Schuster, “Statistical para-metric speech synthesis using deep neural networks,” in

Proc. ICASSP . IEEE, 2013, pp. 7962–7966.[4] Y. Fan, Y. Qian, F. Xie, and F. Soong, “TTS syn-thesis with bidirectional LSTM based recurrent neuralnetworks,” in

Proc. Interspeech , 2014, pp. 1964–1968.[5] G. Fant,

Acoustic Theory of Speech Production , TheHague: Mouton, 1960.[6] T. Merritt, T. Raitio, and S. King, “Investigating sourceand ﬁlter contributions, and their interaction, to statisticalparametric speech synthesis,” in

Proc. Interspeech , 2014,pp. 1509–1513.[7] M. Crespo, P. Velasco, L. Serrano, and J. Sardina, “Onthe use of a sinusoidal model for speech synthesis in text-to-speech,” in

Progress in Speech Synthesis , pp. 57–70.Springer, 1997.[8] H. Banno, K. Takeda, and F. Itakura, “The effect ofgroup delay spectrum on timbre,”

Acoustical Scienceand Technology , vol. 23, no. 1, pp. 1–9, 2002.[9] G. Degottex, A. Roebel, and X. Rodet, “Phase mini-mization for glottal model estimation,”

Audio, Speech,and Language Processing, IEEE Transactions on , vol.19, no. 5, pp. 1080–1090, 2011.[10] L. Alsteris and K. Paliwal, “Iterative reconstructionof speech from short-time Fourier transform phase andmagnitude spectra,”

Computer Speech & Language , vol.21, no. 1, pp. 174–186, 2007.[11] R. Schluter and H. Ney, “Using phase spectrum informa-tion for improved speech recognition performance,” in

Proc. ICASSP . IEEE, 2001, vol. 1, pp. 133–136.[12] G. Shi, M. Shanechi, and P. Aarabi, “On the importanceof phase in human speech recognition,”

Audio, Speech,and Language Processing, IEEE Transactions on , vol.14, no. 5, pp. 1867–1874, 2006.[13] H. Pobloth and W. Kleijn, “Squared error as a measureof perceived phase distortion,”

The Journal of theAcoustical Society of America , vol. 114, no. 2, pp. 1081–1094, 2003.14] T. Nakatani, T. Irino, and P. Zolfaghari, “Dominancespectrum based v/uv classiﬁcation and F0 estimation,”in

Proc. Eurospeech , 2003, pp. 2313–2316.[15] K. Paliwal and L. Alsteris, “Usefulness of phase spec-trum in human speech perception,” in

Proc. Eurospeech ,2003, pp. 2117–2120.[16] M. Koutsogiannaki, O. Simantiraki, G. Degottex, andY. Stylianou, “The importance of phase on voice qualityassessment,” in

Proc. Interspeech , 2014, pp. 1653–1657.[17] E. Moulines and F. Charpentier, “Pitch-synchronouswaveform processing techniques for text-to-speech syn-thesis using diphones,”

Speech communication , vol. 9,no. 5, pp. 453–467, 1990.[18] T. Dutoit and H. Leich, “MBR-PSOLA: Text-to-speechsynthesis based on an MBE re-synthesis of the segmentsdatabase,”

Speech Communication , vol. 13, no. 3, pp.435–440, 1993.[19] Y. Stylianou, “Removing linear phase mismatches inconcatenative speech synthesis,”

Speech and AudioProcessing, IEEE Transactions on , vol. 9, no. 3, pp. 232–239, 2001.[20] K. Tokuda and H. Zen, “Directly modeling speechwaveforms by neural networks for statistical parametricspeech synthesis,” in

Proc. ICASSP , 2015, pp. 4215–4219.[21] R. Maia, M. Akamine, and M. Gales, “Complex cepstrumas phase information in statistical parametric speechsynthesis,” in

Proc. ICASSP . IEEE, 2012, pp. 4581–4584.[22] R. Smits and B. Yegnanarayana, “Determination ofinstants of signiﬁcant excitation in speech using groupdelay function,”

Speech and Audio Processing, IEEETransactions on , vol. 3, no. 5, pp. 325–333, 1995.[23] Z. Ling, L. Deng, and D. Yu, “Modeling spectralenvelopes using restricted Boltzmann machines and deepbelief networks for statistical parametric speech synthe-sis,”

Audio, Speech, and Language Processing, IEEETransactions on , vol. 21, no. 10, pp. 2129–2139, 2013. [24] P. Taylor,

Text-to-Speech Synthesis , United Kingdom:University of Cambridge, 2007.[25] T. Drugman and T. Dutoit, “Glottal closure and openinginstant detection from speech signals,” in

Proc. Inter-speech , 2009, pp. 2891–2894.[26] T. Drugman, M. Thomas, J. Gudnason, P. Naylor, andT. Dutoit, “Detection of glottal closure instants fromspeech signals: a quantitative review,”

Audio, Speech,and Language Processing, IEEE Transactions on , vol.20, no. 3, pp. 994–1006, 2012.[27] H. Kawahara, I. Masuda-Katsuse, and A. De Cheveigne,“Restructuring speech representations using a pitch adap-tive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repet-itive structure in sounds,”

Speech communication , vol.27, no. 3, pp. 187–207, 1999.[28] A. Graves,

Supervised Sequence Labelling with Recur-rent Neural Networks , Springer, 2012.[29] A. Graves, A. Mohamed, and G. Hinton, “Speechrecognition with deep recurrent neural networks,” in

Proc. ICASSP . IEEE, 2013, pp. 6645–6649.[30] F. K. Soong and B.-H. Juang, “Line spectrum pair (LSP)and speech data compression,” in

Proc. ICASSP . IEEE,1984, pp. 37–40.[31] F. Itakura, “Line spectrum representation of linearpredictor coefﬁcients of speech signals,”

The Journalof the Acoustical Society of America , vol. 57, pp. S35,1975.[32] H. Kawahara, M. Morise, T. Takahashi, R. Nisimura,T. Irino, and H. Banno, “TANDEM-STRAIGHT: A tem-porally stable power spectral representation for periodicsignals and applications to interference-free spectrum,F0, and aperiodicity estimation,” in

Proc. ICASSP , 2008,pp. 3933–3936.[33] D. Erro, I. Sainz, E. Navas, and I. Hernaez, “Harmonicsplus noise model based vocoder for statistical parametricspeech synthesis,”