Fast Griffin Lim based Waveform Generation Strategy for Text-to-Speech Synthesis
Ankit Sharma, Puneet Kumar, Vikas Maddukuri, Nagasai Madamshettib, Kishore KG, Sahit Sai Sriram Kavurub, Balasubramanian Raman, Partha Pratim Roy
aa r X i v : . [ ee ss . A S ] J u l Noname manuscript No. (will be inserted by the editor)
Fast Griffin Lim based Waveform Generation Strategy forText-to-Speech Synthesis
Ankit Sharma a · Puneet Kumar a · VikasMaddukuri b* · Nagasai Madamshetti b · Kishore KG b · Sahit Sai Sriram Kavuru b · Balasubramanian Raman a · Partha PratimRoy a Received: date / Accepted: date
Abstract
The performance of text-to-speech (TTS) systems heavily depends onspectrogram to waveform generation, also known as the speech reconstructionphase. The time required for the same is known as synthesis delay. In this paper,an approach to reduce speech synthesis delay has been proposed. It aims to en-hance the TTS systems for real-time applications such as digital assistants, mobilephones, embedded devices, etc. The proposed approach applies Fast Griffin LimAlgorithm (FGLA) instead Griffin Lim algorithm (GLA) as vocoder in the speechsynthesis phase. GLA and FGLA are both iterative, but the convergence rate ofFGLA is faster than GLA. The proposed approach is tested on LJSpeech, Blizzardand Tatoeba datasets and the results for FGLA are compared against GLA andneural Generative Adversarial Network (GAN) based vocoder. The performanceis evaluated based on synthesis delay and speech quality. A 36.58% reduction inspeech synthesis delay has been observed. The quality of the output speech hasimproved, which is advocated by higher Mean opinion scores (MOS) and fasterconvergence with FGLA as opposed to GLA.
Keywords
Tacotron · Vocoder · Text to Speech Synthesis Delay · DilatedConvolutional Neural Network
The conclusive step in a text-to-speech (TTS) system is the generation of speechfrom the spectrogram representation of the signal. This process is known as thewaveform reconstruction while the generation of intermediate signal from inputtext is called the construction process. The waveform generated in the reconstruc-tion process is the time-domain signal obtained from its intermediate spectrogram.The overall performance of a TTS system depends on the waveform processing in-volved in the reconstruction phase [1]. The main challenge in the TTS systems * Corresponding author, E-mail: [email protected] a Computer Science and Engg. Dept., Indian Institute of Technology, Roorkee, India, 247667 b Electronics and Comm. Engg. Dept., Indian Institute of Technology, Roorkee, India, 247667 Ankit Sharma et al. is to optimize the waveform processing time while maintaining or improving thequality of the generated speech [2].The TTS systems have emerged as valuable tools for day-to-day applicationssuch as digital assistants, mobile phones, embedded devices, etc. Most of thesedevices have limited computational capacity and they are sometimes used in of-fline mode. Reducing the speech synthesis delay for them could be very usefulin real-life applications such as - human computer interface, navigation systems,telecommunication and multimedia, aid to physically challenged people, daily ap-pliances like TVs, washing machines, etc. [3]. For such applications, TTS systemsare expected to have quick response time and generate speech with good quality.Hence, it becomes even more important to fasten up spectrogram to speech re-construction for real-time applications. With the aim to make TTS systems moresuitable for real-time applications, it is important to improve their response timewhile retaining the quality of the synthesized speech.Many software and hardware-based techniques have been suggested in the pastfor waveform optimization during speech synthesis [4, 5]. Most of the tradition-ally used techniques were based on software components such as concatenativespeech synthesis, parametric speech synthesis, etc. The current computing trendhas shifted towards deep learning due to the availability of hardware resourcesand training data. The state-of-the-art of TTS systems have also started lever-aging deep neural network (DNN) based techniques for speech synthesis [6]. Theperformance of text-to-speech systems has significantly improved especially afterthe introduction of end-to-end neural waveform generation methods [7, 8]. Inspiteof significant performance boost, even end-to-end neural waveform generation ap-proaches suffer from sluggish speech reconstruction process. Therefore, there is aneed to look for more efficient waveform optimization approaches to enhance thespeed and quality of machine synthesized speech.There are three stages in text-to-speech process: text analysis, linguistic analy-sis and waveform generation. Traditional TTS systems are based on complex multi-stage hand-engineered pipelines. The present state-of-the-art is end-to-end neuralspeech synthesis which puts together these stages of TTS process into a single lay-ered pipeline through the use of DNNs. All three phases of TTS take place withouthuman intervention for acoustic feature crafting. However, the hyper-parametersand configuration settings to cater a specific stage of the TTS can be set up-front.In that context, the implementation settings for waveform generation can be set inthe form of appropriate choice of the reconstruction algorithm. Griffin-Lim algo-rithm (GLA) is the most predominantly used reconstruction algorithm for speechsynthesis [9].GLA is an iterative algorithm that tries to produce a signal from the spectro-gram and does not have any information about phase. However, GLA needs manyiterations and the perceptual quality of the output speech is not always very good[10]. An optimized version of GLA is available in the literature which is known asFast Griffin Lim Algorithm (FGLA) [11]. FGLA naturally requires lesser iterationsto construct the phase from spectrogram representation for general signal process-ing applications. However, it has not been applied and tested for speech synthesis.In this paper, we have applied FGLA for speech synthesis process of neural TTSsystems. We’ve formulated an experiment to optimize the waveform processing oflinear spectrograms in Tacotron TTS system. FGLA based reconstruction strat- ast Griffin Lim based Waveform Generation Strategy for Text-to-Speech Synthesis 3 egy has been applied to reduce the speech synthesis delay. And, observations havebeen made in context of the quality of the synthesized speech and the number ofiterations required for the convergence of the reconstruction algorithm.The proposed speech synthesis systems generate the speech from the magnitudespectral envelope. We have conducted a Mean opinion scores (MOS) study to testthe quality of audio produced by FGLA with a lesser number of iterations and GLAwith a greater number of iterations with an optimal number of training steps. Theexperiments have been conducted for LJSpeech, Blizzard and Tatoeba datasets. Itresulted into 36.58% reduction in speech synthesis time. The results have reflectedhigher quality of the output speech in terms of improved MOS. The number oftraining steps and iterations were determined by experimental observations. Theconvergence patterns of fourier transform plots of the resultant waveforms arefound to be in-line with the choice of number of training iterations.1.1 ContributionThe major contributions of the current research work are: – An FGLA based method has been proposed to reconstruct .wav speech filesfrom linear spectrograms. In TTS applications, reconstruction of a waveformfrom spectrogram plays an important role because synthesis time is equiva-lent to the waiting time for application users. Users expect the speech outputpromptly. The proposed method has reflected into reduced synthesis time whichis likely to enhance the experience of TTS application users. – The quality of the synthesis speech has been maintained while reducing the syn-thesis time. A market-based application cannot compromise about the qualityof the synthesized speech. Speech quality also depends on the trained model.Hence, the model is trained upto optimal number of steps and the speechquality checking process has been carried out on three datasets. On all threedatasets, FGLA based speech reconstruction produced better quality speechthan GLA based construction. – TTS models have been trained on LJSpeech, Tatoeba and Blizzard datasetsand the waveform reconstruction has been carried out for GLA, FGLA andthe GAN based vocoder. Optimal number of training steps and iterations havedetermined experimentally. The .wav files generated by the TTS models havebeen evaluated based on the quality of the output speech and the synthesistime. The speech quality has been analyzed by evaluating in terms of MeanOpinion Score (MOS) and the synthesis delay has been analyzed by measuringthe time needed for the TTS model to synthesize the output speech.1.2 OrganizationThe rest of the paper is organized as follows. Existing work on waveform pro-cessing for TTS systems has been surveyed in Section 2. Section 3 formulates theproblem statement. The details of the proposed methodology have been outlinedin Section 4. Section 5.1 presents the experimental setup. Analysis of the observedresults has been presented in Section 5.2. Finally, Section 6 concludes the paperand highlights the scope for future research.
Ankit Sharma et al.
In recent years, text-to-speech processing has witnessed significant improvements.Traditionally, concatenative and parametric speech synthesis methods have beenused for the task of text-to-speech conversion. In last couple of years, neural TTSsystems have provided substantial performance boost in the quality of machinesynthesized speech. A review of various research attempts in context of aforemen-tioned methods alongwith the waveform optimization strategies followed by themhave been provided in the following sections and their summary is presented inTable 1.2.1 Various Speech Synthesis Methods
Concatenative models have dominated speech synthesis process since 1970’s. Theyare based on searching and collecting small samples of speech components fromthe voice database [12]. The voice quality of the speech synthesized by them ismore natural, however, they require a huge amount of voice database. There aretwo types of costs associated with them - a) searching cost and b) concatenativecost. Searching cost deals with searching specific voice segments corresponding torequired broken portion and concatenative cost is related to joining these segments.As pointed by G. Coormanne et al. [13] , one of the problems with these models isthat they do not produce a good quality speech if a suitable match in the databasecorresponding to the required segment is not obtained. Another challenge withconcatenative models is that they require complete dataset for generating a newset of voice. It is difficult to select the target unit from the voice database in orderto minimize the difference between the required and selected samples [14].
Parametric speech synthesis is another widely used process to generate speechfrom text. TTS models use the same statistical models derived from the data [15].They follow a parameter generation approach unlike fetching the speech samplesfrom the database. Hidden Markov Model (HMM) based TTS architectures areamong the most famous parametric models. The primary step involved in them isto find a parametric form of speech including spectral and excitation parametersfrom the voice corpus and then model them by using a set of generative models[16]. In context of using HMM-based TTS systems, Y. Junichi et al. [17] predictedthe parameters then synthesized the speech for a given text. The benefit of thisapproach is that it does not require the complete dataset at synthesis time. T.Masuko et al. [18] were success to change the speaker’s voice easily using theparametric speech synthesis while L. Soojeong et al. [19] tried statistical parametricmethod for enhancement of the speech. However, a disadvantage associated withparametric methods is that the voice quality of the synthesized speech is not asnatural as in case of concatenative speech synthesis. ast Griffin Lim based Waveform Generation Strategy for Text-to-Speech Synthesis 5
Concatenative and parametric TTS systems have practical difficulties, for exam-ple, their different components need to be modeled and processed separately [20].Deep learning-based end-to-end neural TTS systems, which are the current state-of-the-art solve this problem. They put together the intermediate stages of TTSprocess into a single, layered pipeline through the use of DNN which is carriedout without human intervention for acoustic feature engineering. The recent neu-ral TTS systems include Wavenet [21], Char2Wav [20], Tacotron [22], Tacotron 2[23], DeepVoice [24], DeepVoice 2 [25], DeepVoice 3 [26], VoiceLoop [27]. Wavenetis based on a generative model that predicts samples based on probability dis-tribution. Tacotron [22] produces spectrograms from the text and then producescorresponding waveform using a vocoder. However, waveform generation has beena time-consuming process for the initial TTS systems and the speech output wasnot human-like. Use of improved spectrogram methods such as mel-spectrogramand better vocoders such as WORLD, GLA, etc. have helped solving these prob-lems [28].Char2Wav predicts the parameters of the WORLD vocoder and uses a Sam-pleRNN conditioned upon WORLD parameters for waveform generation. WORLD[29] on the other hand consists of three analysis algorithms for determining thefundamental frequency ( F ), spectral envelope and aperiodic parameters. Tacotronis another end-to-end model that uses seq-to-seq learning to map the text to spec-trogram as intermediate data and then audio is generated by using vocoder. Ituses Griffin Lim as the vocoder that generates audio waveforms from the linearspectrogram. It takes linear scale magnitude spectrogram and number of iterationsas input and produces the corresponding waveform. It was observed that GLA inTacotron converges in about 60 iterations [22]. Tacotron incorporates the GLA forphase estimation, followed by an inverse Short-Time Fourier transform (STFT) forwaveform reconstruction. Tacotron 2 is an entirely neural network-based approachfor speech synthesis which combines the seq-to-seq model feature used in Tacotronand generates the mel-spectrogram and performs speech synthesis using modifiedWavenet vocoder.In DeepVoice [24], Wavenet architecture is modified and a fast synthesis systemis developed during the audio synthesis stage. DeepVoice 2 [25] is a multi-speakermodel that has taken Tacotron architecture as a base and performed modificationin Griffin Lim algorithm with Wavenet based vocoder. Deepvoice 3[26] is a fullyconvolutional attention-based neural end-to-end TTS system. Its architecture iscapable of transforming several textual features into vocoder parameters such asmel-spectrograms, linear scale log spectrograms, spectral envelope, fundamentalfrequency ( F ), aperiodicity parameters, etc. These vocoder features are given asinput to the waveform synthesis models. It uses three different vocoders - WORLD,Griffin-Lim and Wavenet. Both WORLD and Griffin Lim use linear spectrogramwhereas the modified Wavenet in Tacotron2 uses mel-spectrogram for waveformsynthesis. VoiceLoop [27] is an attention-based neural text to speech system ref-erenced by a working memory model called phonological loop. It is capable ofproducing voices that are sampled in the wild. VoiceLoop replaces convolutionalRNNs with memory buffer. Ankit Sharma et al.
In context of using neural vocodders, K. Oyamada et al. [30] focused on DNNbased architecture to recover the phase information from magnitude spectrogram.K. Kumar et al. [31] proposed the MelGAN which is a fully convolutional, non-autoregressive vocoder. It generalized well for unseen speakers and showed signif-icant speed up in speech construction from mel-spectrograms. In a similar work,WaveGlow [32] was proposed by replacing the vocoder part of Wavenet by deepneural architecture. It performed efficiently on large utterances but its perfor-mance degraded while converting small text samples into speech. TTS with neuralvocoders such as WaveGlow have to repetitively go through serial steps of wave-form construction which causes them to take more time while constructing smallsentences. As observed by K. Oyamada et al. [30], on CPU, some of the neuralvocoders took three times longer than GLA for the speech synthesis. Their trainingtime is as high as a few weeks and however their inference is fast on the GPU, thelarge size of the trained model makes their application very difficult with real-timedevices having CPU with constrained memory. [32, 31, 33].2.2 Waveform Processing in TTS systemsThe final speech generated by the TTS systems is in the form of waveform whilethe intermediate representation is called spectrogram. The raw text input is con-verted into sampled embedding vector in the pre-processing phase, from whichthe intermediate frequency-time representation, that is, spectrogram is generated[34]. It is called the ‘Construction Phase’. Then, waveform is generated in the ‘Re-construction Phase’ using the vocoders. The efficiency of reconstruction algorithmmajorly determines the overall performance of the TTS system. GLA has beenthe most predominantly used reconstruction algorithm for speech synthesis [10].A time-domain signal can be reconstructed from its amplitude spectrogram usingthe information about its phase. When no information is available about the phaseand only the amplitude spectrogram is available, GLA is particularly suited forphase reconstruction. However, GLA needs many iterations and the perceptualquality of the output speech is not always very good [35].There have been a number of attempts to optimize the waveform processing forspeech reconstruction. For instance, Sercan et al. [10] implemented transposed con-volution layers alongwith non-linear interpolation which resulted into better uti-lization of modern multi-core processors than simple iterative strategy. In anotherwork, Y. Fisher [36] used multi-scale context aggregation by dilated convolutionsthat resulted in simplified network alongwith increased state-of-the-art accuracy.In the context of waveform processing based applications, Z. Cheng and J. Shenet al. [37] used the properties of the audio waveforms to recommend music basedon the venue and surrounding of the user. As an attempt to enhance the vocodermodule, M. Morise et al. [29] proposed a new vocoder, WORLD for feature ex-traction and waveform synthesis. Y. Masuyama [38] proposed an enhanced phasereconstruction technique by combining DNN with GLA to build GLA-inspiredneural network layers for waveform generation.Some of the distinctly related work in the area of signal processing maps to theutilization of fourier transformation techniques such as Gabor Transform [39, 40].It is a special form of fourier transform that is used to determine the frequencyand phase content of the signals represented in the form of spectrograms. In this ast Griffin Lim based Waveform Generation Strategy for Text-to-Speech Synthesis 7 direction, a real-time fast fourier transform algorithm was proposed by H. Sorensenet al. [41]. Successful research has also been carried out to achieve phase recoverywith lesser number of iterations as compared to GLA [10]. It gives a hint tolook out for alternative reconstruction algorithms requiring lesser iterations whilemaintaining the quality of the synthesized speech.Though there have been various attempts to optimize the waveform processingin context of GLA. However, better alternatives for waveform reconstruction havenot been explored to their best potential. An optimized version of GLA known asFast Griffin Lim Algorithm (FGLA) is available in the literature [11]. It requireslesser iterations to construct the phase from spectrogram representation for generalsignal processing applications. However, it has not been applied and tested forspeech synthesis application. In this paper, FGLA based waveform generationmethod has been proposed with the aim to reduce synthesis delay. It aims toovercome the challenges faced by concatenative and parametric TTS systems bygetting rid of the need of human intervention for acoustic feature engineering.Some of the challenges of using nerual vocoders for real-time TTS applicationssuch as - slow speech synthesis with CPU, larger model size, complex architecture,etc. have also been considered and addressed.
The major objective of the proposed research work is to optimize the waveformgeneration process during speech synthesis by TTS systems. The speech synthesistime should be reduced without changing the quality of the output speech. Thetraining phase for TTS device is performed once in a given system until there is achange in algorithm. The synthesis phase is executed on real-time speech synthesisdevices having low computational power. This phase is repeated every time atext is converted into speech. System resources at training stage are generally ofhigh computing power. However, most of the real-time speech synthesis systemshave limited computing capabilities. Hence, the synthesis algorithm should takeless amount of memory to make the speech synthesis more suitable for real-timeapplications. The problem statement is subjected to the following constraints: i) The average time taken ( T ) to convert the corresponding spectrogram to thewaveform should be minimized. T corresponds to the synthesis delay for n samples. ii) The number of iterations required ( itr ) for the output waveform to convergeshould be minimized. That is, their plots should reach to an optimal state as soonas possible. iii)
Quality of the synthesized speech ( qual ) should be maintained. The reductionin speech synthesis time should not affect it. iv)
The speech synthesis process should result into optimal resource utilization( util ). It should cater to the limited computational resources of real-time TTSsystems.The aforementioned constraints can be modeled mathematically as shown inEq. 1. A n k i t Sh a r m a e t a l. T a b l e : Su mm a r y o f li t e r a t u r e r e v i e w . TTS Model Basic Method Properties Text Pre-processing WaveformPreprocessing SpectrogramType Construction ReconstructionConcatenative
Searches the audio segmentfrom speech database that ismost relevant to the text Simple to implement.Results into good qualityspeech - - - - -
Parametric
Keeps track of parametric formof speech including spectral pa-rameters Speaker voice can bechanged with minimumefforts - - - - -
Wavenet [21] Uses dilated regressive CNNto predict present sample frompast sample High quality. Can gener-ate multi-speaker voice. Yes No Mel-spectrogram - -
Char2Wav [20] Uses bidirectional RNN to pro-duce waveform from textualcontent only Expert linguistic knowl-edge is not required No Yes Linear-spectrogram - SampleRNN
Tacotron [21] Encoder-decoder architecturebased on RNNs Fully end-to-end; robustand fast processing Yes Approximate Linear-spectrogram CBHG GLA
Tacotron2 [23] Encoder-decoder architecturebased on RNNs Better speech qualityand smaller model sizethan Tacotron Yes Approximate Mel-spectrogram Convolutionbased ModifiedWavenet
DeepVoice [24] Uses five different DNNs forTTS; needs less parametersand faster than Wavenet Faster processing No No Linear-spectrogram CBHG GLA
DeepVoice2 [25] Wavenet based spectrogram toaudio generation Can generate multi-speaker voices with lesstraining No Approximate Linear-spectrogram Attentionbased encoder GLA, Wavenet
VoiceLoop [27] Uses shifting buffer memory in-stead of RNNs Robust; produces lessererrors Yes Exact Linear-spectrogram Buffer shallownetwork WORLDHere, GLA: Griffin Lim Algorithm; CBHG: (1-D convolution bank + highway network + bidirectional GRU) ast Griffin Lim based Waveform Generation Strategy for Text-to-Speech Synthesis 9
Subject to constraints : (cid:26) minimize ( T, itr ) maximize ( qual, util ) W here : n : number of inputs.itr : number of iterations required for the output waveform to converge.qual : quality of the synthesized speech.util : utilization of the computing capacity of T T S plateform.T = { t , t , t , ......t n } , time to convert coefficients into waveforms.t , t , .......t n are the times taken to convert the spectrograms to waveforms.S : { u , u , u , ......u n } , set of spectrograms generated by the T T S system.u , u , .......u n are the coefficient matrices of the spectrogram.W = { v , v , v , .......v n } , set of corresponding waveforms generated by a vocoder.v , v , .......v n are the waveforms produced from u , u , .......u n . (1) In general, a TTS system contains three phases: a) text analysis (text to words),b) linguistic analysis (words to phonemes) and c) waveform generation (phonemesto sound). The first and second phases are carried out during the training phase.The TTS model is trained on text data and intermediate spectrogram is generatedfrom the trained model for given input text. The third phase takes place duringthe synthesis when phase waveform is synthesized through this spectrogram. Theproposed method aims to optimize the reconstruction of original speech signalfrom the intermediate spectrogram. Generally, a magnitude spectrogram doesn’tcontain the complete phase information. A reconstruction algorithm such as GLAiteratively recovers that information. GLA is an iterative algorithm that takes ahigh number of steps to recover the phase information. The proposed approachapplies an optimized alternative FGLA, for phase reconstruction in waveform gen-eration phase.The waveform analysis can be performed easily in the frequency domain [39].Fourier transform is most widely used transformation that converts time domainsignal into frequency domain signal. That is why, the proposed methodology uti-lizes fourier transform and its variants such as Short-Time Fourier Transform(STFT), Discrete Fourier Transform (DFT), Gabor Transform, etc. during vari-ous steps of waveform generation. STFT is a series of fourier transforms of a subsetof the signal. When frequency components of a signal vary with time, STFT is usedto retrieve the time-localized frequency information. Gabor Transform is a specialkind of STFT representation which is used to discover the phase information andsinusoidal frequency of the subsets of a time varying signal. The time-frequencyanalysis is carried out by first multiplying the function by a Gaussian function andthen transforming it with a fourier transform. During the synthesis phase, input isa spectrogram and output is a waveform. The STFT can be represented as a matrixof coefficients where column index represents time and row index represents fre-quency of the respective DFT coefficient. The magnitude of each coefficient in the respective index is computed and this matrix can be treated as a image known asspectrogram of the signal. The choice of base implementation has been explainedin Section 4.1. The proposed methodology is described in the following sections.It is visually represented in Fig. 1 and mathematically depicted in Algorithm 1.4.1 Rationale Behind Selecting Fast Griffin Lim Algorithm with TacotronThe research work presented in this paper primarily aims to reduce the speechsynthesis delay in text-to-speech systems. A fundamental experiment to applyFGLA based waveform generation from linear-spectrogram has been formulated.The most commonly used waveform reconstruction algorithm for speech synthesisis GLA. The most acclaimed TTS system that uses linear-spectrogram for inter-mediate representation is Tacotron, developed by Google [22]. GLA was used asvocoder in Tacotron that produces approximate waveform corresponding to the in-put spectrogram, not the exact waveform. FGLA has been chosen instead of GLAbecause FGLA is known to take lesser iterations to recover the phase from spec-trogram [11]. However, other vocoders such as Wavenet [21], WORLD [29], etc. areavailable in the literature as potential choices for vocoders but they process mel-spectrograms while current research targeted to work with linear-spectrogramsduring the intermediate step of the speech synthesis process. Thus, FGLA withTacotron emerged out as the most suitable choice for the experiment in consider-ation.4.2 Strategy for Waveform OptimizationSTFT is used for the observable comparative analysis in the frequency domain [9].FGLA attempts to reconstruct the speech signal from the intermediate spectro-gram of the signal. For that purpose, it finds the real signal X ∗ ∈ R L from a givenset of spectral magnitude coefficients s , such that the magnitude of STFT of X ∗ is as close as possible to the input signal. It helps in more accurate reconstructionof the signal. Any arbitrary set of complex numbers cannot be chosen as STFTcoefficients, i.e., only a certain set of complex numbers correspond to STFT of awaveform. In the same way the input that we get may not be a valid spectogram.A valid spectrogram S would have the magnitude of the coefficients as close aspossible to the input.The relevant terms have been defined in Section 3. Two important conceptsthat are utilized by the proposed method are Gabor Transform and Projection. G ∗ x is the inverse Gabor Transform of x . It is a special case of STFT that ishelpful in extracting the feature patterns from the spectrogram representation[42]. It helps in finding the time needed to convert the spectrogram into waveform[43]. Three phases of the proposed method have been described as follows. (i) First we initialize the coefficient matrix c , which is of same dimensions of theinput spectrogram and contains the element of set C , c . ast Griffin Lim based Waveform Generation Strategy for Text-to-Speech Synthesis 11 START
Intermediate signalconstruction by TTSt n (cid:160) =(cid:160)P C (P C (c n-1 )),c n (cid:160)= t n (cid:160) (cid:160)+(cid:160) α (t n - t n-1 ). Update c n , t n i = i +1.(cid:160)Stop Input TextCalculate Gabor(cid:160)Transform of c i < n(cid:160) Convergence Rate α (cid:160)= 0.2Iterator i = 0Coefficient Matrices c , t Number of Iterations n Waveform SpectrogramYesNo
InitializationFinalization Iteration
Fig. 1: Flow diagram depicting the proposed methodology. Here, green coloredboxes show various stages of the methodology; red box shows the loop decisionand yellow box shows the loop iteration. Theoretical analysis of the complexityand the effect of convergence rate α is discussed in Section 4.2. α is an importanthyper parameter impacting the complexity and causing the speep-up in waveformprocessing. Its appropriate value is determined in Section 5.1.1.(ii) The magnitude of every element is made equal to the element in input matrixin the corresponding position. That is, projecting on to set C .(iii) Then we initialise another matrix t of same dimensions.(iv) The projection of the modified coefficients of the transform on to set C isdefined as follows. P C ( c ) = s.e i. ∠ c (2)The above matrix is projected on to set C , followed by projection on to set C to get t . Here, t i and c i denote the matrices after i iterations. (cid:26) C = s.e i. ∠ ( c ) t = P C ( P C ( c )) (3)(v) Here, C is the set of possible coefficients of STFT. C is the set of complexnumbers whose magnitude is equal to the magnitude spectrum coefficients. G x is the Gabor Transform of x . (cid:26) C = { c : ∃ x ∈ R L k c = G x } C = { c ∈ C MN k| c | = s } (4) (i) The magnitude of the elements of the coefficient matrix is made equal to theinput matrix, keeping the phase unchanged.(ii) Inverse Gabor Transform is then applied on the resultant coefficient followedby Gabor Transform. The projection of the modified coefficient on to the set C is defined as follows. This is the influential step to make the FGLA fasterthan GLA [11]. (cid:26) P C ( c ) = GG ∗ ct n = P C ( P C ( c n − )) (5)(iii) This projection on to set C in current step is subtracted with the projectionin set C and multiplied by a factor, convergence rate α . Choosing α close toone but not exactly one yields better results. This product is added to theprojection in the current step, the initialised coefficient as updated to thisvalue. c n = t n + α ( t n − t n − ) (6)(iv) The above steps are repeated iteratively and in each step the coefficients con-verge close to a real signal whose magnitude spectrum is approximately is equalto the input spectrum. (i) Inverse Gabour Transform is applied on the final coefficient to get the wave-form. x ∗ = G ∗ c n (7)The aforementioned phases of the proposed methodology are depicted in Al-gorithm 1 and the theoretical analysis of its complexity is discussed below. Complexity Analysis
As shown in Algorithm 1, FGLA has single iteration loop and as per GLA paper[9], GLA also involves single loop. Hence, the theoretical time complexities forboth FGLA and GLA are O(n). Experiments revealed that FGLA could producethe waveform of same quality with 30 iterations as compared to GLA with 60iterations. Lesser number of iterations for FGLA is a determining factor for thereduction in the synthesis delay. Convergence rate α is another important hyper-parameter impacting the complexity and causing the speed-up in the waveformprocessing. Its value ranges from 0 to 1. For α = 0, FGLA behaves as GLA. Asits value increases, speed-up also increases till a limit with faster convergence andthen it starts decreasing. As discussed in Section 5.1.1, the appropriate value of α has been determined as 0.2. ast Griffin Lim based Waveform Generation Strategy for Text-to-Speech Synthesis 13 Algorithm 1:
FGLA based Waveform Optimization for text-to-speech
Input n : Number of iterations. Input k : Number of input variables. Input C : Set of possible coefficients of STFT. Input C : Set of complex numbers with magnitude same as of spectrum coefficients. Define s : Spectral magnitude coefficients. Define c : Coefficient matrix. Define t : Matrix of same dimensions as c. Define α : Convergence Rate. Define x ∗ : Final waveform. Define G ∗ : Inverse Gabor Transform of x ∗ . Input S = { u , u , u , .....u k } : set of spectrograms generated by TTS model. Define u , u , .......u k : coefficient matrices of the spectrogram. Output W = { v , v , v , ......v k } : set of waveforms generated by a vocoder. Define v , v , .......v k : produced waveform. Define T = { t , t , t , ......t k } : time to convert coefficients into waveform. Define t , t , .......t k : time to convert the spectograms to waveforms. Procedure
WaveOpti1:
Stage 1: Initialisation
2: //
Projection of modified transform coefficients on to set C P C ( c ) = s.e i. ∠ c
4: //
Projection of above matrix on set C and then C C = s.e i. ∠ ( c ) t = P C ( P C ( c )).7: Stage 2: Iteration
8: //
Projection of modified transform coefficients on to set C P C ( c ) = GG ∗ c
10: //
Update t and c for each iteration for i in n do t i = P C ( P C ( c i − ))13: c i = t i + α ( t i − t i − ).14: end for Stage 3: Waveform Generation
16: //
Inverse Gabor Transform of final coefficients
17: x ∗ = G ∗ c n This section discusses and evaluates the experiments to apply FGLA based wave-form generation from linear-spectrogram.5.1 Experimental Set-upThis section demonstrates the experimental implementation and analyses the re-sults. A fundamental experiment to optimize the waveform processing of linearspectrograms in Tacotron TTS system has been formulated. The model trainingis done on Nvidia Tesla K80 GPU machine with 24GB RAM and 4992 CUDAcores. Text to speech synthesis is done on Intel(R) Core(TM) i7-7700, 4.2 GHzCPU with 16GB RAM and 64-bit Windows 10 OS machine. The Machine Learn-ing libraries used in this implementation are Numpy, Tensorflow and Keras. Thechoice of parameters and datasets has been detailed in the following sections.
Table 2: Hyper-parameter choices.
Parameter Value
Convergence Rate, α The questions that the experimental set-up tries to answer are - ”What is theideal number of training steps for TTS model training?”; ”How to determine theappropriate number of iterations?”; ”How to evaluate the speech synthesized bythe TTS system in terms of quality, speed and convergence?” Suitable numberof training steps and iterations are experimentally determined in Section 5.1.3and 5.1.4. Then as per Table 7, use-case sentences are formulated according tovarious complexity levels. In Section 5.2, speech-synthesis for these sentences hasbeen evaluated in terms of speech quality, synthesis delay and convergence.
During the synthesis, the convergence rate α was experimentally chosen as 0.2.We experimented with the α values starting from 0.1 with learning rate 0.0002and performed the iterations for spectrogram to waveform construction and anal-ysed the construction time. The most suitable value of α corresponding to theoptimal construction time emerged out as 0.2. The basic entity for training themodel is .text and .wav file. Wav file signal has to be sampled for the analysis.Table 2 represents the important parameters for the current analysis and theirselected values. Here, ‘Sampling rate’ denotes the number of samples per second,‘Frame shift’ specifies amount by which window will slide. ‘Learning rate’ showshow fast network learns by adjusting weights. The experimentally determined val-ues for sampling rate, frameshift and learning rate are 20000, 12.5 ms and .0002respectively. Tacotron has been trained from scratch for the datasets described inSection 5.1.2 for various number of iterations and speech synthesis time has beenobserved for GLA and FGLA both. The original Tacotron paper had used LJSpeech dataset. We have trained andtested the TTS model with Blizzard and Tatoeba datasets as well. The details ofall the datasets used in the implementation is provided in the following sections.(i)
LJ Speech Dataset [44]: It is a single-speaker, public domain speech-datasetcontaining 13100 audio samples ranging from 1 to 10 seconds. The total dura-tion of the dataset is about 24 hours. Each audio-file is a single-channel 16-bitPCM WAV with a sample rate of 22050 Hz. The properties of LJSpeech datasetare detailed in Table 3.(ii)
Tatoeba Dataset [45]: This audio corpus is a crowdsourced dataset of sen-tences and translations. It contains a subset of the English sentences of Tatoeba.We have not used the complete dataset. Sentences have been filtered out and ast Griffin Lim based Waveform Generation Strategy for Text-to-Speech Synthesis 15
Table 3: Details of LJSpeech dataset
Parameters Values
Total Clips 13,100Total Words 225,715Total Characters 1,308,678Total Duration 23:55:17Mean Clip Duration 6.57 secMin Clip Duration 1.11 secMax Clip Duration 10.10 secMean Words per Clip 17.23Distinct Words 13,821
Table 4: Details of Blizzard dataset
Audiobook Name Total Audio Length
Tramp Abroad 15:46:01Life on the Mississippi 14:47:27The Man That CorruptedHadleyburg and Other Stories 13:04:00
Blizzard Dataset [46]: It is available under Creative Commons AttributionShare-Alike license. It contains the samples of three audiobooks read by asingle American English narrator. The books name and recording time aregiven in Table 4. Audio file format of blizzard corpus is 16-bit WAV, monoand sampling frequency is 44100 Hz.
The quality of synthesized speech also depends on the number of training steps.However, training the model for more steps requires more computation and time.Hence, it is important to determine the optimal number of steps. We have trainedthe Tacotron model till 400K steps and observed the MOS values of the speechsynthesized with it. MOS is a subjective evalution score to denote the quality of aspeech utterance [47]. The MOS scores corresponding to various checkpoints areshown in Table 5. It has been observed that the MOS values improves rapidly till250k steps but their convergnce slows down significantly after that. Hence, theTTS model have been trained for at least 250k steps for the final training of eachuse-case. This analysis is performed with configuration same as original Tacotronpaper, i.e., using GLA algorithm with 60 iterations.
FGLA is supposed to converge faster than GLA. However, the suitable value forFGLA’s number of iterations needs to be determined effectively. With that aim,
Table 5: Determination of appropriate no. of steps.
Steps MOSLJSpeech Tatoeba Blizzard
40k 7.0 7.1 580K 7.5 7.3 5.2120K 7.6 7.5 5.4160K 7.6 7.6 6.0200K 7.65 7.8 6.3240K 7.9 8 6.5280K 8 8.1 6.7320K 8.2 8.25 7
Table 6: Determination of appropriate no. of iterations.
Dataset Iterations GLA FGLALJSpeech
20 6.0 6.830 7.3 8.260 7.6 8.2
Tatoeba
20 7.6 7.130 7.6 8.2560 8.1 8.25
Blizzard
20 6.0 6.230 6.4 7.160 6.9 7.3 we trained Tacotron with GLA and FGLA both for 20, 30 and 60 iterations re-spectively. Then we checked the MOS values for the audio samples synthesizedwith the model thus trained. These values have been illustrated in Table 6. Thespeech quality for FGLA in terms of MOS scores is observed to be better thanthat for GLA. Moreover, FGLA is also observed to take lesser number of iterationsto reach same to same MOS score as compared to GLA. FGLA with 30 iterationsconverged to equivalent MOS values for GLA with 60 iterations. Hence, 30 was se-lected as the appropriate number of iterations for FGLA to be used for Tacotron’straining. The correctness of this choice has been justified in Section 5.2.2.
During speech synthesis, the test sentences are chosen according to various com-plexity levels. Various verbal and lingual combinations in terms of punctuationmarks, abbreviations, special characters, exclamation and question mark, etc. havebeen included to form five use-case sentences of varying lengths. As mentionedearlier, the main objective of the work presented in this paper is to reduce thesynthesis delay without affecting the quality. The trained models are tested tosynthesize these sentences and their synthesis delay speech quality has been ob-served. The time taken in the synthesis process is proportional to the length of thetext. So, checked sentences have variable-sized length. Every sentence has been ast Griffin Lim based Waveform Generation Strategy for Text-to-Speech Synthesis 17
Table 7: List of use-case sentences.
S.No. Sentences
1. He said to him, “Is not your name Ahmed?”2. All of a sudden, there was a loud screaming, Pleasehelp me!3. I think I lost my wallet! I can’t find it anywhere!Oh, I could just kick myself!4. ”Sunshine on my shoulders makes me happy, sun-shine in my eyes can make me cry.”5. As the stranger entered the town, he was met by apolice, man who asked, “Are you a traveler?” “Soit would appear”, He replied carelessly. synthesized 10 times and then the average synthesis time has been considered.To make periodic observations, model checkpoint has been saved after every 1000training steps. The size of the trained model has been observed to be of the orderof 80 MB. The aforementioned use cases have been depicted in Table 7 and thedetailed analysis has been presented in Table 6.5.2 Result AnalysisThe implementation has been carried out considering the number of training stepsand iterations determined in the above section. This section presents and verifiesthe results in terms of speech synthesis using the TTS model thus trained. Theresults have been evaluated for five use-case sentences described in Table 7. Duringthe result evaluation, we have parallelly checked the quality of audio generated forall three corpora at different intervals of the model trained given in Table 5. Meanopinion score (MOS) is calculated based on wave file generated by synthesis onthe trained model. The optimal number of training steps for different datasets forvarious training steps are shown in the Table 5. The results have been analysedbased on the quality of the output speech and synthesis delay. The choice ofnumber of iterations made in Section 5.1 has also been verified by observing theconvergence of the waveform plots for GLA and FGLA.
Quality analysis results have been shown in Table 5 and 6. It was observed that,after 250k steps of model training, the output speech included prosody features.That made the voice more feasible for real-time speech synthesis. Speech qualityalso depends on the corpus used for training. Model is trained up to 400K steps forall datasets and results have been generated. The quality of the synthesized speechis expected to be clear to understand and non-robotic in nature. The more human-like the voice is, have a higher value of MOS and results easy to understand. X axis shows duration of model trained and Y axis represents MOS of speech takenby 10 evaluators. The output speech has been evaluated at regular training stepintervals for different data sets. The speech quality was observed to depend mainlyon the number of iterations used in the algorithm. The graph shows that in theinitial stage learning rate is very high but after 250k training steps learning rateis very slow. The observed results in context of the quality of the output speechare visually illustrated in Fig. 2, 3 and 4.
Accuracy of MOS Determination
To further evaluate the quality of the proposed methodology’s results, 100 textsamples with known MOS scores are considered. Corresponding speech is synthe-sized for them using GLA, FGLA and GAN based vocoders. MOS of the synthe-sized speech have been evaluated. If the variation in the MOS of the synthesizedspeech i.e.
MOS s and the known MOS i.e. MOS g is less than the error margin e ,then the sample is assumed to be accurately determined. The error margin is takenas 0.45 which is 5% of the MOS of natural human voice [48]. The calculations aredone as per Eq. 8 and the results for the considered samples are summaried inTable 8. n | MOS g – MOS s | < e = > Accurately determined. o W here : e : error margin.MOS g : ground truth MOS.MOS s : MOS of synthesized speech. (8)
Table 8: Accuracy of MOS determination
Vocoder Avg. MOS Min. MOS Max. MOS AccuracyGLA
FGLA
GAN Vocoder
The accuracy for FGLA came out to be much better than GLA and comparableto GAN based neural vocoder. It should be noted that GAN vocoder has been evalu-ated on CPU, in-line with the goal of optimizing the waveform processing from linearspectrograms for real-time devices with limited processing capabilities. ‘Synthesis Delay’ is the time required for the output speech to start getting pro-duced by the TTS system. The same has been computed and compared for all theuse-case sentences described in Table 7 and datasets mentioned in Section 5.1.2.The observations have been drawn for GLA with 30 iterations, FGLA with 60 ast Griffin Lim based Waveform Generation Strategy for Text-to-Speech Synthesis 19
40k 80k 120k 160k 200k 240k 280k 320k
Training steps M O S MOS graph for LJSpeech
Fig. 2: Quality analysis of synthesized speech for LJSpeech dataset
40k 80k 120k 160k 200k 240k 280k 320k
Training steps M O S MOS graph for Tatoeba
Fig. 3: Quality analysis of synthesized speech for Tatoeba datasetiterations and GAN based neural vocoder [30]. Every sentence is synthesised 10times and then average synthesis delay has been calculated. Same computationalconfiguration has been maintained on the testing machines while doing so. Asour aim is to optimize the waveform processing keeping CPU based low memorydevices for real-time usage, the waveform reconstruction has been carried out onCPU for GLA, FGLA and the GAN based vocoder.
40k 80k 120k 160k 200k 240k 280k 320k
Training steps M O S MOS graph for Blizzard
Fig. 4: Quality analysis of synthesized speech for Blizzard datasetThe observed synthesis time for the aforementioned cases have been depictednumerically in Table 9 and visually in Fig. 5. FGLA came out to be 49.12%, 33.57%and 26.52% faster than GLA in terms of Synthesis Delay for LJSpeech, Tatoebaand Blizzard datasets. The overall reduction in the Synthesis Delay has been ob-served to be 36.58%. While the average speech synthesis time for GAN basedvocoder on CPU came out to be 3.65, 2.88 and 2.76 times more than FGLA.It should also be noted that FGLA produced better quality speech with lessertraining iterations as compared to GLA. With the proposed waveform generationstrategy, LJ Speech dataset has shown more reduction in the synthesis delay thanother datasets. In context of MOS scores, Tatoeba dataset showed faster progresswhile Blizzard dataset showed lower values as compared to LJSpeech.
As discussed in Section 5.1.4, appropriate number of iterations for FGLA weredetermined as 30. Here, we have verified that choice by observing the convergencein the plots of the resulting waveforms. The comparative analysis in the frequencydomain is easier to observe. That’s why, fourier transformations of the waveformsare considered. The transforms of the speech produced from FLGA with 30 iter-ations and FLGA with 60 iterations are plotted and compared. It is found thatthe two plots overlap with each other. Similarly, fourier transforms of waveformsproduced by GLA 60 iterations and GLA 30 iterations are plotted and compared.It is found that the plots do not overlap. This means that the waveform producedby FLGA 30 iterations and FGLA 60 iterations are same, thus a good quality ofspeech is produced using FLGA 30 itself, whereas, using GLA requires 60 itera-tions give a better quality of speech that GLA 60 iterations. The convergence plots a s t G r i ffi n L i m b a s e d W a v e f o r m G e n e r a t i o nS t r a t e g y f o r T e x t - t o - Sp eec hS y n t h e s i s T a b l e : S y n t h e s i s t i m e ( m s ) f o r v a r i o u s u s e - c a s e s a ndd a t a s e t s S.N. Use-case Sentences LJSpeech Tatoeba BlizzardGLA(60) FGLA(30) GANvocoder GLA(60) FGLA(30) GANvocoder GLA(60) FGLA(30) GANvocoder
1. He said to him, “Is not your name Ahmed?” 10026 5173 18900 9684 6456 17950 9800 6725 168502. All of a sudden, there was a loud screaming,Please help me! 9698 4940 18190 10416 6880 19950 8997 6726 188703. I think I lost my wallet! I can’t find it anywhere!Oh, I could just kick myself! 9702 4920 17000 10180 6799 20010 8957 6702 190104. “Sunshine on my shoulders makes me happy,sunshine in my eyes can make me cry.” 9849 4959 18700 10550 7010 19880 8940 6757 185505. As the stranger entered the town, he was met bya police, man who asked, “Are you a traveler?”“So it would appear”, He replied carelessly. 9828 4992 18560 10300 6821 20150 9129 6763 19950
Synthesis time for all 3 datasets
LJ Speech Tatoeba Blizzard
Datasets A ve r a g e t i m e i n m s GLA 60 iterationsFGLA 30 iterationsGAN based vocoder
Fig. 5: Synthesis Delay for the use-cases formulated in Table 7for the fourth sentence from Table. 7 are shown in Fig. 6 and Fig. 7 for GLA andFGLA respectively. The fourth sentence captures sufficient variations in terms ofsentence length, special characters, punctuation marks, etc. The plots for the restof the sentences are included in the supplementary material.
In this paper, a FGLA based method is proposed to optimize the waveform gen-eration process to reduce speech synthesis delay. The final speech, i.e., waveform,is reconstructed from intermediate spectrogram. GLA has mostly been used, es-pecially when phase information about the waveform is missing. But GLA is slow,which causes delay in the speech synthesis. A faster alternative of GLA, i.e., FGLAhas been used in the proposed method that resulted in 36.58% reduction in speechsynthesis time. In the presented work, experiments were performed to optimizewaveform generation from linear spectrogram in single-speaker TTS systems. Theproposed approach is compared against GLA and GAN based neural vocoder interms of speech quality and synthesis delay. The quality of the synthesized speechhas been checked using MOS based evaluation. The quality is observed to be re-tained in spite of the reduction in the synthesis time. The number of trainingsteps and iterations were determined by experimental observation. This choicewas verified through the convergence of fourier transform plots of the resultantwaveforms. ast Griffin Lim based Waveform Generation Strategy for Text-to-Speech Synthesis 23
Frequency A m p li t ud e Frequency analysis of output waveforms
GLA 20 itr (a) GLA 20 Iterations
Frequency A m p li t ud e Frequency analysis of output waveforms
GLA 30 itr (b) GLA 30 Iterations
Frequency A m p li t ud e Frequency analysis of output waveforms
GLA 60 itr (c) GLA 60 Iterations
Frequency A m p li t ud e Frequency analysis of output waveforms
GLA 20 itrGLA 30 itrGLA 60 itr (d) GLA 20, 30 & 60 Iterations
Fig. 6: GLA Convergence PlotsIn future, we will work to optimize the waveform processing for the TTS sys-tems trained with multi-speaker datasets using mel-spectrograms as intermediaterepresentation. We will also explore more neural vocoders for speech synthesis inreal-time applications. It is planned to work on the challenges involved with themsuch as reducing the model size and reducing the computational requirementsespecially for synthesizing small sentences.
References
1. Hideyuki Mizuno, Masanobu Abe, and Tomohisa Hirokawa. Waveform-basedspeech synthesis approach with a formant frequency modification. In
IEEEInternational Conference on Acoustics, Speech, and Signal Processing , volume 2,pages 195–198, 1993.2. Yi Zhao, Shinji Takaki, Hieu-Thi Luong, Junichi Yamagishi, Daisuke Saito,and Nobuaki Minematsu. Wasserstein gan and waveform loss-based acous-
Frequency A m p li t ud e Frequency analysis of output waveforms
FGLA 20 itr (a) FGLA 20 Iterations
Frequency A m p li t ud e Frequency analysis of output waveforms
FGLA 30 itr (b) FGLA 30 Iterations
Frequency A m p li t ud e Frequency analysis of output waveforms
FGLA 60 itr (c) FGLA 60 Iterations
Frequency A m p li t ud e Frequency analysis of output waveforms
FGLA 20 itrFGLA 30 itrFGLA 60 itr (d) FGLA 20, 30 & 60 Iterations
Fig. 7: FGLA Convergence Plotstic model training for multi-speaker text-to-speech synthesis systems using awavenet vocoder.
IEEE Access , 6:60478–60488, 2018.3. Boland T Jones, David Michael Guthrie, Laurence Schaefer, and J DouglasMartin. Real-time speech-to-text conversion in an audio conference session,January 31 2017. US Patent 9,560,206.4. Pravin Ghate and S D Shirbahadurkar. A survey on methods of tts and varioustest for evaluating the quality of synthesized speech.
International Journal ofDevelopment Research , 07:15236–15239, 2017.5. Andy Aaron, Raimo Bakis, Ellen M Eide, and Wael M Hamza. Systems andmethods for text-to-speech synthesis using spoken example, November 11 2014.US Patent 8,886,538.6. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi,Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey,et al. Google’s neural machine translation system: Bridging the gap betweenhuman and machine translation. arXiv preprint arXiv:1609.08144 , 2016.7. Suyoun Kim, Takaaki Hori, and Shinji Watanabe. Joint ctc-attention basedend-to-end speech recognition using multi-task learning. In
IEEE international ast Griffin Lim based Waveform Generation Strategy for Text-to-Speech Synthesis 25 conference on acoustics, speech and signal processing (ICASSP) , pages 4835–4839,2017.8. Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, PatrickNguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, et al. Transferlearning from speaker verification to multispeaker text-to-speech synthesis. In
Advances in neural information processing systems (NeuroIPS) , pages 4480–4490,2018.9. Daniel Griffin and Jae Lim. Signal estimation from modified short-timefourier transform.
IEEE Transactions on Acoustics, Speech, and Signal Process-ing , 32(2):236–243, 1984.10. Yoshiki Masuyama, Kohei Yatabe, and Yasuhiro Oikawa. Griffin–lim like phaserecovery via alternating direction method of multipliers.
IEEE Signal ProcessingLetters , 26(1):184–188, 2018.11. N. Perraudin, P. Balazs, and P. L. Søndergaard. A fast griffin-lim algorithm.In
IEEE Workshop on Applications of Signal Processing to Audio and Acoustics ,pages 1–4, 2013.12. A. J. Hunt and A. W. Black. Unit selection in a concatenative speech synthesissystem using a large speech database. In
IEEE International Conference onAcoustics, Speech, and Signal Processing Conference Proceedings , volume 1, pages373–376, 1996.13. Geert Coorman, Filip Deprez, Mario De Bock, Justin Fackrell, Steven Leys,Peter Rutten, Jan De Moortel, Andre Schenk, and Bert Van Coile. Speechsynthesis using concatenation of speech waveforms, May 15 2007. US Patent7,219,060.14. Keiichi Tokuda, Yoshihiko Nankaku, Tomoki Toda, Heiga Zen, Junichi Yam-agishi, and Keiichiro Oura. Speech synthesis based on hidden markov models.
Proceedings of the IEEE , 101(5):1234–1252, 2013.15. Heiga Zen, Keiichi Tokuda, and Alan W Black. Statistical parametric speechsynthesis.
Elsevier Speech Communication , 51(11):1039–1064, 2009.16. Keiichi Tokuday and Heiga Zen. Directly modeling speech waveforms by neuralnetworks for statistical parametric speech synthesis. In
IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) , pages 4215–4219, 2015.17. Junichi Yamagishi, Bela Usabaev, Simon King, Oliver Watts, John Dines, JileiTian, Rile Hu, Yong Guan, Keiichiro Oura, Keiichi Tokuda, Reima Karhila,and Mikko Kurimo. Thousands of voices for hmm-based speech synthe-sis–analysis and application of tts systems built on various asr corpora.
IEEETransactions on Audio, Speech, and Language Processing , 18:984–1004, 2009.18. Takashi Masuko, Keiichi Tokuda, Takao Kobayashi, and Satoshi Imai. Voicecharacteristics conversion for hmm-based speech synthesis system. In , vol-ume 3, pages 1611–1614, 1997.19. Soojeong Lee and Joon-Hyuk Chang. Spectral difference for statistical model-based speech enhancement in speech recognition.
Springer Multimedia Toolsand Applications , 76(23):24917–24929, 2017.20. Jose Sotelo, Soroush Mehri, Kundan Kumar, Joao Felipe Santos, Kyle Kastner,Aaron Courville, and Yoshua Bengio. Char2wav: End-to-end speech synthesis.2017.
21. Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan,Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and KorayKavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprintarXiv:1609.03499 , 2016.22. Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss,Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio,et al. Tacotron: Towards end-to-end speech synthesis. arXiv preprintarXiv:1703.10135 , 2017.23. Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly,Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan,et al. Natural tts synthesis by conditioning wavenet on mel spectrogram pre-dictions. In
International Conference on Acoustics, Speech and Signal Processing(ICASSP) , pages 4779–4783. IEEE, 2018.24. Sercan O Arik, Mike Chrzanowski, Adam Coates, Gregory Diamos, An-drew Gibiansky, Yongguo Kang, Xian Li, John Miller, Andrew Ng, JonathanRaiman, et al. Deep voice: Real-time neural text-to-speech. In
Proceedings ofthe 34th International Conference on Machine Learning(ICML) , volume 70, pages195–204, 2017.25. Andrew Gibiansky, Sercan Arik, Gregory Diamos, John Miller, Kainan Peng,Wei Ping, Jonathan Raiman, and Yanqi Zhou. Deep voice 2: Multi-speakerneural text-to-speech. In
Advances in neural information processing systems ,pages 2962–2970, 2017.26. Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan O Arik, Ajay Kannan, Sha-ran Narang, Jonathan Raiman, and John Miller. Deep voice 3: Scaling text-to-speech with convolutional sequence learning. arXiv preprint arXiv:1710.07654 ,2017.27. Yaniv Taigman, Lior Wolf, Adam Polyak, and Eliya Nachmani. Voiceloop:Voice fitting and synthesis via a phonological loop. arXiv preprintarXiv:1707.06588 , 2017.28. D. Griffin and Jae Lim. Signal estimation from modified short-time fouriertransform.
IEEE Transactions on Acoustics, Speech, and Signal Processing ,32(2):236–243, April 1984.29. Masanori Morise, Fumiya Yokomori, and Kenji Ozawa. World: a vocoder-based high-quality speech synthesis system for real-time applications.
IEICETransactions on Information and Systems , 99(7):1877–1884, 2016.30. Keisuke Oyamada, Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka,Nobukatsu Hojo, and Hiroyasu Ando. Generative adversarial network-basedapproach to signal reconstruction from magnitude spectrogram. In , pages 2514–2518, 2018.31. Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin,Wei Zhen Teoh, Jose Sotelo, Alexandre de Brebisson, Yoshua Bengio, andAaron C Courville. Melgan: Generative adversarial networks for conditionalwaveform synthesis. In
Advances in Neural Information Processing Systems ,pages 14881–14892, 2019.32. Ryan Prenger, Rafael Valle, and Bryan Catanzaro. Waveglow: A flow-basedgenerative network for speech synthesis. In
IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) , pages 3617–3621, 2019.33. Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, OriolVinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, ast Griffin Lim based Waveform Generation Strategy for Text-to-Speech Synthesis 27
Luis C Cobo, Florian Stimberg, et al. Parallel wavenet: Fast high-fidelityspeech synthesis. arXiv preprint arXiv:1711.10433 , 2017.34. Kishore Prahallad. Speech technology: Spec-trogram, cepstrum and mel-frequency analysis. https://archive.org/details/SpectrogramCepstrumAndMel-frequency_636522 ,2016.35. Sercan O Arik, Heewoo Jun, and Gregory Diamos. Fast spectrogram inversionusing multi-head convolutional neural networks.
IEEE Signal Processing Letters ,26(1):94–98, 2018.36. Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilatedconvolutions. arXiv preprint arXiv:1511.07122 , 2015.37. Zhiyong Cheng and Jialie Shen. On effective location-aware music recommen-dation.
ACM Transactions on Information Systems (TOIS) , 34(2):1–32, 2016.38. Yoshiki Masuyama, Kohei Yatabe, Yuma Koizumi, Yasuhiro Oikawa, andNoboru Harada. Deep griffin–lim iteration. In
IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) , pages 61–65, 2019.39. Ronald Newbold Bracewell and Ronald N Bracewell.
The Fourier transformand its applications , volume 31999. McGraw-Hill New York, 1986.40. T Malathi and Manas Kamal Bhuyan. Performance analysis of gabor waveletfor extracting most informative and efficient features.
Springer MultimediaTools and Applications , 76(6):8449–8469, 2017.41. H V Sorensen, D Jones, Michael Heideman, and C Burrus. Real-valued fastfourier transform algorithms.
IEEE Transactions on acoustics, speech, and signalprocessing , 35(6):849–863, 1987.42. Shie Qian and Dapang Chen. Discrete gabor transform.
IEEE Transactions onsignal processing , 41(7):2429–2438, 1993.43. Marc Levoy.
Volume rendering using the fourier projection-slice theorem . Com-puter Systems Laboratory, Stanford University, 1992.44. Keith Ito. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/ ,2017.45. Sysko. Tatoeba speech dataset. https://tatoeba.org/eng/ , 2013.46. Norbert Braunschweiler, Mark J. F. Gales, and Sabine Buchholz. Lightly su-pervised recognition for automatic alignment of large coherent speech record-ings. In
INTERSPEECH , 2010.47. Pier Luigi Salza, Enzo Foti, Luciano Nebbia, and Mario Oreglia. Mos andpair comparison combined methods for quality evaluation of text-to-speechsystems.
Acta Acustica united with Acustica , 82(4):650–656, 1996.48. Bret Kinsella. Speech synthesis becomes more humanlike. https://voicebot.ai/2017/12/21/speech-synthesis-becomes-humanlike/https://voicebot.ai/2017/12/21/speech-synthesis-becomes-humanlike/