[PDF] Fast Griffin Lim based Waveform Generation Strategy for Text-to-Speech Synthesis

Abstract

The performance of text-to-speech (TTS) systems heavily depends on spectrogram to waveform generation, also known as the speech reconstruction phase. The time required for the same is known as synthesis delay. In this paper, an approach to reduce speech synthesis delay has been proposed. It aims to enhance the TTS systems for real-time applications such as digital assistants, mobile phones, embedded devices, etc. The proposed approach applies Fast Griffin Lim Algorithm (FGLA) instead Griffin Lim algorithm (GLA) as vocoder in the speech synthesis phase. GLA and FGLA are both iterative, but the convergence rate of FGLA is faster than GLA. The proposed approach is tested on LJSpeech, Blizzard and Tatoeba datasets and the results for FGLA are compared against GLA and neural Generative Adversarial Network (GAN) based vocoder. The performance is evaluated based on synthesis delay and speech quality. A 36.58% reduction in speech synthesis delay has been observed. The quality of the output speech has improved, which is advocated by higher Mean opinion scores (MOS) and faster convergence with FGLA as opposed to GLA.

Full PDF

aa r X i v : . [ ee ss . A S ] J u l Noname manuscript No. (will be inserted by the editor)

Fast Griﬃn Lim based Waveform Generation Strategy forText-to-Speech Synthesis

Ankit Sharma a · Puneet Kumar a · VikasMaddukuri b* · Nagasai Madamshetti b · Kishore KG b · Sahit Sai Sriram Kavuru b · Balasubramanian Raman a · Partha PratimRoy a Received: date / Accepted: date

Abstract

The performance of text-to-speech (TTS) systems heavily depends onspectrogram to waveform generation, also known as the speech reconstructionphase. The time required for the same is known as synthesis delay. In this paper,an approach to reduce speech synthesis delay has been proposed. It aims to en-hance the TTS systems for real-time applications such as digital assistants, mobilephones, embedded devices, etc. The proposed approach applies Fast Griﬃn LimAlgorithm (FGLA) instead Griﬃn Lim algorithm (GLA) as vocoder in the speechsynthesis phase. GLA and FGLA are both iterative, but the convergence rate ofFGLA is faster than GLA. The proposed approach is tested on LJSpeech, Blizzardand Tatoeba datasets and the results for FGLA are compared against GLA andneural Generative Adversarial Network (GAN) based vocoder. The performanceis evaluated based on synthesis delay and speech quality. A 36.58% reduction inspeech synthesis delay has been observed. The quality of the output speech hasimproved, which is advocated by higher Mean opinion scores (MOS) and fasterconvergence with FGLA as opposed to GLA.

Keywords

Tacotron · Vocoder · Text to Speech Synthesis Delay · DilatedConvolutional Neural Network

The conclusive step in a text-to-speech (TTS) system is the generation of speechfrom the spectrogram representation of the signal. This process is known as thewaveform reconstruction while the generation of intermediate signal from inputtext is called the construction process. The waveform generated in the reconstruc-tion process is the time-domain signal obtained from its intermediate spectrogram.The overall performance of a TTS system depends on the waveform processing in-volved in the reconstruction phase [1]. The main challenge in the TTS systems * Corresponding author, E-mail: [email protected] a Computer Science and Engg. Dept., Indian Institute of Technology, Roorkee, India, 247667 b Electronics and Comm. Engg. Dept., Indian Institute of Technology, Roorkee, India, 247667 Ankit Sharma et al. is to optimize the waveform processing time while maintaining or improving thequality of the generated speech [2].The TTS systems have emerged as valuable tools for day-to-day applicationssuch as digital assistants, mobile phones, embedded devices, etc. Most of thesedevices have limited computational capacity and they are sometimes used in of-ﬂine mode. Reducing the speech synthesis delay for them could be very usefulin real-life applications such as - human computer interface, navigation systems,telecommunication and multimedia, aid to physically challenged people, daily ap-pliances like TVs, washing machines, etc. [3]. For such applications, TTS systemsare expected to have quick response time and generate speech with good quality.Hence, it becomes even more important to fasten up spectrogram to speech re-construction for real-time applications. With the aim to make TTS systems moresuitable for real-time applications, it is important to improve their response timewhile retaining the quality of the synthesized speech.Many software and hardware-based techniques have been suggested in the pastfor waveform optimization during speech synthesis [4, 5]. Most of the tradition-ally used techniques were based on software components such as concatenativespeech synthesis, parametric speech synthesis, etc. The current computing trendhas shifted towards deep learning due to the availability of hardware resourcesand training data. The state-of-the-art of TTS systems have also started lever-aging deep neural network (DNN) based techniques for speech synthesis [6]. Theperformance of text-to-speech systems has signiﬁcantly improved especially afterthe introduction of end-to-end neural waveform generation methods [7, 8]. Inspiteof signiﬁcant performance boost, even end-to-end neural waveform generation ap-proaches suﬀer from sluggish speech reconstruction process. Therefore, there is aneed to look for more eﬃcient waveform optimization approaches to enhance thespeed and quality of machine synthesized speech.There are three stages in text-to-speech process: text analysis, linguistic analy-sis and waveform generation. Traditional TTS systems are based on complex multi-stage hand-engineered pipelines. The present state-of-the-art is end-to-end neuralspeech synthesis which puts together these stages of TTS process into a single lay-ered pipeline through the use of DNNs. All three phases of TTS take place withouthuman intervention for acoustic feature crafting. However, the hyper-parametersand conﬁguration settings to cater a speciﬁc stage of the TTS can be set up-front.In that context, the implementation settings for waveform generation can be set inthe form of appropriate choice of the reconstruction algorithm. Griﬃn-Lim algo-rithm (GLA) is the most predominantly used reconstruction algorithm for speechsynthesis [9].GLA is an iterative algorithm that tries to produce a signal from the spectro-gram and does not have any information about phase. However, GLA needs manyiterations and the perceptual quality of the output speech is not always very good[10]. An optimized version of GLA is available in the literature which is known asFast Griﬃn Lim Algorithm (FGLA) [11]. FGLA naturally requires lesser iterationsto construct the phase from spectrogram representation for general signal process-ing applications. However, it has not been applied and tested for speech synthesis.In this paper, we have applied FGLA for speech synthesis process of neural TTSsystems. We’ve formulated an experiment to optimize the waveform processing oflinear spectrograms in Tacotron TTS system. FGLA based reconstruction strat- ast Griﬃn Lim based Waveform Generation Strategy for Text-to-Speech Synthesis 3 egy has been applied to reduce the speech synthesis delay. And, observations havebeen made in context of the quality of the synthesized speech and the number ofiterations required for the convergence of the reconstruction algorithm.The proposed speech synthesis systems generate the speech from the magnitudespectral envelope. We have conducted a Mean opinion scores (MOS) study to testthe quality of audio produced by FGLA with a lesser number of iterations and GLAwith a greater number of iterations with an optimal number of training steps. Theexperiments have been conducted for LJSpeech, Blizzard and Tatoeba datasets. Itresulted into 36.58% reduction in speech synthesis time. The results have reﬂectedhigher quality of the output speech in terms of improved MOS. The number oftraining steps and iterations were determined by experimental observations. Theconvergence patterns of fourier transform plots of the resultant waveforms arefound to be in-line with the choice of number of training iterations.1.1 ContributionThe major contributions of the current research work are: – An FGLA based method has been proposed to reconstruct .wav speech ﬁlesfrom linear spectrograms. In TTS applications, reconstruction of a waveformfrom spectrogram plays an important role because synthesis time is equiva-lent to the waiting time for application users. Users expect the speech outputpromptly. The proposed method has reﬂected into reduced synthesis time whichis likely to enhance the experience of TTS application users. – The quality of the synthesis speech has been maintained while reducing the syn-thesis time. A market-based application cannot compromise about the qualityof the synthesized speech. Speech quality also depends on the trained model.Hence, the model is trained upto optimal number of steps and the speechquality checking process has been carried out on three datasets. On all threedatasets, FGLA based speech reconstruction produced better quality speechthan GLA based construction. – TTS models have been trained on LJSpeech, Tatoeba and Blizzard datasetsand the waveform reconstruction has been carried out for GLA, FGLA andthe GAN based vocoder. Optimal number of training steps and iterations havedetermined experimentally. The .wav ﬁles generated by the TTS models havebeen evaluated based on the quality of the output speech and the synthesistime. The speech quality has been analyzed by evaluating in terms of MeanOpinion Score (MOS) and the synthesis delay has been analyzed by measuringthe time needed for the TTS model to synthesize the output speech.1.2 OrganizationThe rest of the paper is organized as follows. Existing work on waveform pro-cessing for TTS systems has been surveyed in Section 2. Section 3 formulates theproblem statement. The details of the proposed methodology have been outlinedin Section 4. Section 5.1 presents the experimental setup. Analysis of the observedresults has been presented in Section 5.2. Finally, Section 6 concludes the paperand highlights the scope for future research.

Ankit Sharma et al.

In recent years, text-to-speech processing has witnessed signiﬁcant improvements.Traditionally, concatenative and parametric speech synthesis methods have beenused for the task of text-to-speech conversion. In last couple of years, neural TTSsystems have provided substantial performance boost in the quality of machinesynthesized speech. A review of various research attempts in context of aforemen-tioned methods alongwith the waveform optimization strategies followed by themhave been provided in the following sections and their summary is presented inTable 1.2.1 Various Speech Synthesis Methods

Concatenative models have dominated speech synthesis process since 1970’s. Theyare based on searching and collecting small samples of speech components fromthe voice database [12]. The voice quality of the speech synthesized by them ismore natural, however, they require a huge amount of voice database. There aretwo types of costs associated with them - a) searching cost and b) concatenativecost. Searching cost deals with searching speciﬁc voice segments corresponding torequired broken portion and concatenative cost is related to joining these segments.As pointed by G. Coormanne et al. [13] , one of the problems with these models isthat they do not produce a good quality speech if a suitable match in the databasecorresponding to the required segment is not obtained. Another challenge withconcatenative models is that they require complete dataset for generating a newset of voice. It is diﬃcult to select the target unit from the voice database in orderto minimize the diﬀerence between the required and selected samples [14].

Parametric speech synthesis is another widely used process to generate speechfrom text. TTS models use the same statistical models derived from the data [15].They follow a parameter generation approach unlike fetching the speech samplesfrom the database. Hidden Markov Model (HMM) based TTS architectures areamong the most famous parametric models. The primary step involved in them isto ﬁnd a parametric form of speech including spectral and excitation parametersfrom the voice corpus and then model them by using a set of generative models[16]. In context of using HMM-based TTS systems, Y. Junichi et al. [17] predictedthe parameters then synthesized the speech for a given text. The beneﬁt of thisapproach is that it does not require the complete dataset at synthesis time. T.Masuko et al. [18] were success to change the speaker’s voice easily using theparametric speech synthesis while L. Soojeong et al. [19] tried statistical parametricmethod for enhancement of the speech. However, a disadvantage associated withparametric methods is that the voice quality of the synthesized speech is not asnatural as in case of concatenative speech synthesis. ast Griﬃn Lim based Waveform Generation Strategy for Text-to-Speech Synthesis 5

Concatenative and parametric TTS systems have practical diﬃculties, for exam-ple, their diﬀerent components need to be modeled and processed separately [20].Deep learning-based end-to-end neural TTS systems, which are the current state-of-the-art solve this problem. They put together the intermediate stages of TTSprocess into a single, layered pipeline through the use of DNN which is carriedout without human intervention for acoustic feature engineering. The recent neu-ral TTS systems include Wavenet [21], Char2Wav [20], Tacotron [22], Tacotron 2[23], DeepVoice [24], DeepVoice 2 [25], DeepVoice 3 [26], VoiceLoop [27]. Wavenetis based on a generative model that predicts samples based on probability dis-tribution. Tacotron [22] produces spectrograms from the text and then producescorresponding waveform using a vocoder. However, waveform generation has beena time-consuming process for the initial TTS systems and the speech output wasnot human-like. Use of improved spectrogram methods such as mel-spectrogramand better vocoders such as WORLD, GLA, etc. have helped solving these prob-lems [28].Char2Wav predicts the parameters of the WORLD vocoder and uses a Sam-pleRNN conditioned upon WORLD parameters for waveform generation. WORLD[29] on the other hand consists of three analysis algorithms for determining thefundamental frequency ( F ), spectral envelope and aperiodic parameters. Tacotronis another end-to-end model that uses seq-to-seq learning to map the text to spec-trogram as intermediate data and then audio is generated by using vocoder. Ituses Griﬃn Lim as the vocoder that generates audio waveforms from the linearspectrogram. It takes linear scale magnitude spectrogram and number of iterationsas input and produces the corresponding waveform. It was observed that GLA inTacotron converges in about 60 iterations [22]. Tacotron incorporates the GLA forphase estimation, followed by an inverse Short-Time Fourier transform (STFT) forwaveform reconstruction. Tacotron 2 is an entirely neural network-based approachfor speech synthesis which combines the seq-to-seq model feature used in Tacotronand generates the mel-spectrogram and performs speech synthesis using modiﬁedWavenet vocoder.In DeepVoice [24], Wavenet architecture is modiﬁed and a fast synthesis systemis developed during the audio synthesis stage. DeepVoice 2 [25] is a multi-speakermodel that has taken Tacotron architecture as a base and performed modiﬁcationin Griﬃn Lim algorithm with Wavenet based vocoder. Deepvoice 3[26] is a fullyconvolutional attention-based neural end-to-end TTS system. Its architecture iscapable of transforming several textual features into vocoder parameters such asmel-spectrograms, linear scale log spectrograms, spectral envelope, fundamentalfrequency ( F ), aperiodicity parameters, etc. These vocoder features are given asinput to the waveform synthesis models. It uses three diﬀerent vocoders - WORLD,Griﬃn-Lim and Wavenet. Both WORLD and Griﬃn Lim use linear spectrogramwhereas the modiﬁed Wavenet in Tacotron2 uses mel-spectrogram for waveformsynthesis. VoiceLoop [27] is an attention-based neural text to speech system ref-erenced by a working memory model called phonological loop. It is capable ofproducing voices that are sampled in the wild. VoiceLoop replaces convolutionalRNNs with memory buﬀer. Ankit Sharma et al.

In context of using neural vocodders, K. Oyamada et al. [30] focused on DNNbased architecture to recover the phase information from magnitude spectrogram.K. Kumar et al. [31] proposed the MelGAN which is a fully convolutional, non-autoregressive vocoder. It generalized well for unseen speakers and showed signif-icant speed up in speech construction from mel-spectrograms. In a similar work,WaveGlow [32] was proposed by replacing the vocoder part of Wavenet by deepneural architecture. It performed eﬃciently on large utterances but its perfor-mance degraded while converting small text samples into speech. TTS with neuralvocoders such as WaveGlow have to repetitively go through serial steps of wave-form construction which causes them to take more time while constructing smallsentences. As observed by K. Oyamada et al. [30], on CPU, some of the neuralvocoders took three times longer than GLA for the speech synthesis. Their trainingtime is as high as a few weeks and however their inference is fast on the GPU, thelarge size of the trained model makes their application very diﬃcult with real-timedevices having CPU with constrained memory. [32, 31, 33].2.2 Waveform Processing in TTS systemsThe ﬁnal speech generated by the TTS systems is in the form of waveform whilethe intermediate representation is called spectrogram. The raw text input is con-verted into sampled embedding vector in the pre-processing phase, from whichthe intermediate frequency-time representation, that is, spectrogram is generated[34]. It is called the ‘Construction Phase’. Then, waveform is generated in the ‘Re-construction Phase’ using the vocoders. The eﬃciency of reconstruction algorithmmajorly determines the overall performance of the TTS system. GLA has beenthe most predominantly used reconstruction algorithm for speech synthesis [10].A time-domain signal can be reconstructed from its amplitude spectrogram usingthe information about its phase. When no information is available about the phaseand only the amplitude spectrogram is available, GLA is particularly suited forphase reconstruction. However, GLA needs many iterations and the perceptualquality of the output speech is not always very good [35].There have been a number of attempts to optimize the waveform processing forspeech reconstruction. For instance, Sercan et al. [10] implemented transposed con-volution layers alongwith non-linear interpolation which resulted into better uti-lization of modern multi-core processors than simple iterative strategy. In anotherwork, Y. Fisher [36] used multi-scale context aggregation by dilated convolutionsthat resulted in simpliﬁed network alongwith increased state-of-the-art accuracy.In the context of waveform processing based applications, Z. Cheng and J. Shenet al. [37] used the properties of the audio waveforms to recommend music basedon the venue and surrounding of the user. As an attempt to enhance the vocodermodule, M. Morise et al. [29] proposed a new vocoder, WORLD for feature ex-traction and waveform synthesis. Y. Masuyama [38] proposed an enhanced phasereconstruction technique by combining DNN with GLA to build GLA-inspiredneural network layers for waveform generation.Some of the distinctly related work in the area of signal processing maps to theutilization of fourier transformation techniques such as Gabor Transform [39, 40].It is a special form of fourier transform that is used to determine the frequencyand phase content of the signals represented in the form of spectrograms. In this ast Griﬃn Lim based Waveform Generation Strategy for Text-to-Speech Synthesis 7 direction, a real-time fast fourier transform algorithm was proposed by H. Sorensenet al. [41]. Successful research has also been carried out to achieve phase recoverywith lesser number of iterations as compared to GLA [10]. It gives a hint tolook out for alternative reconstruction algorithms requiring lesser iterations whilemaintaining the quality of the synthesized speech.Though there have been various attempts to optimize the waveform processingin context of GLA. However, better alternatives for waveform reconstruction havenot been explored to their best potential. An optimized version of GLA known asFast Griﬃn Lim Algorithm (FGLA) is available in the literature [11]. It requireslesser iterations to construct the phase from spectrogram representation for generalsignal processing applications. However, it has not been applied and tested forspeech synthesis application. In this paper, FGLA based waveform generationmethod has been proposed with the aim to reduce synthesis delay. It aims toovercome the challenges faced by concatenative and parametric TTS systems bygetting rid of the need of human intervention for acoustic feature engineering.Some of the challenges of using nerual vocoders for real-time TTS applicationssuch as - slow speech synthesis with CPU, larger model size, complex architecture,etc. have also been considered and addressed.

The major objective of the proposed research work is to optimize the waveformgeneration process during speech synthesis by TTS systems. The speech synthesistime should be reduced without changing the quality of the output speech. Thetraining phase for TTS device is performed once in a given system until there is achange in algorithm. The synthesis phase is executed on real-time speech synthesisdevices having low computational power. This phase is repeated every time atext is converted into speech. System resources at training stage are generally ofhigh computing power. However, most of the real-time speech synthesis systemshave limited computing capabilities. Hence, the synthesis algorithm should takeless amount of memory to make the speech synthesis more suitable for real-timeapplications. The problem statement is subjected to the following constraints: i) The average time taken ( T ) to convert the corresponding spectrogram to thewaveform should be minimized. T corresponds to the synthesis delay for n samples. ii) The number of iterations required ( itr ) for the output waveform to convergeshould be minimized. That is, their plots should reach to an optimal state as soonas possible. iii)

Quality of the synthesized speech ( qual ) should be maintained. The reductionin speech synthesis time should not aﬀect it. iv)

The speech synthesis process should result into optimal resource utilization( util ). It should cater to the limited computational resources of real-time TTSsystems.The aforementioned constraints can be modeled mathematically as shown inEq. 1. A n k i t Sh a r m a e t a l. T a b l e : Su mm a r y o f li t e r a t u r e r e v i e w . TTS Model Basic Method Properties Text Pre-processing WaveformPreprocessing SpectrogramType Construction ReconstructionConcatenative

Searches the audio segmentfrom speech database that ismost relevant to the text Simple to implement.Results into good qualityspeech - - - - -

Parametric

Keeps track of parametric formof speech including spectral pa-rameters Speaker voice can bechanged with minimumeﬀorts - - - - -

Wavenet [21] Uses dilated regressive CNNto predict present sample frompast sample High quality. Can gener-ate multi-speaker voice. Yes No Mel-spectrogram - -

Char2Wav [20] Uses bidirectional RNN to pro-duce waveform from textualcontent only Expert linguistic knowl-edge is not required No Yes Linear-spectrogram - SampleRNN

Tacotron [21] Encoder-decoder architecturebased on RNNs Fully end-to-end; robustand fast processing Yes Approximate Linear-spectrogram CBHG GLA

Tacotron2 [23] Encoder-decoder architecturebased on RNNs Better speech qualityand smaller model sizethan Tacotron Yes Approximate Mel-spectrogram Convolutionbased ModiﬁedWavenet

DeepVoice [24] Uses ﬁve diﬀerent DNNs forTTS; needs less parametersand faster than Wavenet Faster processing No No Linear-spectrogram CBHG GLA

DeepVoice2 [25] Wavenet based spectrogram toaudio generation Can generate multi-speaker voices with lesstraining No Approximate Linear-spectrogram Attentionbased encoder GLA, Wavenet

VoiceLoop [27] Uses shifting buﬀer memory in-stead of RNNs Robust; produces lessererrors Yes Exact Linear-spectrogram Buﬀer shallownetwork WORLDHere, GLA: Griﬃn Lim Algorithm; CBHG: (1-D convolution bank + highway network + bidirectional GRU) ast Griﬃn Lim based Waveform Generation Strategy for Text-to-Speech Synthesis 9

Subject to constraints : (cid:26) minimize ( T, itr ) maximize ( qual, util ) W here :  n : number of inputs.itr : number of iterations required for the output waveform to converge.qual : quality of the synthesized speech.util : utilization of the computing capacity of T T S plateform.T = { t , t , t , ......t n } , time to convert coefficients into waveforms.t , t , .......t n are the times taken to convert the spectrograms to waveforms.S : { u , u , u , ......u n } , set of spectrograms generated by the T T S system.u , u , .......u n are the coefficient matrices of the spectrogram.W = { v , v , v , .......v n } , set of corresponding waveforms generated by a vocoder.v , v , .......v n are the waveforms produced from u , u , .......u n . (1) In general, a TTS system contains three phases: a) text analysis (text to words),b) linguistic analysis (words to phonemes) and c) waveform generation (phonemesto sound). The ﬁrst and second phases are carried out during the training phase.The TTS model is trained on text data and intermediate spectrogram is generatedfrom the trained model for given input text. The third phase takes place duringthe synthesis when phase waveform is synthesized through this spectrogram. Theproposed method aims to optimize the reconstruction of original speech signalfrom the intermediate spectrogram. Generally, a magnitude spectrogram doesn’tcontain the complete phase information. A reconstruction algorithm such as GLAiteratively recovers that information. GLA is an iterative algorithm that takes ahigh number of steps to recover the phase information. The proposed approachapplies an optimized alternative FGLA, for phase reconstruction in waveform gen-eration phase.The waveform analysis can be performed easily in the frequency domain [39].Fourier transform is most widely used transformation that converts time domainsignal into frequency domain signal. That is why, the proposed methodology uti-lizes fourier transform and its variants such as Short-Time Fourier Transform(STFT), Discrete Fourier Transform (DFT), Gabor Transform, etc. during vari-ous steps of waveform generation. STFT is a series of fourier transforms of a subsetof the signal. When frequency components of a signal vary with time, STFT is usedto retrieve the time-localized frequency information. Gabor Transform is a specialkind of STFT representation which is used to discover the phase information andsinusoidal frequency of the subsets of a time varying signal. The time-frequencyanalysis is carried out by ﬁrst multiplying the function by a Gaussian function andthen transforming it with a fourier transform. During the synthesis phase, input isa spectrogram and output is a waveform. The STFT can be represented as a matrixof coeﬃcients where column index represents time and row index represents fre-quency of the respective DFT coeﬃcient. The magnitude of each coeﬃcient in the respective index is computed and this matrix can be treated as a image known asspectrogram of the signal. The choice of base implementation has been explainedin Section 4.1. The proposed methodology is described in the following sections.It is visually represented in Fig. 1 and mathematically depicted in Algorithm 1.4.1 Rationale Behind Selecting Fast Griﬃn Lim Algorithm with TacotronThe research work presented in this paper primarily aims to reduce the speechsynthesis delay in text-to-speech systems. A fundamental experiment to applyFGLA based waveform generation from linear-spectrogram has been formulated.The most commonly used waveform reconstruction algorithm for speech synthesisis GLA. The most acclaimed TTS system that uses linear-spectrogram for inter-mediate representation is Tacotron, developed by Google [22]. GLA was used asvocoder in Tacotron that produces approximate waveform corresponding to the in-put spectrogram, not the exact waveform. FGLA has been chosen instead of GLAbecause FGLA is known to take lesser iterations to recover the phase from spec-trogram [11]. However, other vocoders such as Wavenet [21], WORLD [29], etc. areavailable in the literature as potential choices for vocoders but they process mel-spectrograms while current research targeted to work with linear-spectrogramsduring the intermediate step of the speech synthesis process. Thus, FGLA withTacotron emerged out as the most suitable choice for the experiment in consider-ation.4.2 Strategy for Waveform OptimizationSTFT is used for the observable comparative analysis in the frequency domain [9].FGLA attempts to reconstruct the speech signal from the intermediate spectro-gram of the signal. For that purpose, it ﬁnds the real signal X ∗ ∈ R L from a givenset of spectral magnitude coeﬃcients s , such that the magnitude of STFT of X ∗ is as close as possible to the input signal. It helps in more accurate reconstructionof the signal. Any arbitrary set of complex numbers cannot be chosen as STFTcoeﬃcients, i.e., only a certain set of complex numbers correspond to STFT of awaveform. In the same way the input that we get may not be a valid spectogram.A valid spectrogram S would have the magnitude of the coeﬃcients as close aspossible to the input.The relevant terms have been deﬁned in Section 3. Two important conceptsthat are utilized by the proposed method are Gabor Transform and Projection. G ∗ x is the inverse Gabor Transform of x . It is a special case of STFT that ishelpful in extracting the feature patterns from the spectrogram representation[42]. It helps in ﬁnding the time needed to convert the spectrogram into waveform[43]. Three phases of the proposed method have been described as follows. (i) First we initialize the coeﬃcient matrix c , which is of same dimensions of theinput spectrogram and contains the element of set C , c . ast Griﬃn Lim based Waveform Generation Strategy for Text-to-Speech Synthesis 11 START

Intermediate signalconstruction by TTSt n (cid:160) =(cid:160)P C (P C (c n-1 )),c n (cid:160)= t n (cid:160) (cid:160)+(cid:160) α (t n - t n-1 ). Update c n , t n i = i +1.(cid:160)Stop Input TextCalculate Gabor(cid:160)Transform of c i < n(cid:160) Convergence Rate α (cid:160)= 0.2Iterator i = 0Coefﬁcient Matrices c , t Number of Iterations n Waveform SpectrogramYesNo

InitializationFinalization Iteration

Fig. 1: Flow diagram depicting the proposed methodology. Here, green coloredboxes show various stages of the methodology; red box shows the loop decisionand yellow box shows the loop iteration. Theoretical analysis of the complexityand the eﬀect of convergence rate α is discussed in Section 4.2. α is an importanthyper parameter impacting the complexity and causing the speep-up in waveformprocessing. Its appropriate value is determined in Section 5.1.1.(ii) The magnitude of every element is made equal to the element in input matrixin the corresponding position. That is, projecting on to set C .(iii) Then we initialise another matrix t of same dimensions.(iv) The projection of the modiﬁed coeﬃcients of the transform on to set C isdeﬁned as follows. P C ( c ) = s.e i. ∠ c (2)The above matrix is projected on to set C , followed by projection on to set C to get t . Here, t i and c i denote the matrices after i iterations. (cid:26) C = s.e i. ∠ ( c ) t = P C ( P C ( c )) (3)(v) Here, C is the set of possible coeﬃcients of STFT. C is the set of complexnumbers whose magnitude is equal to the magnitude spectrum coeﬃcients. G x is the Gabor Transform of x . (cid:26) C = { c : ∃ x ∈ R L k c = G x } C = { c ∈ C MN k| c | = s } (4) (i) The magnitude of the elements of the coeﬃcient matrix is made equal to theinput matrix, keeping the phase unchanged.(ii) Inverse Gabor Transform is then applied on the resultant coeﬃcient followedby Gabor Transform. The projection of the modiﬁed coeﬃcient on to the set C is deﬁned as follows. This is the inﬂuential step to make the FGLA fasterthan GLA [11]. (cid:26) P C ( c ) = GG ∗ ct n = P C ( P C ( c n − )) (5)(iii) This projection on to set C in current step is subtracted with the projectionin set C and multiplied by a factor, convergence rate α . Choosing α close toone but not exactly one yields better results. This product is added to theprojection in the current step, the initialised coeﬃcient as updated to thisvalue. c n = t n + α ( t n − t n − ) (6)(iv) The above steps are repeated iteratively and in each step the coeﬃcients con-verge close to a real signal whose magnitude spectrum is approximately is equalto the input spectrum. (i) Inverse Gabour Transform is applied on the ﬁnal coeﬃcient to get the wave-form. x ∗ = G ∗ c n (7)The aforementioned phases of the proposed methodology are depicted in Al-gorithm 1 and the theoretical analysis of its complexity is discussed below. Complexity Analysis

As shown in Algorithm 1, FGLA has single iteration loop and as per GLA paper[9], GLA also involves single loop. Hence, the theoretical time complexities forboth FGLA and GLA are O(n). Experiments revealed that FGLA could producethe waveform of same quality with 30 iterations as compared to GLA with 60iterations. Lesser number of iterations for FGLA is a determining factor for thereduction in the synthesis delay. Convergence rate α is another important hyper-parameter impacting the complexity and causing the speed-up in the waveformprocessing. Its value ranges from 0 to 1. For α = 0, FGLA behaves as GLA. Asits value increases, speed-up also increases till a limit with faster convergence andthen it starts decreasing. As discussed in Section 5.1.1, the appropriate value of α has been determined as 0.2. ast Griﬃn Lim based Waveform Generation Strategy for Text-to-Speech Synthesis 13 Algorithm 1:

FGLA based Waveform Optimization for text-to-speech

Input n : Number of iterations. Input k : Number of input variables. Input C : Set of possible coeﬃcients of STFT. Input C : Set of complex numbers with magnitude same as of spectrum coeﬃcients. Deﬁne s : Spectral magnitude coeﬃcients. Deﬁne c : Coeﬃcient matrix. Deﬁne t : Matrix of same dimensions as c. Deﬁne α : Convergence Rate. Deﬁne x ∗ : Final waveform. Deﬁne G ∗ : Inverse Gabor Transform of x ∗ . Input S = { u , u , u , .....u k } : set of spectrograms generated by TTS model. Deﬁne u , u , .......u k : coeﬃcient matrices of the spectrogram. Output W = { v , v , v , ......v k } : set of waveforms generated by a vocoder. Deﬁne v , v , .......v k : produced waveform. Deﬁne T = { t , t , t , ......t k } : time to convert coeﬃcients into waveform. Deﬁne t , t , .......t k : time to convert the spectograms to waveforms. Procedure

WaveOpti1:

Stage 1: Initialisation

2: //

Projection of modiﬁed transform coeﬃcients on to set C P C ( c ) = s.e i. ∠ c

4: //

Projection of above matrix on set C and then C C = s.e i. ∠ ( c ) t = P C ( P C ( c )).7: Stage 2: Iteration

8: //

Projection of modiﬁed transform coeﬃcients on to set C P C ( c ) = GG ∗ c

10: //

Update t and c for each iteration for i in n do t i = P C ( P C ( c i − ))13: c i = t i + α ( t i − t i − ).14: end for Stage 3: Waveform Generation

16: //

Inverse Gabor Transform of ﬁnal coeﬃcients

17: x ∗ = G ∗ c n This section discusses and evaluates the experiments to apply FGLA based wave-form generation from linear-spectrogram.5.1 Experimental Set-upThis section demonstrates the experimental implementation and analyses the re-sults. A fundamental experiment to optimize the waveform processing of linearspectrograms in Tacotron TTS system has been formulated. The model trainingis done on Nvidia Tesla K80 GPU machine with 24GB RAM and 4992 CUDAcores. Text to speech synthesis is done on Intel(R) Core(TM) i7-7700, 4.2 GHzCPU with 16GB RAM and 64-bit Windows 10 OS machine. The Machine Learn-ing libraries used in this implementation are Numpy, Tensorﬂow and Keras. Thechoice of parameters and datasets has been detailed in the following sections.

Table 2: Hyper-parameter choices.

Parameter Value

Convergence Rate, α The questions that the experimental set-up tries to answer are - ”What is theideal number of training steps for TTS model training?”; ”How to determine theappropriate number of iterations?”; ”How to evaluate the speech synthesized bythe TTS system in terms of quality, speed and convergence?” Suitable numberof training steps and iterations are experimentally determined in Section 5.1.3and 5.1.4. Then as per Table 7, use-case sentences are formulated according tovarious complexity levels. In Section 5.2, speech-synthesis for these sentences hasbeen evaluated in terms of speech quality, synthesis delay and convergence.

During the synthesis, the convergence rate α was experimentally chosen as 0.2.We experimented with the α values starting from 0.1 with learning rate 0.0002and performed the iterations for spectrogram to waveform construction and anal-ysed the construction time. The most suitable value of α corresponding to theoptimal construction time emerged out as 0.2. The basic entity for training themodel is .text and .wav ﬁle. Wav ﬁle signal has to be sampled for the analysis.Table 2 represents the important parameters for the current analysis and theirselected values. Here, ‘Sampling rate’ denotes the number of samples per second,‘Frame shift’ speciﬁes amount by which window will slide. ‘Learning rate’ showshow fast network learns by adjusting weights. The experimentally determined val-ues for sampling rate, frameshift and learning rate are 20000, 12.5 ms and .0002respectively. Tacotron has been trained from scratch for the datasets described inSection 5.1.2 for various number of iterations and speech synthesis time has beenobserved for GLA and FGLA both. The original Tacotron paper had used LJSpeech dataset. We have trained andtested the TTS model with Blizzard and Tatoeba datasets as well. The details ofall the datasets used in the implementation is provided in the following sections.(i)

LJ Speech Dataset [44]: It is a single-speaker, public domain speech-datasetcontaining 13100 audio samples ranging from 1 to 10 seconds. The total dura-tion of the dataset is about 24 hours. Each audio-ﬁle is a single-channel 16-bitPCM WAV with a sample rate of 22050 Hz. The properties of LJSpeech datasetare detailed in Table 3.(ii)

Tatoeba Dataset [45]: This audio corpus is a crowdsourced dataset of sen-tences and translations. It contains a subset of the English sentences of Tatoeba.We have not used the complete dataset. Sentences have been ﬁltered out and ast Griﬃn Lim based Waveform Generation Strategy for Text-to-Speech Synthesis 15

Table 3: Details of LJSpeech dataset

Parameters Values

Total Clips 13,100Total Words 225,715Total Characters 1,308,678Total Duration 23:55:17Mean Clip Duration 6.57 secMin Clip Duration 1.11 secMax Clip Duration 10.10 secMean Words per Clip 17.23Distinct Words 13,821

Table 4: Details of Blizzard dataset

Audiobook Name Total Audio Length

Tramp Abroad 15:46:01Life on the Mississippi 14:47:27The Man That CorruptedHadleyburg and Other Stories 13:04:00

Blizzard Dataset [46]: It is available under Creative Commons AttributionShare-Alike license. It contains the samples of three audiobooks read by asingle American English narrator. The books name and recording time aregiven in Table 4. Audio ﬁle format of blizzard corpus is 16-bit WAV, monoand sampling frequency is 44100 Hz.

The quality of synthesized speech also depends on the number of training steps.However, training the model for more steps requires more computation and time.Hence, it is important to determine the optimal number of steps. We have trainedthe Tacotron model till 400K steps and observed the MOS values of the speechsynthesized with it. MOS is a subjective evalution score to denote the quality of aspeech utterance [47]. The MOS scores corresponding to various checkpoints areshown in Table 5. It has been observed that the MOS values improves rapidly till250k steps but their convergnce slows down signiﬁcantly after that. Hence, theTTS model have been trained for at least 250k steps for the ﬁnal training of eachuse-case. This analysis is performed with conﬁguration same as original Tacotronpaper, i.e., using GLA algorithm with 60 iterations.

FGLA is supposed to converge faster than GLA. However, the suitable value forFGLA’s number of iterations needs to be determined eﬀectively. With that aim,

Table 5: Determination of appropriate no. of steps.

Steps MOSLJSpeech Tatoeba Blizzard

40k 7.0 7.1 580K 7.5 7.3 5.2120K 7.6 7.5 5.4160K 7.6 7.6 6.0200K 7.65 7.8 6.3240K 7.9 8 6.5280K 8 8.1 6.7320K 8.2 8.25 7

Table 6: Determination of appropriate no. of iterations.

Dataset Iterations GLA FGLALJSpeech

20 6.0 6.830 7.3 8.260 7.6 8.2

Tatoeba

20 7.6 7.130 7.6 8.2560 8.1 8.25

Blizzard

20 6.0 6.230 6.4 7.160 6.9 7.3 we trained Tacotron with GLA and FGLA both for 20, 30 and 60 iterations re-spectively. Then we checked the MOS values for the audio samples synthesizedwith the model thus trained. These values have been illustrated in Table 6. Thespeech quality for FGLA in terms of MOS scores is observed to be better thanthat for GLA. Moreover, FGLA is also observed to take lesser number of iterationsto reach same to same MOS score as compared to GLA. FGLA with 30 iterationsconverged to equivalent MOS values for GLA with 60 iterations. Hence, 30 was se-lected as the appropriate number of iterations for FGLA to be used for Tacotron’straining. The correctness of this choice has been justiﬁed in Section 5.2.2.

During speech synthesis, the test sentences are chosen according to various com-plexity levels. Various verbal and lingual combinations in terms of punctuationmarks, abbreviations, special characters, exclamation and question mark, etc. havebeen included to form ﬁve use-case sentences of varying lengths. As mentionedearlier, the main objective of the work presented in this paper is to reduce thesynthesis delay without aﬀecting the quality. The trained models are tested tosynthesize these sentences and their synthesis delay speech quality has been ob-served. The time taken in the synthesis process is proportional to the length of thetext. So, checked sentences have variable-sized length. Every sentence has been ast Griﬃn Lim based Waveform Generation Strategy for Text-to-Speech Synthesis 17

Table 7: List of use-case sentences.

S.No. Sentences

1. He said to him, “Is not your name Ahmed?”2. All of a sudden, there was a loud screaming, Pleasehelp me!3. I think I lost my wallet! I can’t ﬁnd it anywhere!Oh, I could just kick myself!4. ”Sunshine on my shoulders makes me happy, sun-shine in my eyes can make me cry.”5. As the stranger entered the town, he was met by apolice, man who asked, “Are you a traveler?” “Soit would appear”, He replied carelessly. synthesized 10 times and then the average synthesis time has been considered.To make periodic observations, model checkpoint has been saved after every 1000training steps. The size of the trained model has been observed to be of the orderof 80 MB. The aforementioned use cases have been depicted in Table 7 and thedetailed analysis has been presented in Table 6.5.2 Result AnalysisThe implementation has been carried out considering the number of training stepsand iterations determined in the above section. This section presents and veriﬁesthe results in terms of speech synthesis using the TTS model thus trained. Theresults have been evaluated for ﬁve use-case sentences described in Table 7. Duringthe result evaluation, we have parallelly checked the quality of audio generated forall three corpora at diﬀerent intervals of the model trained given in Table 5. Meanopinion score (MOS) is calculated based on wave ﬁle generated by synthesis onthe trained model. The optimal number of training steps for diﬀerent datasets forvarious training steps are shown in the Table 5. The results have been analysedbased on the quality of the output speech and synthesis delay. The choice ofnumber of iterations made in Section 5.1 has also been veriﬁed by observing theconvergence of the waveform plots for GLA and FGLA.

Quality analysis results have been shown in Table 5 and 6. It was observed that,after 250k steps of model training, the output speech included prosody features.That made the voice more feasible for real-time speech synthesis. Speech qualityalso depends on the corpus used for training. Model is trained up to 400K steps forall datasets and results have been generated. The quality of the synthesized speechis expected to be clear to understand and non-robotic in nature. The more human-like the voice is, have a higher value of MOS and results easy to understand. X axis shows duration of model trained and Y axis represents MOS of speech takenby 10 evaluators. The output speech has been evaluated at regular training stepintervals for diﬀerent data sets. The speech quality was observed to depend mainlyon the number of iterations used in the algorithm. The graph shows that in theinitial stage learning rate is very high but after 250k training steps learning rateis very slow. The observed results in context of the quality of the output speechare visually illustrated in Fig. 2, 3 and 4.

Accuracy of MOS Determination

To further evaluate the quality of the proposed methodology’s results, 100 textsamples with known MOS scores are considered. Corresponding speech is synthe-sized for them using GLA, FGLA and GAN based vocoders. MOS of the synthe-sized speech have been evaluated. If the variation in the MOS of the synthesizedspeech i.e.

MOS s and the known MOS i.e. MOS g is less than the error margin e ,then the sample is assumed to be accurately determined. The error margin is takenas 0.45 which is 5% of the MOS of natural human voice [48]. The calculations aredone as per Eq. 8 and the results for the considered samples are summaried inTable 8. n | MOS g – MOS s | < e = > Accurately determined. o W here :  e : error margin.MOS g : ground truth MOS.MOS s : MOS of synthesized speech. (8)

Table 8: Accuracy of MOS determination

Vocoder Avg. MOS Min. MOS Max. MOS AccuracyGLA

FGLA

GAN Vocoder

The accuracy for FGLA came out to be much better than GLA and comparableto GAN based neural vocoder. It should be noted that GAN vocoder has been evalu-ated on CPU, in-line with the goal of optimizing the waveform processing from linearspectrograms for real-time devices with limited processing capabilities. ‘Synthesis Delay’ is the time required for the output speech to start getting pro-duced by the TTS system. The same has been computed and compared for all theuse-case sentences described in Table 7 and datasets mentioned in Section 5.1.2.The observations have been drawn for GLA with 30 iterations, FGLA with 60 ast Griﬃn Lim based Waveform Generation Strategy for Text-to-Speech Synthesis 19

40k 80k 120k 160k 200k 240k 280k 320k

Training steps M O S MOS graph for LJSpeech

Fig. 2: Quality analysis of synthesized speech for LJSpeech dataset

40k 80k 120k 160k 200k 240k 280k 320k

Training steps M O S MOS graph for Tatoeba

Fig. 3: Quality analysis of synthesized speech for Tatoeba datasetiterations and GAN based neural vocoder [30]. Every sentence is synthesised 10times and then average synthesis delay has been calculated. Same computationalconﬁguration has been maintained on the testing machines while doing so. Asour aim is to optimize the waveform processing keeping CPU based low memorydevices for real-time usage, the waveform reconstruction has been carried out onCPU for GLA, FGLA and the GAN based vocoder.

40k 80k 120k 160k 200k 240k 280k 320k

Training steps M O S MOS graph for Blizzard

Fig. 4: Quality analysis of synthesized speech for Blizzard datasetThe observed synthesis time for the aforementioned cases have been depictednumerically in Table 9 and visually in Fig. 5. FGLA came out to be 49.12%, 33.57%and 26.52% faster than GLA in terms of Synthesis Delay for LJSpeech, Tatoebaand Blizzard datasets. The overall reduction in the Synthesis Delay has been ob-served to be 36.58%. While the average speech synthesis time for GAN basedvocoder on CPU came out to be 3.65, 2.88 and 2.76 times more than FGLA.It should also be noted that FGLA produced better quality speech with lessertraining iterations as compared to GLA. With the proposed waveform generationstrategy, LJ Speech dataset has shown more reduction in the synthesis delay thanother datasets. In context of MOS scores, Tatoeba dataset showed faster progresswhile Blizzard dataset showed lower values as compared to LJSpeech.

As discussed in Section 5.1.4, appropriate number of iterations for FGLA weredetermined as 30. Here, we have veriﬁed that choice by observing the convergencein the plots of the resulting waveforms. The comparative analysis in the frequencydomain is easier to observe. That’s why, fourier transformations of the waveformsare considered. The transforms of the speech produced from FLGA with 30 iter-ations and FLGA with 60 iterations are plotted and compared. It is found thatthe two plots overlap with each other. Similarly, fourier transforms of waveformsproduced by GLA 60 iterations and GLA 30 iterations are plotted and compared.It is found that the plots do not overlap. This means that the waveform producedby FLGA 30 iterations and FGLA 60 iterations are same, thus a good quality ofspeech is produced using FLGA 30 itself, whereas, using GLA requires 60 itera-tions give a better quality of speech that GLA 60 iterations. The convergence plots a s t G r i ﬃ n L i m b a s e d W a v e f o r m G e n e r a t i o nS t r a t e g y f o r T e x t - t o - Sp eec hS y n t h e s i s T a b l e : S y n t h e s i s t i m e ( m s ) f o r v a r i o u s u s e - c a s e s a ndd a t a s e t s S.N. Use-case Sentences LJSpeech Tatoeba BlizzardGLA(60) FGLA(30) GANvocoder GLA(60) FGLA(30) GANvocoder GLA(60) FGLA(30) GANvocoder

1. He said to him, “Is not your name Ahmed?” 10026 5173 18900 9684 6456 17950 9800 6725 168502. All of a sudden, there was a loud screaming,Please help me! 9698 4940 18190 10416 6880 19950 8997 6726 188703. I think I lost my wallet! I can’t ﬁnd it anywhere!Oh, I could just kick myself! 9702 4920 17000 10180 6799 20010 8957 6702 190104. “Sunshine on my shoulders makes me happy,sunshine in my eyes can make me cry.” 9849 4959 18700 10550 7010 19880 8940 6757 185505. As the stranger entered the town, he was met bya police, man who asked, “Are you a traveler?”“So it would appear”, He replied carelessly. 9828 4992 18560 10300 6821 20150 9129 6763 19950

Synthesis time for all 3 datasets

LJ Speech Tatoeba Blizzard

Datasets A ve r a g e t i m e i n m s GLA 60 iterationsFGLA 30 iterationsGAN based vocoder

Fig. 5: Synthesis Delay for the use-cases formulated in Table 7for the fourth sentence from Table. 7 are shown in Fig. 6 and Fig. 7 for GLA andFGLA respectively. The fourth sentence captures suﬃcient variations in terms ofsentence length, special characters, punctuation marks, etc. The plots for the restof the sentences are included in the supplementary material.

In this paper, a FGLA based method is proposed to optimize the waveform gen-eration process to reduce speech synthesis delay. The ﬁnal speech, i.e., waveform,is reconstructed from intermediate spectrogram. GLA has mostly been used, es-pecially when phase information about the waveform is missing. But GLA is slow,which causes delay in the speech synthesis. A faster alternative of GLA, i.e., FGLAhas been used in the proposed method that resulted in 36.58% reduction in speechsynthesis time. In the presented work, experiments were performed to optimizewaveform generation from linear spectrogram in single-speaker TTS systems. Theproposed approach is compared against GLA and GAN based neural vocoder interms of speech quality and synthesis delay. The quality of the synthesized speechhas been checked using MOS based evaluation. The quality is observed to be re-tained in spite of the reduction in the synthesis time. The number of trainingsteps and iterations were determined by experimental observation. This choicewas veriﬁed through the convergence of fourier transform plots of the resultantwaveforms. ast Griﬃn Lim based Waveform Generation Strategy for Text-to-Speech Synthesis 23