[PDF] A Transfer Learning End-to-End ArabicText-To-Speech (TTS) Deep Architecture

Abstract

Speech synthesis is the artificial production of human speech. A typical text-to-speech system converts a language text into a waveform. There exist many English TTS systems that produce mature, natural, and human-like speech synthesizers. In contrast, other languages, including Arabic, have not been considered until recently. Existing Arabic speech synthesis solutions are slow, of low quality, and the naturalness of synthesized speech is inferior to the English synthesizers. They also lack essential speech key factors such as intonation, stress, and rhythm. Different works were proposed to solve those issues, including the use of concatenative methods such as unit selection or parametric methods. However, they required a lot of laborious work and domain expertise. Another reason for such poor performance of Arabic speech synthesizers is the lack of speech corpora, unlike English that has many publicly available corpora and audiobooks. This work describes how to generate high quality, natural, and human-like Arabic speech using an end-to-end neural deep network architecture. This work uses just ⟨ text, audio ⟩ pairs with a relatively small amount of recorded audio samples with a total of 2.41 hours. It illustrates how to use English character embedding despite using diacritic Arabic characters as input and how to preprocess these audio samples to achieve the best results.

Full PDF

AA Transfer Learning End-to-End ArabicText-To-Speech (TTS) Deep Architecture

Fady K. Fahmy [0000 − − − , Mahmoud I. Khalil [0000 − − − ,and Hazem M. Abbas [0000 − − − Ain Shams University, Dept. Computer and Systems Engineering, Cairo, Egypt [email protected], { mahmoud.khalil,hazem.abbas } @eng.asu.edu.eg Abstract.

Speech synthesis is the artiﬁcial production of human speech.A typical text-to-speech system converts a language text into a waveform.There exist many English TTS systems that produce mature, natural,and human-like speech synthesizers. In contrast, other languages, in-cluding Arabic, have not been considered until recently. Existing Arabicspeech synthesis solutions are slow, of low quality, and the naturalnessof synthesized speech is inferior to the English synthesizers. They alsolack essential speech key factors such as intonation, stress, and rhythm.Diﬀerent works were proposed to solve those issues, including the useof concatenative methods such as unit selection or parametric methods.However, they required a lot of laborious work and domain expertise. An-other reason for such poor performance of Arabic speech synthesizers isthe lack of speech corpora, unlike English that has many publicly avail-able corpora and audiobooks. This work describes how to generatehigh quality, natural, and human-like Arabic speech using an end-to-endneural deep network architecture. This work uses just (cid:104) text, audio (cid:105) pairs with a relatively small amount of recorded audio samples with atotal of 2.41 hours. It illustrates how to use English character embeddingdespite using diacritic Arabic characters as input and how to preprocessthese audio samples to achieve the best results.

Keywords:

Tacotron 2, WaveGlow, Arabic text-to-speech, speech syn-thesis, deep learning, neural networks

Speech synthesis has been a challenging task for decades. Conventional text-to-speech (TTS) systems are usually made up of several components connectedthrough a pipeline that includes text analysis frontends, acoustic models, andaudio synthesis models. Building each component in a conventional TTS systemoften requires comprehensive domain expertise and a lot of laborious work likefeature engineering and annotation. Besides, errors generated by each componentpropagate to later stages, making it hard to identify the source of the ﬁnalperceived error. LjSpeech, https://keithito.com/LJ-Speech-Dataset/ Blizzard 2012, a r X i v : . [ ee ss . A S ] J u l o appear in The 9th IAPR TC3 Workshop on Artiﬁcial Neural Networks inPattern Recognition, ANNPR 2020Researchers have adopted the use of concatenative speech synthesis [1, 2]for years. The idea is based on selecting and concatenating units (phonemes)from a large database to generate intelligible speech. Such units could be anyof the following: phones , diphones , half-phones, syllables, morphemes, words,phrases, or sentences. Generally, the longer the unit, the larger the size of thedatabase that must cover the unit with diﬀerent prosodies. The drawbacks ofconcatenative methods for speech synthesis are (a) they need massive databasesfor large unit size, (b) noise captured while recording units may degrade thequality of synthesized speech since units recorded are represented as it is whilesynthesizing, and ﬁnally (c) the massive amount of labeling and recording.Statistical parametric speech synthesis based on Hidden Markov Model(HMM) [3, 4] showed an increase in adoption rate and popularity over time. Itsolved a lot of problems of concatenative methods such as (a) modeling prosodicvariation by modifying HMM parameters, thus solving the problem of largedatabases, (b) it has proved to have fewer word error rates which lead to betterunderstandably, and (c) it is more robust because the pre-recorded units in unitselection synthesis could be recorded in diﬀerent environment adhering diﬀerentnoise proﬁles. The drawbacks of HMM-based synthesis may include (a) requiringa lot of feature engineering and domain expertise, and (b) generated speechsounds more robotic than speech generated by unit selection speech synthesis.Deep neural network architectures have proved extraordinary eﬃcient atlearning the inherent features of data. WaveNet [5] is a generative model forgenerating waveforms based on PixelCNN [6]. It has outperformed productionlevel parametric methods in terms of naturalness. Still, it has two signiﬁcantdrawbacks: (a) it requires conditioning on linguistic features from an existingTTS system, so it is not a fully end-to-end system, and (b) it synthesizes speechvery slowly due to the auto-regressive nature of the architecture. Deep voice [7]is another example of deep neural architectures. It has proven high performance,production-level quality, and real-time synthesis. It consists of ﬁve stages, namelya segmentation model for locating phoneme boundaries, a grapheme-tophonemeconversion model, a phoneme duration, a fundamental frequency predictionmodel, and an audio synthesis model. Deep Voice is a step towards a genuinelyend-to-end neural network architecture.With the introduction of end-to-end architectures such as Tacotron [8], muchlaborious work to synthesize speech is alleviated. Such examples for laboriouswork include feature engineering, and human annotation (although a slight hu-man annotation is needed to prepare the (cid:104) text, audio (cid:105) pairs for training).Tacotron is a generative text-to-speech model based on a seq-to-seq model withattention mechanism [9] taking characters as input and producing audio wave-forms. Tacotron uses content-based attention [10], where it concatenates con-text vector with attention RNN cell output to provide an input to decoder distinct speech sound or gesture, regardless of whether the exact sound is criticalto the meanings of words consists of two connected half phones that start in the middle of the ﬁrst phoneand end in the middle of the second phone o appear in The 9th IAPR TC3 Workshop on Artiﬁcial Neural Networks inPattern Recognition, ANNPR 2020RNNs. Tacotron 2 [11] is a natural evolution of Tacotron. It oﬀers a uniﬁedpurely neural network approach and eliminates the non-neural network partsused previously by Tacotron, such as the Griﬀen-Lim reconstruction algorithmto synthesize speech. Tacotron 2 consists of two main components, (a) recurrentseq-to-seq generative model with attention, and (b) a modiﬁed Wavenet actingas a vocoder to synthesize speech signal. Tacotron 2 uses hybrid attention [12](both location-based and content-based attention).This paper describes how to use a modiﬁed deep architecture from Tacotron2 [11] to generate mel-spectrograms from Arabic diacritic text as an intermediatefeature representation followed by a WaveGlow architecture acting as a vocoderto produce a high-quality Arabic speech. The proposed model is trained using apublished pre-trained Tacoron 2 English model using a dataset with a total of2.41 hours of recorded speech.The rest of this paper is organized as follows: Sec. 2 presents a review ofrelated works in the Arabic TTS domain. Sec. 3 describes the proposed modelarchitecture, including the two main components, feature prediction network andWaveGlow, while Sec. 4 introduces the training setup and procedures, issuesfaced in training, and quantitative and qualitative analysis evaluation of theresults. Finally, the paper is concluded in Sec. 5. Many works are covering Arabic text-to-speech synthesis to generate a good andhuman-like speech. In [13], Y. A. El-Imam uses a set of sub-phonetic elementsas the synthesis units to allow synthesis of unlimited-vocabulary speech of goodquality. The input to the system is an Arabic diacritic spelling or simple numericexpressions.Abdel-Hamid Ossam et al., in [14], managed to improve the synthesizedArabic speech using an HMM-based approach. They used a statistical modelto generate Arabic speech parameters such as spectrum, fundamental frequency(F0), and phonemes duration. Then, the authors applied a multi-band excitationmodel and used samples extracted from spectral envelop as spectral parameters.Speech synthesis using diacritic text such as [15] has gained a lot of momen-tum because there is a lack of Arabic diacritic database for speech synthesis. Thework discusses two methods to recognize appropriate diacritic marks for Arabictext: a machine learning approach and a dictionary method. This work uses astatistical parameter approach using non-uniform unit size for speech synthesis.It employs variable-sized units, as it has proven to be more eﬀective than usingﬁxed-size units such as phonemes and diphonemes. It partially solves some prob-lems of classical statistical parameter methods. Such issues are speech quality,articulatory eﬀect, and discontinuity eﬀect. This work aimed to build an ArabicTTS system with the integration of diacritization system.Studying Arabic phonetics [16] for speech synthesis and corpus design is vi-tal to provide a corpus that has excellent coverage of phonetics and phonology.We have used the corpus generated from [16] in the training phase of the spec-trogram prediction network model. We have also used another technique in thiso appear in The 9th IAPR TC3 Workshop on Artiﬁcial Neural Networks inPattern Recognition, ANNPR 2020work to phonetize diacritic Arabic characters as part of training the spectrogramprediction network.The work [17],by Imene Zangar and Zied Mnasri, uses Deep neural networks(DNN) for duration modeling for Arabic speech synthesis. In this work, theauthors compare duration modeling using Hidden Markov Model (HMM) andduration modeling based on deep neural network of diﬀerent architectures tominimize the root mean square prediction error (RMSE). They concluded thatusing DNN for modeling duration outperformed HMM-based modeling from theHTS toolkit and the DNN-based modeling from the MERLIN toolkit.

Unlike conventional methods for speech synthesis, end-to-end neural networkarchitectures not only alleviate the need for extensive domain expertise and la-borious feature engineering, but they also require minimal human annotation.They can be conditioned for any language, gender, or sentiment. ConventionalTTS synthesizers consist of many stages, each trained separately. This can giverise to making each component’s error cascade to later stages. End-to-end ar-chitectures are structured as a single component and thus can become morerobust.In this work, a slightly modiﬁed model that is described in [11] is adoptedwhere the

Wavenet part is replaced with a ﬂow-based implementation of

Waveg-low [18]. Hence, the proposed model shown in Fig. 1 consists of two components:1. A sequence-to-sequence architecture spectrogram prediction network, withattention which takes a diacritic Arabic text as input and predicts the cor-responding mel-spectrogram as output.2. A ﬂow-based implementation of WaveGlow which takes the mel-spectrogramsas input and generates a time-domain waveform of the input text.There are many advantages of using mel-frequency spectrograms as an interme-diate feature representation between spectrogram prediction network and Wave-Glow. They include(a) mel-spectrograms can be computed easily from time domain waveforms,making it easy to train each of the two components separately.(b) they are easier to train compared to waveforms as they are phase invariantand thus training can be done using simple loss functions such as squaredloss.(c) mel-frequency spectrograms are related to linear-frequency spectrograms.One can obtain a mel-frequency spectrogram from a linear-frequency spec-trogram by converting the frequency axis to log scale and the ”colour” axis,the amplitude, to decibels. A spectrogram is a visual representation of the spectrum of frequencies of a signalas it varies with time o appear in The 9th IAPR TC3 Workshop on Artiﬁcial Neural Networks inPattern Recognition, ANNPR 2020

Fig. 1.

Block diagram of the spectrogram prediction network with WaveGlow, it takesdiacritic Arabic characters as input and produces audio waveform as output [11]. (d) Mel-frequency spectrograms use mel-frequency scale, they can emphasizedetails on lower frequencies, which is essential for speech naturalness. It alsogives less attention to higher frequencies which are not critical for humanperception.(e) It is straightforward for WaveGlow to be conditioned on mel-frequency spec-trogram to generate a good quality speech.

As shown in Fig. 1, the spectrogram prediction network is a sequence-to-sequencearchitecture. It consists of an encoder that creates an internal representation ofthe input signal, which is fed to the decoder to generate the predicted mel-spectrogram. The encoder is made of three parts: character embedding, threeconvolution layers, and bidirectional LSTM. It takes character sequence as in-put and produces a hidden feature vector representation. The decoder is made ofa two-layer LSTM network, two-layer pre-net, ﬁve Conv-layer post-net, and lin-ear progression. It consumes the hidden feature vector representation producedby the encoder and generates the mel-spectrograms of given input characters.Since the diacritic Arabic text is used as an input, a text phonitization block isemployed to transform the Arabic characters to another Unicode character set.The following block in the architecture is an embedding layer (512-dimensionalvector) which represents each character symbol numerically. The output of theembedding layer is fed to three convolutional layers, each of 512 ﬁlters of dimen-sion 5 × C i is computed by multiplying the encoder outputs, h j , andthe attention weights, α ij , as in Eq. 1 c i = T x (cid:88) j =1 α ij h j (1) α ij = exp ( e ij ) (cid:80) T x k =1 exp ( e ik ) (2) e ij = w T tanh ( W s i − + V h i + b ) (3)where α ij is attention weight, and e ij is an energy function. W and V arematrices, while w and b are vectors and they are all trainable parameters.The output of the decoder layer is then fed to pre-net, which consists oftwo fully connected layers of 256 hidden ReLU units, then passed through 2 uni-directional LSTM of 1204 units. The concatenation of LSTM output and context Fig. 2.

Hybrid attention mechanism used in spectrogram prediction network [12] o appear in The 9th IAPR TC3 Workshop on Artiﬁcial Neural Networks inPattern Recognition, ANNPR 2020

Fig. 3.

Block diagram of WaveGlow Vocoder [18], it takes a spectrogram as input andproduces an audio waveform. vector is projected to a linear transformation to predict mel-spectrogram, whichis passed to a ﬁve-layer post-net. A scaler (stop token) is calculated in parallel byprojecting concatenation of context vector with the decoder LSTM output andpassing them through a sigmoid activation to predict when to stop generatingspeech at inference time. Mel-spectrograms are computed using 50ms frame hub,and a ”han” window function.All convolutional layers are regulated using dropout [21], while LSTM layersare regulated using zoneout [22].

WaveGlow is a ﬂow-based generative network that combines insights from glow[23] and Wavenet. According to the authors of [18], it generates speech withquality as good as the best open-source implementations of WaveNet. However,it is much faster as it is not auto-regressive and could fully utilize GPUs. Itis trained alongside with the spectrogram prediction network using the originalmel-spectrograms as an input and the audio clips as the output. WaveGlow canbe easily conditioned on mel-spectrograms to generate high-quality waveforms.The forward path, as shown in Fig. 3, takes a group of eight samples as avector as in [23], then passes the output into twelve steps of ﬂow, each stepconsisting of 1 × × W of the invertible convolution are initialized to be orthogonal and thus theyare also invertible. orthogonal matrix is a square matrix whose columns and rows are orthogonal unitvectors o appear in The 9th IAPR TC3 Workshop on Artiﬁcial Neural Networks inPattern Recognition, ANNPR 2020 Fig. 4.

Phonitization examples. The left side of the graph represents diacritic Arabicwords, while the right side represents the corresponding Unicode character symbols.

We have trained the Spectrogram prediction network on Nawar Halabi’s ArabicDataset [16], which contains about 2.41 hours of Arabic speech, a total of 906utterances, and 694556 frames. The dataset consists of (cid:104) text, audio (cid:105) pairs. Theinput text is diacritic Arabic characters, while the output is a 16-bit 48 kHzPCM audio clip is with a bit-rate of 768 kbps. Since the dataset is relativelysmall, it is split into a 95% training set and a 5% validation set. The trainingwas executed on a supercomputing environment .The spectrogram prediction network was trained separately using diacriticArabic characters as input and original mel-spectrograms at the decoder side asthe target. Because of the small dataset size, we were not able to learn charac-ter embedding, nor the attention between encoder and decoder perfectly. Also,the quality of the generated speech was poor. As a result, we utilized transferlearning from English by (a) transforming diacritic Arabic words into Englishcharacters using an open-source phonitization algorithm , refer to Text Phoniti-zation in Fig. 1, phonitization examples at Fig. 4, (b) using a pre-trained Englishmodel with the learned English character embedding to be able to fully trainthe attention mechanism. The audio training clips have been down-sampled to22050 Hz in to employ the same audio parameters as those in the open-sourceimplementation (trained on LJSpeech dataset) such as the hop length and theﬁlter length. Silence moments (below 60 decibels) of each training sample wereremoved using a frame size of 1024 and a hop size of 256, which greatly helpedto align the attention graph shown in Fig. 5.Other training parameters are: a batch size of 8 on 2 GPUs, Adam opti-mizer [25] with β = 0.9, β = 0.999, and (cid:15) = 10 − , a constant learning rateof 10 − , and L regularization with weight 10 − . A training epoch took, on av-erage, about 15 minutes while only about 2 seconds were needed to generate awaveform. http://en.arabicspeechcorpus.com/ https://hpc.bibalex.org/ https://github.com/nawarhalabi/Arabic-Phonetiser https://drive.google.com/file/d/1c5ZTuT7J08wLUoVZ2KkUs VdZuJ86ZqA/view https://github.com/NVIDIA/tacotron2 o appear in The 9th IAPR TC3 Workshop on Artiﬁcial Neural Networks inPattern Recognition, ANNPR 2020 Fig. 5.

Alignment Graphs at diﬀerent steps of training

For quantitative analysis, both training and validation losses were assessed asmetrics. Simple mean square error loss (MSE) between predicted and targetmel-spectrograms was calculated.For qualitative analysis, the attention alignment graph was used as a metric.The Attention alignment graph is an indication of how the decoder is attendingcorrectly to encoder input. Encoder reads input step-by-step and produces statusvectors. Decoder reads all status vectors and produces audio frames step-by-step. A good alignment simply means: An “A“ sound generated by the decodershould be the result of focusing on the vector generated by the encoder fromreading “A“ character. The diagonal line is the result when audio frames aregenerated by focusing (paying attention) on the correct input characters. Fig.5 shows that the spectrogram prediction network was continually improving inlearning attention throughout the training process. It helped in eliminating somepronunciation errors as well as removing some pauses in the generated speech.Our model started to pick up alignment after about 40 epochs of training.Further qualitative analysis was carried out by using human ratings similarto Amazon’s Mechanical Turk . We used a pre-trained model of WaveGlow to infer ten randomly selected samples of spoken sentences. Each sample is ratedby 26 raters on a scale from 1 to 5 with a step of 0.5 to calculate a subjectivemean opinion score (MOS) for audio naturalness. Each evaluation is conductedindependently from each other. Table 1 compares the proposed architecture withother architectures samples from [26] such as concatenative methods with HMMsand Tacotron with the Griﬃn-Lim algorithm as a synthesizer. Fig. 6 shows thedetailed raters’ review for each of the test samples where each entry is the sumof all 26 rates divided by the number of raters (26). https://drive.google.com/file/d/1rpK8CzAAirq9sWZhe9nlfvxMF1dRgFbF/view o appear in The 9th IAPR TC3 Workshop on Artiﬁcial Neural Networks inPattern Recognition, ANNPR 2020 Table 1.

MOS evaluation for diﬀerent system architectures.System Architecture MOSConcatenative methods with HMMs 3.89Tacotron 1 with Griﬃn-Lim algorithm 4.02Tacotron 2 with WaveGlow (proposed) 4.21

Fig. 6.

Human judgement over ten randomly selected test samples.

This paper describes how to use the Tacotron 2 architecture to generate inter-mediate feature representation from Arabic diacritic text using a pre-trainedEnglish model and a total of 2.41 hours of recorded speech, followed by Wave-Glow as a vocoder to synthesize high-quality Arabic speech. It also shows theviability of how to apply transfer learning from English text-to-speech to Ara-bic text-to-speech successfully in spite of the fact that the two languages arequite diﬀerent in terms of character level embedding and language phonemes. Italso describes how to preprocess audio speech training data to gain a plausiblegenerated speech.There are many possible future enhancements for this work. They may in-clude integrating Arabic diacrtizer, which will reduce the amount of manualwork needed to diacrtise a given Arabic text. Another possible enhancementis to model speech prosody (intonation, stress, and rhythm) for expressive andmore human-like speech. Modeling prosody could be done using an architecturesimilar to Tacotron but with additional neural networks to embed prosody intothe encoded text before encoding the information using the same sequence-to-sequence architecture. And last but not least, using a much larger dataset totrain the model which will generally produce more plausible speech quality.

Acknowledgment

The authors would like to thank The Bibliotheca Alexandrina for providing thecomputing resources through their Supercomputing Facility (https://hpc.bibalex.org) . References

1. A. J. Hunt and A. W. Black, “Unit selection in a concatenative speech synthesissystem using a large speech database,” in , 1996, pp. 373–376. o appear in The 9th IAPR TC3 Workshop on Artiﬁcial Neural Networks inPattern Recognition, ANNPR 2020

2. C. Hamon, E. Mouline, and F. Charpentier, “A diphone synthesis system basedon time-domain prosodic modiﬁcations of speech,” in

International Conference onAcoustics, Speech, and Signal Processing, , 1989, pp. 238–241.3. K. Tokuda, Y. Nankaku, T. Toda, H. Zen, J. Yamagishi, and K. Oura, “Speechsynthesis based on hidden markov models,”

Proceedings of the IEEE , pp. 1234–1252, 2013.4. K. Yu and S. Young, “Continuous f0 modeling for hmm based statistical parametricspeech synthesis,”

IEEE Transactions on Audio, Speech, and Language Processing ,pp. 1071–1079, 2011.5. A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves,N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, “Wavenet: A generativemodel for raw audio,”

CoRR , vol. abs/1609.03499, 2016.6. A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neuralnetworks,”

CoRR , vol. abs/1601.06759, 2016.7. S. ¨O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, Y. Kang, X. Li,J. Miller, J. Raiman, S. Sengupta, and M. Shoeybi, “Deep voice: Real-time neuraltext-to-speech,”

CoRR , vol. abs/1702.07825, 2017.8. Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, Z. Yang,Y. Xiao, Z. Chen, S. Bengio, Q. V. Le, Y. Agiomyrgiannakis, R. Clark, and R. A.Saurous, “Tacotron: A fully end-to-end text-to-speech synthesis model,”

CoRR ,vol. abs/1703.10135, 2017.9. I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neuralnetworks,”

Advances in neural information processing systems , vol. abs/1409.3215,pp. 3104–3112, 2014.10. S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled sampling for sequenceprediction with recurrent neural networks,”

Advances in neural information pro-cessing systems , vol. abs/1506.03099, pp. 1171–1179, 2015.11. J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang,Y. Wang, R. J. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgiannakis, and Y. Wu,“Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions,”

CoRR , vol. abs/1712.05884, 2017.12. J. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, “Attention-basedmodels for speech recognition,” vol. abs/1506.07503. 2015.13. Y. A. El-Imam, “An unrestricted vocabulary arabic speech synthesis system,”

IEEE Transactions on Acoustics, Speech, and Signal Processing , pp. 1829–1845,1989.14. O. Abdel-Hamid, S. Abdou, and M. Rashwan, “Improving arabic hmm basedspeech synthesis quality.,”

INTERSPEECH 2006 and 9th International Conferenceon Spoken Language Processing, INTERSPEECH 2006 - ICSLP , 2006.15. I. Rebai and Y. BenAyed, “Arabic speech synthesis and diacritic recognition,”

International Journal of Speech Technology , pp. 485–494, 2016.16. N. Halabi and M. Wald, “Phonetic inventory for an Arabic speech corpus,” in

Pro-ceedings of the Tenth International Conference on Language Resources and Evalu-ation (LREC’16) , 2016, pp. 734–738.17. I. Zangar, Z. Mnasri, V. Colotte, D. Jouvet, and A. Houidhek, “Duration modelingusing DNN for Arabic speech synthesis,” in , 2018.18. R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A ﬂow-based generativenetwork for speech synthesis,”

CoRR , vol. abs/1811.00002, 2018. o appear in The 9th IAPR TC3 Workshop on Artiﬁcial Neural Networks inPattern Recognition, ANNPR 2020

19. S. Ioﬀe and C. Szegedy, “Batch normalization: Accelerating deep network train-ing by reducing internal covariate shift,”

International Conference on MachineLearning (ICML) , vol. abs/1502.03167, 2015.20. Y. Li and Y. Yuan, “Convergence analysis of two-layer neural networks with reluactivation,” vol. abs/1705.09886. 2017.21. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov,“Dropout: A simple way to prevent neural networks from overﬁtting,”

Journalof Machine Learning Research , pp. 1929–1958, 2014.22. D. Krueger, T. Maharaj, J. Kram´ar, M. Pezeshki, N. Ballas, N. R. Ke, A. Goyal,Y. Bengio, H. Larochelle, A. C. Courville, and C. Pal, “Zoneout: Regularizing rnnsby randomly preserving hidden activations,”

International Conference on LearningRepresentations (ICLR) , vol. abs/1606.01305, 2016.23. D. P. Kingma and P. Dhariwal, “Glow: Generative ﬂow with invertible 1 × CoRR , 2018.24. L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real NVP,”

CoRR , vol. abs/1605.08803, 2016.25. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in , Y. Bengio and Y. LeCun,Eds., 2015.26. A. H. Ali, M. Magdy, M. Alfawzy, M. Ghaly, and H. Abbas, “Arabic speechsynthesis using deep neural networks,” in