VARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention
Peng Liu, Yuewen Cao, Songxiang Liu, Na Hu, Guangzhi Li, Chao Weng, Dan Su
VVARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based onVery Deep VAE with Residual Attention
Peng Liu Yuewen Cao * 2
Songxiang Liu * 2
Na Hu Guangzhi Li Chao Weng Dan Su Abstract
This paper proposes VARA-TTS , a non-autoregressive (non-AR) end-to-end text-to-speech (TTS) model using a very deep V ariational A utoencoder (VDVAE) with R esidual A ttentionmechanism, which refines the textual-to-acousticalignment layer-wisely. Hierarchical latent vari-ables with different temporal resolutions fromthe VDVAE are used as queries for the resid-ual attention module. By leveraging the coarseglobal alignment from previous attention layeras an extra input, the following attention layercan produce a refined version of alignment. Thisamortizes the burden of learning the textual-to-acoustic alignment among multiple attention lay-ers and outperforms the use of only a single at-tention layer in robustness. An utterance-levelspeaking speed factor is computed by a jointly-trained speaking speed predictor, which takesthe mean-pooled latent variables of the coarsestlayer as input, to determine number of acousticframes at inference. Experimental results showthat VARA-TTS achieves slightly inferior speechquality to an AR counterpart Tacotron 2 but anorder-of-magnitude speed-up at inference; andoutperforms an analogous non-AR model, BVAE-TTS, in terms of speech quality.
1. Introduction
In recent years, text-to-speech (TTS) synthesis technologieshave made rapid progress with deep learning techniques.Previous attention-based encoder-decoder TTS models haveachieved state-of-the-art results in speech quality and intel-ligibility (Wang et al., 2017; Shen et al., 2018; Gibianskyet al., 2017; Ping et al., 2018b; Li et al., 2019). They first * Work done during internship at Tencent AI Lab. TencentAI Lab Human-Computer Communications Laboratory, The Chi-nese University of Hong Kong. Correspondence to: Peng Liu < [email protected] > . Codes will be released soon. generate acoustic feature (e.g., mel-spectrogram) autore-gressively from text input using an attention mechanism,and then synthesize audio samples from the acoustic featurewith a neural vocoder (Oord et al., 2016; Kalchbrenner et al.,2018). However, the autoregressive (AR) structure has twomajor limitations: (1) it greatly limits the inference speed,since the inference time of AR models grows linearly withthe output length; (2) the AR models usually suffer fromrobustness issues, e.g., word skipping and repeating, due toaccumulated prediction error.To avoid the aforementioned limitations of the AR TTS mod-els, researchers have proposed various non-autoregressive(non-AR) TTS models (Ren et al., 2019; Peng et al., 2020;Ren et al., 2020; Kim et al., 2020; Miao et al., 2020; Leeet al., 2021). These models can synthesize acoustic featureswith significantly faster speed than AR models and reducerobustness issues, while achieving comparable speech qual-ity to their AR counterparts. However, these models usuallycontain a separate duration module that does not propagateinformation to the acoustic module, which may lead to thetraining-inference mismatch issue. Moreover, the durationmodule needs duration labels as supervision, which maycome from pre-trained AR TTS models (Ren et al., 2019;Peng et al., 2020), forced aligner (Ren et al., 2020), dy-namic programming backtrack path (Kim et al., 2020) orjointly-trained attention modules (Miao et al., 2020; Leeet al., 2021).In attention-based encoder-decoder TTS models, the keysand values are calculated from text, while the queries aredifferently constructed in various models. For AR TTSmodels like Tacotron (Wang et al., 2017), the queries areconstructed from the AR hidden states. It is a natural choicesince the AR hidden states contain the acoustic informationup to the current frame. For non-AR TTS models likeFLOW-TTS (Miao et al., 2020), only position informationis used as query. For BVAE-TTS, the query is constructedform VAE latent variables. If the correlation between thequeries and keys are not strong enough, it is difficult forattention module to learn the alignment between queries andkeys. Therefore, the key to build a non-AR TTS model isconstructing the queries that are highly correlated to keysand also enable parallel attention calculation. a r X i v : . [ c s . S D ] F e b ARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention
The corresponding text transcription can be considered as alossy compression of the acoustic features. Meanwhile, thelatent variables from hierarchical variational autoencoder(VAE) are also lossy compressions of acoustic features whentrained on acoustic data. These latent variables may be cor-related with the text and suitable to construct the queries.Moreover, hierarchical VAE provides latent variables of dif-ferent resolutions, which can be used as the queries to builda coarse-to-fine layer-wise refined residual attention module.Specifically, the attention layer can use the alignment pro-duced by the previous layer as an extra input and producea refined version of the alignment. In this way, the burdenof predicting the textual-to-acoustic alignment could be re-duced and amortized by the coarse-to-fine attention layers.For the fist attention layer, we can use a “nearly diagonal”initial alignment as a good instructive bias, considering themonotonic properties of text and acoustic features.In this work, we present a novel non-AR TTS model basedon a specific hierarchical VAE model called Very Deep VAE(VDVAE) and a coarse-to-fine layer-wise refined residualattention module. Our main contributions are as follows:• We adopt VDVAE to model mel-spectrograms withbottom-up and top-down paths. The hierarchical latentvariables serve as queries to the attention modules.• We propose a novel residual attention mechanism thatlearns layer-wise alignments from coarse granularityto fine granularity in the top-down paths.• We propose detailed Kullback-Leibler ( KL ) gain forVDVAE to avoid posterior collapse in some hierarchi-cal layers.• The proposed model is fully parallel and trained inan end-to-end manner, while obtaining high speechquality. Inference speed of the proposed model is 16times faster than the AR model, Tacotron2, on a singleNVIDIA GeForce RTX 2080 Ti GPU. Also, our pro-posed model outperforms an analogous non-AR model,BVAE-TTS, in terms of speech quality.The rest of the paper is organized as follows. Section 2discusses the related work. The VAEs are introduced inSection 3. We present the model architecture in Section 4.Experimental results and analyses are reported in Section 5and Section 6. The conclusion is drawn in Section 7.
2. Related Work
The goal of text-to-speech (TTS) synthesis is to convert aninput text sequence into an intelligible and natural-soundingspeech utterance. Most previous work divides the task into two steps. The first step is text-to-acoustic (e.g. mel-spectrograms) modeling. Tacotron 1 & 2 (Wang et al., 2017;Shen et al., 2018), Deep Voice 2 & 3 (Gibiansky et al.,2017; Ping et al., 2018b), TransformerTTS (Li et al., 2019)and Flowtron (Valle et al., 2020) are AR models amongthe best performing TTS models. These models employan encoder-decoder framework with attention mechanism,where the encoder convert the input text sequence to hid-den representations and the decoder takes a weighted sumof the hidden representations to generate the output acous-tic features frame by frame. It is challenging to learn thealignment between text sequence and acoustic features (e.g.mel-spectrograms) for TTS models. Various attention mech-anisms are proposed to improve the stability and monotonic-ity of the alignment in AR models, such as location-sensitiveattention (Chorowski et al., 2015), forward attention (Zhanget al., 2018), multi-head attention (Li et al., 2019), step-wise monotonic attention (He et al., 2019) location-relativeattention (Battenberg et al., 2020). However, the low infer-ence efficiency of AR models hinders their application inreal-time services. Recently, non-AR models are proposedto synthesize the output in parallel. The key to design anon-AR acoustic model is the parallel alignment prediction.ParaNet (Peng et al., 2020) also adopts a layer-wise refinedattention mechanism, where the queries are only positionalencoding in the first attention layer and previous attentionlayer output processed by a convolution block in the follow-ing attention layers. However, attention distillation frompre-trained AR TTS model are still needed to guide thetraining of alignments. Fastspeech (Ren et al., 2019) alsorequires knowledge distillation from pre-trained AR TTSmodel to learn alignments, while Fastspeech 2 bypassesthe requirement of teacher model through an external forcealigner for duration labels (Ren et al., 2020). Glow-TTS(Kim et al., 2020) and Flow-TTS (Miao et al., 2020) are bothflow-based non-AR TTS models. Glow-TTS enforces hardmonotonic alignments through the properties of flows anddynamic programming. Flow-TTS adopts positional atten-tion to learn the alignment during training and uses lengthpredictor to predict spectrogram length during inference.The second step is acoustic features to time-domain wave-form samples modeling. WaveNet (Oord et al., 2016) isthe first of these AR neural vocoders, which produced highquality audio. Since WaveNet inference is computation-ally challenging. Several AR models are proposed to im-prove the inference speed while retaining quality (Arik et al.,2017; Jin et al., 2018; Kalchbrenner et al., 2018). Non-ARvocoders have also attracted increasing research interest(Oord et al., 2018; Ping et al., 2018a; Prenger et al., 2019;Yamamoto et al., 2020), which generate high fidelity speechmuch faster than real-time.Recently, end-to-end generation of audio samples from textsequence has been proposed in (Donahue et al., 2020; Ren
ARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention et al., 2020; Weiss et al., 2020). Wave-Tacotron extendsTacotron by incorporating a normalizing flow into the ARdeocoder loop. Both EATS (Weiss et al., 2020) and Fast-speech 2s (Ren et al., 2020) are non-AR models, where var-ious adversarial feedbacks and auxiliary prediction lossesare used respectively.
VAE (Kingma & Welling, 2013; Rezende et al., 2014) isa widely used generative model. Both variational RNN(VRNN) (Chung et al., 2015) and vector quantised-VAE(VQ-VAE) (Van Den Oord et al., 2017) adopt AR structureto model the generative process of audio samples. (Child,2020) verifies that VDVAE outperforms the AR model Pix-elCNN (Van den Oord et al., 2016) in log-likelihood on allnatural image benchmarks, while using fewer parametersand generating samples thousands of times faster. Inspiredby the success of the VDVAE architecture, we employ it forparallel speech synthesis task.In parallel to our work, BVAE-TTS (Lee et al., 2021) hasbeen proposed, where bidirectional hierarchical VAE archi-tecture is also adopted for non-AR TTS. However, only la-tent variables from the toppest layer are used as queries andthere is a gap between the attention-based mel-spectrogramgeneration during training and the duration-based mel-spectrogram generation during inference in BVAE-TTS.Therefore, various empirical and carefully-designed tech-niques are needed to bridge this gap. By employing a resid-ual attention mechanism, our proposed model can eliminatethis training and inference mismatch.
3. Variational autoencoders
VAEs consist of the following parts: a generator p ( x | z ) , aprior p ( z ) and a posterior q ( z | x ) approximator. Typically,the posteriors and priors in VAEs are assumed normallydistributed with diagonal covariance, which allows for theGaussian reparameterization trick to be used (Kingma &Welling, 2013). The generator p ( x | z ) and approximator q ( z | x ) are jointly trained by maximizing the evidence lowerbound (ELBO): log p ( x ) ≥ E z ∼ q ( z | x ) log p ( x | z ) − KL [ q ( z | x ) k p ( z )] (1)where the first term of the right hand side of this inequalitycan be seen as the expectation of negative reconstructionerror and the second KL divergence term can be seen as aregularizer.Hierarchical VAEs can gain better generative performancethan previous VAEs where fully-factorized posteriors andpriors are incorporated. One typical hierarchical VAE isintroduced in (Sønderby et al., 2016), where both the prior p ( z ) and posterior approximator q ( z | x ) are conditionally dependent: p ( z ) = p ( z ) p ( z | z ) . . . p ( z N | z 4. Model Architecture We adopt the VDVAE (Child, 2020) with a novel residualattention mechanism for non-AR TTS. The overall archi-tecture is shown in Figure 1. Hierarchical latent variablesat decreasing time scale are extracted from the input mel-spectrograms along the bottom-up path. These hierarchicallatent variables are processed from top to bottom and servedas queries ( Q ). A text encoder takes phoneme sequenceand optional speaker ID as input and outputs text encodingas key ( K ) and value ( V ). Q , K , V and attention weightfrom previous attention block ( A prev ) are sent to the follow-ing residual attention module to produce a refined attentionweight and context vector. The context vector is a weightedaverage of V , and is passed to top-down block for varia-tional inference and mel-spectrogram reconstruction alongwith stored hierarchical latent variables. A speaking speedpredictor takes the mean-pooled latent variables of the coars-est layer as input and predicts the utterance-level averagespeaking speed factor to determine the number of acousticframes at inference.We use the β -VAE training objective and a mel-spectrogramloss inspired by (Rezende & Viola, 2018). The hierarchicalVAE, speaking speed predictor, text encoder and residualattention modules are jointly trained with the followingobjective: L = α L speaking speed + L recon + β L KL (5)where, L speaking speed = E (cid:16) d − ˆd (cid:17) , (6) ARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention res blockres blockres blockpoolpool unpooltopdown blocktopdown blocktopdown blockunpoolBottom-up path Top-down path Text EncoderInitialAttentionResidualAttentionResidualAttentionPhonemesSpeakingspeedpredictor SpeakerID mean pool repeat mean pool previousblock outputq(z i |x,z
Overall model architecture L recon = k x − ˆx k k x k + k log x − log ˆx k , (7) L KL = KL [ q ( z | x , t ) k p ( z , t )] + N X n =1 E q ( z Figure 2. Network detail of text encoder alignment obtained from the acoustic module and text en-coding module as supervision. This may lead to the issuethat the acoustic module is not adapted well to the durationmodule.In VARA-TTS, the speaking speed predictor adopts twofully-connected (FC) layers, between which we add aReLU activation function, a layer-normalization layer anda dropout layer in a sequence (more details are presentedin Appendix D). Unlike most existing non-AR TTS models,which require token-level duration information as supervi-sion for duration model training, we use a readily-computedspeaking rate d for each training utterance as the target ofthe speaking speed predictor: d = N ormalize min-max ( T mel L text ) ∈ [0 , , (12)where N ormalize min-max ( · ) denotes min-max normaliza-tion across the training set, T mel represents the number offrames in a mel spectrogram and L text is the number of to-kens in a phoneme sequence. We add a sigmoid function onthe speaking speed predictor output to obtain the predictedspeaking rate ˆd . An MSE loss applied on d and ˆd is used astraining signal for the speaking speed predictor. As shown in Fig 2. phoneme sequence and speaker id aretransformed to phoneme embedding and speaker embeddingthrough two lookup tables. The feature-wise linear modu-lation (FiLM) (Perez et al., 2018) is used to fuse speakerembedding with phoneme embedding. Two FC layers areused to compute the scale and shift vectors from the speakerembedding vector, respectively. The feature-wise affineoperation is conducted as: γ spk × U phn + ξ spk (13) ARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention where γ spk and ξ spk represent the scale and shift vectorsand U phn represents phoneme embedding. The fused outputpasses four convolution layers. The convolution output andsinuous positional encoding (Vaswani et al., 2017) are addedas text encoder output. For the initial attention module in Figure 1, we simply setthe attention weight matrix A ∈ R T max red × L text to be “nearlydiagonal”. T max red is the number of feature frames beforethe mean pooling layer at bottom-up path. This “nearly di-agonal” attention matrix is a good instructive bias, since thealignment between text and acoustic features are monotonicand almost diagonal. Inspired by (Tachibana et al., 2018),we set the attention weight matrix as follows: S l = exp (cid:2) − ( t/T max red − l/L text ) / g (cid:3) (14) A tl = S tl / X l S tl (15)where, l is text position index and t is mel-spectrogram posi-tion index. In this paper, we generate four attention matricesusing g ∈ [0 . , . , . , . and compute four contextvectors by multiplying the four generated attention matriceswith text encoder output respectively. Then the four contextvectors are concatenated and projected to one context vec-tor along temporal axis. During training stage, T max red isobtained from the number of ground-truth mel-spectrogramframes divided by the maximum reduction factor. Maxi-mum reduction factor is the product of the reduction factorsbetween neighbouring residual block groups. During infer-ence stage, it is obtained by multiplying predicted speakingspeed ˆd to L text and then being divided by the maximumreduction factor.For the remaining residual attention modules, they take textencoder output ( K , V ) , attention weight from previousattention module ( A prev ) and latent variables from previoustopdowm group ( Q ) as input. Multi-head attention mecha-nism (Vaswani et al., 2017) is used, where A prev is added asan additional input:ResidualMultiHead ( Q , K , V , A prev ) = Concat ( head , . . . , head h ) W O (16)where Q , K , V are matrices with dimension d k , d k and d v respectively. head i = ResidualAttention ( QW Qi , KW Ki , VW Vi , A prev ) . W O is the transformation matrix that lin-early projects the concatenation of all heads output. W Qi , W Ki and W Vi are projection matrices of the i -th head. A prev is the averaged attention weights from previous attentionheads. We use the scaled dot-product mechanism basedresidual attention:ResidualAttention ( Q , K , V , A prev ) = Softmax (cid:16) QK T √ d k + A prev (cid:17) V (17) We find that using attention weights A prev makes trainingprocess more stable than using before-softmax attentionscores, which is used in RealFormer (He et al., 2020). Theprevious attention weight A prev is first upsampled to fitthe time dimension and then processed by a convolutionlayer before being sent to the next layer. Different fromthe location sensitive attention (Chorowski et al., 2015) thatonly takes into account attention weight at previous decodertime step, our residual attention mechanism incorporatesattention weights of all time steps from last residual attentionmodule, which provides a global perspective of attentionhistory. 5. Experiments We conduct both single-speaker and multi-speaker TTS ex-periments to evaluate the proposed VARA-TTS model. Forsingle-speaker TTS, the LJSpeech corpus (Ito & Johnson,2017) is used, which contains 13100 speech samples withtotal duration of about 24 hours. We up-sample the speechsignals from 22.05 kHz to 24 kHz in sampling rate. Thedataset is randomly partitioned into training, validation andtesting sets according to a 12900/100/100 scheme. Formulti-speaker TTS, we use an internal Mandarin Chinesemulti-speaker corpus, which contains 55 hours of speechdata from 7 female speakers.In all experiments, mel-spectrograms are computed with1024 window length and 256 frame shift. We convert textsentences into phoneme sequences with Festival for En-glish and with an internal grapheme-to-phoneme toolkit forChinese. We compare VARA-TTS with a well-known AR TTS model,Tacotron 2 (Shen et al., 2018), and an analogous non-ARTTS model, BVAE-TTS (Lee et al., 2021), under single-speaker TTS setting. We use an open-source Tacotron 2 im-plementation and the official BVAE-TTS implementation .Some key hyperparameters in VARA-TTS are presentedin Appendix C. To make the comparison fair, we use thesame neural vocoder, HiFi-GAN (Kong et al., 2020) withHiFi-GAN-V1 configuration, to convert mel spectrogramsinto waveform for all three compared models.We train the VARA-TTS model with a batch size of 32 withtwo NVIDIA V100 GPUs. And the model for evaluationare trained for 90k iterations. The Adam optimizer with β = 0 . , β = 0 . is adopted for parameter updating,where the maximum of learning rate is 1.5e-4 and scheduled https://github.com/NVIDIA/tacotron2 https://github.com/LEEYOONHYUNG/BVAE-TTS ARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention in the same manner as in (Vaswani et al., 2017) with 10ksteps to warm-up. 6. Results and Analysis Mean opinion score (MOS) test is conducted to evaluate thespeech naturalness. We invite 10 raters to take the MOStest, where participants are asked to give a score from 1 to5 (least natural to most natural) with 0.5 point incrementsto each stimuli presented. The sentences for evaluation aresampled from LibriTTS (Zen et al., 2019) test set. TheMOS test results are presented in Table 1. We can see thatour proposed VARA-TTS obtains higher MOS score thanBVAE-TTS, while is inferior to Tacotron 2 in naturalness.We also conduct a MOS test for a multi-speaker VARA-TTS trained with the multi-speaker Mandarin corpus. TheMOS result is 4.49±0.11. We can see that the MOS scoreon multi-speaker Mandarin corpus is much higher than theresult on LJSpeech. One possible reason may be that themulti-speaker Mandarin corpus is of higher quality andcontains larger amount of data. VARA-TTS is data-hungryand tends to over-fit on LJSpeech. Audio samples can befound online . Benefiting from its non-AR structure, VARA-TTS enjoysfast inference speed. We use the test set to compare theinference speed of the three compared models. Averageinference speed of the three compared models obtainedfrom 10 different runs on one NVIDIA GeForce RTX 2080Ti GPU is presented in Table 1. We can see that inferencespeed of our VARA-TTS is 16x faster than that of Tacotron2, and at the same scale as that of BVAE-TTS. Figure B.1 in Appendix B shows an example of the align-ment refinement process for an utterance by the well-trainedmulti-speaker VARA-TTS model. In the figure, the bottomrow shows the initial alignments, which show diagonal pat-terns generated by rule. In the second row, a blur alignmentis generated. The blurriness indicates that the alignmentis not reliable. However, VARA-TTS learns to refine thecoarse alignment and obtain clearer alignments during thesucceeding hierarchies, as can be seen from the upper plotsin Figure B.1. Moreover, we observe that the alignment re-finement process varies with the β value used. For β = 1 . ,the alignments disperse in upper layers, and remain clearfor β = 1 . . This indicates that hierarchical latent variablesfrom the same layers encode different information when https://vara-tts.github.io/VARA-TTS/ Table 1. MOS with 95% confidence and inference speed results. Model MOS Inference speed (ms)Tacotron 2 4.11±0.22 526.52BVAE-TTS 3.33±0.18 18.06VARA-TTS 3.88±0.20 32.01 C u m u l a t i v e K L =1.0, =0.0=1.8, =0.0=1.8, =1.0 Figure 3. Cumulative KL for different β and λ . The y-axis is thecumulative KL value per data point by layers and the y-axis is thelayer index. Flat horizontal line segment indicates that posteriorcollapse occurs in the corresponding layers. trained with different values of β . To show the effectiveness of model design in VARA-TTS,we conduct the following ablation studies: (i) Train withoutdetailed KL gain; (ii) Use different values of β in Equation11 during training; (iii) Separately train the speaking speedpredictor. Detailed KL gain . Figure 3 shows the cumulative KL diver-gence curves with different values of β and λ . We can seethat there are some horizontal segments in the cumulative KL divergence curve when λ = 0 . This indicates KL val-ues are and posterior collapse occurs in the correspondinglayers. When detailed KL gain is applied ( λ = 1 . ), thecumulative KL increases smoothly with the layer index andcontains no horizontal segments. The KL value of λ = 1 . is larger than that of λ = 0 in Figure 3. This is becausedetailed KL gain mechanism enlarges the hierarchical KL values smaller than KL ref , which can be compensated by alarger β . We also observe that when trained with detailed KL gain, VARA-TTS can learn clearer alignments in coarsehierarchical layer as shown in Appendix B. This indicatesthat posterior collapse does not happen in these layers andthe latent variables encode meaningful information. Different values of β . At training, the queries are sampledfrom posterior, but sampled from prior at inference. Smaller KL value indicates smaller gap between prior and poste-rior for VDVAE. The smaller the gap between prior andposterior, the more reliable the queries can be at inference.The hpyerparameter β in Equation 11 adjusts the relativeweight between the reconstruction error and KL term. The ARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention f e a t u r e r e c o n s t r u c t i o n e rr o r =1.0, =0.0=1.8, =0.0=1.8, =1.0=2.2, =1.0=2.6, =1.0=3.0, =1.0 (a) feature reconstruction error K L =1.0, =0.0=1.8, =0.0=1.8, =1.0=2.2, =1.0=2.6, =1.0=3.0, =1.0 (b) KL s p e a k i n g s p ee d e rr o r =1.0, =0.0=1.8, =0.0=1.8, =1.0=2.2, =1.0=2.6, =1.0=3.0, =1.0 (c) speaking speed error Figure 4. The feature reconstruction error, KL , and speaking speed error training curves on validation set with different values of β and λ . The x-axis is training step in the unit of thousand and the y-axis is the corresponding value of feature reconstruction error, KL , andspeaking speed error. s p e a k i n g s p ee d e rr o r jointly-trainedseparately-trained (a) speaking speed error - E L B O jointly-trainedseparately-trained (b) -ELBO Figure 5. Speaking speed error and -ELBO for models trained with a joint speaking speed predictor and a separate one curves of feature reconstruction error and KL value withdifferent values of β and λ on validation set are shown inFigure 4 (a) and (b) respectively. As can be seen, enlarging β results in larger feature reconstruction error and smaller KL value. However, we find that small changes in featurereconstruction error do not influence perpetual result a lot.Figure 4 (c) shows the speaking speed error on validationset. We can see that different values of β have little effecton speaking speed error. We use β = 1 . and λ = 1 . formodel evaluation. Joint speaking speed modeling . The separate speakingspeed predictor has the same structure as the joint one, ex-cept that its input is the mean-pooled text embedding anddetached from the computational graph by a stop gradientoperation. The stop gradient operation is also applied inBVAE-TTS and GLOW-TTS to avoid affecting the trainingobjective. As show in Figure 5, the joint training strategyobtains similar ELBO as the separate training counterpart,but attains much smaller speaking speed error on validationset. This validates the effectiveness of joint training thespeaking speed predictor and the whole model. 7. Conclusion In this work, we propose a novel non-AR end-to-end TTSmodel, VARA-TTS, generating mel-spectrogram from textwith VDVAE and residual attention mechanism. The hier-archical latent variables from VDVAE are used as queriesfor the residual attention module. The residual attentionmodule is able to generate the textual-to-acoustic alignmentin a layer-by-layer coarse-to-fine manner. Experimental re-sults show that VARA-TTS attains better perceptual resultsthan BVAE-TTS at a similar inference speed, and is 16xspeed-up for inference over Tacotron 2 with slightly infe-rior performance in naturalness. We also demonstrate itsextensibility to a multi-speaker setting.VARA-TTS should be easily extended to text-to-waveformby adding more layers. However, we find that it is hard tooptimize in our preliminary experiments. The model is ableto learn clear alignment between text and waveform, but itcan not generate intelligible waveform. This is consistentwith the results reported in (Child, 2020) where VDVAE cannot generate consistent and sharp × images. Wepropose detailed KL gain for VDVAE, which can avoid non-informative hierarchical latent variables. It is interestingto analyze it theoretically and we leave this for the futurework. ARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention References Alemi, A. A., Poole, B., Fischer, I., Dillon, J. V., Saurous,R. A., and Murphy, K. Fixing a broken ELBO. In Dy,J. G. and Krause, A. (eds.), Proceedings of the 35th Inter-national Conference on Machine Learning, ICML 2018,Stockholmsm¨assan, Stockholm, Sweden, July 10-15, 2018 ,volume 80 of Proceedings of Machine Learning Research ,pp. 159–168. PMLR, 2018.Arik, S. O., Chrzanowski, M., Coates, A., Diamos, G., Gib-iansky, A., Kang, Y., Li, X., Miller, J., Ng, A., Raiman, J.,et al. Deep voice: Real-time neural text-to-speech. arXivpreprint arXiv:1702.07825 , 2017.Battenberg, E., Skerry-Ryan, R., Mariooryad, S., Stanton,D., Kao, D., Shannon, M., and Bagby, T. Location-relative attention mechanisms for robust long-formspeech synthesis. In ICASSP 2020-2020 IEEE Inter-national Conference on Acoustics, Speech and SignalProcessing (ICASSP) , pp. 6194–6198. IEEE, 2020.Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N.,Desjardins, G., and Lerchner, A. Understanding disentan-gling in β -vae. CoRR , abs/1804.03599, 2018.Chen, X., Duan, Y., Houthooft, R., Schulman, J., Sutskever,I., and Abbeel, P. Infogan: Interpretable representationlearning by information maximizing generative adversar-ial nets. In NIPS , 2016.Child, R. Very deep vaes generalize autoregressive modelsand can outperform them on images. arXiv preprintarXiv:2011.10650 , 2020.Chorowski, J. K., Bahdanau, D., Serdyuk, D., Cho, K., andBengio, Y. Attention-based models for speech recogni-tion. Advances in neural information processing systems ,28:577–585, 2015.Chung, J., Kastner, K., Dinh, L., Goel, K., Courville, A. C.,and Bengio, Y. A recurrent latent variable model for se-quential data. Advances in neural information processingsystems , 28:2980–2988, 2015.Donahue, J., Dieleman, S., Bi´nkowski, M., Elsen, E., and Si-monyan, K. End-to-end adversarial text-to-speech. arXivpreprint arXiv:2006.03575 , 2020.Gibiansky, A., Arik, S., Diamos, G., Miller, J., Peng, K.,Ping, W., Raiman, J., and Zhou, Y. Deep voice 2: Multi-speaker neural text-to-speech. In Advances in neuralinformation processing systems , pp. 2962–2970, 2017.He, M., Deng, Y., and He, L. Robust sequence-to-sequenceacoustic modeling with stepwise monotonic attention forneural tts. Proc. Interspeech 2019 , pp. 1293–1297, 2019. He, R., Ravula, A., Kanagal, B., and Ainslie, J. Real-former: Transformer likes residual attention. CoRR ,abs/2012.11747, 2020.Hendrycks, D. and Gimpel, K. Gaussian error linear units(gelus). arXiv preprint arXiv:1606.08415 , 2016.Ito, K. and Johnson, L. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/ , 2017.Jin, Z., Finkelstein, A., Mysore, G. J., and Lu, J. Fftnet:A real-time speaker-dependent neural vocoder. In , pp. 2251–2255. IEEE, 2018.Kalchbrenner, N., Elsen, E., Simonyan, K., Noury, S.,Casagrande, N., Lockhart, E., Stimberg, F., Oord, A.v. d., Dieleman, S., and Kavukcuoglu, K. Efficient neuralaudio synthesis. arXiv preprint arXiv:1802.08435 , 2018.Kim, J., Kim, S., Kong, J., and Yoon, S. Glow-tts: A gen-erative flow for text-to-speech via monotonic alignmentsearch. arXiv preprint arXiv:2005.11129 , 2020.Kingma, D. P. and Welling, M. Auto-encoding variationalbayes. arXiv preprint arXiv:1312.6114 , 2013.Kong, J., Kim, J., and Bae, J. Hifi-gan: Generative ad-versarial networks for efficient and high fidelity speechsynthesis. In Larochelle, H., Ranzato, M., Hadsell, R.,Balcan, M., and Lin, H. (eds.), Advances in Neural In-formation Processing Systems 33: Annual Conference onNeural Information Processing Systems 2020, NeurIPS2020, December 6-12, 2020, virtual , 2020.Lee, Y., Shin, J., and Jung, K. Bidirectional variationalinference for non-autoregressive text-to-speech. In In-ternational Conference on Learning Representations ,2021. URL https://openreview.net/forum?id=o3iritJHLfO .Li, N., Liu, S., Liu, Y., Zhao, S., and Liu, M. Neural speechsynthesis with transformer network. In Proceedings of theAAAI Conference on Artificial Intelligence , volume 33,pp. 6706–6713, 2019.Miao, C., Liang, S., Chen, M., Ma, J., Wang, S., and Xiao, J.Flow-tts: A non-autoregressive network for text to speechbased on flow. In ICASSP 2020-2020 IEEE InternationalConference on Acoustics, Speech and Signal Processing(ICASSP) , pp. 7209–7213. IEEE, 2020.Oord, A., Li, Y., Babuschkin, I., Simonyan, K., Vinyals, O.,Kavukcuoglu, K., Driessche, G., Lockhart, E., Cobo, L.,Stimberg, F., et al. Parallel wavenet: Fast high-fidelityspeech synthesis. In International conference on machinelearning , pp. 3918–3926. PMLR, 2018. ARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K.,Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A.,and Kavukcuoglu, K. Wavenet: A generative model forraw audio. arXiv preprint arXiv:1609.03499 , 2016.Peng, K., Ping, W., Song, Z., and Zhao, K. Non-autoregressive neural text-to-speech. In InternationalConference on Machine Learning , pp. 7586–7598.PMLR, 2020.Perez, E., Strub, F., De Vries, H., Dumoulin, V., andCourville, A. Film: Visual reasoning with a general con-ditioning layer. In Proceedings of the AAAI Conferenceon Artificial Intelligence , volume 32, 2018.Ping, W., Peng, K., and Chen, J. Clarinet: Parallel wavegeneration in end-to-end text-to-speech. In InternationalConference on Learning Representations , 2018a.Ping, W., Peng, K., Gibiansky, A., Arik, S. O., Kannan,A., Narang, S., Raiman, J., and Miller, J. Deep voice3: 2000-speaker neural text-to-speech. Proc. ICLR , pp.214–217, 2018b.Prenger, R., Valle, R., and Catanzaro, B. Waveglow: Aflow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) , pp.3617–3621. IEEE, 2019.Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., andLiu, T.-Y. Fastspeech: Fast, robust and controllable textto speech. Advances in Neural Information ProcessingSystems , 32:3171–3180, 2019.Ren, Y., Hu, C., Qin, T., Zhao, S., Zhao, Z., and Liu, T.-Y.Fastspeech 2: Fast and high-quality end-to-end text-to-speech. arXiv preprint arXiv:2006.04558 , 2020.Rezende, D. J. and Viola, F. Taming vaes. CoRR ,abs/1810.00597, 2018.Rezende, D. J., Mohamed, S., and Wierstra, D. Stochasticbackpropagation and approximate inference in deep gen-erative models. In International Conference on MachineLearning , pp. 1278–1286, 2014.Shen, J., Pang, R., Weiss, R. J., Schuster, M., Jaitly, N.,Yang, Z., Chen, Z., Zhang, Y., Wang, Y., Skerrv-Ryan, R.,et al. Natural tts synthesis by conditioning wavenet onmel spectrogram predictions. In , pp. 4779–4783. IEEE, 2018.Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K.,and Winther, O. Ladder variational autoencoders. In Advances in neural information processing systems , pp.3738–3746, 2016. Tachibana, H., Uenoyama, K., and Aihara, S. Efficientlytrainable text-to-speech system based on deep convolu-tional networks with guided attention. In , pp. 4784–4788. IEEE, 2018.Valle, R., Shih, K., Prenger, R., and Catanzaro, B. Flowtron:an autoregressive flow-based generative network for text-to-speech synthesis. arXiv preprint arXiv:2005.05957 ,2020.Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals,O., Graves, A., et al. Conditional image generation withpixelcnn decoders. Advances in neural information pro-cessing systems , 29:4790–4798, 2016.Van Den Oord, A., Vinyals, O., et al. Neural discrete rep-resentation learning. In Advances in Neural InformationProcessing Systems , pp. 6306–6315, 2017.Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attentionis all you need. In Guyon, I., von Luxburg, U., Bengio,S., Wallach, H. M., Fergus, R., Vishwanathan, S. V. N.,and Garnett, R. (eds.), Advances in Neural InformationProcessing Systems 30: Annual Conference on NeuralInformation Processing Systems 2017, December 4-9,2017, Long Beach, CA, USA , pp. 5998–6008, 2017.Wang, Y., Skerry-Ryan, R., Stanton, D., Wu, Y., Weiss, R. J.,Jaitly, N., Yang, Z., Xiao, Y., Chen, Z., Bengio, S., et al.Tacotron: Towards end-to-end speech synthesis. Proc.Interspeech 2017 , pp. 4006–4010, 2017.Weiss, R. J., Skerry-Ryan, R., Battenberg, E., Mariooryad,S., and Kingma, D. P. Wave-tacotron: Spectrogram-free end-to-end text-to-speech synthesis. arXiv preprintarXiv:2011.03568 , 2020.Yamamoto, R., Song, E., and Kim, J.-M. Parallel wavegan:A fast waveform generation model based on generativeadversarial networks with multi-resolution spectrogram.In ICASSP 2020-2020 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) , pp.6199–6203. IEEE, 2020.Zen, H., Dang, V., Clark, R., Zhang, Y., Weiss, R. J., Jia,Y., Chen, Z., and Wu, Y. Libritts: A corpus derived fromlibrispeech for text-to-speech. CoRR , abs/1904.02882,2019.Zhang, J.-X., Ling, Z.-H., and Dai, L.-R. Forward attentionin sequence-to-sequence acoustic modeling for speechsynthesis. In , pp.4789–4793. IEEE, 2018. upplement Materials A. Auxiliary speaking speed predictor proof Minimizing the prediction error for speaking speed d from the top-most latent variables z paves the way to maximizing alower bound of their mutual information. Proof I ( z ; d) = − H (d | z ) + H (d)= E z ∼ p ( z ) (cid:2) E d ∼ p (d | z ) [log p (d | z )] (cid:3) + H (d)= E z ∼ p ( z ) (cid:2) D KL ( p (d | z ) || q (d | z )) + E d ∼ p (d | z ) [log q (d | z )] (cid:3) + H (d) ≥ E z ∼ p ( z ) (cid:2) E d ∼ p ( s | z ) [log q (d | z )] (cid:3) + H (d)= E d ∼ p (d) (cid:2) E z ∼ p ( z | d) [log q (d | z )] (cid:3) + H (d) I ( z ; d) is the mutual information between z and d . H (d | z ) is the conditional entropy of d given z and H (d) is theentropy of d . Each piece of datum ( x and t ) has its corresponding d . p ( z ) is the prior distribution of z given t , whichomitted for conciseness. p (d) is the prior of d . It is not actually sampled. d is sampled when a piece of datum is sampled. q (d | z ) is the output distribution of an auxiliary speaking speed predictor. p ( z | d) is actually the posterior of z given x and t . B. Alignments for different values of β and γ . Figure B.1 shows alignments for different values of β and γ . C. Detailed model configuration Details of some key hyperparameters in VARA-TTS are listed in Table 1. For the remaining model configuration, pleaserefer to the source code accompanied with this manuscript. D. Speaking speed predictor network structure Network structure of the speaking speed predictor used in VARA-TTS is illustrated in Figure D.1. a r X i v : . [ c s . S D ] F e b ARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention P h o n e m e i n d e x Alignments for =1.0, =0.0 P h o n e m e i n d e x Alignments for =1.8, =0.0 P h o n e m e i n d e x Alignments for =1.8, =1.0 Figure B.1. Alignments for different values of β and λ . All the alignments are interpolated to the highest temporal resolution. The figuresfrom bottom row to top row correspond to coarse to fine hierarchical layers. The alignments on the bottom row are diagonal alignmentsgenerated according to lengths of text and acoustic features. The following alignments are generated by residual attention mechanism.The alignments become clearer at finer hierarchical layers when β = 1 . . However, when β = 1 . , the alignments at finer hierarchicallayers become blurry. When λ = 1 . , the alignments at coarse layers is clearer than those when λ = 0 . Table 1. Hyperparameters in VARA-TTS. Hyperparameter name VARA-TTS(EN) VARA-TTS(ZH) Number of mel banks in mel spectrogram 80 80Mel spectrogram pre-conv layer Conv1D with k=11 and c=384 Conv1D with k=11 and c=512Number of phoneme tokens 55 148Text embedding dimension 384 384Total number of bottom-up stacks 6 7Number of res blocks in each bottom-up stack 4/6/8/12/9/5 6/12/16/10/8/8/5Temporal reduction rate in each bottom-up stack repeat/2/2/2/2/1 repeat/2/2/2/2/2/1Bottom-up residual block conv dimensions 384/96/96/384 512/128/128/512Number of heads in multihead attention module 8 8Attention dimension 384 384Latent variable dimension 16 16Text encoder conv dimensions 384/96/96/384 384/96/96/384Number of speakers 1 7Speaker embedding dimension - 384 ARA-TTS: Non-Autoregressive Text-to-Speech Synthesis based on Very Deep VAE with Residual Attention LinearSpeaking rate ReLUDropoutLinearSigmoidTraining Inference