LightSpeech: Lightweight and Fast Text to Speech with Neural Architecture Search
Renqian Luo, Xu Tan, Rui Wang, Tao Qin, Jinzhu Li, Sheng Zhao, Enhong Chen, Tie-Yan Liu
aa r X i v : . [ c s . S D ] F e b LIGHTSPEECH: LIGHTWEIGHT AND FAST TEXT TO SPEECH WITHNEURAL ARCHITECTURE SEARCH
Renqian Luo , Xu Tan , Rui Wang , Tao Qin , Jinzhu Li , Sheng Zhao , Enhong Chen and Tie-Yan Liu University of Science and Technology of China, Microsoft Research Asia, Microsoft Azure Speech [email protected],[email protected], { xuta,ruiwa,taoqin,tyliu } @microsoft.com, { jinzl,sheng.zhao } @microsoft.com ABSTRACT
Text to speech (TTS) has been broadly used to synthesize nat-ural and intelligible speech in different scenarios. Deploy-ing TTS in various end devices such as mobile phones orembedded devices requires extremely small memory usageand inference latency. While non-autoregressive TTS modelssuch as FastSpeech have achieved significantly faster infer-ence speed than autoregressive models, their model size andinference latency are still large for the deployment in resourceconstrained devices. In this paper, we propose LightSpeech,which leverages neural architecture search (NAS) to automat-ically design more lightweight and efficient models based onFastSpeech. We first profile the components of current Fast-Speech model and carefully design a novel search space con-taining various lightweight and potentially effective architec-tures. Then NAS is utilized to automatically discover wellperforming architectures within the search space. Experi-ments show that the model discovered by our method achieves15x model compression ratio and 6.5x inference speedup onCPU with on par voice quality. Audio demos are provided athttps://speechresearch.github.io/lightspeech.
Index Terms — Text to speech, lightweight, fast, neuralarchitecture search
1. INTRODUCTION
Text to speech (TTS) has been widely used to synthesize nat-ural and intelligible speech audio given text, and has been de-ployed in many services such as audio navigation, newscast-ing, tourism interpretation, etc. While neural network basedTTS models [1, 2, 3, 4] have greatly improved the voice qual-ity over conventional TTS systems, they usually adopt autore-gressive generation with large inference latency, which makethem difficult to deploy on various end devices such as mobilephones or embedded devices. Recently, non-autoregressiveTTS models [5, 6, 7, 8, 9] have significantly accelerated theinference speed over previous autoregressive systems. De-spite their success, the models still have relatively large modelsize, inference latency and power consumption when deploy-ing in resource constrained scenarios (e.g., mobile phones, embedded devices and low budget services where CPU ismainly available).There have been many techniques on designing lightweightand efficient neural networks, such as shrinking, tensor de-composition [10], quantization [11] and pruning [12]. Thesemethods have achieved significant success on compressingbig models into smaller ones with less computational cost.However, most of them are designed for convolution neuralnetworks for computer vision tasks, which involve many areaspecific knowledge or characteristics, and cannot be easilyextended to sequence learning tasks (e.g., natural languageprocessing and speech processing) with recurrent neural net-works, attention networks, etc. For example, manually re-ducing the depth and width of the network brings severe per-formance drop (as in Table 2). Recently, neural architecturesearch (NAS) [13, 14] is leveraged to automatically designlight weight models with promising performance [15, 16].However, applying NAS to a new area and task is challengingwhich requires careful design and chosen of the search space,the search algorithm and the evaluation metric: 1) The searchspace determines the potential bound of the performance andrequires carefully design. A good search space is efficientto search and a poorly designed search space is hard to findpromising architectures. 2) The search algorithm should becarefully chosen to fit and adapt to the task with specific char-acteristics. 3) The metric should also be designed or chosento best represent the final evaluation metric. Directly applyingexisting NAS algorithms may lead to no improvements.In this work, we propose LightSpeech, which leveragesNAS for designing lightweight and fast TTS models withmuch smaller size and faster inference speed on CPU. Firstly,we carefully profile the bottlenecks of current TTS model [9].Secondly, according to the observations from the profiling, wedesign a novel search space that covers a range of lightweightmodels. Thirdly, among many well-performing NAS algo-rithms, we adopt accuracy prediction based NAS [17, 18]which is straightforward and efficient. Specifically, we adoptGBDT-NAS [18] to perform the search due to its promisingperformance and fitness in our task (the chain-structure of oursearch space is well fitted by GBDT-NAS).xperiments show that compared to the original Fast-Speech 2 model [9], architecture discovered by our Light-Speech achieves 15x compression ratio (1.8M vs. 27M),16x less MACs (0.76G vs. 12.50G) and 6.5x inference timespeedup on CPU.
2. METHOD
Considering FastSpeech [5, 9] is one of the most popular non-autoregressive TTS models with fast and high-quality speechsynthesis, we mainly adopt it as the model backbone. Firstwe analyze the memory and latency of each component in thecurrent models in Section 2.1. Then we design a novel searchspace including various operations in Section 2.2. Finally weintroduce the search algorithm used to find efficient models inSection 2.3.
Non-autoregressive TTS models [5, 9, 7, 8, 6] generate speechin parallel and greatly speed up the inference process com-pared to autoregressive TTS models [1, 2, 3]. However, themodels are still large which cause high memory usage and in-ference latency when deploying on end devices with limitedcomputing resources (e.g., mobile phones and embedded de-vices). For example, taking FastSpeech 2 [9] (which furtherimproves the voice quality of FastSpeech by introducing morevariance information) as an example, it has 27M parameterswith more than 100M memory footprint and is 10x slower onCPU than on GPU .In order to determine the network backbone and thesearch space, we profile each component in FastSpeech 2model to identify the bottlenecks of memory and inferencespeed. The core model of FastSpeech 2 contains 5 parts: en-coder, decoder, duration predictor, pitch predictor and energypredictor. The encoder and the decoder consists of 4 feed-forward Transformer blocks [5] respectively. The durationpredictor is a 2-layer 1D-convolution neural network withkernel size . The pitch predictor is a 5-layer 1D-convolutionneural network with kernel size . The energy predictor hasthe same structure as the pitch predictor. We measure thesize (i.e., number of parameters) and the inference speed ofeach component as shown in Table 1.We have several observations from Table 1: 1) The en-coder and the decoder takes the most of the model size andthe inference time. Therefore we mainly aim to reduce the en-coder and the decoder size and perform architecture search todiscover better and efficient architectures. 2) The predictorstake about 1/3 of the total size and inference time. Thereforewe will manually design the variance predictors with morelightweight operations rather than searching the architectures. It takes . × − seconds and . × − seconds to generate onesecond waveform on GPU and CPU respectively. Name Structure . × − Encoder 4 Trans Block 11.56M . × − Decoder 4 Trans Block 11.54M . × − Duration Predictor 2 Conv Layer 0.40M . × − Pitch Predictor 5 Conv Layer 1.64M . × − Energy Predictor 5 Conv Layer 1.64M . × − Table 1 . Profiling of the model size and the inference latencyof components in FastSpeech 2 model. RTF denotes the real-time factor which is the time (in seconds) required for thesystem to synthesize one second waveform. The RTF of eachcomponent is calculated by measuring the inference latencyof the component and divided by the total generated audiolength. The latency is measured with a single thread and asingle core on an Intel Xeon CPU E5-2690 v4 @ 2.60 GHz,with 256 GB memory, and a batch size of 1.
There are 4 feed-forward Transformer blocks [5] in boththe encoder and the decoder in [9], where each feed-forwardTransformer block contains a multi-head self-attention (MHSA) [4,5, 9] layer and a feed-forward network (FFN) . We use thisencoder-decoder framework as our network backbone and setthe number of layers in both the encoder and the decoder to . For the variance predictors (duration, pitch and energy) inFastSpeech 2, our preliminary experiments show that remov-ing the energy predictor makes very marginal performancedrop in the voice quality. Accordingly, we directly removethe energy predictor in our design.After setting the number of layers in each component, wesearch for different combinations of diverse architectures inthe encoder and the decoder parts. We carefully design thecandidate operations for our task: 1) LSTM is not considereddue to the slow inference speed. 2) We decouple the originalTransformer block to MHSA and FFN as separate operations.We adopt MHSA with different numbers of attention heads as { , , } . 3) Considering that depthwise separable convolu-tion (SepConv) [19] is much more memory and computationefficient compared to vanilla convolution, we use SepConvas an alternative of vanilla convolution. The parameter sizeof vanilla convolution is K × I d × O d where K is the kernelsize, I d is the input dimension and O d is the output dimension.The size of SepConv is K × I d + I d × O d . Following [20],we adopt different kernel sizes of { , , , , , , } . Fi-nally, our candidate operations include differentchoices: MHSA with number of attention head in { , , } ,SepConv with kernel size in { , , , , , , } and FFN.This yields a search space of = 11 = 214358881 different candidate architectures. FFN in [4, 5, 9] for TTS task consists of a 1D convolution layer and afully connected layer. o reduce the model size and latency of the variance pre-dictors (the duration predictor and the pitch predictor), wedirectly replace the convolution operation in the variance pre-dictors with SepConv in the same kernel size, without search-ing for other operations, which demonstrates to be very effec-tive in our experiments.
There have been many methods for searching neural architec-tures [13, 21, 14, 18]. For our task, we adopt a very recentmethod [18] which is based on accuracy prediction. It is effi-cient and effective, and well fits our task (the chain-structuresearch space). Specifically, it uses a gradient boosting deci-sion tree (GBDT) trained on some architecture-accuracy pairsto predict the accuracy of other numerous candidate architec-tures. Then the architectures with top predicted accuracy arefurther evaluated by training on the training set and then eval-uating on a held-out dev set. Finally, the architecture with thebest evaluated accuracy is selected.In our task, since the evaluation of the voice quality of aTTS system involves human labor, it is impractical to evalu-ate each candidate architecture during the search. We use thevalidation loss on the dev set as a proxy of the accuracy toguide the search . Therefore, we search architectures with assmall validation loss as possible.
3. EXPERIMENTS3.1. Experimental SetupDataset
We evaluate our method on LJSpeech dataset [22].LJSpeech contains pairs of text and speech data withapproximately hours of speech audio. We split the datasetinto three parts: 12,900 samples as the training set, 100 sam-ples as the dev set and 100 samples as the test set. Follow-ing [9], we convert the source text sequence into the phonemesequence with an open-source grapheme-to-phoneme tool .We transform the raw waveform into mel-spectrograms fol-lowing [9], and set the frame size and the hop size to 1024and 256 with respect to the sample rate 22050. Search Configuration
For the GBDT-NAS algorithm,we follow the default setting and hyper-parameters in [18].The weight-sharing mechanism [23, 21] adopted in [18] trainsand evaluates thousands of candidate architectures efficientlyin a supernet which contains all the candidate architectures.Specifically, we train the supernet for 20k steps with a batchsize of 28000 tokens per GPU and evaluate the validation lossof 1000 candidate architectures. Then, the GBDT predictor istrained on the 1000 architecture-loss pairs with 100 trees and31 leaves per tree. Since the search space is moderate, we In non-autoregressive models (e.g., FastSpeech and FastSpeech 2), thevalid loss on the dev set is highly correlated with the final quality, while inautogregressive models it is not. https://github.com/Kyubyong/g2p use the trained GBDT to predict the accuracy of all the candi-date architectures in the search space, and re-evaluate the top300 architectures by training them and evaluating their losseswith the supernet. Finally, the architecture with the smallestvalidation loss is selected. The whole search process takesonly 4 hours on 4 NVIDIA P40 GPU. Training and Inference
We train the searched TTS mod-els on 4 NVIDIA P40 GPU, with bath size of 28000 tokensper GPU, for 100k steps. In the inference process, the outputmel-spectrograms are transformed into audio samples usingpre-trained Parallel WaveGAN [24] following [9].
Model
Table 2 . The CMOS comparison between LightSpeech(our searched model), FastSpeech 2* (manually designedlightweight FastSpeech 2 model) and FastSpeech 2.
Audio Quality
To evaluate the quality of synthesizedspeech, we perform CMOS [25] evaluation on the test set.We compare our searched model (denoted as LightSpeech)with two baselines: 1) standard FastSpeech 2 [9], and 2)manually designed lightweight FastSpeech 2 model (denotedas FastSpeech 2*), whose model size and inference latencycan match to that of LightSpeech, with 2 feed-forward Trans-former block in both the encoder and the decoder, hidden size128 and filter size 256, no energy predictor, SepConv in thevariance predictors. The other settings and configurations arethe same as [9]. The results are shown in Table 2. We cansee that the model discovered by our LightSpeech achievescomparable audio quality compared to FastSpeech 2 withno performance drop , while using much fewer number ofparameters (1.8M vs. 27M) which achieves 15x compressionratio. Meanwhile, manually designed lightweight FastSpeech2 model (FastSpeech2) has similar model size (1.8M) butleads to severe performance drop compared to standard Fast-Speech 2 (-0.230 CMOS). This demonstrates the advantagesof our LightSpeech over human design in finding lightweightTTS model. Speedup and Complexity
Further, we measure thespeedup and computation complexity in Table 3. The in-ference speed on CPU in terms of RTF is reduced from . × − to . × − with 6.5x speedup . For computa- CMOS within [-0.05, +0,05] is regarded as on par performance which isa common practice. The RTF of FastSpeech 2 seems small on CPU ( . × − ) since itis measured on a powerful server. However, on many devices with veryconstrained computation capability, the inference speed can be much slower. odel . × − /LightSpeech . × − Table 3 . The comparisons of model size, MACs and inference speed between LightSpeech and FastSpeech 2. The inferencespeed is measured in RTF (real time factor) with the same method as in Table 1, using a single thread and a single core on anIntel Xeon CPU E5-2690 v4 @ 2.60 GHz. MACs is measured on a sample with input length and output length .tion complexity, we use the number of Multiply-AccumulateOperations (MAC) to quantify the computation cost. Light-Speech has 16x fewer MACs than FastSpeech 2 (0.76GMACs vs. 12.50G MACs).In summary, Table 2 and Table 3 show that the architec-ture discovered by LightSpeech achieves on par audio qualitycompared to FastSpeech 2 with 15x compression ratio, 16xfewer MACs and 6.5x inference speedup on CPU. Accord-ingly, it is more feasible to deploy in many resource constraintscenarios (e.g., mobile platforms, embedded devices).
In this section, we study the effect of the designs proposed inSection 2.2.
Shallowing
The manually designed FastSpeech 2* modelshallows the FastSpeech model to 2 feed-forward Trans-former block in both the encoder and the decoder. From theresults in Table 2, we see that the model size of FastSpeech 2*largely reduces from 27M to 1.8M while the correspondingaudio quality severely drops with -0.230 CMOS comparedto FastSpeech2. This indicates that simply compressing themodel by shallowing the model results in performance drop.
SepConv
We replace the convolution in the variance pre-dictors with SepConv and show the loss of the duration pre-dictor and the pitch predictor. The results are shown in Ta-ble 4. We can see that the loss does not drop, indicating theeffectiveness of using SepConv to reduce model size whilemaintaining model capacity.
Setting Duration Loss Pitch LossFastSpeech 2 0.12 0.85FastSpeech 2 + SepConv 0.12 0.85
Table 4 . Analysis on the effectiveness of replacing convolu-tion with SepConv in the variance predictors.
Search Space and NAS
To evaluate the effectiveness ofour design of search space of the encoder and the decoder,we randomly sample 10 architectures from the search spaceand calculate their average validation losses. The results arein Table 5. We can see that random architectures achieve
LightSpeech makes the deployment on these devices feasible with a 6.5x in-ference speedup. much lower loss compared to manually designed FastSpeech2* (0.2753 vs.0.2956). This shows the effectiveness of oursearch space, which contains many promising architecturesthat are better than manually designed FastSpeech mod-els. Further, architecture discovered by NAS (LightSpeech)achieves lower loss (0.2561), which is on par with that ofFastSpeech 2 (0.2575), demonstrating the effectiveness ofNAS.
Setting
Table 5 . Analysis on the effectiveness of the search space andNAS.
We show the final discovered architecture by our Light-Speech. The encoder consists of SepConv (k=5), SepConv(k=25), SepConv (k=13) and SepConv (k=9), and the de-coder consists of SepConv (k=17), SepConv (k=21), Sep-Conv (k=9), SepConv (k=13), where k is the kernel size. Thehidden size is 256 as the same in [9]. Other parts follow thedescriptions in Section 2.2.
4. CONCLUSION
In this paper, we propose LightSpeech, which leverages neu-ral architecture search to discover lightweight and fast TTSmodels. We carefully analyze the memory and latency ofeach module in the original FastSpeech 2 model and then de-sign corresponding improvements including the model back-bone and the search space. Then we adopt GBDT-NAS tosearch well performing and efficient architectures. Experi-ments show that the discovered lightweight model achieves15x compression ratio, 16x fewer MACs and 6.5x inferencespeedup on CPU with on par audio quality compared to Fast-Speech 2. For future work, we will further combine NAS withother compression methods such as pruning and quantizationfor more efficient TTS models. . REFERENCES [1] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, YonghuiWu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, YingXiao, Zhifeng Chen, Samy Bengio, et al., “Tacotron:Towards end-to-end speech synthesis,”
Proc. Inter-speech 2017 , pp. 4006–4010, 2017.[2] Jonathan Shen, Ruoming Pang, Ron J Weiss, MikeSchuster, Navdeep Jaitly, Zongheng Yang, ZhifengChen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al.,“Natural tts synthesis by conditioning wavenet on melspectrogram predictions,” in
ICASSP , 2018.[3] Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan OArik, Ajay Kannan, Sharan Narang, Jonathan Raiman,and John Miller, “Deep voice 3: Scaling text-to-speechwith convolutional sequence learning,” in
ICLR , 2018.[4] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, andMing Liu, “Neural speech synthesis with transformernetwork,” in
AAAI , 2019, vol. 33, pp. 6706–6713.[5] Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao,Zhou Zhao, and Tie-Yan Liu, “Fastspeech: Fast, robustand controllable text to speech,” in
Advances in NeuralInformation Processing Systems , 2019, pp. 3165–3174.[6] Kainan Peng, Wei Ping, Zhao Song, and KexinZhao, “Parallel neural text-to-speech,” arXiv preprintarXiv:1905.08459 , 2019.[7] Dan Lim, Won Jang, Hyeyeong Park, Bongwan Kim,Jesam Yoon, et al., “Jdi-t: Jointly trained duration in-formed transformer for text-to-speech without explicitalignment,” arXiv preprint arXiv:2005.07799 , 2020.[8] Chenfeng Miao, Shuang Liang, Minchuan Chen, JunMa, Shaojun Wang, and Jing Xiao, “Flow-tts: A non-autoregressive network for text to speech based on flow,”in
ICASSP , 2020, pp. 7209–7213.[9] Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao,Zhou Zhao, and Tie-Yan Liu, “Fastspeech 2: Fastand high-quality end-to-end text to speech,” arXiv , pp.arXiv–2006, 2020.[10] Nicholas D Sidiropoulos, Lieven De Lathauwer, XiaoFu, Kejun Huang, Evangelos E Papalexakis, and Chris-tos Faloutsos, “Tensor decomposition for signal pro-cessing and machine learning,”
IEEE Transactions onSignal Processing , vol. 65, no. 13, pp. 3551–3582, 2017.[11] Matthieu Courbariaux, Yoshua Bengio, and Jean-PierreDavid, “Binaryconnect: Training deep neural networkswith binary weights during propagations,” in
Neural In-formation Processing Systems , 2015, pp. 3123–3131. [12] Yihui He, Xiangyu Zhang, and Jian Sun, “Channelpruning for accelerating very deep neural networks,” in
ICCV , 2017, pp. 1389–1397.[13] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, andQuoc V Le, “Learning transferable architectures forscalable image recognition,” in
CVPR , 2018.[14] Renqian Luo, Fei Tian, Tao Qin, Enhong Chen, and Tie-Yan Liu, “Neural architecture optimization,” in
Ad-vances in neural information processing systems , 2018.[15] Mingxing Tan and Quoc Le, “Efficientnet: Rethinkingmodel scaling for convolutional neural networks,” in
ICML , 2019, pp. 6105–6114.[16] Daoyuan Chen, Yaliang Li, Minghui Qiu, Zhen Wang,Bofang Li, Bolin Ding, Hongbo Deng, Jun Huang,Wei Lin, and Jingren Zhou, “Adabert: Task-adaptivebert compression with differentiable neural architecturesearch,” arXiv preprint arXiv:2001.04246 , 2020.[17] Wei Wen, Hanxiao Liu, Hai Li, Yiran Chen, GabrielBender, and Pieter-Jan Kindermans, “Neural predic-tor for neural architecture search,” arXiv preprintarXiv:1912.00848 , 2019.[18] Renqian Luo, Xu Tan, Rui Wang, Tao Qin, EnhongChen, and Tie-Yan Liu, “Neural architecture search withgbdt,” arXiv preprint arXiv:2007.04785 , 2020.[19] Lukasz Kaiser, Aidan N Gomez, and Francois Chollet,“Depthwise separable convolutions for neural machinetranslation,” in
ICLR , 2018.[20] Renqian Luo, Xu Tan, Rui Wang, Tao Qin, EnhongChen, and Tie-Yan Liu, “Semi-supervised neural archi-tecture search,” arXiv preprint arXiv:2002.10389 , 2020.[21] Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, andJeff Dean, “Efficient neural architecture search via pa-rameter sharing,” in
ICML , 2018, pp. 4092–4101.[22] Keith Ito, “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/ ,2017.[23] Gabriel Bender, Pieter-Jan Kindermans, Barret Zoph,Vijay Vasudevan, and Quoc Le, “Understanding andsimplifying one-shot architecture search,” in
ICML ,2018, pp. 549–558.[24] Ryuichi Yamamoto, Eunwoo Song, and Jae-Min Kim,“Parallel wavegan: A fast waveform generation modelbased on generative adversarial networks with multi-resolution spectrogram,” in
ICASSP , 2020.[25] Philipos C Loizou, “Speech quality assessment,” in