Peking Opera Synthesis via Duration Informed Attention Network
Yusong Wu, Shengchen Li, Chengzhu Yu, Heng Lu, Chao Weng, Liqiang Zhang, Dong Yu
PPeking Opera Synthesis via Duration Informed Attention Network
Yusong Wu ∗ , Shengchen Li , Chengzhu Yu , Heng Lu , Chao Weng , Liqiang Zhang , Dong Yu Beijing University of Posts and Telecommunications Tencent AI Lab { wuyusong, shengchen.li } @bupt.edu.cn, { czyu, bearlu, cweng, tatelqzhang, dyu } @tencent.com Abstract
Peking Opera has been the most dominant form of Chineseperforming art since around 200 years ago. A Peking Operasinger usually exhibits a very strong personal style via intro-ducing improvisation and expressiveness on stage which leadsthe actual rhythm and pitch contour to deviate significantly fromthe original music score. This inconsistency poses a great chal-lenge in Peking Opera singing voice synthesis from a musicscore. In this work, we propose to deal with this issue and syn-thesize expressive Peking Opera singing from the music scorebased on the Duration Informed Attention Network (DurIAN)framework. To tackle the rhythm mismatch, Lagrange multi-plier is used to find the optimal output phoneme duration se-quence with the constraint of the given note duration from mu-sic score. As for the pitch contour mismatch, instead of directlyinferring from music score, we adopt a pseudo music score gen-erated from the real singing and feed it as input during training.The experiments demonstrate that with the proposed system wecan synthesize Peking Opera singing voice with high-qualitytimbre, pitch and expressiveness.
Index Terms : singing synthesis, expressive singing synthesis,machine learning, deep learning, Lagrange multiplier
1. Introduction
Peking Opera, also known as Beijing Opera or Jingju, is Chi-nese traditional performing art which combines music, vocalperformance, mime, dance and acrobatics. Singing in PekingOpera has various styles, each widely different depending ondifferent role type and music styles. Strong personal styles alsomake the actual singing can be different from the given mu-sic notes. Like a dialect to Mandarin, it even has its uniqueway of pronunciation. Moreover, melody in singing often con-sist of arias with variation of complex transitory and vibratos,which makes the singing very expressive and difficult to learn.Another difference from normal singing is the note length hasa great variance, sometime very long note can appear (can bemore than 10 seconds). All above factors makes it very chal-lenging to modelling and generating Peking Opera singing com-paring to normal singing.Although there are few works focusing on the synthesisof Peking Opera, or more broadly, opera, the synthesis ofsinging voice has been researched since 1962 when Kelly andLochbaum [1] used an acoustic tube model to synthesis singingvoice with success. Recently, several works [2–7] use deep neu-ral networks to synthesis singing voice which, known as para-metric systems, process fundamental frequency (or pitch con-tour, f ) and harmonics features (or timbre) separately. As atypical case among such systems, Neural Parametric SingingSynthesizer (NPSS) [2] using a phoneme timing model, a pitchmodel and a timbre model each consist a set of neural networks ∗ Yusong Wu performed the work while at Tencent. to generate acoustic parameters of the singing. In NPSS, a Fit-ting Heuristic method is introduced to eliminate the mismatchbetween music note duration and the predicted phoneme dura-tion. However, Fitting Heuristic method is totally rule basedand it requires to locate the principal vowel before adjustingphoneme duration. This maybe acceptable in most English orJapanese singing cases, but can cause huge duration error whensynthesizing Peking Opera. Different from normal speech orsinging, in Peking Opera, one syllable can last very long timeand contains a long sequence of phonemes, e.g. “l-j-E-a-a-N”. More importantly, one can’t simply tell which phonemeamongst all these phonemes is the principle phone. Therecould be multiple equally important phonemes in Peking Operasinging.To better synthesize the expressive Peking Opera, this pa-per proposes a Peking Opera singing synthesis system based onDuration Informed Attention Network (DurIAN) [8]. The maincontribution in this study lies in the two following points: 1)To tackle with rhythm mismatch between music note durationand the predicted phoneme duration, contextual based mixturedensity networks (MDN) [9] followed by a Lagrange Multiplieroptimization is proposed and implemented for duration mod-elling. This method is completely data-driven, and more im-portantly, skips the step of locating the principle phoneme fromthe conventional Fitting Heuristic method. 2) To deal with themelody mismatch between original music score and the actualsinging, and also to better model the expressive variations andvibratos in Peking Opera, a pseudo music score is generatedfrom the real singing and fed as input during DurIAN [8] modeltraining. Experimental Results show proposed duration mod-eling and prediction method outperforms the Fitting Heuristicmethod by a large margin. And the generated pitch contoursalso demonstrate our system’s ability to synthesize the singingvariations and vibratos in Peking Opera.The following sections of this paper are organized as fol-lows. Firstly, the proposed model architecture is introduced.Next, proposed Lagrange Multiplier-based duration predictionand pseudo score generation are introduced in Section 2. Insection 3 experiments are conducted based on a unique PekingOpera database. Finally, a quick discussion and conclusion isgiven in Section 4.
2. Methods
In proposed system, three efforts made it possible to synthe-size expressive Peking Opera singing from music score: 1)DurIAN [8] based auto-regressive framework is used to gener-ate output spectrogram features, with generated pitch as condi-tion. 2) With the constraint of Lagrange Multiplier, a contextualdependent mixture density network (MDN) based phoneme du-ration modeling and generation method is introduced for accu-rate duration modelling. And 3) a melody transcription methodis used to obtain a pseudo score from training waveform sam- a r X i v : . [ ee ss . A S ] A ug igure 1: The overview diagram of the proposed system. The proposed system is trained using data annotation and melody transcriptionresult. In the synthesis stage, note-pitch and phoneme sequence are parsed from score, while the phoneme duration is predicted by theproposed duration model. ples in training process. This step is to ensure the extractedacoustic feature and the input music score consistent in trainingas well as in inference. Fig. 1 shows the training and synthesisdiagram. In training stage, note-pitch of the singing is providedby melody transcription results while phoneme sequence andphoneme duration are obtained from data annotation. While inthe synthesis, input note-pitch is from real music score and in-put phoneme sequence comes from lyric parsing module whichalso use music score as reference.
Duration Informed Attention Network (DurIAN) [8] structureis used here as main frame. Similar to DurIAN, in proposedsystem, the features of temporal dependency in phoneme se-quences is first modelled by a phoneme encoder consist ofa two-layer prenet followed by a CBHG module [10] (1-Dconvolution bank + highway network+ bidirectional GRU) forlearning contextual dependencies of phonemes. The encodedphoneme sequence is aligned with output spectrogram in align-ment model where singer identity and frame-wise generatedpitch are further added. The final Mel-spectrogram frames aregenerated by a decoder with Gated Recurrent Unit (GRU) [11]in an auto-regressive manner where the temporal dependenciesof the Mel-spectrogram are modelled, and a post-net is used forrefining the output. Different from DurIAN, a CBHG moduleis used as a post-net instead of a fully convolution network toimprove generation quality.In the training stage, phoneme sequences are obtained fromdata annotation. Ground-truth phoneme duration is used inphoneme sequence expanding in alignment model. The inputmusic note is generated from real singing by the melody tran-scription module which is introduced in section 2.3. While inthe synthesis stage, the phoneme duration is predicted by thetrained duration model introduced in section 2.2 and the inputmusic note is from the real music score.
In the music score, the duration of each note and also the lyricaligned with note is given. However, the duration of eachphoneme needs to be predicted in order to generate syllable orword pronunciation during synthesis stage. Thus, a phoneme duration model is trained in training stage and used to gener-ate phoneme duration in generation process. Unlike in speechsynthesis where no music score exists, in singing synthesis, thesum of predicted phonemes duration in a note should be alwaysequal to the duration of the note. This is to make sure synthe-sized singing follows the rhythm on the score. To ensure theduration of the phoneme adds up to the note duration, FittingHeuristic method is introduced in [2, 12]. However, FittingHeuristic duration scaling relies heavily on the selection of pri-mary vowel and r . In case of Peking Opera singing, there areprevailing use of very long notes. When scaling phoneme du-ration for a very long note, and sometimes the note can con-tain more than 5 equally important phonemes, the incorrectchoice of primary vowel will lead to huge phoneme durationprediction errors. Instead, our system generates phoneme dura-tion with Lagrange multiplier constraint which has been used inspeech synthesis [13–15]. All the scaling or compensate factorsare generated through data-driven methods other than based onrules.Bi-directional Long Short Term Memory (Bi-LSTM) net-works [16, 17] with mixture density output layer [9] is usedas out phoneme duration model. With a total of M phonemesin the given note from music score, the duration distribution L ( d i , λ ) for each phoneme can be represented as L ( d i , λ ) = K (cid:88) k =1 Π ki N ( µ ki , ( σ ki ) ) (1)and M (cid:88) i =1 d i = T, i = 1 , , ..., M (2)where K is the total number of Gaussian Mixtures for eachphone. Π ki , µ ki and σ ki are the weight, mean and standard devi-ation for the k th mixture of the i th phoneme respectively. λ indicates the total MDN parameter set mentioned above. Intraining stage, the parameters in proposed Bi-LSTM based mix-ture density network duration model are learned through back-propagating the negative log-likelihood error.In synthesis stage, to maximize log MDN likelihood underthe constraint of given total word duration, a Lagrange Multi-plier is introduced and the task becomes to solve the followingequation: d i ,α [ M (cid:88) i =1 log ( L ( d i , λ )) − α ( M (cid:88) i =1 d i − T )] = 0 (3)where α is the Lagrange Multiplier factor.Because there is no close form solution for the Multi-Gaussian distribution maximum likelihood optimization prob-lems, to simply the form of final solution, we always choose theGaussian with the maximum weight instead of using all mix-tures when calculating. Thus, d i ∼ L ( d i , { µ ki , σ ki } ) where k = argmax k Π ki (4)Let ˆ µ i , ˆ σ i be the Gaussian means and standard deviationswith the largest weight for the i th phone, and the solution forEq. 3 is formulated as: d i = ˆ µ i + ˆ σ i · α where α = T − (cid:80) Mi =1 ˆ µ i (cid:80) Mi =1 ˆ σ i (5)As a comparison, if there’s no Lagrange Multiplier con-straint in Eq. 3, the result for the maximum likelihood willbe only the means for the Gaussian ˆ µ i . The equation abovecan be interpreted as scaling of each phoneme duration predic-tion ˆ µ i by a compensation term α which is decided by 1) thegap between music score note duration and sum of predictedphoneme duration, and 2) its standard deviation. And that indi-cates phonemes with large variance, such as some vowels, canbe extend or compress more than the ones with smaller vari-ance. This result is actually consistent with the duration scalingrules in Fitting Heuristic, but more generally formulated, andobtained through purely data-driven method. Although published version of the Peking Opera score is avail-able, the improvisation and expressiveness of the singer makesthe actual singing inconsistent with the score. Efforts have beenmade to build Peking Opera singing synthesis system directlyfrom original music scores, but the model turned out to be badlytrained, and the generation results were with poor quality. Alter-natively, a pseudo music score is automatically transcribed fromactual Peking Opera singing to replace original music score intraining stage. Same melody transcription method proposedin [18] is used here. The melody transcription method is basedon a Hidden Markov Model [19] based pitch tracking mod-ule which uses probabilistic YIN [20] pitch estimation. First,frame-wise fundamental frequency of the singing with tempo-ral smoothness is estimated by probabilistic YIN [20] pitch es-timation where a Hidden Markov Model is applied to decodefrom multiple pitch candidates. Then, the estimated pitch trackis fed into another Hidden Markov Model to render frame-wisediscrete note pitch. The Hidden Markov Model used in tran-scription process contains 3-state consisting of Attack, Stableand Silent. When transitioning from current note to the nextnote, probabilities of note transition are calculated by a notetransition probability function. A Peking Opera genre-specificnote transition distribution are used for better note pitch decod-ing, achieving a note transcription F-score of 0.73. The outputof the transcription algorithm is frame-wise discrete note pitchof the singing, where possible pitch value ranges from MIDIpitch 35 (B1) to MIDI pitch 85 (C f of the singing pitch and the blue dotted lineindicates the transcribed result.Figure 2: A example of the melody transcription results onPeking Opera singing.
3. Experiments and results
To demonstrate the effectiveness of proposed methods, two ex-periments are conducted. In phoneme duration prediction ex-periment, the objective predicted phoneme duration error iscompared between proposed duration model and Fitting Heuris-tic method. In Peking Opera synthesis experiment, generatedpitch contours are drawn and subjective Mean Opinion Score(MOS) test show our system is capable of synthesizing PekingOpera with fair quality.
Although being popular for centuries, Peking Opera has re-ceived little effort in research field. To collect Peking Operatraining data is not a trivial task. One of the available data setis the “Jingju a cappella singing” [21–24]. “Jingju a cappellasinging” contains over 10 hours of a cappella Peking Operarecordings and annotations, but amongst which only 2 hoursof recordings have been phoneme annotated. The 2 hours ofrecordings with phoneme annotation contains 71 Peking Operasinging fragments. And the annotation of phoneme adopts amodified X-SAMPA (Extended Speech Assessment MethodsPhonetic Alphabet) phoneme set containing 51 phonemes. Thesinging data is further segmented according to line boundary,resulting a set of 606 singing phrases. In the experiment part,all experiments are conducted using the 2 hours of the annotateddataset.
In order to compare our Lagrange Multiplier-based phonemeduration prediction method with Fitting Heuristic based meth-ods, six phoneme duration prediction models are implementedand compared. First, the duration model used in NPSS sys-tem [2] is implemented (
CNN-Softmax-Fit ) which consist of1d-CNNs and a softmax output predicting duration which arediscretized to 50 log scale bins. To compare different dura-tion scaling method, the softmax output in
CNN-Softmax-Fit is then replaced with mixture density output with LagrangeMultiplier for duration optimization (
CNN-MDN-Lag ). Next,the duration model proposed in this paper is built (
BiLSTM-MDN-Lag ) which used a BiLSTM with mixture density out-able 1:
The mean phoneme duration prediction error in num-ber of frames for different duration model. all notes < BiLSTM-MSE-Fit 13.63 4.23BiLSTM-MDN-Fit 17.15 4.16CNN-MDN-Lag 14.74 4.75CNN-Softmax-Fit [2] 19.61 5.83Mean-Fit [12] 18.69 5.52put and Lagrange Multiplier optimization. For comparison, theoutput layer of the proposed duration model is replaced witha scalar prediction and Fitting Heuristic scaling while trainedusing mean square error (MSE) criterion (
BiLSTM-MSE-Fit ).To further compare the duration scaling method, the proposedsystem using Fitting Heuristic scaling instead of the LagrangeMultiplier optimization is built (
BiLSTM-MDN-Fit ) which usethe mean of the predicted distribution as duration prediction.Last, the duration prediction method originally used in [12] isimplemented (
Mean-Fit ) which use the mean duration of eachphoneme as a look-up table for duration prediction and FittingHeuristic for scaling.In the experiment, the frame length is set to 10ms. 64Peking Opera singing phrases are randomly chosen fromdatabase for test, while the rest of the data is used for training.The CNN blocks are built according to [2]. All Bidirectional-LSTMs use 2 hidden layers and has a 256-dimensional hiddenlayer size with training dropout rate of 0.5. And all FittingHeuristic scaling set the second phoneme as the primary vowel.The number of Gaussian mixtures predict by model is set by 2which we find yields best results.The results are shown in Tab. 1 with average duration pre-diction error per phone, in terms of number frames. For inPeking Opera singing, there could be extremely long noteswhich may last for seconds. Thus when counting the results,two kind of results are counted: 1) the music score notesshorter than 2 seconds case, and 2) all notes cases. Thisobjective average phoneme prediction error results reveal theproposed duration model outperforms other methods, achiev-ing minimum prediction error. Probably because the dynamicrange of phoneme duration in Peking Opera is much largerthan normal speech or singing, 50 discrete log scale bins is toocoarse as the prediction target and thus introduces largest pre-diction error in
CNN-Softmax-Fit . From the results, we cansee propose Lagrange Multiplier-based phoneme duration scal-ing consistently outperforms the Fitting Heuristic scaling bya large margin. And our proposed Mixture Density Networkbased Lagrange Multiplier phone duration generation method
BiLSTM-MDN-Lag renders the best performance. It is worthnoting that
BiLSTM-MDN-Fit outperforms
BiLSTM-MSE-Fit in predicting phoneme duration for shorter notes while per-form worse when including long notes. Further analysis showsthis happens when the music note is long, and when FittingHeuristic scaling is employed, the generated phoneme durationis dominant by the stretching length. Moreover, when the prin-ciple vowel is locating wrong, the prediction error can be huge.
In order to evaluate the subjective naturalness of the proposedPeking Opera synthesis system, Mean Opinion Score test is conducted where test participants are asked to score the singingsamples from 1-5 according to the naturalness of the singing.1 stands for most unnatural and unintelligible, 5 stands formost natural and almost the same as sung by real people.2,3,4 are somewhere between. Peking Opera music scoresin the format of MusicXML are used as score input whichfirst parsed into notes and corresponding phonemes then inputto the duration model to get each duration of the phonemes.Using the predicted phoneme duration along with notes andphonemes as input, singing samples are generated by the pro-posed system, and MOS test is conducted using the generatedsinging. In result, a 3.34 of MOS score is obtained, whichindicates that the proposed system can generate Peking Operasinging with fair quality. Generated audio samples can be foundon the web page: https://lukewys.github.io/files/Peking-Opera-Synthesis-2020.html.
The generated f contour from the synthesized singing is drawnwith input music notes in Fig. 3. Different from the pitch con-tour generated by normal TTS or singing systems, generatedpitch contours by our proposed system in Peking Opera showmore up and downs and with larger variation. The vibratosand transitions curves in Fig. 3 is consistent with Peking Operasinging style and shows the ability of proposed system to gen-erate the expressiveness in Peking Opera.Figure 3: generated Peking Opera singing f contour by pro-posed system
4. Conclusions
Improvisation and expressiveness in Peking Opera singingmakes it extremely difficult to synthesize this classical perform-ing art. With proposed MDN-based phoneme duration gener-ation with Lagrange Multiplier optimization, our system cangenerate more accurate phoneme duration compared to the Fit-ting Heuristic phoneme duration scaling method. Pseudo mu-sic notes are generated through the melody transcription algo-rithm to solve the score inconsistency problem in training. Boththe objective average predicted phoneme duration error and thegenerated pitch contour show our system performances well ingenerating Peking Opera singing. And as one can see fromMOS and the generated samples that there is still a gap betweenthe generated singing and the real performance in terms of natu-ralness. Our further work includes collecting and labelling morePeking Opera singing data, conducting MOS test in larger scalewith subjects in musical background, and improving the qualityand pitch accuracy of the generated singing. . References [1] C. Lochbaum and J. Kelly, “Speech synthesis,” in
Proceedings ofthe Speech Communication Seminar , 1962, pp. 583–596.[2] M. Blaauw and J. Bonada, “A neural parametric singing synthe-sizer modeling timbre and expression from natural songs,”
Ap-plied Sciences , vol. 7, no. 12, p. 1313, 2017.[3] Y.-H. Yi, Y. Ai, Z.-H. Ling, and L.-R. Dai, “Singing voice synthe-sis using deep autoregressive neural networks for acoustic model-ing,” arXiv preprint arXiv:1906.08977 , 2019.[4] K. Kaewtip, F. Villavicencio, F.-Y. Kuo, M. Harvilla, I. Ouyang,and P. Lanchantin, “Enhanced virtual singers generation by in-corporating singing dynamics to personalized text-to-speech-to-singing,” in . IEEE, 2019, pp. 6960–6964.[5] Y. Wada, R. Nishikimi, E. Nakamura, K. Itoyama, and K. Yoshii,“Sequential generation of singing f0 contours from musical notesequences based on wavenet,” in . IEEE, 2018, pp. 983–989.[6] P. Chandna, M. Blaauw, J. Bonada, and E. Gomez, “Wgansing:A multi-voice singing voice synthesizer based on the wasserstein-gan,” arXiv preprint arXiv:1903.10729 , 2019.[7] Y. Hono, S. Murata, K. Nakamura, K. Hashimoto, K. Oura,Y. Nankaku, and K. Tokuda, “Recent development of the DNN-based singing voice synthesis system — sinsy,” in . IEEE, nov 2018.[8] C. Yu, H. Lu, N. Hu, M. Yu, C. Weng, K. Xu, P. Liu,D. Tuo, S. Kang, G. Lei et al. , “Durian: Duration informedattention network for multimodal synthesis,” arXiv preprintarXiv:1909.01700 , 2019.[9] C. M. Bishop et al. , Neural networks for pattern recognition . Ox-ford university press, 1995.[10] Y. Wang, R. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss,N. Jaitly, Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al. , “Tacotron:Towards end-to-end speech synthesis,” in
Proc. Interspeech , 2017,pp. 4006–4010.[11] K. Cho, B. Van Merri¨enboer, C. Gulcehre, D. Bahdanau,F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase repre-sentations using rnn encoder-decoder for statistical machine trans-lation,” in , 2014, pp. 1724–1734.[12] M. Blaauw and J. Bonada, “Sequence-to-sequence singing syn-thesis using the feed-forward transformer,” in . IEEE, 2020, pp. 7229–7233.[13] J. Yamagishi, “An introduction to hmm-based speech synthesis,”
Technical Report , 2006.[14] H. Zen, T. Masuko, K. Tokuda, T. Yoshimura, T. Kobayasih, andT. Kitamura, “State duration modeling for hmm-based speechsynthesis,”
IEICE transactions on information and systems ,vol. 90, no. 3, pp. 692–693, 2007.[15] H. Lu, Y.-J. Wu, K. Tokuda, L.-R. Dai, and R.-H. Wang, “Full co-variance state duration modeling for hmm-based speech synthe-sis,” in . IEEE, 2009, pp. 4033–4036.[16] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
Neural computation , vol. 9, no. 8, pp. 1735–1780, 1997.[17] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neu-ral networks,”
IEEE transactions on Signal Processing , vol. 45,no. 11, pp. 2673–2681, 1997.[18] R. Gong, Y. Yang, and X. Serra, “Pitch contour segmentation forcomputer-aided jinju singing training,” in , 2016, pp. p. 172–8. [19] M. Mauch, C. Cannam, R. Bittner, G. Fazekas, J. Salamon, J. Dai,J. Bello, and S. Dixon, “Computer-aided melody note transcrip-tion using the tony software: Accuracy and efficiency,” , 2015.[20] M. Mauch and S. Dixon, “pyin: A fundamental frequency esti-mator using probabilistic threshold distributions,” in . IEEE, 2014, pp. 659–663.[21] R. Gong, R. Caro, and T. Zhu, “Jingju a cappella recordingscollection,” https://doi.org/10.5281/zenodo.3251760, 2019, ac-cessed: 2020-05-01.[22] R. Gong, R. C. Repetto, Yile Yang, and X. Serra, “Jingju acappella singing dataset part1,” https://doi.org/10.5281/zenodo.1323561, 2018, accessed: 2020-05-01.[23] R. Gong, R. C. Repetto, and X. Serra, “Jingju a cappella singingdataset part2,” https://doi.org/10.5281/zenodo.1421692, 2018, ac-cessed: 2020-05-01.[24] R. Gong and X. Serra, “Jingju a cappella singing dataset part3,”https://doi.org/10.5281/zenodo.1286350, 2018, accessed: 2020-05-01.[25] D. A. Black, M. Li, and M. Tian, “Automatic identification ofemotional cues in chinese opera singing,” , p. pp.250255, 2014.[26] M. Umbert, J. Bonada, M. Goto, T. Nakano, and J. Sundberg, “Ex-pression control in singing voice synthesis: Features, approaches,evaluation, and challenges,”