Disentangling Timbre and Singing Style with Multi-singer Singing Synthesis System
DDISENTANGLING TIMBRE AND SINGING STYLE WITH MULTI-SINGER SINGINGSYNTHESIS SYSTEM
Juheon Lee, Hyeong-Seok Choi, Junghyun Koo, Kyogu Lee
Music and Audio Research Group, Seoul National University { juheon2, kekepa15, dg22302, kglee } @ snu.ac.kr ABSTRACT
In this study, we define the identity of the singer with two in-dependent concepts – timbre and singing style – and proposea multi-singer singing synthesis system that can model themseparately. To this end, we extend our single-singer modelinto a multi-singer model in the following ways: first, we de-sign a singer identity encoder that can adequately reflect theidentity of a singer. Second, we use encoded singer identityto condition the two independent decoders that model tim-bre and singing style, respectively. Through a user study withthe listening tests, we experimentally verify that the proposedframework is capable of generating a natural singing voice ofhigh quality while independently controlling the timbre andsinging style. Also, by using the method of changing singingstyles while fixing the timbre, we suggest that our proposednetwork can produce a more expressive singing voice.
Index Terms — Singing voice synthesis, Singer identity,Timbre, Singing style
1. INTRODUCTION
Singing voice synthesis (SVS) is a task that generates a nat-ural singing voice from given sheet music and lyrics infor-mation. SVS is similar to the text-to-speech (TTS) system interms of synthesizing natural speech from text informationbut differs in that it requires controllability of the durationand pitch of each syllable. Similar to the development of TTS[1, 2], the methodology based on the deep neural network hasrecently been studied in SVS, and the performance is compa-rable with the existing concatenative method [3].After the successful development of single-singer model,researches have been conducted to extend the existing modelto a multi-singer system. The multi-singer SVS system shouldnot only produce natural pronunciation and pitch contour butalso suitably reflect the identity of a particular singer. Toachieve this, methods for adding conditional inputs reflectingthe singer’s identity to the network have been proposed [4, 5].In this study, we break down a singer’s identity into twoindependent factors: timbre and singing style. A timbre is de-fined as a factor that allows us to distinguish the differencebetween the two voices even when the singers are singing
Fig. 1 . Proposed method to reflect singer identity in multi-singer SVS systemwith the same pitch and pronunciation, and it is generallyknown that they are related to singers’ formant frequency [6,7]. Meanwhile, a singing style can be defined as an expressionof a singer, hence the natural realization of a pitch sequencefrom sheet music, including singing skills such as legato, vi-brato, and so on. The expressive SVS system should be able tosynthesize the two elements effectively, and it becomes morepowerful if the user can control them independently.To this end, we propose a conditioning method that canmodel timbre and singing styles, respectively, while extend-ing our existing single-singer SVS system [8] to a multi-singer system. First, we add a singer identity encoder tothe baseline model to capture the singer’s global identity.Then we independently condition the encoded singer iden-tity information to the two decoders responsible for formantfrequency and pitch contour so that timbre and singing stylecan be reflected as shown in Fig. 1. Our proposed networkcan independently control the two identities we define, socross-generation combining different speakers’ timbre andsinging styles is also possible. Using this, we generated asinging voice that reflects the timbre or singing style of aparticular singer and conducted a listening test, confirmingthat the network can generate a high-quality singing voicewhile actually reflecting each identity.The contribution of this paper is as follows: We proposea multi-singer SVS system that produces a natural singingvoice. We propose a new perspective on the identity of thesinger – timbre and singing style – and propose an indepen-dent conditioning method that could model it effectively. a r X i v : . [ c s . S D ] O c t ig. 2 . The overview of the proposed multi-singer SVS system.
2. RELATED WORKS2.1. Single-singer SVS system
The concatenative method, one of the typical SVS systemssuch as [9, 10, 11], synthesizes the singing voice for a givenquery based on the pre-recorded actual singing data. Thismethod has the advantage of high sound quality because ituses the human voice directly, but it has a limitation in that itrequires an extensive data set every time a new system is de-signed. For a more flexible system, parametric methods havebeen proposed that directly predict the parameters that makeup the singing voice [12, 13, 3]. This method overcomes thedisadvantages of the concatenative method but has a limita-tion that depends on the performance of the vocoder itself.Recently, researches are being conducted to generate spectro-grams using fully end-to-end methods directly [8], or designsof vocoders as trainable neural networks are also in progress[5]. In this study, we experimented based on the end-to-endnetwork that directly generates a linear spectrogram.
Researches to extend the SVS system to the multi-singer sys-tem has been conducted relatively recently. [4] proposes amethod of expressing each singer’s identity by one-hot em-bedding. This method is straightforward and simple, but hasthe limitation of requiring re-training each time to add a newsinger. A method of learning trainable embedding directlyfrom the singer’s singing query for a more general singeridentity is proposed in [2]. Our proposed method is differentfrom the previous works in that it directly maps the singingquery into an embedding, and defines the singer identity astwo independent factors, timbre and singing style.
3. PROPOSED SYSTEM
We propose a multi-singer SVS system that can model timbreand singing styles independently. We designed the network with [8] as the baseline and extended the existing model tothe multi-singer model by adding 1) singer identity encoderand 2) timbre/singing style conditioning method. As shown inFig. 2, our model uses text T L , pitch P L , mel-spectrograms M L − , and a singing voice query Q as inputs. Each inputis encoded via an encoder, then are decoded with formantmask decoder and pitch skeleton decoder. The formant maskdecoder generates a pronunciation and timbre-related feature F M from encoded text E T and query E Q . The pitch skele-ton decoder generates pitch and style-related feature P S fromencoded mel-spectrogram E M , pitch E P and query E Q . Es-timated mel-spectrograms ˆ M L ; the result of element-wisemultiplication of F M and
P S , are converted to estimatedlinear spectrograms ˆ S L (cid:48) via a super-resolution network. Fi-nally, to create a linear spectrogram that is more realistic, weapplied adversarial training and added a discriminator to thisend. Please refer to [8] for more detailed information on eachmodule of the network. The summary of the generation pro-cess of the entire network is as follows: ˆ S = SR ( ˆ M ) = SR ( F M ( T, Q ) (cid:12) P S ( M, P, Q )) . (1) Expanding the single-singer model to the multi-singer modelrequires an additional input about singer identity information.To achieve this, we designed a singer identity encoder thatdirectly maps the singer’s singing voice into an embeddingvector. The network structure is shown in Fig. 3. A singingquery is passed to two 1d-convolutional layers and an aver-age time pooling layer to capture global time-invariant char-acteristics while eliminating the changes over time. Then, thepooled embedding is converted into a 256-dimensional em-bedding vector through the dense layer and tiled to match thenumber of time frames of the features. Finally, it is used asa conditioning embedding vector for a pitch skeleton decoderand a formant mask decoder, respectively. ig. 3 . Singer identity encoder structure and conditioningmethod. HWC, HWNC denotes highwav causal/non-causal covolutional module proposed in [14], and
Conv1d , Dense , relu , sigmoid denotes 1d-convolutionallayer, fully connected layer, rectifier linear unit, and sigmoidactivation unit, respectively. In this section, we will provide details of our conditioningmethod to model timbre and singing styles separately. Ourbaseline network generates a mel-spectrogram by the multi-plication of two different features, formant mask and pitchskeleton. Formant mask is responsible for regulating formantfrequency to model corresponding pronunciation informationfrom the input text, while pitch skeleton plays a role in creat-ing natural pitch contours from input pitch. We focused thatsinger identity embedding could be reflected in each of thesefeatures in different ways. In other words, we assumed thatsinger identity embedding had to be conditioned on the for-mant mask decoder to control the modality of the timbre, andto control the singing style, it had to be conditioned on thepitch skeleton decoder that forms the shape of the pitch con-tour. Based on this assumption, we used a method of condi-tioning singer identity embedding independently of each ofthe two decoders. We used the global conditioning methodproposed in [15], and the specific formula is as follows. z ( x , c ) = σ ( W ∗ x + V ∗ c ) (cid:12) relu ( W ∗ x + V ∗ c ) (2)where x is a target to be conditioned, c is a condition vector,and W ∗ ∗ x and V ∗ ∗ c are 1d-convolution.
4. EXPERIMENT4.1. Dataset and preprocessing
For training, we use 255 songs of a singing voice, consist-ing of a total of 15 singers. Three inputs (text, pitch, andmel-spectrogram) were extracted from the lyrics text, midi, and audio data, respectively. Query singing voice for singeridentity embedding was randomly chosen from other singingsources of that singer. One of each singer’s recorded songswas used as test data, and the rest were used to train the net-work.We preprocessed the training data in the same way as [8]for all input features except the singing query for singer iden-tity embedding. The sampling rate was set to 22,050Hz. Thepreprocessing step for the singing query is as follows. First,we randomly selected about the 12-second section from thesinger’s singing voice source. Then, we set both the windowsize and hop length to 1024 and converted the singing voicewaveform into a mel-spectrogram of 80 dimensions and 256frames and used it as the singing query.
We trained the network in the same way as proposed by [8],except to set different speaker samples evenly distributed ineach mini-batch. The inference was also conducted in thesame way as in the previous study, but for tests to show thatthe timbre and singing style can be controlled separately, wegenerated test samples through cross-generation, which gen-erates a pitch skeleton and a formant mask from differentspeaker embeddings, respectively . We compared the generated spectrogram by a differentspeaker for the same pitch and text to see the effect of thespeaker identity embedding. As shown in Fig. 4, each spec-trogram has a similar overall shape but includes partial dif-ferences. In the case of a formant mask, female vocals havevigorous intensity in high-frequency areas, while the male’scorresponding frequency area is shifting down. This is inline with the fact that males generally have lower formantfrequency even in the same-pitched condition. Even with thesame gender, we can see that the shape of the formant maskis different, and from this, we have confirmed that the speakerembedding appropriately reflects the timbre of each singer.Likewise, pitch skeleton differs depending on the speaker,where it is spotted at the position of the onset/offset, the slopenear it, the intensity of vibrato, and the shape of the unvoicedarea. From this, we confirm that the singer identity embeddingaffects the style change of pitch skeleton effectively. Note thatdespite conditioning with identical embeddings through time,changes in the style of pitch skeletons over time have beenobserved. We argue that our network generates singing voicein an auto-regressive way so that it could reflect the styledifferences over the time axis of different singers.We were also able to observe a few changes as we interpo-late two different singer identity embeddings from female tomale vocalist. For example, we found that the high-frequency audio samples available at https://juheo.github.io/DTS ig. 4 . Generated mel-spectrogram with various singer em-bedding (top) and interpolated singer embedding (bottom). F M , P S , ˆ M denotes formant mask, pitch skeleton, and esti-mated mel-spectrogram, respectively.area of the formant mask was gradually lowered, and the vi-brato was gradually strengthened in the case of pitch skeleton.From this, we confirmed that speaker embedding not only re-flects the identity of different singers but also contains appro-priate information about their changes. We conducted a listening test with a total of 6 different maleand female singer’s voices for qualitative evaluation. We gen-erated two vocal voices for each person for the randomly se-lected song. To show that the proposed network does not haveany degradation in performance even when it independentlycontrols singing style and timbre, we also created two sam-ples for each person’s formant mask with another person’spitch skeleton and used them for evaluation. 26 participantswere asked to evaluate pronunciation accuracy, sound qual-ity, and the naturalness of test samples. The result is shown inTable 1.
Table 1 . Listening test result (9-point scale)
Model Pronun.acc Sound.quality Naturalness proposed (w/o cross) . ± .
44 5 . ± .
44 5 . ± . proposed (w/ cross) . ± .
39 5 . ± .
76 5 . ± . Ground . ± .
50 6 . ± .
96 6 . ± . A paired t-test [16] shows no difference for all items, re-gardless of whether the cross-generation was carried out. Wealso confirmed that there is no significant difference with theground truth samples for the pronunciation accuracy. Fromthis, we verify that our proposed network could combinedifferent timbre and singing style without any performancedegradation, and can generate a singing voice that can matchthe ground truth sample with accurate pronunciation.
We conducted a classification test to ensure that the net-work generates results that reflect timbre and singing stylesindependently. We prepared a total of 20 test sets, 10 eachfor judging timbre and singing style, and each test set con-sisted of three sources A, B, and C. A and B are the singingvoices generated without cross-generation, and C is cross-generated using its own timbre/style and referencing one ofA or B’s style/timbre. By comparing these samples, partici-pants are asked to prefer instead sample C’s timbre/style isa closer match to A or B’s. Considering gender differences,we equally divided three singers’ gender into every possiblecombination, and the result is as follows in Fig. 5.
Fig. 5 . Timbre and style classification test resultAccording to the results of the experiment, 8% of partic-ipants in timbre and 31% in singing style chose incorrect an-swers. The answer rate of the singing style was lower than thetimbre, which is analyzed because the data we used in train-ing consisted of amateur vocals whose style was relativelyunclear. Nevertheless, more than half of the participants re-sponded to the correct answer from which we conjecture thatour network is able to generate a timbre and singing style thatmatches a given singer identity query to a level that humanscan perceive.
5. CONCLUSION
In this study, we proposed a multi-singer SVS system thatcan independently model and control the singer’s timbreand singing style. We disentangled the identity of the singerthrough a method of conforming singer identity embeddingindependently in two decoders. The listening test showed thatour system could produce high quality and accurate singingcomparable to the ground truth singing voice. Through listen-ing tests, which classify the timbre and singing styles of thegenerated samples, we revealed that we could control bothelements independently. . REFERENCES [1] Jonathan Shen, Ruoming Pang, Ron J Weiss, MikeSchuster, Navdeep Jaitly, Zongheng Yang, ZhifengChen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al.,“Natural tts synthesis by conditioning wavenet on melspectrogram predictions,” in . IEEE, 2018, pp. 4779–4783.[2] Yutian Chen, Yannis Assael, Brendan Shillingford,David Budden, Scott Reed, Heiga Zen, Quan Wang,Luis C Cobo, Andrew Trask, Ben Laurie, et al., “Sam-ple efficient adaptive text-to-speech,” arXiv preprintarXiv:1809.10460 , 2018.[3] Merlijn Blaauw and Jordi Bonada, “A neural paramet-ric singing synthesizer modeling timbre and expressionfrom natural songs,”
Applied Sciences , vol. 7, no. 12,pp. 1313, 2017.[4] Pritish Chandna, Merlijn Blaauw, Jordi Bonada, andEmilia Gomez, “Wgansing: A multi-voice singingvoice synthesizer based on the wasserstein-gan,” arXivpreprint arXiv:1903.10729 , 2019.[5] Merlijn Blaauw, Jordi Bonada, and Ryunosuke Daido,“Data efficient voice cloning for neural singing synthe-sis,” in
ICASSP 2019-2019 IEEE International Con-ference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2019, pp. 6840–6844.[6] Johan Sundberg, “Level and center frequency of thesinger’s formant,”
Journal of voice , vol. 15, no. 2, pp.176–186, 2001.[7] Thomas F Cleveland, “Acoustic properties of voice tim-bre types and their influence on voice classification,”
The Journal of the Acoustical Society of America , vol.61, no. 6, pp. 1622–1629, 1977.[8] Juheon Lee, Hyeong-Seok Choi, Chang-Bin Jeon,Junghyun Koo, and Kyogu Lee, “Adversarially trainedend-to-end korean singing voice synthesis system,”
Proc. Interspeech 2019 , pp. 2588–2592, 2019.[9] Michael Macon, Leslie Jensen-Link, E Bryan George,James Oliverio, and Mark Clements, “Concatenation-based midi-to-singing voice synthesis,” in
Audio En-gineering Society Convention 103 . Audio EngineeringSociety, 1997.[10] Hideki Kenmochi and Hayato Ohshita, “Vocaloid-commercial singing synthesizer based on sample con-catenation,” in
Eighth Annual Conference of the Inter-national Speech Communication Association , 2007. [11] Jordi Bonada, Alex Loscos, and H Kenmochi, “Sample-based singing voice synthesizer by spectral concatena-tion,” in
Proceedings of Stockholm Music AcousticsConference , 2003, pp. 1–4.[12] Keijiro Saino, Heiga Zen, Yoshihiko Nankaku, AkinobuLee, and Keiichi Tokuda, “An hmm-based singing voicesynthesis system,” in
Ninth International Conference onSpoken Language Processing , 2006.[13] Kazuhiro Nakamura, Keiichiro Oura, YoshihikoNankaku, and Keiichi Tokuda, “Hmm-based singingvoice synthesis and its application to japanese andenglish,” in .IEEE, 2014, pp. 265–269.[14] Hideyuki Tachibana, Katsuya Uenoyama, and Shun-suke Aihara, “Efficiently trainable text-to-speech sys-tem based on deep convolutional networks with guidedattention,” in .IEEE, 2018, pp. 4784–4788.[15] Aaron van den Oord, Sander Dieleman, Heiga Zen,Karen Simonyan, Oriol Vinyals, Alex Graves, NalKalchbrenner, Andrew Senior, and Koray Kavukcuoglu,“Wavenet: A generative model for raw audio,” arXivpreprint arXiv:1609.03499 , 2016.[16] Graeme D Ruxton, “The unequal variance t-test is anunderused alternative to student’s t-test and the mann–whitney u test,”