PJS: phoneme-balanced Japanese singing voice corpus
PPJS: phoneme-balanced Japanese singing voice corpus
Junya Koguchi and Shinnosuke Takamichi Meiji University, Japan. Graduate School of Information Science and Technology, The University of Tokyo, Japan. [email protected], shinnosuke [email protected]
Abstract
This paper presents a free Japanese singing voice corpus thatcan be used for highly applicable and reproducible singingvoice synthesis research. A singing voice corpus helps developsinging voice synthesis, but existing corpora have two criticalproblems: data imbalance (singing voice corpora do not guaran-tee phoneme balance, unlike speaking-voice corpora) and copy-right issues (cannot legally share data). As a way to avoid theseproblems, we constructed a PJS (phoneme-balanced Japanesesinging voice) corpus that guarantees phoneme balance and islicensed with CC BY-SA 4.0, and we composed melodies us-ing a phoneme-balanced speaking-voice corpus. This paper de-scribes how we built the corpus.
Index Terms : Singing voice corpus, singing voice synthesis,music information processing, phoneme balance
1. Introduction
With the recent developments in deep learning and signal pro-cessing, we can now synthesize high-quality singing voices.Various deep learning architectures have been utilized (e.g.,feed-forward [1], recurrent [2], and auto-regressive types [3]),and many products have been launched (e.g., Sinsy [4] andNEUTRINO [5]).Freely available singing voice corpora contribute to appli-cable and reproducible singing voice synthesis research. Cor-pora are being developed in many languages (e.g., Chinese [6],English [7], etc. [8]). The leading Japanese corpus, the largeRWC Music Database [9, 10], was developed 15 years ago.While the RWC corpus was designed for more general use inmusic information research, the recently developed Tohoku Kir-itan database [11] was designed for singing voice synthesis. Thecorpus contains a selection of 50 songs made up of childrenssongs and anime songs. By comparing these corpora, we aimto develop a smaller corpus for easy-to-train machine learning.The HTS demo [12] and JVS-MuSiC [13] examples never guar-antee phoneme balance, which is an important factor in creatinga smaller corpus. Phoneme imbalance typically results in pho-netic lack in synthesized singing voices.This paper describes the construction of a phoneme-balanced singing voice corpus named the phoneme-balancedJapanese singing voice (
PJS ). Using the Voice Actress Cor-pus [14], a phoneme-balanced speaking voice corpus, we com-posed melodies for 100 sentences. Additionally, our corpuscontributes the following:
Singing and speaking voices : We recorded both singingvoices and parallel speaking voices. This paired data con-tributes to speaking-singing research (e.g., [15]).
Descriptions of compositions : We noted descriptions ofmelody compositions. These descriptions contribute tonatural-language-based music information research.
CC BY-SA 4.0 license : All the data in our corpus is li-censed with CC BY-SA 4.0. Therefore, our corpus is avail-able for both research and commercial use, unlike existingcorpora [6, 7, 8, 9, 10, 11, 12].
Availability online : Our corpus can be freely downloadedfrom our project page [16].The following sections describe the details of the corpus.
2. Corpus design
Here, we list the directory structure of our corpus. [SEN-TENCE ID] in directory PJS100 [SENTENCE ID] is the sen-tence ID of the original speaking voice corpus [14].
PJS100 001
PJS100 001 song.wavPJS100 001 speech.wavPJS100 001.midPJS100 001.xmlPJS100 001.labPJS100 001.txt
PJS100 002 ...
PJS100 100
The directory PJS100 [SENTENCE ID] consists of the fol-lowing files:• PJS100 [SENTENCE ID] song.wav: singing voice wecomposed using a sentence from the phoneme-balancedspeaking-voice corpus [14] as the lyric• PJS100 [SENTENCE ID] speech.wav: speaking voice thatutters a sentence from the phoneme-balanced speaking-voicecorpus [14]• PJS100 [SENTENCE ID] .mid: MIDI file we used as theguide melody during recording• PJS100 [SENTENCE ID] .xml: musicXML file that de-scribes musical note information• PJS100 [SENTENCE ID] .txt: musical information thatsongs use (e.g., genre, scale, artist, etc.)We composed and recorded 100 phoneme-balanced sen-tences [14]. The following sections describe the compositionand recording conditions.
A native Japanese male in his twenties composed all the songs.He is not a professional composer but has work experienceusing his singing, composing, and recording skills. He com-posed melodies within his range using each of the phoneme-balanced sentences. The musical notes he composed were writ-ten in PJS100 [SENTENCE ID] .xml. He composed a variety a r X i v : . [ c s . S D ] J un f melodies (based on genre, scale, etc.). Descriptions of thecompositions were written in PJS100 [SENTENCE ID] .txt. Healso made a MIDI file (PJS100 [SENTENCE ID] .xml) of thecomposed melody to guide the recording described below. The composer was also the singer. While listening to the guidemelody generated from the MIDI file, he recorded his singingvoice so that his pitch and tempo would be as in sync with theguide as possible. To avoid the proximity effect of the micro-phone, we let him maintain
15 cm between the microphone andhis mouth. The recording environment was a simple soundproofroom in which we attached sound-absorbing materials to thewalls. The recording environment was not an anechoic cham-ber, so we recorded -second background noise each record-ing day for noise reduction after the recording. We used aLewitt LCT 441 FLEX (cardioid mode) [17] microphone, aJZ MICROPHONES Pop Filter [18] windscreen, and an RMEFireface UCX [19] audio interface.We also let him record his speaking voice in the same man-ner. We saved the singing and speaking voices in the
48 kHz -sampled, bit-encoded RIFF WAV format.
3. Corpus specifications
The data size of the singing voice was larger than that of thespeaking voice. The recording of the singing voice was 27.20minutes long, and the recording of the speaking voice was 12.09minutes long. Therefore, texts are shared between singing andspeaking voices, but the duration of the singing voice is longerthan that of speaking voice. This is consistent with existingwork [15].
Figure 1 and
Figure 2 show histograms of the keys andtempos of our corpus, respectively. As
Figure 1 shows, thetonics are well-balanced, while there are fewer songs in minorkeys than in major keys. Moreover, as
Figure 2 shows, thetempos are distributed in a range between 80 to 160 beats perminute (BPM), indicating that this corpus may be unsuitable forsynthesizing songs with extremely slow or fast tempos or in aminor key.
Japanese songs typically use one musical note per Japanese syl-lable but not always.
Figure 3 is an example of such an excep-tion, PJS100 001.xml. The multisyllabic notes to-o -ji and myo-o- o-o can be found on the first and second musical bars, respec-tively, where “-” indicates the syllable boundary. This meansspecial processes (e.g., copying notes to each syllable [20]) areneeded to train singing voice synthesizers.
4. Conclusion
This paper presented the PJS corpus, a freely availablephoneme-balanced Japanese singing voice corpus. We con-firmed the phoneme balance in our corpus by composing musicbased on a phoneme-balanced speaking-voice corpus. Our cor-pus consists of singing voice data, parallel speaking-voice data,and the musical information that songs use. Therefore, our cor-pus can contribute to research areas beyond singing voice syn-thesis. In our future work, we will add a variety of singingstyles, such as falsetto and growl voices.
Key (major) ♭ D E ♭ E F F ♭ A B ♭ B Key (minor) C o un t Figure 1:
Key histogram of our corpus. There are fewer songsin minor keys than in major keys. C o un t Tempo [BPM]
Figure 2:
Tempo histogram of our corpus. The songs only rangefrom 80 to 160 beats per minute (BPM).
The PJS corpus is available on our project page [16]. Allthe data is licensed with the CC BY-SA 4.0 license.
Acknowledgements:
Part of this research was supportedby the GAP Foundation Program of the University of Tokyo.
5. References [1] M. Nishimura, K. Hashimoto, K. Oura, Y. Nankaku, andK. Tokuda, “Singing voice synthesis based on deep neural net-works,” in
Proc. INTERSPEECH , San Francisco, U.S.A., Sep.2016, pp. 2478–2482.[2] J. Kim, H. Choi, J. Park, M. Hahn, S. Kim, and J.-J. Kim, “Koreansinging voice synthesis system based on an LSTM recurrent neu-ral network,” in
Proc. INTERSPEECH , Hyderabad, India, Sep.2018, pp. 1551–1555.[3] M. Blaauw and J. Bonada, “A neural parametric singing synthe-sizer modeling timbre and expression from natural songs,”
Ap-plied Sciences , vol. 7, no. 12, Dec. 2017.[4] “Sinsy,” .[5] “NEUTRINO,” https://n3utrino.work/ . い い おー のな みょーよおーおーと もる こさ れ しゅば れ るよおー といはにおー みょおご だ いよ お にじ のおちゅーお た とーま = 90 Figure 3:
Score of PJS100 001.xml. The lyrics are “mata tooji no yoo ni godai myoooo to yobareru shuyoo na myoooo no chuuoo nihaisareru koto mo ooi.” Most (but not all) individual notes correspond to a single syllable. Some notes correspond to multiple syllables,such as to-o-ji on the first musical bar and myo-o-o-o on the second musical bar, where “-” indicates the syllable boundary. [6] C.-L. Hsu and J.-S. R. Jang, “On the improvement of singingvoice separation for monaural recordings using the MIR-1Kdataset,”
IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 18, no. 2, pp. 310–319, Feb.[7] Z. Duan, H. Fang, B. Li, K. C. Sim, and Y. Wang, “The NUS sungand spoken lyrics corpus: A quantitative comparison of singingand speech,” in
Proc. APSIPA ASC , Kaohsiung, Taiwan, Oct.2013, pp. 1–8.[8] A. Liutkus, F.-R. St¨oter, Z. Rafii, D. Kitamura, B. Rivet, N. Ito,N. Ono, and J. Fontecave, “The 2016 signal separation evaluationcampaign,” in
Proc. LVA/ICA , Cham, Aug. 2017, pp. 323–332,Springer International Publishing.[9] M. Goto and T. Nishimura, “AIST Humming Database: Musicdatabase for singing research,”
The Special Interest Group Notesof IPSJ (MUS) , vol. 82, pp. 7–12, Aug.[10] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka, “RWC musicdatabase: Popular, classical, and jazz music databases,” in
Proc.ISMIR , Paris, France, Oct. 2002, vol. 2, pp. 287–288.[11] M. Morise, “Tohoku Kiritan singing voice corpus,” https://zunko.jp/kiridev/login.php .[12] “HMM-based speech synthesis system (HTS),”http://hts.sp.nitech.ac.jp/.[13] H. Tamaru, S. Takamichi, N. Tanji, and H. Saruwatari, “JVS-MuSiC: free Japanese multispeaker singing-voice corpus,” arXivpreprint 2001.07044 , Jan. 2020.[14] y benjo and MagnesiumRibbon, “Voice actress corpus,” http://voice-statistics.github.io .[15] Y. Ohishi, M. Goto, K. Itou, and K. Takeda, “Discrimination be-tween singing and speaking voices,” in
Proc. EUROSPEECH ,Lisbon, Portugal, Sep. 2005, pp. 1141–1144.[16] “PJS: Phoneme-balanced japanese singing voice cor-pus,” https://sites.google.com/site/shinnosuketakamichi/research-topics/pjs_corpus .[17] Lewitt, “440 FLEX,” .[18] JZ MICROPHONE, “Pop filter,” https://intshop.jzmic.com/collections/accesories/products/pop-filter .[19] RME, “Fireface UCX,” .[20] K. Nakamura, K. Oura, Y. Nankaku, and K. Tokuda, “HMM-based singing voice synthesis and its application to Japanese andEnglish,” in