[PDF] Singing voice phoneme segmentation by hierarchically inferring syllable and phoneme onset positions

Abstract

In this paper, we tackle the singing voice phoneme segmentation problem in the singing training scenario by using language-independent information -- onset and prior coarse duration. We propose a two-step method. In the first step, we jointly calculate the syllable and phoneme onset detection functions (ODFs) using a convolutional neural network (CNN). In the second step, the syllable and phoneme boundaries and labels are inferred hierarchically by using a duration-informed hidden Markov model (HMM). To achieve the inference, we incorporate the a priori duration model as the transition probabilities and the ODFs as the emission probabilities into the HMM. The proposed method is designed in a language-independent way such that no phoneme class labels are used. For the model training and algorithm evaluation, we collect a new jingju (also known as Beijing or Peking opera) solo singing voice dataset and manually annotate the boundaries and labels at phrase, syllable and phoneme levels. The dataset is publicly available. The proposed method is compared with a baseline method based on hidden semi-Markov model (HSMM) forced alignment. The evaluation results show that the proposed method outperforms the baseline by a large margin regarding both segmentation and onset detection tasks.

Full PDF

SSinging voice phoneme segmentation by hierarchically inferring syllable andphoneme onset positions

Rong Gong, Xavier Serra

Music Technology Group, Universitat Pompeu Fabra, Barcelona, Spain [email protected], [email protected]

Abstract

In this paper, we tackle the singing voice phoneme segmenta-tion problem in the singing training scenario by using language-independent information – onset and prior coarse duration. Wepropose a two-step method. In the ﬁrst step, we jointly calculatethe syllable and phoneme onset detection functions (ODFs) us-ing a convolutional neural network (CNN). In the second step,the syllable and phoneme boundaries and labels are inferredhierarchically by using a duration-informed hidden Markovmodel (HMM). To achieve the inference, we incorporate the a priori duration model as the transition probabilities and theODFs as the emission probabilities into the HMM. The pro-posed method is designed in a language-independent way suchthat no phoneme class labels are used. For the model trainingand algorithm evaluation, we collect a new jingju (also knownas Beijing or Peking opera) solo singing voice dataset and man-ually annotate the boundaries and labels at phrase, syllable andphoneme levels. The dataset is publicly available. The pro-posed method is compared with a baseline method based onhidden semi-Markov model (HSMM) forced alignment. Theevaluation results show that the proposed method outperformsthe baseline by a large margin regarding both segmentation andonset detection tasks.

Index Terms : singing voice phoneme segmentation, onsetdetection, convolutional neural network, multi-task learning,duration-informed hidden Markov model

1. Introduction

The objective of this work lies in the background of automaticsinging voice pronunciation assessment (ASVPA) for jingjumusic. Jingju singing is sung in the Mandarin language var-ied by two Chinese dialects. In a professional jingju singingtraining situation, the teacher would be very demanding of thestudents regarding a precise pronunciation of each syllable andphoneme.We design an ASVPA system according to the learning byimitation method which is used as the basic training method bymany musical traditions and as well by jingju singing [1]. Inpractice, this method contains three steps in regards to teach-ing how to sing a musical phrase: (i) the teacher ﬁrstly givesa demonstrative singing, (ii) Then the student is asked to imi-tate it. (iii) The teacher provides the feedback by assessing thestudent’s singing at the syllable or phoneme-levels. Steps (ii)and (iii) should be repeated until the teacher satisﬁes with thestudent’s singing. We design a three-steps ASVPA system inconsideration of the above training method: (a) the teacher’sdemonstrative and student’s imitative singing voice audios arerecorded. The former is manually pre-segmented and labeledat the phrase, syllable and phoneme-levels. The latter is man- ually pre-segmented and labeled only at phrase-level. (b) Thestudent’s singing voice is then automatically segmented and la-beled at the phoneme-level using the proposed method in thispaper. (c) The corresponding phonemes between teacher’s andstudent’s recordings are ﬁnally compared by a phonetic pronun-ciation similarity algorithm. The similarity score will be givento the student as her/his pronunciation score.In this paper, we approach step (a) manually so that thecoarse durations and labels of the teacher’s demonstrativesinging are available as the prior information for step (b). Wetackle the phoneme segmentation problem of step (b). Step (c)remains a work in progress.

The ﬁrst topic which is highly related to our research is speechforced alignment. Speech forced alignment is a process thatthe orthographic transcription is aligned with the speech au-dio at word or phone-level. Most of the non-commercial align-ment tools are built on HTK [2] or Kaldi [3] frameworks, suchas Montreal forced aligner [4] and Penn Forced Aligner [5].These tools implement a part of the automatic speech recog-nition (ASR) pipeline, train the HMM acoustic models iter-atively using Viterbi algorithm and align audio features (e.g.MFCCs) to the HMM states. Brognaux and Drugman [6] ex-plored the forced alignment on a small dataset using supple-mentary acoustic features and initializing the silence model byvoice activity detection algorithm. To predict the conﬁdencemeasure of the aligned word boundaries and to ﬁne-tune theirtime positions, Serri´ere et al. [7] explored an alignment post-processing method using a deep neural network (DNN). Theforced alignment is language-dependent, in which the acousticmodels should be trained by using the corpus of a certain lan-guage. Another category of speech segmentation methods islanguage-independent, which relies on detecting the phonemeboundary change in the temporal-spectral domain [8, 9]. Thedrawback of these methods is that the segmentation accuraciesare poorer than the language-dependent counterparts [10].The second topic related to our research is the singing voicelyrics-to-audio alignment. Most of these works [11, 12, 13, 14,15, 16, 17, 18] used the forced alignment method accompaniedby music-related techniques. Loscos et al. [12] used MFCCswith additional features and also explored speciﬁc HMMtopologies. Fujihara et al. [13] used voice/accompanimentseparation to deal with mixed recording, and vocal detection,fricative detection to increase the alignment performance. Ad-ditional musical side information extracted from the musicalscore is used in many works. Mauch et al. [14] used chord in-formation such that each HMM state contains both chord andphoneme labels. Iskandar et al. [15] constrained the align-ment by using musical note length distribution. Gong et al.[16], Kruspe [17], Dzhambazov and Serra [18] all used sylla- a r X i v : . [ c s . S D ] J un le/phoneme duration extracted from the musical score and de-coded the alignment path by duration-explicit HMM models.Chien et al. [19] introduced an approach based on vowel likeli-hood models. Chang and Lee [20] used canonical time warpingand repetitive vowel patterns to ﬁnd the alignment for vowelsequence. Some other works achieved the alignment at musicstructure-level [21] or line-level [22].Our research is also related to multi-task learning (MTL)because we want to achieve the segmentation on the jointlylearned syllable and phoneme ODFs. MTL means that learn-ing by optimizing more than one loss function [23, 24]. Hardparameter sharing is the most commonly used MTL approach,which applied by sharing the hidden layers between all tasksand keep several task-speciﬁc output layers [23]. Baxter arguedthat hard parameter sharing can reduce the risk of overﬁtting inan order of the number of tasks. In the music information re-trieval (MIR) domain, Yang et al. proposed an MTL frameworkbased on the neural networks to jointly consider chord and rootnote recognition problems [25]. Vogl et al. showed that learningbeats jointly with drums can be beneﬁcial for the task of drumdetection [26]. In this paper, we present a new jingju solo singing voice datasetfor the phoneme segmentation, which is manually annotated atthree hierarchical levels - phrase, syllable, phoneme (section2). We propose a language-independent phoneme segmentationmethod which jointly learns syllable and phoneme onsets andhierarchically infers the phoneme boundaries and labels by aduration-informed HMM (section 3). Finally, we build a forcedalignment baseline method based on HSMM and compare itwith the proposed method (section 4).

2. Dataset

The jingju solo singing voice dataset focuses on two most im-portant jingju role-types (performing proﬁle) [27]: dan (female)and laosheng (old man). It has been collected by the researchersin Centre for Digital Music, Queen Mary University of London[28] and Music Technology Group, Universitat Pompeu Fabra.Table 1:

Statistics of the dataset

The dataset contains 95 recordings split into train and testsets (table 1). The recordings in the test set are student imi-tative singing. Their teacher demonstrative recordings can befound in the train set, which guarantees that the coarse sylla-ble/phoneme duration and labels are available for the algorithmtesting. Audios are pre-segmented into singing phrases. Thesyllable/phoneme ground truth boundaries (onsets/offsets) andlabels are manually annotated in Praat [29] by two Mandarinnative speakers and a jingju musicologist. 29 phoneme cate-gories are annotated, which include a silence category and anon-identiﬁable phoneme category, e.g. throat-cleaning. Thecategory table can be found in the Github page . The dataset ispublicly available . https://goo.gl/fFr9XU https://doi.org/10.5281/zenodo.1185123

3. Proposed method

We introduce a coarse duration-informed phoneme segmenta-tion method. The syllable and phoneme onset ODFs are jointlylearned by a hard parameter sharing multi-task CNN model.The syllable/phoneme boundaries and labels are then inferredby an HMM using the a priori duration model as the transitionprobabilities and the ODFs as the emission probabilities. : We use M

ADMOM Python package tocalculate the log-mel spectrogram of the student’s singing au-dio. The frame size and hop size of the spectrogram are respec-tively 46.4ms (2048 samples) and 10ms (441 samples). Thelow and high frequency bounds are 27.5Hz and 16kHz. We usea log-mel context as the CNN model input, where the contextsize is 80 ×

15 (log-mel binsframes). Thus the CNN model takesa binary onset/non-onset decision sequentially for every framegiven its context: ± Preparing target labels : The target labels of the training setare prepared according to the ground truth annotations. We setthe label of a certain context to 1 if an onset has been annotatedfor its corresponding frame, otherwise 0. To compensate thehuman annotation inaccuracy and to augment the positive sam-ple size, we also set the labels of the two neighbor contexts to 1.However, the importance of the neighbor contexts should not beequal to their center context, thus we compensate this by settingthe sample weights of the neighbor contexts to 0.25. A similarsample weighting strategy has been presented in Schluter’s pa-per [30]. Finally, for each log-mel context, we have its syllableand phoneme labels. They will be used as the training targets inthe CNN model to predict the onset presence.Figure 1:

Diagram of the multi-task CNN model.

Hard parameter sharing multi-task CNN model : We build aCNN for classifying each log-mel context and output the syl-lable and phoneme ODFs. We extend the CNN architecturepresented in Schluter’s work [30] by using two predicting ob-jectives – syllable and phoneme (ﬁgure 1). The two objectivesshare the same parameters, and both are using the sigmoid ac-tivation function. Binary cross-entropy is used as the loss func-tion. The loss weighting coefﬁcients for the two objectivesare set to equal since no signiﬁcant effect has been found inthe preliminary experiment. The model parameters are learnedwith mini-batch training (batch size 256), adam [31] update ruleand early stopping – if validation loss is not decreasing after 15epochs. The ODFs output from the CNN model is used as theemission probabilities for the syllable/phoneme boundary infer-ence.

The inference algorithm receives the syllable and phoneme du-rations and labels of teacher’s singing phrase as the prior input https://github.com/CPJKU/madmom nd infers the syllable and phoneme boundaries and labels forthe student’s singing phrase. The syllable durations of the teacher’s singing phrase are storedin an array M s = µ · · · µ n · · · µ N , where µ n is the duration ofthe nth syllable. The phoneme durations are stored in a nestedarray M p = M p · · · M np · · · M Np , where M np is the sub-arraywith respect to the nth syllable and can be further expandedto M np = µ n · · · µ nk · · · µ nK n , where K n is the number ofphonemes contained in the nth syllable. The phoneme durationsof the nth syllable sum to its syllable duration: µ n = (cid:80) K n k =1 µ nk (ﬁgure 2). In both syllable and phoneme duration sequences – M s , M p , the duration of the silence is not treated separatelyand is merged with its previous syllable or phoneme.Figure 2: Illustration of the syllable M s and phoneme M p coarse duration sequences and their a priori duration mod-els – N s , N p . The blank rectangulars in M p represent thephonemes. The a priori duration model is shaped with a Gaussian func-tion N ( d ; µ n , σ n ) . It provides the prior likelihood of an on-set to occur according to the syllable/phoneme duration of theteacher’s singing. The mean µ n of the Gaussian represents theexpected duration of nth teacher’s syllable/phoneme. Its stan-dard deviation σ n is proportional to µ n : σ n = γµ n and γ isheuristically set to 0.35. Figure 2 provides an intuitive exam-ple of how the a priori duration model works. The a prioriduration model will be incorporated into a duration-informedHMM as the state transition probabilities to inform that wheresyllable/phoneme onsets is likely to occur in student’s singingphrase. We present an HMM conﬁguration which makes use of thecoarse duration and label input (section 3.2.1) and can be ap-plied to inferring ﬁrstly (i) the syllable boundaries and labels onthe ODF for the whole singing phrase, then (ii) the phonemeboundaries and labels on the ODF segment constrained by theinferred syllable boundaries. To use the same inference formu-lation, we unify the notations N , K n (both introduced in sec-tion 3.2.1) to N , and M s , M np to M . The uniﬁcation of thenotations has a practical meaning because we use the same al-gorithm for both syllable and phoneme inference. The HMM ischaracterized by the following:1. The hidden state space is a set of T candidate onset positions S , S , · · · , S T discretized by the hop size, where S T is theoffset position of the last syllable or the last phoneme withina syllable. 2. The state transition probability at the time instant t associ-ated with state changes is deﬁned by a priori duration dis-tribution N ( d ij ; µ t , σ t ) , where d ij is the time distance be-tween states S i and S j ( j > i ). The length of the inferredstate sequence is equal to N .3. The emission probability for the state S j is represented byits value in the ODF, which is denoted as p j .The goal is to ﬁnd the best onset state sequence Q = q q · · · q N − for a given duration sequence M and impose thecorresponding segment label, where q i denotes the onset of the i + 1 th inferred syllable/phoneme. The onset of the currentsegment is assigned as the offset of the previous segment. q and q N are ﬁxed as S and S T as we expect that the onset ofthe ﬁrst syllable(or phoneme) is located at the beginning of thesinging phrase(or syllable) and the offset of the last syllable(orphoneme) is located at the end of the phrase(or syllable). Onecan fulﬁll this assumption by truncating the silences at both endsof the incoming audio. The best onset sequence can be inferredby the logarithmic form of Viterbi algorithm [32]: Algorithm 1

Logarithmic form of Viterbi algorithm using the apriori duration model δ n ( i ) ← max q ,q , ··· ,q n log P [ q q · · · q n , µ µ · · · µ n ] procedure L OG F ORM V ITERBI ( M, p ) Initialization : δ ( i ) ← log( N ( d i ; µ , σ )) + log( p i ) ψ ( i ) ← S Recursion : tmp var ( i, j ) ← δ n − ( i ) + log( N ( d ij ; µ n , σ n )) δ n ( j ) ← max (cid:54) i

4. Evaluation

The phoneme segmentation task consists of determining thetime positions of phoneme onsets and offsets and its labels.As the onset of the current phoneme is assigned as the offsetof the previous phoneme, the evaluation consists in compar-ing the detected (i) onsets and (ii) segments to their referenceones. As the syllable segmentation is the prerequisite for theproposed method, we also report the syllable segmentation re-sults. To compare with the proposed method, we ﬁrstly intro-duce a forced alignment-based baseline method.

The baseline is a 1-state monophone DNN/HSMM model. Weuse monophone model because our small dataset doesn’t haveenough phoneme instances for exploring the context-dependenttriphones model, also Brognaux and Drugman [6] and Pakoci etal. [10] argued that context-dependent model can’t bring signif-icant alignment improvement. It is convenient to apply 1-statemodel because each phoneme can be represented by a semi-Markovian state carrying a state occupancy time distribution.The audio preprocessing step is the same as in section 3.1.e construct an HSMM for phoneme segment inference.The topology is a left-to-right semi-Markov chain, where thestates represent sequentially the phonemes of the teacher’ssinging phrase. As we are dealing with the forced alignment, weconstraint that the inference can only be started by the leftmoststate and terminated to the rightmost state. The self-transitionprobabilities are set to 0 because the state occupancy dependson the predeﬁned distribution. Other transitions – from cur-rent states to subsequent states are set to 1. We use a one-layerCNN with multi-ﬁlter shapes as the acoustic model [33] and theGaussian N ( d ; µ n , σ n ) introduced in section 3.2.1 as the stateoccupancy distribution. The inference goal is to ﬁnd best statesequence, and we use Gu´edon’s HSMM Viterbi algorithm [34]for this purpose. The baseline details and code can be foundin the Github page . Finally, the segments are labeled by thealignment path, and the phoneme onsets are taken on the statetransition time positions. To deﬁne a correctly detected onset, we choose a tolerance of τ = 25 ms. If the detected onset o d lies within the tolerance ofits ground truth counterpart o g : | o d − o g | < τ , we consider thatit’s correctly detected. To measure the segmentation correct-ness, we use the ratio between the duration of correctly labeledsegments and the total duration of the singing phrase. This met-ric has been suggested by Fujihara et al. [13] in their lyricsalignment work. We trained both proposed and baseline mod-els 5 times with different random seeds, and report the meanand the standard deviation score on the test set. Below is theevaluation results table:Table 2: Evaluation results table. Table cell: meanscore ± standard deviation score. Onset F1-measure % Segmentation %phoneme syllable phoneme syllableProposed 75.2 ± ± ± ± ± ± ± ± . On both metrics – onset detection and segmentation, the pro-posed method outperforms the baseline. The proposed methoduses the ODF which provides the time “anchors” for the onsetdetection. Besides, the ODF calculation is a binary classiﬁca-tion task. Thus the training data for both positive and negativeclass is more than abundant. Whereas, the phonetic classiﬁ-cation is a harder task because many singing interpretations ofdifferent phonemes have the similar temporal-spectral patterns.Our relatively small training dataset might be not sufﬁcient totrain a proper discriminative acoustic model with 29 phonemecategories. We believe that these reasons lead to a better on-set detection and segmentation performance of the proposedmethod.Fig 3 shows an result example for a testing singing phrase.The syllable/phoneme labels, baseline emission probabilitiesmatrix and alignment path are omitted for the plot clarity, andcan be found in the link . Notice that there are some extra ormissing onsets in the detection. This is due to the inconsistency Figure 3: An illustration of the result for a testing singingphrase. The red solid and black dash vertical lines are re-spectively the syllable and phoneme onset positions (1st row:ground truth, 2nd and 3rd rows: proposed method, 4th row:baseline method). The blue curves in the 2nd and 3rd row arerespectively the syllable and phoneme ODFs. between the coarse duration input and the ground truth – stu-dents might add or delete some phonemes in the actual singing.Also notice that in the 3rd row, the two detected phoneme on-sets within the last syllable are not in the peak positions of theODF. This is due to that the onsets is inferred by taking into ac-count both ODF and the a priori duration model, and the latterpartially constraints the detected onsets.The biggest advantage of the proposed method is thelanguage-independency, which means that the pre-trained CNNmodel can be eventually applied to the singing voice of vari-ous languages because they could share the similar temporal-spectral patterns of phoneme transitions. Besides, the Viterbidecoding of the proposed method (time complexity O ( T S ), T : time, S : states) is much faster than the HSMM counterpart(time complexity O ( T S + T S ) ). An interactive jupyter note-book demo for showcasing the proposed algorithm is providedfor running in Google Colab .

5. Conclusions

In this paper, we presented a language-independent singingvoice segmentation method by ﬁrst jointly learning the syllableand phoneme ODFs using a CNN model, then inferring the on-sets and segmented labels using a duration-informed HMM. Wealso presented a jingju solo singing voice dataset with manualboundary and label annotations. For the evaluation, we com-pared the proposed method with a baseline forced alignmentmethod based on a language-dependent HSMM. The evalua-tion results showed that the proposed method outperforms thebaseline in both segmentation and onset detection tasks. We at-tribute the performance improvement of the proposed methodto the efﬁcient use of the onset and duration information for ourrelatively small dataset. However, the proposed method is notable to solve the phoneme insertion or deletion problems whenthere is a mismatch between the prior coarse duration informa-tion and the actual singing. We are improving the algorithm toovercome this limitation by using recognition-based methods.

6. Acknowledgements

This work is supported by the CompMusic project (ERC grantagreement 267583). https://goo.gl/BzajRy . References [1] R. Gong and X. Serra, “Identiﬁcation of potential music infor-mation retrieval technologies for computer-aided jingju singingtraining,” in , Suzhou, China, Oct 2017.[2] S. J. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, andP. Woodland, The HTK Book Version 3.4 . Cambridge UniversityPress, 2006.[3] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech recog-nition toolkit,” in

IEEE 2011 Workshop on Automatic SpeechRecognition and Understanding . IEEE Signal Processing So-ciety, Dec. 2011.[4] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, andM. Sonderegger, “Montreal forced aligner: Trainable text-speechalignment using kaldi,” in

Proceedings of Interspeech 2017 ,Stockholm, Sweden, 2017, pp. 498–502. [Online]. Available:http://dx.doi.org/10.21437/Interspeech.2017-1386[5] “Penn phonetics lab forced aligner,” https://web.sas.upenn.edu/phonetics-lab/facilities/, accessed: 2018-03-08.[6] S. Brognaux and T. Drugman, “HMM-based speech segmenta-tion: Improvements of fully automatic approaches,”

IEEE/ACMTransactions on Audio, Speech and Language Processing , vol. 24,no. 1, pp. 5–15, 2016.[7] G. Serri`ere, C. Cerisara, D. Fohr, and O. Mella, “Weakly-supervised text-to-speech alignment conﬁdence measure,” in

In-ternational Conference on Computational Linguistics (COLING) ,Osaka, Japan, 2016.[8] A. Esposito and G. Aversano, “Text independent methods forspeech segmentation,” in

Nonlinear Speech Modeling and Appli-cations . Springer, 2005, pp. 261–290.[9] G. Almpanidis, M. Kotti, and C. Kotropoulos, “Robust detectionof phone boundaries using model selection criteria with few ob-servations,”

IEEE Transactions on Audio, Speech, and LanguageProcessing , vol. 17, no. 2, pp. 287–298, Feb 2009.[10] E. Pakoci, B. Popovi´c, N. Jakovljevi´c, D. Pekar, and F. Yassa, “Aphonetic segmentation procedure based on hidden markov mod-els,” in

International Conference on Speech and Computer . Bu-dapest, Hungary: Springer, 2016, pp. 67–74.[11] A. Mesaros and T. Virtanen, “Automatic alignment of music audioand lyrics,” in

Proceedings of the 11th Int. Conference on DigitalAudio Effects (DAFx-08) , Espoo, Finland, 2008.[12] A. Loscos, P. Cano, and J. Bonada, “Low-delay singing voicealignment to text,” in

International Computer Music Conference ,Beijing, China, 1999.[13] H. Fujihara, M. Goto, J. Ogata, and H. G. Okuno, “Lyricsynchro-nizer: Automatic synchronization system between musical audiosignals and lyrics,”

IEEE Journal of Selected Topics in Signal Pro-cessing , vol. 5, no. 6, pp. 1252–1261, 2011.[14] M. Mauch, H. Fujihara, and M. Goto, “Integrating additionalchord information into hmm-based lyrics-to-audio alignment,”

IEEE Transactions on Audio, Speech, and Language Processing ,vol. 20, no. 1, pp. 200–210, 2012.[15] D. Iskandar, Y. Wang, M.-Y. Kan, and H. Li, “Syllabic level au-tomatic synchronization of music signals and text lyrics,” in

Pro-ceedings of the 14th ACM international conference on Multime-dia , Santa Barbara, CA, USA, 2006, pp. 659–662.[16] R. Gong, P. Cuvillier, N. Obin, and A. Cont, “Real-time audio-to-score alignment of singing voice based on melody and lyric infor-mation,” in

Proceedings of Interspeech 2015 , Dresden, Germany,2015.[17] A. M. Kruspe, “Keyword spotting in singing with duration-modeled hmms,” in , Nice, France, Aug 2015, pp. 1291–1295. [18] G. B. Dzhambazov and X. Serra, “Modeling of phoneme du-rations for alignment between polyphonic audio and lyrics,” in , Maynooth, Ire-land, 2015.[19] Y. R. Chien, H. M. Wang, and S. K. Jeng, “Alignment oflyrics with accompanied singing audio based on acoustic-phoneticvowel likelihood modeling,”

IEEE/ACM Transactions on Audio,Speech, and Language Processing , vol. 24, no. 11, pp. 1998–2008, Nov 2016.[20] S. Chang and K. Lee, “Lyrics-to-audio alignment by unsuperviseddiscovery of repetitive patterns in vowel acoustics,”

IEEE Access ,vol. 5, pp. 16 635–16 648, 2017.[21] M. M¨uller, F. Kurth, D. Damm, C. Fremerey, and M. Clausen,“Lyrics-based audio retrieval and multimodal navigation in musiccollections,” in

International Conference on Theory and Practiceof Digital Libraries . Budapest, Hungary: Springer, 2007, pp.112–123.[22] Y. Wang, M.-Y. Kan, T. L. Nwe, A. Shenoy, and J. Yin, “Lyrically:automatic synchronization of acoustic musical signals and textuallyrics,” in

Proceedings of the 12th annual ACM international con-ference on Multimedia . New York, NY, USA: ACM, 2004, pp.212–219.[23] S. Ruder, “An overview of multi-task learning in deep neural net-works,” arXiv preprint arXiv:1706.05098 , 2017.[24] R. Caruana, “Multitask learning,” in

Learning to learn . Springer,1998, pp. 95–133.[25] M. H. Yang, L. Su, and Y. H. Yang, “Highlighting root notes inchord recognition using cepstral features and multi-task learning,”in , Dec 2016, pp.1–8.[26] R. Vogl, M. Dorfer, G. Widmer, and P. Knees, “Drum transcriptionvia joint beat and drum modeling using convolutional recurrentneural networks,” in , Suzhou, China, 2017.[27] R. C. Repetto and X. Serra, “Creating a corpus of jingju (Beijingopera) music and possibilities for melodic analysis,” in

ISMIR ,Taipei, Taiwan, 2014.[28] D. A. A. Black, M. Li, and M. Tian, “Automatic identiﬁcationof emotional cues in Chinese opera singing,” in

ICMPC , Seoul,South Korea, Aug. 2014.[29] P. Boersma, “Praat, a system for doing phonetics by computer.”

Glot International , vol. 5, no. 9/10, pp. 341–345, 2001.[30] J. Schluter and S. Bock, “Improved musical onset detectionwith convolutional neural networks,” in

Proceedings of Acoustics,speech and signal processing (ICASSP) , Florence, Italy, 2014, pp.6979–6983.[31] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti-mization,” arXiv preprint arXiv:1412.6980 , 2014.[32] L. R. Rabiner, “A tutorial on hidden markov models and selectedapplications in speech recognition,”

Proceedings of the IEEE ,vol. 77, no. 2, pp. 257–286, 1989.[33] J. Pons, O. Slizovskaia, R. Gong, E. G´omez, and X. Serra, “Tim-bre analysis of music audio signals with convolutional neural net-works,” in , Kos, Greece, Aug 2017, pp. 2744–2748.[34] Y. Gu´edon, “Exploring the state sequence space for hiddenmarkov and semi-markov chains,”