[PDF] The Use of Voice Source Features for Sung Speech Recognition

Abstract

In this paper, we ask whether vocal source features (pitch, shimmer, jitter, etc) can improve the performance of automatic sung speech recognition, arguing that conclusions previously drawn from spoken speech studies may not be valid in the sung speech domain. We first use a parallel singing/speaking corpus (NUS-48E) to illustrate differences in sung vs spoken voicing characteristics including pitch range, syllables duration, vibrato, jitter and shimmer. We then use this analysis to inform speech recognition experiments on the sung speech DSing corpus, using a state of the art acoustic model and augmenting conventional features with various voice source parameters. Experiments are run with three standard (increasingly large) training sets, DSing1 (15.1 hours), DSing3 (44.7 hours) and DSing30 (149.1 hours). Pitch combined with degree of voicing produces a significant decrease in WER from 38.1% to 36.7% when training with DSing1 however smaller decreases in WER observed when training with the larger more varied DSing3 and DSing30 sets were not seen to be statistically significant. Voicing quality characteristics did not improve recognition performance although analysis suggests that they do contribute to an improved discrimination between voiced/unvoiced phoneme pairs.

Full PDF

TTHE USE OF VOICE SOURCE FEATURES FOR SUNG SPEECH RECOGNITION

Gerardo Roa Dabike, Jon Barker

Department of Computer Science, University of Shefﬁeld, UK © 2021 IEEE. Personal use of this material is permitted. Permis-sion from IEEE must be obtained for all other uses, in any current orfuture media, including reprinting/republishing this material for ad-vertising or promotional purposes, creating new collective works, forresale or redistribution to servers or lists, or reuse of any copyrightedcomponent of this work in other works.

ABSTRACT

In this paper, we ask whether vocal source features (pitch, shim-mer, jitter, etc) can improve the performance of automatic sung speech recognition, arguing that conclusions previously drawn fromspoken speech studies may not be valid in the sung speech domain.We ﬁrst use a parallel singing/speaking corpus (NUS-48E) to illus-trate differences in sung vs spoken voicing characteristics includingpitch range, syllables duration, vibrato, jitter and shimmer. We thenuse this analysis to inform speech recognition experiments on thesung speech DSing corpus, using a state of the art acoustic modeland augmenting conventional features with various voice sourceparameters. Experiments are run with three standard (increasinglylarge) training sets, DSing1 (15.1 hours), DSing3 (44.7 hours) andDSing30 (149.1 hours). Pitch combined with degree of voicingproduces a signiﬁcant decrease in WER from 38.1% to 36.7% whentraining with DSing1 however smaller decreases in WER observedwhen training with the larger more varied DSing3 and DSing30 setswere not seen to be statistically signiﬁcant. Voicing quality charac-teristics did not improve recognition performance although analysissuggests that they do contribute to an improved discrimination be-tween voiced/unvoiced phoneme pairs.

Index Terms — Sung speech, voice source, speech recognition

1. INTRODUCTION

Automatic sung speech recognition is attracting increased researchattention, driven in part by the release of large sung speech datasets,e.g., in 2020, the Music Information Retrieval Exchange (MIREX)ran its ﬁrst Lyrics Transcription task . Lyric transcription is an in-teresting problem in its own right, but equally importantly, adaptingspeech technology to sung speech may also provide insight into howto extend to other types of atypical speech (e.g., dysarthric speech).Further, poorly-intelligible sung speech acts as a useful stress-testfor state-of-the-art (SOTA) acoustic modelling (AM) techniques.Existing automatic sung speech recognition systems are typi-cally based on successful approaches for spoken speech [1, 2, 3].In particular, they use the same acoustic features. This has beenmotivated by the idea that spoken and sung speech share the sameproduction system, and that semantic information is conveyed in thesame way in both speech styles. However, there are several differ-ences between sung and spoken speech that make the former moredifﬁcult to recognise, such as, the pitch ranges, the syllable duration, the existence of vibrato in singing and the fundamental fact that insinging, intelligibility is often less important than artistic interpre-tation. Since most of these differences centre around the use of thevoice source, informing recognition systems with voice source in-formation may help to improve their performances.Traditional SOTA ASR systems designed for spoken speech,typically use Mel frequency cepstral coefﬁcients (MFCC) or ﬁlter-bank acoustic features to capture the vocal ﬁlter characteristics, andi-Vector speaker-speciﬁc representations [4] for speaker [5, 6] andenvironment adaptation [7]. These systems produce excellent re-sults. However, unaccompanied sung speech ASR scenarios systemsthat utilised a combination of MFCCs and i-Vectors have not beenachieving the same level of performance [8, 3]. This reduced per-formance may be partly explained by the relative lack of trainingdata availability, the utilisation of inappropriate language models,and the (often) lower intelligibility of the signal. Additionally, sungspeech possesses a higher pitch variability than spoken speech, e.g.,females average 342 Hz when singing and 237 Hz speaking (Sec-tion 2). This variability means that one singer can effectively havedifferent ‘voices’ (guided by the song structure), which may make itdifﬁcult to characterise the speakers adequately.The inclusion of pitch information has been improving the per-formance of ASR systems in different spoken speech scenarios. Intonal languages systems, such as Mandarin [9], it has been widelyused since the tone has a direct relation with the word meaning. Innon-tonal languages systems, like English [10, 11, 12], the pitch-strength information helps to improve the discrimination betweenvoiced and unvoiced sounds [10, 11]. Additionally, pitch informa-tion has been used for vocal tract normalisation (VTN), by exploitingthe relationship between the pitch values and the vocal tract length,e.g., male speakers have a larger vocal tract than females and pro-duce a lower pitch [12]. For children’s ASR systems, pitch informedVTN has also been employed to train on adult speech data and test-ing in children data, by pitch-adaptive front-end [13, 14].However, the role of the voice source in singing is fundamentallydifferent from its role in speaking, meaning it has very different char-acteristics. First, for sung speech, the pitch range is larger, and theaverage pitch is higher than in spoken speech [15]. Second, the spo-ken speech pitch varies freely rising and falling within one syllable,with changes up to 12 semitones [16]. In contrast, the pitch in sungspeech is expected to remain steady during a note, having controlledand discrete variations, with rather infrequent changes greater thantwo semitones [17]. Third, in singing, in order to carry the melody,speech is more heavily voiced, and this may interfere with the role ofvoicing as a phonetic cues. This may lead to difﬁculties in reliablydiscriminating between voiced/unvoiced phoneme pairs. Fourth, theduration of the sung vowels is often larger than in spoken speech, asneeded to achieve the rhythm dictated by the musical compositionand to convey the pitch assigned to the syllable. Finally, for artisticexpression, singers employ a frequency modulation called vibrato, amusical effect corresponding to small and periodic variations of theTo appear in Proc. ICASSP 2021, June 06-11, 2021, Toronto, Ontario, Canada © IEEE 2021 a r X i v : . [ ee ss . A S ] F e b itch between 5.5 and 7.5 Hertz (Hz), with characteristics that arespeciﬁc to the singer [18].Considering these differences, the voice source features may beless useful as a phonetic cue for sung speech than for spoken speech.Alternatively, with sufﬁcient data, voiced source features, and in par-ticular, pitch, may be, on balance, beneﬁcial for sung speech ASRsystems, not only by improving the voiced and unvoiced soundsdiscrimination and sung vowel classiﬁcation [19], but also by help-ing the system to normalise for different speakers or for systematicchanges in phoneme quality apparent when a singer is singing in adifferent part of his or her vocal range. Additionally, using charac-terisation of the voice source quality , which are strongly associatedwith speaker identity, may also help the system by providing condi-tioning variables that improve speaker normalisation.This paper aims to critically evaluate the usefulness of varioustypes of voice source information, for improving the performanceof sung speech ASR systems. This is the ﬁrst study using a systemtested and trained on large amounts of sung speech data. (Pitch in-formation was previously evaluated in [19] but in a system trainedusing spoken data and evaluated only with male singers). Section 2presents a data-driven analysis of some of the differences betweensung and spoken speech. Section 3 details the voice source-basedfeatures chosen for use in the ASR experiments, informed by thisanalysis. Section 4, describes the speech recognition dataset and thebaseline system. The experimental results and analysis are presentedin Section 5 and conclusions in Section 6.

2. SUNG SPEECH ANALYSIS

Using the sung and spoken lyrics corpus NUS-48E [20], we exam-ined some of the sung speech characteristics that makes it more chal-lenging than spoken speech. The NUS-48E corpus is composed of48 sung and spoken (read) English songs, performed by 12 singerswith different levels of musical training, where each singer speaksand sings four different songs from a selection of 20 unique songs.The data is humanly annotated at the phone-level, and has about2800 annotated phones per subject, for each speech style.First, we considered vowel duration differences (Figure 1a). Itis seen that whereas spoken vowel durations are typically less than300 milliseconds (ms), sung vowels are often over 500 ms and canextend over 1 second. Word recognition accuracy may be affectedby the phone lengthening. In [19], it was found that the extendedduration can lead to phone insertion and substitution errors, espe-cially if the pitch is varying during the vowel. Further, vowel lengthis an important phonetic cue, (e.g., aiding discrimination betweenshort and long vowel pairs such as /I/ vs /i:/). Artiﬁcially extendinga vowel may therefore disrupt identiﬁcation.Figure 1b shows the vowel’s fundamental frequency (F0) distri-bution for speech style and gender. Sung speech has a much largerF0 range than spoken speech. In contrast to spoken speech, whichhas virtually no overlap in F0 range between genders, the extendedF0 ranges mean that the top end of the male range (200-300 Hz)overlaps with the lower female F0 values. For sung speech, differ-ent peaks can be seen in the male and female distributions. Thesepeaks occur at the position of speciﬁc musical notes, e.g., some ofthe peaks in the male singer plot correspond to the notes E , F and G . Notice that unlike spoken speech, which can be well rep-resented with a single Gaussian in log frequency scale, the sungspeech distribution would be better represented by a Gaussian mix-ture model where each component centred at a discrete note in thevocal range. At the highest sung F0 values vowel identiﬁcation will (a) Vowels duration. (b) Vowels pitch values. Fig. 1 . Distribution of vowel duration (1a) and pitch value (1b),separated by sung and spoken speech styles.become challenging as the sparseness of the harmonics means thatthe formant positions and bandwidths will be hard to determine.We used the Kaldi ‘probability of voicing’ measure (POV) (Sec-tion 3), to examine the importance of the voicing information for thediscrimination between voiced and unvoiced phones. Figure 2 showsthe POV measure’s probability distribution for the voiced/unvoicedfricative sounds (note, low POV implies more strongly voiced). Foreach condition, the Bhattacharyya distance ( D B ) was measured toshow the similarity between the voiced and unvoiced distributions.For sung speech, although both classes show more voicing (i.e., ashift of the distributions to the left), the shift is greater for voicedphonemes leading to a higher D B value. This suggests that the POVcould be more informative for sung phonemes differentiation thanfor spoken speech. No variation on the D B distance was observedfor voiced versus unvoiced stop phonemes.Finally, we analysed the voice quality (VQ) production differ-ences between speech styles by using the fundamental frequencyperturbance measurements jitter and shimmer, plus harmonic tonoise ratio (HNR); parameters typically used for voice pathologydetection [21]. These features have been proved useful for speakerrecognition [22]. VQ measurements are divided into three sub-groups; jitter, that measures the frequency variation from cycle-to-cycle; shimmer, that relates to the amplitude variation of the soundwave; and HNR, that measures the ratio between the harmonics andthe glottal noise. Analysis using the NUS-48E showed that sungspeech jitter and shimmer ranges are lower, and HNR is higherthan in spoken speech. Among the factors that may explain thesevariations are vocal tract muscles ”warm-up” during singing [23],interactions between jitter, shimmer and the degree of vibrato [24]and, the correlation between these parameters and the fundamentalfrequency, i.e., jitter will decrease at higher frequencies. The voicequality parameters ranges showed a clear gender distinction in spo-ken speech. In contrast, in singing, the gender distinction virtuallydisappears, only jitter maintains some degree of gender speciﬁcationand, shimmer and HNR become more gender independent param-eters. The weaker link to speaker identity may reduce the value ofthese parameters as speaker normalisation conditioning variables.

3. VOICE SOURCE FEATURES

This section describes the voice source-based features that will beevaluated in the ASR experiments.

For pitch feature extraction, we utilised the Kaldi pitch tracker [25],which is based on the getf0 (RAPT) algorithm [26]. Unlike getf0,Kaldi pitch tracker does not make a hard voicing decision; instead,2 a) Sung speech. (b) Spoken speech.

Fig. 2 . Distribution of POV measure for voiced and unvoiced frica-tive phones. Bhattacharyya distance ( D B ) is annotated in both plots.it treats all frames as voiced sounds with an associated pitch, butwith varying probability of voicing (POV). A Viterbi search is usedto interpolate pitch estimates across unvoiced frames. The algorithmoutputs a log pitch representation, and a normalised log pitch by us-ing a short-time mean subtraction (as in [27]), a Gaussian-distributedPOV feature and a two frame-context delta-pitch feature. The Kaldi pitch tracker possesses a maximum pitch value (max-f0)and low-pass frequency cut-off (lowpass-cutoff) that are tuned forspoken speech. These were re-tuned for sung speech using the MIR-1K pitch annotated sung speech dataset (MIR-1K) [28]. MIR-1K is acollection of 1000 male and female song clips totalling 133 minutes,extracted from 110 karaoke songs selected from 5,000 Chinese popsongs. Pitch values have been manually annotated.By employing a grid-searching technique, we evaluated sevenvalues for max-f0 (between 400 and 1,000, with 100Hz step) andthree values for lowpass-cutoff (1,000, 1,500 and 2,000). The per-formance was assessed by using the gross pitch error (GPE) [29]and ﬁne pitch error (FPE) [30] measurements. GPE calculates theproportion of frames classiﬁed as voiced by both the estimation andground truth, where the estimated pitch deviates more than one semi-tone from the ground truth [31]. However, as the Kaldi pitch trackerdoes not perform a hard voiced decision, we calculate the GPE usingall the voiced frames from the ground truth. And, FPE is deﬁned asthe mean absolute error derived from the voiced frames where thereference deviates less than the GPE threshold [30].Best performance was found with a max-f0 of 1,000Hz andlowpass-cutoff of 1,500Hz (compared to 400 and 1000 respectivelyfor the original spoken speech tuning). Figure 3 shows an exampleof the predicted pitch by the Kaldi pitch tracker using these param-eters, compared to the ground truth. Notice that Kaldi pitch trackerpredicted the values with high accuracy and it interpolates betweenvoiced frames to assign a pitch to the unvoiced ones. Additionally,this plot shows the POV estimation (right y-axis with inverted or-der), notice that in unvoiced areas, the POV value shows valleys tovalues below a voicing decision threshold estimated from MIR-1K.

The VQ parameters were extracted using Praat [32]. First, we esti-mated jitter by analysing the timings of individual fundamental peri-ods. In our experiments, we evaluated two jitter parameters, namelyaverage absolute jitter ( jitta ) and relative average perturbation ( rap )[21]. The jitta parameter measures the average absolute variation ofthe fundamental period between consecutive cycles, and rap mea-sures the relative difference between a period and the average of

Fig. 3 . Kaldi pitch tracker’s pitch and POV estimation contrastedwith the ground truth from 3.5 seconds from one MIR-1K excerpt.its two neighbours. Second, we estimated shimmer, deﬁned as theaverage absolute difference between the amplitude of pairs of con-secutive periods. Finally, HNR is a single parameter calculated fromthe auto-correlation function of the voice signal and is a measure ofthe degree of aperiodicity [21].

4. THE ASR BASELINE4.1. The task

Recognition experiments have been performed using the DSing sungspeech recognition corpus [8]. DSing is composed of the 4,460karaoke performances, and is the English subset of the larger multi-language karaoke Smule Sing!300x30x2 dataset (Sing!) [33] re-leased by Smule , in early 2018.The performances are equally distributed by gender and organ-ised by the country (30 in total) where the singer was located duringthe recording. Using the country information, DSing convenientlysplits the data into three progressively larger training sets, named DSing1 (15.1 hours),

DSing3 (44.7 hours) and

DSing30 (149.1hours). The smallest training set DSing1 is constructed using 80%of the recordings from the UK. The remaining 20% is split into twobalanced test sets, one for development (dev) and one for evaluation(eval) (1.5 hours in total). DSing3 extends DSing1 with recordingsfrom Australia and the US. Although there is no information aboutthe nationality of the singers, it can be safely assumed that most ofthe singers in DSing1 and DSing3 are native English speakers. Thethird and most extensive set, DSing30, covers all the English record-ings from the 30 countries, including a large number of Englishrecordings from non-native English speakers.

As a baseline, we employ the sung speech recognition system from[8] built using the Kaldi ASR toolkit [34]. The acoustic featuresemployed are 13 Mel frequency cepstral coefﬁcients (MFCC) plusdelta, delta-delta and energy, with 25 ms frame length and 15 msof overlapping. Initial alignments are performed with a tri-phonespeaker adapted Gaussian mixture model (GMM) and feature-spacemaximum likelihood linear regression (fMLLR). The tri-phonemodel is used to clean the training data using the standard Kaldicleanup process . A factorised time-delay neural network (TDNN-F) [35] acoustic model is then trained using 40 MFCCs with twoframes context plus 100-dimensional i-Vectors, and a lattice-freemaximum mutual information (LF-MMI) loss function [36]. https://github.com/kaldi-asr/kaldi/blob/master/egs/wsj/s5/steps/cleanup/clean and segment data.sh able 1 . Evaluation set performances (95% conﬁdence interval WER) from experiments trained on DSing1, DSing3 and DSing30. Values inbold show WERs that are signiﬁcantly better than the baseline. Experiment

DSing1 DSing3 DSing303-gram 4-gram 3-gram 4-gram 3-gram 4-gramBaseline . ± .

55 38 . ± .

58 28 . ± .

14 24 . ± .

26 22 . ± .

21 19 . ± . Kaldi LN . ± .

49 36 . ± . . ± .

24 24 . ± .

21 23 . ± .

28 19 . ± . Kaldi L . ± . . ± .

52 27 . ± .

34 24 . ± .

27 22 . ± .

33 19 . ± . Kaldi N . ± . . ± . . ± . . ± .

19 22 . ± .

15 19 . ± . Kaldi LN + VQ . ± .

30 36 . ± . . ± . . ± . . ± .

32 19 . ± . A 3-gram MaxEnt language model (LM) is trained on an in-domain lyrics corpus, sourced from the Lyrics Wiki website . Thiscorpus is composed of the lyrics from songs from all artists in theDSing3 training set excluding songs that actually appear in DSingitself, plus the lyrics from all the artists from the Billboard The Hot100 for the 31st December of the years 2015 to 2018. Languagemodel rescoring is then performed with a 4-gram MaxEnt modeltrained on the same data. The vocabulary selected was the 26K mostfrequency words.Using the 3-gram LM, the evaluation set performance was42.28%, 28.67% and 22.32% for models trained with DSing1, DS-ing3 and DSing30 respectively, reducing to 37.63%, 24.27% and19.60% WER after applying 4-gram LM rescoring. For full systemdetails see [8].

5. EXPERIMENT AND RESULTS5.1. Experiments

In order to establish the signiﬁcance of our results we have repeatedexperiments multiple times. In particular, the AM training dependson the random initialisation of parameters, the random presentationorder of training data, etc. Each training, though equally valid, canproduce WER results that vary appreciably, and the variation can bemistaken for genuine performance variations. Therefore, for all ex-periments we retrain systems 11 times allowing us to treat the eval-uation of the model performances statistically.We ﬁrst replicated the baseline system training it eleven timesto calculate conﬁdence intervals by using the mean and the stan-dard error of the mean scores. After each training, the developmentset was used to select the LM-weight and words-insertion-penaltyKaldi’s decoding parameters, and these parameters were then usedto decode the evaluation set. These parameters were ﬁrstly estimatedwhen decoding with the 3-gram LM and then when using the 4-gramLM for re-scoring.Following the above procedure, we performed experiments byexpanding the baseline MFCC + i-Vector feature vector using differ-ent voice source based feature combinations. The ﬁrst experiment,

Kaldi LN , evaluated the effect of including both Kaldi pitch repre-sentations; the log pitch and the normalised log pitch. The second,

Kaldi L , evaluated the effect of using the log pitch, without the nor-malised log pitch. The third,

Kaldi N , evaluated the effect of utilis-ing only the normalised log pitch. For these three ﬁrst experiments,both the delta pitch and POV were included. The ﬁnal experimentexpanded the best combination of MFCC plus pitch features – ob-tained from the results when training with the smallest DSing1 –with the four VQ features. https://lyrics.fandom.com/wiki/LyricWiki Fig. 4 . T-SNE constructed with the posterior probabilities frommodel train on DSing30.

Table 1 presents the system performances (WER) along with the95% conﬁdence intervals for the baseline system and the systemsusing the various combinations of voice source features. For eachsystem, results are shown for each training dataset and for usingeach of the 3-gram or 4-gram LM. The baseline results for the 4-gram LM, . ± . , . ± . and . ± . WER, forDSing1, DSing3 and DSing30 respectively, are consistent with theresults reported in [8].For the DSing1 model, using the four Kaldi pitch features (KaldiLN experiment) reduces the error by about 2.0% ( p < . ) 3-gramand, 1.4% ( p < . ) 4-gram. A similar improvement is obtained byexpanding Kaldi LN with the VQ features ( p < . ). For the largerDSing3 training set, a reduction of 0.5% ( p = . ) for 3-gramand, 0.7% ( p < . ) 4-gram, is obtained when combining the pitchfeature with the VQ. Voiced source features did not help to improvethe models trained on the largest DSing30. The VQ features do notproduce a signiﬁcant improvement over using the pitch features, forany training set size.Figure 4 shows a t-SNE plot of the \ s \ - \ z \ fricative and \ p \ - \ b \ plosive sounds, constructed by using the posterior probabilitiesfrom one model trained with DSing30. The subscript 0 representsthe baseline model; subscript 1 Kaldi LN and, subscript 2 Kaldi LN+ VQ. In the baseline model there are separate clusters for the voicedand unvoiced fricatives and plosive pairs, but the clusters overlap.When the model is informed with pitch features and VQ informa-tion, the distance between classes becomes wider and the classes aremore compressed. Best discrimination is seen when the VQ featuresare included, giving some hint that they may be useful even thoughimprovements in WER were not signiﬁcant.4 . CONCLUSIONS This paper presented an analysis of the differences in voice sourcecharacteristic between the sung and spoken speech styles, and its ef-fect on a sung speech ASR performance. Using voice source featuresimproved ASR performances, reducing WERs by between 0.7% to1.4% on an existing state of the art baseline. The ASR results and t-SNE analysis suggested that voice source features help to compressthe phoneme classes and to increase the distance between pairs ofvoiced/unvoiced phonemes. However, this effect only improves per-formances for ASR systems with the smaller training sets. When us-ing the more extensive DSing30 (149.1 hrs) the vocal source basedfeatures were, surprisingly, found to provide no signiﬁcant beneﬁtfor ASR performance. This suggests that, with enough training datasets, systems are able to learn the phonetic cues being carried in thevoice source in a less direct manner (e.g., via the temporal dynamicsof the MFCC features).

7. REFERENCES [1] A. Mesaros and T. Virtanen, “Recognition of phonemes andwords in singing,” in

Proc. IEEE ICASSP , 2010.[2] A. M. Kruspe, “Bootstrapping a system for phoneme recogni-tion and keyword spotting in unaccompanied singing,” in

Proc.International Society for Music Information Retrieval Confer-ence (ISMIR) , 2016.[3] C. Tsai, Y. Tuan, and L. Lee, “Transcribing lyrics from com-mercial song audio: The ﬁrst step towards singing content pro-cessing,” in

Proc. IEEE ICASSP , 2018.[4] N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouel-let, “Front-end factor analysis for speaker veriﬁcation,”

IEEETrans. Acoust., Speech, Signal Process. , vol. 19, no. 4, pp. 788–798, 2011.[5] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: An ASR corpus based on public domain audiobooks,” in

Proc. IEEE ICASSP , 2015.[6] K. J. Han, A. Chandrashekaran, J. Kim, and I. R. Lane,“The CAPIO 2017 conversational speech recognition system,”

CoRR , vol. abs/1801.00059, 2017.[7] M. Rouvier and B. Favre, “Speaker adaptation of DNN-basedASR with i-vectors: Does it actually adapt models to speak-ers?” in

Proc. INTERSPEECH , 2014.[8] G. Roa Dabike and J. Barker, “Automatic lyric transcriptionfrom Karaoke vocal tracks: Resources and a baseline system,”in

Proc. Interspeech , 2019.[9] H. Huang, Y. Hu, and H. Xu, “Mandarin tone modeling usingrecurrent neural networks,” arXiv:1711.01946 , 2017.[10] M. M. Doss, T. A. Stephenson, and H. Bourlard, “Using pitchfrequency information in speech recognition,” in

Proc. EU-ROSPEECH , 2003.[11] G. Cloarec, D. Jouvet, and J. Monn´e, “Analysis of the mod-eling of pitch and voicing parameters for speaker-independentspeech recognition systems,” in

Speech Recognition and Intrin-sic Variation (SRIV) , 2006.[12] M. Płonkowski and P. Urbanovich, “The use of pitch in large-vocabulary continuous speech recognition system,”

PrzegladElektrotechniczny , vol. 92(8), pp. 78–81, 2016. [13] S. Shahnawazuddin, A. Dey, and R. Sinha, “Pitch-adaptivefront-end features for robust children’s ASR,” in

Proc. Inter-speech , 2016.[14] S. Shahnawazuddin, R. Sinha, and G. Pradhan, “Pitch-normalized acoustic features for robust children’s speechrecognition,”

IEEE Signal Process. Lett. , vol. 24, no. 8, pp.1128–1132, 2017.[15] J. Merrill and P. Larrouy-Maestri, “Vocal features of song andspeech: Insights from Schoenberg’s Pierrot lunaire,”

Frontiersin Psychology , vol. 8, no. JUL, 2017.[16] A. D. Patel, M. Wong, J. Foxton, A. Lochy, and I. Peretz,“Speech intonation perception deﬁcits in musical tone deafness(congenital amusia),”

Music Perception , vol. 25, pp. 357–368,04 2008.[17] P. G. Vos and J. M. Troost, “Ascending and descending melodicintervals: Statistical ﬁndings and their perceptual relevance,”

Music Perception: An Interdisciplinary Journal , vol. 6, no. 4,pp. 383–396, 1989.[18] J. Sundberg,

The science of the singing voice . Northern Illi-nois University Press, 1987.[19] D. Kawai, K. Yamamoto, and S. Nakagawa, “Lyric recognitionin monophonic singing using pitch-dependent DNN,” in

Proc.IEEE ICASSP , 2017.[20] Z. Duan, H. Fang, B. Li, K. C. Sim, and Y. Wang, “The NUSsung and spoken lyrics corpus: A quantitative comparison ofsinging and speech,” in

Proc. APSIPA ASC , 2013.[21] J. P. Teixeira, C. Oliveira, and C. Lopes, “Vocal acoustic analy-sis – jitter, shimmer and HNR parameters,”

Procedia Technol-ogy , vol. 9, pp. 1112–1122, 2013.[22] M. Farrus, J. Hernando, and P. Ejarque, “Jitter and Shim-mer Measurements for Speaker Recognition,” in

Proc. INTER-SPEECH , 2007.[23] C. Mezzedimi, M. Spinosi, T. Massaro, F. Ferretti, andJ. Cambi, “Singing voice: acoustic parameters after vocalwarm-up and cool-down,”

Logopedics Phoniatrics Vocology ,pp. 1–9, 2018.[24] J. Hakes, T. Shipp, and E. T. Doherty, “Acoustic characteristicsof vocal oscillations: vibrato, exaggerated vibrato, trill, andtrillo,”

Journal of Voice , vol. 1, no. 4, pp. 326–331, 1988.[25] P. Ghahremani, B. Babaali, D. Povey, K. Riedhammer, J. Tr-mal, and S. Khudanpur, “A pitch extraction algorithm tunedfor ASR,” in

Proc. IEEE ICASSP , 2014.[26] D. Talkin and W. B. Kleijn, “A robust algorithm for pitch track-ing (RAPT),”

Speech coding and synthesis , vol. 495, p. 518,1995.[27] X. Lei, “Modeling lexical tones for Mandarin large vocabularycontinuous speech recognition,” Ph.D. dissertation, Universityof Washington, 2006.[28] C. Hsu and J. R. Jang, “On the Improvement of SingingVoice Separation for Monaural Recordings Using the MIR-1K Dataset,”

IEEE Trans. Acoust., Speech, Signal Process. ,vol. 18, no. 2, pp. 310–319, 2010.[29] T. Drugman, “Joint robust voicing detection and pitch estima-tion based on residual harmonics,” in

Proc. INTERSPEECH ,2011.530] M. Asgari and I. Shafran, “Improving the accuracy and the ro-bustness of harmonic model for pitch estimation,” in

Proc. IN-TERSPEECH , 2013.[31] O. Babacan, T. Drugman, N. d’Alessandro, N. Henrich, andT. Dutoit, “A comparative study of pitch extraction algorithmson a large variety of singing sounds,” in

Proc. IEEE ICASSP ,2013.[32] P. Boersma, “A system for doing phonetics by computer,”

GlotInternational , vol. 5:9/10, pp. 341–345, 11 2000.[33] Smule Sing!300x30x2 Dataset, “https://ccrma.stanford.edu/damp/,” accessed September 2018.[34] D. Povey et al. , “The Kaldi speech recognition toolkit,” in

IEEE ASRU , 2011.[35] D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmoham-madi, and S. Khudanpur, “Semi-orthogonal low-rank matrixfactorization for deep neural networks,” in

Proc. Interspeech ,2018.[36] D. Povey, V. Peddinti, D. Galvez, P. Ghahrmani, V. Manohar,X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trainedneural networks for ASR based on lattice-free MMI,” in