[PDF] Convolutional Speech Recognition with Pitch and Voice Quality Features

Abstract

The effects of adding pitch and voice quality features such as jitter and shimmer to a state-of-the-art CNN model for Automatic Speech Recognition are studied in this work. Pitch features have been previously used for improving classical HMM and DNN baselines, while jitter and shimmer parameters have proven to be useful for tasks like speaker or emotion recognition. Up to our knowledge, this is the first work combining such pitch and voice quality features with modern convolutional architectures, showing improvements up to 7% and 3% relative WER points, for the publicly available Spanish Common Voice and LibriSpeech 100h datasets, respectively. Particularly, our work combines these features with mel-frequency spectral coefficients (MFSCs) to train a convolutional architecture with Gated Linear Units (Conv GLUs). Such models have shown to yield small word error rates, while being very suitable for parallel processing for online streaming recognition use cases. We have added pitch and voice quality functionality to Facebook's wav2letter speech recognition framework, and we provide with such code and recipes to the community, to carry on with further experiments. Besides, to the best of our knowledge, our Spanish Common Voice recipe is the first public Spanish recipe for wav2letter.

Full PDF

CCONVOLUTIONAL SPEECH RECOGNITION WITH PITCHAND VOICE QUALITY FEATURES

Guillermo C´ambara , , Jordi Luque , and Mireia Farr´us Universitat Pompeu Fabra, Barcelona, Spain Universitat Polit`ecnica de Catalunya, Barcelona, Spain Universitat de Barcelona, Barcelona, Spain Telef´onica Research, Barcelona, Spain [email protected], [email protected], [email protected]

ABSTRACT

The effects of adding pitch and voice quality features suchas jitter and shimmer to a state-of-the-art CNN model for Au-tomatic Speech Recognition are studied in this work. Pitch fea-tures have been previously used for improving classical HMMand DNN baselines, while jitter and shimmer parameters haveproven to be useful for tasks like speaker or emotion recogni-tion. Up to our knowledge, this is the ﬁrst work combiningsuch pitch and voice quality features with modern convolutionalarchitectures, showing improvements up to 7% and 3% rela-tive WER points, for the publicly available Spanish CommonVoice and LibriSpeech 100h datasets, respectively. Particularly,our work combines these features with mel-frequency spectralcoefﬁcients (MFSCs) to train a convolutional architecture withGated Linear Units (Conv GLUs). Such models have shown toyield small word error rates, while being very suitable for par-allel processing for online streaming recognition use cases. Wehave added pitch and voice quality functionality to Facebook’swav2letter speech recognition framework, and we provide withsuch code and recipes to the community, to carry on with fur-ther experiments. Besides, to the best of our knowledge, ourSpanish Common Voice recipe is the ﬁrst public Spanish recipefor wav2letter.

Index Terms : automatic speech recognition, convolutionalneural networks, pitch, jitter, shimmer

1. INTRODUCTION

Neural network models applied to automatic speech recogni-tion (ASR) task are consistently achieving state-of-the-art re-sults in the ﬁeld. Some of the best scoring architectures involvetransformer-based acoustic models [1], LAS models [2] withSpecAugment data augmentation [3] or models strongly basedon convolutional neural networks, like the ResNets and TDSones in [4].Such convolutional approaches have the advantage of be-ing able to look at larger context windows, without the risk ofvanishing gradients like in pure LSTM approaches, and beingsuitable for online streaming applications, while attaining lowword error rate (WER) scores. Furthermore, following the trendof making systems as end-to-end as possible, even fully con-volutional neural approaches have been proposed, and shownstate-of-the-art performances [5]. This fully convolutional ar-chitecture takes proﬁt of stacking convolutional layers for efﬁ-cient parallelization with gated linear units that prevent the gra-dients from vanishing as architectures go deeper [6].Recently, Facebook has outsourced wav2letter [7], a veryfast speech recognition framework with recipes prepared for training and decoding with some of these modern models, withan emphasis on the convolutional ones. Most of the modern ar-chitectures work only on cepstral (MFCCs) and mel-frequencyspectral coefﬁcients (MFSCs) inputs, or even directly with theraw waveform, and tending towards the increasing of granular-ity at input level by usually augmenting the number of spectralparameters. Whilst it seems evident whether current end-to-enddeep network architectures are able to automatically performrelevant feature extraction for speech tasks, psychical or func-tional properties, related to the underlying speech productionsystem, become fuzzy or difﬁcult to connect with the speechrecognition performances. In addition, it is still unclear how thegreat quantity and different speech hand-crafted voice features,carefully developed along past years and based on our linguis-tic knowledge, might help and in which degree to the currentspeech network architectures.Some well-known speech recognition frameworks, likeKaldi [8], have incorporated the use of additional prosodic fea-tures, such as the pitch or the probability of voicing (POV).These are stacked into an input vector together with those cep-stral/spectral ones, and then forwarded to classiﬁers like HMMor DNN ensembles. Nevertheless, the newest convolutional ar-chitectures have not yet been extensively applied along withsuch prosodic features, and frameworks like wav2letter, reach-ing state-of-the-art performances in ASR tasks, do not yet pro-vide with integrated pitch functionality within feature extractionmodules. Furthermore, in the last decades, jitter and shimmerhave shown to be useful in a wide range of applications; e.g.detection of different speaking styles [9], age and gender classi-ﬁcation [10], emotion detection [11], speaker recognition [12],speaker diarization [13], and Alzheimer’s and Parkinson Dis-eases detection [14, 15], among others.The main contribution of this work focus on assessing thevalue of adding pitch and voice quality features to the spectralcoefﬁcients, commonly employed in most of the deep speechrecognition systems. To this end, the dimension of MFSCvector, at the input layer, is augmented by the prosodic fea-tures and error rates are reported for both Spanish and En-glish speech recognition tasks. Experiments are carried outby using the Conv GLU model. It was proposed by [16]within the wav2letter’s WSJ recipe and has reported state-of-the-art performances for both LibriSpeech and WSJ datasets.To the best of our knowledge, this is the ﬁrst attempt to usejitter and shimmer features within a modern deep neural-basedspeech recognition system while keeping easy to identify psy-chical/functional properties of the voice and linking them to theASR performance. Furthermore, the recipes employed in this a r X i v : . [ ee ss . A S ] N ov esearch have been published in a GitHub repository, aimingto make easy to reproduce experiments on two public and freelyavailable corpus: the Spanish Common Voice dataset [17] andthe English LibriSpeech 100h partition [18].

2. PROSODIC AND VOICE QUALITYFEATURES

Pitch –or fundamental perceived frequency– has proved to in-crease performance in ASR systems, signiﬁcantly for tonal lan-guages, like Punjabi [19], as well as for non-tonal ones like En-glish [20]. Jitter and shimmer represent the cycle-to-cycle varia-tions of the fundamental frequency and amplitude, respectively.Since a long time, they have been considered relevant and oftenapplied to detect voice pathologies, as in [21, 22], thus consid-ered as measurements of voice quality. Although voice qualityfeatures differ intrinsically from those suprasegmental prosodicfeatures, they have shown to be related to prosody. In [23], theauthors showed that voice quality features are relevant markerssignaling paralinguistic information, and that they should evenbe considered as prosodic carriers along with pitch and dura-tion, for instance.In ASR literature, some works have reported that prosodicinformation can raise the performance of speech recognizersystems. For instance, in [24] the authors built an ASR fordysarthric speech and [25] reports beneﬁt on the use of jitterand shimmer for noisy speech recognition, both systems basedon classical HMM acoustic modelling. In the case of deep net-works, LSTMs was taken into [26] for acoustic emotion recog-nition task, however they did not perform ASR task on its own.Following previous works, we hypothesize that prosodicand voice quality features may boost robustness in convolu-tional like ASR systems. Moreover, they could play an evenmore important role in further speech tasks, including punctu-ation marks, emotion recognition or musical contexts, whereadditional prosodic information would be useful.

3. METHODOLOGY

The effect of adding pitch and voice quality features is evalu-ated by means of the Common Voice corpus in Spanish [17] andthe LibriSpeech 100h dataset in English [18]. Common Voicecorpus is an open-source dataset that consists of recordingsfrom volunteer contributors pronouncing scripted sentences,recorded at 48kHz rate and using own devices. The sentencescome from original contributor donations and public domainmovie scripts and it is continuously growing. Although thereare already more than 100 hours of validated audio, we havekept a reduced partition of approximately 19.0 h for training,2.7 h for development and 2.2 h for testing sets. The main crite-rion for the stratiﬁcation of such partitions is to ensure that eachone has exclusive speakers, while trying to keep a 80-10-10%proportion. Every sample can be down voted by the contribu-tors if it is not clear enough, so we have discarded all samplescontaining at least one down vote, to keep the cherry pickedrecordings as clean as possible. Afterwards, we try to keep asbalanced as possible the distributions by age, gender and accent.The Python scripts for obtaining such partition are provided inour public Git repository, along with other code necessary to re-produce our ASR recipes. Up to our knowledge, this is the ﬁrst https://github.com/gcambara/wav2letter/tree/wav2letter_pitch public repository with a wav2letter recipe for a publicly avail-able Spanish dataset. Besides, in order to provide with resultsfor a popular benchmark, the proposal is also assessed with theaforementioned LibriSpeech 100h partition in English, consist-ing of audio book recordings sampled at 16kHz. As recommended by wav2letter’s Conv GLU recipes, raw audiois processed to extract static MFSCs, applying 40 ﬁlterbanks.This serves as our baseline, so on top of it we append pitch andvoice quality related features. From now on in this work, whenwe talk about pitch features we refer to the following three fea-tures: the extracted pitch itself, plus the POV for each frameand the variation of pitch across two frames (delta-pitch). Be-ing so, 40 MFSCs are always computed for each time frame,and if speciﬁed by the user in the conﬁguration, the three pitchfeatures (pitch, POV and delta-pitch) can be appended to them,plus jitter relative, jitter absolute, shimmer dB and/or shimmerrelative.There are various pitch extractor algorithms such as Yin[27] or getF0 [28]. However, we have decided to refactor theKaldi’s one from [29] within the feature extractor C++ classfrom wav2letter. Latter algorithm has been frequently testedalong the recent years within a wide variety of ASR tasks. Itis inspired by getF0 and ﬁnds the sequence of lags that max-imizes the Normalized Cross Correlation Function (NCCF). Itmakes use of the Viterbi algorithm for obtaining the optimallags and, in our implementation, it applies the logarithm to thepitch values as the only post-processing step. The logarithmcompresses pitch values to the same order as the MFSCs, whichare compressed by the logarithm as well, thus improving nu-merical stability later on during the training phase. Subtractingthe weighted average pitch during post-processing has been dis-carded, since the reported gains in WER by Kaldi are only of a0.1%, but we may implement them in future iterations.Shimmer is computed measuring the peak-to-peak wave-form amplitude at each period where the pitch is extracted, andthen performing the corresponding operations, depending onwhether we deal with shimmer dB or shimmer relative, see ref-erence [12]. With the pitch extracted at each period, the samecan be done for jitter absolute and relative, by calculating thefundamental frequency differences between such cycles.

Since our purpose is to study how pitch and voice quality fea-tures contribute to a convolutional acoustic model (AM), wehave used the Conv GLU AM from wav2letter’s Wall StreetJournal (WSJ) recipe [16]. This model has approximately 17Mparameters with dropout applied after each of its 17 layers.The WSJ dataset contains around 80 hours of audio recordings,which is closer to the magnitude of our data than the full Lib-riSpeech recipe (about 1000 hours). We have not done an exten-sive exploration of architecture parameters, since it yields de-cent out of the box results with Common Voice and LibriSpeech100h data.Regarding Common Voice’s lexicon, we use a grapheme-based one extracted from the approximately 9000 words fromboth the training and development partitions. We use the stan-dard Spanish alphabet as tokens, plus the ”c¸” letter from Catalanand the vowels with diacritical marks, making a total of 37 to-kens. The ”c¸” character is included because of the presence ofsome Catalan words in the dataset, like ”Barc¸a”. The languagemodel (LM) is a 4-gram model extracted with KenLM [30] fromable 1:

WER percentages by augmenting spectral features with prosody and voice quality ones. The results are reported on theCommon Voice’s development (Dev) and test (Test) sets, comprising 2.7 hours and 2.2 hours, respectively. Error rates are obtainedby using a greedy decoding without language model (NoLM) and by a beam search decoding using a 4-gram LM trained both on theCommon Voice’s training subset (CVLM) and on the training partition of the Spanish corpus Fisher-Callhome (FCLM).

AM WER (%)

Features NoLM-Dev NoLM-Test CVLM-Dev CVLM-Test FCLM-Dev FCLM-Test

MFSC .

92 70 .

07 20 .

29 24 .

72 38 .

58 44 . + Pitch .

18 68 . .

56 24 . . . + Pitch + Jitter .

83 69 .

56 20 .

28 23 .

97 38 .

07 43 . + Pitch + Shimmer .

18 77 .

04 23 .

30 25 .

10 46 .

90 50 . + Pitch + Jitter + Shimmer .

46 69 . .

01 22 . . . the training set. Since most of the sentences are shared acrosspartitions, due to the scripted nature of the dataset, we expectedan optimistic behavior after applying such LM. Therefore, weare also reporting results given by another 4-gram LM extractedfrom the Spanish Fisher+Callhome. The Fisher corpus splittingis taken from the Kaldi’s recipe [31]. Decoding across AM, lex-icon and LM is done with the beam-search decoder provided bywav2letter [32]. Furthermore, in order to assess the capacity ofthe AM by itself, we also evaluate without LM, choosing theﬁnal characters with the greedy best path from the predictionsof the AM. For the LibriSpeech evaluation, the lexicon and thelanguage model are the same as provided by wav2letter’s ConvGLU LibriSpeech recipe. The lexicon is obtained from the traincorpus and the language model is a 4-gram model also trainedwith KenLM. After some initial simulations, we have found that the most sta-ble voice quality features are jitter relative and shimmer relative.Therefore, we try 5 different feature conﬁgurations:1. 40 MFSCs only2. 40 MFSCs + 3 pitch features3. 40 MFSCs + 3 pitch + 1 relative jitter4. 40 MFSCs + 3 pitch + 1 rel. shimmer5. 40 MFSCs + 3 pitch + 1 rel. jitter + 1 rel. shimmerFor each one, we compute WERs on Common Voice’s devand test sets. Decodings are performed without LM (NoLM),with both in-domain and out-domain LMs, from CommonVoice’s LM (CVLM) from Fisher+Callhome’s LM (FCLM)databases, respectively. Therefore, we obtain 6 WERs for eachone of the 5 feature conﬁgurations.Besides the features, the training conﬁgurations for eachexperiment are the same, all based on wav2letter’s WSJ recipe.The inferred segmentation is taken out from wav2letter’s AutoSegmentation Criterion (ASG) [16], inspired by CTC loss [33].The learning rate is tweaked to . , and is decayed in a 20% ev-ery 10 epochs, a tuning done with the dev set. A 25 ms rollingwindow with a 10 ms stride is used for extracting all the fea-tures, jitter and shimmer are averaged across 500 ms windows.For beam-search decoding, the following settings are tunedwith the dev set: LM weight set to 2.5, word score set to 1, beamsize set to 2500, beam threshold set to 25 and silence weight setto -0.4. In order to tune these, we have not run an extensiveexploration of hyperparameters, but after a shallow search wefound these to provide good results for both LMs. Furthermore, LibriSpeech WER is evaluated with dev-clean/other and test-clean/other partitions, using the same AMtraining recipe as in Common Voice. As we are scaling witha bigger dataset demanding a higher computational cost, thetop three parameter conﬁgurations found with Common Voiceexperiments are selected in order to perform such evaluations.Decoding parameters are taken from wav2letter’s LibriSpeechrecipe.

4. RESULTS AND DISCUSSION

The Table 1 reports the word error rates (WER, %) for each oneof the 5 feature conﬁgurations, for the proposed decodings ofCommon Voice’s dev and test sets, without LM (NoLM), withits own LM (CVLM) and the Fisher+Callhome LM (FCLM).For every evaluated case, the best WER score is always pro-vided by one of the models using pitch features, or pitch withvoice quality (jitter + shimmer) features, with gains between1.38% and 7.36% relative WER points.For the cases without LM, the model with MFSC and pitchfeatures is the one with the best performance, with relative gainsof 2.68% and 1.83% for dev and test sets, respectively. Addi-tional features on the other models also improve the WER score,except for the case with pitch and shimmer only, which yieldsworse results across all experiments. On the other hand, decod-ing with CVLM achieves the best WER scores, when trainingwith all the proposed features together: MFSCs, the 3 pitch fea-tures, jitter relative and shimmer relative. A 20.01% WER isobtained for the dev set, and a 22.90% WER for the test set.As it was expected, the CVLM improves drastically the predic-tions, because even though it is obtained from the train partitionsolely, many sentences are shared with the dev and test sets, dueto the reduced vocabulary in this dataset.A more realistic approach is to decode by using an externalLM. The FCLM language model is built from the training par-tition of the LDC Spanish Fisher+Callhome corpus. Althoughthe LM enrollment is performed with less than 20 hours of au-dio (approximately 16k sentences), it still yields to a reasonableperformance compared to the CVLMs decodings. With respectto the prosodic features, the FCLM beam decoding reaches thelower WER in development by using MFSCs only augmentedwith pitch features; that is, 37.57% WER. The lowest 42.95%WER score in the test set is given by the combination of all pitchand voice quality characteristics. Once again, the best resultsin terms of WER are provided by models with pitch features,or pitch features with the combination of jitter and shimmer,showing the potential of pitch and voice quality features to im-prove the performance of an ASR based on convolutional neuraligure 1:

Common Voice and LibriSpeech dev set WER (%)during training, as a function of the epoch number. For thelatter, dev-clean and dev-other are evaluated. Curves across the5 different feature conﬁgurations for the same acoustic modelarchitecture. networks.Nonetheless, it is worth noticing how the use of only pitchand shimmer features yields to worse performance for bothAM and AM/LM decoding models. Previous behaviour is de-picted in the Figure 1(a), where using only shimmer dramati-cally affects the training stage of the model, making it worseand slower. However, training with pitch features or with pitchand jitter features seems to help at reaching better WER plateausand at faster pace. While jitter is a measure of frequency insta-bility in the wave, shimmer is a measure of amplitude instabil-ity. Being so, pitch and jitter characteristics might contribute toMFSCs spectral features with independent information, just bysynchronising them in a simple concatenation like the proposedone. However, the inclusion of shimmer, which is related to am-plitude –as opposed to the pitch and jitter, related to frequency–, is more likely to be understood as a perturbation throughoutthe convolutional layers that might difﬁcult the acoustic modeltraining.Even though, it is interesting to see how if shimmer is cou-pled with jitter and pitch characteristics altogether, the perfor-mance obtained yields to more robust results compared to thebaseline and independently of decoding with CVLM and FCLMlanguage models. Other studies already suggest the correlationbetween jitter and shimmer by the same index, that is, the VoiceHandicap Index (VHI) [34], so the convolutional ﬁlters may beﬁnding similar correlations, thus improving mutual informationwhen coupled together with spectral features and promotingsuch as voice measurements as good feature candidates for en-hancing the speech recognition of pathological voices. The lat-ter being an interesting hypothesis to look for further evidence.The impact of pitch and voice quality features is also re-ported by LibriSpeech experiments, see Table 2 and Figure 1(b),with relative improvements of 2.94% and 2.06% for dev-cleanand dev-other, respectively, and about 0.96% and 2.87% fortest-clean and test-other. Gains seem to be more consistent forthe ”other” partitions, where there is more accent and prosodydiversity than in ”clean” ones.Appending pitch characteristics to MFSCs seems to slightlyimprove the ASR performance. Among them, MFSC+pitch andMFSC+pitch+jitter+shimmer combinations are the ones thatprovide the most robust behavior across all the experiment. Allassessed features carry prosodic information and might aid thenetwork on complementing the information conveyed by solelythe magnitude spectrum. For instance, by helping on reduc-ing the MFSC distortion which appears at the lower frequencyregion of the spectrum [35]. Overall, they help boosting the Table 2:

LibriSpeech WER values for the three best performingcombinations of features proposed in the Acoustic Model (AM)in the Common Voice experiments: MFSC, MFSC+Pitch andMFSC+Pitch+Shimmer+Jitter (shortened as ”All”). Decodingis done with a 4-gram LM trained with LibriSpeech train settranscripts.

AM WER (%)

Features dev-clean dev-other test-clean test-other

MFSC .

22 31 .

59 10 .

38 34 . + Pitch .

94 30 . .

28 33 . + All .

92 30 . .

37 33 . performance of the convolutional acoustic model for both thedifferent databases and the languages studied in this work.Note that the approach employed for the feature combi-nation has been quite simple, by just appending such featuresto the spectral ones at the input layer, without extensive post-processing either nor adaptation of the model architecture. Be-ing so, it is reasonable to think that there is still margin of im-provement in the application of pitch and voice quality measure-ments to state-of-the-art convolutional neural models. Possiblestrategies comprises adapting the feature concatenation, maybeby dedicating exclusive ﬁlters to the new pitch and voice qual-ity features with POV as a gating mechanism, especially afterexperimentally realising, not reported in this work, that the esti-mation of measurements like shimmer may beneﬁt from differ-ent post-processing techniques.

5. CONCLUSIONS

In this study, we performed a preliminary exploration on theeffects of pitch and jitter/shimmer voice quality measurementswithin the framework of the ASR task performed by convolu-tional neural network models. The experiments reported with apublicly available Spanish speech corpus showed consistent im-provements on the model robustness, achieving a reduced rela-tive 7% WER in some scenarios. Besides, such feature extrac-tion functionalities are provided and integrated with wav2lettercode to easily replicate our ﬁndings or directly applying pitchand voice quality features to wav2letter models. We also pro-vide the recipe for the Common Voice Spanish dataset, the ﬁrstrecipe suited for wav2letter using a Spanish publicly availabledataset. The recipe for LibriSpeech experiments is also pro-vided, which achieves up to a 2.94% relative WER improve-ment. Further steps on the research of convolutional ASR withpitch and voice quality would imply adapting architectures forfeature processing, or applying such characteristics for tasks in-cluding the presence of punctuation marks, emotion recognitionand even pathological or singing voices. For the latter tasks, theimportance of pitch and voice quality features is expected tobecome even more relevant.

6. ACKNOWLEDGMENTS

This work is a part of the INGENIOUS project, funded by theEuropean Union’s Horizon 2020 Research and Innovation Pro-gramme and the Korean Government under Grant AgreementNo 833435. The third author has been funded by the Agen-cia Estatal de Investigaci´on (AEI), Ministerio de Ciencia, In-novaci´on y Universidades and the Fondo Social Europeo (FSE)under grant RYC-2015-17239 (AEI/FSE, UE). . REFERENCES [1] Y. Wang, A. Mohamed, D. Le, C. Liu, A. Xiao, J. Mahadeokar,H. Huang, A. Tjandra, X. Zhang, F. Zhang et al. , “Transformer-based acoustic modeling for hybrid speech recognition,” in

ICASSP 2020-2020 IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) . IEEE, 2020, pp. 6874–6878.[2] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend andspell,”

CoRR , vol. abs/1508.01211, 2015. [Online]. Available:http://arxiv.org/abs/1508.01211[3] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D.Cubuk, and Q. V. Le, “Specaugment: A simple data augmen-tation method for automatic speech recognition,” arXiv preprintarXiv:1904.08779 , 2019.[4] G. Synnaeve, Q. Xu, J. Kahn, T. Likhomanenko, E. Grave,V. Pratap, A. Sriram, V. Liptchinsky, and R. Collobert, “End-to-end asr: from supervised to semi-supervised learning with modernarchitectures,” 2019.[5] N. Zeghidour, Q. Xu, V. Liptchinsky, N. Usunier, G. Synnaeve,and R. Collobert, “Fully convolutional speech recognition,” 2018.[6] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Languagemodeling with gated convolutional networks,”

CoRR , vol.abs/1612.08083, 2016. [Online]. Available: http://arxiv.org/abs/1612.08083[7] V. Pratap, A. Hannun, Q. Xu, J. Cai, J. Kahn, G. Synnaeve,V. Liptchinsky, and R. Collobert, “wav2letter++: The fastestopen-source speech recognition system,” 12 2018.[8] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,N. Goel, M. Hannemann, P. Motl´ıˇcek, Y. Qian, P. Schwarz,J. Silovsk´y, G. Stemmer, and K. Vesel, “The kaldi speech recog-nition toolkit,”

IEEE 2011 Workshop on Automatic Speech Recog-nition and Understanding , 01 2011.[9] R. E. Slyh, W. T. Nelson, and E. G. Hansen, “Analysis of mrate,shimmer, jitter, and f/sub 0/contour features across stress andspeaking style in the susas database,” in , vol. 4. IEEE, 1999,pp. 2091–2094.[10] F. Wittig and C. M¨uller, “Implicit feedback for user-adaptive sys-tems by analyzing the users’ speech,” in

Proceedings of the Work-shop on Adaptivit¨at und Benutzermodellierung in interaktivenSoftwaresystemen (ABIS) , Karlsruhe, Germany, 2003.[11] X. Li, J. Tao, M. T. Johnson, J. Soltis, A. Savage, K. M. Leong,and J. D. Newman, “Stress and emotion classiﬁcation using jitterand shimmer features,” in , vol. 4.IEEE, 2007, pp. IV–1081.[12] M. Farr´us, J. Hernando, and P. Ejarque, “Jitter and shimmer mea-surements for speaker recognition,” in

Proceedings of the Inter-speech , Antwerp, Belgium, 2007.[13] A. W. Zewoudie, J. Luque, and F. J. Hernando Peric´as, “Jitter andshimmer measurements for speaker diarization,” in

VII Jornadasen Tecnolog´ıa del Habla and III Iberian SLTech Workshop: pro-ceedings: November 19-21, 2014: Escuela de Ingenier´ıa en Tele-comunicaci´on y Electr´onica Universidad de Las Palmas de GranCanaria: Las Palmas de Gran Canaria, Spain , 2014, pp. 21–30.[14] S. Mirzaei, M. El Yacoubi, S. Garcia-Salicetti, J. Boudy,C. Kahindo, V. Cristancho-Lacroix, H. Kerherv´e, and A.-S.Rigaud, “Two-stage feature selection of voice parameters for earlyalzheimer’s disease prediction,”

IRBM , vol. 39, no. 6, pp. 430–435, 2018.[15] A. Benba, A. Jilbab, and A. Hammouch, “Hybridization of bestacoustic cues for detecting persons with parkinson’s disease,” in .IEEE, 2014, pp. 622–625.[16] R. Collobert, C. Puhrsch, and G. Synnaeve, “Wav2letter: anend-to-end convnet-based speech recognition system,”

CoRR ,vol. abs/1609.03193, 2016. [Online]. Available: http://arxiv.org/abs/1609.03193 [17] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler,J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber,“Common voice: A massively-multilingual speech corpus,” 2019.[18] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: An asr corpus based on public domain audio books,”04 2015, pp. 5206–5210.[19] J. Guglani and A. Mishra, “Automatic speech recognition sys-tem with pitch dependent features for punjabi language on kalditoolkit,”

Applied Acoustics , vol. 167, p. 107386, 2020.[20] M. Magimai-Doss, T. Stephenson, and H. Bourlard, “Using pitchfrequency information in speech recognition,” 01 2003.[21] J. Kreiman and B. R. Gerratt, “Perception of aperiodicity in patho-logical voice,”

The Journal of the Acoustical Society of America ,vol. 117, no. 4, pp. 2201–2211, 2005.[22] D. Michaelis, M. Fr¨ohlich, H. W. Strube, E. Kruse, B. Story, andI. R. Titze, “Some simulations concerning jitter and shimmer mea-surement,” in , 1998, pp. 744–754.[23] N. Campbell and P. Mokhtari, “Voice quality: the 4th prosodicdimension,” pp. 2417–2420, 2003.[24] B.-F. Zaidi, M. Boudraa, S.-A. Selouani, D. Addou, and M. S.Yakoub, “Automatic recognition system for dysarthric speechbased on mfcc’s, pncc’s, jitter and shimmer coefﬁcients,” in

Sci-ence and Information Conference . Springer, 2019, pp. 500–510.[25] H. Rahali, Z. Hajaiej, and N. Ellouze, “Robust features for noisyspeech recognition using jitter and shimmer,”

International Jour-nal of Innovative Computing, Information and Control , vol. 11,pp. 955–963, 01 2015.[26] J. Cho, R. Pappagari, P. Kulkarni, J. Villalba, Y. Carmiel, andN. Dehak, “Deep neural networks for emotion recognition com-bining audio and transcripts,” 2019.[27] A. de Cheveign´e and H. Kawahara, “Yin, a fundamental frequencyestimator for speech and music,”

The Journal of the AcousticalSociety of America , vol. 111, no. 4, pp. 1917–1930, 2002.[28] D. Talkin, “A robust algorithm for pitch tracking ( rapt ),” 2005.[29] P. Ghahremani, B. BabaAli, D. Povey, K. Riedhammer, J. Trmal,and S. Khudanpur, “A pitch extraction algorithm tuned for auto-matic speech recognition,” , pp. 2494–2498, 2014.[30] K. Heaﬁeld, “Kenlm: Faster and smaller language model queries,”in

Proceedings of the Sixth Workshop on Statistical MachineTranslation , ser. WMT ’11. USA: Association for Computa-tional Linguistics, 2011, p. 187–197.[31] R. J. Weiss, J. Chorowski, N. Jaitly, Y. Wu, and Z. Chen,“Sequence-to-sequence models can directly translate foreignspeech,” 2017.[32] V. Liptchinsky, G. Synnaeve, and R. Collobert, “Letter-based speech recognition with gated convnets,”

ArXiv , vol.abs/1712.09444, 2017.[33] A. Graves, S. Fern´andez, F. Gomez, and J. Schmidhuber, “Con-nectionist temporal classiﬁcation: Labelling unsegmented se-quence data with recurrent neural networks.” New York, NY,USA: Association for Computing Machinery, 2006.[34] A. Schindler, F. Mozzanica, M. Vedrody, P. Maruzzi, and F. Ot-taviani, “Correlation between the voice handicap index and voicemeasurements in four groups of patients with dysphonia,”

Oto-laryngology–Head and Neck Surgery , vol. 141, no. 6, pp. 762–769, 2009.[35] I. C. Yadav, S. Shahnawazuddin, and G. Pradhan, “Addressingnoise and pitch sensitivity of speech recognition system throughvariational mode decomposition based spectral smoothing,”