[PDF] A Deep Learning Algorithm for Objective Assessment of Hypernasality in Children with Cleft Palate

Abstract

Objectives: Evaluation of hypernasality requires extensive perceptual training by clinicians and extending this training on a large scale internationally is untenable; this compounds the health disparities that already exist among children with cleft. In this work, we present the objective hypernasality measure (OHM), a speech analytics algorithm that automatically measures hypernasality in speech, and validate it relative to a group of trained clinicians. Methods: We trained a deep neural network (DNN) on approximately 100 hours of a publicly-available healthy speech corpus to detect the presence of nasal acoustic cues generated through the production of nasal consonants and nasalized phonemes in speech. Importantly, this model does not require any clinical data for training. The posterior probabilities of the deep learning model were aggregated at the sentence and speaker-levels to compute the OHM. Results: The results showed that the OHM was significantly correlated with the perceptual hypernasality ratings in the Americleft database ( r=0.797, ~p < 0.001), and with the New Mexico Cleft Palate Center (NMCPC) database (r=0.713,p<$0.001). In addition, we evaluated the relationship between the OHM and articulation errors; the sensitivity of the OHM in detecting the presence of very mild hypernasality; and establishing the internal reliability of the metric. Further, the performance of OHM was compared with a DNN regression algorithm directly trained on the hypernasal speech samples. Significance: The results indicate that the OHM is able to rate the severity of hypernasality on par with Americleft-trained clinicians on this dataset.

Full PDF

11 A Deep Learning Algorithm for ObjectiveAssessment of Hypernasality in Children with CleftPalate

Vikram C. Mathad, Nancy Scherer, Kathy Chapman, Julie M. Liss, and Visar Berisha

Abstract —Objectives: Evaluation of hypernasality requiresextensive perceptual training by clinicians and extending thistraining on a large scale internationally is untenable; this com-pounds the health disparities that already exist among childrenwith cleft. In this work, we present the objective hypernasalitymeasure (OHM), a speech analytics algorithm that automaticallymeasures hypernasality in speech, and validate it relative toa group of trained clinicians. Methods: We trained a deepneural network (DNN) on approximately 100 hours of a publicly-available healthy speech corpus to detect the presence of nasalacoustic cues generated through the production of nasal conso-nants and nasalized phonemes in speech. Importantly, this modeldoes not require any clinical data for training. The posteriorprobabilities of the deep learning model were aggregated at thesentence and speaker-levels to compute the OHM. Results: Theresults showed that the OHM was signiﬁcantly correlated withthe perceptual hypernasality ratings in the Americleft database( r=0.797, p < ), and with the New Mexico Cleft Palate Center(NMCPC) database ( r=0.713, p < ). In addition, we evaluatedthe relationship between the OHM and articulation errors; thesensitivity of the OHM in detecting the presence of very mild hy-pernasality; and establishing the internal reliability of the metric.Further, the performance of OHM was compared with a DNNregression algorithm directly trained on the hypernasal speechsamples. Signiﬁcance: The results indicate that the OHM is ableto rate the severity of hypernasality on par with Americleft-trained clinicians on this dataset. Index Terms —Cleft palate, clinical speech analysis, deep neuralnetworks, hypernasality, speech assessment, vocal biomarkers

I. I

NTRODUCTION

Cleft palate (CP), with or without cleft lip, is a craniofacialanomaly and the most common birth disorder, with 1 inevery 700 live births presenting with craniofacial clefts [1].In healthy craniofacial development, the bilateral bony palatalshelves fuse horizontally at midline to create the roof of themouth (hard palate) and provide points of muscular attachmentfor the soft palate (velum). These velar muscles, along withthose in the upper pharynx, allow for modulation of theopening between the oral and nasal cavities (velopharyngeal

Vikram C. Mathad is with the Department of Speech & Hearing Sciences,Arizona State University, Tempe, AZ-85281, Email: [email protected] Scherer is with the Department of Speech & Hearing Sciences,Arizona State University, Tempe, AZ-85281, Email: [email protected] Chapman is with the Department of Communication Sci-ences and Disorders, University of Utah, Salt Lake City, UT-84112,Email:[email protected]. Julie M. Liss is with the Departmentof Speech & Hearing Sciences, Arizona State University, Tempe, AZ-85281,Email: [email protected]. Visar Berisha is with the College of Health Solu-tions, and School of Electrical, Computer, & Energy Engineering, ArizonaState University, Tempe, AZ-85281, Email: [email protected]. This work isfunded in part by NIH grant NIDCR DE026252 port) during respiration, swallowing, and speaking. The failureof the palatal shelves to fuse at midline during embryologicaldevelopment (cleft) means there is no hard or soft palate andno separation between the oral and nasal cavities. The primaryintervention involves surgical repair of the palatal cleft toproduce anatomical closure and to create the ability to mod-ulate the velopharyngeal port aperture. When velopharyngealdysfunction (VPD) persists post primary palate surgery [2],a secondary surgery (e.g. pharyngeal ﬂap, dynamic sphincterpharyngoplasty) is required. In the presence of VPD, thevelopharyngeal port fails to close off the nasal tract completelyduring speech production for non-nasal sounds, and air andacoustic energy escape through the nasal cavity, resulting inreduced speech intelligibility. Twenty to thirty percent of chil-dren with clefts require a secondary surgery to rectify VPD [3]for the exclusive purpose of improving speech outcomes.The inability to achieve adequate velopharyngeal closureduring speech results in the percept of hypernasality, char-acterized by excessive nasal resonance due to passage ofthe vibrating column of air through the nasal cavity (seesupplementary material to listen to hypernasal speech). Theperception of hypernasality in speech, secondary to VPD, isconsidered a primary outcome measure in CP as it drivesdecisions related to secondary surgery, speech therapy, and isan important determinant of long-term educational and socialoutcomes [4], [5]. As a result, it is considered a primaryoutcome by the American Cleft Palate-Craniofacial Associ-ation and the Cleft Palate Committee of the InternationalAssociation of Logopedics and Phoniatrics [6].Instrumental methods such as a nasometer, magnetic res-onance imaging, and cineradiography can be used to assessVPD [7]. These instruments require special training and ex-pensive equipment; furthermore, they only show a moderatecorrelation with perceptual impressions of hypernasality. As aresult, they are rarely used in clinical practice [7]. Instead,clinicians rely on their perception of hypernasality to assessVPD. Perception of hypernasality is a complex task thatrequires the clinician to infer, from the acoustic signal, theratios of resonances across the pharyngeal, oral, and nasalcavities. The clinician then maps the perceived ratios to equal-interval or visual-analog scales of hypernasality. However,this percept is vulnerable to other co-modulating variablessuch as the words being spoken, the quality and loudnessof the voice, audible turbulence and escape of air throughthe nose (nasal emission), and the idiosyncratic shape ofan individuals resonating cavities [8], [9]. This results in a r X i v : . [ ee ss . A S ] S e p a highly nonlinear mapping between the percept and theactual acoustic nasal resonance and considerable inter-raterand intra-rater variability in the assessment. Fundamentally,this limits the reliability and validity of the ratings obtainedfrom untrained clinicians [10]. The Americleft Speech Projectwas developed to address this dilemma by facilitating inter-center collaborations for speech outcomes research [11]. Theﬁrst step included the development of a standardized protocoland calibration of craniofacial speech-language pathologists(SLPs) on perceptual ratings of hypernasality. Over the studyperiod, recalibration was required to maintain high levels ofinter-rater reliability. To date, only a small number of clinicianshave participated in this program and applying this training ona large scale internationally is untenable.In this paper, we present the objective hypernasality measure(OHM) to assess hypernasality in CP speech and show that ittracks with the clinical perception of Americleft-trained SLPs.When clinicians make judgements of hypernasality, they focuson speciﬁc acoustic cues that are hallmarks of hypernasalspeech. Similarly, we design an automatic assessment toolbased on deep learning and demonstrate that the learnedfeatures from speech correlate with the clinical ratings ofhypernasality. A. Related work

The development of speech technology-based system in-volves the extraction of acoustic features, which reﬂect ab-normal nasal resonances present in the hypernasal speech.Spectral measures, such as addition of extra nasal formantaround 250 Hz, increased spectral ﬂatness, reduced ﬁrst for-mant amplitude, voice low-to-high tone ratio, and vowel spacearea have previously shown a correlation with the perceivedhypernasality [12]–[17]. Acoustic features in combination withthe machine learning algorithms have been used to developautomatic hypernasality assessment systems. Mel-frequencycepstral coefﬁcients (MFCCs), jitter, shimmer, vowel spacearea, wavlet transform based features have been used to trainclassiﬁers (e.g. support vector machines (SVMs), Gaussianmixture models (GMMS)) that detect hypernasal speech [15],[16], [18], [19]. Recently, convolutional neural network andrecurrent neural networks have also been used for the samepurpose [20], [21].Most of these automatic algorithms for detection of hy-pernasality were developed in a binary classiﬁcation setting,i.e., healthy vs. hypernasal speech [15], [16], [18]–[21]. Thisis inconsistent with clinical practice, where clinicians requiremore ﬁne-grained information (e.g. evaluation of hypernasalityon a scale that ranges from normal to severe) for decisionmaking. For example, a secondary surgery may only berequired for treating moderate-severe hypernasal cases [22].There is a limited number of multi-class classiﬁcation [23]–[25] and regression-based approaches for predicting hyper-nasality severity [26], [27]. These approaches rely on analysisof sustained phonations or segmented phonemes from utter-ances. This is limiting in two ways: (1) Sustained phonationsdon’t capture phonetic context and don’t provide a reliablepercept of hypernasality. That’s why clinicians prefer to use connected speech for reliable estimation of hypernasality [28].(2) For approaches that rely on connected speech, the pho-netic segmentation was achieved either by manual markingor forced-alignment using orthogrophic transcriptions. How-ever, orthographic transcription is time consuming and theforced alignment procedure is prone to errors for children’sspeech [29].Hypernasality estimation based on supervised machinelearning requires large, labeled speech corpora. Most of theexisting speech-based hypernasality evaluation methods usespeech samples and corresponding perceptual ratings to trainmachine learning models, including k -near neighborhood clas-siﬁer [25], Gaussian mixture models [24], support vectormachines [18], [19], [23], [30], and deep neural networks [20],[21], [31]. The performance of these systems critically dependson the availability of clinical hypernasal speech databases thatinclude speech samples from patients and corresponding hy-pernasality ratings from trained SLPs. However, developmentof a large hypernasal speech database is difﬁcult in practicedue to the limited availability of patients’ speech and theassociated SLP clinical ratings. As a result, the models runthe risk of overﬁtting to a particular database and rating scale. B. The proposed approach

While large-scale databases of CP speech are untenable,healthy speech provides us with clues as to the acoustic mani-festation of hypernasality. For example, the voiced sounds /M/and /N/, and the sounds that precede and follow them, requireopening of the velopharyngeal port to shunt the vibratingcolumn of air through the nasal cavity. Thus, /M/ and /N/ arenasalized consonants (NC). Because the velum is a relativelysluggish articulator in comparison with the tongue and lips,the velopharyngeal port opens and closes more slowly, creatingnasalization of vowels adjacent to the NCs, or nasalized vowels(NV). For example, the vowel /AE/ is nasalized in the word“man”.This is in contrast to production of the oral consonants(OC), which involve closure of the velopharyngeal port toimpound oral air pressure that creates a burst upon release ofthe articulatory closure (plosive). The voiced stop consonants,/B/ and /D/, and unvoiced stop consonants, /P/ and /T/,share the exact same places of articulation as /M/ and /N/,respectively, but are completely orally produced. This meansthat vowels (oral vowels, OV) adjacent to these OCs are alsonot nasalized, as in the /AE/ in the word “cat”. Since, the effectof nasalization is evident in the healthy speech, the acousticmanifestation of velopharyngeal port can be modelled usinghealthy speech corpus. Compared to CP speech, there are alarge number of healthy speech corpora are available in thepublic domain. In our algorithm, we make use of a publiclyavailable healthy speech corpus and train a nasality featureextraction model using only healthy speech. This results inan objective measure of hypernasality (OHM) that can becomputed frame-by-frame and aggregated at the level of anutterance or speaker.An overview of the proposed algorithm for estimating theOHM is described in Fig. 1. To learn the acoustic manifesta-tion of the velopharyngeal port opening, we utilize 960 hours

Fig. 1. Overview of the proposed approach for the hypernasality prediction. First, the input children’s speech is pre-processed and passed through thepre-trained DNN Nasality model. The DNN model posteriors are combined to form the objective hypernasality measure. of speech from a publicly-available Librispeech corpus to traina deep neural network (DNN) model that classiﬁes amongnasalized consonants (NC), oral consonants (OC), nasalizedvowels (NV), and oral vowels (OV). Training the DNN toclassify among these classes forces it to learn the acousticmanifestation of an open velopharyngeal port. As a result,we refer to this DNN as the nasality model and use the fourDNN posteriors of this model as “features” for assessing thepresence of nasalization in CP speech.The input children’s speech was pre-processed and weextracted the four posterior features using the pre-trained DNNnasality model. These features were combined to derive theOHM for each speech utterance. The details of the algorithmcan be found in the Methods section. We established theconstruct validity of the OHM through several experimentsusing cleft speech samples and gold-standard clinical ratingsfrom the Americleft project; then we evaluated the externalvalidity of the model using data from the New Mexico cleftpalate center (NMCPC) database.II. D

ATABASES

The details of the healthy speech corpus and the two CPspeech databases are described below.

A. Healthy speech corpus

One hundred hours of healthy speech from the Librispeechdatabase ( train-clean-100 ) was used to train the DNN [32].The database contains English read speech samples recordedfrom 251 healthy adult speakers (125 male and 126 female).In addition to the speech samples, the database also containsorthographic transcriptions for each read sentence. A separatetest set ( test-clean ) comprised of 5.4 hours of speech was usedas a validation set.

B. Americleft database

The Americleft database was collected as a part of theAmericleft Speech Project at the University of Utah. Thedatabase consists of 60 children with CP (37 boys and 23girls) of average age . ± . years. The control groupconsisted of 10 typically developing children (6 boys and 4 girls) with typically-developing speech characteristics (asdetermined by a speech language pathologist) having anaverage age of . ± . years. The recorded stimuliwas comprised of 24 sentences containing different targetconsonants [11]. The Americleft database was used with anapproval from Institutional Review Board (IRB) with IRB ID:STUDY00008224 and written consent was obtained from allparticipants. Fig. 2. The distribution of the number of speakers over different ground-truth hypernasality levels. Histograms of weighted averaged ratings for (a)Americleft (70 speakers), (b) balanced Americleft (38 speakers) and NMCLP(51 speakers) databases.

The hypernasality of the recorded speech samples wasperceptually evaluated by 4 SLPs from the Americleft speechoutcomes group (ASOG) according to a standardized proto-col [11]. The speaker-level hypernasality was rated on theAmericleft Speech Protocol scale on a 5-point scale (0-normal,1-borderline, 2-mild, 3-moderate, 4-severe) [11]. The Pearsoncorrelation coefﬁcient was computed between different pairs ofraters to evaluate the inter-rater reliability. The average inter-rater correlation coefﬁcient was found to be . ± . .The ratings for all 4 SLPs were averaged to obtain a single“ground-truth” rating per speaker [33]. The histogram inFig. 2(a) shows the distribution of hypernasality ratings forthe 70 speakers from the Americleft database. It is clear fromthe ﬁgure that the database is skewed towards the normal (‘0’)end of the scale. We balance the original Americleft data by randomly removing a subset of speakers rated with normal hy-pernasality. The balanced Americleft database, Americleft(38),is comprised of 38 speakers and the histogram of the ground-truth ratings is shown in Fig. 2(b). The average inter-ratercorrelation for Americleft(38) was . ± . .In addition to hypernasality, the Americleft samples wereevaluated for articulation errors. The sentence stimuli werephonetically transcribed by the four Americleft raters usingthe International Phonetic Alphabet. The number of activeerrors (glottal, pharyngeal, nasal fricatives, palatal, dental,lateral, double articulation errors) and passive errors (nasalsubstitutions, weak pressure consonants) were computed bythe four SLP raters. In the present work, the active and passiveerrors were reported against the target consonants present inthe 24 sentence-level recordings. Finally, for each speaker, theratio between the number of active errors and the total numberof target consonants was used to compute the percentage ofactive errors. Similarly, for each speaker, we computed thepercentage of passive errors. C. The New Mexico Cleft Palate Center database

The New Mexico Cleft Palate Center (NMCPC) database isdescribed in [30]. The database is comprised of speech sam-ples from 10 controls (8 boys and 2 girls) and 41 children withCP (41 speakers (22 boys and 19 girls) with an average age of . ± . years. Each child was asked to repeat a random subsetof sentences selected from a larger set of 76 sentences. Thenumber of sentences per participant ranged from 7 to 69. Therecorded samples were perceptually evaluated by 5 listenerson a continuous scale ranging from 0 to 3, where 0 stands fornormal and 3 for severe hypernasality. These raters were notspeech language pathologists, but they were speech processingexperts that listened to the samples together after some self-training on how to evaluate hypernasality. The average inter-rater correlation of this database was . ± . . The ratingsof 5 raters were averaged to obtain a single “ground-truth”rating per speaker [33]. A histogram of average ratings isshown in Fig. 2(c). This database is nicely balanced acrossthe different hypernasality levels. The NMCPC database is apublicly available database that can be acquired upon requestto Dr. Luis Cuadros, New Mexico Cleft Palate Center.III. M ETHODS

A. DNN Nasality Model

Below we describe the development of the DNN nasalitymodel, including its architecture and the training procedure.

DNN model architecture:

The architecture of the DNN nasality model is shownin Fig. 3. The model layer has an input layer with 39-nodes, corresponding to the 39-dimensional MFCCs inputspeech feature. The model is comprised of 3-hidden layers,where each layer has 1024 hidden neurons with rectiﬁedlinear unit (ReLU) activation. The output layer consists of4 softmax nodes, each interpreted as a posterior probabilitycorresponding to nasal consonant (NC), oral consonant (OC),nasal vowel (NV), and oral consonant (OC) classes.

Trainingthe DNN:

First, the 100 hours of heatlhy speech recordings of

Fig. 3. The architecture of the DNN nasality model. The model is a feed for-ward neural network consisting of a 39-dimensional input layer, three hiddenlayers with 1024 hidden neurons in each layer, and a 4-dimensional softmaxoutput layer. The output layer yields posterior probabilities corresponding tonasal consonants (NC), oral consonants (OC), nasal vowels (NV), and oralvowels (OV).

Librispeech corpus and the corresponding orthographic tran-scriptions were passed though the Montreal forced-aligner [34]to align the speech acoustics to the transcript at the phonemelevel. The segmented phonemes were grouped into the fourclasses of interest: nasal consonants (NC), oral consonants(OC), nasalized vowels (NV), and oral vowels (OV). TheNC group was formed by combining across nasal consonants(/N/, /M/, and /NG/); the OC group was formed by combiningacross oral consonants, including plosives (/B/, /D/, /G/, /P/,/T/, /K/), fricatives (/Z/, /ZH/, /V/, /S/, /SH/, /F/, /H/), affricates(/JH/, /CH/), glides and liquids (/L/, /R/). The NV group wasformed by combining thirty percent of the vowel segmentsthat follow and precede a nasal consonant. The OV group wasformed by combining across the remaining vowels segments,which were not surrounded by nasal consonants. An examplegrouping of phonemes in a healthy speech sample is illustratedin Fig. 4. The speech waveform corresponding to the phrase“no one who had ever seen” and its spectrogram are shownin Fig. 4(a) and (b), respectively. The English phonemes inARPABET encoded form, along with their time boundariesare marked on the speech waveform in Fig. 4(a). Based onthe velopharyngeal activity, the phonemes are grouped intoNC, OC, NV, and OV categories. In the example shownin Fig. 4, the nasal consonant (/N/) and the vowels (/OW/,/IY/) surrounding it are grouped into NC and NV classes,respectively. The oral consonants (/W/, /HH/, /D/, /V/, /S/)and the vowels (/UW/, /AE/, /EH/, /ER/) adjacent to them aregrouped as OC and OV classes, respectively.The input speech to the DNN was sampled at a 16 kHz sam-pling rate, and short-time processed using a 20 ms Hammingwindow with 10 ms overlap. From each frame, 13 dimensionalMFCCs, velocity ( ∆ ) and acceleration ( ∆∆ ) coefﬁcients werecomputed. This 39-dimensional feature vector was the inputto the DNN nasality model; the label for each 20 ms framecorresponded to the category to which that frame belongedto. The classiﬁer was trained to classify between the fourphoneme categories described above. The error between thepredicted and ground truth labels was computed using a Fig. 4. An illustration of the phoneme mapping procedure used for DNNtraining. (a) The speech waveform corresponding to the text ‘no one whohad ever seen’ and (b) its spectrogram. Overlaid on the waveform, we showthe transcription in English ARPABET format and the mappings to the fourclasses of interest, i,e. nasal consonant (NC), oral consonant (OC), nasalizedvowel (NC), and oral vowel (OV). categorical cross-entropy loss function. The ADAM optimizerwas used to estimate the optimum parameters of the network.The network was trained for 25 epochs with a learning rate of0.001. The MFCC features were computed using the Librosapackage in Python and the DNN was implemented using theKeras 2.2.4 toolkit with a TensorFlow 1.13.1 backend.

DNN posteriors as nasality features:

For the given inputspeech frame, the DNN results in 4 posterior probabilitiescorresponding to the NC, OC, NV, and OV classes. Asdescribed in Fig. 5, increased values of P ( N C ) and P ( N V ) indicate the presence of nasalization in consonants and vow-els, respectively. Hence, we consider these posteriors as thenasality features and we used these to compute an objectivemeasure of hypernasality. Evaluating the OHM:

Evaluating the OHM requires thepre-trained DNN nasality model described above. The DNNwas trained on adult speech, however we aim to use it toevaluate hypernasality in children’s speech. To compensate forthe acoustic mismatch between children and adult speech, weused the pitch modiﬁcation algorithm proposed by [35]. Thepitch modiﬁcation pre-processing step lowers the pitch andspeaking rate of children per the details in the paper. Thissame approach was used to improve the performance of speechrecognition algorithms trained on adult speech and evaluatedon children speech [35]. The pitch modiﬁcation algorithmwas implemented in MATLAB 2019a.The pitch-modiﬁed speech signal was resampled at 16 kHzand short-time processed using a 20 ms Hamming windowwith 10 ms overlap. The frame-size and frame-shift wereconsistent with the parameters used during DNN nasalitymodel training. As before, a 39-dimensional MFCC featurevector was computed and provided as input to the pre-trainedDNN nasality model. The 4 DNN posterior probabilities werecomputed for each frame of children’s speech.The DNN posteriors obtained for the pre-processed chil-dren’s speech were used to compute the OHM. Let, x i be thefeature vector corresponding to the MFCC input features forthe i th frame; the DNN outputs probabilities corresponding toNC, OC, NV, and OV classes for that frame, i.e., P ( N C | x i ) , P ( OC | x i ) , P ( N V | x i ) , and P ( OV | x i ) , respectively. Then theobjective hypernasality measure OHM ( x i ) for i th frame iscomputed as OHM ( x i ) = max (cid:18) log ( P ( N C | x i ) P ( OC | x i ) ) , log ( P ( N V | x i ) P ( OV | x i ) ) (cid:19) (1)In the above equation 1, we compute the ratios of posteriorprobabilities of nasal to oral consonants and nasalized to oralvowels. If either ratio is larger than 1, it indicates the presenceof a nasalized sound. The frame-level OHM, OHM ( x i ) , isbuilt by logarithmically transforming each ratio and taking themaximum across the two. Fig. 5. The four frame-wise DNN posteriors and corresponding frame-wise OHM for healthy and hypernasal speech: (a) the waveform of a controlspeech sample (“buy baby a bib”), (b) P ( NC ) and P ( OC ) , (c) P ( NV ) and P ( OV ) , and (d) the OHM for the target sentence produced by a childfrom the control group; (e) the waveform of a CP speech sample (“buy babya bib”), (f) P ( NC ) and P ( OC ) , (g) P ( NV ) and P ( OV ) , and (h) OHMfor target oral sentence produced by a participant with CP. For clarity, it is useful to demonstrate the OHM with twoexamples. The speech waveform, frame-wise DNN nasalityposterior features, and the OHM contours from a controlsample and a CP sample are plotted in Fig. 5(a)-(h). The targetsentence is “buy baby a bib”. Since the sentence does notcontain any nasal consonants, no nasal cues are expected in thespeech from the control group; this is consistent with panels(b) and (c) where P ( OC ) > P ( N C ) and P ( OV ) > P ( N V ) for the speech from the control group. For the case ofhypernasal speech from the CP group in panels (f) and (g),we see that P ( N C ) > P ( OC ) and P ( N V ) > P ( OV ) .Although the target text does not contain any nasal consonants,large P ( N C ) and P ( N V ) values indicate the presence ofabnormal nasal resonances in the CP speech indicative ofhypernasality. As expected, the OHM measure obtained bycombining P ( N C ) , P ( OC ) , P ( N V ) , and P ( OV ) indicatesrelatively higher values for hypernasal speech (Fig. 5(d) thanfor healthy speech (Fig. 5(h)).The frame-level OHM scores ( OHM ( x i ) ) were averagedover all the frames of a given utterance to obtain sentence-level OHM scores. Similarly, sentence-level OHM scores were TABLE IL

IST OF SENTENCES IN THE A MERICLEFT DATABASE , TARGET CONSONANTS , AND THE SENTENCE - LEVEL CORRELATION VALUES TO THE

OHM.

Sl. No Target Sentence r p -value Category r(Category) p -value(Category)1 P Puppy will pull a rope 0.414 < . Plosives 0.703 < . < . < . < . < . < . < . Fricatives 0.793 < . < . < .

10 DH The other feather 0.63 < .

11 S Sissy saw sally race 0.691 < .

12 Z Zoey has roses 0.471 < .

13 SH She washed a dish 0.724 < .

14 S-cluster I spy a starry sky 0.540 < .

20 H Hurry ahead harry 0.663 < .

15 CH Watch a choo-choo 0.715 < . Affricates 0.752 < .

16 J George saw gigi 0.698 < .

17 L Laura will yell 0.487 < . Liquids 0.564 < .

18 R Ray will arrive early 0.507 < .

19 W We were away 0.640 < . Glides 0.640 < .

21 M Mom and amy are home 0.069 0.682 Nasals 0.108 0.51822 N Anna knew no one -0.126 0.45223 NG We are hanging on 0.269 0.10324 N, M, NG We ran a long mile 0.162 0.331 averaged over all utterances spoken by the same speaker toobtain speaker-level OHM scores.IV. E

XPERIMENTS AND R ESULTS

The experiments were conducted using the Americleftdatabase to the evaluate the sentence and speaker-level per-formance of the OHM, the robustness of OHM to activeerrors, the sensitivity of OHM, and the internal reliability ofOHM. The external validity of the OHM was evaluated usingNMCPC database. Further, a comparison of the OHM witha representative supervised learning method based on DNNregression was also conducted. The details of the experimentsand the results are presented in the following subsections.

A. Validation of sentence-level OHM scores

The frame-level results in Fig. 5 are averaged over the entireduration of the utterance to generate a sentence-level OHMscore. The correlation between the sentence-level OHM andthe speaker-level perceptual ratings, the ground-truth ratingobtained from Americleft-trained SLPs, was evaluated usingPearson’s correlation coefﬁcient ( r ). In Table I we list thesentences from the Americleft database, grouped by targetconsonant category; for each sentence, we also list the cor-relation between the sentence-level OHM and the percep-tual hypernasality level. As expected, oral sentences, i.e., the sentences containing oral consonants (plosives, fricatives,affricates, liquids, and glides) showed a high correlation withthe perceptual ratings; whereas the OHM calculated from nasalsentences reveals a low correlation. This makes sense as theOHM demonstrates a ceiling effect for nasal sentences sinceit is expected that they are nasalized. B. Validation of speaker-level OHM scores

We compute the speaker-level OHM scores by averagingacross all oral sentences produced by a speaker. The averagePearson correlation between the speaker-level OHM with eachrater’s perceptual ratings and the average inter-rater Pearsoncorrelation are shown in Table II.For ﬁner-grained analysis, we compare the OHM with theground-truth rating obtained by averaging the clinical ratingsfrom the 4 raters. Fig. 6 shows the relationship between thespeaker-level OHM and perceptual ratings for the Americleftdatabase. The OHM shows a signiﬁcant correlation ( r =0 . , p < . ) with the ground truth perceptual ratings. Ascatter plot of sample-level data is shown in Fig. 6. C. Robustness to active articulation errors

We analyzed the effect of articulation errors on the estimatedOHM scores. Articulation errors in CP cases are broadly cate-

TABLE IIA

COMPARISON OF THE AVERAGE PAIRWISE CORRELATION BETWEEN A MERICLEFT - TRAINED RATERS AND THE AVERAGE CORRELATIONBETWEEN THE

OHM

AND EACH RATER . Mean ± std.Inter-rater correlation 0.776 ± ± Fig. 6. A scatter plot of speaker-level OHM scores vs. ground-truth perceptualratings for the Americleft database. gorized into active and passive errors [28], [36]. The percent-age (%) of active articulation errors (PAAE) and percentage(%) of passive articulation errors (PPAE) were computed at thespeaker-level using the 24 Americleft sentences. A correlationbetween the Americleft ratings and the PAAE and PPAE wasevaluated, as was a correlation between the OHM scores andPAAE and PPAE.The bar plot in Fig. 7(a) shows the correlation of theperceptual hypernasality ratings with respect to the PAAE andPPAE. The Americleft ratings showed a moderate correlation( r = 0 . , p < . ) with respect to PPE. Passive errors in-clude nasalized consonants, which carry nasal cues, therefore,it is expected that the presence of passive errors increases theseverity of perceived hypernasality. In fact, the perception ofnasalized consonants was considered an important criterion indeveloping the hypernasality rating scale in [28]. Nasal reso-nances are not evident in active errors, such as glottal stops,pharyngeal stops, and nasal fricatives. Hence, the perceptualratings showed a low correlation ( r = 0 . , p = . ) withthe Americleft ratings for the active errors.Similar to the hypernasality ratings, the OHM scores alsoshowed a moderate correlation ( r = 0 . , p > . ) withPPAE and a low correlation ( r = − . , p = 0 . ) withrespect to PAAE. However, when compared to perceptualratings, the OHM showed a relatively lower correlation foractive errors and higher for passive errors. These resultsprovide additional evidence that the OHM captures the nasalcues present in the speech signal and is robust to co-existingactive articulation errors. Fig. 7. (a) Correlation of OHM scores and perceptual hypernasality scoreswith respect to active and passive articulation errors. (b) OHM scores for thecontrol group and for children with CP rated as having normal hypernasalityin the Americleft database.

D. Evaluating the sensitivity of the OHM

In our analysis, we considered a balanced set of 38 speakersin Americleft database to evaluate the sentence-level andspeaker-level OHM scores relative to the perceptual ratings.Additionally, we analyzed the OHM for the remaining 32 ‘CPspeakers rated as normal’ (no hypernasality) and comparedthem with the control group. Here, the ‘CP rated as normal’corresponds to speakers whose SLP hypernasality rating wasconsidered normal. Fig. 7(b) shows the range of the OHMfor the two groups. The OHM scores of speakers with CPrated ‘0’ were greater than that of controls. A t -test revealsa statistically signiﬁcant difference between the two groups( t = − . , p < . ). This result may indicate the presenceof very mild hypernasality in the CP group not detected bythe SLPs. E. Assessing the internal reliability of the OHM

To evaluate the internal reliability of OHM scores, wegrouped the 20 oral sentences from the Americleft databaseinto set-1 and set-2, where set-1 contains the ﬁrst 10 sentencesand the remaining 10 formed set-2. We computed speaker-levelOHM scores for set-1 and set-2 data. Fig. 8 shows the scatterplot of OHM scores for set-1 vs. set-2. The OHM scores ofset-1 are signiﬁcantly correlated ( r = 0 . , p < . ) withthat of set-2. F. Assessing the external validity of the OHM

We also evaluated the performance of the OHM on theNMCPC database in order to evaluate how well the OHMgeneralizes to data collected in other studies and evaluatedby other SLPs. The hypernasality level of the NMCPC speechsamples was evaluated by 5 SLPs. The correlation of the OHMwith respect to each rater and the inter-rater correlation areshown in Table III. The average correlation of the OHM vs.individual raters was found equal to r = 0 . , p < . whereas the inter-rater correlation was r = 0 . , p < . . Fig. 8. Analysis of internal reliability. A scatterplot of speaker-level OHMscores computed for two independent sets of sentences.

The scatter plot of the speaker-level OHM vs. the averageof the 5 clinical ratings is shown in Fig. 9. The OHM showeda signiﬁcant correlation ( r = 0 . , p < . ) with theaverage hypernasality ratings provided by the SLPs. TABLE IIIA

COMPARISON OF THE AVERAGE PAIRWISE CORRELATION BETWEEN

NMCPC

RATERS AND THE AVERAGE CORRELATION BETWEEN THE

OHM

AND EACH RATER . Mean ± std.Inter-rater correlation 0.872 ± ± Fig. 9. A scatter plot of speaker-level OHM scores vs. ground-truth perceptualratings for the NMCPC database.

G. Comparison with a fully supervised approach

Conventionally, automatic evaluation of hypernasality in-volves the supervised training of models like SVMs [18],[19], artiﬁcial neural networks [21], and recurrent neuralnetworks [20] for the binary classiﬁcation of healthy andhypernasal speech samples. In all of these existing approaches, the supervised training of models was carried out by usinga perceptually labeled CP speech database. In the proposedapproach, the OHM was computed using the nasality DNN,whose parameters were estimated by using only the healthyspeech samples. The existing deep learning-based implementa-tions [20], [21] were aimed for binary classiﬁcation from thesegmented vowels. We found only one DNN-based work inthe literature [31], which predicts hypernasality severity fromthe connected speech samples. In this work [31], the DNNwas training directly on a labeled CP speech database. Tocompare the OHM with a conventional supervised approaches,we implemented a DNN regressor, which was directly trainedon the MFCC features extracted from the speech samples andthe perceptual ratings of the Americleft database.We used only oral sentences to train and test the DNN re-gressor using leave-one-speaker-out (LOSO) cross-validation.The sample size of the Americleft database (20 oral sentencesx 38 speakers=760) is very small to train a DNN. To addressthis issue, we used data augmentation using (a) addition ofnoise: white, babble, and factory noise with 5, 10, 15, and 20dB SNRs, (b) speaking rate modiﬁcation using the factors 0.8,0.9, 1.1, and 1.2, (c) vocal tract length perturbation (VTLP)using the perturbation factors 0.9, 0.95, 1.05, and 1.1 [37].After the data augmentation, the sample-size of the databasewas increased from 760 to 9120 sentence-level recordings.The 39-dimensional MFCC features extracted using 20 msHamming windowed speech frames with a shift of 10 ms werefed to a feed-forward DNN regressor. The architecture detailsof DNN regression are as follows: 39 input nodes, three hiddenlayers with 512 neurons with ReLU activation, and 1 outputnode with a linear activation function. The error between thepredicted outputs and ground truth labels was estimated usingthe mean squared error (MSE) loss function. The ADAMoptimizer was used to estimate the optimum parameters ofthe network. The network was trained for 25 epochs, with alearning rate of 0.001.The MFCCs were computed at the frame-level, but theground truth was available at the speaker-level. During train-ing we assigned the speaker-level hypernasality ratings toevery frame-level feature vector belonging to that particularspeaker. In the testing phase, speaker-wise averaging of DNNoutputs was carried out to get a single score per speaker.The performance of the DNN regressor was evaluated usingthe leave-one-speaker-out (LOSO) cross-validation criteria. InLOSO cross-validation, the samples of all speakers exceptone speaker were used to train DNN and the remaining one’ssample was considered for the testing. Note that the augmentedsamples were used only during the training phase, whiletesting we used only the original samples.The correlation coefﬁcient computed between predictedscores by DNN-regressor and the perceptual ratings is shownin Table IV. The correlation found to statistically signiﬁcant( r = 0 . , p < . ), but it is well below that of the OHM.The DNN regressor trained on Americleft samples was used toevaluate the hypernasality in NMCPC samples and the resultsare presented in Table. The predicted scores for NMCPCdatabase showed a weak correlation ( r = 0 . , p = 0 . )with the perceptual ratings. These results indicate overﬁtting TABLE IVC

OMPARISON BETWEEN

OHM

AND

DNN

REGRESSOR . Database ApproachOHM DNN RegressorAmericleft r = 0 . , p < . r = 0 . , p < . NMCPC r = 0 . , p < . r = 0 . , p = 0 . effect. The OHM scores resulted in a strong correlation ( p < . ) for both the databases and these results empiricallyshow that the OHM was robust to a variety of recordingconditions, sentence contexts, and gold-standard perceptualratings. V. D ISCUSSION

Comparison with existing hypernasality evaluation ap-proaches:

Most of the existing approaches for the automatichypernasality evaluation in children with CP were aimedat the classiﬁcation of normal and hypernasal speech [18],[21] or multi-class classiﬁcation (normal, mild, moderate, andsevere) [23], [30]. Since these supervised machine learningmodels were directly trained on CP speech samples and theperceptual hypernasality ratings, their performance criticallydepends on the availability of labeled hypernasality speechdatabase. To analyze the risk of overﬁtting in supervisedtraining in the existing approaches and the advantages ofthe OHM, we implemented and evaluated a hypernasalityestimation method based on DNN regression. The regressorwas trained directly on the speech samples and ratings of theAmericleft database. Even though we increased the databasesize via data augmentation, the DNN regressor’s was wellbelow that of the OHM. The availability of a perceptuallylabeled database became a practical limitation in clinicalspeech research due to the limited number of subjects andtrained SLPs. The OHM bypasses the need for clinical labelsas the scores are estimated from a DNN trained on a publiclyavailable large healthy speech corpus. We only use the clinicaldata to evaluate the OHM with the perceptual ratings.The DNN regressor trained on the Americleft database wasalso tested on the NMCPC database with poor results. Sincethe model was trained for the Americleft samples and ratings,the model overﬁt to the training examples and showed a lack of generalization ability on another database. In contrast,the results in Table IV revealed that the OHM is robustto differences in speech samples and rating scales. This isevidenced by the fact that the perceived hypernasality of theAmericleft samples was rated on a 5-point scale, whereasthe NMCPC data was rated on a 4-point scale. Since thetraining of the DNN does not use any perceptual ratings, theOHM itself is not sensitive to the rating scales. In fact, ourresults empirically show that the OHM was robust to a varietyof recording conditions, sentence contexts, and gold-standardperceptual ratings.Another important advantage of OHM is that the approachdoes not require phonetic segmentation. Most of the automatedhypernasality evaluation methods rely on the phonetic segmen-tation, where segmentation was performed either by manuallabeling or force-alignment using orthographic transcriptions.The OHM was computed directly on the connected speechsamples and this makes the evaluation process simple andfaster as manual labeling and transcribing speech is a time-consuming process.

The role of the speech stimuli for estimating hypernasality:

The choice of stimuli or target sentence plays an importantrole in the assessment of hypernasality. In the case of healthyspeakers, nasal resonances are inherently present during theproduction of nasal sentences and completely absent in thecase of oral sentences. The presence of nasal resonancesduring the production of oral sentences is considered to beabnormal, which indicates the presence of hypernasality [28].Sentence-wise correlation values listed in Table I revealed thatthe proposed measure yields a good correlation for the oralsentences and poor correlation for nasal sentences. For both CPsubjects and controls, the acoustic energy passes through thenasal tract while producing nasal sentence; hence, it is difﬁcultto evaluate hypernasality using nasal sentences. These resultsindicate that to reliably compute the speaker-level OHM, thetarget speech samples should only contain oral consonants.These results closely match with the perceptual assessmentguidelines mentioned in [28], where only the sentences withoral consonants were suggested for assessing hypernasality.High-pressure and low-pressure consonants play an impor-tant role in clinical evaluation of hypernasality [22], [28].Pressure-sensitive or high-pressure consonants (plosives: /P/,/T/, /K/, /B/, /D/, /G/, fricatives: /S/, /F/, /SH/, /Z/, /V/, andaffricates: /CH/, /JH/, /TH/, /DH/) require adequate intraoralpressure, which is developed by the closure of oral and nasaltracts. Whereas, low-pressure consonants (glides: /Y/, /W/ andliquids: /L/, /R/) do not require high intra-oral pressure. Sincethe loss of airﬂow in patients with CP and VPD affects theability to build up and maintain intra-oral air pressure, theproduction of high pressure consonants is severely affected.The substitution of nasal consonants for target high pressureconsonants is most commonly reported in speakers with CPbecause of the escape of air through the incompletely closedVP port . Therefore, in clinical settings, target speech con-taining high-pressure oral consonants is highly recommendedfor the perceptual assessment of hypernasality [22], [28]. Ourresults are consistent with this as the OHM shows higher correlation for sentences containing high-pressure consonants(plosives, fricatives, and affricates) than those with low-pressure consonants. Active and passive articulation errors in assessment ofhypernasality:

The articulation errors in speakers with CPare widely divided into active and passive articulation errors.In the case of passive articulation errors, the speaker triesto retain the place of articulation of the target phoneme butthe air escapes through the velopharyngeal port. Hence, theconsonant is perceived to be weak or nasalized. In the case ofsevere VPD, the target consonant is completely replaced by anasal consonant (/T/, /D/ − > /N/, /P/, /B/ − > /M/) [28]. In thecase of active errors (also known as compensatory errors), aspeaker attempts to compensate for the effect of nasalizationby shifting the location of articulatory constriction. The shiftin the place of articulation may be within the oral or non-oralcavity. In many cases, the presence of VPD leads to a shifttowards glottal and pharyngeal regions [28]. Although activeand passive articulation errors are produced as a consequenceof VPD, their contribution in the perception of hypernasalityis different. The passive errors contain nasal cues and hence,their presence contributes to the perception of hypernasality. Incontrast, the active articulation errors, such as glottal and/orpharyngeal substitutions, do not carry nasal cues; thereforethey do not contribute to the perception of hypernasality.However, the presence of active errors can create variability inperceptual evaluation as it is difﬁcult to perceptually decoupleactive articulation errors from the presence of hypernasality.As shown in Fig. 7(a), the OHM scores do not have this biasas they show no correlation with the active errors. This resultprovides evidence that OHM scores capture only the nasalcues present in the speech signal and are robust to co-existingactive errors in CP speech. Differences in OHM performance on Americleft and NM-CPC databases:

For the Americleft database, the inter-raterreliabilities are on average a little higher than the OHM-to-rater reliabilities, but they are within 1 standard deviation ofeach other. For the NMCPC database inter-rater reliabilitiesare signiﬁcantly greater than the OHM-to-rater reliabilities.Also, the OHM scores showed relatively higher correlationwith the averaged perceptual ratings ( r = 0 . ), whencompared to the NMCPC database ( r = 0 . ). The reasonsfor this difference in the OHM’s performance are multifold.The NMCPC database is not balanced in terms of the numberof sentences per speaker. This imbalanced nature of NM-CPC database affects the estimation of speaker-level OHM,where the speaker-level OHM was computed by averaging thesentence-level scores. The average signal-to-noise ratio (SNR)of the Americleft samples was found equal to 23.88 ± ± UMMARY AND FUTURE WORK

In this work, we introduced an objective measure of hyper-nasality based on a DNN nasality model trained on healthyspeakers with no clinical labels. First, we modeled the acousticcues related to an open velum (nasalized) from healthy speechby training a DNN classiﬁer to classify among NC, OC, NV,and OV classes. This pre-trained DNN on healthy speechsamples was used to characterize the presence of abnormalnasal resonances in the speech of children with CP. The OHMwas computed at the level of a speech frame and aggregatedinto sentence-level and speaker-level scores.In the current implementation of the OHM, the model iscompletely tuned and implemented using only the healthyspeech database. Future work can focus on further reﬁnementof the implementation by using a corpus of cleft speech totune model parameters, or perhaps to change it to a supervisedmodel by using linear regression across different sentences toproduce speaker-level hypernasality scores. The present workuses a pitch modiﬁcation algorithm to compensate for theacoustic mismatch between adult and children’s speech. How-ever, better speaker adaptation methods such as identity vector(i-vector) and transfer learning approaches can be explored toimprove the system’s performance. Another limitation of thecurrent work is that the algorithm assumes the presence ofinformation related to nasality over the entire duration of utter-ance and simply averages the frame-level OHM scores over anentire utterance. However, depending on the severity level, thehypernasality information may be distributed unevenly overdifferent phonemes. Therefore, instead of a simple averagingoperation, recurrent neural networks and attention models canbe used to capture unevenly distributed nasality information.Furthermore, hypernasality is not only speciﬁc to CPspeech. It can also be present in speech from individuals withneurological disorders, such as Huntington’s and Parkinson’sdisease. Therefore, future work can focus on extending the val-idation of this model for evaluating hypernasality in dysarthricspeech. VII. A

CKNOWLEDGEMENTS

Authors would like to thank Dr. Luis Cuadros, New MexicoCleft Palate Center and Dr. Anil Kumar Vappula, IIIT Hy-derabad, India for providing NMCPC database. Also, authorswould like to thank Dr. Nagaraj Adiga, a co-author of thework [35] for sharing the pitch modiﬁcation program. R EFERENCES[1] P. A. Mossey, E. E. Catilla et al. , “Global registry and database oncraniofacial anomalies: Report of a who registry meeting on craniofacialanomalies,” 2003.[2] A. W. Kummer,

Cleft palate & craniofacial anomalies: Effects on speechand resonance . Nelson Education, 2013.[3] M. A. Hardin-Jones and D. L. Jones, “Speech production of preschoolerswith cleft palate,”

The Cleft palate-craniofacial journal , vol. 42, no. 1,pp. 7–13, 2005.[4] K. L. Chapman, “The relationship between early reading skills andspeech and language performance in young children with cleft lip andpalate,”

The Cleft Palate-Craniofacial Journal , vol. 48, no. 3, pp. 301–311, 2011.[5] J. C. Bell, C. Raynes-Greenow, R. Turner, C. Bower, A. Dodson,W. Nicholls, and N. Nassar, “School performance for children withcleft lip and palate: a population-based study,”

Child: care, health anddevelopment , vol. 43, no. 2, pp. 222–231, 2017.[6] “Parameters for evaluation and treatment of patients with cleft lip/palateor other craniofacial differences,”

The Cleft Palate-CraniofacialJournal , vol. 55, no. 1, pp. 137–156, 2018. [Online]. Available:https://doi.org/10.1177/1055665617739564[7] K. Bettens, F. L. Wuyts, and K. M. Van Lierde, “Instrumental assess-ment of velopharyngeal function and resonance: A review,”

Journal ofcommunication disorders , vol. 52, pp. 170–183, 2014.[8] A. Baylis, K. Chapman, and T. L. Whitehill, “Validity and reliabilityof visual analog scaling for assessment of hypernasality and audiblenasal emission in children with repaired cleft palate,”

The Cleft Palate-Craniofacial Journal , vol. 52, no. 6, pp. 660–670, 2015.[9] R. P. Yamashita, E. Borg, S. Granqvist, and A. Lohmander, “Reliabilityof hypernasality rating: comparison of 3 different methods for perceptualassessment,”

The Cleft Palate-Craniofacial Journal , vol. 55, no. 8, pp.1060–1071, 2018.[10] T. Sweeney and D. Sell, “Relationship between perceptual ratingsof nasality and nasometry in children/adolescents with cleft palateand/or velopharyngeal dysfunction,”

International Journal of Language& Communication Disorders , vol. 43, no. 3, pp. 265–282, 2008.[11] K. L. Chapman, A. Baylis, J. Trost-Cardamone, K. N. Cordero,A. Dixon, C. Dobbelsteyn, A. Thurmes, K. Wilson, A. Harding-Bell,T. Sweeney et al. , “The americleft speech project: a training andreliability study,”

The Cleft Palate-Craniofacial Journal , vol. 53, no. 1,pp. 93–108, 2016.[12] R. Kataoka, K.-I. Michi, K. Okabe, T. Miura, and H. Yoshida, “Spectralproperties and quantitative evaluation of hypernasality in vowels,”

TheCleft palate-craniofacial journal , vol. 33, no. 1, pp. 43–50, 1996.[13] C. Vikram, N. Adiga, and S. M. Prasanna, “Spectral enhancement ofcleft lip and palate speech.” 2016.[14] P. Vijayalakshmi, M. R. Reddy, and D. O’Shaughnessy, “Acousticanalysis and detection of hypernasality using a group delay function,”

IEEE Transactions on biomedical engineering , vol. 54, no. 4, pp. 621–629, 2007.[15] G.-S. Lee, C.-P. Wang, and S. Fu, “Evaluation of hypernasality in vowelsusing voice low tone to high tone ratio,”

The Cleft Palate-CraniofacialJournal , vol. 46, no. 1, pp. 47–52, 2009.[16] G.-S. Lee, C.-P. Wang, C. C. Yang, and T. B. Kuo, “Voice low toneto high tone ratio: a potential quantitative index for vowel [a:] and itsnasalization,”

IEEE transactions on biomedical engineering , vol. 53,no. 7, pp. 1437–1439, 2006.[17] K. Nikitha, S. Kalita, C. Vikram, M. Pushpavathi, and S. M. Prasanna,“Hypernasality severity analysis in cleft lip and palate speech usingvowel space area.” in

INTERSPEECH , 2017, pp. 1829–1833.[18] M. Golabbakhsh, F. Abnavi, M. Kadkhodaei Elyaderani, F. Derakhshan-deh, F. Khanlar, P. Rong, and D. P. Kuehn, “Automatic identiﬁcation ofhypernasality in normal and cleft lip and palate patients with acousticanalysis of speech,”

The Journal of the Acoustical Society of America ,vol. 141, no. 2, pp. 929–935, 2017.[19] J. R. Orozco-Arroyave, J. Vargas-Bonilla, J. D. Arias-Londo˜no,S. Murillo-Rend´on, G. Castellanos-Dom´ınguez, and J. Garc´es, “Nonlin-ear dynamics for hypernasality detection in spanish vowels and words,”

Cognitive Computation , vol. 5, no. 4, pp. 448–457, 2013.[20] X. Wang, S. Yang, M. Tang, H. Yin, H. Huang, and L. He, “Hyper-nasalitynet: Deep recurrent neural network for automatic hypernasalitydetection,”

International Journal of Medical Informatics , vol. 129, pp.1–12, 2019.[21] X. Wang, M. Tang, S. Yang, H. Yin, H. Huang, and L. He, “Automatichypernasality detection in cleft palate speech using cnn,”

Circuits,Systems, and Signal Processing , pp. 1–27, 2019. [22] A. W. Kummer and L. Lee, “Evaluation and treatment of resonancedisorders,”

Language, Speech, and Hearing Services in Schools , vol. 27,no. 3, pp. 271–281, 1996.[23] A. K. Dubey, S. M. Prasanna, and S. Dandapat, “Detection and assess-ment of hypernasality in repaired cleft palate speech using vocal tractand residual features,”

The Journal of the Acoustical Society of America ,vol. 146, no. 6, pp. 4211–4223, 2019.[24] L. He, J. Zhang, Q. Liu, H. Yin, and M. Lech, “Automatic evaluationof hypernasality and consonant misarticulation in cleft palate speech,”

IEEE Signal Processing Letters , vol. 21, no. 10, pp. 1298–1301, 2014.[25] L. He, J. Zhang, Q. Liu, H. Yin, M. Lech, and Y. Huang, “Automaticevaluation of hypernasality based on a cleft palate speech database,”

Journal of medical systems , vol. 39, no. 5, p. 61, 2015.[26] V. C. Mathad, K. Chapman, J. Liss, N. Scherer, and V. Berisha, “Deeplearning based prediction of hypernasality for clinical applications,” in

ICASSP 2020-2020 IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP) . IEEE, 2020, pp. 6554–6558.[27] M. Saxon, J. Liss, and V. Berisha, “Objective measures of plosivenasalization in hypernasal speech,” in

ICASSP, 2019, Brighton, UnitedKingdom , pp. 6520–6524.[28] G. Henningsson, D. P. Kuehn, D. Sell, T. Sweeney, J. E. Trost-Cardamone, and T. L. Whitehill, “Universal parameters for reportingspeech outcomes in individuals with cleft palate,”

The Cleft Palate-Craniofacial Journal , vol. 45, no. 1, pp. 1–17, 2008.[29] T. Knowles, M. Clayards, and M. Sonderegger, “Examining factorsinﬂuencing the viability of automatic acoustic analysis of child speech,”

Journal of Speech, Language, and Hearing Research , vol. 61, no. 10,pp. 2487–2501, 2018.[30] M. H. Javid, K. Gurugubelli, and A. K. Vuppala, “Single frequency ﬁlterbank based long-term average spectra for hypernasality detection andassessment in cleft lip and palate speech,” in

ICASSP 2020-2020 IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2020, pp. 6754–6758.[31] C. M. Vikram, A. Tripathi, S. Kalita, and S. R. M. Prasanna, “Estimationof hypernasality scores from cleft lip and palate speech.” in

Interspeech,2018, Hyderabad, India , pp. 1701–1705.[32] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: anasr corpus based on public domain audio books,” in

ICASSP , 2015, pp.5206–5210.[33] M. Grimm, K. Kroschel, E. Mower, and S. Narayanan, “Primitives-basedevaluation and estimation of emotions in speech,”

Speech Communica-tion , vol. 49, no. 10-11, pp. 787–800, 2007.[34] M. McAuliffe, M. Socolof, S. Mihuc, M. Wagner, and M. Sonderegger,“Montreal forced aligner: Trainable text-speech alignment using kaldi.”in

Interspeech , 2017, pp. 498–502.[35] S. Shahnawazuddin, N. Adiga, and H. K. Kathania, “Effect of prosodymodiﬁcation on children’s asr,”

IEEE Signal Processing Letters , vol. 24,no. 11, pp. 1749–1753, 2017.[36] A. Harding and P. Grunwell, “Active versus passive cleft-type speechcharacteristics,”

International journal of language & communicationdisorders , vol. 33, no. 3, pp. 329–352, 1998.[37] T.-H. Lo, F.-A. Chao, S.-Y. Weng, and B. Chen, “The ntnu system atthe interspeech 2020 non-native children’s speech asr challenge,” arXivpreprint arXiv:2005.08433arXivpreprint arXiv:2005.08433