I-vector Based Within Speaker Voice Quality Identification on connected speech
Chuyao Feng, Eva van Leer, Mackenzie Lee Curtis, David V. Anderson
II-vector Based Within Speaker Voice QualityIdentification on connected speech
Chuyao Feng, Eva van Leer, Mackenzie Lee Curtis, and David V. Anderson School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA,30332, USA Department of Communication Sciences and Disorders, Georgia State University, Atlanta, GA,30303, USA (Dated: 16 February 2021)
Voice disorders affect a large portion of the population, especially heavy voice users such asteachers or call-center workers. Most voice disorders can be treated effectively with behav-ioral voice therapy, which teaches patients to replace problematic, habituated voice produc-tion mechanics with optimal voice production technique(s), yielding improved voice quality.However, treatment often fails because patients have difficulty differentiating their habitualvoice from the target technique independently, when clinician feedback is unavailable betweentherapy sessions. Therefore, with the long term aim to extend clinician feedback to extra-clinical settings, we built two systems that automatically differentiate various voice qualitiesproduced by the same individual. We hypothesized that 1) a system based on i-vectors couldclassify these qualities as if they represent different speakers and 2) such a system wouldoutperform one based on traditional voice signal processing algorithms. Training record-ings were provided by thirteen amateur actors, each producing 5 perceptually different voicequalities in connected speech: normal, breathy, fry, twang, and hyponasal. As hypothesized,the i-vector system outperformed the acoustic measure system in classification accuracy (i.e.97.5% compared to 77.2%, respectively). Findings are expected because the i-vector systemmaps features to an integrated space which better represents each voice quality than the22-feature space of the baseline system. Therefore, an i-vector based system has potentialfor clinical application in voice therapy and voice training. © [https://doi.org(DOI number)][XYZ] Pages: 1–9 I. INTRODUCTION
Voice disorders are changes in voice quality, pitch,loudness, effort and endurance (Ramig and Verdolini,1998). These affect approximately 7.5 million Ameri-cans at a time and can occur throughout the lifespan,with multiple etiologies and a prevalence of up to 47%inteachers (Cohen, 2010; Cohen et al. , 2012; Thibeault et al. , 2004). Most voice disorders can be treated effec-tively with behavioral voice therapy provided by speech-language pathologists (SLPs) (Schwartz et al. , 2009)(Ramig and Verdolini, 1998).However, as with other be-havioral interventions (Glanz et al. , 2008) (e.g., diet, ex-ercise, smoking cessation), patient adherence to voicetherapy is often poor (Smith et al. , 2010; van Leer andConnor, 2012) and drop-out rates that exceed 50% (Hap-ner et al. , 2009; Portone et al. , 2008).This problem largelystems from patients’ difficulty in altering their voice useoutside of therapy session, when they do not have accessto clinician feedback.In weekly voice therapy sessions, the speech lan-guage pathologist teaches the patient to replace theirhabitual, sub-optimal voice production mechanics withan improved voice production technique such as reso-nant, breathy, twang, or loud voice (Ramig and Ver- dolini, 1998). This target technique is individualized tothe patient and associated with voice quality improve-ment. For example, patients with a weak, excessivelybreathy voice due to aging or unilateral vocal fold pare-sis may be taught to apply techniques of increased vocaleffort and twang resonance to reduce breathiness (Lom-bard and Steinhauer, 2007; Ziegler et al. , 2014). Con-versely, patients with a strained voice quality due to ad-ducted hyperfunction (i.e. excessively effortful voice pro-duction) might reduce strain by increasing breathiness(Ziegler et al. , 2014). Thus a particular voice qualitysuch as breathiness may be problematic in one patientand a therapeutic target in another. Furthermore, useof the target technique may a alter untargeted vocal pa-rameters such as loudness, fundamental frequency andspeaking rate.(Bonilha and Dawson, 2012; Dejonckereand Lebacq, 2001). Thus, the so-called target voice variesfrom the patient’s habitual voice along multiple parame-ters or features of voice and speech.While patients may produce the target techniquesuccessfully in weekly treatment sessions, they commonlystruggle to replicate it between sessions, without the ther-apist’s feedback(van Leer and Connor, 2010). Specifi-cally, patients report difficulty differentiating their habit-ual voice from the therapeutic target in dedicated tech-
J. Acoust. Soc. Am. / 16 February 2021 Feng et al. 1 a r X i v : . [ c s . S D ] F e b ique practice as well as daily conversation. Unlike thebulls eye in a game of darts, a vocal target is not visuallyapparent, making it hard to judge one’s progress towardit (van Leer and Connor, 2010). Consequently, patientsbecome demoralized, practice infrequently, and fail to re-place their habitual voice with the target voice (). Eventhose who succeed in therapy fail to maintain the voicechange in the long term, relapsing back to a pre-therapyvocal status ().Since patients have difficulty producing the targetvoice without clinician feedback they may benefit fromautomated supplemental feedback systems. At a mini-mum, such systems could differentiate the patient’s ha-bitual voice quality from their therapeutic target in run-ning speech. Such feedback would extend the clinician’sjudgement from the therapy session to the extra-clinicalenvironment. Since each patient’s habitual and targetvoice qualities are unique, the system would aim for anindividualized comparison between the habitual and tar-get production for each patient, rather than a comparisonto standardized norms. Furthermore, the system shouldremain robust in noisy speaking environments. No suchsystem presently exists.Extensive research into objective acoustic measuresof voice has been conducted in an effort to replace sub-jective expert judgement of voice quality with objectivenumbers. Several measures of the voice signal have beenassociated with expert judgement of breathiness, over-all clarity of voice, and to a lesser degree, roughness andstrain () . In order to replace the expert clinician’s judge-ment of voice quality, several systems have been built toclassify voice samples. Approaches has included usingacoustic features with machine learning methods on sus-tained vowels. The majority of such studies have aimedto automatically detect features associated with emotion,e.g. such as acoustic signs of depression or anger (Scherer et al. , 2000). Another group of studies has sought to pre-dict the pathology label (i.e. diagnosis) associated with adatabase of voice samples, without examining voice qual-ity. A handful of studies have aimed to classify voicequalities across speakers. Mehta group developed a sys-tem that differentiates modal (i.e. normal) voice qual-ity from breathy, rough and strained production on sus-tained vowels (Mehta et al. , 2012). In this study, 28actors provided samples of modal, breathy, strained andrough quality. Actors variability in producing the targetswas a limitation to accuracy classification, since multiplerecordings did not fully or consistently reflect the tar-get voice quality. Wang automatically classified breathyand rough voice ? from normal voice in sustained phona-tion recordings of disordered voice labeled perceptuallyjudged with the GRB parameters of the GRBAS withbest accuracy around 80% (Wang et al. , 2016). Sim-ilarly, (Arias-Londo˜no et al. , 2019) used deep learningframework with RGB scale and MFCC features, whichshows an improvement of accuracy of 18.1%. In sum,past approaches have focused largely on automaticallydetecting specific voice qualities in sustained phonationbut not on connected speech. Objective acoustic measures of the voice signal havebeen developed to supplement or replace human percep-tual judgment for research and outcome purposes, butfew have been examined as extra-clinical patient feed-back. Mobile feedback regarding vocal loudness and pitchcan be provided through a variety of existing technologiesand apps (). However, objective electronic measurementof voice quality (i.e., timbre) has proven more challeng-ing, in turn making provision of voice quality feedbackmore difficult. Cepstral peak prominence (CPP), whichquantifies aspects of signal periodicity and harmonic en-ergy, is significantly and inversely correlation with per-ceived dysphonia, such that it may be used a measure ofoverall voice quality. CPP has demonstrated utility asfeedback for patients practicing a resonant voice target(). However, results of CPP and other voice measures re-quire interpretation in order to determine if the patient isspeaking in his or her problematic, habitual voice qualityor their prescribed target voice quality (i.e., good voice).Furthermore, depending on the patient and the target,additional parameters may be needed to consistently dif-ferentiate a patient’s habitual and target voice, yieldingmore measures for the patient to interpret. Most recently,algorithms comprised of multiple weighted acoustic mea-sures have been developed to best correlate with humanperception of overall vocal clarity or overall degree ofdysphonia (Van Leer et al. , 2017), but it cannot be ap-plied to connected speech and have not been tested forutility in providing voice quality feedback to patients. Amore comprehensive feedback system is under develop-ment to provide information regarding relevant aspectsof voice production mechanics but requires the patientto wear an accelerometer taped to his or her neck (Ghas-semi et al. , 2014). In sum, systems that capture vocalclarity are under development for patient use.There are two limitations to the current approaches.First, in therapy, qualities that constitute the problem-atic habitual voice for a certain patient may represent thetarget voice for another patient, depending on their par-ticular disorders and individual differences. Thus, voicequality goals should be set in reference to within-speakerparameters rather than in reference to between-speakernorms. Second, clinical judgment of voice quality mayinvolve multiple parameters that require knowledge fromthe treating clinician. Therefore, there is a need for asystem that does not replace the clinician’s ability to de-tect a bad (i.e. habitual) from a good voice (i.e. target),but rather extends the clinical judgment outside of thetreatment room. Lastly, previous approaches have clas-sified sustained phonation samples that do not representthe patient’s target voice use in connected speech.With the larger aim to develop an automated feed-back system for patient use, the objective of the presentstudy was to build and validate a system that can dif-ferentiate various within-speaker voice qualities in con-nected speech. This system would yield a proof ofconcept that within-speaker voice quality targets canbe detected and differentiated automatically in runningspeech. In the present study, thirteen actors each pro- uced 5 different voice qualities to yield a data set for de-veloping and testing the system: normal (clear), breathy,fry, twang, and hyponasality (i.e. voice quality associatedwith having a headcold). Achieving automatic differenti-ation of these quality extremes represents a first step inthe development of a system that can differentiate goodversus bad voice qualities within the running speech ofvoice patients. If these grossly differing voices can bedifferentiated automatically, future work is indicated fordetection of more subtle differencesThe aim will be explored through two differentmethodological approaches. Our first approach is to in-corporate common acoustic measures of vocal parame-ters as ingredients for voice quality differentiation, in-cluding Pitch Strength (PS), CPPS, Harmonic-to-NoiseRatio (HNR), and other fundenmental frequency relatedstatistics. Since these measures are commonly used todetect vocal parameters, it is reasonable to expect thattheir combination may successfully serve as ingredients todifferentiate within-speaker voices. It serves as a baselinefor our second i-vector based approach for comparison.Our second method uses i-vector, which works well inspeaker recognition and we hypothesize that i-vectors canbe applied to identify different intra-speaker voice quali-ties. In past years, i-vector methods have demonstrated astate-of-the-art performance of text-independent speakerverification systems (Dehak et al. , 2011). These were tra-ditionally developed to detect differences between speak-ers (i.e., speaker identification and verification) but didnot detect intraspeaker voice quality differences. Com-mon voice qualities that may constitute either target orhabitual voice qualities include vocal fry, breathiness,twang, and hyponasal resonance, and an overall clear(normal) voice quality. Depending on the patient, someof these qualities may represent bad and good technique.The system effectively examines a speaker’s different vo-cal qualities as if these represent different speakers (i.e.,your good-voice-self vs. your bad-voice-self). The ap-proach also differs conceptually from current patient-centered acoustic analysis in requiring that the patientwear an accelerometer taped to their neck (Mehta et al. ,2012) or seeks to entirely replace the therapist (Maryn et al. , 2010) rather than extend the therapist’s judgment(i.e., “now you are in your good voice!”) to the patient’senvironment (Ghassemi et al. , 2014). Once an SLP isinvolved, these voice quality metrics do not provide ad-ditional benefit. In contrast, the presently presented sys-tem allows the therapist to use their judgment to settargets and provides feedback to the patient on whetherthey are speaking in the target voice. The purpose of thispaper is to develop, test and compare two methods forautomatic voice quality differentiation within speakers.This paper is organized as follows: Section II de-scribes the data collection process.Section 1 describes thepipeline of system 1 involving common acoustic measuresSection IV describes the general pipeline of the system 2based in i-vectorsThe experiment setup and result withclean and noise data is described in Section V and Sec- tion VI, followed by the discussion in Section VII and theconclusion in Section VIII. II. DATA COLLECTION
13 adults (10 women and 3 men) ages 18-49(mean=26.9, SD=9.7) were enrolled in the study to pro-vide data for algorithm development. Participants werescreened for normal voice quality by the third author, alicensed speech-language pathologist, and a graduate re-search assistant. Additionally, participants summarizedtheir voice as “normal” (i.e absence of impairment) onthe Voice Handicap Index (VHI) (Jacobson et al. , 1997).The average VHI score was 12.4 (SD=16.38) on this self-report measure; this was slightly higher than normal dueto personality factors that elevated the score for 2 partic-ipants. To provide voice data for algorithm development,participants were instructed to produce four different vo-cal qualities in addition to their habitual voice: breathy,fry, twang, and hyponasal in spontaneous speaking. Thequalities of strain and roughness were excluded from thisinitial study because actors complained of excessive la-ryngeal effort when attempting these. Resonant voicewas omitted because the learning curve for this targettechnique was too long in duration. Breathiness is pres-ence of audible air escape in the voice (Kempster et al. ,2009) common to voices such as Marily Monroe. Twangis the use of loud, bright (i.e. ”brassy”) voice qualitycommon in country western singing (Lombard and Stein-hauer, 2007) and the character Fran Drescher in the com-edy series The Nanny. Fry is phonation at a low fre-quency register below modal voice (Hollien et al. , 1966)common in voice disorders and more prevalent in youngwomen than men (Hornibrook et al. , 2018), and associ-ated with the Valley Girl sociolect. A hyponasal qualityis a denasal use of resonance that sounds like one is speak-ing with a stuffed up nose (Behrman et al. , 2002). Actorswere taught to use each voice quality via modeling, directinstruction, and cueing by the second author. Only thequality ”hyponasal” was difficult for a subset of speakers,who were allowed to imitate the cartoon character Bar-ney Rubble instead. The resulting S Voice quality labelwas verified by a research assistant trained to detect eachin a forced choice. Once these were elicited successfully,uncompressed audio recording was performed at a sam-pling rate of 44.1KHz in a quiet room using the RolandRO-5 wave/mp3 recorder. Participants were asked toread 26 standard sentences out loud, and subsequentlyproduce 4 minutes of extemporaneous speaking in theirnormal voice quality, followed by the same tasks in thetargeted 4 voice qualities. Standard sentences includedthe tasks of the Consensus Auditory Perceptual Evalu-ation of Voice (CAPE-V) (Kempster et al. , 2009), andten sentences each of the (TIMIT (Zue et al. , 1990) andHarvard sentence inventories (Rothauser, 1969). FIG. 1shows the spectrogram of the five different voice qual-ities of the same speaker. Notice that for FIG. 1(a)and Fig. 1(b), the harmonic in higher frequency appearsblurred in the spectrogram. However, for FIG. 1(d) and
J. Acoust. Soc. Am. / 16 February 2021 Feng et al. 3 a)(b)(c)(d)(e)FIG. 1. Spetrogram of five different voice quality modes.(a) Vocal fry. (b) Breathiness. (c) Twang (d) Hyponasalresonance. (e) Normal.
FIG. 1(e), there are strong harmonics, and voice qualitydifferences are not immediately apparent.
III. ACOUSTIC MEASURE SYSTEM (BASELINE)
There has been a great amount of research searchingfor acoustic features that could measure the effective-ness of voice therapy treatment outcomes and quantifythe severity of dysphonia along a variety of parameters.We explored acoustic measures of voice that have beenvalidated in past research to detect different vocal pa-rameters related to quality and pitch. These were in-corporated to create a baseline voice quality detectionsystemfor our study. . Most of the measures can only beapplied to sustained vowels because they may not be ac-curate on running speech. For example, H1-H2 extractedfrom vowels has been shown to be effective in measuringbreathiness associated with dysphonia (Narasimhan andVishal, 2017). Also, relative fundamental frequency mea-sure have shown to be sensitive to hyperfunction-relatedvoice disorders (Stepp et al. , 2010). Our study focuses onclassifying different voice qualities in unscripted speech,so only features that can be applied to connected speechare studied in this paper. We used these acoustic mea-sures with Support Vector Machine to build a voice qual-ity identification that act as a baseline performance ofthis task.
A. Pitch Strength (PS)
Pitch strength (PS) has been used widely in soundperception, and recent studies have exhibited that PSmay be used to perceptually measure voice quality (Ru-bin et al. , 2019). It is an estimate of pitch saliency orpitch tonality, which is usually reported subjectively. PSis different from pitch height, as two notes with the samepitch height could have different pitch strength (Kopf et al. , 2017). Previous studies have shown that PS has aninverse relationship with breathiness(Eddins et al. , 2016).Pitch strength is extracted using the Auditory SawtoothWaveform Inspired Pitch Estimator—Prime (Auditory-SWIPE). The detail of calculating the pitch strength canbe found in (Camacho and Harris, 2008). The averagepitch strength of each recording is used.
B. Cepstral Peak Prominence (CPP)
Cepstral Peak Prominence (CPP) measures theamount of cepstral energy in a voice sample. Itcan be applied to both sustained vowel and connectedspeech(Heman-Ackah et al. , 2002; Rubin et al. , 2019).Similar to the MFCC, the audio signal is initially con-verted to spectrum. Then, the spectrum is converted tocepstral domain. The cepstral peak is subsequently lo-cated, and the difference between the cepstral peak andthe value under this peak is calculated CPP has beenshown to have an positive association with perceived clar-ity of voice, such that higher values represent clearervoice, and lower values are associated with more ape-
IG. 2. Pipeline of the i-vector system riodic and breathy voice quality. CPP also varies withloudness of the signal, such that higher CPP scores canindicate a louder signal (Hillenbrand et al. , 1994). CPPis also used for calculating othercomposite measures ofoverall voice quality, including the Cepstral Spectral In-dex of Dysphonia (CSID) and the Acoustic Voice Qual-ity Index (AVQI)(Rubin et al. , 2019). The AVQI algo-rithm is a multi-varia measure that includes the CPPS,harmonics-to-noise ratio (HNR) and other acoustic fea-tures such as shimmer. However, it is not applicableto connected speech(Lee et al. , 2018). Unlike AVQI,CSID can be used to analyze connected speech and sus-tained vowel separately and has been extensively vali-dated(Peterson et al. , 2013). We used the equation in(Lee et al. , 2018) to calculate CSID. Praat is used toextract both CPP and CSID.
C. Harmonic-to-Noise Ratio (HNR)
Harmonicity can be used to measure the signal-to-noise ratio (SNR) of anything that generates a periodicsignal. It is also a good indicator of different voice quali-ties(Jannetts and Lowit, 2014). Calculation of harmonic-to noise ratio (HNR) is based on the assumption that thesignal has a periodic component and an addictive noisecomponent. HNR calculates the energy ratio of signaland noise components. Praat is used to generate HNR.
D. Fundamental frequency F0 different statistics
Fundamental frequency (F0) is considered the lowestfrequency of a periodic signal. It has proven to measureaging in womens’ voices (da Silva et al. , 2011). Bothelevated and lowered F0 may be associated with a vari-ety of pathologies, as well as with with improved vocaloutcomes (da Silva et al. , 2011). The fundamental fre-quency is extracted using Praat. The mean, standarddeviation, maximum, minimum and slope of F0 are/iscalculated afterward and used as a combined measurefor voice quality.
IV. I-VECTOR BASED IDENTIFICATION SYSTEM
In this paper, we extracted i-vectors as features toidentify intra-speaker voice quality variations using thedata described above. The problem is defined as a closeset identification problem with 5 variations per speakerwith the goal to identify different voices quality withinspeakers. The i-vector pipeline of the voice quality identi-fication system is a series of generative models. A univer-sal background model (UBM) is used to collect sufficientinformation to compute Baum-Welch statistics (Kenny,2012). A total variability matrix is trained for i-vectorextraction using voice features and a classifier back-endis used for identification. The proposed system is shownin Fig. 2.
A. Voice Features
Mel Frequency Cepstral Coefficients (MFCCs) areused as the voice features, which are pre-designed fea-tures intended to capture the characteristic of humanspeech. They are computed from mean-normalized au-dio clips. Delta and acceleration of MFCCs are appendedto generate a larger feature vector. Cepstral mean nor-malization (CMVN) is applied to the feature vector toremove any channel effects. For practical application,voice activity detector is not applied to keep silence seg-ments in training.
B. Universal Background Model
For the majority of speaker identification systems,UBM is usually trained with the maximum amount ofdata from from publicly available speech database. How-ever, there is no concrete evidence that using the max-imum amount of data would guarantee the best overallperformance (Hasan and Hansen, 2011). Therefore, wetrained a Guassian Mixture Model (GMM) on only thetraining data with model parameters λ = { π c , m c , Σ c } .Each mixture component c = 1, ... , C , π c , m c , Σ c de-notes the weights, mean vectors. and covariance matri-ces. The dimension of the mean vector m c is F is the J. Acoust. Soc. Am. / 16 February 2021 Feng et al. 5 ame as the dimension of the feature vectors. The di-mension of the covariance matrix is Σ c is F × F . UBMis trained using the Expectation-Maximization algorithm(Kenny, 2005). C. I-vector Representation and LDA
Following the notation in (Kenny, 2005), m denotesthe supervector concatenating the mean of each mixturecomponent m , ..., m C with a dimension of CF ×
1. Foreach different voice v (from the same or different speak-ers), M ( v ) denotes the voice-adapted supervector for thesame voice. For each mixture component, M c ( v ) is theconcatenation of m c ( s ) for c = 1 ...C , which is a subvectorof M ( v ). Then, the i-vector representation of the systemis: M ( v ) = m + Tw ( s ) (1)where w ( s ) denotes the i-vector of dimension M , and T denotes the total variability matrix, which has a di-mension of CF × M . The total variability maps thehigh dimension supervector of GMM into w ( s ), whichis a lower dimensional representation of different voices.The total variability matrix is trained on all of the train-ing data using the EM method and i-vector is computedusing the same training data.LDA is then performed on i-vectors to maximizeinter-class variation and to minimize intra-class vari-ances. This algorithm is widely used for dimension re-duction in classification problems. After LDA dimen-sion reduction, the low dimension i-vectors are cen-tered and length normalized (Garcia-Romero and Espy-Wilson, 2011) before feeding into the classifier. V. EXPERIMENTSA. Data Pre-processing
13 actors, including 3 males and 10 females wererecorded reading aloud and speaking extemporaneouslyin 5 different voice qualities for approximately 5 minutesper voice. All speakers recorded in fry, breathy, nor-mal, hyponasal, and twang voice qualities. For all of therecordings, we cut off the first and last 30 seconds, sincethey are mostly likely to contain conversations betweenthe SLP and the speakers. Each recording was then seg-mented into 8 second audio sub-segments, resulting ina total of 30 audio segments of each voice quality foreach speaker. Next, files were further divided into eight-second segments, four-second segments and two-secondsegments to evaluate the performance on short record-ings.
B. Feature Extraction
For the traditional baseline system, we extractedMFCCs, PS, CPPS, CSID, HNR and F0 statistics.MFCCs are 13 dimension and averaged across time, F0statistics are 5 dimension and other measures are 1 di- mension. In total we obtained a 22 dimension feature setfor each audio clip.For our i-vector identification system, 13 MFCCswere extracted with a frame length of 25ms and slidinglength of 10ms. 13 delta and 13 double-delta of MFCCsare concatenated with MFCCs to create 39 dimensionspectoral feature vectors. The MFCCs wer extracted us-ing a HTK toolkit (Young et al. , 2002)
C. Classifiers
Our experiment used two types of classifiers: PLDAscoring (Kenny, 2010) and multi-class SVM. The PLDAmodel was trained using the same training data and im-plemented using MSR toolkit (Sadjadi et al. , 2013). Wesimply selected the model with highest score. The multi-class SVM was implemented using the lib-svm library(Chang and Lin, 2011). We use linear kernel SVM witha complexity component value equal to one. We use bothPLDA scoring and SVM in i-vector based system but onlySVM in our baseline system.
D. I-vector System Setup
The UBM is a 256 component Gaussian mixturemodel and the dimension of i-vector is 100, since ourdataset is small. We trained the UBM and total variabil-ity matrix using the first 20 sub-segments of each speaker.The dimension of the i-vector was reduced to 64 afterimplementing LDA. We also computed the dimension re-duced i-vector for the last ten testing sub-segments. Thedimension reduced i-vector is used as the input featurefor training the classifier.
VI. EXPERIMENTAL RESULTA. Intra-speaker Classification
In this experiment, we evaluated the performanceof two systems on intra-speaker voice quality variations.For a given testing audio segment, we performed clas-sification only on the five voice models (breathy, fry,twang, hyponasal, and normal) from the same speaker.In the first experiment,we used eight-second audio seg-ments from each actor. However, for practical applica-tion, it is important to provide feedback for patients af-ter only minimal delay. Therefore, we further segmentedthe testing audio segments to 4 seconds (20 clips for eachvoice) and 2 seconds (40 segments for each voice) and ranthe same experiment on shorter segments to evaluate thefeasibility of feedback streaming. TABLE I summarizesthe performance on intra-speaker voice quality variationsof the proposed i-vector system. As seen from the ta-ble, both PLDA scoring and multi-class SVM performedwell with an average accuracy of 97.5% and 96.4% for 8second audio clips respectively. Our results also demon-strate that using longer testing utterances work betterthan a shorter testing utterances. The results meet ourexpectations, since the percentage of non-speech frames n the shorter audio segment is higher. Although theperformance degraded with shorter clips, the average ofaccuracy remained above 90%. FIG. 3 shows the scatterplot of the first two LDA dimension of all the five modelsof actor nine of our i-vector system. As seen in the plot,the i-vector effectively separates different voice qualities.TABLE II shows the result of our proposed systemcompared to traditional system. We can see that oursystem outperforms the traditional system by more than20%. The traditional system only achieves 77.2% on 8-second segments. On 2 second segments, the i-vectorsystem outperforms by more than 30%.
FIG. 3. First two LDA dimensions of intraspeaker voice qual-ities
B. Inter-speaker Classification
In this experiment, we examined inter-speaker differ-ences on the i-vector system. That is, for a given testingsegment, we scored it on all of the possible models (13speakers and 5 voice qualities for each of them). Whenrecording voice therapy sessions or home practice, it is in-evitable that there will be speech from other people (e.g.,the SLP or family members). Therefore, it is importantfor the system to differentiate the patient’s voice fromother people’s voices while also detecting the patient’svoice quality (i.e., combined tasks. The result of this ex-periment is summarize in TABLE. III. As expected, theperformance of the system is worse because a portion ofthe testing clips are classified as other actors. Althoughthe system performs worse in combined tasks, the highestaccuracy achieved (97.1%) is still sufficient for practicaluse.
VII. DISCUSSION
From the experiment we can see that i-vector systemis more accurate, it can detect more subtle voice quali-ties than the traditional acoustic measure system. The
TABLE I. Accuracy of intra-speaker experiment of i-vectorsystem. PLDA Scoring Multi-class SVM8s 4s 2s 8s 4s 2sActor 1 98% 94% 90% 98% 95% 90.5%Actor 2 100% 99% 97% 96% 93% 91%Actor 3 98% 97% 93.5% 98% 96% 92.5%Actor 4 92% 90% 79.5% 78% 77% 70%Actor 5 100% 96% 90% 96% 92% 85.5%Actor 6 100% 97% 94.5% 100% 93% 91.5%Actor 7 100% 100% 96% 100% 99% 96.5%Actor 8 98% 95% 88.5% 94% 90% 83.5%Actor 9 100% 100% 99% 100% 100% 99.5%Actor 10 82% 79% 77.5% 92% 92% 89.5%Actor 11 100% 100% 99% 100% 100% 98%Actor 12 100% 97% 91.5% 100% 97% 92%Actor 13 100% 100% 100% 100% 99% 99%Average i-vector based system exceeded the system built on tradi-tional measures in both intra-speaker and inter-speakervoice quality identification. This may be because, whenindividuals vary their voice quality, they vary many otheraspects of speech production that are not captured bytraditional voice measures only.In this study, vocal qualities were separated by in-structing participants to speak in a consistent voice
J. Acoust. Soc. Am. / 16 February 2021 Feng et al. 7
ABLE III. Accuracy of inter-speaker experiment of two clas-sifiers PLDA Scoring Multi-class SVM8s 4s 2s 8s 4s 2sAverage quality. For example, in the ‘fry” quality, participantsattempted to use fry in every word of each sentence,whereas normal speakers will typically do so only in fi-nal words of a phrase. Thus, detection of problematicvocal qualities may be more challenging when they areintermittently produced. Thus, future work will includedevelopment of an approach to scale the particular modeafter its classification. For example, the system may clas-sify a voice as “23% fry” rather than clarifying the entiresegment as “fry.” Furthermore, the present system maybe used to automatically analyze patient home practiceor mobile ambulatory recordings. In this manner, voicequalities can be quantified without requiring the clinicianor investigator’s perceptual analysis of all recordings: atime-consuming process. Since patients are likely to usethese to a degree rather than consistently holds utilityfor automatic analysis of extra-clinical patient practicerecordings for study of patient adherence to voice ther-apy. Furthermore, the target voice in a voice patient mayonly be subtly different from their habitual voice. So inthe end we will need a system that can classify voicesthat are only subtly different,and based on even shortersegments. So for data collection with patients they don’tneed to be in the target voice for 2 minutes straight, butjust in 20 sec chunks of good talking, and then coach andthen record again, which will make the system easier touse. To build a practical application, We also need toanalyze the performance of the system under noisy con-dition. MFCC features is vulnerable to noise and theperformance might degrade when we have a low Signal-to-Noise Ratio (SNR) utterances. Augmentation withnoise might be necessary when turning the system into apractical application.
VIII. CONCLUSION
We have demonstrated that it is possible to clas-sify dramatically different intra-speaker voice qualities,including normal, breathy, fry, twang, and hyponasal, us-ing i-vectors. This successful classification provides proofof concept for classifying voice differences within clinicalpatients . The next step may be to train the system onmore subtle voice quality differences, and subsequently,to detect the habitual voice quality and treatment targetin actual voice patients. This work may ultimately leadto the the production of feedback mechanisms for extra-clinical patient voice use, as the system is able to detectwhether a person is using their normal habitual voiceand a variety of qualities that may represents problem-atic vocal use or conversely, vocal targets. This shows the potential of this system to build into an actual productfor voice therapy. IRB approval was obtained from Georgia Institute of Technologyand Georgia State University.Arias-Londo˜no, J. D., G´omez-Garc´ıa, J. A., and Godino-Llorente,J. I. ( ). “Multimodal and multi-output deep learning archi-tectures for the automatic assessment of voice quality using thegrb scale,” IEEE Journal of Selected Topics in Signal Processing (2), 413–422.Behrman, A., Shikowitz, M. J., and Dailey, S. ( ). “The ef-fect of upper airway surgery on voice,” Otolaryngology-Head andNeck Surgery (1), 36–42.Bonilha, H. S., and Dawson, A. E. ( ). “Creating a masteryexperience during the voice evaluation,” Journal of Voice (5),665–e1.Camacho, A., and Harris, J. G. ( ). “A sawtooth waveforminspired pitch estimator for speech and music,” The Journal ofthe Acoustical Society of America (3), 1638–1652.Chang, C.-C., and Lin, C.-J. ( ). “Libsvm: A library for sup-port vector machines,” ACM transactions on intelligent systemsand technology (TIST) (3), 27.Cohen, S. M. ( ). “Self-reported impact of dysphonia in a pri-mary care population: An epidemiological study,” The Laryngo-scope (10), 2022–2032.Cohen, S. M., Kim, J., Roy, N., Asche, C., and Courey, M. ( ).“The impact of laryngeal disorders on work-related dysfunction,”The Laryngoscope (7), 1589–1594.da Silva, P. T., Master, S., Andreoni, S., Pontes, P., and Ramos,L. R. ( ). “Acoustic and long-term average spectrum mea-sures to detect vocal aging in women,” Journal of voice (4),411–419.Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., and Ouellet,P. ( ). “Front-end factor analysis for speaker verification,”IEEE Transactions on Audio, Speech, and Language Processing (4), 788–798.Dejonckere, P., and Lebacq, J. ( ). “Plasticity of voice quality:a prognostic factor for outcome of voice therapy?,” Journal ofVoice (2), 251–256.Eddins, D. A., Anand, S., Camacho, A., and Shrivastav, R. ( ).“Modeling of breathy voice quality using pitch-strength esti-mates,” Journal of Voice (6), 774–e1.Garcia-Romero, D., and Espy-Wilson, C. Y. ( ). “Analysis ofi-vector length normalization in speaker recognition systems,” in Twelfth Annual Conference of the International Speech Commu-nication Association .Ghassemi, M., Van Stan, J. H., Mehta, D. D., Za˜nartu, M.,Cheyne II, H. A., Hillman, R. E., and Guttag, J. V. ( ).“Learning to detect vocal hyperfunction from ambulatory neck-surface acceleration features: Initial results for vocal fold nod-ules,” IEEE Transactions on Biomedical Engineering (6),1668–1675.Glanz, K., Rimer, B. K., and Viswanath, K. ( ). Health behav-ior and health education: theory, research, and practice (JohnWiley & Sons).Hapner, E., Portone-Maira, C., and Johns III, M. M. ( ). “Astudy of voice therapy dropout,” Journal of Voice (3), 337–340.Hasan, T., and Hansen, J. H. ( ). “A study on universal back-ground model training in speaker verification,” IEEE Transac-tions on Audio, Speech, and Language Processing (7), 1890–1899.Heman-Ackah, Y. D., Michael, D. D., and Goding Jr, G. S. ( ).“The relationship between cepstral peak prominence and selectedparameters of dysphonia,” Journal of Voice (1), 20–27.Hillenbrand, J., Cleveland, R. A., and Erickson, R. L. ( ).“Acoustic correlates of breathy vocal quality,” Journal of Speech,Language, and Hearing Research (4), 769–778.Hollien, H., Moore, P., Wendahl, R. W., and Michel, J. F. ( ).“On the nature of vocal fry,” Journal of Speech and HearingResearch (2), 245–247. ornibrook, J., Ormond, T., and Maclagan, M. ( ). “Creakyvoice or extreme vocal fry in young women,” The New ZealandMedical Journal (Online) (1486), 36–40.Jacobson, B. H., Johnson, A., Grywalski, C., Silbergleit, A., Ja-cobson, G., Benninger, M. S., and Newman, C. W. ( ). “Thevoice handicap index (vhi) development and validation,” Amer-ican Journal of Speech-Language Pathology (3), 66–70.Jannetts, S., and Lowit, A. ( ). “Cepstral analysis of hypoki-netic and ataxic voices: correlations with perceptual and otheracoustic measures,” Journal of Voice (6), 673–680.Kempster, G. B., Gerratt, B. R., Abbott, K. V., Barkmeier-Kraemer, J., and Hillman, R. E. ( ). “Consensus auditory-perceptual evaluation of voice: development of a standardizedclinical protocol,” American Journal of Speech-Language Pathol-ogy .Kenny, P. ( ). “Joint factor analysis of speaker and sessionvariability: Theory and algorithms,” CRIM, Montreal,(Report)CRIM-06/08-13 , 28–29.Kenny, P. ( ). “Bayesian speaker verification with heavy-tailedpriors.,” in Odyssey , Vol. 14.Kenny, P. ( ). “A small footprint i-vector extractor,” in
Odyssey 2012-The Speaker and Language Recognition Workshop .Kopf, L. M., Jackson-Menaldi, C., Rubin, A. D., Skeffington, J.,Hunter, E. J., Skowronski, M. D., and Shrivastav, R. ( ).“Pitch strength as an outcome measure for treatment of dyspho-nia,” Journal of Voice (6), 691–696.Lee, J. M., Roy, N., Peterson, E., and Merrill, R. M. ( ).“Comparison of two multiparameter acoustic indices of dyspho-nia severity: the acoustic voice quality index and cepstral spectralindex of dysphonia,” Journal of Voice (4), 515–e1.Lombard, L. E., and Steinhauer, K. M. ( ). “A novel treatmentfor hypophonic voice: Twang therapy,” Journal of Voice (3),294–299.Maryn, Y., Corthals, P., Van Cauwenberge, P., Roy, N., andDe Bodt, M. ( ). “Toward improved ecological validity inthe acoustic measurement of overall voice quality: combiningcontinuous speech and sustained vowels,” Journal of voice (5),540–555.Mehta, D. D., Zanartu, M., Feng, S. W., Cheyne II, H. A., andHillman, R. E. ( ). “Mobile voice health monitoring usinga wearable accelerometer sensor and a smartphone platform,”IEEE Transactions on Biomedical Engineering (11), 3090–3096.Narasimhan, S., and Vishal, K. ( ). “Spectral measures ofhoarseness in persons with hyperfunctional voice disorder,” Jour-nal of Voice (1), 57–61.Peterson, E. A., Roy, N., Awan, S. N., Merrill, R. M., Banks, R.,and Tanner, K. ( ). “Toward validation of the cepstral spec-tral index of dysphonia (csid) as an objective treatment outcomesmeasure,” Journal of Voice (4), 401–410.Portone, C., Johns III, M. M., and Hapner, E. R. ( ). “A reviewof patient adherence to the recommendation for voice therapy,”Journal of Voice (2), 192–196.Ramig, L. O., and Verdolini, K. ( ). “Treatment efficacy: voicedisorders,” Journal of Speech, Language, and Hearing Research (1), S101–S116.Rothauser, E. ( ). “Ieee recommended practice for speech qual-ity measurements,” IEEE Trans. on Audio and Electroacoustics , 225–246.Rubin, A. D., Jackson-Menaldi, C., Kopf, L. M., Marks, K., Skeff-ington, J., Skowronski, M. D., Shrivastav, R., and Hunter, E. J.( ). “Comparison of pitch strength with perceptual and otheracoustic metric outcome measures following medialization laryn-goplasty,” Journal of Voice (5), 795–800.Sadjadi, S. O., Slaney, M., and Heck, L. ( ). “Msr identity tool-box v1. 0: A matlab toolbox for speaker-recognition research,”Speech and Language Processing Technical Committee Newslet-ter (4), 1–32.Scherer, K. R. et al. ( ). “Psychological models of emotion,”The neuropsychology of emotion (3), 137–162.Schwartz, S. R., Cohen, S. M., Dailey, S. H., Rosenfeld, R. M.,Deutsch, E. S., Gillespie, M. B., Granieri, E., Hapner, E. R.,Kimball, C. E., Krouse, H. J. et al. ( ). “Clinical practiceguideline: hoarseness (dysphonia),” Otolaryngology–Head andNeck Surgery (1 suppl), 1–31. Smith, B. E., Kempster, G. B., and Sims, H. S. ( ). “Patientfactors related to voice therapy attendance and outcomes,” Jour-nal of Voice (6), 694–701.Stepp, C. E., Hillman, R. E., and Heaton, J. T. ( ). “Theimpact of vocal hyperfunction on relative fundamental frequencyduring voicing offset and onset,” Journal of Speech, Language,and Hearing Research .Thibeault, S. L., Merrill, R. M., Roy, N., Gray, S. D., and Smith,E. M. ( ). “Occupational risk factors associated with voicedisorders among teachers,” Annals of epidemiology (10), 786–792.van Leer, E., and Connor, N. P. ( ). “Patient perceptions ofvoice therapy adherence,” Journal of Voice (4), 458–469.van Leer, E., and Connor, N. P. ( ). “Use of portable digitalmedia players increases patient motivation and practice in voicetherapy,” Journal of Voice (4), 447–453.Van Leer, E., Pfister, R. C., and Zhou, X. ( ). “An ios-basedcepstral peak prominence application: Feasibility for patientpractice of resonant voice,” Journal of Voice (1), 131–e9.Wang, Z., Yu, P., Yan, N., Wang, L., and Ng, M. L. ( ). “Au-tomatic assessment of pathological voice quality using multidi-mensional acoustic analysis based on the grbas scale,” Journal ofSignal Processing Systems (2), 241–251.Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu,X., Moore, G., Odell, J., Ollason, D., Povey, D. et al. ( ).“The htk book,” Cambridge university engineering department , 175.Ziegler, A., Verdolini Abbott, K., Johns, M., Klein, A., and Hap-ner, E. R. ( ). “Preliminary data on two voice therapy inter-ventions in the treatment of presbyphonia,” The Laryngoscope (8), 1869–1876.Zue, V., Seneff, S., and Glass, J. ( ). “Speech database devel-opment at mit: Timit and beyond,” Speech communication (4),351–356.(4),351–356.