Voice Gender Scoring and Independent Acoustic Characterization of Perceived Masculinity and Femininity
VVoice Gender Scoring and Independent Acoustic Characterization of PerceivedMasculinity and Femininity
Fuling Chen a , Roberto Togneri a , Murray Maybery b , Diana Tan b,c a Dept. of Electrical, Electronic and Computer Engineering, University of Western Australia b School of Psychological Science, University of Western Australia c Telethon Kids Institute, Perth, Australia
Abstract
Previous research has found that voices can provide reliable information to be used for gender classification with ahigh level of accuracy. In social psychology, perceived vocal masculinity and femininity (i.e., masculinity / femininityrated by humans) has often been considered as an important feature when investigating the influence of vocal fea-tures on social behaviours. While previous studies have characterised acoustic features that contributed to perceivers’judgements of speakers’ vocal masculinity or femininity, there is limited research on developing an objective mas-culinity / femininity scoring model and characterizing the independent acoustic factors that contribute to perceivers’judgements of the degree of masculinity or femininity in speakers’ voices. In this work, we firstly propose an objectivemasculinity / femininity scoring system based on the Extreme Random Forest and then characterize the independentand meaningful acoustic factors contributing to perceivers’ judgements by using a correlation matrix based hierarchi-cal clustering method. The results show that the objective masculinity / femininity ratings strongly correlated with theperceived masculinity / femininity ratings when we used an optimal speech duration of 7 seconds, with a correlationcoe ffi cient of up to .63 for females and .77 for males. Nine independent clusters of acoustic measures were generatedfrom our modelling of femininity judgements for female voices and eight clusters were found for masculinity judge-ments for male voices. The results revealed that, for both sexes, the F0 mean is the most important acoustic measurethat a ff ects the judgement of vocal masculinity and femininity. The F3 mean, F4 mean and VTL estimators are foundto be highly inter-correlated and appeared in the same cluster, forming the second significant factor influencing theassessment of vocal masculinity and femininity. Next, F1 mean, F2 mean and F0 standard deviation are independentfactors that share similar importance. The voice perturbation measures, including HNR, jitter and shimmer, are oflesser importance in influencing masculinity / femininity judgements. Keywords: masculinity, femininity, Extreme Random Forest, Hierarchical Clustering, acoustic, regression
1. Introduction
The human voice conveys a wide range of information. For instance, it contains information that reflects the degreeof perceived masculinity or femininity of the speaker. In general, di ff erences in vocal masculinity and femininity areassociated with di ff erences between males and females in the development of secondary sex characteristics [1], suchas the length of the vocal tract [1]. These sex di ff erences are influenced by biological factors (e.g., higher levels ofmasculinizing hormones like testosterone). Furthermore, variations in secondary sex characteristics have been foundto correlate with health status, physical strength, mating success and the presence of psychological disorders [2, 3, 4].Several studies have shown that perceived vocal masculinity and femininity play important roles in human socialpreferences such as attractiveness [5, 6, 7, 8, 9, 10]. These studies characterized human perception of masculinityand femininity by employing two types of research methods: (1) observing the influence of the manipulations ofvarious acoustic properties of voice samples on perceived masculinity and femininity ratings, and (2) examining the Email addresses: [email protected] (Fuling Chen), [email protected] (Roberto Togneri), [email protected] (Murray Maybery), [email protected] (Diana Tan)
Preprint submitted to Speech Communication February 17, 2021 a r X i v : . [ c s . S D ] F e b elationships between acoustic measures and perceived masculinity and femininity ratings. However, the acquisitionof perceived masculinity or femininity is resource intensive. Several studies have investigated the relationships between the perceived masculinity / femininity of voices andvarious acoustic measures, as summarized in Table 1. Cartei et al. [5] examined relationships between acousticmeasures (e.g. fundamental frequency (F0) and resonances ( ∆ F)) and perceived judgements of speakers’ masculinity.Male speakers with lower F0 and ∆ F were rated as more masculine, with F0 showing as a more salient cue forperceived masculinity than ∆ F. Similarly, Ko’s study [11] demonstrated that F0 was highly and positively correlatedwith perceived femininity in both males and females. Furthermore, the study showed that ∆ F and F0 variance hadweaker correlations with vocal femininity, compared to the correlation between F0 and vocal femininity, and F0variance was related to vocal femininity only in males and not in females. Other acoustic features associated withperceived masculinity and femininity have also been reported in the literature. For instance, perceived masculinitywas found to be related to F0 and formant frequencies (Fn) [10, 12]. Specifically, speakers with higher F0 andhigher Fn were associated with more feminine and less masculine voices. These results imply that F0 has a strongere ff ect on perceived masculinity / femininity than other acoustic measures. The Fn has also been shown to correlatewith perceived masculinity / femininity ratings, with the second formant frequency (F2) correlating more stronglywith masculinity / femininity ratings than either the first (F1) or the third (F3) formant frequency [13]. Of a rangeof acoustic measures examined in [14], F0 and F2 were found to account for 49.60% and 19.60% of the variancein masculinity ratings for males respectively, and for 40.60% and 24.10% of the variance in femininity ratings forfemales respectively. Feinberg et al. [6, 9] showed that F0 and apparent vocal tract lengths (VTL) both influenced themasculinity ratings for male speakers. The voice perturbation measures, including jitter, shimmer and Harmonic-to-Noice Ratio (HNR), have also been investigated in [15, 16, 17, 18] as contributors in classifying males and females.Among them, only jitter parameters were found to be statistically significantly higher in males than in females [15],and HNR correlated more strongly with femininity ratings in females than with masculinity ratings in males [18]. Suchvoice perturbation measures may only be valid for di ff erentiating females and males, and may not be significantlyassociated with the masculinity ratings only or femininity ratings only [19, 20]. In summary, the existing literaturesuggests that acoustic measures such as F0, F0 variance, Fn, ∆ F, VTL, HNR, jitter and shimmer could be valid cuesto human listeners in assessing vocal masculinity and femininity.In the past two decades, the development of machine learning models has advanced the field of speech science,o ff ering the potential to overcome the limitation of relying on human ratings, the collection of which is resource in-tensive. Machine learning models are capable of addressing several challenges including: (1) the gender classificationproblem, (2) objectively rating speakers’ voices for masculinity / femininity, and (3) characterizing cues that a ff ect thedetermination of perceived masculinity / femininity.Several studies have addressed the first issue of the gender classification problem. Gender classification for adultsusing speech signals can achieve an accuracy of approximately 95%, by applying machine learning classifiers de-signed for voice-based gender classification, including Random Forest (RF), Linear Discriminant Analysis (LDA),and K-Nearest Neighbour (KNN) [21, 22]. RF-based models have been found to be a powerful tool for both classi-fication and regression purposes [23] with multiple applications, such as emotion recognition [24] as well as genderclassification [25]. Harb et al. [26], proposed a set of neural networks using acoustic and pitch related features forgender classification and achieved 98.5% accuracy.Considering the length of speech to use in either gender classification or in finding relationships to masculin-ity / femininity ratings, long-term speech was found to be more suitable than short-term speech. Harb et al. [26] foundthat classification was more accurate using segments of 5 seconds compared to 1- and 3-second segments (98.5% vs90% and 93%). Similarly, Cartei et al. [5] demonstrated that correlations between perceived masculinity for malesand both F0 and ∆ F were of larger magnitude when the voice samples were connected speech of multiple sentencesrather than word-level speech or single-sentence speech.To the best of our knowledge, limited research has addressed the second and third challenges of developing anobjective means of rating masculinity and femininity and characterizing the salient cues a ff ecting perceived masculin-ity / femininity. In the computer vision field, Gilani et al. [27] proposed an objective gender classification model basedon an LDA algorithm using facial cues. The objective facial masculinity / femininity scores, generated by the algorithm,significantly correlated with the subjective masculinity / femininity scores for males (r = .79) and for females (r = .90).2 able 1: Summary of research on vocal masculinity and femininity Reference Acoustic Measures Subjects Raters Method + Stimuli Type Findings [5] F0, ∆ F 37M* 20F* 1 Isolated word; sentence;connected speech F0 mediated correlation between perceivedmasculinity and testosterone levels.[6, 9] F0, VTL 4M, 4F[6];10M[9] 26F[6];89F[9] 2 Vowels F0 and VTL were independent and correlatedwith perceived masculinity.[7] F0 10M, 10F 24M, 25F 2 Not mentioned Men with lower F0 were rated more masculine;women with higher F0 were rated more femi-nine.[10] F0 and Fn (F1, F2, F3, F4) 57M, 57F 30M, 31F 2 Single-syllable bVt words F0 and Fn were independent and correlatedwith perceived masculinity.[13] F0, F0 range, intonationfor two sentences samples;F1, F2, F3 for vowels 15T*, 3M,6F 20 1 Two sentences; vowelseach for 5 seconds F2 was more statistically significant in per-ceived masculinity / femininity ratings than ei-ther F1 or F3.[14] F0, F1, F2, F0 range, VTL, / s / center of gravity, / s / skew-ness, H2-H1 ampli-tude for / æ / / a / F0 and jitter discriminated males and females.[18] F0, HNR, jitter, shimmer 57M, 57F Self-assessment 1 45 seconds of speech re-duced from 30-minute in-teractions F0, HNR, jitter and shimmer are contributorsto classify males and females.[17] Jitter, shimmer 20F None 4 3 seconds, vowels Females had more jitter and less shimmer thanmales.[12] F0, average formant fre-quency, shimmer, HNR 22T, 10F,10M 10M, 10F 1 5 seconds, vowels Higher F0 and average formant frequency weremore strongly associated with feminine ratingsthan masculinity ratings[19, 20] F0, HNR, jitter, shimmer[19]; F0, Fn, jitter, shim-mer [20] 20T, 5F,5M [19];21T, 9F[20] 12M, 13F[19]; 15F,5M [20] 1 [19]; 4[20] Passage reading, 20 - 25seconds F0 was strongly correlated with speaker’s self-rated and listener-rated femininity, but HNR,jitter and shimmer did not correlate with theseratings.* M - males; F - females; T - Transsexual; G - Gay men; L - Lesbian +
1. Correlation of acoustic measures and perceived masculinity / femininity ratings; 2. Manipulation of acoustic characteristics; 3. Characterizationusing regression models; 4. Statistical analysis using t-test. Similar to Gilani’s study, in our previous work [28], we objectively rated speakers’ vocal masculinity and femininitybased on gender classification. We did this by deriving the objective masculinity / femininity scores for individualsrepresenting where they were positioned in the classification space between extreme “maleness” (the male model)and extreme “femaleness” (the female model). The results of our study demonstrated a close correspondence betweenthe objective scores and human listeners’ ratings of masculinity for males (r = .67), and femininity for females (r = .51). Besides the LDA algorithm, RF-based regression algorithms have also been widely used to predict humanratings on recommendation of movies [29] and on word prominence judgments [30]. Regarding the characterizationof the salient features in prediction, RF-based models have been popularly used in analysing important features invoice based emotion recognition [31, 32], detection of Parkinson’s Disease [33] and sleep stages classification [34].For the purpose of characterisation, it has been noted that severe multicollinearity increases the di ffi culty of inter-preting regression results [35, 36]. Several studies have been conducted to investigate the relationships among variousacoustic measures. Cartei et al. [5], consistent with other studies [1, 37, 38], showed that F0 and ∆ F are largelyindependent of each other as they were a ff ected by di ff erent constraints on the speech production system. Apart from ∆ F, formant dispersion, as another estimator of VTL was shown to be independent of F0 [6, 9]. Regarding the rela-tionship between F0 and Fn, the results of Pisanski et al. [10] were consistent with other research [37, 39, 40] in thatF0 and Fn were largely independent, both within and across utterances by the same speaker and in the average valuesof these measures in di ff erent speakers. Other studies showed strong intercorrelations existed in some other acousticmeasures. For instance, higher formant frequencies, such as F3 and F4, were shown to be strongly correlated withVTL estimators [41, 42, 43]. Of note, the severity of multicollinearity among di ff erent VTL estimators, including ∆ F,formant dispersion, formant position and formant spacing, has not been investigated. It is notable that di ff erent VTL3stimators, all derived from formant frequencies, may represent the same source, but can vary in their measurementemphasis. There is another potential for severe multicollinearity that may occur with the sources of periodicity per-turbations in voice speech signals, including HNR, jitter and shimmer. Some studies showed HNR depended on bothjitter and shimmer [44, 45, 46]. However, the intercorrelation may vary between males and females, such as jitterwhich can be independent of other measures in males, while moderately correlated with HNR in females [15, 16].To summarize, the above research provided evidence of the relationships between some acoustic measures, but whatwas lacking is a comprehensive and thorough examination of the intercorrelations of all the acoustic measures thatcontribute to perceived voice masculinity of males voices and femininity of female voices.Given the existence of multicollinearity, several approaches have been proposed to reduce the degree of multi-collinearity, such as adopting ridge regression and principal components regression [36]. However, a major limitationof ridge regression is that the choice of the biasing constant k is a subjective one and the exact distributional propertiesare not known [47, 48]. As the degree of multicollinearity is likely to vary among the multiple acoustic measures, thechoice of k may not be ideal for all combinations of the acoustic measures. It may cause over-reduction for weaklycorrelated acoustic measures but under-reduction for highly correlated measures. Principal component regression ad-dresses multicollinearity by using less than the full set of principal components to explain the variation in the responsevariable. However, the principal components lose the original natural meanings of the variables, so this method is notideal for characterization purposes. Therefore, in research by Ketchen [49], clustering methods were designed toaddress the multicollinearity problem while retaining the natural meanings of variables. In the study of hierarchicalcluster analysis [50], a recommendation was made to cluster the variables with the highest average intercorrelationsin the correlation matrix. This correlation matrix based hierarchical clustering method was shown to have highersensitivity than a method using independent component analysis (ICA), for identifying correlation structures withrelatively weak connections, and its outcomes are easy to interpret as the strength of functional connectivity [51]. The above mentioned studies highlight the importance of developing models for objective masculinity / femininityrating and acoustic characterization to understand various psychological and social processes. The summary of theexisting literature also highlights the resource-intensive nature of obtaining these ratings and in understanding thesalient acoustic measures that a ff ect perceived masculinity and femininity.A key motivation of this study is that there has been limited research to build an objective masculinity / femininityscoring model based on a set of comprehensive acoustic measures which can be used to characterize and predict lis-teners’ perceived voice masculinity / femininity. In investigating the viability of an objective masculinity / femininityscoring model, our research used a database of voices substantially larger than the databases used in previous research(see Table 1; largest existing database consisted of 57 male and 57 female speakers [18]). The present study used anew dataset of speech segments from 225 adult speakers which were rated for the speakers’ vocal masculinity andfemininity by 25-30 listeners. This large dataset enabled rigorous testing in model development. We also extended ourprevious study [28] in which the LDA model used was trained for gender classification rather than directly modellinghuman perceptual ratings. The present study proposes a novel masculinity / femininity rating model based on the Ex-treme Random Forest (ERF) algorithm for predicting the degree of masculinity of males and the degree of femininityof females, given a comprehensive set of acoustic measures derived from recordings of passage reading and listeners’masculinity / femininity ratings of those recordings.Furthermore, little attention has been given to address the problem of multicollinearity with acoustic measures.For example, in the studies of [10, 15, 16], it was unclear whether F0, Fn, jitter, shimmer and HNR were independentof one another. The weights of individual measures may be unreliable and imprecise if severe multicollinearityexists. This also raises another question: is it possible to group the highly correlated acoustic measures into severalclusters, which retain the connection with the physical speech production model of the original measures? Usingdata reduction strategies such as Principle Component Analysis (PCA) and Linear Discriminant Analysis (LDA)do address the multicollinearity problem, though at the expense of not being able to easily interpret the physicalmeaning of the resulting features. This study proposes using a novel computational clustering framework to addressthe multicollinearity issue, which retains the physical meanings of each independent group of acoustic measures.Last but not least, it is noticed from Table 1 that considerable e ff ort has been invested in studying the correlationsbetween the masculinity / femininity ratings of human listeners and the speakers’ vocal characteristics, and rankingthe importance of the characteristics. However, statistically, the weights of acoustic measures or groups of acoustic4easures were not provided in these studies, especially when multiple acoustic measures were used. To addressthis issue, the current study proposes an ERF-based model to characterize independent groups of acoustic measuresthat dominate the prediction of perceived masculinity in males and femininity in females, in conjunction with theclustering of acoustic measures.The remainder of this paper is organized as follows: Section 2 describes the methodology including the datasetsand acoustic measures used, the pre-processing of the data and labels, the proposed objective masculinity / femininityscoring model, the solution of the multicollinearity problem, the acoustic characterization and our evaluation meth-ods. Section 3 presents the results of the modelling and discussion of the outcomes. Finally, Section 4 draws someconclusions based on the study.
2. Methodology
The proposed system (see Figure 1) achieves the following three goals: (1) generation of objective masculin-ity / femininity ratings by training on a set of acoustic measures with known perceived masculinity / femininity ratingsfor both classes (males and females); (2) building independent clusters of acoustic measures with their clearly inter-pretable meanings for males and females to eliminate multicollinearity; and (3) characterization of the salient clustersof acoustic measures that are associated with perceived masculinity / femininity ratings for both classes. In the re-mainder of this section we will describe the methods behind each of the goals based on the application of ERF tocarry out the objective masculinity / femininity rating, a novel hierarchical clustering of feature correlations to buildthe meaningful clusters, and the characterization of the independent clusters. Figure 1: Block diagram of the proposed system
The datasets were obtained from the School of Psychological Science at the University of Western Australia. Voicerecordings were collected for the purpose of investigating the association between perceived masculinity / femininityratings and autistic traits [52]. This database was chosen because it contains more speakers with available perceivedmasculinity / femininity ratings than any other public databases.The database (see description in Table 2) is composed of two cohorts of 225 adult participants (96 males and 129females) who were typically developing undergraduates and fluent in English. Tested individually in a soundproofroom, each participant provided two voice recordings by reading the Rainbow passage [53] using a conversationaltone. Only the second sentence from the passage was used for the masculinity and femininity ratings.Human masculinity / femininity ratings were provided by raters who did not know the speakers. For each rater, thevoices were presented in a random order. Following the presentation of each voice through enclosed headphones, a5 able 2: Database description Cohort No. 1 2
Collected year 2015 2019Speakers mean age 18.9 years 19.09 yearsNumber of speakers 22 M*, 22 F* 74 M, 107 FNumber of raters 30 25Rating scale 1-10 1-100* M - males; F - femalesrating scale appeared on the screen. The scale ranged from 1 to 10 for Cohort 1 and 1 to 100 for Cohort 2, with theextreme points labelled ‘not at all masculine’ and ‘extremely masculine’ for male voices, and ‘not at all feminine’ and‘extremely feminine’ for female voices.The recruitment and testing of all participants were conducted in accordance with the ethics approval obtained forthis study from the Human Research Ethics Committee at the University of Western Australia.
The set of 23 widely known acoustic measures in Table 3 were used to describe the vocal characteristics of eachspeaker. Among these measures, mean value of F0 (F0 mean), standard deviation of F0 (F0 SD), HNR, all jittermeasures (local Jitter, local absolute Jitter, rap Jitter, ppq5 Jitter and ddp Jitter) and all shimmer measures (localShimmer, apq3 / /
11 Shimmer and dda Shimmer) were obtained from Parselmouth 0.3.3 which is a Python libraryfor the Praat Software. The mean values of F1, F2, F3 and F4 measured the corresponding formants at each glottalpulse using the formant position formula [54]. The apparent VTL was estimated in six measures: formant position(pF) [54], formant dispersion (fdisp) [55], average formant frequency (avgFormant) [10], geometric mean formantfrequency (m ff ) [56], Fitch formant estimate [55] and formant spacing ( ∆ F) [57].
Table 3: Acoustic measures
Pitch related measures (
Acoustic perturbation measure - HNR (
11 apq5 Shimmer 20 average formant frequency (avgFormant)3 HNR 12 apq11 Shimmer 21 geometric mean formant frequency (m ff ) Acoustic perturbation measure - jitter (
13 dda Shimmer 22 fitch VTL4 local Jitter
Formant frequencies (
23 formant spacing ( ∆ F )5 local absolute Jitter 14 F1 mean (F1)6 rap Jitter 15 F2 mean (F2)7 ppq5 Jitter 16 F3 mean (F3)8 ddp Jitter 17 F4 mean (F4) The Extreme Random Forest (ERF) is one of the most popular machine learning algorithms used for classificationand regression purposes, providing good predictive performance, low over-fitting and easy interpretability [58]. In thecase of regression, the ERF works by creating a large number of unpruned decision trees from the training dataset.Predictions are made by averaging the prediction of the decision trees.Furthermore, it is easy to obtain the contributions of each variable to the decision, by computing the impurity ofeach node. In regression mode, the measure of impurity is the variance. The principal idea is that the more a featuredecreases the impurity, the more important the feature is. In the Random Forest (RF), the impurity decreases provided http: // / praat / manual / Voice 2 Jitter.html http: // / praat / manual / Voice 3 Shimmer.html
6y each feature can be averaged across trees to determine the feature’s importance. In other words, features that areselected at the top of trees are in general more important than features that are selected at the end nodes of trees, astop splits lead to bigger information gains. The ERF can be regarded as an extension of the RF [59, 60, 58]. The maindi ff erence is that, the RF computes the locally optimal feature / split combination, while the ERF selects a randomvalue for the split for each feature under consideration. Thus, the ERF uses more diversified trees and less splitters,so that the ERF is much faster than the RF with reduced tendency to overfit.In this study, the most suitable hyper-parameters of each ERF model were obtained by exhaustive search overspecified parameter values and cross-validation splitting strategy of 4 folds, evaluated by the mean square error (MSE).Considering the hyper-parameters would vary with di ff erent types of input data, a set of ERF models were designedbased on the input data size and the number of input data dimensions. The ERF models applied for the purpose ofobjective masculinity / femininity rating rating (ERF Model A in Figure 1) were trained on the 23 acoustic measuresextracted from each segment from the input training data. The duration of the speech segments were varied from 1second to 10 seconds to establish which duration yielded the best object masculinity / femininity scoring performance(as described in Section 2.2.2). Then beginning with the 23 clusters (corresponding to the 23 acoustic measures), thenumber of clusters was progressively reduced down to just 1 cluster by the hierarchical clustering method described inSection 2.4. These di ff erent numbers of clusters were then used to train a subset of ERF Models (ERF Model Sets Bin Figure 1) to assess the quality of each cluster reduction. The optimal number of clusters was then determined as theminimum number of clusters with negligible reduction in performance (as described in Section 2.4.5). The optimalspeech duration and the optimal number of clusters were then used for training a final ERF model (ERF Model C inFigure 1) to extract the feature weights for the acoustic factors characterization. As mentioned above in Section 2.1.1 two di ff erent scales were applied in collecting perceived masculinity / femininityratings, with the scales ranging from 1 to 10 for Cohort 1, and from 1 to 100 for Cohort 2. To correct for how lis-teners may have used the masculinity / femininity rating scales di ff erently, the ratings provided by each listener wereconverted to z-scores. This also enabled the merging of ratings across the two cohorts. The label of each speaker wasthe mean value of all the z-scored perceived masculinity / femininity ratings given by the listeners. The audio files in the datasets were seperated into the males and female sets which were used to build the ERFmodels for each sex independent of the other sex. Based on the literature from Table 1, long-term speech segmentswere shown to be most suitable for studying the acoustic measures that a ff ect perceived masculinity / femininity ratings,ranging from word-level utterances to utterances of multiple sentences. However, it is not evident what the best speechduration should be. In order to compare performances based on various speech durations, all the recordings werepre-processed to obtain the targeted speakers’ utterances and were segmented into 1, 2, 5, 7 and 10 second speechduration datasets. Additionally, as the raters provided their perceived masculinity / femininity ratings only on thesecond sentence of the Rainbow passage (2 - 3 seconds in duration), the second sentences only were also regarded asone set of input data, which was compared with the various speech duration segments extracted from all the availableutterances.Given the assumption that the perceived vocal masculinity / femininity rating of one speaker would not vary acrossdi ff erent utterances, all the segments provided by the same speaker shared the same label obtained as described inSection 2.2.1.For each input dataset (5 durations and the 2nd sentence), the set of 23 widely known acoustic measures (seeSection 2.1.2) were extracted for each segment. Each of the 23 acoustic measures were then z-normalised across eachinput dataset. / Femininity Rating
To investigate the optimal speech duration which achieves the best performance in objective masculinity / femininityrating, ERF model A (Figure 1) was trained on the 5 datasets with various speech durations and the 2nd sentencedataset. 7he utterances, provided by the speakers, were composed of two voice recordings of the Rainbow passage, with atotal duration of N i seconds for speaker i . The utterances of each speaker were then segmented into L -second segmentsand for each segment the 23 acoustic measures were extracted which are the input data samples. The input datasetdetails are specified in Table 4. The number of samples was calculated using Equation 1, where S denotes the setof female speakers and male speakers, respectively. The input data size is the number of samples × the 23 acousticmeasures. Number of Samples = (cid:88) i ∈ S (cid:100) N i L (cid:101) (1) Table 4: Number of samples for each segment duration across the 96 male and 129 female speakers
Speech Duration L Number of Samples
96 Males 129 Females1 second 4341 69332 seconds 2593 37205 seconds 1167 16367 seconds 871 124010 seconds 676 9572nd sentence 96 129To evaluate the performance of the ERF model, the standard k-fold cross validation was applied on each speaker.We used k = / femininity rating model was carried out by investigatingthe mean values of the 4 sets of the R , MS E and the Pearson correlation coe ffi cient ( r ) values using either the trainingdata or the testing data with the subjective masculinity / femininity rating labels. Specifically, within each fold iteration,the ERF model was trained on the training data, together with the corresponding subjective masculinity / femininityrating labels. The R train , MS E train and r train were calculated based on the predictions of the training data, and the R test , MS E test and r test were calculated based on predictions of the testing data both using the same trained ERF model.These procedures were repeated 4 times. The final R train , R test , MS E train , MS E test , r train and r test of each input datasetwere obtained by calculating their mean values of the 4 sets.The optimal speech durations for both sexes were then determined based on the performances of the models. Thefinal objective masculinity / femininity ratings were generated by the model using the optimal speech duration. Thedetails of the implementation are provided in Appendix A. Using all the samples with the optimal speech duration obtained from the previous step (Section 2.3), this sectionfocuses on addressing the problem of multicollinearity and building independent clusters of acoustic measures.
Figure 2 is a visual depiction of the correlation matrix for the pairs of acoustic measures, u and v , calculated usingPearson’s correlation coe ffi cient r ( u , v ).The correlation matrix is used to (1) apply hierarchical clustering and generate the dendrograms which will bediscussed in Section 3.2.1, and (2) supply an initial overview of the intercorrelations which will be compared withthe final correlation matrix of independent clusters / acoustic measures. From Figure 2 it is evident that severe multi-collinearity exists among the 23 acoustic measures, as some of the measures in the same group are highly correlatedwith each other, such as the measures in the jitter group, the shimmer group and the group of VTL estimators. Al-though the presence of multicollinearity is a common problem in both females and males, the severity may varybetween females and males. Such di ff erences may result in di ff erent clustering patterns.8 igure 2: Correlation matrix of acoustic measures for females (left) and males (right), where the (u,v)th element is r ( u , v ). The numerical valuesare provided in Appendix D. Correlation matrix based hierarchical clustering was proposed in study [51] to identify the correlation structuresamong multiple features. We applied the same hierarchical clustering method to group the acoustic measures inclusters based on the acoustic measures’ similarity. The idea is to regard each acoustic measure as an individualcluster and then merge them into the nearest clusters, until one cluster remained. The hierarchical clustering approachachieves this by generating a dendrogram, which is a tree-based representation of the 23 acoustic measures. Theiterations (shown as the loop in Figure 1), including multicollinearity monitoring, clusters representation and trainingthe ERF Model Sets B, were carried out based on the dendrogram. The method of generating the dendrogram is asfollows.Firstly, the similarities for all pairs of the acoustic measures were measured, by investigating the correlationmatrix of these 23 acoustic measures (see Figure 2). Before clustering, we needed to define the distance between twoacoustic measures. Being di ff erent from study [51], that measured the distance by d ( u , v ) = − r ( u , v ), the presentstudy considered another distance measure. The distance d ( u , v ), measuring the similarity between the two acousticmeasures u and v , was defined by the Euclidean distance between all the points of u and v , defined by Eq 2. Themore similar the two acoustic measures are, the shorter the distance is. In study [61], it was shown such a distancemeasure performed better than the one used in [51] in hierarchical clustering tasks. It is because the distance measure d ( u , v ) = − r ( u , v ) only considered the correlation between the two acoustic measures, whereas by using the distancemeasure Eq 2, we considered all the correlations of the targeted acoustic measure with the other 22 acoustic measures. d ( u , v ) = (cid:118)(cid:117)(cid:116) (cid:88) k = ( | r ( u , k ) | − | r ( v , k ) | ) (2)More generally, it is necessary to introduce a distance measurement called cophenetic distance, D ( s , t ), to measurethe distance between any two clusters s and t which could contain multiple acoustic measures. There are di ff erentways to calculate the D ( s , t ), among them the single, complete, average and centroid linkages have been the mostcommonly used. In this study, the average linkage method was used to compute the D ( s , t ) between any two clusters s and t , which is defined by Eq 3, where u and v denote all the points in the clusters s and t , and N s and N t are thecardinalities of the clusters s and t , respectively. The average linkage was chosen because it yielded higher copheneticcorrelations than the other methods. Besides, it reduced the tendency to produce chain-shape clusters which alwaysoccurs by using single linkage, and average linkage has a higher tolerance for outliers than the complete linkage [51]. D ( s , t ) = (cid:80) N s u = (cid:80) N t v = d ( u , v ) N s · N t (3)9he flow of the hierarchical clustering started by finding the shortest distance d min ( u , v ) across all pair combinations { u , v } of the 23 initial acoustic measures and merging them into the first cluster c { u , v } . This resulted in 22 clusters:cluster c { u , v } and 21 singleton clusters of the remaining, yet to be merged, acoustic measures. Eq 3 was used tocalculate each pair-wise cophenetic distance D among the 22 acoustic measures to find the clusters s and t with theminimum distance D min ( s , t ) which were then merged into a new cluster. The clustering stopped until all the 23acoustic measures formed into one cluster. The pseudo-code implementation is provided in Appendix B.The above procedure generates the dendrogram which will be discussed in Section 3.2.1. The dendrogram pro-vides the essential information for the subsequent iterations. The information conveyed by the dendrogram includesthe acoustic measures clustering order, the clustering patterns and the cophenetic distances. The severity of multicollinearity was monitored throughout the iterations, which supplies important informationfor the determination of the optimal number of clusters.The multicollinearity issue would not significantly a ff ect the prediction of the ERF Model Sets B (see Figure1), but would strongly and negatively a ff ect the identification of feature importance when using the ERF model. Asmentioned above, in the application of the ERF, each tree of the ERF would pick the most discriminant variable whichowns the lowest impurity, while the other correlated variables would be less important to bring any further variationin masculinity / femininity ratings. Across a large number of trees in the ERF, the overall importance of any highlycorrelated variables would be reduced. Therefore, addressing the multicollinearity problem is essential for meaningfulacoustic characterization using the ERF model.The Variance Inflation Factor (VIF) is the quotient of the variance in a model with multiple terms by the varianceof a model with one term alone [62], which has been popularly used to assess the severity of multicollinearity amongthe measures and clusters. In this study, assuming there are k variables, the data is the correlation matrix of k variables,with the size of k × k ( x , x , ..., x k , 1 ≤ k ≤ x i x i = a + a x + ... + a i − x i − + a i + x i + + ... + a k x k + e (4)where a is a constant and e is the error term. Secondly, each VIF value V IF i corresponding to x i was calculated as V IF i = / (1 − R i ) (5)where R i is the coe ffi cient of determination of the regression equation. We analysed the magnitude of multicollinearityby considering the size of the VIFs. A rule of thumb is that if the largest VIF is greater than 5, then multicollinearityis high [63, 64].Initially before clustering, a VIF was calculated for each of the 23 acoustic measures ( k = i th iteration, the acousticmeasure variables were grouped into k clusters. Among the k clusters, k m clusters were composed of multiple acousticmeasures, while the remaining ( k − k m ) clusters were the initial single acoustic measures. The multiple acousticmeasures in each k m cluster were considered to be correlated, and were replaced by a single variable using PrincipalComponent Analysis (PCA) (see Section 2.4.4). In the next iteration, these newly generated variables for each of the k m clusters, together with the k − k m single measures, were used to update the correlation matrix, with the size of k × k ,as well as the latest VIFs. According to the dendrogram generated from Section 2.4.2, the iteration started at the first joint point defined aswhere the two most correlated acoustic measures were merged, and ended up at the last joint point where all acousticmeasures were merged into one cluster.For each cluster among the k m clusters, a PCA model was applied to generate one principal component to representthe cluster, by creating one new variable that maximizes the variance of the enclosed values of acoustic measures.These newly generated variables, together with the k − k m single acoustic measures, yielded a k -dimensional vector S new , which was used to replace the sample S before clustering. By applying this procedure on all the samples, thenewly generated S new were then used to train the ERF Model Sets B (see Figure 1).10 .4.5. Evaluation The same data splitting strategy and model performance indicators as used in Section 2.3 were used for the purposeof building independent clusters. The evaluation in this section concerned assessing the following three factors as thenumber of clusters were reduced from 23 to 1: (1) severity of multicollinearity, (2) system performances, and (3)clearly interpretable meanings of clusters. The optimal clusters were then obtained based on the assessments of thesethree factors.Firstly, the severity of multicollinearity was monitored by investigating the VIFs within each iteration. It is ex-pected that the VIFs would be extremely high at the beginning of clustering, according to the correlation matrix shownin Figure 2. But the severity of multicollinearity would be relieved when some of the highly correlated measures wereclustered and represented by the single variables using the PCA models. The absence of severe multicollinearity wasonly considered to be achieved when the maximum VIF value was lower than 5 [63, 64].Secondly, the six performance indicators from Section 2.3, R train , R test , MS E train , MS E test , r train and r test , wererecorded at each iteration, for the training data and the testing data respectively. The best system performance isexpected to happen before any clustering, because no information will be lost at this stage. The performance shouldremain unchanged or only degrade slightly before any critical independent acoustic measures (such as F0 and F1) areclustered into other unrelated groups.Thirdly, the interpretation of clusters was evaluated by investigating the enclosed acoustic measures of each cluster.At the beginning of clustering, as the most correlated acoustic measures are grouped, the physical meanings aresupposed to be very explicit and interpretable, such as the cluster composed of m ff and Fitch f (r = .99 for both malesand females as shown in Figure 2 and Appendix D) which are the estimators of the physical vocal-tract length (VTL).With larger and fewer clusters, the interpretation of each cluster is less clear. According to the current literature,the final determined optimal clusters should confirm that: (1) F0 mean and VTL estimators should appear in twoindependent clusters [6, 9]; (2) F0 mean and Fn mean should be in di ff erent clusters [10, 37, 39]; and (3) 5 jittermeasures should be grouped in one cluster, as similarly should the 4 shimmer measures.As stated above, our method of building independent clusters of acoustic measures aims to eliminate multi-collinearity while preserving the best system performance and retaining the clearly interpretable meanings of theclusters. The optimal number of clusters was determined by assessing the above three aspects and was used for thecharacterisation purpose. The criteria for the judgement of the optimal number of clusters include: • No severe multicollinearity exists, which means all the VIF values of these clusters must be lower than 5. • The system performance must be no worse than the best performance happening before clustering. • The physical meanings of each cluster must be easily to interpret with no obscureness.
In the characterization the best speech duration (from Section 2.3) and the optimal number of clusters (fromSection 2.4) were then used to train the ERF model C (see Figure 1) and generate the cluster weights to determine thefeature importance of each cluster. The ERF model C was fined-tuned by using the most suitable hyper-parametersfor the clustered data with the dimensions of the number of optimal clusters, as described in Section 2.4.5. The resultsof characterization were evaluated by comparing with what is known from the literature, summarised in Table 1.The details of the implementation of building independent clusters (Section 2.4) and characterization are providedin Appendix C.
3. Results and Discussion / Femininity Rating
The best performance of each input dataset occurs prior to clustering. This is because the ERF will explore allthe information across the full dimension range, whereas the clustering together with the PCA representation reducesthe dimensions which results in missing some information. Table 5 shows the system performance of using di ff erentspeech durations (1-, 2-, 5-, 7- and 10-second durations). As the human raters provided their ratings on masculinityand femininity based on listening to only the second sentence of speech for each speaker, the result of using the11econd sentence speech is also provided in Table 5 for comparison. The hyper-parameters of all the ERF models werefine-tuned to minimise model over-fitting. Table 5: System performance of using di ff erent speech durations in females and males Female MaleTime frame (second) 1 2 5 7 10 2nd sentence 7 (LR) 1 2 5 7 10 2nd sentence 7 (LR) R train .43 .52 .61 .73 .55 .68 .30 .65 .77 .53 .78 .50 .84 .50 R test .20 .27 .35 .37 .35 .19 .26 .40 .48 .47 .57 .44 .35 .46 MS E train .20 .17 .14 .09 .16 .11 .24 .21 .13 .28 .13 .30 .09 .29
MS E test .27 .25 .23 .22 .23 .28 .26 .35 .31 .32 .25 .33 .37 .31 r train .72 .79 .84 .91 .81 .90 .55 .83 .91 .75 .90 .74 .95 .70 r test .46 .54 .61 .63 .62 .46 .51 .64 .70 .70 .77 .68 .63 .68 The analyses showed that model performance was better for the speech duration of 7 seconds compared to the otherdurations, in both sexes. This outcome is consistent with the findings of previous studies [5, 26], that long-term speechis better than short-term speech in investigating the relationship between perceived vocal masculinity / femininity rat-ings and features extracted from acoustic signal analysis. In using just binary labels for biological sex, our previouswork [28] demonstrated that speakers’ vocal masculinity and femininity could be objectively rated by an LDA model( r = .51 in females and r = .67 in males). In contrast, the current work using an ERF regression model to predictspeakers’ vocal masculinity and femininity demonstrated a more promising result, with r test = .63 in females and r test = .77 in males.As shown in Table 5, when the 2nd sentence speech data were used, there was a large discrepancy in systemperformance between training data and testing data. The discrepancy indicates that even though the ERF modelwas correctly trained, the model exhibited over-fitting when the 2nd sentence speech data were used. One possibleexplanation could be that the size of the 2nd sentence speech data is much smaller than the other input datasets withdi ff erent speech durations. Actually, each speaker provided 2 recordings of the same passage of approximately 30seconds duration each, while only the speech of the 2nd sentence selected from one recording was rated by the humanlisteners. The speech duration of the rated 2nd sentence was only about 3 seconds, which was approximately 1 / R train = .73 and R test = .37 for females, and R train = .78 and R test = .57 for males.Therefore, we evaluated the model for possible over-fitting issue by comparing the ERF model results from thoseobtained from a Linear Regression (LR) model, using 7-second duration data for training and testing, and 4-fold crossvalidation for both models. The results demonstrated that the performance of using the 7-second duration data (seecolumn ‘7’ in Table 5) is better than the results based on LR (see column ‘7 (LR)’ in Table 5), for both the trainingand testing data, even though there is less over-fitting using the LR (e.g. R train = .30 and R test = .26)We present a possible explanation for the discrepancy between the results of the training data and the testingdata in the ERF model using the 7-second speech duration data. The perceived masculinity / femininity ratings wereprovided based on only the second sentence of the speech (almost 3 seconds), which was about 5% of the entirespeech length for each speaker (almost 60 seconds). The mean value of the perceived masculinity / femininity ratingsfor this speech segment was used to label all the speech segments from each speaker and was regarded as the “groundtruth”. By using this generalization method, we were not limited to using only the second sentence (i.e. 225 sampleseach 2-3 seconds in duration) in model training and testing. Instead, a much larger dataset of speech samples could beused (e.g. 2111 samples for the 7-second duration), which significantly benefited the ERF model training. However,the side e ff ect was that all the segments of one speaker shared one perceived masculinity / femininity rating. Variationmay exist throughout a 30 second recording which may influence the masculinity / femininity rating for any 2-3 secondsegment. In the present study, the ERF model was given a collection of samples from each speaker with di ff erentvalues of acoustic measures expressed across the samples, but was forced to regress these di ff erent samples on to onespecific masculinity / femininity rating. This limited the model to precisely predict the rating for an unknown sample,which may have contributed to the gap between the training data and testing data results.Also of note is the di ff erence in model performance for males and females, with r test = .63 in females while r test = .77 in males for the 7-second duration data. This suggests that the ERF model tends to predict masculinity12atings for males better than femininity for females. One possible reason for the di ff erence in R , MS E and r inmales and females is that human raters in this study were better at using vocal cues in judging masculinity in malesthan using vocal cues in judging femininity in females. Lippa [65] reported a stronger correlation between observers’masculinity ratings of speakers’ voices and speakers’ masculinity in terms of their personalities was for males (r = .59) compared to females (r = .49). In contrast, Lippa reported a stronger correlation between observers’ masculinityratings of speakers’ faces (presented without voice) and speakers’ masculinity in personality styles for females (r = .64) compared to males (r = .39). These results suggest that observers relied on vocal cues in making judgements ofmasculinity in males and relied on visual cues in making femininity judgements in females. Similarly in the studyof facial gender scoring [27], the objective ratings were more correlated with human perceived ratings in females (r = .90) than in males (r = .79). The results of the present study are consistent with the former literatures in the waythat machine learning models, using acoustic measures as vocal cues, are more powerful in predicting masculinity formales than in predicting femininity for females. Using all the samples with the optimal speech duration of 7 seconds, this section provides the results and analysisof the hierarchical clustering and the determined independent clusters.
Figure 3: Dendrograms of acoustic measures in females (left) and males (right)
Figure 3 shows the dendrograms of the hierarchical clustering in the data of females and males. The commonpatterns shown in both males and females are: (1) all jitter-related measures (acoustic measures = .43 in females andD = .59 in males, shown in green in Figure 3), (2) all shimmer-related measures except for apq11shimmer (acousticmeasures = .42 infemales and D = .41 in males, shown in yellow in Figure 3), (3) F4 mean and fdisp were highly correlated and wereformed into one cluster (D = .15 in females and D = .32 in males, shown in red in Figure 3), and (4) each singlemeasure of F0 mean, F0 SD, F1 mean and F2 mean had D over 1 when merged with other clusters and measures(shown as the leaves of the blue branches in Figure 3) and so were retained as individual measures. As mentioned in Section 2.4.3 it is only when the maximum value of VIFs was below 5 that the risk of multi-collinearity is not considered a serious problem when using regression algorithms. Figure 4 presents the VIF values13ith the number of clusters decreasing gradually, where the solid blue line demonstrates the mean values of the VIFvalues, the blue ribbon illustrates the range of the VIF values and the red horizontal line indicates the threshold ofthe VIF values which is set to 5. It is not surprising that the VIF values decreased more sharply at the beginningof clustering (as the highly correlated acoustic measures were grouped first) and converged more gradually to a lowvalue at the end. For females, the maximum VIF value was below 5 when the number of clusters is at most 9 andfor males at most 8. This yielded and as themeaningful acoustic factors.
Figure 4: Relationships between VIFs and the number of clusters in females (left) and males (right)
The PCA was applied on each cluster to generate a new variable which represented the cluster. The correlationmatrix of the clustered acoustic measures with 9 clusters for females and 8 clusters for males are shown in Figure5. Table 6 shows the cluster, enclosed acoustic measures and VIF values for females and males. Comparing Figure5 with the correlation matrix before clustering (illustrated in Figure 2 and Appendix D), Figure 5 shows that nosevere multicollinearity existed after clustering 23 acoustic measures into 9 independent clusters for females and 8independent cluster for males. It should be noted that the highest intercorrelation, appearing in the three clusters infemales (see Figure 5), namely jitter measures, shimmer measures and HNR, also yielded the highest VIF values afterclustering, but these were all less than 5 (see Table 6).
Figure 5: Correlation matrix of clustered acoustic measures in females (left) and males (right) able 6: Enclosed acoustic measures and VIFs of the independent clusters for females (left) and males (right) FemaleEnclosed acoustic measures VIFF2 mean 1.86jitter measures 2.22shimmer measures 2.86HNR 2.50apq11 shimmer 1.65F0 mean 1.46F0 SD 1.09F1 mean 1.09F3 mean, F4 mean and VTL estimators 1.81 MaleEnclosed acoustic measures VIFF0 SD 1.61jitter measures 1.69HNR, shimmer measures 2.05apq11 shimmer 1.42F3 mean, F4 mean and VTL estimators 1.50F2 mean 1.45F1 mean 1.14F0 mean 1.62
As the acoustic measures were merged and the VIFs were updated in each iteration loop, the 9 independent clustersfor females and 8 independent clusters for males were formed at the nodes above the threshold cophenetic distance(shown as the black line in Figure 3), corresponding to Table 6, when the VIF values were all lower than 5. Theoptimal clusters and their enclosed acoustic measures included three big clusters which were formed as { all the jittermeasures } , { all the shimmer measures (except for apq11shimmer) } , and { all the VTL estimators, together with the F3mean and F4 mean } , as described in Section 3.2.1. The F0 mean, F0 SD, F1 mean, F2 mean and apq11 shimmer werethe five acoustic measures that were independent of other measures. The only di ff erence for the two sexes is that HNRwas grouped into the cluster of shimmer measures for males, but not for females. It is interesting that, for females,the clustering of HNR and the group of shimmer measures occurred when the number of clusters reduced to 8, whichwould have happened at the next clustering iteration.As shown in Table 6, the interpretable meanings of the independent groups of acoustic measures still hold afterthe 9 clusters for females and 8 clusters for males were formed for females and males respectively. The results wereevaluated by reviewing the consistency or inconsistency with the published literature. • This study supports previous findings that F0 mean and VTL estimators are independent of each other [6, 9], asthey appeared in two independent clusters. • This study is consistent with a previous study [40] that F0 mean, F1 mean and F2 mean are independent of eachother and the other acoustic measures. The correlation between F3 mean and F4 mean was found to be strongin females (r = .75) and was moderate in males (r = .47). • Regarding the relationships between F3 mean and the VTL estimators, F3 mean was highly correlated with theVTL estimators in both females and males, with r(F3 mean, ∆ F) = .87, r(F3 mean,avgFormant) = .87 in females,and r(F3 mean, ∆ F) = .70, r(F3 mean,avgFormant) = .69 in males as shown in the original correlation matrixAppendix D. Consequently, F3 mean was grouped with the VTL estimators in clustering. This finding alignswith the results of [41] that F3 mean varies as a function of VTL across talkers, and is presented in di ff erenttypes of speech (e.g., whispered, nonphonated speech). Our result shows that F3 mean is highly correlated withthe VTL estimators which supports the conclusion drawn in a previous study [42] that F3 mean provides a goodestimator of VTL in automatic speech recognition. • In the present study, the F4 mean was found to be more highly correlated with the VTL estimators than wasthe F3 mean. In females, r(F4 mean,fdisp) = .95, r(F4 mean, ∆ F) = .92, r(F4 mean,avgFormant) = .88, andthe absolute values of correlations with the rest of the VTL estimators were all above .72. Regarding males,r(F4 mean,fdisp) = .93, r(F4 mean, ∆ F) = .89, r(F4 mean,avgFormant) = .81, and the absolute values of thecorrelations with the rest of the VTL estimators were all above .54. This evidence that F4 mean was morecorrelated with VTL estimators than was the F3 mean, is consistent with the results of [43] where VTL wasstrongly correlated with individual formant values, particularly for higher formants (r(F4 mean,Fitch f) in arange of -.49 to -.95). Therefore, the F4 mean was grouped with the VTL estimators at an earlier stage than F3mean, as shown in the dendrogram (Figure 3). 15 Furthermore, the F3 mean and F4 mean were found to be more highly correlated with the VTL estimators infemales than in males. Specifically, as shown in the correlation matrix (see Appendix D) in females, r(F3mean, ∆ F) = .87, r(F4 mean, ∆ F) = .92; in males, r(F3 mean, ∆ F) = .70, r(F4 mean, ∆ F) = .89. • To the best of our knowledge, this is the first study statistically demonstrating that F0 SD can be regardedas independent to all other acoustic measures, with very low VIFs of 1.09 and 1.61 for females and malesrespectively (from Table 6). It is noticed that F0 SD was moderately correlated with F0 mean for males (r = .41), but had negligible correlation with F0 mean for females (r = -.15). • The voice perturbation measures, including HNR, jitter and shimmer, were found to be independent factorsfor females, but not for males, with HNR grouped with shimmer for males. However, in fact, the correlationsamong these three perturbation measures are still moderate rather than low for females (r(HNR, jitter) = -.63,r(HNR, shimmer) = -.66, r(jitter, shimmer) = .67, as observed in Figure 5). The reason for some correlationamong these three factors was explained in the study [66] that the sources of periodicity perturbations can bedivided into four classes: (1) pulse frequency perturbations, which is jitter, (2) pulse amplitude perturbations,which is shimmer, (3) additive noise, and (4) waveform variations, and these four classes would demonstratecorrelation. Furthermore, HNR has been proposed as a measure of the amount of additive noise in the acousticwaveform. And many studies have stated HNR is not only dependent on additive noise, but also jitter andshimmer [44, 45, 46]. These three kinds of measures may represent di ff erent voice quality, for example shimmerand HNR can be used to classify the degree of roughness and breathiness in voice production [67]. Figure 6 presents the relationships between model performance ( R , MSE and Correlation) and the number of clus-ters, with orange lines representing females and blue lines representing males. The solid lines show the performancewhen using training data and the dashed lines demonstrate the performance when using testing data. Figure 6: Relationships between model performance and number of clusters
Table 7 shows the VIFs, R scores, MSE and correlations without clustering and with clustering in males andfemales, using the 7-second speech data.These results show that the best system performance was achieved when using all measures instead of usingindependent clusters. This is because, after clustering, data dimensionality was drastically reduced, from the original23 dimensions to 9 dimensions for females and 8 dimensions for males respectively. However, after clustering, theredundant information carried by those highly correlated measures was eliminated, and the essential information washighly condensed within each independent group. Therefore, the R curves in Figure 6 become flat when the numberof clusters drops down from 22 to 9 for females and 8 for males, and the system’s performance (in Table 7) does notreduce dramatically, compared to the results before clustering.It is interesting to point out that, the results demonstrated in Figure 6 are consistent with the findings reportedin Section 3.2.2 that the optimal number of clusters was 9 for females and 8 for males (shown in the star points in16 able 7: VIFs, R , MSE and correlations without clustering and with clustering in males and females Gender Cluster Type VIFs R MSE Correlation
Female without clustering training > .73 .09 .91testing .37 .22 .639 clusters training < > .78 .13 .90testing .57 .25 .778 clusters training < / femininity rating is moredependent on the discriminative quality of the measures or the groups of measures than on the number of measures,as evident when comparing using the optimal number of highly independent groups of measures with using the 23highly inter-dependent measures. It also implies that F0 mean, F0 SD, F2 mean and the group of F3 mean, F4 meanand the VTL estimators are the important factors in judging masculinity and femininity, as the system performancebecame poor when any of them was merged with other measures. Using these independent and interpretable clusters, we next analysed their contributions in predicting the mas-culinity / femininity ratings. The cluster weights were obtained by applying the ERF model on the 9 clusters for femaledata and the 8 clusters for male data with speech duration of 7 seconds. The weights of each independent cluster areshown in Figure 7, where the rank of importance of each cluster is presented in a counter-clockwise way, starting fromthe direction of 12 o’clock, and the legend shows the rank and the enclosed acoustic measures of each cluster. Figure 7: Cluster weights in females (left) and males (right) ff ers between males and females, where F0 mean plays a more importantrole in assessing the degree of masculinity in males (43.54%) than in assessing the degree of femininity in females(23.8%).The second important factor in masculinity / femininity rating for both males and females is the group of VTLrelated acoustic measures including F3 mean, F4 mean and the VTL estimators (12.95% in females, 16.36% in males).This result agrees with the previous literature [5, 6, 9, 10, 13, 12] that these enclosed acoustic measures (F3 mean, F4mean and VTL estimators) are correlated with perceived masculinity / femininity ratings.Apart from the same top two important factors in males and females, F1 mean, F2 mean and F0 SD are the nextthree important factors with equivalent weights ( ≈
11% in females, ≈ / femininity ratings than either F1 mean or F3 mean. Actually, it was foundin the present study that the weights of F2 mean and F1 mean were equivalent no matter in predicting femininityfor females (11.14% v.s. 11.23%) or in predicting masculinity for males (7.74% v.s. 7.17%). Further, F3 mean,together with the other highly correlated acoustic measures (F4 mean and the VTL estimators), weight more thaneither F2 mean or F1 mean in the masculinity / femininity rating prediction, especially in the judgement of femininity.Surprisingly, the present study is the first to find the importance of F0 SD in predicting masculinity / femininity ratings.The F0 SD, as one of the independent factors, is as critical as F1 mean and F2 mean in accounting for variancein masculinity / femininity ratings. F0 SD has not been paid much attention in the previous research regarding theclassification of sex for speech or the assessments of vocal masculinity / femininity.Finally, vocal perturbation related acoustic measures, which include HNR, shimmer measures and jitter measures,are the least important factors in predicting masculinity / femininity ratings. This finding aligns with the studies [19, 20]in suggesting that perturbation measures do not a ff ect perceived vocal masculinity or femininity. According to studies[15, 18], perturbation measures are able to di ff erentiate between males and females because the female voice hassignificantly greater jitter and less shimmer than male voice. This between-class di ff erence was validated by Andrewsand Schmidt [68] who reported that voice samples produced by females were perceived to be more breathy and lesshoarse than those produced by males, and breathiness has been associated with increased jitter [20] and shimmervalues have been suspected to be a measure of vocal hoarseness. However, it seems that jitter and shimmer may onlybe regarded as discriminators between the two classes of males and female, but not salient factors that influence theperceived masculinity for males or perceived femininity for females.
4. Conclusions
This study investigated a novel model framework with the following objectives: (1) objectively rating the speakers’vocal masculinity and femininity on a set of acoustic measures, (2) building independent groups of acoustic measureswith interpretable physical meanings to eliminate multicollinearity, and (3) characterization of the salient acousticmeasures associating with the subjective masculinity / femininity ratings. The model provided promising results thatthe objective masculinity / femininity ratings are strongly correlated with the perceived masculinity / femininity ratingswhen using the data with the optimal speech duration of 7 seconds. Moreover, the model builds 9 and 8 independentmeaningful groups of acoustic measures for females and males respectively which showed no worse performance inpredicting the subjective masculinity / femininity ratings than using the 23 acoustic measures before clustering. Basedon the optimal speech duration and the independent meaningful groups of acoustic measures, the model provides theweights of the groups in predicting the perceived vocal masculinity and femininity. The results revealed that F0 meanand the group of F3 mean, F4 mean and VTL estimators are the top two characteristics that a ff ect the judgement ofspeakers’ vocal masculinity and femininity, with F0 mean being the most significant factor in assessing the masculinityin males. The F1 mean, F2 mean and F0 standard deviation share similar importance, and the voice perturbationmeasures, including HNR, jitter and shimmer are the least important. A key limitation of this study is that the labelsof the subjective masculinity / femininity ratings in the datasets were provided for a specific utterance that representedonly about 5% of the total utterances. Datasets with fully labelled subjective masculinity / femininity ratings would18e more ideal to build such a machine learning model. Furthermore, this study was limited by the stimulus type,considering that human raters may use other paralinguistic features to assess the speakers’ masculinity / femininity,such as tone. Therefore, future studies should consider alternative features, such as higher level prosodic or linguisticfeatures. References [1] W. T. Fitch, J. Giedd, Morphology and development of the human vocal tract: A study using magnetic resonance imaging, The Journal ofthe Acoustical Society of America 106 (1999) 1511–1522.[2] B. Fink, N. Neave, H. Seydel, Male facial appearance signals physical strength to women, American Journal of Human Biology 19 (2007)82–87.[3] J. H¨onekopp, U. Rudolph, L. Beier, A. Liebert, C. M¨uller, Physical attractiveness of face and body as indicators of physical fitness in men,Evolution and Human Behavior 28 (2007) 106–111.[4] M. M. Samson, I. Meeuwsen, A. Crowe, J. Dessens, S. A. Duursma, H. Verhaar, Relationships between physical performance measures, age,height and body weight in healthy adults., Age and ageing 29 (2000) 235–242.[5] V. Cartei, R. Bond, D. Reby, What makes a voice masculine: Physiological and acoustical correlates of women’s ratings of men’s vocalmasculinity, Hormones and Behavior 66 (2014) 569–576.[6] D. R. Feinberg, B. C. Jones, M. L. Smith, F. R. Moore, L. M. DeBruine, R. E. Cornwell, S. Hillier, D. I. Perrett, Menstrual cycle, traitestrogen level, and masculinity preferences in the human voice, Hormones and behavior 49 (2006) 215–222.[7] A. C. Little, J. Connely, D. R. Feinberg, B. C. Jones, S. C. Roberts, Human preference for masculinity di ff ers according to context in faces,bodies, voices, and smell, Behavioral Ecology 22 (2011) 862–868.[8] D. R. Feinberg, L. M. DeBruine, B. C. Jones, A. C. Little, Correlated preferences for men’s facial and vocal masculinity, Evolution andHuman Behavior 29 (2008) 233–241.[9] D. R. Feinberg, B. C. Jones, A. C. Little, D. M. Burt, D. I. Perrett, Manipulations of fundamental and formant frequencies influence theattractiveness of human male voices, Animal behaviour 69 (2005) 561–568.[10] K. Pisanski, D. Rendall, The prioritization of voice fundamental frequency or formants in listeners’ assessments of speaker size, masculinity,and attractiveness, The Journal of the Acoustical Society of America 129 (2011) 2201–2212.[11] S. J. Ko, C. M. Judd, I. V. Blair, What the voice reveals: Within-and between-category stereotyping on the basis of voice, Personality andSocial Psychology Bulletin 32 (2006) 806–819.[12] T. L. Hardy, J. M. Rieger, K. Wells, C. A. Boliek, Acoustic predictors of gender attribution, masculinity–femininity, and vocal naturalnessratings amongst transgender and cisgender speakers, Journal of Voice 34 (2020) 300–e11.[13] M. P. Gelfer, K. J. Schofield, Comparison of acoustic and perceptual measures of voice in male-to-female transsexuals perceived as femaleversus those perceived as male, Journal of voice 14 (2000) 22–33.[14] B. Munson, The acoustic correlates of perceived masculinity, perceived femininity, and perceived sexual orientation, Language and speech50 (2007) 125–142.[15] J. P. Teixeira, P. O. Fernandes, Jitter, shimmer and hnr classification within gender, tones and vowels in healthy voices, Procedia technology16 (2014) 1228–1237.[16] A. Lovato, W. De Colle, L. Giacomelli, A. Piacente, L. Righetto, G. Marioni, C. de Filippis, Multi-dimensional voice program (mdvp) vspraat for assessing euphonic subjects: a preliminary study on the gender-discriminating power of acoustic analysis software, Journal of Voice30 (2016) 765–e1.[17] D. Sorensen, Y. Horii, Frequency and amplitude perturbation in the voices of female speakers, Journal of Communication Disorders 16(1983) 57–61.[18] M. Biemans, Gender variation in voice quality, Netherlands Graduate School of Linguistics, 2000.[19] K. Owen, A. B. Hancock, The role of self-and listener perceptions of femininity in voice therapy, International Journal of Transgenderism12 (2010) 272–284.[20] R. S. King, G. R. Brown, C. R. McCrea, Voice parameters that result in identification or misidentification of biological gender in male-to-female transgender veterans, International Journal of Transgenderism 13 (2012) 117–130.[21] M. SEDAGHI, A comparative study of gender and age classification in speech signals (2009).[22] A. Raahul, R. Sapthagiri, K. Pankaj, V. Vijayarajan, Voice based gender classification using machine learning, in: Materials Science andEngineering Conference Series, volume 263, 2017, p. 042083.[23] V. Svetnik, A. Liaw, C. Tong, J. C. Culberson, R. P. Sheridan, B. P. Feuston, Random forest: a classification and regression tool for compoundclassification and qsar modeling, Journal of chemical information and computer sciences 43 (2003) 1947–1958.[24] Y. Li, L. Chao, Y. Liu, W. Bao, J. Tao, From simulated speech to natural speech, what are the robust features for emotion recognition?, in:2015 International Conference on A ff ective Computing and Intelligent Interaction (ACII), IEEE, 2015, pp. 368–373.[25] M. M. RAMADHAN, I. S. SITANGGANG, F. R. NASUTION, A. GHIFARI, Parameter tuning in random forest based on grid search methodfor gender classification based on voice frequency, DEStech Transactions on Computer Science and Engineering (2017).[26] H. Harb, L. Chen, Voice-based gender identification in multimedia applications, Journal of intelligent information systems 24 (2005)179–198.[27] S. Z. Gilani, K. Rooney, F. Shafait, M. Walters, A. Mian, Geometric facial gender scoring: objectivity of perception, PloS one 9 (2014).[28] F. Chen, R. Togneri, M. Maybery, D. Tan, An objective voice gender scoring system and identification of the salient acoustic measures, Proc.Interspeech 2020 (2020) 1848–1852.[29] A. Ajesh, J. Nair, P. Jijin, A random forest approach for rating-based recommender system, in: 2016 International conference on advances incomputing, communications and informatics (ICACCI), IEEE, 2016, pp. 1293–1297.
30] S. Baumann, B. Winter, What makes a word prominent? predicting untrained german listeners’ perceptual judgments, Journal of Phonetics70 (2018) 20–38.[31] J. Rong, G. Li, Y.-P. P. Chen, Acoustic feature selection for automatic emotion recognition from speech, Information processing & manage-ment 45 (2009) 315–328.[32] W.-H. Cao, J.-P. Xu, Z.-T. Liu, Speaker-independent speech emotion recognition based on random forest feature selection algorithm, in:2017 36th Chinese Control Conference (CCC), IEEE, 2017, pp. 10995–10998.[33] E. Vaiciukynas, A. Verikas, A. Gelzinis, M. Bacauskiene, K. Vaskevicius, V. Uloza, E. Padervinskis, J. Ciceliene, Fusing various audiofeature sets for detection of parkinson’s disease from sustained voice and speech recordings, in: International Conference on Speech andComputer, Springer, 2016, pp. 328–337.[34] M. Xiao, H. Yan, J. Song, Y. Yang, X. Yang, Sleep stages classification based on heart rate variability and random forest, Biomedical SignalProcessing and Control 8 (2013) 624–633.[35] D. E. Farrar, R. R. Glauber, Multicollinearity in regression analysis: the problem revisited, The Review of Economic and Statistics (1967)92–107.[36] R. K. Paul, Multicollinearity: Causes, e ff ects and remedies, IASRI, New Delhi (2006) 58–65.[37] G. Fant, Acoustic theory of speech production, 2, Walter de Gruyter, 1970.[38] W. T. Fitch, The evolution of speech: a comparative review, Trends in cognitive sciences 4 (2000) 258–267.[39] J. M¨uller, W. Baly, The physiology of the senses, voice, and muscular motion, with the mental faculties, Taylor, Walton & Maberly, 1848.[40] E. N. MacDonald, D. W. Purcell, K. G. Munhall, Probing the independence of formant control using altered auditory feedback, The Journalof the Acoustical Society of America 129 (2011) 955–965.[41] P. J. Monahan, W. J. Idsardi, Auditory sensitivity to formant ratios: Toward an account of vowel normalisation, Language and cognitiveprocesses 25 (2010) 808–839.[42] T. Claes, I. Dologlou, L. ten Bosch, D. Van Compernolle, A novel feature transformation for vocal tract length normalization in automaticspeech recognition, IEEE Transactions on Speech and Audio Processing 6 (1998) 549–557.[43] W. Tecumseh Fitch, D. Reby, The descended larynx is not uniquely human, Proceedings of the Royal Society of London. Series B: BiologicalSciences 268 (2001) 1669–1675.[44] G. d. Krom, A cepstrum-based technique for determining a harmonics-to-noise ratio in speech signals, Journal of Speech, Language, andHearing Research 36 (1993) 254–266.[45] H. Muta, T. Baer, K. Wagatsuma, T. Muraoka, H. Fukuda, A pitch-synchronous analysis of hoarseness in running speech, The Journal of theAcoustical Society of America 84 (1988) 1292–1301.[46] Y. Qi, R. E. Hillman, Temporal and spectral estimations of harmonics-to-noise ratio in human voice signals, The Journal of the AcousticalSociety of America 102 (1997) 537–543.[47] J. Neter, W. Wasserman, M. Kutner, et al., Simultaneous inferences and other topics in regression analysis-1, Applied linear regressionmodels (1983) 150–153.[48] R. H. Myers, R. H. Myers, Classical and modern regression with applications, volume 2, Duxbury press Belmont, CA, 1990.[49] D. J. Ketchen, C. L. Shook, The application of cluster analysis in strategic management research: an analysis and critique, Strategicmanagement journal 17 (1996) 441–458.[50] C. C. Bridges Jr, Hierarchical cluster analysis, Psychological reports 18 (1966) 851–854.[51] X. Liu, X.-H. Zhu, P. Qiu, W. Chen, A correlation-matrix-based hierarchical clustering method for functional connectivity analysis, Journalof neuroscience methods 211 (2012) 94–102.[52] D. W. Tan, S. N. Russell-Smith, J. M. Simons, M. T. Maybery, D. Leung, H. L. Ng, A. J. Whitehouse, Perceived gender ratings for high andlow scorers on the autism-spectrum quotient consistent with the extreme male brain account of autism, PloS one 10 (2015).[53] G. Fairbanks, Voice and articulation drillbook, Harper & Brothers, 1940.[54] D. A. Puts, C. L. Apicella, R. A. C´ardenas, Masculine voices signal men’s threat potential in forager and industrial societies, Proceedings ofthe Royal Society B: Biological Sciences 279 (2012) 601–609.[55] W. T. Fitch, Vocal tract length and formant frequency dispersion correlate with body size in rhesus macaques, The Journal of the AcousticalSociety of America 102 (1997) 1213–1222.[56] D. R. Smith, R. D. Patterson, The interaction of glottal-pulse rate and vocal-tract length in judgements of speaker size, sex, and age, TheJournal of the Acoustical Society of America 118 (2005) 3177–3186.[57] D. Reby, K. McComb, Anatomical constraints generate honesty: acoustic cues to age and weight in the roars of red deer stags, Animalbehaviour 65 (2003) 519–530.[58] P. Geurts, D. Ernst, L. Wehenkel, Extremely randomized trees, Machine learning 63 (2006) 3–42.[59] L. Breiman, J. Friedman, C. J. Stone, R. A. Olshen, Classification and regression trees, CRC press, 1984.[60] L. Breiman, Random forests, Machine learning 45 (2001) 5–32.[61] Y. Gu, C. Wang, A study of hierarchical correlation clustering for scientific volume data, in: International Symposium on Visual Computing,Springer, 2010, pp. 437–446.[62] G. James, D. Witten, T. Hastie, R. Tibshirani, An introduction to statistical learning, volume 112, Springer, 2013.[63] T. A. Craney, J. G. Surles, Model-dependent variance inflation factor cuto ff values, Quality Engineering 14 (2002) 391–403.[64] S. Sheather, A modern approach to regression with R, Springer Science & Business Media, 2009.[65] R. Lippa, The naive perception of masculinity-femininity on the basis of expressive cues, Journal of Research in Personality 12 (1978) 1–14.[66] P. J. Murphy, Perturbation-free measurement of the harmonics-to-noise ratio in voice signals using pitch synchronous harmonic analysis, TheJournal of the Acoustical Society of America 105 (1999) 2866–2881.[67] L. W. Lopes, D. P. Cavalcante, P. O. d. Costa, Severity of voice disorders: integration of perceptual and acoustic data in dysphonic patients,in: Codas, volume 26, SciELO Brasil, 2014, pp. 382–388.[68] M. L. Andrews, C. P. Schmidt, Gender presentation: Perceptual and acoustical analysesof voice, Journal of Voice 11 (1997) 307–313. ppendix A. Pseudocode of Objective Masculinity / Femininity Rating
Algorithm 1:
Pseudo code of objective masculinity / femininity rating Data:
23 acoustic measures extracted from 1 / / / /
10 seconds segments and 2nd sentence segments for bothmales and females, together with perceived masculinity / femininity ratings, 12 sets of data in total Input:
One set of Data
Output: R train , R test , MS E train , MS E test , r train and r test /* Data size = m samples ×
23 acoustic measures, m = number of participants x totalspeech duration per participants/speech duration */ begin X , Y = StandardScalar(input); /* Standardize data by removing the mean and scaling to unit variance. */ kf = KFold (n splits =
4, shu ffl e = True, on = IDs speaker); /* 4 Folds cross validation, splitting dataset into 4 consecutive folds withshuffling, the training data and testing data were split based on individualspeaker. */ param grid = { ’max depth’,’min samples leaf’,’min samples split’,’max leaf nodes’ } ; ERF = ExtraTreesRegressor(n estimator = /* initiate an ERF model */ clf = GridSearchCV(ERF, param grid ,cv = /* exhaustive search over specified paramter values, cross-validation splittingstrategy of 10 folds */ clf.fit( X ); /* train the ERF model on the entire input dataset to get the best fithyper-parameters */ best param = clf.best params ; ERF best = ExtraTreesRegressor(n estimator = = best param); for train, test in kf.split( X , Y ) do ERF best.fit(train); /* apply the best fit hyper-parameters on the training dataset */ train prediction = ERF best(train); test prediction = ERF best(test); R train k f .append(R2(train,train prediction)); R test k f .append(R2(test,test prediction)); MS E train k f .append(mean squared error(train,train prediction)); MS E test k f .append(mean squared error(test,test prediction)); r train k f .append(pearsonr(train,train prediction)); r test k f .append(pearsonr(test,test prediction)); /* intermediate output, each one contains 4 values from 4 folds crossvalidation */ R train , R test , MS E train , MS E test , r train , r test = mean( R train k f , R test k f , MS E train k f , MS E test k f , r train k f , r test k f ) ppendix B. Pseudocode of Hierarchical Clustering Algorithm Algorithm 2:
Hierarchical clustering algorithm
Data:
Correlation Matrix of 23 Acoustic Measures, size of 23 ×
23, with labels as m , m , . . . , m Output:
Stepwise dendrogram; breakdown data structure of dendrogram - L, an (N-1) × begin Def dist( c , c ) ; /* A distance function Eq 2 */ for i = do c i = m i C = c , . . . , c while C.size > do ( c min , c min ) = minimum dist( c i , c j ) for all c i , c j in C remove c min and c min from C add c new = { c min , c min } to C L.append( c min , c min , dist ( c min , c min )) return L ppendix C. Pseudo Code of Building Independent Clusters and Characterization Algorithm 3:
Pseudo code for building independent clusters and characterization
Data:
23 acoustic measures extracted from the segments with the optimal speech duration for both males and females, together withsubjective gender ratings
Input:
One set of Data
Output:
VIFs, R train , R test , MS E train , MS E test , r train and r test , cluster importance /* Data size = m samples ×
23 acoustic measures, m = number of participants x total speech durationper participants/ the optimal speech duration */ begin X , Y = StandardScalar(input); /* Standardize data by removing the mean and scaling to unit variance. */ r matrix = abs( X .corr()); /* absolute values of pairwise correlations of clusters, initial size of 23x23 */ dendrogram = Hierarchical Clustering Algorithm (data = r matrix , method = ’average’); for n cluster ← to /* number of clusters, 2 ≤ measure i ≤
23 */ do VIFs = vif( r matrix ) ; /* a list of VIF values, the list length = n cluster */ clusters = dendrogram (n = n cluster ); /* generate a list of clusters, the length of each item in it is n cluster */ for cluster in clusters do if length(cluster) > then new X[cluster i] = PCA(n component = measure i for measure i in cluster]) /* cluster i - an element of clusters, one cluster may contain one or multiple acousticmeasures, the combination of all clusters covers 23 measures with no repetition *//* measure i - the index number of i th acoustic measure, 0 ≤ measure i <
23 *//* new X[cluster i] - the values of the i th cluster in samples, size of m samples × else new X[cluster i] = X[measure i] /* X[measure i] - the values of the i th acoustic measure in samples, size of m samples × new X size = m samples × n cluster */ r matrix = abs( new X .corr()) ; /* update r matrix */ kf = KFold (n splits =