Joint gender and age estimation based on speech signals using x-vectors and transfer learning
IICASSP 2021 JOINT GENDER AND AGE ESTIMATION BASED ON SPEECH SIGNALSUSING X-VECTORS AND TRANSFER LEARNING
D. Kwasny, D. Hemmerling
AGH University of Science and Technology, Department of Measurement and Electronicsal. Mickiewicza 30 30-059 Krak´ow
ABSTRACT
In this paper we extend the x-vector framework for the taskof speaker’s age estimation and gender classification. In par-ticular, we replace the baseline multilayer-TDNN architecturewith QuartzNet, a convolutional architecture that has gainedsuccess in the field of speech recognition. We further proposea two-staged transfer learning scheme, utilizing large scalespeech datasets: VoxCeleb and Common Voice, and usage ofmultitask learning to allow for joint age estimation and gen-der classification with a single system. We train and evaluatethe performance on the TIMIT dataset. The proposed trans-fer learning scheme yields consecutive performance improve-ments in terms of both age estimation error and gender clas-sification accuracy and the best performing system achievesnew state-of-the-art results on the task of age estimation onthe TIMIT TEST dataset with MAE of 5.12 and 5.29 yearsand RMSE of 7.24 and 8.12 years for male and female speak-ers respectively while maintaining a gender classification ac-curacy of 99.6%.
Index Terms — speech processing, neural networks, gen-der classification, age estimation, x-vector
1. INTRODUCTION
Speech plays important role in the interpersonal communica-tion. In addition to the content of the speech, it also containsthe information about speaker’s identity, emotions, gender,origin, etc. The creation of a system that automatically andwith the greatest possible precision could estimate the speak-ers’ data such as gender and age is desired by companies us-ing telephony to communicate with customers.
Related work:
In recent years the DNN usage for speechprocessing became very efficient in the identification of rel-evant information for classification and prediction of genderand age. The paper [13] presents the usage of deep bottle-neck extractor and a GMM–UBM classifier for speaker ageand gender classification. The overall accuracy achieved bythe proposed classification system is 57,63%. Authors of [17]proposed a convolutional-recurrent neural network architec-ture for gender and age prediction from speech. Gender is rec-ognized with an average error below 1,55%, while the prob- ability of speakers’ age to be correctly classified into threeage groups is higher than 80% on average. The paper of [7]presents the implementation of DNN with a transfer learningstrategy in a convolutional layer based network for genderclassification from movie audio data. The authors achieved85% weighted accuracy in the best set up. The authors of [4]proposed usage of the x-vector neural network architecture,which has been shown to offer state-of-the-art results in therealm of speaker recognition [18], to the age estimation task.The reported mean absolute error (MAE) was 4,92 years onthe NIST SRE10 dataset. In a recent paper from 2019 [8],the authors proposed a novel, unified DNN architecture for ajoint height/age estimation system that let them improve overthe baseline support vector regression solution by at least 0.6years in terms of the root mean square error (RMSE). In thecase of age estimation, the RMSE errors are 7.60 and 8.63years for males and females respectively, evaluated on theTIMIT database. The same database was used in the researchconducted by authors of [9]. They explored different featuresthat stream for age and body build estimation derived from thespeech spectrum using support vector regression. The MAEof age estimation in this approach is 5.2 and 5.6 years formales and females respectively.
Contribution:
In this work we propose an implemen-tation of x-vector-based DNN system for joint age esti-mation and gender classification. In particular, the pro-posed system uses different embedding network architec-ture (QuartzNet [10]), compared to the previous attemptspresented in [4] and [14], which used the TDNN network.Moreover, unlike the system from [4], which can only beused to estimate the age of the speaker, and [14], which per-forms joint age/gender classification, but only distinguishesbetween 4 age classes, our system learns to jointly estimatethe exact age of the speaker and classify the gender. Fur-thermore, we propose to use transfer learning from speakerrecognition and age estimation/gender classification on dif-ferent datasets. The proposed transfer learning scheme yieldsconsecutive results improvements and the system achievesnew state-of-the-art results on the TIMIT dataset. a r X i v : . [ ee ss . A S ] D ec . METHODS The general idea behind the x-vector architecture is a usageof a few convolutional layers responsible for capturing localdependencies, and a pooling layer which computes statistics(mean and standard deviation) over the time dimension. Theresult of the pooling layer is then passed through additionalaffine layer to form the final embedding vector, which is offixed size regardless of the length of the input sequence. Re-cent results in the field of speaker recognition [18] shows,that the performance of the x-vector-based systems is relianton the architecture of the embedder network used and deeperarchitectures such as TDNN-F [16] and ResNet [6] offer per-formance gains when compared to the baseline TDNN-basedsystem.Encouraged by these results, we decided to follow withthe x-vector framework for the joint gender classification andage estimation task. The scheme of the proposed system isshown in figure 1.
Fig. 1 . High level representation of the system for joint ageestimation and gender classification.We distinguish two groups of modules making-up theoverall system: the embedder, which given a variable-lengthaudio sequence outputs a fixed-size embedding of the in-put utterance and the front-end classifier and regressor net-works, which given the embedding perform the classifica-tion/regression task. Instead of the TDNN-based embedderused in [4], we used a deeper network, which is heavilybased-off the QuartzNet [10] architecture.
BlockName KernelLength Repeats Residual SizeInput
Block 1
Block 2
Block 3
Final
Statspooling(MeanStd) - 1 FALSE 1500+1500
DenseBatchNormReLu - 1 FALSE 512
DenseBatchNormReLu - 1 FALSE 512
Table 1 . QuartzNet embedder architecture.The QuartzNet architecture we used is composed of a con-volutional layer followed by a sequence of 3 groups of blocks.The blocks in a group are identical and are repeated 2 times.There is one additional convolutional layer proceeded with apooling layer, which aggregates the mean and standard devi-ation across the time dimension. The output of the poolinglayer is then passed through a stack of two affine layers toform the final embedding. The embedder network architec-ture is summarized in table 1. The details of the front-endmodules are presented in table 2.
Network Layer Input SIze Output SIzeBinaryClassifier
Dense +ReLu +BatchNorm 512 512Dense +Sigmoid 512 1
MultiClassifier
Dense +ReLu +BatchNorm 512 512Dense +Softmax 512 8
Regressor
Dense +ReLu +BatchNorm 512 512Dense 512 1
Table 2 . Front-end modules details.One of the most challenging aspects of using more com-plex DNN architectures is the difficult training process, espe-cially when the amount of training data is limited and the net-work tends to overfit the training data. To fight these limita-ions we have decided to use two techniques which has foundgreat success in various areas of deep learning: transfer learn-ing and multitask learning. In particular, we explored vari-ous pretraining schemes involving pretraining the embedderfor the task of speaker recognition or age/gender classifica-tion but on a different dataset with more data as well as acombination of both. As for the multitask learning, the pro-posed system jointly learns to predict both the gender andthe age of the speaker. This is beneficial in two ways, asonly a single system needs to be trained to predict multiplespeaker characteristics and the front-end networks may bene-fit from the implicit information about the gender and age ofthe speaker conatined in the embeddings. We also employ anadditional cross-entropy loss for the classification objectivefor age-group classification, as it has been shown to stabilizethe training of the age estimator [4].
For the purpose of this research we have used 3 datasets: Vox-Celeb1 [15], the Common Voice dataset [2, 1] and DARPA-TIMIT dataset [3]. The VoxCeleb1 dataset was created pri-marily with the aim of accelerating research in the field ofspeaker identification and verification. It is gender-balancedwith around 55% of male speakers. We have used this datasetto pretrain the embedder network for the speaker recognitiontask. For the sake of reproducibility we have decided to usethe subset of the english part of the CommonVoice datasetavailable through the Kaggle website [1]. We used only thoseentries, which contained information about both age groupand gender of the speaker, further restricting gender to be ei-ther male or female to allign with the gender present in theTIMIT dataset. In total, it corresponds to approximately 80hours of data in the train set, and 1.5 hours of data in bothvalidation and test sets. For final training and validation weused the DARPA-TIMIT dataset. A random TRAIN-TESTsplit is performed on the default TRAIN subset of the data,which corresponds to roughly 3.5 hours of train data and 0.5an hour of validation data. For the final evaluation of resultsthe default TEST set is used, which corresponds to roughly1500 utterances and 1.5 hour of data, all spoken by speak-ers not present in the training set. The usage of the defaultTEST set let us fairly compare with results obtained by previ-ous works on this dataset.
Before the feature extraction, the following processing stepswere applied to the raw waveforms. First, Voice Activity De-tection was used to remove the non-speech frames from theinput utterance and then a random, 5-seconds long crop wasextracted, as it has been shown to improve the results of x-vector based system presented in [4]. The cropped signal wasthen volume-normalized to the common value of decibel rel-ative to full scale (dBFS) of -30 dB [19]. Two feature sets commonly used in various speech classi-fication task were considered: MFCCs and mel spectrograms.In particular we tried 30-dimensional MFCCs features and64-dimensional mel spectrograms. In both cases we use 25ms-long hamming window with a slide of 10 ms, lower cut-off frequency of 40 Hz and upper cut-off frequency of 8000Hz.We use a combination of binary cross entropy loss forgender classification, cross entropy loss for age-group pre-diction that stabilize the training of the age estimator [4] andmean square error loss for the age estimation task The lossesare combined according to weights shown in table 3. To allowpretraining the regressor on the Common Voice dataset, weapproximate the speaker’s exact age to be equal to the middlevalue of the corresponding age bin.Loss weightsBCE Lossfor genderclassification CE Lossfor age groupclassification MSE Lossfor ageestimation1.0 1.0 0.001
Table 3 . Loss weights for joint gender classification and ageestimation.
Optimizer infoName
NovoGrad [5]
Learning Rate
Learning Rate Policy
Cosine Annealing [12]
Weight Decay
Beta 1
Beta 2
Warmup Steps
Batch Size Num Epochs (CV)
Num Epochs (TIMIT) Table 4 . Optimizer configuration.The network was trained with the optimizer configurationshown in table 4 on a single nVidia Quadro RTX5000 graph-ics card. Before the experiments, all the data was convertedto a common wav format and down-sampled to the frequencyof 16 kHz. To conduct all the experiments we used the nVidiaNeMo framework [11].
3. RESULTS3.1. Gender classification results
Results of the gender classification task evaluated on theTIMIT TEST dataset with different pretraining schemes andeatures Pretrained on Group AccuracyMFCC - all 98,30%female 97,00%male 98,90%VoxCeleb all 98,50%female 96,40%male 99,50%VoxCelebCommon Voice all female 98,80%male
MelSpectogram - all 95,40%female 86,40%male 99,90%VoxCeleb all 98,90%female 97,50%male 99,60%VoxCelebCommon Voice all 99,40%female male 99,60%
Table 5 . Gender classification results on the TIMIT TESTdataset.feature sets are shown in table 5. All networks were pre-trained on the datasets shown in the ”Pretrained on” columnand then trained on the TIMIT TRAIN dataset. The proposedtransfer learning scheme resulted in improved accuracy inall cases, especially the system that uses mel spectrogramsbenefited from pretraining on the VoxCeleb dataset (it’s accu-racy raised from 95,40% to 98,90%). On the other hand, theMFCCs-based system benefited more from the additional pre-training on the Common Voice dataset, achieving the highestgender classification accuracy of 99,60%.
The results of the age estimation evaluated on the TIMITTEST dataset are shown in table 6. Similarly as in the genderclassification case, the proposed pretraining yields improve-ments in every scenario. While the results of the systems withembedder pretrained on the VoxCeleb dataset are very simi-lar, the system that uses MFCCs shows superior performancewhen no pretraining is applied as well as when additional pre-training on Common Voice is used. The MFCC-based sys-tem pretrained only on VoxCeleb already offers results betterthen the best results obtained with a DNN-based system onthis dataset in terms of the RMSE, presented in [8], whilethe MAE matches this achieved by the authors in [9] witha feature-engineering based approach. Additional pretrain-ing of the whole MFCC-based system on the Common Voicedataset yields another performance improvement and allowsus to report the new state-of-the-art on the task of age estima-tion on the TIMIT TEST dataset with MAE of 5.12 and 5.29 years and RMSE of 7.24 and 8.12 years for male and femalespeakers respectively.Features Pretrained on Group MAE RMSEMFCC - all 5,98 8,47female 6,28 9,42male 5,83 7,96VoxCeleb all 5,37 7,74female 5,65 8,53male 5,23 7,31VoxCelebCommon Voice all female
MelSpectogram - all 6,4 9,12female 7,38 9,9male 8,71 8,71VoxCeleb all 5,4 7,78female 5,59 8,35male 5,31 7,48VoxCelebCommon Voice all 5,34 7,70female 5,36 male 5,32 7,65
Table 6 . Age estimation results on the TIMIT TEST dataset.
4. DISCUSSION AND CONCLUSION
In this work we proposed to replace the basic TDNN-basedx-vector architecture with a deeper QuartzNet-based x-vectorarchitecture and to train it for the joint speaker’ age estimationand gender classification. We proposed a novel, two-stagetransfer learning scheme, utilizing data available in large scalespeech corpora: VoxCeleb and Common Voice. We trainedand evaluated the system on the popular TIMIT dataset, usingthe default TRAIN-TEST split. The proposed transfer learn-ing scheme resulted in consecutive performance gains in ev-ery scenario. Using an embedder network that was pretrainedon different task (speaker recognition) and dataset (VoxCeleb)we reported a RMSE of 7.31 and 8.53 years and MAE of5.23 and 5.65 years for male and female speakers respec-tively. These results are already better then the best resultson this task in terms of RMSE metric (7.60 male/8.63 female)achieved with a DNN-based system [8] and on-par in terms ofthe MAE with the results obtained with a feature engineering-based approach, reported in [9]. Additional pretraining on theCommon Voice dataset yielded another improvement and thesystem achieved new state-of-the-art results in terms of bothMAE and RMSE metrics, with MAE of 5.12 and 5.29 yearsand RMSE of 7.24 and 8.12 years for male and female speak-ers respectively while maintaining a gender classification ac-curacy of 99.6% on the TIMIT TEST dataset. These resultsfurther confirm the effectiveness of the proposed approach. . REFERENCES [1] Common voice. , 2017.[2] Rosana Ardila, Megan Branson, Kelly Davis, MichaelHenretty, Michael Kohler, Josh Meyer, Reuben Morais,Lindsay Saunders, Francis M Tyers, and Gregor Weber.Common voice: A massively-multilingual speech cor-pus. arXiv preprint arXiv:1912.06670 , 2019.[3] J. Garofolo, Lori Lamel, W. Fisher, Jonathan Fiscus,D. Pallett, N. Dahlgren, and V. Zue. Timit acoustic-phonetic continuous speech corpus.
Linguistic DataConsortium , 11 1992.[4] Pegah Ghahremani, Phani Sankar Nidadavolu, NanxinChen, Jes´us Villalba, Daniel Povey, Sanjeev Khudanpur,and Najim Dehak. End-to-end deep neural network ageestimation. In
Interspeech , pages 277–281, 2018.[5] Boris Ginsburg, Patrice Castonguay, Oleksii Hrinchuk,Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Ja-son Li, Huyen Nguyen, Yang Zhang, and Jonathan MCohen. Stochastic gradient methods with layer-wiseadaptive moments for training of deep networks. arXivpreprint arXiv:1905.11286 , 2019.[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer visionand pattern recognition , pages 770–778, 2016.[7] Rajat Hebbar, Krishna Somandepalli, and Shrikanth SNarayanan. Improving gender identification in movieaudio using cross-domain data. In
INTERSPEECH ,pages 282–286, 2018.[8] Shareef Babu Kalluri, Deepu Vijayasenan, and SriramGanapathy. A deep neural network based end to endmodel for joint height and age estimation from shortduration speech. In
ICASSP 2019-2019 IEEE Interna-tional Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP) , pages 6580–6584. IEEE, 2019.[9] Shareef Babu Kalluri, Deepu Vijayasenan, and SriramGanapathy. Automatic speaker profiling from short du-ration speech data.
Speech Communication , 2020.[10] Samuel Kriman, Stanislav Beliaev, Boris Ginsburg, Jo-celyn Huang, Oleksii Kuchaiev, Vitaly Lavrukhin, RyanLeary, Jason Li, and Yang Zhang. Quartznet: Deep au-tomatic speech recognition with 1d time-channel sep-arable convolutions. In
ICASSP 2020-2020 IEEE In-ternational Conference on Acoustics, Speech and SignalProcessing (ICASSP) , pages 6124–6128. IEEE, 2020. [11] Oleksii Kuchaiev, Jason Li, Huyen Nguyen, OleksiiHrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kri-man, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook,Patrice Castonguay, Mariya Popova, Jocelyn Huang,and Jonathan M. Cohen. Nemo: a toolkit for buildingai applications using neural modules, 2019.[12] Ilya Loshchilov and Frank Hutter. Sgdr: Stochasticgradient descent with warm restarts. arXiv preprintarXiv:1608.03983 , 2016.[13] Arafat Abu Mallouh, Zakariya Qawaqneh, and Buket DBarkana. New transformed features generated bydeep bottleneck extractor and a gmm-ubm classifier forspeaker age and gender classification.
Neural Comput-ing and Applications , 30(8):2581–2593, 2018.[14] Maxim Markitantov. Transfer learning in speaker’s ageand gender recognition. In
International Conference onSpeech and Computer , pages 326–335. Springer, 2020.[15] Arsha Nagrani, Joon Son Chung, and Andrew Zisser-man. Voxceleb: a large-scale speaker identificationdataset. arXiv preprint arXiv:1706.08612 , 2017.[16] Daniel Povey, Gaofeng Cheng, Yiming Wang, Ke Li,Hainan Xu, and Sanjeev Khudanpur. Semi-orthogonallow-rank matrix factorization for deep neural networks.[17] H´ector A S´anchez-Hevia, Roberto Gil-Pita,Manuel Utrilla-Manso, and Manuel Rosa-Zurera.Convolutional-recurrent neural network for age andgender prediction from speech. In , pages 242–245. IEEE,2019.[18] Jes´us Villalba, Nanxin Chen, David Snyder, DanielGarcia-Romero, Alan McCree, Gregory Sell, JonasBorgstrom, Fred Richardson, Suwon Shon, Franc¸oisGrondin, et al. State-of-the-art speaker recognition fortelephone and video speech: The jhu-mit submission fornist sre18. In
Interspeech , pages 1488–1492, 2019.[19] Li Wan, Quan Wang, Alan Papir, and Ignacio LopezMoreno. Generalized end-to-end loss for speaker ver-ification. In2018 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP)