[PDF] Joint gender and age estimation based on speech signals using x-vectors and transfer learning

Abstract

In this paper we extend the x-vector framework for the task of speaker's age estimation and gender classification. In particular, we replace the baseline multilayer-TDNN architecture with QuartzNet, a convolutional architecture that has gained success in the field of speech recognition. We further propose a two-staged transfer learning scheme, utilizing large scale speech datasets: VoxCeleb and Common Voice, and usage of multitask learning to allow for joint age estimation and gender classification with a single system. We train and evaluate the performance on the TIMIT dataset. The proposed transfer learning scheme yields consecutive performance improvements in terms of both age estimation error and gender classification accuracy and the best performing system achieves new state-of-the-art results on the task of age estimation on the TIMIT TEST dataset with MAE of 5.12 and 5.29 years and RMSE of 7.24 and 8.12 years for male and female speakers respectively while maintaining a gender classification accuracy of 99.6%.

Full PDF

IICASSP 2021 JOINT GENDER AND AGE ESTIMATION BASED ON SPEECH SIGNALSUSING X-VECTORS AND TRANSFER LEARNING

D. Kwasny, D. Hemmerling

AGH University of Science and Technology, Department of Measurement and Electronicsal. Mickiewicza 30 30-059 Krak´ow

ABSTRACT

In this paper we extend the x-vector framework for the taskof speaker’s age estimation and gender classiﬁcation. In par-ticular, we replace the baseline multilayer-TDNN architecturewith QuartzNet, a convolutional architecture that has gainedsuccess in the ﬁeld of speech recognition. We further proposea two-staged transfer learning scheme, utilizing large scalespeech datasets: VoxCeleb and Common Voice, and usage ofmultitask learning to allow for joint age estimation and gen-der classiﬁcation with a single system. We train and evaluatethe performance on the TIMIT dataset. The proposed trans-fer learning scheme yields consecutive performance improve-ments in terms of both age estimation error and gender clas-siﬁcation accuracy and the best performing system achievesnew state-of-the-art results on the task of age estimation onthe TIMIT TEST dataset with MAE of 5.12 and 5.29 yearsand RMSE of 7.24 and 8.12 years for male and female speak-ers respectively while maintaining a gender classiﬁcation ac-curacy of 99.6%.

Index Terms — speech processing, neural networks, gen-der classiﬁcation, age estimation, x-vector

1. INTRODUCTION

Speech plays important role in the interpersonal communica-tion. In addition to the content of the speech, it also containsthe information about speaker’s identity, emotions, gender,origin, etc. The creation of a system that automatically andwith the greatest possible precision could estimate the speak-ers’ data such as gender and age is desired by companies us-ing telephony to communicate with customers.

Related work:

In recent years the DNN usage for speechprocessing became very efﬁcient in the identiﬁcation of rel-evant information for classiﬁcation and prediction of genderand age. The paper [13] presents the usage of deep bottle-neck extractor and a GMM–UBM classiﬁer for speaker ageand gender classiﬁcation. The overall accuracy achieved bythe proposed classiﬁcation system is 57,63%. Authors of [17]proposed a convolutional-recurrent neural network architec-ture for gender and age prediction from speech. Gender is rec-ognized with an average error below 1,55%, while the prob- ability of speakers’ age to be correctly classiﬁed into threeage groups is higher than 80% on average. The paper of [7]presents the implementation of DNN with a transfer learningstrategy in a convolutional layer based network for genderclassiﬁcation from movie audio data. The authors achieved85% weighted accuracy in the best set up. The authors of [4]proposed usage of the x-vector neural network architecture,which has been shown to offer state-of-the-art results in therealm of speaker recognition [18], to the age estimation task.The reported mean absolute error (MAE) was 4,92 years onthe NIST SRE10 dataset. In a recent paper from 2019 [8],the authors proposed a novel, uniﬁed DNN architecture for ajoint height/age estimation system that let them improve overthe baseline support vector regression solution by at least 0.6years in terms of the root mean square error (RMSE). In thecase of age estimation, the RMSE errors are 7.60 and 8.63years for males and females respectively, evaluated on theTIMIT database. The same database was used in the researchconducted by authors of [9]. They explored different featuresthat stream for age and body build estimation derived from thespeech spectrum using support vector regression. The MAEof age estimation in this approach is 5.2 and 5.6 years formales and females respectively.

Contribution:

In this work we propose an implemen-tation of x-vector-based DNN system for joint age esti-mation and gender classiﬁcation. In particular, the pro-posed system uses different embedding network architec-ture (QuartzNet [10]), compared to the previous attemptspresented in [4] and [14], which used the TDNN network.Moreover, unlike the system from [4], which can only beused to estimate the age of the speaker, and [14], which per-forms joint age/gender classiﬁcation, but only distinguishesbetween 4 age classes, our system learns to jointly estimatethe exact age of the speaker and classify the gender. Fur-thermore, we propose to use transfer learning from speakerrecognition and age estimation/gender classiﬁcation on dif-ferent datasets. The proposed transfer learning scheme yieldsconsecutive results improvements and the system achievesnew state-of-the-art results on the TIMIT dataset. a r X i v : . [ ee ss . A S ] D ec . METHODS The general idea behind the x-vector architecture is a usageof a few convolutional layers responsible for capturing localdependencies, and a pooling layer which computes statistics(mean and standard deviation) over the time dimension. Theresult of the pooling layer is then passed through additionalafﬁne layer to form the ﬁnal embedding vector, which is ofﬁxed size regardless of the length of the input sequence. Re-cent results in the ﬁeld of speaker recognition [18] shows,that the performance of the x-vector-based systems is relianton the architecture of the embedder network used and deeperarchitectures such as TDNN-F [16] and ResNet [6] offer per-formance gains when compared to the baseline TDNN-basedsystem.Encouraged by these results, we decided to follow withthe x-vector framework for the joint gender classiﬁcation andage estimation task. The scheme of the proposed system isshown in ﬁgure 1.

Fig. 1 . High level representation of the system for joint ageestimation and gender classiﬁcation.We distinguish two groups of modules making-up theoverall system: the embedder, which given a variable-lengthaudio sequence outputs a ﬁxed-size embedding of the in-put utterance and the front-end classiﬁer and regressor net-works, which given the embedding perform the classiﬁca-tion/regression task. Instead of the TDNN-based embedderused in [4], we used a deeper network, which is heavilybased-off the QuartzNet [10] architecture.

BlockName KernelLength Repeats Residual SizeInput

Block 1

Block 2

Block 3

Final

Statspooling(MeanStd) - 1 FALSE 1500+1500

DenseBatchNormReLu - 1 FALSE 512

Table 1 . QuartzNet embedder architecture.The QuartzNet architecture we used is composed of a con-volutional layer followed by a sequence of 3 groups of blocks.The blocks in a group are identical and are repeated 2 times.There is one additional convolutional layer proceeded with apooling layer, which aggregates the mean and standard devi-ation across the time dimension. The output of the poolinglayer is then passed through a stack of two afﬁne layers toform the ﬁnal embedding. The embedder network architec-ture is summarized in table 1. The details of the front-endmodules are presented in table 2.

Network Layer Input SIze Output SIzeBinaryClassiﬁer

Dense +ReLu +BatchNorm 512 512Dense +Sigmoid 512 1

MultiClassiﬁer

Dense +ReLu +BatchNorm 512 512Dense +Softmax 512 8

Regressor

Dense +ReLu +BatchNorm 512 512Dense 512 1

Table 2 . Front-end modules details.One of the most challenging aspects of using more com-plex DNN architectures is the difﬁcult training process, espe-cially when the amount of training data is limited and the net-work tends to overﬁt the training data. To ﬁght these limita-ions we have decided to use two techniques which has foundgreat success in various areas of deep learning: transfer learn-ing and multitask learning. In particular, we explored vari-ous pretraining schemes involving pretraining the embedderfor the task of speaker recognition or age/gender classiﬁca-tion but on a different dataset with more data as well as acombination of both. As for the multitask learning, the pro-posed system jointly learns to predict both the gender andthe age of the speaker. This is beneﬁcial in two ways, asonly a single system needs to be trained to predict multiplespeaker characteristics and the front-end networks may bene-ﬁt from the implicit information about the gender and age ofthe speaker conatined in the embeddings. We also employ anadditional cross-entropy loss for the classiﬁcation objectivefor age-group classiﬁcation, as it has been shown to stabilizethe training of the age estimator [4].

For the purpose of this research we have used 3 datasets: Vox-Celeb1 [15], the Common Voice dataset [2, 1] and DARPA-TIMIT dataset [3]. The VoxCeleb1 dataset was created pri-marily with the aim of accelerating research in the ﬁeld ofspeaker identiﬁcation and veriﬁcation. It is gender-balancedwith around 55% of male speakers. We have used this datasetto pretrain the embedder network for the speaker recognitiontask. For the sake of reproducibility we have decided to usethe subset of the english part of the CommonVoice datasetavailable through the Kaggle website [1]. We used only thoseentries, which contained information about both age groupand gender of the speaker, further restricting gender to be ei-ther male or female to allign with the gender present in theTIMIT dataset. In total, it corresponds to approximately 80hours of data in the train set, and 1.5 hours of data in bothvalidation and test sets. For ﬁnal training and validation weused the DARPA-TIMIT dataset. A random TRAIN-TESTsplit is performed on the default TRAIN subset of the data,which corresponds to roughly 3.5 hours of train data and 0.5an hour of validation data. For the ﬁnal evaluation of resultsthe default TEST set is used, which corresponds to roughly1500 utterances and 1.5 hour of data, all spoken by speak-ers not present in the training set. The usage of the defaultTEST set let us fairly compare with results obtained by previ-ous works on this dataset.

Before the feature extraction, the following processing stepswere applied to the raw waveforms. First, Voice Activity De-tection was used to remove the non-speech frames from theinput utterance and then a random, 5-seconds long crop wasextracted, as it has been shown to improve the results of x-vector based system presented in [4]. The cropped signal wasthen volume-normalized to the common value of decibel rel-ative to full scale (dBFS) of -30 dB [19]. Two feature sets commonly used in various speech classi-ﬁcation task were considered: MFCCs and mel spectrograms.In particular we tried 30-dimensional MFCCs features and64-dimensional mel spectrograms. In both cases we use 25ms-long hamming window with a slide of 10 ms, lower cut-off frequency of 40 Hz and upper cut-off frequency of 8000Hz.We use a combination of binary cross entropy loss forgender classiﬁcation, cross entropy loss for age-group pre-diction that stabilize the training of the age estimator [4] andmean square error loss for the age estimation task The lossesare combined according to weights shown in table 3. To allowpretraining the regressor on the Common Voice dataset, weapproximate the speaker’s exact age to be equal to the middlevalue of the corresponding age bin.Loss weightsBCE Lossfor genderclassiﬁcation CE Lossfor age groupclassiﬁcation MSE Lossfor ageestimation1.0 1.0 0.001

Table 3 . Loss weights for joint gender classiﬁcation and ageestimation.

Optimizer infoName

NovoGrad [5]

Learning Rate

Learning Rate Policy

Cosine Annealing [12]

Weight Decay

Beta 1

Beta 2

Warmup Steps

Batch Size Num Epochs (CV)

Num Epochs (TIMIT) Table 4 . Optimizer conﬁguration.The network was trained with the optimizer conﬁgurationshown in table 4 on a single nVidia Quadro RTX5000 graph-ics card. Before the experiments, all the data was convertedto a common wav format and down-sampled to the frequencyof 16 kHz. To conduct all the experiments we used the nVidiaNeMo framework [11].

3. RESULTS3.1. Gender classiﬁcation results

Results of the gender classiﬁcation task evaluated on theTIMIT TEST dataset with different pretraining schemes andeatures Pretrained on Group AccuracyMFCC - all 98,30%female 97,00%male 98,90%VoxCeleb all 98,50%female 96,40%male 99,50%VoxCelebCommon Voice all female 98,80%male

MelSpectogram - all 95,40%female 86,40%male 99,90%VoxCeleb all 98,90%female 97,50%male 99,60%VoxCelebCommon Voice all 99,40%female male 99,60%

Table 5 . Gender classiﬁcation results on the TIMIT TESTdataset.feature sets are shown in table 5. All networks were pre-trained on the datasets shown in the ”Pretrained on” columnand then trained on the TIMIT TRAIN dataset. The proposedtransfer learning scheme resulted in improved accuracy inall cases, especially the system that uses mel spectrogramsbeneﬁted from pretraining on the VoxCeleb dataset (it’s accu-racy raised from 95,40% to 98,90%). On the other hand, theMFCCs-based system beneﬁted more from the additional pre-training on the Common Voice dataset, achieving the highestgender classiﬁcation accuracy of 99,60%.

The results of the age estimation evaluated on the TIMITTEST dataset are shown in table 6. Similarly as in the genderclassiﬁcation case, the proposed pretraining yields improve-ments in every scenario. While the results of the systems withembedder pretrained on the VoxCeleb dataset are very simi-lar, the system that uses MFCCs shows superior performancewhen no pretraining is applied as well as when additional pre-training on Common Voice is used. The MFCC-based sys-tem pretrained only on VoxCeleb already offers results betterthen the best results obtained with a DNN-based system onthis dataset in terms of the RMSE, presented in [8], whilethe MAE matches this achieved by the authors in [9] witha feature-engineering based approach. Additional pretrain-ing of the whole MFCC-based system on the Common Voicedataset yields another performance improvement and allowsus to report the new state-of-the-art on the task of age estima-tion on the TIMIT TEST dataset with MAE of 5.12 and 5.29 years and RMSE of 7.24 and 8.12 years for male and femalespeakers respectively.Features Pretrained on Group MAE RMSEMFCC - all 5,98 8,47female 6,28 9,42male 5,83 7,96VoxCeleb all 5,37 7,74female 5,65 8,53male 5,23 7,31VoxCelebCommon Voice all female

MelSpectogram - all 6,4 9,12female 7,38 9,9male 8,71 8,71VoxCeleb all 5,4 7,78female 5,59 8,35male 5,31 7,48VoxCelebCommon Voice all 5,34 7,70female 5,36 male 5,32 7,65

Table 6 . Age estimation results on the TIMIT TEST dataset.

4. DISCUSSION AND CONCLUSION

In this work we proposed to replace the basic TDNN-basedx-vector architecture with a deeper QuartzNet-based x-vectorarchitecture and to train it for the joint speaker’ age estimationand gender classiﬁcation. We proposed a novel, two-stagetransfer learning scheme, utilizing data available in large scalespeech corpora: VoxCeleb and Common Voice. We trainedand evaluated the system on the popular TIMIT dataset, usingthe default TRAIN-TEST split. The proposed transfer learn-ing scheme resulted in consecutive performance gains in ev-ery scenario. Using an embedder network that was pretrainedon different task (speaker recognition) and dataset (VoxCeleb)we reported a RMSE of 7.31 and 8.53 years and MAE of5.23 and 5.65 years for male and female speakers respec-tively. These results are already better then the best resultson this task in terms of RMSE metric (7.60 male/8.63 female)achieved with a DNN-based system [8] and on-par in terms ofthe MAE with the results obtained with a feature engineering-based approach, reported in [9]. Additional pretraining on theCommon Voice dataset yielded another improvement and thesystem achieved new state-of-the-art results in terms of bothMAE and RMSE metrics, with MAE of 5.12 and 5.29 yearsand RMSE of 7.24 and 8.12 years for male and female speak-ers respectively while maintaining a gender classiﬁcation ac-curacy of 99.6% on the TIMIT TEST dataset. These resultsfurther conﬁrm the effectiveness of the proposed approach. . REFERENCES [1] Common voice. , 2017.[2] Rosana Ardila, Megan Branson, Kelly Davis, MichaelHenretty, Michael Kohler, Josh Meyer, Reuben Morais,Lindsay Saunders, Francis M Tyers, and Gregor Weber.Common voice: A massively-multilingual speech cor-pus. arXiv preprint arXiv:1912.06670 , 2019.[3] J. Garofolo, Lori Lamel, W. Fisher, Jonathan Fiscus,D. Pallett, N. Dahlgren, and V. Zue. Timit acoustic-phonetic continuous speech corpus.

Linguistic DataConsortium , 11 1992.[4] Pegah Ghahremani, Phani Sankar Nidadavolu, NanxinChen, Jes´us Villalba, Daniel Povey, Sanjeev Khudanpur,and Najim Dehak. End-to-end deep neural network ageestimation. In

Interspeech , pages 277–281, 2018.[5] Boris Ginsburg, Patrice Castonguay, Oleksii Hrinchuk,Oleksii Kuchaiev, Vitaly Lavrukhin, Ryan Leary, Ja-son Li, Huyen Nguyen, Yang Zhang, and Jonathan MCohen. Stochastic gradient methods with layer-wiseadaptive moments for training of deep networks. arXivpreprint arXiv:1905.11286 , 2019.[6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. Deep residual learning for image recognition. In

Proceedings of the IEEE conference on computer visionand pattern recognition , pages 770–778, 2016.[7] Rajat Hebbar, Krishna Somandepalli, and Shrikanth SNarayanan. Improving gender identiﬁcation in movieaudio using cross-domain data. In

INTERSPEECH ,pages 282–286, 2018.[8] Shareef Babu Kalluri, Deepu Vijayasenan, and SriramGanapathy. A deep neural network based end to endmodel for joint height and age estimation from shortduration speech. In

ICASSP 2019-2019 IEEE Interna-tional Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP) , pages 6580–6584. IEEE, 2019.[9] Shareef Babu Kalluri, Deepu Vijayasenan, and SriramGanapathy. Automatic speaker proﬁling from short du-ration speech data.

Speech Communication , 2020.[10] Samuel Kriman, Stanislav Beliaev, Boris Ginsburg, Jo-celyn Huang, Oleksii Kuchaiev, Vitaly Lavrukhin, RyanLeary, Jason Li, and Yang Zhang. Quartznet: Deep au-tomatic speech recognition with 1d time-channel sep-arable convolutions. In

ICASSP 2020-2020 IEEE In-ternational Conference on Acoustics, Speech and SignalProcessing (ICASSP) , pages 6124–6128. IEEE, 2020. [11] Oleksii Kuchaiev, Jason Li, Huyen Nguyen, OleksiiHrinchuk, Ryan Leary, Boris Ginsburg, Samuel Kri-man, Stanislav Beliaev, Vitaly Lavrukhin, Jack Cook,Patrice Castonguay, Mariya Popova, Jocelyn Huang,and Jonathan M. Cohen. Nemo: a toolkit for buildingai applications using neural modules, 2019.[12] Ilya Loshchilov and Frank Hutter. Sgdr: Stochasticgradient descent with warm restarts. arXiv preprintarXiv:1608.03983 , 2016.[13] Arafat Abu Mallouh, Zakariya Qawaqneh, and Buket DBarkana. New transformed features generated bydeep bottleneck extractor and a gmm-ubm classiﬁer forspeaker age and gender classiﬁcation.

Neural Comput-ing and Applications , 30(8):2581–2593, 2018.[14] Maxim Markitantov. Transfer learning in speaker’s ageand gender recognition. In

International Conference onSpeech and Computer , pages 326–335. Springer, 2020.[15] Arsha Nagrani, Joon Son Chung, and Andrew Zisser-man. Voxceleb: a large-scale speaker identiﬁcationdataset. arXiv preprint arXiv:1706.08612 , 2017.[16] Daniel Povey, Gaofeng Cheng, Yiming Wang, Ke Li,Hainan Xu, and Sanjeev Khudanpur. Semi-orthogonallow-rank matrix factorization for deep neural networks.[17] H´ector A S´anchez-Hevia, Roberto Gil-Pita,Manuel Utrilla-Manso, and Manuel Rosa-Zurera.Convolutional-recurrent neural network for age andgender prediction from speech. In , pages 242–245. IEEE,2019.[18] Jes´us Villalba, Nanxin Chen, David Snyder, DanielGarcia-Romero, Alan McCree, Gregory Sell, JonasBorgstrom, Fred Richardson, Suwon Shon, Franc¸oisGrondin, et al. State-of-the-art speaker recognition fortelephone and video speech: The jhu-mit submission fornist sre18. In

Interspeech , pages 1488–1492, 2019.[19] Li Wan, Quan Wang, Alan Papir, and Ignacio LopezMoreno. Generalized end-to-end loss for speaker ver-iﬁcation. In2018 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP)