[PDF] NISP: A Multi-lingual Multi-accent Dataset for Speaker Profiling

Abstract

Many commercial and forensic applications of speech demand the extraction of information about the speaker characteristics, which falls into the broad category of speaker profiling. The speaker characteristics needed for profiling include physical traits of the speaker like height, age, and gender of the speaker along with the native language of the speaker. Many of the datasets available have only partial information for speaker profiling. In this paper, we attempt to overcome this limitation by developing a new dataset which has speech data from five different Indian languages along with English. The metadata information for speaker profiling applications like linguistic information, regional information, and physical characteristics of a speaker are also collected. We call this dataset as NITK-IISc Multilingual Multi-accent Speaker Profiling (NISP) dataset. The description of the dataset, potential applications, and baseline results for speaker profiling on this dataset are provided in this paper.

Full PDF

NNISP: A Multi-lingual Multi-accent Dataset for Speaker Proﬁling

Shareef Babu Kalluri , Deepu Vijayasenan , Sriram Ganapathy ,Ragesh Rajan M , Prashant Krishnan National Institute of Technology Karnataka, Surathkal, India, Learning and Extraction of Acoustic Patterns (LEAP) lab, Indian Institute of Science, Bangalore. { shareefbabu1,deepu.senan,sriram.iisc,mrageshrajan,gillyprash29 } @gmail.com Abstract

Many commercial and forensic applications of speech demandthe extraction of information about the speaker characteristics,which falls into the broad category of speaker proﬁling. Thespeaker characteristics needed for proﬁling include physicaltraits of the speaker like height, age and gender of the speakeralong with native language of the speaker. Many of the datasetsavailable have only partial information for speaker proﬁling. Inthis paper, we attempt to overcome this limitation by develop-ing a new dataset which has speech data from ﬁve different In-dian languages along with English. The metadata informationfor speaker proﬁling applications like linguistic information, re-gional information and physical characteristics of a speaker arealso collected. We call this dataset as

NITK-IISc MultilingualMulti-accent Speaker Proﬁling (NISP) dataset. The descriptionof dataset, potential applications and baseline results for speakerproﬁling on this dataset are provided in this paper.

Index Terms : NISP dataset, Speaker proﬁling, Voice forensics.

1. Introduction

In the recent years, speech is emerging as a reliable biometricfor various commercial and surveillance applications. Speechcontains the speaker identity information along with textual in-formation, geographical information (region from where the in-dividual belongs to) in the form of accent, age (child / teenager/ adult), gender (male / female), social information, and also theemotional state of the person (angry, happy, sad, anxious etc.)[1]. Extraction of speaker related meta information is known asspeaker proﬁling. This metadata can be used in commercial ap-plications like voice agents and dialog systems, to deliver con-tent targeted to the user [2]. Also, in forensic scenarios, speakerproﬁling could provide clues about the caller. . Such applica-tions have resulted in increased interest in area of speaker pro-ﬁling [3] and it makes creation of datasets in this domain veryessential. Building effective speaker proﬁling systems requirelarge amount of good quality speech data along with metadatasuch as gender, age, physical characteristics, accent.Existing speech corpora has limited information aboutspeaker metadata. Most of them have either physical charac-teristics or accent information, but often not about both. Forexample, the most common dataset TIMIT [4] has only ageheight and gender information about the speakers. There is noinformation about other physical parameters or about the ac-cents. The popular Speaker Recognition Evaluation (SRE) chal-lenge datasets [5, 6, 7] have in addition the information aboutsmoking habits and native country. They don’t have linguisticinformation. Other datasets such as 2010 Interspeech Paralin-guistic Challenge(ComParE) dataset [2], Fisher English Corpus[8], SpeechDat II dataset [9] provide only the gender and age group information of the speaker. The CMU Kids [10] datasetjust provides the grade in which the kids are studying. None ofthese datasets provide any details about physical parameters be-yond height and age. The only exception to this is the Copycatcorpus [11] that has details of height, weight and age, but thespeakers are limited to children. Similarly there are also datasets that provide the only the accent information of the speak-ers such as Accents of British Isles (ABI-1) corpus [12] and theCSLU-Foreign Accent English (FAE) [13] datasets. In this con-text, there is a need for dataset with richer metadata for speakerproﬁling systems.Another limitation of current datasets is that most of theavailable datasets are monolingual (English). On the otherhand, multi-lingual data available (for example, the Babeldataset [14]) do not have detailed speaker proﬁling information.In order to build a speaker proﬁling system robust to multiplelanguages and accents, we require a dataset that also has all therequired speaker proﬁling metadata information along with thespeech data.In this paper, we attempt to overcome some of the limita-tions of the available datasets by collecting multilingual, multi-accent dataset from ﬁve Indian states. This dataset is called

NITK-IISc Multilingual Multi-accent Speaker Proﬁling (NISP)dataset. We describe the details of the dataset in this paperalong with baseline systems for speaker proﬁling.The rest of the paper is organized as follows. Sec. 2 de-scribes the design and description of the dataset. Sec. 3 providesdetails about the statistics of the dataset. Sec. 4 provides the listof potential applications where NISP data can be useful. Sec. 5contains the discussion on the baseline experimental results onphysical parameter estimation. This is followed by a discussionand summary in Sec. 6.

2. Design of Database

The NISP dataset creation involved collecting the speech andmetadata from Indian speakers belonging to ﬁve Indian lan-guages. The entire data collection took place over the courseof a year. The speakers who participated in contributing speechdata for this database consisted of students, academic staffand faculty members of different educational institutions acrosssouthern India. An informed consent is obtained from thespeakers to use the data for academic and research activities.The linguistic, regional and physical traits are collected from This dataset is publicly made available in the following address, https://github.com/iiscleap/NISP-Dataset . This dataset is freely avail-able for academic and research purposes only with standard licenseagreements. a r X i v : . [ ee ss . A S ] J u l ach speaker along with the speech data. The metadata infor-mation collected in this dataset are the following,1. Native language (L1) of the speaker and whether thespeaker can read text from L1.2. Language used in the schooling years.3. Second language (L2) - Most commonly spoken lan-guage other than L1.4. Regional Information: The geographic location of thenative place (or the place where the subject has liveddominantly).5. Current place of residence.6. Physical Characteristics Information: Age, gender of thespeaker and body build parameters like height, shouldersize, and weight. The age of the speaker was noted inyears and the height is measured in centimetres. Theshoulder size of the speaker is measured at the widestpoint of shoulders between acromion bone with the indi-vidual’s arms at their side in centimetres. And the weightof a speaker is measured in kilograms using standard dig-ital weighing machine. The audio recordings were collected in a quiet environment likea normal class room or seminar hall in each of the educationalinstitutions. All necessary precautions are taken care to avoidambient noise, and reverberations. Also any fans or air con-ditioners were switched off during the data collection process.The speech data was collected using a high quality microphone(with Scarlett solo studio, CM25 a large diaphragm condensermicrophone ). The data was sampled at . kHz with a bit-rate bits per sample. In order to avoid any channel variationsacross recordings, all the speech samples were collected usingthe same microphone device.The text data used in the reading task for the speakers werepresented in the L1 language as well as in English in two differ-ent sessions. The text provided to speakers were taken from thedaily news articles as unique sentences without any contextualcontinuity from one sentence to another in both native languageand English texts. This setting was made to avoid any prosodiccontinuity in the reading task. Separately, a continuous shortstory section was given to the speakers in both the L1 languageand English language to have contextual continuity effects inthe reading task. Along with these sentences, we had also usedﬁve common sentences for every speaker. This includes twoTIMIT sa1 and sa2 sentences and three general news articlesentences in English language (to perform speaker proﬁling intext dependent manner). Similarly two common sentences werealso made in the native language text. Overall, each subjectprovided - unique sentences in L1 and English, - con-textual sentences in L1 and English, common sentences forEnglish, and sentences from L1. Each speaker was instructedto read aloud in a clear voice with a close talking microphone. The audio recording setup is made by using a publicly avail-able software, namely “

Speech Recorder ” and with FocusriteScarlett solo studio audio recording device by connecting it to alaptop. This audio recorder device has gain controller to adjust This software is available in this address,

Table 1:

Distribution of native languages’, and the number ofmale and female speakers in the NISP dataset

Sl.No. Native Language Male Female Total1. Hindi 76 27 1032. Kannada 33 27 603. Malayalam 35 25 604. Telugu 35 22 575. Tamil 40 25 65Total Speakers 219 126 345the gain and amplitude of the speech signal while recording.The software enables a graphical user interface (GUI) to dis-play each sentence at a time on the screen of the speaker andit is monitored and controlled by a controller on another dis-play. The participant is asked to read out the text aloud which isdisplayed on the monitor in a comfortable sitting posture. Thecontroller also veriﬁed the content, which is being read, in orderto avoid any reading errors made by the speaker.The technical speciﬁcations and statistics of collecteddatasets are detailed in the following section.

Hindi Kannada Malayalam Tamil Telugu0510 T i m e ( h r s ) Duration of Speech corpus per language

Native LangEnglish Lang

Hindi Kannada Malayalam Tamil Telugu01000200030004000 u tt e r a n ces No. of utterances per language

Native LangEnglish Lang

Figure 1:

Number of utterances and speech duration of eachlanguage (both native language and English speech data) in theNISP dataset

3. Dataset Statistics

The NISP dataset has speakers, which includes maleand female speakers. The dataset has ﬁve native Indianlanguages (namely Hindi, Kannada, Malayalam, Tamil and Tel-ugu) as well as Indian accented English. Each speaker providedaround - minutes of speech data in each language. The dis-tribution of speakers across the different native languages aswell as gender wise distribution is shown in Table 1. The totalnumber of utterances in this dataset are , , out of which , are male speaker utterances, and , are femalespeaker utterances. The total number of native language ut-terances are and there are English utterances inthe dataset. This dataset has a total of . hours of nativelanguage speech data and . hours of English speech data.The total duration of speech in hours and total number of ut-terances corresponding to each native language along with En-glish speech are shown in Fig 1. The gender wise statistics ofeach physical parameters are given in Table 2. The total numberigure 2: Native geographic region of the speakers in the NISPdataset. of speakers from each region per accent is shown in Fig 2.

4. Potential Applications

The NISP dataset provides a wide range of various applicationsdepending on the task requirement. This dataset provides theinsight to explore more about the multilingual setting of speakerproﬁling applications in text dependent or independent fashion,accent/language identiﬁcation experiments, speaker recognitionas well as multilingual speech recognition experiments.

As most of the available speaker proﬁling databases are spe-ciﬁc to one language (English), this developed NISP datasethas speech data with multiple native languages of India (Hindi,Kannada, Malayalam,Tamil and Telugu ) along with Englishspeech recordings from each native speaker.

Identifying the accent and L1 of the speaker is an important cuein the voice forensic applications as well as in smart speaker anddialog systems. The NISP dataset enables research to exploreaccent related effects on speech. This database allows both L1identiﬁcation from L2 as well as language identiﬁcation basedon the L1 languages. Note that many of L1 languages are fromgeographically connected regions of the country and thereforewe hypothesize language identiﬁcation will itself be challeng-ing in this setting.

The large scale speaker recognition datasets [15, 16] etc.,) aremonolingual (English). Many of these datasets are currentlyused to build large neural network based embedding extractors.The NISP dataset, while being much smaller in scale, can beused to ﬁne-tune the large neural network models with moremulti-accent and multi-lingual variabilities. We hypothesizethat this can improve the robustness of speaker recognition sys-tems. In addition, multilingual speaker veriﬁcation with mis-matched languages in enrollment and test data can be useful forbench-marking speaker veriﬁcation systems. Table 2:

Gender wise statistics of each physical parameter inthe NISP dataset

Physical Min Max Mean StandardCharacteristic Deviation

Male SpeakersHeight ( cm ) 151.0 191.0 171.6 6.7Shoulder width ( cm ) 32.0 55.0 44.7 3.2Weight ( kg ) 43.4 116.5 69.4 11.9Age ( y ) 18.0 47.5 24.4 5.6Female SpeakersHeight ( cm ) 143.0 180.0 158.9 6.8Shoulder width ( cm ) 30.0 53.0 39.7 3.4Weight ( kg ) 34.1 86.2 56.5 10.5Age ( y ) 18.3 46.5 25.1 6.1Male and Female SpeakersHeight ( cm ) 143.0 191.0 166.9 9.1Shoulder width ( cm ) 30.0 55.0 42.9 4.0Weight ( kg ) 34.1 116.5 64.7 13.0Age ( y ) 18.0 47.5 24.7 5.8 Male Female Both02468 M A E ( C m ) Height prediction

TMPFstatFmntsF-locAmpHarm

Male Female Both0123 M A E ( C m ) Shoulder prediction

Male Female Both0510 M A E ( K g ) Weight prediction

Male Female Both0246 M A E ( y ) Age prediction

Figure 3:

Gender wise MAE of each feature (Fstat,Formants(Fmnts), frequency locations (F-loc), Amplitude (Amp) and har-monic features (amplitude + frequency locations – Harm ))compared with Training data Mean Predictor (TMP) of theNISP dataset This dataset has potentially rich text information in both En-glish and all the native languages (Hindi, Kannada, Malayalam,Tamil and Telugu). All these transcription, after manual veriﬁ-cation, are recorded in UTF-8 format. The dataset also enablesaccented speech recognition research along with multi-lingualASR experiments.

5. Baseline Experiments and Results

For the evaluation purposes, the dataset is divided into train andtest splits without overlapping any speakers. The training splithas 210 speakers with 17161 utterances, which comprises of134 male speakers with 10911 utterances and 76 female speak-ers with 6250 utterances. For test split, there are 135 speakerswith 11107 utterances, which includes 85 male speakers with6933 utterances and 50 female speakers with 4174 utterances.able 3:

Statistics of Train and Test splits of each physical pa-rameter in the NISP dataset

Physical Min Max Mean StandardCharacteristic Deviation

Train SpeakersHeight ( cm ) 143 191 167.1 9.5Shoulder width ( cm ) 32 55 42.9 4.2Weight ( kg ) 36.9 116.5 65.4 14.0Age ( y ) 18 47.5 24.8 6.0Test SpeakersHeight ( cm ) 146.5 182.5 166.7 8.5Shoulder width ( cm ) 30.0 53.0 42.9 3.7Weight ( kg ) 34.1 93.8 63.5 11.3Age ( y ) 18.3 43.6 24.4 5.5The statistics of train and test splits of the dataset are givenin Table 3. The standard error metrics Mean Absolute Error(MAE) and Root Mean Square Error (RMSE) are used to mea-sure the errors from the actual and predicted targets.We estimate the physical parameters like height, age, shoul-der size and weight using the NISP dataset. We perform thephysical parameter estimation task using three different featuresnamely, mel ﬁlter bank features, formants and harmonics. Moredetails about the feature extraction setup is given in [17]. Wecomputed the ﬁrst order statistics (Fstat) from the Mel ﬁl-ter bank features using a component diagonal covarianceGaussian Mixture Model Universal Background Model (GMM-UBM). The GMM was trained using Mel Frequency Cep-stral Coefﬁcients (MFCC) and its deltas and double deltas to-gether constitutes dimensional features. The formant andfundamental frequency features are extracted using wide bandspectral components with th order all pole model. The per-centiles ( , , , and ) are computed for the extracted fea-tures over the entire utterance. Also the harmonic features in-cluding both frequency locations (F-loc) and amplitude features(Amp.) are extracted using the narrow band spectral compo-nents using th order all pole model. The same set of per-centiles are computed for the harmonic features over the entireutterance. These computed statistics from each individual fea-ture are given to linear Support Vector Regression (SVR) modelto predict each physical parameter.The MAE of each individual feature is shown in Fig 3.This is compared with the default approach - the Training dataMean Predictor (predicting the target physical parameter usingthe mean of training data of each parameter). We performed thesimple average of predicted targets of these individual regres-sion outputs of these features to improve the performance of theﬁnal predicted targets.The three different Support Vector Regression outputs ofﬁrst order statistics, formants and the harmonic features (bothfrequency and amplitude features) were combined (Comb–3).These results are tabulated in comparison with default predictorin Table 4. This simple average of predicted targets of these fea-tures has improved the predicted error metrics over the individ-ual error metrics. The MAE and RMSE of both speakers (maleand female speakers) improved relatively by about − %in body build parameter estimation (height, shoulder width andweight) tasks. Similarly, in age estimation, we observe a rel- Table 4: Comparison of three feature combinations with defaultpredictor – Comb -3 (Fstats + formant + harmonic features(amplitude + frequency locations)) Height (cm) EstimationMale Female AllMAE RMSE MAE RMSE MAE RMSETMP 5.22 6.17 5.30 6.93 7.14 8.47Comb–3

Shoulder (cm) EstimationTMP 1.98 2.58

Weight(kg) EstimationTMP 7.74 9.57 7.88 9.76 9.08 11.35Comb–3

Age(y) EstimationTMP 4.40 ative improvement of % improvement in MAE. There is arelative improvement over the TMP with three feature combina-tion (Comb–3) in all the physical parameters except in RMSEof female speakers’ shoulder size and male speakers’ age.

6. Conclusions

A multilingual speaker proﬁling dataset is presented in this pa-per where the data was recorded in ﬁve different Indian nativelanguages (Hindi, Kannada, Malayalam, Tamil, and Telugu)along with English language. This dataset has the linguistic in-formation, regional information and physical characteristics of aspeaker which are all useful in commercial and forensic applica-tions of speaker proﬁling. This dataset has ( males and females) speakers and contains , ( , from malespeaker, and , from female speaker) utterances. Over-all, this dataset has . hours of speech data in which . hours of data came from native languages of the speaker and . hours of English data. For speaker proﬁling tasks on thisdataset, the baseline results with the combination of three fea-tures (Fstats, formants and harmonics) performs better in MAEand RMSE measures when compared to the training mean pre-dictor.

7. Acknowledgments

This work was partially funded by Science and EngineeringResearch Board (SERB) under grant no: EMR/2016/007934.Authors would like to acknowledge support from institu-tions namely, National Institute of Technology Karnataka(NITK) Surathkal, Indian Institute of Science (IISc) Bangalore,Sree Vidyanikethan Engineering College, Tirupathi, AndhraPradesh, KSR College of Engineering, Tiruchengode, Tamil-nadu, and College of Engineering Thalassery, Kerala. We alsoacknowledge support from staff and students from these institu-tions for smooth conduction of data collection. . References [1] Lawrence R Rabiner and Ronald W Schafer,

Digital processingof speech signals , vol. 100, Prentice-hall Englewood Cliffs, NJ,1978.[2] Bj¨oRn Schuller, Stefan Steidl, Anton Batliner, Felix Burkhardt,Laurence Devillers, Christian M¨uLler, and Shrikanth Narayanan,“Paralinguistics in speech and languagestate-of-the-art and thechallenge,”

Computer Speech & Language , vol. 27, no. 1, pp.4–39, 2013.[3] Amir Hossein Poorjam, Mohamad Hasan Bahari, et al., “Mul-titask speaker proﬁling for estimating age, height, weight andsmoking habits from spontaneous telephone speech signals,” in

Computer and Knowledge Engineering (ICCKE), 2014 4th Inter-national eConference on . IEEE, 2014, pp. 7–12.[4] John S Garofolo, Lori F Lamel, William M Fisher, Jonathon GFiscus, and David S Pallett, “DARPA TIMIT acoustic-phoneticcontinous speech corpus CD-ROM. NIST speech disc 1-1.1,”

NASA STI/Recon technical report n , vol. 93, 1993.[5] “NIST Speaker Recognition Evaluation (SRE) se-ries,” .[6] Alvin F Martin and Craig S Greenberg, “NIST 2008 speakerrecognition evaluation: Performance across telephone and roommicrophone channels,” in

Tenth Annual Conference of the Inter-national Speech Communication Association , 2009.[7] Alvin F Martin and Craig S Greenberg, “The NIST 2010 speakerrecognition evaluation,” in

Eleventh Annual Conference of theInternational Speech Communication Association , 2010.[8] Christopher Cieri, David Miller, and Kevin Walker, “The FisherCorpus: a Resource for the Next Generations of Speech-to-Text.,”in

LREC , 2004, vol. 4, pp. 69–71.[9] “German SpeechDat(II),” https://catalogue.elra.info/en-us/repository/browse/ELRA-S0096 .[10] David Graff Maxine Eskenazi, Jack Mostow, “The CMUKids Corpus,” https://catalog.ldc.upenn.edu/LDC97S63 .[11] Jill Fain Lehman and Rita Singh, “Estimation of children’s phys-ical characteristics from their voices.,” in

INTERSPEECH , 2016,pp. 1417–1421.[12] Shona M DArcy, Martin J Russell, Sue R Browning, and Mike JTomlinson, “The accents of the British Isles (ABI) corpus,”

Proceedings Mod´elisations pour lIdentiﬁcation des Langues , pp.115–119, 2004.[13] T Lander, “CSLU: Foreign Accented English Release 1.2,” https://catalog.ldc.upenn.edu/LDC2007S08 .[14] Mary Harper, “The BABEL program and low resource speechtechnology,”

Proc. of ASRU 2013 , 2013.[15] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identiﬁcation dataset,” in

INTERSPEECH , 2017.[16] J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deepspeaker recognition,” in

INTERSPEECH , 2018.[17] Shareef Babu Kalluri, Deepu Vijayasenan, and Sriram Ganapathy,“Automatic speaker proﬁling from short duration speech data,”