[PDF] VOTE400(Voide Of The Elderly 400 Hours): A Speech Dataset to Study Voice Interface for Elderly-Care

Abstract

This paper introduces a large-scale Korean speech dataset, called VOTE400, that can be used for analyzing and recognizing voices of the elderly people. The dataset includes about 300 hours of continuous dialog speech and 100 hours of read speech, both recorded by the elderly people aged 65 years or over. A preliminary experiment showed that speech recognition system trained with VOTE400 can outperform conventional systems in speech recognition of elderly people's voice. This work is a multi-organizational effort led by ETRI and MINDs Lab Inc. for the purpose of advancing the speech recognition performance of the elderly-care robots.

Full PDF

aa r X i v : . [ ee ss . A S ] J a n VOTE400(Voide Of The Elderly 400 Hours): A Speech Dataset to StudyVoice Interface for Elderly-Care

Minsu Jang , Sangwon Seo , Dohyung Kim , Jaeyeon Lee and Jaehong Kim and Jun-Hwan Ahn Abstract — This paper introduces a large-scale Korean speechdataset, called VOTE400, that can be used for analyzing andrecognizing voices of the elderly people. The dataset includesabout 300 hours of continuous dialog speech and 100 hoursof read speech, both recorded by the elderly people aged 65years or over. A preliminary experiment showed that speechrecognition system trained with VOTE400 can outperformconventional systems in speech recognition of elderly people’svoice. This work is a multi-organizational effort led by ETRIand MINDs Lab Inc. for the purpose of advancing the speechrecognition performance of the elderly-care robots.

I. INTRODUCTIONVoice interface is the most intuitive, comfortable anduniversal interface for interacting with service robots. Re-cent advancement of commercial cloud-based speech-to-text(STT) services allowed devising a voice interface for servicerobots a very simple process of integrating a service API forSTT into the robot SW system.While these commercial systems work very well withadults in the ages of between 20 and 60, it easily fails withvoices from older adults aged 65 years or over. It is knownthat speech signals from older adults bring about difﬁcultiesfor automated speech recognition as they tend to be imprecisein consonant pronunciation, include tremors, and have slowerarticulations [1].In the need to develop a speech recognition system thatare specialized to the speech signals from older adults,we built a speech dataset by collecting large-scale dia-logue and read speech from older adults. The result ofour effort is 400 hours of Korean speech data which wenamed as ’VOTE400 ( V oice O f T he E lderly Hours)and open-sourced for any non-commercial research projects(https://ai4robot.github.io/mindslab-etri-vote400).II. D

ATASET D ESCRIPTION

A. Dataset Collection

For recruiting and collecting voice data from older adults,we got assistance from a Korean governmental ofﬁce, calledDok-Geo-No-In-Jong-Haap-Ji-Won Center (Ji-Won Center: ) de-voted to the support of older adults living alone. With thesupport from the Ji-Won Center, we could collect a large-scale dialog speech and read speech from a number of olderadults across various regions of South Korea. Minsu Jang, Dohyung Kim, Jaeyeon Lee and Jaehong Kim are withElectronics and Telecommunications Research Institute, Daejeon-si, SouthKorea minsu at etri.re.kr Sangwon Seo and Jun-Hwan Ahn is with MINDs Lab Inc., Kyungki-do,South Korea asdn9353 at mindslab.ai

1) Dialog Speech:

To collect spontaneous speech datafrom older adults, we could utilize a support program ofJi-Won Center called Saa-Raang-It-Gi where social workersregularly visit elderly people’s homes for consulting onhealth-related issues and relieving loneliness. After explain-ing about the data collection experiment and getting consentof participation from the elderly, conversations between asocial worker and an elderly were recorded using a smart-phone.The recordings from these program sessions were sent toJi-Won Center and a screening process was performed toremove every dialogue involving sensitive personal informa-tion. Then, a quality assurance process was followed to ﬁlterout speech segments incomprehensible by human listener dueto imprecise pronunciation or signiﬁcant noise.

2) Read Speech:

To amend the relatively low-quality ofthe dialog speech dataset, we launched another data collec-tion process to acquire read speech from older adults. Webuilt and utilized in the process a dedicated speech collectionsystem, where a tablet-based client program presents a sen-tence to read to an elderly user; makes a recording and sendsit to a server; where the recording is inspected to be acceptedor not. In total, the number of unique sentences chosen tobe read by participants was 2,250. These sentences wereselected by considering how often these could be casuallyuttered by older adults in daily lives.

TABLE IR AW D ATA C OLLECTION OF D IALOG S PEECH

Region( R ) No. Participants Len. (hrs)Seoul-si(SE) 620 122Busan-si(PS) 242 90Daegu-si(DG) 202 33Gwangju-si(GJ) 179 63Daejeon-si(DJ) 275 66Ulsan-si(WS) 80 28Goyang-si(GG) 335 69Gangwon-do(GW) 178 45Chungcheongbuk-do(CB) 252 92Chungcheongnam-do(CN) 317 46Jeollanam-do(JN) 323 103Gyeongsangbuk-do(GB) 378 116Total 3,381 873 B. Dialog Speech Data

The total number of elderly participants is 3,381 and thetotal length of recordings is 873 hours. This is the result ofcollective efforts by regional senior citizens welfare institutesollaborating with the Ji-Won Center. Table I shows theregional distributions of all the participants and the lengthof recordings per region.After the screening and the QA process mentioned in theprevious subsection, we ﬁnalized 300 hours of dialog speechto be included in the VOTE400 dataset. Transcription forevery ﬁnal speech data was done by human annotators. InVOTE400, we provide for a recording session in a WAVﬁle. The audio format of the WAV ﬁle is as shown in tableII. Every WAV ﬁle is accompanied by a transcription textﬁle encoded in ISO-8859. The transcription does not includeaudio-text alignment information.

TABLE IIVOTE400 D

IALOG S PEECH A UDIO F ORMAT

Property ValueFormat. PCMFormat Settings Little/SignedCodec ID 1Bit Rate Mode ConstantBit Rate. 256Channel(s) 1Sampling Rate 16 kHzBit Depth 16 bits

The ﬁle name of each recording follows the pattern of ____

, where

P-ID is a uniqueparticipant ID; G is a gender value ( F for female, M for male), A is a age value, R is a regional code, and DT is the data-timeof the recording session.Participants and speech audio statistics for VOTE400dialog dataset are shown in table III and table IV. TABLE IIID

EMOGRAPHICS OF

VOTE400 D

IALOG S PEECH

Region( R ) No. Participants Age ( µ/σ )Seoul-si(SE) 251(F:210,M:41) 78.98/5.13Daegu-si(DG) 108(F:95,M:13) 80.33/6.08Gyoungki-do(GG) 110(F:83,M:27) 80.17/5.41Chungcheongnam-do(CN) 6(F:6,M:0) 77.00/3.69Jeollanam-do(JN) 70(F:56,M:14) 80.76/4.90Busan-si(PS) 160(F:137,M:23) 78.70/5.51Daejeon-si(DJ) 96(F:72,M:24) 78.81/5.24Gangwon-do(GW) 109(F:94,M:15) 80.07/5.50Gyeongsangbuk-do(GB) 98(F:95,M:3) 80.87/4.48Gwangju-si(GJ) 87(F:70,M:17) 79.39/5.77Chungcheongbuk-do(CB) 17(F:17,M:0) 80.47/5.51Ulsan-si(WS) 58(F:49,M:9) 76.97/4.48Total 1,170(F:984,M:186) 79.47/5.37 C. Read Speech Data

The total number of elderly participants is 104 and thetotal length of recordings is 100 hours. Table VI shows thestatistics of VOTE400 read speech data.Audio format of the VOTE400 read speech data is asshown in V, which is slightly different from the format ofthe dialog speech data.

TABLE IVS

PEECH A UDIO S TATISTICS FOR

VOTE400 D

IALOG S PEECH

Region( R ) Len.(secs) Len.( µ/σ )Seoul-si(SE) 151,010 601.63/239.83Daegu-si(DG) 60,740 562.42/228.14Gyoungki-do(GG) 107,935 981.23/357.19Chungcheongnam-do(CN) 5,193 865.62/293.98Jeollanam-do(JN) 81,767 1,168.10/294.85Busan-si(PS) 200,207 1,251.30/255.85Gangwon-do(GW) 95,420 875.42/158.18Daejeon-si(DJ) 123,138 1,282.70/293.83Gyeongsangbuk-do(GB) 71,175 726.28/308.80Gwangju-si(GJ) 92,699 1,065.52/276.53Chungcheongbuk-do(CB) 20,135 1,184.41/309.54Ulsan-si(WS) 70,754 1,219.90/254.43Total 1,080,179 923.23/380.17TABLE VVOTE400 R EAD S PEECH A UDIO F ORMAT

Property ValueFormat. PCMFormat Settings Little/SignedCodec ID 1Bit Rate Mode ConstantBit Rate. 705.6 kb/sChannel(s) 1Sampling Rate 44.1 kHzBit Depth 16 bits

The ﬁle name of read speech data follows the pat-tern of

PID____ ,where

P-ID is a unique participant ID,

DATA is the date ofrecording,

SENTENCE-NO is a serial number put to eachof the recorded sentences, and R is the region code asshown in table VI. Each WAV ﬁle contains a single sentence,accompanied by a transcription text ﬁle encoded in EUC-KR .Though the number of sentences chosen and presented tothe participants was originally 2,250, the ﬁnal total numberof unique sentences in VOTE400 read speech data is 7,832,due to mistakes and slight variations in real utterances byolder adults. III. P

RELIMINARY E XPERIMENT

We conducted a preliminary experiment by training aMINDs Lab Inc.’s proprietary baseline speech recognizer( M ),which is based on LSTM architecture, and estimating theSTT accuracy using VOTE400. After ﬁne-tuning the baselinewith 50 hours each of dialog speech data and read speechdata of VOTE400, a simple test with 100 sentences fromdifferent regions was performed and the results are as shownin table VII, along with the results when the sentences weretested on a commercial cloud-based STT engine( C ).IV. S UMMARY

We described a Korean speech dataset VOTE400 which iscollected entirely from older adults of more than 75 yearsold. VOTE400 contains 300 hours of dialogue speech dataand 100 hours of read speech data, with proﬁcient varieties

ABLE VIR

EGIONAL D ISTRIBUTIONS OF

VOTE400 R

EAD D ATASET

Region( G ) No. Persons No. Sent. Len.( µ/σ )Gyeongsangnam-do(GB) 20 22,575 3.18/1.38Seoul-si(SE) 18 19,220 3.31/1.49Jeollanam-do(JN) 21 21,393 3.36/1.52Daegu-si(DG) 25 26,950 3.60/1.87Gangwon-do(GW) 20 21,676 2.73/1.12Total 104 111,814 3.25/1.54TABLE VIISTT P ERFORMANCE T EST R ESULTS WITH

VOTE400Region( G ) Gender Acc. M (%) Acc. C (%)SE M 90 90SE F 90 80GW M 80 90GW F 90 80DG M 70 80DG F 90 80GN M 90 80GN F 80 80JN M 70 50JN F 80 60 in gender and regions. To our knowledge, VOTE400 is by farone of the largest voice datasets that is oriented to voices ofthe elderly. We hope that this dataset will be useful to studyolder adult’s voice features and realize voice technologiesthat work sufﬁciently well in elderly-care robotics.ACKNOWLEDGMENTThis work was supported by the Institute of Informa-tion communi-cations Technology Planning Evaluation(IITP)grant funded by the Koreagovernment(MSIT) (No. 2017-0-00162, Development of Human-care RobotTechnology forAging Society) Rin gender and regions. To our knowledge, VOTE400 is by farone of the largest voice datasets that is oriented to voices ofthe elderly. We hope that this dataset will be useful to studyolder adult’s voice features and realize voice technologiesthat work sufﬁciently well in elderly-care robotics.ACKNOWLEDGMENTThis work was supported by the Institute of Informa-tion communi-cations Technology Planning Evaluation(IITP)grant funded by the Koreagovernment(MSIT) (No. 2017-0-00162, Development of Human-care RobotTechnology forAging Society) R