ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 2021

Age-VOX-Celeb: Multi-Modal Corpus for Facial and Speech Estimation

 
 
 
 

Abstract


Estimating a speaker’s age from their speech is more challenging than age estimation from their face because of insufficiently available public corpora. To tackle this problem, we construct a new audio-visual age corpus named AgeVoxCeleb by annotating age labels to VoxCeleb2 videos. AgeVoxCeleb is the first large-scale, balanced, and multi-modal age corpus that contains both video and speech of the same speakers from a wide age range. Using AgeVox-Celeb, our paper makes the following contributions: (i) A facial age estimation model can outperform a speech age estimation model by comparing the state-of-the-art models in each task. (ii) Facial age estimation is more robust against the difference between training and test sets. (iii) We developed cross-modal transfer learning from face to speech age estimation, showing that the estimated age with a facial age estimation model can be used to train a speech age estimation model. Proposed AgeVoxCeleb will be published in https://github.com/nttcslab-sp/agevoxceleb.

Volume None
Pages 6963-6967
DOI 10.1109/ICASSP39728.2021.9414272
Language English
Journal ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Full Text