[PDF] LRS3-TED: a large-scale dataset for visual speech recognition

Abstract

This paper introduces a new multi-modal dataset for visual and audio-visual speech recognition. It includes face tracks from over 400 hours of TED and TEDx videos, along with the corresponding subtitles and word alignment boundaries. The new dataset is substantially larger in scale compared to other public datasets that are available for general research.

Full PDF

aa r X i v : . [ c s . C V ] O c t LRS3-TED: a large-scale dataset for visual speech recognition

Triantafyllos Afouras, Joon Son Chung, Andrew Zisserman

Visual Geometry Group, Department of Engineering Science,University of Oxford, UK { afourast, joon, az } @robots.ox.ac.uk Abstract

This paper introduces a new multi-modal dataset for visualand audio-visual speech recognition. It includes face tracksfrom over 400 hours of TED and TEDx videos, along with thecorresponding subtitles and word alignment boundaries. Thenew dataset is substantially larger in scale compared to otherpublic datasets that are available for general research.

Index Terms : lip reading, visual speech recognition, large-scale, dataset

1. Introduction

Visual speech recognition (or lip reading ) is a very challeng-ing task, and a difﬁcult skill for a human to learn. In the re-cent years, there has been signiﬁcant progress [1, 2, 3, 4] in theperformance of automated lip reading due to the application ofdeep neural network models and the availability of large scaledatasets. However, most of these datasets are subject to somerestrictions (e.g. LRW [5] or the LRS2-BBC [6] cannot be usedby industrial research labs) and this has meant that it is difﬁ-cult to compare the performance of one lip-reading system toanother, as there is no large scale common benchmark dataset.Our aim in releasing the LRS3-TED dataset is to provide sucha benchmark dataset, and one that is larger in size compared toany available dataset in this ﬁeld.The

LRS3-TED dataset can be downloaded from .

2. LRS3-TED dataset

The dataset consists of over 400 hours of video, extractedfrom 5594 TED and TEDx talks in English, downloaded fromYouTube.The cropped face tracks are provided as .mp4 ﬁles with aresolution of 224 ×

224 and a frame rate of 25 fps, encoded usingthe h264 codec. The audio tracks are provided as single-channel16-bit 16kHz format, while the corresponding text transcripts,as well as the alignment boundaries of every word are includedin plain text ﬁles.The dataset is organized into three sets: pre-train , train-val and test . The ﬁrst two overlap in terms of content but the last iscompletely independent. The statistics for each set are given inTable 1. We use a multi-stage pipeline for automatically generating thelarge-scale dataset for audio-visual speech recognition. Usingthis pipeline, we have been able to collect hundreds of hoursof spoken sentences and phrases along with the correspondingfacetrack.We start from the TED and TEDx videos that are availableon their respective YouTube channels. These videos were se- lected for mutliple reasons: (1) a wide range of speakers appearsin the videos, unlike movies or dramas with a ﬁxed cast; (2)shot changes are less frequent, therefore there are more full sen-tences with continuous facetracks; (3) the speakers usually talkwithout interruption, allowing us to obtain longer face tracks.TED videos have previously been used for audio-visual datasetsfor these reasons [9].The pipeline is based on the methods described in [1, 6],but we give a brief sketch of the method here.

Video preparation.

We use a CNN face detector based on theSingle Shot MultiBox Detector (SSD) [10] to detect face ap-pearances in the individual frames.The time boundaries of a shot are determined by compar-ing color histograms across consecutive frames [11], and withineach shot, face tracks are generated from face detections basedon their positions.

Audio and text preparation.

Only the videos providing en-glish subtitles created by humans were used. The subtitlesin the YouTube videos are broadcast in sync with the au-dio only at sentence-level, therefore the Penn Phonetics LabForced Aligner (P2FA) [12] is used to obtain a word-level align-ment between the subtitle and the audio signal. The alignmentis double-checked against an off-the-shelf Kaldi-based ASRmodel.

AV sync and speaker detection.

In YouTube or broadcastvideos, the audio and the video streams can be out of sync byup to around one second, which can introduce temporal offsetsbetween the videos and the text labels (aligned to the audio).We use a two-stream network (SyncNet) described in [13] tosynchronise the two streams. The same network is also used todetermine which face’s lip movements match the audio, and ifnone matches, the clip is rejected as being a voice-over.

Sentence extraction.

The videos are divided into individualsentences/ phrases using the punctuations in the transcript. Thesentences are separated by full stops, commas and questionmarks. The sentences in the train-val and test sets are clippedto 100 characters or 6 seconds.The train-val and test sets are divided by videos (extractedfrom disjoint sets of original videos). Although we do not ex-plicitly label the identities, it is unlikely that there are manyidentities that appear in both training and test sets, since thespeakers do not generally appear on TED programs repeatedly.This is in contrast to the LRW and the LRS2-BBC datasets thatare based on regular TV programs, hence the same charactersare likely to appear in common from one episode to the next.The pre-train set is more extensive, as it contains videosspanning the full duration of the face track, along with the cor-responding subtitles. It is extracted from the same set of originalYouTube videos as the train-val set. However, these videos maybe shorter or longer than the full sentences included in the train-val and test sets, and are annotated with the alignment bound-aries of every word. ataset Source Split Dates

GRID [7] - - - 51 33,000 165k 51 27.5MODALITY [8] - - - 35 5,880 8,085 182 31LRW [5] BBC Train-val 01/2010 - 12/2015 - 514k 514k 500 165Test 01/2016 - 09/2016 - 25k 25k 500 8LRS2-BBC [6] BBC Pre-train 01/2010 - 02/2016 - 96k 2M 41k 195Train-val 01/2010 - 02/2016 - 47k 337k 18k 29Test 03/2016 - 09/2016 - 1,243 6,663 1,693 0.5Text-only 01/2016 - 02/2016 - 8M 26M 60k -

LRS3-TED

TED &TEDx(YouTube) Pre-train - 5,090 119k 3.9M 51k 407Train-val - 4,004 32k 358k 17k 30Test - 451 1,452 11k 2,136 1Text-only - 5,543 1.2M 7.2M 57k -

Table 1:

A comparison of publicly available lip reading datasets. Division of training, validation and test data; and the number of utterances, numberof word instances and vocabulary size of each partition.

Utt:

Utterances.

3. Conclusion

In this document, we have brieﬂy described the LRS3-TEDaudio-visual corpus. The dataset is useful for many applicationsincluding lip reading, audio-visual speech recognition, video-driven speech enhancement, as well as other audio-visual learn-ing tasks. [6] reports the performance of some of the latest lipreading models on this dataset.

Acknowledgements.

Funding for this research is providedby the UK EPSRC CDT in Autonomous Intelligent Ma-chines and Systems, the Oxford-Google DeepMind GraduateScholarship, and by the EPSRC Programme Grant SeebibyteEP/M013774/1.

4. References [1] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, “Lip read-ing sentences in the wild,” in

Proc. CVPR , 2017. 1[2] Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Fre-itas, “Lipnet: Sentence-level lipreading,” arXiv preprintarXiv:1611.01599 , 2016. 1[3] T. Stafylakis and G. Tzimiropoulos, “Combining residual net-works with LSTMs for lipreading,” in

Interspeech , 2017. 1[4] B. Shillingford, Y. Assael, M. W. Hoffman, T. Paine, C. Hughes,U. Prabhu, H. Liao, H. Sak, K. Rao, L. Bennett et al. , “Large-scalevisual speech recognition,” arXiv preprint arXiv:1807.05162 ,2018. 1[5] J. S. Chung and A. Zisserman, “Lip reading in the wild,” in

Proc.ACCV , 2016. 1, 2[6] T. Afouras, J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman,“Deep audio-visual speech recognition,” in arXiv , 2018. 1, 2[7] M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An audio-visual corpus for speech perception and automatic speech recog-nition,”

The Journal of the Acoustical Society of America , vol.120, no. 5, pp. 2421–2424, 2006. 2[8] A. Czyzewski, B. Kostek, P. Bratoszewski, J. Kotus, andM. Szykulski, “An audio-visual corpus for multimodal automaticspeech recognition,”

Journal of Intelligent Information Systems ,pp. 1–26, 2017. 2[9] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim,W. T. Freeman, and M. Rubinstein, “Looking to listen at the cock-tail party: A speaker-independent audio-visual model for speechseparation,”

CoRR , vol. abs/1804.03619, 2018. 1[10] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu,and A. C. Berg, “Ssd: Single shot multibox detector,” in

Proc.ECCV . Springer, 2016, pp. 21–37. 1 [11] R. Lienhart, “Reliable transition detection in videos: A survey andpractitioner’s guide,”

International Journal of Image and Graph-ics , Aug 2001. 1[12] J. Yuan and M. Liberman, “Speaker identiﬁcation on the scotuscorpus,”

Journal of the Acoustical Society of America , vol. 123,no. 5, p. 3878, 2008. 1[13] J. S. Chung and A. Zisserman, “Out of time: automated lip sync inthe wild,” in