[PDF] Lip-reading with Densely Connected Temporal Convolutional Networks

Abstract

In this work, we present the Densely Connected Temporal Convolutional Network (DC-TCN) for lip-reading of isolated words. Although Temporal Convolutional Networks (TCN) have recently demonstrated great potential in many vision tasks, its receptive fields are not dense enough to model the complex temporal dynamics in lip-reading scenarios. To address this problem, we introduce dense connections into the network to capture more robust temporal features. Moreover, our approach utilises the Squeeze-and-Excitation block, a light-weight attention mechanism, to further enhance the model's classification power. Without bells and whistles, our DC-TCN method has achieved 88.36% accuracy on the Lip Reading in the Wild (LRW) dataset and 43.65% on the LRW-1000 dataset, which has surpassed all the baseline methods and is the new state-of-the-art on both datasets.

Full PDF

LLip-reading with Densely Connected Temporal Convolutional Networks

Pingchuan Ma

Yujiang Wang † Jie Shen

Stavros Petridis

Maja Pantic

Imperial College London Facebook London Samsung AI Center Cambridge { pingchuan.ma16,yujiang.wang14,jie.shen07,stavros.petridis04,m.pantic } @imperial.ac.uk Abstract

In this work, we present the Densely Connected Tem-poral Convolutional Network (DC-TCN) for lip-reading ofisolated words. Although Temporal Convolutional Net-works (TCN) have recently demonstrated great potential inmany vision tasks, its receptive ﬁelds are not dense enoughto model the complex temporal dynamics in lip-reading sce-narios. To address this problem, we introduce dense con-nections into the network to capture more robust tempo-ral features. Moreover, our approach utilises the Squeeze-and-Excitation block, a light-weight attention mechanism,to further enhance the model’s classiﬁcation power. With-out bells and whistles, our DC-TCN method has achieved88.36 % accuracy on the Lip Reading in the Wild (LRW)dataset and 43.65 % on the LRW-1000 dataset, which hassurpassed all the baseline methods and is the new state-of-the-art on both datasets.

1. Introduction

Visual Speech Recognition, also known as lip-reading,consists of the task of recognising a speaker’s speech con-tent from visual information alone, typically the movementof the lips. Lip-reading can be extremely useful under sce-narios where the audio data is unavailable, and it has a broadrange of applications such as in silent speech control sys-tem [42], for speech recognition with simultaneous multi-speakers and to aid people with hearing impairment. In ad-dition, lip-reading can also be combined with an acousticrecogniser to improve its accuracy.Despite of many recent advances, lip-reading is still achallenging task. Traditional methods usually follow a two-step approach. The ﬁrst stage is to apply a feature extractorsuch as Discrete Cosine Transform (DCT) [18, 36, 37] to themouth region of interest (RoI), and then feed the extractedfeatures into a sequential model (usually a Hidden MarkovModel or HMM in short) [14, 38, 13] to capture the tempo-ral dynamics. Readers are referred to [57] for more details * Equal Contribution. † Corresponding author. about these older methods.The rise of deep learning has led to signiﬁcant improve-ment in the performance of lip-reading methods. Similarto traditional approaches, the deep-learning-based methodsusually consist of a feature extractor (front-end) and a se-quential model (back-end). Autoencoder models were ap-plied as the front-end in the works of [15, 27, 30] to extractdeep bottleneck features (DBF) which are more discrimina-tive than DCT features. Recently, the 3D-CNN (typicallya 3D convolutional layer followed by a deep 2D Convo-lutional Network) has gradually become a popular front-end choice [41, 26, 31, 49]. As for the back-end models,Long-Short Term Memory (LSTM) networks were appliedin [30, 41, 34, 34] to capture both global and local temporalinformation. Other widely-used back-end models includesthe attention mechanisms [10, 32], self-attention modules[1], and Temporal Convolutional Networks (TCN) [5, 26].Unlike Recurrent Neural Networks (RNN) such asLSTMs or Gated Recurrent Units (GRUs) [9] with recur-rent structures and gated mechanisms, Temporal Convo-lutional Networks (TCN) adopt fully convolutional archi-tectures and have the advantage of faster converging speedwith longer temporal memory. The authors of [5] describeda simple yet effective TCN architecture which outperformedbaseline RNN methods, suggesting that TCN can be a rea-sonable alternative for RNNs on sequence modelling prob-lems. Following this work, it was further demonstrated in[26] that a multi-scale TCN could achieve better perfor-mance than RNNs on lip-reading of isolated words, which isalso the state-of-the-art model so far. Such multi-scale TCNstacks the outputs from convolutions with multiple kernelsizes to gain a more robust temporal features, which has al-ready been shown to be effective in other computer visiontasks utilising multi-scale information such as the seman-tic segmentation [6, 55, 7]. The TCN architectures in bothworks [5, 26] have adopted dilated convolutions [53] to en-large the receptive ﬁelds of models. Under the scenariosof lip-reading, a video sequence usually contains varioussubtle syllables that are essential to distinguish the word orsentence, and thus the model’s abilities to compactly coverthose syllables are necessary and important. However, TCN1 a r X i v : . [ c s . C V ] N ov rchitectures in [5, 26] have utilised a sparse connection andthus may not observe the temporal features thoroughly anddensely.Inspired by recent success of Densely Connected Net-works [20, 51, 16], we introduce dense connections into theTCN structures and propose the Densely Connected TCN(DC-TCN) for word-level lip-reading. DC-TCNs are ableto cover the temporal scales in a denser fashion and thusare more sensitive to words that may be challenging to pre-vious TCN architectures [5, 26]. Speciﬁcally, we exploretwo approaches of adding dense connections in the paper.One is a fully dense (FD) TCN model, where the input ofeach Temporal Convolutional (TC) layers is the concatena-tions of feature maps from all preceding TC layers. AnotherDC-TCN variant employs a partially dense (PD) structures.We further utilise the Squeeze-and-Excitation (SE) atten-tion machanism[19] in both DC-TCN variants, which fur-ther enhances their classiﬁcation power.To validate the effectiveness of the proposed DC-TCNmodels, we have conducted experiments on the Lip Readingin the Wild (LRW) [11] dataset and LRW-1000 dataset [52],which are the largest publicly available benchmark datasetsfor unconstrained lip-reading in English and in Mandarin,respectively. Our ﬁnal model achieves 88.36 % accuracyon LRW, surpassing the current state-of-the-art method [26](85.3 %) by around 3.1 %. On LRW-1000, our methodgains 43.65 % accuracy and also out-perform all baselines,demonstrating the generality and strength of the proposedDC-TCN.In general, this paper presents the following contribu-tions :1. We propose a Densely Connected Temporal Convolu-tional Network (DC-TCN) for lip-reading of isolated words,which can provide denser and more robust temporal fea-tures.2. Two DC-TCN variants with Squeeze-and-excitationblocks [19], namely the fully dense (FD) and partially dense(PD) architectures, are introduced and evaluated in this pa-per.3. Our method has achieved 88.36 % top-1 accuracy onLRW dataset and 43.65 % and on LRW-1000, which havesurpassed all baseline methods and set a new record on thesetwo datasets.

2. Related Works

Early deep learning methods for lip-reading of iso-lated words were mainly evaluated on small-scale datasetsrecorded in constrained environments such as OuluVS2 [2]and CUAVE [29]. The authors of [30] proposed to use thecombination of deep bottleneck features (DBF) and DCTfeatures to train a LSTM classiﬁer, which is not end-to-end trainable. An end-to-end trainable model was ﬁrst demon-strated in [46] using Histogram of Oriented Gradient (HOG)features with LSTMs, and later Petridis et al. [34] trainedan end-to-end model with a Bidirectional LSTM back-endwhere the ﬁrst and second derivatives of temporal featuresare also computed, achieving much better results than [46].The lip-reading accuracy on those small-scale datasets arefurther improvded by the introduction of multi-view visualinformation [34] and audio-visual fusion [33, 35].Lip Reading in the Wild (LRW) dataset [11] is the ﬁrstand the largest publicly available dataset for unconstrainedlip-reading with a 500-word English vocabulary. It has en-couraged the emergence of numerous deep learning mod-els with more and more powerful word-recognising abili-ties. The WLAS sequence-to-sequence model [10] consistsof a VGG network [39] and a LSTM with dual attentionsystems on visual and audio stream, respectively. LipNet[3] is the ﬁrst approach to employ a 3D-CNN to extractspatial-temporal features that are classiﬁed by BidirectionalGated Recurrent Units (BGRUs). A 2D Residual Network(ResNet) [17] on top of a 3D Convolutional layer is usedas the front-end in [41] (with an LSTM as the back-end).Two 3D ResNets are organised in a two-stream fashion (onestream for image and another for optical ﬂow) in the workof [49], learning more robust spatial-temporal features atthe cost of larger network size. Cheng et al. [8] propose atechnique of pose augmentation to enhance the performanceof lip-reading in extreme poses. Zhang et al. [54] proposeto incorporate other facial parts in additional to the mouthregion to solve lip-reading of isolated words, and mutualinformation constrains are added in [56] to produce morediscriminative features. The current state-of-the-art perfor-mance on LRW is achieved by [26], which has replacedRNN back-ends with a multi-scale Temporal ConvolutionalNetworks (TCN). The same model achieves the state-of-the-art performance on LRW-1000 [52] dataset which iscurrently the largest lip-reading dataset for Mandarin.

Although RNN networks such as LTSMs or GRUs hadbeen commonly used in lip-reading methods to modelthe temporal dependencies, alternative light-weight, faster-converging CNN models have started to gain attention inrecent works. Such efforts can be traced back to the Time-Delay Neural Networks (TDNN) [45] in 1980s. Conse-quently, models with Temporal Convolutions were devel-oped, including WaveNet [43] and Gated ConNets [12].Lately, Bai et al. [5] described a simple and generic Tem-poral Convolutional Network (TCN) architecture that out-performed baseline RNN models in various sequence mod-elling problems. Although the TCN introduced in [5] is acausal one in which no future features beyond the currenttime step can be seen in order to prevent the leakage of fu-2ure information, the model can also be modiﬁed into a non-causal variant without such constrains. The work of [26]has adopted a non-casual TCN design, where the linear TCblock architecture was replaced with a multi-scale one. Tothe best of our knowledge, this work [26] has achieved thecurrent state-of-the-art performance on the LRW dataset.However, the receptive ﬁeld scales in such TCN architec-tures may not be able cover the full temporal range underlip-reading scenarios, which can be solved by the employ-ment of dense connections.

Densely connected network hse received broad attentionsince its inception in [20], where a convolutional layer re-ceives inputs from all its preceding layers. Such denselyconnected structure can effectively solve the vanishing-gradient problem by employing shallower layers and thusbeneﬁting feature propagation. The authors of [51] haveapplied dense connections to dilated convolutions to enlargethe receptive ﬁeld sizes and to extract denser feature pyra-mid for semantic segmentation. Recently, a simple denseTCN for Sign Language Translation has been illustrated in[16]. Our work is the ﬁrst to explore the densely connectedTCN for for word-level lip-reading, where we present botha fully dense (FD) and a partially dense (PD) block architec-tures with the addition of the channel-wise attention methoddescribed in [19].

Attention mechanism [4, 25, 44, 48] can be used to teachthe network to focus on the more informative locations ofinput features. In lip-reading, attention mechanisms havebeen mainly developed for sequence models like LSTMsor GRUs. A dual attention mechanism is proposed in [10]for the visual and audio input signals of the LSTM mod-els. Petridis et al. [32] have coupled the self-attention[44] block with a CTC loss to improve the performance ofBidirectional LSTM classiﬁers. Those attention methodsare somehow computational expensive and are inefﬁcient tobe integrated into TCN structures. In this paper, we adopta light-weight attention block, which is the Squeeze-and-Excitation (SE) network [19], to introduce the channel-wiseattention into the DC-TCN network.In particular, denote the input tensor of a SE block as U ∈ R C × H × W where C is the channel number, its channel-wise descriptor z ∈ R C × × is ﬁrst be obtained by a globalaverage pooling operation to squeeze the spatial dimension H × W , i.e. z = GlobalP ool ( U ) where GlobalP ool de-notes the global average pooling. After that, an excita-tion operation is applied to z to obtain the channel-wisedependencies s ∈ R C × × , which can be expressed as s = σ ( W u δ ( W v z )) . Here, W v ∈ R Cr × C and W u ∈ R C × Cr are learnable weights, while σ and δ stands for the sigmoid Figure 1. The general framework of our method. We utilise a 3Dconvolutional Layer plus a 2D ResNet-18 to extract features fromthe input sequence, while the proposed Densely Connected TCN(DC-TCN) models the temporal dependencies. C , C , C de-notes different channel numbers, while C refers to the total wordclasses to be predicted. The batch size dimension is ignored forsimplicity. activation and ReLU functions and r represents the reduc-tion ratio. The ﬁnal output of the SE block is simply thechannel-wise broadcasting multiplication of s and U . Thereaders are referred to [19] for more details.

3. Methodology

Fig. 1 depicts the general framework of our method. Theinput is the cropped gray-scale mouth RoIs with the shapeof T × H × W , where T stands for the temporal dimen-sion and H , W represent the height and width of the mouthRoIs, respectively. Note that we have ignored the batch sizefor simplicity. Following [41, 26], we ﬁrst utilise a 3D con-volutional layer to obtain the spatial-temporal features withshape T × H × W × C , where C is the feature channelnumber. On top of this layer, a 2D ResNet-18 [17] is appliedto produce features with shape T × H × W × C . The nextlayer applies the average pooling to summarise the spatialknowledge and to reduce the dimensionality to T × C . Af-ter this pooling operation, the proposed Densely ConnectedTCN (DC-TCN) is employed to model the temporal dynam-ics. The output tensor ( T × C ) is passed through anotheraverage pooling layer to summarise temporal informationinto C channels, while C represents the classes to be pre-dicted. The word class probabilities are predicted by thesucceeding softmax layer. The whole model is end-to-end3 igure 2. An illustration of the non-causal temporal convolutionlayers where k is the ﬁlter size and d is the dilation rate. Thereceptive ﬁelds for the ﬁlled neurons are shown. trainable. To introduce the proposed Densely Connected TCN(DC-TCN), we start from a brief explanation of the tempo-ral convolution in [5]. A temporal convolution is essentiallya 1-D convolution operating on temporal dimensions, whilea dilation [53] is usually inserted into the convolutional ﬁl-ter to enlarge the receptive ﬁelds. Particularly, for a 1-Dfeature x ∈ R T where T is the temporal dimensionality,deﬁne a discrete function g : Z + (cid:55)→ R such that g ( s ) = x s where s ∈ [1 , T ] ∩ Z + , let Λ k = [1 , k ] ∩ Z + and f : Λ k (cid:55)→ R be a 1D discrete ﬁlter of size k , a temporal convolution ∗ d with dilation rate d is described as y p = ( g ∗ d f )( p ) = (cid:88) s + dt = p g ( s ) f ( t ) (1)where y ∈ R T is the 1-D output feature and y p refers toits p -th element. Note that zero padding is used to keepthe temporal dimensionality unchanged in y . Note that thetemporal convolution described in Eq. 1 is non-casual, i.e.the ﬁlters can observe features of every time step, similarlyto that of [26]. Fig. 2 has provided a intuitive example ofthe non-casual temporal convolution layers.Let T C l be the l -th temporal convolution layer, and let x l ∈ R T × C i and y l ∈ R T × C o be its input and outputtensors with C i and C o channels, respectively, i.e. y l = T C l ( x l ) . In common TCN architectures, y l is directly fedinto the ( l +1) -th temporal convolution layer T C l +1 to pro-duce its output y l +1 , which is depicted as x l +1 = y l y l +1 = T C l +1 ( x l +1 ) . (2)In DC-TCN, dense connections [20] are utilised and theinput for the following TC layer ( T C l +1 ) is the concatena-tion between x l and y l , which can be written as x l +1 = [ x l , y l ] y l +1 = T C l +1 ( x l +1 ) . (3) Note that x l +1 ∈ R T × ( C i + C o ) has additional channels ( C o )than x l , where C o is deﬁned as the growth rate following[20].We have embedded the dense connections in Eq. 3 toconstitute the block of DC-TCN. More formally, we de-ﬁne a DC-TCN block to consist of temporal convolution(TC) layers with arbitrary but unique combinations of ﬁl-ter size k ∈ K and dilation rate r ∈ D , where K and D stand for the sets of all available ﬁlter sizes and dila-tion rates for this block, respectively. For example, if wedeﬁne a block to have ﬁlter sizes set K = { , } and di-lation rates set D = { , } , there will be four TC layers( k d , k d , k d , k d ) in this block.In this paper, we study two approaches of constructingDC-TCN blocks. The ﬁrst approach applies dense connec-tions for all TC layers, which is denoted as the fully dense(FD) block, as illustrated at the top of Fig. 3, where theblock ﬁlter sizes set K = { , } and the dilation rates set D = { , } . As shown in the ﬁgure, the output tensor ofeach TC layer is consistently concatenated to the input ten-sor, increasing the input channels by C (the growth rate)each time. Note that we have a Squeeze-and-Excitation(SE) block [19] after the input tensor of each TC layer tointroduce channel-wise attentions for better performance.Since the output of the top TC layer in the block typicallyhas much more channels than the block input (e.g. C i +4 C channels in Fig. 3), we employ a 1 × C i + 4 C to C r forefﬁciency (“Reduce Layer” in Fig. 3). A 1 × C i (cid:54) = C r . In the fully dense architecture, TC layers arestacked in a receptive-ﬁeld-ascending order.Another DC-TCN block design is depicted at the bot-tom of Fig. 3, which we denote as the partially dense (PD)block. In the PD block, ﬁlters with identical dilation ratesare employed in a multi-scale fashion, such as the k d and k d TC layers in Fig. 3 (bottom), and their outputsare concatenated to the input simultaneously. PD block isa essentially a hybrid of the multi-scale architectures anddensely connected networks, and thus is expected to beneﬁtfrom both. Just like in FD architectures, SE attention is alsoattached after every input tensor, while the ways of utilisingthe reduce layer is the same to that of fully dense blocks.A DC-TCN block, either fully or partially dense, can beseamlessly stacked with another block to obtain ﬁner fea-tures. A fully dense / partially dense DC-TCN model canbe formed by stacking B identical FD / PD blocks together,where B denotes the number of blocks.There are various important network parameters to bedetermined for a DC-TCN model, including the ﬁlter sizes K and the dilation rates D in each block, the growth rate C o ,and the reduce layer channel C r and the blocks number B .The optimal DC-TCN architecture along with the process4 igure 3. The architectures of the fully dense block (Up) and the partially dense block (bottom) in DC-TCN. We have selected the blockﬁlter sizes set K = { , } and the dilation rates set D = { , } for simplicity. In both blocks, Squeeze-and-Excitation (SE) attention isattached after each input tensor. A reduce layer is involved for channel reduction.Figure 4. The block receptive ﬁeld size when combining four TClayers (with a receptive ﬁeld size of , , and ) in a linear (left)or in a multi-scale (right) method. of determining it can be found in Sec. 4.3. The receptive ﬁeld size R for a ﬁlter with kernel size k and dilation rate d can be calculated as R = k + ( d − k − . (4)Stacking two TC layers with receptive ﬁelds R and R willproduce a receptive ﬁeld size of ( R + R − . The recep-tive ﬁeld sizes for the four TC layers described in Fig. 3are , , and , respectively. If they are connected lin-early as in [5], the resulting model can see a temporal rangeof (3 , , , . A multi-scale structure [26] will lead toreceptive ﬁelds of (3 , , , . The linearly connected ar-chitecture retains a larger maximum receptive size than themulti-scale one, however, it also generate more sparse tem- poral features. We have illustrated the receptive ﬁelds forthese two block architectures in Fig. 4.Unlike the linearly connected [5] or multi-scale [26]TCN, our DC-TCN can extract the temporal features atdenser scales and thus increase the features’ robustnesswithout reducing the maximum receptive ﬁeld size. Fig.5 depicts the temporal range covered by our partially denseand fully dense blocks, which consist of the four identicalTC layers as shown in Fig. 4. Since we have introduceddense connection (“DC” in the ﬁgure) into the structure, aTC layer can see all the preceding layers and therefore thevarieties of its receptive sizes are signiﬁcantly enhanced. Asshown in Fig. 5 (left), our partially dense block can observea total of eight different ranges, which is double of that inlinear or multi-scale architectures (only 4 scales). The fullydense block in Fig. 5 (right) can produce feature pyramidfrom 15 different receptive ﬁelds with the maximum one tobe 31 (larger than multi-scale and equal to linear). Suchdense receptive ﬁelds can ensure that the information froma wide range of scales can be observed by the model, andthus strengthen the model’s expression power.

4. Experiments

We have conducted our experiments on the Lip Readingin the Wild (LRW) [11] and the LRW-1000 [52] dataset,which are the largest publicly available dataset for lipread-ing of isolated words in English and in Mandarin, respec-tively.5 igure 5. The block receptive ﬁeld sizes when combining four TClayers (with a receptive ﬁeld size of , , and ) in our partiallydense (left) or fully dense (right) block. The dense connection de-scribed in Eq. 3 is denoted as “DC”. Compared with the structuresin Fig. 4, our DC-TCN can observe denser temporal scales with-out shrinking the maximum receptive ﬁeld and thus can producemore robust features. The LRW dataset has a vocabulary of 500 English words.The sequences in LRW are captured from more than 1 000speakers in BBC programs, and each sequence has a du-ration of 1.16 seconds (29 video frames). There are a to-tal of 538 766 sequences in this dataset, which are splitinto 488 766/25 000/25 000 for training/validation/testingusages. This is a challenging dataset due to the large num-ber of subjects and variations in head poses and lightingconditions.The LRW-1000 dataset contains a total of 718 018 sam-ples for 1 000 mandarin words, recorded from more than2 000 subjects. The average duration for each sequence is0.3 second, and the total length of all sequences is about57 hours. The training/validation/testing splits consist of603 193/63 237/51 588 samples, respectively. This datasetis even more challenging than LRW considering its hugevariations in speaker properties, background clutters, scale,etc.

Pre-processing

We have pre-processed the LRWdataset following the same method as described in [26]. Weﬁrst detect 68 facial landmarks using dlib [22]. Based on thecoordinates of these landmarks, the face images are warpedto a reference mean face shape. Finally, the mouth RoI ofsize 96 ×

96 is cropped from each warped face image andis converted into grayscale. For LRW-1000, we simply usethe provided pre-cropped mouth RoIs and resize them to122 ×

122 following [52].

Evaluation metric

Top-1 accuracy is used to evaluatethe model performance, since we are solving word-level lip- reading classiﬁcation problems.

Training settings

The whole model in Fig. 1, in-cluding the proposed DC-TCN, is trained in an end-to-endfashion, where the weights are randomly initialised. Weemploy identical training settings for both LRW and LRW-1000 datasets except of some slight differences to cope withtheir different input dimensions. We train 80 epochs with abatch size of 32/16 on LRW/LRW-1000, respectively, andmeasure the top-1 accuracy using the validation set to de-termine the best-performing checkpoint weights. We adoptAdamW [24] as the optimiser, where the initial learning rateis set to 0.0003/0.0015 for LRW and LRW-1000, respec-tively. A cosine scheduler [23] is used to steadily decreasethe learning rate from the initial value to 0. BatchNorm lay-ers [21] are embedded to accelerate training convergence,and we use dropouts with dropping probabilities 0.2 for reg-ularisation. The reduction ratio in the SE block is set to 16,and the channel value C of DC-TCN’s input tensor is setto 512. Besides, we adopt the variable length augmentationas proposed in [26] to increase the model’s temporal robust-ness. Explorations of DC-TCN structures

We evaluateDC-TCN with different structure parameters on LRWdataset to determine the best-performing one. In particu-lar, we ﬁrst validate the effectiveness of different ﬁlter sizes K and dilation rates D in each DC-TCN block while freez-ing other hyper-parameters such as the growth rate C o andthe reduce layer channels C r . Then we select the most ef-fective K and D values to ﬁne-tune other structural options,including the growth rate C o and whether to use SE atten-tion. We explore structures for both FD and PD blocks. Baseline methods

On the LRW dataset, the perfor-mance of the proposed DC-TCN model is compared withthe following baselines: 1) the method proposed in theLRW paper [11] with a VGG backbone [39], 2) the WLASmodel [10], 3) the work of [41] where a ResNet [17] anda LSTM is used, 4) the End-to-End Audio-Visual network[31], 5) the multi-grained spatial-temporal model in [47], 6)the two-stream 3D CNN in [49], 7) the Global-Local Mu-tual Information Maximisation method in [56], 8) the faceregion cutout approach by authors of [54], 9) the multi-modality speech recognition method in [50], and 10) themulti-scale TCN proposed by [26].LRW-1000 is a relatively new dataset and there aresomehow fewer works on it. We have selected the follow-ing methods as baselines on this dataset: 1) the work of [41],2) the multi-grained spatial-temporal model in [47], 3) theGLMIM method in [56] and 4) the multi-scale TCN [26].

Implementations

We implement our method in thePyTorch framework [28]. Experiments are conducted ona server with eight 1080Ti GPUs. It takes around four daysto train a single model end-to-end on LRW using one GPUand ﬁve days for LRW-1000. Note that this training time6 ilter Sizes K DilationRates D Acc.(%, FD) Acc.(%, PD) { } { } { } { } { } { } { } { } { } { } K and dilation rates D . The top-1 accuracyof fully dense (FD) and partially dense (PD) blocks is reported.Other network parameters are ﬁxed, and for simplicity all SE at-tention is temporarily disabled.Growthrate C o Adding SE Acc.(%, FD) Acc.(%, PD)64 - 87.11 87.5064 (cid:88) (cid:88)

Table 2. Performance on the LRW dataset of DC-TCN with dif-ferent growth rate C o and using SE or not. The top-1 accuracyof fully dense (FD) and partially dense (PD) blocks are reported.The ﬁlter sizes K and dilation rates D are selected as { , , } and { , , } , respectively, while the reduce layer channels C r and thetotal block number B are set to 512 and 4. is signiﬁcantly lower than other works [31] which requiresapproximately three weeks per GPU for a training cycle. DC-TCN architectures

To ﬁnd an optimal structureof DC-TCN, we ﬁrst evaluate the impact of various ﬁltersizes K and dilation rates D on LRW while keeping thevalue of other hyper-parameters ﬁxed. In particular, we ﬁxthe growth rate C o and the reduce layer channels C r to be64 and 512, respectively, and stack a total of 4 DC-TCNblocks without SE attention. As shown in Table 1, bothFully Dense (FD) and Partially Dense (PD) blocks achieveoptimal performance when K and D are set to be { , , } and { , , } , respectively. Therefore, we decide to use thissetting for K and D in subsequent experiments.Once the optimal values of K and D are found, we havefurther investigated the effect of different growth rate C o settings and the addition of SE block, while the reduce layerchannels C r and the total block number B are set to 512and 4, respectively. As shown in 2, it is evident that: 1.the performance of using 128 for C o exceeds that of using64, and 2. the effectiveness of adding SE in the block isvalidated since it consistently leads to higher accuracy when C o stays the same. Methods Front-end Back-end Acc. (%)LRW [11] VGG-M - 61.1WLAS [10] VGG-M LSTM 76.2ResNet+BLSTM[41] 3D Conv +ResNet34 BLSTM 83.0End-to-EndAVR [31] 3D Conv +ResNet34 BLSTM 83.4Multi-grainedST [47] ResNet34 +DenseNet3D Conv-BLSTM 83.3Two-stream3D CNN [49] (3DConv) × DC-TCN(FD)

Table 3. A comparison of the performance between the baselinemethods and ours on the LRW dataset. We report the best resultsfrom the fully dense (FD) and the partially dense (PD) blocks,respectively.

To sum up, we have selected the following hyper-parameters as the ﬁnal DC-TCN model conﬁguration forboth FD and PD: the ﬁlter sizes K and dilation rates D ineach block are set to K = { , , } and D = { , , } ,with the growth rate C o = 128 , the reduce layer channel C r = 512 and the block number B = 4 , where SE attentionis added after each input tensor. Performance on the LRW and LRW-1000 datasets

In Table 3 and 4 we report the performance of our methodand various baseline approaches on the LRW and LRW-1000 datasets, respectively. On LRW, our method hasachieved an accuracy of 88.01 % (FD blocks) and 88.36 %(PD blocks), which is the new state-of-the-art performanceon this dataset with an absolution improvement of 3.1 %over the current state-of-the-art method [26] on the LRWdataset. Besides, our method also produces higher top-1accuracy (43.65 % and 43.11 % by using PD and FD, re-spectively) than the best baseline method [26] (41.4 %) onLRW-1000, which has further validated the generality and7 ethods Front-end Back-end Acc. (%)ResNet+LSTM[52] 3D Conv +ResNet34 LSTM 38.2Multi-grainedST [47] ResNet34 +DenseNet3D Conv-BLSTM 36.9GLMIM [56] 3D Conv +ResNet18 BGRU 38.79Multi-scaleTCN [26] 3D Conv +ResNet18 MS-TCN 41.4Ours 3D Conv+ ResNet18 DC-TCN(PD)

DC-TCN(FD)

Table 4. A comparison of the performance between the baselinemethods and ours on the LRW-1000 dataset.Drop N Frames → N =0 N =1 N =2 N =3 N =4 N =5End-to-EndAVR [31] N frames are randomly removed from each testing sequence. effectiveness of the proposed DC-TCN model. Difﬁculty Categories

To intuitively illustrate why ourDC-TCN can out-perform the baseline methods, we haveexamined the classiﬁcation rates of different methods onﬁve word categories with various difﬁculty levels. To bespeciﬁc, we have divided the 500 classes in the LRW testset into ﬁve categories (100 words per category) based ontheir classiﬁcation difﬁculty in [26], which are “very easy”,“easy”, “medium”, “difﬁcult” and “very difﬁcult”. Thenwe compare the performance of our DC-TCN (FD andPD) with two baseline methods (End-to-End AVR [31] andMulti-scale TCN [26]) on those ﬁve difﬁculty categories, asdemonstrated in Fig. 6. We observe that our methods re-sult in slightly better performance than the baselines on the“very easy” and “easy” categories, however, improvementsover the baselines are more signiﬁcant on the other threegroups, especially on the “difﬁcult” and the “very difﬁcult”categories. Since the improvement of our methods is mainlyachieved on those more difﬁcult words, it is reasonably todeduce that our DC-TCN can extract more robust temporalfeatures.

Variable Lengths

We further evaluate the temporal ro-

3D Conv+ResNet18 as front-end and BGRU as back-end.

Figure 6. A comparison of our method and two baseline methods(End-to-End AVR [31] and Multi-scale TCN [26]) on the ﬁve difﬁ-culty categories of the LRW test set. Our method shows signiﬁcantimprovement over the baselines on these more challenging wordclasses, which demonstrates that our DC-TCN models can providemore robust temporal features. bustness of different models against video sequences withvariable lengths, i.e. N frames are randomly dropped fromeach testing sequence in LRW dataset where N ranges from0 to 5. As shown in Table 5, the performance of End-to-End AVR [31] drops signiﬁcantly as increasing frames arerandomly removed from the testing sequences. In contrast,MS-TCN [26] and our DC-TCN (both PD and FD) demon-strate better tolerance to such frame removals, mainly due tothe usage of variable length augmentation [26] during train-ing. Besides, the accuracy of our models (both PD and FD)constantly outperforms that of MS-TCN [26] no matter howthe number of frames to remove varies, which veriﬁes thesuperior temporal robustness of our method.

5. Conclusion

We have introduced a Densely Connected TemporalConvolution Network (DC-TCN) for word-level lip-readingin this paper. Characterised by the dense connections andthe SE attention mechanism, the proposed DC-TCN cancapture more robust features at denser temporal scales andtherefore improve the performance of the original TCN ar-chitectures. DC-TCN have surpassed the performance of allbaseline methods on both the LRW dataset and the LRW-1000 dataset. To the best of our knowledge, this is the ﬁrstattempt to adopt a densely connected TCN with SE attentionfor lip-reading of isolated words, resulting in new state-of-the-art performance.

Acknowledgements

The work of Pingchuan Ma has been partially supportedby Honda and the “AWS Cloud Credits for Research” pro-gram. The work of Yujiang Wang has been partially sup-ported by China Scholarship Council (No. 201708060212)and the EPSRC project EP/N007743/1 (FACER2VM).8 eferences [1] Triantafyllos Afouras, Joon Son Chung, Andrew Senior,Oriol Vinyals, and Andrew Zisserman. Deep audio-visualspeech recognition.

IEEE transactions on pattern analysisand machine intelligence , 2018.[2] Iryna Anina, Ziheng Zhou, Guoying Zhao, and MattiPietik¨ainen. Ouluvs2: A multi-view audiovisual databasefor non-rigid mouth motion analysis. In

Proceedings of FG ,volume 1, pages 1–5, 2015.[3] Yannis M. Assael, Brendan Shillingford, Shimon Whiteson,and Nando de Freitas. Lipnet: Sentence-level lipreading.

CoRR , abs/1611.01599, 2016.[4] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio.Neural machine translation by jointly learning to align andtranslate. In

Proceedings of ICLR , 2015.[5] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empiricalevaluation of generic convolutional and recurrent networksfor sequence modeling. arXiv preprint arXiv:1803.01271 ,2018.[6] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,Kevin Murphy, and Alan L Yuille. Deeplab: Semantic imagesegmentation with deep convolutional nets, atrous convolu-tion, and fully connected crfs.

IEEE transactions on patternanalysis and machine intelligence , 40(4):834–848, 2017.[7] Liang-Chieh Chen, George Papandreou, Florian Schroff, andHartwig Adam. Rethinking atrous convolution for seman-tic image segmentation. arXiv preprint arXiv:1706.05587 ,2017.[8] Shiyang Cheng, Pingchuan Ma, Georgios Tzimiropoulos,Stavros Petridis, Adrian Bulat, Jie Shen, and Maja Pan-tic. Towards pose-invariant lip-reading. In

Proceedings ofICASSP , pages 4357–4361, 2020.[9] Kyunghyun Cho, Bart van Merrienboer, C¸ aglar G¨ulc¸ehre,Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, andYoshua Bengio. Learning phrase representations using RNNencoder-decoder for statistical machine translation. In

Pro-ceedings of EMNLP , pages 1724–1734, 2014.[10] Joon Son Chung, Andrew Senior, Oriol Vinyals, and AndrewZisserman. Lip reading sentences in the wild. In

Proceedingsof CVPR , pages 3444–3453, 2017.[11] Joon Son Chung and Andrew Zisserman. Lip reading in thewild. In

Proceedings of ACCV , pages 87–103, 2016.[12] Yann N. Dauphin, Angela Fan, Michael Auli, and DavidGrangier. Language modeling with gated convolutional net-works. In

Proceedings of ICML , volume 70, pages 933–941,2017.[13] St´ephane Dupont and Juergen Luettin. Audio-visual speechmodeling for continuous speech recognition.

IEEE transac-tions on multimedia , 2(3):141–151, 2000.[14] Virginia Estellers, Mihai Gurban, and Jean-Philippe Thiran.On dynamic stream weighting for audio-visual speech recog-nition.

IEEE Transactions on Audio, Speech, and LanguageProcessing , 20(4):1145–1157, 2011.[15] Jonas Gehring, Yajie Miao, Florian Metze, and Alex Waibel.Extracting deep bottleneck features using stacked auto-encoders. In

Proceedings of ICASSP , pages 3377–3381,2013. [16] Dan Guo, Shuo Wang, Qi Tian, and Meng Wang. Densetemporal convolution network for sign language translation.In

Proceedings of IJCAI , pages 744–750, 2019.[17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learningfor image recognition. In

Proceedings of CVPR , pages 770–778, 2016.[18] Xiaopeng Hong, Hongxun Yao, Yuqi Wan, and Rong Chen.A pca based visual dct feature extraction method for lip-reading. In , pages 321–326, 2006.[19] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-excitation net-works. In

Proceedings of CVPR , pages 7132–7141, 2018.[20] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kil-ian Q Weinberger. Densely connected convolutional net-works. In

Proceedings of CVPR , pages 4700–4708, 2017.[21] Sergey Ioffe and Christian Szegedy. Batch normalization:Accelerating deep network training by reducing internal co-variate shift. In

Proceedings of ICML , volume 37, pages448–456, 2015.[22] Davis E King. Dlib-ml: A machine learning toolkit.

Journalof Machine Learning Research , 2009.[23] Ilya Loshchilov and Frank Hutter. SGDR: stochastic gradientdescent with warm restarts. In

Proceedings of ICLR , 2017.[24] Ilya Loshchilov and Frank Hutter. Decoupled weight decayregularization. In

Proceedings of ICLR , 2019.[25] Thang Luong, Hieu Pham, and Christopher D. Manning. Ef-fective approaches to attention-based neural machine trans-lation. In

Proceedings of EMNLP , pages 1412–1421, 2015.[26] Brais Martinez, Pingchuan Ma, Stavros Petridis, and MajaPantic. Lipreading using temporal convolutional networks.

Proceedings of ICASSP , pages 6319–6323, 2020.[27] Kuniaki Noda, Yuki Yamaguchi, Kazuhiro Nakadai, Hi-roshi G Okuno, and Tetsuya Ogata. Audio-visual speechrecognition using deep learning.

Applied Intelligence ,42(4):722–737, 2015.[28] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,James Bradbury, Gregory Chanan, Trevor Killeen, ZemingLin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: Animperative style, high-performance deep learning library. In

Proceedings of NIPS , pages 8026–8037, 2019.[29] Eric K. Patterson, Sabri Gurbuz, Zekeriya Tufekci, andJohn N. Gowdy. Moving-talker, speaker-independent fea-ture study, and baseline results using the CUAVE multi-modal speech corpus.

EURASIP J. Adv. Signal Process. ,2002(11):1189–1201, 2002.[30] Stavros Petridis and Maja Pantic. Deep complementary bot-tleneck features for visual speech recognition. In

Proceed-ings of ICASSP , pages 2304–2308, 2016.[31] Stavros Petridis, Themos Stafylakis, Pingehuan Ma, FeipengCai, Georgios Tzimiropoulos, and Maja Pantic. End-to-endaudiovisual speech recognition. In

Proceedings of ICASSP ,pages 6548–6552, 2018.[32] Stavros Petridis, Themos Stafylakis, Pingchuan Ma, Geor-gios Tzimiropoulos, and Maja Pantic. Audio-visual speechrecognition with a hybrid ctc/attention architecture. In

Pro-ceedings of SLT , pages 513–520, 2018.

33] Stavros Petridis, Yujiang Wang, Zuwei Li, and Maja Pantic.End-to-end audiovisual fusion with lstms. In

Proceedings ofAVSP , pages 36–40, 2017.[34] Stavros Petridis, Yujiang Wang, Zuwei Li, and Maja Pantic.End-to-end multi-view lipreading. In

Proceedings of BMVC ,2017.[35] Stavros Petridis, Yujiang Wang, Pingchuan Ma, Zuwei Li,and Maja Pantic. End-to-end visual speech recognition forsmall-scale datasets.

Pattern Recognition Letters , 131:421–427, 2020.[36] Gerasimos Potamianos, Chalapathy Neti, GuillaumeGravier, Ashutosh Garg, and Andrew W Senior. Recentadvances in the automatic recognition of audiovisual speech.

Proceedings of the IEEE , 91(9):1306–1326, 2003.[37] Gerasimos Potamianos, Chalapathy Neti, Giridharan Iyen-gar, Andrew W Senior, and Ashish Verma. A cascade visualfront end for speaker independent automatic speechreading.

International Journal of Speech Technology , 4(3-4):193–208, 2001.[38] Xu Shao and Jon Barker. Stream weight estimation for mul-tistream audio–visual speech recognition in a multispeakerenvironment.

Speech Communication , 50(4):337–353, 2008.[39] Karen Simonyan and Andrew Zisserman. Very deep convo-lutional networks for large-scale image recognition. In

Pro-ceedings of ICLR , 2015.[40] Themos Stafylakis, Muhammad Haris Khan, and GeorgiosTzimiropoulos. Pushing the boundaries of audiovisual wordrecognition using residual networks and LSTMs.

ComputerVision and Image Understanding , 176–177:22–32, 2018.[41] Themos Stafylakis and Georgios Tzimiropoulos. Combiningresidual networks with lstms for lipreading. In

Proceedingsof Interspeech , pages 3652–3656, 2017.[42] Ke Sun, Chun Yu, Weinan Shi, Lan Liu, and YuanchunShi. Lip-interact: Improving mobile device interaction withsilent speech commands. In

Proceedings of the 31st AnnualACM Symposium on User Interface Software and Technol-ogy , pages 581–593, 2018.[43] A¨aron van den Oord, Sander Dieleman, Heiga Zen, KarenSimonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner,Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: Agenerative model for raw audio. In

ISCA Speech SynthesisWorkshop , page 125, 2016.[44] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-reit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and IlliaPolosukhin. Attention is all you need. In

Proceedings ofNIPS , pages 5998–6008, 2017.[45] Alex Waibel, Toshiyuki Hanazawa, Geoffrey Hinton, Kiy-ohiro Shikano, and Kevin J Lang. Phoneme recognition us-ing time-delay neural networks. volume 37, pages 328–339,1989.[46] Michael Wand, Jan Koutn´ık, and J¨urgen Schmidhuber.Lipreading with long short-term memory. In

Proceedingsof ICASSP , pages 6115–6119, 2016.[47] Chenhao Wang. Multi-grained spatio-temporal modeling forlip-reading. In

Proceedings of BMVC , page 276, 2019.[48] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaim-ing He. Non-local neural networks. In

Proceedings of CVPR ,pages 7794–7803, 2018. [49] Xinshuo Weng and Kris Kitani. Learning spatio-temporalfeatures with two-stream deep 3d cnns for lipreading. In

Pro-ceedings of BMVC , page 269, 2019.[50] Bo Xu, Cheng Lu, Yandong Guo, and Jacob Wang. Discrim-inative multi-modality speech recognition. In

Proceedings ofCVPR , 2020.[51] Maoke Yang, Kun Yu, Chi Zhang, Zhiwei Li, and KuiyuanYang. Denseaspp for semantic segmentation in street scenes.In

Proceedings of CVPR , pages 3684–3692, 2018.[52] Shuang Yang, Yuanhang Zhang, Dalu Feng, Mingmin Yang,Chenhao Wang, Jingyun Xiao, Keyu Long, Shiguang Shan,and Xilin Chen. Lrw-1000: A naturally-distributed large-scale benchmark for lip reading in the wild. In

Proceedingsof FG , pages 1–8, 2019.[53] Fisher Yu and Vladlen Koltun. Multi-scale context aggrega-tion by dilated convolutions. In

Proceedings of ICLR , 2016.[54] Yuanhang Zhang, Shuang Yang, Jingyun Xiao, ShiguangShan, and Xilin Chen. Can we read speech beyond the lips?rethinking roi selection for deep visual speech recognition.In

Proceedings of FG , volume 1, pages 851–858, 2020.[55] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, XiaogangWang, and Jiaya Jia. Pyramid scene parsing network. In

Proceedings of CVPR , pages 2881–2890, 2017.[56] Xing Zhao, Shuang Yang, Shiguang Shan, and Xilin Chen.Mutual information maximization for effective lip reading.In

Proceedings of FG , volume 1, pages 843–850, 2020.[57] Ziheng Zhou, Guoying Zhao, Xiaopeng Hong, and MattiPietik¨ainen. A review of recent advances in visual speech de-coding.

Image and vision computing , 32(9):590–605, 2014., 32(9):590–605, 2014.