[PDF] 3D Feature Pyramid Attention Module for Robust Visual Speech Recognition

Abstract

Visual speech recognition is the task to decode the speech content from a video based on visual information, especially the movements of lips. It is also referenced as lipreading. Motivated by two problems existing in lipreading, words with similar pronunciation and the variation of word duration, we propose a novel 3D Feature Pyramid Attention (3D-FPA) module to jointly improve the representation power of features in both the spatial and temporal domains. Specifically, the input features are downsampled for 3 times in both the spatial and temporal dimensions to construct spatiotemporal feature pyramids. Then high-level features are upsampled and combined with low-level features, finally generating a pixel-level soft attention mask to be multiplied with the input this http URL enhances the discriminative power of features and exploits the temporal multi-scale information while decoding the visual speeches. Also, this module provides a new method to construct and utilize temporal pyramid structures in video analysis tasks. The field of temporal featrue pyramids are still under exploring compared to the plentiful works on spatial feature pyramids for image analysis tasks. To validate the effectiveness and adaptability of our proposed module, we embed the module in a sentence-level lipreading model, LipNet, with the result of 3.6% absolute decrease in word error rate, and a word-level model, with the result of 1.4% absolute improvement in accuracy.

Full PDF

33D Feature Pyramid Attention Module for Robust Visual SpeechRecognition

Jingyun Xiao , Department of Computer Science, University of Chinese Academy of Sciences, Beijing, China Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute ofComputing Technology, CAS, Beijing, China

Abstract — Visual speech recognition is the task to decodethe speech content from a video based on visual information,especially the movements of lips. It is also referenced aslipreading. Motivated by two problems existing in lipreading,words with similar pronunciation and the variation of wordduration, we propose a novel 3D Feature Pyramid Attention(3D-FPA) module to jointly improve the representation power offeatures in both the spatial and temporal domains. Speciﬁcally,the input features are downsampled for times in both thespatial and temporal dimensions to construct spatiotemporalfeature pyramids. Then high-level features are upsampled andcombined with low-level features, ﬁnally generating a pixel-levelsoft attention mask to be multiplied with the input features.It enhances the discriminative power of features and exploitsthe temporal multi-scale information while decoding the visualspeeches. Also, this module provides a new method to constructand utilize temporal pyramid structures in video analysis tasks.The ﬁeld of temporal featrue pyramids are still under exploringcompared to the plentiful works on spatial feature pyramidsfor image analysis tasks. To validate the effectiveness andadaptability of our proposed module, we embed the modulein a sentence-level lipreading model, LipNet [2], with the resultof . absolute decrease in word error rate, and a word-level model proposed in [3], with the result of . absoluteimprovement in accuracy. I. I

NTRODUCTION

Visual speech recognition, also known as lipreading, is adeveloping topic in the ﬁeld of video understanding whichreceives more and more attention in recent years. It has broadapplication prospects in hearing aids and special educationfor hearing impaired people, complementing speech recog-nition in noisy environments, new man-machine interactionmethods and other potential application scenarios.The target of the lipreading task is to decode the speechcontent from videos based on the visual information, in-cluding the how the lips, tongue, teeth move and interactduring the speaking process. Different words have differentpronunciation corresponding to different lip-region move-ments, therefore it is possible to learn the consistent latentpatterns of the same words spoken by different speakers andto discriminate different words based on these patterns.Figure 1 shows the lip crop sequences of the same wordspoken by different speakers, indicating a consistent visualpattern that makes the sequence to express the certain word“blue” or “with” instead of “red” or “at”. Figure 2 showsdifferent words spoken by the same person. As shown, onlytiny differences exist in the initial or terminal frames of different words, which is a common case in the lipreadingtask.

Fig. 1. The same words spoken by different people. The target of lipreadingis to exploit the latent patterns of the image sequences of the same words.

Lipreading has common features and processing steps withother video tasks such as activity detection. Moreover, ithas its own characteristics and faces speciﬁc difﬁculties.One major difﬁculty is how to distinguish the words withsimilar pronunciation. As shown in Figure 3, it is difﬁcultto distinguish words with similar pronunciation, especiallyfor short words. To address this problem, we expect toenhance the discriminative power of spatial features on eachframe. Another challenge is that many words have a shortduration, usually no more than 0.02 seconds, therefore unableto provide enough information to learn the latent patterns.In our experiments, short words such as “a”, “an”, “eight”and “bin” suffer higher error rates. To addres this problem,the temporal context information should be referred whiledecoding the words. Moreover, a good model is expected torecognize words at different temporal scales, which requirestemporal multi-scale information. a r X i v : . [ c s . C V ] J a n ig. 2. Different words spoken by the same person. There are tiny differences between frames of different words, which is a common case in the lipreadingtask.Fig. 3. Words with similar pronunciations are difﬁcult to distinguish. Motivated by the two problems above, we expect amethod to enhance the spatial features in each frame aswell as explore the multi-scale information in the temporaldimension of the whole video. Multi-scale problem hasalways been a concern in computer vision researches. Inimage processing tasks such as object detection and semanticsegmentation, plentiful methods are proposed to detect thetarget objects at different scales. Feature Pyramid Network[9] and related methods have shown great effect on thisproblem. Recently [1] proposes a Feature Pyramid Attentionmodule which combines attention mechanism with spatialpyramid in the semantic segmentation task. It refers localand global information at different level to enhance thelocalization information of features, improving the predictionaccuracy especially on small objects, which exactly coincideswith our requirements. We adapt the FPA module in [1] tothe lipreading task, generalize it to the temporal dimensionand propose a novel 3-Dimension Feature Pyramid Attention(3D-FPA) module to enhance the discriminative power oneach frame and exploits the temporal multi-scale informationof the whole video. The core point is to utilize high-levelfeatures as guidance to the attention of low-level features andgenerate a pixel-level attention mask on the input features.This method is inspired by FPN [9] and SENet [10].The main contributions of this paper are summarized asfollows. (1) We propose a 3D Feature Pyramid Attention(3D-FPA) module to improve the representation power ofspatiotemporal features for the lipreading task. (2) Our proposed model provides a new method to construct andutilize temporal pyramid structure, which has referentialvalue for other video analysis tasks. (3) We demonstrate theeffectiveness and adaptability of our module by embeddingit into a sentence-level prediction model, LipNet [2] and aword-level classiﬁcation model, CRNL [3] and performingexperiments on the GRID corpus [4] and the LRW corpus[5] respectively.The rest of this paper is organized as follows. In Section 2,we review recent works on feature pyramids and lipreading.In Section 3, we present the details the proposed moduleand the modiﬁed architectures of two lipreading models.In Section 4, we evaluate the models on two large-scaledatabase and provide a detailed analysis of the experimentresults. Finally, we make conclusion of our work in section5. II. R

ELATED W ORK

In this section, we summarize our literature review onworks of feature pyramids and approaches over lipreading. Inthe ﬁrst part, we review the works of spatial pyramid whichis widely used in image processing tasks. Then we outlinerecent works on temporal pyramid used in video analysistasks. In the last part, we give a basic review over the existingmethods on lip reading.

A. Spatial Pyramid

In the object detection task, detecting and recognizingobjects at various scales has always been a concern. In the eraof hand-engineered features, the original image is resized toa range of scales and constructs a pyramidal structure that iscalled image pyramids. Models detect objects on each levelof the image pyramids.With the advance of deep learning methods, differentapproaches have been proposed to solve the multi-scaleproblem, among which two categories of approaches arepopular. The ﬁrst category performs detectors of differentscales on the same feature map extracted by neural networks.Fast RCNN [7] and Faster RCNN [8] use anchors of 3different sizes and 3 length-width ratios, 9 anchors in all, todetect objects at different scales. The second category utilizesthe inherent pyramidal structure of featrues maps producedy different layers of deep neural networks in the forwardpropagation, which is called feature pyramids, and performdetection on different levels of the pyramids with a shareddetector.One representative work is Feature Pyramid Networks(FPN) [9], which provides a novel and efﬁcient way to buildfeature pyramids with high-level semantic information at alllevels by combining low-level, semantically weak featureswith high-level, semantically strong features via a top-downpathway and lateral connections. Compared to the methodsof the ﬁrst category, methods of feature pyramids providethe multi-scale information with marginal time and memoryextra cost because the models reuse the feature maps whichare computed in the forward propagation. Moreover, featurepyramids can be utilized in other image analysis tasks.Since feature pyramids have achieved appealing results inthe task of object detection, pyramidal structure is introducedto scene parsing and semantic segmentation to explore themulti-scale information of the image. PSPNet [14] upsamplesfeature pyramids of different levels to a certain size and thenconcatenate all the upsampled features together to get a ﬁnestfeature map rich in both local and global context information.Inspired by FPN [9] and SENet [10], Hanchao Li et.al [1] propose a Feature Pyramid Attention (FPA) moduleto combine feature maps at different levels by upsamplingand lateral connections, ﬁnally generating a pixel-level softattention mask to be multiplied with the input feature maps.It combines attention mechanism and spatial pyramids to en-hance important features based on the reference of local andglobal context information. It shows great performance inthe semantic segmentation task especially for small objects.We modify the FPA module and generalize it from 2D to3D to get better representation in both spatial and temporaldomains.

B. Temporal Pyramid

Inspired by the success of spatial pyramid structures in ob-ject detection, a few recent works attempt to apply pyramidalstructures in the temporal dimension to activity detection,expecting to perform better detection on activity instancesover different temporal scales. Temporal pyramid structureshave shown its potential power in exploiting the multi-scale information of videos. However, since video analysis ismuch more complicated than image analysis, how to designeffective temporal pyramid architectures and combine themwith the backbone networks properly is still under exploring,which is a main concern of our work in this paper. In thispart, we would give a general review of three recent worksthat employ temporal pyramid structures in different mannersfor the task of activity detection .Zhang et al. [12] propose Dynamic Temporal PyramidNetwork (DTPN) which samples an input image sequenceat different frequency to generate multiple image sequenceswith different temporal sizes, thus constructing pyramidalinput data. The pyramidal input is passed forward to thefollowing modules to extract features independently to buildfeature pyramids. The feature maps at different levels are upsampled to the same scale and concatenated together tocombine the local and global information. DTPN achievesstate-of-the-art results on the ActivityNet dataset. This prac-tice resembles the method of image pyramids in earlytimes for object detection, which resizes the input imageto different scales to build pyramidal structures. However,building feature pyramids from pyramidal input causes highcomputation cost.Contextual Multi-Scale Region Convolutional 3D Network(CMS-RC3D) [11] uses the C3D ConvNet as the backbonearchitecture to extract features and add two additional down-sampling layers to construct temporal feature pyramids. Oneach level of the temporal feature pyramids, an activityproposal detector and an activity classiﬁer are learned todetect activities of speciﬁc temporal scales independently.Compared to [12], this method builds feature pyramids byreusing feature maps computed by previous layers, thereforereduces computation cost. It does not fuse the featuresat different levels of the pyramids. It performs detectionand classiﬁcation on each level of pyramids independentlywithout fusing the multi-scale information.Structured Segment Networks (SSN) [13] proposes struc-tured temporal pyramid pooling (STPP) to produce a globalrepresentation of the each generated proposal. For eachproposal, it extracts spatiotemporal features, builds 2-levelfeature pyramids and concatenates them to get global repre-sentation for the proposal, which is used for the followingactivity and completeness classiﬁcation. Combining multi-scale information of a proposal and its surrounding snippetsbrings a good balance between expressive power and com-plexity, since it does not build feature pyramids on the wholevideo.The methods above work well for the activity detectiontask but has limited instructive value for other video analysistasks such as lipreading and video description, which requirereﬁned and consecutive spatiotemporal information.In our proposed method, we build feature pyramids byreusing feature maps computed in previous layers, there-fore it requires marginal extra computation cost. Since thelipreading task is a decoding task but not a detection task,we build pyramids on the whole video rather than focus onseveral snippets of the video to provdie intact informationof the whole sequence. Also, in order not to lose temporaldistribution information, we do not concatenate featurestogether but generate a pixel-wise attention mask on theoriginal feature maps to select import featrues, inspired bythe chanel-wise mask of SENet [10].

C. Lipreading

In the early stage, most of the works on visual speechrecognition focused on how to design proper feature extrac-tors to represent the complicated lip movement sequences.A classical process is to employ Hidden Markov Modelsto exploit the temporal relationships among the extractedfeatures [15], [16], [17], [18]. With the development ofdeep learning technologies and the appearance of large-scalelipreading databases, a few works started to introduce the onv 3×3 Conv 3×3 +× Conv 3×3 Conv 3×3 + Conv 3×3 Conv 3×3 + MaxpoolUpsample

Fig. 4. The structure of 2D-FPA module we adopt for lipreading. The greenand orange arrows denote the downsampling and upsampling operationsrespectively. This module utilizes high-level features as guidance to low-level features, therefore combines information at different levels step bystep. Finally it generates an attention mask for the input feature maps. convolutional neural network to extract the features of eachframe and employ recurrent units to model the temporalrelationships of the frame sequence in the speaking process[19], [20], [21]. In 2016, [5] proposed the ﬁrst end-to-endlipreading model which performs word-level classiﬁcationas well as a large-scale word-level database LRW. Afterthat, more recent lipreading approaches follow an end-to-endfashion.Based on the observation that human lipreaders performbetter on long words an sentences, [2] proposed LipNet,the ﬁrst end-to-end lipreading model that performs sentence-level prediction. It takes frame sequences of variable lengthas input and outputs character sequences. It uses threecascaded spatiotemporal convolutional neural networks toextract spatiotemporal features and recurrent units to performcharacter prediction at each time-step. Notably, it employsthe CTC loss [6], which is widely used in the speechrecognition task to train unaligned datas. It attains the worderror rate equal to . on the sentence-level databaseGRID [4]. Compared to word-level models, LipNet canexploit temporal context when predicting sentences, thereforeattains much higher accuracy. The result accords with thefact that human lipreaders perform better on longer words orsentences. The detailed architecture of LipNet is presentedin section 3.Then In 2017, [3] proposed a complicated word-levelmodel and attained word accuracy equal to . on theLRW corpus, making the new state-of-the-art result. It is acombination of spatiotemporal convolutional units, ResNet[22] and bidirectional LSTM networks [23]. It introducesResNet to cope with the massive amount of datas withextraordinarily high variability in the LRW corpus. Thedetailed architecture is presented in section 3.As stated in Section 1, lipreading has two difﬁculties todeal with. The words with similar pronunciations are hardto distinguish and short words are unable to provide enoughinformation for precise recognition. Inspired by the works onspatial pyramid and temporal pyramid, we propose a 3D-FPAmodule to enhance the discriminative power on each single frame as well as exploit the temporal multi-scale informationof the whole frame sequence.III. M ETHOD

In this section, we ﬁrst present the 2D-FPA moduleproposed in [1]. Then we describe the architecture of our3D-FPA module. After that we introduce two lipreading ar-chitectures respectively performing sentence-level predictionand word-level classiﬁcation and how we embed the FPAmodule in the architectures to verify the effectiveness andadaptability our proposed module.

A. 2D-FPA

In semantic segmentation, models employing spatial pyra-mid pooling such as PSPNet [14] may lose local information.Inspired by Attention Mechanism and SENet [10] usingchannel-wise attention to weight the feature maps, [1] pro-poses Feature Pyramid Attention (FPA) module that producespixel-level attention with context prior information of eachpixel to select the features pixel-wisely.FPA deploys bottom-up and top-down branches similarto Feature Pyramid Network [9]. In the bottom-up branch,feature maps are extracted at 3 different scales. It use × , × , and × convolution kernels at different pyramidlevels. In the top-down branch, the context information ofdifferent scales is integrated step-by-step, from global tolocal, ﬁnally generates an attention mask with the sameshape of the original feature map. The original feature mapis passed through × convolution and multiplied pixel-wisely with the attention mask. The author also introduces aglobal pooling branch which is concatenated with the outputfeatures to further improve the performance.For our lipreading task, we expect to improve the discrim-inative ability on each single frame to address the problemof different words with similar lip shapes. We employ theFPA module to exploit the nuanced information and focuson important features with the pixel-level attention mask.To adapt the FPA module for our task, we do severalexperiments with different settings and ﬁnd the best structure.We remove the × convolution before the multiplicationbetween the mask and the input feature map to protect theinput feature map from losing information. We also removethe global pooling branch for lower computation cost. Weuse × convolution kernels at all pyramid levels. The ﬁnalstructure we adopt is presented in Figure 4. B. 3D-FPA

To address the temporal multi-scale problem, we naturallythink about temporal pyramid structure. However, improperstructures may cause information loss in temporal distri-bution, which is especially harmful to the performance inlipreading. Inspired by the mechanism of the FPA module in[1], generating a soft attention mask with prior informationof different temporal scales is a practical solution. We extendthe 2D-FPA to the temporal dimension and design a 3D-FPAmodule for the lipreading task. onv 3×3×3 Conv 3×3×3 + ×

Conv 3×3×3 Conv 3×3×3Conv 3×3×3 Conv 3×3×3 + Input (T, W, H, C) (T/2, W/2, H/2, C) (T/4, W/4, H/4, C) (T/8, W/8, H/8, C) Mask (T, W, H, C) (T/2, W/2, H/2, C) (T/4, W/4, H/4, C) Output (T, W, H, C)MaxpoolUpsample

Fig. 5. The structure of 3D-FPA module. T, W, H, C denote the size of the time-step, width, height and channel respectively. It constructs feature pyramidsin both spatial and temporal domains simultaneously. High-level features are utilized to guide the attention of low-level features. It exploits the multi-scaleinformation in spatial and temporal dimensions jointly.

It takes a sequence of feature maps as input and outputspixel-wisely weighted feature maps with the same shapeof the input. The structure is similar to 2D-FPA. It isnoteworthy that the temporal dimension is also downsampledand upsampled in the pathway. Figure 5 shows the detailedstructure of 3D-FPA. Since high-level features have strongersemantic information, it is expected to guide the attentionof low-level features, in both temporal and spatial domainssimultaneously.The proposed 3D-FPA module is simple, lightweight andeffective. Since its input and output are of the same shape,we can embed it into different backbone architectures. Wedo most experiments with the LipNet [2] backbone on theGRID corpus [4], performing sentence-level prediction. Andin order to validate the adaptability of 3D-FPA, we also applyit to Combining Residual Networks with LSTMs [3], a muchmore complicated architecture to do word-level classiﬁcationon the more challenging corpus, LRW [5].

C. LipNet

LipNet is the ﬁrst end-to-end sentence-level lipreadingmodel. It starts with 3 sets of spatiotemporal convolutionlayers, dropout layers and spatial max-pooling layers. Theextracted features are passed forward to two Bi-GRUs.Finally, a linear transformation and a SoftMax are appliedat each time-step, followed by the CTC loss. The number ofSoftMax output classes of each time-step is 28, 26 letters,a blank symbol and the CTC blank token. The structure ofLipNet is illustrated in Figure 6 (a).We embed 2D-FPA and 3D-FPA after the input images,F1 and F2 respectively, as shown in Figure 7 (a) (b) (c). Thebackbone remains the same with the original architecture ofLipNet.

D. Combining ResNet with LSTMs [3] proposes a word-level classiﬁcation model. It startswith a spatiotemporal convolutional front-end, and then aResidual Network is applied to each time-step. Since itis trained and evaluated on LRW, a much more complexdatabase, the ResNet can help to cope with the complicated

Conv (3×5×5), strides=(1,2,2), 32

Input (75,100,50,3)

Pool (1,2,2)Conv (3×5×5), strides=(1,1,1), 64Pool (1,2,2)

F1 (75,25,12,32)F2 (75,12,6,64)

Conv (3×3×3), strides=(1,1,1), 96Pool (1,2,2)

F3 (75, (6×3×96) )

Bi-GRU 256Bi-GRU 256FC (26+1+1)

Output (75, 1)

CTC Loss Conv (5×7×7), strides=(1,2,2), 64

Input (29,112,112,1)

Pool (1,2,2)ResNet-34

F1 (29,28,28,64)F2 (29, (1×1×512) )

Linear 256Bi-LSTM 256FC 500

Output

Cross Entropy Loss (a) LipNet (b) Combining ResNet with LSTMs

Fig. 6. (a) is the structure of LipNet [2], which performs sentence-levelprediction on the GRID corpus [4]. (b) is the structure of CRNL [3], whichperforms word-level classiﬁcation on the LRW corpus [5]. mage features. Then it is followed by a Bidirectional LongShort-Term Mem-ory (Bi-LSTM) network. Finally a Soft-Max layer with 500 output classes is followed. The detailedstructure is illustrated in Figure 6 (b). We apply 2D-FPAand 3D-FPA modules prior to and inside ResNet, as shownin Figure 8.

Conv (3×5×5), strides=(1,2,2), 32

Input (75,100,50,3)

Pool (1,2,2)Conv (3×5×5), strides=(1,1,1), 64Pool (1,2,2)

F1 (75,25,12,32)F2 (75,12,6,64)

Conv (3×3×3), strides=(1,1,1), 96Pool (1,2,2)

F3 (75, (6×3×96) )

Bi-GRU 256Bi-GRU 256FC (26+1+1)

Output (75, 1)Input (75,100,50,3)F1 (75,25,12,32)

Conv (3×5×5), strides=(1,2,2), 32Pool (1,2,2)Conv (3×5×5), strides=(1,1,1), 64Pool (1,2,2)

F2 (75,12,6,64)

Conv (3×3×3), strides=(1,1,1), 96Pool (1,2,2)

F3 (75, (6×3×96) )

Bi-GRU 256Bi-GRU 256FC (26+1+1)

Output (75, 1) (b) (c)

Input (75,100,50,3)F1 (75,25,12,32)

Conv (3×5×5), strides=(1,2,2), 32Pool (1,2,2)Conv (3×5×5), strides=(1,1,1), 64Pool (1,2,2)

F2 (75,12,6,64)

Conv (3×3×3), strides=(1,1,1), 96Pool (1,2,2)

F3 (75, (6×3×96) )

Bi-GRU 256Bi-GRU 256FC (26+1+1)

Output (75, 1) (a)

Fig. 7. The modiﬁed LipNet with FPA modules embedded after the input,F1 and F2 respectively.

IV. E

XPERIMENTS

We verify the effectiveness and adaptability of the pro-posed module by embedding it into the backbone lipreadingmodels and evaluate the modiﬁed model on the corre-sponding datasets. In this section, we present our sentence-level and word-level experiments respectively, including thedatasets, implementation details, results and analysis.

A. Sentence-level1) Dataset:

The GRID corpus is a sentence-level dataset,which contains audio and video recordings of 34 speakersand each of them produces 1000 sentences. The total durationof GRID is about 28 hours. The sentences produced with aﬁxed grammar: command + color + preposition + letter +digit + adverb. The detailed components of the 6 categoriesare shown in Table I. For example, two sentences in thedataset are “Bin blue at A zero again” and “Place greenwith B eight soon”.

Conv (5×7×7), strides=(1,2,2), 64

Input (29,112,112,1)

Pool (1,2,2)ResNet-34

F1 (29,28,28,64)F2 (29, (1×1×512) )

Linear 256Bi-LSTM 256FC 500

Output (a)

Conv (5×7×7), strides=(1,2,2), 64

Input (29,112,112,1)

Pool (1,2,2)

F1 (29,28,28,64)F2 (29, (1×1×512) )

Linear 256Bi-LSTM 256FC 500

Output (b)

ResLayer 4

Fig. 8. The modiﬁed CRNL with FPA modules embedded prior to andinside the ResNet respectively. TABLE IC

OMPONENTS OF THE SENTENCES IN THE

GRID

CORPUS

Category Content NumberCommand bin, lay, place, set 4Color blue, green, red, white 4Preposition at, by, in, with 4letter A, B, ... Y, Z (without W) 25Digit zero, one, ... eight, nine 10Adverb again, now, please, soon 4

The original LipNet and our modiﬁed LipNet with FPAmodule are trained and evaluated on the GRID corpus. Wefollow the datasets split setting in [2] and use two malespeakers (speaker 1 and 2) and two female speakers (speaker20 and 22) for evaluation (3971 videos are usable afterﬁltrating invalid videos). The remaining videos are usedfor training (28775 videos are usable). By this setting, themodels are evaluated on speakers that have not appeared inthe training process, thus testing the generalization ability ofthe models in a more convincing way. ) Implementation Details:

We use the Keras implemen-tation of LipNet on the GitHub and attain . word errorrate (WER) on the evaluation set, higher than the result of . in the original paper. The reason may be attributed tonot introducing video clips of individual words as additionaltraining instances or other deﬁcient operations in the trainingprocess. Since the goal of our work is to test the validity of3D-FPA, we did not ﬁnetine the networks further to achievethe WER declared in [2].Inside the FPA module, we apply Batch Normalizationlayers and Dropout layers after each convolutional layer.Since GRID is not very large and the model is easy to overﬁt,we apply Dropout layers to alleviate this problem. Whenupsampling the high-level feature maps, there is a problemthat the upsampled features don’t have the same shape withprevious feature maps so that they cannot be added. Then weemploy bilinear upsampling if the spatial sizes don’t matchand padding the last frame to the end before upsampling ifthe temporal sizes don’t match.We use Adam as the optimizer and set the learning rateas 0.0001, a ﬁrst-moment momentum coefﬁcient as 0.9, asecond-moment momentum coefﬁcient as 0.999.

3) Results and Analysis:

We embed 2D-FPA and 3D-FPAat different position in the original architecture as shown inFigure 7 (a) (b) (c) respectively, train and evaluate it the sameway as baseline method. The results are shown in Table II.CER, WER and BLEU are short for character-level error rate,word-level error rate and Bilingual Evaluation Understudyrespectively. The “input” in the parentheses means that the3D-FPA is embedded after the input frames, “F1” means it isembedded after F1. “F1, F2” means we simultaneously apply3D-FPA modules after F1 and F2. The results indicate thatthe performance of LipNet get a considerable improvementwith the 3D-FPA module embedded.

TABLE IIR

ESULTS OF SENTENCE - LEVEL EXPERIMENTS ON THE

GRID

CORPUS

Method CER WER BLEULipNet 0.11192 0.17741 0.82409LipNet + 2D-FPA(Input) 0.08721 0.15924 0.85144LipNet + 3D-FPA(Input) 0.07571 0.14543 0.86840LipNet + 2D-FPA(F1) 0.08983 0.16033 0.84863LipNet + 3D-FPA(F1) 0.08019 0.14224 0.86725LipNet + 2D-FPA(F2) 0.08002 0.14660 0.86441LipNet + 3D-FPA(F2)

LipNet + 3D-FPA(F1,F2) 0.07798 0.14199 0.86956

Figure 9 shows the training loss and validation loss duringthe training process. The training loss of the model with 3D-FPA module starts to decrease earlier than the original model,after a few epochs of stasis. Also, the validation loss of themodiﬁed model is lower than the original model, showingthat the 3D-FPA module has effect on preventing overﬁtting.Moreover, we check the error rate for each word of thesentence. All the sentences in the GRID corpus follow a ﬁxedgrammar: command + color + preposition + letter + digit +adverb, for example, “Set blue by A four please” and “Place red at C zero again”. Table III shows the performance on eachword of the sentences, where the number in the parenthesesdenotes the sequence number of the word in the sentence.The results show that our method reduces the WER of eachword at different degrees. It is noteworthy that it makes greatimprovement on the ﬁfth and sixth word, inducing more than decline in WER.

LipNet LipNet+ 3D-FPA (F1) （ a ） Training Loss （ b ） Validation Loss

Fig. 9. The training loss and validation loss of original LipNet and modiﬁedLipNet with 3D-FPA. It indicates that 3D-FPA works well on preventingoverﬁtting.

B. Word-level1) Dataset:

LRW is the most challenging public word-level database. It contains large amounts of audiovisualspeech segments extracted from BBC TV broadcasts. Severalcharacteristics make it challenging. (1) It has 500 targetwords, compared with 24 words of GRID and 26 wordsof CUAVE. For each target word, it has a training set of1000 segments, a validation and an evaluation set of 50segments each. The total duration of this corpus is 173 hours,compared to 28 hours of GRID. (2) the videos have highvariation in the pose, the angle and the age of speakers.Also, the background is set in the wild while most lipreadingdatasets are recorded indoors in a controlled lab environment.(3) the target words are not isolated in the video segments butappear with a few context words. This add to the difﬁculty ofrecognizing the target words. The model is expected to spoton the key frames and ignore the disturbing context. Somecropped frames of LRW are presented in Figure 10.The model proposed in [3] and our modiﬁed models withFPA embedded are trained and evaluated on LRW.

2) Implementation Details:

We use the PyTorch imple-mentation of CRNL on the GitHub and follow the traininginstructions in [3]. Initially , a temporal convolutional back-end is used to replace the Bi-LSTM. After convergence,

ABLE IIIWER

OF EACH WORD ON THE

GRID

CORPUS

Method WER WER(1) WER(2) WER(3) WER(4) WER(5) WER(6)LipNet 0.17741 0.03123 0.09166 0.21884 0.36540 0.23218 0.12188LipNet + 2D-FPA(Input) 0.15924

LipNet + 3D-FPA(F1,F2) 0.14199 0.04357 0.09771 0.19970 the temporal convolution back-end is removed and the Bi-LSTM back-end is attached. Keeping the weights of thespatiotemporal convolution front-end and the ResNet ﬁxed,the Bi-LSTM back-end is trained for 5 epochs. Finally, theoverall systems is trained end-to-end. Our model gets theaccuracy of . on the test set after convergence, lowerthan the accuracy of . in the original paper.Since LRW is much more complex, we only apply BatchNormalization layers after each convolution layer in the FPAand do not apply Dropout layers.We employ ResNet-34 network in the architecture and usethe standard SGD training algorithm with the initial learningrate of 0.003 the momentum of 0.9.

3) Results and Analysis:

We embed 2D-FPA and 3D-FPAmodule before and inside ResNet, as shown in Figure 8,import the pretrained weights of the original model and trainthem on the LRW corpus. The results are shown in TableIV. The performance of CRNL model is improved with FPAmodules embedded.Table V shows 20 words with the highest accuracy pre-dicted by the original CRNL model. Table VI shows 10words with the lowest accuracy predicted by the originalmodel and relative accuracy by modiﬁed models. It indicates that the model performs better on longer words, whichaccords with the fact that human lipreaders perform betteron longer words or sentences. Table VI also shows thatour proposed module improves the performance of the basicarchitecture on these short words.We compare the accuracy between the original model andthe model with one 3D-FPA moudle embedded shown inFigure 8 (a) to ﬁnd which type of words can be improvedmore. Table VII shows the words with the highest growthin accuracy, as well as the words that are most likely to bemistaken. The results indicate that our proposed module im-proves the ability to distinguish words with different sufﬁxeslike “Difference” and “Different”, and words with differencein the initial phonemes like “Million” and “Billion”, whichis one of the concerns in the lipreading task.

TABLE IVR

ESULTS OF WORD - LEVEL EXPERIMENTS ON THE

LRW

CORPUS

Method AccuracyCRNL 0.7780CRNL + 2D-FPA 0.7837CRNL + 3D-FPA*1 0.7895CRNL + 3D-FPA*4

TABLE VT

HE WORDS WITH HIGHEST ACCURACY BY THE ORIGINAL MODEL

Ground Truth Acc Ground Truth AccWESTMINSTER 1.00 INVESTMENT 0.98TEMPERATURES 1.00 CHIEF 0.98PRIVATE 1.00 WOMEN 0.98GERMANY 1.00 MIGRANTS 0.98BEFORE 1.00 TOMORROW 0.98WEAPONS 1.00 FOLLOWING 0.96PROVIDE 1.00 INFORMATION 0.96WELFARE 1.00 PARLIAMENT 0.96SUNSHINE 0.98 POTENTIAL 0.96EUROPEAN 0.98 AFTERNOON 0.96ABLE VIT

HE WORDS WITH LOWEST ACCURACY BY THE ORIGINAL MODEL ANDRELATIVE ACCURACY BY MODIFIED MODELS

GroundTruth Acc(CRNL) Acc(CRNL+2D-FPA) Acc(CRNL+3D-FPA*1) Acc(CRNL+3D-FPA*4)UNDER 0.31

UNTIL 0.35

THERE 0.38 0.38

TAKING 0.44 0.60

HE WORDS WITH HIGHEST GROWTH IN ACCURACY

GroundTruth Acc(CRNL) Acc(CRNL+3D-FPA*1) Growth ConfusingWord ErrorRateWORDS 0.58 0.76 0.18 WORLD 0.10PRICE 0.52 0.68 0.16 PRESS 0.08DIFFERENCE 0.72 0.88 0.16 DIFFERENT 0.12TAKING 0.44 0.60 0.16 TAKEN 0.14BANKS 0.64 0.80 0.16 PLACE 0.10RUSSIAN 0.61 0.76 0.15 RUSSIA 0.27LEADERS 0.58 0.72 0.14 LEAST 0.08CLOSE 0.66 0.80 0.14 ALLOWED 0.06MORNING 0.86 0.98 0.12 POINT 0.04MILLION 0.64 0.76 0.12 BILLION 0.10

V. C

ONCLUSION

We propose a spatiotemporal pixel-level attention modulefor visual speech recognition. It utilize high-level featuresto guide the attention of low-level features, exploring themulti-scale context information in both temporal and spatialdomains. Notably, it proposes a creative usage of temporalpyramid by combining pyramidal structures with attentionmechanism. Because of its concise structure and low compu-tation cost, 3D-FPA can be embedded into different backbonearchitectures, therefore has great potential in other VideoUnderstanding tasks besides visual speech recognition.R

EFERENCES[1] Hanchao Li, Pengfei Xiong, Jie An, and Lingxue Wang. PyramidAttention Network for Semantic Segmentation. pages 1–12, 2018.[2] Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nandode Freitas. LipNet: End-to-End Sentence-level Lipreading. pages 1–13, 2016.[3] Themos Stafylakis and Georgios Tzimiropoulos. Combining ResidualNetworks with LSTMs for Lipreading.

Proceedings of the AnnualConference of the International Speech Communication Association,INTERSPEECH , 2017-Augus:3652–3656, 2017.[4] M Cooke, J Barker, S Cunningham, and X. Shao. An audio-visual corpus for speech perception and automatic speech recognition.

Journal of the Acoustical Society of America , 120(1):2421–2424, 2006.[5] Joon Son Chung and Andrew Zisserman. Lip reading in the wild,2017. [6] Alex Graves.

Connectionist Temporal Classiﬁcation , pages 61–93.Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.[7] Ross Girshick. Fast r-cnn.

Computer Science , 2015.[8] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: towards real-time object detection with region proposal networks.pages 91–99, 2015.[9] Tsung-yi Lin, Piotr Doll, Ross Girshick, Kaiming He, Bharath Hariha-ran, Serge Belongie, Facebook Ai, and Cornell Tech. Feature PyramidNetworks for Object Detection.

Cvpr , 2017.[10] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-Excitation Networks.pages 1–11, 2017.[11] Yancheng Bai, Huijuan Xu, Kate Saenko, and Bernard Ghanem.Contextual Multi-Scale Region Convolutional 3D Network for ActivityDetection. 2018.[12] Da Zhang, Xiyang Dai, and Yuan-Fang Wang. Dynamic TemporalPyramid Network: A Closer Look at Multi-Scale Modeling for ActivityDetection. 2018.[13] Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang,and Dahua Lin. Temporal Action Detection with Structured SegmentNetworks.

Proceedings of the IEEE International Conference onComputer Vision , 2017-Octob:2933–2942, 2017.[14] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, andJiaya Jia. Pyramid Scene Parsing Network Supplymentary Material.(c):7–8, 2012.[15] A. J. Goldschen, O. N. Garcia, and E. D. Petajan, “Continuous auto-matic speech recognition by lipreading,” in

Motion-Based recognition .Springer, 1997, pp. 321–343.[16] G. I. Chiou and J.-N. Hwang, “Lipreading from color video,”

IEEETransactions on Image Processing , vol. 6, no. 8, pp. 1192–1195, 1997.[17] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior,“Recent advances in the automatic recognition of audiovisual speech,”

Proceedings of the IEEE , vol. 91, no. 9, pp. 1306–1326, 2003.[18] C. Chandrasekaran, A. Trubanova, S. Stillittano, A. Caplier, andA. A. Ghazanfar, “The natural statistics of audiovisual speech,”

PLoSComput Biol , vol. 5, no. 7, 2009.[19] K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata,“Audio-visual speech recognition using deep learning,”

Applied Intel-ligence , vol. 42, no. 4, pp. 722–737, 2015.[20] K. Thangthai, R. W. Harvey, S. J. Cox, and B.-J. Theobald, “Improvinglip-reading performance for robust audiovisual speech recognitionusing DNNs,” in

AVSP , 2015, pp. 127–131.[21] I. Almajai, S. Cox, R. Harvey, and Y. Lan, “Improved speakerindependent lip reading using speaker adaptive training and deepneural networks,” in

IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) . IEEE, 2016, pp. 2722–2726.[22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2016, pp. 770–778.[23] A. Graves, S. Fern´andez, and J. Schmidhuber, “Bidirectional LSTMnetworks for improved phoneme classiﬁcation and recognition,” in