3D Feature Pyramid Attention Module for Robust Visual Speech Recognition
33D Feature Pyramid Attention Module for Robust Visual SpeechRecognition
Jingyun Xiao , Department of Computer Science, University of Chinese Academy of Sciences, Beijing, China Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute ofComputing Technology, CAS, Beijing, China
Abstract — Visual speech recognition is the task to decodethe speech content from a video based on visual information,especially the movements of lips. It is also referenced aslipreading. Motivated by two problems existing in lipreading,words with similar pronunciation and the variation of wordduration, we propose a novel 3D Feature Pyramid Attention(3D-FPA) module to jointly improve the representation power offeatures in both the spatial and temporal domains. Specifically,the input features are downsampled for times in both thespatial and temporal dimensions to construct spatiotemporalfeature pyramids. Then high-level features are upsampled andcombined with low-level features, finally generating a pixel-levelsoft attention mask to be multiplied with the input features.It enhances the discriminative power of features and exploitsthe temporal multi-scale information while decoding the visualspeeches. Also, this module provides a new method to constructand utilize temporal pyramid structures in video analysis tasks.The field of temporal featrue pyramids are still under exploringcompared to the plentiful works on spatial feature pyramidsfor image analysis tasks. To validate the effectiveness andadaptability of our proposed module, we embed the modulein a sentence-level lipreading model, LipNet [2], with the resultof . absolute decrease in word error rate, and a word-level model proposed in [3], with the result of . absoluteimprovement in accuracy. I. I
NTRODUCTION
Visual speech recognition, also known as lipreading, is adeveloping topic in the field of video understanding whichreceives more and more attention in recent years. It has broadapplication prospects in hearing aids and special educationfor hearing impaired people, complementing speech recog-nition in noisy environments, new man-machine interactionmethods and other potential application scenarios.The target of the lipreading task is to decode the speechcontent from videos based on the visual information, in-cluding the how the lips, tongue, teeth move and interactduring the speaking process. Different words have differentpronunciation corresponding to different lip-region move-ments, therefore it is possible to learn the consistent latentpatterns of the same words spoken by different speakers andto discriminate different words based on these patterns.Figure 1 shows the lip crop sequences of the same wordspoken by different speakers, indicating a consistent visualpattern that makes the sequence to express the certain word“blue” or “with” instead of “red” or “at”. Figure 2 showsdifferent words spoken by the same person. As shown, onlytiny differences exist in the initial or terminal frames of different words, which is a common case in the lipreadingtask.
Fig. 1. The same words spoken by different people. The target of lipreadingis to exploit the latent patterns of the image sequences of the same words.
Lipreading has common features and processing steps withother video tasks such as activity detection. Moreover, ithas its own characteristics and faces specific difficulties.One major difficulty is how to distinguish the words withsimilar pronunciation. As shown in Figure 3, it is difficultto distinguish words with similar pronunciation, especiallyfor short words. To address this problem, we expect toenhance the discriminative power of spatial features on eachframe. Another challenge is that many words have a shortduration, usually no more than 0.02 seconds, therefore unableto provide enough information to learn the latent patterns.In our experiments, short words such as “a”, “an”, “eight”and “bin” suffer higher error rates. To addres this problem,the temporal context information should be referred whiledecoding the words. Moreover, a good model is expected torecognize words at different temporal scales, which requirestemporal multi-scale information. a r X i v : . [ c s . C V ] J a n ig. 2. Different words spoken by the same person. There are tiny differences between frames of different words, which is a common case in the lipreadingtask.Fig. 3. Words with similar pronunciations are difficult to distinguish. Motivated by the two problems above, we expect amethod to enhance the spatial features in each frame aswell as explore the multi-scale information in the temporaldimension of the whole video. Multi-scale problem hasalways been a concern in computer vision researches. Inimage processing tasks such as object detection and semanticsegmentation, plentiful methods are proposed to detect thetarget objects at different scales. Feature Pyramid Network[9] and related methods have shown great effect on thisproblem. Recently [1] proposes a Feature Pyramid Attentionmodule which combines attention mechanism with spatialpyramid in the semantic segmentation task. It refers localand global information at different level to enhance thelocalization information of features, improving the predictionaccuracy especially on small objects, which exactly coincideswith our requirements. We adapt the FPA module in [1] tothe lipreading task, generalize it to the temporal dimensionand propose a novel 3-Dimension Feature Pyramid Attention(3D-FPA) module to enhance the discriminative power oneach frame and exploits the temporal multi-scale informationof the whole video. The core point is to utilize high-levelfeatures as guidance to the attention of low-level features andgenerate a pixel-level attention mask on the input features.This method is inspired by FPN [9] and SENet [10].The main contributions of this paper are summarized asfollows. (1) We propose a 3D Feature Pyramid Attention(3D-FPA) module to improve the representation power ofspatiotemporal features for the lipreading task. (2) Our proposed model provides a new method to construct andutilize temporal pyramid structure, which has referentialvalue for other video analysis tasks. (3) We demonstrate theeffectiveness and adaptability of our module by embeddingit into a sentence-level prediction model, LipNet [2] and aword-level classification model, CRNL [3] and performingexperiments on the GRID corpus [4] and the LRW corpus[5] respectively.The rest of this paper is organized as follows. In Section 2,we review recent works on feature pyramids and lipreading.In Section 3, we present the details the proposed moduleand the modified architectures of two lipreading models.In Section 4, we evaluate the models on two large-scaledatabase and provide a detailed analysis of the experimentresults. Finally, we make conclusion of our work in section5. II. R
ELATED W ORK
In this section, we summarize our literature review onworks of feature pyramids and approaches over lipreading. Inthe first part, we review the works of spatial pyramid whichis widely used in image processing tasks. Then we outlinerecent works on temporal pyramid used in video analysistasks. In the last part, we give a basic review over the existingmethods on lip reading.
A. Spatial Pyramid
In the object detection task, detecting and recognizingobjects at various scales has always been a concern. In the eraof hand-engineered features, the original image is resized toa range of scales and constructs a pyramidal structure that iscalled image pyramids. Models detect objects on each levelof the image pyramids.With the advance of deep learning methods, differentapproaches have been proposed to solve the multi-scaleproblem, among which two categories of approaches arepopular. The first category performs detectors of differentscales on the same feature map extracted by neural networks.Fast RCNN [7] and Faster RCNN [8] use anchors of 3different sizes and 3 length-width ratios, 9 anchors in all, todetect objects at different scales. The second category utilizesthe inherent pyramidal structure of featrues maps producedy different layers of deep neural networks in the forwardpropagation, which is called feature pyramids, and performdetection on different levels of the pyramids with a shareddetector.One representative work is Feature Pyramid Networks(FPN) [9], which provides a novel and efficient way to buildfeature pyramids with high-level semantic information at alllevels by combining low-level, semantically weak featureswith high-level, semantically strong features via a top-downpathway and lateral connections. Compared to the methodsof the first category, methods of feature pyramids providethe multi-scale information with marginal time and memoryextra cost because the models reuse the feature maps whichare computed in the forward propagation. Moreover, featurepyramids can be utilized in other image analysis tasks.Since feature pyramids have achieved appealing results inthe task of object detection, pyramidal structure is introducedto scene parsing and semantic segmentation to explore themulti-scale information of the image. PSPNet [14] upsamplesfeature pyramids of different levels to a certain size and thenconcatenate all the upsampled features together to get a finestfeature map rich in both local and global context information.Inspired by FPN [9] and SENet [10], Hanchao Li et.al [1] propose a Feature Pyramid Attention (FPA) moduleto combine feature maps at different levels by upsamplingand lateral connections, finally generating a pixel-level softattention mask to be multiplied with the input feature maps.It combines attention mechanism and spatial pyramids to en-hance important features based on the reference of local andglobal context information. It shows great performance inthe semantic segmentation task especially for small objects.We modify the FPA module and generalize it from 2D to3D to get better representation in both spatial and temporaldomains.
B. Temporal Pyramid
Inspired by the success of spatial pyramid structures in ob-ject detection, a few recent works attempt to apply pyramidalstructures in the temporal dimension to activity detection,expecting to perform better detection on activity instancesover different temporal scales. Temporal pyramid structureshave shown its potential power in exploiting the multi-scale information of videos. However, since video analysis ismuch more complicated than image analysis, how to designeffective temporal pyramid architectures and combine themwith the backbone networks properly is still under exploring,which is a main concern of our work in this paper. In thispart, we would give a general review of three recent worksthat employ temporal pyramid structures in different mannersfor the task of activity detection .Zhang et al. [12] propose Dynamic Temporal PyramidNetwork (DTPN) which samples an input image sequenceat different frequency to generate multiple image sequenceswith different temporal sizes, thus constructing pyramidalinput data. The pyramidal input is passed forward to thefollowing modules to extract features independently to buildfeature pyramids. The feature maps at different levels are upsampled to the same scale and concatenated together tocombine the local and global information. DTPN achievesstate-of-the-art results on the ActivityNet dataset. This prac-tice resembles the method of image pyramids in earlytimes for object detection, which resizes the input imageto different scales to build pyramidal structures. However,building feature pyramids from pyramidal input causes highcomputation cost.Contextual Multi-Scale Region Convolutional 3D Network(CMS-RC3D) [11] uses the C3D ConvNet as the backbonearchitecture to extract features and add two additional down-sampling layers to construct temporal feature pyramids. Oneach level of the temporal feature pyramids, an activityproposal detector and an activity classifier are learned todetect activities of specific temporal scales independently.Compared to [12], this method builds feature pyramids byreusing feature maps computed by previous layers, thereforereduces computation cost. It does not fuse the featuresat different levels of the pyramids. It performs detectionand classification on each level of pyramids independentlywithout fusing the multi-scale information.Structured Segment Networks (SSN) [13] proposes struc-tured temporal pyramid pooling (STPP) to produce a globalrepresentation of the each generated proposal. For eachproposal, it extracts spatiotemporal features, builds 2-levelfeature pyramids and concatenates them to get global repre-sentation for the proposal, which is used for the followingactivity and completeness classification. Combining multi-scale information of a proposal and its surrounding snippetsbrings a good balance between expressive power and com-plexity, since it does not build feature pyramids on the wholevideo.The methods above work well for the activity detectiontask but has limited instructive value for other video analysistasks such as lipreading and video description, which requirerefined and consecutive spatiotemporal information.In our proposed method, we build feature pyramids byreusing feature maps computed in previous layers, there-fore it requires marginal extra computation cost. Since thelipreading task is a decoding task but not a detection task,we build pyramids on the whole video rather than focus onseveral snippets of the video to provdie intact informationof the whole sequence. Also, in order not to lose temporaldistribution information, we do not concatenate featurestogether but generate a pixel-wise attention mask on theoriginal feature maps to select import featrues, inspired bythe chanel-wise mask of SENet [10].
C. Lipreading
In the early stage, most of the works on visual speechrecognition focused on how to design proper feature extrac-tors to represent the complicated lip movement sequences.A classical process is to employ Hidden Markov Modelsto exploit the temporal relationships among the extractedfeatures [15], [16], [17], [18]. With the development ofdeep learning technologies and the appearance of large-scalelipreading databases, a few works started to introduce the onv 3×3 Conv 3×3 +× Conv 3×3 Conv 3×3 + Conv 3×3 Conv 3×3 + MaxpoolUpsample
Fig. 4. The structure of 2D-FPA module we adopt for lipreading. The greenand orange arrows denote the downsampling and upsampling operationsrespectively. This module utilizes high-level features as guidance to low-level features, therefore combines information at different levels step bystep. Finally it generates an attention mask for the input feature maps. convolutional neural network to extract the features of eachframe and employ recurrent units to model the temporalrelationships of the frame sequence in the speaking process[19], [20], [21]. In 2016, [5] proposed the first end-to-endlipreading model which performs word-level classificationas well as a large-scale word-level database LRW. Afterthat, more recent lipreading approaches follow an end-to-endfashion.Based on the observation that human lipreaders performbetter on long words an sentences, [2] proposed LipNet,the first end-to-end lipreading model that performs sentence-level prediction. It takes frame sequences of variable lengthas input and outputs character sequences. It uses threecascaded spatiotemporal convolutional neural networks toextract spatiotemporal features and recurrent units to performcharacter prediction at each time-step. Notably, it employsthe CTC loss [6], which is widely used in the speechrecognition task to train unaligned datas. It attains the worderror rate equal to . on the sentence-level databaseGRID [4]. Compared to word-level models, LipNet canexploit temporal context when predicting sentences, thereforeattains much higher accuracy. The result accords with thefact that human lipreaders perform better on longer words orsentences. The detailed architecture of LipNet is presentedin section 3.Then In 2017, [3] proposed a complicated word-levelmodel and attained word accuracy equal to . on theLRW corpus, making the new state-of-the-art result. It is acombination of spatiotemporal convolutional units, ResNet[22] and bidirectional LSTM networks [23]. It introducesResNet to cope with the massive amount of datas withextraordinarily high variability in the LRW corpus. Thedetailed architecture is presented in section 3.As stated in Section 1, lipreading has two difficulties todeal with. The words with similar pronunciations are hardto distinguish and short words are unable to provide enoughinformation for precise recognition. Inspired by the works onspatial pyramid and temporal pyramid, we propose a 3D-FPAmodule to enhance the discriminative power on each single frame as well as exploit the temporal multi-scale informationof the whole frame sequence.III. M ETHOD
In this section, we first present the 2D-FPA moduleproposed in [1]. Then we describe the architecture of our3D-FPA module. After that we introduce two lipreading ar-chitectures respectively performing sentence-level predictionand word-level classification and how we embed the FPAmodule in the architectures to verify the effectiveness andadaptability our proposed module.
A. 2D-FPA
In semantic segmentation, models employing spatial pyra-mid pooling such as PSPNet [14] may lose local information.Inspired by Attention Mechanism and SENet [10] usingchannel-wise attention to weight the feature maps, [1] pro-poses Feature Pyramid Attention (FPA) module that producespixel-level attention with context prior information of eachpixel to select the features pixel-wisely.FPA deploys bottom-up and top-down branches similarto Feature Pyramid Network [9]. In the bottom-up branch,feature maps are extracted at 3 different scales. It use × , × , and × convolution kernels at different pyramidlevels. In the top-down branch, the context information ofdifferent scales is integrated step-by-step, from global tolocal, finally generates an attention mask with the sameshape of the original feature map. The original feature mapis passed through × convolution and multiplied pixel-wisely with the attention mask. The author also introduces aglobal pooling branch which is concatenated with the outputfeatures to further improve the performance.For our lipreading task, we expect to improve the discrim-inative ability on each single frame to address the problemof different words with similar lip shapes. We employ theFPA module to exploit the nuanced information and focuson important features with the pixel-level attention mask.To adapt the FPA module for our task, we do severalexperiments with different settings and find the best structure.We remove the × convolution before the multiplicationbetween the mask and the input feature map to protect theinput feature map from losing information. We also removethe global pooling branch for lower computation cost. Weuse × convolution kernels at all pyramid levels. The finalstructure we adopt is presented in Figure 4. B. 3D-FPA
To address the temporal multi-scale problem, we naturallythink about temporal pyramid structure. However, improperstructures may cause information loss in temporal distri-bution, which is especially harmful to the performance inlipreading. Inspired by the mechanism of the FPA module in[1], generating a soft attention mask with prior informationof different temporal scales is a practical solution. We extendthe 2D-FPA to the temporal dimension and design a 3D-FPAmodule for the lipreading task. onv 3×3×3 Conv 3×3×3 + ×
Conv 3×3×3 Conv 3×3×3Conv 3×3×3 Conv 3×3×3 + Input (T, W, H, C) (T/2, W/2, H/2, C) (T/4, W/4, H/4, C) (T/8, W/8, H/8, C) Mask (T, W, H, C) (T/2, W/2, H/2, C) (T/4, W/4, H/4, C) Output (T, W, H, C)MaxpoolUpsample
Fig. 5. The structure of 3D-FPA module. T, W, H, C denote the size of the time-step, width, height and channel respectively. It constructs feature pyramidsin both spatial and temporal domains simultaneously. High-level features are utilized to guide the attention of low-level features. It exploits the multi-scaleinformation in spatial and temporal dimensions jointly.
It takes a sequence of feature maps as input and outputspixel-wisely weighted feature maps with the same shapeof the input. The structure is similar to 2D-FPA. It isnoteworthy that the temporal dimension is also downsampledand upsampled in the pathway. Figure 5 shows the detailedstructure of 3D-FPA. Since high-level features have strongersemantic information, it is expected to guide the attentionof low-level features, in both temporal and spatial domainssimultaneously.The proposed 3D-FPA module is simple, lightweight andeffective. Since its input and output are of the same shape,we can embed it into different backbone architectures. Wedo most experiments with the LipNet [2] backbone on theGRID corpus [4], performing sentence-level prediction. Andin order to validate the adaptability of 3D-FPA, we also applyit to Combining Residual Networks with LSTMs [3], a muchmore complicated architecture to do word-level classificationon the more challenging corpus, LRW [5].
C. LipNet
LipNet is the first end-to-end sentence-level lipreadingmodel. It starts with 3 sets of spatiotemporal convolutionlayers, dropout layers and spatial max-pooling layers. Theextracted features are passed forward to two Bi-GRUs.Finally, a linear transformation and a SoftMax are appliedat each time-step, followed by the CTC loss. The number ofSoftMax output classes of each time-step is 28, 26 letters,a blank symbol and the CTC blank token. The structure ofLipNet is illustrated in Figure 6 (a).We embed 2D-FPA and 3D-FPA after the input images,F1 and F2 respectively, as shown in Figure 7 (a) (b) (c). Thebackbone remains the same with the original architecture ofLipNet.
D. Combining ResNet with LSTMs [3] proposes a word-level classification model. It startswith a spatiotemporal convolutional front-end, and then aResidual Network is applied to each time-step. Since itis trained and evaluated on LRW, a much more complexdatabase, the ResNet can help to cope with the complicated
Conv (3×5×5), strides=(1,2,2), 32
Input (75,100,50,3)
Pool (1,2,2)Conv (3×5×5), strides=(1,1,1), 64Pool (1,2,2)
F1 (75,25,12,32)F2 (75,12,6,64)
Conv (3×3×3), strides=(1,1,1), 96Pool (1,2,2)
F3 (75, (6×3×96) )
Bi-GRU 256Bi-GRU 256FC (26+1+1)
Output (75, 1)
CTC Loss Conv (5×7×7), strides=(1,2,2), 64
Input (29,112,112,1)
Pool (1,2,2)ResNet-34
F1 (29,28,28,64)F2 (29, (1×1×512) )
Linear 256Bi-LSTM 256FC 500
Output
Cross Entropy Loss (a) LipNet (b) Combining ResNet with LSTMs
Fig. 6. (a) is the structure of LipNet [2], which performs sentence-levelprediction on the GRID corpus [4]. (b) is the structure of CRNL [3], whichperforms word-level classification on the LRW corpus [5]. mage features. Then it is followed by a Bidirectional LongShort-Term Mem-ory (Bi-LSTM) network. Finally a Soft-Max layer with 500 output classes is followed. The detailedstructure is illustrated in Figure 6 (b). We apply 2D-FPAand 3D-FPA modules prior to and inside ResNet, as shownin Figure 8.
Conv (3×5×5), strides=(1,2,2), 32
Input (75,100,50,3)
Pool (1,2,2)Conv (3×5×5), strides=(1,1,1), 64Pool (1,2,2)
F1 (75,25,12,32)F2 (75,12,6,64)
Conv (3×3×3), strides=(1,1,1), 96Pool (1,2,2)
F3 (75, (6×3×96) )
Bi-GRU 256Bi-GRU 256FC (26+1+1)
Output (75, 1)Input (75,100,50,3)F1 (75,25,12,32)
Conv (3×5×5), strides=(1,2,2), 32Pool (1,2,2)Conv (3×5×5), strides=(1,1,1), 64Pool (1,2,2)
F2 (75,12,6,64)
Conv (3×3×3), strides=(1,1,1), 96Pool (1,2,2)
F3 (75, (6×3×96) )
Bi-GRU 256Bi-GRU 256FC (26+1+1)
Output (75, 1) (b) (c)
Input (75,100,50,3)F1 (75,25,12,32)
Conv (3×5×5), strides=(1,2,2), 32Pool (1,2,2)Conv (3×5×5), strides=(1,1,1), 64Pool (1,2,2)
F2 (75,12,6,64)
Conv (3×3×3), strides=(1,1,1), 96Pool (1,2,2)
F3 (75, (6×3×96) )
Bi-GRU 256Bi-GRU 256FC (26+1+1)
Output (75, 1) (a)
Fig. 7. The modified LipNet with FPA modules embedded after the input,F1 and F2 respectively.
IV. E
XPERIMENTS
We verify the effectiveness and adaptability of the pro-posed module by embedding it into the backbone lipreadingmodels and evaluate the modified model on the corre-sponding datasets. In this section, we present our sentence-level and word-level experiments respectively, including thedatasets, implementation details, results and analysis.
A. Sentence-level1) Dataset:
The GRID corpus is a sentence-level dataset,which contains audio and video recordings of 34 speakersand each of them produces 1000 sentences. The total durationof GRID is about 28 hours. The sentences produced with afixed grammar: command + color + preposition + letter +digit + adverb. The detailed components of the 6 categoriesare shown in Table I. For example, two sentences in thedataset are “Bin blue at A zero again” and “Place greenwith B eight soon”.
Conv (5×7×7), strides=(1,2,2), 64
Input (29,112,112,1)
Pool (1,2,2)ResNet-34
F1 (29,28,28,64)F2 (29, (1×1×512) )
Linear 256Bi-LSTM 256FC 500
Output (a)
Conv (5×7×7), strides=(1,2,2), 64
Input (29,112,112,1)
Pool (1,2,2)
F1 (29,28,28,64)F2 (29, (1×1×512) )
Linear 256Bi-LSTM 256FC 500
Output (b)
ResLayer 4
Fig. 8. The modified CRNL with FPA modules embedded prior to andinside the ResNet respectively. TABLE IC
OMPONENTS OF THE SENTENCES IN THE
GRID
CORPUS
Category Content NumberCommand bin, lay, place, set 4Color blue, green, red, white 4Preposition at, by, in, with 4letter A, B, ... Y, Z (without W) 25Digit zero, one, ... eight, nine 10Adverb again, now, please, soon 4
The original LipNet and our modified LipNet with FPAmodule are trained and evaluated on the GRID corpus. Wefollow the datasets split setting in [2] and use two malespeakers (speaker 1 and 2) and two female speakers (speaker20 and 22) for evaluation (3971 videos are usable afterfiltrating invalid videos). The remaining videos are usedfor training (28775 videos are usable). By this setting, themodels are evaluated on speakers that have not appeared inthe training process, thus testing the generalization ability ofthe models in a more convincing way. ) Implementation Details:
We use the Keras implemen-tation of LipNet on the GitHub and attain . word errorrate (WER) on the evaluation set, higher than the result of . in the original paper. The reason may be attributed tonot introducing video clips of individual words as additionaltraining instances or other deficient operations in the trainingprocess. Since the goal of our work is to test the validity of3D-FPA, we did not finetine the networks further to achievethe WER declared in [2].Inside the FPA module, we apply Batch Normalizationlayers and Dropout layers after each convolutional layer.Since GRID is not very large and the model is easy to overfit,we apply Dropout layers to alleviate this problem. Whenupsampling the high-level feature maps, there is a problemthat the upsampled features don’t have the same shape withprevious feature maps so that they cannot be added. Then weemploy bilinear upsampling if the spatial sizes don’t matchand padding the last frame to the end before upsampling ifthe temporal sizes don’t match.We use Adam as the optimizer and set the learning rateas 0.0001, a first-moment momentum coefficient as 0.9, asecond-moment momentum coefficient as 0.999.
3) Results and Analysis:
We embed 2D-FPA and 3D-FPAat different position in the original architecture as shown inFigure 7 (a) (b) (c) respectively, train and evaluate it the sameway as baseline method. The results are shown in Table II.CER, WER and BLEU are short for character-level error rate,word-level error rate and Bilingual Evaluation Understudyrespectively. The “input” in the parentheses means that the3D-FPA is embedded after the input frames, “F1” means it isembedded after F1. “F1, F2” means we simultaneously apply3D-FPA modules after F1 and F2. The results indicate thatthe performance of LipNet get a considerable improvementwith the 3D-FPA module embedded.
TABLE IIR
ESULTS OF SENTENCE - LEVEL EXPERIMENTS ON THE
GRID
CORPUS
Method CER WER BLEULipNet 0.11192 0.17741 0.82409LipNet + 2D-FPA(Input) 0.08721 0.15924 0.85144LipNet + 3D-FPA(Input) 0.07571 0.14543 0.86840LipNet + 2D-FPA(F1) 0.08983 0.16033 0.84863LipNet + 3D-FPA(F1) 0.08019 0.14224 0.86725LipNet + 2D-FPA(F2) 0.08002 0.14660 0.86441LipNet + 3D-FPA(F2)
LipNet + 3D-FPA(F1,F2) 0.07798 0.14199 0.86956
Figure 9 shows the training loss and validation loss duringthe training process. The training loss of the model with 3D-FPA module starts to decrease earlier than the original model,after a few epochs of stasis. Also, the validation loss of themodified model is lower than the original model, showingthat the 3D-FPA module has effect on preventing overfitting.Moreover, we check the error rate for each word of thesentence. All the sentences in the GRID corpus follow a fixedgrammar: command + color + preposition + letter + digit +adverb, for example, “Set blue by A four please” and “Place red at C zero again”. Table III shows the performance on eachword of the sentences, where the number in the parenthesesdenotes the sequence number of the word in the sentence.The results show that our method reduces the WER of eachword at different degrees. It is noteworthy that it makes greatimprovement on the fifth and sixth word, inducing more than decline in WER.
LipNet LipNet+ 3D-FPA (F1) ( a ) Training Loss ( b ) Validation Loss
Fig. 9. The training loss and validation loss of original LipNet and modifiedLipNet with 3D-FPA. It indicates that 3D-FPA works well on preventingoverfitting.
B. Word-level1) Dataset:
LRW is the most challenging public word-level database. It contains large amounts of audiovisualspeech segments extracted from BBC TV broadcasts. Severalcharacteristics make it challenging. (1) It has 500 targetwords, compared with 24 words of GRID and 26 wordsof CUAVE. For each target word, it has a training set of1000 segments, a validation and an evaluation set of 50segments each. The total duration of this corpus is 173 hours,compared to 28 hours of GRID. (2) the videos have highvariation in the pose, the angle and the age of speakers.Also, the background is set in the wild while most lipreadingdatasets are recorded indoors in a controlled lab environment.(3) the target words are not isolated in the video segments butappear with a few context words. This add to the difficulty ofrecognizing the target words. The model is expected to spoton the key frames and ignore the disturbing context. Somecropped frames of LRW are presented in Figure 10.The model proposed in [3] and our modified models withFPA embedded are trained and evaluated on LRW.
2) Implementation Details:
We use the PyTorch imple-mentation of CRNL on the GitHub and follow the traininginstructions in [3]. Initially , a temporal convolutional back-end is used to replace the Bi-LSTM. After convergence,
ABLE IIIWER
OF EACH WORD ON THE
GRID
CORPUS
Method WER WER(1) WER(2) WER(3) WER(4) WER(5) WER(6)LipNet 0.17741 0.03123 0.09166 0.21884 0.36540 0.23218 0.12188LipNet + 2D-FPA(Input) 0.15924
LipNet + 3D-FPA(F1,F2) 0.14199 0.04357 0.09771 0.19970 the temporal convolution back-end is removed and the Bi-LSTM back-end is attached. Keeping the weights of thespatiotemporal convolution front-end and the ResNet fixed,the Bi-LSTM back-end is trained for 5 epochs. Finally, theoverall systems is trained end-to-end. Our model gets theaccuracy of . on the test set after convergence, lowerthan the accuracy of . in the original paper.Since LRW is much more complex, we only apply BatchNormalization layers after each convolution layer in the FPAand do not apply Dropout layers.We employ ResNet-34 network in the architecture and usethe standard SGD training algorithm with the initial learningrate of 0.003 the momentum of 0.9.
3) Results and Analysis:
We embed 2D-FPA and 3D-FPAmodule before and inside ResNet, as shown in Figure 8,import the pretrained weights of the original model and trainthem on the LRW corpus. The results are shown in TableIV. The performance of CRNL model is improved with FPAmodules embedded.Table V shows 20 words with the highest accuracy pre-dicted by the original CRNL model. Table VI shows 10words with the lowest accuracy predicted by the originalmodel and relative accuracy by modified models. It indicates that the model performs better on longer words, whichaccords with the fact that human lipreaders perform betteron longer words or sentences. Table VI also shows thatour proposed module improves the performance of the basicarchitecture on these short words.We compare the accuracy between the original model andthe model with one 3D-FPA moudle embedded shown inFigure 8 (a) to find which type of words can be improvedmore. Table VII shows the words with the highest growthin accuracy, as well as the words that are most likely to bemistaken. The results indicate that our proposed module im-proves the ability to distinguish words with different suffixeslike “Difference” and “Different”, and words with differencein the initial phonemes like “Million” and “Billion”, whichis one of the concerns in the lipreading task.
TABLE IVR
ESULTS OF WORD - LEVEL EXPERIMENTS ON THE
LRW
CORPUS
Method AccuracyCRNL 0.7780CRNL + 2D-FPA 0.7837CRNL + 3D-FPA*1 0.7895CRNL + 3D-FPA*4
TABLE VT
HE WORDS WITH HIGHEST ACCURACY BY THE ORIGINAL MODEL
Ground Truth Acc Ground Truth AccWESTMINSTER 1.00 INVESTMENT 0.98TEMPERATURES 1.00 CHIEF 0.98PRIVATE 1.00 WOMEN 0.98GERMANY 1.00 MIGRANTS 0.98BEFORE 1.00 TOMORROW 0.98WEAPONS 1.00 FOLLOWING 0.96PROVIDE 1.00 INFORMATION 0.96WELFARE 1.00 PARLIAMENT 0.96SUNSHINE 0.98 POTENTIAL 0.96EUROPEAN 0.98 AFTERNOON 0.96ABLE VIT
HE WORDS WITH LOWEST ACCURACY BY THE ORIGINAL MODEL ANDRELATIVE ACCURACY BY MODIFIED MODELS
GroundTruth Acc(CRNL) Acc(CRNL+2D-FPA) Acc(CRNL+3D-FPA*1) Acc(CRNL+3D-FPA*4)UNDER 0.31
UNTIL 0.35
THERE 0.38 0.38
TAKING 0.44 0.60
HE WORDS WITH HIGHEST GROWTH IN ACCURACY
GroundTruth Acc(CRNL) Acc(CRNL+3D-FPA*1) Growth ConfusingWord ErrorRateWORDS 0.58 0.76 0.18 WORLD 0.10PRICE 0.52 0.68 0.16 PRESS 0.08DIFFERENCE 0.72 0.88 0.16 DIFFERENT 0.12TAKING 0.44 0.60 0.16 TAKEN 0.14BANKS 0.64 0.80 0.16 PLACE 0.10RUSSIAN 0.61 0.76 0.15 RUSSIA 0.27LEADERS 0.58 0.72 0.14 LEAST 0.08CLOSE 0.66 0.80 0.14 ALLOWED 0.06MORNING 0.86 0.98 0.12 POINT 0.04MILLION 0.64 0.76 0.12 BILLION 0.10
V. C
ONCLUSION
We propose a spatiotemporal pixel-level attention modulefor visual speech recognition. It utilize high-level featuresto guide the attention of low-level features, exploring themulti-scale context information in both temporal and spatialdomains. Notably, it proposes a creative usage of temporalpyramid by combining pyramidal structures with attentionmechanism. Because of its concise structure and low compu-tation cost, 3D-FPA can be embedded into different backbonearchitectures, therefore has great potential in other VideoUnderstanding tasks besides visual speech recognition.R
EFERENCES[1] Hanchao Li, Pengfei Xiong, Jie An, and Lingxue Wang. PyramidAttention Network for Semantic Segmentation. pages 1–12, 2018.[2] Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nandode Freitas. LipNet: End-to-End Sentence-level Lipreading. pages 1–13, 2016.[3] Themos Stafylakis and Georgios Tzimiropoulos. Combining ResidualNetworks with LSTMs for Lipreading.
Proceedings of the AnnualConference of the International Speech Communication Association,INTERSPEECH , 2017-Augus:3652–3656, 2017.[4] M Cooke, J Barker, S Cunningham, and X. Shao. An audio-visual corpus for speech perception and automatic speech recognition.
Journal of the Acoustical Society of America , 120(1):2421–2424, 2006.[5] Joon Son Chung and Andrew Zisserman. Lip reading in the wild,2017. [6] Alex Graves.
Connectionist Temporal Classification , pages 61–93.Springer Berlin Heidelberg, Berlin, Heidelberg, 2012.[7] Ross Girshick. Fast r-cnn.
Computer Science , 2015.[8] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: towards real-time object detection with region proposal networks.pages 91–99, 2015.[9] Tsung-yi Lin, Piotr Doll, Ross Girshick, Kaiming He, Bharath Hariha-ran, Serge Belongie, Facebook Ai, and Cornell Tech. Feature PyramidNetworks for Object Detection.
Cvpr , 2017.[10] Jie Hu, Li Shen, and Gang Sun. Squeeze-and-Excitation Networks.pages 1–11, 2017.[11] Yancheng Bai, Huijuan Xu, Kate Saenko, and Bernard Ghanem.Contextual Multi-Scale Region Convolutional 3D Network for ActivityDetection. 2018.[12] Da Zhang, Xiyang Dai, and Yuan-Fang Wang. Dynamic TemporalPyramid Network: A Closer Look at Multi-Scale Modeling for ActivityDetection. 2018.[13] Yue Zhao, Yuanjun Xiong, Limin Wang, Zhirong Wu, Xiaoou Tang,and Dahua Lin. Temporal Action Detection with Structured SegmentNetworks.
Proceedings of the IEEE International Conference onComputer Vision , 2017-Octob:2933–2942, 2017.[14] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, andJiaya Jia. Pyramid Scene Parsing Network Supplymentary Material.(c):7–8, 2012.[15] A. J. Goldschen, O. N. Garcia, and E. D. Petajan, “Continuous auto-matic speech recognition by lipreading,” in
Motion-Based recognition .Springer, 1997, pp. 321–343.[16] G. I. Chiou and J.-N. Hwang, “Lipreading from color video,”
IEEETransactions on Image Processing , vol. 6, no. 8, pp. 1192–1195, 1997.[17] G. Potamianos, C. Neti, G. Gravier, A. Garg, and A. W. Senior,“Recent advances in the automatic recognition of audiovisual speech,”
Proceedings of the IEEE , vol. 91, no. 9, pp. 1306–1326, 2003.[18] C. Chandrasekaran, A. Trubanova, S. Stillittano, A. Caplier, andA. A. Ghazanfar, “The natural statistics of audiovisual speech,”
PLoSComput Biol , vol. 5, no. 7, 2009.[19] K. Noda, Y. Yamaguchi, K. Nakadai, H. G. Okuno, and T. Ogata,“Audio-visual speech recognition using deep learning,”
Applied Intel-ligence , vol. 42, no. 4, pp. 722–737, 2015.[20] K. Thangthai, R. W. Harvey, S. J. Cox, and B.-J. Theobald, “Improvinglip-reading performance for robust audiovisual speech recognitionusing DNNs,” in
AVSP , 2015, pp. 127–131.[21] I. Almajai, S. Cox, R. Harvey, and Y. Lan, “Improved speakerindependent lip reading using speaker adaptive training and deepneural networks,” in
IEEE International Conference on Acoustics,Speech and Signal Processing (ICASSP) . IEEE, 2016, pp. 2722–2726.[22] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2016, pp. 770–778.[23] A. Graves, S. Fern´andez, and J. Schmidhuber, “Bidirectional LSTMnetworks for improved phoneme classification and recognition,” in