[PDF] FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire

Abstract

Lipreading is an impressive technique and there has been a definite improvement of accuracy in recent years. However, existing methods for lipreading mainly build on autoregressive (AR) model, which generate target tokens one by one and suffer from high inference latency. To breakthrough this constraint, we propose FastLR, a non-autoregressive (NAR) lipreading model which generates all target tokens simultaneously. NAR lipreading is a challenging task that has many difficulties: 1) the discrepancy of sequence lengths between source and target makes it difficult to estimate the length of the output sequence; 2) the conditionally independent behavior of NAR generation lacks the correlation across time which leads to a poor approximation of target distribution; 3) the feature representation ability of encoder can be weak due to lack of effective alignment mechanism; and 4) the removal of AR language model exacerbates the inherent ambiguity problem of lipreading. Thus, in this paper, we introduce three methods to reduce the gap between FastLR and AR model: 1) to address challenges 1 and 2, we leverage integrate-and-fire (I\&F) module to model the correspondence between source video frames and output text sequence. 2) To tackle challenge 3, we add an auxiliary connectionist temporal classification (CTC) decoder to the top of the encoder and optimize it with extra CTC loss. We also add an auxiliary autoregressive decoder to help the feature extraction of encoder. 3) To overcome challenge 4, we propose a novel Noisy Parallel Decoding (NPD) for I\&F and bring Byte-Pair Encoding (BPE) into lipreading. Our experiments exhibit that FastLR achieves the speedup up to 10.97 × comparing with state-of-the-art lipreading model with slight WER absolute increase of 1.5\% and 5.5\% on GRID and LRS2 lipreading datasets respectively, which demonstrates the effectiveness of our proposed method.

Full PDF

FFastLR: Non-Autoregressive Lipreading Model withIntegrate-and-Fire

Jinglin Liu ∗ Zhejiang [email protected]

Yi Ren ∗ Zhejiang [email protected]

Zhou Zhao † Zhejiang [email protected]

Chen Zhang

Zhejiang [email protected]

Baoxing Huai

HUAWEI TECHNOLOGIES CO., [email protected]

Jing Yuan

Huawei Cloud [email protected]

ABSTRACT

Lipreading is an impressive technique and there has been a defi-nite improvement of accuracy in recent years. However, existingmethods for lipreading mainly build on autoregressive (AR) model,which generate target tokens one by one and suffer from high infer-ence latency. To breakthrough this constraint, we propose FastLR,a non-autoregressive (NAR) lipreading model which generates alltarget tokens simultaneously. NAR lipreading is a challenging taskthat has many difficulties: 1) the discrepancy of sequence lengthsbetween source and target makes it difficult to estimate the lengthof the output sequence; 2) the conditionally independent behaviorof NAR generation lacks the correlation across time which leadsto a poor approximation of target distribution; 3) the feature rep-resentation ability of encoder can be weak due to lack of effectivealignment mechanism; and 4) the removal of AR language modelexacerbates the inherent ambiguity problem of lipreading. Thus, inthis paper, we introduce three methods to reduce the gap betweenFastLR and AR model: 1) to address challenges 1 and 2, we leverageintegrate-and-fire (I&F) module to model the correspondence be-tween source video frames and output text sequence. 2) To tacklechallenge 3, we add an auxiliary connectionist temporal classifica-tion (CTC) decoder to the top of the encoder and optimize it withextra CTC loss. We also add an auxiliary autoregressive decoder tohelp the feature extraction of encoder. 3) To overcome challenge4, we propose a novel Noisy Parallel Decoding (NPD) for I&F andbring Byte-Pair Encoding (BPE) into lipreading. Our experimentsexhibit that FastLR achieves the speedup up to 10.97 × comparingwith state-of-the-art lipreading model with slight WER absoluteincrease of 1.5% and 5.5% on GRID and LRS2 lipreading datasetsrespectively, which demonstrates the effectiveness of our proposedmethod. CCS CONCEPTS • Computing methodologies → Computer vision ; Speech recog-nition . ∗ Equal contribution. † Corresponding author.

MM ’20, October 12–16, 2020, Seattle, WA, USA © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM.This is the author’s version of the work. It is posted here for your personal use. Notfor redistribution. The definitive Version of Record was published in

Proceedings ofthe 28th ACM International Conference on Multimedia (MM ’20), October 12–16, 2020,Seattle, WA, USA , https://doi.org/10.1145/3394171.3413740.

KEYWORDS

Lip Reading; Non-autoregressive generation; Deep Learning

ACM Reference Format:

Jinglin Liu ∗ , Yi Ren ∗ , Zhou Zhao † , Chen Zhang, Baoxing Huai, and Jing Yuan.2020. FastLR: Non-Autoregressive Lipreading Model with Integrate-and-Fire. In Proceedings of the 28th ACM International Conference on Multimedia(MM ’20), October 12–16, 2020, Seattle, WA, USA.

ACM, New York, NY, USA,9 pages. https://doi.org/10.1145/3394171.3413740

Lipreading aims to recognize sentences being spoken by a talk-ing face, which is widely used now in many scenarios includingdictating instructions or messages in a noisy environment, tran-scribing archival silent films, resolving multi-talker speech [1] andunderstanding dialogue from surveillance videos. However, it iswidely considered a challenging task and even experienced humanlipreaders cannot master it perfectly [3, 24]. Thanks to the rapiddevelopment of deep learning in recent years, there has been a lineof works studying lipreading and salient achievements have beenmade.Existing methods mainly adopt autoregressive (AR) model, ei-ther based on RNN [25, 33], or Transformer [1, 2]. Those systemsgenerate each target token conditioned on the sequence of tokensgenerated previously, which hinders the parallelizability. Thus, theyall without exception suffer from high inference latency, especiallywhen dealing with the massive videos data containing hundreds ofhours (like long films and surveillance videos) or real-time applica-tions such as dictating messages in a noisy environment.To tackle the low parallelizability problem due to AR generation,many non-autoregressive (NAR) models [13–17, 21, 31] have beenproposed in the machine translation field. The most typical oneis NAT-FT [13], which modifies the Transformer [29] by adding afertility module to predict the number of words in the target se-quence aligned to each source word. Besides NAR translation, manyresearchers bring NAR generation into other sequence-to-sequencetasks, such as video caption [20, 22], speech recognition [5] andspeech synthesis[18, 22]. These works focus on generating the tar-get sequence in parallel and mostly achieve more than an order ofmagnitude lower inference latency than their corresponding ARmodels.However, it is very challenging to generate the whole targetsequence simultaneously in lipreading task in following aspects: a r X i v : . [ ee ss . A S ] A ug The considerable discrepancy of sequence length betweenthe input video frames and the target text tokens makes itdifficult to estimate the length of the output sequence or todefine a proper decoder input during the inference stage.This is different from machine translation model, whichcan even simply adopt the way of uniformly mapping thesource word embedding as the decoder input [31] due to theanalogous text sequence length. • The true target sequence distributions show a strong correla-tion across time, but the NAR model usually generates targettokens conditionally independent of each other. This is apoor approximation and may generate repeated words. Guet al. [13] terms the problem as âĂĲmultimodal-problemâĂİ. • The feature representation ability of encoder could be weakwhen just training the raw NAR model due to lack of effectivealignment mechanism. • The removal of the autoregressive decoder, which usuallyacts as a language model, makes the model much more diffi-cult to tackle the inherent ambiguity problem in lipreading.In our work, we propose FastLR, a non-autoregressive lipreadingmodel based on Transformer. To handle the challenges mentionedabove and reduce the gap between FastLR and AR model, we intro-duce three methods as follows: • To estimate the length of the output sequence and allevi-ates the problem of time correlation in target sequence, weleverage integrate-and-fire (I&F) module to encoding thecontinuous video signal into discrete token embeddings bylocating the acoustic boundary, which is inspired by Dongand Xu [10]. These discrete embeddings retain the timinginformation and correspond to the target tokens directly. • To enhance the feature representation ability of encoder, weadd the connectionist temporal classification (CTC) decoderon the top of encoder and optimize it with CTC loss, whichcould force monotonic alignments. Besides, we add an aux-iliary AR decoder during training to facilitate the featureextraction ability of encoder. • To tackle the inherent ambiguity problem and reduce thespelling errors in NAR inference, we first propose a novelNoisy Parallel Decoding (NPD) for I&F method. The rescor-ing method in NPD takes advantages of the language modelin the well-trained AR lipreading teacher without harmingthe parallelizability. Then we bring Byte-Pair Encoding (BPE)into lipreading, which compresses the target sequence andmakes each token contain more language information toreduce the dependency among tokens compared with char-acter level encoding.The core contribution of this work is that, we are the first topropose a non-autoregressive lipreading system, and present sev-eral elaborate methods metioned above to bridge the gap betweenFastLR and state-of-the-art autoregressive lipreading models.The experimental results show that FastLR achieves the speedupup to 10.97 × comparing with state-of-the-art lipreading model withslight WER increase of 1.5% and 5.5% on GRID and LRS2 lipreadingdatasets respectively, which demonstrates the effectiveness of ourproposed method. We also conduct ablation experiments to verifythe significance of all proposed methods in FastLR. Prior works utilizing deep learning for lipreading mainly adopt theautoregressive model. The first typical approach is LipNet [3] basedon CTC [12], which takes the advantage of the spatio-temporalconvolutional front-end feature generator and GRU [6]. Further,Stafylakis and Tzimiropoulos [25] propose a network combining themodified 3D/2D-ResNet architecture with LSTM. Afouras et al. [1]introduce the Transformer self-attention architecture into lipread-ing, and build TM-seq2seq and TM-CTC. The former surpasses theperformance of all previous work on LRS2-BBC dataset by a largemargin. To boost the performance of lipreading, Petridis et al. [19]present a hybrid CTC/Attention architecture aiming to obtain thebetter alignment than attention-only mechanism, Zhao et al. [33]provide the idea that transferring knowledge from audio-speechrecognition model to lipreading model by distillation.However, these methods, either based on recurrent neural net-work or Transformer, all adopt autoregressive decoding methodwhich takes in the input video sequence and generates the tokensof target sentence y one by one during the inference process. Andthey all suffer from the high latency. An autoregressive model takes in a source sequence x = ( x , x , ..., x T x ) and then generates words in target sentence y = ( y , y , ..., y T y ) one by one with the causal structure during the inference process[26, 29]. To reduce the inference latency, Gu et al. [13] introducenon-autoregressive model based on Transformer into the machinetranslation field, which generates all target words in parallel. Theconditional probability can be defined as P ( y | x ) = P ( T y | x ) T y (cid:214) t = P ( y t | x ) , (1)where T y is the length of the target sequence gained from thefertility prediction function conditioned on the source sentence.Due to the multimodality problem [13], the performance of NARmodel is usually inferior to AR model. Recently, a line of worksaiming to bridge the performance gap between NAR and AR modelfor translation task has been presented [11, 14].Besides the study of NAR translation, many works bring NARmodel into other sequence-to-sequence tasks, such as video caption[32], speech recognition [5] and speech synthesis [18, 22]. The integrate-and-fire neuron model describes the membrane po-tential of a neuron according to the synaptic inputs and the injectedcurrent [4]. It is bio-logical and widely used in spiking neural net-works. Concretely, the neuron integrates the input signal forwardlyand increases the membrane potential. Once the membrane poten-tial reaches a threshold, a spike signal is generated, which meansan event takes place. Henceforth, the membrane potential is resetand then grows in response to the subsequent input signal again. Itenables the encoding from continuous signal sequences to discretesignal sequences, while retaining the timing information. utput Text(Shifted right) N × Encoder BlockPositionalEncoding Source Video Decoder BlockOutput Text

I&F ModuleVisual Front-endSelf-Attention Self-AttentionAdd & NormFeed ForwardAdd & Norm Add & NormFeed ForwardAdd & Norm

Encoder Hidden

1D conv & FC

Weight Embedding Weight Embedding EncoderHiddenFiredEmbedding

FCCTC loss

Length ControllerPositionalEncodingN × Decoder BlockOutput Text

Inter-AttentionAdd & NormFeed ForwardAdd & Norm

PositionalEncoding

Self-AttentionAdd & Norm

Auxiliary auto-regressive decoder

Text embedding pad N × Figure 1: The overview of the model architecture for FastLR.

Recently, Dong and Xu [10] introduce the integrate-and-firemodel into speech recognition task. They use continuous functionsthat support back-propagation to simulate the process of integrate-and-fire. In this work, the fired spike represents the event thatlocates an acoustic boundary.

In this section, we introduce FastLR and describe our methodsthoroughly. As shown in Figure 1, FastLR is composed of a spatio-temporal convolutional neural network for video feature extrac-tion (visual front-end) and a sequence processing model (mainmodel) based on Transformer with an enhenced encoder, a non-autoregressive decoder and a I&F module. To further tackle thechallenges in non-autoregressive lipreading, we propose the NPDmethod for I&F and bring byte-pair encoding into our method. Thedetails of our model and methods are described in the followingsubsections : The encoder of FastLR is composed of stacked self-attention andfeed-forward layers, which are the same as those in Transformer [29]and autoregressive lipreading model (TM-seq2seq[1]). Thus, weadd an auxiliary autoregressive decoder, shown in the left panel ofFigure 1, and by doing so, we can optimize the AR lipreading taskwith FastLR together with one shared the encoder during trainingstage. This transfers knowledge from the AR model to FastLR whichfacilitates the optimization. Besides, we add the connectionist tem-poral classification (CTC) decoder with CTC loss on the encoder for We introduce the visual front-end in section 4.2 as it varies from one dataset toanother. forcing monotonic alignments, which is a widely used techniquein speech recognition field. Both adjustments improve the featurerepresentation ability of our encoder.

To estimate the length of the output sequence and alleviate theproblem of time correlation in target sequence, we adopt continu-ous integrate-and-Fire (I&F) [10] module for FastLR. This is a softand monotonic alignment which can be employed in the encoder-decoder sequence processing model. First, the encoder output hid-den sequence h = ( h , h , fi , h m ) will be fed to a 1-dimensionalconvolutional layer followed by a fully connected layer with sig-moid activation function. Then we obtain the weight embeddingsequence w = ( w , w , fi , w m ) which represents the weight of in-formation carried in h . Second, the I&F module scans w and accu-mulates them from left to right until the sum reaches the threshold(we set it to 1.0), which means an acoustic boundary is detected.Third, I&F divides w i at this point into two part: w i , and w i , . w i , is used for fulfilling the integration of current embedding f j tobe fired, while w i , is used for the next integration of f j + . Then,I&F resets the accumulation and continues to scan the rest of w which begins with w i , for the next integration. This procedure isnoted as âĂĲaccumulate and detectâĂİ. Finally, I&F multiplies all w k (or w k , , w k , ) in w by corresponding h k and integrates themaccording to detected boundaries. An example is shown in Figure 2. Different from Transformer decoder, the self-attention of FastL-RâĂŹs decoder can attend to the entire sequence for the condi-tionally independent property of NAR model. And we remove theinter-attention mechanism since FastLR already has an alignmentechanism (I&F) between source and target. The decoder takes inthe fired embedding sequence of I&F f = ( f , f , fi , f n ) and gen-erates the text tokens y = ( y , y , fi y n ) in parallel during eithertraining or inference stage. The absence of AR decoding procedure makes the model much moredifficult to tackle the inherent ambiguity problem in lipreading. So,we design a novel NPD for I&F method to leverage the languageinformation in well-trained AR lipreading model.In section 3.2, it is not hard to find that, ⌊ S ⌋ represents the lengthof predicted sequence f (or y ), where S is the total sum of w . AndDong and Xu [10] propose a scaling strategy which multiplies w by a scalar (cid:101) S (cid:205) mi = w i to generate w ′ = ( w fi , w fi , fi , w m fi ) , where (cid:101) S is the length of target label (cid:101) y . By doing so, the total sum of w fi isequal to (cid:101) S and this teacher-forces I&F to predict f with the truelength of (cid:101) S which would benefit the cross-entropy training.However, we do not stop at this point. Besides training, we alsoscale w during the inference stage to generate multiple candidatesof weight embedding with different length bias (cid:101) b . When set thebeam size B = w fi (cid:101) b = (cid:205) mi = w i + (cid:101) b (cid:205) mi = w i · w , where (cid:101) b ∈ [− , ] ∩ Z , (2)where w = ( w , w , fi , w m ) is the output of I&F module duringinference and length bias (cid:101) b is provided in "Length Controller" mod-ule in Figure 1. Then, we utilize the re-scoring method used inNoisy Parallel Decoding (NPD), which is a common practice innon-autoregressive neural machine translation, to select the bestsequence from these 2 ∗ B candidates via an AR lipreading teacher: w N PD = arдmax w fi (cid:101) b p AR ( G ( x , w fi (cid:101) b ; θ )| x ; θ ) , (3)where p AR ( A ) is the probability of the sequence A generated byautoregressive model; The G ( x , w ; θ ) means the optimal generationof FastLR given a source sentence x and weight embedding w , θ represents the parameters of model.The selection process could leverage information in the lan-guage model (decoder) of the well-trained autoregressive lipread-ing teacher, which alleviates the ambiguity problem and gives achance to adjust the weight embedding generated by I&F modulefor predicting a better sequence length. Note that these candidatescan be computed independently, which wonâĂŹt hurt the paral-lelizability (only doubles the latency due to the selection process).The experiments demonstrate that the re-scored sequence is moreaccurate. Byte-Pair Encoding [23] is widely used in NMT [29] and ASR [10]fields, but rare in lipreading tasks. BPE could make each token con-tain more language information and reduce the dependency amongtokens compared with character level encoding, which alleviate theproblems of non-autoregressive generation discussed before. In this work, we tokenize the sentence with moses tokenizer and thenuse BPE algorithm to segment each target word into sub-words. Accumulate and detect w w w w w w w w w w w w w w w w w w w w h h h h h h h h h f f Weight EmbeddingFired Embedding …… Figure 2: An example to illustrate how I&F module works. h respresents the encoder output hidden sequence. In thiscase f = w · h + w , · h , f = w , · h + w · h + w · h + w · h + w · h + w , · h . We optimize the CTC decoder with CTC loss. CTC introduces aset of intermediate representation path ϕ ( y ) termed as CTC pathsfor one target text sequence y . Each CTC path is composed ofscattered target text tokens and blanks which can reduce to thetarget text sequence by removing the repeated words and blanks.The likelihood of y could be calculated as the sum of probabilitiesof all CTC paths corresponding to it: P ctc ( y | x ) = (cid:213) c ∈ ϕ ( y ) P ctc ( c | x ) (4)Thus, CTC loss can be formulated as: L ctc = − (cid:213) ( x , y )∈(X×Y) (cid:213) c ∈ ϕ ( y ) P ctc ( c | x ) (5)where (X × Y) denotes the set of source video and target textsequence pairs in one batch.We optimize the auxiliary autoregressive task with cross-entropyloss, which can be formulated as: L AR = − (cid:213) ( x , y )∈(X×Y) loдP AR ( y | x ) (6)And most importantly, we optimize the main task FastLR withcross-entropy loss and sequence length loss: L F LR = − (cid:213) ( x , y )∈(X×Y) (cid:104) loдP F LR ( y | x ) + ( (cid:102) S x − S x ) (cid:105) (7)where the (cid:101) S and S are defined in section 3.4.Then, the total loss function for training our model is: L = λ L ctc + λ L AR + λ L F LR (8)where the λ , λ , λ are hyperparameters to trade off the threelosses. https://github.com/moses-smt/mosesdecoder/blob/master/scripts/tokenizer/tokenizer.perl EXPERIMENTS AND RESULTS4.1 Datasets

GRID.

The GRID dataset [9] consists of 34 subjects, and each ofthem utters 1,000 phrases. It is a clean dataset and easy to learn. Weadopt the split the same with Assael et al. [3], where 255 randomsentences from each speaker are selected for evaluation. In order tobetter recognize lip movements, we transform the image into grayscale, and crop the video images to a fixed 100 ×

50 size containingthe mouth region with Dlib face detector. Since the vocabulary sizeof GRID datasets is quite small and most words are simple, we donot apply Byte-Pair Encoding [23] on GRID, and just encode thetarget sequence at the character level.

LRS2.

The LRS2 dataset contains sentences of up to 100 charac-ters from BBC videos [2], which have a range of viewpoints fromfrontal to profile. We adopt the origin split of LRS2 for train/dev/testsets, which contains 46k, 1,082 and 1,243 sentences respectively.And we make use of the pre-train dataset provided by LRS2 whichcontains 96k sentences for pretraining. Following previous works[1, 2, 33], the input video frames are converted to grey scale andcentrally cropped into 114 ×

114 images. As for the text sentence,we split each word token into subwords using BPE [23], and set thevocabulary size to 1k considering the vocabulary size of LRS2.The statistics of both datasets are listed in Table 1.

Table 1: The statistics on GRID and LRS2 lip reading datasets.Utt: Utterance.

Dataset Utt. Word inst. Vocab hoursGRID 33k 165k 51 27.5LRS2 (Train-dev) 47k 337k 18k 29

For GRID datasets, we use spatio-temporal CNN to extract visualfeatures follow Torfi et al. [27]. The visual front-end network iscomposed of four 3D convolution layers with 3D max poolingand RELU, and two fully connected layers. The kernel size of 3Dconvolution and pooling is 3 ×

3, the hidden sizes of fully connectedlayer as well as output dense layer are both 256. We directly trainthis visual front-end together with our main model end-to-end onGRID on the implementation by Torfi et al. [27].For LRS2 datasets, we adopt the same structure as Afouras et al.[2], which uses a 3D convolution on the input frame sequence witha filter width of 5 frames, and a 2D ResNet decreasing the spatialdimensions progressively with depth. The network convert the T × H × W frame sequence into T × H × W ×

512 feature sequence, where T , H , W is frame number, frame height, frame width respectively. Itis worth noting that, training the visual front-end together with themain model could obtain poor results on LRS2, which is observedin previous works [1]. Thus, as Zhao et al. [33] do, we utilize thefrozen visual front-end provided by Afouras et al. [1], which is pre-trained on a non-public datasets MV-LRS [8], to exact the visual https://github.com/astorfi/lip-reading-deeplearning We adopt the Transformer [29] as the basic model structure forFastLR because it is parallelizable and achieves state-of-the-art ac-curacy in lipreading [1]. The model hidden size, number of encoder-layers, number of decoder-layers, and number of heads are set to d hidden = , n enc = , n dec = , n head = d hidden = , n enc = , n dec = , n head = ∗ d hidden and 9 respectively. The CTC de-coder consists of two fully-connected layers with ReLU activationfunction and one fully-connected layer without activation function.The hidden sizes of these fully-connected layers equal to d hidden .The auxiliary decoder is an ordinary Transformer decoder withthe same configuration as FastLR, which takes in the target textsequence shifted right one sequence step for teacher-forcing. As mentioned in section 3.1, to boost the feature representationability of encoder, we add an auxiliary connectionist temporal clas-sification (CTC) decoder and an autoregressive decoder to FastLRand optimize them together. We set λ to 0.5, λ , λ to 1 , λ , λ to 0 , Table 2: The training steps of FastLR for different datasetsfor each training stage.Stage

GRID LRS2Warm-up 300k 55kMain 160k 120k

During the inference stage, the auxiliary CTC decoder as well asautoregressive decoder will be thrown away. Given the beam size B =

4, FastLR generates 2 ∗ B + ∗ B + ErrorRate = ( S + D + I )/ N , (9)where S, D, I and N are the number of substitutions, deletions,insertions and reference tokens (word or character) respectively.When evaluating the latency, we run FastLR on 1 NVIDIA 1080TiGPU in inference. Table 3: The word error rate (WER) and character error rate(CER) on GRID

GRIDMethod WER CER

Autoregressive Models

LSTM [30] 20.4% /LipNet [3] 4.8% 1.9%WAS [7] / Non-Autoregressive Models

NAR-LR (base) 25.8% 13.6%FastLR (Ours)

Table 4: The word error rate (WER) and character error rate(CER) on LRS2. † denotes baselines from our reproduction. LRS2Method WER CER

Autoregressive Models

WAS [7] 70.4% /BLSTM+CTC [2] 76.5% 40.6%FC-15 [2] 64.8% 33.9%LIBS [33] 65.3% 45.5%TM-seq2seq [1] † † Non-Autoregressive Models

NAR-LR (base) 81.3% 57.9%FastLR (Ours)

We conduct experiments of FastLR, and compare them with autore-gressive lipreading baseline and some mainstream state-of-the-artof AR lipreading models on the GRID and LRS2 datasets respec-tively. As for TM-seq2seq [1], it has the same Transformer settingswith FastLR and works as the AR teacher for NPD selection. We also apply CTC loss and BPE technique to TM-seq2seq for a faircomparison. The results on two datasets are listed in Table 3 and 4. We cansee that 1) WAS [7] and TM-seq2seq [1, 2] obtain the best resultsof autoregressive lipreading model on GRID and LRS2. Comparedwith them, FastLR only has a slight WER absolute increase of 1.5%and 5.5% respectively. 2) Moreover, on GRID dataset, FastLR out-performs LipNet [3] for 0.3% WER, and exceeds LSTM [30] witha notable margin; On LRS2 dataset, FastLR achieves better WERscores than WAS and BLSTM+CTC [2] and keeps comparable per-formance with LIBS [33] and FC-15 [2]. In addition, compared withLIBS, we do not introduce any distillation method in training stage,and compared with WAS and TM-seq2seq, we do not leverageinformation from other datasets beyond GRID and LRS2.We also propose a baseline non-autoregressive lipreading modelwithout Integrate-and-Fire module termed as NAR-LR (base), andconduct experiments for comparison. As the result shows, FastLRoutperforms this NAR baseline distinctly. The overview of the de-sign for NAR-LR (base) is shown in Figure 3.

PositionalEncoding Source Video

EncoderVisual Front-end N × Decoder BlockOutput Text

Inter-AttentionAdd & NormFeed ForwardAdd & Norm

PositionalEncoding

Self-AttentionAdd & Norm

Encoder Hidden mm × … : A trainable tensor m : Predicted length Auxiliary AR Decoder CTCDecoder : A trainable tensorm : Predicted length

Figure 3: The NAR-LR (base) model. It is also basedon Transformer [29], but generates outputs in the non-autoregressive manner [13]. It sends a series of duplicatedtrainable tensor into the decoder to generates target tokens.The repeat count of this trainable tensor is denoted as "m".For training, "m" is set to ground truth length, but for infer-ence, we estimate it by a linear function of input length, andthe parameters are obtained using the least square methodon the train set. The auxiliary AR decoder is the same asFastLR’s. The CTC decoder contains FC layers and CTC loss.

In this section, we compare the average inference latency of FastLRwith that of the autoregressive Transformer lipreading model. And Our reproduction has a weaker performance compared with the results reported in[1, 2]. Because we do not have the resource of MV-LRS, a non-public dataset whichcontains individual word excerpts of frequent words used by [1, 2]. Thus, we do notadopt curriculum learning strategy as Afouras et al. [2]. hen, we analyze the relationship between speedup and the lengthof the predicted sequence.

The average latency is mea-sured in average time in seconds required to decode one sentenceon the test set of LRS2 dataset. We record the inference latency andcorresponding recognition accuracy of TM-seq2seq [1, 2], FastLRwithout NPD and FastLR with NPD9, which is listed in Table 5.The result shows that FastLR speeds up the inference by 11.94 × without NPD, and by 5.81 × with NPD9 on average, compared withthe TM-seq2seq which has similar number of model parameters.Note that the latency is calculated excluding the computation costof data pre-processing and the visual front-end. Table 5: The comparison of average inference latency andcorresponding recognition accuracy. The evaluation is con-ducted on a server with 1 NVIDIA 1080Ti GPU, 12 Intel XeonCPU. The batch size is set to 1. The average length of the gen-erated sub-word sequence are all about 14.Method

WER Latency (s) SpeedupTM-seq2seq [1] 61.7% 0.215 1.00 × FastLR (no NPD) 73.2% 0.018 11.94 × FastLR (NPD 9) 67.2% 0.037 5.81 × During inference,the autoregressive model generates the target tokens one by one,but the non-autoregressive model speeds up the inference by in-creasing parallelization in the generation process. Thus, the longerthe target sequence is, the more the speedup rate is. We visual-ize the relationship between the length of the predicted sub-wordsequence in Figure 4. It can be seen that the inference latency in-creases distinctly with the predicted text length for TM-seq2seq,while nearly holds a small constant for FastLR.Then, we bucket the test sequences of length within [ , ] ,and calculate their average inference latency for TM-seq2seq andFastLR to obtain the maximum speedup on LRS2 test set. The resultsare 0.494s and 0.045s for TM-seq2seq and FastLR (NPD9) respec-tively, which shows that FastLR (NPD9) achieves the speedup upto 10.97 × on LRS2 test set, thanks to the parallel generation whichis insensitive to sequence length. In this section, we first conduct ablation experiments on LRS2to verify the significance of all proposed methods in FastLR. Theexperiments are listed in Table 6. Then we visualize the encoder-decoder attention map of the well-trained AR model (TM-seq2seq)and the acoustic boundary detected by the I&F module in FastLRto check whether the I&F module works well.

As shown in the table 6, the naive lipreading model with Integrate-and-Fire is not able to converge well, due to the difficulty of learningthe weight embedding in I&F module from the meaningless encoderhidden. Thus, the autoregressive lipreading model works as the

Predicted Text Length I n f e r e n c e T i m e (a) TM-seq2seq [1]

10 15 20 25 30 35

Predicted Text Length I n f e r e n c e T i m e (b) FastLR (NPD9) Figure 4: Relationship between Inference time (second) andPredicted Text Length for TM-seq2seq [1] and FastLR.Table 6: The ablation studies on LRS2 dataset. Naive Modelwith I&F is the naive lipreading model only with Integrate-and-Fire. "+Aux" means adding the auxiliary autoregressivetask. We add our methods and evaluate their effectivenessprogressively.

Model WER CERNaive Model with I&F >1 75.2%+Aux 93.1% 64.9%+Aux+BPE 75.7% 52.7%+Aux+BPE+CTC 73.2% 51.4%+Aux+BPE+CTC+NPD(FastLR) auxiliary model to enhance the feature representation ability ofencoder, and guides the non-autoregressive model with Integrate-and-Fire to learn the right alignments (weight embedding). Fromthis, the model with I&F begins to generate the target sequencewith meaning, and

CER <

65% (Row 3).

BPE makes each token contain more language information and re-duce the dependency among tokens compared with character levelncoding. In addition, from observation, the speech speed of BBCvideo is a bit fast, which causes that one target token (characterif without BPE) corresponds to few video frames. While BPE com-presses the target sequence and this will help the Integrate-and-Firemodule to find the acoustic level alignments easier.From the table 6 (Row 4), it can be seen that BPE reduces the worderror rate and character error rate to 75.7% and 52.7% respectively,which means BPE helps the model gains the ability to generatesunderstandable sentence.

The result shows that (Row 5), adding auxiliary connectionist tem-poral classification(CTC) decoder with CTC loss will further boostthe feature representation ability of encoder, and cause 2.5% abso-lute decrease in WER. At this point, the model gains considerablerecognition accuracy compared with the traditional autoregressivemethod.

Table 6 (Row 6) shows that using NPD for I&F can boost the per-formance effectively. We also study the effect of increasing thecandidates number for FastLR on LRS2 dataset, as shown in Figure5. It can be seen that, when setting the candidates number to 9,the accuracy peaks. Finally, FastLR achieves considerable accuracycompared with state-of-the-art autoregressive lipreading model.

Candidates Number E rr o r R a t e % WERCER

Figure 5: The effect of cadidates number on WER and CERfor FastLR model.

We visualize the encoder-decoder attention map in Figure 6, whichis obtained from the well-trained AR TM-seq2seq. The attentionmap illustrates the alignment between source video frames and thecorresponding target sub-word sequence.The figure shows that the video frames between two horizon-tal red lines are roughly just what the corresponding target tokenattends to. It means that the "accumulate and detect" part in I&Fmodule tells the acoustic boundary well and makes a right predic-tion of sequence length.

Figure 6: An example of the visualization for encoder-decoder attention map and the acoustic boundary. The hor-izontal red lines represent the acoustic boundaries detectedby I&F module in FastLR, which split the video frames todiscrete segments.

In this work, we developed FastLR, a non-autoregressive lipread-ing system with Integrate-and-Fire module, that recognizes sourcesilent video and generates all the target text tokens in parallel.FastLR consists of a visual front-end, a visual feature encoder anda text decoder for simultaneous generation. To bridge the accuracygap between FastLR and state-of-the-art autoregressive lipreadingmodel, we introduce I&F module to encode the continuous visualfeatures into discrete token embedding by locating the acousticboundary. In addition, we propose several methods including auxil-iary AR task and CTC loss to boost the feature representation abilityof encoder. At last, we design NPD for I&F and bring Byte-Pair En-coding into lipreading, and both methods alleviate the problemcaused by the removal of AR language model. Experiments onGRID and LRS2 lipreading datasets show that FastLR outperformsthe NAR-LR baseline and has a slight WER increase compared withstate-of-the-art AR model, which demonstrates the effectiveness ofour method for NAR lipreading.In the future, we will continue to work on how to make a betterapproximation to the true target distribution for NAR lipreadingtask, and design more flexible policies to bridge the gap between ARand NAR model as well as keeping the fast speed of NAR generation.

ACKNOWLEDGMENTS

This work was supported in part by the National Key R&D Pro-gram of China (Grant No.2018AAA0100603), Zhejiang Natural Sci-ence Foundation (LR19F020006), National Natural Science Founda-tion of China (Grant No.61836002, No.U1611461 and No.61751209)and the Fundamental Research Funds for the Central Universities(2020QNA5024). This work was also partially supported by theLanguage and Speech Innovation Lab of HUAWEI Cloud.

EFERENCES [1] Triantafyllos Afouras, Joon Son Chung, Andrew Senior, Oriol Vinyals, and An-drew Zisserman. 2018. Deep audio-visual speech recognition.

IEEE transactionson pattern analysis and machine intelligence (2018).[2] Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2018. Deeplip reading: a comparison of models and an online application. arXiv preprintarXiv:1806.06053 (2018).[3] Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando De Fre-itas. 2016. Lipnet: End-to-end sentence-level lipreading. arXiv preprintarXiv:1611.01599 (2016).[4] Anthony N Burkitt. 2006. A review of the integrate-and-fire neuron model: I.Homogeneous synaptic input.

Biological cybernetics

95, 1 (2006), 1–19.[5] Nanxin Chen, Shinji Watanabe, Jesús Villalba, and Najim Dehak. 2019. Non-Autoregressive Transformer Automatic Speech Recognition. arXiv preprintarXiv:1911.04908 (2019).[6] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014.Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).[7] Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. 2017.Lip reading sentences in the wild. In . IEEE, 3444–3453.[8] Joon Son Chung and AP Zisserman. 2017. Lip reading in profile. (2017).[9] Martin Cooke, Jon Barker, Stuart Cunningham, and Xu Shao. 2006. An audio-visual corpus for speech perception and automatic speech recognition.

TheJournal of the Acoustical Society of America arXiv preprint arXiv:1905.11235 (2019).[11] Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. 2019. Mask-predict: Parallel decoding of conditional masked language models. In

Proceedingsof the 2019 Conference on Empirical Methods in Natural Language Processing andthe 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) . 6114–6123.[12] Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber.2006. Connectionist temporal classification: labelling unsegmented sequencedata with recurrent neural networks. In

Proceedings of the 23rd internationalconference on Machine learning . 369–376.[13] Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. 2017.Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281 (2017).[14] Junliang Guo, Xu Tan, Di He, Tao Qin, Linli Xu, and Tie-Yan Liu. 2019. Non-autoregressive neural machine translation with enhanced decoder input. In

AAAI ,Vol. 33. 3723–3730.[15] Jason Lee, Elman Mansimov, and Kyunghyun Cho. 2018. Deterministic Non-Autoregressive Neural Sequence Modeling by Iterative Refinement. In

EMNLP .1173–1182.[16] Jinglin Liu, Yi Ren, Xu Tan, Chen Zhang, Tao Qin, Zhou Zhao, and Tie-Yan Liu.2020. Task-Level Curriculum Learning for Non-Autoregressive Neural MachineTranslation. In

Proceedings of the Twenty-Ninth International Joint Conference onArtificial Intelligence, IJCAI-20 . 3861–3867.[17] Xuezhe Ma, Chunting Zhou, Xian Li, Graham Neubig, and Eduard Hovy. 2019.FlowSeq: Non-Autoregressive Conditional Sequence Generation with GenerativeFlow. In

EMNLP-IJCNLP . 4273–4283.[18] Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals,Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C Cobo,Florian Stimberg, et al. 2017. Parallel wavenet: Fast high-fidelity speech synthesis. arXiv preprint arXiv:1711.10433 (2017).[19] Stavros Petridis, Themos Stafylakis, Pingchuan Ma, Georgios Tzimiropoulos, andMaja Pantic. 2018. Audio-visual speech recognition with a hybrid ctc/attentionarchitecture. In . IEEE,513–520.[20] Yi Ren, Chenxu Hu, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2020.FastSpeech 2: Fast and High-Quality End-to-End Text-to-Speech. arXiv preprintarXiv:2006.04558 (2020).[21] Yi Ren, Jinglin Liu, Xu Tan, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2020. AStudy of Non-autoregressive Model for Sequence Generation. arXiv preprintarXiv:2004.10454 (2020).[22] Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-YanLiu. 2019. Fastspeech: Fast, robust and controllable text to speech. In

Advancesin Neural Information Processing Systems . 3165–3174.[23] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machinetranslation of rare words with subword units. arXiv preprint arXiv:1508.07909 (2015).[24] Brendan Shillingford, Yannis Assael, Matthew W Hoffman, Thomas Paine, CíanHughes, Utsav Prabhu, Hank Liao, Hasim Sak, Kanishka Rao, Lorrayne Bennett,et al. 2018. Large-scale visual speech recognition. arXiv preprint arXiv:1807.05162 (2018). [25] Themos Stafylakis and Georgios Tzimiropoulos. 2017. Combining residual net-works with LSTMs for lipreading. arXiv preprint arXiv:1703.04105 (2017).[26] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learningwith neural networks. In

Advances in neural information processing systems . 3104–3112.[27] Amirsina Torfi, Seyed Mehdi Iranmanesh, Nasser Nasrabadi, and Jeremy Daw-son. 2017. 3d convolutional neural networks for cross audio-visual matchingrecognition.

IEEE Access

CoRR abs/1803.07416 (2018). http://arxiv.org/abs/1803.07416[29] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is allyou need. In

Advances in neural information processing systems . 5998–6008.[30] Michael Wand, Jan Koutník, and Jürgen Schmidhuber. 2016. Lipreading with longshort-term memory. In . IEEE, 6115–6119.[31] Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu.2019. Non-Autoregressive Machine Translation with Auxiliary Regularization.In

AAAI .[32] Bang Yang, Fenglin Liu, and Yuexian Zou. 2019. Non-Autoregressive VideoCaptioning with Iterative Refinement. arXiv preprint arXiv:1911.12018 (2019).[33] Ya Zhao, Rui Xu, Xinchao Wang, Peng Hou, Haihong Tang, and Mingli Song.2019. Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers. arXiv preprint arXiv:1911.11502arXiv preprint arXiv:1911.11502