A Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading
AA Cascade Sequence-to-Sequence Model for Chinese MandarinLip Reading
Ya Zhao
Zhejiang Provincial Key Laboratoryof Service RobotsZhejiang [email protected]
Rui Xu
Zhejiang Provincial Key Laboratoryof Service RobotsZhejiang [email protected]
Mingli Song
Zhejiang Provincial Key Laboratoryof Service RobotsZhejiang [email protected]
ABSTRACT
Lip reading aims at decoding texts from the movement of a speaker’smouth. In recent years, lip reading methods have made great progressfor English, at both word-level and sentence-level. Unlike English,however, Chinese Mandarin is a tone-based language and relieson pitches to distinguish lexical or grammatical meaning, whichsignificantly increases the ambiguity for the lip reading task. Inthis paper, we propose a Cascade Sequence-to-Sequence Model forChinese Mandarin (CSSMCM) lip reading, which explicitly mod-els tones when predicting sentence. Tones are modeled based onvisual information and syntactic structure, and are used to predictsentence along with visual information and syntactic structure.In order to evaluate CSSMCM, a dataset called CMLR (ChineseMandarin Lip Reading) is collected and released, consisting of over100,000 natural sentences from China Network Television website.When trained on CMLR dataset, the proposed CSSMCM surpassesthe performance of state-of-the-art lip reading frameworks, whichconfirms the effectiveness of explicit modeling of tones for ChineseMandarin lip reading.
CCS CONCEPTS • Computing methodologies → Machine translation ; Computervision ; Neural networks . KEYWORDS lip reading, datasets, multimodal
ACM Reference Format:
Ya Zhao, Rui Xu, and Mingli Song. 2019. A Cascade Sequence-to-SequenceModel for Chinese Mandarin Lip Reading. In
ACM Multimedia Asia (MMAsia’19), December 15–18, 2019, Beijing, China.
ACM, New York, NY, USA, 6 pages.https://doi.org/10.1145/3338533.3366579
Lip reading, also known as visual speech recognition, aims to predictthe sentence being spoken, given a silent video of a talking face. Innoisy environments, where speech recognition is difficult, visual
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
MMAsia ’19, December 15–18, 2019, Beijing, China © 2019 Association for Computing Machinery.ACM ISBN 978-1-4503-6841-4/19/12...$15.00https://doi.org/10.1145/3338533.3366579 speech recognition offers an alternative way to understand speech.Besides, lip reading has practical potential in improved hearingaids, security, and silent dictation in public spaces. Lip readingis essentially a difficult problem, as most lip reading actuations,besides the lips and sometimes tongue and teeth, are latent andambiguous. Several seemingly identical lip movements can producedifferent words.Thanks to the recent development of deep learning, English-based lip reading methods have made great progress, at both word-level [9, 13] and sentence-level [1, 8]. However, as the language ofthe most number of speakers, there is only a little work for ChineseMandarin lip reading in the multimedia community. Yang et al. [14]present a naturally-distributed large-scale benchmark for ChineseMandarin lip-reading in the wild, named LRW-1000, which contains1,000 classes with 718,018 samples from more than 2,000 individualspeakers. Each class corresponds to the syllables of a Mandarinword composed of one or several Chinese characters. However,they perform only word classification for Chinese Mandarin lipreading but not at the complete sentence level. LipCH-Net [15] is thefirst paper aiming for sentence-level Chinese Mandarin lip reading.LipCH-Net is a two-step end-to-end architecture, in which two deepneural network models are employed to perform the recognitionof Picture-to-Pinyin (mouth motion pictures to pronunciations)and the recognition of Pinyin-to-Hanzi (pronunciations to texts)respectively. Then a joint optimization is performed to improve theoverall performance.Belong to two different language families, English and ChineseMandarin have many differences. The most significant one mightbe that: Chinese Mandarin is a tone language, while English is not.The tone is the use of pitch in language to distinguish lexical orgrammatical meaning - that is, to distinguish or to inflect words . Even two words look the same on the face when pronounced,they can have different tones, thus have different meanings. Forexample, even though " 练 习 " (which means practice ) and " 联 系 "(which means contact ) have different meanings, but they have thesame mouth movement. This increases ambiguity when lip reading.So the tone is an important factor for Chinese Mandarin lip reading.Based on the above considerations, in this paper, we presentCSSMCM, a sentence-level Chinese Mandarin lip reading network,which contains three sub-networks. Same as [15], in the first sub-network, pinyin sequence is predicted from the video. Differentfrom [15], which predicts pinyin characters from video, pinyinis taken as a whole in CSSMCM, also known as syllables. As we https://en.wikipedia.org/wiki/Tone_(linguistics) a r X i v : . [ c s . C V ] N ov MAsia ’19, December 15–18, 2019, Beijing, China Ya Zhao, Rui Xu, and Mingli Song know, Mandarin Chinese is a syllable-based language and sylla-bles are their logical unit of pronunciation. Compared with pinyincharacters, syllables are a longer linguistic unit, and can reduce thedifficulty of syllable choices in the decoder by sequence-to-sequenceattention-based models [17]. Chen et al. [6] find that there mightbe a relationship between the production of lexical tones and thevisible movements of the neck, head, and mouth. Motivated bythis observation, in the second sub-network, both video and pinyinsequence is used as input to predict tone. Then in the third sub-network, video, pinyin, and tone sequence work together to predictthe Chinese character sequence. At last, three sub-networks arejointly finetuned to improve overall performance.As there is no public sentence-level Chinese Mandarin lip read-ing dataset, we collect a new Chinese Mandarin Lip Reading datasetcalled CMLR based on China Network Television broadcasts con-taining talking faces together with subtitles of what is said.In summary, our major contributions are as follows. • We argue that tone is an important factor for Chinese Man-darin lip reading, which increases the ambiguity comparedwith English lip reading. Based on this, a three-stage cas-cade network, CSSMCM, is proposed. The tone is inferredby video and syntactic structure, and are used to predict sen-tence along with visual information and syntactic structure. • We collect a ’Chinese Mandarin Lip Reading’ (CMLR) dataset,consisting of over 100,000 natural sentences from nationalnews program "News Broadcast". The dataset will be releasedas a resource for training and evaluation. • Detailed experiments on CMLR dataset show that explicitlymodeling tone when predicting Chinese sentence performsa lower character error rate.
Table 1: Symbol Definition
Symbol Definition
GRU ve GRU unit in video encoderGRU pe , GRU pd GRU unit in pinyin encoder and pinyin decoderGRU te , GRU td GRU unit in tone encoder and tone decoderGRU yd GRU unit in character decoderAttention vp attention between pinyin decoder and video en-coder. The superscript indicates the encoder andthe subscript indicates the decoder. x , y , p , t video, character, pinyin, and tone sequence i timestep h ve , h pe , h te video encoder output, pinyin encoder output, toneencoder output c v , c p , c t video content, pinyin content, tone content In this section, we present CSSMCM, a lip reading model for Chi-nese Mandarin. As mention in Section 1, pinyin and tone are bothimportant for Chinese Mandarin lip reading. Pinyin represents howto pronounce a Chinese character and is related to mouth move-ment. Tone can alleviate the ambiguity of visemes (several speechsounds that look the same) to some extent and can be inferred fromvisible movements. Based on this, the lip reading task is defined as 𝒕 • 𝑝 𝑦 𝑥 = σ 𝑧 σ 𝑡 𝑝 𝑦 𝑧, 𝑡, 𝑥 𝑝 𝑡 𝑧, 𝑥 𝑝(𝑧|𝑥) Attention t v 𝒙 𝐆𝐑𝐔 𝐞𝐯 𝐕𝐆𝐆 𝒙 𝐆𝐑𝐔 𝐞𝐯 𝐕𝐆𝐆 𝒙 𝐆𝐑𝐔 𝐞𝐯 𝐕𝐆𝐆 𝒑 𝒑 𝒑 𝐆𝐑𝐔 𝐞𝐩 𝐆𝐑𝐔 𝐞 𝐩 𝐆𝐑𝐔 𝐞 𝐩 Attention tp 𝐆𝐑𝐔 𝐝𝐭 𝐌𝐋𝐏 𝒕 𝐆𝐑𝐔 𝐝𝐭 𝐌𝐋𝐏 𝒕 𝐆𝐑𝐔 𝐝𝐭 𝐌𝐋𝐏
Figure 1: The tone prediction sub-network. follow: P ( y | x ) = (cid:213) p (cid:213) t P ( y | p , t , x ) P ( t | p , x ) P ( p | x ) , (1)The meaning of these symbols is given in Table 1.As shown in Equation (1), the whole problem is divided into threeparts, which corresponds to pinyin prediction, tone prediction, andcharacter prediction separately. Each part will be described in detailbelow. The pinyin prediction sub-network transforms video sequence intopinyin sequence, which corresponds to P ( p | x ) in Equation (1). Thissub-network is based on the sequence-to-sequence architecturewith attention mechanism [2]. We name the encoder and decoderthe video encoder and pinyin decoder, for the encoder processvideo sequence, and the decoder predicts pinyin sequence. Theinput video sequence is first fed into the VGG model [4] to extractvisual feature. The output of conv5 of VGG is appended with globalaverage pooling [12] to get the 512-dim feature vector. Then the512-dim feature vector is fed into video encoder. The video encodercan be denoted as: ( h ve ) i = GRU ve (( h ve ) i − , VGG ( x i )) . (2)When predicting pinyin sequence, at each timestep i , video encoderoutputs are attended to calculate a context vector c vi : ( h pd ) i = GRU pd (( h pd ) i − , p i − ) , (3) c vi = h ve · Attention vp (( h pd ) i , h ve ) , (4) P ( p i | p < i , x ) = softmax ( MLP (( h pd ) i , c vi )) . (5) As shown in Equation (1), tone prediction sub-network ( P ( t | p , x ) )takes video and pinyin sequence as inputs and predict correspond-ing tone sequence. This problem is modeled as a sequence-to-sequence learning problem too. The corresponding model architec-ture is shown in Figure 1.In order to take both video and pinyin information into consid-eration when producing tone, a dual attention mechanism [8] isemployed. Two independent attention mechanisms are used forvideo and pinyin sequence. Video context vectors c vi and pinyincontext vectors c pi are fused when predicting a tone character ateach decoder step. Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading MMAsia ’19, December 15–18, 2019, Beijing, China 𝒕 𝒕 𝒕 𝐆 𝐑 𝐔 𝐞 𝒕 𝐆 𝐑 𝐔 𝐞 𝒕 𝐆 𝐑 𝐔 𝐞 𝒕 Attention ct 𝐆𝐑𝐔 𝐝 𝐜 𝒚 Attention cv 𝒙 𝐆𝐑𝐔 𝐞𝐯 𝐕𝐆𝐆 𝒙 𝐆𝐑𝐔 𝐞𝐯 𝐕𝐆𝐆 𝒙 𝐆𝐑𝐔 𝐞𝐯 𝐕𝐆𝐆 𝒑 𝒑 𝒑 𝐆𝐑𝐔 𝐞𝐩 𝐆𝐑𝐔 𝐞𝐩 𝐆𝐑𝐔 𝐞𝐩 Attention cp 𝐌𝐋𝐏 𝐆𝐑𝐔 𝐝𝐜 𝒚 𝐌𝐋𝐏 𝐆𝐑𝐔 𝐝 𝐜 𝒚 𝐌𝐋𝐏
Figure 2: The character prediction sub-network.
The video encoder is the same as in Section 2.1 and the pinyinencoder is: ( h pe ) i = GRU pe (( h pe ) i − , p i − ) . (6)The tone decoder takes both video encoder outputs and pinyinencoder outputs to calculate context vector, and then predicts tones: ( h td ) i = GRU td (( h td ) i − , t i − ) , (7) c vi = h ve · Attention vt (( h td ) i , h ve ) , (8) c pi = h pe · Attention pt (( h td ) i , h pe ) , (9) P ( t i | t < i , x , p ) = softmax ( MLP (( h td ) i , c vi , c pi )) . (10) The character prediction sub-network corresponds to P ( y | p , t , x ) inEquation (1). It considers all the pinyin sequence, tone sequenceand video sequence when predicting Chinese character. Similarly,we also use attention based sequence-to-sequence architecture tomodel this equation. Here the attention mechanism is modified intotriplet attention mechanism: ( h cd ) i = GRU cd (( h cd ) i − , y i − ) , (11) c vi = h ve · Attention vc (( h cd ) i , h ve ) , (12) c pi = h pe · Attention pc (( h cd ) i , h pe ) , (13) c ti = h te · Attention tc (( h cd ) i , h te ) , (14) P ( c i | c < i , x , p , t ) = softmax ( MLP (( h cd ) i , c vi , c pi , c ti )) . (15)For the following needs, the formula of tone encoder is also listedas follows: ( h te ) i = GRU te (( h te ) i − , t i − ) . (16) The architecture of the proposed approach is demonstrated in Fig-ure 3. For better display, the three attention mechanisms are notshown in the figure. During the training of CSSMCM, the outputsof pinyin decoder are fed into pinyin encoder, the outputs of tonedecoder into tone encoder: ( h pe ) i = GRU pe (( h pe ) i − , MLP (( h td ) i , c vi , c pi )) , (17) ( h te ) i = GRU te (( h pe ) i − , MLP (( h cd ) i , c vi , c pi , c ti )) . (18) 𝒙 𝒙 𝒑 𝒑 𝒕 𝒕 𝒚 𝒚 • 𝑝 𝑦 𝑥 = σ 𝑧 σ 𝑡 𝑝 𝑦 𝑧, 𝑡, 𝑥 𝑝 𝑡 𝑧, 𝑥 𝑝(𝑧|𝑥) 𝐕𝐆𝐆
𝐆𝐑𝐔 𝐞𝐱 𝐆𝐑𝐔 𝐞𝐱 𝐆𝐑𝐔 𝐝𝐩 𝐆𝐑𝐔 𝐝𝐩 𝐆𝐑𝐔 𝐞 𝐩 𝐆𝐑𝐔 𝐞 𝐩 𝐆𝐑𝐔 𝐞 𝐭 𝐆𝐑𝐔 𝐞𝐭 𝐆𝐑𝐔 𝐝𝐭 𝐆𝐑𝐔 𝐝 𝐭 𝐆𝐑𝐔 𝐝𝐜 𝐆𝐑𝐔 𝐝𝐜 𝐕𝐆𝐆
Figure 3: The overall of the CSSMCM network. The attentionmodule is omitted for sake of simplicity.
We replace Equation (6) with Equation (17), Equation (16) withEquation (18). Then, the three sub-networks are jointly trained andthe overall loss function is defined as follows: L = L p + L t + L c , (19)where L p , L t and L c stand for loss of pinyin prediction sub-network,tone prediction sub-network and character prediction sub-networkrespectively, as defined below. L p = − (cid:213) i log P ( p i | p < i , x ) , L t = − (cid:213) i log P ( t i | t < i , x , p ) , L c = − (cid:213) i log P ( c i | c < i , x , p , t ) . (20) To accelerate training and reduce overfitting, curriculum learn-ing [8] is employed. The sentences are grouped into subsets ac-cording to the length of less than 11, 12-17, 18-23, more than 24Chinese characters. Scheduled sampling proposed by [3] is usedto eliminate the discrepancy between training and inference. Atthe training stage, the sampling rate from the previous output isselected from 0.7 to 1. Greedy decoder is used for fast decoding.
In this section, a three-stage pipeline for generating the ChineseMandarin Lip Reading (CMLR) dataset is described, which includesvideo pre-processing, text acquisition, and data generation. Thisthree-stage pipeline is similar to the method mentioned in [8], butconsidering the characteristics of our Chinese Mandarin dataset,we have optimized some steps and parts to generate a better qualitylip reading dataset. The three-stage pipeline is detailed below.
Video Pre-processing . First, national news program "NewsBroadcast" recorded between June 2009 and June 2018 is obtainedfrom China Network Television website. Then, the HOG-based facedetection method is performed [11], followed by an open sourceplatform for face recognition and alignment. The video clip set ofeleven different hosts who broadcast the news is captured. Duringthe face detection step, using frame skipping can improve efficiencywhile ensuring the program quality.
Text Acquisition . Since there is no subtitle or text annotation inthe original "News Broadcast" program, FFmpeg tools are used to https://ffmpeg.org/ MAsia ’19, December 15–18, 2019, Beijing, China Ya Zhao, Rui Xu, and Mingli Song
Table 2: The CMLR dataset. Division of training, validationand test data; and the number of sentences, phrases andcharacters of each partition.Set
Validation
Test
All ASR, the corresponding text annotationof the video clip set is obtained. However, there is some noise inthese text annotation. English letters, Arabic numerals, and rarepunctuation are deleted to get a more pure Chinese Mandarin lipreading dataset.
Data Generation . The text annotation acquired in the previousstep also contains timestamp information. Therefore, video clip setis intercepted according to these timestamp information, and thenthe corresponding word, phrase, or sentence video segment of thetext annotation are obtained. Since the text timestamp informationmay have a few uncertain errors, some adjustments are made tothe start frame and the end frame when intercepting the videosegment. It is worth noting that through experiments, we foundthat using OpenCV The input images are 64 ×
128 in dimension. Lip frames are trans-formed into gray-scale, and the VGG network takes every 5 lipframes as an input, moving 2 frames at each timestep. For all sub-networks, a two-layer bi-direction GRU [7] with a cell size of 256 isused for the encoder and a two-layer uni-direction GRU with a cellsize of 512 for the decoder. For character and pinyin vocabulary, wekeep characters and pinyin that appear more than 20 times. [sos],[eos] and [pad] are also included in these three vocabularies. Thefinal vocabulary size is 371 for pinyin prediction sub-network, 8 fortone prediction sub-network (four tones plus a neutral tone), and1,779 for character prediction sub-network.The initial learning rate was 0.0001 and decreased by 50% everytime the training error did not improve for 4 epochs. CSSMCM isimplemented using pytorch library and trained on a Quadro 64CP5000 with 16GB memory. The total end-to-end model was trainedfor around 12 days. http://docs.opencv.org/2.4.13/modules/refman.html We list here the compared methods and the evaluation protocol.
Table 3: The detailed comparison between CSSMCM andother methods on the CMLR dataset. V, P, T, C stand forvideo, pinyin, tone and character. V2P stands for the trans-formation from video sequence to pinyin sequence. VP2Trepresents the input are video and pinyin sequence and theoutput is sequence of tone. OVERALL means to combine thesub-networks and make a joint optimization.
Models sub-network CER PER TERWAS - 38.93% - -
LipCH-Net-seq
V2P - 27.96% -P2C 9.88% - -OVERALL 34.07% 39.52% -
CSSMCM-w/o video
V2P - 27.96% -P2T - - 6.99%PT2C 4.70 % - -OVERALL 42.23% 46.67% 13.14%
CSSMCM
V2P - 27.96% -VP2T - - 6.14%VPT2C 3.90% - -OVERALL 32.48% 36.22% 10.95%
WAS : The architecture used in [8] without the audio input. Thedecoder output Chinese character at each timestep. Others keepunchanged to the original implementation.
LipCH-Net-seq : For a fair comparison, we use sequence-to-sequence with attention framework to replace the Connectionisttemporal classification (CTC) loss [10] used in LipCH-Net [15]when converting picture to pinyin.
CSSMCM-w/o video : To evaluate the necessity of video infor-mation when predicting tone, the video stream is removed whenpredicting tone and Chinese characters. In other word, video is onlyused when predicting the pinyin sequence. The tone is predictedfrom the pinyin sequence. Tone information and pinyin informationwork together to predict Chinese character.We tried to implement the Lipnet architecture [1] to predict Chi-nese character at each timestep. However, the model did not con-verge. The possible reasons are due to the way CTC loss works andthe difference between English and Chinese Mandarin. Comparedto English, which only contains 26 characters, Chinese Mandarincontains thousands of Chinese characters. When CTC calculatesloss, it first adds blank between every character in a sentence, thatcauses the number of the blank label is far more than any otherChinese character. Thus, when Lipnet starts training, it predictsonly the blank label. After a certain epoch, " 的 " character will occa-sionally appear until the learning rate decays to close to zero.For all experiments, Character Error Rate (CER) and Pinyin Er-ror Rate (PER) are used as evaluation metrics. CER is defined as ErrorRate = ( S + D + I )/ N , where S is the number of substitutions, D is the number of deletions, I is the number of insertions to getfrom the reference to the hypothesis and N is the number of wordsin the reference. PER is calculated in the same way as CER. Tone Cascade Sequence-to-Sequence Model for Chinese Mandarin Lip Reading MMAsia ’19, December 15–18, 2019, Beijing, China
Table 4: Examples of sentences that CSSMCM correctly predicts while other methods do not. The pinyin and tone sequencecorresponding to the Chinese character sentence are also displayed together. GT stands for ground truth.
Method Chinese Character Sentence Pinyin Sequence Tone SequenceGT 既 让 老 百 姓 得 实 惠 ji rang lao bai xing de shi hui 4 4 3 3 4 2 2 4 WAS 介 项 老 百 姓姓 事 会 jie xiang lao bai xing xing shi hui 4 4 3 3 4 4 4 4 LipCH-Net-seq 既 让 老 百 姓 的 吃 贵 ji rang lao bai xing de chi gui 4 4 3 3 4 0 1 4 CSSMCM 既 让 老 百 姓 得 实 惠 ji rang lao bai xing de shi hui 4 4 3 3 4 2 2 4 GT 有 效 应 对 当 前 半 岛局 势 you xiao ying dui dang qian ban dao ju shi 3 4 4 4 1 2 4 3 2 4 WAS 有 效 应 对 当 天 半 岛 趋 势 you xiao ying dui dang tian ban dao qu shi 3 4 4 4 1 1 4 3 1 4 LipCH-Net-seq 有 效 应 对 党 年 半 岛局 势 you xiao ying dui dang nian ban dao ju shi 3 4 4 4 3 2 4 3 2 4 CSSMCM 有 效 应 对 当 前 半 岛局 势 you xiao ying dui dang qian ban dao ju shi 3 4 4 4 1 2 4 3 2 4 Table 5: Failure cases of CSSMCM. GT 向 全 球 价 值 链 中 高 端 迈进 xiang quan qiu jia zhi lian zhong gao duan mai jin CSSMCM 向 全 球 下 试 联 中 高 端 迈进 xiang quan qiu xia shi lian zhong gao duan mai jin GT 随 着 我 国 医 学 科 技 的 进 步 sui zhe wo guo yi xue ke ji de jin bu CSSMCM 随 着 我 国 一 水 科 技 的 信 步 sui zhe wo guo yi shui ke ji de jin bu Error Rate (TER) is also included when analyzing CSSMCM, whichis calculated in the same way as above.
Table 3 shows a detailed comparison between various sub-networkof different methods. Comparing P2T and VP2T, VP2T consid-ers video information when predicting the pinyin sequence andachieves a lower error rate. This verifies the conjecture of [6] thatthe generation of tones is related to the motion of the head. In termsof overall performance, CSSMCM exceeds all the other architectureon the CMLR dataset and achieves 32.48% character error rate. Itis worth noting that CSSMCM-w/o video achieves the worst re-sult (42.23% CER) even though its sub-networks perform well whentrained separately. This may be due to the lack of visual informationto support, and the accumulation of errors. CSSMCM using tone in-formation performs better compared to LipCH-Net-seq, which doesnot use tone information. The comparison results show that toneis important when lip reading, and when predicting tone, visualinformation should be considered.Table 4 shows some generated sentences from different meth-ods. CSSMCM-w/o video architecture is not included due to itsrelatively lower performance. These are sentences other methodsfail to predict but CSSMCM succeeds. The phrase " 实 惠 " (whichmeans affordable ) in the first example sentence, has a tone of 2, 4and its corresponding pinyin are shi, hui . WAS predicts it as " 事 会 " (which means opportunity ). Although the pinyin prediction iscorrect, the tone is wrong. LipCH-Net-seq predicts " 实 惠 " as " 吃 贵 "(not a word), which have the same finals " ui " and the correspondingmouth shapes are the same. It’s the same in the second example." 前 , 天 , 年 " have the same finals and mouth shapes, but the tone isdifferent. These show that when predicting characters with the same lipshape but different tones, other methods are often unable to predictcorrectly. However, CSSMCM can leverage the tone information topredict successfully.Apart from the above results, Table 5 also lists some failure casesof CSSMCM. The characters that CSSMCM predicts wrong areusually homophones or characters with the same final as the groundtruth. In the first example, " 价 " and " 下 " have the same final, ia ,while " 一 " and " 医 " are homophones in the second example. UnlikeEnglish, if one character in an English word is predicted wrong,the understanding of the transcriptions has little effect. However, ifthere is a character predicted wrong in Chinese words, it will greatlyaffect the understandability of transcriptions. In the second example,CSSMCM mispredicts " 医 学 " ( which means medical ) to " 一 水 "(which means all ). Although their first characters are pronouncedthe same, the meaning of the sentence changed from Now with theprogress of medical science and technology in our country to It is nowwith the footsteps of China’s Yishui Technology . Figure 4 (a) and Figure 4 (b) visualise the alignment of video framesand Chinese characters predicted by CSSMCM and WAS respec-tively. The ground truth sequence is " 同 时 他 还 向 媒 体 表 示 ". Com-paring Figure 4 (a) with Figure 4 (b), the diagonal trend of thevideo attention map got by CSSMCM is more obvious. The videoattention is more focused where WAS predicts wrong, i.e. the areacorresponding to " 还 向 ". Although WAS mistakenly predicts the" 媒 体 " as " 么 体 ", the " 媒 体 " and the " 么 体 " have the same mouthshape, so the attention concentrates on the correct frame.It’s interesting to mention that in Figure 5, when predicting the i -th character, attention is concentrated on the i + i + i + i -th timestep, making prediction more accurate. In this paper, we propose the CSSMCM, a Cascade Sequence-to-Sequence Model for Chinese Mandarin lip reading. CSSMCM isdesigned to predicting pinyin sequence, tone sequence, and Chinesecharacter sequence one by one. When predicting tone sequence, adual attention mechanism is used to consider video sequence and
MAsia ’19, December 15–18, 2019, Beijing, China Ya Zhao, Rui Xu, and Mingli Song Y L G H R I U D P H V S U H G L F W H G &