[PDF] CMS-LSTM: Context-Embedding and Multi-Scale Spatiotemporal-Expression LSTM for Video Prediction

Abstract

Extracting variation and spatiotemporal features via limited frames remains as an unsolved and challenging problem in video prediction. Inherent uncertainty among consecutive frames exacerbates the difficulty in long-term prediction. To tackle the problem, we focus on capturing context correlations and multi-scale spatiotemporal flows, then propose CMS-LSTM by integrating two effective and lightweight blocks, namely Context-Embedding (CE) and Spatiotemporal-Expression (SE) block, into ConvLSTM backbone. CE block is designed for abundant context interactions, while SE block focuses on multi-scale spatiotemporal expression in hidden states. The newly introduced blocks also facilitate other spatiotemporal models (e.g., PredRNN, SA-ConvLSTM) to produce representative implicit features for video prediction. Qualitative and quantitative experiments demonstrate the effectiveness and flexibility of our proposed method. We use fewer parameters to reach markedly state-of-the-art results on Moving MNIST and TaxiBJ datasets in numbers of metrics. All source code is available at this https URL

Full PDF

aa r X i v : . [ c s . C V ] F e b CMS-LSTM: Context-Embedding and Multi-Scale Spatiotemporal-ExpressionLSTM for Video Prediction

Zenghao Chai , Chun Yuan , ∗ , Zhihui Lin , and Yunpeng Bai Shenzhen International Graduate School, Tsinghua University, Shenzhen, China Department of Computer Science and Technologies, Tsinghua University, Beijing, China Peng Cheng Laboratory, Shenzhen, [email protected], [email protected], { lin-zh14,byp20 } @mails.tsinghua.edu.cn Abstract

Extracting variation and spatiotemporal featuresvia limited frames remains as an unsolved andchallenging problem in video prediction. Inher-ent uncertainty among consecutive frames exac-erbates the difﬁculty in long-term prediction. Totackle the problem, we focus on capturing contextcorrelations and multi-scale spatiotemporal ﬂows,then propose CMS-LSTM by integrating two ef-fective and lightweight blocks, namely Context-Embedding (CE) and Spatiotemporal-Expression(SE) block, into ConvLSTM backbone. CE blockis designed for abundant context interactions, whileSE block focuses on multi-scale spatiotemporalexpression in hidden states. The newly intro-duced blocks also facilitate other spatiotempo-ral models (e.g., PredRNN, SA-ConvLSTM) toproduce representative implicit features for videoprediction. Qualitative and quantitative experi-ments demonstrate the effectiveness and ﬂexibil-ity of our proposed method. We use fewer pa-rameters to reach markedly state-of-the-art resultson Moving MNIST and TaxiBJ datasets in num-bers of metrics. All source code is available athttps://github.com/czh-98/CMS-LSTM.

Spatiotemporal predictive learning has become a challengingbut essential ﬁeld in computer vision. Video prediction is oneof the hotspots in spatiotemporal learning with board researchprospects. It has beneﬁted or could beneﬁt plenty of applica-tions, e.g., meteorological prediction [Shi et al. , 2015], traf-ﬁc ﬂows prediction [Xu et al. , 2018; Zhang et al. , 2017], andphysical object movement [Lerer et al. , 2016]. The core taskand challenge of video prediction are predicting future se-quences based on limited observed frames.Nevertheless, video sequences contain inherent complexsemantic features, whereas the certainty of frames is excep-tionally fuzzy. Therefore, it is crucial but hard to extract abun-dant implicit context features to overcome the uncertainty. Itis usually necessary to take both target overlap, scale changes ∗ Contact Author into consideration, making the video prediction more chal-lenging.Recent years have seen signiﬁcant progress in videoprediction. Numerous researchers have carried outin-depth research in spatiotemporal predictive learningand proposed a series of RNN-based [Werbos, 1990;Hochreiter and Schmidhuber, 1997] models, from the orig-inal ConvLSTM [Shi et al. , 2015] used for precipitationnowcasting to the improved methods proposed basedon ConvLSTM in recent years, such as PredRNN[Wang et al. , 2017], PredRNN++ [Wang et al. , 2018], MIM[Wang et al. , 2019b], E3D-LSTM [Wang et al. , 2019a], SA-ConvLSTM [Lin et al. , 2020]. These methods have achievedremarkable results in video prediction.However, most of the previous work merely focuses onglobal spatiotemporal ﬂows of given frames in hidden states,resulting in more extra parameters and ignorance of multi-scale variations between sequences. On the other hand, theinput and context always perform independently in previouswork. The relationship between the two is unidimensional.Namely, they did not pay enough attention to the interactionof context. With the increase of models’ depth and complex-ity, correlations between the current input and upper contextwill decline as information ﬂows among layers.In this paper, to overcome deﬁciencies of the unidimen-sional relationship of context in previous work and paymore attention to multi-scale spatiotemporal ﬂows, we pro-pose Context-Embedding and Multi-Scale Spatiotemporal-Expression LSTM (CMS-LSTM), an extension structure ofConvLSTM. Speciﬁcally, 1) Context-Embedding (CE) blockis designed to enhance the input and context interactions andcorrelations. 2) Multi-Scale Spatiotemporal-Expression (SE)block is designed based on the attention mechanism to cap-ture abundant spatiotemporal ﬂows in different scales. Themain contributions are as follows:• Two effective blocks are designed, namely Context-Embedding (CE) block and Spatiotemporal-Expression(SE) block. CE block can maintain consistency and ex-tract further correlations between the current input andupper context. SE block can facilitate multi-scale dom-inant spatiotemporal ﬂows’ expression and weaken thenegligible ones simultaneously.• To the best of our knowledge, the proposed CMS-LSTMnnovatively integrates context interaction enhancementand multi-scale spatiotemporal expression mechanism.It achieves state-of-the-art results on Moving MNISTand TaxiBJ datasets in numbers of metrics comparingwith previous models.• Numerous detailed qualitative and quantitative experi-ments have demonstrated the importance of context in-teractions and multi-scale spatiotemporal ﬂows in videoprediction. The proposed CE block and SE block havethe portability to transplant in other models.

RNN [Werbos, 1990] and its improved structure LSTM[Hochreiter and Schmidhuber, 1997] have been extensivelyused in spatiotemporal predictive learning in recent years.ConvLSTM [Shi et al. , 2015] based models are a crucialbranch in video prediction. PredRNN [Wang et al. , 2017] andPredRNN++ [Wang et al. , 2018] improved predictive perfor-mance by introducing additional global memory cell and itsreorganization. MIM [Wang et al. , 2019b] further updatedmemory cells into extra computation of non-stationary andstationary information for spatiotemporal expression. More-over, E3D-LSTM [Wang et al. , 2019a] designed new 3D-CNN ﬂows accompanied by a self-attention module as SA-ConvLSTM [Lin et al. , 2020] did for video prediction.On the one hand, the context in previous spatiotempo-ral predictive work is independent, in ignorance of contextcorrelation’s decrease as spatiotemporal information trans-mits among layers. e.g., SA-ConvLSTM [Lin et al. , 2020]designed an attention module for hidden states, regardlessof previous input and upper context interactions. However,extraction of context correlations among state gates and en-hancement of the relationship between these gating units arealways a vital improving direction of LSTM. By extendingthe original LSTM, Mogriﬁer LSTM [Melis et al. , 2020] in-teracted current input with the upper context to maintain thecorrelation and extract implicit features by iterative calcula-tion, demonstrating its effectiveness in multiple NLP tasks.On the other hand, previous models above merely focuson global spatiotemporal features, regardless of the expres-sion in multi-scales with different attention. However, Spa-tiotemporal ﬂows are essential for inference in video predic-tion. Convolution Layer [Krizhevsky et al. , 2017] can effec-tively focus on local features but lacks expressing the im-portance of spatiotemporal ﬂows. Initially proposed and ap-plied in NLP [Gehring et al. , 2017], the self-attention mecha-nism [Vaswani et al. , 2017] has successfully extended to CV[Chu et al. , 2017; Xu and Saenko, 2016] related tasks andachieved impressive results. [Lin et al. , 2020] ﬁrstly ap-plied the self-attention mechanism into video prediction, pro-posed as SA-ConvLSTM, and achieved state-of-the-art re-sults. Nevertheless, self-attention fails to capture spatiotem-poral expression in different scales, which is indispensablein video prediction. To overcome the inherent weakness ofself-attention, the multi-scale attention mechanism, a methodwidely used in image detection [Zhao and Wu, 2019], im-age restoration [Mei et al. , 2020], and other ﬁelds, has showngreat advantages in ﬁne-grained feature extraction.

Figure 1: The pipeline of proposed CE block, where H t − , X t arethe previous state and current input, respectively. H and X are × convolution layers to extract features of H t and ˆ X t , respectively. ˆ H t − and ˆ X t are the output of CE block, representing the previousstate and current input after context embedding, respectively. In previous RNN and LSTM based models, the input x t andthe previous state h t − are completely independently enter-ing into LSTM layers. In other words, the two interact inLSTM in a unidimensional spatiotemporal state. Therefore,correlations between the current input and upper context tendto disappear as models become increasingly complex.To explore the relationship between input and context andextract further correlational information between the two, in-spired by the previous work [Melis et al. , 2020], we constructCE block to correlate the two states by iterative interaction,and the pipeline of CE block is illustrated in Figure 1.Formally, context correlations are extracted by the interac-tion mode as Formula 1 in the proposed CE block. ˆ X t = 2 × σ ( W H ⋆ H t − + b H ) ◦ X t ˆ H t − = 2 × σ ( W X ⋆ ˆ X t + b X ) ◦ H t − (1)In CE block, X t and H t − are correlated by two convo-lution layers and Hadamard product. To describe the asso-ciation of them with richer interactions, we use stacked CEblocks to extract abundant correlational information further.In addition, to minimize the extra parameters, multi-layerstacked CE blocks share the same convolution layers. In this section, we emphasize the insufﬁciency of previouswork in multi-scale spatiotemporal ﬂow extractions and con-struct SE block for maximizing extract multi-scale implicitspatiotemporal ﬂows to overcome previous weakness.Previous RNN-based approaches mostly concentrate onmodeling global spatiotemporal features and ﬂows, regard-less of multi-scale neighbor features among sequences. How-ever, pixel-level and object-level changes between adjacentframes tend to occur in speciﬁc regions. Namely, these re-gions contain more implicit spatiotemporal ﬂows than sin-gle scale frames, showing the great necessity to model multi-scale spatiotemporal expression. igure 2: The architecture of basic attention module, where H t and C t are the output state and memory state of LSTM in a speciﬁc scale. K, Q, V are × convolution layers to obtain Key, Query, V alue ,respectively. ˆ H t and ˆ C t are the output of basic attention module. The self-attention mechanism [Vaswani et al. , 2017] caneffectively focus on important parts of the given feature map.We construct SE block based on self-attention mechanism,pipeline illustrated in Figure 2.To reduce parameter consumption and improve efﬁciencyas much as possible, we use a weight-shared attention moduleto share the weight of

K, Q, V for H t and C t and calculatethem in parallel.In Figure 2, H t and C t constitute the input of basic atten-tion module. Key , Query , and

V alue are calculated throughthree × convolution layers K, Q, V separately. Then, At-tention Map is obtained by

Sof tmax the multiplication of

Query T and Key . The output is the multiplication of Atten-tion Map and value. ˆ H t , ˆ C t = Sof tmax ( Query T × Key ) × V alue (2)In LSTM, H t and C t contain spatiotemporal ﬂows of givenframes, while multi-scale neighbors in adjacent frames con-tain more implicit features than single-scale ones. In otherwords, multi-scale regions can extract pixel-level and object-level spatiotemporal ﬂows among contexts and contain morepotential tendencies.Inspired by previous work [Zhao and Wu, 2019;Mei et al. , 2020; Chen and Shi, 2020], we construct SEblock for implicit feature expression in multi-scale neighborsto extract spatiotemporal ﬂows. The architecture of SE blockis illustrated in Figure 3. In the SE block, the extractionof multi-scale spatiotemporal ﬂows can be divided into twoparts: Part 1. Multi-Scale Spatiotemporal Features Expres-sion

The spatiotemporal states H t , C t ∈ R C × H × W arestacked into Z ∈ R C × H × W × . According to segmenta-tion rules R = R , · · · , R n , Z is divided into n multi-scale groups Z , · · · , Z n (three groups in Figure 3), andeach Z i , i ∈ [1 , n ] is stacked in C channel to compose z , · · · , z n . Then the multi-scale implicit features ˆ z , · · · , ˆ z n are expressed by attention mechanism in Figure 2.After that, the multi-scale implicit features are restored in H and W channels accompanied by concat operation in C channel to composing ˆ Z . Ultimately, feature maps A H and A C are calculated by × convolution layer taking ˆ Z ∈ R nC × H × W × as input and separated in the last channel. Part 2. Spatiotemporal Implicit States Update A H and A C are stacked in C channel as the input of × con-volution layer to obtain multi-scale Attention Map A . Up-dated implicit state ˆ Z H is obtained by summation of A and × convolution layer processed implicit state Z H , and issplit into 3 parts: i t , g t , and o t , respectively. The memorystate ˆ C t is further updated as follows: i t = σ ( W Ai ⋆ [ A H , A C ] + W hi ⋆ Z H + b i ) g t = tanh( W Ag ⋆ [ A H , A C ] + W hg ⋆ Z H + b g )ˆ C t = (1 − i t ) ◦ C t + i t ◦ g t (3)Then, the output state ˆ H t is the dot product result betweenthe output gate o t and updated memory state ˆ C t , which canbe formulated as follows: o t = σ ( W Ao ⋆ [ A H , A C ] + W ho ⋆ Z H + b o )ˆ H t = o t ◦ ˆ C t (4) As mentioned above, our goals are to maintain the spatiotem-poral consistency and correlations among frames in LSTMlayers, facilitate multi-scale dominant spatiotemporal ﬂows’expression and weaken the negligible ones simultaneously.Therefore, CMS-LSTM is constructed specially by takingboth considerations of context interactions and multi-scalespatiotemporal ﬂows. The architecture of proposed CMS-LSTM is illustrated in Figure 4.Formally, the calculation process of CMS-LSTM can beexpressed as follows: ˆ X t , ˆ H t − = CE ( · · · CE ( X t , H t − )) g t = tanh ( W xg ⋆ ˆ X t + W hg ⋆ ˆ H t − + b g ) i t = σ ( W xi ⋆ ˆ X t + W hi ⋆ ˆ H t − + b i ) f t = σ ( W xf ⋆ ˆ X t + W hf ⋆ ˆ H t − + b f ) C t = f t ◦ C t − + i t ◦ g t o t = σ ( W xo ⋆ ˆ X t + W ho ⋆ ˆ H t − + b o ) H t = o t ◦ tanh ( C t )ˆ H t , ˆ C t = SE ( H t , C t ) (5)In Formula 5, CE and SE represent the CE and SE blockmentioned in Section 3.1 and 3.2, respectively. ˆ X t and ˆ H t − represent the output of 5-layer stacked CE blocks accompanyby intensive context interactions.Then, H t and C t are obtained through LSTM gate oper-ations, which merely contains limited global spatiotemporalﬂows at present. We thus adopt 3-scale SE block mentionedin Section 3.2 to extract multi-scale features for further spa-tiotemporal ﬂows among neighbors, to obtain the ﬁnal output ˆ H t and ˆ C t of CMS-LSTM. igure 3: The pipeline of proposed SE block, where H t and C t represent the output of original ConvLSTM, and × convolution layers areused to extract features, ˆ H t and ˆ C t represent the output state and memory state after multi-scale spatiotemporal ﬂows’ extraction.Figure 4: The architecture of CMS-LSTM. H t − and C t − repre-sent output state and memory state of t − time respectively, X t represents the t time input, while ˆ H t and ˆ C t represent the output ofCMS-LSTM, namely the output state and memory state of t time. Pytorch [Paszke et al. , 2019] version of the proposed modelis implemented. We use an RTX 2080Ti to train and test.For fair comparisons, the proposed model has the same ar-chitecture and similar computation load compared with pre-vious work ([Wang et al. , 2017] etc.). We use the same 4-layer LSTM architecture with 64 hidden states. Settingmini-batch to 8 and initial learning rate to 0.001, sched-uled sampling [Bengio et al. , 2015] and layer normalization[Ba et al. , 2016] are simultaneously adopted during training.We use L loss for Moving MNIST and L + L loss for Tax-iBJ with AdamW [Loshchilov and Hutter, 2017] optimizer totrain the model. Moving MNIST

Moving MNIST [Srivastava et al. , 2015] is a commonly useddataset in video prediction, depicting 2 digits’ movement withconstant velocity. Each data contains × × consecu-tive frames with 10 for input and 10 for prediction, , randomly generate sequences for training and , ﬁxedsequences for testing. TaxiBJ

TaxiBJ [Xu et al. , 2018] is a trafﬁc ﬂow dataset collectedfrom chaotic real-world environment, containing consecutivetrafﬁc ﬂow images collected by GPS monitors of taxicabs inBeijing. Each frame in the dataset is a × × grid im-age, while each channel represents the trafﬁc ﬂow enteringand leaving in same district. Following previous work, wegenerate , sequences for training and , sequencesfor testing, with 4 known frames to predict the next 4 frames. We compare the proposed model with previous SOTA meth-ods having the same architecture on Moving MNIST and Tax-iBJ datasets quantitatively and randomly select the predic-tion results for qualitative comparisons to demonstrate ourmethod’s advantages and effectiveness. Other SOTA meth-ods with different architectures and experiment settings arealso quantitatively compared.

Results on Moving MNIST

We set , iterations consistent with previous work([Wang et al. , 2017] etc.) and , iterations for betterperformance. Quantitative and qualitative comparisons areshown in Table 1 and Figure 5, respectively. Peak signal-to-noise ratio (PSNR), structural similarity (SSIM), mean squareerror (MSE) and mean absolute error (MAE) are used forquantitative comparisons. The performance improves as theSSIM and PSNR increase and the MSE and MAE decrease.Results in Table 1 demonstrate the superiority of ourmethod on Moving MNIST dataset in all above metrics, im-proving . and . on PSNR and SSIM, and reducing . and . on MSE and MAE respectively comparedwith previous SOTA methods having the same architecture.Results in Figure 5 show that CMS-LSTM has better capa-bility to capture variations over digits, especially deals withthe trajectory of overlap digits and maintains the clarity overtime. In contrast, predicted frames of other methods appearblurry in the digits and fail to deal with overlap digits.More speciﬁcally, we give each metric’s time-varyingcurves above on different models in Moving MNIST dataset igure 5: Qualitative comparisons of previous SOTA models on Moving MNIST test set at , iterations. The output frames are shownat one-frame intervals. We magnify the local of prediction results for additional detailed comparison at the last frame.Table 1: Quantitative comparisons of previous SOTA models on Moving MNIST test set. All models predict 10 frames by observing 10previous frames. We also try , iterations for higher performance. Models ↑ ∆ SSIM ↑ ∆ MSE ↓ ∆ MAE ↓ ∆ FC-LSTM [Srivastava et al. , 2015] - - 0.690 - 118.3 - 209.4 -DDPAE [Hsieh et al. , 2018] - 21.170 +1.567 0.922 +0.232 38.9 -79.4 90.7 -118.7CrevNet+ConvLSTM [Yu et al. , 2020] - - 0.928 +0.238 38.5 -79.8 -PhyDNet [Guen and Thome, 2020] - 23.120 +3.517 0.947 +0.257 24.4 -93.9 70.3 -139.1PDE-Driven [Don`a et al. , 2021] - 21.760 +2.157 0.909 +0.219 - -PredRNN [Wang et al. , 2017] 13.799 M 19.603 - 0.867 +0.177 56.8 -61.5 126.1 -83.3PredRNN++ [Wang et al. , 2018] 13.237 M 20.239 +0.636 0.898 +0.208 46.5 -71.8 106.8 -102.6MIM* [Wang et al. , 2019b] 27.971 M 20.678 +1.075 0.910 +0.220 44.2 -74.1 101.1 -108.3E3D-LSTM [Wang et al. , 2019a] 38.696 M 20.590 +0.987 0.910 +0.220 41.7 -76.6 87.2 -122.2SA-ConvLSTM [Lin et al. , 2020] 10.471 M 20.500 +0.897 0.913 +0.223 43.9 -74.4 94.7 -114.7CMS-LSTM ( , iterations) 7.968 M 21.955 +2.352 0.931 +0.241 33.6 -84.7 73.1 -136.3 CMS-LSTM ( , iterations) 7.968 M 23.682 +4.079 0.949 +0.259 24.3 -94.0 58.1 -151.3 (a) PSNR-Time (b) SSIM-Time(c) MSE-Time (d) MAE-TimeFigure 6: Frame-wise comparisons of the next 10 generated Mov-ing MNIST frames at , iterations. The slower trend indicatesbetter performance. The proposed CMS-LSTM is the most high-performing method overall timestamps in the forecasting horizon. with , iterations, as shown in Figure 6.Results in Figure 6 not only show that our method outper-forms all the above methods in frame-wise prediction overthese metrics but also represent the stability of our method inlong-term prediction task. CMS-LSTM shows the best resultsand slowest performance decay in the forecasting horizon. Results on TaxiBJ

We train the proposed model for , iterations forfair comparisons with previous methods ([Wang et al. , 2017]etc.). Quantitative and qualitative comparisons are shown inTable 2 and Figure 7, respectively.As shown in Table 2, we adopt the frame-wise MSE as themetric. Smaller MSE indicates better performance. Com-pared with previous work, our method has the best per-formance and stability in trafﬁc ﬂow prediction, achievingover . average MSE reduction compared with previousSOTA models (SA-ConvLSTM).The visualized comparisons with previous methods in Fig-ure 7 include both the predicted frames and their absolute dif-ference between ground truth. The brighter brightness repre-sents higher absolute errors, whereas the proposed methodshows the darkest brightness compared with other methods,further indicating the superiority of our method in dealingwith uncertainty sequences. igure 7: Qualitative comparisons of previous SOTA models on TaxiBJ test set. All models output the next 4 frames, accompanied byabsolute difference with ground truth. Brighter brightness represents higher absolute errors.Table 2: Frame-wise MSE comparisons of previous SOTA modelson TaxiBJ test set. All models predict the next 4 frames (trafﬁcconditions for the next 2 hours) via 4 historical trafﬁc ﬂow images. Models Frame1 ↓ Frame2 ↓ Frame3 ↓ Frame4 ↓ Average ↓ ∆ ST-ResNet [Zhang et al. , 2017] 0.460 0.571 0.670 0.762 0.618 -VPN [Kalchbrenner et al. , 2017] 0.427 0.548 0.645 0.721 0.585 -0.033FRNN [Oliu et al. , 2018] 0.331 0.416 0.518 0.619 0.471 -0.147PhyDNet [Guen and Thome, 2020] - - - - 0.419 -0.199PDE-Driven [Don`a et al. , 2021] - - - - 0.398 -0.220PredRNN [Wang et al. , 2017] 0.318 0.427 0.516 0.595 0.464 -0.154PredRNN++ [Wang et al. , 2018] 0.319 0.399 0.500 0.573 0.448 -0.170MIM* [Wang et al. , 2019b] 0.309 0.390 0.475 0.542 0.429 -0.189SA-ConvLSTM [Lin et al. , 2020] 0.269 0.356 0.426 0.507 0.390 -0.228

CMS-LSTM 0.162 0.203 0.254 0.294 0.228 -0.390

To better illustrate the superiority of the proposed method,we conduct a series of ablation studies to verify the effective-ness of CE block and SE block, which focus on extractionsof context interactions and multi-scale spatiotemporal ﬂows.All experiments below set , iterations for training.We verify the necessity of context interactions and multi-scale spatiotemporal ﬂows by comparing CMS-LSTM re-moving CE block and SE block, respectively, and then usingdifferent scales to illustrate the necessity of the multi-scalespatiotemporal expression.Besides, to testify the portability of CE block and SE block,we transplant the two blocks into previous work PredRNNand SA-ConvLSTM. Speciﬁcally, we compare PredRNN[Wang et al. , 2017] and SA-ConvLSTM [Lin et al. , 2020]with/without CE block and SE block in the same experimentsettings using the same metrics as Section 4.3 for quantita-tive comparisons on Moving MNIST dataset, results shownin Table 3.Results in Table 3 show the effectiveness of CE block andSE block. The entire CMS-LSTM achieves the best perfor-mance compared with the original ConvLSTM. Comparingmodels with and without CE block demonstrates the necessityof context interactions. Moreover, experiments in multi-scalefurther show the importance of spatiotemporal ﬂow extrac- Table 3: Ablation studies on Moving MNIST dataset. Models withand without CE block or SE block are tested sequentially in differentbackbones, as well as SE block with different scales in ConvLSTM.

Models PSNR ↑ ∆ SSIM ↑ ∆ MSE ↓ ∆ MAE ↓ ∆ ConvLSTM 18.523 - 0.877 - 70.4 - 115.9 -w CE, w/o SE 21.189 +2.666 0.918 +0.041 39.1 -31.3 82.8 -33.1w CE, w 1-scale SE 21.708 +3.185 0.927 +0.050 35.1 -35.3 76.3 -39.6w/o CE, w SE 21.712 +3.189 0.927 +0.050 34.8 -35.6 76.2 -39.7w CE, w 2-scale SE 21.858 +3.335 0.929 +0.052 33.8 -36.6 74.3 -41.6 w CE, w SE 21.955 +3.432 0.931 +0.054 33.6 -36.8 73.1 -42.8

PredRNN 19.603 - 0.867 - 56.8 - 126.1 -w CE, w/o SE 22.356 +2.753 0.924 +0.057 30.7 -26.1 82.7 -43.4w/o CE, w SE 22.761 +3.158 0.931 +0.064 28.7 -28.1 76.9 -49.2 w CE, w SE 23.210 +3.607 0.935 +0.068 26.3 -30.5 74.2 -51.9

SA-ConvLSTM 20.500 - 0.913 - 43.9 - 94.7 -w/o CE, w SE 20.970 +0.470 0.918 +0.005 39.8 -4.10 84.2 -10.5 w CE, w/o SE 22.591 +2.091 0.929 +0.016 27.3 -16.6 79.0 -15.7 tions in different scales.Table 3 further veriﬁes the portability of CE block and SEblock. With the transplant of CE block and SE block, previ-ous models’ performances do get signiﬁcantly improved, in-dicating the ability of our methods to be transplanted in otherspatiotemporal predictive models.

This paper creatively proposes effective and lightweightmodules focused on context interactions and multi-scalespatiotemporal expression named CE block and SE block,and then constructs CMS-LSTM, an extension architec-ture of ConvLSTM. Qualitative and quantitative experimentsdemonstrate the superiority of our method dealing with un-certainty and overlap in sequences, showing state-of-the-artperformance in Moving MNIST and TaxiBJ datasets.Ablation studies further verify the effectiveness and ﬂexi-bility of our method. The proposed CE block can maintain thespatiotemporal consistency among long sequences, and SEblock facilitates multi-scale dominant spatiotemporal ﬂows’expression and weaken the negligible ones simultaneously.They can transplant to other spatiotemporal predictive relatedmodels to improve the performance markedly. eferences [Ba et al. , 2016] Jimmy Lei Ba, Jamie Ryan Kiros, and Ge-offrey E Hinton. Layer normalization. arXiv preprintarXiv:1607.06450 , 2016.[Bengio et al. , 2015] Samy Bengio, Oriol Vinyals, NavdeepJaitly, et al. Scheduled sampling for sequence predictionwith recurrent neural networks. pages 1171–1179, 2015.[Chen and Shi, 2020] Hao Chen and Zhenwei Shi. A spatial-temporal attention-based method and a new dataset forremote sensing image change detection.

Remote. Sens. ,(10):1662, 2020.[Chu et al. , 2017] Xiao Chu, Wei Yang, Wanli Ouyang, et al.Multi-context attention for human pose estimation. In

CVPR 2017 , pages 5669–5678, 2017.[Don`a et al. , 2021] J´er´emie Don`a, Jean-Yves Franceschi,sylvain lamprier, et al. Pde-driven spatiotemporal disen-tanglement. In

ICLR 2021 , 2021.[Gehring et al. , 2017] Jonas Gehring, Michael Auli, DavidGrangier, et al. Convolutional sequence to sequence learn-ing. In

ICML 2017 , pages 1243–1252, 2017.[Guen and Thome, 2020] Vincent Le Guen and NicolasThome. Disentangling physical dynamics from unknownfactors for unsupervised video prediction. In

CVPR 2020 ,pages 11471–11481, 2020.[Hochreiter and Schmidhuber, 1997] Sepp Hochreiter andJ¨urgen Schmidhuber. Long short-term memory.

Neuralcomputation , (8):1735–1780, 1997.[Hsieh et al. , 2018] Jun-Ting Hsieh, Bingbin Liu, De-AnHuang, et al. Learning to decompose and disentangle rep-resentations for video prediction. In

NeurIPS 2018 , pages515–524, 2018.[Kalchbrenner et al. , 2017] Nal Kalchbrenner, A¨aronvan den Oord, Karen Simonyan, et al. Video pixelnetworks. In

ICML 2017 , pages 1771–1779, 2017.[Krizhevsky et al. , 2017] Alex Krizhevsky, Ilya Sutskever,and Geoffrey E Hinton. Imagenet classiﬁcation with deepconvolutional neural networks.

Communications of theACM , (6):84–90, 2017.[Lerer et al. , 2016] Adam Lerer, Sam Gross, and Rob Fer-gus. Learning physical intuition of block towers by exam-ple. pages 430–438, 2016.[Lin et al. , 2020] Zhihui Lin, Maomao Li, Zhuobin Zheng,et al. Self-attention convlstm for spatiotemporal predic-tion. In

AAAI 2020 , pages 11531–11538, 2020.[Loshchilov and Hutter, 2017] Ilya Loshchilov and FrankHutter. Fixing weight decay regularization in adam.

CoRR ,2017.[Mei et al. , 2020] Yiqun Mei, Yuchen Fan, Yulun Zhang,et al. Pyramid attention networks for image restoration. arXiv preprint arXiv:2004.13824 , 2020.[Melis et al. , 2020] G´abor Melis, Tom´as Kocisk´y, and PhilBlunsom. Mogriﬁer LSTM. In

ICLR 2020 , 2020. [Oliu et al. , 2018] Marc Oliu, Javier Selva, and Sergio Es-calera. Folded recurrent neural networks for future videoprediction. In

ECCV 2018 , pages 745–761, 2018.[Paszke et al. , 2019] Adam Paszke, Sam Gross, FranciscoMassa, et al. Pytorch: An imperative style, high-performance deep learning library. In

NeurIPS 2019 ,pages 8024–8035, 2019.[Shi et al. , 2015] Xingjian Shi, Zhourong Chen, Hao Wang,et al. Convolutional lstm network: A machine learningapproach for precipitation nowcasting.

Advances in neuralinformation processing systems , pages 802–810, 2015.[Srivastava et al. , 2015] Nitish Srivastava, Elman Mansi-mov, and Ruslan Salakhutdinov. Unsupervised learning ofvideo representations using lstms. In

ICML 2015 , pages843–852, 2015.[Vaswani et al. , 2017] Ashish Vaswani, Noam Shazeer, NikiParmar, et al. Attention is all you need. In

NeurIPS 2017 ,pages 5998–6008, 2017.[Wang et al. , 2017] Yunbo Wang, Mingsheng Long, JianminWang, et al. Predrnn: Recurrent neural networks for pre-dictive learning using spatiotemporal lstms. In

NeurIPS2017 , pages 879–888, 2017.[Wang et al. , 2018] Yunbo Wang, Zhifeng Gao, MingshengLong, et al. Predrnn++: Towards A resolution of thedeep-in-time dilemma in spatiotemporal predictive learn-ing. pages 5110–5119, 2018.[Wang et al. , 2019a] Yunbo Wang, Lu Jiang, Ming-HsuanYang, et al. Eidetic 3d LSTM: A model for video pre-diction and beyond. In

ICLR 2019 , 2019.[Wang et al. , 2019b] Yunbo Wang, Jianjin Zhang, HongyuZhu, et al. Memory in memory: A predictive neural net-work for learning higher-order non-stationarity from spa-tiotemporal dynamics. In

CVPR 2019 , pages 9154–9162,2019.[Werbos, 1990] Paul J Werbos. Backpropagation throughtime: what it does and how to do it.

Proceedings of theIEEE , (10):1550–1560, 1990.[Xu and Saenko, 2016] Huijuan Xu and Kate Saenko. Ask,attend and answer: Exploring question-guided spatial at-tention for visual question answering. In

ECCV 2016 ,pages 451–466, 2016.[Xu et al. , 2018] Ziru Xu, Yunbo Wang, Mingsheng Long,et al. Predcnn: Predictive learning with cascade convolu-tions. In

IJCAI 2018 , pages 2940–2947, 2018.[Yu et al. , 2020] Wei Yu, Yichao Lu, Steve Easterbrook,et al. Efﬁcient and information-preserving future frameprediction and beyond. In

ICLR 2020 , 2020.[Zhang et al. , 2017] Junbo Zhang, Yu Zheng, and DekangQi. Deep spatio-temporal residual networks for citywidecrowd ﬂows prediction. In

AAAI 2017 , pages 1655–1661,2017.[Zhao and Wu, 2019] Ting Zhao and Xiangqian Wu. Pyra-mid feature attention network for saliency detection. In