Comparison of Spatiotemporal Networks for Learning Video Related Tasks
CC OMPARISON OF S PATIOTEMPORAL N ETWORKS FOR L EARNING V IDEO R ELATED T ASKS
A P
REPRINT
Logan Courtney
Department of Industrial and Enterprise Systems EngineeringUniversity of IllinoisUrbana-Champaign, IL [email protected]
Ramavarapu Sreenivas
Department of Industrial and Enterprise Systems EngineeringUniversity of IllinoisUrbana-Champaign, IL [email protected]
September 17, 2020 A BSTRACT
Many methods for learning from video sequences involve temporally processing 2D CNN featuresfrom the individual frames or directly utilizing 3D convolutions within high-performing 2D CNNarchitectures. The focus typically remains on how to incorporate the temporal processing withinan already stable spatial architecture. This work constructs an MNIST-based video dataset withparameters controlling relevant facets of common video-related tasks: classification, ordering, andspeed estimation. Models trained on this dataset are shown to differ in key ways depending on the taskand their use of 2D convolutions, 3D convolutions, or convolutional LSTMs. An empirical analysisindicates a complex, interdependent relationship between the spatial and temporal dimensions withdesign choices having a large impact on a network’s ability to learn the appropriate spatiotemporalfeatures.
The convolution operation remains at the core of 2D spatial networks due to advantageous characteristics such as weightsharing (reduced parameters), local connectivity (reduced convergence time), and down-sampling (reduced computation).Base architectures have grown deeper and more complex over time and demonstrated improved performance on avariety of image-based tasks [1].For natural language processing (NLP), techniques for handling 1D sequences differ in how the temporal information isconsumed with both convolutions[2] and LSTM networks[3] showing early success. LSTMs process elements one at atime with a form of memory as opposed to convolutions operating on local neighborhoods. Although still occasionallyused, these methods have largely been replaced by the Transformer[4] due to its non-recurrent structure and its abilityto self-attend over all positions in the input sequence.When it comes to video, the success of a deep network depends on its ability to learn both spatial and temporal features.Interpreting a video as a sequence of images, many video-based networks borrow techniques from the above fields ofimage processing and NLP.The early stages of action recognition had relatively small datasets[5] insufficient for training novel spatiotemporalnetworks from scratch. This forced methods to creatively utilize 2D image networks pretrained on ImageNet[6]. [7]stacks LSTMs on top of a pretrained CNN. [8] inputs RGB frames for content and optical flow frames for motion to a r X i v : . [ c s . C V ] S e p PREPRINT - S
EPTEMBER
17, 2020pretrained CNNs. The spatial and temporal dimensions are handled in two separate stages. As larger action recognitiondatasets become available such as Kinetics[9][10][11], networks primarily use 3D convolutions to directly capturespatiotemporal features. [12] uses 3D versions of high-performing 2D CNNs. [13] uses a high-resolution, low framerate model for spatial semantics and a low-resolution, high frame rate model for motion. [14] utilizes two spatiotemporalviews to exploit different contexts. The focus remains on temporal modifications to popular 2D image networks withoutan emphasis on how the choice of the spatial architecture directly impacts what temporal modifications are necessary.This issue becomes more apparent after viewing recent techniques for lipreading. [15] demonstrates an increase inperformance when utilizing 3D convolutions for only the first layer of the network to process the short-term dynamics.The remainder of the spatial information is processed with 2D convolutions before a bidirectional LSTM to capturelong-term context. [16] achieves comparable performance utilizing convolutional LSTM layers throughout the fullnetwork. Utilizing 3D convolutions throughout the full network in [17] results in their lowest performing modelindicating a non-negligible relationship between the spatial architecture and temporal technique. Current state-of-the-artlipreading models [18][19] both utilize modified training schemes with additional unlabeled data without focusing onthe spatiotemporal architecture.This work takes a step backwards and attempts to understand how various video techniques operate on a morefundamental level. Multiple models are trained on a novel MNIST-based video dataset. By directly parameterizing thevideo sequence to control for visible spatiotemporal features and training on multiple video related tasks, the performanceof these techniques is shown to differ in key ways. Section 2 begins with an overview of spatiotemporal receptive fieldsand resolution. Section 3 discusses the creation of the MNIST-based video dataset and its associated spatiotemporaltasks. Section 4 discusses the various models and their characteristics. Section 5 provides an interpretation of the resultsfollowed by Section 6 wrapping up with the impact and future work.
The Universal Approximation Theorem (UAT) [20] states a single layer network with a finite number of neurons canapproximate any continuous function arbitrarily close. Given a video input x ∈ R h × w × c × T , a single layer networkwith h o hidden units is sufficient for all video related tasks. Each element of the output is dependent on the full spatialand temporal dimensions. In practice, various techniques such as convolutions are used in place of fully connectednetworks due to the unknown number of hidden units h o necessary, the large number of parameters, and the unprovenability to learn such functions.Although the representation power of the network is reduced with the replacement of fully connected layers, this hasstill led to high-performing models on a wide variety of tasks. Convolutions constrain models to operate only on small,local neighborhoods of the input. The region of the input contributing to the output of a layer is called the receptivefield. In order to learn features in the input outside of the receptive field of a single layer, model designers constructdeep networks of many layers. Spatial Receptive Field:
Spatial pooling and/or strided convolutions are typically used to reduce the height/width ofthe input frame as it passes deeper into the network. This significantly reduces the computation and greatly extends thespatial receptive field allowing the small localized area of a particular layer to contain more spatially distant informationin the original input.
Spatial Resolution:
This comes at the cost of reducing the spatial resolution. The exact location of small featuresmust be encoded into the depth dimension otherwise this information is lost. The relevance of this depends on thetask. Image classification cares less about where the object appears within the image and more about whether or not itappears at all. Object detection requires the precise spatial information to pinpoint the object’s location in the originalimage. Spatial pooling/strided convolutions have been necessary for every deep network. This is done not for thebenefit of model representation but out of necessity for modern implementations. The GPU memory and computationrequirements become prohibitive if these techniques are not employed due to the large input image dimensions.
Temporal Receptive Field:
Convolutions treat the temporal dimension in the same way as the spatial dimension. Thetemporal receptive field grows gradually as the input passes through the network. Only a few neighboring inputs areseen early in the network and long-range information is handled deeper in the network. LSTMs on the other handmaintain an internal state with an additional self connection to pass the information along every timestep. Long-rangedependencies can appear at any layer without forcing the network to delay this processing until a later stage.
Temporal Resolution:
Similar to the spatial dimension, strided convolutions can be used to increase the temporalreceptive field at the cost of resolution. A problem like action recognition may employ such techniques since the task ismore related to specifying what happens more so than when it happens. The computation is reduced without sacrificingperformance since the destruction of the fine-grained temporal information is less relevant to the output. Techniques for2
PREPRINT - S
EPTEMBER
17, 2020lipreading have shown to perform better when the temporal dimension is maintained following the intuition lipreadingrequires some degree of what happens as well as when it happens. The additional computation and smaller receptivefield from maintaining the temporal dimension is a less of a problem than the spatial dimension typically due to mostvideos having roughly 30 frames per second (much smaller than the height/width of each frame) and the current tasks(like action recognition and lipreading) require a relatively short temporal receptive field of at most a few seconds.
Spatiotemporal Considerations
As long as the output has a sufficiently large receptive field, networks seem to performwell. Detecting small objects in a 2D image may use a relatively small spatial receptive field. Full image classificationtypically uses receptive fields larger than the input image. Classifying single sentences in a document may require thecontext of a few neighboring sentences while classifying the document may require the full text.Most techniques for spatiotemporal related tasks simply borrow and combine the successful techniques for both ofthese sub-problems. However, individual consideration of the spatial and temporal dimensions remains insufficient fordealing with spatiotemporal features. These techniques result in a complex interaction which is shown here to have asignificant impact on performance depending on the nature of the task.
Figure 1:
Example input from MNIST-based video dataset with and without quadrant masking.
The MNIST dataset [21] contains × sized images of handwritten digits. These images are used to construct avideo dataset (similar to [22]) meant to test the capability of various networks to learn three spatiotemporal relatedtasks: classification, ordering, and speed.The images are placed in a random location of a larger frame of size × . The digit moves around over the courseof 48 frames based on a random pixel speed S = { , , ..., } and random direction selected at the start of the video.Collisions with walls cause the digit to reverse direction at the frame of impact. Only select frames are visible makingthe digit appear to blink at a randomly selected rate V = { , , ..., } . The digit is split into four quadrants as shown inthe bottom of Figure 1 with a separate quadrant appearing every visible frame. The four quadrants appear in a randomlyselected order with the order repeating itself throughout the video. Task
Each network will attempt to identify the digit present in the video. By presenting onlya single quadrant every visible frame, the network must piece together information from individual frames. The rate ofvisible frames V and speed S alter how this information is presented impacting performance based on a network’s abilityto handle various spatiotemporal features. 2D convolution networks have been shown to achieve high performanceclassifying the digits providing a reasonable upper bound for expected performance on this task. Task
Each network will attempt to identify the order quadrants appear from six possible sequences:(1,2,3,4), (1,2,4,3), (1,4,2,3), (1,4,3,2), (1,3,2,4), (1,3,4,2). The previous task classifies the digit regardless of the orderthe information is presented. At the output, the model needs only a summary of the total spatial information. For thistask, the spatial information is an intermediate feature allowing the model identify the visible quadrant at each framewhile the output only needs to contain the temporal information about the order of presentation.
Task
Each network will attempt to identify the pixel speed S . This task requires the network to maintaininformation related to both previous tasks as well as maintain high spatial and temporal resolution information.Calculating the speed of the centroid between visible frames is different depending on the quadrant order. Identifyingthe quadrant order is dependent on identifying the digit.When V = 12 , a quadrant is shown every 12 frames requiring a minimum of 48 frames for the model to be presentedwith the full digit. The digit still moves with speed S even if the frame is not visible. S and V are limited to combinationsresulting in distances less than 50 pixels between visible frames. That is, at the max speed S = 6 , the max value for V PREPRINT - S
EPTEMBER
17, 2020is (48 total pixels of movement). At a speed S = 4 , V can achieve its max value of . Without this limit, aliasingcan occur when the digit bounces off of a wall due to the video frame size of × . All of the models are based on Residual Networks [23]. Each residual block has two paths. The first path contains aconvolution with a × kernel followed by a specialized convolution (2D convolution, 3D convolution, convolutionalLSTM, or 3D convolution with a temporal stride) followed by another 1x1 convolution. The second path is an openconnection. Both paths are added together to form the output of the block. If the block uses a spatial stride, the openconnection contains a × convolution with a stride of 2 to downsample the input to the appropriate size. The modelscontain 8 total residual blocks which each use a different specialized convolution within the blocks depending on themodel.Tables 1, 2, 3, and 4 specify the number of channels, the convolution stride, the input/output dimensions, and thespatial/temporal receptive fields (SRF/TRF). There are two models of different size for each type. The specializedconvolutions operate on the temporal dimension in various ways shown in Figures 2, 3, 4, and 5. Each output (red)represents a spatial feature map with the connections highlighting its dependencies on the intermediate feature maps(blue) and input frames (yellow). Unlike 1D temporal problems, the effect of the spatial dimension complicates theimpact of these temporal differences since the spatial resolution and receptive field differ throughout the network. Figure 2:
The spatial dimension for each frame is processed individually.
Conv2D Channels Channels-XL Stride Input Size Output Size SRF TRF(in,ch,out) (in,ch,out) (h,w,t) (h,w,t) (h,w,t) (h,w) (t)Res1_1 (1,16,64) (1,36,72) (2,2,1) 64x64x48 32x32x48 3x3 1Res2_1 (64,16,64) (72,36,72) (2,2,1) 32x32x48 16x16x48 7x7 1Res3_1 (64,16,64) (72,36,72) (2,2,1) 16x16x48 8x8x48 15x15 1Res3_2 (64,16,64) (72,36,72) (1,1,1) 8x8x48 8x8x48 31x31 1Res4_1 (64,32,128) (72,72,144) (2,2,1) 8x8x48 4x4x48 47x47 1Res4_2 (128,32,128) (144,72,144) (1,1,1) 4x4x48 4x4x48 79x79 1Res4_3 (128,32,128) (144,72,144) (1,1,1) 4x4x48 4x4x48 111x111 1Res4_4 (128,32,128) (144,72,144) (1,1,1) 4x4x48 4x4x48 143x143 1SpatialPool - - (4,4,1) 4x4x48 1x1x48 - -LSTM1 (128,128) (144,144) - 1x1x48 1x1x48 - 48LSTM2 (128,128) (144,144) - 1x1x48 1x1x48 - 48Table 1:
The temporal receptive field (TRF) remains 1 until the LSTM layers.
A baseline network utilizes 2D convolutions followed by two LSTM layers. The spatial and temporal informationis handled in two separate stages. The spatial information for a single frame is processed entirely with the spatialdimension fully collapsed before any sort of temporal processing is performed by the LSTM. This means the embeddingoutput from the 2D convolutions must encode the relevant spatial information into the depth dimension.Consider two sub-functions. The first function takes the input frame and outputs a two-dimensional vector containingthe exact x and y pixel location of the digit. The second function calculates the distance between points divided by thenumber of frames between these points. Combining these functions would result in perfect speed prediction. It is not4 PREPRINT - S
EPTEMBER
17, 2020unreasonable to believe a 2D convolution network and an LSTM could learn these two respective functions suggestingmore advanced techniques are unneeded. However, this function may not be learnable via gradient descent with areasonable network size.
Figure 3:
The temporal receptive field grows as the network deepens.
Conv3D Channels Channels-XL Stride Input Size Output Size SRF TRF(in,ch,out) (in,ch,out) (h,w,t) (h,w,t) (h,w,t) (h,w) (t)Res1_1 (1,16,64) (1,36,72) (2,2,1) 64x64x48 32x32x48 3x3 3Res2_1 (64,16,64) (72,36,72) (2,2,1) 32x32x48 16x16x48 7x7 5Res3_1 (64,16,64) (72,36,72) (2,2,1) 16x16x48 8x8x48 15x15 7Res3_2 (64,16,64) (72,36,72) (1,1,1) 8x8x48 8x8x48 31x31 9Res4_1 (64,32,128) (72,72,144) (2,2,1) 8x8x48 4x4x48 47x47 11Res4_2 (128,32,128) (144,72,144) (1,1,1) 4x4x48 4x4x48 79x79 13Res4_3 (128,32,128) (144,72,144) (1,1,1) 4x4x48 4x4x48 111x111 15Res4_4 (128,32,128) (144,72,144) (1,1,1) 4x4x48 4x4x48 143x143 17SpatialPool - - (4,4,1) 4x4x48 1x1x48 - -LSTM1 (128,128) (144,144) - 1x1x48 1x1x48 - 48LSTM2 (128,128) (144,144) - 1x1x48 1x1x48 - 48Table 2:
The temporal receptive field does not grow fast enough to cover the full sequence of length 48 before the spatial dimensioncollapses.
A deep network of 3D convolutions, which has two LSTM-layers at the top as before, processes the temporal and spatialinformation simultaneously. For short term dynamics, the 3D convolution is able to directly extract spatiotemporalfeatures if they fall within its spatiotemporal receptive field. For long term dynamics, the model must encode thehigh-resolution spatial information into the depth dimension to be processed by a later layer. This is similar to the 2Dconvolution network but less extreme as it can still handle some degree of temporal information. For example, speeds S ≥ move the object outside of the spatial receptive field of the first layer with visible frame rates V ≥ moving theobject outside of its temporal receptive field. For these large values, the model is forced to combine information frommultiple frames much later in the network at a reduced spatial resolution. The previous model is modified to use strided convolutions in the temporal dimension. The temporal receptive fieldincreases more rapidly at the cost of decreased temporal resolution in the same manner as the spatial resolution. Forshort term dynamics, the network is capable of dealing with spatiotemporal features sooner than the no-stride networkbut now must also encode the temporal location into the depth dimension. This can be both a benefit and a hindrance.A particular layer has a wider field of vision allowing it to directly act on the relevant information before the spatialdimension gets reduced but the model is directly acting on less precise information due to the temporal resolution.
A deep network of convolutional LSTMs the spatiotemporal features simultaneously. The hidden state for each layeris passed to each subsequent frame implying each layer is capable of processing any temporal feature at all stages ofspatial resolution and spatial receptive field. As the spatial resolution is decreased, the model can either encode the5
PREPRINT - S
EPTEMBER
17, 2020Figure 4:
The temporal receptive field grows more rapidly as the network deepens due to the temporal stride.
TS-Conv3D-XL Channels Channels-XL Stride Input Size Output Size SRF TRF(in,ch,out) (in,ch,out) (h,w,t) (h,w,t) (h,w,t) (h,w) (t)Res1_1 - (1,36,72) (2,2,2) 64x64x48 32x32x24 3x3 3Res2_1 - (72,36,72) (2,2,2) 32x32x24 16x16x12 7x7 7Res3_1 - (72,36,72) (2,2,2) 16x16x12 8x8x6 15x15 15Res3_2 - (72,36,72) (1,1,1) 8x8x6 8x8x6 31x31 31Res4_1 - (72,72,144) (2,2,2) 8x8x3 4x4x3 47x47 47Res4_2 - (144,72,144) (1,1,1) 4x4x3 4x4x3 79x79 79Res4_3 - (144,72,144) (1,1,1) 4x4x3 4x4x3 111x111 111Res4_4 - (144,72,144) (1,1,1) 4x4x3 4x4x3 143x143 143SpatialPool - - (4,4,1) 4x4x3 1x1x3LSTM1 - (144,144) - 1x1x3 1x1x3 - 48LSTM2 - (144,144) - 1x1x3 1x1x3 - 48Table 3:
The temporal receptive field (TRF) grows at the same rate as the spatial receptive field (SRF).
Figure 5:
All previous inputs are used in calculating the output.
ConvLSTM Channels Channels-XL Stride Input Size Output Size SRF TRF(in,ch,out) (in,ch,out) (h,w,t) (h,w,t) (h,w,t) (h,w) (t)Res1_1 (1,16,64) (1,16,64) (2,2,1) 64x64x48 32x32x48 3x3 48Res2_1 (64,16,64) (64,16,64) (2,2,1) 32x32x48 16x16x48 7x7 48Res3_1 (64,16,64) (64,16,64) (2,2,1) 16x16x48 8x8x48 15x15 48Res3_2 (64,16,64) (64,16,64) (1,1,1) 8x8x48 8x8x48 31x31 48Res4_1 (64,32,128) (64,32,128) (2,2,1) 8x8x48 4x4x48 47x47 48Res4_2 (128,32,128) (128,32,128) (1,1,1) 4x4x48 4x4x48 79x79 48Res4_3 (128,32,128) (128,32,128) (1,1,1) 4x4x48 4x4x48 111x111 48Res4_4 (128,32,128) (128,32,128) (1,1,1) 4x4x48 4x4x48 143x143 48SpatialPool - - (4,4,1) 4x4x48 1x1x48 - -Table 4:
The temporal receptive field (TRF) is large at all layers of the network. spatial information into the depth dimension or maintain this information in the hidden state to be processed at the nexttime step. Each layer having access to the full temporal information regardless of the network depth is an expectedadvantage. 6
PREPRINT - S
EPTEMBER
17, 2020
Task
Performance over the three tasks for all models trained. The metric for the third task is mean absolute error (MAE) of thepixel speed.
As shown in Table 5, there were slight variations in performance for the first two tasks. The large models all outperformedtheir smaller counterparts. With a magnitude of less than 1%, both convolutional LSTM models lagged behind the2D/3D models digit classification.Excluding TS-Conv3D-XL, performance was largely unaffected by the input parameters. Speed S had no effect oneither task. Large values for visible frames V caused a drop between 2% and 3% when classifying the sequence ordercompared to smaller values. The models have less information considering each quadrant is visible only once in thesescenarios.The TS-Conv3D-XL model is the only model to stand out. The drastic drop in performance for classifying the sequenceorder is almost entirely due to when V = 1 (70.8%) and V = 2 (85.4%). Performance for V ≥ matches closely withConv3D-XL. This difference must be attributed to the decrease in temporal resolution early in the network. For smallvalues of V, there is no longer a unique temporal input for each individual quadrant after the first layer. The model isforced to encode this information into the depth dimension as opposed to the Conv3D-XL model can delay this untillater in the network.This decrease in sequence order performance is most likely connected to the under-performance in digit classification.Unless a digit is recognizable from only 1-2 quadrants (such as the number 1), the model must either know the sequenceorder to appropriately piece together the parts or know the digit in order to identify the order. Digits 6, 8, and 9 havehigher error rates. Three out of four quadrants of the digit 8 match with 6 or 9. For these small values of V, the modelimmediately had to encode two separate frames into a single output making this distinction later in the network difficult.This difference in digit error rates is only seen in TS-Conv3D-XL. Performance estimating the speed is far more varied between the models. Mean absolute error is used as the metric. 3Dconvolutions outperformed 2D convolutions and convolutional LSTMs outperformed 3D convolutions. At a first glance,this is rather uninsightful considering the difference in spatiotemporal receptive fields between these models was alreadyknown ahead of time. However, unlike the first two tasks, performance is heavily impacted by the parameters S and V . By comparing performance over the space of S and V , the variations provide a insight into how these techniquesoperate. Conv2D-XL vs. Conv3D-XL:
Although Conv3D-XL has a 10% smaller overall MAE than Conv2D-XL, the errordifference is non-uniform over the space of S and V with the relative performance between the models shown in Figure7. Positive values indicate greater performance by Conv3D. For slow speeds ( S ≤ ) and small values of V ( ≤ ), theinputs fall into the spatiotemporal receptive field of the first three residual blocks allowing the model to actually takeadvantage of the 3D convolution operation and significantly outperform Conv2D. The Conv2D model must insteadcorrectly encode this spatial information for later processing by the LSTM layers.For V ≥ , performance is nearly identical between the models. Note the temporal receptive field for Conv3D is 9 afterRes3_2 (see Table 2). These large values of V prevent the model from any temporal operations until after this layerimplying it operates like a 2D convolution. The spatial resolution has decreased by a factor of 8 at this point makingthis encoding significantly more difficult resulting in a sharp drop in MAE.7 PREPRINT - S
EPTEMBER
17, 2020Figure 6:
Mean absolute error (MAE) over the space of V and S for all four models.
The region ≤ V ≤ and S = 6 is on the cusp of the spatiotemporal receptive field for Conv3D. Between visibleframes, the object moves between 24 and 42 pixels. Although the input falls within the temporal receptive field ofRes3, the input is occasionally outside of the spatial receptive field causing the model to delay processing until afterthe spatial resolution has been reduced. These regions where the network operates as a hybrid of 3D convolutionsand 2D convolutions seem to cause learning difficulties resulting in performance worse than the model with only 2Dconvolutions. Conv3D-XL vs. TS-Conv3D-XL:
Both models achieved comparable overall MAE with the relative differences overthe space of S and V shown in Figure 8. The temporal receptive field increases more rapidly allowing the model accessto spatial information before the spatial resolution has decreased at the cost of decreased temporal resolution.For V ≥ , the input now falls into the temporal receptive field of TS-Conv3D before the spatial resolution hasreduced resulting in an increase of performance over the Conv3D model. For V ≤ , the input falls into the temporalreceptive field of both models at a high spatial resolution yet with reduced temporal resolution for TS-Conv3D hinderingperformance. Temporally strided convolutions can be both a benefit and a hindrance. Conv3D-XL vs. ConvLSTM-XL:
The relative performance difference between Conv3D-XL and ConvLSTM-XLover the space of S and V is shown in Figure 9. ConvLSTM performed better for all inputs. Allowing each layeraccess to the full range of temporal information provides a clear advantage for estimating the speed. Additionally, theperformance is far more uniform without the model being biased towards particular regions unlike the Conv3D andTS-Conv3D models.Although the use of convolutional LSTMs increases the temporal receptive field over 3D convolutions, it does notcompensate for the full spatiotemporal receptive field. Performance still gradually drops as both S and V increase.Large values of these parameters result in very large object movements between visible frames. Convolutional LSTMs8 PREPRINT - S
EPTEMBER
17, 2020Figure 7:
The Conv3D model performs significantly better than the Conv2D model for small values of V and S yet performs poorlyon its spatiotemporal boundaries.
Figure 8:
Both Conv3D models achieve the same MAE over the entire dataset but very different errors over the space of V and S. still utilize 2D convolutions for spatial operations and these large movements can easily fall outside the spatial receptivefield delaying processing until a later stage after the spatial resolution has been reduced.Figure 9:
The ConvLSTM model performs relatively better than the Conv3D model over the entire space of V and S. PREPRINT - S
EPTEMBER
17, 2020
Encoding Spatial Information:
When the input falls outside of the spatiotemporal receptive field of a particular 3Dconvolution layer, the network simply operates as a 2D convolution and encodes the spatial information into the depthdimension for a later stage. Although Conv2D performed the worst for estimating the speed, it works unexpectedlywell with 96.1% accuracy when the output is rounded. Encoding the spatial information is clearly doable but may bedifficult to learn via gradient descent or may simply require a prohibitively large model. The models with a large depthdimension all outperformed their smaller counterparts yet still underperformed relative to the smaller convolutionalLSTM model. Utilizing additional connections to allow access to as much of the spatiotemporal information as possibleat all stages of processing is more effective than forcing the model to learn in a roundabout way such as forced encoding.
Differences Between Spatial and Temporal Resolution:
The underperformance of temporally strided 3D convolu-tions for digit classification and sequence order hints at the importance of maintaining the temporal dimension. Isthis also true for the spatial dimension? Nearly all 2D image networks reduce the spatial resolution out of necessity.Without doing so would require very deep networks to achieve a large visual receptive field. This very deep networkwould also be operating on much higher resolution feature maps resulting in a prohibitive amount of computation. If theinformation in these dimensions is different, this may not be an issue. However, if they are different, is there a reasonfor treating the dimensions the same as is done by 3D convolutions?
Effective Use of 3D Convolutions:
Performance of 3D convolutions is heavily impacted by design choices. Conv3Dand TS-Conv3D appear equal overall with this particular dataset but that is simply by design. Both models learn adifferent spatiotemporal region. For real datasets, the distribution of spatiotemporal features is almost certainly non-uniform. This is the crux of the problem. Should the network be well designed to capture the appropriate spatiotemporalfeatures, performance will be high. However, the scale of the important spatiotemporal features is frequently unknownahead of time requiring extensive experimentation.This is further complicated after comparing the results of Conv2D-XL and Conv3D-XL. When an input falls on thespatiotemporal boundary, the 3D model can actually perform worse. A poorly designed 3D network could lead to moreproblems and create difficulties identifying the source of errors during this experimentation stage.
Overfitting with Convolutional LSTMs:
Convolutional LSTMs seem capable of learning a much larger spatiotempo-ral region than 3D convolutions. This also provides a greater capability to overfit. Without the ability to parameterizethe spatiotemporal receptive field like 3D convolutions, it is not possible to directly control which spatiotemporalfeatures the model focuses on. Without a significant amount of data for the model to learn which features to focus on, itcould easily overfit to regions of the feature space with a smaller signal to noise ratio. 3D convolutions inherently filterinformation which may explain their relatively high performance in a lot of real world datasets.
Impact of Task and Loss Function:
Although the greater capabilities of convolutional LSTMs were necessary for atask like speed estimation, the model underperformed for digit classification. There is a difference between learning aproblem requiring spatiotemporal features and a temporal problem involving spatial features. Digit classification simplyneeds to know what happens and not when it happens. Although convolutional LSTMs have greater spatiotemporalcapabilities, they may not learn those capabilities as easily as 3D convolutions. Difficulties in training LSTMs comparedwith convolutions has been frequently discussed in other fields like NLP.
Limitation of Experiments:
These drastic performance differences appear even in this relatively simple MNIST-based video dataset. The digits are in motion yet they are rigid with no rotation. They are 2D flat objects yet mostvideos contain 2D projections of 3D objects. It is easy to imagine many real datasets containing much more complexspatiotemporal frequencies along with large varying scales, camera movement, color, and motion blur.
References [1] Zewen Li, Wenjie Yang, Shouheng Peng, and Fan Liu. A survey of convolutional neural networks: Analysis, applications, andprospects. 04 2020.[2] Yoon Kim. Convolutional neural networks for sentence classification.
CoRR , abs/1408.5882, 2014.[3] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.
Neural computation , 9:1735–80, 12 1997.[4] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and IlliaPolosukhin. Attention is all you need.
CoRR , abs/1706.03762, 2017.[5] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. UCF101: A dataset of 101 human actions classes from videos inthe wild.
CoRR , abs/1212.0402, 2012. PREPRINT - S
EPTEMBER
17, 2020 [6] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge.
International Journal of Computer Vision (IJCV) , 115(3):211–252, 2015.[7] Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and TrevorDarrell. Long-term recurrent convolutional networks for visual recognition and description.
CoRR , abs/1411.4389, 2014.[8] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action recognition in videos.
CoRR ,abs/1406.2199, 2014.[9] João Carreira, Eric Noland, Chloe Hillier, and Andrew Zisserman. A short note on the kinetics-700 human action dataset.
CoRR , abs/1907.06987, 2019.[10] João Carreira, Eric Noland, Andras Banki-Horvath, Chloe Hillier, and Andrew Zisserman. A short note about kinetics-600.
CoRR , abs/1808.01340, 2018.[11] Will Kay, João Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green,Trevor Back, Paul Natsev, Mustafa Suleyman, and Andrew Zisserman. The kinetics human action video dataset.
CoRR ,abs/1705.06950, 2017.[12] Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?
CoRR , abs/1711.09577, 2017.[13] Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. Slowfast networks for video recognition.
CoRR ,abs/1812.03982, 2018.[14] Xianhang Li, Yali Wang, Zhipeng Zhou, and Yu Qiao. Smallbignet: Integrating core and contextual views for video classification.In
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , June 2020.[15] Themos Stafylakis and Georgios Tzimiropoulos. Combining residual networks with lstms for lipreading.
CoRR , abs/1703.04105,2017.[16] Logan Courtney and Ramavarapu Sreenivas. Using deep convolutional lstm networks for learning spatiotemporal features.In Shivakumara Palaiahnakote, Gabriella Sanniti di Baja, Liang Wang, and Wei Qi Yan, editors,
Pattern Recognition , pages307–320, Cham, 2020. Springer International Publishing.[17] Joon Son Chung and Andrew Zisserman. Lip reading in the wild. In
Asian Conference on Computer Vision , 2016.[18] Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. Asr is all you need: cross-modal distillation for lip reading,2019.[19] B. Xu, J. Wang, C. Lu, and Y. Guo. Watch to listen clearly: Visual speech enhancement driven multi-modality speechrecognition. In , pages 1626–1635, 2020.[20] G Cybenkot. Approximation by superpositions of a sigmoidal function *. 1989.[21] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.[22] Nitish Srivastava, Elman Mansimov, and Ruslan Salakhutdinov. Unsupervised learning of video representations using lstms.
CoRR , abs/1502.04681, 2015.[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.
CoRR , abs/1512.03385,2015., abs/1512.03385,2015.