[PDF] Learning to Sort Image Sequences via Accumulated Temporal Differences

Abstract

Consider a set of n images of a scene with dynamic objects captured with a static or a handheld camera. Let the temporal order in which these images are captured be unknown. There can be n! possibilities for the temporal order in which these images could have been captured. In this work, we tackle the problem of temporally sequencing the unordered set of images of a dynamic scene captured with a hand-held camera. We propose a convolutional block which captures the spatial information through 2D convolution kernel and captures the temporal information by utilizing the differences present among the feature maps extracted from the input images. We evaluate the performance of the proposed approach on the dataset extracted from a standard action recognition dataset, UCF101. We show that the proposed approach outperforms the state-of-the-art methods by a significant margin. We show that the network generalizes well by evaluating it on a dataset extracted from the DAVIS dataset, a dataset meant for video object segmentation, when the same network was trained with a dataset extracted from UCF101, a dataset meant for action recognition.

Full PDF

JJOURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1

Learning to Sort Image Sequences via AccumulatedTemporal Differences

Gagan Kanojia and Shanmuganathan Raman

Abstract —Consider a set of n images of a scene with dynamicobjects captured with a static or a handheld camera. Let thetemporal order in which these images are captured be unknown.There can be n ! possibilities for the temporal order in whichthese images could have been captured. In this work, we tacklethe problem of temporally sequencing the unordered set ofimages of a dynamic scene captured with a hand-held camera.We propose a convolutional block which captures the spatialinformation through 2D convolution kernel and captures thetemporal information by utilizing the differences present amongthe feature maps extracted from the input images. We evaluatethe performance of the proposed approach on the dataset ex-tracted from a standard action recognition dataset, UCF101. Weshow that the proposed approach outperforms the state-of-the-art methods by a signiﬁcant margin. We show that the networkgeneralizes well by evaluating it on a dataset extracted from theDAVIS dataset, a dataset meant for video object segmentation,when the same network was trained with a dataset extractedfrom UCF101, a dataset meant for action recognition. All thecodes and pretrained models will be publicly available at . Index Terms —Image sequencing, Convolutional neural net-works.

I. I

NTRODUCTION

In today’s world of digital photography, when a group ofpeople attend an event like sports, they are likely to capturetheir moments of interest. These moments are generallyshort in duration and are dynamic in nature as they involvessome moving objects or moving people present in the scene.Analysis of a dynamic scene using still images has beenan active area of research in image processing, computervision, and machine learning for a long time. However, themost common device for capturing such event is the mobilephone, which is a hand-held device. Even when the imagesare captured with a single handheld device, they are proneto misalignment due to reasons like handshakes. This makesthe problem even more challenging, because apart fromdealing with the object motion, the analysis also has to dealwith the camera motion. This is due to the fact that thetemporal information of the dynamic scene is an importanttool for its analysis and visualization. In case the images areobtained from sources like internet, there may be no timestamp available. This makes the analysis of dynamic scenes

Gagan Kanojia is with Electrical Engineering, Indian Instituteof Technology Gandhinagar, Gandhinagar 382355, India (e-mail:[email protected]).Shanmuganathan Raman is with Electrical Engineering and ComputerScience and Engineering, Indian Institute of Technology Gandhinagar, Gand-hinagar 382355, India (e-mail:[email protected]).The work of Gagan Kanojia was supported by the TCS Research Fellow-ship. The work of Shanmuganathan Raman was supported by an IMPRINT-2grant extremely challenging. In [1], the authors showed that thetemporal ordering plays an important role in recognizingseveral classes from the standard action recognition datasets[2], [3].In the past few years, 2D convolutional neural networks(CNNs) have been dominating several domains of computervision like object recognition [4], single image depthestimation [5], and semantic segmentation [6]. There areseveral 2D CNN architectures which are being fueled withlarge still image datasets like ImageNet [7]. Apart from stillimages, it has also been shown that the 2D CNNs performquite well when applied on videos [8], [9], [10]. They areapplied on individual frames of the video to perform taskssuch as action recognition. However, 2D CNNs lack inexploiting the 3D structure present in the input. To cope upwith this issue, researchers moved on to the 2.5D approachwhich exploits the 3D structure while utilizing 2D convolutionkernels [11], [12]. In 2.5D approach, the network is providedwith some higher level information about the input apart fromthe RGB channels present in the images. For example, in thecase of action recognition, the higher level information couldbe optical ﬂow and in the case of dynamic object detection,it could be semantic maps.

Problem Statement.

Consider a set of n images capturedfrom single or multiple hand-held uncalibrated cameras whoseorder of capture, i.e., the temporal order, is unknown. In thiswork, we tackle a challenging problem of image sequencingin which we recover the unknown temporal order of theunordered set of images. Similar to [13], we formulate theproblem of image sequencing as a classiﬁcation problem andthe classes are all the possible permutations of the temporalorder of the given sequence length. The objective is to mapthe given unordered image sequence to its correspondingpermutation. Consider an input unordered image sequence I = { I , I , I , I , I } of length 5 whose correct order is { I , I , I , I , I } . The input sequence can have n ! possiblepermutations. In this work, we will consider the forwardand backward permutations as a single class similar to [13],[14]. Hence, for a sequence of length 5, there are 5!/2 =60 classes. The objective is to map I to its correct permutation. Contributions.

In this work, we propose a novel convolutionalblock for the task of image sequencing which extracts thespatial information using 2D convolution and the temporalinformation by exploiting the differences between the featuremaps extracted from the input images. We do not provideany higher level information such as depth map and semantic a r X i v : . [ c s . C V ] O c t OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2 information as an input. We only utilize the raw RGB imagesas the input. We use ResNet [4] as the back-bone architecturefor the proposed convolutional block. We show that theproposed approach outperforms the state-of-the-art methods bya signiﬁcant margin. The proposed approach can be used as apre-processing step in cases when the images of dynamic sceneare obtained with no time stamp from sources like internet or agroup of people [15], [16], [17], [18]. In [1], the authors havealready shown that the action recognition accuracy drops forseveral classes when the frames are randomly shufﬂed. In [19],the authors have shown that even with 3 or 5 frames extractedfrom a video, a signiﬁcant accuracy can be obtained for thetask of action recognition. Also, in case of dynamic objectdetection and/or removal, the recent works have used aroundsix images in the input set [15], [16], [17], [18]. Hence, welimit our experiments up to six images which is also consistentwith the recent works in image sequencing [13], [14].The major contributions of the work are as follows. • We propose a novel convolutional block which capturesspatial information by performing a 2D convolution andtemporal information by exploiting differences betweenthe feature maps extracted from the unordered inputimages. • We show that motion plays a key role in image sequenc-ing through the motion heat maps computed using theoutputs of the proposed block. • We show that the network learns to shift its focus ondynamic objects without being trained with any suchsupervision by demonstrating the progression of motionheat maps along the depth of the network. • We show that the network generalizes well by evaluatingit on the dataset extracted from the DAVIS dataset,a dataset meant for video object segmentation, whenthe network is trained with the dataset extracted fromUCF101, a dataset meant for action recognition. • We outperform the state-of-the-art accuracy on the stan-dard dataset used in previous works.The rest of the paper is organized as follows. Section IIdiscusses the previous works relevant to this work. Section IIIdescribes the proposed convolutional block in detail. SectionIV discusses the network architecture and the experimentswhich show the effectiveness of the proposed method. Italso discusses the ablation studies performed on the proposedconvolutional block to justify the design choices. Section Vprovides the conclusion.II. R

ELATED W ORK

In the past few years, 2D CNNs have enjoyed a hugeamount of attention and have shown very promising results inseveral tasks of computer vision such as image classiﬁcation,object detection, and image segmentation [4], [6]. They havebeen very successful in obtaining rich representations for stillimages. Many works extended 2D CNNs to operate on spatio-temporal data by extracting features of individual frames andthen integrating the information along the temporal dimension[8], [9]. In [20], the authors study different approaches toincorporate the temporal dimension along with the spatial

Fig. 1.

Image sequencing.

The ﬁgure demonstrates the task of imagesequencing. dimensions through a “slow fusion” model which extends theconnectivity of the convolutional layers along the temporaldimension.In many applications, the temporal structure of the input playsa very important role. In such cases, sequencing an unorderedimage sequence could help in better exploitation of temporalinformation [21], [22], [23], [24]. In [22], the authors learn topredict the future actions of a person in an ego-centric video byperforming two tasks related to the temporal reasoning amongwhich one is the temporal ordering of the two given shortvideo snippets. In [23], the authors investigate whether thevideo is being played in the forward direction or the backwarddirection.The problem of sequencing has been addressed in severalscenarios like temporal ordering of the events in news [25],photo album creation from jumbled set of images [26], andestimating the 2-D rotation applied on the images to improvethe feature representations [27]. In [28], the authors learn thevideo representations by learning to determine whether theinput video is in correct temporal order or not. In [29], insteadof using only the images, the authors used image-caption pairsof an event of sequence length 5 and sorted them to make astory.The problem of sequencing images of a dynamic scene cap-tured by a hand-held camera was ﬁrst addressed by Basha et al. [30], [31]. In their work, in the case of multiple cameras, theyassume that they know the cluster of images belonging to thesame camera and they know the temporal order of the imagescaptured from the same camera. Also, they assume that at leasttwo images are captured from almost the same location. In[32], the authors deal with these assumptions by proposinga methodology when given a set of images captured from

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3

Fig. 2. The ﬁgure shows an illustration of the difference accumulator block. multiple cameras, they cluster the images captured with thesame camera. After clustering, they sort the images temporallyin their order of capture. However, these methods are notlearning-based approaches. Also, they are not evaluated onlarge datasets.The recent works by Lee et al. [13] and Kanojia et al. [14] arethe most relevant to us. Lee et al. [13] propose a learning-basedapproach to sort an unordered image sequence. They formulatethe task of sequencing as a multi-class classiﬁcation problemin which the classes are all the possible permutations of thetemporal order of the input unordered image set. However,they do not feed the images directly to the network. Giventhe images, they extract image regions which exhibits largemotion and then feed these regions to the network along withsome pre-processing. Unlike Lee et al. [13], Kanojia et al. [14] directly feed the images into their proposed LSTM-basednetwork. They formulated the task of image sequencing as asequence-to-sequence mapping task. Their proposed networkmaps the input images to its position in the ordered sequence.The proposed approach uses only 2D convolution kernels.Since 2D convolution kernels lack in capturing the temporalinformation, we adopt a 2.5D approach. In 2.5D approach,the RGB channels of the image are appended by some higherlevel information which captures the temporal structure of theinput [11], [12]. In [12], the authors fuse the input imagewith its orthogonal views. In [11], the authors extend thedimension of the magnetic resonance volume along the RGBchannels to exploit the 3D features. In the proposed approach,we extract the differences among the feature maps along the temporal direction and append them along the channels. Then,we perform the 2D convolution to extract the 3D structurepresent in the input unordered image set.Temporal difference has been explored in the area of actionrecognition [19]. Temporal differences can provide the roughlocation of the non-rigid bodies performing the action in thevideos [33], [34]. In [19], the authors proposed a motion ﬁlterin which they compute the differences only among the featuremaps of the adjacent frames. However, in our case, we donot have the information regarding the adjacency of the inputframes. Also, in [19], authors perform 1D convolution on thefeature differences and then add them to the previous mapwhile we adopt a 2.5D approach.III. P

ROPOSED C ONVOLUTIONAL B LOCK

The proposed convolutional block has three parts: a 2Dconvolution kernel, a difference accumulator block (DAB),and a 2.5D convolutional block. Let the input feature mapto the convolutional block be

F ∈ R c × n × h × w . Here, c is thenumber of channels, n corresponds to the number of imagesin the input set, and h and w are the height and width of thefeature map F , respectively. A. 2D Convolution

It has been shown in the early works [35], [36], that inthe initial layers, the 2D ﬁlters learn to capture the infor-mation like edges and corners. In the later layers, they tryto capture object-level information of the scene. The ideabehind performing 2D convolution is to ﬁrst obtain rich spatial

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4 representations individually. In the proposed framework, ﬁrstwe obtain F c as shown in Eq. 1. F c = F (cid:63) h c (1)Here, (cid:63) stands for convolution and h c is the ﬁlter for 2Dconvolution of kernel size × k × k and c channels. Thereason behind representing the kernel size of the ﬁlter h c withthree dimension is to indicate that we are dealing with multipleimages. This is mentioned for the sake of clarity that we arenot performing convolution along the temporal direction. It canbe seen that the kernel size along the temporal direction is .We apply h c on the feature maps corresponding to each image,which are provided in the unordered fashion, to obtain theirspatial feature maps. Then, we pass F c through the proposeddifference accumulator block (DAB) to obtain the temporalstructure of the feature maps. B. Difference Accumulator Block

The core idea behind Difference Accumulator Block (DAB)is to capture the 3D structure present in the input data. Sincewe want to ﬁnd the temporal order, we exploit the changesoccurring among the images. The changes can be due to theobject motion or the camera motion. In the proposed DAB,we rely on the differences among the feature maps extractedfrom the images, i.e., the change in the spatial informationat different time instances, to extract the necessary temporalinformation present in the feature maps. These differences canprovide rough locations of the non-rigid bodies performing ac-tion in the videos which could help the network in performingimage sequencing [33], [34]. In a general sense, DAB tries tocapture the notion of how different is the volume of featuremaps at the current temporal location in comparison to thevolumes at the other temporal locations.Let F c ∈ R c × n × h × w be given as the input to DAB. Here, c is the number of channels, n corresponds to the numberof images in the input set, and h and w are the height andwidth of the feature map F c , respectively. We pass F c throughDAB. We accumulate the differences of the feature mapcorresponding to each image of the unordered sequence withthe features maps of the remaining images, i.e., the differencesbetween the volumes along the second dimension of F c . Let {F c , F c , F c , . . . , F nc } ∈ R c × × h × w be the volumes along thetemporal depth of F c . The output of DAB F s ∈ R c × n × h × w is obtained as shown in Eq. 2. Here, F s is the concatenationof {F s , F s , F s , . . . , F ns } ∈ R c × × h × w along the temporaldepth. F is = n (cid:88) k = i +1 ( F ic − F kc ) , ∀ i = 2 , . . . , n − (2)Here, i is a location along the temporal dimension. For i = n , F ns = F nc . It can be observed that the range of k starts from i + 1 . This is performed to avoid symmetric computations.Fig. 2 shows an illustration of the proposed DAB. C. 2.5D Convolutional Block

To capture the temporal information along with the spatialinformation, we adopt 2.5D approach. In 2.5D approach, thechannels containing the spatial information are appended bysome higher level information which captures the temporalstructure of the input [11], [12]. In our case, F s obtained fromDAB contains essential information regarding the temporalstructure of the input image set. The feature maps F c obtainedby applying the 2D convolution kernel contain only the spatialinformation. To exploit both the spatial and the temporalstructure of the input, we concatenate F c and F s along thechannels to obtain F sc ∈ R c × n × h × w . Then, we pass F sc through a 2D convolution kernel to obtain the ﬁnal output ofthe block. D. Forward/Backward Propagation

The forward and backward propagation through the pro-posed convolutional block is quite straightforward. The ﬁrstcomponent of the proposed convolutional block is a standard2-D convolution ﬁlter through which gradients can be passedusing standard backpropagation. The second component isDAB. In DAB, we extract the feature maps from the inputfeature maps by tensor slicing and then, perform subtractionand addition to obtain the output. These operations can bedone in a differentiable manner using standard deep learninglibraries. Finally, the third component is again a convolutionkernel through which the gradients can be passed usingstandard backpropagation.IV. E

XPERIMENTS AND D ISCUSSIONS

A. Datasets1) UCF101:

UCF101 is a standard action recognitiondataset which contains real world action videos [37]. It con-tains 13320 videos out of which 9537 videos are used fortraining and 3783 videos are used for testing purpose. It isa diverse dataset which covers 101 action categories. Thevideos have large camera variations, cluttered background,and illumination conditions. It has been used as a benchmarkdataset in several works such as [10], [38], [39]. Lee et al. [13] extract image sequences of length 3 and 4 from thetraining set of the split-1 of UCF-101. To extract the imagesequences, they estimate optical ﬂow in the videos and basedon the magnitude of the optical ﬂow, they extract the imagesequences. Kanojia et al. [14] extend their dataset by includingthe image sequences of lengths 5 and 6. They obtain sequencesof length 5 by extracting a frame from the left of the sequencesof length 4. Similarly, they obtain sequences of length 6 byextracting the frames from the left and the right side of thesequences of length 4. While extracting the frames, the authorsmade sure that the temporal spacing between the frames isconsistent with the original sequence of length 4. We randomlysplit the datasets corresponding to each sequence length intotraining and testing as 70% and 30% of the data, respectively.The videos of UCF101 contains 101 action categories whichare grouped in 25 groups. Each group contains 4-7 videos. Thevideo belonging to same group can contain common features.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5

Hence, while splitting, we made sure that the image sequencesbelonging to the same group fall into the same category, i.e.,training and test set.

2) DAVIS:

DAVIS dataset is a benchmark dataset in thearea of video object segmentation [40]. It contains ﬁfty videosequences with several challenging scenarios like occlusions,appearance variation, and motion blur. The videos are capturedwith a moving camera. They contain single and multipledynamic objects. We extract the dataset of evenly spacedimage sequences of lengths 4 and 6 from the videos. Weuse this dataset to evaluate the generalization capability ofthe proposed approach for the task of image sequencing. Wedo not train the network on the sequences extracted from thisdataset. We treat the whole dataset as the test set. We usethe network trained on the dataset extracted from UCF101 toestimate the temporal order of the sequences extracted fromthe DAVIS dataset when provided in an unordered fashion.

B. Network Architecture

We use residual networks (ResNet) as the back-bone archi-tecture to show the effectiveness of the proposed convolutionalblock [4]. Fig. 3 (a) and (b) show the basic and bottleneckblock used in ResNet architecture [4]. In each of the residualblocks, we replace the 2D convolution kernel by the proposedconvolutional block (in green) as shown in the Fig. 3 (c)and (d). We just replace the 2D convolution kernels bythe proposed convolutional block while keeping the overallstructure of the network intact. We perform the experimentswith 18 layers and 50 layers version of the ResNet architecture.Similar to [14], we train separate networks for each sequencelength. The input to the network is an unordered set of imagesof a dynamic scene captured with a hand-held camera. Similarto [13] and [14], the forward and the backward permutationsare considered as a single class. Hence, the classiﬁcation layerof the network has n ! / classes. Each class corresponds to apermutation. The objective of the network is to map the inputunordered image sequence to its corresponding permutation. C. Training

We train the networks with Stochastic Gradient Descent(SGD) for the weight update with a momentum of 0.9, weightdecay of 0.001, and an initial learning rate of 0.1. We reducethe learning rate by 0.1 when the validation loss saturates.We use a batch of 16 clips for all the networks. The dataaugmentation used in training the networks is the same as thatused in [14]. Similar to [14], we perform a random croppingon the input image sets. The clips are spatially resized suchthat the shorter edge gets scaled to 136 pixels and then, werandomly crop a region of size × . The size of eachdata sample is × n × × . Here, is the number ofchannels, n is the number of images in the input sequence,and is the spatial size of the clips. We normalize theframes by subtracting the mean values and dividing them bythe variance values of the ImageNet [7]. The training set forsequences of length 3, 4, 5, and 6 contains around 87.7K,87.7K, 85K, and 83K sets of image sequences, respectively.We train the networks by feeding all the permutations of the (a) ResNet (Basic) (b) ResNet (Bottleneck)(c) Ours (Basic) (d) Ours (Bottleneck) Fig. 3. (a) and (b) show the basic and bottleneck blocks used in ResNetarchitecture [4]. (c) and (d) show the residual blocks in which 2D convolutionkernel is replaced by the proposed convolutional block (in green). Here, k , st , out ch , conv , BN , and ReLU stands for kernel size, stride, output channels,convolution, batch normalization and rectiﬁed linear unit, respectively. image sequences in a random order. For example, to train thenetwork for the sequence length 6, we feed the network with K × / ≈ . M unordered image sequences. Similarly,we test the networks with all the permutations of the imagessequences in the test sets. We use categorical cross-entropy asthe loss function. D. Comparisons with the State-of-the-art Methods

Table I compares the test classiﬁcation accuracy obtainedon the datasets of unordered image sequences of differentsequence lengths extracted from UCF-101 by the proposedapproach with the state-of-the-art methods proposed by Lee etal. [13] and Kanojia et al. [14]. All the networks are trained

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6

Fig. 4.

Motion Heap Maps.

The ﬁgure shows the ﬁve test sets (ﬁrst out of two columns in each set) of unordered image sequences extracted from UCF-101dataset which has been correctly classiﬁed by the proposed network trained on the training set extracted from UCF-101. In each set, the ﬁrst column showsan unordered image set which is given as the input to the proposed network and the second column shows the order of images provided by the proposednetwork as the output along with the motion heat maps computed from the output of the last DAB of the network.TABLE I C OMPARISON WITH THE STATE - OF - THE - ART . T HE TABLE COMPARESTHE TEST CLASSIFICATION ACCURACY ( IN PERCENTAGE ) OBTAINED ONTHE DATASETS OF UNORDERED IMAGE SEQUENCES OF DIFFERENTSEQUENCE LENGTHS EXTRACTED FROM

UCF-101

BY THE PROPOSEDAPPROACH WITH R ES N ET (50 LAYERS ) AS BACKBONE ARCHITECTUREWITH THE STATE - OF - THE - ART METHODS PROPOSED BY L EE et al. [13] AND K ANOJIA et al. [14].SequenceLength Lee et al. [13] Kanojia et al. [14] Ours3 63 67.18 from scratch. Table I shows the accuracy obtained using theproposed approach with ResNet (50 layers) as the backbonearchitecture. It can be observed that there is a signiﬁcant im-provement over the previous methods in terms of classiﬁcationaccuracy. The proposed approach outperforms the state-of-the-art method by Kanojia et al. [14] by a signiﬁcant margin.It can be observed that the margin grows as we move fromthe sequence length of 3 to 6. This shows that the proposedapproach is better in handling longer sequences in comparisonto Kanojia et al. [14] and Lee et al. [13]. Fig. 4 showsthe results on four test sets of unordered image sequencesextracted from UCF-101 dataset obtained by the proposednetwork trained on the training set extracted from UCF-101.

E. Ablation study1) Without DAB:

In this study, we veriﬁed the importanceof DAB. The output of DAB is F s which contains the temporalstructure of the input obtained by accumulating the differencebetween the features extracted from the input image set. Tocheck its importance, during training we set F s = 0 , i.e., weﬁll zeros at all positions in F s , in all the layers of the network.We keep everything else exactly the same. We experimentedwith the image sequences extracted from UCF-101 of length3 and 4 with ResNet (18 layers) as the back-bone. Weobserved that the network does not learn anything and givesthe output accuracy equivalent to the random probability whichis / ( n ! / for a sequence of length n . This shows thatextracting temporal structure is very crucial for the task ofimage sequencing. Also, this study conﬁrms that DAB playsa signiﬁcant role for the task.

2) Effect of Network Depth:

In this study, we observethe effect of the depth of the back-bone network on thesequences of lengths 3 and 4. We used ResNet with 18layers and 50 layers as the back-bone architecture for theproposed convolutional block. We replaced the 2D convolutionkernel in the residual blocks with the proposed convolutionalblock. We train them with the datasets of image sequencesextracted from UCF-101 of lengths 3 and 4. Table II showsthe comparison of the test classiﬁcation accuracy obtainedon the datasets of unordered image sequences extracted fromUCF-101 of lengths 3 and 4 when trained with the networks

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7

TABLE II E FFECT OF NETWORK DEPTH . T HE TABLE SHOWS THE COMPARISON OFTEST CLASSIFICATION ACCURACY ( IN PERCENTAGE ) OBTAINED ON THEDATASETS OF UNORDERED IMAGE SEQUENCES EXTRACTED FROM

UCF-101

OF LENGTHS AND WHEN TRAINED WITH THE NETWORKS OFDIFFERENT DEPTHS .Sequence Length Back-bone Network Accuracy3 ResNet (18 layers) 80.943 ResNet (50 layers)

TABLE III E FFECT OF S IGN . T HE TABLE SHOWS THE COMPARISON OF THE TESTCLASSIFICATION ACCURACY ( IN PERCENTAGE ) OBTAINED ON THEDATASET OF UNORDERED IMAGE SEQUENCES EXTRACTED FROM

UCF-101

BY THE PROPOSED APPROACH WHEN THE NETWORK ISTRAINED WITH THE DIFFERENCE ACCUMULATOR BLOCK COMPRISING E Q . 2 WITH WHEN IT IS TRAINED WITH THE DIFFERENCE ACCUMULATORBLOCK COMPRISING E Q . 3. R ES N ET (18 LAYERS ) IS USED AS BACK - BONEARCHITECTURE USED FOR THIS EXPERIMENT .Sequence Length AccuracyMagnitude 4 60.82Sign+Magnitude 4 of different depth. It can be observed in Table II that thedeeper network (ResNet-50) performs better than the shallowernetwork (ResNet-18).

3) Effect of the Sign of Differences:

In this study, weobserve the effect of the sign of the differences computed inDAB. To observe its effect, we modiﬁed Eq. 2 of the blockas shown in Eq. 3. F is = n (cid:88) k = i +1 | ( F ic − F kc ) | , ∀ i = 2 , . . . , n − (3)Here, | . | stands for the absolute value of the input. F is isdeﬁned in Section III-B. Instead of accumulating differenceswith their sign, we only accumulate their magnitude. We trainthe network with ResNet (18 layers) as the back-bone on thedataset of unordered image sequences of sequence length 4extracted from UCF-101 with the modiﬁed DAB, i.e., withonly the magnitude of the differences. Table III shows thecomparison of the test classiﬁcation accuracy obtained on thedataset of unordered image sequences extracted from UCF-101of sequence length 4 by the proposed approach when we trainthe network with DAB comprising Eq. 2 with that of when itis trained with DAB comprising Eq. 3. It can be observed thatthe network performs signiﬁcantly better with both the signand the magnitude. TABLE IV E FFECT OF VARYING THE NUMBER OF IMAGES FOR ACCUMULATINGDIFFERENCES . T HE TABLE SHOWS THE COMPARISON OF TESTCLASSIFICATION ACCURACY ( IN PERCENTAGE ) OBTAINED ON THEDATASET OF UNORDERED IMAGE SEQUENCES OF LENGTH EXTRACTEDFROM

UCF-101

WHEN TRAINED WITH DIFFERENT VALUES OF M (E Q . 2).m Back-bone Network Accuracy0 ResNet (50 layers) 1.6671 ResNet (50 layers) 1.6672 ResNet (50 layers) 77.16n ResNet (50 layers) TABLE V G ENERALIZABILITY . T HE TABLE SHOWS THE CLASSIFICATIONACCURACY ( IN PERCENTAGE ) OBTAINED ON THE DATASET OFUNORDERED IMAGE SEQUENCES OF LENGTH AND EXTRACTED FROM

DAVIS

DATASET USING THE PROPOSED APPROACH (R ES N ET AS THEBACKBONE ) WHEN THE NETWORK IS TRAINED ON THE DATASETEXTRACTED FROM THE

UCF-101. T

HE FIRST COLUMN SHOWS THESPACING ( IN TERMS OF FRAMES ) BETWEEN THE IMAGES OF THEEXTRACTED IMAGE SEQUENCES IN THE ORIGINAL VIDEO . F

OR EACHSEQUENCE LENGTH , I . E . 4 AND THE FIRST COLUMN SHOWS THEACCURACY OBTAINED ON SETS OF UNORDERED IMAGE SEQUENCES WHENTHEY ARE EXTRACTED WITH THE CORRESPONDING TEMPORAL SPACINGAND THE SECOND COLUMN SHOWS THE NUMBER OF PERMUTATIONS OFIMAGE SEQUENCES EXTRACTED FROM THE

DAVIS

DATASET USED FOROBTAINING THE CLASSIFICATION ACCURACY .

ERMS STANDS FOR THENUMBER OF ALL THE PERMUTATIONS OF IMAGE SEQUENCES .Sequence Length4 6step Accuracy ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈ ≈

4) Varying the number of images for accumulation oftemporal differences :

In this study, we observe the effectof accumulating the differences of the feature map corre-sponding to each image with the feature maps of ﬁxednumber of images ahead of it in the input sequence. Let {F c , F c , F c , . . . , F nc } ∈ R c × × h × w be the volumes alongthe temporal depth of F c . In this case, the output of DAB F s ∈ R c × n × h × w is obtained as shown in Eq. 4. Here, F s is the concatenation of {F s , F s , F s , . . . , F ns } ∈ R c × × h × w along the temporal depth. F is = min( i + m,n ) (cid:88) k = i ( F ic − F kc ) , ∀ i = 2 , . . . , n − (4)Here, i is a location along the temporal dimension and m < n .For i = n , F ns = F nc . Table IV shows the effect of varyingthe number of images used for accumulating the temporaldifferences, i.e., the value of m in Eq. 4, by comparing thetest classiﬁcation accuracy (in percentage) obtained on thedataset of unordered image sequences of length 5 extractedfrom UCF-101 when trained with different values of m (Eq.4). For the sequence of length 5, there are / classes. It can be observed that without temporal differences,the network achieved the accuracy equivalent to the randomprobability of picking a class among 60 classes i.e. 0.0167.Even with the help of temporal differences computed amongthe adjacent images, the network still does not learn anything.It can be seen that as we increase the number of image usedfor the accumulation of temporal gradients, the test accuracyincreases.

5) Generalizability:

In this study, we evaluate the gen-eralizability of the proposed approach. We want to verifythat the proposed approach (ResNet-50 as the backbone) islearning the task of sequencing rather than somehow learningthe distribution of the dataset. For this purpose, we use the

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8

Fig. 5.

Generalizability.

The ﬁgure shows the seven test sets (one columneach) of unordered image sequences extracted from DAVIS dataset whichhas been correctly classiﬁed by the proposed approach when the network istrained on the dataset extracted from UCF-101. Each column shows one imageset. (a) shows the unordered image sets which are given as the input to theproposed network. (b) shows the order of images provided by the proposedapproach as the output along with the motion heat maps computed from theoutput of the last DAB of the network. dataset of image sequences extracted from the DAVIS dataset.We use the networks (ResNet50 as the backbone) trained onthe dataset of sequence lengths 4 and 6 extracted from theUCF-101 to obtain the classiﬁcation accuracy on the datasetof unordered image sequences of lengths 4 and 6 extractedfrom the DAVIS dataset.Table V shows the classiﬁcation accuracy obtained on thedataset of unordered image sequences extracted from DAVISdataset. The dataset is extracted in such a way that the imagesin the sequence are evenly spaced temporally. However, thetemporal spacing between the images could affect the classiﬁ-cation accuracy of the temporal ordering. We extract differenttest sets of the image sequences from the DAVIS dataset byvarying the number of frames skipped in the video while ex-tracting the image sequences of length 4 and 6. Table V shows the variation of the classiﬁcation accuracy as we change thetemporal spacing between the images of the image sequence.It can be seen that as we increase the temporal spacing ofthe image sequences during the extraction, the classiﬁcationaccuracy decreases. This is because when the temporal spacingis large, the dynamic objects perform large motion whichcould lead to erroneous ordering. However, considering thatwe are evaluating the network on a dataset (in this case DAVISdataset) which is different from the dataset is trained with(in this case UCF101), the obtained accuracy is considerablyhigh. This shows that the proposed approach (ResNet50 asthe backbone) learns the task of image sequencing. Fig. 5shows seven sets of unordered image sequences extracted fromDAVIS dataset which have been correctly classiﬁed by theproposed network trained on the dataset extracted from UCF-101. Fig. 5(a) shows the unordered image sets which are givenas the input to the proposed network. Fig. 5(b) shows the orderof image set provided by the proposed network as the output.

6) Motion Heat Maps:

We extract heat maps from theoutput of the last DAB of the network to understand thenature of the feature maps computed by DAB. DAB outputsthe volumes of the accumulated temporal differences, i.e., {F s , F s , F s , . . . , F ns } corresponding to each of the n inputimages. We compute the heat map for the i th image byaveraging the absolute values of the feature maps belongingto F is along the channels. Fig. 4 and 5(b) shows the motionheat maps obtained from the output of the last DAB of thenetwork. It can be observed that the DAB is focusing in theregions with signiﬁcant motion. It is signiﬁcant as we havenot provided any motion related cues to the network.

7) Progression of motion heat maps:

In this study, weobserve the progression of the motion heat maps computedusing the output of DAB along the depth of the network. Forthe experiment, we used ResNet (50 layers) as the backbonearchitecture trained on the dataset of image sequences oflength 6 extracted from UCF101. The architecture of ResNetis a sequence of conv1, layer1, layer2, layer3, layer4 andfc. Here, conv1 is the convolution layer and fc is the fullyconnected layer. layer1, layer2, layer3, and layer4 comprisesof 3, 4, 6, and 3 residual blocks, respectively [4]. We usedthe output of DAB used after conv1 and the outputs of thelast DAB of layer1, layer2, layer3, and layer4 to demonstratethe progression of the motion heat maps. Figure 6 showsthe progression of the motion heat maps computed using theoutput of DAB along the depth of the ResNet (50 layers)which is used as the backbone architecture for the proposedconvolution block. The ﬁrst column shows one of the imageof the input image sequences. The second column shows theoutput of DAB used after conv1 of ResNet [4]. The third,fourth, ﬁfth and sixth columns show the outputs of the lastDAB of layer1, layer2, layer3, and layer4 of the ResNetarchitecture, respectively.V. C

ONCLUSION

In this work, we propose a novel convolutional blockfor the task of image sequencing. We use residual networkarchitecture as the back-bone for the proposed convolutional

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9

Fig. 6.

Progression of motion heat maps.

The ﬁgure shows the progression of the motion heat maps computed using the output of DAB along the depthof the ResNet (50 layers) which is used as the backbone architecture for the proposed convolution block. The ﬁrst column shows one of image of the inputimage sequences. The second column shows the output of DAB used after conv1 of ResNet [4]. The third, fourth, ﬁfth, and sixth columns show the outputsof the last DAB of layer1, layer2, layer3, and layer4 of the ResNet architecture, respectively

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10 block [4]. We outperform the state-of-the-art methods on thestandard dataset used in the previous works by a signiﬁcantmargin. Through experiments, we verify the signiﬁcance ofthe proposed difference accumulator block (DAB). We showthrough experiments that the sign of the differences of thefeature maps holds an important information. We also showthat the proposed approach generalizes well by evaluating it onDAVIS dataset on which the networks has not been trained.Generalizability has been a major concern in deep learningfor a long time. The networks trained on one dataset doesnot perform well on the dataset which they have not beentrained with even when the task is same and quite generallike estimation of optical ﬂow and semantic segmentation. Theproposed approach is observed to overcome this issue for thetask of image sequencing.R

EFERENCES[1] L. Sevilla-Lara, S. Zha, Z. Yan, V. Goswami, M. Feiszli, and L. Tor-resani, “Only time can tell: Discovering temporal data for temporalmodeling,” arXiv preprint arXiv:1907.08340 , 2019.[2] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijaya-narasimhan, F. Viola, T. Green, T. Back, P. Natsev et al. , “The kineticshuman action video dataset,” arXiv preprint arXiv:1705.06950 , 2017.[3] R. Goyal, S. E. Kahou, V. Michalski, J. Materzynska, S. Westphal,H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag et al. ,“The” something something” video database for learning and evaluat-ing visual common sense.” in

Proceedings of the IEEE internationalconference on computer vision , vol. 1, no. 4, 2017, p. 5.[4] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for imagerecognition,” in

Proceedings of the IEEE conference on computer visionand pattern recognition , 2016, pp. 770–778.[5] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe, “Unsupervisedlearning of depth and ego-motion from video,” in

Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , 2017,pp. 1851–1858.[6] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick, “Mask r-cnn,” in

Proceedings of the IEEE international conference on computer vision ,2017, pp. 2961–2969.[7] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet:A large-scale hierarchical image database,” in . Ieee, 2009, pp. 248–255.[8] Z. Xu, Y. Yang, and A. G. Hauptmann, “A discriminative cnn video rep-resentation for event detection,” in

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , 2015, pp. 1798–1807.[9] R. Girdhar, D. Ramanan, A. Gupta, J. Sivic, and B. Russell, “Action-vlad: Learning spatio-temporal aggregation for action classiﬁcation,” in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2017, pp. 971–980.[10] D. Tran, J. Ray, Z. Shou, S.-F. Chang, and M. Paluri, “Convnetarchitecture search for spatiotemporal feature learning,” arXiv preprintarXiv:1708.05038 , 2017.[11] R. Alkadi, A. El-Baz, F. Taher, and N. Werghi, “A 2.5 d deep learning-based approach for prostate cancer detection on t2-weighted magneticresonance imaging,” in

Proceedings of the European Conference onComputer Vision (ECCV) , 2018, pp. 0–0.[12] H. R. Roth, L. Lu, A. Seff, K. M. Cherry, J. Hoffman, S. Wang, J. Liu,E. Turkbey, and R. M. Summers, “A new 2.5 d representation for lymphnode detection using random sets of deep convolutional neural networkobservations,” in

International conference on medical image computingand computer-assisted intervention . Springer, 2014, pp. 520–527.[13] H.-Y. Lee, J.-B. Huang, M. Singh, and M.-H. Yang, “Unsupervisedrepresentation learning by sorting sequences,” in

IEEE InternationalConference on Computer Vision . IEEE, 2017, pp. 667–676.[14] G. Kanojia and S. Raman, “Deepimseq: Deep image sequencing forunsynchronized cameras,”

Pattern Recognition Letters , vol. 117, pp. 9–15, 2019.[15] ——, “Simultaneous detection and removal of dynamic objects in multi-view images,” in

The IEEE Winter Conference on Applications ofComputer Vision , 2020, pp. 1990–1999.[16] ——, “Patch-based detection of dynamic objects in crowdcam images,”

The Visual Computer , vol. 35, no. 4, pp. 521–534, 2019. [17] N. Zarrabi, S. Avidan, and Y. Moses, “Crowdcam: Dynamic regionsegmentation,” arXiv preprint arXiv:1811.11455 , 2018.[18] A. Dafni, Y. Moses, S. Avidan, and T. Dekel, “Detecting moving regionsin crowdcam images,”

Computer Vision and Image Understanding , vol.160, pp. 36–44, 2017.[19] M. Lee, S. Lee, S. Son, G. Park, and N. Kwak, “Motion featurenetwork: Fixed motion ﬁlter for action recognition,” in

Proceedings ofthe European Conference on Computer Vision (ECCV) , 2018, pp. 387–403.[20] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, andL. Fei-Fei, “Large-scale video classiﬁcation with convolutional neuralnetworks,” in

Proceedings of the IEEE conference on Computer Visionand Pattern Recognition , 2014, pp. 1725–1732.[21] R. Baker, M. Dexter, T. E. Hardwicke, A. Goldstone, and Z. Kourtzi,“Learning to predict: Exposure to temporal sequences facilitates predic-tion of future events,”

Vision research , vol. 99, pp. 124–133, 2014.[22] Y. Zhou and T. L. Berg, “Temporal perception and prediction in ego-centric video,” in

IEEE International Conference on Computer Vision ,2015, pp. 4498–4506.[23] T.-H. K. Huang, F. Ferraro, N. Mostafazadeh, I. Misra, A. Agrawal,J. Devlin, R. Girshick, X. He, P. Kohli, D. Batra et al. , “Visualstorytelling,” in

Proceedings of the 2016 Conference of the NorthAmerican Chapter of the Association for Computational Linguistics:Human Language Technologies , 2016, pp. 1233–1239.[24] L. C. Pickup, Z. Pan, D. Wei, Y. Shih, C. Zhang, A. Zisserman,B. Scholkopf, and W. T. Freeman, “Seeing the arrow of time,” in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2014, pp. 2035–2042.[25] I. Mani and B. Schiffman, “Temporally anchoring and ordering eventsin news,”

Time and Event Recognition in Natural Language. JohnBenjamins , 2005.[26] F. Sadeghi, J. R. Tena, A. Farhadi, and L. Sigal, “Learning to selectand order vacation photographs,” in

Applications of Computer Vision(WACV), 2015 IEEE Winter Conference on . IEEE, 2015, pp. 510–517.[27] M. Noroozi and P. Favaro, “Unsupervised learning of visual representa-tions by solving jigsaw puzzles,” in

European Conference on ComputerVision . Springer, 2016, pp. 69–84.[28] I. Misra, C. L. Zitnick, and M. Hebert, “Shufﬂe and learn: unsupervisedlearning using temporal order veriﬁcation,” in

European Conference onComputer Vision . Springer, 2016, pp. 527–544.[29] H. Agrawal, A. Chandrasekaran, D. Batra, D. Parikh, and M. Bansal,“Sort story: Sorting jumbled images and captions into stories,” arXivpreprint arXiv:1606.07493 , 2016.[30] T. Basha, Y. Moses, and S. Avidan, “Photo sequencing,” in

EuropeanConference on Computer Vision . Springer, 2012, pp. 654–667.[31] Y. Moses, S. Avidan et al. , “Space-time tradeoffs in photo sequencing,”in

Proceedings of the IEEE International Conference on ComputerVision , 2013, pp. 977–984.[32] G. Kanojia, S. R. Malireddi, S. C. Gullapally, and S. Raman, “Whoshot the picture and when?” in

International Symposium on VisualComputing . Springer, 2014, pp. 438–447.[33] D. Park, C. L. Zitnick, D. Ramanan, and P. Doll´ar, “Exploring weakstabilization for motion feature extraction,” in

Proceedings of the IEEEconference on computer vision and pattern recognition , 2013, pp. 2882–2889.[34] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid, “Evaluationof local spatio-temporal features for action recognition,” in

BMVC 2009-British Machine Vision Conference . BMVA Press, 2009, pp. 124–1.[35] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolu-tional networks,” in

European conference on computer vision . Springer,2014, pp. 818–833.[36] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,” in

Advances in neural infor-mation processing systems , 2012, pp. 1097–1105.[37] K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 humanactions classes from videos in the wild,” arXiv preprint arXiv:1212.0402 ,2012.[38] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learningspatiotemporal features with 3d convolutional networks,” in

Proceedingsof the IEEE international conference on computer vision , 2015, pp.4489–4497.[39] A. Diba, M. Fayyaz, V. Sharma, M. Mahdi Arzani, R. Yousefzadeh,J. Gall, and L. Van Gool, “Spatio-temporal channel correlation networksfor action classiﬁcation,” in

Proceedings of the European Conference onComputer Vision (ECCV) , 2018, pp. 284–299.

OURNAL OF L A TEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11 [40] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, andA. Sorkine-Hornung, “A benchmark dataset and evaluation methodologyfor video object segmentation,” in