[PDF] Learning Temporal Embeddings for Complex Video Analysis

Abstract

In this paper, we propose to learn temporal embeddings of video frames for complex video analysis. Large quantities of unlabeled video data can be easily obtained from the Internet. These videos possess the implicit weak label that they are sequences of temporally and semantically coherent images. We leverage this information to learn temporal embeddings for video frames by associating frames with the temporal context that they appear in. To do this, we propose a scheme for incorporating temporal context based on past and future frames in videos, and compare this to other contextual representations. In addition, we show how data augmentation using multi-resolution samples and hard negatives helps to significantly improve the quality of the learned embeddings. We evaluate various design decisions for learning temporal embeddings, and show that our embeddings can improve performance for multiple video tasks such as retrieval, classification, and temporal order recovery in unconstrained Internet video.

Full PDF

LLearning Temporal Embeddings for Complex Video Analysis

Vignesh Ramanathan , Kevin Tang , Greg Mori and Li Fei-Fei Stanford University, Simon Fraser University { vigneshr, kdtang } @cs.stanford.edu, [email protected], [email protected] Abstract

In this paper, we propose to learn temporal embeddingsof video frames for complex video analysis. Large quanti-ties of unlabeled video data can be easily obtained from theInternet. These videos possess the implicit weak label thatthey are sequences of temporally and semantically coher-ent images. We leverage this information to learn tempo-ral embeddings for video frames by associating frames withthe temporal context that they appear in. To do this, wepropose a scheme for incorporating temporal context basedon past and future frames in videos, and compare this toother contextual representations. In addition, we show howdata augmentation using multi-resolution samples and hardnegatives helps to signiﬁcantly improve the quality of thelearned embeddings. We evaluate various design decisionsfor learning temporal embeddings, and show that our em-beddings can improve performance for multiple video taskssuch as retrieval, classiﬁcation, and temporal order recov-ery in unconstrained Internet video.

1. Introduction

Video data is plentiful and a ready source of information– what can we glean from watching massive quantities ofvideos? At a ﬁne granularity, consecutive video frames arevisually similar due to temporal coherence. At a coarserlevel, consecutive video frames are visually distinct but se-mantically coherent.Learning from this semantic coherence present in videoat the coarser-level is the main focus of this paper. Purelyfrom unlabeled video data, we aim to learn embeddings forvideo frames that capture semantic similarity by using thetemporal structure in videos. The prospect of learning ageneric embedding for video frames holds promise for a va-riety of applications ranging from generic retrieval and sim-ilarity measurement, video recommendation, to automaticcontent creation such as summarization or collaging. In thispaper, we demonstrate the utility of our video frame embed-dings for several tasks such as video retrieval, classiﬁcation E m b e dd i n g s p a c e V i d e o - V i d e o - Figure 1. The temporal context of a video frame is crucial in deter-mining its true semantic meaning. For instance, consider the aboveexample where the embeddings of different semantic classes areshown in different colors. The middle frame from the two wed-ding videos correspond to visually dissimilar classes of “churchceremony” and “court ceremony”. However, by observing thesimilarity in their temporal contexts we expect them to be seman-tically closer. Our work leverages such powerful temporal contextto learn semantically rich embeddings. and temporal order recovery.The idea of leveraging sequential data to learn embed-dings in an unsupervised fashion is well explored in the Nat-ural Language Processing (NLP) community. In particular,distributed word vector representations such as word2vec[20] have the unique ability to encode regularities and pat-terns surrounding words, using large amounts of unlabeleddata. In the embedding space, this brings together wordsthat may be very different, but which share similar contextsin different sentences. This is a desirable property we wouldlike to extend to video frames as well as shown in Fig. 1.We would like to have a representation for frames whichcaptures the semantic context around the frame beyond thevisual similarity obtained from temporal coherence.However, the task of embedding frames poses multiplechallenges speciﬁc to the video domain: 1. Unlike words,the set of frames across all videos is not discrete and quan-1 a r X i v : . [ c s . C V ] M a y izing the frames leads to a loss in information; 2. Tempo-rally proximal frames within the same video are often visu-ally similar and might not provide useful contextual infor-mation; 3. The correct representation of context surround-ing a frame is not obvious in videos. The main contributionof our work is to propose a new ranking loss based embed-ding framework, along with a contextual representation spe-ciﬁc to videos. We also develop a well engineered data aug-mentation strategy to promote visual diversity among thecontext frames used for embedding.We evaluate our learned embeddings on the standardtasks of video event retrieval and classiﬁcation on theTRECVID MED 2011 [25] dataset, and compare to severalrecently published spatial and temporal video representa-tions [5, 30]. Aside from semantic similarity, the learnedembeddings capture valuable information in terms of thetemporal context shared between frames. Hence, we alsoevaluate our embeddings on two related tasks: 1. tem-poral frame retrieval, and 2. temporal order recovery invideos. Our embeddings improve performance on all tasks,and serves as a powerful representation for video frames.

2. Related Work

Video features.

Standard tasks in video such as classiﬁca-tion and retrieval require a well engineered feature repre-sentation, with many proposed in the literature [1, 6, 11, 18,22, 23, 24, 26, 29, 37, 36]. Deep network features learnedfrom spatial data [8, 12, 30] and temporal ﬂow [30] havealso shown comparable results. However, recent works incomplex event recognition [38, 41] have shown that spa-tial Convolutional Neural Network (CNN) features learnedfrom ImageNet [2] without ﬁne-tuning on video, accompa-nied by suitable pooling and encoding strategies achievesstate-of-the-art performance. In contrast to these methodswhich either propose handcrafted features or learn featurerepresentations with a fully supervised objective from im-ages or videos, we try to learn an embedding in an unsu-pervised fashion. Moreover, our learned features can be ex-tended to other tasks beyond classiﬁcation and retrieval.There are several works which improve complex eventrecognition by combining multiple feature modalities [10,21, 33]. Another related line of work is the use of sub-events deﬁned manually [5], or clustered from data [17] toimprove recognition. Similarly, Yang et al. used low dimen-sional features from deep belief nets and sparse coding [39].While these methods are targeted towards building featuresspeciﬁcally for classiﬁcation in limited settings, we proposea generic video frame representation which can capture se-mantic and temporal structure in videos.

Unsupervised learning in videos.

Learning features withunsupervised objectives has been a challenging task in theimage and video domain [7, 19, 34]. Notably, [19] devel-ops an Independent Subspace Analysis (ISA) model for fea- ture learning using unlabeled video. Recent work from [4]also hints at a similar approach to exploit the slowness priorin videos. Also, recent attempts extend such autoencodertechniques for next frame prediction in videos [28, 32].These methods try to capitalize on the temporal continuityin videos to learn an LSTM [40] representation for frameprediction. In contrast to these methods which aim to pro-vide a uniﬁed representation for a complete temporal se-quence, our work provides a simple yet powerful represen-tation for independent video frames and images.

Embedding models.

The idea of embedding words to adense lower dimension vector space has been prevalent inthe NLP community. The word2vec model [20] tries tolearn embeddings such that words with similar contexts insentences are closer to each other. A related idea in com-puter vision is the embedding of text in the semantic visualspace attempted by [3, 15] based on large image datasetslabeled with captions or class names. While these methodsfocus on different scenarios for embedding text, the aim ofour work is to generate an embedding for video frames.

3. Our Method

Given a large collection of unlabeled videos, our goalis to leverage their temporal structure to learn an effectiveembedding for video frames. We wish to learn an embed-ding such that the context frames surrounding each target frame can determine the representation of the target frame,similar to the intuition from word2vec [20]. For example,in Fig. 1, context such as “crowd” and “cutting the cake”provides valuable information about the target “ceremony”frames that occur in between. This idea is fundamental toour embedding objective and helps in capturing semanticand temporal interactions in video.While the idea of representing frames by embeddings islucrative, the extension from language to visual data is notstraightforward. Unlike language we do not have a natural,discrete vocabulary of words. This prevents us from using asoftmax objective as in the case of word2vec [20]. Further,consecutive frames in videos often share visual similaritydue to temporal coherence. Hence, a naive extension of [20]does not lead to good vector representations of frames.To overcome the problem of lack of discrete words, weuse a ranking loss which explicitly compares multiple pairsof frames across all videos in the dataset. This ensures thatthe context in a video scores the target frame higher thanothers in the dataset. We also handle the problem of visu-ally similar frames in temporally smooth videos through acarefully designed sampling mechanism. We obtain contextframes by sampling the video at multiple temporal scales,and choosing hard negatives from the same video.  n fhfh neg  n fhfh h neg  n ffff  n ffff neg (a) Our model (full)(b) Our model (no future) (c) Our model (no temporal) h Figure 2. Visualizations of the temporal context of frames used in:(a) our model (full), (b) our model (no future), and (c) our model(no temporal). Green boxes denote target frames, magenta boxesdenote contextual frames, and red boxes denote negative frames.

We are given a collection of videos V , where eachvideo v ∈ V is a sequence of frames { s v , . . . , s vn } . Wewish to obtain an embedding f vj for each frame s vj . Let f vj = f ( s vj ; W e ) be the temporal embedding functionwhich maps the frame s vj to a vector. The model embed-ding parameters are given by W e , and will be learned by ourmethod. We embed the frames such that the context framesaround the target frame predict the target frame better thanother frames. The model is learned by minimizing the sumof objectives across all videos. Our embedding loss objec-tive is shown below: J ( W e ) = (cid:88) v ∈V (cid:88) s vj ∈ vs − (cid:54) = s vj max (0 , − ( f vj − f − ) · h vj ) , (1)where f − is the embedding of a negative frame s − , andthe context surrounding the frame s vj is represented by thevector h vj . Note that unlike the word vector embeddingmodels in word2vec [20], we do not use an additional linearlayer for softmax prediction on top of the context vector. As we verify later in the experiments, the choice of con-text is crucial to learning good embeddings. A video frameat any time instant is semantically correlated with both pastand future frames in the video. Hence, a natural choice forcontext representation would involve a window of framescentered around the target frame, similar to the skip-gramidea used in word2vec [20]. Along these lines, we proposea context representation given by the average of the frameembeddings around the target frame. Our context vector h vj for a frame s vj is: h vj = 12 T T (cid:88) t =1 f vj + t + f vj − t , (2)where T is the window size, and f vj is the embedding ofthe frame s vj . This embedding model is shown in Fig. 2(a).For negatives, we use frames from other videos as well asframes from the same video which are outside the temporalwindow, as explained in Sec. 3.4.Two important characteristics of this context representa-tion is that it 1. makes use of the temporal order in whichframes occur and 2. considers contextual evidence fromboth past and future. In order to examine their effect onthe quality of the learned embedding, we also consider twoweaker variants of the context representation below. Our model (no future).

This one-sided contextual repre-sentation tries to predict the embedding of a frame in a videoonly based on the embeddings of frames from the past asshown in Fig. 2(b). For a frame s vj , the context h nofuturevj is given by: h nofuturevj = 1 T T (cid:88) t =1 f vj − t , (3)where T is the window size. Our model (no temporal).

An even weaker variant of con-text representation is simple co-occurrence without tempo-ral information. We also explore a contextual representationwhich completely neglects the temporal ordering of framesand treats a video as a bag of frames. The context h notempvj for a target frame s vj is sampled from the embeddings cor-responding to all other frames in the same video: h notempvj ∈ { f vk | k (cid:54) = j } . (4)This contextual representation is visualized in Fig. 2(c). In the previous sections, we introduced a model for rep-resenting context, and now move on to discuss the em-bedding function f ( s ij ; W e ) . In practice, the embeddingfunction can be a CNN built from the frame pixels, or anyunderlying image or video representation. However, fol-lowing the recent success of ImageNet trained CNN fea-tures for complex event videos [38, 41], we choose to learnan embedding on top of the fully connected fc6 layer fea-ture representation obtained by passing the frame througha standard CNN [16] architecture. We use a simple modelwith a fully connected layer followed by a rectiﬁed linearunit (ReLU) and local response normalization (LRN) layer,with dropout regularization. In this architecture, the learnedmodel parameters W e correspond to the weights and bias ofour afﬁne layer.

135 9 13 target framecontext frames hard negs. temporal sampling temporal sampling

Figure 3. Multi-resolution sampling and hard negatives used in ourfull context model ( T = 1). For a target frame (green), we samplecontext frames (magenta) at varying resolutions, as shown by therows in this ﬁgure. We take hard negatives as examples in the samevideo that fall outside the context window (red). We found that a careful strategy for sampling contextframes and negatives is important to learning high qualityembeddings in our models. This helps both in handling theproblem of temporal smoothness and prevents the modelfrom overﬁtting to less interesting video-speciﬁc properties.

Multi-resolution sampling.

Complex events progress atdifferent paces within different videos. Densely samplingframes in slowly changing videos can lead to context win-dows comprised of frames that are visually very similar tothe target frame. On the other hand, a sparse sampling offast videos could lead to context windows only composed ofdisjoint frames from unrelated parts of the video. We over-come these problems through multi-resolution sampling asshown in Fig. 3. For every target frame, we sample contextframes from multiple temporal resolutions. This ensures agood trade-off between visual variety and semantic related-ness in the context windows.

Hard negatives.

The context frames, as well as the target tobe scored are chosen from the same video. This causes themodel to cluster frames from the same video based on lessinteresting video-speciﬁc properties such as lighting, cam-era characteristics and background, without learning any-thing semantically meaningful. We avoid such problems bychoosing hard negatives from within the same video as well.Empirically, this improves performance for all tasks. Thenegatives are chosen from outside the range of the contextwindow within a video as depicted in Fig. 3.

The context window size was set to T = 2 , and the em-bedding dimension to . The learning rate was set to . and gradually annealed in steps of . The trainingis typically completed within a day on 1 GPU with Caffe[9] for a dataset of approximately videos. All videoswere ﬁrst down-sampled to . fps before training. The em-bedding code as well as the learned models and video em-beddings will be made publicly available upon publication.

4. Experimental Setup

Our embeddings are aimed at capturing semantic andtemporal interactions within complex events in a video, andthus we require a generic set of videos with a good varietyof actions and sub-events within each video. Most stan-dard datasets such as UCF-101 [31] and Sport-1M [12] arecomprised of short video clips capturing a single sports ac-tion, making them unsuitable for our purpose. Fortunately,the TRECVID MED 2011 [25] dataset provides a large setof diverse videos collected directly from YouTube. Moreimportantly, these videos are not simple single clip videos;rather they are complex events with rich interactions be-tween various sub-events within the same video [5]. Specif-ically, we learn our embeddings on the complete MED11DEV and TEST sets comprised of videos. A sub-set of videos from the DEV and TEST set was usedfor validation. The DEV and TEST sets are typical randomassortments of YouTube videos with minimal constraints.We compare our embeddings against different video rep-resentations for three video tasks: video retrieval, complexevent classiﬁcation, and temporal order recovery. All ex-periments are performed on the MED11 event kit videos,which are completely disjoint from the training and valida-tion videos used for learning our embeddings. The eventkit is composed of event classes with approximately − videos per event, with a total of videos.We stress that the embeddings are learned in a com-pletely unsupervised setting and capture the temporal andsemantic structure of the data. We do not tune them specif-ically to any event class and ∼ . of the DEV and TESTsets contain videos from each category. This is not unrea-sonable, since any large unlabeled video dataset is expectedto contain a small fraction of videos from every event.

5. Video Retrieval

In retrieval tasks, we are given a query, and the goal isto retrieve a set of related examples from a database. Westart by evaluating our embeddings on two types of retrievaltasks: event retrieval and temporal retrieval. The retrievaltasks help to evaluate the ability of our embeddings to grouptogether videos belonging to the same semantic event classand frames that are temporally coherent.

In the event retrieval scenario, we are given a query videofrom the MED11 event kit and our goal is to retrieve videosthat contain the same event from the remaining videos inthe event kit. For each video in the event kit, we sort allother videos in the dataset based on their similarity to thequery video using the cosine similarity metric, which wefound to work best for all representations. We use AveragePrecision (AP) to measure the retrieval performance of each ethod mAP ( %)Two-stream pre-trained [30] 20.09fc6 20.08fc7 21.24Our model (no temporal) 21.92Our model (no future) 21.30Our model (no hard neg.) 24.22

Our model 25.07

Table 1. Event retrieval results on the MED11 event kits. video and provide the mean Average Precision (mAP) overall videos in Tab. 1. For all methods, we uniformly sample frames per video and represent the video as an averageof the features extracted from them. The different baselinemethods used for comparison are explained below: • Two-stream pre-trained : We use the two-stream CNNfrom [30] pre-trained on the UCF-101 dataset. Themodels were used to extract spatial and temporal fea-tures from the video with a temporal stack size of 5. • fc6 and fc7 : Features extracted from the ReLU layersfollowing the corresponding fully connected layers ofa standard CNN model [16] pre-trained on ImageNet. • Our model (no temporal) : Our model trained with notemporal context (Fig. 2(c)). • Our model (no future) : Our model trained with no fu-ture context (Fig. 2(b)) but with multi-resolution sam-pling and hard negatives. • Our model (no hard neg.) : Our model trained withouthard negatives from the same video. • Our model : Our full model trained with multi-resolution sampling and hard negatives.We observe that our full model outperforms other rep-resentations for event retrieval. We note that in contrast tomost other representations trained on ImageNet, our modelis capable of being trained with large quantities of unlabeledvideo which is easy to obtain. This conﬁrms our hypothesisthat learning from unlabeled video data can improve featurerepresentations. While the two-stream model also has theadvantage of being trained speciﬁcally on a video dataset,we observe that the learned representations do not transferfavorably to the MED11 dataset in contrast to fc7 and fc6features trained on ImageNet. A similar observation wasmade in [38, 41], where simple CNN features trained fromImageNet consistently provided the best results.Our embeddings capture the temporal regularities andpatterns in videos without the need for expensive labels,which allows us to more effectively represent the seman-tic space of events. The performance gain of our full con-text model over the representation without temporal order − − − − −

10 0 10 20 30 40 50 − − − − − board trick weddingfeed animal land ﬁ sh birthdaywookwork ﬂ ash mobchange tire vehicle unstuckgroom animal make sandwichparade parkourrepair app. sewing − − − − −

10 0 10 20 30 40 50 − − − (a) fc7 (b) our embedding Figure 4. t-SNE plot of the semantic space for (a) fc7 and (b) ourembedding. The different colors correspond to different events.Figure 5. t-SNE visualization of words from synopses describingMED11 event kit videos. Each word is represented by the averageof our embeddings corresponding to the videos associated with theword. We show sample video frames for a subset of the words. shows the need for utilizing the temporal information whilelearning the embeddings.

Visualizing the embedding space.

To gain a better qual-itative understanding of our learned embedding space, weuse t-SNE [35] to visualize the embeddings in a 2D space.In Fig. 4, we visualize the fc7 features and our embeddedfeatures by sampling a random set of videos from the eventkits. The different colors in the graph correspond to each ofthe different event classes, as listed in the ﬁgure. Visu-ally, we can see that certain event classes such as “Groom-ing an animal”, “Changing a vehicle tire”, and “Making asandwich” enjoy better clustering in our embedded frame-work as opposed to the fc7 representation.Another way to visualize this space is in terms of the ac-tual words used to describe the events. Each video in theMED11 event kits is associated with a short synopsis de-scribing the video. We represent each word from this syn-opsis collection by averaging the embeddings of videos as-sociated with that word. The features are then used to pro- etrieved by our embedding Retrieved by fc7 feature weddingmake sandwichboard trick birthday feeding animal

Figure 6. The retrieval results for fc7 (last two columns) and ourembedding (middle two columns). The ﬁrst column shows thequery frame and event, while the top 2 frames retrieved from theremaining videos are shown in the middle two column for our em-bedding, and the last two columns for fc7. The incorrect framesare highlighted in red, and correct frames in green. duce a t-SNE plot as shown in Fig. 5. We avoid noisy clus-tering due to simple co-occurrence of words by only plot-ting words which do not frequently co-occur in the samesynopsis. We observe many interesting patterns. For in-stance, objects such as “river”, “pond” and “ocean” whichprovide the same context for a “ﬁshing” event are clusteredtogether. Similarly crowded settings such as “bollywood”,“military”, and “carnival” are clustered together. This pro-vides a visual clustering of the words based on shared se-mantic temporal context.

Event retrieval examples.

We visualize the top frames re-trieved for a few query frames from the event kit videos inFig. 6. The query frame is shown in the ﬁrst column alongwith the event class corresponding to the video. The top 2frames retrieved from other videos by our embedding andby fc7 are shown in the ﬁrst and second columns for eachquery video, respectively.We observe a few interesting examples where the queryappears visually distinct from the results retrieved by ourembedding. These can be explained by noting that the re-trieved actions might co-occur in the same context as thequery, which is captured by the temporal context in ourmodel. For instance, the frame of a “bride near a car” re-trieves frames of “couple kissing”. Similarly, the frame of“kneading dough” retrieves frames of “spreading butter”.

In the temporal retrieval task, we test the ability of ourembedding to capture the temporal structure in videos. Wesample four frames from different time instants in a video

Method mAP ( %)Two-stream pre-trained [30] 20.11fc6 19.27fc7 22.99Our model (no temporal) 22.50Our model (no future) 21.71Our model (no hard neg.) 24.12

Our model 26.74

Table 2. Temporal retrieval results on the MED11 event kits. and try to retrieve the frames in between the middle twoframes. This is an interesting task which has potential forcommercial applications such as ad placements in videosearch engines. For instance, the context at any time in-stant in a video can be used to retrieve the most suited videoad from a pool of video ads, to blend into the original video.For this experiment, we use a subset of videos fromthe MED11 event kits which are at least seconds long.From each video, we uniformly sample context frames, positive frames from in between the middle two contextframes, and negative distractors from the remaining seg-ments of the video. In addition to the negative distractorsfrom the same video, all frames from other videos are alsotreated as negative distractors. For each video, given the context frames we evaluate our ability to retrieve the pos-itive frames from this large pool of distractors.We retrieve frames based on their cosine similarity to theaverage of the features extracted from the context frames,and use mean Average Precision (mAP) as before. We usethe same baselines as the event retrieval task. The resultsare shown in Tab. 2.Our embedding representation which is trained to cap-ture temporal structure in videos is seen to outperform theother representations. This also shows their ability to cap-ture long-term interactions between events occurring at dif-ferent instants of a video. Temporal retrieval examples.

We visualize the top exam-ples retrieved for a few temporal queries in Fig. 7. Here,we can see just how difﬁcult this task is, as often framesthat seem to be viable options for temporal retrieval are notpart of the ground truth. For instance, in the “sandwich” ex-ample, our embedding wrongly retrieves frames of humanhands to keep up with the temporal ﬂow of the video.

6. Complex Event Classiﬁcation

The complex event classiﬁcation task on the MED11event kits is one of the more challenging classiﬁcation tasks.We follow the protocol of [5, 27] and use the same train/testsplits. Since the goal of our work is to evaluate the effective-ness of video frame representations, we use a simple linearSupport Vector Machine classiﬁer for all methods.Unlike retrieval settings, we are provided labeled train- ime

Missing frame at (t ) retrieved by our embedding(t ) (t ) (t ) (t ) Figure 7. The retrieval results for our embedding model on the temporal retrieval task. The ﬁrst and last 2 columns show the contextframes sampled from each video, and the middle 3 columns show the top 3 frames retrieved by our embedding. The correctly retrievedframes are highlighted in green, and incorrect frames highlighted in red.Method mAP ( %)Two-stream ﬁne-tuned [30] 62.99ISA [19] 55.87Izadinia et al. [5] linear 62.63Izadinia et al. [5] full 66.10Raman. et al. [27] 66.39fc6 68.56fc7 69.17Our model (no temporal) 69.57Our model (no future) 69.22Our model (no hard neg.) 69.81 Our model 71.17

Table 3. Event classiﬁcation results on the MED11 event kits. ing instances in the event classiﬁcation task. Thus, we ﬁne-tune the last two layers of the two-stream model (pre-trainedon UCF-101) on the training split of the event kits, andfound this to perform better than the pre-trained model.In addition to baselines from previous tasks, we alsocompare with [5], [19] and [27], with results shown inTab. 3. Note that [5, 27] use a combination of multiple im-age and video features including SIFT, MFCC, ISA, andHOG3D. Further, they also use additional labels such aslow-level events within each video. In Tab. 3, Izadinia etal. linear refers to the results without low-level event labels.We observe that our method outperforms ISA [19],which is also a unsupervised neural network feature repre-sentation. Additionally, the CNN features trained from Im-ageNet seem to perform better than most previous featurerepresentations, which is also consistent with the retrievalresults and previous work [38, 41]. Among the methods, thetwo-stream model holds the advantage of being ﬁne-tunedto the MED11 event kits. However, our performance gaincould be attributed to the ability of our model to use largeamounts of unlabeled data to learn a better representations.

HVC465414 (our: 0.3030, fc7: 0.6061) (a) order recovered by fc7(b) order recovered by our embedding Figure 9. An example of the temporal ordering retrieved by fc7and our method for a “Making a sandwich” video. The indexesof the frames already in the correct temporal order are shown ingreen, and others in red.

7. Temporal Order Recovery

An effective representation for video frames should beable to not only capture visual similarities, but also preservethe structure between temporally coherent frames. This fa-cilitates holistic video understanding tasks beyond classiﬁ-cation and retrieval. With this in mind, we explore the videotemporal order recovery task, which seeks to show how thetemporal interaction between different parts of a complexevent are inherently captured by our embedding.In this task, we are given as input a jumbled sequenceof frames belonging to a video, and our goal is to order theframes into the correct sequence. This has been previouslyexplored in the context of photostreams [14], and has po-tential for use in applications such as album generation.

Solving the order recovery problem.

Since our goal is toevaluate the effectiveness of various feature representationsfor this task, we use a simple greedy technique to recoverthe temporal order. We assume that we are provided theﬁrst two frames in the video and proceed to retrieve the nextframe (third frame) from all other frames in the video. Thisis done by averaging the ﬁrst two frames and retrieving theclosest frame in cosine similarity. We go on to greedilyretrieve the fourth frame using the average of the secondand third frames, and continue until all frames are retrieved. fficiant’s address exchanging vows exchanging rings couple kissing cutting wedding cake wedding dance wedding images Image clusters ordered with our embedding

Figure 8. After querying the Internet for images of the “wedding” event, we cluster them into sub-events and temporally organize theclusters using our model. On the left, we show sample images crawled for the “wedding” event, and on the right the temporal orderrecovered by our model is visualized along with manual captions for the clusters.Method 1.4k Videos 1k VideosRandom chance 50.00 50.00Two-stream [30] 42.05 44.18fc6 42.43 43.33fc7 41.67 43.15Our model (pairwise) 42.03 43.72Our model (no future) 40.91 42.98Our model (no hard neg.) 41.02 41.95

Our model 40.41 41.13

Table 4. Video temporal order recovery results on the MED11event kits evaluated using the Kendell tau distance (normalized to0-100). Smaller distance indicates better performance. The 1.4kVideos refers to the set of videos used in the temporal retrievaltask, and the 1k Videos refers to a further subset with the mostvisually dissimilar frames.

In order to enable easy comparison across all videos, wesample the same number of frames ( ) from each videobefore scrambling them for the order recovery problem. Anexample comparing our embeddings to fc7 is show in Fig. 9. Evaluation.

We evaluate the performance for solving theorder recovery problem using the Kendall tau [13] distancebetween the groundtruth sequence of frames and the se-quence returned by the greedy method. The Kendall taudistance is a metric that counts the number of pairwise dis-agreements between two ranked lists; the larger the distancethe more dissimilar the lists. The performance of differentfeatures for this task is shown in Tab. 4, where the Kendalltau distance is normalized to be in the range − .Similar to the temporal retrieval setting, we use the sub-set of videos which are at least seconds long. Theseresults are reported in the ﬁrst column of the table. We ob-served that our performance was quite comparable to thatof fc7 features for videos with visually similar frames likethose from the “parade” event, as they lack interesting tem- poral structure. Hence, we also report results on the subsetof videos which had the most visually distinct frames.These results are shown in the second column of the table.We also evaluated the human performance of this task on arandom subset of videos, and found the Kendell tau tobe around . This is on par with the performance of theautomatic temporal order produced by our methods, and il-lustrates the difﬁculty of this task for humans as well.We observe that our full context model trained with atemporal objective achieves the best Kendall tau distance.This improvement is more marked in the case of the 1kVideos with more visually distinct frames. This shows theability of our model to bring together sequences of framesthat should be temporally and semantically coherent. Ordering actions on the Internet.

Image search on the In-ternet has improved to the point where we can ﬁnd relevantimages with textual queries. Here, we wanted to investi-gate whether or not we could also temporally order imagesreturned from the Internet for textual queries that involvecomplex events. To do this, we used query expansion onthe “wedding” query, and crawled Google for a large setof images. Then, based on the queries, we clustered theimages into sets of semantic clusters, and for each cluster,averaged our embedding features to obtain a representationfor the cluster. With this representation, we then used ourmethod to recover the temporal ordering of these clustersof images. In Fig. 8, we show the temporal ordering au-tomatically recovered by our embedded features, and someexample images from each cluster. Interestingly, the recov-ered order seems consistent with typical wedding scenarios.

8. Conclusion

In this paper, we presented a model to embed videoframes. We treated videos as sequences of frames and em-bedded them in a way which captures the temporal contexturrounding them. Our embeddings were learned from alarge collection of more than unlabeled videos, andhave shown to be more effective for multiple video tasks.The learned embeddings performed better than other videoframe representations for all tasks. The main thrust of ourwork is to push a framework for learning frame-level rep-resentations from large sets of unlabeled video, which canthen be used for a wide range of generic video tasks.

References [1] N. Dalal, B. Triggs, and C. Schmid. Human detection usingoriented histograms of ﬂow and appearance. In

ECCV , 2006.2[2] J. Deng et al. Imagenet: A large-scale hierarchical imagedatabase. In

CVPR , 2009. 2[3] A. Frome et al. Devise: A deep visual-semantic embeddingmodel. In

NIPS , 2013. 2[4] R. Goroshin, J. Bruna, J. Tompson, D. Eigen, and Y. LeCun.Unsupervised feature learning from temporal data. arXivpreprint arXiv:1504.02518 , 2015. 2[5] H. Izadinia and M. Shah. Recognizing complex events usinglarge margin joint low-level event model. In

ECCV , 2012. 2,4, 6, 7[6] M. Jain, H. J´egou, and P. Bouthemy. Better exploiting mo-tion for better action recognition. In

CVPR , 2013. 2[7] H. Jhuang, T. Serre, L. Wolf, and T. Poggio. A biologicallyinspired system for action recognition. In

ICCV , 2007. 2[8] S. Ji, W. Xu, M. Yang, and K. Yu. 3d convolutional neuralnetworks for human action recognition.

T-PAMI , 2013. 2[9] Y. Jia et al. Caffe: Convolutional architecture for fast featureembedding. arXiv preprint arXiv:1408.5093 , 2014. 4[10] L. Jiang, A. G. Hauptmann, and G. Xiang. Leveraging high-level and low-level features for multimedia event detection.In

ACM international conference on Multimedia , 2012. 2[11] Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo.Trajectory-based modeling of human actions with motionreference points. In

ECCV , 2012. 2[12] A. Karpathy et al. Large-scale video classiﬁcation with con-volutional neural networks. In

CVPR , 2014. 2, 4[13] M. G. Kendall. A new measure of rank correlation.

Biometrika , pages 81–93, 1938. 8[14] G. Kim and E. P. Xing. Jointly aligning and segmenting mul-tiple web photo streams for the inference of collective photostorylines. In

CVPR , 2013. 7[15] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifyingvisual-semantic embeddings with multimodal neural lan-guage models. arXiv preprint arXiv:1411.2539 , 2014. 2[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassiﬁcation with deep convolutional neural networks. In

NIPS , 2012. 3, 5[17] K.-T. Lai, D. Liu, M.-S. Chen, and S.-F. Chang. Recogniz-ing complex events in videos by learning key static-dynamicevidences. In

ECCV . Springer, 2014. 2[18] I. Laptev, M. Marszalek, C. Schmid, and B. Rozenfeld.Learning realistic human actions from movies. In

CVPR ,2008. 2 [19] Q. Le et al. Learning hierarchical invariant spatio-temporalfeatures for action recognition with independent subspaceanalysis. In

CVPR , 2011. 2, 7[20] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efﬁcientestimation of word representations in vector space. arXivpreprint arXiv:1301.3781 , 2013. 1, 2, 3[21] P. Natarajan, S. Wu, S. Vitaladevuni, X. Zhuang, S. Tsaka-lidis, U. Park, and R. Prasad. Multimodal feature fusion forrobust event detection in web videos. In

CVPR , 2012. 2[22] J. C. Niebles, C.-W. Chen, and L. Fei-Fei. Modeling tempo-ral structure of decomposable motion segments for activityclassiﬁcation. In

ECCV , 2010. 2[23] S. Oh et al. Multimedia event detection with multimodalfeature fusion and temporal concept localization.

MachineVision and Applications , 25, 2014. 2[24] D. Oneata, J. Verbeek, and C. Schmid. Action and eventrecognition with ﬁsher vectors on a compact feature set. In

ICCV , 2013. 2[25] P. Over et al. Trecvid 2014 – an overview of the goals, tasks,data, evaluation mechanisms and metrics. In

Proceedings ofTRECVID 2014 , 2014. 2, 4[26] X. Peng, C. Zou, Y. Qiao, and Q. Peng. Action recognitionwith stacked ﬁsher vectors. In

ECCV , 2014. 2[27] V. Ramanathan et al. Video event understanding using natu-ral language descriptions. In

ICCV , 2013. 6, 7[28] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert,and S. Chopra. Video (language) modeling: a baselinefor generative models of natural videos. arXiv preprintarXiv:1412.6604 , 2014. 2[29] S. Sadanand and J. J. Corso. Action bank: A high-level rep-resentation of activity in video. In

CVPR , 2012. 2[30] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprintarXiv:1409.1556 , 2014. 2, 5, 6, 7, 8[31] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A datasetof 101 human actions classes from videos in the wild. arXivpreprint arXiv:1212.0402 , 2012. 4[32] N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsu-pervised learning of video representations using lstms. arXivpreprint arXiv:1502.04681 , 2015. 2[33] A. Tamrakar, S. Ali, Q. Yu, J. Liu, O. Javed, A. Divakaran,H. Cheng, and H. Sawhney. Evaluation of low-level featuresand their combinations for complex event detection in opensource videos. In

CVPR , 2012. 2[34] G. Taylor, R. Fergus, Y. LeCun, and C. Bregler. Convolu-tional learning of spatiotemporal features. In

ECCV , 2010.2[35] L. Van der Maaten and G. Hinton. Visualizing data usingt-sne.

JMLR , 9(2579-2605):85, 2008. 5[36] H. Wang, A. Kl¨aser, C. Schmid, and C.-L. Liu. ActionRecognition by Dense Trajectories. In

CVPR , 2011. 2[37] H. Wang, M. M. Ullah, A. Klaser, I. Laptev, and C. Schmid.Evaluation of local spatio-temporal features for action recog-nition. In

BMVC , 2009. 2[38] Z. Xu, Y. Yang, and A. G. Hauptmann. A discriminativecnn video representation for event detection. arXiv preprintarXiv:1411.4006v1 , 2015. 2, 3, 5, 739] Y. Yang and M. Shah. Complex events detection using data-driven concepts. In

ECCV , 2012. 2[40] W. Zaremba, I. Sutskever, and O. Vinyals. Recurrent neuralnetwork regularization. arXiv:1409.2329 , 2014. 2[41] S. Zha, F. Luisier, W. Andrews, N. Srivastava, andR. Salakhutdinov. Exploiting image-trained cnn architec-tures for unconstrained video classiﬁcation. arXiv preprintarXiv:1503.04144v2arXiv preprintarXiv:1503.04144v2