[PDF] Temporal-Needle: A view and appearance invariant video descriptor

Abstract

The ability to detect similar actions across videos can be very useful for real-world applications in many fields. However, this task is still challenging for existing systems, since videos that present the same action, can be taken from significantly different viewing directions, performed by different actors and backgrounds and under various video qualities. Video descriptors play a significant role in these systems. In this work we propose the "temporal-needle" descriptor which captures the dynamic behavior, while being invariant to viewpoint and appearance. The descriptor is computed using multi temporal scales of the video and by computing self-similarity for every patch through time in every temporal scale. The descriptor is computed for every pixel in the video. However, to find similar actions across videos, we consider only a small subset of the descriptors - the statistical significant descriptors. This allow us to find good correspondences across videos more efficiently. Using the descriptor, we were able to detect the same behavior across videos in a variety of scenarios. We demonstrate the use of the descriptor in tasks such as temporal and spatial alignment, action detection and even show its potential in unsupervised video clustering into categories. In this work we handled only videos taken with stationary cameras, but the descriptor can be extended to handle moving camera as well.

Full PDF

””Temporal-Needle”: A view and appearance invariant videodescriptor

Michal Yarom Michal IraniThe Weizmann Institute of Science, Israel

Abstract

The ability to detect similar actions across videos canbe very useful for real-world applications in many ﬁelds.However, this task is still challenging for existing systems,since videos that present the same action, can be takenfrom signiﬁcantly different viewing directions, performedby different actors and backgrounds and under variousvideo qualities. Video descriptors play a signiﬁcant rolein these systems.In this work we propose the ”temporal-needle” de-scriptor which captures the dynamic behavior, while beinginvariant to viewpoint and appearance. The descriptor iscomputed using multi temporal scales of the video and bycomputing self-similarity for every patch through time inevery temporal scale. The descriptor is computed for ev-ery pixel in the video. However, to ﬁnd similar actionsacross videos, we consider only a small subset of the de-scriptors - the statistical signiﬁcant descriptors. This al-low us to ﬁnd good correspondences across videos moreefﬁciently. Using the descriptor, we were able to detectthe same behavior across videos in a variety of scenarios.We demonstrate the use of the descriptor in tasks suchas temporal and spatial alignment, action detection andeven show its potential in unsupervised video clusteringinto categories. In this work we handled only videos takenwith stationary cameras, but the descriptor can be ex-tended to handle moving camera as well.

Action analysis has drawn signiﬁcant attention of thecomputer vision community. Although there has beenprogress in the past two decades, it is still considered ahard challenge, especially in unconstrained videos. Theinterest in the topic is motivated by the potential of manyapplications based on automatic video analysis, rang-ing from video retrieval, surveillance systems, machine- human interaction and sports video analysis.The problem we addressed in our work is to develop adescriptor for action detection, which will allow detectingthe same action in different videos recorded from differentview points, possibly at different times and even by differ-ent sensing modalities. We considered two cases: a simplecase where the videos capture the same scene recorded si-multaneously from different viewing directions, and thegeneral case where they capture different scenes, i.e. thesame action performed by different actors with differentbackgrounds.This gives the motivation to develop a descriptor whichwill capture the dynamics in the video, while being in-variant to appearance, view-point, scale and insensitive tosmall temporal variations.Various approaches have been proposed over the yearsfor action detection and recognition: ranging from highlevel representation of shapes and silhouettes to low-levelappearance and motion estimation. Several early attemptsused silhouettes to extract human motion such as [1, 2, 3,4].A recent work by Ben-Artzi et al [5] propose to create a”Motion Barcode” for every pixel, such that it captures theexiting/non-exiting motion in that pixel over time. Theirmethod can determine if two videos presents the sameevent even if they are captured from signiﬁcant differ-ent viewing directions. However, their method is limitedto videos that capture simultaneously the same dynamicscene, and they rely on background-foreground segmen-tation.In a more recent work by the same authors [6], theycombine dynamic silhouette methods with their temporalsignature, to estimate the epipolar geometry between thecameras that capture the same event. The temporal sig-nature is similar to the Motion Barcode, but instead ofcomputing the signature for every pixel, they compute asignature for epipolar lines. This work is still limited tothe same scene and relies on extracting silhouettes.In the recent years, most works focused on local low1 a r X i v : . [ c s . C V ] D ec evel representations, and are brieﬂy reviewed below. Formore comprehensive survey the reader is referred to [7, 8].One type of low level representation is done by ﬁrstﬁnding Space-Time Interest Points (STIP) [9]. The lo-cal information in these points is captured using one ofseveral descriptors, such as HoG, HoF, SIFT [10, 11, 12]or a 3D modiﬁed version for them (e.g. [13]). Then thevideo is represented by a Bag of Words technique [14].This approach has proven effective in action recognitionon challenging datasets (e.g. in [15]).In [16] a method for alignment of video sequences fromdifferent modalities is proposed. They demonstrate thatby tracking the space-time interest points, they can esti-mate the temporal and spatial sequence to sequence align-ment using trajectories. However, the main limitation ofthe STIP approach is that it relies on ﬁnding a suitableamount of space-time interest points. Videos with subtlemotion will not provide enough corresponding interestspoints across the videos. On the other hand, videos withlarge motions (e.g. videos of waves in the sea) will pro-vide too many interest points, making it difﬁcult to ﬁndreliable matchings.In [17] a direct approach for sequence alignment isproposed, based on maximizing local space-time correla-tions. The algorithm is applied directly on the space-timeintensity information without ﬁnding space-time interestpoints. They were able to align challenging sequenceswith different appearance and time ﬂuctuations. However,this approach is restricted to 2D parametric transforma-tions, and can not be applied to sequences taken from dif-ferent viewpoints.Shechtman and Irani [18] propose a self-similarity de-scriptor that correlate space-time local patches over differ-ent locations in space and time. The descriptor is invari-ant to color, texture and to small scale variations. Theyhave proved its utility for action detection between videoswith different appearances. However, if there is largescale variations (e.g, due to zoom difference between thevideos, such as in Fig. 8), or large viewpoint variations,the self-similarity descriptor will fail to detect the sameaction.Junejo et al [19] have shown that the temporal self-similarity matrices (SSM) of an action seen from differ-ent viewpoints are very similar. The temporal SSM canbe used with different local descriptors. This method wassuccessfully used for cross-view action recognition. How-ever, it is restricted to a single action within the ﬁeld ofview.Kliper-Gross et al [20] developed a Motion InterchangePattern (MIP) descriptor that creates a signature at everypixel and triplet of frames in the video (the previous, current and next frame). It encodes the motion bycomparing the patch centered in the pixel coordinate atthe current frame, to 8 patches in the previous and nextframes. This is an extension of the Local Trinary Triplet(LTP) descriptor [21]. The LTP descriptor only comparesthe patches, in the previous and next frames, on the samelocation relative to the patch in the current frame, i.e.8 comparisons instead of 64 in the MIP descriptor. ByCombining the MIP descriptor with a standard bag ofwords technique, it achieved impressive results of actionrecognition on challenging benchmarks. However, theirmethod is also restricted to a single action in the ﬁeld ofview.In this work we introduce a new space-time descriptor,the ”temporal-needle”, which captures dynamics. Thedescriptor is invariant to changes in appearance, view-point and geometric transformations. The key to ourwork is creating a signature by computing self-similarityover time at the same pixel location, at multiple temporalscales. It creates a signature of local repetitive behaviorat every point in space and time.Inspired by the self-similarity and the MIP descriptors,we compute the sum of squared differences (SSD) be-tween small patches. However, in our descriptor, the SSDis computed between a small spatial patch around a pointin the current frame, to the patches centered at the samespatial point in the neighboring frames. Unlike the MIPdescriptor, we compute the descriptor for a larger numberof frames, and at multiple temporal scales. By increas-ing the temporal support we can better represent the mo-tion patterns of the action. By computing the descriptorin multi temporal scales of the video, allows for subtletemporal variations in speed of the same action.The descriptor will be described in more details inSec. 2 and Sec. 3. We describe an efﬁcient method forﬁnding good correspondences across videos in Sec. 4. Wetest the applicability of the temporal needle descriptor fora variety of tasks. Sec. 5 demonstrates its use for temporaland spatial alignment of video sequences for a large vari-ety of scenarios: videos taken with different types of sen-sors, videos with wide baseline, videos with non-rigid mo-tion and videos with signiﬁcant zoom difference. Sec. 6shows its use for action detection; we were able to detectthe same action in videos of a real sports games. Sec. 7presents its potential use for unsupervised video cluster-ing.2 Overview of Our Approach

Structure from motion (SfM) [22] and photo tourism [23]recover camera parameters and the fundamental matrixbetween images, by ﬁnding correspondences between im-ages of the same scene based on their appearance (usuallyby using feature detectors such as SIFT [12] and SURF[24]).We extend these ideas to videos. We want to compareand ﬁnd correspondences between two videos V ( x , y , t ) and V ( x , y , t ) under the assumption that they capture thesame (or similar) action. The action takes place in the 4Dspace-time world ( X , Y , Z , T ) . Videos V and V are theprojections of the 4D action into a 3D coordinate systems ( x , y , t ) , deﬁned by the internal and external parameters ofthe video cameras.For simplicity, let’s assume that cameras are stationaryand synchronized in time. Suppose that the videos showa dancer, and that his hand passes at the global coordi-nate ( X , Y , Z ) at some discrete times t , t , . . . , t n . Since thecamera parameters are ﬁxed, we expect that the dancer’shand to be projected on some point ( x , y ) in V andon a point ( x , y ) in V , at the same times t , t , . . . , t n .Namely, there will be strong self-similarity between thepatch around ( x , y ) in V across the frames at times t , t , . . . , t n in V . Similarly, there will be strong self-similarity between the patch around ( x , y ) in V acrossthe frames at times t , t , . . . , t n in V . Therefore, by creat-ing a signature of self-similarities for each spatial locationover time in the videos V and V , we will be able to rec-ognize that ( x , y ) and ( x , y ) present the same behavior,hence corresponding dynamic points.This gives the motivation why the temporal needle de-scriptor is view-invariant. In Sec. 3 we explain in de-tails why the descriptor is view-invariant and appearance-invariant.Similarly, we claim that the descriptor is invariant toscale. Let’s assume that the dancer is captured from thesame direction by two videos with zoom ratio of 1: a . Asassumed before, the hand of the dancer passes at time t through point ( x , y ) in V and through point ( x , y ) in V . Let b denote the number of pixels the hand movesfrom ( x , y ) in V between t to the next frame. Due tothe zoom ratio between the videos, the number of pixelsthe hand moves from ( x , y ) in V will be a · b at the sametime. The number of pixels the hand moves in each videois different, but eventually in time t it returns back topoint ( x , y ) in V and to point ( x , y ) in V .Our descriptor does not estimate the motion, but in-stead it measures the self-similarity of a ﬁxed spatial pointthrough time. There will be strong self-similarity between the patch centered at ( x , y ) in V at time t to patchesat the same spatial location in every frame the hand ispasses through point ( x , y ) (in our example in times t , t , . . . , t n ). Similarly, there will be strong self-similaritybetween the patch centered at ( x , y ) in V at the sametimes t , t , . . . , t n . Thus, by using our descriptor, we willbe able to recognize that points ( x , y ) and ( x , y ) arecorresponding dynamic points, despite the different zoom.We compute the SSD (sum of squared differences) ofa small spatial patch in the current frame with patches,located at the same spatial location, in its neighboringframes. We measure the self-similarity in more than onetemporal scale, by scaling down the video in time, thusgenerating down-sampled versions of the video { V s } . Wewere inspired to use temporal multi-scale versions of thevideo in our temporal needle descriptor by the spatialNeedle descriptor proposed by [25]. We have found thatadding temporal scales is better than increasing the tem-poral length because it captures more signiﬁcant parts ofthe action, while being insensitive to variation in the speedof the action, as explained in Sec. 3.Fig. 5 illustrates the properties of the descriptor. Itshows two corresponding descriptors that were extractedfrom videos of a tennis serve. The descriptors of thecorresponding points are very similar, although the ac-tion is performed by different tennis players and differentbackgrounds, from signiﬁcantly different viewpoints, andthere are slight speed variations between the players. In this section we introduce our multi scale temporal-needle descriptor, which is computed for every pixel in thevideo. Let V ( x , y , t ) and V ( x , y , t ) be two videos that cap-ture the same action. We would like to ﬁnd good matchingpoints between the two videos. Let p = ( x , y , t ) be apoint in the ﬁrst video V , and suppose that its matchingpoint in the second video V is located at p = ( x , y , t ) .Although the videos present the same action, they can bevery different: the actors, their clothes, their backgrounds,illumination, the viewpoint and zoom of the cameras, allthese can change between the videos. Therefore, we areinterested in ﬁnding good correspondences based on thedynamic behavior presented in the videos.The temporal-needle signature, d ( x , y , t ) for a point lo-cated at ( x , y , t ) is computed as follows: the video V isdownscaled temporally, to generate a multi scale temporalpyramid { V s } , by blurring with a Gaussian LPF along thetemporal dimension, and then sub-sampling the video V intime. For every patch p and every scale s ( s = , , , . . . )3n the video pyramid V s , we compute: the sum of squareddifferences (SSD) between the patch p centered at ( x , y , t ) with the patches located in the same spatial coordinates ( x , y ) in the Γ previous and Γ next frames in that scale.This creates a vector of length 2 Γ , per point ( x , y , t ) perscale s : d s ( x , y , t ) =[ (cid:107) p ( x , y , t , s ) − p ( x , y , t − Γ , s ) (cid:107) , . . . , (cid:107) p ( x , y , t , s ) − p ( x , y , t + r , s ) (cid:107) , . . . , (cid:107) p ( x , y , t , s ) − p ( x , y , t + Γ , s ) (cid:107) ] (1)Where − Γ ≤ r ≤ Γ . The descriptor is a vector of length2 Γ (after omitting the 0 value at its center when r = ( x , y , t ) in thevideo V is obtained by concatenating the self-similaritiescorresponding to that point from all scales of the videopyramid V s . d ( x , y , t ) = [ d ( x , y , t ) , d / ( x , y , t / ) , d / ( x , y , t / ) , . . . ] (2)Finally, the descriptor is normalized, so that the sum of allits entries are 1: d ( x , y , t ) = d ( x , y , t ) max ( Sum ( d ( x , y , t )) , Sum noise ) (3)In the experiments we present, we sub-sampled the videoby a factor of 0.5. The normalization is done by choosingthe maximum between the sum of the elements in the de-scriptor and Sum noise . Sum noise is equal to the descriptor’slength multiplied by a constant that represents the esti-mated noise variance. In our experiments the noise vari-ance was set to be the 30 th percentile of all entry values,taken from all descriptors computed for that video.Fig. 1 illustrates how we compute the descriptor in ev-ery temporal scale for Γ = Increasing the temporal support (2 Γ +1) of the descrip-tor will enable to represent longer and more meaningfulrepetitive behaviors. On the other hand, the self-similaritypattern will need to be the same for a larger numberof frames across videos, which will not allow ﬂexibilityin the speed of the action. We next show that addingmore temporal scales achieves these two goals: increas-ing the temporal context, while allowing for ﬂexibility inthe speed of the action.Let’s assume that videos V and V capture two dif-ferent people performing similar actions. Let ( x , y , t ) (a)(b)Figure 1: (a) The Temporal Needle descriptor with 3 temporallevels. (b) The computation of the descriptor in every temporalscale for Γ = ( x , y , t ) with the patches at the samespatial position in the Γ previous and Γ next frames. in V and ( x , y , t ) in V be two corresponding points.Denote by λ = Γ + L temporal scales, i.e. d V ( x , y , t ) =[ d V ( x , y , t ) , . . . , d s L V ( x , y , t / s L )] and d V ( x , y , t ) = [ d V ( x , y , t ) , . . . , d s L V ( x , y , t / s L )] . Thedescriptors d s l V ( x , y , t / s l ) and d s l V ( x , y , t / s l ) arecomputed in a down scaled versions of V and V bya factor of s l . Let τ and τ be the temporal windows(of length λ ) captured by the descriptors in the coarsesttemporal scale of V and V . Let T and T be thetemporal windows in the original temporal scale of V and V which correspond to τ and τ in the coarse scale,4amely down-scaling T and T temporally by a factorof s L results in τ and τ , respectively: T ↓ / s L = τ and T ↓ / s L = τ . Their length | T | = | T | = Λ > λ .Fig. 2 illustrate the correspondences between windowsin a temporal down-scaled versions of the video, to thewindows in the original temporal scale. τ s , τ s and τ s are temporal windows, of length λ , in 3 temporal scales ofthe video. The window τ s in the coarsest temporal scaleof the video, corresponds to the window T in the originaltemporal scale of the video. Similarly, the window τ s corresponds to a smaller window in the original temporalscale of the video (marked by dashed lines). Figure 2: illustration of the correspondences between temporalwindows in a scale-down version of the video, to windows in theoriginal temporal scale. τ s , τ s and τ s are temporal windows,of length λ , in 3 temporal scales of the video. The window τ s ,in the coarsest temporal scale of the video, corresponds to thewindow T , in the original temporal scale of the video. Similarly,the window τ s corresponds to a smaller window in the originaltemporal scale of the video. Assume we compute the self-similarity descriptor d T of window T only at the ﬁnest scale, namely, the self-similarity of the central patch in the window T to all otherpatches at the same spatial location in all other frames in T . Let d T be the self-similarity descriptor of window T . Let u ( t ) denote the ”misalignment” between d T and d T . These temporal deformations can occur, for example,if the action in one video is performed in different speedthan in the second video, such as shown in Fig. 3. In thisexample V and V present a similar action performed atdifferent speeds. One can notice that in V the girl kicksfaster than in V (by a speed ratio of 4 / t = t = − V the girl is in the same position as in times t =

12 and t = −

12 in V . The black arrows in Fig. 3(b)illustrate a deformation between the frames in T and T ,namely how much a frame in T needs to move relative tothe ”matching” frame in T .To simplify the analysis, we assume that the size ofthese temporal deformations (misalignments) between T and T grow at most linearly with the distance from the (a) Video V (b) Video V Figure 3: Example of temporal misalignment between videospresenting the same action in different speed. V and V cap-ture a similar action perfromed at different speeds. (a) and (b)show a few frames from the action in V and in V . The centerframe of the kick is in time t = V is faster than in V by a speed ratio of 4 / V in times t = t = − V in times t =

12 and t = −

12. The black arrows in(b) present the relative misalignment, u ( t ) , between the framesin V to the frames in V . center frame t =

0, namely: | u ( t ) | = α | t | (4)for some scalar α >

0. We next show that although theaverage temporal misalignment is large for the temporalsupports T and T , it is small for the corresponding tem-poral supports τ and τ . Moreover, the average temporalmisalignment is small for the entire temporal needle de-scriptors d ( x , y , t ) and d ( x , y , t ) .For simplicity, we perform the computation of the aver-age temporal misalignment in the continuum. Under theassumption of Eq. 4, the average temporal misalignmentper frame in T and T with temporal size Λ is: AvgTempMisalignment ( T , T ) = Λ (cid:90) Λ / − Λ / | u ( t ) | dt = Λ (cid:90) Λ / − Λ / α | t | dt = Λ α T and T , this would bethe average ”misalignment” between the entries of theirdescriptors d T and d T . This, however, is not true for5escriptors estimated on the down-scaled temporal win-dows τ and τ . We claim that the average temporalmisalignment of descriptors estimated on τ and τ willbe signiﬁcantly smaller. When two video are tempo-rally scaled-down by a factor of s l , their relative tempo-ral misalignments become s l -times smaller. In our case, | u ↓ / s l ( t ) | = s l | u ( s l t ) | = α s l | s l t | = α | t | . Hence, the con-stant α remains the same in all temporal scales. Under theassumption of Eq. 4, the average temporal misalignmentper frame (hence also for descriptor entry) in τ and τ with temporal size λ is: AvgTempMisalignment ( τ , τ ) = λ (cid:90) λ / − λ / α | t | dt = λ α AvgTempMisalignment ( d s L ( x , y , t / s L ) , d s L ( x , y , t / s L )) = λα . This is true for the descriptors in every temporal scale(the average ”misalignment” is independent of the scale s l ). Therefore, the average temporal misalignment perentry between the temporal needle descriptors d ( x , y , t ) and d ( x , y , t ) is: AvgTempMisalignment ( d ( x , y , t ) , d ( x , y , t )) = L L ∑ l = AvgTempMisalignment ( d s l ( x , y , t / s l ) , d s l ( x , y , t / s l ))= λ α For example, let’s assume that the speed of the ac-tion in video V is faster than the speed of the actionin video V by a factor of β ≥

1, then α = | − β | .In most of our experiments the descriptor’s temporallength is λ = L = Λ = β = .

25, the averagetemporal misalignment between the entries of the self-similarity temporal descriptors for the windows T and T is AvgTempMisalignment ( T , T ) ≈ . d ( x , y , t ) and d ( x , y , t ) is AvgTempMisalignment ( d ( x , y , t ) , d ( x , y , t )) ≈ . d T and d T ofthe windows T and T , is maxMisalignment ( T , T ) = d T and d T , the misalignment is larger than one frame. Whereas, the temporal needle descriptors, al-though they capture information from the same tem-poral context, the maximum misalignment betweenthem is maxMisalignment ( d ( x , y , t ) , d ( x , y , t )) = .

75 frames. The small misalignments lead to high sim-ilarity between the two temporal needles. Therefore,adding more temporal scales to the needle is equivalentto increasing the temporal context, with the advantage ofbeing insensitive to small speed variations.

We will only demonstrate the properties of the descriptorin a single temporal scale as it applies to every temporalscale, it therefore applies to the entire descriptor.Suppose that the two videos V and V capture the sameaction from different view points (e.g., see Fig. 4). Theaction takes place in a 4D space-time world and videos V ( x , y , t ) and V ( x , y , t ) are the 3D projections of the ac-tion. We assume that the hand of the actor passes throughsome coordinate ( X , Y , Z ) in discrete times t , t , . . . , t n .Since the positions of the cameras are ﬁxed this point isprojected to some point ( x , y ) in V and to some point ( x , y ) in V . Figure 4:

An example of the same action captured from twodifferent viewpoints. Althogh the videos are taken from differentviews, the patch marked by a yellow rectangle captures the girl’shand in frames t − , t and t +

1. In the rest of the frames thepatch captures the background (note that in V the backgroundis darker than in V ). For example in Fig.4, the patch centered at the point ( x , y ) in V and the patch centered at the point ( x , y ) in V capture the girl’s hand in frames t − , t , t +

1, andbackground in frames t − , t − , t + , t +

3. When wecompute the descriptor d ( x , y , t ) in V and the descrip-tor d ( x , y , t ) in V with Γ =

3, we measure the SSD ofthe patch located in ( x , y ) (or ( x , y ) ) in frame t and thepatches in the same spatial location in 3 previous framesand 3 next frames.6he SSD between the patch in frame t to the patchesin frames t − t + V and V are taken from different viewpoints, in frames t − , t − , t + t + V the SSDbetween the patch in frame t to the patches in frames t − , t − , t + t + α > V the SSD between the patch in frame t to the patches in frames t − , t − , t + t + β > β (cid:54) = α ). Thus, d ( x , y , t ) = (cid:2) α α α α (cid:3) and d ( x , y , t ) = (cid:2) β β β β (cid:3) .Then we normalize the descriptor by dividing each en-try by the sum of the entries in the descriptor (we assumethat this sum is larger than Sum noise ), and the normalizeddescriptors will be the same d ( x , y , t ) = d ( x , y , t ) = (cid:2)

14 14

14 14 (cid:3) .The normalization compensates for variations in the SSDof the patch in frame t to patches with different appear-ance in its neighboring frames.Similarly, we claim that the descriptor is appearance-invariant. Suppose that the appearance of the girl in V is different than in V (e.g., by capturing the scene withdifferent types of sensors). The descriptor is computedby measuring the self-similarity of a patch around a pointthrough time, i.e. measuring how much a patch arounda point is similar to itself in different frames. Althoughthe girls in the videos have different appearance, we onlycompare each video to itself. As in the example above, theﬁnal step of normalizing the descriptor makes it invariantto appearance.Fig. 5 shows an example of corresponding temporal-needles in two videos taken from different viewpoints ofa tennis serve. As can be seen the tennis serve presentedin both videos generate the similar local patterns overtime, although it was performed by two different play-ers, with different clothes and different backgrounds, andthe videos were taken from different viewpoints. The ﬁrstvideo was taken behind the player, while the second videowas taken from the side. The temporal needle descriptorcaptures the local repetitive dynamics well, while beinginsensitive to differences in appearance and viewpoint. In principal, the temporal needle descriptor can be com-puted for every pixel in the video, and then matchedacross videos. However, even short videos of a few sec-onds contain millions of pixels, most of them are back- ground pixels, which are not particularly informative formatching. We next suggest an approach for focusing thecorrespondence estimation on only a small subset of in-formative descriptors.

The distance between the descriptors of two correspond-ing video points must be small. However, this is not asufﬁcient condition to guarantee good correspondences.For example, points in the static background, will have auniform zero descriptor, and hence will match well to anyother static point. We therefore wish to match only dy-namic points, and preferably those that produce reliableunambiguous matches. Thus, we would like to seek goodmatches between ”informative” descriptors, namely, de-scriptors whose probability to appear at random is low.Let Q be a query video and R be a reference video, wewould like to match descriptors from Q to R . We employthe notion of ”saving in bits” and ”informativeness” ofdescriptors, as deﬁned in [26, 27].To ﬁnd the ”informative” descriptors in Q we denoteby P r ( d | H ) the probability of choosing the descriptor d atrandom. We estimate P r ( d | H ) using a similar techniquethat was presented in [27]. We generate a descriptor code-book H as follows: ﬁrst we sample a small portion (typi-cally 1%-5%, as long as it more than 100,000 descriptors)of the descriptors from the two videos Q and R . Thesedescriptors are then clustered into a few hundreds clustersby applying K-means clustering. The centers (mean) ofthese clusters form the codebook words of H.Descriptors that are very frequent in the videos, willbe well represented in the codebook, whereas unique/raredescriptors will not be represented well in the codebook.Therefore, we approximate P r ( d | H ) by: P r ( d | H ) = exp − | ∆ d ( H ) | σ (8)where ∆ d ( H ) denotes the distance between the descriptor d and its closest word in the codebook H . If a descriptor d is far from the codebook, it results in small estimatedprobability to appear at random, hence is ﬂagged as an”informative” descriptor.We further deﬁne the probability of ﬁnding a goodmatch for descriptor d in the reference video R by P r ( d | R ) .We use an approximation: P r ( d | R ) = exp − | ∆ d ( R ) | σ (9)where ∆ d ( R ) denotes the distance between the descriptor d to its NN (nearest neighbor) descriptor in the referencevideo R .7 igure 5: The temporal-needle of corresponding space-time points in two videos that present the same action. (a) presents 7 framesfrom a tennis serve by Roger Federer, we picked two points, p ( t ) and q ( t ) , in the center frame t . (b) display 7 frames from atennis serve by some other player and the points p ( t ) and q ( t ) are the corresponding points of p ( t ) and q ( t ) , respectively. (c)display the descriptors of these 4 points. The descriptors were computed with patch size 3x3 (smaller than the rectangles presentedin (a) and (b)), temporal radius of Γ = We deﬁne the likelihood of a match for descriptor d tobe a ”reliable” one as follows:Likelihood ratio ( d ) = P r ( d | R ) P r ( d | H ) (10)Namely, the ratio between P r ( d | R ) , the probability of ﬁnd-ing a good match for descriptor d in the reference video R , versus P r ( d | H ) , the probability of ﬁnding the descrip-tor at random. Thus for example, if a descriptor d ∈ Q has a good match in the reference video R then P r ( d | R ) will be high. However, if d is a trivial descriptor, then P r ( d | H ) will also be high, which results in an overall lowlikelihood ratio. On the other hand, if d is informative, P r ( d | H ) will be low resulting in high likelihood ratio.According to Shannon [28], the entropy of a randomvariable x , namely − p ( x ) logp ( x ) , represents the numberof bits required to code x . Therefore, taking the log ofEq. 10 and discarding constants yields:”Saving in bits” ( d ) = | ∆ d ( H ) | − | ∆ d ( R ) | (11)Thus, a descriptor whose distance ∆ d ( H ) from thecodebook H is large (i.e., a rare descriptor), and found a good match in the reference video R (i.e., the distance ∆ d ( R ) is small), will result in high ”saving in bits”. Thisindicates a reliable match. It is not hard to see that ifa descriptor is not informative or did not ﬁnd a similardescriptor in the reference video, it will result in low”saving in bits”, i.e., an unreliable match.In the next few sections (Sec. 5, 6, 7) we presentseveral different applications of the temporal-needledescriptor. All the applications are based on ﬁndingreliable correspondences between the videos. These in-clude Sequence-to-Sequence alignment (Sec. 5), Actiondetection (Sec. 6), and Video clustering (Sec. 7). Let V ( x , y , t ) and V ( x , y , t ) be two videos capturing thesame dynamic scene. Let p = ( x , y , t ) be a pointin the ﬁrst video V (namely, p is a point in frame t which is located in coordinates ( x , y ) spatially), and let p = ( x , y , t ) be its matching point in the second video V . We assume that the cameras are stationary (they can8lso move jointly, as long as the parameters between thecameras are ﬁxed) and the scene is dynamic.We would like to ﬁnd both the temporal and spatial align-ment between V and V . Correspondences between thevideos both in space and time can be modeled with a smallset of parameters, T spatial and T temporal , and our goal is toﬁnd these parameters. Videos V and V can be misaligned temporally if the cam-eras have different frame rates (which results in scaling intime) and if the cameras are not synchronized (which re-sults in an offset in time). Therefore, we model the tem-poral transformation between the two sequences as a 1Dafﬁne transformation in time: t = rt + ∆ t . Where r is thetemporal scaling (e.g., the ratio between the frame rates ofthe videos), and ∆ t is the frame shift between them (whichis not necessarily an integer number of frames). In mostcases r is known, therefore we only compute the shift (al-though we can also compute r ). Computing the time shiftis done in two steps: ﬁrst we compute a course integerframe shift, and then we reﬁne it to a sub-frame shift.(a) Computing an integer frame shift -

Let S and S denote the informative descriptors detected in V and V , respectively (see Sec. 4). We search for an inte-ger ∆ t (cid:48) in the range [ ∆ t min , ∆ t max ] that will maximizesimilarity of informative descriptors between frames t ∈ V and the corresponding frame t ∈ V such that t = rt + ∆ t (cid:48) . For each ∆ t (cid:48) in the range we align thevideos according to the current temporal shift. Weconsider only the frames that overlap between the twovideos, as illustrated in Fig. 6. For every informativedescriptor in S we seek its NN (nearest neighbor)in its corresponding frame (according to the currentshift). We measure the SSD between the descriptors.We repeat this process for the descriptors in S . Wechoose the integer frame shift ∆ t (cid:48) that minimizes theaverage error per descriptor.(b) Computing the sub-frame shift - since the true ∆ t between the videos is not always an integer value,we search for the sub-frame shift − ≤ α ≤ ∆ α = . V (cid:48) with the sub-frame shift α by interpolating two consecutive frames as fol-lows: let p ( t ) be a pixel in frame t then, p ( t + α ) =( − α ) p ( t ) + α p ( t + ) .We ﬁnd the informative descriptors S (cid:48) in V (cid:48) and re-peat the process of computing the average error perdescriptor, but this time with a sub-frame shift of Figure 6:

Illustartion of a temporal alignment with ∆ t (cid:48) =

3. Inthis case we will compute the average error per descriptor withthe 8 overlapping frames. The average error is computed byﬁnding NN in both directions, from V to V and from V to V ,and dividing the error by the number of informative descriptorsin these frames. ∆ t = ∆ t (cid:48) + α . Eventually, we choose the sub-frameshift ∆ t with the minimum average error. Given the estimated temporal alignment, we can proceedto estimate the spatial transformation between the two se-quences using these frame correspondences. Let p ( t ) =( x , y , ) T denote the homogeneous coordinates of onlythe spatial part of the point p = ( x , y , t ) in video V .Similarly, p ( t ) = ( x , y , ) T denote the homogeneouscoordinates of the spatial part of its NN p = ( x , y , t ) in V (in frame t = rt + δ t ).We consider two cases: 2D parametric alignment and3D transformation using epipolar geometry. For each ofthese cases, the geometric transformation T spatial is a dif-ferent model and we will describe it in detail in Sec. 5.2.1and 5.2.2. However, the process of ﬁnding the parametersof T spatial is common to both cases. We ﬁrst ﬁnd goodcorrespondences (NNs) between corresponding framesacross the videos, as described in Sec. 4. We then apply amodiﬁed version of RANSAC, using following steps:1. Based on the known parameters of the temporalalignment, ﬁnd good correspondences between cor-responding frames in V and V as described inSec. 4.2. Choose at random a subset of pairs of point corre-spondences.3. Estimate candidate parameters for T spatial on the se-lected subset of points.4. Compute the error score (averaged over all the corre-sponding descriptors) for the estimated spatial trans-formation T spatial .5. Repeat steps (2),(3) and (4) N times.9. Choose the estimated spatial transformation whichobtained the lowest error score. When the centers of the cameras are relatively close toeach other (compared to their distance to the scene), orwhen the scene is planner, a 2D parametric transforma-tion sufﬁces to model the spatial transformation betweenthe two video sequences. The most general 2D paramet-ric transformation which models these cases is a 2D pro-jective transformation (a homography). We used a morelimited transformation in our algorithm, a 2D afﬁne trans-formation. p ( t ) = p ( rt + ∆ t ) = Ap ( t ) where A =  a a a a a a  (12)In this case there are 6 spatial unknown parameters: T spatial = (cid:2) a a a a a a (cid:3) For estimating a candidate for the afﬁne transformation A (Step (3)), we need to choose 3 pairs of points at randomin Step (2) (since each pair of points contributes 2 equa-tions and there are 6 unknown parameters).In Step (4) we measure the error of the model in thefollowing way: ﬁrst we apply the estimated afﬁne trans-formation on the points in the ﬁrst video V to get their lo-cation in the second video V as described in Eq. 12. Next,we compute the error between the descriptors in video V and the corresponding descriptors in V . Thus, err ( A ) = ∑ p ( t ) ∈ P (cid:107) d V [ p ( t )] − d V [ Ap ( t )] (cid:107) (13) P is the set of the points in V whose descriptors are in-formative, and Ap ( t ) is their location after applying theafﬁne transformation. d V [ · ] and d V [ · ] denote the descrip-tors taken from videos V and V , respectively.We tested our method on several videos (see full videosin ). Our tem-poral needle descriptor consisted of 3x3 patches, a tempo-ral radius of Γ = s = , , ). Weexperimented with the following types of videos:1. Videos with non rigid motion -

Fig. 7 shows anexample of temporal and spatial alignment of twovideos of ﬂags waving in the wind. We ﬁrst foundthat the frame shift is -31 frames, which means that the ﬁrst video V is 31 frames behind the secondvideo V . It can be observed in (a) that -31 is the tem-poral shift which obtains the minimum average NNerror per descriptor. In this case, the integer frameshift already gave satisfactory temporal alignment.Columns (b) and (c) show frames 207, 211 and 214in V and V , before temporal and spatial alignments.Column (d) shows the spatial and temporal misalign-ment in these frames (taking the green band from onevideo, and the red and blue from the other). Column(e) shows the result after alignment (using the samevisualization). The ”true color” observed in column(e), indicates that we were able to obtain good align-ment between the 2 sequences, both in time and inspace.2. Videos with signiﬁcant zoom difference -

Fig. 8shows an example of aligning two videos with azoom ratio of 1:3. In this case we recovered onlythe spatial alignment, since the videos were alreadysynchronized in time. Each row in Fig. 8 shows a dif-ferent pair of frames. Column (c) shows the resultsafter the spatial alignment (green from one sequence,red and blue from the other). The ”true color” ob-tained in the overlapping regions indicates accuratealignment.3.

Multi Sensor Alignment - the temporal needle de-scriptor is appearance invariant. We tested this prop-erty by aligning videos that present the same scenewith different sensors. We used a short part of avideo from Youtube that presents a person walk-ing and the scene is captured with 3 types of sen-sors (regular daylight camera, camera with night vi-sion device and thermal camera). All 3 multi-sensorvideos were successfully aligned by our algorithm.Fig. 9 shows the alignment results of two of thevideos: one is a regular daylight sensor with highgain, and the second is a thermal sensor. Each typeof sensor provides different information: in the ther-mal sensor (Fig. 9a) we can see more details of thewalking person, while in the daylight sensor (Fig. 9b)the details in the background are clear and the personis visible also through the window, when walking be-hind it. The result of fusing the two videos is shownin Fig. 9c, capturing the details from both sequences.4.

Alignment of similar actions (from differentscenes) -

Fig. 10 shows an example of a short videotaken behind the scenes of ”Dawn Of The Planet ofThe Apes” movie. The ﬁrst video presents an actor,and the second video is a synthesized video of an10 igure 7:

Results of temporal and spatial alignment of non rigid motion. (a) shows the average NN error per descriptor over integerframe shift ∆ t (cid:48) . The minimum average error is obtained for ∆ t (cid:48) = −

31, which means that the ﬁrst video V is 31 frames behind thesecond video V . Columns (b) and (c) show frames 207, 211 and 214 in V and V , before temporal and spatial alignments. Column(d) shows the spatial and temporal misalignment in these frames (taking the green band from one video, and the red and blue fromthe other). (e) shows the result after alignment (using the same visualization). The ”true color” observed in indicates that we wereable to obtain good alignment between the 2 sequences, both in time and in space ape, that was generated by imitating the movementsof the actor. While there is no single global afﬁnetransformation between the videos (since the head ofthe ape was modiﬁed differently than its body), westill were able to compute the best afﬁne transforma-tion between the two sequences. When the centers of the cameras are located far from eachother and the scene is not planar, there is observable paral-lax between the videos. In this 3D case the spatial relationbetween the videos is expressed by an unknown 3x3 fun-damental matrix F : p ( rt + ∆ t ) T F p ( t ) = F usingan implementation of the normalized 8-point algorithm inHartely and Zisserman [22] (page 281-282).To measure the error score for the candidates ofthe fundamental matrix (step (4) in the Ransac algo-rithm), we evaluate the ﬁrst order approximation of thegeometric error (Sampson distance) of the ﬁt of a funda-mental matrix with respect to a set of matched points as needed by Ransac (Hartely and Zisserman [22] page 287).We tested our method on different examples. Fig. 11shows an example of an extreme wide baseline betweenthe cameras. The videos display a basketball gamecaptured with cameras facing each other. Each camerais visible in the video recorded by the other camera.The recovered temporal shift was + . V should fall on the image of the second camera in V .And vice versa - the epipole of the second video shouldfall on the image of the ﬁrst camera in V . Fig. 11 showsthat the estimated epipole (marked by a yellow plus sign)is very close to the true epipole (the location of the othercamera). Let R ( x , y , t ) be a reference video, and Q ( x , y , t ) be a tem-plate query video that contains a query dynamic behav-ior. By action detection we refer to the ability to detectthe space-time location of the template Q in video R . Wewould like to ﬁnd where the action took place in R . Thequery and the reference videos do not have to be of thesame spatial size or temporal length.11 a) Zoom-out (b)

Zoom-in (c)

After Alignment

Figure 8:

Videos with a large zoom difference of a ball thrown from side to side. We present 3 represantative frames, each row in(a) and (b) present a different pair of frames. (c) shows the results after the alignment (taking the green band from one video, andthe red and blue from the other). The ”true color” obtained in the overlapping regions indicates accurate alignment.

Detecting an action is conceptually similar to aligningvideos both in space and time. We can detect an action byﬁnding good correspondences between statistically signif-icant descriptors in the query video, to the descriptors inthe reference video. Unlike the alignment case, in thiscase we are only interested in ﬁnding good correspon-dences in a smaller space-time region, for the desired ac-tion, and not in the entire video. Moreover, the match mayoccur at multiple positions in the reference video (e.g., ifrepeated several times).The process of action detection is done as follows:1. We ﬁnd the informative descriptors in the templatevideo Q as described in Sec. 4.2. For each informative descriptor d in the action tem-plate, we search for 15 NNs (nearest neighbors) in video R . Each NN of d votes to a frame in R as acandidate center frame of the action, the same wayas d relates to the center frame of the query Q . Forexample, suppose an informative descriptor d ∈ Q is α frames after the center frame of Q , then every oneof its NNs will vote to α frames before their frame in R as a candidate central frame of the detected action.3. We deﬁne a frame as a detected action center, if itsscore provides a local maximum with value largerthan twice the average frame-score. If an action oc-curs more than once, we are able to detect all ofits occurrences. In all cases we experimented with,when the action was not present at all, the valuesof all frames were almost the same (none exceededtwice the average score) and no frame was be de-12 a) Thermal camera (b)

Daylight camera (c)

Fused result

Figure 9:

Multi-sensor alignment. (a) shows 3 frames from the video taken with a thermal video camera. In the last frame wecan not see the person since he is behind the window. (b) shows the same 3 frames taken with a regular daylight camera. We cansee less details on the person, but the background is clear. (c) shows the fused sequence using the afﬁne transformation that wasestimated. It provides the details from both videos. tected. However, we can envision cases where therewill be false alarms, in these cases our method willdetect the most similar action.4. Once we detected the center frame, we can displaythe correspondences that were found. We can fur-ther run spatial alignment as described in Sec. 5.2.1,to ﬁnd the afﬁne transformation which maximizesthe similarity between the template and the referencevideo (it only applies when the actions in Q and R were taken from similar 3D viewpoints).We tested our method on several videos, in some cases wechose the query to be one instance of an action that occursmultiple times in the reference video. Fig. 12 and 13 showtwo examples display the results of two experiments.Their videos as well as additional examples can be foundin . Experiment with videos of a modern ballet - we choseas a query Q a short video segment of a male dancer per- forming a speciﬁc move. Fig. 12 (a) shows few framesfrom the move. The reference video R is a video ofa female dancer that contains many moves. In our ex-periment we detected the action correctly with no falsealarms, as can be seen in Fig. 12(b). Fig. 12(c) showsthe corresponding frames of the action that was detected.The two dancers are supposed to perform the same dance,but in fact there are some small variations in their perfor-mance. For example, the position and height of the armsare different between the dancers. Also, the view point ofthe camera relative to the dancer is not identical for thetwo dancers. The algorithm overcame these small varia-tions and is able to detect that the dancers perform similarmoves. Experiment with videos of tennis games -

Fig. 13 dis-play the results of detecting a tennis serve in two differentgames. We chose as a query a serve from one tennis game,(a) shows three frames from the query action. The refer-ence video R is a longer segment from a different tennisgame with different players. (b) shows the score of every13 a) Video V (b) Video V (c) Alignment result (d)

Fused result

Figure 10:

Example of spatial alignment between two videos that present the same event. (a) and (b) show 3 different correspondingframes from videos V and V , one of an actor and one of an ape, performing the same action. While there is no single global afﬁnetransformation between the videos (since the head of the ape was modiﬁed differently than its body), we still were able to computethe best afﬁne transformation between the two sequences. (c) shows the alignment by taking the red and blue bands from the actor’svideo, and the green band from the ape’s video. (d) display the fusion between the videos frame in the reference video to be an action center. Our al-gorithm detected the action twice (marked by yellow andorange). (c) shows the center frame of the actions thatwere detected. In tennis, serve and smash hits are verysimilar, therefore we refer to them as the same action. Inthis example the ﬁrst detection (with yellow frame) de-tected a smash and the second detection detected a serve.The algorithm detected the actions correctly and with nofalse alarm and mis-detection. ”Clustering by Composition” [27] partitions a collectionof images into clusters of similar image categories by theafﬁnities between the images in the collection. The afﬁn-ity between two images is computed based on ﬁndinglarge non-trivial shared regions between the images.In our work we extend the algorithm to videos. Com-bining it with the temporal needle descriptor, we are ableto discover automatically categories from a collection ofunlabeled videos. Our method contains two steps: (1) Building an afﬁnity matrix to reﬂect the videos relationsbased on shared space-time volumes between the videos.The afﬁnity between videos build on top of our temporalneedle. (2) Partition the videos into clusters based on theafﬁnity matrix using N-cuts [29].The main key to building the afﬁnity matrix is the re-gion growing algorithm . We slightly modiﬁed the regiongrowing of [27] to space-time, to handle videos instead ofimages. For full details and proofs we refer the reader to”Clustering by Composition” [27] Sec.4.Let R be a 3D (space-time) shared region (with un-known size and shape) between videos V and V . Let F denote the number of frames in V and V , and N the num-ber of pixels per frame. Denote by R and R its instancesin V and V , respectively. The goal of the algorithm isto ﬁnd for each descriptor d ∈ R its matching descriptor d ∈ R .The algorithm is composed by two steps: The sampling step - in this step, every descriptor d ∈ V randomly samples S positions ( x , y , t ) in V and choosesits best matching descriptor among them. The run time ofthis step is O ( SNF ) .14 igure 11: An extreme baseline example (cameras facing each other): temporal alignment and recovering the 3D spatialtransformation of videos that capture the same basketball game with a wide baseline. We ﬁrst aligned the videos temporally, therecovered frame shift was 23.7 frames. (c) display the error over ∆ t (cid:48) values as a function of integer frame shifts. (a) and (b) displaya pair of corresponding frames after the temporal alignment. They show 3 pairs of points and the epipolar lines that corresponds tothese points. The epipole that was found is marked by a yellow plus sign, and the true epipole is located at the position of the othercamera (which is visible in the frame). (d) and (e) are zoom-in of the regions around the cameras in both videos. As can be seen,the true and the estimated epipoles are very close. The propagation step - each descriptor chooses betweenits best match in the sampling step and matches proposedto it by its spatio-temporal neighbors. This is achievedby sweeping four times the video: two spatial sweeps(for each frame once from top down and the second frombottom up) and two temporal sweeps (once from thebeginning to end and once from the end to the beginning).The run time of this step is O ( NF ) . Time complexity - the overall running time of thealgorithm: O ( SNF ) , namely, linear in the size of the video. According to the theory in [27], in order to detecta region of size | R | with probability p ≥ ( − δ ) , therequired number of samples is S = NF | R | log ( δ ) .For example, if we assume that the shared space-timeregion of the action is 10% of the spatial size and 10% ofthe temporal size, it results with a shared region of size1% of the entire video. Hence, for δ =

2% (probabilityof detection p ≥ S = igure 12: Action detection in dance video. We selected a single dance move performed by a male dancer as a template. (a)shows a few frames from the action. (b) shows the score assigned to every frame in the reference video, signifying how likely it is tobe the center-frame of the detected action. Our algorithm correctly detected the action, with no false alarms and no mis-detections.(c) shows the corresponding frames of the action that was detected

Figure 13: Action detection in sports video.

We selected a tennis serve as an action template. (a) shows three frames from theaction. (b) shows the score assigned to every frame in the reference video, signifying how likely it is to be the center-frame of thedetected action. (c) shows the center frame of two detected occurrences of the action. Our algorithm correctly detected the actions,with no false alarms and no mis-detections. collaborativevideo clustering . For full details we refer to ”Clusteringby Composition” [27] Sec.5. We brieﬂy describe thealgorithm bellow:Clustering Algorithm: We start with a uniform randomsampling. Each descriptor randomly samples S = C samples across the videos in the collection. Where C is the number of clusters. At each iteration, using theregion growing algorithm, shared space-time regionsare found, inducing connections between videos in thecollection. These connections induce a sparse set ofafﬁnities between the videos. The sampling densitydistribution of each video is updated according to theafﬁnities. For example, if video V has high afﬁnity withvideos V and V , in the next iteration videos V and V will be encouraged to sample more in each other. Thisresults in a ”guided” sampling, exploiting the ”wisdom ofcrowds of videos”. Finally, after several such iterations,the resulting afﬁnities are fed to N-cut algorithm [29], topartition the videos into the desired C clusters. Time and memory complexity -

Let M be the number ofvideos in the collection and T be the number of iteration.Each iteration runs the region growing algorithm forevery video and re-estimate the sampling density distribu-tion for the next iteration. Therefore, each iteration takes O ( MNF ) and overall the algorithm takes O ( T MNF ) .However, since the afﬁnity matrix should be sparse, thenumber of iterations is typically small, T = log M ,thus the runtime is O ( MNFlogM ) .During the algorithm computation, we hold all thedescriptors of the videos in the collection in the memory.Therefore, the memory is O ( MNF ) .The time and memory complexity leads to the mainlimitation of our algorithm for video clustering. Thememory and run time are both proportional to the numberof videos in the collection and their size. This restrictedus to using small collections of short videos in ourexperiments.We tested our algorithm on two collections of videosthat we created. Most of the videos were taken fromYoutube, and some were taken from UCF-Sports dataset[30]. We downloaded videos from Youtube becausemost of the action recognition datasets contain very shortvideos that focus on distinguishing single and simple ac-tions. We were interested in using longer and more com-plex videos, so we can ﬁnd interesting common space-time regions across videos in the same category.(a) Judo and Karate collection - we collected 14 Karate and 14 Judo videos from Youtube. These two martialarts have similar spatial appearances, and vary in thespeciﬁc unique movements. The unique movementsare usually detected as the shared space-time regionsbetween videos in the same cluster. The videos spatialframe size is 360x480 and their length varies between2 to 5 seconds. We ran our algorithm with 5 iterationsonly and it was able to assign correctly 25/28 videos,results with 89 .

3% mean purity. The results are pre-sented in Fig 14.(b) Skateboarding and Walking collection - the collectionincludes 15 skateboarding and 15 walking videos.Some of the videos were taken from UCF-sportsdataset [30], but most of them were downloaded fromYoutube. The spatial frame size varies between thevideos, and their length is between 2 to 5 seconds.Fig. 15 display the results, the algorithm correctlyclustered 27 out of 30 videos, which results with meanpurity of 90%.

In this paper we presented the ”Temporal-Needle” - avideo descriptor which captures dynamic behavior, whilebeing invariant both to appearance and to viewpoint. Weshowed how using this descriptor gives rise to detection ofthe same dynamic behavior across videos in a variety ofscenarios. In particular, we demonstrated the use of thedescriptor in tasks such as sequence-to-sequence align-ment under complex conditions, action detection, as wellas video clustering for unsupervised discovery of videocategories.

References [1] Moshe Blank, Lena Gorelick, Eli Shechtman,Michal Irani, and Ronen Basri. Actions as space-time shapes. In

The Tenth IEEE InternationalConference on Computer Vision (ICCV’05) , pages1395–1402, 2005.[2] Aaron F. Bobick and James W. Davis. The recogni-tion of human movement using temporal templates.

IEEE Trans. Pattern Anal. Mach. Intell. , 23(3):257–267, March 2001.[3] Kong Man Cheung, Simon Baker, and TakeoKanade. Shape-from-silhouette of articulated ob-jects and its use for human body kinematics estima-tion and motion capture. In

Proceedings of the IEEE igure 14: Clustering results on a small collection of Karate and Judo videos we collected from Youtube. The collection contains14 Katrate and 14 Judo videos, they are all presented in the ﬁgure. Videos that were assigned to the wrong cluster are marked by ared rectangle. We assigned correctly 25 out of 28 videos, which results with mean purity of 89.3%

Figure 15:

Clustering results on a small collection of skateboarding and walking videos we collected from Youtube and UCF-Sports dataset. The collection contains 30 videos, 15 skateboarding and 15 walking videos. One frame of each video is presentedin the ﬁgure. The videos assigned to the wrong cluster are marked by a red rectangel. The algorithm assigned 27 out of 30 videoscorrectly which results in 90% mean purity.

Conference on Computer Vision and Pattern Recog-nition , June 2003. [4] Keith Forbes, Fred Nicolls, Gerhard de Jager, andAnthon Voigt. Shape-from-silhouette with two mir-rors and an uncalibrated camera. In

Computer Vision ECCV 2006, 9th European Conference on Com-puter Vision, Graz, Austria, May 7-13, 2006, Pro-ceedings, Part II , pages 165–178, 2006.[5] Gil Ben-Artzi, Michael Werman, and Shmuel Pe-leg. Event retrieval using motion barcodes. In

ImageProcessing (ICIP), 2015 IEEE International Confer-ence on , pages 2621–2625. IEEE, 2015.[6] Gil Ben-Artzi, Michael Werman, and Shmuel Peleg.Epipolar geometry from temporal signatures and dy-namic silhouettes.

CoRR , abs/1506.07866, 2015.[7] Ronald Poppe. A survey on vision-based human ac-tion recognition.

Image Vision Comput. , 28(6):976–990, June 2010.[8] Guangchun Cheng, Yiwen Wan, Abdullah N. Sauda-gar, Kamesh Namuduri, and Bill P. Buckles. Ad-vances in human action recognition: A survey.

CoRR , abs/1501.05964, 2015.[9] Ivan Laptev. On space-time interest points.

Int. J.Comput. Vision , 64(2-3):107–123, September 2005.[10] Navneet Dalal and Bill Triggs. Histograms of ori-ented gradients for human detection. In CordeliaSchmid, Stefano Soatto, and Carlo Tomasi, edi-tors,

International Conference on Computer Vision& Pattern Recognition , volume 2, pages 886–893,INRIA Rhˆone-Alpes, ZIRST-655, av. de l’Europe,Montbonnot-38334, June 2005.[11] Navneet Dalal, Bill Triggs, and Cordelia Schmid.Human detection using oriented histograms of ﬂowand appearance. In

Proceedings of the 9th Euro-pean Conference on Computer Vision - Volume PartII , ECCV’06, pages 428–441, Berlin, Heidelberg,2006. Springer-Verlag.[12] David G. Lowe. Distinctive image features fromscale-invariant keypoints.

Int. J. Comput. Vision ,60(2):91–110, November 2004.[13] Alexander Kl¨aser, Marcin Marszałek, and CordeliaSchmid. A spatio-temporal descriptor based on 3d-gradients. In

British Machine Vision Conference ,pages 995–1004, sep 2008.[14] Svetlana Lazebnik, Cordelia Schmid, and JeanPonce. Beyond bags of features: Spatial pyramidmatching for recognizing natural scene categories.In

Proceedings of the 2006 IEEE Computer SocietyConference on Computer Vision and Pattern Recog-nition - Volume 2 , CVPR ’06, pages 2169–2178, Washington, DC, USA, 2006. IEEE Computer So-ciety.[15] Jingen Liu, Yang Yang, Imran Saleemi, andMubarak Shah. Learning semantic features for ac-tion recognition via diffusion maps.

Computer Vi-sion and Image Understanding , 116(3):361–377,2012.[16] Yaron Caspi, Denis Simakov, and Michal Irani.Feature-based sequence-to-sequence matching.

Int.J. Comput. Vision , 68(1):53–64, 2006.[17] Yaron Ukrainitz and Michal Irani. Aligning se-quences and actions by maximizing space-time cor-relations. In

Proceedings of the 9th EuropeanConference on Computer Vision - Volume PartIII , ECCV’06, pages 538–550, Berlin, Heidelberg,2006. Springer-Verlag.[18] Eli Shechtman and Michal Irani. Matching localself-similarities across images and videos. In

IEEEConference on Computer Vision and Pattern Recog-nition 2007 (CVPR’07) , June 2007.[19] Imran N. Junejo, Emilie Dexter, Ivan Laptev, andPatrick Perez. View-independent action recognitionfrom temporal self-similarities.

IEEE Trans. PatternAnal. Mach. Intell. , 33(1):172–185, January 2011.[20] Orit Kliper-Gross, Yaron Gurovich, Tal Hassner, andLior Wolf. Motion interchange patterns for actionrecognition in unconstrained videos. In

EuropeanConference on Computer Vision (ECCV) , Oct. 2012.[21] Lahav Yeffet and Lior Wolf. Local trinary patternsfor human action recognition. In

ICCV , pages 492–497. IEEE, 2009.[22] R. I. Hartley and A. Zisserman.

Multiple View Ge-ometry in Computer Vision . Cambridge UniversityPress, ISBN: 0521540518, second edition, 2004.[23] Noah Snavely, Steven M. Seitz, and RichardSzeliski. Photo tourism: Exploring photo collec-tions in 3d. In

ACM SIGGRAPH 2006 Papers , SIG-GRAPH ’06, pages 835–846, New York, NY, USA,2006. ACM.[24] Herbert Bay, Andreas Ess, Tinne Tuytelaars, andLuc Van Gool. Speeded-up robust features (surf).

Comput. Vis. Image Underst. , 110(3):346–359, June2008.1925] Or Lotan and Michal Irani. Needle-match: Reliablepatch matching under high uncertainty. In

The IEEEConference on Computer Vision and Pattern Recog-nition (CVPR) , June 2016.[26] Oren Boiman and Michal Irani. Similarity by com-position. In

NIPS , pages 177–184. MIT Press, 2006.[27] Alon Faktor and Michal Irani. Clustering by compo-sition – unsupervised discovery of image categories.

European Conference on Computer Vision (ECCV) ,October 2012.[28] Claude E. Shannon. A Mathematical Theory ofCommunication.

The Bell System Technical Jour-nal , 27(3):379–423, 1948.[29] Jianbo Shi and Jitendra Malik. Normalized cutsand image segmentation.

IEEE Trans. Pattern Anal.Mach. Intell. , 22(8):888–905, August 2000.[30] Ucf-sports dataset. http://crcv.ucf.edu/data/UCF_Sports_Action.phphttp://crcv.ucf.edu/data/UCF_Sports_Action.php