Improving Human Activity Recognition Through Ranking and Re-ranking
IImproving Human Activity Recognition Through Ranking and Re-ranking
Zhenzhong Lan, Shoou-I Yu, Alexander G. HauptmannCarnegie Mellon University lanzhzh, iyu, [email protected]
Abstract
We propose two well-motivated ranking-based methodsto enhance the performance of current state-of-the-art hu-man activity recognition systems. First, as an improve-ment over the classic power normalization method, we pro-pose a parameter-free ranking technique called rank nor-malization (RaN). RaN normalizes each dimension of thevideo features to address the sparse and bursty distribu-tion problems of Fisher Vectors and VLAD. Second, in-spired by curriculum learning, we introduce a training-freere-ranking technique called multi-class iterative re-ranking(MIR). MIR captures relationships among action classes byseparating easy and typical videos from difficult ones andre-ranking the prediction scores of classifiers accordingly.We demonstrate that our methods significantly improve theperformance of state-of-the-art motion features on six real-world datasets.
1. Introduction
It is a challenging task to recognize human activitiesamong videos, especially unconstrained Internet videoswith large visual diversity. A typical human activity recog-nition pipeline is often composed of the following threesteps: (i) extract local video descriptors (e.g., STIP [17],IDT [27] or Deep learning features [28]), (ii) encode andpool the local descriptors into video descriptors (e.g., FisherVectors [24] or VLADs [1]), and (iii) classify the videorepresentations (e.g., SVM). The basic components of thecurrent state-of-the-art human activity recognition system,for example, are Improved Dense Trajectories (IDT), FisherVectors, and SVMs. In this paper, we introduce tworanking-based methods to improve the performance of thisstate-of-the-art human activity recognition pipeline.The first proposed method is to tackle the sparse [24] andbursty [1] distribution problems of Fisher Vectors [24] andVLADs [1]. A common issue of Fisher Vector and VLADencoding is that they generate high dimensional sparse andbursty video representations whose similarity cannot be ac-curately assessed by linear measurements. The sparse dis- tribution, i.e. most of the values in the representation areclose to zero, comes from the fact that as the codebooksize increases, fewer local descriptors are assigned to eachcodebook. A bursty distribution, i.e. the values of a fewdimensions dominate the representation, occurs when re-peated patterns in a video generate a few artificially largecomponents in the representation. To deal with those prob-lems, Perronnin et al. [24] invented a simple Power Normal-ization (PN) method to disperse the video representations.PN uses an element-wise power operation to discount largevalues and increase small values of video representations.As one of the most significant improvements in the past fewyears, this simple algorithm essentially makes Fisher Vec-tors and VLADs useful in practice, and has been widelyadopted by the research community to both handcrafted[17, 27] and deeply-learned features [28, 29]. However, PNcan only alleviate the sparse and bursty distribution prob-lems. It is often difficult to decide how much dispersal eachtask requires. There is also no theoretical justification forusing sign square root [24, 1] as a rule of thumb. Such be-ing the case, we propose Rank Normalization (RaN) that isparameter-free and thoroughly addresses the sparse burstydistribution problem. RaN ranks all video representationsin a dataset along each dimension and uses the normalizedrankings in place of the original video representations forthe subsequent classification. We show that RaN also worksfor scenarios where not all data are available. Using an ap-proximation method which requires only a subset of videorepresentations gives a similar ranking performance to theprecise case.The second method we suggest is inspired by curricu-lum learning which mimics the human learning schemeand has become popular for image and video classification[3, 5, 11, 6]. Curriculum learning [3, 14] suggests distin-guishing easy and typical samples from difficult ones andtreating them separately. Traditional curriculum learningmethods [3, 14, 2] rely on human or extra resources to de-fine data with difficulty levels (curriculum). Instead, we usefreely-available classifiers from other action classes to de-fine the curriculum. We then re-rank the prediction resultsto promote easy videos and suppress difficult ones. Our1 a r X i v : . [ c s . C V ] D ec roposed method, called Multi-class Iterative Re-ranking(MIR), is easy to implement and training-free.In the remainder of this paper, we review some relevantworks about the improvement of feature encoding, mostlyassociated with Fisher Vectors and VLADs. We also brieflyintroduce recent studies on capturing relationships amongmultiple action classes and curriculum learning. We thendescribe our two proposed methods in details and demon-strate their performance gain over the baseline approach(composed of IDT, Fisher Vectors, and SVMs) for the actionrecognition task represented by Hollywood2 and OlympicSports datasets. Next, we combine those two methods andreport results accordingly on both action recognition andevent detection [15] tasks. A conclusion and discussion offuture works follow in the end.
2. Related Work
Features and encoding methods are the major sourcesof breakthroughs in conventional video representations.Among them the trajectory based approaches [27, 12], espe-cially the Dense Trajectory (DT), IDT [26, 27], and FisherVectors encoding, are the basis of current state-of-the-artalgorithms.Fisher Vectors and VLADs are very similar encodingmethods [1] and both have been popular for image andvideo classification [27, 24, 1, 28, 19]. In the originalscheme [23, 10] they either do not require post processing[23] or use only L2 normalization [10]. Although L2 nor-malization can reduce the influence of background infor-mation and transform the linear kernel into an L2 similaritymeasurement [24], it does not disperse the data. As a result,the original Fisher Vector encoding method showed incon-clusive results compared to other state-of-the-art encodingmethods [23]. It is the introduction of PN [24, 1] that signif-icantly improved the performance of those encoding meth-ods and thus made them useful in practice. PN alleviates theproblem of sparse and bursty distribution of Fisher Vectorand VLAD. Later on, as a special design for VLAD, Arand-jelovic and Zisserman [1] proposed Intra-normalization tofurther reduce the bursty distribution problem of VLADs.J´egou and Chum [10] used PCA to decorrelate a low dimen-sional representation and adopted multiple clustering to re-duce the quantization errors for VLADs. Nonetheless, thosemethods only alleviated the burstiness problem. The spar-sity of the encoded descriptors still depends on that of theoriginal data. Unlike the above two approaches, RaN nor-malizes each dimension of Fisher Vectors to a distributionclose to the uniform distribution, regardless of how sparsethe original data are.Given the encoded features, state-of-the-art methods of-ten use ‘one versus rest’ SVM, which does not consider therelationships among action classes. To model those rela-tionships, Bergamo & Torresani [4] suggested a meta-class method for identifying related image classes based on mis-classification errors from a validation set. Hou et al. [8]identified similar class pairs and grouped them together totrain ‘two versus rest’ classifiers. By combining ‘two ver-sus rest’ with ‘one versus rest’ classifiers, they observedsignificant improvements from baselines. Unlike the afore-mentioned approaches that require training and modify thepredictions for one time only, MIR is training-free and it-eratively updates the prediction rankings given those fromprevious iterations.Our MIR model is inspired by a new learning paradigmcalled curriculum learning, proposed by Bengio et al. [3].Curriculum learning rates the difficulty levels for classify-ing samples and uses the rating as a ‘curriculum’ to guidelearning. This new way of learning, from a human behav-ioral perspective, is considered similar to human learningin principle [13]. Moreover, just like school curriculumdesign in the everyday case, it relies on human or otherdata resources to define the curriculum. Our method in-stead uses freely available classifier predictions from otheraction classes to help define the curriculum without humanintervention.
3. Action Recognition Preliminaries
Problem formulation
Given a collection of short clips ofvideos that usually last a few seconds, the goal of an actionrecognition task is to classify them into actions of interestsuch as running and kissing , solely based on the video con-tent. By evaluating on this task, we dig deep into the char-acteristics of RaN and MIR.
Benchmark datasets
We rely on two widely used actionrecognition benchmark datasets including the Hollywood2[18] and Olympic Sports datasets [20].The Hollywood2 dataset [18] contains 12 action classesand 1707 video clips that are collected from 69 differentHollywood movies. There are 12 action classes such as an-swering a phone, driving a car and standing up. Each videoof this dataset may contain multiple actions. We use theclean training dataset and standard splits with training (823samples) and test videos (884 samples) provided [18]. Theperformance is evaluated by computing the average preci-sion (AP) for each of the action classes and reporting themean AP (mAP) over all classes .The Olympic Sports dataset [20] consists of 16 athletespracticing sports such as high-jump, pole-vault and basket-ball lay-up. It has a total of 783 video clips. We use standardsplits with 649 training clips and 134 test clips and reportmAP as in [20] for comparison purposes.
Experimental settings
We follow the experimental set-tings in [27]. More specifically, we use IDT features ex-racted using 15 frame tracking and camera motion stabi-lization. PCA is utilized to reduce the dimensionality ofIDT descriptors by a factor of two. After reduction, the lo-cal descriptors are augmented with three-dimensional nor-malized location information [16]. Fisher Vector encodingmaps the raw descriptors into a Gaussian Mixture Modelwith 256 Gaussians trained from a set of 256000 randomlysampled data points. Classification in conducted by a ‘oneversus rest’ linear SVM classifier with a fixed C = 100 ([27]). For PN, we use sign square root unless otherwisestated.
4. Rank Normalization (RaN)
As discussed in [24, 1], Fisher Vector encoding oftengenerates high dimensional sparse and bursty video repre-sentations, whose similarity cannot be accurately quantifiedby a linear kernel or an L2 similarity measurement. Toaddress this problem, Perronnin et al. [24] introduced PN,which has the following element-wise operation: f ( z ) = sign ( z ) | z | α , where ≤ α ≤ . PN can only alleviate the sparse andbursty distribution problems and it is difficult to determinea good α for different tasks. To overcome this weakness,we propose RaN, which is parameter-free and handles thebursty distributions in a more fundamental way. RaN ap-plies to each dimension of the Fisher Vectors the followingfunction: f ( z ) = rank ( z ) /N, where rank(z) is z’s position after sorting along the dimen-sion of all N Fisher Vectors in the dataset. After RaN, thevalues in each dimension of Fisher Vectors are spread outand have a distribution close to uniform.
To qualitatively characterize the differences between PNand RaN, we visualize their results from different perspec-tives using all Fisher Vectors in the Hollywood2 dataset.Results are shown in Figure 1, 2, 3, and 4. Figure 1 displaysthe distributions of Fisher Vector values and their standarddeviations along each dimension. As evidenced by Figure1a and 1b, the distribution of L2-normalized
Fisher Vectorvalues are indeed sparse and bursty . These sparse andbursty distributed representations would have cosine simi-larities that clustered around zero (Figure 2a). PN dispersesthe data and reduces the probability of zero cosine similari-ties (Figure 2b). However, the distributions after PN are stillsomewhat sparse (Figure 1c) and bursty (Figure 1d), whosecosine similarities remain centred around zero (Figure 2b).Conversely, RaN disperses the data to an extreme (Figure1e, 1f) and completely removes zero cosine similarity (Fig-ure 2c). −0.01 −0.005 0 0.005 0.0100.050.10.150.20.250.3 P r obab ili t y Fisher Vector Values (a) L2 Normalization S t anda r d D e v i a t i on (b) L2 Normalization −0.01 −0.005 0 0.005 0.0100.050.10.15 P r obab ili t y Fisher Vector Values (c) PN S t anda r d D e v i a t i on (d) PN P r obab ili t y Fisher Vector Values (e) RaN S t anda r d D e v i a t i on (f) RaN Figure 1:
The effect of different normalization methodson the sparse problem and the bursty problem of FisherVectors.
Plots on the left (1a, 1c, 1e) are the distributionsof all the values in the Fisher Vectors across all videos forHollywood2. On the right (1b, 1d, 1f) are the standard de-viations of the values in each dimension. L2 normaliza-tion serves as a baseline method and is applied after PN andRaN.In Figure 3, we also compare the normalization effectson the first dimension of Fisher Vectors on the Hollywood2dataset. The x-axis shows the original values and the y-axis are the corresponding values after normalization. Fora better visualization, we normalize all the curves so thatthe values are between -1 and 1. Basically, PN returns theoriginal values when α = 1 and becomes a step functionas α = 0 . When < α < , PN carries out a mappinglike a sigmoid activation function. It magnifies those valuesaround zeros and allows them to possess more of the y-axisspace. Through this magnification, PN assumes that val-ues around zeros are more important than what their origi-nal values indicate. However, this assumption provides noinformation about how much magnification is sufficient. Incontrast, RaN states that all Fisher Vector values are equallyimportant in calculating the similarities between video rep-resentations. p r obab ili t y pos−pospos−neg (a) L2 Normalization −0.2 0 0.2 0.4 0.600.020.040.060.08 p r obab ili t y pos−pospos−neg (b) PN p r obab ili t y pos−pospos−neg (c) RaN Figure 2:
The cosine similarities of videos under different normalization methods.
The blue and plain lines are thedistributions of cosine similarities between positive samples, and the red and dashed lines show the distributions of cosinesimilarities between positive and negative samples. −1 −0.5 0 0.5 1−1−0.500.51 Original Values V a l ue s a ft e r N o r m a li z a t i on RaNPN α =0PN α =0.3PN α =0.5PN α =0.7PN α =1 Figure 3:
Comparison of the effects of PN and RaN onthe first dimension of Fisher Vectors in Hollywood2.
Fora better visualization, we normalize all the curves so thattheir values are between -1 and 1.Another interesting observation is that PN cannot dy-namically adjust its mapping according to the distributionof values in different sign spaces whereas RaN can. We seefrom Figure 3 that only a small subset of data are negative;while in the PN case, data with negative values possess asmuch space as data with positive values. Through dynamicmapping, RaN does much better than PN in spreading outthe data (Figure 4). The x-axis of Figure 4 shows the orig-inal distribution of L2 distances of the first dimension ofFisher Vectors and the y-axis shows the corresponding dis-tances after normalization. Evidently, after PN, most of theL2 distances are still cluttered around zero; while after RaN,the distribution of L2 distances spread out significantly andbecome much more separable. L2 D i s t an c e s a ft e r N o r m a li z a t i on (a) PN L2 D i s t an c e s a ft e r N o r m a li z a t i on (b) RaN Figure 4:
Comparison of L2 distances before and afternormalization on the first dimension of Fisher Vectorsin Hollywood2.
In Table 1, we compare the results of different normal-ization methods on both Hollywood2 and Olympic Sportsdatasets. As in Figure 1, L2 normalization serves as a base-line method and is applied after PN and RaN. First, we showthat PN improves the performance of the baseline methodon both datasets by more than . These improvements aresignificant given the difficulty level of the task. RaN furtherimproves the PN performance by more than on Holly-wood2 and on Olympic Sports. Compared to the base-line method, RaN gives about absolute improvementon a difficult dataset like Hollywood2; on the relatively easyone like Olympic Sports that has less potential to explore,RaN still manages to improve by more than , absolutely.We also find that applying RaN to local descriptors withineach video and combining the rankings with the originaldescriptors (Two-level RaN) can further improve the per-formance of Hollywood2 by about , reaching an overallimprovement of around . We hypothesize that this isrelated to removing background distribution as suggestedin [30], but here we do not further investigate this connec-ollywood2 Olympic SportsL2 Normalization 57.9% 83.0%PN 66.1% 89.4%RaN 67.7% 92.3%Two-level RaN Table 1:
Performance comparison of different normal-ization methods on action recognition datasets.
L2 nor-malization is the baseline method and has also been appliedafter PN and RaN.S Hollywood2 (% ) Olympic Sports ( %)1 . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Table 2:
Comparison of different subset size S for RaN.
Each experiment is repeated 10 times and the mean valuesand standard deviations are shown.tion. Figure 5 shows the per-class comparison of with andwithout RaN. Most remarkably, RaN improves the baselinemethod on all 12 actions from the Hollywood2 dataset. Onsome of the hard classes like ‘HandShake’, RaN improvesthe baseline results by more than . A similar trend canbe seen in the Olympic Sports dataset. Compared to thebaseline method, RaN either improves or delivers similarresults on 15 out of 16 actions. These per-class perfor-mance comparisons show that our improvements are robustand significant.
The exact ranking of RaN requires comparing all theFisher Vectors in a dataset, which may not be available inmany application scenarios such as online learning. Giventhe high dimensionality of the data, we conjecture that ex-act ranking may not be necessary. To test our hypothe-sis, we randomly choose a small subset of Fisher Vectorswith size S as seed vectors and compare each Fisher Vectorwith the seed vectors to get an approximate ranking. S=1is equivalent to a binarization of the Fisher Vectors. Weexperiment with different S on Hollywood2 and OlympicSports datasets and repeat each experiment 10 times. Theresults are shown in Table 2. Surprisingly, a subset size S assmall as 5 is good enough to achieve similar results as theprecise ranking. We also observe better performance fromOlympic Sports after using approximate ranking. These re-sults show that fine-grain ranking information is not neces-sary for RaN. m AP A n s w e r P hone D r i v e C a r E a t F i gh t P e r s on G e t O u t C a r H and S ha k e H ug P e r s on K i ss R un S i t D o w n S i t U p S t and U p w/o RaNw/ RaN (a) Hollywood2 m AP B a sk e t ba ll La y up B o w li ng C l ean A nd J e r k D i sc u s T h r o wD i v i ng P l a tf o r m m D i v i ng S p r i ngboa r d3 m H a m m e r T h r o wH i gh J u m p J a v e li n T h r o w Long J u m p P o l e V au l t S ho t P u t S na t c h T enn i s S e r v e T r i p l e J u m p V au l t w/o RaNw/ RaN (b) Olympic Sports Figure 5:
Per-class performance comparison of the base-line performances with and without RaN.
5. Multi-class Iterative Re-ranking (MIR)
As mentioned previously, traditional curriculum learning[3] often relies on human or extra data sources to definethe curriculum: easy and typical samples versus difficultones. In this paper, we instead rely on classifiers of otherclasses, which are freely available, to define the curriculum.The new way of curriculum definition captures relationshipamong multiple human activity classes.As depicted in Figure 6, we rank the videos from easyto difficult based on the classifier values. Obviously, easiervideos have a much faster roll-off rate of sorted classifica-tion predictions. According to this ranking, we discover thatvideos that have some combinations of the following threescenarios will be more likely to be ranked on the difficultend of the scale: • Contains noisy background motions. If the back-ground of a video contains noisy motions and the tar- a) Easy Videos.
The example actions are: HandShake (left),Kiss (middle), HugPerson (right). (b)
Difficult Videos.
The example actions are: HandShake (left),Kiss (middle), SitUp (right).
Figure 6:
Illustration of easy and typical videos versus difficult ones.
The bar charts show the sorted predictions of allthe classifiers to the example videos. The predictions are normalized so that the values are between 0 and 1. As shown, thepredictions of typical easy videos often have one or two dominant classes while those predictions of difficult videos oftenhave much more smooth score distributions.
Algorithm 1
Multi-class Iterative Re-ranking (MIR) Input:
The prediction scores of K class P ∈ R N × K ;Re-ranking annealing parameter η ; Re-ranking weight-ing coefficient β > ; Total iteration steps W . Init: P (0) = P for w = 1 , , · · · , W − do For any instance index i ∈ { , , · · · , N } and classindex j ∈ { , , · · · , K } , ∆ ( w ) i,j = sort( { P ( w ) i, , P ( w ) i, , · · · , P ( w ) i,K }\ P ( w ) i,j , ↓ ) P ( w +1) i,j = P ( w ) i,j − η w − K (cid:88) r =1 r (cid:54) = j e − βr ∆ ( w ) i,j ( r ) end for Output: P ( W ) get action has a small and weak signal, then the targetaction would be obscured. For example, the videosof HandShake actions in 6b are obscured by the back-ground motions and much more difficult to detect thanthose videos of HandShake actions in 6a. • Contains multiple actions. If the subject performs mul-tiple actions, the classifiers would be confused as thevideo features represent a mixture of the multiple ac-tions. For example, the Kiss video in 6b contain bothKiss and Hug actions (appear in the following frames)while the Kiss examples in 6a only contain the Kissaction itself. m AP f o r H o ll y w ood2 m AP f o r O l y m p i c S po r t s Hollywood2Olympic Sports
Figure 7:
MIR performance versus number of iterations. • Ill-defined actions. If the target action is ill-definedby itself, the video would be more difficult to classify.For example, ’SitUp’ action is often encapsulated inthe ’StandUp’ action, which make it harder to classify.Based on the difficulty scale, we design MIR to re-rankthe predictions of classifiers. The intuition behind MIR isthat if videos are more difficult to classify, then their pre-dictions are less reliable and their rank should be lowered;if videos are easy and typical, then the predictions should bemore reliable and the videos should be ranked higher. Thealgorithm to re-rank is easy to implement and fast to run.As shown in Algorithm 1, given a score matrix P ∈ R N × K that contains K classifiers’ predictions on N videos. Weupdate each score P i,j iteratively by looking at other classi- m AP A n s w e r P hone D r i v e C a r E a t F i gh t P e r s on G e t O u t C a r H and S ha k e H ug P e r s on K i ss R un S i t D o w n S i t U p S t and U p w/o MIRw/ MIR (a) Hollywood2 m AP B a sk e t ba ll La y up B o w li ng C l ean A nd J e r k D i sc u s T h r o wD i v i ng P l a tf o r m m D i v i ng S p r i ngboa r d3 m H a m m e r T h r o wH i gh J u m p J a v e li n T h r o w Long J u m p P o l e V au l t S ho t P u t S na t c h T enn i s S e r v e T r i p l e J u m p V au l t w/o MIRw/ MIR (b) Olympic Sports Figure 8:
Per-class performance comparison of the base-line performances with and without MIR. fiers’ predictions on the same video and reducing the scoreusing the predictions from the other classifiers. The reduc-tion is carried out by first sorting other classifiers’ predic-tions { P ( w ) i, , P ( w ) i, , · · · , P ( w ) i,K }\ P ( w ) i,j in a descending orderand then subtracting the weighted sum of the sorted scoresfrom P i,j . We use exponentially decaying weights and theweighting coefficient β and the annealing parameter η havebeen set to 1 and 0.5, respectively, throughout the paper.As shown in Figure 7, MIR typically converges within 3 or4 iterations. It improves more than 2% over the baselinemethod on both datasets. We also show the per-class com-parison of with and without MIR in Figure 8. We can seethat for Hollywood2, MIR improves upon the baseline re-sults on 11 out of 12 actions, and for Olympic Sports, MIRimproves or gets similar results on 14 out of 16 actions. Dif-ferent from the improvements of RaN, the improvementsof MIR are more evenly distributed among action classes.These per-class performance comparisons again show that Hollywood2 ( % ) Olympics Sports ( % )Sapienz et al. [25] 59.6 Jain et al. [9] 83.2Jain et al. [9] 62.5 Oneata et al. [21] 84.6Oneata et al. [21] 63.3 Adrien et al. [7] 85.5Wang et al. [27] 64.3 Wang et al. [27] 91.1Lan et al. [16] 68.0 Lan et al. [16] 91.4Combined Combined
Table 3:
Comparison of our results to the state-of-the-arts on the action recognition task. ‘Combined’ indicatesapplying both RaN and MIR to the baseline method.the improvements from MIR are robust and significant.
6. Combined Results: RaN and MIR
If we combine both proposed methods, we observean even more prominent improvement over the baselinemethod. We improve upon the baseline method by morethan and absolute performance improvement onHollywood2 and Olympic Sports datasets, respectively. Amore detailed comparison is provided in Figure 9, fromwhich we can see that our proposed methods together im-prove the baseline method on all the action classes in Hol-lywood2. For some of the hard classes like ’Hand Shake’and ’Answer Phone’, we can get more than absoluteimprovement. For Olympic Sports dataset, we observe asimilar trend and get 9 out 16 classes with perfect predic-tions.In Table 3, we compare our combined results to the state-of-the-art performance on Hollywood2 and Olympic Sportsdatasets. Note that although we list several most recent ap-proaches here for comparison purposes, most of them arenot directly comparable to our results due to the use of dif-ferent features and representations. The most comparableone is Wang & Schmid [27], from which we build our ap-proaches on. Sapienz et al. [25] explored ways to sub-sample and generate vocabularies for Dense Trajectory fea-tures. Jain et al. [9]’s approach incorporated a new motiondescriptor. Oneata et al. [21] focused on testing SpatialFisher Vector for multiple actions and event tasks. Adrien et al. [7] tried to cluster low-level features into mid-levelrepresentations in a hierarchical way. The learned hierar-chies of mid-level motion components are data-driven de-compositions specific to each video. As can be seen, ourcombined method significantly outperforms these state-of-the-art methods.
Problem formulation
Given a collection of videos, thegoal of an event detection task is to detect events of interestsuch as birthday party and parade, solely based on the video m AP A n s w e r P hone D r i v e C a r E a t F i gh t P e r s on G e t O u t C a r H and S ha k e H ug P e r s on K i ss R un S i t D o w n S i t U p S t and U p BaselineCombined (a) Hollywood2 m AP B a sk e t ba ll La y up B o w li ng C l ean A nd J e r k D i sc u s T h r o wD i v i ng P l a tf o r m m D i v i ng S p r i ngboa r d3 m H a m m e r T h r o wH i gh J u m p J a v e li n T h r o w Long J u m p P o l e V au l t S ho t P u t S na t c h T enn i s S e r v e T r i p l e J u m p V au l t BaselineCombined (b) Olympic Sports
Figure 9:
Per-class performance comparison of our com-bined results to the baseline method. ‘Combined’ indi-cates applying both RaN and MIR to the baseline method.content. The task is challenging due to complex actions andscenes. By evaluating on this task, we examine how RaNand MIR behavior on difficult event detection tasks.
Benchmark datasets
TREC Video Retrieval Evaluation(TRECVID) Multimedia Event Detection (MED) [22] is atask organized by NIST (National Institute of Standards andTechnology) aimed at encouraging new technologies for de-tecting complex events such as having a birthday party and rock climbing . Until 2014, NIST has built up a databasethat contains 8000 hours of videos and 40 events, which isby far the largest human labeled event detection collection.MEDTEST13, 14 are two standard evaluation datasets re-leased by NIST in 2013 and 2014, respectively. Each ofthem contains around 10 percent of the whole MED collec-tion and has 20 events and consist of two tasks, i.e. EK100and EK10. EK100 has 100 positive training samples per MEDTEST13 MEDTEST14EK10 EK100 EK10 EK100Baseline 17.0 33.6 12.0 26.2RaN
Table 4:
Performance Comparison on the MED task. event while EK10 only has 10. Together, each dataset has8000 training samples and 24000 testing samples.
Experimental settings
This is similar to the settings dis-cussed in section 3.
Results
Table 4 lists the overall mAP on all four datasets.The baseline method is a conventional IDT representation.First, on all four datasets, RaN notably improves the per-formance of the conventional IDT representation. These re-sults demonstrate that RaN is robust across tasks with dif-ferent difficulty levels. For MIR, in EK10 scenario wherethe baseline performance is unusually low, MIR hurts theperformance due to the inaccurate curriculum estimation;in EK100 settings that have comparatively reasonable base-line performance, MIR manages to achieve noticeable im-provements, though not as much as the improvements in ac-tion recognition tasks where the baseline performances aremuch higher. These results reveal the fact that for MIR be-ing useful, the baseline performance should be reasonablyaccurate. Finally, the combined results, though not alwaysthe best, improve upon the baseline method by around on EK100 tasks and about on EK10 tasks. It is worthemphasizing that MED is such a challenging task that of absolute performance improvement is significant.
7. Conclusions and Discussions
This paper has introduced two ranking-based methods toimprove the performance of state-of-the-art action recogni-tion systems. RaN ranks each dimension of Fisher Vectorsand uses the ranks to replace the original vectors. It im-proves the classic PN method by addressing the sparse andbursty distribution problem of Fisher Vectors and VLADs.MIR iteratively uses the predictions of other classifiers torank videos from easy to difficult task levels and re-ranksthe predictions accordingly. These two methods togethersignificantly improve the performance of Fisher Vectors onsix real-world datasets and set new state-of-the-art resultsfor two benchmark action datasets including Hollywood2and Olympic Sports. Future work will be investigating howto use RaN for local descriptors. In addition, we would liketo apply the proposed methods to VLAD and other videofeatures such as deep neural network features. eferences [1] R. Arandjelovic and A. Zisserman. All about vlad. In
Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition (CVPR) , 2013. 1, 2, 3[2] Y. Bengio. Learning deep architectures for ai.
Foundationsand trends in Machine Learning , 2(1):1–127, 2009. 1[3] Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Cur-riculum learning. In
Proceedings of the 26th Annual Inter-national Conference on Machine Learning (ICML) , 2009. 1,2, 5[4] A. Bergamo and L. Torresani. Meta-class features for large-scale object categorization on a budget. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , 2012. 2[5] X. Chen and A. Gupta. Webly supervised learning of convo-lutional networks. arXiv preprint arXiv:1505.01554 , 2015.1[6] M. Cimpoi, S. Maji, and A. Vedaldi. Deep filter banks fortexture recognition and segmentation. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , 2015. 1[7] A. Gaidon, Z. Harchaoui, and C. Schmid. Activity repre-sentation with motion hierarchies.
International Journal ofComputer Vision (IJCV) , 107(3):219–238, 2014. 7[8] R. Hou, A. R. Zamir, R. Sukthankar, and M. Shah. Damn–discriminative and mutually nearest: Exploiting pairwisecategory proximity for video action recognition. In
Proceed-ings of European Conference on Computer Vision (ECCV) .2014. 2[9] M. Jain, H. J´egou, and P. Bouthemy. Better exploiting mo-tion for better action recognition. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recog-nition (CVPR) , 2013. 7[10] H. J´egou and O. Chum. Negative evidences and co-occurences in image retrieval: The benefit of pca and whiten-ing. In
Proceedings of European Conference on ComputerVision (ECCV) , pages 774–787. Springer, 2012. 2[11] L. Jiang, D. Meng, Q. Zhao, S. Shan, and A. G. Hauptmann.Self-paced curriculum learning. In
Proceedings of Twenty-Ninth AAAI Conference on Artificial Intelligence (AAAI) ,2015. 1[12] Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, and C.-W. Ngo.Trajectory-based modeling of human actions with motionreference points. In
Proceedings of European Conferenceon Computer Vision (ECCV) . 2012. 2[13] F. Khan, B. Mutlu, and X. Zhu. How do humans teach: Oncurriculum learning and teaching dimension. In
Proceed-ings of Advances in Neural Information Processing Systems(NIPS) , pages 1449–1457, 2011. 2[14] M. P. Kumar, B. Packer, and D. Koller. Self-paced learningfor latent variable models. In
Proceedings of Advances inNeural Information Processing Systems (NIPS) , pages 1189–1197, 2010. 1[15] Z. Lan, L. Jiang, S.-I. Yu, S. Rawat, Y. Cai, C. Gao, S. Xu,H. Shen, X. Li, Y. Wang, et al. Cmu-informedia at trecvid2013 multimedia event detection. In
Proceedings of theTRECVID 2013 Workshop , volume 1, page 5, 2013. 2 [16] Z. Lan, M. Lin, X. Li, A. G. Hauptmann, and B. Raj. Be-yond gaussian pyramid: Multi-skip feature stacking for ac-tion recognition.
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2015. 3,7[17] I. Laptev. On space-time interest points.
International Jour-nal of Computer Vision (IJCV) , 64(2-3):107–123, 2005. 1[18] M. Marszalek, I. Laptev, and C. Schmid. Actions in context.In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2009. 2[19] I. Mironic˘a, B. Ionescu, J. Uijlings, and N. Sebe. Fisherkernel temporal variation-based relevance feedback for videoretrieval.
Computer Vision and Image Understanding , 2015.2[20] J. C. Niebles, C.-W. Chen, and L. Fei-Fei. Modeling tempo-ral structure of decomposable motion segments for activityclassification. In
Proceedings of European Conference onComputer Visions . 2010. 2[21] D. Oneata, J. Verbeek, C. Schmid, et al. Action and eventrecognition with fisher vectors on a compact feature set. In
Proceedings of the International Conference on ComputerVision and Pattern Recognition (ICCV) , 2013. 7[22] P. Over, G. Awad, J. Fiscus, and G. Sanders. Trecvid 2013–an introduction to the goals, tasks, data, evaluation mecha-nisms, and metrics. 2013. 8[23] F. Perronnin and C. Dance. Fisher kernels on visual vo-cabularies for image categorization. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR) , pages 1–8. IEEE, 2007. 2[24] F. Perronnin, J. S´anchez, and T. Mensink. Improving thefisher kernel for large-scale image classification. In
Proceed-ings of the on European Conferences on Computer Vision(ECCV) . 2010. 1, 2, 3[25] M. Sapienza, F. Cuzzolin, and P. H. Torr. Feature samplingand partitioning for visual vocabulary generation on large ac-tion classification datasets. arXiv preprint arXiv:1405.7545 ,2014. 7[26] H. Wang, A. Klaser, C. Schmid, and C.-L. Liu. Actionrecognition by dense trajectories. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recog-nition (CVPR) , 2011. 2[27] H. Wang, C. Schmid, et al. Action recognition with improvedtrajectories. In
Proceedings of the International Conferenceon Computer Vision and Pattern Recognition (ICCV) , 2013.1, 2, 3, 7[28] Z. Xu, Y. Yang, and A. G. Hauptmann. A discriminativecnn video representation for event detection. arXiv preprintarXiv:1411.4006 , 2014. 1, 2[29] S. Zha, F. Luisier, W. Andrews, N. Srivastava, andR. Salakhutdinov. Exploiting image-trained cnn architec-tures for unconstrained video classification. arXiv preprintarXiv:1503.04144 , 2015. 1[30] X. Zhang, Z. Li, L. Zhang, W.-Y. Ma, and H.-Y. Shum. Ef-ficient indexing for large scale visual search. In