Need for Speed: A Benchmark for Higher Frame Rate Object Tracking
Hamed Kiani Galoogahi, Ashton Fagg, Chen Huang, Deva Ramanan, Simon Lucey
NNeed for Speed: A Benchmark for Higher Frame Rate Object Tracking
Hamed Kiani Galoogahi ∗ , Ashton Fagg , ∗ , Chen Huang , Deva Ramanan and Simon Lucey , Robotics Institute SAIVT LabCarnegie Mellon University Queensland University of Technology
Abstract
In this paper, we propose the first higher frame ratevideo dataset (called Need for Speed - NfS) and bench-mark for visual object tracking. The dataset consists of100 videos (380K frames) captured with now commonlyavailable higher frame rate (240 FPS) cameras from realworld scenarios. All frames are annotated with axis alignedbounding boxes and all sequences are manually labelledwith nine visual attributes - such as occlusion, fast motion,background clutter, etc. Our benchmark provides an exten-sive evaluation of many recent and state-of-the-art trackerson higher frame rate sequences. We ranked each of thesetrackers according to their tracking accuracy and real-timeperformance. One of our surprising conclusions is that athigher frame rates, simple trackers such as correlation fil-ters outperform complex methods based on deep networks.This suggests that for practical applications (such as inrobotics or embedded vision), one needs to carefully trade-off bandwidth constraints associated with higher frame rateacquisition, computational costs of real-time analysis, andthe required application accuracy. Our dataset and bench-mark allows for the first time (to our knowledge) systematicexploration of such issues, and will be made available toallow for further research in this space.
1. Introduction
Visual object tracking is a fundamental task in computervision which has implications for a bevy of applications:surveillance, vehicle autonomy, video analysis, etc. The vi-sion community has shown an increasing degree of interestin the problem - with recent methods becoming increasinglysophisticated and accurate [8, 13, 10, 1, 2]. However, mostof these algorithms - and the datasets they have been evalu-ated upon [37, 36, 22] - have been aimed at the canonical ap-proximate frame rate of 30 Frames Per Second (FPS). Con-sumer devices with cameras such as smart phones, tablets, ∗ Kiani & Fagg are joint first authors.Email: [email protected] & [email protected]
Figure 1. The effect of tracking higher frame rate videos. Top rowsillustrate the robustness of tracking higher frame rate videos (240FPS) versus lower frame rate videos (30 FPS) for a Correlation Fil-ter (BACF with HOG) and a deep tracker (MDNet). Bottom rowsshow if higher frame rate videos are available, cheap CF track-ers (Staple and BACF) can outperform complicated deep trackers(SFC and MDNet) on challenging situations such as fast motion,rotation, illumination and cluttered background. Predicted bound-ing boxes of these methods are shown by different colors. HFRand LFR refer to Higher and Lower Frame Rate videos. drones, and robots are increasingly coming with higherframe rate cameras (240 FPS now being standard on manysmart phones, tablets, drones, etc.). The visual object track-ing community is yet to adapt to the changing landscape ofwhat “real-time” means and how faster frame rates effectthe choice of tracking algorithm one should employ.In recent years, significant attention has been paid toCorrelation Filter (CF) based methods [3, 13, 8, 15, 16, 1]for visual tracking. The appeal of correlation filters istheir efficiency - a discriminative tracker can be learned on-line from a single frame and adapted after each subsequent1 a r X i v : . [ c s . C V ] M a r rame. This online adaptation process allows for impressivetracking performance from a relatively low capacity learner.Further, CFs take advantage of intrinsic computational effi-ciencies afforded from operating in the Fourier domain [20].Some CF methods (such as [13, 3]) are able to operate athundreds of frames per second on embedded devices.More recently, however, the visual tracking communityhas started to focus upon improving reliability and robust-ness through advances in deep learning [2, 31, 33]. Whilesuch methods have been shown to work well, their usecomes at a cost. First, extracting discriminative featuresfrom CNNs or applying deep tracking frameworks is com-putationally expensive. Some deep methods operate at onlya fraction of a frame per second, or require a high-end GPUto achieve real time performance. Second, training deeptrackers can sometimes require a very large amount of data,as the learners are high capacity and require a significantamount of expense to train.It is well understood that the central artifact that effectsa visual tracker’s performance is tolerance to appearancevariation from frame to frame. Deep tracking methods haveshown a remarkable aptitude for providing such tolerance- with the unfortunate drawback of having a considerablecomputational footprint [31, 33]. In this paper we want toexplore an alternate thesis. Specifically, we want to explorethat if we actually increase the frame rate - thus reducing theamount of appearance variation per frame - could we getaway with substantially simpler tracking algorithms (froma computational perspective)? Further, could these compu-tational savings be dramatic enough to cover the obviousadditional cost of having to process a significant numberof more image frames per second. Due to the widespreadavailability of higher frame rate (e.g. 240 FPS) cameras onmany consumer devices we believe the time is ripe to ex-plore such a question.Inspired by the recent work of Handa et al. [11] in visualodometry, we believe that increasing capture frame rate al-lows for a significant boost in tracking performance with-out the need for deep features or complex trackers. Wedo not dismiss deep trackers or deep features, however weshow that under some circumstances they are not necessary.By trading tracker capacity for higher frame rates - as ispossible in many consumer devices - we believe that morefavourable runtime performance can be obtained, particu-larly on devices with resource constraints, while still ob-taining competitive tracking accuracy. Contributions:
In this paper, we present the Need forSpeed (NfS) dataset and benchmark, the first benchmark (toour knowledge) for higher frame rate general object track-ing using consumer devices. We use our dataset to evaluatenumerous state of the art tracking methods (both CFs anddeep learning based methods). An exciting outcome of ourwork was the unexpected result that if a sufficiently higher frame rate can be attained CFs with cheap hand-crafted fea-tures ( e.g . HOG [5]) can outperform state of the art deeptrackers in terms of accuracy and computational efficiency.
2. Related Work
Standard tracking datasets such as OTB100 [37],VOT14 [17] and ALOV300 [30] have been widely used toevaluate current tracking methods in the literature. Thesedatasets display annotated generic objects in real-world sce-narios captured by low frame rate cameras ( i.e . 24-30 FPS).Existing tracking datasets are briefly described as below.
OTB50 and OTB100:
OTB50 [36] and OTB100 [37] be-long to the Object Tracking Benchmark (OTB) with 50 and100 sequences, respectively. OTB50 is a subset of OTB100,and both datasets are annotated with bounding boxes aswell as 11 different attributes such as illumination variation,scale variation and occlusion and deformation.
Temple-Color 128 (TC128):
This dataset consists of 128videos which was specifically designed for the evaluation ofcolor-enhanced trackers. Similar to OTBs, TC128 providesper frame bounding boxes and 11 per video attributes [22].
VOT14 and VOT15:
VOT14 [17] and VOT15 [18] con-sist of 25 and 30 challenging videos, respectively, whichare mainly borrowed from OTB100. All videos are labelledwith rotated bounding boxes rather than upright ones. Bothdatasets come with per frame attribute annotation.
ALOV300:
This dataset [29] contains 314 sequencesmainly borrowed from the OTBs, VOT challenges andTC128. Videos are labeled with 14 visual attributes suchas low contrast, long duration, confusion, zooming camera,motion smoothness, moving camera and transparency.
UAV123:
This dataset [25] is recently created for Un-manned Aerial Vehicle (UAV) target tracking. There are128 videos in this dataset, 115 videos captured by UAVcameras and 8 sequences rendered by a UAV simulator,which are all annotated with bounding boxes and 12 at-tributes.Table 1 compares the NfS dataset with these datasets,showing that NfS is the only dataset with higher frame ratevideos captured at 240 FPS. Moreover, in terms of num-ber of frames, NfS is the largest dataset with 380K frameswhich is more than two times bigger than ALOV300.
Recent trackers can be generally divided into two cate-gories, including correlation filter (CF) trackers [1, 13, 7,23, 9] and deep trackers [26, 2, 34, 31]. We briefly revieweach of these two categories as following. able 1. Comparing NfS with other object tracking datasets.
UAV123 OTB50 OTB100 TC128 VOT14 VOT15 ALOV300 NfS[25] [36] [37] [22] [17] [18] [29]Capture frame rate 30 30 30 30 30 30 30 240
Correlation Filter Trackers:
The interest in employingCFs for visual tracking was ignited by the seminal MOSSEfilter [3] with an impressive speed of ∼
700 FPS, and the ca-pability of online adaptation. Thereafter, several works [13,1, 6, 10, 24] were built upon the MOSSE showing notableimprovement by learning CF trackers from more discrim-inative multi-channel features ( e.g . HOG [5]) rather thanpixel values. KCF [13] significantly improved MOSSE’saccuracy by real-time learning of kernelized CF trackers onHOG features. Trackers such as Staple [1], LCT [24] andSAMF [10] were developed to improve KCF’s robustnessto object deformation and scale change. Kiani et al. [16]showed that learning such trackers in the frequency domainis highly affected by boundary effects, leading to subop-timal performance [8]. The CF with Limited Boundaries(CFLB) [16], Spatially Regularized CF (SRDCF) [8] andthe Background-Aware CF (BACF) [14] have proposed so-lutions to mitigate these boundary effects in the Fourier do-main, with impressive results.Recently, learning CF trackers from deep ConvolutionalNeural Networks (CNNs) features [28, 19] has offeredsuperior results on several standard tracking datasets [9,7, 23]. The central tenet of these approaches (such asHCF [23] and HDT [27]) is that- even by per frame on-line adaptation- hand-crafted features such as HOG are notdiscriminative enough to capture the visual difference be-tween consecutive frames in low frame rate videos. Despitetheir notable improvement, the major drawback of such CFtrackers is their intractable complexity ( ∼ Deep Trackers:
Recent deep learning based trackers [35,26, 33] represent a new paradigm in tracking. Instead ofhand-crafted features, a deep network trained for a non-tracking task (such as object recognition [19]) is updatedwith video data for generic object tracking. Unlike the di-rect combination of deep features with the traditional shal-low methods e.g . CFs [9, 23, 7, 27], the updated deep track-ers aim to learn from scratch the target-specific featuresfor each new video. For example, the MDNet [26] learnsgeneric features on a large set of videos and updates the socalled domain-specific layers for unseen ones. The more re- lated training set and unified training and testing approachesmake MDNet win the first place in the VOT15 Challenge.Wang et al . [33] proposed to use fully convolutional net-works (FCNT) with feature map selection mechanism toimprove performance. However, such methods are compu-tationally very expensive (even with a high end GPU) dueto the fine-tuning step required to adapt the network from alarge number of example frames. There are two high-speeddeep trackers GOTURN [12] and SFC [2] that are able torun at 100 FPS and 75 FPS respectively on GPUs. Bothof these methods train a Siamese network offline to predictmotion between two frames (either using deep regressionor a similarity comparison). At test time, the network isevaluated without any fine-tuning. Thus, these trackers aresignificantly less expensive because the only computationalcost is the fixed feed-forward process. For these trackers,however, we remark that there are two major drawbacks.First, their simplicity and fixed-model nature can lead tohigh speed, but also lose the ability to update the appearancemodel online which is often critical to account for drasticappearance changes. Second, on modern CPUs, their speedbecomes no more than 3 FPS, which is too slow for practicaluse on devices with limited computational resources.
3. NfS Dataset
The NfS datset consists of 100 higher frame rate videoscaptured at 240 FPS. We captured 75 videos using theiPhone 6 (and above) and the iPad Pro which are capa-ble of capturing 240 frames per second. We also included25 sequences from YouTube, which were captured at 240FPS from a variety of different devices. All 75 capturedvideos come with corresponding IMU and Gyroscope rawdata gathered during the video capturing process. Althoughwe make no use of such data in this paper, we will make thedata publicly available for potential applications.The tracking targets include (but not limited to) vehicle(bicycle, motorcycle, car), person, face, animal (fish, bird,mammal, insect), aircraft (airplane, helicopter, drone), boat,and generic objects ( e.g . sport ball, cup, bag, etc.). Eachframe in NfS is annotated with an axis aligned boundingbox using the VATIC toolbox [32]. Moreover, all videos arelabeled with nine visual attributes, including occlusion, illu- able 2. Distribution of visual attributes within the NfS dataset,showing the number of coincident attributes across all videos.Please refer to Section 3 for more details.
IV SV OCC DEF FM VC OV BC LRIV
39 23 12 33 17 9 16 8SV 39
41 36 57 41 21 28 8OCC 23 41
21 31 23 10 19 8DEF 12 36 21
19 30 4 13 1FM 33 57 31 19
31 20 24 4VC 17 41 23 30 31
10 4BC 16 28 19 13 24 16 10 mination variation (IV), scale variation (SV), object defor-mation (DEF), fast motion (FM), viewpoint change (VC),out of view (OV), background clutter (BC) and low reso-lution (LR). The distribution of these attributes for NfS ispresented in Table 2. Example frames of the NfS datasetand detailed description of each attribute are provided in thesupplementary material.
4. Evaluation
Evaluated Algorithms:
We evaluated 15 recent trackerson the NfS dataset. We generally categorised these trackersbased on their learning strategy and utilized feature in threeclasses including CF trackers with hand-crafted features(BACF [14], SRDCF [8], Staple [1], DSST [6], KCF [13],LCT [24], SAMF [21] and CFLB [16]), CF trackers withdeep features (HCF [23] and HDT [27]) and deep trackers(MDNet [26], SiameseFc [2], FCNT [33], GOTURN [12]).We also included MEEM [38] in the evaluation as the stateof the art SVM-based tracker with hand-crafted feature. Allthese trackers are detailed in the supplementary material interms of learning strategy and feature representation.
Evaluation Methodology:
We use the success metric toevaluate all the trackers [36]. Success measures the in-tersection over union (IoU) of predicted and ground truthbounding boxes. The success plot shows the percentage ofbounding boxes whose IoU is larger than a given thresh-old. We use the Area Under the Curve (AUC) of suc-cess plots to rank the trackers. We also compare all thetrackers by their success rate at the conventional thresh-olds of 0.50 (IoU > improved accuracyaccuracy of lower frame rate tracking , where the improved accuracyis the difference between accuracy (success rate at IoU > Tracking Scenarios:
To measure the effect of captureframe rate on tracking performance, we consider two dif-ferent tracking scenarios. At the first scenario, we run each
Figure 2. Top) a frame captured by a high frame rate camera (240FPS), the same frame with synthesized motion blur, and the sameframe captured by a low frame rate camera (30 FPS) with real mo-tion blur. Bottom) sampled frames with corresponding synthesizedmotion blur. Please refer to Tracking Scenarios for more details. tracker over all frames of the higher frame rate videos (240FPS) in the NfS dataset. The second scenario, on the otherhand, involves tracking lower frame rate videos (30 FPS).Since all videos in the NfS dataset are captured by highframe rate cameras, and thus no 30 FPS video is available,we simply create a lower frame rate version of NfS by tem-poral sampling every 8th frame. In such case, we trackthe object over each 8th frame instead of all frames. Thissimply models the large visual difference between two fol-lowing sampled frames as one may observe in a real lowerframe rate video. However, the main issue of temporalsampling is that since the videos are originally captured byhigher frame rate cameras with very short exposure time,the motion blur caused by fast moving object/camera is sig-nificantly diminished. This leads to excluding the effect ofmotion blur in lower frame rate tracking. To address thisconcern and make the evaluation as realistic as possible,we simulate motion blur over the lower frame rate videoscreated by temporal sampling. We utilize a leading visualeffects package (Adobe After Effects) to synthesize motionblur over the sampled frames. To verify the realism of thesynthesized motion blur, Fig. 2 demonstrates a real frame(of a checkerboard) captured by a 240 FPS camera, the sameframe with synthesized motion blur and a frame with realmotion blur captured by a 30 FPS camera with identical ex-trinsic and intrinsic settings. To capture two sequences withdifferent frame rates, we put two iPhones capture rates of30 and 240 FPS side-by-side and then capture sequencesfrom the same scene simultaneously. Fig. 2 also shows twoframes before and after adding synthesized motion blur.
A unique characteristic of CF trackers is their inher-ent ability to update the tracking model online, when newframes become available. The impact of each frame inlearning/updating process is controlled by a constant weight able 3. Evaluating the effect of updating learning rate of each CF tracker on tracking higher frame rate videos (240 FPS). Accuracy isreported as success rate ( % ) at IoU > BACF SRDCF Staple LCT DSST SAMF KCF CFLB HCF HDTOriginal LR 48.8 48.2 51.1 34.5 44.0 42.8 28.7 18.3 33.0 57.7Updated LR 60.5 55.8 53.4 36.4 53.4 51.7 34.8 22.9 41.2 59.6
BACF SRDCF Staple LCT DSST SAMF KCF CFLB HCF HDT MEEM MDNet SFC FCNT GOTURN
Trackers S u cc e ss r a t e ( % ) , a t I o U > .
30 FPS-MB 30 FPS- no MB 240 FPS
Figure 3. Comparing higher frame rate tracking (240 FPS) versus lower frame rate tracking (30 FPS) for each tracker. For higher framerate tracking CF trackers employ updated learning rates. The results of lower frame rate tracking are plotted for videos with and withoutmotion blur (30 FPS-MB and 30 FPS- no MB). Results are reported as success rate ( % ) at IoU > called learning (or adaptation) rate [3]. Smaller rates in-crease the impact of older samples, while bigger ones givehigher weight to samples from more recent frames. EachCF tracker has its own learning rate which was tuned forrobust tracking on low frame rate sequences (30 FPS). Toretain the robustness of such methods for higher frame ratevideos (240 FPS), we approximately adjust their learningrate to be LR new = LR old . Since the number of framesin 240 FPS videos is 8 times more than that in 30 FPS se-quences over a fixed period of time, dividing learning ratesby 8 can keep the balance between CFs updating capacityand smaller inter-frame variation in 240 FPS videos .Here, we empirically demonstrate how adjusting thelearning rates of CF trackers affects their tracking perfor-mance. Table 3 shows the tracking accuracy (success rateat IoU > LR old from their reference papers) versus updatedrates LR new . The result shows that adjusting the learningrates notably improves the accuracy of all the CF trackers.For some trackers such as BACF, SRDCF, DSST and HCFthere is a substantial improvement, while for Staple, LCTand HDT the improvement is much smaller ( ∼ ). Thisis most likely because of the complementary parts of thesetrackers. Staple utilizes color scores per pixel, and LCTuses random fens classifiers as additional detectors/trackerswhich are independent of their CF modules. Similarly, HDTemploys the Hedge algorithm [4] as a multi-experts deci-sion maker to merge hundreds of weak CF trackers in astrong tracker. Thus, updating their learning rates offers lessimprovement compared to those trackers such as BACF andDSST that solely track by a single CF based tracker. Proof is provided in the supplementary material.
Figure 3 compares tracking higher versus lower framerate videos for each evaluated method. For lower framerate tracking (30 FPS) results are reported for both with andwithout motion blur. All CF trackers achieve a significantincrease in performance (AUCs are improved > % ) whentracking on 240 FPS videos. This is because in higher framerate video, the appearance change between two adjacentframes is very small, which can be effectively learned by perframe CF online adaptation. Among deep trackers, FCNTachieved the most improvement (6 % ), since this tracker alsofine-tunes every 20 frames using the most confident trackingresults. The lowest improvement belongs to SFC and MD-Net. These methods are trained off-line and do not update amodel or maintain a memory of past appearances [2]. Thus,tracking higher frame rate videos offers much smaller im-provement to such trackers. When evaluating lower framerate videos, a slight performance drop can be observed withthe presence of motion blur, demonstrating that all trackersare reasonably robust towards motion blur. The overall comparison of all trackers over three trackingsettings- higher frame rate tracking (240 FPS), lower framerate tracking with synthesized motion blur (30 FPS MB) andlower frame rate tracking without motion blur (30 FPS noMB)- is demonstrated in Fig. 4 (success plots) and Table 4(AUCs and tracking speed).
Accuracy Comparison:
For lower frame rate trackingwithout motion blur MDNet achieved the best performancefollowed by SFC. HDT which utilizes deep features over aCF framework obtained the third rank followed by FCNT.Almost the same ranking is observed for lower frame ratetracking with motion blur. Overall, deep trackers outper-
Overlap threshold S u cc e ss r a t e ( % ) MDNet [42.9]HDT [40.3]SFC [40.1]FCNT [39.7]SRDCF [35.1]BACF [34.1]GOTURN [33.4]Staple [33.3]MEEM [29.7]HCF [29.5]SAMF [29.3]DSST [28.0]LCT [23.8]KCF [21.8]CFLB [14.3]
Overlap threshold S u cc e ss r a t e ( % ) MDNet [44.4]SFC [42.3]HDT [41.4]FCNT [40.6]GOTURN [37.8]SRDCF [35.8]BACF [35.3]Staple [34.5]MEEM [32.9]SAMF [30.3]HCF [30.2]DSST [29.5]LCT [24.9]KCF [22.3]CFLB [14.9]
Overlap threshold S u cc e ss r a t e ( % ) BACF [49.6]HDT [47.8]SFC [47.8]MDNet [47.3]SRDCF [47.1]FCNT [46.9]Staple [45.3]DSST [44.8]SAMF [43.9]HCF [39.5]GOTURN [38.6]MEEM [37.5]LCT [34.4]KCF [33.3]CFLB [19.9] (a) (b) (c)
Figure 4. Evaluating trackers over three tracking scenarios, (a) lower frame rate tracking with synthesized motion blur, (b) and lower framerate tracking without motion blur, and (c) higher frame rate tracking. AUCs are reported in brackets.Table 4. Comparing trackers on three tracking scenarios including higher frame rate tracking (240 FPS), lower frame rate tracking withsynthesized motion blur (30 FPS MB) and lower frame rate tracking without motion blur (30 FPS no MB). Results are reported as theAUC of success plots. We also show the speed of each tracker on CPUs and/or GPUs if applicable. The first, second, third, forth and fifthhighest AUCs/speeds are highlighted in color.
BACF SRDCF Staple LCT DSST SAMF KCF CFLB HCF HDT MEEM MDNet SFC FCNT GOTURN30 FPS- no MB 35.2 35.7 34.5 24.8 29.4 30.3 22.3 14.9 30.2 41.3 32.9 44.4 42.3 40.5 37.730 FPS- MB 34.0 35.1 33.2 23.7 28.0 29.2 21.7 14.2 29.5 40.3 29.6 42.9 40.1 39.7 33.4240 FPS 49.5 47.1 45.3 34.3 44.8 43.9 33.3 19.9 39.5 47.8 37.5 47.3 47.7 46.9 38.6Speed (CPU) 38.3 3.8 50.8 10.0 12.5 16.6 170.4 85.1 10.8 9.7 11.1 0.7 2.5 3.2 3.9Speed (GPU) - - - - - - - - - 43.1 - 2.6 48.2 51.8 155.3 formed CF trackers for lower frame rate tracking. This isnot surprising, as deep trackers have a high learning capac-ity and employ highly discriminative deep features whichare able to handle the large variation in adjacent frameswhich is present in lower frame rate videos.Surprisingly, the best accuracy of higher frame ratetracking achieved by BACF (49.56), which is a CF trackerwith HOG features, followed by HDT (47.80)- a CF trackerwith deep features. SRDCF (47.13), Staple (45.34), DSST(44.80) and SAMF (43.92) outperformed GOTURN (38.65)and obtained very competitive accuracy compared to otherdeep trackers including SFC (47.78), MDNet (47.34) andFCNT (46.94). This implies that when higher frame ratevideos are vailable, the ability of CF trackers to adapt on-line is of greater benefit than high learning capacity of deeptrackers. The reasoning for this is intuitive, since for higherframe rate video there is less appearance change amongconsecutive frames, which can be efficiently modeled byupdating the tracking model at each frame even using sim-ple hand-crafted features.
Run-time Comparison:
The tracking speed of all evalu-ated methods in FPS is reported in Table 4. For the sakeof fair comparison, we tested MATLAB implementationsof all methods (including deep trackers) on a 2.7 GHz IntelCore i7 CPU with 16 GB RAM. We also reported the speedof deep trackers on nVidia GeForce GTX Titan X GPU tohave a better sense of their run-time when GPUs are avail-able. On CPUs, all CF trackers achieved much higher speedcompared to all deep trackers, because of their shallow ar- chitecture and efficient computation in the Fourier domain.Deep trackers on CPUs performed much slower than CFs,with the exception of SRDCF (3.8 FPS). On GPU, how-ever, deep trackers including GOTURN (155.3 FPS), FCNT(51.8 FPS) and SFC (48.2 FPS) performed much faster thanon CPU. Their performance is comparable with many CFtrackers running on CPU, such as KCF (170.4 FPS), BACF(38.3 FPS) and Staple (50.8 FPS). For tracking lower framerate videos, only BACF, Staple, KCF and CFLB can track ator above real time on CPUs. GOTURN, FCNT and SFC of-fer real-time tracking of lower frame rate videos on GPUs.KCF (170.4 FPS) and GOTURN (155.3 FPS) are the onlytrackers which can track higher frame rate videos almostreal time on CPUs and GPUs, respectively.
The attribute based evaluation of all trackers on threetracking settings is shown in Fig. 6 (success rate at IoU > RDCF (soccer ball)HDT (pingpong)MDNet (tiger)failure case (fish)Figure 5. Rows(1-3) show tracking performance of three trackersicluding a CF tracker with HOG (SRDCF), a CF tracker with deepfeatures (HDT) and a deep tracker (MDNets), comparing lowerframe rate (green boxes) versus higher frame rate (red boxes)tracking. Ground truth is shown by blue boxes. Last row visu-alizes a failure case of higher frame rate tracking caused by non-rigid deformation for BACF, Staple, MDNet and SFC. strates the sensitivity of both CF based and deep trackers tonon-rigid deformation even when they track higher framerate videos. Fig. 6 shows that CF trackers with hand-craftedfeatures outperformed all deep trackers as well as HDT for6 attributes. More particularly, for illumination variation,occlusion, fast motion, out-of-view, background clutter andlow resolution CF trackers with hand-crafted features (suchas BACF and SRDCF) achieved superior performance toall deep trackers and HDT. However, deep tracker MDNetachieved the highest accuracy for scale variation (61.0), de-formation (59.2) and view change (55.9), closely followedby BACF for scale variation (60.1) and HDT for deforma-tion (57.6) and view change (54.8). The relative accuracyimprovement of tracking higher frame rate versus lowerframe rate videos (with motion blur) for each tracker andeach attribute is reported in Table 5. The result shows that
Table 5. Relative accuracy improvement ( % ) of high frame ratetracking versus low frame rate tracking for each attribute. Relativeimprovement more than 50 % are underlined in red. IV SV OCC DEF FM VC OV BC LRBACF 82.1 44.5 50.8 11.7 62.1 26.1 39.6 47.2 118.3SRDCF 28.4 19.1 40.7 11.3 43.9 36.9 13.0 59.3 47.2Staple 106.0 37.1 42.7 10.3 55.9 22.5 33.8 11.6 66.9LCT 74.5 22.4 60.7 16.3 48.7 36.2 18.3 94.8 127.8DSST 119.4 58.4 57.9 12.1 84.5 41.1 36.0 65.3 54.4SAMF 124.4 49.9 62.6 13.2 73.0 28.2 33.1 52.6 44.9KCF 128.0 37.7 66.7 5.1 78.7 34.3 21.4 81.9 123.6CFLB 113.2 66.7 52.3 5.4 79.1 14.0 42.3 89.3 200.4HCF 50.1 23.1 62.8 16.2 39.0 26.0 30.6 71.8 134.5HDT 29.2 19.1 12.5 4.3 26.2 16.6 19.8 13.8 32.5MEEM 32.2 17.8 33.6 16.3 31.1 19.1 14.9 29.0 60.9MDNet 14.1 8.3 9.7 1.9 16.5 13.9 4.1 16.9 10.6SFC 35.4 20.1 20.6 7.9 26.7 16.6 15.7 22.1 27.8FCNT 29.7 19.3 16.1 4.8 24.8 19.1 15.5 12.5 26.5GOTURN 52.9 27.7 16.5 -3.8 38.3 18.2 34.9 37.2 -8.6 that first, compared to other attributes, less improvementachieved for non-rigid deformation attribute for all track-ers, and second, the percentage of relative improvement forCF trackers is much higher than that of deep trackers.
In this paper, we introduce the first higher frame rate ob-ject tracking dataset and benchmark. We empirically evalu-ate the performance of the state-of-the-art trackers with re-spect to two different capture frame rates (30 FPS vs. 240FPS), and find the surprising result that at higher framerates, simple trackers such as correlation filters trainedon hand-crafted features ( e.g . HOG) outperform complextrackers based on deep architecture. This suggests that com-putationally tractable methods such as cheap CF trackers inconjunction with higher capture frame rate videos can beutilized to effectively perform object tracking on deviceswith limited processing resources such as smart phones,tablets, drones, etc. As shown in Fig. 7, cheaper trackerson higher frame rate video ( e.g . KCF and Staple) have beendemonstrated as competitive with many deep trackers onlower frame rate videos (such as HDT and FCNT).Our results also suggest that traditional evaluation crite-ria that trades off accuracy versus speed (e.g., Fig.7 in [18])could possibly paint an incomplete picture. This is because,up until now, accuracy has been measured without regard tothe frame rate of the video. As we show, this dramaticallyunderestimates the performance of high speed algorithms.In simple terms: the accuracy of a 240 FPS tracker cannotbe truly appreciated until it is run on a 240 FPS video! Froman embedded-vision perspective, we argue that the acquisi-tion frame rate is a resource that should be explicitly tradedoff when designing systems, just as is hardware (GPU vsCPU). Our new dataset allows for, the first time, exploration
Background clutter Scale variation Occlusion S u cc e ss r a t e ( % ) , I o U > . BA C F S R D C F S t ap l e L C T D SS T SA M F K C F C F L B H C F H D T M EE M M
D N e t S F C F C N T G O T U R N S u cc e ss r a t e ( % ) , I o U > . BA C F S R D C F S t ap l e L C T D SS T SA M F K C F C F L B H C F H D T M EE M M
D N e t S F C F C N T G O T U R N S u cc e ss r a t e ( % ) , I o U > . BA C F S R D C F S t ap l e L C T D SS T SA M F K C F C F L B H C F H D T M EE M M
D N e t S F C F C N T G O T U R N
Illumination variation Fast motion Out-of-view S u cc e ss r a t e ( % ) , I o U > . BA C F S R D C F S t ap l e L C T D SS T SA M F K C F C F L B H C F H D T M EE M M
D N e t S F C F C N T G O T U R N S u cc e ss r a t e ( % ) , I o U > . BA C F S R D C F S t ap l e L C T D SS T SA M F K C F C F L B H C F H D T M EE M M
D N e t S F C F C N T G O T U R N S u cc e ss r a t e ( % ) , I o U > . BA C F S R D C F S t ap l e L C T D SS T SA M F K C F C F L B H C F H D T M EE M M
D N e t S F C F C N T G O T U R N
Deformation Viewpoint change Low resolution S u cc e ss r a t e ( % ) , I o U > . BA C F S R D C F S t ap l e L C T D SS T SA M F K C F C F L B H C F H D T M EE M M
D N e t S F C F C N T G O T U R N S u cc e ss r a t e ( % ) , I o U > . BA C F S R D C F S t ap l e L C T D SS T SA M F K C F C F L B H C F H D T M EE M M
D N e t S F C F C N T G O T U R N S u cc e ss r a t e ( % ) , I o U > . BA C F S R D C F S t ap l e L C T D SS T SA M F K C F C F L B H C F H D T M EE M M
D N e t S F C F C N T G O T U R N
Figure 6. Attribute based evaluation. Results are reported as success rate ( % ) at IoU > -3 -2 -1 Real-time performance S u cc e ss r a t e ( % ) BACF-LFR tracking-CPUBACF-HFR tracking-CPUSRDCF-LFR tracking-CPUSRDCF-HFR tracking-CPUStaple-LFR tracking-CPUStaple-HFR tracking-CPUSAMF-LFR tracking-CPUSAMF-HFR tracking-CPUKCF-LFR tracking-CPUKCF-HFR tracking-CPUHDT-LFR tracking-CPUHDT-HFR tracking-CPUHDT-HFR tracking-GPUHDT-LFR tracking-GPUMDNet-LFR tracking-CPUMDNet-HFR tracking-CPUMDNet-HFR tracking-GPUMDNet-LFR tracking-GPUSFC-LFR tracking-CPUSFC-HFR tracking-CPUSFC-HFR tracking-GPUSFC-LFR tracking-GPUFCNT-LFR tracking-CPUFCNT-HFR tracking-CPUFCNT-HFR tracking-GPUFCNT-LFR tracking-GPUGoturn-LFR tracking-CPUGoturn-HFR tracking-CPUGoturn-HFR tracking-GPUGoturn-LFR tracking-GPU -3 -2 -1 Real-time performance S u cc e ss r a t e ( % ) BACF-LFR tracking-CPUBACF-HFR tracking-CPUSRDCF-LFR tracking-CPUSRDCF-HFR tracking-CPUStaple-LFR tracking-CPUStaple-HFR tracking-CPUSAMF-LFR tracking-CPUSAMF-HFR tracking-CPUKCF-LFR tracking-CPUKCF-HFR tracking-CPUHDT-LFR tracking-CPUHDT-HFR tracking-CPUHDT-HFR tracking-GPUHDT-LFR tracking-GPUMDNet-LFR tracking-CPUMDNet-HFR tracking-CPUMDNet-HFR tracking-GPUMDNet-LFR tracking-GPUSFC-LFR tracking-CPUSFC-HFR tracking-CPUSFC-HFR tracking-GPUSFC-LFR tracking-GPUFCNT-LFR tracking-CPUFCNT-HFR tracking-CPUFCNT-HFR tracking-GPUFCNT-LFR tracking-GPUGoturn-LFR tracking-CPUGoturn-HFR tracking-CPUGoturn-HFR tracking-GPUGoturn-LFR tracking-GPU -3 -2 -1 Real-time performance S u cc e ss r a t e ( % ) BACF-LFR tracking-CPUBACF-HFR tracking-CPUSRDCF-LFR tracking-CPUSRDCF-HFR tracking-CPUStaple-LFR tracking-CPUStaple-HFR tracking-CPUSAMF-LFR tracking-CPUSAMF-HFR tracking-CPUKCF-LFR tracking-CPUKCF-HFR tracking-CPUHDT-LFR tracking-CPUHDT-HFR tracking-CPUHDT-HFR tracking-GPUHDT-LFR tracking-GPUMDNet-LFR tracking-CPUMDNet-HFR tracking-CPUMDNet-HFR tracking-GPUMDNet-LFR tracking-GPUSFC-LFR tracking-CPUSFC-HFR tracking-CPUSFC-HFR tracking-GPUSFC-LFR tracking-GPUFCNT-LFR tracking-CPUFCNT-HFR tracking-CPUFCNT-HFR tracking-GPUFCNT-LFR tracking-GPUGoturn-LFR tracking-CPUGoturn-HFR tracking-CPUGoturn-HFR tracking-GPUGoturn-LFR tracking-GPU -3 -2 -1 Real-time performance S u cc e ss r a t e ( % ) BACF-LFR tracking-CPUBACF-HFR tracking-CPUSRDCF-LFR tracking-CPUSRDCF-HFR tracking-CPUStaple-LFR tracking-CPUStaple-HFR tracking-CPUSAMF-LFR tracking-CPUSAMF-HFR tracking-CPUKCF-LFR tracking-CPUKCF-HFR tracking-CPUHDT-LFR tracking-CPUHDT-HFR tracking-CPUHDT-HFR tracking-GPUHDT-LFR tracking-GPUMDNet-LFR tracking-CPUMDNet-HFR tracking-CPUMDNet-HFR tracking-GPUMDNet-LFR tracking-GPUSFC-LFR tracking-CPUSFC-HFR tracking-CPUSFC-HFR tracking-GPUSFC-LFR tracking-GPUFCNT-LFR tracking-CPUFCNT-HFR tracking-CPUFCNT-HFR tracking-GPUFCNT-LFR tracking-GPUGoturn-LFR tracking-CPUGoturn-HFR tracking-CPUGoturn-HFR tracking-GPUGoturn-LFR tracking-GPU -3 -2 -1 Real-time performance S u cc e ss r a t e ( % ) BACF-LFR tracking-CPUBACF-HFR tracking-CPUSRDCF-LFR tracking-CPUSRDCF-HFR tracking-CPUStaple-LFR tracking-CPUStaple-HFR tracking-CPUSAMF-LFR tracking-CPUSAMF-HFR tracking-CPUKCF-LFR tracking-CPUKCF-HFR tracking-CPUHDT-LFR tracking-CPUHDT-HFR tracking-CPUHDT-HFR tracking-GPUHDT-LFR tracking-GPUMDNet-LFR tracking-CPUMDNet-HFR tracking-CPUMDNet-HFR tracking-GPUMDNet-LFR tracking-GPUSFC-LFR tracking-CPUSFC-HFR tracking-CPUSFC-HFR tracking-GPUSFC-LFR tracking-GPUFCNT-LFR tracking-CPUFCNT-HFR tracking-CPUFCNT-HFR tracking-GPUFCNT-LFR tracking-GPUGoturn-LFR tracking-CPUGoturn-HFR tracking-CPUGoturn-HFR tracking-GPUGoturn-LFR tracking-GPU
Figure 7. This plot shows the affect of resource availability (GPUs vs. CPUs) and frame rate of captured videos (lower vs. higher framerate) on the top-10 evaluated trackers’ accuracies and real-time performance. Real-time performance is computed as the ratio of eachtracker’s speed (FPS) to frame rate of the target videos (30 vs. 240 FPS). The vertical line on the plot shows the boundary of being real-time (frame rate of the target video is the same as the tracker’s speed). Trackers which are plotted at the left side of the line are not able totrack real-time (according to their tracking speed and video frame rate). GPU results are highlighted in yellow. f such novel perspectives. Our dataset fills a need. Theneed for speed.
References [1] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, andP. H. S. Torr. Staple: Complementary learners for real-timetracking. In
CVPR , June 2016. 1, 2, 3, 4[2] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, andP. H. Torr. Fully-convolutional siamese networks for objecttracking. In
ECCV , pages 850–865, 2016. 1, 2, 3, 4, 5[3] D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui.Visual object tracking using adaptive correlation filters. In
CVPR , pages 2544–2550. IEEE, 2010. 1, 2, 3, 4[4] K. Chaudhuri, Y. Freund, and D. J. Hsu. A parameter-freehedging algorithm. In
NIPS , pages 297–305, 2009. 5[5] N. Dalal and B. Triggs. Histograms of oriented gradients forhuman detection. In
CVPR , volume 1, pages 886–893, 2005.2, 3[6] M. Danelljan, G. H¨ager, F. Khan, and M. Felsberg. Accuratescale estimation for robust visual tracking. In
BMVC , 2014.3, 4[7] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Fels-berg. Convolutional features for correlation filter based vi-sual tracking. In
CVPR Workshop , pages 58–66, 2015. 2,3[8] M. Danelljan, G. Hager, F. Shahbaz Khan, and M. Felsberg.Learning spatially regularized correlation filters for visualtracking. In
ICCV , pages 4310–4318, 2015. 1, 3, 4[9] M. Danelljan, A. Robinson, F. S. Khan, and M. Felsberg.Beyond correlation filters: Learning continuous convolutionoperators for visual tracking. In
ECCV , pages 472–488,2016. 2, 3[10] M. Danelljan, F. Shahbaz Khan, M. Felsberg, and J. Van deWeijer. Adaptive color attributes for real-time visual track-ing. In
CVPR , pages 1090–1097, 2014. 1, 3[11] A. Handa, R. A. Newcombe, A. Angeli, and A. J. Davison.Real-time camera tracking: When is high frame-rate best?In
ECCV , pages 222–235, 2012. 2[12] D. Held, S. Thrun, and S. Savarese. Learning to track at 100fps with deep regression networks. In
ECCV , 2016. 3, 4[13] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized correlation filters.
PAMI ,37(3):583–596, 2015. 1, 2, 3, 4[14] H. Kiani Galoogahi, A. Fagg, and S. Lucey. Learn-ing background-aware correlation filters for visual tracking. arXiv preprint 1703.04590 , 2017. 3, 4[15] H. Kiani Galoogahi, T. Sim, and S. Lucey. Multi-channelcorrelation filters. In
ICCV , pages 3072–3079, 2013. 1[16] H. Kiani Galoogahi, T. Sim, and S. Lucey. Correlation filterswith limited boundaries. In
CVPR , pages 4630–4638, 2015.1, 3, 4[17] M. Kristan, J. Matas, A. Leonardis, and et al. The visual ob-ject tracking vot2014 challenge results. In
ECCV Workshop ,2014. 2, 3[18] M. Kristan, J. Matas, A. Leonardis, M. et alFelsberg, L. Ce-hovin, G. Fernandez, T. Vojir, G. Hager, G. Nebehay, and R. Pflugfelder. The visual object tracking vot2015 challengeresults. In
CVPR Workshop , pages 1–23, 2015. 2, 3, 7[19] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenetclassification with deep convolutional neural networks. In
NIPS , pages 1097–1105, 2012. 3[20] B. V. Kumar.
Correlation pattern recognition . CambridgeUniversity Press, 2005. 2[21] Y. Li and J. Zhu. A scale adaptive kernel correlation filtertracker with feature integration. In
ECCV , pages 254–265,2014. 4[22] P. Liang, E. Blasch, and H. Ling. Encoding color informa-tion for visual tracking: algorithms and benchmark.
ITIP ,24(12):5630–5644, 2015. 1, 2, 3[23] C. Ma, J.-B. Huang, X. Yang, and M.-H. Yang. Hierarchicalconvolutional features for visual tracking. In
CVPR , pages3074–3082, 2015. 2, 3, 4[24] C. Ma, X. Yang, C. Zhang, and M.-H. Yang. Long-termcorrelation tracking. In
CVPR , pages 5388–5396, 2015. 3, 4[25] M. Mueller, N. Smith, and B. Ghanem. A benchmark andsimulator for uav tracking. In
ECCV , pages 445–461, 2016.2, 3[26] H. Nam and B. Han. Learning multi-domain convolu-tional neural networks for visual tracking. arXiv preprint1510.07945 , 2015. 2, 3, 4[27] Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, and M.-H.Yang. Hedged deep tracking. In
CVPR , pages 4303–4311,2016. 3, 4[28] K. Simonyan and A. Zisserman. Very deep convolutionalnetworks for large-scale image recognition. arXiv preprint1409.1556 , 2014. 3[29] A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara,A. Dehghan, and M. Shah. Visual tracking: An experimentalsurvey.
IEEE Transactions on Pattern Analysis and MachineIntelligence , 36(7). 2, 3[30] A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara,A. Dehghan, and M. Shah. Visual tracking: An experimentalsurvey.
PAMI , 36(7):1442–1468, 2014. 2[31] R. Tao, E. Gavves, and A. W. Smeulders. Siamese instancesearch for tracking. arXiv preprint 1605.05863 , 2016. 2[32] C. Vondrick, D. Patterson, and D. Ramanan. Efficiently scal-ing up crowdsourced video annotation.
IJCV , 101(1):184–204, 2013. 3[33] L. Wang, W. Ouyang, X. Wang, and H. Lu. Visual trackingwith fully convolutional networks. In
ICCV , pages 3119–3127, 2015. 2, 3, 4[34] L. Wang, W. Ouyang, X. Wang, and H. Lu. Stct: Sequen-tially training convolutional networks for visual tracking.CVPR, 2016. 2[35] N. Wang and D.-Y. Yeung. Learning a deep compact imagerepresentation for visual tracking. In
NIPS , 2013. 3[36] Y. Wu, J. Lim, and M.-H. Yang. Online object tracking: Abenchmark. In
CVPR , pages 2411–2418, 2013. 1, 2, 3, 4[37] Y. Wu, J. Lim, and M.-H. Yang. Object tracking benchmark.
PAMI , 37(9):1834–1848, 2015. 1, 2, 3[38] J. Zhang, S. Ma, and S. Sclaroff. MEEM: robust trackingvia multiple experts using entropy minimization. In