Fully-Convolutional Siamese Networks for Object Tracking
Luca Bertinetto, Jack Valmadre, João F. Henriques, Andrea Vedaldi, Philip H. S. Torr
FFully-Convolutional Siamese Networksfor Object Tracking
Luca Bertinetto (cid:63)
Jack Valmadre (cid:63)
Jo˜ao F. HenriquesAndrea Vedaldi Philip H. S. Torr
Department of Engineering Science, University of Oxford { name.surname } @eng.ox.ac.uk Abstract.
The problem of arbitrary object tracking has traditionallybeen tackled by learning a model of the object’s appearance exclusivelyonline, using as sole training data the video itself. Despite the success ofthese methods, their online-only approach inherently limits the richnessof the model they can learn. Recently, several attempts have been madeto exploit the expressive power of deep convolutional networks. How-ever, when the object to track is not known beforehand, it is necessaryto perform Stochastic Gradient Descent online to adapt the weights ofthe network, severely compromising the speed of the system. In this pa-per we equip a basic tracking algorithm with a novel fully-convolutionalSiamese network trained end-to-end on the ILSVRC15 dataset for objectdetection in video. Our tracker operates at frame-rates beyond real-timeand, despite its extreme simplicity, achieves state-of-the-art performancein multiple benchmarks.
Keywords: object-tracking, Siamese-network, similarity-learning, deep-learning
We consider the problem of tracking an arbitrary object in video, where theobject is identified solely by a rectangle in the first frame. Since the algorithmmay be requested to track any arbitrary object, it is impossible to have alreadygathered data and trained a specific detector.For several years, the most successful paradigm for this scenario has been tolearn a model of the object’s appearance in an online fashion using examples ex-tracted from the video itself [1]. This owes in large part to the demonstrated abil-ity of methods like TLD [2], Struck [3] and KCF [4]. However, a clear deficiency ofusing data derived exclusively from the current video is that only comparativelysimple models can be learnt. While other problems in computer vision have seenan increasingly pervasive adoption of deep convolutional networks (conv-nets)trained from large supervised datasets, the scarcity of supervised data and theconstraint of real-time operation prevent the naive application of deep learningwithin this paradigm of learning a detector per video. (cid:63)
The first two authors contributed equally, and are listed in alphabetical order. a r X i v : . [ c s . C V ] S e p L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, P. H. S. Torr
Several recent works have aimed to overcome this limitation using a pre-trained deep conv-net that was learnt for a different but related task. Theseapproaches either apply “shallow” methods (e.g. correlation filters) using thenetwork’s internal representation as features [5,6] or perform SGD (stochasticgradient descent) to fine-tune multiple layers of the network [7,8,9]. While the useof shallow methods does not take full advantage of the benefits of end-to-endlearning, methods that apply SGD during tracking to achieve state-of-the-artresults have not been able to operate in real-time.We advocate an alternative approach in which a deep conv-net is trained toaddress a more general similarity learning problem in an initial offline phase,and then this function is simply evaluated online during tracking. The key con-tribution of this paper is to demonstrate that this approach achieves very com-petitive performance in modern tracking benchmarks at speeds that far exceedthe frame-rate requirement. Specifically, we train a Siamese network to locatean exemplar image within a larger search image. A further contribution is anovel Siamese architecture that is fully-convolutional with respect to the searchimage: dense and efficient sliding-window evaluation is achieved with a bilinearlayer that computes the cross-correlation of its two inputs.We posit that the similarity learning approach has gone relatively neglectedbecause the tracking community did not have access to vast labelled datasets.In fact, until recently the available datasets comprised only a few hundred anno-tated videos. However, we believe that the emergence of the ILSVRC dataset forobject detection in video [10] (henceforth ImageNet Video) makes it possible totrain such a model. Furthermore, the fairness of training and testing deep modelsfor tracking using videos from the same domain is a point of controversy, as ithas been recently prohibited by the VOT committee. We show that our modelgeneralizes from the ImageNet Video domain to the ALOV/OTB/VOT [1,11,12]domain, enabling the videos of tracking benchmarks to be reserved for testingpurposes.
Learning to track arbitrary objects can be addressed using similarity learning.We propose to learn a function f ( z, x ) that compares an exemplar image z to acandidate image x of the same size and returns a high score if the two imagesdepict the same object and a low score otherwise. To find the position of theobject in a new image, we can then exhaustively test all possible locations andchoose the candidate with the maximum similarity to the past appearance of theobject. In experiments, we will simply use the initial appearance of the objectas the exemplar. The function f will be learnt from a dataset of videos withlabelled object trajectories.Given their widespread success in computer vision [13,14,15,16], we will use adeep conv-net as the function f . Similarity learning with deep conv-nets is typ-ically addressed using Siamese architectures [17,18,19]. Siamese networks applyan identical transformation ϕ to both inputs and then combine their represen- ully-Convolutional Siamese Networks for Object Tracking 3 Fig. 1: Fully-convolutional Siamese architecture. Our architecture is fully-convolutional with respect to the search image x . The output is a scalar-valuedscore map whose dimension depends on the size of the search image. This enablesthe similarity function to be computed for all translated sub-windows within thesearch image in one evaluation. In this example, the red and blue pixels inthe score map contain the similarities for the corresponding sub-windows. Bestviewed in colour.tations using another function g according to f ( z, x ) = g ( ϕ ( z ) , ϕ ( x )). When thefunction g is a simple distance or similarity metric, the function ϕ can be con-sidered an embedding. Deep Siamese conv-nets have previously been applied totasks such as face verification [18,20,14], keypoint descriptor learning [19,21] andone-shot character recognition [22]. We propose a Siamese architecture which is fully-convolutional with respectto the candidate image x . We say that a function is fully-convolutional if itcommutes with translation. To give a more precise definition, introducing L τ to denote the translation operator ( L τ x )[ u ] = x [ u − τ ], a function h that mapssignals to signals is fully-convolutional with integer stride k if h ( L kτ x ) = L τ h ( x ) (1)for any translation τ . (When x is a finite signal, this only need hold for the validregion of the output.)The advantage of a fully-convolutional network is that, instead of a candidateimage of the same size, we can provide as input to the network a much larger search image and it will compute the similarity at all translated sub-windowson a dense grid in a single evaluation. To achieve this, we use a convolutionalembedding function ϕ and combine the resulting feature maps using a cross-correlation layer f ( z, x ) = ϕ ( z ) ∗ ϕ ( x ) + b , (2) L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, P. H. S. Torr where b denotes a signal which takes value b ∈ R in every location. The outputof this network is not a single score but rather a score map defined on a finitegrid D ⊂ Z as illustrated in Figure 1. Note that the output of the embeddingfunction is a feature map with spatial support as opposed to a plain vector. Thesame technique has been applied in contemporary work on stereo matching [23].During tracking, we use a search image centred at the previous position ofthe target. The position of the maximum score relative to the centre of thescore map, multiplied by the stride of the network, gives the displacement of thetarget from frame to frame. Multiple scales are searched in a single forward-passby assembling a mini-batch of scaled images.Combining feature maps using cross-correlation and evaluating the networkonce on the larger search image is mathematically equivalent to combining fea-ture maps using the inner product and evaluating the network on each translatedsub-window independently. However, the cross-correlation layer provides an in-credibly simple method to implement this operation efficiently within the frame-work of existing conv-net libraries. While this is clearly useful during testing, itcan also be exploited during training. We employ a discriminative approach, training the network on positive andnegative pairs and adopting the logistic loss (cid:96) ( y, v ) = log(1 + exp( − yv )) (3)where v is the real-valued score of a single exemplar-candidate pair and y ∈{ +1 , − } is its ground-truth label. We exploit the fully-convolutional nature ofour network during training by using pairs that comprise an exemplar image anda larger search image. This will produce a map of scores v : D → R , effectivelygenerating many examples per pair. We define the loss of a score map to be themean of the individual losses L ( y, v ) = 1 |D| (cid:88) u ∈D (cid:96) ( y [ u ] , v [ u ]) , (4)requiring a true label y [ u ] ∈ { +1 , − } for each position u ∈ D in the score map.The parameters of the conv-net θ are obtained by applying Stochastic GradientDescent (SGD) to the problemarg min θ E ( z,x,y ) L ( y, f ( z, x ; θ )) . (5)Pairs are obtained from a dataset of annotated videos by extracting exemplarand search images that are centred on the target, as shown in Figure 2. Theimages are extracted from two frames of a video that both contain the objectand are at most T frames apart. The class of the object is ignored during training.The scale of the object within each image is normalized without corrupting the ully-Convolutional Siamese Networks for Object Tracking 5 Fig. 2: Training pairs extracted from the same video: exemplar image and cor-responding search image from same video. When a sub-window extends beyondthe extent of the image, the missing portions are filled with the mean RGB value.aspect ratio of the image. The elements of the score map are considered to belongto a positive example if they are within radius R of the centre (accounting forthe stride k of the network) y [ u ] = (cid:40) +1 if k (cid:107) u − c (cid:107) ≤ R − . (6)The losses of the positive and negative examples in the score map are weightedto eliminate class imbalance.Since our network is fully-convolutional, there is no risk that it learns a biasfor the sub-window at the centre. We believe that it is effective to considersearch images centred on the target because it is likely that the most difficultsub-windows, and those which have the most influence on the performance ofthe tracker, are those adjacent to the target.Note that since the network is symmetric f ( z, x ) = f ( x, z ), it is in factalso fully-convolutional in the exemplar. While this allows us to use differentsize exemplar images for different objects in theory, we assume uniform sizesbecause it simplifies the mini-batch implementation. However, this assumptioncould be relaxed in the future. The 2015 edition of ImageNet Large Scale Visual Recognition Challenge [10](ILSVRC) introduced the ImageNet Video dataset as part of the new objectdetection from video challenge. Participants are required to classify and locateobjects from 30 different classes of animals and vehicles. Training and valida-tion sets together contain almost 4500 videos, with a total of more than one
L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, P. H. S. Torr million annotated frames. This number is particularly impressive if compared tothe number of labelled sequences in VOT [12], ALOV [1] and OTB [11], whichtogether total less than 500 videos. We believe that this dataset should be ofextreme interest to the tracking community not only for its vast size, but alsobecause it depicts scenes and objects different to those found in the canoni-cal tracking benchmarks. For this reason, it can safely be used to train a deepmodel for tracking without over-fitting to the domain of videos used in thesebenchmarks.
During training, we adopt exemplar images that are 127 ×
127 and search images that are 255 ×
255 pixels. Images are scaled such thatthe bounding box, plus an added margin for context, has a fixed area. Moreprecisely, if the tight bounding box has size ( w, h ) and the context margin is p ,then the scale factor s is chosen such that the area of the scaled rectangle isequal to a constant s ( w + 2 p ) × s ( h + 2 p ) = A . (7)We use the area of the exemplar images A = 127 and set the amount of contextto be half of the mean dimension p = ( w + h ) /
4. Exemplar and search imagesfor every frame are extracted offline to avoid image resizing during training. In apreliminary version of this work, we adopted a few heuristics to limit the numberof frames from which to extract the training data. For the experiments of thispaper, instead, we have used all
Network architecture
The architecture that we adopt for the embedding func-tion ϕ resembles the convolutional stage of the network of Krizhevsky et al. [16].The dimensions of the parameters and activations are given in Table 1. Max-pooling is employed after the first two convolutional layers. ReLU non-linearitiesfollow every convolutional layer except for conv5, the final layer. During train-ing, batch normalization [24] is inserted immediately after every linear layer.The stride of the final representation is eight. An important aspect of the designis that no padding is introduced within the network. Although this is commonpractice in image classification, it violates the fully-convolutional property ofeq. 1. Tracking algorithm
Since our purpose is to prove the efficacy of our fully-convolutional Siamese network and its generalization capability when trainedon ImageNet Video, we use an extremely simplistic algorithm to perform track-ing. Unlike more sophisticated trackers, we do not update a model or maintain amemory of past appearances, we do not incorporate additional cues such as opti-cal flow or colour histograms, and we do not refine our prediction with boundingbox regression. Yet, despite its simplicity, the tracking algorithm achieves sur-prisingly good results when equipped with our offline-learnt similarity metric. ully-Convolutional Siamese Networks for Object Tracking 7
Table 1: Architecture of convolutional embedding function, which is similar tothe convolutional stage of the network of Krizhevsky et al. [16]. The channelmap property describes the number of output and input channels of each con-volutional layer.
Activation sizeLayer Support Chan. map Stride for exemplar for search chans.127 ×
127 255 × × ×
11 96 × ×
59 123 × × × ×
29 61 × × × ×
48 1 25 ×
25 57 × × × ×
12 28 × × × ×
256 1 10 ×
10 26 × × × ×
192 1 8 × × × × ×
192 1 6 × × × Online, we do incorporate some elementary temporal constraints: we only searchfor the object within a region of approximately four times its previous size, and acosine window is added to the score map to penalize large displacements. Track-ing through scale space is achieved by processing several scaled versions of thesearch image. Any change in scale is penalized and updates of the current scaleare damped.
Several recent works have sought to train Recurrent Neural Networks (RNNs)for the problem of object tracking. Gan et al. [25] train an RNN to predict theabsolute position of the target in each frame and Kahou et al. [26] similarlytrain an RNN for tracking using a differentiable attention mechanism. Thesemethods have not yet demonstrated competitive results on modern benchmarks,however it is certainly a promising avenue for future research. We remark that aninteresting parallel can be drawn between this approach and ours, by interpretinga Siamese network as an unrolled RNN that is trained and evaluated on sequencesof length two. Siamese networks could therefore serve as strong initialization fora recurrent model.Denil et al. [27] track objects with a particle filter that uses a learnt distancemetric to compare the current appearance to that of the first frame. However,their distance metric is vastly different to ours. Instead of comparing images ofthe entire object, they compute distances between fixations (foveated glimpsesof small regions within the object’s bounding box). To learn a distance metric,they train a Restricted Boltzmann Machine (RBM) and then use the Euclideandistance between hidden activations for two fixations. Although RBMs are un-supervised, they suggest training the RBM on random fixations within centredimages of the object to detect. This must either be performed online or in anoffline phase with knowledge of the object to track. While tracking an object,
L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, P. H. S. Torr they learn a stochastic policy for choosing fixations which is specific to that ob-ject, using uncertainty as a reward signal. Besides synthetic sequences of MNISTdigits, this method has only been demonstrated qualitatively on problems of faceand person tracking.While it is infeasible to train a deep conv-net from scratch for each new video,several works have investigated the feasibility of fine-tuning from pre-trainedparameters at test time. SO-DLT [7] and MDNet [9] both train a convolutionalnetwork for a similar detection task in an offline phase, then at test-time useSGD to learn a detector with examples extracted from the video itself as inthe conventional tracking-as-detector-learning paradigm. These methods cannotoperate at frame-rate due to the computational burden of evaluating forward andbackward passes on many examples. An alternative way to leverage conv-nets fortracking is to apply traditional shallow methods using the internal representationof a pre-trained convolutional network as features. While trackers in this stylesuch as DeepSRDCF [6], Ma et al. [5] and FCNT [8] have achieved strong results,they have been unable to achieve frame-rate operation due to the relatively highdimension of the conv-net representation.Concurrently with our own work, some other authors have also proposed us-ing conv-nets for object tracking by learning a function of pairs of images. Heldet al. [28] introduce GOTURN, in which a conv-net is trained to regress directlyfrom two images to the location in the second image of the object shown in thefirst image. Predicting a rectangle instead of a position has the advantage thatchanges in scale and aspect ratio can be handled without resorting to exhaus-tive evaluation. However, a disadvantage of their approach is that it does notpossess intrinsic invariance to translation of the second image. This means thatthe network must be shown examples in all positions, which is achieved throughconsiderable dataset augmentation. Chen et al. [29] train a network that maps anexemplar and a larger search region to a response map. However, their methodalso lacks invariance to translation of the second image since the final layers arefully-connected. Similarly to Held et al., this is inefficient because the trainingset must represent all translations of all objects. Their method is named YCNNfor the Y shape of the network. Unlike our approach, they cannot adjust thesize of the search region dynamically after training. Tao et al. [30] propose totrain a Siamese network to identify candidate image locations that match theinitial object appearance, dubbing their method SINT (Siamese INstance searchTracker). In contrast to our approach, they do not adopt an architecture which isfully-convolutional with respect to the search image. Instead, at test time, theysample bounding boxes uniformly on circles of varying radius as in Struck [3].Moreover, they incorporate optical flow and bounding box regression to improvethe results. In order to improve the computational speed of their system, theyemploy Region of Interest (RoI) pooling to efficiently examine many overlap-ping sub-windows. Despite this optimization, at 2 frames per second, the overallsystem is still far from being real-time.All of the competitive methods above that train on video sequences (MD-Net [9], SINT [30], GOTURN [28]), use training data belonging to the same ully-Convolutional Siamese Networks for Object Tracking 9
ALOV/OTB/VOT domain used by the benchmarks. This practice has been for-bidden in the VOT challenge due to concerns about over-fitting to the scenesand objects in the benchmark. Thus an important contribution of our work is todemonstrate that a conv-net can be trained for effective object tracking withoutusing videos from the same distribution as the testing set.
The parameters of the embedding function are found by minimizingeq. 5 with straightforward SGD using MatConvNet [31]. The initial values ofthe parameters follow a Gaussian distribution, scaled according to the improvedXavier method [32]. Training is performed over 50 epochs, each consisting of50,000 sampled pairs (according to sec. 2.2). The gradients for each iterationare estimated using mini-batches of size 8, and the learning rate is annealedgeometrically at each epoch from 10 − to 10 − . Tracking.
As mentioned earlier, the online phase is deliberately minimalistic.The embedding ϕ ( z ) of the initial object appearance is computed once, and iscompared convolutionally to sub-windows of the subsequent frames. We foundthat updating (the feature representation of) the exemplar online through simplestrategies, such as linear interpolation, does not gain much performance andthus we keep it fixed. We found that upsampling the score map using bicubicinterpolation, from 17 ×
17 to 272 × . {− , − , , , } , and update the scale bylinear interpolation with a factor of 0.35 to provide damping.In order to make our experimental results reproducible, we share training andtracking code, together with the scripts to generate the curated dataset at . On a machine equipped witha single NVIDIA GeForce GTX Titan X and an Intel Core i7-4790K at 4.0GHz,our full online tracking pipeline operates at 86 and 58 frames-per-second, whensearching respectively over 3 and 5 scales. We evaluate two variants of our simplistic tracker: SiamFC (Siamese Fully-Convolutional) and SiamFC-3s, which searches over 3 scales instead of 5.
The OTB-13 [11] benchmark considers the average per-frame success rate atdifferent thresholds: a tracker is successful in a given frame if the intersection-over-union (IoU) between its estimate and the ground-truth is above a certain
Overlap threshold S u cc e ss r a t e Success plots of OPE
SiamFC (ours) [0.612]LCT (2015) [0.612]SiamFC_3s (ours) [0.608]CCT (2015) [0.605]Staple (2016) [0.600]SCT4 (2016) [0.595]KCFDP (2015) [0.581]DSST (2014) [0.554]DLSSVM_NU (2016) [0.550]SCM (2012) [0.499]
Overlap threshold S u cc e ss r a t e Success plots of TRE
SiamFC (ours) [0.621]SiamFC_3s (ours) [0.618]Staple (2016) [0.617]LCT (2015) [0.594]DLSSVM_NU (2016) [0.580]DSST (2014) [0.566]Struck (2011) [0.514]SCM (2012) [0.514]ASLA (2012) [0.485]CXT (2011) [0.463]
Overlap threshold S u cc e ss r a t e Success plots of SRE
SiamFC (ours) [0.554]SiamFC_3s (ours) [0.549]Staple (2016) [0.545]LCT (2015) [0.518]DLSSVM_NU (2016) [0.496]DSST (2014) [0.494]Struck (2011) [0.439]ASLA (2012) [0.421]SCM (2012) [0.420]TLD (2010) [0.402]
Fig. 3: Success plots for OPE (one pass evaluation), TRE (temporal robustnessevaluation) and SRE (spatial robustness evaluation) of the OTB-13 [11] bench-mark. The results of CCT, SCT4 and KCFDP were only available for OPE atthe time of writing.threshold. Trackers are then compared in terms of area under the curve of successrates for different values of this threshold. In addition to the trackers reportedby [11], in Figure 3 we also compare against seven more recent state-of-the-arttrackers presented in the major computer vision conferences and that can run atframe-rate speed: Staple [33], LCT [34], CCT [35], SCT4 [36], DLSSVM NU [37],DSST [38] and KCFDP [39]. Given the nature of the sequences, for this bench-mark only we convert 25% of the pairs to grayscale during training. All the otherhyper-parameters (for training and tracking) are fixed.
For our experiments, we use the latest stable version of the Visual Object Track-ing (VOT) toolkit (tag vot2015-final ), which evaluates trackers on sequenceschosen from a pool of 356, selected so that seven different challenging situationsare well represented. Many of the sequences were originally presented in otherdatasets (e.g. ALOV [1] and OTB [11]). Within the benchmark, trackers areautomatically re-initialized five frames after failure, which is deemed to have oc-curred when the IoU between the estimated bounding box and the ground truthbecomes zero.
VOT-14 results.
We compare our method SiamFC (and the variant SiamFC-3s) against the best 10 trackers that participated in the 2014 edition of the VOTchallenge [40]. We also include Staple [33] and GOTURN [28], two recent real-time trackers presented respectively at CVPR 2016 and ECCV 2016. Trackers areevaluated according to two measures of performance: accuracy and robustness .The former is calculated as the average IoU, while the latter is expressed interms of the total number of failures. These give insight into the behaviour of atracker. Figure 4 shows the Accuracy-Robustness plot, where the best trackersare closer to the top-right corner.
VOT-15 results.
We also compare our method against the 40 best participantsin the 2015 edition [12]. In this case, the raw scores of accuracy and number ully-Convolutional Siamese Networks for Object Tracking 11
Robustness (S = 30.00) A cc u r a cy AR plot for experiment baseline (mean)
SiamFC (ours)SiamFC_3s (ours)StapleGOTURNACATDGTDSSTeASMSHMMTxDKCFMCTPLT_13PLT_14SAMF
Fig. 4: VOT-14 Accuracy-robustness plot. Best trackers are closer to the top-right corner.of failures are used to compute the expected average overlap measure , whichrepresents the average IoU with no re-initialization following a failure. Figure 5illustrates the final ranking in terms of expected average overlap, while Table 2reports scores and speed of the 15 highest ranked trackers of the challenge.
VOT-16 results.
At the time of writing, the results of the 2016 edition werenot available. However, to facilitate an early comparison with our method, wereport our scores. For SiamFC and SiamFC-3s we obtain, respectively, an overallexpected overlap (average between the baseline and unsupervised experiments) of0.3876 and 0.4051. Please note that these results are different from the VOT-16report, as our entry in the challenge was a preliminary version of this work.Despite its simplicity, our method improves over recent state-of-the-art real-time trackers (Figures 3 and 4). Moreover, it outperforms most of the bestmethods in the challenging VOT-15 benchmark, while being the only one thatachieves frame-rate speed (Figure 5 and Table 2). These results demonstratethat the expressiveness of the similarity metric learnt by our fully-convolutionalSiamese network on ImageNet Video alone is enough to achieve very strong re-sults, comparable or superior to recent state-of-the-art methods, which often areseveral orders of magnitude slower. We believe that considerably higher perfor-mance could be obtained by augmenting the minimalist online tracking pipelinewith the methods often adopted by the tracking community (e.g. model update,bounding-box regression, fine-tuning, memory).
Table 3 illustrates how the size of the dataset used to train the Siamese networkgreatly influences the performance. The expected average overlap (measured on
Order A v e r age e x pe c t ed o v e r l ap Expected overlap scores for baseline
MDNetEBTDeepSRDCFSiamFC_3s (ours)SiamFC (ours)srdcfsPSTLDPscebtnsamfstruck_PAMIs3trackerrajsscsumshiftDATSODLTRobStruckMCTMEEMOACFHMMTxDASMSggtmkcf_plusAOGTrackertricmvcftsmesratdtrackersamfkcfv2musterTGPRcmilACTLGTkcf_mtsaKCF2DSSTMILHT
VOT15winner(1 fps)VOT14winner(24 fps) S i a m F C - ( f p s ) SiamFC (58 fps)
Fig. 5: VOT-15 ranking in terms of expected average overlap. Only the best 40results have been reported.VOT-15) steadily improves from 0.168 to 0.274 when increasing the size of thedataset from 5% to 100%. This finding suggests that using a larger video datasetcould increase the performance even further. In fact, even if 2 million supervisedbounding boxes might seem a huge number, it should not be forgotten that theystill belong to a relatively moderate number of videos, at least compared to theamount of data normally used to train conv-nets.
In this work, we depart from the traditional online learning methodology em-ployed in tracking, and show an alternative approach that focuses on learningstrong embeddings in an offline phase. Differently from their use in classifi-cation settings, we demonstrate that for tracking applications Siamese fully-convolutional deep networks have the ability to use the available data moreefficiently. This is reflected both at test-time, by performing efficient spatialsearches, but also at training-time, where every sub-window effectively repre-sents a useful sample with little extra cost. The experiments show that deepembeddings provide a naturally rich source of features for online trackers, andenable simplistic test-time strategies to perform well. We believe that this ap-proach is complementary to more sophisticated online tracking methodologies,and expect future work to explore this relationship more thoroughly. ully-Convolutional Siamese Networks for Object Tracking 13
Table 2: Raw scores, overlap and reported speed for our proposed method andthe best 15 performing trackers of the VOT-15 challenge. Where available, wecompare with the speed reported by the authors, otherwise (*) we report thevalues from the VOT-15 results [12] in EFO units, which roughly correspond tofps (e.g. the speed of the NCC tracker is 140 fps and 160 EFO).
Tracker accuracy < SiamFC-3s (ours) 0.5335 84 0.2889 (ours) 0.5240 87 0.2743 58SRDCF [42] 0.5260 71 0.2743 5sPST [43] 0.5230 85 0.2668 2LDP [12] 0.4688 78 0.2625 4 *SC-EBT [44] 0.5171 103 0.2412 –NSAMF [45] 0.5027 87 0.2376 5 *StruckMK [3] 0.4442 90 0.2341 2S3Tracker [46] 0.5031 100 0.2292 14 *RAJSSC [12] 0.5301 105 0.2262 2 *SumShift [46] 0.4888 97 0.2233 17 *DAT [47] 0.4705 113 0.2195 15SO-DLT [7] 0.5233 108 0.2190 5
Table 3: Effects of using increasing portions of the ImageNet Video dataset ontracker’s performance.
Dataset (%)
Frame 1 (init.) Frame 50 Frame 100 Frame 200
Fig. 6: Snapshots of the simple tracker described in Section 2.4 equipped withour proposed fully-convolutional Siamese network trained from scratch on Ima-geNet Video. Our method does not perform any model update, so it uses only thefirst frame to compute ϕ ( z ). Nonetheless, it is surprisingly robust to a numberof challenging situations like motion blur (row 2), drastic change of appearance(rows 1, 3 and 4), poor illumination (row 6) and scale change (row 6). On theother hand, our method is sensitive to scenes with confusion (row 5), arguablybecause the model is never updated and thus the cross-correlation gives a highscores for all the windows that are similar to the first appearance of the tar-get. All sequences come from the VOT-15 benchmark: gymnastics1 , car1 , fish3 , iceskater1 , marching , singer1 . The snapshots have been taken at fixed frames (1,50, 100 and 200) and the tracker is never re-initialized. ully-Convolutional Siamese Networks for Object Tracking 15 References
1. Smeulders, A.W.M., Chu, D.M., Cucchiara, R., Calderara, S., Dehghan, A., Shah,M.: Visual tracking: An experimental survey. PAMI (7) (2014) 1442–14682. Kalal, Z., Mikolajczyk, K., Matas, J.: Tracking-learning-detection. PAMI (7)(2012) 1409–14223. Hare, S., Saffari, A., Torr, P.H.S.: Struck: Structured output tracking with kernels.In: ICCV 2011, IEEE4. Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking withkernelized correlation filters. PAMI (3) (2015) 583–5965. Ma, C., Huang, J.B., Yang, X., Yang, M.H.: Hierarchical convolutional featuresfor visual tracking. In: ICCV 20156. Danelljan, M., Hager, G., Khan, F., Felsberg, M.: Convolutional features for cor-relation filter based visual tracking. In: ICCV 2015 Workshop. (2015) 58–667. Wang, N., Li, S., Gupta, A., Yeung, D.Y.: Transferring rich feature hierarchies forrobust visual tracking. arXiv CoRR (2015)8. Wang, L., Ouyang, W., Wang, X., Lu, H.: Visual tracking with fully convolutionalnetworks. In: ICCV 20159. Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visualtracking. arXiv CoRR (2015)10. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet LargeScale Visual Recognition Challenge. IJCV (2015)11. Wu, Y., Lim, J., Yang, M.H.: Online object tracking: A benchmark. In: CVPR2013. (2013)12. Kristan, M., Matas, J., Leonardis, A., Felsberg, M., Cehovin, L., Fernandez, G.,Vojir, T., Hager, G., Nebehay, G., Pflugfelder, R.: The Visual Object TrackingVOT2015 Challenge results. In: ICCV 2015 Workshop. (2015) 1–2313. Razavian, A., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf:An astounding baseline for recognition. In: CVPR 2014 Workshop14. Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. BMVC 2015(2015)15. Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., van derSmagt, P., Cremers, D., Brox, T.: FlowNet: Learning optical flow with convolu-tional networks. In: ICCV 201516. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deepconvolutional neural networks. In: NIPS 201217. Bromley, J., Bentz, J.W., Bottou, L., Guyon, I., LeCun, Y., Moore, C., S¨ackinger,E., Shah, R.: Signature verification using a “Siamese” time delay neural network.International Journal of Pattern Recognition and Artificial Intelligence (1993)18. Taigman, Y., Yang, M., Ranzato, M., Wolf, L.: DeepFace: Closing the gap tohuman-level performance in face verification. In: CVPR 2014. (2014) 1701–170819. Zagoruyko, S., Komodakis, N.: Learning to compare image patches via convolu-tional neural networks. In: CVPR 2015. (2015)20. Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: A unified embedding for facerecognition and clustering. In: CVPR 2015. (2015) 815–82321. Simo-Serra, E., Trulls, E., Ferraz, L., Kokkinos, I., Fua, P., Moreno-Noguer, F.:Discriminative learning of deep convolutional feature point descriptors. In: ICCV2015. (2015) 118–1266 L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, P. H. S. Torr22. Koch, G., Zemel, R., Salakhutdinov, R.: Siamese neural networks for one-shotimage recognition. In: ICML 2015 Deep Learning Workshop. (2015)23. Luo, W., Schwing, A.G., Urtasun, R.: Efficient deep learning for stereo matching.In: CVPR 2016. (2016) 5695–570324. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training byreducing internal covariate shift. In: ICML 2015. (2015) 448–45625. Gan, Q., Guo, Q., Zhang, Z., Cho, K.: First step toward model-free, anonymousobject tracking with recurrent neural networks. arXiv CoRR (2015)26. Kahou, S.E., Michalski, V., Memisevic, R.: RATM: Recurrent Attentive TrackingModel. arXiv CoRR (2015)27. Denil, M., Bazzani, L., Larochelle, H., de Freitas, N.: Learning where to attendwith deep architectures for image tracking. Neural Computation (2012)28. Held, D., Thrun, S., Savarese, S.: Learning to track at 100 fps with deep regressionnetworks. arXiv CoRR (2016)29. Chen, K., Tao, W.: Once for all: A two-flow convolutional neural network for visualtracking. arXiv CoRR (2016)30. Tao, R., Gavves, E., Smeulders, A.W.M.: Siamese instance search for tracking.arXiv CoRR (2016)31. Vedaldi, A., Lenc, K.: MatConvNet – Convolutional Neural Networks for MAT-LAB. (2015)32. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In: ICCV 2015. (2015)33. Bertinetto, L., Valmadre, J., Golodetz, S., Miksik, O., Torr, P.H.S.: Staple: Com-plementary learners for real-time tracking. CVPR 2016 (2016)34. Ma, C., Yang, X., Zhang, C., Yang, M.H.: Long-term correlation tracking. In:CVPR 201535. Zhu, G., Wang, J., Wu, Y., Lu, H.: Collaborative correlation tracking. In: BMVC201536. Choi, J., Jin Chang, H., Jeong, J., Demiris, Y., Young Choi, J.: Visual trackingusing attention-modulated disintegration and integration. In: CVPR 201637. Ning, J., Yang, J., Jiang, S., Zhang, L., Yang, M.H.: Object tracking via duallinear structured svm and explicit feature map. In: CVPR 201638. Danelljan, M., H¨ager, G., Khan, F., Felsberg, M.: Accurate scale estimation forrobust visual tracking. In: BMVC 201439. Huang12, D., Luo, L., Wen12, M., Chen12, Z., Zhang12, C.: Enable scale andaspect ratio adaptability in visual tracking with detection proposals40. LIRIS, F.: The visual object tracking vot2014 challenge results41. Zhu, G., Porikli, F., Li, H.: Tracking randomly moving objects on edge box pro-posals. arXiv CoRR (2015)42. Danelljan, M., Hager, G., Shahbaz Khan, F., Felsberg, M.: Learning spatiallyregularized correlation filters for visual tracking. In: ICCV 2015. (2015) 4310–431843. Hua, Y., Alahari, K., Schmid, C.: Online object tracking with proposal selection.In: ICCV 2015. (2015) 3092–310044. Wang, N., Yeung, D.Y.: Ensemble-based tracking: Aggregating crowdsourcedstructured time series data. In: ICML 2014. (2014) 1107–111545. Li, Y., Zhu, J.: A scale adaptive kernel correlation filter tracker with featureintegration. In: ECCV 2014 Workshops. (2014)46. Li, A., Lin, M., Wu, Y., Yang, M.H., Yan, S.: NUS-PRO: A new visual trackingchallenge. PAMI38