[PDF] Accurate Anchor Free Tracking

Abstract

Visual object tracking is an important application of computer vision. Recently, Siamese based trackers have achieved good accuracy. However, most of Siamese based trackers are not efficient, as they exhaustively search potential object locations to define anchors and then classify each anchor (i.e., a bounding box). This paper develops the first Anchor Free Siamese Network (AFSN). Specifically, a target object is defined by a bounding box center, tracking offset, and object size. All three are regressed by Siamese network with no additional classification or regional proposal, and performed once for each frame. We also tune the stride and receptive field for Siamese network, and further perform ablation experiments to quantitatively illustrate the effectiveness of our AFSN. We evaluate AFSN using five most commonly used benchmarks and compare to the best anchor-based trackers with source codes available for each benchmark. AFSN is 3-425 times faster than these best anchor based trackers. AFSN is also 5.97% to 12.4% more accurate in terms of all metrics for benchmark sets OTB2015, VOT2015, VOT2016, VOT2018 and TrackingNet, except that SiamRPN++ is 4% better than AFSN in terms of Expected Average Overlap (EAO) on VOT2018 (but SiamRPN++ is 3.9 times slower).

Full PDF

AAccurate Anchor Free Tracking

Shengyun PengCollege of Civil EngineeringTongji UniversityShanghai, P.R. China

Yunxuan YuElectrical and ComputerEngineering Department, UCLALos Angeles, USA [email protected]

Kun WangElectrical and ComputerEngineering Department, UCLALos Angeles, USA [email protected]

Lei HeElectrical and ComputerEngineering Department, UCLALos Angeles, USA [email protected]

Abstract

Visual object tracking is an important application ofcomputer vision. Recently, Siamese based trackers haveachieved a good accuracy. However, most of Siamese basedtrackers are not efﬁcient, as they exhaustively search po-tential object locations to deﬁne anchors and then clas-sify each anchor (i.e., a bounding box). This paper devel-ops the ﬁrst Anchor Free Siamese Network (AFSN). Specif-ically, a target object is deﬁned by a bounding box cen-ter, tracking offset and object size. All three are regressedby Siamese network with no additional classiﬁcation or re-gional proposal, and performed once for each frame. Wealso tune the stride and receptive ﬁeld for Siamese network,and further perform ablation experiments to quantitativelyillustrate the effectiveness of our AFSN. We evaluate AFSNusing ﬁve most commonly used benchmarks and compareto the best anchor-based trackers with source codes avail-able for each benchmark. AFSN is × − × faster thanthese best anchor based trackers. AFSN is also 5.97% to12.4% more accurate in terms of all metrics for benchmarksets OTB2015, VOT2015, VOT2016, VOT2018 and Track-ingNet, except that SiamRPN++ is 4% better than AFSNin terms of Expected Average Overlap (EAO) on VOT2018(but SiamRPN++ is . × slower).

1. Introduction

Video object tracking, which locates an arbitrary objectin a changing video sequence, powers many computer vi-sion topics such as automatic driving and pose estimation.Liu et al . [27] focus on the task of searching a speciﬁc vehi-

Figure 1. Comparisons of our tracker with SiamRPN. AFSN isable to resist the interference of similar objects, illumination vari-ation, and predict a more precise bounding box than SiamRPN.When the ratio of length to width is abnormal, AFSN can still es-timate the bounding box accurately. cle that appears in the surveillance networks. Doellinger etal . [9] use tracking methods to predict local statistics aboutthe direction of human motion. A core problem in trackingis how to locate an object accurately and efﬁciently in chal-lenging scenarios like background clutter, occlusion, scalevariation, illumination change, deformation and other vari-ations [42].Current trackers can be generally classiﬁed into twobranches, i.e. , generative and discriminative methods. Gen-erative methods [18, 28, 29, 38, 46] consider tracking as areconstruction problem and they maintain a template set on-line to represent the moving target. Discriminative trackerslike MOSSE [5], Struck [12], CSK [14], KCF [15] learn1 a r X i v : . [ c s . C V ] J un classiﬁer between foreground and background [2, 1, 10].Correlation ﬁlter (CF) trackers can update online with cur-rent video due to its high efﬁciency. However, a clear deﬁ-ciency of using data derived exclusively from current videoresults in learning a comparatively simple model. In con-trast, trackers based on deep neural network aim to makefull use of the entire tracking dataset [31]. Siamese net-works, which track an object through similarity compari-son, have developed into various versions in the trackingcommunity [4, 11, 25, 24, 13, 37, 39, 43, 40, 44].Although these tracking approaches obtain balanced ac-curacy and speed, most of the successful Siamese trackersrely on the anchors generated before tracking. A trackingobject is represented by an axis-aligned bounding box thatencompasses the object. The localization of the object be-comes an image classiﬁcation of an extensive number ofpotential anchors. Since this method needs to enumerate allpossible object locations and regress a normalized distancefor each prospective bounding box, it is inefﬁcient. Further-more, it restricts the ability of proposing accurate boundingbox when the ratio of length to width is abnormal.To deal with this challenge, we propose the ﬁrst trackerwithout using an anchor. First, our Anchor Free SiameseNetwork (AFSN) represents an object with merely a centerpoint, a tracking offset and the object size. Compared withanchor-based trackers, it has a reduced complexity. Sec-ond, we model an object through the network inference re-sult rather than modifying the object position and size withthe bounding boxes proposed in advance. Only one esti-mation is conducted for each frame, with no need to clas-sify each potential anchor. The inference efﬁciency is im-proved signiﬁcantly. Third, our method leverages all im-ages in the large scale supervised tracking datasets. Clearly,using videos from various categories can largely improverobustness. Ablation experiments also demonstrate the ef-fectiveness of our AFSN.To further improve the tracking quality, we test differ-ent network backbones. We ﬁnd that the accuracy dropsseverely when the network backbone grows deeper. Thisproblem has also been discovered in SiamDW [45]. Onereason is that these deeper and wider network architecturesare mainly designed for image classiﬁcation tasks, but notnecessarily are optimal for tracking. We also reveal thata bigger network stride improves overlap area of receptiveﬁelds for two neighboring output score maps, but reducestracking position precision, so the network stride needs tobe optimized. In order to take full advantage of moderndeep neural networks, we in this paper train 8 differentbackbones considering stride, receptive ﬁeld, group convo-lution and kernel size. Section 4 gives a further analysis onthe backbone design. The resulting AFSN outperforms astate-of-the-art tracker SiamRPN, as illustrated in Fig. 1.We evaluate the proposed method using most commonly used datasets including OTB2015 [42], VOT2015 [22],VOT2016 [19], VOT2018 [20] and TrackingNet [30]. OurAFSN can achieve leading performance in all of the 5datasets. Compared with SiamRPN, AFSN increases theprecision rate and success rate by 0.93% and 5.97% onOTB2015, respectively. In terms of EAO (expected averageoverlap), it outperforms SiamRPN on VOT2015, VOT2016and VOT2018 by 0.381, 0.372 and 0.398, respectively.Meanwhile, AFSN runs at 136 FPS (frames per second) onTitan Xp.The contributions of this paper can be summarized inthree-fold as follows: 1) We propose the Anchor FreeSiamese Network (AFSN), which is the ﬁrst anchor freeend-to-end tracker trained with large scale dataset. 2) Aquantitative analysis of the network architecture, especiallyreceptive ﬁeld and network stride, leads to the best networkbackbone for our AFSN. 3) Our tracker AFSN has balancedaccuracy and robustness for ﬁve commonly used datasets.The rest of the paper is organized as follows. Section2 introduces related work. Section 3 presents our AFSN,and Section 4 optimizes the network backbone. Section 5performs experimental study, and Section 6 concludes thepaper.

2. Related work

Recently, Siamese network has drawn widespread atten-tion in tracking community because of its good accuracyand speed, and the capability to make full use of the track-ing dataset during ofﬂine training. Basically, a Siamese net-work is used for comparing the exemplar and instance im-age pairs, and exporting the ﬁnal results by a score map.SINT [37] proposes learning a generic matching functionfor tracking, which can be applied to new tracking videosof previously unseen target objects. GOTURN [13] adaptsthe Siamese network to tracking and utilizes fully connectedlayers as fusion tensors. SiamFC [4] introduces the corre-lation operator. Dense and efﬁcient sliding window eval-uation is achieved with a biliear layer that computes thecross correlation of its two inputs, namely instance branchand exemplar branch. SiamRPN [25] integrates a popu-lar detection technique, namely region proposal network(RPN) with SiamFC. The tracker reﬁnes the proposal toavoid the expensive multi-scale tests. SiamRPN++ [24] is aResNet-driven Siamese tracker, which performs layer-wiseand depth-wise aggregations. However, a tracking object isrepresented by an axis-aligned bounding box that emcom-passes the object. The localization of the object becomesan image classiﬁcation of an entensive number of potentialanchors. Since this method needs to enumerate all possi-ble object locations and regress a normalized distance forall propective bounding boxes, sliding window based object2 igure 2. Main framework of Anchor Free Siamese Network (AFSN): On the left is the feature extraction subnetwork. Three branches,namely classiﬁcation branch, offset branch and scale branch lies in the middle. These three branches are kused for classifying foregroundand background, eliminating the deviation and predicting the object size. Then, depth-wise correlation is performed to obtain the ﬁnal2-layer score map, 2-layer tracking offset for x and y axis and 2-layer object size: width and height. trackers are however a bit inefﬁcient. OTB2015 [42] constructs a benchmark dataset with 100fully annotated sequences to facilitate the performance eval-uation. It is an extension of OTB2013 [41], which contains50 representative video sequences. The VOT [19, 21, 20]datasets are constructed by a novel video clustering ap-proach based on visual properties. The dataset is fully an-notated, all the sequences are labelled per-frame with visualattributes to facilitate in-depth analysis. The OTB [41, 42],ALOV [35] and VOT [19, 21, 20] datasets represent theinitial attempts to unify the testing data and performancemeasurements of generic object tracking. Recently, GOT-10k [17] has been proposed and is larger than most trackingdatasets, which offers a much wider coverage of movingobject. Several competitive trackers (MDNet [31], SINT[37], GOTURN [13]) are trained on video sequences usingOTB and VOT dataset. However, this practice has been pro-hibited in the VOT challenge. Thus, we train our networkwith the GOT-10k dataset, which differs from the video se-quences in the benchmark. It is less likely for our model toover-ﬁt the scenes and objects in the benchmark.

Faster RCNN [34] generates region proposal within thedetection network. It samples ﬁxed shape anchors on alow resolution image and classiﬁes each into foreground orbackground. SiamRPN [25] has adopted RPN into track-ing scenario. The improved versions of SiamRPN, DaSi-amRPN [48] and SiamRPN++ [24] are all successful track-ers. However, the enumeration of a nearly exhausted list of anchors is inefﬁcient and requires extra post-processing.The tracking accuracy is also restricted by the pre-proposedﬁxed shape bounding boxes.Keypoint estimation has some great applications in de-tection. CornerNet [23] detects two bounding box cornersas keypoints. ExtremeNet [32] predicts the four corners andcenter points for all objects. CenterNet [47] extracts a singlecenter point per object without the need for post-processing.Since anchor free method solely generates the bounding boxonce in one inference time, its simplicity can boost the deeplearning trackers. Different anchor free trackers representthe object using different techniques. Some represent thefour corners, others represent the center point. This pro-vides a variety of opportunities using anchor free in track-ing. Anchor free has many successful applications in detec-tion because of its efﬁciency and great performance. How-ever, it has not been fully exploited in tracking.

3. Siamese tracking without anchors

In this section, we describe the proposed AFSN frame-work in detail. As shown in Fig. 2, the AFSN consists ofa Siamese network for feature extraction. Three branches,namely classiﬁcation branch, offset branch and scale branchare used for classifying foreground and background, elim-inating the deviation and predicting the object size. Imagepairs are fed into the proposed framework for end-to-endtraining.

Our aim is to represent the object with the boundingbox center. Scales and tracking offset are regressed di-3ectly from image features at the center location. Let Y ∈ R W × H × be an output score map of classiﬁcation branchwith width W and height H. Suppose ˆ Y ( x i ,y j ,k ) is the valueof point ( x i , y j ) on the score map of the k th frame. A pre-diction ˆ Y ( x i ,y j ,k ) = 1 corresponds to the tracking objectcenter, while ˆ Y ( x i ,y j ,k ) = 0 is the background.The classiﬁcation labels Y are designed to represent var-ious foreground objects. Therefore, the groundtruth key-points are designed to obey two-dimension normal distribu-tion. The mean value is the center of bounding box. Ac-cording to the three sigma rule [33], the probability for X falling away from its mean by more than 3 standard devia-tion is at most 5% if X obeys the normal distribution. Thus,the standard deviation in our label is one sixth of the widthand height: Y = 12 πσ σ exp (cid:40) − (cid:34) ( x − µ ) σ + ( y − µ ) σ (cid:35)(cid:41) (1)The response value intensiﬁes with the increase of overlap-ping area between the exemplar and the instance. Hence,the score around the edge of bounding box should be lowerthan the center part.The training objective is a penalty-reduced pixel-wise lo-gistic regression with focal loss [26]: L cls = − N (cid:88) (cid:16) − ˆ Y xyk (cid:17) α log( ˆ Y xyk ) Y xyk = 1(1 − Y xyk ) β (cid:16) ˆ Y xyk (cid:17) α log(1 − ˆ Y xyk ) Y xyk = 1 , (2)where α and β are hyper parameters of the focal loss, and N is the number of frames in one epoch. We use α = 2 and β = 4 in all our experiments, following Law and Deng[23]. Since the input exemplar size, instance size and the out-put score map are × , × and × sepa-rately, the stride of the network is 8. To eliminate the devi-ation and restore the gap, a tracking offset is added for eachpoint on the score map. For the i th point, tracking offsets O k = (cid:110) ( δx ( i ) k , δy ( i ) k ) (cid:111) ni =1 can be expressed as: O k = ( x k /stride − ˆ x k , y k /stride − ˆ y k ) (3)Then, the offset is trained with the L1 loss L off . Eachpoint on the score map is considered, which assists locatingthe bounding box even if the center peak deviates from thegroundtruth. To estimate the size of an object is equivalent to regressthe object size S k = ( x k − x k , y k − y k ) in each frame. To make sure the estimation falls in the positive region, werepresent the size with α k and β k as: S k = (cid:0) e α k , e β k (cid:1) (4)Then, a simple prediction is conducted. The L1 loss forthe scale estimation at the bounding box center is: L scl = 1 N (cid:88) k (cid:104) | ˆ α P k − α k | + (cid:12)(cid:12)(cid:12) ˆ β P k − β k (cid:12)(cid:12)(cid:12)(cid:105) (5) Figure 3. The outputs of the proposed anchor free Siamese net-work: score map, tracking offset and object size.

The feature extraction subnetwork is fully-convolutional.The search of optimum network architecture is presented inSection 4. Two branches compose the subnetwork. Thetemplate branch receives the exemplar patch (denoted as z ).The search branch receives the full-scale instance patch (de-noted as x ). The two feature extraction branches share thesame parameters. Thus, the same types of features can becompared in the following network. Let L t represent theextraction operator ( L t x ) [ u ] = x [ u − t ] , and the deﬁnitionof fully convolution within stride k can be deﬁned as: h ( L kt x ) = L t h ( x ) (6)Correlation operator is a batch processing function,which compares the Euclidean distance or similarity met-ric between φ ( z ) and φ ( x ) . Note that φ ( z ) and φ ( x ) de-note the outputs of template and search branches, respec-tively. Combining deep features in a higher dimension isequivalent to dense sampling around the bounding box andevaluating similarity after each feature extraction. How-ever, the former method is more efﬁcient due to the smallerscale of higher dimension feature. For convenience, let u ( φ ( z ) , φ ( x )) denote the output of correlation function.Since no normalization for offset and scale is included,the overall loss function is: loss = L cls + λ off L off + λ scl L scl , (7)where λ off and λ scl are two hyper-parameters to balancethe three parts. In our experiment, we set λ off = 0 . and λ scl = 4 . Only this single network is used to predict thebounding box center ˆ Y k , tracking offset ˆ O k and object size ˆ S k igure 4. Architectures of designed backbone networks for AFSN. In this architecture list, k is the kernel size, s is the general convolutionstride and group is the number of group convolution. on the bottom is the success rate tested on OTB2015 benchmark. At inference time, the points with the highest responsescore are extracted in the score map. In order to avoidnoises or sudden changes in the background, we also ap-ply a hanning window on the ﬁnal score map. Suppose ˆ P k = (ˆ x k , ˆ y k ) is the predicted center point in the k th frame.Combining the regressed tracking offset ˆ O k = ( δ ˆ x k , δ ˆ y k ) and the object size ˆ S k = ( ˆ w k , ˆ h k ) , the estimated boundingbox can be expressed as: (cid:16) ˆ x k + δ ˆ x k − ˆ w k / , ˆ y k + δ ˆ y k − ˆ h k / , ˆ w k , ˆ h k (cid:17) (8)A more explicit way of illustrating the Siamese output isshown in Figure 3.

4. Network backbone

This section presents the process of optimizing the net-work backbone for the proposed AFSN tracker. Stride, re-ceptive ﬁeld, group convolution and kernel size are the im-pact factors of different networks. For a faster networksearching process, the backbone networks are trained by40% GOT-10k dataset [17] with 20 epochs.With the size of the input image and output score map,the stride of Siamese trackers can be calculated as: instance − exemplarscoremap − stride (9) The aggregation of different kernel size controls the re-gion of the receptive ﬁeld. A larger receptive ﬁeld providesgreater image context, but shallow features like color, shapewill be lost. A smaller receptive ﬁeld focuses on several par-ticular parts on objects, but it cannot capture the structure oftarget objects. From Eq. 9, we can ﬁnd that if we increasethe receptive ﬁeld, the score map size will decrease becausemore information are contained in one convolution. Then,the stride will increase, leading to a larger gap between twoexemplar images. The ﬁnal tracking results are generallycorrect around the target object, but the accurate localiza-tion and scale estimation cannot be achieved. If the recep-tive ﬁeld decreases in order to capture the detail features,the gap will also decrease. However, once the boundingbox deviates from the tracking object, it shows less robust-ness to relocate the object. Although the predicted scale willbe more accurate in this scenario, the accuracy will not in-crease due to the poor robustness, In Siamese tracking, tem-plate image is not updated online, which further decreasesthe accuracy. This is a common contradictory in Siamesenetwork, and it also explains why the backbone networks inSiamese trackers are relatively shallow.To optimize and search for the best backbone network,8 different backbones are trained as shown in Fig. 4. Net-work stride affects the overlap area of receptive ﬁelds fortwo neighboring output score maps. The proposed AFSNprefers a relatively small network stride, which is around7 to 9 (AFSN1 vs. AFSN6 vs.

AFSN8). In these cases, the5epth of the shallow layers largely affect the success rate.Since shallow features like color and shape can be appliedon several similar background objects, there is no needto extract more shallow features. Therefore, the receptiveﬁeld is better set at 70% to 80% of the exemplar image.Group convolution separates different channels to differ-ent kernels, which increases the robustness of the track-ing (ASFN4 vs.

AFSN7). It also decreases the computa-tion amount. More channel numbers extract more featuresand offer more similarity information to compare, the op-timum channel numbers for deeper layers is 256 (AFSN1 vs.

AFSN2 vs.

AFSN3).

5. Experiments

This section presents the results of our anchor freeSiamese network on ﬁve challenging tracking datasets, i.e. , OTB2015, VOT2015, VOT2016, VOT2018 and Track-ingNet. All the tracking results are compared with the state-of-the-art trackers using the reported results to ensure a faircomparison.

The parameters in the proposed AFSN are trained fromscratch, and the overall training objective optimizes the lossfunction in Eq. 7 with SGD. There are totally 50 epochsconducted and the learning rate decreases in the log spacefrom − to − . Since the loss in tracking offset oc-cupies most of the loss in the ﬁrst phase of training, weset a cut off at − for the offset loss. We extract imagepairs from GOT-10k dataset for training and test on OTBand VOT dataset to verify the feasibility and efﬁciency ofour model. The template image is cropped centering on theforeground object with size of A × A : ( w + p ) × ( h + p ) = A , (10)where w, h are the target bounding box width and heightand p = w + h . For the template branch, A is 127, and forthe instance branch, A is 255. To investigate the impact of anchor free method, wechange the training label for SiamFC and SiamRPN with-out changing the input and hyper-parameters of the originalmodels. SiamFC ﬁnds the best tracking scale by enumerat-ing three potential ratio: . − , , . We represent it with anew score map following the two-dimension normal distri-bution. The size can be directly predicted according to theresponse. Through combining the original model with thenewly designed label, the precision rate and the success rateon OTB2015 grow 7.65% and 5.15%, as shown in Fig. 5and 6. We also apply this approach to SiamRPN. The preci-sion rate drop a little, mainly because the tracking offset is not included. The position estimation may drift caused bythe locations of the pre-proposed anchors. Even though nooffset is incorporated, the success rate increases 3.14%. Theresults demonstrate that the anchor free design can improvethe tracking performance. Figure 5. Ablation experiments: precision plot on OTB2015Figure 6. Ablation experiments: success plot on OTB2015

OTB-2015 Benchmark

The standardized OTB bench-mark provides a fair test for both accuracy and robustness.The benchmark [42] considers the precision plot and thesuccess plot of one path evaluation (OPE). The precisionplot considers the percentage of frames in which the es-timated locations are within a given threshold distance ofthe target bounding box. The deﬁnition of average suc-cess rate is that a tracker is successful in a given frame ifthe intersection-over-union between its estimation and the6roundtruth is above a certain threshold.We compare our anchor free Siamese tracker onthe OTB2015 with the state-of-the-art trackers includingSiamRPN [25], ACFN [6], Staple [3],SiamFC [4], CNN-SVM [16], DSST [7], CF2 [36], MOSSE [5], KCF [15],CSK [14]. Fig. 7 and 8 show that our tracker produces lead-ing results. Compared with the recent SiamRPN [25], theprecision rate and success rate increase 0.93% and 5.97%,respectively.

Figure 7. Precision plot of OTB2015Figure 8. Success plot of OTB2015

The VOT2015 dataset consists on 60 sequences [22].The overall performance is evaluated using Expected Aver-age Overlap (EAO), which takes account of both accuracyand robustness. Besides, the speed is evaluated with a nor-malized speed Equivalent Filter Operations (EFO).We compared our AFSN with 10 state-of-the-art track-ers. The results are reported in Tab. 1. SiamFC and SiamRPN are added into comparisons as our baselines. Asis shown in Fig. 9, our tracker is able to rank the st in EAO.SiamFC is one of the top trackers on VOT2015 which canrun at frame rates beyond real time and achieves state-of-the-art performance. SiamRPN is able to gain 23% increasein EAO, and our AFSN can achieve 0.381 in EAO, which is9.2% higher than SiamRPN. Also, AFSN is able to rank the st in accuracy, the nd in EFO and the rd in failure. Figure 9. EAO scores rank on VOT2015Table 1. Results about the state-of-the-art trackers in VOT2015.Red, blue and green represent the st, nd and rd respectively. Tracker EAO Accuracy Failure EFO

DeepSRDCF 0.3181 0.56 1.0 0.38EBT 0.313 0.45 1.02 1.76SRDCF 0.2877 0.55 1.18 1.99LDP 0.2785 0.49 1.3 4.36sPST 0.2767 0.54 1.42 1.01SC-EBT 0.2548 0.54 1.72 0.8NSAMF 0.2536 0.53 1.29 5.47Struck 0.2458 0.46 1.5 2.44RAJSSC 0.242 0.57 1.75 2.12S3Tracker 0.2403 0.52 1.67 14.27SiamFC-3s 0.2904 0.54 1.42 8.68SiamRPN 0.349 0.58 0.93 23.0

AFSN

The video sequences in VOT2016 are the same asVOT2015, but the groundtruth bounding boxes are re-annotated. We compare our trackers to the top 20 trackers inthe challenge. As shown in Fig. 10, AFSN can outperformall entries in challenge. Tab. 2 shows the information ofseveral state-of-the-art trackers. AFSN can achieve a 12.4%gain in EAO, 15.1% in accuracy compared with C-COT [8].Our tracker also outperforms SiamRPN in EAO, accuracyand failure. Most prominently, Our tracker operates at 136FPS, which is × faster than C-COT.7 igure 10. Expected overlap scores in the VOT2016 challenge.Table 2. Results about the published state-of-the-art trackers inVOT2016. Red, blue and green represent the st, nd and rdrespectively. Tracker EAO Accuracy Failure EFO

C-COT 0.331 0.53 0.85 0.507ECO-HC 0.322 0.53 1.08 15.13Staple 0.2952 0.54 1.35 11.14EBT 0.2913 0.47 0.9 3.011MDNet 0.257 0.54 1.2 0.534SiamRN 0.2766 0.55 1.37 5.44SiamAN 0.2352 0.53 1.65 9.21SiamRPN 0.3441 0.56 1.08 23.3AFSN 0.372 0.61 1.04 20.6

VOT2018 [20] dataset consists of 60 video sequences.The performance is also evaluated in terms of accuracy (av-erage overlap in the course of successful tracking) and ro-bustness (failure rate). EAO is the combination of thesetwo measurements. Tab. 4 shows the comparison of ourapproach with the top 10 trackers in the VOT2018 chal-lenge. Among the top trackers, our AFSN achieves the bestEAO and accuracy, while having competitive robustness.Although recently released SiamRPN++ [24] can achieve0.414 in EAO, our AFSN can operate at . × faster (136FPS v.s.

35 FPS) than SiamRPN++ with only a 4% drop inEAO.

Figure 11. EAO scores rank on VOT2018 Table 3. Results about the published state-of-the-art trackers inVOT2018. Red, blue and green represent the st, nd and rdrespectively. Tracker EAO Accuracy Robustness

LADCF 0.389 0.503 0.159MFT 0.385 0.505 0.140DaSiamRPN 0.383 0.586 0.276UPDT 0.378 0.536 0.184RCO 0.376 0.507 0.155DRT 0.356 0.519 0.201DeepSTRCF 0.345 0.523 0.215CPT 0.339 0.506 0.239SASiamR 0.337 0.566 0.258DLSTpp 0.325 0.543 0.224AFSN 0.398 0.589 0.204

TrackingNet [30] is the ﬁrst large-scale dataset andbenchmark for object tracking in the wild. It provides morethan 30K videos with more than 14 million dense boundingbox annotations sampled from YouTube. The dataset coversa wide selection of object classes in broad and diverse con-text. The trackers are evaluated using an online evaluationserver on a test set of 511 videos. The results of precision,normalized precision and success are reported in Tab. 4.MDNet achieves scores of 0.565 and UPDT achieves 0.611in terms of precision and success, respectively. Our AFSNranks the st with relative gains of 7.4% and 7.2%. Table 4. Comparison pn the TrackingNet test set with the state-of-the-art trackers. Red, blue and green represent the st, nd and rdrespectively. Tracker Precision Norm precision Success

UPDT 0.557 0.702 0.611MDNet 0.565 0.705 0.606CFNet 0.533 0.654 0.578SiamFC 0.533 0.666 0.571DaSiamRPN 0.413 0.602 0.568ECO 0.492 0.618 0.554CSRDCF 0.480 0.622 0.534SAMF 0.477 0.598 0.504Staple 0.470 0.603 0.528AFSN 0.607 0.738 0.655

6. Conclusions

This paper presents the ﬁrst in-depth study on anchorfree object tracking called Anchor Free Siamese Network(AFSN). Unlike traditional Siamese trackers, the proposedAFSN does not need to enumerate an exhaustive list of po-tential object locations and classify each anchor. A targetobject is characterized by a bounding box center, tracking8ffset and object size. All three are regressed by Siamesenetwork, performed one time per frame. We also opti-mize Siamese network architecture for AFSN, and per-form extensive ablation experiments to quantitatively il-lustrate effectiveness of AFSN. We evaluate AFSN us-ing ﬁve most commonly used benchmarks and compareto the best anchor-based trackers with source codes avail-able for each benchmark. AFSN is × − × faster thanthese best anchor based trackers. AFSN is also 5.97% to12.4% more accurate in terms of all metrics for benchmarksets OTB2015, VOT2015, VOT2016, VOT2018 and Track-ingNet, except that SiamRPN++ is 4% better than AFSNin terms of Expected Average Overlap (EAO) on VOT2018(but SiamRPN++ is . × slower). References [1] S. Avidan. Support vector tracking.

IEEE Transactionson Pattern Analysis and Machine Intelligence , 26(8):1064–1072, Aug 2004.[2] S. Avidan. Ensemble tracking.

IEEE Transactions on Pat-tern Analysis and Machine Intelligence , 29(2):261–271, Feb2007.[3] L. Bertinetto, J. Valmadre, S. Golodetz, O. Miksik, and P. H.Torr. Staple: Complementary learners for real-time tracking.In

The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , June 2016.[4] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, andP. H. Torr. Fully-convolutional siamese networks for ob-ject tracking. In

Computer Vision – ECCV 2016 Workshops ,pages 850–865, Cham, 2016. Springer International Publish-ing.[5] D. Bolme, J. Beveridge, B. Draper, and Y. Lui. Visual objecttracking using adaptive correlation ﬁlters. pages 2544–2550,06 2010.[6] J. Choi, H. J. Chang, S. Yun, T. Fischer, Y. Demiris, andJ. Y. Choi. Attentional correlation ﬁlter network for adap-tive visual tracking. In , pages 4828–4837,July 2017.[7] M. Danelljan, G. Hger, and F. Khan. Accurate scale esti-mation for robust visual tracking.

British Machine VisionConference , pages 1–11, 01 2014.[8] M. Danelljan, A. Robinson, F. Khan, and M. Felsberg. Be-yond correlation ﬁlters: Learning continuous convolutionoperators for visual tracking. volume 9909, pages 472–488,10 2016.[9] J. Doellinger, V. S. Prabhakaran, L. Fu, and M. Spies.Environment-aware multi-target tracking of pedestrians.

IEEE Robotics and Automation Letters , 4(2):1831–1837,April 2019.[10] H. Grabner, M. Grabner, and H. Bischof. Real-time trackingvia on-line boosting. volume 1, pages 47–56, 01 2006.[11] Q. Guo, W. Feng, C. Zhou, R. Huang, L. Wan, and S. Wang.Learning dynamic siamese network for visual object track-ing. In , pages 1781–1789, Oct 2017. [12] S. Hare, A. Saffari, and P. H. S. Torr. Struck: Structured out-put tracking with kernels. In , pages 263–270, Nov 2011.[13] D. Held, S. Thrun, and S. Savarese. Learning to track at100 fps with deep regression networks. In Bastian Leibe,Jiri Matas, Nicu Sebe, and Max Welling, editors,

ComputerVision – ECCV 2016 , pages 749–765, Cham, 2016. SpringerInternational Publishing.[14] J. Henriques, R. Caseiro, P. Martins, and J. Batista. Exploit-ing the circulant structure of tracking-by-detection with ker-nels. volume 7575, pages 702–715, 10 2012.[15] J. F. Henriques, R. Caseiro, P. Martins, and J. Batista. High-speed tracking with kernelized correlation ﬁlters.

IEEETransactions on Pattern Analysis and Machine Intelligence ,37(3):583–596, March 2015.[16] S. Hong, T. You, S. Kwak, and B. Han. Online tracking bylearning discriminative saliency map with convolutional neu-ral network. In

Proceedings of the 32Nd International Con-ference on International Conference on Machine Learning -Volume 37 , ICML’15, pages 597–606. JMLR.org, 2015.[17] L.H. Huang, X. Zhao, and K.Q. Huang. Got-10k: A largehigh-diversity benchmark for generic object tracking in thewild.

ArXiv , abs/1810.11981, 2018.[18] X. Jia, H. Lu, and M. Yang. Visual tracking via adaptivestructural local sparse appearance model. In , pages1822–1829, June 2012.[19] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R.Pﬂugfelder, and L. ˇCehovin. The visual object trackingvot2016 challenge results. In

Computer Vision – ECCV 2016Workshops , pages 777–823, Cham, 2016. Springer Interna-tional Publishing.[20] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R.Pﬂugfelder, L. Zajc, T. Vojir, G. Bhat, and A. Lukeˇziˇc.The sixth visual object tracking vot2018 challenge results.In

Computer Vision – ECCV 2018 Workshops , pages 3–53,Cham, 2019. Springer International Publishing.[21] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R.Pﬂugfelder, L. C. Zajc, T. Vojr, G. Hger, A. Lukeic, A. El-desokey, G. Fernndez, and . Garca-Martn. The visual objecttracking vot2017 challenge results. In ,pages 1949–1972, Oct 2017.[22] M. Kristan, J. Matas, A. Leonardis, M. Felsberg, L. Ce-hovin, G. Fernandez, T. Vojir, G. Hager, G. Nebehay, andR. Pﬂugfelder. The visual object tracking vot2015 challengeresults. In

The IEEE International Conference on ComputerVision (ICCV) Workshops , December 2015.[23] H. Law and J. Deng. Cornernet: Detecting objects as pairedkeypoints.

International Journal of Computer Vision , Aug2019.[24] B. Li, W. Wu, Q. Wang, F.Y. Zhang, J.L. Xing, and J.J. Yan.Siamrpn++: Evolution of siamese visual tracking with verydeep networks.

CoRR , abs/1812.11703, 2018.[25] B. Li, J.J. Yan, W. Wu, Z. Zhu, and X.L. Hu. High perfor-mance visual tracking with siamese region proposal network.In

The IEEE Conference on Computer Vision and PatternRecognition (CVPR) , June 2018.

26] T.Y. Lin, P. Goyal, R. Girshick, K.M. He, and P. Dollar. Focalloss for dense object detection. In

The IEEE InternationalConference on Computer Vision (ICCV) , Oct 2017.[27] X.C. Liu, H.D. Ma, and S.Q. Li. Pvss: A progressive vehi-cle search system for video surveillance networks.

Journalof Computer Science and Technology , 34(3):634–644, May2019.[28] X. Mei and H.B. Ling. Robust visual tracking using l(1)minimization. pages 1436 – 1443, 11 2009.[29] X. Mei, H.B. Ling, Y. Wu, E. Blasch, and l. Bai. Minimumerror bounded efﬁcient l1 tracker with occlusion detection(preprint). pages 1257–1264, 06 2011.[30] M. Muller, A. Bibi, S. Giancola, S. Alsubaihi, and B.Ghanem. Trackingnet: A large-scale dataset and benchmarkfor object tracking in the wild. In

The European Conferenceon Computer Vision (ECCV) , September 2018.[31] H. Nam and B. Han. Learning multi-domain convolutionalneural networks for visual tracking. In ,pages 4293–4302, June 2016.[32] F. Nasse and G. A. Fink. A bottom-up approach for learningvisual object detection models from unreliable sources. InAxel Pinz, Thomas Pock, Horst Bischof, and Franz Leberl,editors,

Pattern Recognition , pages 488–497, Berlin, Heidel-berg, 2012. Springer Berlin Heidelberg.[33] F. Pukelsheim. The three sigma rule.

The American Statisti-cian , 48(2):88–91, 1994.[34] S.Q. Ren, K.M. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposalnetworks. In C. Cortes, N. D. Lawrence, D. D. Lee, M.Sugiyama, and R. Garnett, editors,

Advances in Neural In-formation Processing Systems 28 , pages 91–99. Curran As-sociates, Inc., 2015.[35] A. W. M. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara,A. Dehghan, and M. Shah. Visual tracking: An experimentalsurvey.

IEEE Transactions on Pattern Analysis and MachineIntelligence , 36(7):1442–1468, July 2014.[36] C. Sun, D. Wang, H.C. Lu, and M.H. Yang. Correlationtracking via joint discrimination and reliability learning. 042018.[37] R. Tao, E. Gavves, and A. Smeulders. Siamese instancesearch for tracking. pages 1420–1429, 06 2016.[38] D. Wang, H. Lu, and M. Yang. Online object tracking withsparse prototypes.

IEEE Transactions on Image Processing ,22(1):314–325, Jan 2013.[39] Q. Wang, Z. Teng, J. Xing, J. Gao, W. Hu, and S. May-bank. Learning attentions: Residual attentional siamesenetwork for high performance online visual tracking. In , pages 4854–4863, June 2018.[40] Q. Wang, M.D. Zhang, J.L. Xing, J. Gao, W.M. Hu, andSteve M. Do not lose the details: Reinforced representationlearning for high performance visual tracking. In

Proceed-ings of the Twenty-Seventh International Joint Conferenceon Artiﬁcial Intelligence, IJCAI-18 , pages 985–991. Interna-tional Joint Conferences on Artiﬁcial Intelligence Organiza-tion, 7 2018. [41] Y. Wu, J. Lim, and M.H. Yang. Online object tracking: Abenchmark. In

The IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , June 2013.[42] Y. Wu, J. Lim, and M. Yang. Object tracking benchmark.

IEEE Transactions on Pattern Analysis and Machine Intelli-gence , 37(9):1834–1848, Sep. 2015.[43] X. C. Zhang, P. Ye, S. Y. Peng, J. Liu, K. Gong, and G. Xiao.Siamft: An rgb-infrared fusion tracking method via fullyconvolutional siamese networks.

IEEE Access , 7:122122–122133, 2019.[44] X. C. Zhang, P. Ye, S. Y. Peng, J. Liu, and G. Xiao.Dsiammft: An rgb-t fusion tracking method via dynamicsiamese networks using multi-layer feature fusion.

SignalProcessing: Image Communication , 84:115756, 2020.[45] Z.P. Zhang, H.W. Peng, and Q. Wang. Deeper and widersiamese networks for real-time visual tracking.

CoRR ,abs/1901.01660, 2019.[46] W. Zhong, H. Lu, and M. Yang. Robust object tracking viasparse collaborative appearance model.

IEEE Transactionson Image Processing , 23(5):2356–2368, May 2014.[47] X.Y. Zhou, D.Q. Wang, and P. Kr¨ahenb¨uhl. Objects aspoints.

CoRR , abs/1904.07850, 2019.[48] Z. Zhu, Q. Wang, B. Li, W. Wu, J.J. Yan, and W.M. Hu.Distractor-aware siamese networks for visual object track-ing. In

The European Conference on Computer Vision(ECCV) , September 2018., September 2018.