[PDF] COMET: Context-Aware IoU-Guided Network for Small Object Tracking

Abstract

We consider the problem of tracking an unknown small target from aerial videos of medium to high altitudes. This is a challenging problem, which is even more pronounced in unavoidable scenarios of drastic camera motion and high density. To address this problem, we introduce a context-aware IoU-guided tracker (COMET) that exploits a multitask two-stream network and an offline reference proposal generation strategy. The proposed network fully exploits target-related information by multi-scale feature learning and attention modules. The proposed strategy introduces an efficient sampling strategy to generalize the network on the target and its parts without imposing extra computational complexity during online tracking. These strategies contribute considerably in handling significant occlusions and viewpoint changes. Empirically, COMET outperforms the state-of-the-arts in a range of aerial view datasets that focusing on tracking small objects. Specifically, COMET outperforms the celebrated ATOM tracker by an average margin of 6.2% (and 7%) in precision (and success) score on challenging benchmarks of UAVDT, VisDrone-2019, and Small-90.

Full PDF

CCOMET: Context-Aware IoU-Guided Networkfor Small Object Tracking

Seyed Mojtaba Marvasti-Zadeh (cid:63) , , , Javad Khaghani (cid:63) , HosseinGhanei-Yakhdan , Shohreh Kasaei , and Li Cheng University of Alberta, Edmonton, Canada { mojtaba.marvasti,khaghani,lcheng5 } @ualberta.ca Yazd University, Yazd, Iran [email protected] Sharif University of Technology, Tehran, Iran [email protected]

Abstract.

We consider the problem of tracking an unknown small tar-get from aerial videos of medium to high altitudes. This is a challengingproblem, which is even more pronounced in unavoidable scenarios ofdrastic camera motion and high density. To address this problem, weintroduce a context-aware IoU-guided tracker (COMET) that exploits amultitask two-stream network and an oﬄine reference proposal genera-tion strategy. The proposed network fully exploits target-related informa-tion by multi-scale feature learning and attention modules. The proposedstrategy introduces an eﬃcient sampling strategy to generalize the net-work on the target and its parts without imposing extra computationalcomplexity during online tracking. These strategies contribute consider-ably in handling signiﬁcant occlusions and viewpoint changes. Empiri-cally, COMET outperforms the state-of-the-arts in a range of aerial viewdatasets that focusing on tracking small objects. Speciﬁcally, COMEToutperforms the celebrated ATOM tracker by an average margin of 6 . Aerial object tracking in real-world scenarios [1,2,3] aims to accurately localize amodel-free target, while robustly estimating a ﬁtted bounding box on the targetregion. Given the wide variety of applications [4,5], vision-based methods for ﬂy-ing robots demand robust aerial visual trackers [6,7]. Generally speaking, aerialvisual tracking can be categorized into videos captured from low-altitudes andmedium/high-altitudes. Low-altitude aerial scenarios (e.g., UAV-123 [8]) look atmedium or large objects in surveillance videos with limited viewing angles (sim-ilar to traditional tracking scenarios such as OTB [9,10] or VOT [11,12]). How-ever, tracking a target in aerial videos captured from medium- (30 ∼

70 meters) (cid:63) equal contribution a r X i v : . [ c s . C V ] J u l S. M. Marvasti-Zadeh, J. Khaghani et al. U A V - S m a ll O b j e c t D a t a s e t s Fig. 1:

Examples to compare low-altitudes and medium/high-altitudes aerial tracking.The ﬁrst row represents the size of most targets in UAV-123 [8] dataset, which capturedfrom 10 ∼

30 meters. However, some examples of small object tracking scenarios inUAVDT [2], VisDrone-2019 [1], and Small-90 [15] datasets are shown in last two rows.The UAV-123 contains mostly large/medium-sized objects, while targets in small objectdatasets just occupy few pixels of a frame. Note that Small-90 has been incorporatedsmall object videos of diﬀerent datasets such as UAV-123, OTB, and TC-128 [16]. Thefocus on this work will be on small/tiny object tracking. and high-altitudes ( >

70 meters) has recently introduced extra challenges, in-cluding tiny objects, dense cluttered background, weather condition, wide aerialview, severe camera/object motion, drastic camera rotation, and signiﬁcant view-point change [13,1,2,14]. In most cases, it is arduous even for humans to tracktiny objects in the presence of complex background as a consequence of limitedpixels of objects. Fig. 1 compares the two main categories of aerial visual track-ing. Most objects in the ﬁrst category (captured from low-altitude aerial views(10 ∼

30 meters)) are medium/large-sized and provide suﬃcient information forappearance modeling. The second one aims to track targets with few pixels in-volving complicated scenarios.Recent state-of-the-art trackers cannot provide satisfactory results for smallobject tracking since strategies to handle its challenges have not been consid-ered. Besides, although various approaches have been proposed for small objectdetection [17,18,19], there are limited methods to focus on aerial view tracking.These trackers [20,21,22,23,24] are based on the discriminative correlation ﬁlters (DCF) that have inherent limitations (e.g., boundary eﬀect problem), and theirperformances are not competitive with modern trackers, despite extracting deepfeatures. Besides, they cannot consider aspect ratio change despite being a crit-ical characteristic for aerial view tracking. Therefore, the proposed method willnarrow the gap between modern visual trackers with aerial ones.Tracking small objects involves major diﬃculties comprising lacking suﬃ-cient target information to distinguish it from background or distractors, muchmore possibility of locations (i.e., accurate localization requirement), and lim-ited knowledge according to previous eﬀorts. Motivated by the issues and alsorecent advances in small object detection, this paper proposes a

Context-aware

OMET: Context-Aware IoU-Guided Network for Small Object Tracking 3 iOu-guided network for sMall objEct Tracking (COMET). It exploits a multitasktwo-stream network to process target-relevant information at various scales andfocuses on salient areas via attention modules. Given a rough estimation of targetlocation by an online classiﬁcation network [25], the proposed network simulta-neously predicts intersection-over-union (IoU) and normalized center locationerror (CLE) between the estimated bounding boxes (BBs) and target. Moreover,an eﬀective proposal generation strategy is proposed, which helps the networkto learn contextual information. By using this strategy, the proposed networkeﬀectively exploits the representations of a target and its parts. It also leads to abetter generalization of the proposed network to handle occlusion and viewpointchange for small object tracking from medium- and high-altitude aerial views.The contributions of the paper are summarized as the following two folds.

1) Oﬄine Proposal Generation Strategy:

In oﬄine training, the pro-posed method generates limited high-quality proposals from the reference frame.The proposed strategy provides context information and helps the network tolearn target and its parts. Therefore, it successfully handles large occlusions andviewpoint changes in challenging aerial scenarios. Furthermore, it is just used inoﬄine training to impose no extra computational complexity for online tracking.

2) Multitask Two-Stream Network:

COMET utilizes a multitask two-stream network to deal with challenges in small object tracking. First, the net-work fuses aggregated multi-scale spatial features with semantic ones to providerich features. Second, it utilizes lightweight spatial and channel attention mod-ules to focus on more relevant information for small object tracking. Third, thenetwork optimizes a proposed multitask loss function to consider both accuracyand robustness.Extensive experimental analyses are performed to compare the proposedtracker with state-of-the-art methods on the well-known aerial view benchmarks,namely, UAVDT [2], VisDrone-2019 [1], Small-90 [15], and UAV-123 [8]. Theresults demonstrate the eﬀectiveness of COMET for small object tracking pur-poses.The rest of the paper is organized as follows. In Section 2, an overview ofrelated works is brieﬂy outlined. In Section 3 and Section 4, our approach andempirical evaluation are presented. Finally, the conclusion is summarized in Sec-tion 5.

In this section, focusing on two-stream neural networks, modern visual trackersare brieﬂy described. Also, aerial visual trackers and some small object detectionmethods are summarized.

Two-stream networks (a generalized form of

Siamese neural networks (SNNs))for visual tracking were interested in generic object tracking using regression net-works (GOTURN) [26], which utilizes oﬄine training of a network without any

S. M. Marvasti-Zadeh, J. Khaghani et al. online ﬁne-tuning during tracking. This idea continued by fully-convolutionalSiamese networks (SiamFC) [27], which deﬁned the visual tracking as a generalsimilarity learning problem to address limited labeled data issues. To exploitboth the eﬃciency of the correlation ﬁlter (CF) and CNN features, CFNet [28]provides a closed-form solution for end-to-end training of a CF layer. The workof [29] applies triplet loss on exemplar, positive instance, and negative instanceto strengthen the feedback of back-propagation and provide powerful features.These methods could not achieve competitive performance compared with well-known DCF methods (e.g., [30,31]) since they are prone to drift problems; How-ever, these methods provide beyond real-time speed.As the baseline of the well-known Siamese trackers (e.g., [32,33,34,35,36]),the

Siamese region proposal network (SiamRPN) [37] formulates generic objecttracking as local one-shot learning with bounding box reﬁnement.

Distractor-aware Siamese RPNs (DaSiamRPN) [32] exploits semantic backgrounds, dis-tractor suppression, and local-to-global search region to learn robust featuresand address occlusion and out-of-view. To design deeper and wider networks,the SiamDW [33] has investigated various units and backbone networks to takefull advantage of state-of-the-art network architectures.

Siamese cascaded RPN (CRPN) [34] consists of multiple RPNs that perform stage-by-stage classiﬁcationand localization. SiamRPN++ method [36] proposes a ResNet-driven Siamesetracker that not only exploits layer-wise and depth-wise aggregations but alsouses a spatial-aware sampling strategy to train a deeper network successfully.SiamMask tracker [35] beneﬁts bounding box reﬁnement and class agnostic bi-nary segmentation to improve the estimated target region.Although the mentioned trackers provide both desirable performance andcomputational eﬃciency, they mostly do not consider background informationand suﬀer from poor generalization due to lacking online training and updatestrategy. The ATOM tracker [25] performs classiﬁcation and target estimationtasks with the aid of an online classiﬁer and an oﬄine IoU-predictor, respectively.First, it discriminates a target from its background, and then, an IoU-predictorreﬁnes the generated proposals around the estimated location. Similarly andbased on a model prediction network, the DiMP tracker [38] learns a robusttarget model by employing a discriminative loss function and an iterative opti-mization strategy with a few steps.Despite considerable achievements on surveillance videos, the performanceof modern trackers is dramatically decreased on videos captured from medium-and high-altitude aerial views; The main reason is lacking any strategies to dealwith small object tracking challenges. For instance, the limited information ofa tiny target, dense distribution of distractors, or signiﬁcant viewpoint changeleads to tracking failures of conventional trackers.

In this subsection, recent advances for small object detection and also aerial viewtrackers will be brieﬂy described.Various approaches have been proposed to overcome shortcomings for small

OMET: Context-Aware IoU-Guided Network for Small Object Tracking 5 object detection [17]. For instance, single shot multi-box detector (SSD) [39] useslow-level features for small object detection and high-level ones for larger objects.

Deconvolutional single shot detector (DSSD) [40] increases the resolution of fea-ture maps using deconvolution layers to consider context information for smallobject detection.

Multi-scale deconvolutional single shot detector for small ob-jects (MDSDD) [41] utilizes several multi-scale deconvolution fusion modules toenhance the performance of small object detection. Also, [42] utilizes multi-scalefeature concatenation and attention mechanisms to enhance small object de-tection using context information. SCRDet method [43] introduces SF-Net andMDA-Net for feature fusion and highlighting object information using attentionmodules, respectively. Furthermore, other well-known detectors (e.g., YOLO-v3[44]) exploit the same ideas, such as multi-scale feature pyramid networks, toalleviate their poor accuracy for small objects.On the other hand, developing speciﬁc methods for small object trackingfrom aerial view is still in progress, and there are limited algorithms for solvingexisting challenges. Current trackers are based on discriminative correlation ﬁl-ters (DCFs), which provide satisfactory computational complexity and intrinsiclimitations such as the inability to handle aspect ratio changes of targets. Forinstance, aberrance repressed correlation ﬁlter [20] (ARCF) proposes a croppingmatrix and regularization terms to restrict the alteration rate of response map.To tackle boundary eﬀects and improve tracking robustness, boundary eﬀect-aware visual tracker (BEVT) [21] penalizes the object regarding its location,learns background information, and compares the scores of following responsemaps. Keyﬁlter-aware tracker [23] learns context information and avoids ﬁltercorruption by generating key-ﬁlters and enforcing a temporal restriction. To im-prove the quality of training set, time slot-based distillation algorithm [22] (TSD)adaptively scores historical samples by a cooperative energy minimization func-tion. It also accelerates this process by discarding low-score samples. Finally, theAutoTrack [24] aims to learn a spatio-temporal regularization term automati-cally. It exploits local-global response variation to focus on trustworthy targetparts and determine its learning rate. The results of these trackers are not com-petitive to the state-of-the-art visual trackers (e.g., Siam-based trackers [36,35],ATOM [25], and DiMP [38]). Therefore, the proposed method aims to narrow thegap between modern visual trackers and aerial view tracking methods, exploringsmall object detection advances.

A key motivation of COMET is to solve the issues discussed in Sec. 1 and Sec.2 by adapting small object detection schemes into the network architecture fortracking purposes. The graphical abstract of proposed oﬄine training and on-line tracking is shown in Fig. 2. The proposed framework mainly consists of anoﬄine proposal generation strategy and a two-stream multitask network, whichconsists of lightweight individual modules for small object tracking. Also, theproposed proposal generation strategy helps the network to learn a generalizedtarget model, handle occlusion, and viewpoint change with the aid of context in-

S. M. Marvasti-Zadeh, J. Khaghani et al.

ResNet-50ResNet-50ProposedOffline Proposal GeneratorReference FrameTarget Frame

Block 3

ResNet-50OnlineProposal GeneratorReference Frame IoUCLE Ground Truth OfflineProposal GeneratorGround Truth

Online Classifier (ATOM [25])

Proposed Two-stream NetworkTarget FrameResNet-50

Online Tracking Offline Training

Proposed Two-stream Network

Block 4Block 3Block 4

Block 3

Block 4Block 3

Block 4

IoUCLE

Fig. 2:

Overview of proposed method in oﬄine training and online tracking phases. formation. This strategy is just applied to oﬄine training of the network to avoidextra computational burden in online tracking. This section presents an overviewof the proposed method and a detailed description of the main contributions.

The eventual goal of proposal generation strategies is to provide a set of candi-date detection regions, which are possible locations of objects. There are variouscategory-dependent strategies for proposal generation [45,39,46]. For instance,the IoU-Net [46] augments the ground-truth instead of using region proposalnetworks (RPNs) to provide better performance and robustness to the network.Also, the ATOM [25] uses a proposal generation strategy similar to [46] with amodulation vector to integrate target-speciﬁc information into its network.Motivated by IoU-Net [46] and ATOM [25], an oﬄine proposal generationstrategy is proposed to extract context information of target from the referenceframe. The ATOM tracker generates N target proposals from the test frame( P t + ζ ), given the target location in that frame ( G t + ζ ). Jittered ground-truthlocations in oﬄine training produce the target proposals. But, the estimatedlocations achieved by a simple two-layer classiﬁcation network will be jitteredin online tracking. The test proposals are generated according to IoU Gt + ζ (cid:44) IoU ( G t + ζ , P t + ζ ) (cid:62) T . Then, a network is trained to predict IoU values ( IoU pred )between P t + ζ and object, given the BB of the target in the reference frame ( G t ).Finally, the designed network in the ATOM minimizes the mean square error of IoU G t + ζ and IoU pred .In this work, the proposed strategy also provides target patches with back-ground supporters from the reference frame (denoted as P t ) to solve the challeng-ing problems of small object tracking. Besides G t , the proposed method exploits P t just in oﬄine training. Using context features and target parts will assistthe proposed network (Sec. 3.2) in handling occlusion and viewpoint changeproblems for small objects. For simplicity, we will describe the proposed oﬄineproposal generation strategy with the process of IoU-prediction. However, theproposed network predicts both IoU and center location error (CLE) of test pro-posals with target, simultaneously. OMET: Context-Aware IoU-Guided Network for Small Object Tracking 7

Algorithm 1 :

Oﬄine Proposal Generation

Notations:

Bounding box B ( G t + ζ for a test frame or G t for a reference frame), IoU threshold T ( T for a test frame or T for a reference frame), Number of proposals N ( N for a test frame or ( N/ − ii ), Maximum iteration ( max ii ), A Gaussian distributionwith zero-mean ( µ = 0) and randomly selected variance Σ r ( N ), Bounding box proposals generatedby a Gaussian jittering P ( P t + ζ for a test frame or P t for a reference frame) Input: B , T , N , Σ r , max ii Output: P for i = 1 : N do ii = 0, do P [ i ] = B + N ( µ, Σ r ) ,ii = ii + 1 , while ( IoU ( B , P [ i ]) < T ) and ( ii < max ii ); endreturn P An overview of the process of oﬄine proposal generation for IoU-prediction isshown in Algorithm 1. The proposed strategy generates ( N/ − IoU Gt (cid:44) IoU ( G t , P t ) (cid:62) T .Note that it considers T > T to prevent drift toward visual distractors. Theproposed tracker exploits this information (especially in challenging scenarios in-volving occlusion and viewpoint change) to avoid confusion during target track-ing. The P t and G t are passed through the reference branch of the proposednetwork, simultaneously (Sec. 3.2). In this work, an extended modulation vectorhas been introduced to provide the representations of the target and its partsinto the network. That is a set of modulation vectors that each vector encodedthe information of one reference proposal. To compute IoU-prediction, the fea-tures of the test patch should be modulated by the features of the target and itsparts. It means that the IoU-prediction of N test proposals is computed per eachreference proposal. Thus, there will be N / N/ N IoU-predictions, the extended modulation vector allowsthe computation of N/ N IoU-predictions at once. Therefore, thenetwork predicts N/ IoU Gt + ζ . During online tracking, COMET doesnot generate P t and just uses the G t to predict one group of IoU-predictions forgenerated P t + ζ . Therefore, the proposed strategy will not impose extra compu-tational complexity in online tracking. Tracking small objects from aerial view involves extra diﬃculties such as clarityof target appearance, fast viewpoint change, or drastic rotations besides existingtracking challenges. This part aims to design an architecture that handles theproblems of small object tracking by considering recent advances in small objectdetection. Inspired by [46,25,43,47,48], a two-stream network is proposed (seeFig. 3), which consists of multi-scale processing and aggregation of features, thefusion of hierarchical information, spatial attention module, and channel atten-tion module. Also, the proposed network seeks to maximize the IoU betweenestimated BBs and the object while it minimizes their location distance. Hence,it exploits a multitask loss function, which is optimized to consider both theaccuracy and robustness of the estimated BBs. In the following, the proposed

S. M. Marvasti-Zadeh, J. Khaghani et al.

Block 4Block 3

ResNet-50ResNet-50 MSAF Module

Block 3Block 4

MSAF

Module Conv1x1Conv1x1 Attention ModuleAttention Module PrRoI Pool 3x3

PrRoI Pool 5x5Conv3x3

Extended Modulation

Vector

FC LinearLinear

IoU

CLE

Conv3x3 Deconvolution

Block 3Block 4

Conv1x1 Conv1x5 Conv5x1 Conv1x7 Conv7x1

Conv1x1 Conv1x3 Conv3x1

Avg Pool 3x3 Conv1x1

Conv1x1 F ilt e r C on ca t e n a ti on Conv1x1 Conv3x3 Conv3x3 Conv1x1GAP Linear Leaky ReLU Linear

Inception Module σ R e f e r e n ce F r a m e T a r g e t F r a m e Ground Truth BB (and reference proposals in offline training)Target proposals

Feature

Fusion

Fig. 3:

Overview of proposed two-stream network. MSAF denotes multi-scale aggrega-tion and fusion module, which utilizes the InceptionV3 module in its top branch. Fordeconvolution block, a 3 × architecture and the role of the main components are described.The proposed network has adopted ResNet-50 [49] to provide backbone fea-tures for reference and test branches. Following small object detection methods,features from block3 and block4 of ResNet-50 are just extracted to exploit bothspatial and semantic features while controlling the number of parameters [17,43].Then, the proposed network employs a multi-scale aggregation and fusion mod-ule (MSAF). It processes spatial information via the InceptionV3 module [50]to perform factorized asymmetric convolutions on target regions. This low-costmulti-scale processing helps the network to approximate optimal ﬁlters that areproper for small object tracking. Also, semantic features are passed through theconvolution and deconvolution layers to be reﬁned and resized for feature fusion.The resulted hierarchical information is fused by an element-wise addition of thespatial and semantic feature maps. After feature fusion, the number of channelsis reduced by 1 × .

01% pixels of a frame.Next, the proposed network utilizes the bottleneck attention module (BAM)[48], which has a lightweight and simple architecture. It emphasizes target-related spatial and channel information and suppresses distractors and redun-dant information, which are common in aerial images [43]. The BAM includeschannel attention, spatial attention, and identity shortcut connection branches.In this work, the SENet [51] is employed as the channel attention branch, whichuses global average pooling (GAP) and a multi-layer perceptron to ﬁnd theoptimal combination of channels. The spatial attention module utilizes dilatedconvolutions to increase the receptive ﬁeld. It helps the network to consider con-

OMET: Context-Aware IoU-Guided Network for Small Object Tracking 9 text information for small object tracking. The spatial and channel attentionmodules answer to where the critical features are located and what relevant fea-tures are. Lastly, the identity shortcut connection helps for better gradient ﬂow.After that, the proposed method generates proposals from the test frame.Also, it uses the proposed proposal generation strategy to extract the BBs fromthe target and its parts from the reference frame in oﬄine training. These gener-ated BBs are applied to the resulted feature maps and fed into a precise region ofinterest (PrRoI) pooling layer [46], which is diﬀerentiable w.r.t. BB coordinates.The network uses a convolutional layer with a 3 × P t + ζ ) are applied to the features of the test branch and fed to a 5 × L Net = L IoU + λ L CLE ,where the L IoU , L CLE , and λ represent the loss function for IoU-prediction head,loss function for the CLE-prediction head, and balancing hyper-parameter forloss functions, respectively. By denoting i -th IoU- and CLE-prediction values as IoU ( i ) and CLE ( i ) , the loss functions are deﬁned as L IoU = 1 N N (cid:88) i =1 ( IoU ( i ) G t + ζ − IoU ( i ) pred ) , (1) L CLE = (cid:40) N (cid:80) Ni =1 12 ( CLE ( i ) G t + ζ − CLE ( i ) pred ) | ( CLE ( i ) G t + ζ − CLE ( i ) pred | < N (cid:80) Ni =1 | ( CLE ( i ) G t + ζ − CLE ( i ) pred ) | − otherwise , (2)where the CLE G t + ζ = ( ∆x G t + ζ /width G t + ζ , ∆y G t + ζ /height G t + ζ ) is the normal-ized distance between the center location of P t + ζ and G t + ζ . For example, ∆x G t + ζ is calculated as x G t + ζ − x P t + ζ . Also, the CLE pred (and

IoU pred ) represents thepredicted CLE (and the predicted IoU) between BB estimations ( G t + ζ ) and tar-get, given an initial BB in the reference frame. In oﬄine training, the proposednetwork optimizes the loss function to learn how to predict the target BB fromthe pattern of proposals generation.In online tracking, the target BB from the ﬁrst frame (similar to [37,35,36,25])and also target proposals in the test frame passes through the network. As aresult, there is just one group of CLE-prediction as well as IoU-prediction toavoid more computational complexity. In this phase, the aim is to maximize theIoU-prediction of test proposals using the gradient ascent algorithm and also tominimize its CLE-prediction using the gradient descent algorithm. Algorithm 2describes the process of online tracking in detail. This algorithm shows how theinputs are passed through the network, and BB coordinates are updated basedon scaled back-propagated gradients. While the IoU-gradients are scaled up withBB sizes to optimize in a log-scaled domain (similar to [46]), just x and y coor-dinates of test BBs are scaled up for CLE-gradients. It experimentally achievedbetter results compared to the scaling process for IoU-gradients. The intuitivereason is that the network has learned the normalized location diﬀerences be- tween BB estimations and target BB. That is, the CLE-prediction is responsiblefor accurate localization, whereas the IoU-prediction determines the BB aspectratio. After reﬁning the test proposals ( N = 10 for online phase) for n = 5 times,the proposed method selects the K = 3 best BBs and uses the average of thesepredictions based on IoU-scores as the ﬁnal target BB. In this section, the proposed method is evaluated on state-of-the-art benchmarksfor small object tracking from aerial view: VisDrone-2019-test-dev (35 videos)[1], UAVDT (50 videos) [2], and Small-90 (90 videos) [15]. Although the Small-90 dataset includes the challenging videos of the UAV-123 dataset with smallobjects, the experimental results on the UAV-123 [8] dataset (low-altitude UAVdataset (10 ∼

30 meters)) are also presented. Generally speaking, the UAV-123dataset lacks varieties in small objects, camera motions, and real scenes [14].Moreover, traditional tracking datasets do not consist of challenges such as tinyobjects, signiﬁcant viewpoint changes, camera motion, and high density fromaerial views. For instance, these datasets (e.g., OTB [10], VOT [11,12], etc.)mostly provide videos that captured by ﬁxed or moving car-based cameras withlimited viewing angles. For these reasons and our focus on tracking small ob-jects on videos captured from medium- and high-altitudes, the proposed tracker(COMET) is evaluated on related benchmarks to demonstrated the motivationand major eﬀectiveness of COMET for small object tracking.Experiments have been conducted three times, and the average results arereported. The proposed method is compared with state-of-the-art visual trackersin terms of precision, success, and normalized area-under-curve (AUC) metrics[10]. Note that all results have been produced by standard benchmarks withdefault thresholds (i.e., 20 pixels for precision scores, and 0.5 for the success

Algorithm 2 :

Online Tracking

Notations:

Input sequence ( S ), Sequence length ( T ), Current frame ( t ), Rough estimation of bound-ing box ( B et ), Generated test proposals ( B pt ), Concatenated bounding boxes ( B ct ), Bounding box pre-diction ( B predt ), Step size ( β ), Number of reﬁnements ( n ), Online classiﬁcation network ( Net

ATOMonline ),Scale and center jittering (

Jitt ) with random factors, Network predictions (

IoU and

CLE ) Input: S = { I , I , ..., I T } , B = { x , y , w , h } Output: B predt , t ∈ { , ..., T } for t = 1 : T do B et = Net

ATOMonline ( I t ) B pt = Jitt ( B et ) B ct = Concat ( B et , B pt ) for i = 1 : n do IoU , CLE = FeedForward( I , I t , B , B ct ) grad IoU B ct = [ ∂IoU∂x , ∂IoU∂y , ∂IoU∂w , ∂IoU∂h ] B ct ← B ct + β × [ ∂IoU∂x .w, ∂IoU∂y .h, ∂IoU∂w .w, ∂IoU∂h .h ], grad CLE B ct = [ ∂CLE∂x , ∂CLE∂y , ∂CLE∂w , ∂CLE∂h ] B ct ← B ct − β × [ ∂CLE∂x .w, ∂CLE∂y .h, ∂CLE∂w , ∂CLE∂h ] end B K × t ← Select K best B ct w.r.t. IoU-scores B predt = Avg ( B K × t ) endreturn B predt OMET: Context-Aware IoU-Guided Network for Small Object Tracking 11 scores) for fair comparisons. In the following, implementation details, ablationanalyses, and state-of-the-art comparisons are presented.

The proposed method uses the ResNet-50 pre-trained on ImageNet [52] to ex-tract backbone features. For oﬄine proposal generation, hyper-parameters areset to N = 16 (test proposals number, ( N/

2) = 8 (seven reference proposalnumbers plus reference ground-truth)), T = 0 . T = 0 . λ = 4, and imagesample pairs randomly selected from videos with a maximum gap of 50 frames( ζ = 50). From the reference image, a square patch centered at the target isextracted with the area of 5 times the target region. Also, ﬂipping and colorjittering are used for data augmentation of the reference patch. To extract thesearch area, a patch (with the area of 5 times the test target scale) with someperturbation in the position and scale is sampled from the test image. Thesecropped regions are then resized to the ﬁxed size of 288 × − , max ii for proposal generation is 200 for ref-erence proposals and 20 for test proposals. The weights of the backbone networkare frozen, and other weights are initialized using [53]. The training splits areextracted from the oﬃcial training set (protocol II) of LaSOT [54], training setof GOT-10K [55], NfS [56], and training set of VisDrone-2019 [1] datasets. Notethat the Small-112 dataset [15] is a subset of the training set of the VisDrone-2019. Moreover, the validation splits of VisDrone-2019 and GOT-10K datasetshave been used in the training phase. To train in an end-to-end fashion, theADAM optimizer [57] is used with an initial learning rate of 10 − , weight decayof 10 − , and decay factor 0 . A systematic ablation study on individual components of the proposed trackerhas been conducted on the UAVDT dataset [14] (see Table 1). It includes threediﬀerent versions of the proposed network consisting of the networks without1) CLE-head, 2) CLE-head and reference proposals generation, and 3) CLE-head, reference proposals generation, and attention module, referred to as A1,A2, and A3, respectively. Moreover, two other diﬀerent feature fusion operationshave been investigated, namely features multiplication (A4) and features con-catenation (A5), compared to the element-wise addition of feature maps in the

Table 1:

Ablation analysis of COMET regarding diﬀerent components and featurefusion operations on UAVDT dataset.

Metric COMET A1 A2 A3 A4 A5Precision 88.7 87.2 85.2 83.6 88 85.3Success 81 78 76.9 73.5 80.4 77.2

MSAF module (see Fig 3).These experiments demonstrate the eﬀectiveness of each component ontracking performance, while the proposed method has achieved 88 .

7% and 81%in terms of precision and success rates, respectively. According to these results,the attention module, reference proposal generation strategy, and CLE-headhave improved the average of success and precision rates up to 2 . . . .

65% and 3 .

6% compared to A4 and A5, respectively.Also, the beneﬁt of feature addition previously has been proved in other methodssuch as [25].

For quantitative comparison, the proposed method (COMET) is compared withstate-of-the-art visual tracking methods including AutoTrack [24], ATOM [25],DiMP-50 [38], SiamRPN++ [36], SiamMask [35], DaSiamRPN [32], CREST [58],MDNet [59], PTAV [60], ECO [31], and MCPF [61] on aerial tracking datasets.Fig. 4 shows the achieved results in terms of precision and success plots [10].According to these results, COMET outperforms top-performing visual track-ers on three available challenging small object tracking datasets as well as theUAV-123 dataset. For instance, COMET has outperformed the SiamRPN++and DiMP-50 trackers by 4 .

4% and 3 .

2% in terms of average precision metric,and 3 .

3% and 3% in terms of average success metric on all datasets, respectively.Compared to the baseline ATOM tracker, COMET has improved the averageprecision rate up to 10 . .

2% and 0 . . .

1% and 2 .

9% on the UAVDT, VisDrone-2019-test-dev andSmall-90 datasets, respectively. Although COMET slightly outperforms ATOM

Fig. 4:

Overall precision and success comparisons of the proposed method (COMET)with state-of-the-art tracking methods on UAVDT, VisDrone-2019-test-dev, Small-90,and UAV-123 datasets.OMET: Context-Aware IoU-Guided Network for Small Object Tracking 13

Table 2:

Average speed of state-of-the-art trackers on UAVDT dataset.

COMET ATOM SiamRPN++ DiMP-50 SiamMask ECOSpeed (FPS) 24 30 32 33 42 35 tracker on the UAV-123 (see Fig. 1), it achieved up to 6 .

2% and 7% improve-ments compared to it in terms of average precision and success metrics on smallobject tracking datasets.These results are mainly owed to the proposed proposal generation strategyand eﬀective modules, which makes the network focus on relevant target (andits parts) information and context information. Furthermore, COMET runs at24 frame-per-second (FPS), while the average speed of other trackers on thereferred machine is indicated in Table 2. This satisfactory speed has been origi-nated from considering diﬀerent proposal generation strategies for oﬄine & on-line procedures and employing lightweight modules in the proposed architecture.The COMET has been evaluated according to various attributes of smallobject tracking scenarios to investigate its strengths and weaknesses. Table 3and Table 4 present the attribute-based comparison of visual trackers on theUAVDT and VisDrone datasets in terms of average precision and AUC met-rics. Also, the overall AUC of trackers is shown in Table 4. The comparisonsare according to various attributes including: background clutter (BC), illumi-nation variation (IV), scale variation (SV), camera motion (CM), object mo-tion (OM), small object (SO), object blur (OB), large occlusion (LO), long-termtracking (LT), aspect ratio change (ARC), fast motion (FM), partial occlusion (POC), full occlusion (FOC), low resolution (LR), out-of-view (OV), similar

Table 3:

Attribute-based comparison of state-of-the-art visual tracking methods interms of accuracy metric on the UAVDT dataset [ First and second visual trackingmethods are shown in color].

Tracker BC CM OM SO IV OB SV LO LTCOMET 83.8 86.1 90.6 90.9 88.5 87.7 90.2 79.6 96ATOM 70.1 77.2 73.4 80.6 80.8 74.9 73 66 91.7AutoTrack 61.1 66.5 63.1 80.9 76.8 73.2 63.6 52.1 87.8SiamRPN++ 74.9 75.9 80.4 83.5 89.7 89.4 80.1 66.6 84.9SiamMask 71.6 76.7 77.8 86.7 86.4 86 77.3 60.1 93.8DiMP-50 71.1 80.3 75.8 81.4 84.3 79 76.1 68.6 100ECO 61.1 64.4 62.7 79.1 76.9 71 63.2 50.8 95.2MDNet 63.6 69.6 66.8 78.4 76.4 72.4 68.5 54.7 93CREST 56.2 62.1 55.8 74.2 69 65.6 56.7 49.7 71.2PTAV 57.2 63.9 56.4 79.1 69.6 66.2 56.5 50.3 80.1MCPF 51.2 59.2 55.3 74.5 73.1 73 55.1 42.5 74.1

Table 4:

Attribute-based comparison of state-of-the-art visual tracking methods interms of AUC metric on the VisDrone-2019-test-dev dataset [ First and second visualtracking methods are shown in color].

Tracker Overall ARC BC CM FM FOC IV LR OV POC SOB SV VCCOMET 64.5 64.2 43.4 62.6 64.9 56.7 65.5 41.8 75.9 62.1 42.8 65.8 70.4ATOM 57.1 52.3 36.7 56.4 52.3 48.8 63.3 31.2 63 51.9 35.6 55.4 61.3SiamRPN++ 59.9 58.9 41.2 58.7 61.8 55.1 63.5 36.4 69.3 58.8 39.6 59.9 67.8DiMP-50 60.8 54.5 40.6 60.6 62 55.8 63.6 32.7 62.4 56.8 39.8 59.7 66SiamMask 58.1 57.8 38.5 57.2 60.8 49 56.6 46.5 67.5 52.9 37 59.4 65.1ECO 55.9 56.5 38.3 54.2 52.1 46.8 59.9 36.6 61.5 51.6 38.6 52.7 62.4

SiamMask [35] ATOM [25] DiMP-50 [38] SiamRPN++ [36] COMET

Fig. 5:

Qualitative comparison of proposed COMET tracker with state-of-the-arttracking methods on S1202, S0602, and S0801 video sequences from UAVDT dataset(top to bottom row, respectively). objects (SOB), and viewpoint change (VC). The results demonstrate the consid-erable performance of COMET compared to the state-of-the-art trackers. Thesetables also show that the COMET can successfully handle the occlusion andviewpoint change problems for small object tracking purposes. Compared tothe DiMP-50, SiamRPN++, and SiamMask, COMET achieves improvementsup to 9 . . .

5% for small object attribute, and 4 . . .

3% for viewpoint change attribute, respectively. Meanwhile, it has gained upto 5 . . .

1% improvement in average for occlusion attribute (i.e.,average of FOC, POC, and LO) compare to the DiMP-50, SiamRPN++, andSiamMask, respectively. While the performance still can be improved based onIV, OB, LR, and LT attributes, COMET outperforms the ATOM by a marginup to 7 . . . .

3% regarding these attributes, respectively.The qualitative comparisons of visual trackers are shown in Fig. 5, in whichthe videos have been selected for more clarity. Further qualitative evaluationson various datasets and even YouTube videos are provided in the supplemen-tary material. According to the ﬁrst row of Fig. 5, COMET successfully modelssmall objects on-the-ﬂy considering complicated aerial view scenarios. Also, itprovides promising results when the aspect ratio of target signiﬁcantly changes.Examples of occurring out-of-view and occlusion are shown in the next rows ofFig. 5. By considering target parts and context information, COMET properlyhandles these problems existing potential distractors.

A context-aware IoU-guided tracker proposed that includes an oﬄine referenceproposal generation strategy and a two-stream multitask network. It aims totrack small objects in videos captured from medium- and high-altitude aerial

OMET: Context-Aware IoU-Guided Network for Small Object Tracking 15 views. First, an introduced proposal generation strategy provides context infor-mation for the proposed network to learn the target and its parts. This strategyeﬀectively helps the network to handle occlusion and viewpoint change in high-density videos with a broad view angle in which only some parts of the targetare visible. Moreover, the proposed network exploits multi-scale feature aggre-gation and attention modules to learn multi-scale features and prevent visualdistractors. Finally, the proposed multitask loss function accurately estimatesthe target region by maximizing IoU and minimizing CLE between the pre-dicted box and object. Experimental results on four state-of-the-art aerial viewtracking datasets and remarkable performance of the proposed tracker demon-strate the motivation and eﬀectiveness of proposed components for small objecttracking purposes.

References

1. Du, D., Zhu, P., Wen, L., Bian, X., Ling, H., et al.: VisDrone-SOT2019: The VisionMeets Drone Single Object Tracking Challenge Results. In: Proc. ICCVW. (2019)2. Du, D., Qi, Y., Yu, H., Yang, Y., Duan, K., Li, G., Zhang, W., Huang, Q., Tian,Q.: The unmanned aerial vehicle benchmark: Object detection and tracking. In:Proc. ECCV. (2018) 375–3913. Marvasti-Zadeh, S.M., Cheng, L., Ghanei-Yakhdan, H., Kasaei, S.: Deep learningfor visual tracking: A comprehensive survey (2019)4. Bonatti, R., Ho, C., Wang, W., Choudhury, S., Scherer, S.: Towards a robust aerialcinematography platform: Localizing and tracking moving targets in unstructuredenvironments. In: Proc. IROS. (2019) 229–2365. Zhang, H., Wang, G., Lei, Z., Hwang, J.: Eye in the sky: Drone-based objecttracking and 3D localization. In: Proc. Multimedia. (2019) 899–9076. Du, D., Zhu, P., et al.: VisDrone-SOT2019: The vision meets drone single objecttracking challenge results. In: Proc. ICCVW. (2019)7. Zhu, P., Wen, L., Du, D., Bian, X., Hu, Q., Ling, H.: Vision meets drones: Past,present and future (2020)8. Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for UAV tracking.In: Proc. ECCV. (2016) 445–4619. Wu, Y., Lim, J., Yang, M.H.: Online object tracking: A benchmark. In: Proc.IEEE CVPR. (2013) 2411–241810. Wu, Y., Lim, J., Yang, M.: Object tracking benchmark. IEEE Trans. PatternAnal. Mach. Intell. (2015) 1834–184811. Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pﬂugfelder, R., et al.: Thesixth visual object tracking VOT2018 challenge results. In: Proc. ECCVW. (2019)3–5312. Kristan, M., et al.: The seventh visual object tracking VOT2019 challenge results.In: Proc. ICCVW. (2019)13. Zhu, P., Wen, L., Du, D., et al.: VisDrone-VDT2018: The vision meets drone videodetection and tracking challenge results. In: Proc. ECCVW. (2018) 496–51814. Yu, H., Li, G., Zhang, W., et al.: The unmanned aerial vehicle benchmark: Objectdetection, tracking and baseline. Int. J. Comput. Vis. (2019)15. Liu, C., Ding, W., Yang, J., et al.: Aggregation signature for small object tracking.IEEE Trans. Image Processing (2020) 1738–17476 S. M. Marvasti-Zadeh, J. Khaghani et al.16. Liang, P., Blasch, E., Ling, H.: Encoding color information for visual tracking:Algorithms and benchmark. IEEE Trans. Image Process. (2015) 5630–564417. Tong, K., Wu, Y., Zhou, F.: Recent advances in small object detection based ondeep learning: A review. Image and Vision Computing (2020)18. LaLonde, R., Zhang, D., Shah, M.: Clusternet: Detecting small objects in largescenes by exploiting spatio-temporal information. In: Proc. CVPR. (2018)19. Bai, Y., Zhang, Y., Ding, M., Ghanem, B.: SOD-MTGAN: Small object detectionvia multi-task generative adversarial network. In: Proc. ECCV. (2018)20. Huang, Z., Fu, C., Li, Y., Lin, F., Lu, P.: Learning aberrance repressed correlationﬁlters for real-time UAV tracking. In: Proc. IEEE ICCV. (2019) 2891–290021. Fu, C., Huang, Z., Li, Y., Duan, R., Lu, P.: Boundary eﬀect-aware visual trackingfor UAV with online enhanced background learning and multi-frame consensusveriﬁcation. In: Proc. IROS. (2019) 4415–442222. Li, F., Fu, C., Lin, F., Li, Y., Lu, P.: Training-set distillation for real-time UAVobject tracking. In: Proc. ICRA. (2020) 1–723. Li, Y., Fu, C., Huang, Z., Zhang, Y., Pan, J.: Keyﬁlter-aware real-time uav objecttracking. In: Proc. ICRA. (2020)24. Li, Y., Fu, C., Ding, F., Huang, Z., Lu, G.: AutoTrack: Towards high-performancevisual tracking for UAV with automatic spatio-temporal regularization. In: Proc.IEEE CVPR. (2020)25. Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: ATOM: Accurate tracking byoverlap maximization. In: Proc. CVPR. (2019)26. Held, D., Thrun, S., Savarese, S.: Learning to track at 100 FPS with deep regressionnetworks. In: Proc. ECCV. (2016) 749–76527. Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.: Fully-convolutional Siamese networks for object tracking. In: Proc. ECCV. (2016) 850–86528. Valmadre, J., Bertinetto, L., Henriques, J., Vedaldi, A., Torr, P.H.: End-to-endrepresentation learning for correlation ﬁlter based tracking. In: Proc. IEEE CVPR.(2017) 5000–500829. Dong, X., Shen, J.: Triplet loss in Siamese network for object tracking. In: Proc.ECCV. (2018) 472–48830. Danelljan, M., Robinson, A., Khan, F.S., Felsberg, M.: Beyond correlation ﬁlters:Learning continuous convolution operators for visual tracking. In: Proc. ECCV.(2016) 472–48831. Danelljan, M., Bhat, G., Shahbaz Khan, F., Felsberg, M.: ECO: Eﬃcient convo-lution operators for tracking. In: Proc. IEEE CVPR. (2017) 6931–693932. Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., Hu, W.: Distractor-aware Siamesenetworks for visual object tracking. In: Proc. ECCV. (2018) 103–11933. Zhang, Z., Peng, H.: Deeper and wider Siamese networks for real-time visualtracking (2019)34. Fan, H., Ling, H.: Siamese cascaded region proposal networks for real-time visualtracking (2018)35. Wang, Q., Zhang, L., Bertinetto, L., Hu, W., Torr, P.H.: Fast online object trackingand segmentation: A unifying approach. In: Proc. IEEE CVPR. (2019)36. Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: SiamRPN++: Evolutionof Siamese visual tracking with very deep networks. In: Proc. IEEE CVPR. (2019)37. Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking withSiamese region proposal network. In: Proc. IEEE CVPR. (2018) 8971–898038. Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Learning discriminative modelprediction for tracking. In: Proc. IEEE ICCV. (2019)OMET: Context-Aware IoU-Guided Network for Small Object Tracking 1739. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C., Berg, A.C.: SSD:Single Shot MultiBox Detector. In: Proc. ECCV. (2016) 21–3740. Fu, C., Liu, W., Ranga, A., Tyagi, A., Berg, A.: DSSD: Deconvolutional singleshot detector (2017)41. Cui, L., Ma, R., Lv, P., Jiang, X., Gao, Z., Zhou, B., Xu, M.: MDSSD: Multi-scaledeconvolutional single shot detector for small objects (2018)42. Lim, J., Astrid, M., Yoon, H., Lee, S.: Small object detection using context andattention (2019)43. Yang, X., Yang, J., Yan, J., Zhang, Y., Zhang, T., Guo, Z., Sun, X., Fu, K.:SCRDet: Towards more robust detection for small, cluttered and rotated objects.In: Proc. IEEE ICCV. (2019)44. Redmon, J., Farhadi, A.: Yolov3: An incremental improvement (2018)45. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accu-rate object detection and semantic segmentation. In: Proc. IEEE CVPR. (2014)580–58746. Jiang, B., Luo, R., Mao, J., Xiao, T., Jiang, Y.: Acquisition of localization conﬁ-dence for accurate object detection. In: Proc. IEEE ECCV. (2018) 816–83247. Lim, J.S., Astrid, M., Yoon, H.J., Lee, S.I.: Small object detection using contextand attention (2019)48. Park, J., Woo, S., Lee, J.Y., Kweon, I.S.: BAM: Bottleneck attention module. In:Proc. BMVC. (2018) 147–16149. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: Proc. IEEE CVPR. (2016) 770–77850. Szegedy, C., Vanhoucke, V., Ioﬀe, S., Shlens, J., Wojna, Z.: Rethinking the incep-tion architecture for computer vision. In: Proc. CVPR. (2016) 2818–282651. Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks.IEEE Trans. Pattern Anal. Mach. Intell. (2019)52. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet largescale visual recognition challenge. IJCV (2015) 211–25253. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectiﬁers: Surpassing human-level performance on ImageNet classiﬁcation. In: Proc. ICCV. (2015) 1026–103454. Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C.,Ling, H.: LaSOT: A high-quality benchmark for large-scale single object tracking.In: Proc. IEEE CVPR. (2019)55. Huang, L., Zhao, X., Huang, K.: GOT-10k: A large high-diversity benchmark forgeneric object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. (2019)56. Galoogahi, H.K., Fagg, A., Huang, C., Ramanan, D., Lucey, S.: Need for speed:A benchmark for higher frame rate object tracking. In: Proc. IEEE ICCV. (2017)1134–114357. Kingma, D.P., Ba, J.: ADAM: A method for stochastic optimization. In: Proc.ICLR. (2014)58. Song, Y., Ma, C., Gong, L., Zhang, J., Lau, R.W., Yang, M.H.: CREST: Convo-lutional residual learning for visual tracking. In: Proc. ICCV. (2017) 2574–258359. Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visualtracking. In: Proc. IEEE CVPR. (2016) 4293–430260. Fan, H., H.Ling: Parallel tracking and verifying. IEEE Trans. Image Process.28