COMET: Context-Aware IoU-Guided Network for Small Object Tracking
Seyed Mojtaba Marvasti-Zadeh, Javad Khaghani, Hossein Ghanei-Yakhdan, Shohreh Kasaei, Li Cheng
CCOMET: Context-Aware IoU-Guided Networkfor Small Object Tracking
Seyed Mojtaba Marvasti-Zadeh (cid:63) , , , Javad Khaghani (cid:63) , HosseinGhanei-Yakhdan , Shohreh Kasaei , and Li Cheng University of Alberta, Edmonton, Canada { mojtaba.marvasti,khaghani,lcheng5 } @ualberta.ca Yazd University, Yazd, Iran [email protected] Sharif University of Technology, Tehran, Iran [email protected]
Abstract.
We consider the problem of tracking an unknown small tar-get from aerial videos of medium to high altitudes. This is a challengingproblem, which is even more pronounced in unavoidable scenarios ofdrastic camera motion and high density. To address this problem, weintroduce a context-aware IoU-guided tracker (COMET) that exploits amultitask two-stream network and an offline reference proposal genera-tion strategy. The proposed network fully exploits target-related informa-tion by multi-scale feature learning and attention modules. The proposedstrategy introduces an efficient sampling strategy to generalize the net-work on the target and its parts without imposing extra computationalcomplexity during online tracking. These strategies contribute consider-ably in handling significant occlusions and viewpoint changes. Empiri-cally, COMET outperforms the state-of-the-arts in a range of aerial viewdatasets that focusing on tracking small objects. Specifically, COMEToutperforms the celebrated ATOM tracker by an average margin of 6 . Aerial object tracking in real-world scenarios [1,2,3] aims to accurately localize amodel-free target, while robustly estimating a fitted bounding box on the targetregion. Given the wide variety of applications [4,5], vision-based methods for fly-ing robots demand robust aerial visual trackers [6,7]. Generally speaking, aerialvisual tracking can be categorized into videos captured from low-altitudes andmedium/high-altitudes. Low-altitude aerial scenarios (e.g., UAV-123 [8]) look atmedium or large objects in surveillance videos with limited viewing angles (sim-ilar to traditional tracking scenarios such as OTB [9,10] or VOT [11,12]). How-ever, tracking a target in aerial videos captured from medium- (30 ∼
70 meters) (cid:63) equal contribution a r X i v : . [ c s . C V ] J u l S. M. Marvasti-Zadeh, J. Khaghani et al. U A V - S m a ll O b j e c t D a t a s e t s Fig. 1:
Examples to compare low-altitudes and medium/high-altitudes aerial tracking.The first row represents the size of most targets in UAV-123 [8] dataset, which capturedfrom 10 ∼
30 meters. However, some examples of small object tracking scenarios inUAVDT [2], VisDrone-2019 [1], and Small-90 [15] datasets are shown in last two rows.The UAV-123 contains mostly large/medium-sized objects, while targets in small objectdatasets just occupy few pixels of a frame. Note that Small-90 has been incorporatedsmall object videos of different datasets such as UAV-123, OTB, and TC-128 [16]. Thefocus on this work will be on small/tiny object tracking. and high-altitudes ( >
70 meters) has recently introduced extra challenges, in-cluding tiny objects, dense cluttered background, weather condition, wide aerialview, severe camera/object motion, drastic camera rotation, and significant view-point change [13,1,2,14]. In most cases, it is arduous even for humans to tracktiny objects in the presence of complex background as a consequence of limitedpixels of objects. Fig. 1 compares the two main categories of aerial visual track-ing. Most objects in the first category (captured from low-altitude aerial views(10 ∼
30 meters)) are medium/large-sized and provide sufficient information forappearance modeling. The second one aims to track targets with few pixels in-volving complicated scenarios.Recent state-of-the-art trackers cannot provide satisfactory results for smallobject tracking since strategies to handle its challenges have not been consid-ered. Besides, although various approaches have been proposed for small objectdetection [17,18,19], there are limited methods to focus on aerial view tracking.These trackers [20,21,22,23,24] are based on the discriminative correlation filters (DCF) that have inherent limitations (e.g., boundary effect problem), and theirperformances are not competitive with modern trackers, despite extracting deepfeatures. Besides, they cannot consider aspect ratio change despite being a crit-ical characteristic for aerial view tracking. Therefore, the proposed method willnarrow the gap between modern visual trackers with aerial ones.Tracking small objects involves major difficulties comprising lacking suffi-cient target information to distinguish it from background or distractors, muchmore possibility of locations (i.e., accurate localization requirement), and lim-ited knowledge according to previous efforts. Motivated by the issues and alsorecent advances in small object detection, this paper proposes a
Context-aware
OMET: Context-Aware IoU-Guided Network for Small Object Tracking 3 iOu-guided network for sMall objEct Tracking (COMET). It exploits a multitasktwo-stream network to process target-relevant information at various scales andfocuses on salient areas via attention modules. Given a rough estimation of targetlocation by an online classification network [25], the proposed network simulta-neously predicts intersection-over-union (IoU) and normalized center locationerror (CLE) between the estimated bounding boxes (BBs) and target. Moreover,an effective proposal generation strategy is proposed, which helps the networkto learn contextual information. By using this strategy, the proposed networkeffectively exploits the representations of a target and its parts. It also leads to abetter generalization of the proposed network to handle occlusion and viewpointchange for small object tracking from medium- and high-altitude aerial views.The contributions of the paper are summarized as the following two folds.
1) Offline Proposal Generation Strategy:
In offline training, the pro-posed method generates limited high-quality proposals from the reference frame.The proposed strategy provides context information and helps the network tolearn target and its parts. Therefore, it successfully handles large occlusions andviewpoint changes in challenging aerial scenarios. Furthermore, it is just used inoffline training to impose no extra computational complexity for online tracking.
2) Multitask Two-Stream Network:
COMET utilizes a multitask two-stream network to deal with challenges in small object tracking. First, the net-work fuses aggregated multi-scale spatial features with semantic ones to providerich features. Second, it utilizes lightweight spatial and channel attention mod-ules to focus on more relevant information for small object tracking. Third, thenetwork optimizes a proposed multitask loss function to consider both accuracyand robustness.Extensive experimental analyses are performed to compare the proposedtracker with state-of-the-art methods on the well-known aerial view benchmarks,namely, UAVDT [2], VisDrone-2019 [1], Small-90 [15], and UAV-123 [8]. Theresults demonstrate the effectiveness of COMET for small object tracking pur-poses.The rest of the paper is organized as follows. In Section 2, an overview ofrelated works is briefly outlined. In Section 3 and Section 4, our approach andempirical evaluation are presented. Finally, the conclusion is summarized in Sec-tion 5.
In this section, focusing on two-stream neural networks, modern visual trackersare briefly described. Also, aerial visual trackers and some small object detectionmethods are summarized.
Two-stream networks (a generalized form of
Siamese neural networks (SNNs))for visual tracking were interested in generic object tracking using regression net-works (GOTURN) [26], which utilizes offline training of a network without any
S. M. Marvasti-Zadeh, J. Khaghani et al. online fine-tuning during tracking. This idea continued by fully-convolutionalSiamese networks (SiamFC) [27], which defined the visual tracking as a generalsimilarity learning problem to address limited labeled data issues. To exploitboth the efficiency of the correlation filter (CF) and CNN features, CFNet [28]provides a closed-form solution for end-to-end training of a CF layer. The workof [29] applies triplet loss on exemplar, positive instance, and negative instanceto strengthen the feedback of back-propagation and provide powerful features.These methods could not achieve competitive performance compared with well-known DCF methods (e.g., [30,31]) since they are prone to drift problems; How-ever, these methods provide beyond real-time speed.As the baseline of the well-known Siamese trackers (e.g., [32,33,34,35,36]),the
Siamese region proposal network (SiamRPN) [37] formulates generic objecttracking as local one-shot learning with bounding box refinement.
Distractor-aware Siamese RPNs (DaSiamRPN) [32] exploits semantic backgrounds, dis-tractor suppression, and local-to-global search region to learn robust featuresand address occlusion and out-of-view. To design deeper and wider networks,the SiamDW [33] has investigated various units and backbone networks to takefull advantage of state-of-the-art network architectures.
Siamese cascaded RPN (CRPN) [34] consists of multiple RPNs that perform stage-by-stage classificationand localization. SiamRPN++ method [36] proposes a ResNet-driven Siamesetracker that not only exploits layer-wise and depth-wise aggregations but alsouses a spatial-aware sampling strategy to train a deeper network successfully.SiamMask tracker [35] benefits bounding box refinement and class agnostic bi-nary segmentation to improve the estimated target region.Although the mentioned trackers provide both desirable performance andcomputational efficiency, they mostly do not consider background informationand suffer from poor generalization due to lacking online training and updatestrategy. The ATOM tracker [25] performs classification and target estimationtasks with the aid of an online classifier and an offline IoU-predictor, respectively.First, it discriminates a target from its background, and then, an IoU-predictorrefines the generated proposals around the estimated location. Similarly andbased on a model prediction network, the DiMP tracker [38] learns a robusttarget model by employing a discriminative loss function and an iterative opti-mization strategy with a few steps.Despite considerable achievements on surveillance videos, the performanceof modern trackers is dramatically decreased on videos captured from medium-and high-altitude aerial views; The main reason is lacking any strategies to dealwith small object tracking challenges. For instance, the limited information ofa tiny target, dense distribution of distractors, or significant viewpoint changeleads to tracking failures of conventional trackers.
In this subsection, recent advances for small object detection and also aerial viewtrackers will be briefly described.Various approaches have been proposed to overcome shortcomings for small
OMET: Context-Aware IoU-Guided Network for Small Object Tracking 5 object detection [17]. For instance, single shot multi-box detector (SSD) [39] useslow-level features for small object detection and high-level ones for larger objects.
Deconvolutional single shot detector (DSSD) [40] increases the resolution of fea-ture maps using deconvolution layers to consider context information for smallobject detection.
Multi-scale deconvolutional single shot detector for small ob-jects (MDSDD) [41] utilizes several multi-scale deconvolution fusion modules toenhance the performance of small object detection. Also, [42] utilizes multi-scalefeature concatenation and attention mechanisms to enhance small object de-tection using context information. SCRDet method [43] introduces SF-Net andMDA-Net for feature fusion and highlighting object information using attentionmodules, respectively. Furthermore, other well-known detectors (e.g., YOLO-v3[44]) exploit the same ideas, such as multi-scale feature pyramid networks, toalleviate their poor accuracy for small objects.On the other hand, developing specific methods for small object trackingfrom aerial view is still in progress, and there are limited algorithms for solvingexisting challenges. Current trackers are based on discriminative correlation fil-ters (DCFs), which provide satisfactory computational complexity and intrinsiclimitations such as the inability to handle aspect ratio changes of targets. Forinstance, aberrance repressed correlation filter [20] (ARCF) proposes a croppingmatrix and regularization terms to restrict the alteration rate of response map.To tackle boundary effects and improve tracking robustness, boundary effect-aware visual tracker (BEVT) [21] penalizes the object regarding its location,learns background information, and compares the scores of following responsemaps. Keyfilter-aware tracker [23] learns context information and avoids filtercorruption by generating key-filters and enforcing a temporal restriction. To im-prove the quality of training set, time slot-based distillation algorithm [22] (TSD)adaptively scores historical samples by a cooperative energy minimization func-tion. It also accelerates this process by discarding low-score samples. Finally, theAutoTrack [24] aims to learn a spatio-temporal regularization term automati-cally. It exploits local-global response variation to focus on trustworthy targetparts and determine its learning rate. The results of these trackers are not com-petitive to the state-of-the-art visual trackers (e.g., Siam-based trackers [36,35],ATOM [25], and DiMP [38]). Therefore, the proposed method aims to narrow thegap between modern visual trackers and aerial view tracking methods, exploringsmall object detection advances.
A key motivation of COMET is to solve the issues discussed in Sec. 1 and Sec.2 by adapting small object detection schemes into the network architecture fortracking purposes. The graphical abstract of proposed offline training and on-line tracking is shown in Fig. 2. The proposed framework mainly consists of anoffline proposal generation strategy and a two-stream multitask network, whichconsists of lightweight individual modules for small object tracking. Also, theproposed proposal generation strategy helps the network to learn a generalizedtarget model, handle occlusion, and viewpoint change with the aid of context in-
S. M. Marvasti-Zadeh, J. Khaghani et al.
ResNet-50ResNet-50ProposedOffline Proposal GeneratorReference FrameTarget Frame
Block 3
ResNet-50OnlineProposal GeneratorReference Frame IoUCLE Ground Truth OfflineProposal GeneratorGround Truth
Online Classifier (ATOM [25])
Proposed Two-stream NetworkTarget FrameResNet-50
Online Tracking Offline Training
Proposed Two-stream Network
Block 4Block 3Block 4
Block 3
Block 4Block 3
Block 4
IoUCLE
Fig. 2:
Overview of proposed method in offline training and online tracking phases. formation. This strategy is just applied to offline training of the network to avoidextra computational burden in online tracking. This section presents an overviewof the proposed method and a detailed description of the main contributions.
The eventual goal of proposal generation strategies is to provide a set of candi-date detection regions, which are possible locations of objects. There are variouscategory-dependent strategies for proposal generation [45,39,46]. For instance,the IoU-Net [46] augments the ground-truth instead of using region proposalnetworks (RPNs) to provide better performance and robustness to the network.Also, the ATOM [25] uses a proposal generation strategy similar to [46] with amodulation vector to integrate target-specific information into its network.Motivated by IoU-Net [46] and ATOM [25], an offline proposal generationstrategy is proposed to extract context information of target from the referenceframe. The ATOM tracker generates N target proposals from the test frame( P t + ζ ), given the target location in that frame ( G t + ζ ). Jittered ground-truthlocations in offline training produce the target proposals. But, the estimatedlocations achieved by a simple two-layer classification network will be jitteredin online tracking. The test proposals are generated according to IoU Gt + ζ (cid:44) IoU ( G t + ζ , P t + ζ ) (cid:62) T . Then, a network is trained to predict IoU values ( IoU pred )between P t + ζ and object, given the BB of the target in the reference frame ( G t ).Finally, the designed network in the ATOM minimizes the mean square error of IoU G t + ζ and IoU pred .In this work, the proposed strategy also provides target patches with back-ground supporters from the reference frame (denoted as P t ) to solve the challeng-ing problems of small object tracking. Besides G t , the proposed method exploits P t just in offline training. Using context features and target parts will assistthe proposed network (Sec. 3.2) in handling occlusion and viewpoint changeproblems for small objects. For simplicity, we will describe the proposed offlineproposal generation strategy with the process of IoU-prediction. However, theproposed network predicts both IoU and center location error (CLE) of test pro-posals with target, simultaneously. OMET: Context-Aware IoU-Guided Network for Small Object Tracking 7
Algorithm 1 :
Offline Proposal Generation
Notations:
Bounding box B ( G t + ζ for a test frame or G t for a reference frame), IoU threshold T ( T for a test frame or T for a reference frame), Number of proposals N ( N for a test frame or ( N/ − ii ), Maximum iteration ( max ii ), A Gaussian distributionwith zero-mean ( µ = 0) and randomly selected variance Σ r ( N ), Bounding box proposals generatedby a Gaussian jittering P ( P t + ζ for a test frame or P t for a reference frame) Input: B , T , N , Σ r , max ii Output: P for i = 1 : N do ii = 0, do P [ i ] = B + N ( µ, Σ r ) ,ii = ii + 1 , while ( IoU ( B , P [ i ]) < T ) and ( ii < max ii ); endreturn P An overview of the process of offline proposal generation for IoU-prediction isshown in Algorithm 1. The proposed strategy generates ( N/ − IoU Gt (cid:44) IoU ( G t , P t ) (cid:62) T .Note that it considers T > T to prevent drift toward visual distractors. Theproposed tracker exploits this information (especially in challenging scenarios in-volving occlusion and viewpoint change) to avoid confusion during target track-ing. The P t and G t are passed through the reference branch of the proposednetwork, simultaneously (Sec. 3.2). In this work, an extended modulation vectorhas been introduced to provide the representations of the target and its partsinto the network. That is a set of modulation vectors that each vector encodedthe information of one reference proposal. To compute IoU-prediction, the fea-tures of the test patch should be modulated by the features of the target and itsparts. It means that the IoU-prediction of N test proposals is computed per eachreference proposal. Thus, there will be N / N/ N IoU-predictions, the extended modulation vector allowsthe computation of N/ N IoU-predictions at once. Therefore, thenetwork predicts N/ IoU Gt + ζ . During online tracking, COMET doesnot generate P t and just uses the G t to predict one group of IoU-predictions forgenerated P t + ζ . Therefore, the proposed strategy will not impose extra compu-tational complexity in online tracking. Tracking small objects from aerial view involves extra difficulties such as clarityof target appearance, fast viewpoint change, or drastic rotations besides existingtracking challenges. This part aims to design an architecture that handles theproblems of small object tracking by considering recent advances in small objectdetection. Inspired by [46,25,43,47,48], a two-stream network is proposed (seeFig. 3), which consists of multi-scale processing and aggregation of features, thefusion of hierarchical information, spatial attention module, and channel atten-tion module. Also, the proposed network seeks to maximize the IoU betweenestimated BBs and the object while it minimizes their location distance. Hence,it exploits a multitask loss function, which is optimized to consider both theaccuracy and robustness of the estimated BBs. In the following, the proposed
S. M. Marvasti-Zadeh, J. Khaghani et al.
Block 4Block 3
ResNet-50ResNet-50 MSAF Module
Block 3Block 4
MSAF
Module Conv1x1Conv1x1 Attention ModuleAttention Module PrRoI Pool 3x3
PrRoI Pool 5x5Conv3x3
Extended Modulation
Vector
FC LinearLinear
IoU
CLE
Conv3x3 Deconvolution
Block 3Block 4
Conv1x1 Conv1x5 Conv5x1 Conv1x7 Conv7x1
Conv1x1 Conv1x3 Conv3x1
Avg Pool 3x3 Conv1x1
Conv1x1 F ilt e r C on ca t e n a ti on Conv1x1 Conv3x3 Conv3x3 Conv1x1GAP Linear Leaky ReLU Linear
Inception Module σ R e f e r e n ce F r a m e T a r g e t F r a m e Ground Truth BB (and reference proposals in offline training)Target proposals
Feature
Fusion
Fig. 3:
Overview of proposed two-stream network. MSAF denotes multi-scale aggrega-tion and fusion module, which utilizes the InceptionV3 module in its top branch. Fordeconvolution block, a 3 × architecture and the role of the main components are described.The proposed network has adopted ResNet-50 [49] to provide backbone fea-tures for reference and test branches. Following small object detection methods,features from block3 and block4 of ResNet-50 are just extracted to exploit bothspatial and semantic features while controlling the number of parameters [17,43].Then, the proposed network employs a multi-scale aggregation and fusion mod-ule (MSAF). It processes spatial information via the InceptionV3 module [50]to perform factorized asymmetric convolutions on target regions. This low-costmulti-scale processing helps the network to approximate optimal filters that areproper for small object tracking. Also, semantic features are passed through theconvolution and deconvolution layers to be refined and resized for feature fusion.The resulted hierarchical information is fused by an element-wise addition of thespatial and semantic feature maps. After feature fusion, the number of channelsis reduced by 1 × .
01% pixels of a frame.Next, the proposed network utilizes the bottleneck attention module (BAM)[48], which has a lightweight and simple architecture. It emphasizes target-related spatial and channel information and suppresses distractors and redun-dant information, which are common in aerial images [43]. The BAM includeschannel attention, spatial attention, and identity shortcut connection branches.In this work, the SENet [51] is employed as the channel attention branch, whichuses global average pooling (GAP) and a multi-layer perceptron to find theoptimal combination of channels. The spatial attention module utilizes dilatedconvolutions to increase the receptive field. It helps the network to consider con-
OMET: Context-Aware IoU-Guided Network for Small Object Tracking 9 text information for small object tracking. The spatial and channel attentionmodules answer to where the critical features are located and what relevant fea-tures are. Lastly, the identity shortcut connection helps for better gradient flow.After that, the proposed method generates proposals from the test frame.Also, it uses the proposed proposal generation strategy to extract the BBs fromthe target and its parts from the reference frame in offline training. These gener-ated BBs are applied to the resulted feature maps and fed into a precise region ofinterest (PrRoI) pooling layer [46], which is differentiable w.r.t. BB coordinates.The network uses a convolutional layer with a 3 × P t + ζ ) are applied to the features of the test branch and fed to a 5 × L Net = L IoU + λ L CLE ,where the L IoU , L CLE , and λ represent the loss function for IoU-prediction head,loss function for the CLE-prediction head, and balancing hyper-parameter forloss functions, respectively. By denoting i -th IoU- and CLE-prediction values as IoU ( i ) and CLE ( i ) , the loss functions are defined as L IoU = 1 N N (cid:88) i =1 ( IoU ( i ) G t + ζ − IoU ( i ) pred ) , (1) L CLE = (cid:40) N (cid:80) Ni =1 12 ( CLE ( i ) G t + ζ − CLE ( i ) pred ) | ( CLE ( i ) G t + ζ − CLE ( i ) pred | < N (cid:80) Ni =1 | ( CLE ( i ) G t + ζ − CLE ( i ) pred ) | − otherwise , (2)where the CLE G t + ζ = ( ∆x G t + ζ /width G t + ζ , ∆y G t + ζ /height G t + ζ ) is the normal-ized distance between the center location of P t + ζ and G t + ζ . For example, ∆x G t + ζ is calculated as x G t + ζ − x P t + ζ . Also, the CLE pred (and
IoU pred ) represents thepredicted CLE (and the predicted IoU) between BB estimations ( G t + ζ ) and tar-get, given an initial BB in the reference frame. In offline training, the proposednetwork optimizes the loss function to learn how to predict the target BB fromthe pattern of proposals generation.In online tracking, the target BB from the first frame (similar to [37,35,36,25])and also target proposals in the test frame passes through the network. As aresult, there is just one group of CLE-prediction as well as IoU-prediction toavoid more computational complexity. In this phase, the aim is to maximize theIoU-prediction of test proposals using the gradient ascent algorithm and also tominimize its CLE-prediction using the gradient descent algorithm. Algorithm 2describes the process of online tracking in detail. This algorithm shows how theinputs are passed through the network, and BB coordinates are updated basedon scaled back-propagated gradients. While the IoU-gradients are scaled up withBB sizes to optimize in a log-scaled domain (similar to [46]), just x and y coor-dinates of test BBs are scaled up for CLE-gradients. It experimentally achievedbetter results compared to the scaling process for IoU-gradients. The intuitivereason is that the network has learned the normalized location differences be- tween BB estimations and target BB. That is, the CLE-prediction is responsiblefor accurate localization, whereas the IoU-prediction determines the BB aspectratio. After refining the test proposals ( N = 10 for online phase) for n = 5 times,the proposed method selects the K = 3 best BBs and uses the average of thesepredictions based on IoU-scores as the final target BB. In this section, the proposed method is evaluated on state-of-the-art benchmarksfor small object tracking from aerial view: VisDrone-2019-test-dev (35 videos)[1], UAVDT (50 videos) [2], and Small-90 (90 videos) [15]. Although the Small-90 dataset includes the challenging videos of the UAV-123 dataset with smallobjects, the experimental results on the UAV-123 [8] dataset (low-altitude UAVdataset (10 ∼
30 meters)) are also presented. Generally speaking, the UAV-123dataset lacks varieties in small objects, camera motions, and real scenes [14].Moreover, traditional tracking datasets do not consist of challenges such as tinyobjects, significant viewpoint changes, camera motion, and high density fromaerial views. For instance, these datasets (e.g., OTB [10], VOT [11,12], etc.)mostly provide videos that captured by fixed or moving car-based cameras withlimited viewing angles. For these reasons and our focus on tracking small ob-jects on videos captured from medium- and high-altitudes, the proposed tracker(COMET) is evaluated on related benchmarks to demonstrated the motivationand major effectiveness of COMET for small object tracking.Experiments have been conducted three times, and the average results arereported. The proposed method is compared with state-of-the-art visual trackersin terms of precision, success, and normalized area-under-curve (AUC) metrics[10]. Note that all results have been produced by standard benchmarks withdefault thresholds (i.e., 20 pixels for precision scores, and 0.5 for the success
Algorithm 2 :
Online Tracking
Notations:
Input sequence ( S ), Sequence length ( T ), Current frame ( t ), Rough estimation of bound-ing box ( B et ), Generated test proposals ( B pt ), Concatenated bounding boxes ( B ct ), Bounding box pre-diction ( B predt ), Step size ( β ), Number of refinements ( n ), Online classification network ( Net
ATOMonline ),Scale and center jittering (
Jitt ) with random factors, Network predictions (
IoU and
CLE ) Input: S = { I , I , ..., I T } , B = { x , y , w , h } Output: B predt , t ∈ { , ..., T } for t = 1 : T do B et = Net
ATOMonline ( I t ) B pt = Jitt ( B et ) B ct = Concat ( B et , B pt ) for i = 1 : n do IoU , CLE = FeedForward( I , I t , B , B ct ) grad IoU B ct = [ ∂IoU∂x , ∂IoU∂y , ∂IoU∂w , ∂IoU∂h ] B ct ← B ct + β × [ ∂IoU∂x .w, ∂IoU∂y .h, ∂IoU∂w .w, ∂IoU∂h .h ], grad CLE B ct = [ ∂CLE∂x , ∂CLE∂y , ∂CLE∂w , ∂CLE∂h ] B ct ← B ct − β × [ ∂CLE∂x .w, ∂CLE∂y .h, ∂CLE∂w , ∂CLE∂h ] end B K × t ← Select K best B ct w.r.t. IoU-scores B predt = Avg ( B K × t ) endreturn B predt OMET: Context-Aware IoU-Guided Network for Small Object Tracking 11 scores) for fair comparisons. In the following, implementation details, ablationanalyses, and state-of-the-art comparisons are presented.
The proposed method uses the ResNet-50 pre-trained on ImageNet [52] to ex-tract backbone features. For offline proposal generation, hyper-parameters areset to N = 16 (test proposals number, ( N/
2) = 8 (seven reference proposalnumbers plus reference ground-truth)), T = 0 . T = 0 . λ = 4, and imagesample pairs randomly selected from videos with a maximum gap of 50 frames( ζ = 50). From the reference image, a square patch centered at the target isextracted with the area of 5 times the target region. Also, flipping and colorjittering are used for data augmentation of the reference patch. To extract thesearch area, a patch (with the area of 5 times the test target scale) with someperturbation in the position and scale is sampled from the test image. Thesecropped regions are then resized to the fixed size of 288 × − , max ii for proposal generation is 200 for ref-erence proposals and 20 for test proposals. The weights of the backbone networkare frozen, and other weights are initialized using [53]. The training splits areextracted from the official training set (protocol II) of LaSOT [54], training setof GOT-10K [55], NfS [56], and training set of VisDrone-2019 [1] datasets. Notethat the Small-112 dataset [15] is a subset of the training set of the VisDrone-2019. Moreover, the validation splits of VisDrone-2019 and GOT-10K datasetshave been used in the training phase. To train in an end-to-end fashion, theADAM optimizer [57] is used with an initial learning rate of 10 − , weight decayof 10 − , and decay factor 0 . A systematic ablation study on individual components of the proposed trackerhas been conducted on the UAVDT dataset [14] (see Table 1). It includes threedifferent versions of the proposed network consisting of the networks without1) CLE-head, 2) CLE-head and reference proposals generation, and 3) CLE-head, reference proposals generation, and attention module, referred to as A1,A2, and A3, respectively. Moreover, two other different feature fusion operationshave been investigated, namely features multiplication (A4) and features con-catenation (A5), compared to the element-wise addition of feature maps in the
Table 1:
Ablation analysis of COMET regarding different components and featurefusion operations on UAVDT dataset.
Metric COMET A1 A2 A3 A4 A5Precision 88.7 87.2 85.2 83.6 88 85.3Success 81 78 76.9 73.5 80.4 77.2
MSAF module (see Fig 3).These experiments demonstrate the effectiveness of each component ontracking performance, while the proposed method has achieved 88 .
7% and 81%in terms of precision and success rates, respectively. According to these results,the attention module, reference proposal generation strategy, and CLE-headhave improved the average of success and precision rates up to 2 . . . .
65% and 3 .
6% compared to A4 and A5, respectively.Also, the benefit of feature addition previously has been proved in other methodssuch as [25].
For quantitative comparison, the proposed method (COMET) is compared withstate-of-the-art visual tracking methods including AutoTrack [24], ATOM [25],DiMP-50 [38], SiamRPN++ [36], SiamMask [35], DaSiamRPN [32], CREST [58],MDNet [59], PTAV [60], ECO [31], and MCPF [61] on aerial tracking datasets.Fig. 4 shows the achieved results in terms of precision and success plots [10].According to these results, COMET outperforms top-performing visual track-ers on three available challenging small object tracking datasets as well as theUAV-123 dataset. For instance, COMET has outperformed the SiamRPN++and DiMP-50 trackers by 4 .
4% and 3 .
2% in terms of average precision metric,and 3 .
3% and 3% in terms of average success metric on all datasets, respectively.Compared to the baseline ATOM tracker, COMET has improved the averageprecision rate up to 10 . .
2% and 0 . . .
1% and 2 .
9% on the UAVDT, VisDrone-2019-test-dev andSmall-90 datasets, respectively. Although COMET slightly outperforms ATOM
Fig. 4:
Overall precision and success comparisons of the proposed method (COMET)with state-of-the-art tracking methods on UAVDT, VisDrone-2019-test-dev, Small-90,and UAV-123 datasets.OMET: Context-Aware IoU-Guided Network for Small Object Tracking 13
Table 2:
Average speed of state-of-the-art trackers on UAVDT dataset.
COMET ATOM SiamRPN++ DiMP-50 SiamMask ECOSpeed (FPS) 24 30 32 33 42 35 tracker on the UAV-123 (see Fig. 1), it achieved up to 6 .
2% and 7% improve-ments compared to it in terms of average precision and success metrics on smallobject tracking datasets.These results are mainly owed to the proposed proposal generation strategyand effective modules, which makes the network focus on relevant target (andits parts) information and context information. Furthermore, COMET runs at24 frame-per-second (FPS), while the average speed of other trackers on thereferred machine is indicated in Table 2. This satisfactory speed has been origi-nated from considering different proposal generation strategies for offline & on-line procedures and employing lightweight modules in the proposed architecture.The COMET has been evaluated according to various attributes of smallobject tracking scenarios to investigate its strengths and weaknesses. Table 3and Table 4 present the attribute-based comparison of visual trackers on theUAVDT and VisDrone datasets in terms of average precision and AUC met-rics. Also, the overall AUC of trackers is shown in Table 4. The comparisonsare according to various attributes including: background clutter (BC), illumi-nation variation (IV), scale variation (SV), camera motion (CM), object mo-tion (OM), small object (SO), object blur (OB), large occlusion (LO), long-termtracking (LT), aspect ratio change (ARC), fast motion (FM), partial occlusion (POC), full occlusion (FOC), low resolution (LR), out-of-view (OV), similar
Table 3:
Attribute-based comparison of state-of-the-art visual tracking methods interms of accuracy metric on the UAVDT dataset [ First and second visual trackingmethods are shown in color].
Tracker BC CM OM SO IV OB SV LO LTCOMET 83.8 86.1 90.6 90.9 88.5 87.7 90.2 79.6 96ATOM 70.1 77.2 73.4 80.6 80.8 74.9 73 66 91.7AutoTrack 61.1 66.5 63.1 80.9 76.8 73.2 63.6 52.1 87.8SiamRPN++ 74.9 75.9 80.4 83.5 89.7 89.4 80.1 66.6 84.9SiamMask 71.6 76.7 77.8 86.7 86.4 86 77.3 60.1 93.8DiMP-50 71.1 80.3 75.8 81.4 84.3 79 76.1 68.6 100ECO 61.1 64.4 62.7 79.1 76.9 71 63.2 50.8 95.2MDNet 63.6 69.6 66.8 78.4 76.4 72.4 68.5 54.7 93CREST 56.2 62.1 55.8 74.2 69 65.6 56.7 49.7 71.2PTAV 57.2 63.9 56.4 79.1 69.6 66.2 56.5 50.3 80.1MCPF 51.2 59.2 55.3 74.5 73.1 73 55.1 42.5 74.1
Table 4:
Attribute-based comparison of state-of-the-art visual tracking methods interms of AUC metric on the VisDrone-2019-test-dev dataset [ First and second visualtracking methods are shown in color].
Tracker Overall ARC BC CM FM FOC IV LR OV POC SOB SV VCCOMET 64.5 64.2 43.4 62.6 64.9 56.7 65.5 41.8 75.9 62.1 42.8 65.8 70.4ATOM 57.1 52.3 36.7 56.4 52.3 48.8 63.3 31.2 63 51.9 35.6 55.4 61.3SiamRPN++ 59.9 58.9 41.2 58.7 61.8 55.1 63.5 36.4 69.3 58.8 39.6 59.9 67.8DiMP-50 60.8 54.5 40.6 60.6 62 55.8 63.6 32.7 62.4 56.8 39.8 59.7 66SiamMask 58.1 57.8 38.5 57.2 60.8 49 56.6 46.5 67.5 52.9 37 59.4 65.1ECO 55.9 56.5 38.3 54.2 52.1 46.8 59.9 36.6 61.5 51.6 38.6 52.7 62.4
SiamMask [35] ATOM [25] DiMP-50 [38] SiamRPN++ [36] COMET
Fig. 5:
Qualitative comparison of proposed COMET tracker with state-of-the-arttracking methods on S1202, S0602, and S0801 video sequences from UAVDT dataset(top to bottom row, respectively). objects (SOB), and viewpoint change (VC). The results demonstrate the consid-erable performance of COMET compared to the state-of-the-art trackers. Thesetables also show that the COMET can successfully handle the occlusion andviewpoint change problems for small object tracking purposes. Compared tothe DiMP-50, SiamRPN++, and SiamMask, COMET achieves improvementsup to 9 . . .
5% for small object attribute, and 4 . . .
3% for viewpoint change attribute, respectively. Meanwhile, it has gained upto 5 . . .
1% improvement in average for occlusion attribute (i.e.,average of FOC, POC, and LO) compare to the DiMP-50, SiamRPN++, andSiamMask, respectively. While the performance still can be improved based onIV, OB, LR, and LT attributes, COMET outperforms the ATOM by a marginup to 7 . . . .
3% regarding these attributes, respectively.The qualitative comparisons of visual trackers are shown in Fig. 5, in whichthe videos have been selected for more clarity. Further qualitative evaluationson various datasets and even YouTube videos are provided in the supplemen-tary material. According to the first row of Fig. 5, COMET successfully modelssmall objects on-the-fly considering complicated aerial view scenarios. Also, itprovides promising results when the aspect ratio of target significantly changes.Examples of occurring out-of-view and occlusion are shown in the next rows ofFig. 5. By considering target parts and context information, COMET properlyhandles these problems existing potential distractors.
A context-aware IoU-guided tracker proposed that includes an offline referenceproposal generation strategy and a two-stream multitask network. It aims totrack small objects in videos captured from medium- and high-altitude aerial
OMET: Context-Aware IoU-Guided Network for Small Object Tracking 15 views. First, an introduced proposal generation strategy provides context infor-mation for the proposed network to learn the target and its parts. This strategyeffectively helps the network to handle occlusion and viewpoint change in high-density videos with a broad view angle in which only some parts of the targetare visible. Moreover, the proposed network exploits multi-scale feature aggre-gation and attention modules to learn multi-scale features and prevent visualdistractors. Finally, the proposed multitask loss function accurately estimatesthe target region by maximizing IoU and minimizing CLE between the pre-dicted box and object. Experimental results on four state-of-the-art aerial viewtracking datasets and remarkable performance of the proposed tracker demon-strate the motivation and effectiveness of proposed components for small objecttracking purposes.
References
1. Du, D., Zhu, P., Wen, L., Bian, X., Ling, H., et al.: VisDrone-SOT2019: The VisionMeets Drone Single Object Tracking Challenge Results. In: Proc. ICCVW. (2019)2. Du, D., Qi, Y., Yu, H., Yang, Y., Duan, K., Li, G., Zhang, W., Huang, Q., Tian,Q.: The unmanned aerial vehicle benchmark: Object detection and tracking. In:Proc. ECCV. (2018) 375–3913. Marvasti-Zadeh, S.M., Cheng, L., Ghanei-Yakhdan, H., Kasaei, S.: Deep learningfor visual tracking: A comprehensive survey (2019)4. Bonatti, R., Ho, C., Wang, W., Choudhury, S., Scherer, S.: Towards a robust aerialcinematography platform: Localizing and tracking moving targets in unstructuredenvironments. In: Proc. IROS. (2019) 229–2365. Zhang, H., Wang, G., Lei, Z., Hwang, J.: Eye in the sky: Drone-based objecttracking and 3D localization. In: Proc. Multimedia. (2019) 899–9076. Du, D., Zhu, P., et al.: VisDrone-SOT2019: The vision meets drone single objecttracking challenge results. In: Proc. ICCVW. (2019)7. Zhu, P., Wen, L., Du, D., Bian, X., Hu, Q., Ling, H.: Vision meets drones: Past,present and future (2020)8. Mueller, M., Smith, N., Ghanem, B.: A benchmark and simulator for UAV tracking.In: Proc. ECCV. (2016) 445–4619. Wu, Y., Lim, J., Yang, M.H.: Online object tracking: A benchmark. In: Proc.IEEE CVPR. (2013) 2411–241810. Wu, Y., Lim, J., Yang, M.: Object tracking benchmark. IEEE Trans. PatternAnal. Mach. Intell. (2015) 1834–184811. Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., et al.: Thesixth visual object tracking VOT2018 challenge results. In: Proc. ECCVW. (2019)3–5312. Kristan, M., et al.: The seventh visual object tracking VOT2019 challenge results.In: Proc. ICCVW. (2019)13. Zhu, P., Wen, L., Du, D., et al.: VisDrone-VDT2018: The vision meets drone videodetection and tracking challenge results. In: Proc. ECCVW. (2018) 496–51814. Yu, H., Li, G., Zhang, W., et al.: The unmanned aerial vehicle benchmark: Objectdetection, tracking and baseline. Int. J. Comput. Vis. (2019)15. Liu, C., Ding, W., Yang, J., et al.: Aggregation signature for small object tracking.IEEE Trans. Image Processing (2020) 1738–17476 S. M. Marvasti-Zadeh, J. Khaghani et al.16. Liang, P., Blasch, E., Ling, H.: Encoding color information for visual tracking:Algorithms and benchmark. IEEE Trans. Image Process. (2015) 5630–564417. Tong, K., Wu, Y., Zhou, F.: Recent advances in small object detection based ondeep learning: A review. Image and Vision Computing (2020)18. LaLonde, R., Zhang, D., Shah, M.: Clusternet: Detecting small objects in largescenes by exploiting spatio-temporal information. In: Proc. CVPR. (2018)19. Bai, Y., Zhang, Y., Ding, M., Ghanem, B.: SOD-MTGAN: Small object detectionvia multi-task generative adversarial network. In: Proc. ECCV. (2018)20. Huang, Z., Fu, C., Li, Y., Lin, F., Lu, P.: Learning aberrance repressed correlationfilters for real-time UAV tracking. In: Proc. IEEE ICCV. (2019) 2891–290021. Fu, C., Huang, Z., Li, Y., Duan, R., Lu, P.: Boundary effect-aware visual trackingfor UAV with online enhanced background learning and multi-frame consensusverification. In: Proc. IROS. (2019) 4415–442222. Li, F., Fu, C., Lin, F., Li, Y., Lu, P.: Training-set distillation for real-time UAVobject tracking. In: Proc. ICRA. (2020) 1–723. Li, Y., Fu, C., Huang, Z., Zhang, Y., Pan, J.: Keyfilter-aware real-time uav objecttracking. In: Proc. ICRA. (2020)24. Li, Y., Fu, C., Ding, F., Huang, Z., Lu, G.: AutoTrack: Towards high-performancevisual tracking for UAV with automatic spatio-temporal regularization. In: Proc.IEEE CVPR. (2020)25. Danelljan, M., Bhat, G., Khan, F.S., Felsberg, M.: ATOM: Accurate tracking byoverlap maximization. In: Proc. CVPR. (2019)26. Held, D., Thrun, S., Savarese, S.: Learning to track at 100 FPS with deep regressionnetworks. In: Proc. ECCV. (2016) 749–76527. Bertinetto, L., Valmadre, J., Henriques, J.F., Vedaldi, A., Torr, P.H.: Fully-convolutional Siamese networks for object tracking. In: Proc. ECCV. (2016) 850–86528. Valmadre, J., Bertinetto, L., Henriques, J., Vedaldi, A., Torr, P.H.: End-to-endrepresentation learning for correlation filter based tracking. In: Proc. IEEE CVPR.(2017) 5000–500829. Dong, X., Shen, J.: Triplet loss in Siamese network for object tracking. In: Proc.ECCV. (2018) 472–48830. Danelljan, M., Robinson, A., Khan, F.S., Felsberg, M.: Beyond correlation filters:Learning continuous convolution operators for visual tracking. In: Proc. ECCV.(2016) 472–48831. Danelljan, M., Bhat, G., Shahbaz Khan, F., Felsberg, M.: ECO: Efficient convo-lution operators for tracking. In: Proc. IEEE CVPR. (2017) 6931–693932. Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., Hu, W.: Distractor-aware Siamesenetworks for visual object tracking. In: Proc. ECCV. (2018) 103–11933. Zhang, Z., Peng, H.: Deeper and wider Siamese networks for real-time visualtracking (2019)34. Fan, H., Ling, H.: Siamese cascaded region proposal networks for real-time visualtracking (2018)35. Wang, Q., Zhang, L., Bertinetto, L., Hu, W., Torr, P.H.: Fast online object trackingand segmentation: A unifying approach. In: Proc. IEEE CVPR. (2019)36. Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., Yan, J.: SiamRPN++: Evolutionof Siamese visual tracking with very deep networks. In: Proc. IEEE CVPR. (2019)37. Li, B., Yan, J., Wu, W., Zhu, Z., Hu, X.: High performance visual tracking withSiamese region proposal network. In: Proc. IEEE CVPR. (2018) 8971–898038. Bhat, G., Danelljan, M., Gool, L.V., Timofte, R.: Learning discriminative modelprediction for tracking. In: Proc. IEEE ICCV. (2019)OMET: Context-Aware IoU-Guided Network for Small Object Tracking 1739. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C., Berg, A.C.: SSD:Single Shot MultiBox Detector. In: Proc. ECCV. (2016) 21–3740. Fu, C., Liu, W., Ranga, A., Tyagi, A., Berg, A.: DSSD: Deconvolutional singleshot detector (2017)41. Cui, L., Ma, R., Lv, P., Jiang, X., Gao, Z., Zhou, B., Xu, M.: MDSSD: Multi-scaledeconvolutional single shot detector for small objects (2018)42. Lim, J., Astrid, M., Yoon, H., Lee, S.: Small object detection using context andattention (2019)43. Yang, X., Yang, J., Yan, J., Zhang, Y., Zhang, T., Guo, Z., Sun, X., Fu, K.:SCRDet: Towards more robust detection for small, cluttered and rotated objects.In: Proc. IEEE ICCV. (2019)44. Redmon, J., Farhadi, A.: Yolov3: An incremental improvement (2018)45. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accu-rate object detection and semantic segmentation. In: Proc. IEEE CVPR. (2014)580–58746. Jiang, B., Luo, R., Mao, J., Xiao, T., Jiang, Y.: Acquisition of localization confi-dence for accurate object detection. In: Proc. IEEE ECCV. (2018) 816–83247. Lim, J.S., Astrid, M., Yoon, H.J., Lee, S.I.: Small object detection using contextand attention (2019)48. Park, J., Woo, S., Lee, J.Y., Kweon, I.S.: BAM: Bottleneck attention module. In:Proc. BMVC. (2018) 147–16149. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: Proc. IEEE CVPR. (2016) 770–77850. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the incep-tion architecture for computer vision. In: Proc. CVPR. (2016) 2818–282651. Hu, J., Shen, L., Albanie, S., Sun, G., Wu, E.: Squeeze-and-excitation networks.IEEE Trans. Pattern Anal. Mach. Intell. (2019)52. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet largescale visual recognition challenge. IJCV (2015) 211–25253. He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In: Proc. ICCV. (2015) 1026–103454. Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C.,Ling, H.: LaSOT: A high-quality benchmark for large-scale single object tracking.In: Proc. IEEE CVPR. (2019)55. Huang, L., Zhao, X., Huang, K.: GOT-10k: A large high-diversity benchmark forgeneric object tracking in the wild. IEEE Trans. Pattern Anal. Mach. Intell. (2019)56. Galoogahi, H.K., Fagg, A., Huang, C., Ramanan, D., Lucey, S.: Need for speed:A benchmark for higher frame rate object tracking. In: Proc. IEEE ICCV. (2017)1134–114357. Kingma, D.P., Ba, J.: ADAM: A method for stochastic optimization. In: Proc.ICLR. (2014)58. Song, Y., Ma, C., Gong, L., Zhang, J., Lau, R.W., Yang, M.H.: CREST: Convo-lutional residual learning for visual tracking. In: Proc. ICCV. (2017) 2574–258359. Nam, H., Han, B.: Learning multi-domain convolutional neural networks for visualtracking. In: Proc. IEEE CVPR. (2016) 4293–430260. Fan, H., H.Ling: Parallel tracking and verifying. IEEE Trans. Image Process.28