Learning Reinforced Attentional Representation for End-to-End Visual Tracking
Peng Gao, Qiquan Zhang, Fei Wang, Liyi Xiao, Hamido Fujita, Yan Zhang
LLearning Reinforced Attentional Representation for End-to-EndVisual Tracking
Peng Gao a,b , Qiquan Zhang a , Fei Wang a , Liyi Xiao a,b , Hamido Fujita c,d,e , Yan Zhang a a School of Electronics and Information Engineering, Harbin Institute of Technology, Shenzhen, China b School of Astronautics, Harbin Institute of Technology, Harbin, China c Faculty of Information Technology, Ho Chi Minh City University of Technology (HUTECH), Ho Chi Minh City,Vietnam d Andalusian Research Institute in Data Science and Computational Intelligence (DaSCI), University of Granada,Granada, Spain e Faculty of Software and Information Science, Iwate Prefectural University, Iwate, Japan
Abstract
Although numerous recent tracking approaches have made tremendous advances in the last decade,achieving high-performance visual tracking remains a challenge. In this paper, we propose anend-to-end network model to learn reinforced attentional representation for accurate target ob-ject discrimination and localization. We utilize a novel hierarchical attentional module with longshort-term memory and multi-layer perceptrons to leverage both inter- and intra-frame attentionto e ff ectively facilitate visual pattern emphasis. Moreover, we incorporate a contextual attentionalcorrelation filter into the backbone network to make our model trainable in an end-to-end fashion.Our proposed approach not only takes full advantage of informative geometries and semanticsbut also updates correlation filters online without fine-tuning the backbone network to enable theadaptation of variations in the target objects appearance. Extensive experiments conducted onseveral popular benchmark datasets demonstrate that our proposed approach is e ff ective and com-putationally e ffi cient. Keywords:
Visual tracking, reinforced representation, attentive learning, correlation filter
1. Introduction
Visual tracking is an essential and actively researched problem in the field of computer vi-sion with various real-world applications such as robotic services, smart surveillance systems,autonomous driving, and human-computer interaction. It refers to the automatic estimation of thetrajectory of an arbitrary target object, usually specified by a bounding box in the first frame, as itmoves around in subsequent video frames. Although considerable progress has been made in lastdecade [1, 2], visual tracking is still commonly recognized as a very challenging task, partiallydue to numerous complicated real-world scenarios such as scale variations, fast motion, occlu-sions, and deformations.One of the most successful tracking frameworks is the discriminative correlation filter (DCF) [3,4, 5]. With the benefits of fast Fourier transform, most DCF-based approaches can employ large
Preprint submitted to Elsevier January 3, 2020 a r X i v : . [ c s . C V ] J a n eometrics Semantics Input Conv1 Conv2 Conv3 Conv4 Conv5Conv1-2 Conv2-2 Conv3-4 Conv4-4 Conv5-4Conv1 Conv2-2c Conv3-4c Conv4-6c Conv5-3c D ee p er A rc h i t ec t u re Figure 1: Visualization of deep feature maps from di ff erent convolutional layers of di ff erent CNN architectures,including AlexNet [8] (top row), VGG-19 [9] (middle row) and ResNet-50 [10] (bottom row). It is evident that low-level geometries from shallow layers, such as ‘ conv
1’ in AlexNet, ‘ conv conv
1’ in ResNet-50,remain fine-grained target-specific details, while high-level semantics from deep layers, such as ‘ conv
5’ in AlexNet,‘ conv conv dinosaur . numbers of cyclically shifted samples for training, and achieve high accuracy while running at im-pressive frame rates. Recent years have witnessed significant advances in convolutional neural net-work (CNN) on many computer vision tasks such as image classification and object detection [6].This is because the CNN can gradually proceed from learning finer-level geometries to coarse-levelsemantics of the target objects by transforming and enlarging the receptive fields at di ff erent convo-lutional layers [7]. Encouraged by these great successes, some DCF-based trackers resort to usingpre-trained CNN models [8, 9, 10] instead of conventional handcrafted features [11, 3] for targetobject representation, and achieved favorable performance [12, 13]. Recently, record-breakingperformance and e ffi ciency has been achieved using Siamese matching networks [14, 15, 16] forvisual tracking. In each frame, these trackers learn a similarity metric between the target templateand the candidate patch from current searching frame in an end-to-end fashion.Despite the significant progress mentioned above, existing CNN-based tracking approachesare still limited by several intractable obstacles. Most methods directly utilize o ff -the-shelf CNNmodels pre-trained on large-scale image classification datasets [6, 17] to obtain generic representa-tion of the target object [8, 9]. It is well acknowledged that di ff erent convolutional layers of CNNs,as shown in Fig. 1, encode di ff erent types of features [7]. Although features taken from the higherconvolutional layers retain rich coarse-level category-specific semantics, they are ine ff ective for2 o n v3_4 c C o n v4_6 c C o n v5_3 c C o n v3_4 c C o n v4_6 c C o n v5_3 c C o n v3_4 c C o n v4_6 c C o n v5_3 c Figure 2: Visualization of feature channels in the last layer of the ‘ conv conv
4’ and ‘ conv
5’ stages in ResNet-50 [10]. Example frames are randomly picked up from
Bolt , Lemming and
Liquor sequences (shown from top tobottom on the left). We show the features extracted from the 20 random channels of each stage from top to bottomon the right of the corresponding example frame. It is clear that only few of feature channels and regions contributeto target object representation, others may serve as information redundancy. A noteworthy is that for each exampleframe, the channels in the corresponding stage are the same. the accurately localizing or estimating the scale of the target object. Conversely, features extractedfrom the lower convolutional layers maintain more fine-level geometries to capture target-specificspatial details which facilitate accurately locating the target object, but are insu ffi cient to distin-guish objects from non-objects with similar characteristics. With the aim of best exploiting deepfeatures, some prior works [18, 19, 20, 12] have attempted to integrate advantages of fine-levelgeometries and coarse-level semantics using multiple refinement strategies. Unfortunately, com-pared with state-of-the-art approaches [14, 16] that only employ the outputs of the last layers torepresent the target objects, their performance still has a notable gap. Combining features directly3rom multiple convolutional layers is thus not su ffi cient for representing target objects; they alsotend to underperform under challenging scenarios.Moreover, on deep feature maps, each feature channel corresponds to a particular type of visualpattern, whereas feature spatial regions represent object-specific details [21, 22]. We observe thatdeep features directly extracted from pre-trained CNN models treat every pixel equally along thechannel-wise and spatial axes. Specifically, there is the possibility that only some of the featuresare closely related to the task of distinguishing specific target objects from background surround-ings, others may be redundant information that may cause model drift, and probably lead to failuresof tracking [23, 24], as illustrated in Fig. 2. Recently, visual attention mechanism has brought re-markable progress to recent researches and performs surprisingly well in many computer visiontasks [25, 26], owing to its ability to model contextual information. Although it is necessary tohighlight useful features and suppress irrelevant information using attention mechanisms for visualtracking, some previous trackers [27, 28, 29] only take advantage of intra-frame attention to learnwhich semantic attribute to select from the proper visual patterns along the channel axis, and donot take care about where to focus along the spatial axis, thus achieving inferior tracking results.Moreover, most existing CNN-based trackers implement their models with shallow networks suchas AlexNet [8], they cannot exploit the benefits of more powerful representations from deepernetworks such as ResNet [10].Notably, as the target objects could be anything, the pre-trained CNN models may be ag-nostic about some target objects not present in the training set. To ensure high-performance vi-sual tracking, most trackers only employ the original deep features taken from the first frame tomatch candidate patches in subsequent frames [14, 16]. The characteristics of the target objectare consistent within consecutive frames, and there exists a strong temporal relationship betweenthe target object appearance and motion in video data [30, 31]. Using contexts from historicalframes may enhance tracking accuracy and robustness under challenging scenarios such as occlu-sions and deformations. The recurrent neural network (RNN), especially long short-term memory(LSTM) [32], has achieved great success in many natural language processing (NLP) applicationsby saving attractive temporal cues and discarding irrelevant ones using prejudiced memory com-ponents, and thereby becoming suitable for exploring inter-frame attention during visual tracking.However, there are limited approaches that employ such network models in visual tracking [33].Most trackers ignore the inter-frame attention, and can hardly obtain appearance variations of thetarget objects well, which may lead to model drift. On the whole, how to take full use of inter- andintra-frame attention for visual tracking is a largely underexplored domain.To address the above issues, we propose a unified end-to-end reinforced attentional Siamesenetwork model, dubbed RAR, to pursue high-performance visual tracking. The framework ofthe proposed approach is shown in Fig. 3. As abovementioned, it has already been proven thattracking can benefit from leveraging deep feature hierarchies across multiple convolutional lay-ers [18, 19]. Therefore, we use a carefully modified ResNet-50 as the backbone network, and takemulti-level deep features from the last three convolutional stages to enhance the e ff ectiveness oftarget object representation. We adopt the tracking-by-detection paradigm to trace target objects,and reformulate the tracking problem as a sequential inference task. To emphasize informativerepresentation and suppress information redundancy, we design a hierarchical attention modulefor learning multiple visual attention, which is composed of an inter-frame attention model and an4 o n v3_a × × c o n v1 × × p oo l × × c o n v2_1 × × c o n v2_2 × × c o n v2_3 × × c o n v3_1 × × c o n v3_2 × × c o n v3_3 × × c o n v3_4 × × A1 c o n v4_1 × × c o n v4_2 × × c o n v4_3 × × c o n v4_4 × × c o n v4_5 × × c o n v4_6 × × c o n v5_1 × × c o n v5_2 × × c o n v5_3 × × A2 A3 c o n v4_a × × c o n v5_a × × R c o n v3_a × × c o n v1 × × p oo l × × c o n v2_1 × × c o n v2_2 × × c o n v2_3 × × c o n v3_1 × × c o n v3_2 × × c o n v3_3 × × c o n v3_4 × × A1 c o n v4_1 × × c o n v4_2 × × c o n v4_3 × × c o n v4_4 × × c o n v4_5 × × c o n v4_6 × × c o n v5_1 × × c o n v5_2 × × c o n v5_3 × × A2 A3 c o n v4_a × × c o n v5_a × × R c o n v_a × × DC F × × Hierarchical Attention ModuleHierarchical Attention Module C o n t e x t u a l A tt e n t i o n C o rre l a t i o n F il t er c o n v_a × × R A ▪ Refinement ModelAttention Model ▪ Correlation Operation
Figure 3: The framework of the proposed tracking approach. Specifically, our approach contains three main compo-nents, i.e., a backbone network for deep feature extraction (detailed in Section 3.1), a hierarchical attention modulefor informative feature emphasis (detailed in Section 3.2), and a decision module for target object discrimination andlocalization (detailed in Section 3.3). intra-frame attention model. The inter-frame attention model is built upon convolutional LSTMunits that can fully explore the temporal cues of the target objects appearance at di ff erent convo-lutional layers in consecutive frames [33, 34]. It can be decomposed into sequential blocks, eachof them corresponding to a specific time slice. We then design an intra-frame attention modelthat consists of two multi-layer perceptrons (MLPs) along the channel-wise and spatial axes onthe deep feature maps [35, 26]. Through the inter- and intra-frame attention, we can obtain sig-nificantly more powerful attentional representations. It is worth noting that both the inter- andintra-frame attention are obtained separately at di ff erent convolutional layers. Subsequently, thehierarchical attentional representations are merged to produce a refined one using a refinementmodel made of convolutional layers and element-wise additions, rather than exploiting them inde-pendently or combining them directly. Specifically, the refined output is generated by successivelyintegrating attentional representations from the last layer with that from earlier layers in a stacked5anner. With the refinement model, we can obtain stronger representations that maintain coherenttarget-specific geometries and semantics at a desirable resolution. Besides, we adopt the DCF todiscriminate and locate target objects. Because the background context around target objects hasa significant impact on tracking performance, a contextual attentional DCF is employed as the de-cision module to take global context into account, and further eliminate unnecessary disturbance.To allow the whole network model to be trained from end to end, the correlation operation is re-formulated as a di ff erentiable correlational layer [15, 36]. Thus, the contextual attentional DCFcan be updated online without fine-tuning the network model to guide the adaptation of the targetobjects appearance model.We summarize the main contributions of our work as follows:1. An end-to-end reinforced attentional Siamese network model is proposed for high-performancevisual tracking.2. A hierarchical attention module is utilized to leverage both inter- and intra-frame attentionat each convolutional layer to e ff ectively highlight informative representations and suppressredundancy.3. A contextual attentional correlation layer that can take global context into account and fur-ther emphasize interesting regions is incorporated into the backbone network.4. Extensive and ablative experiments on four popular benchmark datasets, i.e., OTB-2013 [37],OTB-2015 [38], VOT-2016 [39] and VOT-2017 [40], demonstrate that our proposed trackeroutperforms state-of-the-art approaches.The rest of the paper is organized as follows. Section 2 briefly reviews related works. Section 3illustrates the proposed tracking approach. Section 4 details experiments and discusses results.Section 5 concludes the paper.
2. Related works
Many real-world applications require visual tracking approaches with excellent e ff ectivenessand e ffi ciency. In this section, we briefly review tracking-by-detection methods based on the DCFand CNN, which are most related to our work. For other visual tracking methods, please refer tomore comprehensive reviews [1, 2].In the past few years, some tracking approaches that train DCFs by exploiting the properties ofcircular correlation and performing operations in the Fourier frequency domain has played a dom-inant role in the visual tracking community, because of their superior computational e ffi ciency andreasonably good accuracy. Several extensions have been proposed to considerably improve track-ing performance using multi-dimensional features [11], nonlinear kernel correlation [3], robustscale estimation [4] and by reducing the boundary e ff ects [41]. However, earlier DCF-based track-ers take advantage of conventional handcrafted features [11, 3], and thus su ff er from inadequaterepresentation capability.Recently, with the rapid progress in deep learning techniques, CNN-based trackers have achievedremarkable progress, and become a trend in visual tracking. Some approaches incorporate CNNfeatures into the DCF framework for tracking, and demonstrate outstanding accuracy and highe ffi ciency. As previously known, the finer-level features that detail the spatial information play6 vital role in accurate localization, and the coarse-level features that characterize semantics playa pivotal role in robust discrimination. Therefore, it is necessary to design a specific feature re-finement scheme before discrimination. HCF [18] extracts the deep features from the hierarchicalconvolutional layers, and merges those features using a fixed weight scheme. HDT [19] employsan adaptive weight to combine the deep features from multiple layers. However, these trackersmerely exploit the CNN for feature extraction, and then learn the filters separately to locate thetarget object. Therefore, their performance may be suboptimal. Some later works attempt totrain a network model to perform both feature extraction and target object localization simultane-ously. Both CFNet [15] and EDCF [36] unify the DCF as a di ff erentiable correlation layer in aSiamese network model [14], and thus make it possible to learn powerful representation from endto end. These approaches have promoted the development of visual tracking, and greatly improvedtracking performance. Nevertheless, many deep features taken from pre-trained CNN models areirrelative to the task of distinguishing the target object from the background. These disturbanceswill significantly limit the performance of the abovementioned end-to-end tracking approaches.Instead of exploiting deep vanilla features for visual tracking, methods using attention weighteddeep features alleviate model drift problems caused by background noise. In fact, when tracking atarget object, the tracker should merely focus on a much smaller subset of deep features which cane ff ectively distinguish and locate the target object from the background. This implies that manydeep features are irrelative to representing the target object. Some works explore attention mecha-nisms to highlight useful information in visual tracking. CSRDCF [27] constructs a unique spatialreliability map to constrain filters learning. ACFN [42] establishes a unique attention mechanismto choose useful filters during tracking. RASNet [28] and FlowTrack [29] further introduce an at-tention network similar to the architecture of SENet [25] to enhance the representation capabilitiesof output features. Specifically, FlowTrack also clusters motion information to exploit historicalcues. CCOT [20] takes previous frames into account during filter training to enhance its robust-ness. RTT [34] learns recurrent filters through an LSTM network to maintain the target objectsappearance. Nonetheless, all these trackers take advantage of only one or two aspects of attentionto refine deep output features, exceedingly useful information in intermediate convolutional layershas not yet been fully explored.Motivated by the above observations, we aim to achieve high-performance visual tracking bylearning e ffi cient representation and DCF mutually in an end-to-end network. Our approach isrelated to but di ff erent from EDCF [36] and HCF [18]. The former proposes a fully convolutionalencoder-decoder network model to jointly perform similarity measurement and correlation opera-tion on multi-level reinforced representation for multi-task tracking, but our approach additionallylearn both inter- and intra-frame attention based on convolutional LSTM units and MLPs to em-phasize useful features, and take global context and temporal correlation into account to train andupdate the DCF. The latter utilizes hierarchical convolutional features for robust tracking. How-ever, rather than using a fixed weight scheme to fuse features from di ff erent levels, we first performattentional analysis on di ff erent convolutional layers separately, following which we merge hier-archical attentional features using a refinement model for better target object representation.7 . The proposed approach We propose a novel Siamese network model for jointly performing reinforced attentional repre-sentation learning and contextual attentional DCF training in an end-to-end fashion. Our networkis based on the Siamese network architecture [14, 15], and takes an image patch pair ( z , x ) thatcomprise a target template patch z and a searching image patch x as input. The target templatepatch z represents the object of interest that is usually centered at the target object position in theprevious video frame. While x represents the searching region in current video frame, which iscentered around the estimated target object position in previous video frame. We use the fully con-volutional portion of ResNet-50 [10] as the backbone network, and partially modify its originalarchitecture. Both inputs are processed using the same backbone network with learnable param-eters ϕ , yielding two deep feature maps, ϕ ( z ) and ϕ ( x ). Then, we employ a hierarchical attentionmodule, as proposed in Section 3.2, to obtain both the inter- and intra-frame attention of each deepfeature hierarchy separately. Subsequently, the hierarchical attentional features are merged usinga refinement model. The reinforced attentional representations of z and x are denoted as ϕ a ( z )and ϕ a ( x ), respectively. The template reinforced attentional representation ϕ a ( z ) is used to learn acontextual attentional DCF w by solving a ridge regression problem f in the Fourier domain [3], w = f ( ϕ a ( z )) (1)The contextual attentional DCF w is then applied to compute the correlation response g of thesearching image patch x as g ( x ) = w (cid:63) ϕ a ( x ) (2)where (cid:63) denotes the cross-correlation operation. The position of the maximum value of g relativeto that of the target object. More details about this part are described in Section 3.3. We introduce a hierarchical attention module to leverage both inter- and intra-frame attention.The inter-frame attention is exploited to perform robust inference in the current frame by capturinghistorical context information. The intra-frame attention along channel-wise and spatial axes areemployed to emphasize the informative representations and suppress redundancy. As illustratedin Fig. 4, for an arbitrary object, the inter-frame attention tends to focus more on some key char-acteristics of the target object than on the surroundings in consecutive frames (the third picturein Fig. 4), while the intra-frame attention mainly concentrates on some critical regions to betterrepresent the target object (the fourth picture in Fig. 4). The details of our hierarchical attentionmodule, as shown in Fig. 5, are below.
Inter-frame attention.
We formulate the tracking task as a sequential inference problem,and utilize a convolutional LSTM unit to model the temporal consistency of the target objectsappearance. On the extracted feature map ϕ t ∈ R W × H × C in the current frame t , the inter-frame8 igure 4: Visualization of feature and attention maps of the convolutional layer ‘ conv Lemming , original feature map, inter-frame attention, intra-frame attention, and the correlationresponse generated by the proposed network. attention can be computed in the convolutional LSTM unit as follows: f t i t o t = σ ( W h h t − + W i ϕ t )˜ c t = tanh( W h h t − + W i ϕ t ) c t = f t (cid:12) c t − + i t (cid:12) ˜ c t h t = o t (cid:12) tanh( c t ) (3)where ⊕ denotes element-wise addition. σ and tanh are sigmoid activation and hyperbolic tangentactivation, respectively. W i and W h are the kernel weights of the input layer and the hidden layer.The hyperparameters f t , i t , o t and ˜ c t indicate the forget, input, output and content gates, respec-tively. c t denotes the cell state. h t is the hidden state that is treated as the inter-frame attention. Tofacilitate the calculation of the intra-frame attention, h t is fed into two fully convolutional layers toseparately obtain the inter-frame attention along the channel-wise axis h ct ∈ R × × C and the spatialaxis h st ∈ R W × H × , h ct = σ ( W hc h t ) h st = σ ( W hs h t ) (4)where W hc and W hs are the kernel weights of di ff erent convolutional layers corresponding to h ct and h st , respectively. Intra-frame attention along the channel-wise axis.
We exploit the channel-wise intra-frameattention to make feature maps more visually appealing, and boost the target object discriminationperformance. Given the input feature ϕ t ∈ R W × H × C and the channel-wise inter-frame attention h ct − × H × W × H × C W × H × MaxPool s AvgPool s MaxPool c AvgPool c ConvConv tanh
Conv σ W × H × ConvConv tanh
Conv σ W × H × C Inter-frame attention Intra-frame attention Input feature Attentional feature σ tanh Element-wise addition Broadcasting multiplication Sigmoid function Hyperbolic tangent function h t-1 h tsc LSTM ConvConv
Intermediate feature h t Figure 5: Overview of the hierarchical attention module. It is worth noting that h ct − is obtained based on the inter-attention h t − of the previous frame, and we use dashed lines to indicate the corresponding operations. of the previous frame, we first apply global average-pooling and max-pooling operations along thespatial axis to the input feature to generate two channel-wise context descriptors: AvgPool c ( ϕ t ) ∈ R × × C and MaxPool c ( ϕ t ) ∈ R × × C . Then, we combine and feed them into an MLP with sigmoidactivation to obtain the channel-wise intra-frame attention Ψ ct ∈ R × × C as follows: Φ ct = AvgPool c ( ϕ t ) ⊕ MaxPool c ( ϕ t ) Θ ct = tanh (cid:0) W c Φ Φ ct ⊕ W ch h ct − ) Ψ ct = σ ( W co Θ ct ) (5)where σ indicates the sigmoid function, ⊕ denotes the element-wise addition. W c Φ , W ch and W co are weights used to achieve a balance between the dimensions of the channel-wise descriptors andchannel-wise intra-frame attention. Intra-frame attention along the spatial axis.
We utilize spatial intra-frame attention tohighlight target-specific details and enhance the capability for target object localization. Giventhe input feature ϕ t ∈ R W × H × C and the spatial inter-frame attention h st in the current frame,we first combine two di ff erent pooled spatial context descriptors AvgPool s ( ϕ t ) ∈ R W × H × and MaxPool s ( ϕ t ) ∈ R W × H × . Then, we feed the combination into a MLP using sigmoid activation to10 o n v3_a tt e n t i o n × × c o n v4_a tt e n t i o n × × c o n v5_a tt e n t i o n × × Conv1 ×
1, 1024 Conv3 ×
3, 1024 Conv3 ×
3, 512Conv1 ×
1, 512Conv1 ×
1, 512 Conv3 ×
3, 512 c o n v_a tt e n t o n × × ReLU ReLUReLU
Element-wise addition
ReLU
Rectified Linear Unit
Figure 6: Structure of the refinement model. generate the spatial intra-frame attention Ψ st ∈ R W × H × , Φ st = AvgPool s ( ϕ t ) ⊕ MaxPool s ( ϕ t ) Θ st = tanh (cid:0) W s Φ Φ st ⊕ W sh h st ) Ψ st = σ ( W so Θ st ) (6)where W s Φ , W sh and W so are the parameters for balancing the dimensions of Φ st and Θ st . σ presentsthe sigmoid function, and ⊕ denotes the element-wise addition. Reinforced Attentional Representation
The hierarchical attentional representation ϕ at can becomputed using both inter- and intra-frame attention as follows: ϕ at = ϕ t ⊗ Ψ st ⊗ Ψ ct ⊕ ϕ t (7)where ⊕ and ⊗ indicate the element-wise addition and broadcasting multiplication, respectively.Finally, we merge those hierarchical attentional representations from coarse to fine to obtain thereinforced attentional representation using a refinement model, as shown in Fig. 6. Unlike traditional DCF-based tracking approaches [3, 4, 12, 13], we make some essentialmodifications to the DCF to utilize the contextual attention in consecutive frames. We choosethe context-aware correlation filter (CACF) [41] as the base of our decision module. Because thebackground around the target object may impact tracking performance, CACF takes the globalcontextual information into account, and demonstrates outstanding discriminative capability. Wecrop a target template patch z and k context template patches { z i | i = , , . . . , k } around z fromthe target template z . Noteworthily, we use a set of target templates from T frames to learn theDCF that has a high response on the target template patch and close to zero for all context template11atches, T (cid:88) t = β t (cid:16) min w (cid:107) g ( z , t ) − y (cid:107) + λ (cid:107) w (cid:107) + λ k (cid:88) i = (cid:107) g ( z i , t ) (cid:107) (cid:17) (8)where β t ≥ z t , y is the desired correlation responsedesigned as a Gaussian function that centered at the target object position estimated in the previousvideo frame, λ controls the context patches regressing to zero. Note that the minimizer in Eq. 8 isconvex, it has a closed-form solution that is given by setting the gradient to zero [43] as follows: w = T (cid:88) t = β t ( ¯ Z Tt ¯ Z t + λ I ) − ¯ Z Tt ¯ y (9)where ¯ Z t = [ Z , t , √ λ Z , t , √ λ Z , t , . . . , √ λ Z i , t ] T is a feature matrix. Z i , t and Z , t are circulantfeature matrices [3] corresponding to z i , t and z , t , respectively. ¯ y = [1 , , , . . . ,
0] is the regressionobjective. For more details, please refer to [41]. The closed-form solution of Eq. 9 in the Fourierfrequency domain can be obtained as follows: ˆw = (cid:80) Tt = β t (cid:0) ˆ ϕ ( z , t ) (cid:12) ˆy (cid:1)(cid:80) Tt = β t (cid:0) ˆ ϕ ∗ ( z , t ) (cid:12) ˆ ϕ ( z , t ) + λ + λ (cid:80) ki = ˆ ϕ ∗ ( z i , t ) (cid:12) ˆ ϕ ( z i , t ) (cid:1) (10)where (cid:12) denotes the Hadamard product, ˆw indicates the discrete Fourier transform F ( w ), and ˆ ϕ ∗ represents the complex conjugate of ˆ ϕ .Subsequently, the correlation response g in Eq. 2 can be calculated by performing an exhaustivematching of w over x in the Fourier domain as follows: g ( x ) = F − (cid:0) ˆw (cid:12) ˆ ϕ a ( x ) (cid:1) (11)where F − ( · ) denotes the inverse discrete Fourier transform. Finally, the current target objectposition can be identified by searching for the maximum value of g .Notably, we formulate the contextual attentional DCF w as a di ff erentiable correlation layerto achieve the end-to-end training of the whole network and updating the filters online. Thesecapabilities can further enhance the adaptability of our approach to the variations in the targetobject appearance. Therefore, the network can be trained by minimizing the di ff erences betweenthe real response g and the desired response y of x . The loss function is formulated as follows: L = (cid:107) g ( x ) − y (cid:107) (12)The back-propagation of loss with respect to searching and template patches are computed as12ollows: ∂ L ∂ϕ ( x ) = F − (cid:0) ( ˆ g ( x ) − ˆy ) (cid:12) ˆw (cid:1) ∂ L ∂ϕ ( z ) = F − (cid:0) (cid:0) ( ˆ g ( x ) − ˆy ) (cid:12) ˆz (cid:1) (cid:12) (cid:0) ˆy − ˆw (cid:12) ˆ ϕ ( z ) (cid:1) ˆ ϕ ∗ ( z ) (cid:12) ˆ ϕ ( z ) + λ + λ (cid:80) ki = ˆ ϕ ∗ ( z i ) (cid:12) ˆ ϕ ( z i ) (cid:1) ∂ L ∂ϕ ( z i ) = F − (cid:0) (cid:0) ( ˆy − ˆ g ( x )) (cid:12) ˆz i (cid:1) (cid:12) (cid:0) ˆw (cid:12) ˆ ϕ ( z i ) (cid:1) ˆ ϕ ∗ ( z ) (cid:12) ˆ ϕ ( z ) + λ + λ (cid:80) ki = ˆ ϕ ∗ ( z i ) (cid:12) ˆ ϕ ( z i ) (cid:1) (13)Once the back-propagation of the correlation layer is derived, our network can be trained end-to-end. The contextual attentional DCF w is incrementally updated during tracking as formulatedin Eq. 10.
4. Experiments
In this section, we first present the implementation details of our proposed approach. Then,we compare the proposed approach with the state-of-the-art trackers on four modern benchmarkdatasets, including OTB-2013 with 50 videos [37], OTB-2015 with 100 videos [38], and VOT-2016 [39] and VOT-2017 [40] both with 60 videos each. Finally, we conduct ablation studies toinvestigate how the proposed components improve tracking performance.
We implement our proposed tracker in Python using MXNet [44] on an Amazon EC2 instancewith an Intel (cid:114)
Xeon (cid:114)
E5 CPU @ 2.3GHz with 61GB RAM, and an NVIDIA (cid:114)
Tesla (cid:114)
K80 GPUwith 12GB VRAM. The average speed of the proposed tracker is 37 fps. We apply stochastic gra-dient decent (SGD) with the learning rate varying from 10 − to 10 − , a weight decay of 0 . . conv conv conv × conv conv × conv conv conv conv λ = − and λ = − . Duringtraining, the target template and searching candidates are cropped with a padding size of 2 × fromtwo frames picked randomly from the sequence of the same target object, and then resized to astandard input size of 225 × ×
3. Moreover, to deal with scale variations, we generate a pro-posal pyramid with three scales { a s | a = . , s ∈ (cid:0) (cid:98)− S − (cid:99) , . . . , (cid:98) S − (cid:99) (cid:1) , S = } times the previoustarget object size. 13 able 1: Architecture of backbone network. More details of each building blocks are shown in brackets. stage output size blocks stride(input 255 × ×
127 7 ×
7, 64 2maxpool1 63 ×
63 3 × × ×
1, 643 ×
3, 641 ×
1, 256 × × ×
1, 1283 ×
3, 1281 ×
1, 512 × × ×
1, 2563 ×
3, 2561 ×
1, 1024 × × ×
1, 5123 ×
3, 5121 ×
1, 2048 × OTB-2013 [37] and OTB-2015 [38] are two popular visual tracking benchmark datasets. TheRAR tracker is compared with recent real-time ( ≥
25 fps) trackers including DaSiamPRN [46],SiamTri [47], SA Siam [48], SiamRPN [16], TRACA [49], EDCF [36], CACF [41], CFNet [15],SiamFC [14], and HCF [18] on these benchmarks. We exploit two evaluation metrics, the distanceprecision (DP) and overlap success rate (OSR), for comparison. The DP is defined as the percent-age of frames where the average Euclidean distance between the estimated target position and theground-truth is smaller than a preset threshold of 20 pixels, while OSR is the overlap ratios ofsuccessful frames exceeded within the threshold range of [0 , ff erent trackers. The evaluation results are illustratedin Table 2.On the OTB-2013 benchmark dataset, the proposed tracker achieves the best AUC score of68 . . . . .
9% and 63 .
8% onthe OTB-2013 benchmark dataset, respectively. By comparison, the proposed approach obtainsabsolute gains of 6 . . .
3% and 4 . .
8% on the OTB-2013 benchmark dataset, our RAR tracker outperforms itwith an absolute gain of 2 .
4% in the AUC score. This is because the proposed hierarchical attentionstrategy used in RAR can best highlight informative representation and suppress redundancy morethan the context-aware deep feature compression scheme employed by TRACA. On the OTB-14 able 2: Comparisons with recent real-time ( ≥
25 fps) state-of-the-art tracking approaches on OTB benchmarks usingAUC) and precision metrics. The best three scores are highlighted in red , blue and green fonts, respectively. Trackers OTB-2013 OTB-2015 Speed (FPS)AUC DP AUC DPRAR
SiamTri [47] 0.615 0.815 0.590 0.781 SA Siam [48] HCF [18] 0.638 0.891 0.562 0.837 26 .
4% and the second-best DP score of 87 . . . .
6% and 63 . . .
8% and 2 . ff ectivenessof our network architecture, as the performance of the tracking approach mainly depends on thediscriminative capacity of the target object representation. As the baseline of our tracker, EDCFand HCF achieve AUC scores of 63 .
5% and 56 .
2% on the OTB-2015 benchmark, respectively.RAR outperforms them by 2 .
9% and 10 . ffi ciently ata real-time speed (37fps).For a comprehensive evaluation, our approach is also compared with state-of-the-art trackersincluding DaSiamRPN [46], SiamRPN [16], SiamFC [14] and HCT [18] based on di ff erent at-tributes on the OTB-2015 benchmark dataset. The video sequences in OTB are annotated with11 di ff erent attributes: illumination variation (IV), out-of-plane rotation (OPR), scale variation(SV), occlusion (OCC), deformation (DEF), motion blur (MB), fast motion (FM), in-plane rota-tion (IPR), out of view (OV), background clutter (BC) and low resolution (LR). The results arepresented in terms of AUC and DP scores in Fig. 7. Although our approach performs worse onthree attributes, IPR, OPR, and LR, it achieves impressive performance on the remaining eightattributes. The good performance of the proposed approach can be attributed to two reasons. First,both the inter- and intra-frame attention are e ff ective for selecting more meaningful representa-tion, which accounts for scale and appearance variations. Secondly, with the use of contextual15 V(38) OPR(63) SV(64) OCC(49)DEF(44) MB(29) FM(39) IPR(51) OV(14) BC(31) LR(9)
AUC ( O v er l a p Su cce ss ) Overlap success rate of 11 attributes on OTB-2015 (100)
RAR DaSiamRPN SiamRPN SiamFC HCF
IV(38) OPR(63) SV(64) OCC(49)DEF(44) MB(29) FM(39) IPR(51) OV(14) BC(31) LR(9) P rec i s i o n ( p i x e l s ) Distance precision performance of 11 attributes on OTB-2015 (100)
RAR DaSiamRPN SiamRPN SiamFC HCF
Figure 7: Performance evaluation of five trackers on the OTB-2015 benchmark dataset with di ff erent attributes. Eachsubset of sequences corresponds to one of the attributes. The later number in the brackets after each attribute acronymis the number of sequences in the corresponding subset. RAR SiamRPN SiamFC HCF Ground truth
Figure 8: Comparison of our proposed approach with the state-of-the-art trackers SiamRPN [16], SiamFC [14],and HCF [18] on three challenging video sequences (from top to bottom are carScale , david3 , and MotorRolling ,respectively). attentional DCF, the proposed approach can further tackle more complicated scenarios such asbackground clutters and heavy occlusions.Fig. 8 shows the comparisons of the proposed approach with excellent trackers SiamRPN [16],SiamFC [14], and HCF [18] on three challenging video sequences from the OTB-2015 benchmarkdataset. In the sequence carScale , the target object undergoes SV with FM. All the trackers, exceptthe proposed one, cannot tackle SV desirably. Both SiamRPN and HCF concentrate on tracking asmall part of the target object, while the bounding boxes generated by SiamFC are larger than theground-truths. In contrast, the proposed approach can trace the target object well. In the sequence david3 , the target object is partially occluded in a BC scene. SiamFC drifts quickly when OCCoccurs, while others are able to trace the target object correctly throughout the sequence. Thetarget object in the sequence
MotorRolling experiences varying illumination with rotations. OnlySiamRPN and RAR can locate the target object accurately. We also reveal three tracking failure17
Figure 9: Failure cases on the
Jump (IPR occurs at LR),
Bird1 (LR) and
Panda (OPR occurs at LR) sequences. Redboxes show our results, and the green ones are ground-truths. cases of our proposed approach in Fig. 9. RAR fails in these sequences mainly because the intra-frame attention cannot learn more meaningful geometries and semantics when IPR / OPR occurs inan LR scene (this is di ff erent from the OCC). Thereby, the inter-frame attention plays a dominantrole that can still model the temporal coherence within consecutive frames, leading the attentionalrepresentation to only focus on some vital parts of the target object. The VOT challenge is the largest annual competition in the field of visual tracking. We com-pare our tracker with several state-of-the-art trackers on the VOT-2016 [39] and VOT-2017 [40]challenge datasets, respectively. Following the evaluation protocol of VOT, we report the trackingperformance in terms of expected average overlap (EAO) scores, as shown in Fig. 10.The RAR tracker obtains the EAO scores of 0 .
329 and 0 .
283 on these datasets, and outperformsSiamFC [14] by absolute gains of 9 .
6% and 9 . . × and123 × ). Consequently, our approach exceeds state-of-the-art bounds by large margins, and it canbe considered as a state-of-the-art tracker according to the definition of the VOT committee. Allthe results demonstrate the e ff ectiveness and e ffi ciency of our proposed tracking approach. We first modify the backbone network , and conduct an ablation study on the OTB benchmarkdatasets to reveal the e ff ects of di ff erent configurations and parameters of our tracker, includingdi ff erent combinations of feature hierarchies, with or without deformable convolutions and net-work fine-tuning, and the variations in output feature size (spatial stride). The results are shown inTable 3.We empirically discover that neither the single stage ( , , and ) nor the combination oftwo stages ( , , and ) has achieved competitive performance. After refining the featuresobtained from all the three stages ( ), both the AUC and DP scores are steadily improved, withgains of 1.4% and 1.6%, compared with the combination of conv conv ) on the OTB-2015 benchmark dataset. This indicates that the feature hierarchies from conv conv conv E A O s c o re [0.331] CCOT [0.329] RAR[0.325] TCNN [0.311] MLDF [0.295] Staple [0.276] DeepSRDCF [0.257] MDNet [0.235] SiamFC [0.192] KCF [0.181] DSST ˆ Φ (a) EAO scores on VOT-2016 E A O s c o re [0.323] LSART [0.286] CFCF [0.283] RAR [0.280] ECO [0.267] CCOT [0.238] ECOhc [0.188] SiamFC [0.169] Staple [0.135] KCF [0.079] DSST ˆ Φ (b) EAO scores on VOT-2017 Figure 10: Expected average overlap plot on VOT datasets. The horizontal dashed lines denote the state-of-the-artbounds according to the VOT committee. able 3: Ablation studies of di ff erent configurations of the network backbone (ResNet-50) on the OTB benchmarkdatasets using AUC and DP scores. C3, C4, and C5 represent conv conv
4. and conv ffl ine. S represents the output spatial stride. The best values are highlighted in bold font. √ √ √ √
16 0.613 0.807 0.591 0.778 √ √
32 0.581 0.786 0.557 0.763 √ √ √
16 0.615 0.825 0.593 0.791 √ √ √
32 0.609 0.821 0.588 0.806 √ √ √
32 0.626 0.835 0.603 0.812 √ √ √ √
32 0.634 0.842 0.617 0.828 √ √ √ √
16 0.653 0.859 0.629 0.843 √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ √ have complementary geometries and semantics useful for target object representation. Note thatthe tracking performance is boosted impressively when the spatial stride is reduced from 32 or 16to 8 ( vs . vs . ). However, as the spatial stride is further reduced from 8 to 4, the perfor-mance drops severely ( vs . ). Theoretically, a larger network spatial stride corresponds to alarger receptive field of the neurons in the output stage. A larger receptive field can cover signif-icantly more image context, but is insensitive for target object localization. On the other hand,a smaller receptive field may not be able to capture the target-specific semantics, and inevitablydegrades its discriminative capability. This illustrates that the resolution of feature hierarchies iscrucial for target object localization and discrimination in visual tracking. We exploit deformableconvolution [45] in our approach to adaptively model target object transformations and enlarge thereceptive fields. Intriguingly, we observe that merely applying deformable convolution in conv ), while using the shallow layer ( ), deeper layer ( ),and the combination of all the convolutional stages ( ) achieve only minor performance improve-ments over the baseline ( ). That is to say, applying deformable convolution in conv conv ffi cient to enhance the transformation modeling capability and the receptive fieldslearning adaptability. In addition, our study indicates that fine-tuning the network backbone isnecessary ( vs . ), because it yields a great improvement on tracking performance.To investigate how each proposed component contributes to improving tracking performance,20 able 4: Ablation studies of several variations of our tracker on the OTB benchmark datasets using AUC and DPscores. The best values are highlighted in bold font. Trackers OTB-2013 OTB-2015AUC DP AUC DPRAR
VGG
ResNet
T DCF
NAA
NT A
NCA
NS A we then evaluate several variations of our approach on the OTB benchmark datasets, including thetracker incorporating the VGG-M network [9] as the backbone (RAR
VGG ); the one deploying theoriginal ResNet-50 network [10] as the backbone (RAR
ResNet , in the above experiment); thetracker with traditional DCF [4] (RAR T DCF ); the tracker without hierarchical convolutional fea-tures (RAR
NHF ); the tracker not using all the attention (RAR
NAA ); and the trackers not deployingany single attention (RAR
NT A means we do not use the inter-frame attention; RAR
NCA means wedo not use the channel-wise inter- and intra-frame attention; and RAR
NS A means we do not use thespatial inter- and intra-frame attention). The detailed evaluation results are illustrated in Table 4.Our full algorithm (RAR) outperforms all those variants. RAR achieves absolute gains of 2 . .
7% in the AUC scores, compared with RAR
VGG and RAR
ResNet on the OTB-2015 bench-mark dataset, respectively. Therefore, it has been proven that our modified backbone networklearns more informative target object representation by enhancing the generalization capability. Itis worth noting that RAR
ResNet underperforms RAR
VGG with 1 .
9% drop. This performance degra-dation can be directly attributed to both the receptive field and the output stride of ResNet-50 aretoo large to capture more useful information, even though the architecture of ResNet-50 is deeperthan VGG-M. To evaluate the impact of the hierarchical attention mechanism, we remove it, anddirectly use the original deep features from three stages to represent the target objects. This elim-ination causes remarkable performance drops, i.e., a degradation of 6 .
9% in the AUC score from0.664 to 0.595 on the OTB-2015 benchmark dataset. It clearly confirms the e ff ectiveness of thecombination of inter- and intra-frame attention for emphasizing meaningful representations andsuppressing redundant information. Besides, by introducing the di ff erentiable correlation layer,the AUC score can be significantly increased by 1 .
5% compared with RAR
T DCF on the OTB-2013benchmark dataset. This performance gain demonstrates the superiority of the proposed contextualattentional DCF. According to our ablation studies, every component in our approach contributesto improving tracking performance. 21 . Conclusions
In this paper, we propose an end-to-end network model that can jointly achieve hierarchicalattentional representation learning and contextual attentional DCF training for high-performancevisual tracking. Specifically, we introduce a hierarchical attention module to learn hierarchicalattentional representation using both inter- and intra-frame attention at di ff erent convolutionallayers to emphasize informative representations and suppress redundant information. Moreover,a contextual attentional correlation layer is incorporated into the network to enhance the track-ing performance for accurate target object discrimination and localization. Experimental resultsclearly demonstrate that our proposed tracker significantly outperforms most state-of-the-art track-ers both in terms of accuracy and robustness at a speed above the real-time requirement. Althoughthe proposed tracker has achieved competitive tracking performance, it can be further improvedby utilizing multimodal representation and robust backbone networks, such as natural linguisticfeatures or graph convolutional networks. References [1] A. W. Smeulders, D. M. Chu, R. Cucchiara, S. Calderara, A. Dehghan, M. Shah, Visual tracking: An experi-mental survey, IEEE Transactions on Pattern Analysis and Machine Intelligence 36 (7) (2014) 1442–1468.[2] P. Li, D. Wang, L. Wang, H. Lu, Deep visual tracking: Review and experimental comparison, Pattern Recogni-tion 76 (2018) 323–338.[3] J. F. Henriques, R. Caseiro, P. Martins, J. Batista, High-speed tracking with kernelized correlation filters, IEEETransactions on Pattern Analysis and Machine Intelligence 37 (3) (2015) 583–596.[4] M. Danelljan, G. H¨ager, F. S. Khan, M. Felsberg, Discriminative scale space tracking, IEEE Transactions onPattern Analysis and Machine Intelligence 39 (8) (2017) 1561–1575.[5] P. Gao, Y. Ma, C. Li, K. Song, Y. Zhang, F. Wang, L. Xiao, Adaptive object tracking with complementarymodels, IEICE Transactions on Information and Systems E101-D (11) (2018) 2849–2854.[6] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,A. C. Berg, F.-F. Li, Imagenet large scale visual recognition challenge, International Journal of Computer Vision115 (3) (2015) 211–252.[7] J. Schmidhuber, Deep learning in neural networks: An overview, Neural Networks 61 (2015) 85–117.[8] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classification with deep convolutional neural networks, in:Annual Conference on Neural Information Processing Systems (NeurIPS), MIT Press, 2012, pp. 1097–1105.[9] K. Simonyan, A. Zisserman, Very deep convolutional networks for large-scale image recognition, arXiv preprintarXiv:1409.1556v6.[10] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR), IEEE, 2016, pp. 770–778.[11] J. Henriques, R. Caseiro, P. Martins, J. Batista, Exploiting the circulant structure of tracking-by-detection withkernels, in: European conference on computer vision (ECCV), Springer-Verlag, 2012, pp. 702–715.[12] P. Gao, Y. Ma, K. Song, C. Li, F. Wang, L. Xiao, Y. Zhang, High performance visual tracking with circular andstructural operators, Knowledge-Based Systems 161 (2018) 240–253.[13] M. Danelljan, G. Bhat, S. F. Khan, M. Felsberg, Eco: E ffi cient convolution operators for tracking, in: IEEEConference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.[14] L. Bertinetto, J. Valmadre, J. Henriques, A. Vedaldi, P. H. S. Torr, Fully-convolutional siamese networks forobject tracking, in: European Conference on Computer Vision (ECCV), Springer-Verlag, 2016, pp. 850–865.[15] J. Valmadre, L. Bertinetto, J. Henriques, A. Vedaldi, P. H. S. Torr, End-to-end representation learning for corre-lation filter based tracking, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE,2017, pp. 2805–2813.
16] B. Li, J. Yan, W. Wu, Z. Zhu, X. Hu, High performance visual tracking with siamese region proposal network,in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2018, pp. 8971–8980.[17] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll´ar, C. L. Zitnick, Microsoft coco:Common objects in context, in: European Conference on Computer Vision (ECCV), Springer-Verlag, 2014, pp.740–755.[18] C. Ma, J.-B. Huang, X. Yang, M.-H. Yang, Hierarchical convolutional features for visual tracking, in: IEEEInternational Conference on Computer Vision (ICCV), IEEE, 2015, pp. 3074–3082.[19] Y. Qi, S. Zhang, L. Qin, H. Yao, Q. Huang, J. Lim, M.-H. Yang, Hedged deep tracking, in: IEEE Conference onComputer Vision and Pattern Recognition (CVPR), IEEE, 2016, pp. 4303–4311.[20] M. Danelljan, A. Robinson, F. S. Khan, M. Felsberg, Beyond correlation filters: Learning continuous convo-lution operators for visual tracking, in: European Conference on Computer Vision (ECCV), Springer-Verlag,2016, pp. 472–488.[21] X. Song, F. Feng, X. Han, X. Yang, W. Liu, L. Nie, Neural compatibility modeling with attentive knowledgedistillation, in: International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR), ACM, 2018, pp. 5–14.[22] N. Zheng, X. Song, Z. Chen, L. Hu, D. Cao, L. Nie, Virtually trying on new clothing with arbitrary poses, in:ACM International Conference on Multimedia (ACMMM), ACM, 2019, pp. 266–274.[23] Y. Ma, C. Yuan, P. Gao, F. Wang, E ffi cient multi-level correlating for visual tracking, in: Asian Conference onComputer Vision (ACCV), Lecture Notes in Computer Science 11365, Springer, 2018, pp. 452–465.[24] P. Gao, Y. Ma, R. Yuan, L. Xiao, F. Wang, Learning cascaded siamese networks for high performance visualtracking, in: IEEE International Conference on Image Processing (ICIP), IEEE, 2019, pp. 3078–3082.[25] J. Hu, L. Shen, G. Sun, Squeeze-and-excitation networks, in: IEEE Conference on Computer Vision and PatternRecognition (CVPR), IEEE, 2018.[26] S. Woo, J. Park, J.-Y. Lee, I. So Kweon, Cbam: Convolutional block attention module, in: European Conferenceon Computer Vision (ECCV), Springer-Verlag, 2018, pp. 3–19.[27] A. Lukeˇziˇc, T. Voj´ıˇr, L. ˇCehovin, J. Matas, M. Kristan, Discriminative correlation filter with channel and spatialreliability, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2017, pp. 4847–4856.[28] Q. Wang, Z. Teng, J. Xing, J. Gao, W. Hu, S. Maybank, Learning attentions: residual attentional siamese networkfor high performance online visual tracking, in: IEEE Conference on Computer Vision and Pattern Recognition(CVPR), IEEE, 2018, pp. 4854–4863.[29] Z. Zhu, W. Wu, W. Zou, J. Yan, End-to-end flow correlation tracking with spatial-temporal attention, in: IEEEConference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2018.[30] Z. Cheng, Y. Ding, X. He, L. Zhu, X. Song, M. S. Kankanhalli, A ncf: An adaptive aspect attention modelfor rating prediction., in: International Joint Conference on Artificial Intelligence (IJCAI), Morgan Kaufmann,2018, pp. 3748–3754.[31] T. Zhuo, Z. Cheng, M. Kankanhalli, Fast video object segmentation via mask transfer network, arXiv preprintarXiv:1908.10717.[32] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation 9 (8) (1997) 1735–1780.[33] B. Chen, P. Li, C. Sun, D. Wang, G. Yang, H. Lu, Multi attention module for visual tracking, Pattern Recognition87 (2019) 80–93.[34] T. Yang, A. B. Chan, Recurrent filter learning for visual tracking, in: IEEE International Conference on Com-puter Vision (ICCV), IEEE, 2017, pp. 2010–2019.[35] P. Gao, R. Yuan, F. Wang, L. Xiao, H. Fujita, Y. Zhang, Siamese attentional keypoint network for high perfor-mance visual tracking, Knowledge-Based Systems.[36] Q. Wang, M. Zhang, J. Xing, J. Gao, W. Hu, S. Maybank, Do not lose the details: reinforced representationlearning for high performance visual tracking, in: International Joint Conference on Artificial Intelligence (IJ-CAI), Morgan Kaufmann, 2018, pp. 985–991.[37] Y. Wu, J. Lim, M.-H. Yang, Online object tracking: A benchmark, in: IEEE Conference on Computer Visionand Pattern Recognition (CVPR), IEEE, 2013, pp. 2411–2418.[38] Y. Wu, J. Lim, M.-H. Yang, Object tracking benchmark, IEEE Transactions on Pattern Analysis and Machine ntelligence 37 (9) (2015) 1834–1848.[39] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. ˇCehovin, et al , The visual object trackingvot2016 challenge results, in: European Conference on Computer Vision (ECCV), Springer-Verlag, 2016.[40] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, L. ˇCehovin Zajc, et al , The visual object trackingvot2017 challenge results, in: IEEE International Conference on Computer Vision (ICCV), IEEE, 2017.[41] M. Mueller, N. Smith, B. Ghanem, Context-aware correlation filter tracking, in: IEEE Conference on ComputerVision and Pattern Recognition (CVPR), IEEE, 2017, pp. 1396–1404.[42] J. Choi, H. J. Chang, S. Yun, T. Fischer, Y. Demiris, J. Y. Choi, Attentional correlation filter network for adaptivevisual tracking, in: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2017.[43] R. Rifkin, G. Yeo, T. Poggio, Regularized least-squares classification, Nato Science Series Sub Series III Com-puter and Systems Sciences 190 (2003) 131–154.[44] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, Z. Zhang, Mxnet: A flexible ande ffi cient machine learning library for heterogeneous distributed systems, arXiv preprint arXiv:1512.01274.[45] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, Y. Wei, Deformable convolutional networks, in: IEEE Interna-tional Conference on Computer Vision (ICCV), IEEE, 2017, pp. 764–773.[46] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, W. Hu, Distractor-aware siamese networks for visual object tracking,in: European Conference on Computer Vision (ECCV), Springer-Verlag, 2018, pp. 103–119.[47] X. Dong, J. Shen, Triplet loss in siamese network for object tracking, in: European Conference on ComputerVision (ECCV), Springer-Verlag, 2018.[48] A. He, C. Luo, X. Tian, W. Zeng, A twofold siamese network for real-time object tracking, in: IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), IEEE, 2018, pp. 4834–4843.[49] J. Choi, H. Jin Chang, T. Fischer, S. Yun, K. Lee, J. Jeong, Y. Demiris, J. Young Choi, Context-aware deep fea-ture compression for high-speed visual tracking, in: IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR), IEEE, 2018, pp. 479–488.[50] C. Sun, D. Wang, H. Lu, M.-H. Yang, Learning spatial-aware regressions for visual tracking, in: IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR), IEEE, 2018, pp. 8962–8970.cient machine learning library for heterogeneous distributed systems, arXiv preprint arXiv:1512.01274.[45] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, Y. Wei, Deformable convolutional networks, in: IEEE Interna-tional Conference on Computer Vision (ICCV), IEEE, 2017, pp. 764–773.[46] Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, W. Hu, Distractor-aware siamese networks for visual object tracking,in: European Conference on Computer Vision (ECCV), Springer-Verlag, 2018, pp. 103–119.[47] X. Dong, J. Shen, Triplet loss in siamese network for object tracking, in: European Conference on ComputerVision (ECCV), Springer-Verlag, 2018.[48] A. He, C. Luo, X. Tian, W. Zeng, A twofold siamese network for real-time object tracking, in: IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR), IEEE, 2018, pp. 4834–4843.[49] J. Choi, H. Jin Chang, T. Fischer, S. Yun, K. Lee, J. Jeong, Y. Demiris, J. Young Choi, Context-aware deep fea-ture compression for high-speed visual tracking, in: IEEE Conference on Computer Vision and Pattern Recog-nition (CVPR), IEEE, 2018, pp. 479–488.[50] C. Sun, D. Wang, H. Lu, M.-H. Yang, Learning spatial-aware regressions for visual tracking, in: IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR), IEEE, 2018, pp. 8962–8970.