[PDF] AttributeNet: Attribute Enhanced Vehicle Re-Identification

Abstract

Vehicle Re-Identification (V-ReID) is a critical task that associates the same vehicle across images from different camera viewpoints. Many works explore attribute clues to enhance V-ReID; however, there is usually a lack of effective interaction between the attribute-related modules and final V-ReID objective. In this work, we propose a new method to efficiently explore discriminative information from vehicle attributes (e.g., color and type). We introduce AttributeNet (ANet) that jointly extracts identity-relevant features and attribute features. We enable the interaction by distilling the ReID-helpful attribute feature and adding it into the general ReID feature to increase the discrimination power. Moreover, we propose a constraint, named Amelioration Constraint (AC), which encourages the feature after adding attribute features onto the general ReID feature to be more discriminative than the original general ReID feature. We validate the effectiveness of our framework on three challenging datasets. Experimental results show that our method achieves state-of-the-art performance.

Full PDF

AAttributeNet: Attribute Enhanced Vehicle Re-Identiﬁcation

Rodolfo Quispe a,c , Cuiling Lan b , Wenjun Zeng b , Helio Pedrini c a Microsoft Corp., One Microsoft Way, Redmond, USA, 98052-6399 b Microsoft Research Asia, Beijing, China, 100080 c Institute of Computing, University of Campinas, Brazil, 13083-852

Abstract

Vehicle Re-Identiﬁcation (V-ReID) is a critical task that associates the same vehicle across images from di ﬀ erentcamera viewpoints. Many works explore attribute clues to enhance V-ReID; however, there is usually a lack ofe ﬀ ective interaction between the attribute-related modules and ﬁnal V-ReID objective. In this work, we propose a newmethod to e ﬃ ciently explore discriminative information from vehicle attributes ( e . g ., color and type). We introduceAttributeNet (ANet) that jointly extracts identity-relevant features and attribute features. We enable the interaction bydistilling the ReID-helpful attribute feature and adding it into the general ReID feature to increase the discriminationpower. Moreover, we propose a constraint, named Amelioration Constraint (AC), which encourages the feature afteradding attribute features onto the general ReID feature to be more discriminative than the original general ReIDfeature. We validate the e ﬀ ectiveness of our framework on three challenging datasets. Experimental results show thatour method achieves state-of-the-art performance. Keywords:

Vehicle re-identiﬁcation, attribute recognition, interaction, convolutional neural networks, informationdistillation

1. Introduction

Vehicle Re-Identiﬁcation (V-ReID) aims tomatch / associate the same vehicle across images.It has many applications for vehicle tracking andretrieval. V-ReID has gained increasing attention inthe computer vision community [1, 2]. This task ischallenging due to drastic changes in view points andillumination, resulting in a small inter-class and largeintra-class di ﬀ erence.Recently, there is a trend to explore additional cluesfor better V-ReID, such as using semantic maps [3], at-tributes ( e . g ., type, color) [4, 5, 6, 7, 8], viewpoints [9],and vehicle parts [9, 10, 11]. In this work, we focuson the exploration of attributes to enhance the discrim-ination power of feature representations. Attributes arein general invariant to viewpoint changes and robust toenvironment alterations (see the examples in Figure 1).Most of the previous attribute-based works [8, 7, 5, 6,4, 12, 13] share a common characteristic in their design: Email addresses: [email protected] (RodolfoQuispe), [email protected] (Cuiling Lan), [email protected] (Wenjun Zeng), [email protected] (Helio Pedrini) Figure 1: Four images of vehicles used for V-ReID. The ﬁrst and sec-ond images belong to the same vehicle; in this case, the color attributecan help overcome the illumination issue to match them. The third andfourth images belong to di ﬀ erent vehicles with really similar appear-ance; in this case, vehicle brand or type can help di ﬀ erentiate them. a global feature representation is extracted from an in-put image using a backbone network ( e . g ., ResNet [14]),where this feature is followed by two types of heads, onefor re-identiﬁcation (ReID), and the other for attributerecognition. We refer to this design as the Vanilla-Attribute Design (VAD) and illustrate a representativeVAD based Network (VAN) in Figure 2. One direct wayto use the VAD for V-ReID is to concatenate the embed-ding features generated from the backbone ( i . e ., globalfeature) and the attribute-based modules [7, 12].VAD aims to drive the network to learn features thatare discriminative for both V-ReID and attribute recog-nition, where the attributes are in general invariant to Preprint submitted to Neurocomputing February 9, 2021 a r X i v : . [ c s . C V ] F e b ackbone Get attentionAttribute 1

GAP+FCGAP+FC

Cross EntropyAttribute 1Cross Entropy + Triplet Loss 𝐴 (cid:2869) 𝐴 (cid:2870) 𝐴 (cid:3041) 𝐹 𝑎 (cid:2869) 𝑎 (cid:2870) 𝑎 (cid:3041) 𝑓𝐴𝑡𝑡 (cid:2869) 𝐴𝑡𝑡 (cid:2870)

𝐴𝑡𝑡 (cid:3041) ...

Get attentionAttribute 2Get attentionAttribute n

GAP+FCGAP+FC

Cross EntropyAttribute 2Cross EntropyAttribute 𝑛 Figure 2: Illustration of VAD based Network (VAN) for V-ReID. Itis composed of a backbone network that learns to extract informationfrom an input image and n branches to predict attributes based onattention modules. We use this VAN in our ANet as the ﬁrst part ofour framework. viewpoint and illumination changes. However, there is alack of e ﬀ ective interaction between the attribute-basedbranches and V-ReID branch, where the attribute mod-ules learn features for attribute recognition but are notexplicitly designed to serve for V-ReID. Wang et al. [6]explore attributes to generate attention masks, but thesemasks are used only to ﬁlter the information from theglobal feature instead of introducing the rich attributerepresentation into the ﬁnal feature representation.We propose Attribute Net (ANet) to enrich the in-teraction between the attribute features and the V-ReIDfeature. ANet is designed to distill attribute informationand add it into the global representation (from the back-bone) to generate more discriminative features. Fig-ures 2 and 3 (with input feature maps obtained fromthe VAN as illustrated in Figure 2) present the proposedANet. Particularly, we combine the feature maps of dif-ferent attribute branches to have a unique and genericrepresentation G of all the attributes. We distill the help-ful attribute feature from G and compensate it onto theglobal V-ReID feature F to have the ﬁnal feature map J , where the spatial average pooled feature of J is theﬁnal ReID feature for matching. Moreover, we intro-duce a new supervision objective, named AmeliorationConstraint (AC), which encourages the compensated V-ReID feature J to be more discriminative than the V-ReID feature F before the compensation from attributefeature.The main contributions of this work are: 𝐴 (cid:2869) C whC wh 𝐴 (cid:2870) C wh 𝐴 (cid:3041) C wh

𝐹 𝐺

C wh whC 𝐽 conv Attribute-basedTriplet Loss

GAP 𝑔 conv 𝐺 (cid:3045)(cid:3032)(cid:3036)(cid:3031) C wh

GAP + FC 𝑗 Cross Entropy + Triplet Loss ...

GAP+FC

AmeliorationConstraint 𝑓 Figure 3: Illustration of the Joint Module. Note that the network toextract feature maps F , A , · · · , A n is shown in Figure 2 and is notshown here. We distill the helpful attribute feature from G and com-pensate it onto the global V-ReID feature F to have the ﬁnal featuremap J , where the spatial average pooled feature of J is the ﬁnal ReIDfeature for matching. Moreover, we introduce a new supervision ob-jective, named Amelioration Constraint (AC), which encourages thecompensated V-ReID feature J to be more discriminative than the V-ReID feature F before the compensation from attribute feature. • We propose a new architecture, named ANet, fore ﬀ ective V-ReID, which enhances the interactionbetween the attribute-supervised modules and V-ReID branch. This encourages the distilled at-tribute features to serve for V-ReID. • We introduce an Amelioration Constraint (AC),which encourages the attribute compensated fea-ture to be more discriminative than the V-ReID fea-ture before compensation.Experiments on three challenging datasets demon-strate the e ﬀ ectiveness of our ANet, which outperformsbaselines signiﬁcantly and achieves the state-of-the-artperformance.

2. Related Work

For vehicle ReID, many approaches explore Genera-tive adversarial Networks (GANs) [15], graph networks(GNs) [11, 16], semantic parsing (SP) [3] and vehiclepart detection (VPD) [17, 10] to improve the perfor-mance. Some of them tend to describe the vehicle de-tails [15] and local regions[17, 10]. PRND [17] andPGAN [10] detect predeﬁned regions ( e . g ., back mir-rors, light, wheels, etc ) and describe them with deepfeatures. SAVER [15] modify the input image with2he vehicle details erased using a GAN. Then, this syn-thetic image is combined with the input image to cre-ate a new version with the details visually enhanced forReID. Some works aim to handle the drastic viewpointchanges [11, 3]. Liu et al. [11] describe each vehicleview based on semantic parsing and also encode the spa-cial relationship between them using GNs.Some works exploit attribute information [7, 6, 18,13, 12] or combine attributes with other clues [8, 5, 4].Most of the previous attribute-based works use at-tribute information to regularize the feature learning[8, 7, 5, 6, 4, 12, 13]. In general, they regress the at-tribute classes from the backbone features, along withthe ReID supervision based on the backbone features.However, using separate heads for di ﬀ erent tasks ig-nores the interaction between the two tasks, where theattribute branches should serve for better ReID.Our work explores attribute clues by enabling ef-fective interaction between attribute regression and V-ReID. Di ﬀ erent from previous methods, we distill help-ful attribute information and compensate into the ReIDfeature representation to have more discriminative rep-resentation.

3. Proposed ANet

Our proposed ANet is designed to exploit attribute in-formation for e ﬀ ective V-ReID. In previous works thatuse attributes, there is a lack of interaction between theglobal V-ReID head and the attribute regression heads,which makes that the feature information is not e ﬀ ec-tively exploited for V-ReID.To address this issue, we propose ANet (as shown inFigures 2 and 3). It consists of two parts: VAD basedNetwork (VAN) and Joint Module (JM). VAN is basedon a Backbone with two heads, where one of them is tolearn global V-ReID features and the other to regressattributes. VAN outputs an initial feature representa-tion of V-ReID and multiple Attribute features from theinput image. Then, the JM distills V-ReID-helpful at-tribute information and compensates it into the globalfeatures. JM promotes the interaction between the at-tribute branches and V-ReID branch. Furthermore, wepropose an Amelioration Constraint (AC), which en-courages the attribute compensated feature to be morediscriminative than the original V-ReID feature beforethe compensation. VAD based Network (VAN), shown in Figure 2, aimsto learn V-ReID features and regress attributes. This design is similar to previous literature work, where theattribute branches are expected to drive the learning ofrobust features since the attributes are in general invari-ant to illumination, viewpoints, etc . Backbone.

A backbone network is used to extract fea-ture map F ( I ) ∈ R h × w × c from an input image I , where h , w and c are height , width and channels of F ( I ),respectively. We follow the previous works and useResNet [14] as the backbone. V-ReID Head / Branch.

On top of the backbone feature F ( I ), we append a spatial global average pooling (GAP)layer followed by a fully-connected (FC) layer to gen-erate the V-ReID feature f ( I ) as f ( I ) = W f GAP ( F ( I )) + b f , (1)where W f and b f denote the weights and bias of the FClayer used to reduce the dimension of the pooled feature, W f ∈ R s f × c and b f ∈ R s f , where s f is the predeﬁned di-mension of the output. f ( I ) is followed by Triplet Loss L ftri and Cross Entropy Loss L fID . Attribute Heads / Branches.

On top of the backbonefeature F ( I ), we add n attribute branches for attributeclassiﬁcation, where n is the number of available at-tributes in the training dataset, one branch for each at-tribute. For the i -th attribute branch, we use a spatialand channel attention module to obtain attribute-relatedfeature A i ( I ) ∈ R h × w × c as A i ( I ) = F ( I ) · Att i ( F ( I )) , (2)where Att i ( I ) ∈ R c denotes the response of the attentionmodule.To make classiﬁcation for the i -th attribute, we applyGAP and a FC layer to get a feature vector a i as a i ( I ) = W a i GAP ( A i ( I )) + b a i , (3)where W a i and b a i denote the weights and bias of the FClayer, W a i ∈ R s a × c and b a i ∈ R s a , where s a is the prede-ﬁned size of the output. a i ( I ) is followed by a classiﬁerwith a cross entropy loss L iatt to recognize which class itbelongs to for the i -th attribute.In summary, VAN is trained by minimizing the loss L VAN as L VAN = L ftri + L fID + λ A n (cid:88) i = L iatt , (4)where λ A is a hyper-parameter for balancing the impor-tance of V-ReID loss and attribute-related losses.3 .2. Joint Module The Joint Module (JM) is illustrated in Figure 3. JMaims to distill V-ReID helpful information from the at-tribute features and compensate it to the V-ReID featurefor the ﬁnal feature matching. First, we merge the at-tribute feature maps from multiple branches to have auniﬁed attribute feature map G ( I ). Then, we distill dis-criminative V-ReID helpful information from G ( I ) andcompensate it onto F ( I ) to create a Joint Feature J ( I ).To encourage a higher discriminative capability of theJoint Feature, we introduce an Amelioration Constraint(AC), which drives the distillation of discriminative in-formation from G ( I ) to enhance the original V-ReIDfeature F ( I ). The JM promotes the interaction betweenthe attribute and V-ReID information to improve the V-ReID performance. Attribute Feature G ( I ) . To facilitate the distillation ofhelpful attribute features, we combine all the attributefeature maps A i ( I ), where i = , · · · , n , to have a uniﬁedattribute feature map G ( I ). We achieve this by summa-rizing the attribute feature maps followed by a convolu-tion layer and a residual connection as G ( I ) = n (cid:88) i = A i ( I ) + θ A ( n (cid:88) i = A i ( I )) , (5)where θ A is implemented by a 1 × i . e . θ A ( x ) = ReLU(W A x ), W A ∈ R c × c . We omitBN to simplify the notation.For the combined attribute feature map G ( I ), we addsupervision from attributes to preserve the attribute in-formation. Given n attributes, m i is the number ofclasses for the i -th attribute. There are in total (cid:81) ni = m i attribute patterns. We apply a GAP layer on G ( I ) to getthe feature vector g ( I ). Then, the Triplet Loss L gtri isused as supervision to pull the features for the same at-tribute pattern and push the features for the di ﬀ erent at-tribute patterns. We name this supervision as Attribute-based Triplet Loss. Joint Feature J ( I ) . To distill V-ReID-helpful attributeinformation from G ( I ) to enhance F ( I ), we use two con-volution layers to have distilled feature G reid ( I ) G reid ( I ) = θ g ( θ g ( G ( I ))) , (6)where θ g and θ g are implemented similarly to θ A butwe use a 3 × × θ g ( x ) = ReLU(W g1 x ) , θ g2 ( x ) = ReLU(W g2 x ), W g ∈ R c × c and W g ∈ R c × c .By adding G reid ( I ) onto the V-ReID feature F ( I ), wehave the Joint Feature J ( I ) as J ( I ) = F ( I ) + G reid ( I ) . (7) J ( I ) combines V-ReID information from F ( I ) and therelevant V-ReID-helpful information from the attributes G ( I ). Similar to the supervision on F ( I ), we add TripletLoss L jtri and Cross Entropy Loss L jID on the spatiallyaverage pooled feature j ( I ), where j ( I ) is obtained as j ( I ) = W j · GAP ( J ( I )) + b j , (8)where W j and b j represent the weights and bias of a FClayer, W j ∈ R s j × c and b j ∈ R s j , s j is the predeﬁneddimension of the output. JM is trained by minimizing L JM L JM = L jtri + L jID + λ G L gtri , (9)where λ G is a hyperparameter balancing the importanceof the compensated V-ReID loss and the attribute relatedloss.Finally, we can train the entire network ANet end-to-end by minimizing LL = L JM + λ L VAN , (10)where λ is a hyperparameter to balance the importanceof L JM and L VAN . Amelioration Constraint.

To further boost the capa-bilities of the network, we deﬁne the Amelioration Con-straint (AC). AC aims to explicitly encourage j ( I ) to bemore discriminative than f ( I ). We separately apply ACfor cross entropy loss and triplet loss. AC for Cross Entropy Loss:

For image I , we deﬁne itas AC ID ( I ) = softplus( L jID ( I ) − L fID ( I )) , (11)where softplus( · ) = ln(1 + exp( · )) is a monotonically in-creasing function that helps to reduce the optimizationdi ﬃ culty by avoiding negative values [19]. L fID ( I ) and L jID ( I ) represent the identity cross entropy loss with re-spect to feature f ( I ) and j ( I ), respectively. Minimizing AC ID ( I ) encourages the network to have a lower classi-ﬁcation error for j ( I ) than that for f ( I ). AC for Triplet Loss:

We seek j ( I ) to represent an en-hanced feature of f ( I ), where j ( I ) has a higher discrimi-native capability than f ( I ). Thus, we encourage the fea-ture distance D ( · , · ) between an anchor sample / image I and a positive sample I + to be smaller w.r.t. feature j ( · )than feature f ( · ). Similarly, we encourage the featuredistance D ( · , · ) between an anchor sample / image I anda negative sample I − to be larger w.r.t. feature j ( · ) thanfeature f ( · ). Then, AC for triplet loss AC tri is deﬁned as AC tri ( I ) = softplus( D ( j ( I ) , j ( I + )) − D ( f ( I ) , f ( I + ))) + softplus( D ( f ( I ) , f ( I − )) − D ( j ( I ) , j ( I − ))) . (12)4e notice that training with AC ID , AC tri in an end-to-end leads to unstable learning. Thus, we follow twosteps in training. In the ﬁrst step, we minimize L . In thesecond step, we freeze the backbone ( i . e ., all operationsbefore f ) and minimize L (cid:48) . Compared with L in (10),the AC losses are enabled and the losses on feature f are disabled in L (cid:48) as L (cid:48) = L + AC tri + AC ID − λ ( L ftri + L fID ) . (13)

4. Experiments

In this section, we present the datasets used in our ex-periments, the implementation details, an ablation studyand a comparison against the state of the art to validateour proposed method.

We evaluate our vehicle re-identiﬁcation method onthree challenging benchmark datasets. • VeRi776 [20]: It contains over 50,000 images of776 vehicles with 20 camera views. It includes at-tribute labels for color and type. It considers 576vehicles for training and 200 vehicles for test. • VeRi-Wild [21]: This is the largest vehicle re-identiﬁcation dataset. It considers 174 cameraviews, 416,314 images and 40,671 IDs. It includesattribute labels for vehicle model, color and type.The testing set is divided into three sets with 3,000(small), 5,000 (medium) and 10,000 (large) IDs.This is the most challenging dataset because theimages were captured for a period of one monthand include severe changes in background, illumi-nation, viewpoint and occlusions. • Vehicle-ID [13]: It includes 221,763 images of26,267 vehicles, captured from either front or backviews. The training set contains 110,178 images of13,134 vehicles and the test set contains 111,585images of 13,133 vehicles. The testing data is fur-ther divided into three sets with 200 (small), 1,600(medium) and 2,400 (large) vehicles. Some imagesin this dataset have attribute labels for vehicle colorand type but not for all the images.For the ﬁrst two datasets, the validation protocols isbased on mean Average Precision (mAP) and Cumula-tive Matching Curve (CMC) @1 (at rank-1 / R1) and @5(at rank-5 / R5) as they have ﬁxed gallery and query sets.For Vehicle-ID, we follow the protocol proposed by the authors of the dataset, which randomly chooses one im-age of each vehicle ID as gallery and the rest as query.The ﬁnal R1 and R5 results are reported after repeatingthis process 10 times.

We follow other works in the literature to implementthe backbone for a fair comparison. We use a modiﬁedversion of ResNet-50 [14] with Instance-Batch Normal-ization [22] and remove the last pooling layer to obtainthe feature map F ( I ) for an image I . Each attentionmodule Att i ( I ) is based on SE [23] with the reductionratio of 16. For the FC layers, we set s a =

128 and s f = s j = λ = λ A = λ G = L iatt and L gtri . We foundthis works well since we use batch size of 512 (4 imagesper ID) and the missing labels are alleviated by the otherIDs in the batch. Note that these missing labels do nota ﬀ ect our AC ID and AC tri , so ANet can still learn fromthose cases.The input images are resized to 256 ×

256 pixelsand augmented by random horizontal ﬂipping, randomzooming and random input erasing [27, 28, 29, 30]. Allmodels are trained on 8 V100 GPUs with NVLink for210 epochs with Amsgrad. An initial learning rate isset to 0.0006 and the learning rate is decayed by 0.1 atepochs 60, 120 and 150. The ﬁrst learning step min-imizes L for the ﬁrst 150 epochs, then the second stepoptimizes L (cid:48) for 60 epochs. n = e . g ., red, yellow, gray, etc )and type ( e . g ., sedan, truck, etc ). During testing, thefeature vectors are L2-normalized for matching. ﬀ ectiveness of using Attributes on V-ReID We ﬁrst evaluate the e ﬀ ects of using attributes in V-ReID and show the comparisons in Table 1. Baseline denotes the scheme which generates feature f usingonly the backbone, without using attribute-related de-signs. VAN denotes the vanilla scheme that exploresattributes as shown in Figure 2, using the same back-bone as

Baseline . For our VAN, we can use the V-ReID feature f ( i ) (i.e. VAN ( f ) ), or use the concate-nation of f ( I ) and attribute features a i ( I ) , i = , · · · , n able 1: Ablation study on the e ﬀ ectiveness of our designs. We indicate the feature vector used for testing using the symbol in parenthesis. VeRi776 Vehicle-ID VeRi-WildSmall Medium Large Small Medium LargeMethod mAP R1 R1 R5 R1 R5 R1 R5 mAP R1 mAP R1 mAP R1

Baseline 78.1 96.1 81.3 94.4 77.7 90.6 75.8 88.5 78.1 94.6 72.2 92.5 64.0 88.7VAN ( f ) 78.1 96.6 84.1 96.5 80.4 93.6 78.4 91.8 83.1 94.5 78.3 93.5 70.6 90.0VAN ( f a ) 77.3 96.5 81.5 95.0 78.5 92.0 76.3 89.6 81.9 94.1 76.9 93.1 69.2 89.4ANet ( j ) w / o AC 79.8 96.9 85.0 96.7 80.9 94.1 79.0 91.8 84.6 96.1 79.9 94.4 72.9 91.5ANet ( j ) 80.1 96.9 86.0 97.4 81.9 95.1 79.6 92.7 85.8 95.9 81.0 94.5 73.9 91.6Input G ( I ) G reid ( I ) Figure 4: Comparison of activation maps. The ﬁrst row represents the input images, second and third row their corresponding activation maps for G ( I ) (attribute features) and G reid ( I ) (attribute features oriented to V-ReID), respectively. The ﬁrst column is the query image, the second to sixthcolumns represent the vehicle retrieved at rank-1, rank-2, rank-3, rank-4 and rank-5.Table 2: Comparison of choice for implementation of attributebranches for the attribute-based baseline. fc represents an implemen-tation using fully connected layers and att represents an implementa-tion using SE attention blocks. Results for Vehicle-ID and VeRi-Wildare reported using their small scale test set. VeRi776 Vehicle-ID VeRi-WildMethod mAP R1 R1 R5 mAP R1 fc 76.7 95.8 83.3 96.0 82.1 94.3att 78.1 96.6 84.1 96.5 83.1 94.5(i.e.

VAN ( f a) ) in inference. We can see that: 1)

VAN( f ) , where the attributes regularize the feature learn-ing, outperforms

Baseline signiﬁcantly on Vehicle-IDand VeRi-Wild. Specially, using attributes improves therank-1 by 0.5% for VeRi776, 2.8% at rank-1 and 3.3% atrank-5 for Vehicle-ID, 6.6% in mAP and 1.3% at rank-1 for VeRi-Wild; 2) using

VAN ( f a) has lower perfor-mance than

VAN ( f ) . This is because not all the attributeinformation a i ( I ) is equally important for V-ReID. Al- locating the relative contributions of each attribute isneeded to have satisfactory results. Hence how to distilltask-oriented attribute information to e ﬃ ciently beneﬁtV-ReID is important, which is what our ANet aims toaddress.We use VAN as our attribute-based baseline , which issimilar to previous works exploiting vehicle attributes.However, previous works usually use simple FC layersinstead of attention blocks for the attribute branches.Using attention facilitates the distillation of attributefeatures. As shown in Table 2, using attention out-performs that using FC layers by 1.2% in rank-1 onVehicle-ID, 1.4% and 1% in mAP on VeRi776 andVeRi-Wild, respectively.

We propose ANet to distill attribute information formore e ﬀ ective V-ReID. Here we study the e ﬀ ectivenessof our Joint Module design, and the AC losses. Table 1shows the comparisons. We can see that: (i) Our ﬁ-6 able 3: Comparison of our proposed method against the state of the arts on VeRi776. The ﬁrst and second best results are marked by bold andunderline, respectively. Method Clues mAP R1 R5

PAMAL [31] attributes 45.0 72.0 88.8MADVR [32] attributes 61.1 89.2 94.7DF-CVTC [4] attributes 61.0 91.3 95.7PAMTRI [5] attributes 71.8 92.8 96.9AGNet [6] attributes 71.5 95.6 96.5SAN [7] attributes 72.5 93.3 97.1StRDAN [8] attributes 76.1 – –VAnet [9] viewpoint 66.3 89.7 95.9PRND [17] veh. parts 74.3 94.3 98.6UMTS [33] TS 75.9 95.8 –PCRNet [11] GN + parsing 78.6 95.4 98.4SAVER [15] GAN 79.6 96.4 98.6PVEN [3] parsing 79.5 95.6 98.4HPGN [16] GN 80.1 96.7 –VKD [34] viewpoint + TS FastReid [35] backbone 81.0 97.1 98.3ANet + FastReid (Ours) attributes 81.2 96.8 98.4nal scheme

ANet ( j) signiﬁcantly outperforms the ba-sic network

VAN ( f ) , by 2.0% in mAP on VeRi776,1.9% / / / Medium / Largescales of Vehicle-ID, 2.7% / / / Medium / Large scales of VeRi-Wild; (ii) our pro-posed AC losses, which encourages higher discrimina-tion after the compensation of distilled attribute featurethan that before, is very helpful to promote the distillof discriminative information from attribute feature forV-ReID purpose.These results show that the interaction between theV-ReID and attribute features of VAN improves the net-work performance, thanks to the distill of V-ReID ori-ented attribute features.To better understand the e ﬀ ects of ANet, we visualizethe attention maps of G ( I ) and G reid ( I ) and show somein Figure 4. G ( I ) encodes generic features of the at-tributes, where the activations are ﬂatter and do not havea special focus on the vehicle parts. In contrast, G reid ( I )represents a portion of the information of G ( I ) that ishelpful for V-ReID. We can observe that the activationmaps focus more on the vehicle. We compare our method with approaches that alsouse attributes information [4, 32, 5, 6, 7, 8]. We also compare our method with the most recent approachesthat leverage clues / techniques such as vehicle parsingmaps [3], vehicle parts [10, 17], GANs [15], Teacher-Student (TS) distillation [33, 34], camera viewpoints [9,34], and Graph Networks (GN) [16, 11]. HPGN createsa pyramid of spacial graph network to explore the spa-tial signiﬁcance of the backbone tensor. PCRNet stud-ies the correlation between parsed vehicle parts througha graph network. VAnet [9] learns two metrics for sim-ilar viewpoints and di ﬀ erent viewpoints in two featurespaces, respectively.We also compare against FastReid [35], a strongbaseline network for re-identiﬁcation that performsan extensive search of hyperparameters, augmentationmethods, and use some architecture design tricks toachieve excellent performance. We also implementedour design on top of it by taking it as our backbone,which we named ANet + FastReid . Note that the re-ported results of FastReid were obtained by our runningof their released code.Tables 3, 4 and 5 show the comparisons on VeRi776,Vehicle-ID, and VeRi-Wild, respectively.

VeRi776 . Compared with attribute-based methods(ﬁrst group in Table 3), our scheme

ANet + FastReid outperforms the best results in this group by inmAP; and 1.5% for rank-1 and rank-5. By compar-7 able 4: Comparison of our proposed method against the state of the arts on Vehicle-ID. The ﬁrst and second best results are marked by bold andunderline, respectively.

Small Medium LargeMethod Clues R1 R5 R1 R5 R1 R5

PAMAL [31] attributes 67.7 87.9 61.5 82.7 54.5 77.2AGNet [6] attributes 71.1 83.7 69.2 81.4 65.7 78.2DF-CVTC [4] attributes 75.2 88.1 72.1 84.3 70.4 82.1SAN [7] attributes 79.7 94.3 78.4 91.3 75.6 88.3PRND [17] veh. parts 78.4 92.3 75.0 88.3 74.2 86.4SAVER [15] GAN 79.9 95.2 77.6 91.1 75.3 88.3UMTS [33] TS 80.9 – 78.8 – 76.1 –PVEN [3] parsing 84.7 97.0 80.6 94.5 77.8 92.0PCRNet [11] GN + parsing 86.6 – 79.9 – 77.3 –Baseline attributes 81.3 94.4 77.7 90.6 75.8 88.5ANet (Ours) attributes 86.0 97.4 81.9 95.1 79.6 92.7FastReid [35] backbone 85.5 97.4 81.8 95.3 79.9 93.8ANet + FastReid (Ours) attributes 87.9 97.8 82.8 96.2

Table 5: Comparison of our proposed method against the state of the arts on VeRi-Wild. The ﬁrst and second best results are marked by bold andunderline, respectively.

Small Medium LargeMethod Clues mAP R1 R5 mAP R1 R5 mAP R1 R5

UMTS [33] TS 82.8 84.5 – 66.1 79.3 – 54.2 72.8 –HPGN [16] GN 80.4 91.3 – 75.1 88.2 – 65.0 82.6 –PCRNet [11] GN + parsing 81.2 92.5 – 75.3 89.6 – 67.1 85.0 –SAVER [15] GAN 80.9 94.5 98.1 75.3 92.7 97.4 67.7 89.5 95.8PVEN [3] parsing 82.5 Baseline attributes 78.1 94.6 98.5 72.2 92.5 97.3 64.0 88.7 95.6ANet (Ours) attributes 85.8 95.9 99.0 81.0 94.5 98.1 73.9 91.6 96.7FastReid [35] backbone 84.8 95.7 98.9 80.0 94.5 98.1 73.2 91.5 96.7ANet + FastReid (Ours) attributes

Vehicle-ID . Our method outperforms attribute-basedmethods (ﬁrst group in Table 4) consistently. For rank-1, our scheme

ANet + FastReid outperforms the bestattribute-based method by , and forsmall, medium and large scales, respectively. Whencompared with methods using other clues, ours achievesin the best results on the large set and competitive per- formance on the other sets. VeRi-Wild . Previous attribute based methods havenot yet reported results for this latest dataset. From Ta-ble 5, we can see that our schemes

ANet and

ANet + FastReid achieve the best performance in mAP.PVEN [3] is a method based on semantic parsing todescribe each vehicle view and region. It has better re-sults on rank-1 / rank-5 but it is not as competitive as inthe two previous datasets.We observed that none of the existing methods con-sistently achieve the best results on all the datasets. Thismay be because di ﬀ erent datasets have di ﬀ erent mainchallenges. Our proposed ANet shows a more con-sistent state-of-the-art performance on all the datasets,8hanks to the generic capabilities of attributes on V-ReID.

5. Conclusions

In this work, we proposed ANet, a novel frame-work to leverage attribute information for vehicle re-identiﬁcation. ANet addresses the problem of lack ofinteraction between the V-ReID features and attributefeatures of previous methods. Particularly, we encour-age the network to distill task-oriented information fromthe attribute branches and compensate it into the globalV-ReID feature to enhance the discrimination capabil-ity of the feature. Evaluation on three datasets showsthe e ﬀ ectiveness of our methods. Acknowledgments

This work was done while the ﬁrst author is af-ﬁliated with Microsoft Corp. We are thankful toMicrosoft Research, São Paulo Research Foundation(FAPESP grant / / References [1] M. Naphade, Z. Tang, M.-C. Chang, D. C. Anastasiu,A. Sharma, R. Chellappa, S. Wang, P. Chakraborty, T. Huang,J.-N. Hwang, The 2019 AI City Challenge, in: IEEE ComputerVision and Pattern Recognition Conference Workshops, 2019,pp. 452–460.[2] S. D. Khan, H. Ullah, A Survey of Advances in Vision-basedVehicle Re-Identiﬁcation, Computer Vision and Image Under-standing 182 (2019) 50–63.[3] D. Meng, L. Li, X. Liu, Y. Li, S. Yang, Z.-J. Zha, X. Gao,S. Wang, Q. Huang, Parsing-based View-aware Embedding Net-work for Vehicle Re-Identiﬁcation, in: IEEE / CVF Conferenceon Computer Vision and Pattern Recognition, 2020, pp. 7103–7112.[4] A. Zheng, X. Lin, C. Li, R. He, J. Tang, Attributes GuidedFeature Learning for Vehicle Re-identiﬁcation, arXiv preprintarXiv:1905.08997 (2019).[5] Z. Tang, M. Naphade, S. Birchﬁeld, J. Tremblay, W. Hodge,R. Kumar, S. Wang, X. Yang, Pamtri: Pose-Aware Multi-TaskLearning for Vehicle Re-Identiﬁcation using Highly Random-ized Synthetic Data, in: IEEE International Conference onComputer Vision, 2019, pp. 211–220.[6] H. Wang, J. Peng, D. Chen, G. Jiang, T. Zhao, X. Fu, Attribute-guided feature learning network for vehicle re-identiﬁcation,arXiv preprint arXiv:2001.03872 (2020).[7] J. Qian, W. Jiang, H. Luo, H. Yu, Stripe-based and Attribute-Aware Network: A Two-Branch Deep Model for Vehicle Re-Identiﬁcation, Measurement Science and Technology (2020). [8] S. Lee, E. Park, H. Yi, S. Hun Lee, StRDAN: Synthetic-to-RealDomain Adaptation Network for Vehicle Re-identiﬁcation, in:IEEE / CVF Conference on Computer Vision and Pattern Recog-nition Workshops, 2020, pp. 608–609.[9] R. Chu, Y. Sun, Y. Li, Z. Liu, C. Zhang, Y. Wei, Vehicle Re-Identiﬁcation with Viewpoint-aware Metric Learning, in: IEEEInternational Conference on Computer Vision, 2019, pp. 8282–8291.[10] X. Zhang, R. Zhang, J. Cao, D. Gong, M. You, C. Shen, Part-Guided Attention Learning for Vehicle Re-identiﬁcation, arXivpreprint arXiv:1909.06023 (2019).[11] X. Liu, W. Liu, J. Zheng, C. Yan, T. Mei, Beyond the Parts:Learning Multi-view Cross-part Correlation for Vehicle Re-identiﬁcation, ACM Multimedia (2020).[12] X. Liu, S. Zhang, Q. Huang, W. Gao, Ram: a region-aware deepmodel for vehicle re-identiﬁcation, in: 2018 IEEE InternationalConference on Multimedia and Expo (ICME), IEEE, 2018, pp.1–6.[13] H. Liu, Y. Tian, Y. Yang, L. Pang, T. Huang, Deep Relative Dis-tance Learning: Tell the Di ﬀ erence between Similar Vehicles,in: IEEE Conference on Computer Vision and Pattern Recogni-tion, 2016, pp. 2167–2175.[14] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning forImage Recognition, in: IEEE Conference on Computer Visionand Pattern Recognition, 2016, pp. 770–778.[15] P. Khorramshahi, N. Peri, J.-c. Chen, R. Chellappa, The Devilis in the Details: Self-Supervised Attention for Vehicle Re-Identiﬁcation, IEEE European Conference on Computer Vision(2020).[16] F. Shen, J. Zhu, X. Zhu, Y. Xie, J. Huang, Exploring SpatialSigniﬁcance via Hybrid Pyramidal Graph Network for VehicleRe-Identiﬁcation, arXiv preprint arXiv:2005.14684 (2020).[17] B. He, J. Li, Y. Zhao, Y. Tian, Part-Regularized Near-DuplicateVehicle Re-Identiﬁcation, in: IEEE Conference on ComputerVision and Pattern Recognition, 2019, pp. 3997–4005.[18] H. Guo, C. Zhao, Z. Liu, J. Wang, H. Lu, Learning coarse-to-ﬁne structured feature embedding for vehicle re-identiﬁcation,in: Thirty-Second AAAI Conference on Artiﬁcial Intelligence,2018, pp. 6853–6860.[19] X. Jin, C. Lan, W. Zeng, Z. Chen, L. Zhang, Style Normalizationand Restitution for Generalizable Person Re-Identiﬁcation, in:IEEE / CVF Conference on Computer Vision and Pattern Recog-nition, 2020, pp. 3143–3152.[20] X. Liu, W. Liu, T. Mei, H. Ma, Provid: Progressive and Multi-modal Vehicle Reidentiﬁcation for Large-Scale Urban Surveil-lance, IEEE Transactions on Multimedia 20 (2017) 645–658.[21] Y. Lou, Y. Bai, J. Liu, S. Wang, L. Duan, Veri-wild: A LargeDataset and a new Method for Vehicle Re-identiﬁcation in theWild, in: IEEE Conference on Computer Vision and PatternRecognition, 2019, pp. 3235–3243.[22] X. Pan, P. Luo, J. Shi, X. Tang, Two at Once: Enhancing Learn-ing and Generalization Capacities via IBN-Net, in: EuropeanConference on Computer Vision, 2018, pp. 464–479.[23] J. Hu, L. Shen, G. Sun, Squeeze-and-Excitation Networks, in:IEEE conference on Computer Vision and Pattern Recognition,2018, pp. 7132–7141.[24] C. Szegedy, V. Vanhoucke, S. Io ﬀ e, J. Shlens, Z. Wojna, Re-thinking the Inception Architecture for Computer Vision, in:IEEE Conference on Computer Vision and Pattern Recognition,2016, pp. 2818–2826.[25] A. Hermans, L. Beyer, B. Leibe, In Defense of the Triplet Lossfor Person Re-Identiﬁcation, arXiv preprint arXiv:1703.07737(2017).[26] H. Luo, Y. Gu, X. Liao, S. Lai, W. Jiang, Bag of Tricks anda Strong Baseline for Deep Person Re-identiﬁcation, in: IEEE onference on Computer Vision and Pattern Recognition Work-shops, 2019, pp. 0–0.[27] G. Ghiasi, T.-Y. Lin, Q. V. Le, Dropblock: A RegularizationMethod for Convolutional Networks, in: Advances in NeuralInformation Processing Systems, 2018, pp. 10727–10737.[28] K. Zhou, T. Xiang, Torchreid: A Library for Deep Learn-ing Person Re-Identiﬁcation in Pytorch, arXiv preprintarXiv:1910.10093 (2019).[29] K. Zhou, Y. Yang, A. Cavallaro, T. Xiang, Omni-Scale FeatureLearning for Person Re-Identiﬁcation, in: International Confer-ence on Computer Vision, 2019, pp. 3702–3712.[30] K. Zhou, Y. Yang, A. Cavallaro, T. Xiang, Learning Generalis-able Omni-Scale Representations for Person Re-Identiﬁcation,arXiv preprint arXiv:1910.06827 (2019).[31] S. Tumrani, Z. Deng, H. Lin, J. Shao, Partial Attention andMulti-Attribute Learning for Vehicle Re-Identiﬁcation, PatternRecognition Letters 138 (2020) 290–297.[32] N. Jiang, Y. Xu, Z. Zhou, W. Wu, Multi-Attribute DrivenVehicle Re-Identiﬁcation with Spatial-Temporal Re-ranking,in: 25th IEEE International Conference on Image Processing,IEEE, 2018, pp. 858–862.[33] X. Jin, C. Lan, W. Zeng, Z. Chen, Uncertainty-AwareMulti-Shot Knowledge Distillation for Image-Based Object Re-Identiﬁcation, AAAI (2020).[34] A. Porrello, L. Bergamini, S. Calderara, Robust Re-Identiﬁcation by Multiple Views Knowledge Distillation, IEEEEuropean Conference on Computer Vision (2020).[35] L. He, X. Liao, W. Liu, X. Liu, P. Cheng, T. Mei, FastReID: APytorch Toolbox for General Instance Re-identiﬁcation, arXivpreprint arXiv:2006.02631 (2020).onference on Computer Vision and Pattern Recognition Work-shops, 2019, pp. 0–0.[27] G. Ghiasi, T.-Y. Lin, Q. V. Le, Dropblock: A RegularizationMethod for Convolutional Networks, in: Advances in NeuralInformation Processing Systems, 2018, pp. 10727–10737.[28] K. Zhou, T. Xiang, Torchreid: A Library for Deep Learn-ing Person Re-Identiﬁcation in Pytorch, arXiv preprintarXiv:1910.10093 (2019).[29] K. Zhou, Y. Yang, A. Cavallaro, T. Xiang, Omni-Scale FeatureLearning for Person Re-Identiﬁcation, in: International Confer-ence on Computer Vision, 2019, pp. 3702–3712.[30] K. Zhou, Y. Yang, A. Cavallaro, T. Xiang, Learning Generalis-able Omni-Scale Representations for Person Re-Identiﬁcation,arXiv preprint arXiv:1910.06827 (2019).[31] S. Tumrani, Z. Deng, H. Lin, J. Shao, Partial Attention andMulti-Attribute Learning for Vehicle Re-Identiﬁcation, PatternRecognition Letters 138 (2020) 290–297.[32] N. Jiang, Y. Xu, Z. Zhou, W. Wu, Multi-Attribute DrivenVehicle Re-Identiﬁcation with Spatial-Temporal Re-ranking,in: 25th IEEE International Conference on Image Processing,IEEE, 2018, pp. 858–862.[33] X. Jin, C. Lan, W. Zeng, Z. Chen, Uncertainty-AwareMulti-Shot Knowledge Distillation for Image-Based Object Re-Identiﬁcation, AAAI (2020).[34] A. Porrello, L. Bergamini, S. Calderara, Robust Re-Identiﬁcation by Multiple Views Knowledge Distillation, IEEEEuropean Conference on Computer Vision (2020).[35] L. He, X. Liao, W. Liu, X. Liu, P. Cheng, T. Mei, FastReID: APytorch Toolbox for General Instance Re-identiﬁcation, arXivpreprint arXiv:2006.02631 (2020).