AttributeNet: Attribute Enhanced Vehicle Re-Identification
AAttributeNet: Attribute Enhanced Vehicle Re-Identification
Rodolfo Quispe a,c , Cuiling Lan b , Wenjun Zeng b , Helio Pedrini c a Microsoft Corp., One Microsoft Way, Redmond, USA, 98052-6399 b Microsoft Research Asia, Beijing, China, 100080 c Institute of Computing, University of Campinas, Brazil, 13083-852
Abstract
Vehicle Re-Identification (V-ReID) is a critical task that associates the same vehicle across images from di ff erentcamera viewpoints. Many works explore attribute clues to enhance V-ReID; however, there is usually a lack ofe ff ective interaction between the attribute-related modules and final V-ReID objective. In this work, we propose a newmethod to e ffi ciently explore discriminative information from vehicle attributes ( e . g ., color and type). We introduceAttributeNet (ANet) that jointly extracts identity-relevant features and attribute features. We enable the interaction bydistilling the ReID-helpful attribute feature and adding it into the general ReID feature to increase the discriminationpower. Moreover, we propose a constraint, named Amelioration Constraint (AC), which encourages the feature afteradding attribute features onto the general ReID feature to be more discriminative than the original general ReIDfeature. We validate the e ff ectiveness of our framework on three challenging datasets. Experimental results show thatour method achieves state-of-the-art performance. Keywords:
Vehicle re-identification, attribute recognition, interaction, convolutional neural networks, informationdistillation
1. Introduction
Vehicle Re-Identification (V-ReID) aims tomatch / associate the same vehicle across images.It has many applications for vehicle tracking andretrieval. V-ReID has gained increasing attention inthe computer vision community [1, 2]. This task ischallenging due to drastic changes in view points andillumination, resulting in a small inter-class and largeintra-class di ff erence.Recently, there is a trend to explore additional cluesfor better V-ReID, such as using semantic maps [3], at-tributes ( e . g ., type, color) [4, 5, 6, 7, 8], viewpoints [9],and vehicle parts [9, 10, 11]. In this work, we focuson the exploration of attributes to enhance the discrim-ination power of feature representations. Attributes arein general invariant to viewpoint changes and robust toenvironment alterations (see the examples in Figure 1).Most of the previous attribute-based works [8, 7, 5, 6,4, 12, 13] share a common characteristic in their design: Email addresses: [email protected] (RodolfoQuispe), [email protected] (Cuiling Lan), [email protected] (Wenjun Zeng), [email protected] (Helio Pedrini) Figure 1: Four images of vehicles used for V-ReID. The first and sec-ond images belong to the same vehicle; in this case, the color attributecan help overcome the illumination issue to match them. The third andfourth images belong to di ff erent vehicles with really similar appear-ance; in this case, vehicle brand or type can help di ff erentiate them. a global feature representation is extracted from an in-put image using a backbone network ( e . g ., ResNet [14]),where this feature is followed by two types of heads, onefor re-identification (ReID), and the other for attributerecognition. We refer to this design as the Vanilla-Attribute Design (VAD) and illustrate a representativeVAD based Network (VAN) in Figure 2. One direct wayto use the VAD for V-ReID is to concatenate the embed-ding features generated from the backbone ( i . e ., globalfeature) and the attribute-based modules [7, 12].VAD aims to drive the network to learn features thatare discriminative for both V-ReID and attribute recog-nition, where the attributes are in general invariant to Preprint submitted to Neurocomputing February 9, 2021 a r X i v : . [ c s . C V ] F e b ackbone Get attentionAttribute 1
GAP+FCGAP+FC
Cross EntropyAttribute 1Cross Entropy + Triplet Loss 𝐴 (cid:2869) 𝐴 (cid:2870) 𝐴 (cid:3041) 𝐹 𝑎 (cid:2869) 𝑎 (cid:2870) 𝑎 (cid:3041) 𝑓𝐴𝑡𝑡 (cid:2869) 𝐴𝑡𝑡 (cid:2870)
𝐴𝑡𝑡 (cid:3041) ...
Get attentionAttribute 2Get attentionAttribute n
GAP+FCGAP+FC
Cross EntropyAttribute 2Cross EntropyAttribute 𝑛 Figure 2: Illustration of VAD based Network (VAN) for V-ReID. Itis composed of a backbone network that learns to extract informationfrom an input image and n branches to predict attributes based onattention modules. We use this VAN in our ANet as the first part ofour framework. viewpoint and illumination changes. However, there is alack of e ff ective interaction between the attribute-basedbranches and V-ReID branch, where the attribute mod-ules learn features for attribute recognition but are notexplicitly designed to serve for V-ReID. Wang et al. [6]explore attributes to generate attention masks, but thesemasks are used only to filter the information from theglobal feature instead of introducing the rich attributerepresentation into the final feature representation.We propose Attribute Net (ANet) to enrich the in-teraction between the attribute features and the V-ReIDfeature. ANet is designed to distill attribute informationand add it into the global representation (from the back-bone) to generate more discriminative features. Fig-ures 2 and 3 (with input feature maps obtained fromthe VAN as illustrated in Figure 2) present the proposedANet. Particularly, we combine the feature maps of dif-ferent attribute branches to have a unique and genericrepresentation G of all the attributes. We distill the help-ful attribute feature from G and compensate it onto theglobal V-ReID feature F to have the final feature map J , where the spatial average pooled feature of J is thefinal ReID feature for matching. Moreover, we intro-duce a new supervision objective, named AmeliorationConstraint (AC), which encourages the compensated V-ReID feature J to be more discriminative than the V-ReID feature F before the compensation from attributefeature.The main contributions of this work are: 𝐴 (cid:2869) C whC wh 𝐴 (cid:2870) C wh 𝐴 (cid:3041) C wh
𝐹 𝐺
C wh whC 𝐽 conv Attribute-basedTriplet Loss
GAP 𝑔 conv 𝐺 (cid:3045)(cid:3032)(cid:3036)(cid:3031) C wh
GAP + FC 𝑗 Cross Entropy + Triplet Loss ...
GAP+FC
AmeliorationConstraint 𝑓 Figure 3: Illustration of the Joint Module. Note that the network toextract feature maps F , A , · · · , A n is shown in Figure 2 and is notshown here. We distill the helpful attribute feature from G and com-pensate it onto the global V-ReID feature F to have the final featuremap J , where the spatial average pooled feature of J is the final ReIDfeature for matching. Moreover, we introduce a new supervision ob-jective, named Amelioration Constraint (AC), which encourages thecompensated V-ReID feature J to be more discriminative than the V-ReID feature F before the compensation from attribute feature. • We propose a new architecture, named ANet, fore ff ective V-ReID, which enhances the interactionbetween the attribute-supervised modules and V-ReID branch. This encourages the distilled at-tribute features to serve for V-ReID. • We introduce an Amelioration Constraint (AC),which encourages the attribute compensated fea-ture to be more discriminative than the V-ReID fea-ture before compensation.Experiments on three challenging datasets demon-strate the e ff ectiveness of our ANet, which outperformsbaselines significantly and achieves the state-of-the-artperformance.
2. Related Work
For vehicle ReID, many approaches explore Genera-tive adversarial Networks (GANs) [15], graph networks(GNs) [11, 16], semantic parsing (SP) [3] and vehiclepart detection (VPD) [17, 10] to improve the perfor-mance. Some of them tend to describe the vehicle de-tails [15] and local regions[17, 10]. PRND [17] andPGAN [10] detect predefined regions ( e . g ., back mir-rors, light, wheels, etc ) and describe them with deepfeatures. SAVER [15] modify the input image with2he vehicle details erased using a GAN. Then, this syn-thetic image is combined with the input image to cre-ate a new version with the details visually enhanced forReID. Some works aim to handle the drastic viewpointchanges [11, 3]. Liu et al. [11] describe each vehicleview based on semantic parsing and also encode the spa-cial relationship between them using GNs.Some works exploit attribute information [7, 6, 18,13, 12] or combine attributes with other clues [8, 5, 4].Most of the previous attribute-based works use at-tribute information to regularize the feature learning[8, 7, 5, 6, 4, 12, 13]. In general, they regress the at-tribute classes from the backbone features, along withthe ReID supervision based on the backbone features.However, using separate heads for di ff erent tasks ig-nores the interaction between the two tasks, where theattribute branches should serve for better ReID.Our work explores attribute clues by enabling ef-fective interaction between attribute regression and V-ReID. Di ff erent from previous methods, we distill help-ful attribute information and compensate into the ReIDfeature representation to have more discriminative rep-resentation.
3. Proposed ANet
Our proposed ANet is designed to exploit attribute in-formation for e ff ective V-ReID. In previous works thatuse attributes, there is a lack of interaction between theglobal V-ReID head and the attribute regression heads,which makes that the feature information is not e ff ec-tively exploited for V-ReID.To address this issue, we propose ANet (as shown inFigures 2 and 3). It consists of two parts: VAD basedNetwork (VAN) and Joint Module (JM). VAN is basedon a Backbone with two heads, where one of them is tolearn global V-ReID features and the other to regressattributes. VAN outputs an initial feature representa-tion of V-ReID and multiple Attribute features from theinput image. Then, the JM distills V-ReID-helpful at-tribute information and compensates it into the globalfeatures. JM promotes the interaction between the at-tribute branches and V-ReID branch. Furthermore, wepropose an Amelioration Constraint (AC), which en-courages the attribute compensated feature to be morediscriminative than the original V-ReID feature beforethe compensation. VAD based Network (VAN), shown in Figure 2, aimsto learn V-ReID features and regress attributes. This design is similar to previous literature work, where theattribute branches are expected to drive the learning ofrobust features since the attributes are in general invari-ant to illumination, viewpoints, etc . Backbone.
A backbone network is used to extract fea-ture map F ( I ) ∈ R h × w × c from an input image I , where h , w and c are height , width and channels of F ( I ),respectively. We follow the previous works and useResNet [14] as the backbone. V-ReID Head / Branch.
On top of the backbone feature F ( I ), we append a spatial global average pooling (GAP)layer followed by a fully-connected (FC) layer to gen-erate the V-ReID feature f ( I ) as f ( I ) = W f GAP ( F ( I )) + b f , (1)where W f and b f denote the weights and bias of the FClayer used to reduce the dimension of the pooled feature, W f ∈ R s f × c and b f ∈ R s f , where s f is the predefined di-mension of the output. f ( I ) is followed by Triplet Loss L ftri and Cross Entropy Loss L fID . Attribute Heads / Branches.
On top of the backbonefeature F ( I ), we add n attribute branches for attributeclassification, where n is the number of available at-tributes in the training dataset, one branch for each at-tribute. For the i -th attribute branch, we use a spatialand channel attention module to obtain attribute-relatedfeature A i ( I ) ∈ R h × w × c as A i ( I ) = F ( I ) · Att i ( F ( I )) , (2)where Att i ( I ) ∈ R c denotes the response of the attentionmodule.To make classification for the i -th attribute, we applyGAP and a FC layer to get a feature vector a i as a i ( I ) = W a i GAP ( A i ( I )) + b a i , (3)where W a i and b a i denote the weights and bias of the FClayer, W a i ∈ R s a × c and b a i ∈ R s a , where s a is the prede-fined size of the output. a i ( I ) is followed by a classifierwith a cross entropy loss L iatt to recognize which class itbelongs to for the i -th attribute.In summary, VAN is trained by minimizing the loss L VAN as L VAN = L ftri + L fID + λ A n (cid:88) i = L iatt , (4)where λ A is a hyper-parameter for balancing the impor-tance of V-ReID loss and attribute-related losses.3 .2. Joint Module The Joint Module (JM) is illustrated in Figure 3. JMaims to distill V-ReID helpful information from the at-tribute features and compensate it to the V-ReID featurefor the final feature matching. First, we merge the at-tribute feature maps from multiple branches to have aunified attribute feature map G ( I ). Then, we distill dis-criminative V-ReID helpful information from G ( I ) andcompensate it onto F ( I ) to create a Joint Feature J ( I ).To encourage a higher discriminative capability of theJoint Feature, we introduce an Amelioration Constraint(AC), which drives the distillation of discriminative in-formation from G ( I ) to enhance the original V-ReIDfeature F ( I ). The JM promotes the interaction betweenthe attribute and V-ReID information to improve the V-ReID performance. Attribute Feature G ( I ) . To facilitate the distillation ofhelpful attribute features, we combine all the attributefeature maps A i ( I ), where i = , · · · , n , to have a unifiedattribute feature map G ( I ). We achieve this by summa-rizing the attribute feature maps followed by a convolu-tion layer and a residual connection as G ( I ) = n (cid:88) i = A i ( I ) + θ A ( n (cid:88) i = A i ( I )) , (5)where θ A is implemented by a 1 × i . e . θ A ( x ) = ReLU(W A x ), W A ∈ R c × c . We omitBN to simplify the notation.For the combined attribute feature map G ( I ), we addsupervision from attributes to preserve the attribute in-formation. Given n attributes, m i is the number ofclasses for the i -th attribute. There are in total (cid:81) ni = m i attribute patterns. We apply a GAP layer on G ( I ) to getthe feature vector g ( I ). Then, the Triplet Loss L gtri isused as supervision to pull the features for the same at-tribute pattern and push the features for the di ff erent at-tribute patterns. We name this supervision as Attribute-based Triplet Loss. Joint Feature J ( I ) . To distill V-ReID-helpful attributeinformation from G ( I ) to enhance F ( I ), we use two con-volution layers to have distilled feature G reid ( I ) G reid ( I ) = θ g ( θ g ( G ( I ))) , (6)where θ g and θ g are implemented similarly to θ A butwe use a 3 × × θ g ( x ) = ReLU(W g1 x ) , θ g2 ( x ) = ReLU(W g2 x ), W g ∈ R c × c and W g ∈ R c × c .By adding G reid ( I ) onto the V-ReID feature F ( I ), wehave the Joint Feature J ( I ) as J ( I ) = F ( I ) + G reid ( I ) . (7) J ( I ) combines V-ReID information from F ( I ) and therelevant V-ReID-helpful information from the attributes G ( I ). Similar to the supervision on F ( I ), we add TripletLoss L jtri and Cross Entropy Loss L jID on the spatiallyaverage pooled feature j ( I ), where j ( I ) is obtained as j ( I ) = W j · GAP ( J ( I )) + b j , (8)where W j and b j represent the weights and bias of a FClayer, W j ∈ R s j × c and b j ∈ R s j , s j is the predefineddimension of the output. JM is trained by minimizing L JM L JM = L jtri + L jID + λ G L gtri , (9)where λ G is a hyperparameter balancing the importanceof the compensated V-ReID loss and the attribute relatedloss.Finally, we can train the entire network ANet end-to-end by minimizing LL = L JM + λ L VAN , (10)where λ is a hyperparameter to balance the importanceof L JM and L VAN . Amelioration Constraint.
To further boost the capa-bilities of the network, we define the Amelioration Con-straint (AC). AC aims to explicitly encourage j ( I ) to bemore discriminative than f ( I ). We separately apply ACfor cross entropy loss and triplet loss. AC for Cross Entropy Loss:
For image I , we define itas AC ID ( I ) = softplus( L jID ( I ) − L fID ( I )) , (11)where softplus( · ) = ln(1 + exp( · )) is a monotonically in-creasing function that helps to reduce the optimizationdi ffi culty by avoiding negative values [19]. L fID ( I ) and L jID ( I ) represent the identity cross entropy loss with re-spect to feature f ( I ) and j ( I ), respectively. Minimizing AC ID ( I ) encourages the network to have a lower classi-fication error for j ( I ) than that for f ( I ). AC for Triplet Loss:
We seek j ( I ) to represent an en-hanced feature of f ( I ), where j ( I ) has a higher discrimi-native capability than f ( I ). Thus, we encourage the fea-ture distance D ( · , · ) between an anchor sample / image I and a positive sample I + to be smaller w.r.t. feature j ( · )than feature f ( · ). Similarly, we encourage the featuredistance D ( · , · ) between an anchor sample / image I anda negative sample I − to be larger w.r.t. feature j ( · ) thanfeature f ( · ). Then, AC for triplet loss AC tri is defined as AC tri ( I ) = softplus( D ( j ( I ) , j ( I + )) − D ( f ( I ) , f ( I + ))) + softplus( D ( f ( I ) , f ( I − )) − D ( j ( I ) , j ( I − ))) . (12)4e notice that training with AC ID , AC tri in an end-to-end leads to unstable learning. Thus, we follow twosteps in training. In the first step, we minimize L . In thesecond step, we freeze the backbone ( i . e ., all operationsbefore f ) and minimize L (cid:48) . Compared with L in (10),the AC losses are enabled and the losses on feature f are disabled in L (cid:48) as L (cid:48) = L + AC tri + AC ID − λ ( L ftri + L fID ) . (13)
4. Experiments
In this section, we present the datasets used in our ex-periments, the implementation details, an ablation studyand a comparison against the state of the art to validateour proposed method.
We evaluate our vehicle re-identification method onthree challenging benchmark datasets. • VeRi776 [20]: It contains over 50,000 images of776 vehicles with 20 camera views. It includes at-tribute labels for color and type. It considers 576vehicles for training and 200 vehicles for test. • VeRi-Wild [21]: This is the largest vehicle re-identification dataset. It considers 174 cameraviews, 416,314 images and 40,671 IDs. It includesattribute labels for vehicle model, color and type.The testing set is divided into three sets with 3,000(small), 5,000 (medium) and 10,000 (large) IDs.This is the most challenging dataset because theimages were captured for a period of one monthand include severe changes in background, illumi-nation, viewpoint and occlusions. • Vehicle-ID [13]: It includes 221,763 images of26,267 vehicles, captured from either front or backviews. The training set contains 110,178 images of13,134 vehicles and the test set contains 111,585images of 13,133 vehicles. The testing data is fur-ther divided into three sets with 200 (small), 1,600(medium) and 2,400 (large) vehicles. Some imagesin this dataset have attribute labels for vehicle colorand type but not for all the images.For the first two datasets, the validation protocols isbased on mean Average Precision (mAP) and Cumula-tive Matching Curve (CMC) @1 (at rank-1 / R1) and @5(at rank-5 / R5) as they have fixed gallery and query sets.For Vehicle-ID, we follow the protocol proposed by the authors of the dataset, which randomly chooses one im-age of each vehicle ID as gallery and the rest as query.The final R1 and R5 results are reported after repeatingthis process 10 times.
We follow other works in the literature to implementthe backbone for a fair comparison. We use a modifiedversion of ResNet-50 [14] with Instance-Batch Normal-ization [22] and remove the last pooling layer to obtainthe feature map F ( I ) for an image I . Each attentionmodule Att i ( I ) is based on SE [23] with the reductionratio of 16. For the FC layers, we set s a =
128 and s f = s j = λ = λ A = λ G = L iatt and L gtri . We foundthis works well since we use batch size of 512 (4 imagesper ID) and the missing labels are alleviated by the otherIDs in the batch. Note that these missing labels do nota ff ect our AC ID and AC tri , so ANet can still learn fromthose cases.The input images are resized to 256 ×
256 pixelsand augmented by random horizontal flipping, randomzooming and random input erasing [27, 28, 29, 30]. Allmodels are trained on 8 V100 GPUs with NVLink for210 epochs with Amsgrad. An initial learning rate isset to 0.0006 and the learning rate is decayed by 0.1 atepochs 60, 120 and 150. The first learning step min-imizes L for the first 150 epochs, then the second stepoptimizes L (cid:48) for 60 epochs. n = e . g ., red, yellow, gray, etc )and type ( e . g ., sedan, truck, etc ). During testing, thefeature vectors are L2-normalized for matching. ff ectiveness of using Attributes on V-ReID We first evaluate the e ff ects of using attributes in V-ReID and show the comparisons in Table 1. Baseline denotes the scheme which generates feature f usingonly the backbone, without using attribute-related de-signs. VAN denotes the vanilla scheme that exploresattributes as shown in Figure 2, using the same back-bone as
Baseline . For our VAN, we can use the V-ReID feature f ( i ) (i.e. VAN ( f ) ), or use the concate-nation of f ( I ) and attribute features a i ( I ) , i = , · · · , n able 1: Ablation study on the e ff ectiveness of our designs. We indicate the feature vector used for testing using the symbol in parenthesis. VeRi776 Vehicle-ID VeRi-WildSmall Medium Large Small Medium LargeMethod mAP R1 R1 R5 R1 R5 R1 R5 mAP R1 mAP R1 mAP R1
Baseline 78.1 96.1 81.3 94.4 77.7 90.6 75.8 88.5 78.1 94.6 72.2 92.5 64.0 88.7VAN ( f ) 78.1 96.6 84.1 96.5 80.4 93.6 78.4 91.8 83.1 94.5 78.3 93.5 70.6 90.0VAN ( f a ) 77.3 96.5 81.5 95.0 78.5 92.0 76.3 89.6 81.9 94.1 76.9 93.1 69.2 89.4ANet ( j ) w / o AC 79.8 96.9 85.0 96.7 80.9 94.1 79.0 91.8 84.6 96.1 79.9 94.4 72.9 91.5ANet ( j ) 80.1 96.9 86.0 97.4 81.9 95.1 79.6 92.7 85.8 95.9 81.0 94.5 73.9 91.6Input G ( I ) G reid ( I ) Figure 4: Comparison of activation maps. The first row represents the input images, second and third row their corresponding activation maps for G ( I ) (attribute features) and G reid ( I ) (attribute features oriented to V-ReID), respectively. The first column is the query image, the second to sixthcolumns represent the vehicle retrieved at rank-1, rank-2, rank-3, rank-4 and rank-5.Table 2: Comparison of choice for implementation of attributebranches for the attribute-based baseline. fc represents an implemen-tation using fully connected layers and att represents an implementa-tion using SE attention blocks. Results for Vehicle-ID and VeRi-Wildare reported using their small scale test set. VeRi776 Vehicle-ID VeRi-WildMethod mAP R1 R1 R5 mAP R1 fc 76.7 95.8 83.3 96.0 82.1 94.3att 78.1 96.6 84.1 96.5 83.1 94.5(i.e.
VAN ( f a) ) in inference. We can see that: 1)
VAN( f ) , where the attributes regularize the feature learn-ing, outperforms
Baseline significantly on Vehicle-IDand VeRi-Wild. Specially, using attributes improves therank-1 by 0.5% for VeRi776, 2.8% at rank-1 and 3.3% atrank-5 for Vehicle-ID, 6.6% in mAP and 1.3% at rank-1 for VeRi-Wild; 2) using
VAN ( f a) has lower perfor-mance than
VAN ( f ) . This is because not all the attributeinformation a i ( I ) is equally important for V-ReID. Al- locating the relative contributions of each attribute isneeded to have satisfactory results. Hence how to distilltask-oriented attribute information to e ffi ciently benefitV-ReID is important, which is what our ANet aims toaddress.We use VAN as our attribute-based baseline , which issimilar to previous works exploiting vehicle attributes.However, previous works usually use simple FC layersinstead of attention blocks for the attribute branches.Using attention facilitates the distillation of attributefeatures. As shown in Table 2, using attention out-performs that using FC layers by 1.2% in rank-1 onVehicle-ID, 1.4% and 1% in mAP on VeRi776 andVeRi-Wild, respectively.
We propose ANet to distill attribute information formore e ff ective V-ReID. Here we study the e ff ectivenessof our Joint Module design, and the AC losses. Table 1shows the comparisons. We can see that: (i) Our fi-6 able 3: Comparison of our proposed method against the state of the arts on VeRi776. The first and second best results are marked by bold andunderline, respectively. Method Clues mAP R1 R5
PAMAL [31] attributes 45.0 72.0 88.8MADVR [32] attributes 61.1 89.2 94.7DF-CVTC [4] attributes 61.0 91.3 95.7PAMTRI [5] attributes 71.8 92.8 96.9AGNet [6] attributes 71.5 95.6 96.5SAN [7] attributes 72.5 93.3 97.1StRDAN [8] attributes 76.1 – –VAnet [9] viewpoint 66.3 89.7 95.9PRND [17] veh. parts 74.3 94.3 98.6UMTS [33] TS 75.9 95.8 –PCRNet [11] GN + parsing 78.6 95.4 98.4SAVER [15] GAN 79.6 96.4 98.6PVEN [3] parsing 79.5 95.6 98.4HPGN [16] GN 80.1 96.7 –VKD [34] viewpoint + TS FastReid [35] backbone 81.0 97.1 98.3ANet + FastReid (Ours) attributes 81.2 96.8 98.4nal scheme
ANet ( j) significantly outperforms the ba-sic network
VAN ( f ) , by 2.0% in mAP on VeRi776,1.9% / / / Medium / Largescales of Vehicle-ID, 2.7% / / / Medium / Large scales of VeRi-Wild; (ii) our pro-posed AC losses, which encourages higher discrimina-tion after the compensation of distilled attribute featurethan that before, is very helpful to promote the distillof discriminative information from attribute feature forV-ReID purpose.These results show that the interaction between theV-ReID and attribute features of VAN improves the net-work performance, thanks to the distill of V-ReID ori-ented attribute features.To better understand the e ff ects of ANet, we visualizethe attention maps of G ( I ) and G reid ( I ) and show somein Figure 4. G ( I ) encodes generic features of the at-tributes, where the activations are flatter and do not havea special focus on the vehicle parts. In contrast, G reid ( I )represents a portion of the information of G ( I ) that ishelpful for V-ReID. We can observe that the activationmaps focus more on the vehicle. We compare our method with approaches that alsouse attributes information [4, 32, 5, 6, 7, 8]. We also compare our method with the most recent approachesthat leverage clues / techniques such as vehicle parsingmaps [3], vehicle parts [10, 17], GANs [15], Teacher-Student (TS) distillation [33, 34], camera viewpoints [9,34], and Graph Networks (GN) [16, 11]. HPGN createsa pyramid of spacial graph network to explore the spa-tial significance of the backbone tensor. PCRNet stud-ies the correlation between parsed vehicle parts througha graph network. VAnet [9] learns two metrics for sim-ilar viewpoints and di ff erent viewpoints in two featurespaces, respectively.We also compare against FastReid [35], a strongbaseline network for re-identification that performsan extensive search of hyperparameters, augmentationmethods, and use some architecture design tricks toachieve excellent performance. We also implementedour design on top of it by taking it as our backbone,which we named ANet + FastReid . Note that the re-ported results of FastReid were obtained by our runningof their released code.Tables 3, 4 and 5 show the comparisons on VeRi776,Vehicle-ID, and VeRi-Wild, respectively.
VeRi776 . Compared with attribute-based methods(first group in Table 3), our scheme
ANet + FastReid outperforms the best results in this group by inmAP; and 1.5% for rank-1 and rank-5. By compar-7 able 4: Comparison of our proposed method against the state of the arts on Vehicle-ID. The first and second best results are marked by bold andunderline, respectively.
Small Medium LargeMethod Clues R1 R5 R1 R5 R1 R5
PAMAL [31] attributes 67.7 87.9 61.5 82.7 54.5 77.2AGNet [6] attributes 71.1 83.7 69.2 81.4 65.7 78.2DF-CVTC [4] attributes 75.2 88.1 72.1 84.3 70.4 82.1SAN [7] attributes 79.7 94.3 78.4 91.3 75.6 88.3PRND [17] veh. parts 78.4 92.3 75.0 88.3 74.2 86.4SAVER [15] GAN 79.9 95.2 77.6 91.1 75.3 88.3UMTS [33] TS 80.9 – 78.8 – 76.1 –PVEN [3] parsing 84.7 97.0 80.6 94.5 77.8 92.0PCRNet [11] GN + parsing 86.6 – 79.9 – 77.3 –Baseline attributes 81.3 94.4 77.7 90.6 75.8 88.5ANet (Ours) attributes 86.0 97.4 81.9 95.1 79.6 92.7FastReid [35] backbone 85.5 97.4 81.8 95.3 79.9 93.8ANet + FastReid (Ours) attributes 87.9 97.8 82.8 96.2
Table 5: Comparison of our proposed method against the state of the arts on VeRi-Wild. The first and second best results are marked by bold andunderline, respectively.
Small Medium LargeMethod Clues mAP R1 R5 mAP R1 R5 mAP R1 R5
UMTS [33] TS 82.8 84.5 – 66.1 79.3 – 54.2 72.8 –HPGN [16] GN 80.4 91.3 – 75.1 88.2 – 65.0 82.6 –PCRNet [11] GN + parsing 81.2 92.5 – 75.3 89.6 – 67.1 85.0 –SAVER [15] GAN 80.9 94.5 98.1 75.3 92.7 97.4 67.7 89.5 95.8PVEN [3] parsing 82.5 Baseline attributes 78.1 94.6 98.5 72.2 92.5 97.3 64.0 88.7 95.6ANet (Ours) attributes 85.8 95.9 99.0 81.0 94.5 98.1 73.9 91.6 96.7FastReid [35] backbone 84.8 95.7 98.9 80.0 94.5 98.1 73.2 91.5 96.7ANet + FastReid (Ours) attributes
Vehicle-ID . Our method outperforms attribute-basedmethods (first group in Table 4) consistently. For rank-1, our scheme
ANet + FastReid outperforms the bestattribute-based method by , and forsmall, medium and large scales, respectively. Whencompared with methods using other clues, ours achievesin the best results on the large set and competitive per- formance on the other sets. VeRi-Wild . Previous attribute based methods havenot yet reported results for this latest dataset. From Ta-ble 5, we can see that our schemes
ANet and
ANet + FastReid achieve the best performance in mAP.PVEN [3] is a method based on semantic parsing todescribe each vehicle view and region. It has better re-sults on rank-1 / rank-5 but it is not as competitive as inthe two previous datasets.We observed that none of the existing methods con-sistently achieve the best results on all the datasets. Thismay be because di ff erent datasets have di ff erent mainchallenges. Our proposed ANet shows a more con-sistent state-of-the-art performance on all the datasets,8hanks to the generic capabilities of attributes on V-ReID.
5. Conclusions
In this work, we proposed ANet, a novel frame-work to leverage attribute information for vehicle re-identification. ANet addresses the problem of lack ofinteraction between the V-ReID features and attributefeatures of previous methods. Particularly, we encour-age the network to distill task-oriented information fromthe attribute branches and compensate it into the globalV-ReID feature to enhance the discrimination capabil-ity of the feature. Evaluation on three datasets showsthe e ff ectiveness of our methods. Acknowledgments
This work was done while the first author is af-filiated with Microsoft Corp. We are thankful toMicrosoft Research, São Paulo Research Foundation(FAPESP grant / / References [1] M. Naphade, Z. Tang, M.-C. Chang, D. C. Anastasiu,A. Sharma, R. Chellappa, S. Wang, P. Chakraborty, T. Huang,J.-N. Hwang, The 2019 AI City Challenge, in: IEEE ComputerVision and Pattern Recognition Conference Workshops, 2019,pp. 452–460.[2] S. D. Khan, H. Ullah, A Survey of Advances in Vision-basedVehicle Re-Identification, Computer Vision and Image Under-standing 182 (2019) 50–63.[3] D. Meng, L. Li, X. Liu, Y. Li, S. Yang, Z.-J. Zha, X. Gao,S. Wang, Q. Huang, Parsing-based View-aware Embedding Net-work for Vehicle Re-Identification, in: IEEE / CVF Conferenceon Computer Vision and Pattern Recognition, 2020, pp. 7103–7112.[4] A. Zheng, X. Lin, C. Li, R. He, J. Tang, Attributes GuidedFeature Learning for Vehicle Re-identification, arXiv preprintarXiv:1905.08997 (2019).[5] Z. Tang, M. Naphade, S. Birchfield, J. Tremblay, W. Hodge,R. Kumar, S. Wang, X. Yang, Pamtri: Pose-Aware Multi-TaskLearning for Vehicle Re-Identification using Highly Random-ized Synthetic Data, in: IEEE International Conference onComputer Vision, 2019, pp. 211–220.[6] H. Wang, J. Peng, D. Chen, G. Jiang, T. Zhao, X. Fu, Attribute-guided feature learning network for vehicle re-identification,arXiv preprint arXiv:2001.03872 (2020).[7] J. Qian, W. Jiang, H. Luo, H. Yu, Stripe-based and Attribute-Aware Network: A Two-Branch Deep Model for Vehicle Re-Identification, Measurement Science and Technology (2020). [8] S. Lee, E. Park, H. Yi, S. Hun Lee, StRDAN: Synthetic-to-RealDomain Adaptation Network for Vehicle Re-identification, in:IEEE / CVF Conference on Computer Vision and Pattern Recog-nition Workshops, 2020, pp. 608–609.[9] R. Chu, Y. Sun, Y. Li, Z. Liu, C. Zhang, Y. Wei, Vehicle Re-Identification with Viewpoint-aware Metric Learning, in: IEEEInternational Conference on Computer Vision, 2019, pp. 8282–8291.[10] X. Zhang, R. Zhang, J. Cao, D. Gong, M. You, C. Shen, Part-Guided Attention Learning for Vehicle Re-identification, arXivpreprint arXiv:1909.06023 (2019).[11] X. Liu, W. Liu, J. Zheng, C. Yan, T. Mei, Beyond the Parts:Learning Multi-view Cross-part Correlation for Vehicle Re-identification, ACM Multimedia (2020).[12] X. Liu, S. Zhang, Q. Huang, W. Gao, Ram: a region-aware deepmodel for vehicle re-identification, in: 2018 IEEE InternationalConference on Multimedia and Expo (ICME), IEEE, 2018, pp.1–6.[13] H. Liu, Y. Tian, Y. Yang, L. Pang, T. Huang, Deep Relative Dis-tance Learning: Tell the Di ff erence between Similar Vehicles,in: IEEE Conference on Computer Vision and Pattern Recogni-tion, 2016, pp. 2167–2175.[14] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning forImage Recognition, in: IEEE Conference on Computer Visionand Pattern Recognition, 2016, pp. 770–778.[15] P. Khorramshahi, N. Peri, J.-c. Chen, R. Chellappa, The Devilis in the Details: Self-Supervised Attention for Vehicle Re-Identification, IEEE European Conference on Computer Vision(2020).[16] F. Shen, J. Zhu, X. Zhu, Y. Xie, J. Huang, Exploring SpatialSignificance via Hybrid Pyramidal Graph Network for VehicleRe-Identification, arXiv preprint arXiv:2005.14684 (2020).[17] B. He, J. Li, Y. Zhao, Y. Tian, Part-Regularized Near-DuplicateVehicle Re-Identification, in: IEEE Conference on ComputerVision and Pattern Recognition, 2019, pp. 3997–4005.[18] H. Guo, C. Zhao, Z. Liu, J. Wang, H. Lu, Learning coarse-to-fine structured feature embedding for vehicle re-identification,in: Thirty-Second AAAI Conference on Artificial Intelligence,2018, pp. 6853–6860.[19] X. Jin, C. Lan, W. Zeng, Z. Chen, L. Zhang, Style Normalizationand Restitution for Generalizable Person Re-Identification, in:IEEE / CVF Conference on Computer Vision and Pattern Recog-nition, 2020, pp. 3143–3152.[20] X. Liu, W. Liu, T. Mei, H. Ma, Provid: Progressive and Multi-modal Vehicle Reidentification for Large-Scale Urban Surveil-lance, IEEE Transactions on Multimedia 20 (2017) 645–658.[21] Y. Lou, Y. Bai, J. Liu, S. Wang, L. Duan, Veri-wild: A LargeDataset and a new Method for Vehicle Re-identification in theWild, in: IEEE Conference on Computer Vision and PatternRecognition, 2019, pp. 3235–3243.[22] X. Pan, P. Luo, J. Shi, X. Tang, Two at Once: Enhancing Learn-ing and Generalization Capacities via IBN-Net, in: EuropeanConference on Computer Vision, 2018, pp. 464–479.[23] J. Hu, L. Shen, G. Sun, Squeeze-and-Excitation Networks, in:IEEE conference on Computer Vision and Pattern Recognition,2018, pp. 7132–7141.[24] C. Szegedy, V. Vanhoucke, S. Io ff e, J. Shlens, Z. Wojna, Re-thinking the Inception Architecture for Computer Vision, in:IEEE Conference on Computer Vision and Pattern Recognition,2016, pp. 2818–2826.[25] A. Hermans, L. Beyer, B. Leibe, In Defense of the Triplet Lossfor Person Re-Identification, arXiv preprint arXiv:1703.07737(2017).[26] H. Luo, Y. Gu, X. Liao, S. Lai, W. Jiang, Bag of Tricks anda Strong Baseline for Deep Person Re-identification, in: IEEE onference on Computer Vision and Pattern Recognition Work-shops, 2019, pp. 0–0.[27] G. Ghiasi, T.-Y. Lin, Q. V. Le, Dropblock: A RegularizationMethod for Convolutional Networks, in: Advances in NeuralInformation Processing Systems, 2018, pp. 10727–10737.[28] K. Zhou, T. Xiang, Torchreid: A Library for Deep Learn-ing Person Re-Identification in Pytorch, arXiv preprintarXiv:1910.10093 (2019).[29] K. Zhou, Y. Yang, A. Cavallaro, T. Xiang, Omni-Scale FeatureLearning for Person Re-Identification, in: International Confer-ence on Computer Vision, 2019, pp. 3702–3712.[30] K. Zhou, Y. Yang, A. Cavallaro, T. Xiang, Learning Generalis-able Omni-Scale Representations for Person Re-Identification,arXiv preprint arXiv:1910.06827 (2019).[31] S. Tumrani, Z. Deng, H. Lin, J. Shao, Partial Attention andMulti-Attribute Learning for Vehicle Re-Identification, PatternRecognition Letters 138 (2020) 290–297.[32] N. Jiang, Y. Xu, Z. Zhou, W. Wu, Multi-Attribute DrivenVehicle Re-Identification with Spatial-Temporal Re-ranking,in: 25th IEEE International Conference on Image Processing,IEEE, 2018, pp. 858–862.[33] X. Jin, C. Lan, W. Zeng, Z. Chen, Uncertainty-AwareMulti-Shot Knowledge Distillation for Image-Based Object Re-Identification, AAAI (2020).[34] A. Porrello, L. Bergamini, S. Calderara, Robust Re-Identification by Multiple Views Knowledge Distillation, IEEEEuropean Conference on Computer Vision (2020).[35] L. He, X. Liao, W. Liu, X. Liu, P. Cheng, T. Mei, FastReID: APytorch Toolbox for General Instance Re-identification, arXivpreprint arXiv:2006.02631 (2020).onference on Computer Vision and Pattern Recognition Work-shops, 2019, pp. 0–0.[27] G. Ghiasi, T.-Y. Lin, Q. V. Le, Dropblock: A RegularizationMethod for Convolutional Networks, in: Advances in NeuralInformation Processing Systems, 2018, pp. 10727–10737.[28] K. Zhou, T. Xiang, Torchreid: A Library for Deep Learn-ing Person Re-Identification in Pytorch, arXiv preprintarXiv:1910.10093 (2019).[29] K. Zhou, Y. Yang, A. Cavallaro, T. Xiang, Omni-Scale FeatureLearning for Person Re-Identification, in: International Confer-ence on Computer Vision, 2019, pp. 3702–3712.[30] K. Zhou, Y. Yang, A. Cavallaro, T. Xiang, Learning Generalis-able Omni-Scale Representations for Person Re-Identification,arXiv preprint arXiv:1910.06827 (2019).[31] S. Tumrani, Z. Deng, H. Lin, J. Shao, Partial Attention andMulti-Attribute Learning for Vehicle Re-Identification, PatternRecognition Letters 138 (2020) 290–297.[32] N. Jiang, Y. Xu, Z. Zhou, W. Wu, Multi-Attribute DrivenVehicle Re-Identification with Spatial-Temporal Re-ranking,in: 25th IEEE International Conference on Image Processing,IEEE, 2018, pp. 858–862.[33] X. Jin, C. Lan, W. Zeng, Z. Chen, Uncertainty-AwareMulti-Shot Knowledge Distillation for Image-Based Object Re-Identification, AAAI (2020).[34] A. Porrello, L. Bergamini, S. Calderara, Robust Re-Identification by Multiple Views Knowledge Distillation, IEEEEuropean Conference on Computer Vision (2020).[35] L. He, X. Liao, W. Liu, X. Liu, P. Cheng, T. Mei, FastReID: APytorch Toolbox for General Instance Re-identification, arXivpreprint arXiv:2006.02631 (2020).