[PDF] Attributes Guided Feature Learning for Vehicle Re-identification

Abstract

Vehicle Re-ID has recently attracted enthusiastic attention due to its potential applications in smart city and urban surveillance. However, it suffers from large intra-class variation caused by view variations and illumination changes, and inter-class similarity especially for different identities with the similar appearance. To handle these issues, in this paper, we propose a novel deep network architecture, which guided by meaningful attributes including camera views, vehicle types and colors for vehicle Re-ID. In particular, our network is end-to-end trained and contains three subnetworks of deep features embedded by the corresponding attributes (i.e., camera view, vehicle type and vehicle color). Moreover, to overcome the shortcomings of limited vehicle images of different views, we design a view-specified generative adversarial network to generate the multi-view vehicle images. For network training, we annotate the view labels on the VeRi-776 dataset. Note that one can directly adopt the pre-trained view (as well as type and color) subnetwork on the other datasets with only ID information, which demonstrates the generalization of our model. Extensive experiments on the benchmark datasets VeRi-776 and VehicleID suggest that the proposed approach achieves the promising performance and yields to a new state-of-the-art for vehicle Re-ID.

Full PDF

JJOURNAL OF L A TEX CLASS FILES 1

Attributes Guided Feature Learning forVehicle Re-identiﬁcation

Aihua Zheng, Xianmin Lin, Chenglong Li, Ran He, and Jin Tang

Abstract —Vehicle Re-ID has recently attracted enthusiasticattention due to its potential applications in smart city and urbansurveillance. However, it suffers from large intra-class variationcaused by view variations and illumination changes, and inter-class similarity especially for different identities with the similarappearance. To handle these issues, in this paper, we proposea novel deep network architecture, which guided by meaningfulattributes including camera views, vehicle types and colors forvehicle Re-ID. In particular, our network is end-to-end trainedand contains three subnetworks of deep features embedded bythe corresponding attributes (i.e., camera view, vehicle type andvehicle color). Moreover, to overcome the shortcomings of limitedvehicle images of different views, we design a view-speciﬁedgenerative adversarial network to generate the multi-view vehicleimages. For network training, we annotate the view labels on theVeRi-776 dataset. Note that one can directly adopt the pre-trainedview (as well as type and color) subnetwork on the other datasetswith only ID information, which demonstrates the generalizationof our model. Extensive experiments on the benchmark datasetsVeRi-776 and VehicleID suggest that the proposed approachachieves the promising performance and yields to a new state-of-the-art for vehicle Re-ID.

Index Terms —Vehicle Re-identiﬁcation, Deep Features, At-tributes, View-speciﬁed Generative Adversarial Network.

I. I

NTRODUCTION V EHICLE re-identiﬁcation (Re-ID) is a frontier and im-portant research problem in computer vision, which hasmany potential applications, such as intelligent transportation,urban surveillance and security since vehicle is the most im-portant object in urban surveillance. The aim of vehicle Re-IDis to identify the same vehicle across non-overlapping cameras.Although license plate can uniquely identify the vehicle, it isscarcely recognizable due to motion blur, challenging cameraview and low resolution etc. in real-life surveillance. Some re-searchers have explored spatial-temporal information [1], [2],[3], [4] to boost the performance of appearance based vehicleRe-ID. However, it is difﬁcult to obtain the complete spatial-temporal information since the vehicles may only appear ina few of the cameras in the large scale camera networks.Therefore, the prevalent vehicle Re-ID methods still focus onappearance based models.Extensive works dedicate on person Re-ID in the pastdecade [5], [6], [7], [8], [9], [10], [11], [12], which focuson two mainstreams: (1) Appearance modelling [6], [7], [8],which develops robust feature descriptors to leverage the vari-ous changes and occlusions among different camera views. (2)

A. Zheng, X. Lin, C. Li, and J. Tang are with School of Computer Scienceand Technology, Anhui University, Hefei, 230601, China.R. He is with Institute of Automation, Chinese Academy of Sciences,Beijing, 100190, China.

Camera1ID1

Camera2

ID2

Camera3

ID1 (a) (b)(c)

Fig. 1. Demonstration of two major challenges in vehicle Re-ID: 1) Intra-classdifference: the same vehicle ”ID1” appears totally different under Camera1and Camera3 due to the view variance. 2) Inter-class similarity: the twodifferent vehicles ”ID1” under Camera1 and ”ID2” under Camera2 haveextremely similar visual appearance especially for the vehicles from the samemanufactories.

Learning-based methods [5], [13], [14], [10], [11], [12], whichlearns metric distance to mitigate the appearance gaps betweenthe low-level features and high-level semantics. Recently,deep neural networks have made a marvellous progress onboth feature learning [5], [11], [12] and metric learning [13],[14], [10] for person Re-ID. However, directly employingperson Re-ID models for vehicle Re-ID obviously cannotguarantee the satisfactory performance, since the appearanceof pedestrians and vehicles varies in the different manner fromdifferent viewpoints.Although much progress has been made on vehicle Re-ID [15], [16], [3], [4], [17], [18], [2], vehicle Re-ID encountersmore challenges as in addition to the common challenges inperson Re-ID such as occlusion, illumination etc. The ﬁrstcrucial challenge of vehicle Re-ID is the large intra-classvariation caused by the viewpoint variation across differentcameras, which has been widely explored in person Re-ID [19], [20], [21], [22]. This issue is even more challengingin vehicle Re-ID since most of the vehicle images under acertain camera are almost in the same viewpoint due to therigid motion of the vehicles as shown in Fig. 1. Unfortunately,it cannot guarantee the satisfactory performance when directlyemploying the methods from person Re-ID for vehicle Re-ID since the appearance distributes totally different betweenpersons and vehicles. Some vehicle Re-ID methods [23], a r X i v : . [ c s . C V ] M a y OURNAL OF L A TEX CLASS FILES 2 [24] use adversarial learning schemes to generate multi-viewimages or features from a single image, and can thus addressthe challenge of view variation to some extent. But theymight be difﬁcult to distinguish different vehicles with verysimilar appearance. Furthermore, they neglect the attributesinformation, such as type and color, which would be criticalcues for boosting the performance of vehicle Re-ID.The second challenge is the high inter-class similarity espe-cially for different identities with similar appearance as shownin Fig. 1. Incorporating the attributes information sufﬁces togenerate better discriminative representation for person Re-ID [11], [25], [12], [26]. Therefore, it is essential to learn thedeep features with the supervision of attributes in vehicle Re-ID, enforcing the same identity with the consistent attributes.Li et al. [16] introduce the attribute recognition into the vehicleRe-ID framework, and use extra semantic information to assistvehicle identiﬁcation especially for different identities withthe similar appearance. However, none of methods handlesboth of two challenges (intra-class difference and inter-classsimilarity) simultaneously, and the performance of vehicle Re-ID is thus limited.In this paper, we attempt to handle above issues in a uniﬁeddeep convolutional framework to jointly learn Deep Featurerepresentations guided by the meaningful attributes, includingCamera Views, vehicle Types and Colors (DF-CVTC) forvehicle Re-ID. Attribute information has been successfullyinvestigated as the mid-level semantics to boost person Re-ID.It can also help vehicle Re-ID in challenging scenarios. First ofall, the camera view is one of the key attributes and challengesin Re-ID. As shown in Fig. 2, the query vehicle image mayhave completely different views from their counterparts underother cameras, such as query Q1 and Q2 and their right ranksmarked as green solid boxes. Second, vehicle types and colors,as the representative attributes for vehicles, play importantrole in vehicle Re-ID especially for the different vehicles withsimilar appearance. As shown in Fig. 2, the wrong hits of queryQ3 and Q4, which present with similar appearance, could beeffectively evaded by vehicle attribute type. Furthermore, thevehicles with different colors may present with similar shapes(such as the wrong hits rank 1 and rank 2 of query Q1),similar overall appearance (such as the wrong hits of queryQ4), or even the similar color (such as the query Q5 withwhite color while the wrong hits of rank 2-4 with gray color)due to illumination changes. Integrating the color attribute mayrelieve this inter-class similarity. These challenges motivate usto utilize above attributes to help classiﬁer distinguish differentvehicles with very similar appearance and also identify thesame vehicles with different viewpoints.Meanwhile, as we observed, most of the vehicle imagesunder a certain camera are almost in the same viewpoint dueto the rigid motion of vehicles as shown in Fig. 1, and thus thenumber of vehicle images with different views is very limitedwhich brings a big challenge to train deep networks. To handlethis issue, we design a view-speciﬁed generative adversarialnetwork (VS-GAN) to generate the multi-view vehicle images.It is worth noting that we jointly learn deep features, cameraviews, vehicle types and colors in an end-to-end framework. Atlast, we annotate the view labels in the benchmark datasets for network training, which can be directly used for other datasetswith only ID labels.To our best knowledge, it is the ﬁrst work to collabo-ratively learn the deep features guided by three attributessimultaneously for vehicle Re-ID. In summary, this papermakes the following contributions to vehicle Re-ID and relatedapplications: • It proposes a uniﬁed attributes guided deep learningframework that jointly learns Deep Feature representa-tions, Camera Views, vehicle Types and Colors (DF-CVTC) for vehicle Re-ID. These components are collab-orative to each other, and thus boost the discriminationability of the learnt representations. • To enhance the diversity of view data, it develops anvehicle generation model, i.e., VS-GAN, to generate themulti-view vehicle images. Together with the synthesizedmulti-view images and the original single view images,our network can better mitigate the view differencecaused by the cross-view cameras. • In addition to the type and color labels, we annotate theview labels for the benchmark dataset of vehicle Re-ID, i.e., VeRi-776, for view predictor training, whichcan be easily employed for the situation with only IDinformation in vehicle Re-ID. And we will release theannotation information of view labels to public for freeacademic usage. • Comprehensive experimental evaluations on two bench-mark datasets, i.e., VeRi-776 and VehicleID, demonstratethe promising performance of the proposed method whichyields to a new state-of-the-art for vehicle Re-ID.II. R

ELATED W ORK

A. Vehicle Re-ID

With great progress in person Re-ID [27], [28], [29], [30],[31], [32], vehicle Re-ID has gradually gained a lot of atten-tions recently since vehicles are the most important object inurban surveillance. Liu et al. [15], [3], [4] released a bench-mark dataset VeRi-776 and considered the vehicle Re-ID taskas a progressive recognition process by using visual features,license plates and spatial-temporal information. Liu et al. [3]released another big surveillance-nature dataset (VehicleID)and designed coupled clusters loss to measure the distance oftwo arbitrary similar vehicles. Zhang et al. [17] designed animproved triplet-wise training by classiﬁcation-oriented loss.Li et al. [16] integrated the identiﬁcation, attribute recognition,veriﬁcation and triplet tasks into a uniﬁed CNN framework.Liu et al. [33] proposed a coarse-to-ﬁne ranking methodconsisting of a vehicle model classiﬁcation loss, a coarse-grained ranking loss, a ﬁne-grained ranking loss and a pairwiseloss.In addition to appearance information, Shen et al. [2]combined the visual spatio-temporal path information forregularization. And Wang et al. [1] introduced the spatial-temporal regularization into the proposed orientation invariantfeature embedding module. However, the large intra-classvariation caused by the viewpoint variation and high inter-class similarity especially for the different vehicles have notbeen well solved in existing works.

OURNAL OF L A TEX CLASS FILES 3

Viewpoint: front_side

Viewpoint: rear_side multi-view multi-attributes feature map

Viewpoint: front_side

Viewpoint: front_side Viewpoint: rear_side query rank 1 rank 2 rank 3 rank 4 rank 5 rank 6 rank 7 rank 8 rank 9 rank 10 rank 11 rank 12 rank 13 rank 14 rank 15 Q1 Q2 Q3 Q4 Q5Q1

Q2Q3

Q4Q5 query Type: suv

Color: brownType: truck Color: gray

Type: truck Color: greenColor: white Type: sedan Color: grayType: sedan

Fig. 2. Beneﬁts of camera views, types and colors on vehicle Re-ID. The brown dash box indicates the ﬁve vehicles characterized by distinct cameraviews, types and colors, their visualized feature representations and 2D feature projections of the images from corresponding identities, where different colorsrepresents different vehicles. The blue dash box demonstrates several ranking results of conventional vehicle Re-ID based on ResNet-50 [19], where the redand green solid boxes of the ﬁrst 15 ranks indicate the wrong and right matching respectively. The results show that extra semantics or attributes play criticalrole in handling the challenges of vehicle Re-ID.

B. View-aware Re-ID

Viewpoint changes introduces a large variation of the intra-class variation in person Re-ID. Zhao et al. [20] proposed anovel method based on human body region guided for personRe-ID which can boost the performance well. Wu et al. [34]proposed a approach called pose prior to make identiﬁcationmore robust to viewpoint. Zheng [35] introduced the PoseBoxstructure which is generated through pose estimation followedby afﬁne transformations.This issue is even crucial in vehicle Re-ID, since theviewpoint of the images are almost the same due to the rigidmotion of the vehicles. Wang et al. [1] proposed orientationinvariant feature embedding to address the inﬂuence of view-point variation. Prokaj et al. [36] presented a method basedon pose estimation to deal with multiple viewpoint.

C. Attribute Embedded Re-ID

Attributes have been extensively investigated as the mid-level semantic information to boost the person Re-ID. Su etal. [11] introduced a low rank attribute embedding into the amulti-task learning framework for person Re-ID. Khamis et al. [12] jointly optimized the attributes classiﬁcation loss andtriplet loss for person Re-ID. Lin et al. [37] integrated theidentiﬁcation loss and the attributes prediction into a simpleResNet framework and annotated the pedestrian attributesin two benchmark person Re-ID datasets Market-1501 andDukeMTMC-reID. Su et al. [38] proposed a weakly supervisedmulti-type attribute learning framework based on the tripletloss by pre-training the attributes predictor on an independentdata. Despite of the previous works focusing on image-basedquery, Li et al [39] and Yin et al. [40] investigated attribute-based query for person retrieval and Re-ID task.In vehicle Re-ID, Li et al. [16] introduced the attributerecognition into the vehicle Re-ID framework together with theveriﬁcation loss and triplet loss. However, how to incorporatethe view-aware identiﬁcation and the attributes recognitioninto a uniﬁed framework is still not investigated.

D. GAN based Re-ID

As one of the hottest research directions in the current deeplearning ﬁeld, GAN [41] has been intensely explored in imagegeneration [42], data enhancement [43], style migration [27]

OURNAL OF L A TEX CLASS FILES 4 and other aspects. Recently, some works have also started todevelop GAN on person Re-ID. Zheng et al. [5] exploredGAN to generate new unlabeled samples for data augmentationin person Re-ID. Zhong et al. [27] introduced a methodnamed camera style (CamStyle) which can be viewed as adata augmentation approach that smooths the camera styledisparities. Qian et al. [44] use GAN to generate eight pre-deﬁned pose for each image which augment the data andaddress the viewpoint variation to some extent. Liu et al. [7]transferred various person pose instances from one dataset toanother to improve the generalization ability of the model. Weiet al. [8], [9] proposed a GAN model to bridge the domaingap among different person Re-ID datasets.Some researchers proposed to use GAN to generate multi-view images or features to relieve the view variation in vehicleRe-ID. Zhou et al. [23] designed a conditional generativenetwork to obtain cross-view images from input view pairsto address the vehicle Re-ID task. Later on, Zhou et al. [24]proposed a Viewpoint-aware Attentive Multi-view Inference(VAMI) model to infer multi-view features from single-viewimage inputs. They used the image pairs for training while ourmethod employs the classiﬁcation CNN and jointly learns thedeep features and the camera views, vehicle types and colors.By learning the view and attributes speciﬁed deep features,our method is superior to the above methods.III. P

ROPOSED N ETWORK A RCHITECTURE

In this paper, we propose a novel deep network architecture,which embeds attributes information, including camera views,vehicle types and colors, for vehicle Re-ID. We shall elaboratethe proposed method in this section.

A. Architecture Overview

The overall architecture is demonstrated in Fig. 3. Ourproposed architecture mainly consists of two parts: the viewtransform model and the vehicle Re-ID model. The viewtransform model consists of a view-speciﬁed GAN to generatemulti-view vehicle images to relieve the view variations. Thevehicle Re-ID model is composed of one backbone, threeguiding subnetworks, and the embedding layers. We discussthese parts one by one for clarity.

B. Backbone

In our work, we adopt ResNet-50 as the baseline networkfor backbone. One can also conﬁgure other networks suchas Inception-v4, VGG16 and MobileNet architectures withoutlimitation. As for ResNet-50, we use the ﬁrst three residualblocks of ResNet-50 as our backbone as shown in Fig. 3, dueto its compelling performance with deeper layers by residuallearning. We denote the parameters of this network as θ . C. Subnetworks

As shown in Fig. 3, each subnetwork consists of a predictorpart and a feature extraction part, with inputs of the featuremaps from Block-1 and Block-3 respectively in the backbone. The predictor is composed of three convolutional (Conv)layers and one fully-connected (FC) layer which outputs aprobability distribution over the corresponding (view, type orcolor) values. The kernel sizes in the three Conv layers are × , × , × , respectively. The strides for these kernelsare 3, 2 and 1, respectively. We use ReLU activation in all threelayers and add a batch normalization layer after each Convlayer. The resulting feature vector is fed into the following FClayer to predict the attribute scores via the K -way softmax.The feature extractor is composed of K units, each of whichis a Conv-net responsible for extracting high-level featurescorresponding to one of the K view or attribute classes. Weuse the Block-4 of ResNet-50 as feature extractor.The features from each speciﬁc feature extractor E Φ can beformulated as, f Φ k = E Φ ( x ; α Φ ) (1)where Φ ∈ { view, type, color } , k = 1 , , ..., K Φ . K Φ isthe number of corresponding units, which also indicates thepossible classes of each view or attribute. x is an image, α Φ denotes the parameters of E Φ .The probability distribution w Φ over corresponding view orattribute values from the predictor network P Φ is, w Φ = P Φ ( x ; β Φ ) (2)where β Φ denotes the parameters of P Φ , which is learnt usingthe cross-entropy loss L Φ , L Φ = − K Φ (cid:88) k =1 log ( w Φ ( k )) q Φ ( k ) (3)where q Φ is a one-hot vector of the ground truth of corre-sponding view or attributes values.After progressively learning of the three subnetworks, weachieve the speciﬁc feature maps via: F Φ = ( f Φ (cid:12) w Φ ) ⊕ ... ⊕ ( f Φ K (cid:12) w Φ K ) (4)where ⊕ denotes the element-wise sum, and (cid:12) denotes theelement-wise multiply.The joint deep features with camera view, type and color areachieved as the fusion of feature maps of three subnetworks, F = F view ⊕ F type ⊕ F color (5) F is the fused deep features containing the complementaryview and attributes information. Next, we describe the detailsof each subnetwork as follows.

1) View subnetwork:

Viewpoint changes bring a crucialchallenge for Re-ID task. We use the view subnetwork toincorporate the view information into the Re-ID model. Theview predictor predicts a K view -way softmax scores whichare used to weight the output of each corresponding viewunit. In this paper, K view = 5 indicating the ﬁve viewpoints f ront, f ront side, side, rear side , and rear . For instance,for the training sample in the rear orientation to the camera,the corresponding view unit will be assigned a strong weightand updated strongly during the back propagation. OURNAL OF L A TEX CLASS FILES 5

Backbone front front_sidesiderear_siderearView Unit

View UnitView Unit

View Unit

Type SubnetworkColor Subnetwork

Original

Generated + Original (cid:19)(cid:17)(cid:25)(cid:19)(cid:17)(cid:21) (cid:19)(cid:17)(cid:19)(cid:24) (cid:19)(cid:17)(cid:19)(cid:24) (cid:19)(cid:17)(cid:20)

Original front front_side side rear_side rear

View Predictor

Target viewpoint front_side side rear_side rear original

Re-ID ModelView Transform Model View Specific

Feature Extractor L id View Subnetwork E m b e dd i n g L ay er s View Subnetwork

Block 3

Block 1 F view Fig. 3. Overview of our DF-CVTC. The view transform model generates multi-view images based on a view-speciﬁed GAN. Both the original (blue box)and the generated images (red boxes) are fed into the vehicle Re-ID model, which consists of one backbone (the ﬁrst three blocks of ResNet-50), threesubnetworks and one embedding network.

2) Type subnetwork:

Type is useful to distinguish thevehicles with similar appearance, which can relieve the inter-class similarity. In the same manner as the view subnetwork,we use the type subnetwork to learn the attribute speciﬁc deepfeatures. The K type scores predicted by type predictor areused to weight the output of each corresponding type unitin the type speciﬁc feature extractor. In this paper, we set K type = 9, indicating 9 types of the vehicles: sedan , suv , van , hatchback , mpv , pickup , bus , truck and estate . Similar toprevious view speciﬁc feature extractor, each type unit willlearn a feature map specialized for one of the K type types.

3) Color subnetwork:

Color is another discriminative at-tribute in vehicles. Therefore, we analogously use the colorsubnetwork to learn the color-speciﬁc features. The colorpredictor predicts the color scores of the vehicle then weightto each color unit. In our implementation, we set K color =10denoting 10 colors of the vehicles: yellow , orange , green , gray , red , blue , white , golden , brown and black . The color-speciﬁc feature extractor is designed in the same manner asin view and type subnetworks. D. Embedding Layers

The embedding layers consist of two FC layers (we denotethe parameters of this embedding as θ ). It embeds the fusedfeature F in Eq. (5) into the higher level joint deep feature F joint , which is used for the ﬁnal Re-ID task. In order to train the Re-ID model, we add a softmax layerinto the embedding network for ID classiﬁcation. We use thecross-entropy loss of L id for model training, L id = − N (cid:88) n =1 log ( p id ( n )) q id ( n ) (6)where N is the number of the vehicle IDs in the training set. q id is the one-hot ground-truth of the ID label of the vehicle. p id ( n ) is the predicted probability indicating the ID of theinput vehicle image, p id = sof tmax ( F joint ) . (7)Fig. 4 demonstrates the effectiveness of the jointly learntdeep features of the proposed DF-CVTC. We can observe that,the vehicle images of the same identity fall into the samecluster regardless of the different visible appearance causedby different camera views (as shown in Fig. 4(a) and (b)) orillumination changes (as shown in Fig. 4 (c)). E. View-speciﬁed GAN

Due to the rigid motion of the vehicles, the images under acertain camera are almost in a single viewpoint, which bringsa big challenge to vehicle Re-ID in wild conditions. Therefore,we design a view-speciﬁed GAN to generate the multi-viewimages. In this paper, we simply employ pix2pix [45] forits generality. The generation architecture is illustrated asFig. 5. In a speciﬁc, given an input vehicle image V i and OURNAL OF L A TEX CLASS FILES 6 frontfront_sidesiderear_siderear sedansuv van hatchbackmpvpickupbustruckestate yelloworangegreengrayredblue white goldenbrownblackview type color(a)(b) (c) front_side rear_sidepickuppickup orange orange (d) (e)

Fig. 4. Demonstration of the DF-CVTC features. (a) (b) and (c) denote three vehicle pairs sampled from VeRi-776 dataset under two distinct camera views andtheir corresponding learned probabilities of varying camera views, types and colors, where visible appearances are distinct due to the different camera viewsor illumination changes. (d) illustrates the 2D feature projections of the vehicle images learnt by the proposed DF-CVTC. (e) represents the correspondingannotation categories of camera views, types, and colors.

Generator G v Discriminator D v V i R i V j Encoder Decoder fake real

Fig. 5. The architecture of VS-GAN based on the architecture of pix2pix [45].For the input single view vehicle image V i ( front view as shown), VS-GANaims to synthesize a vehicle image V j with the same view as the target vehicleimage R i ( front side view as shown). a target vehicle image R i with different view, our VS-GANaims to generate a new vehicle image V j with the sameview as R i . VS-GAN constitutes a Generator G v learning amap conditional on the given target, and a Discriminator D v discriminating real data samples from the generated samples,such that the distribution of image V j is indistinguishable fromthe distribution image V i . The loss function can be expressedas, L ( G v , D v ) = E V i ,R i [ logD v ( V i , R i )] + E V i ,R i [ log (1 − D v ( V i , G v ( V i , R i ))] + λ E V i ,R i [ (cid:107) V i − G v ( V i , R i ) (cid:107) ] (8)where G v tries to minimize this objective against an adver-sarial D v that tries to maximize it, (cid:96) distance is used toencourage less blurring. λ is the weighting coefﬁcient. Fig. 6demonstrates several examples of synthesizing the f ront viewvehicle images to f ront side view on VeRi-776 dataset viaVS-GAN. One can generate multi-view images by altering thetarget images with different views. F. Difference from Previous Work

Our method is signiﬁcantly different from [23], [24], [19]from the following aspects. 1) [23], [24] infer the multi-viewimages or features using adversarial learning. However, theyrender vehicle Re-ID as a veriﬁcation task while our methodemploys a classiﬁcation CNN to learn the deep features.Furthermore, our learnt features embeds attributes information(type and color) in addition to the view information. 2) [19]incorporate both ﬁne and coarse pose/view information tolearn a feature representations and propose a novel re-rankingmethod for person Re-ID. While our DF-CVTC further inte-grates the attributes information and jointly learns the deepfeatures embedded by camera views, vehicle types and colorsinto an end-to-end framework.IV. T

RAINING D ETAILS

A. Progressive Learning

We progressively learn the three subnetworks and ﬁne-tunethe Re-ID model, which achieves comparative performanceas the multi-task learning (minimizing the combination ofthe four losses). Furthermore, it can signiﬁcantly reduce thecomputational complexity.

1) View subnetwork training:

We ﬁne-tune the backbonenetwork pre-trained on ImageNet classiﬁcation [46] and therest of Re-ID model are initialized from scratch. First, weminimize L view to obtain α view , then we minimize L id toobtain { β view , θ } while ﬁxing all the other parameters in Re-ID model. OURNAL OF L A TEX CLASS FILES 7 frontfront_side (a1) (b1) (c1) (d1) (e1) (f1) (g1) (a2) (b2) (c2) (d2) (e2) (f2) (g2)

Fig. 6. Examples of synthesizing the the vehicle images of front viewto front side view on VeRi-776 dataset via VS-GAN. The ﬁrst and thesecond rows indicate the vehicle images with the original front view andthe synthesized front side view respectively.

2) Type subnetwork training:

We ﬁrst minimize L type toobtain α type , and then minimize L id using F view ⊕ F type toobtain { β type , θ } while ﬁxing all the other parameters in Re-ID model.

3) Color subnetwork training:

In the same manner, we ﬁrstminimize L color to obtain α color , then minimize L id using F view ⊕ F type ⊕ F color to obtain { β color , θ } while ﬁxing allthe other parameters in Re-ID model.

4) Joint learning:

After training the three subnetworks, weﬁne-tune { α Φ , β Φ , θ , θ } , Φ ∈ { view, type, color } of thewhole Re-ID model by minimizing L id until convergence. B. Implementation Details

In practice we use a stochastic approximation of the ob-jective since the training set is quite huge. Training setis stochastically divided into mini-batches with 16 samples.The network performs forward propagation on the currentmini-batch, followed by the backpropagation to compute thegradients with simple cross-entropy loss for network parame-ters updating. We perform Adam optimizer at recommendedparameters with an initial learning rate of 0.0001 and a decayof 0.96 every epoch. With more passes over the training data,the model improves until it converges. To reduce overﬁtting,we artiﬁcially augment the data by performing random 2Dtranslation as the same protocol in [6]. For an original imageof size [ W , H ], we resize it to [ . W , . H ]. We alsohorizontally ﬂip each image.V. E XPERIMENTS

We carry out comprehensive evaluation of the proposed DF-CVTC comparing to the state-of-the-art methods on two publicvehicle Re-ID datasets, VeRi-776 [15] and VehicleID [3]. Weuse the Cumulative Matching Characteristics (CMC) curvesand mAP to evaluate our results [3]. The type and color labelsare available in VeRi-776, therefore, we annotate the viewlabels for network training. In VehicleID, we directly employthe view, type and color subnetworks pretrained on VeRi-776and only ID labels are used.

A. State-of-the-art Methods

All the compared state-of-the-art methods are brieﬂy intro-duced as follows: (1)

LOMO [14]. Local Maximal Occurrence (LOMO) isdedicated to propose an effective feature representation againstviewpoint changes for person Re-ID.(2)

BOW-CN [47]. Bag-of-Word based Color Name (CN).(3)

GoogLeNet [48]. Pre-trained on ImageNet [46] and ﬁne-tuned on the CompCars dataset [48] to extract discriminativesemantic feature representation.(4)

FACT [15]. Fusion of Attributes and Color features dis-criminates vehicles by jointly learning low-level color featureand highlevel semantic attribute such as SIFT, Color Nameand GoogLeNet features.(5)

FACT+Plate-SNN+STR [3]. FACT [15] with additionalplate veriﬁcation based on Siamese Neural Network andspatiotempoal relations (STR).(6)

Siamese-Visual [2]. Only visual (appearance) informa-tion is used to compute similarity between the input pairwise.(7)

Siamese-CNN+Path-LSTM [2]. Combining Siamese-CNN with Path-LSTM which estimates the validness score ofthe visual-spatiotemporal path.(8)

NuFACT [4]. The null space based Fusion of Attributeand Color features method to integrate the multi-level appear-ance features of vehicles, i.e., texture, color, and high-levelattribute features.(9)

VAMI [24]. VAMI aims to transform the single-viewfeature into a global feature representation that contains multi-view feature information, followed by the distance metriclearning on the global feature space.(10)

C2F-Rank [33]. C2F-Rank designs the coarse-to-ﬁneranking loss consisting of a vehicle model classiﬁcation loss,a coarse-grained ranking loss, a ﬁne-grained ranking loss anda pairwise loss.

B. Experiments on VeRi-776 Dataset1) Setting:

The VeRi-776 dataset contains 776 identitiescollected with 20 cameras in real-world trafﬁc surveillanceenvironment. The whole dataset is split into 576 identitieswith 37,778 images for training and 200 identities with 11,579images for testing. An additional set of 1,678 images selectedfrom the test identities are used as query images. In orderto evaluate the view subnetwork, we annotate all the vehicleimages in VeRi-776 dataset into ﬁve viewpoints as f ront , f ront side , side , rear side and rear . We follow the evalu-ation protocol in [3]. We use mean average precision (mAP)metric for evaluation. We ﬁrst calculate the average precisionfor each query. Than, the mAP can be obtained by calculate themean of each average precision. The cumulative match curve(CMC) metric is also used for evaluation. First, we sort theEuclidean distance between each query and each gallery imagein ascending order. Then, the CMC curve can be obtained bythe average of sorted value. Noted that, only vehicle in non-overlap cameras will be counted during evaluation.

2) Qualitative examples:

Fig. 7 demonstrates the qualitativeexamples of six ranking results of our DF-CVTC on VeRi-776 dataset. From which we can observe that, our methodsuccessfully hits the vehicles with large view variations to thequery such as rank 2 and ranks 10-11 in Fig. 7 (a), rank 4and ranks 8-9 in Fig. 7 (b), rank 8 in Fig. 7 (c), rank 1 and

OURNAL OF L A TEX CLASS FILES 8 query the result of rank 12 (cid:708) a (cid:709) (cid:708) b (cid:709) (cid:708) c (cid:709)(cid:708) d (cid:709)(cid:708) e (cid:709)(cid:708) f (cid:709) rank 1 rank 2 rank 3 rank 4 rank 5 rank 6 rank7 rank8 rank9 rank10 rank11 rank12 Fig. 7. Examples of ranking results on VeRi-776 dataset. The green and red boxes indicate the right matchings and the wrong matchings respectively. query the result of rank 12 (cid:708) a (cid:709)(cid:708) b (cid:709)(cid:708) c (cid:709)(cid:708) d (cid:709)(cid:708) e (cid:709)(cid:708) f (cid:709) rank 1 rank 2 rank 3 rank 4 rank 5 rank 6 rank 7 rank 8 rank 9 rank 10 rank 11 rank 12 Fig. 8. Examples of ranking results on VehicleID dataset. The green boxes indicate the right matchings. Note that there is only one ground truth vehicleimage in gallery set in the VehicleID dataset. rank 6 in Fig. 7 (d). The wrong hits generally result from thehigh inter-class similarity with homologous visual appearance,such as ranks 4-5, rank 9 and rank 12 in Fig. 7 (a), rank 10in Fig. 7 (c), ranks 7-9 in Fig. 7 (e), ranks 2-3 and rank 5in Fig. 7 (f). Fig. 11 demonstrates the qualitative examples ofranking result of our DF-CVTC on VeRi-776 dataset. Fig. 11(b) shows the view/attributes probability which are predictedby each subnetwork. Fig. 11 (c) show the ranking result. Fromthe Fig. 11, we can ﬁnd that the ranking result are improvingby introduction each subnetwork progressively.

3) Quantitative results:

Table I reports the performanceof our approach comparing with the published state-of-the-arts on VeRi-776 dataset while Fig. 9 (a) demonstrates thecorresponding CMC curves. From which we can see, our DF-CVTC signiﬁcantly surpasses the state-of-the-arts. Notethat we haven’t utilized any license plates or spatial tem-poral information as in Siamese-CNN+Path-LSTM [2] andFACT+Plate-SNN+STR [3]. Even though, our method stillachieves the superior mAP and ranking accuracies by a largemargin. We also investigate the contribution of the componentsof the view, type and color subnetworkes in our model. Byprogressively introducing these components, both mAP andranking accuracies increase, verifying the clear contributionof each component.

C. Experiments on VehicleID Dataset1) Setting:

The VehicleID dataset [3] consists of the train-ing set with 110,178 images of 13,134 vehicles and the

OURNAL OF L A TEX CLASS FILES 9

TABLE IC

OMPARISONS WITH STATE - OF - THE - ART R E -ID METHODS ON V E R I -776 ( IN %). T HE TOP THREE RESULTS ARE HIGHLIGHTED IN RED , GREEN ANDBLUE , RESPECTIVELY .Method mAP rank 1 rank 5 reference(1) LOMO [14] 9.64 25.33 46.48 CVPR 2015(2) BOW-CN [47] 12.20 33.91 53.69 ICCV 2015(3) GoogLeNet [48] 17.89 52.32 72.17 CVPR 2015(4) FACT [15] 18.49 50.95 73.48 ICME 2016(5) FACT+Plate-SNN+STR [3] 27.70 61.44 78.78 ECCV 2016(6) Siamese-Visual [2] 29.48 41.12 60.31 ICCV 2017(7) Siamese-CNN+Path-LSTM [2] +view+type +view+type+color (DF-CVTC)

TABLE IIC

OMPARISONS WITH STATE - OF - THE - ART R E -ID METHODS ON V EHICLE

ID ( IN %). T HE TOP THREE RESULTS ARE HIGHLIGHTED IN RED , GREEN ANDBLUE , RESPECTIVELY .Method

Test Size = 800 Test Size = 1600 Test Size = 2400 referencemAP rank 1 rank 5 mAP rank 1 rank 5 mAP rank 1 rank 5(1) LOMO [14] - 19.76 32.01 - 18.85 29.18 - 15.32 25.29 CVPR 2015(2) BOW-CN [47] - 13.14 22.69 - 12.94 21.09 - 10.20 17.89 ICCV 2015(3) GoogLeNet [48] 46.20 47.88 67.18 44.00 43.40 63.86 38.10 38.27 59.39 CVPR 2015(4) FACT [15] - 49.53 68.07 - 44.59 64.57 - 39.92 60.32 ICME 2016(8) NuFACT [4] - 48.90 69.51 - 43.64 65.34 - 38.63 60.72 TMM 2018(9) VAMI [24] - 63.12 83.25 - 52.87 75.12 - 47.34 70.29 CVPR 2018(10) C2F-Rank [33] 63.50 61.10 81.70 60.00 56.20 76.20 53.00 51.40 72.20 AAAI 2018ResNet-50 (baseline) 70.50 67.75 79.13 68.48 65.79 76.64 66.19 63.45 74.70+view +view+type +view+type+color (DF-CVTC) test set with 111,585 images of 13,133 vehicles. Followedby the protocol in [3], we test VehicleID dataset in threedistinct settings with different number of testing samples:800, 1600 and 2400. Speciﬁcally, since some of the typeand color information is missing and no view labels in thisdataset, we adopt the view, type and color subnetworks pre-trained on VeRi-776 dataset and ﬁne-tuned during the Re-IDtraining. Which in turn means one can easily apply our modelon the dataset with only ID information. The mean averageprecision (mAP), cumulative match curve (CMC) are used asthe evaluation metric in the same manner as in VeRi-776. Theonly difference is we randomly select a image from test datasetas gallery, while consider the remaining images in test datasetas query. The experimental results are based on the averageof 10 random trials.

2) Qualitative examples:

Fig. 8 demonstrates six rankingresults of our DF-CVTC on VehicleID. From which wecan observe that, our method can successfully hit the rightmatching with large inter-class difference caused by the illu-mination/color changes, such as Fig. 8 (a), (c) and (f), as wellas the viewpoint changes, such as Fig. 8 (b), (c), (d) and (f).The wrong hits of rank 1 on Fig. 8 (b) and (e) result from theinter-class similarity between vehicles, despite of which, ourmethod still hit the right matchings in the early ranks. Notethat there is only one ground truth vehicle image in galleryset in the VehicleID dataset.

3) Quantitative results:

Table II reports the performance ofour method against the state-of-the-arts on VehicleID datasetwhile Fig. 9 (b)(c)(d) demonstrates the corresponding CMCcurves on each test sizes. Clearly, our method signiﬁcantlybeats the existing state-of-the-arts in mAP, rank 1 and rank5.By introducing the view, type and color subnetworks pro-gressively, the performance of our method is consistentlyimproved. On the base model of ResNet-50, our DF-CVTCbeat the VAMI strongly. In speciﬁc, we increase the rank1 by12.11%, 19.28% and 23.12% in three difference scaler testsets respectively. Our DF-CVTC beat the C2F-Rank strongly.In speciﬁc, we increase the mAP by 14.53%, 14.87% and20.15% in three difference scaler test sets respectively.

D. Ablation Study1) Analysis on VS-GAN:

The designed VS-GAN sufﬁcesto generate vehicle images with the other four viewpointsas show in Fig. 3. Due to the computational complexity,we have simply transfered 1400 f ront view vehicles into f ront side view images for training data augmentation onVeRi-776 dataset as shown in Fig. 6. Fig. 10 demonstratesthe ablation study of VS-GAN. From which we can see, byaugmenting even only 1400 synthetic multi-view images intototal 37729 training samples, VS-GAN can beneﬁt the Re-IDmodel with various components. Moreover, it can further boost

OURNAL OF L A TEX CLASS FILES 10

Rank(a) M a t c h i ng A cc u r ac y ( % ) CMC on the VeRi-776 dataset

LOMO BOW-CNGoogLeNetFACTFACT+Plate-SNN+STR Siamese-VisualSiamese-CNN+Path-LSTMNuFACT VAMIResNet-50(our)+view(our)+view+type(our)DF-CVTC(our)

Rank(b) M a t c h i ng A cc u r ac y ( % ) CMC on the VehicleID dataset (800)

LOMO BOW-CNGoogLeNetFACTNuFACT VAMIResNet-50(our)+view(our)+view+type(our)DF-CVTC(our)

Rank(c) M a t c h i ng A cc u r ac y ( % ) CMC on the VehicleID dataset (1600)

LOMO BOW-CNGoogLeNetFACTNuFACT VAMIResNet-50(our)+view(our)+view+type(our)DF-CVTC(our)

Rank(d) M a t c h i ng A cc u r ac y ( % ) CMC on the VehicleID dataset (2400)

LOMO BOW-CNGoogLeNetFACTNuFACT VAMIResNet-50(our)+view(our)+view+type(our)DF-CVTC(our)

Fig. 9. CMC curves on VeRi-776 and VehicleID datasets comparing to the state-of-the-art methods where the curves of variants of our methods are plottedin red color but with different markers. the contribution of the view subnetwork by improving . and . in mAP and rank 1 respectively, comparing to . and . improvements of DF-CVTC without VS-GAN. Webelieve that more generated images with more viewpoints willfurther boost the performance. mAP rank 1 (a) DF-CVTC without VS-GAN M a t c h i ng R a t e ( % ) ResNet-50+view+view+type+view+type+color mAP rank 1 (b) DF-CVTC + VS-GAN M a t c h i ng R a t e ( % ) Resnet-50+view+view+type+view+type+color

Fig. 10. Ablation study of VS-GAN on VeRi-776 dataset. (a) and (b)demonstrate the mAP and rank 1 scores of the proposed DF-CVTC and itsvariants without and with VS-GAN respectively. The digits on the top ofthe last three bars on each metric indicate the degree of improvement byprogressively introducing view, type and color, comparing to the ﬁrst bluebar of the baseline ResNet-50.

2) Analysis on backbones:

As we mentioned in Sec-tion III-B, any other CNN architecture can be conﬁgured in-stead of ResNet-50 without any limitation. We further evaluatethree prevalent CNN architectures, Inception-v4, VGG16 andMobileNet as the backbone respectively while remain the otherpart of the proposed model unchanged. The comprising results to ResNet-50 backbone on VeRi-776 dataset are reportedin Table III. From which we can see, all the four CNNcounterparts achieve satisfactory performance. Speciﬁcally,Inception-v4 and MobileNet achieve competitive performanceon all the metrics. VGG16 works slight worse than the otherthree architecture, but it is still competitive to the state-of-the-art methods, which demonstrates that the high performanceof the proposed model is not totally due to the superiorityof the ResNet-50. Furthermore, by progressively introducingthe view, type and color subnetworks, the performance of thecorresponding variants based on all the backbones consistentlyimproves, which veriﬁes the contribution of the proposedjointly learning model. Fig. 11 demonstrates an example ofranking results of the proposed DF-CVTC for a query fromVeRi-776 dataset by progressively introducing the view, typeand color subnetworks into the ResNet-50 backbone. Weobserve that: 1) By introducing the view subnetwork, it caneliminate the wrong ranks with quite similar visible appear-ance especially with similar views to the query especially, suchas rank 1 and rank 2 in Fig. 11 (c1). 2) By further introducingthe type subnetwork, it can eliminate the wrong ranks withobviously distinct types, such as rank 2 and rank 9 in Fig. 11(c2). 3) Our full model DF-CVTC (Fig. 11 (c4)) hits the mostright ranks by progressively introduce the three subnetworks.

E. Other Discussion

We further evaluate the attributes recognition on the 1678query vehicle images from VeRi-776 dataset. The recognition

OURNAL OF L A TEX CLASS FILES 11 rank 1 rank 2 rank 3 rank 4 rank 5 rank 6 (a) query rank 7 rank 8 rank 9

ResNet-50+ view+ view + type+ view + type + color(b) view / attributes probability (c) ranking resultfront_sidetruckgrayviewtypecolor (c1)(c2)(c3)(c4)

Fig. 11. An example of ranking results of DF-CVTC on ResNet-50 backbone by progressively introducing the view, type and color subnetworks on VeRi-776dataset. The green and red boxes indicate the right matchings and the wrong matchings respectively. The histograms denote the probability distributions learntfrom the view, type and color subnetworks respectively. TABLE IIIA

BLATION STUDY ON DIFFERENT BACKBONES WITH VARYING COMPONENTS ON V E R I -776 DATASET ( IN %). T HE TOP THREE RESULTS AREHIGHLIGHTED IN RED , GREEN AND BLUE , RESPECTIVELY . Component (a) (b) (c) (d)view × (cid:88) (cid:88) (cid:88) type × × (cid:88) (cid:88) color × × × (cid:88) Backbone mAP rank 1 rank 5 mAP rank 1 rank 5 mAP rank 1 rank 5 mAP rank 1 rank 5ResNet-50 51.58 86.71 92.43 54.52 89.69 94.40 60.47 91.66 95.59 61.06 91.36 95.77VGG16 42.35 77.77 88.14 44.17 80.63 89.57 45.43 81.17 90.35 45.62 81.76 91.12MobileNet 52.55 86.23 94.10 54.48 87.60 93.92 58.49 89.15 94.64 59.23 89.45 94.87Inception-v4 49.78 84.62 91.90 52.74 87.66 93.68 59.49 89.27 94.76 60.50 89.51 95.47 rates of three subnetworks are originally . , . and . for view, type and color attributes respectively, whileachieving . , . and . after ﬁne-tuned bythe Re-ID loss. We observe that, the attributes guided featurelearning model can further beneﬁt the attributes recognitionrates especially on camera views by . improvement.Meanwhile, the recognition rate of vehicle type has also beenslightly improved. It seems that we haven’t achieved anysigniﬁcant change on the recognition rate of vehicle colorattribute, the main reason is most of query images are in thesimilar color and the original recognition rate of color attribute( . ) is nearly saturated.VI. C ONCLUSION

In this paper, we have proposed a novel end-to-end deepconvolutional network to jointly learn deep features, cameraviews, types and colors for vehicle Re-ID. We expand thebackbone of ResNet-50 with three consolidated subnetworksincorporating the view, type and color cues respectively.These three tasks beneﬁt each other and learn a informativediscriminative representation for vehicle Re-ID. Furthermore,we have increased the diversity of the views for vehicleimages via a view-speciﬁed generative adversarial network.By jointly learning the deep features, camera views, vehicle types and vehicle colors in a single uniﬁed framework, ourmethod can achieve superior performance comparing to thestate-of-the-art methods. Comprehensive evaluation on twobenchmark datasets demonstrates the clear contribution of eachsubnetwork and the capability of informative representation forvehicle Re-ID. A

CKNOWLEDGEMENT

This work was partially supported by the National Natu-ral Science Foundation of China (61502006, 61702002 and61872005). R

EFERENCES[1] Z. Wang, L. Tang, X. Liu, Z. Yao, S. Yi, J. Shao, J. Yan, S. Wang,H. Li, and X. Wang, “Orientation invariant feature embedding andspatial temporal regularization for vehicle re-identiﬁcation,” in

IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2017,pp. 379–387.[2] Y. Shen, T. Xiao, H. Li, S. Yi, and X. Wang, “Learning deep neuralnetworks for vehicle re-id with visual-spatio-temporal path proposals,”in

IEEE International Conference on Computer Vision (ICCV) , 2017,pp. 1918–1927.[3] X. Liu, W. Liu, T. Mei, and H. Ma, “A deep learning-based approach toprogressive vehicle re-identiﬁcation for urban surveillance,” in

EuropeanConference on Computer Vision . Springer, 2016, pp. 869–884.[4] ——, “Provid: Progressive and multimodal vehicle reidentiﬁcation forlarge-scale urban surveillance,” in

IEEE Transactions on Multimedia ,2018, pp. 645–658.

OURNAL OF L A TEX CLASS FILES 12 [5] Z. Zheng, L. Zheng, and Y. Yang, “Unlabeled samples generated bygan improve the person re-identiﬁcation baseline in vitro,” in

The IEEEInternational Conference on Computer Vision (ICCV) , Oct 2017.[6] W. Li, R. Zhao, T. Xiao, and X. Wang, “Deepreid: Deep ﬁlter pairingneural network for person re-identiﬁcation,” in

IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2014, pp. 152–159.[7] J. Liu, B. Ni, Y. Yan, P. Zhou, S. Cheng, and J. Hu, “Pose transferrableperson re-identiﬁcation,” in

IEEE Conference on Computer Vision andPattern Recognition (CVPR) , 2018, pp. 4099–4108.[8] L. Wei, S. Zhang, W. Gao, and Q. Tian, “Person transfer gan tobridge domain gap for person re-identiﬁcation,” in

IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , 2018, pp. 79–88.[9] W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao, “Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person reidentiﬁcation,” in

IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) , 2018, pp. 994–1003.[10] P. Chen, X. Xu, and C. Deng, “Deep view-aware metric learningfor person re-identiﬁcation.” in the International Joint Conference onArtiﬁcial Intelligence (IJCAI) , 2018, pp. 620–626.[11] C. Su, F. Yang, S. Zhang, Q. Tian, L. S. Davis, and W. Gao, “Multi-tasklearning with low rank attribute embedding for person re-identiﬁcation,”in

Proceedings of the IEEE International Conference on ComputerVision , 2015, pp. 3739–3747.[12] S. Khamis, C.-H. Kuo, V. K. Singh, V. D. Shet, and L. S. Davis, “Jointlearning for attribute-consistent person re-identiﬁcation,” in

EuropeanConference on Computer Vision . Springer, 2014, pp. 134–146.[13] E. Ahmed, M. Jones, and T. K. Marks, “An improved deep learningarchitecture for person re-identiﬁcation,” in

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2015, pp.3908–3916.[14] S. Liao, Y. Hu, X. Zhu, and S. Z. Li, “Person re-identiﬁcation bylocal maximal occurrence representation and metric learning,” in

IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2015,pp. 2197–2206.[15] X. Liu, W. Liu, H. Ma, and H. Fu, “Large-scale vehicle re-identiﬁcationin urban surveillance videos,” in

IEEE International Conference onMultimedia and Expo (ICME) , 2016, pp. 1–6.[16] Y. Li, Y. Li, H. Yan, and J. Liu, “Deep joint discriminative learning forvehicle re-identiﬁcation and retrieval,” in

IEEE International Conferenceon Image Processing (ICIP) . IEEE, 2017, pp. 395–399.[17] Y. Zhang, D. Liu, and Z.-J. Zha, “Improving triplet-wise training ofconvolutional neural network for vehicle re-identiﬁcation,” in

IEEEInternational Conference on Multimedia and Expo (ICME) , 2017, pp.1386–1391.[18] J. Zhu, Y. Du, Y. Hu, L. Zheng, and C. Cai, “Vrsdnet: vehicle re-identiﬁcation with a shortly and densely connected convolutional neuralnetwork,” in

Multimedia Tools and Applications , 2018, pp. 1–15.[19] M. S. Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen, “A pose-sensitive embedding for person re-identiﬁcation with expanded crossneighborhood re-ranking,” arXiv preprint arXiv:1711.10378 , 2017.[20] H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, andX. Tang, “Spindle net: Person re-identiﬁcation with human body regionguided feature decomposition and fusion,” in

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2017, pp.1077–1085.[21] L. Zhao, X. Li, Y. Zhuang, and J. Wang, “Deeply-learned part-alignedrepresentations for person re-identiﬁcation.” in

ICCV , 2017, pp. 3239–3248.[22] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian, “Pose-driven deepconvolutional model for person re-identiﬁcation,” in

Computer Vision(ICCV), 2017 IEEE International Conference on . IEEE, 2017, pp.3980–3989.[23] Y. Zhou and L. Shao, “Cross-view gan based vehicle generation for re-identiﬁcation,” in

Proceedings of the British Machine Vision Conference(BMVC) , 2017, pp. 1–12.[24] ——, “Aware attentive multi-view inference for vehicle re-identiﬁcation,” in

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2018, pp. 6489–6498.[25] C. Su, S. Zhang, J. Xing, W. Gao, and Q. Tian, “Deep attributesdriven multi-camera person re-identiﬁcation,” in

European conferenceon computer vision . Springer, 2016, pp. 475–491.[26] R. Layne, T. M. Hospedales, S. Gong, and Q. Mary, “Person re-identiﬁcation by attributes.” in

The British Machine Vision Conference(BMVC) , 2012, p. 8.[27] Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang, “Camera styleadaptation for person re-identiﬁcation,” in

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2018, pp.5157–5166.[28] Z. Zheng, L. Zheng, and Y. Yang, “A discriminatively learned cnn em-bedding for person reidentiﬁcation,”

ACM Transactions on MultimediaComputing, Communications, and Applications (TOMM) , p. 13, 2017.[29] Y.-C. Chen, X. Zhu, W.-S. Zheng, and J.-H. Lai, “Person re-identiﬁcationby camera correlation aware feature augmentation,”

IEEE transactionson pattern analysis and machine intelligence , pp. 392–408, 2018.[30] A. Wu, W.-S. Zheng, and J.-H. Lai, “Robust depth-based person re-identiﬁcation.”

IEEE Transactions on Image Processing , pp. 2588–2603,2017.[31] X. Zhu, B. Wu, D. Huang, and W.-S. Zheng, “Fast open-world personre-identiﬁcation,”

IEEE Transactions on Image Processing , pp. 2286–2300, 2018.[32] X. Wang, S. Zheng, R. Yang, B. Luo, and J. Tang, “Pedestrian attributerecognition: A survey,” arXiv preprint arXiv:1901.07474 , 2019.[33] G. Haiyun, Z. Chaoyang, L. Zhiwei, W. Jinqiao, L. Hanqing et al. ,“Learning coarse-to-ﬁne structured feature embedding for vehicle re-identiﬁcation,” in

Association for the Advancement of Artiﬁcial Intelli-gence (AAAI) , 2018, pp. 1–8.[34] Z. Wu, Y. Li, and R. J. Radke, “Viewpoint invariant human re-identiﬁcation in camera networks using pose priors and subject-discriminative features,”

IEEE transactions on pattern analysis andmachine intelligence , vol. 37, no. 5, pp. 1095–1108, 2015.[35] L. Zheng, Y. Huang, H. Lu, and Y. Yang, “Pose invariant embedding fordeep person re-identiﬁcation,” arXiv preprint arXiv:1701.07732 , 2017.[36] J. Prokaj and G. Medioni, “3-d model based vehicle recognition,” in

Applications of Computer Vision (WACV), 2009 Workshop on . IEEE,2009, pp. 1–7.[37] Y. Lin, L. Zheng, Z. Zheng, Y. Wu, and Y. Yang, “Improving per-son re-identiﬁcation by attribute and identity learning,” arXiv preprintarXiv:1703.07220 , 2017.[38] C. Su, S. Zhang, J. Xing, W. Gao, and Q. Tian, “Multi-type attributesdriven multi-camera person re-identiﬁcation,”

Pattern Recognition , pp.77–89, 2018.[39] S. Li, T. Xiao, H. Li, B. Zhou, D. Yue, and X. Wang, “Person search withnatural language description,” arXiv preprint arXiv:1702.05729 , 2017.[40] Z. Yin, W.-S. Zheng, A. Wu, H.-X. Yu, H. Wan, X. Guo, F. Huang,and J. Lai, “Adversarial attribute-image person re-identiﬁcation.” in theInternational Joint Conference on Artiﬁcial Intelligence (IJCAI) , 2018,pp. 1100–1106.[41] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”in

International Conference on Neural Information Processing Systems(NIPS) , 2014, pp. 2672–2680.[42] Y. Li, L. Song, X. Wu, R. He, and T. Tan, “Anti-makeup: Learninga bi-level adversarial network for makeup-invariant face veriﬁcation,”2018.[43] R. Huang, S. Zhang, T. Li, and R. He, “Beyond face rotation: Global andlocal perception gan for photorealistic and identity preserving frontalview synthesis,” in

The IEEE International Conference on ComputerVision (ICCV) , Oct 2017.[44] X. Qian, Y. Fu, W. Wang, T. Xiang, Y. Wu, Y.-G. Jiang, and X. Xue,“Pose-normalized image generation for person re-identiﬁcation,” arXivpreprint arXiv:1712.02225 , 2017.[45] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translationwith conditional adversarial networks,” arXiv preprint , 2017.[46] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classiﬁcationwith deep convolutional neural networks,” in

Advances in Neural Infor-mation Processing Systems , 2012, pp. 1097–1105.[47] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalableperson re-identiﬁcation: A benchmark,” in

IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) , 2015, pp. 1116–1124.[48] L. Yang, P. Luo, C. Change Loy, and X. Tang, “A large-scale car datasetfor ﬁne-grained categorization and veriﬁcation,” in