Attributes Guided Feature Learning for Vehicle Re-identification
JJOURNAL OF L A TEX CLASS FILES 1
Attributes Guided Feature Learning forVehicle Re-identification
Aihua Zheng, Xianmin Lin, Chenglong Li, Ran He, and Jin Tang
Abstract —Vehicle Re-ID has recently attracted enthusiasticattention due to its potential applications in smart city and urbansurveillance. However, it suffers from large intra-class variationcaused by view variations and illumination changes, and inter-class similarity especially for different identities with the similarappearance. To handle these issues, in this paper, we proposea novel deep network architecture, which guided by meaningfulattributes including camera views, vehicle types and colors forvehicle Re-ID. In particular, our network is end-to-end trainedand contains three subnetworks of deep features embedded bythe corresponding attributes (i.e., camera view, vehicle type andvehicle color). Moreover, to overcome the shortcomings of limitedvehicle images of different views, we design a view-specifiedgenerative adversarial network to generate the multi-view vehicleimages. For network training, we annotate the view labels on theVeRi-776 dataset. Note that one can directly adopt the pre-trainedview (as well as type and color) subnetwork on the other datasetswith only ID information, which demonstrates the generalizationof our model. Extensive experiments on the benchmark datasetsVeRi-776 and VehicleID suggest that the proposed approachachieves the promising performance and yields to a new state-of-the-art for vehicle Re-ID.
Index Terms —Vehicle Re-identification, Deep Features, At-tributes, View-specified Generative Adversarial Network.
I. I
NTRODUCTION V EHICLE re-identification (Re-ID) is a frontier and im-portant research problem in computer vision, which hasmany potential applications, such as intelligent transportation,urban surveillance and security since vehicle is the most im-portant object in urban surveillance. The aim of vehicle Re-IDis to identify the same vehicle across non-overlapping cameras.Although license plate can uniquely identify the vehicle, it isscarcely recognizable due to motion blur, challenging cameraview and low resolution etc. in real-life surveillance. Some re-searchers have explored spatial-temporal information [1], [2],[3], [4] to boost the performance of appearance based vehicleRe-ID. However, it is difficult to obtain the complete spatial-temporal information since the vehicles may only appear ina few of the cameras in the large scale camera networks.Therefore, the prevalent vehicle Re-ID methods still focus onappearance based models.Extensive works dedicate on person Re-ID in the pastdecade [5], [6], [7], [8], [9], [10], [11], [12], which focuson two mainstreams: (1) Appearance modelling [6], [7], [8],which develops robust feature descriptors to leverage the vari-ous changes and occlusions among different camera views. (2)
A. Zheng, X. Lin, C. Li, and J. Tang are with School of Computer Scienceand Technology, Anhui University, Hefei, 230601, China.R. He is with Institute of Automation, Chinese Academy of Sciences,Beijing, 100190, China.
Camera1ID1
Camera2
ID2
Camera3
ID1 (a) (b)(c)
Fig. 1. Demonstration of two major challenges in vehicle Re-ID: 1) Intra-classdifference: the same vehicle ”ID1” appears totally different under Camera1and Camera3 due to the view variance. 2) Inter-class similarity: the twodifferent vehicles ”ID1” under Camera1 and ”ID2” under Camera2 haveextremely similar visual appearance especially for the vehicles from the samemanufactories.
Learning-based methods [5], [13], [14], [10], [11], [12], whichlearns metric distance to mitigate the appearance gaps betweenthe low-level features and high-level semantics. Recently,deep neural networks have made a marvellous progress onboth feature learning [5], [11], [12] and metric learning [13],[14], [10] for person Re-ID. However, directly employingperson Re-ID models for vehicle Re-ID obviously cannotguarantee the satisfactory performance, since the appearanceof pedestrians and vehicles varies in the different manner fromdifferent viewpoints.Although much progress has been made on vehicle Re-ID [15], [16], [3], [4], [17], [18], [2], vehicle Re-ID encountersmore challenges as in addition to the common challenges inperson Re-ID such as occlusion, illumination etc. The firstcrucial challenge of vehicle Re-ID is the large intra-classvariation caused by the viewpoint variation across differentcameras, which has been widely explored in person Re-ID [19], [20], [21], [22]. This issue is even more challengingin vehicle Re-ID since most of the vehicle images under acertain camera are almost in the same viewpoint due to therigid motion of the vehicles as shown in Fig. 1. Unfortunately,it cannot guarantee the satisfactory performance when directlyemploying the methods from person Re-ID for vehicle Re-ID since the appearance distributes totally different betweenpersons and vehicles. Some vehicle Re-ID methods [23], a r X i v : . [ c s . C V ] M a y OURNAL OF L A TEX CLASS FILES 2 [24] use adversarial learning schemes to generate multi-viewimages or features from a single image, and can thus addressthe challenge of view variation to some extent. But theymight be difficult to distinguish different vehicles with verysimilar appearance. Furthermore, they neglect the attributesinformation, such as type and color, which would be criticalcues for boosting the performance of vehicle Re-ID.The second challenge is the high inter-class similarity espe-cially for different identities with similar appearance as shownin Fig. 1. Incorporating the attributes information suffices togenerate better discriminative representation for person Re-ID [11], [25], [12], [26]. Therefore, it is essential to learn thedeep features with the supervision of attributes in vehicle Re-ID, enforcing the same identity with the consistent attributes.Li et al. [16] introduce the attribute recognition into the vehicleRe-ID framework, and use extra semantic information to assistvehicle identification especially for different identities withthe similar appearance. However, none of methods handlesboth of two challenges (intra-class difference and inter-classsimilarity) simultaneously, and the performance of vehicle Re-ID is thus limited.In this paper, we attempt to handle above issues in a unifieddeep convolutional framework to jointly learn Deep Featurerepresentations guided by the meaningful attributes, includingCamera Views, vehicle Types and Colors (DF-CVTC) forvehicle Re-ID. Attribute information has been successfullyinvestigated as the mid-level semantics to boost person Re-ID.It can also help vehicle Re-ID in challenging scenarios. First ofall, the camera view is one of the key attributes and challengesin Re-ID. As shown in Fig. 2, the query vehicle image mayhave completely different views from their counterparts underother cameras, such as query Q1 and Q2 and their right ranksmarked as green solid boxes. Second, vehicle types and colors,as the representative attributes for vehicles, play importantrole in vehicle Re-ID especially for the different vehicles withsimilar appearance. As shown in Fig. 2, the wrong hits of queryQ3 and Q4, which present with similar appearance, could beeffectively evaded by vehicle attribute type. Furthermore, thevehicles with different colors may present with similar shapes(such as the wrong hits rank 1 and rank 2 of query Q1),similar overall appearance (such as the wrong hits of queryQ4), or even the similar color (such as the query Q5 withwhite color while the wrong hits of rank 2-4 with gray color)due to illumination changes. Integrating the color attribute mayrelieve this inter-class similarity. These challenges motivate usto utilize above attributes to help classifier distinguish differentvehicles with very similar appearance and also identify thesame vehicles with different viewpoints.Meanwhile, as we observed, most of the vehicle imagesunder a certain camera are almost in the same viewpoint dueto the rigid motion of vehicles as shown in Fig. 1, and thus thenumber of vehicle images with different views is very limitedwhich brings a big challenge to train deep networks. To handlethis issue, we design a view-specified generative adversarialnetwork (VS-GAN) to generate the multi-view vehicle images.It is worth noting that we jointly learn deep features, cameraviews, vehicle types and colors in an end-to-end framework. Atlast, we annotate the view labels in the benchmark datasets for network training, which can be directly used for other datasetswith only ID labels.To our best knowledge, it is the first work to collabo-ratively learn the deep features guided by three attributessimultaneously for vehicle Re-ID. In summary, this papermakes the following contributions to vehicle Re-ID and relatedapplications: • It proposes a unified attributes guided deep learningframework that jointly learns Deep Feature representa-tions, Camera Views, vehicle Types and Colors (DF-CVTC) for vehicle Re-ID. These components are collab-orative to each other, and thus boost the discriminationability of the learnt representations. • To enhance the diversity of view data, it develops anvehicle generation model, i.e., VS-GAN, to generate themulti-view vehicle images. Together with the synthesizedmulti-view images and the original single view images,our network can better mitigate the view differencecaused by the cross-view cameras. • In addition to the type and color labels, we annotate theview labels for the benchmark dataset of vehicle Re-ID, i.e., VeRi-776, for view predictor training, whichcan be easily employed for the situation with only IDinformation in vehicle Re-ID. And we will release theannotation information of view labels to public for freeacademic usage. • Comprehensive experimental evaluations on two bench-mark datasets, i.e., VeRi-776 and VehicleID, demonstratethe promising performance of the proposed method whichyields to a new state-of-the-art for vehicle Re-ID.II. R
ELATED W ORK
A. Vehicle Re-ID
With great progress in person Re-ID [27], [28], [29], [30],[31], [32], vehicle Re-ID has gradually gained a lot of atten-tions recently since vehicles are the most important object inurban surveillance. Liu et al. [15], [3], [4] released a bench-mark dataset VeRi-776 and considered the vehicle Re-ID taskas a progressive recognition process by using visual features,license plates and spatial-temporal information. Liu et al. [3]released another big surveillance-nature dataset (VehicleID)and designed coupled clusters loss to measure the distance oftwo arbitrary similar vehicles. Zhang et al. [17] designed animproved triplet-wise training by classification-oriented loss.Li et al. [16] integrated the identification, attribute recognition,verification and triplet tasks into a unified CNN framework.Liu et al. [33] proposed a coarse-to-fine ranking methodconsisting of a vehicle model classification loss, a coarse-grained ranking loss, a fine-grained ranking loss and a pairwiseloss.In addition to appearance information, Shen et al. [2]combined the visual spatio-temporal path information forregularization. And Wang et al. [1] introduced the spatial-temporal regularization into the proposed orientation invariantfeature embedding module. However, the large intra-classvariation caused by the viewpoint variation and high inter-class similarity especially for the different vehicles have notbeen well solved in existing works.
OURNAL OF L A TEX CLASS FILES 3
Viewpoint: front_side
Viewpoint: rear_side multi-view multi-attributes feature map
Viewpoint: front_side
Viewpoint: front_side Viewpoint: rear_side query rank 1 rank 2 rank 3 rank 4 rank 5 rank 6 rank 7 rank 8 rank 9 rank 10 rank 11 rank 12 rank 13 rank 14 rank 15 Q1 Q2 Q3 Q4 Q5Q1
Q2Q3
Q4Q5 query Type: suv
Color: brownType: truck Color: gray
Type: truck Color: greenColor: white Type: sedan Color: grayType: sedan
Fig. 2. Benefits of camera views, types and colors on vehicle Re-ID. The brown dash box indicates the five vehicles characterized by distinct cameraviews, types and colors, their visualized feature representations and 2D feature projections of the images from corresponding identities, where different colorsrepresents different vehicles. The blue dash box demonstrates several ranking results of conventional vehicle Re-ID based on ResNet-50 [19], where the redand green solid boxes of the first 15 ranks indicate the wrong and right matching respectively. The results show that extra semantics or attributes play criticalrole in handling the challenges of vehicle Re-ID.
B. View-aware Re-ID
Viewpoint changes introduces a large variation of the intra-class variation in person Re-ID. Zhao et al. [20] proposed anovel method based on human body region guided for personRe-ID which can boost the performance well. Wu et al. [34]proposed a approach called pose prior to make identificationmore robust to viewpoint. Zheng [35] introduced the PoseBoxstructure which is generated through pose estimation followedby affine transformations.This issue is even crucial in vehicle Re-ID, since theviewpoint of the images are almost the same due to the rigidmotion of the vehicles. Wang et al. [1] proposed orientationinvariant feature embedding to address the influence of view-point variation. Prokaj et al. [36] presented a method basedon pose estimation to deal with multiple viewpoint.
C. Attribute Embedded Re-ID
Attributes have been extensively investigated as the mid-level semantic information to boost the person Re-ID. Su etal. [11] introduced a low rank attribute embedding into the amulti-task learning framework for person Re-ID. Khamis et al. [12] jointly optimized the attributes classification loss andtriplet loss for person Re-ID. Lin et al. [37] integrated theidentification loss and the attributes prediction into a simpleResNet framework and annotated the pedestrian attributesin two benchmark person Re-ID datasets Market-1501 andDukeMTMC-reID. Su et al. [38] proposed a weakly supervisedmulti-type attribute learning framework based on the tripletloss by pre-training the attributes predictor on an independentdata. Despite of the previous works focusing on image-basedquery, Li et al [39] and Yin et al. [40] investigated attribute-based query for person retrieval and Re-ID task.In vehicle Re-ID, Li et al. [16] introduced the attributerecognition into the vehicle Re-ID framework together with theverification loss and triplet loss. However, how to incorporatethe view-aware identification and the attributes recognitioninto a unified framework is still not investigated.
D. GAN based Re-ID
As one of the hottest research directions in the current deeplearning field, GAN [41] has been intensely explored in imagegeneration [42], data enhancement [43], style migration [27]
OURNAL OF L A TEX CLASS FILES 4 and other aspects. Recently, some works have also started todevelop GAN on person Re-ID. Zheng et al. [5] exploredGAN to generate new unlabeled samples for data augmentationin person Re-ID. Zhong et al. [27] introduced a methodnamed camera style (CamStyle) which can be viewed as adata augmentation approach that smooths the camera styledisparities. Qian et al. [44] use GAN to generate eight pre-defined pose for each image which augment the data andaddress the viewpoint variation to some extent. Liu et al. [7]transferred various person pose instances from one dataset toanother to improve the generalization ability of the model. Weiet al. [8], [9] proposed a GAN model to bridge the domaingap among different person Re-ID datasets.Some researchers proposed to use GAN to generate multi-view images or features to relieve the view variation in vehicleRe-ID. Zhou et al. [23] designed a conditional generativenetwork to obtain cross-view images from input view pairsto address the vehicle Re-ID task. Later on, Zhou et al. [24]proposed a Viewpoint-aware Attentive Multi-view Inference(VAMI) model to infer multi-view features from single-viewimage inputs. They used the image pairs for training while ourmethod employs the classification CNN and jointly learns thedeep features and the camera views, vehicle types and colors.By learning the view and attributes specified deep features,our method is superior to the above methods.III. P
ROPOSED N ETWORK A RCHITECTURE
In this paper, we propose a novel deep network architecture,which embeds attributes information, including camera views,vehicle types and colors, for vehicle Re-ID. We shall elaboratethe proposed method in this section.
A. Architecture Overview
The overall architecture is demonstrated in Fig. 3. Ourproposed architecture mainly consists of two parts: the viewtransform model and the vehicle Re-ID model. The viewtransform model consists of a view-specified GAN to generatemulti-view vehicle images to relieve the view variations. Thevehicle Re-ID model is composed of one backbone, threeguiding subnetworks, and the embedding layers. We discussthese parts one by one for clarity.
B. Backbone
In our work, we adopt ResNet-50 as the baseline networkfor backbone. One can also configure other networks suchas Inception-v4, VGG16 and MobileNet architectures withoutlimitation. As for ResNet-50, we use the first three residualblocks of ResNet-50 as our backbone as shown in Fig. 3, dueto its compelling performance with deeper layers by residuallearning. We denote the parameters of this network as θ . C. Subnetworks
As shown in Fig. 3, each subnetwork consists of a predictorpart and a feature extraction part, with inputs of the featuremaps from Block-1 and Block-3 respectively in the backbone. The predictor is composed of three convolutional (Conv)layers and one fully-connected (FC) layer which outputs aprobability distribution over the corresponding (view, type orcolor) values. The kernel sizes in the three Conv layers are × , × , × , respectively. The strides for these kernelsare 3, 2 and 1, respectively. We use ReLU activation in all threelayers and add a batch normalization layer after each Convlayer. The resulting feature vector is fed into the following FClayer to predict the attribute scores via the K -way softmax.The feature extractor is composed of K units, each of whichis a Conv-net responsible for extracting high-level featurescorresponding to one of the K view or attribute classes. Weuse the Block-4 of ResNet-50 as feature extractor.The features from each specific feature extractor E Φ can beformulated as, f Φ k = E Φ ( x ; α Φ ) (1)where Φ ∈ { view, type, color } , k = 1 , , ..., K Φ . K Φ isthe number of corresponding units, which also indicates thepossible classes of each view or attribute. x is an image, α Φ denotes the parameters of E Φ .The probability distribution w Φ over corresponding view orattribute values from the predictor network P Φ is, w Φ = P Φ ( x ; β Φ ) (2)where β Φ denotes the parameters of P Φ , which is learnt usingthe cross-entropy loss L Φ , L Φ = − K Φ (cid:88) k =1 log ( w Φ ( k )) q Φ ( k ) (3)where q Φ is a one-hot vector of the ground truth of corre-sponding view or attributes values.After progressively learning of the three subnetworks, weachieve the specific feature maps via: F Φ = ( f Φ (cid:12) w Φ ) ⊕ ... ⊕ ( f Φ K (cid:12) w Φ K ) (4)where ⊕ denotes the element-wise sum, and (cid:12) denotes theelement-wise multiply.The joint deep features with camera view, type and color areachieved as the fusion of feature maps of three subnetworks, F = F view ⊕ F type ⊕ F color (5) F is the fused deep features containing the complementaryview and attributes information. Next, we describe the detailsof each subnetwork as follows.
1) View subnetwork:
Viewpoint changes bring a crucialchallenge for Re-ID task. We use the view subnetwork toincorporate the view information into the Re-ID model. Theview predictor predicts a K view -way softmax scores whichare used to weight the output of each corresponding viewunit. In this paper, K view = 5 indicating the five viewpoints f ront, f ront side, side, rear side , and rear . For instance,for the training sample in the rear orientation to the camera,the corresponding view unit will be assigned a strong weightand updated strongly during the back propagation. OURNAL OF L A TEX CLASS FILES 5
Backbone front front_sidesiderear_siderearView Unit
View UnitView Unit
View Unit
View Unit
Type SubnetworkColor Subnetwork
Original
Generated + Original (cid:19)(cid:17)(cid:25)(cid:19)(cid:17)(cid:21) (cid:19)(cid:17)(cid:19)(cid:24) (cid:19)(cid:17)(cid:19)(cid:24) (cid:19)(cid:17)(cid:20)
Original front front_side side rear_side rear
View Predictor
Target viewpoint front_side side rear_side rear original
Re-ID ModelView Transform Model View Specific
Feature Extractor L id View Subnetwork E m b e dd i n g L ay er s View Subnetwork
Block 3
Block 1 F view Fig. 3. Overview of our DF-CVTC. The view transform model generates multi-view images based on a view-specified GAN. Both the original (blue box)and the generated images (red boxes) are fed into the vehicle Re-ID model, which consists of one backbone (the first three blocks of ResNet-50), threesubnetworks and one embedding network.
2) Type subnetwork:
Type is useful to distinguish thevehicles with similar appearance, which can relieve the inter-class similarity. In the same manner as the view subnetwork,we use the type subnetwork to learn the attribute specific deepfeatures. The K type scores predicted by type predictor areused to weight the output of each corresponding type unitin the type specific feature extractor. In this paper, we set K type = 9, indicating 9 types of the vehicles: sedan , suv , van , hatchback , mpv , pickup , bus , truck and estate . Similar toprevious view specific feature extractor, each type unit willlearn a feature map specialized for one of the K type types.
3) Color subnetwork:
Color is another discriminative at-tribute in vehicles. Therefore, we analogously use the colorsubnetwork to learn the color-specific features. The colorpredictor predicts the color scores of the vehicle then weightto each color unit. In our implementation, we set K color =10denoting 10 colors of the vehicles: yellow , orange , green , gray , red , blue , white , golden , brown and black . The color-specific feature extractor is designed in the same manner asin view and type subnetworks. D. Embedding Layers
The embedding layers consist of two FC layers (we denotethe parameters of this embedding as θ ). It embeds the fusedfeature F in Eq. (5) into the higher level joint deep feature F joint , which is used for the final Re-ID task. In order to train the Re-ID model, we add a softmax layerinto the embedding network for ID classification. We use thecross-entropy loss of L id for model training, L id = − N (cid:88) n =1 log ( p id ( n )) q id ( n ) (6)where N is the number of the vehicle IDs in the training set. q id is the one-hot ground-truth of the ID label of the vehicle. p id ( n ) is the predicted probability indicating the ID of theinput vehicle image, p id = sof tmax ( F joint ) . (7)Fig. 4 demonstrates the effectiveness of the jointly learntdeep features of the proposed DF-CVTC. We can observe that,the vehicle images of the same identity fall into the samecluster regardless of the different visible appearance causedby different camera views (as shown in Fig. 4(a) and (b)) orillumination changes (as shown in Fig. 4 (c)). E. View-specified GAN
Due to the rigid motion of the vehicles, the images under acertain camera are almost in a single viewpoint, which bringsa big challenge to vehicle Re-ID in wild conditions. Therefore,we design a view-specified GAN to generate the multi-viewimages. In this paper, we simply employ pix2pix [45] forits generality. The generation architecture is illustrated asFig. 5. In a specific, given an input vehicle image V i and OURNAL OF L A TEX CLASS FILES 6 frontfront_sidesiderear_siderear sedansuv van hatchbackmpvpickupbustruckestate yelloworangegreengrayredblue white goldenbrownblackview type color(a)(b) (c) front_side rear_sidepickuppickup orange orange (d) (e)
Fig. 4. Demonstration of the DF-CVTC features. (a) (b) and (c) denote three vehicle pairs sampled from VeRi-776 dataset under two distinct camera views andtheir corresponding learned probabilities of varying camera views, types and colors, where visible appearances are distinct due to the different camera viewsor illumination changes. (d) illustrates the 2D feature projections of the vehicle images learnt by the proposed DF-CVTC. (e) represents the correspondingannotation categories of camera views, types, and colors.
Generator G v Discriminator D v V i R i V j Encoder Decoder fake real
Fig. 5. The architecture of VS-GAN based on the architecture of pix2pix [45].For the input single view vehicle image V i ( front view as shown), VS-GANaims to synthesize a vehicle image V j with the same view as the target vehicleimage R i ( front side view as shown). a target vehicle image R i with different view, our VS-GANaims to generate a new vehicle image V j with the sameview as R i . VS-GAN constitutes a Generator G v learning amap conditional on the given target, and a Discriminator D v discriminating real data samples from the generated samples,such that the distribution of image V j is indistinguishable fromthe distribution image V i . The loss function can be expressedas, L ( G v , D v ) = E V i ,R i [ logD v ( V i , R i )] + E V i ,R i [ log (1 − D v ( V i , G v ( V i , R i ))] + λ E V i ,R i [ (cid:107) V i − G v ( V i , R i ) (cid:107) ] (8)where G v tries to minimize this objective against an adver-sarial D v that tries to maximize it, (cid:96) distance is used toencourage less blurring. λ is the weighting coefficient. Fig. 6demonstrates several examples of synthesizing the f ront viewvehicle images to f ront side view on VeRi-776 dataset viaVS-GAN. One can generate multi-view images by altering thetarget images with different views. F. Difference from Previous Work
Our method is significantly different from [23], [24], [19]from the following aspects. 1) [23], [24] infer the multi-viewimages or features using adversarial learning. However, theyrender vehicle Re-ID as a verification task while our methodemploys a classification CNN to learn the deep features.Furthermore, our learnt features embeds attributes information(type and color) in addition to the view information. 2) [19]incorporate both fine and coarse pose/view information tolearn a feature representations and propose a novel re-rankingmethod for person Re-ID. While our DF-CVTC further inte-grates the attributes information and jointly learns the deepfeatures embedded by camera views, vehicle types and colorsinto an end-to-end framework.IV. T
RAINING D ETAILS
A. Progressive Learning
We progressively learn the three subnetworks and fine-tunethe Re-ID model, which achieves comparative performanceas the multi-task learning (minimizing the combination ofthe four losses). Furthermore, it can significantly reduce thecomputational complexity.
1) View subnetwork training:
We fine-tune the backbonenetwork pre-trained on ImageNet classification [46] and therest of Re-ID model are initialized from scratch. First, weminimize L view to obtain α view , then we minimize L id toobtain { β view , θ } while fixing all the other parameters in Re-ID model. OURNAL OF L A TEX CLASS FILES 7 frontfront_side (a1) (b1) (c1) (d1) (e1) (f1) (g1) (a2) (b2) (c2) (d2) (e2) (f2) (g2)
Fig. 6. Examples of synthesizing the the vehicle images of front viewto front side view on VeRi-776 dataset via VS-GAN. The first and thesecond rows indicate the vehicle images with the original front view andthe synthesized front side view respectively.
2) Type subnetwork training:
We first minimize L type toobtain α type , and then minimize L id using F view ⊕ F type toobtain { β type , θ } while fixing all the other parameters in Re-ID model.
3) Color subnetwork training:
In the same manner, we firstminimize L color to obtain α color , then minimize L id using F view ⊕ F type ⊕ F color to obtain { β color , θ } while fixing allthe other parameters in Re-ID model.
4) Joint learning:
After training the three subnetworks, wefine-tune { α Φ , β Φ , θ , θ } , Φ ∈ { view, type, color } of thewhole Re-ID model by minimizing L id until convergence. B. Implementation Details
In practice we use a stochastic approximation of the ob-jective since the training set is quite huge. Training setis stochastically divided into mini-batches with 16 samples.The network performs forward propagation on the currentmini-batch, followed by the backpropagation to compute thegradients with simple cross-entropy loss for network parame-ters updating. We perform Adam optimizer at recommendedparameters with an initial learning rate of 0.0001 and a decayof 0.96 every epoch. With more passes over the training data,the model improves until it converges. To reduce overfitting,we artificially augment the data by performing random 2Dtranslation as the same protocol in [6]. For an original imageof size [ W , H ], we resize it to [ . W , . H ]. We alsohorizontally flip each image.V. E XPERIMENTS
We carry out comprehensive evaluation of the proposed DF-CVTC comparing to the state-of-the-art methods on two publicvehicle Re-ID datasets, VeRi-776 [15] and VehicleID [3]. Weuse the Cumulative Matching Characteristics (CMC) curvesand mAP to evaluate our results [3]. The type and color labelsare available in VeRi-776, therefore, we annotate the viewlabels for network training. In VehicleID, we directly employthe view, type and color subnetworks pretrained on VeRi-776and only ID labels are used.
A. State-of-the-art Methods
All the compared state-of-the-art methods are briefly intro-duced as follows: (1)
LOMO [14]. Local Maximal Occurrence (LOMO) isdedicated to propose an effective feature representation againstviewpoint changes for person Re-ID.(2)
BOW-CN [47]. Bag-of-Word based Color Name (CN).(3)
GoogLeNet [48]. Pre-trained on ImageNet [46] and fine-tuned on the CompCars dataset [48] to extract discriminativesemantic feature representation.(4)
FACT [15]. Fusion of Attributes and Color features dis-criminates vehicles by jointly learning low-level color featureand highlevel semantic attribute such as SIFT, Color Nameand GoogLeNet features.(5)
FACT+Plate-SNN+STR [3]. FACT [15] with additionalplate verification based on Siamese Neural Network andspatiotempoal relations (STR).(6)
Siamese-Visual [2]. Only visual (appearance) informa-tion is used to compute similarity between the input pairwise.(7)
Siamese-CNN+Path-LSTM [2]. Combining Siamese-CNN with Path-LSTM which estimates the validness score ofthe visual-spatiotemporal path.(8)
NuFACT [4]. The null space based Fusion of Attributeand Color features method to integrate the multi-level appear-ance features of vehicles, i.e., texture, color, and high-levelattribute features.(9)
VAMI [24]. VAMI aims to transform the single-viewfeature into a global feature representation that contains multi-view feature information, followed by the distance metriclearning on the global feature space.(10)
C2F-Rank [33]. C2F-Rank designs the coarse-to-fineranking loss consisting of a vehicle model classification loss,a coarse-grained ranking loss, a fine-grained ranking loss anda pairwise loss.
B. Experiments on VeRi-776 Dataset1) Setting:
The VeRi-776 dataset contains 776 identitiescollected with 20 cameras in real-world traffic surveillanceenvironment. The whole dataset is split into 576 identitieswith 37,778 images for training and 200 identities with 11,579images for testing. An additional set of 1,678 images selectedfrom the test identities are used as query images. In orderto evaluate the view subnetwork, we annotate all the vehicleimages in VeRi-776 dataset into five viewpoints as f ront , f ront side , side , rear side and rear . We follow the evalu-ation protocol in [3]. We use mean average precision (mAP)metric for evaluation. We first calculate the average precisionfor each query. Than, the mAP can be obtained by calculate themean of each average precision. The cumulative match curve(CMC) metric is also used for evaluation. First, we sort theEuclidean distance between each query and each gallery imagein ascending order. Then, the CMC curve can be obtained bythe average of sorted value. Noted that, only vehicle in non-overlap cameras will be counted during evaluation.
2) Qualitative examples:
Fig. 7 demonstrates the qualitativeexamples of six ranking results of our DF-CVTC on VeRi-776 dataset. From which we can observe that, our methodsuccessfully hits the vehicles with large view variations to thequery such as rank 2 and ranks 10-11 in Fig. 7 (a), rank 4and ranks 8-9 in Fig. 7 (b), rank 8 in Fig. 7 (c), rank 1 and
OURNAL OF L A TEX CLASS FILES 8 query the result of rank 12 (cid:708) a (cid:709) (cid:708) b (cid:709) (cid:708) c (cid:709)(cid:708) d (cid:709)(cid:708) e (cid:709)(cid:708) f (cid:709) rank 1 rank 2 rank 3 rank 4 rank 5 rank 6 rank7 rank8 rank9 rank10 rank11 rank12 Fig. 7. Examples of ranking results on VeRi-776 dataset. The green and red boxes indicate the right matchings and the wrong matchings respectively. query the result of rank 12 (cid:708) a (cid:709)(cid:708) b (cid:709)(cid:708) c (cid:709)(cid:708) d (cid:709)(cid:708) e (cid:709)(cid:708) f (cid:709) rank 1 rank 2 rank 3 rank 4 rank 5 rank 6 rank 7 rank 8 rank 9 rank 10 rank 11 rank 12 Fig. 8. Examples of ranking results on VehicleID dataset. The green boxes indicate the right matchings. Note that there is only one ground truth vehicleimage in gallery set in the VehicleID dataset. rank 6 in Fig. 7 (d). The wrong hits generally result from thehigh inter-class similarity with homologous visual appearance,such as ranks 4-5, rank 9 and rank 12 in Fig. 7 (a), rank 10in Fig. 7 (c), ranks 7-9 in Fig. 7 (e), ranks 2-3 and rank 5in Fig. 7 (f). Fig. 11 demonstrates the qualitative examples ofranking result of our DF-CVTC on VeRi-776 dataset. Fig. 11(b) shows the view/attributes probability which are predictedby each subnetwork. Fig. 11 (c) show the ranking result. Fromthe Fig. 11, we can find that the ranking result are improvingby introduction each subnetwork progressively.
3) Quantitative results:
Table I reports the performanceof our approach comparing with the published state-of-the-arts on VeRi-776 dataset while Fig. 9 (a) demonstrates thecorresponding CMC curves. From which we can see, our DF-CVTC significantly surpasses the state-of-the-arts. Notethat we haven’t utilized any license plates or spatial tem-poral information as in Siamese-CNN+Path-LSTM [2] andFACT+Plate-SNN+STR [3]. Even though, our method stillachieves the superior mAP and ranking accuracies by a largemargin. We also investigate the contribution of the componentsof the view, type and color subnetworkes in our model. Byprogressively introducing these components, both mAP andranking accuracies increase, verifying the clear contributionof each component.
C. Experiments on VehicleID Dataset1) Setting:
The VehicleID dataset [3] consists of the train-ing set with 110,178 images of 13,134 vehicles and the
OURNAL OF L A TEX CLASS FILES 9
TABLE IC
OMPARISONS WITH STATE - OF - THE - ART R E -ID METHODS ON V E R I -776 ( IN %). T HE TOP THREE RESULTS ARE HIGHLIGHTED IN RED , GREEN ANDBLUE , RESPECTIVELY .Method mAP rank 1 rank 5 reference(1) LOMO [14] 9.64 25.33 46.48 CVPR 2015(2) BOW-CN [47] 12.20 33.91 53.69 ICCV 2015(3) GoogLeNet [48] 17.89 52.32 72.17 CVPR 2015(4) FACT [15] 18.49 50.95 73.48 ICME 2016(5) FACT+Plate-SNN+STR [3] 27.70 61.44 78.78 ECCV 2016(6) Siamese-Visual [2] 29.48 41.12 60.31 ICCV 2017(7) Siamese-CNN+Path-LSTM [2] +view+type +view+type+color (DF-CVTC)
TABLE IIC
OMPARISONS WITH STATE - OF - THE - ART R E -ID METHODS ON V EHICLE
ID ( IN %). T HE TOP THREE RESULTS ARE HIGHLIGHTED IN RED , GREEN ANDBLUE , RESPECTIVELY .Method
Test Size = 800 Test Size = 1600 Test Size = 2400 referencemAP rank 1 rank 5 mAP rank 1 rank 5 mAP rank 1 rank 5(1) LOMO [14] - 19.76 32.01 - 18.85 29.18 - 15.32 25.29 CVPR 2015(2) BOW-CN [47] - 13.14 22.69 - 12.94 21.09 - 10.20 17.89 ICCV 2015(3) GoogLeNet [48] 46.20 47.88 67.18 44.00 43.40 63.86 38.10 38.27 59.39 CVPR 2015(4) FACT [15] - 49.53 68.07 - 44.59 64.57 - 39.92 60.32 ICME 2016(8) NuFACT [4] - 48.90 69.51 - 43.64 65.34 - 38.63 60.72 TMM 2018(9) VAMI [24] - 63.12 83.25 - 52.87 75.12 - 47.34 70.29 CVPR 2018(10) C2F-Rank [33] 63.50 61.10 81.70 60.00 56.20 76.20 53.00 51.40 72.20 AAAI 2018ResNet-50 (baseline) 70.50 67.75 79.13 68.48 65.79 76.64 66.19 63.45 74.70+view +view+type +view+type+color (DF-CVTC) test set with 111,585 images of 13,133 vehicles. Followedby the protocol in [3], we test VehicleID dataset in threedistinct settings with different number of testing samples:800, 1600 and 2400. Specifically, since some of the typeand color information is missing and no view labels in thisdataset, we adopt the view, type and color subnetworks pre-trained on VeRi-776 dataset and fine-tuned during the Re-IDtraining. Which in turn means one can easily apply our modelon the dataset with only ID information. The mean averageprecision (mAP), cumulative match curve (CMC) are used asthe evaluation metric in the same manner as in VeRi-776. Theonly difference is we randomly select a image from test datasetas gallery, while consider the remaining images in test datasetas query. The experimental results are based on the averageof 10 random trials.
2) Qualitative examples:
Fig. 8 demonstrates six rankingresults of our DF-CVTC on VehicleID. From which wecan observe that, our method can successfully hit the rightmatching with large inter-class difference caused by the illu-mination/color changes, such as Fig. 8 (a), (c) and (f), as wellas the viewpoint changes, such as Fig. 8 (b), (c), (d) and (f).The wrong hits of rank 1 on Fig. 8 (b) and (e) result from theinter-class similarity between vehicles, despite of which, ourmethod still hit the right matchings in the early ranks. Notethat there is only one ground truth vehicle image in galleryset in the VehicleID dataset.
3) Quantitative results:
Table II reports the performance ofour method against the state-of-the-arts on VehicleID datasetwhile Fig. 9 (b)(c)(d) demonstrates the corresponding CMCcurves on each test sizes. Clearly, our method significantlybeats the existing state-of-the-arts in mAP, rank 1 and rank5.By introducing the view, type and color subnetworks pro-gressively, the performance of our method is consistentlyimproved. On the base model of ResNet-50, our DF-CVTCbeat the VAMI strongly. In specific, we increase the rank1 by12.11%, 19.28% and 23.12% in three difference scaler testsets respectively. Our DF-CVTC beat the C2F-Rank strongly.In specific, we increase the mAP by 14.53%, 14.87% and20.15% in three difference scaler test sets respectively.
D. Ablation Study1) Analysis on VS-GAN:
The designed VS-GAN sufficesto generate vehicle images with the other four viewpointsas show in Fig. 3. Due to the computational complexity,we have simply transfered 1400 f ront view vehicles into f ront side view images for training data augmentation onVeRi-776 dataset as shown in Fig. 6. Fig. 10 demonstratesthe ablation study of VS-GAN. From which we can see, byaugmenting even only 1400 synthetic multi-view images intototal 37729 training samples, VS-GAN can benefit the Re-IDmodel with various components. Moreover, it can further boost
OURNAL OF L A TEX CLASS FILES 10
Rank(a) M a t c h i ng A cc u r ac y ( % ) CMC on the VeRi-776 dataset
LOMO BOW-CNGoogLeNetFACTFACT+Plate-SNN+STR Siamese-VisualSiamese-CNN+Path-LSTMNuFACT VAMIResNet-50(our)+view(our)+view+type(our)DF-CVTC(our)
Rank(b) M a t c h i ng A cc u r ac y ( % ) CMC on the VehicleID dataset (800)
LOMO BOW-CNGoogLeNetFACTNuFACT VAMIResNet-50(our)+view(our)+view+type(our)DF-CVTC(our)
Rank(c) M a t c h i ng A cc u r ac y ( % ) CMC on the VehicleID dataset (1600)
LOMO BOW-CNGoogLeNetFACTNuFACT VAMIResNet-50(our)+view(our)+view+type(our)DF-CVTC(our)
Rank(d) M a t c h i ng A cc u r ac y ( % ) CMC on the VehicleID dataset (2400)
LOMO BOW-CNGoogLeNetFACTNuFACT VAMIResNet-50(our)+view(our)+view+type(our)DF-CVTC(our)
Fig. 9. CMC curves on VeRi-776 and VehicleID datasets comparing to the state-of-the-art methods where the curves of variants of our methods are plottedin red color but with different markers. the contribution of the view subnetwork by improving . and . in mAP and rank 1 respectively, comparing to . and . improvements of DF-CVTC without VS-GAN. Webelieve that more generated images with more viewpoints willfurther boost the performance. mAP rank 1 (a) DF-CVTC without VS-GAN M a t c h i ng R a t e ( % ) ResNet-50+view+view+type+view+type+color mAP rank 1 (b) DF-CVTC + VS-GAN M a t c h i ng R a t e ( % ) Resnet-50+view+view+type+view+type+color
Fig. 10. Ablation study of VS-GAN on VeRi-776 dataset. (a) and (b)demonstrate the mAP and rank 1 scores of the proposed DF-CVTC and itsvariants without and with VS-GAN respectively. The digits on the top ofthe last three bars on each metric indicate the degree of improvement byprogressively introducing view, type and color, comparing to the first bluebar of the baseline ResNet-50.
2) Analysis on backbones:
As we mentioned in Sec-tion III-B, any other CNN architecture can be configured in-stead of ResNet-50 without any limitation. We further evaluatethree prevalent CNN architectures, Inception-v4, VGG16 andMobileNet as the backbone respectively while remain the otherpart of the proposed model unchanged. The comprising results to ResNet-50 backbone on VeRi-776 dataset are reportedin Table III. From which we can see, all the four CNNcounterparts achieve satisfactory performance. Specifically,Inception-v4 and MobileNet achieve competitive performanceon all the metrics. VGG16 works slight worse than the otherthree architecture, but it is still competitive to the state-of-the-art methods, which demonstrates that the high performanceof the proposed model is not totally due to the superiorityof the ResNet-50. Furthermore, by progressively introducingthe view, type and color subnetworks, the performance of thecorresponding variants based on all the backbones consistentlyimproves, which verifies the contribution of the proposedjointly learning model. Fig. 11 demonstrates an example ofranking results of the proposed DF-CVTC for a query fromVeRi-776 dataset by progressively introducing the view, typeand color subnetworks into the ResNet-50 backbone. Weobserve that: 1) By introducing the view subnetwork, it caneliminate the wrong ranks with quite similar visible appear-ance especially with similar views to the query especially, suchas rank 1 and rank 2 in Fig. 11 (c1). 2) By further introducingthe type subnetwork, it can eliminate the wrong ranks withobviously distinct types, such as rank 2 and rank 9 in Fig. 11(c2). 3) Our full model DF-CVTC (Fig. 11 (c4)) hits the mostright ranks by progressively introduce the three subnetworks.
E. Other Discussion
We further evaluate the attributes recognition on the 1678query vehicle images from VeRi-776 dataset. The recognition
OURNAL OF L A TEX CLASS FILES 11 rank 1 rank 2 rank 3 rank 4 rank 5 rank 6 (a) query rank 7 rank 8 rank 9
ResNet-50+ view+ view + type+ view + type + color(b) view / attributes probability (c) ranking resultfront_sidetruckgrayviewtypecolor (c1)(c2)(c3)(c4)
Fig. 11. An example of ranking results of DF-CVTC on ResNet-50 backbone by progressively introducing the view, type and color subnetworks on VeRi-776dataset. The green and red boxes indicate the right matchings and the wrong matchings respectively. The histograms denote the probability distributions learntfrom the view, type and color subnetworks respectively. TABLE IIIA
BLATION STUDY ON DIFFERENT BACKBONES WITH VARYING COMPONENTS ON V E R I -776 DATASET ( IN %). T HE TOP THREE RESULTS AREHIGHLIGHTED IN RED , GREEN AND BLUE , RESPECTIVELY . Component (a) (b) (c) (d)view × (cid:88) (cid:88) (cid:88) type × × (cid:88) (cid:88) color × × × (cid:88) Backbone mAP rank 1 rank 5 mAP rank 1 rank 5 mAP rank 1 rank 5 mAP rank 1 rank 5ResNet-50 51.58 86.71 92.43 54.52 89.69 94.40 60.47 91.66 95.59 61.06 91.36 95.77VGG16 42.35 77.77 88.14 44.17 80.63 89.57 45.43 81.17 90.35 45.62 81.76 91.12MobileNet 52.55 86.23 94.10 54.48 87.60 93.92 58.49 89.15 94.64 59.23 89.45 94.87Inception-v4 49.78 84.62 91.90 52.74 87.66 93.68 59.49 89.27 94.76 60.50 89.51 95.47 rates of three subnetworks are originally . , . and . for view, type and color attributes respectively, whileachieving . , . and . after fine-tuned bythe Re-ID loss. We observe that, the attributes guided featurelearning model can further benefit the attributes recognitionrates especially on camera views by . improvement.Meanwhile, the recognition rate of vehicle type has also beenslightly improved. It seems that we haven’t achieved anysignificant change on the recognition rate of vehicle colorattribute, the main reason is most of query images are in thesimilar color and the original recognition rate of color attribute( . ) is nearly saturated.VI. C ONCLUSION
In this paper, we have proposed a novel end-to-end deepconvolutional network to jointly learn deep features, cameraviews, types and colors for vehicle Re-ID. We expand thebackbone of ResNet-50 with three consolidated subnetworksincorporating the view, type and color cues respectively.These three tasks benefit each other and learn a informativediscriminative representation for vehicle Re-ID. Furthermore,we have increased the diversity of the views for vehicleimages via a view-specified generative adversarial network.By jointly learning the deep features, camera views, vehicle types and vehicle colors in a single unified framework, ourmethod can achieve superior performance comparing to thestate-of-the-art methods. Comprehensive evaluation on twobenchmark datasets demonstrates the clear contribution of eachsubnetwork and the capability of informative representation forvehicle Re-ID. A
CKNOWLEDGEMENT
This work was partially supported by the National Natu-ral Science Foundation of China (61502006, 61702002 and61872005). R
EFERENCES[1] Z. Wang, L. Tang, X. Liu, Z. Yao, S. Yi, J. Shao, J. Yan, S. Wang,H. Li, and X. Wang, “Orientation invariant feature embedding andspatial temporal regularization for vehicle re-identification,” in
IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2017,pp. 379–387.[2] Y. Shen, T. Xiao, H. Li, S. Yi, and X. Wang, “Learning deep neuralnetworks for vehicle re-id with visual-spatio-temporal path proposals,”in
IEEE International Conference on Computer Vision (ICCV) , 2017,pp. 1918–1927.[3] X. Liu, W. Liu, T. Mei, and H. Ma, “A deep learning-based approach toprogressive vehicle re-identification for urban surveillance,” in
EuropeanConference on Computer Vision . Springer, 2016, pp. 869–884.[4] ——, “Provid: Progressive and multimodal vehicle reidentification forlarge-scale urban surveillance,” in
IEEE Transactions on Multimedia ,2018, pp. 645–658.
OURNAL OF L A TEX CLASS FILES 12 [5] Z. Zheng, L. Zheng, and Y. Yang, “Unlabeled samples generated bygan improve the person re-identification baseline in vitro,” in
The IEEEInternational Conference on Computer Vision (ICCV) , Oct 2017.[6] W. Li, R. Zhao, T. Xiao, and X. Wang, “Deepreid: Deep filter pairingneural network for person re-identification,” in
IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2014, pp. 152–159.[7] J. Liu, B. Ni, Y. Yan, P. Zhou, S. Cheng, and J. Hu, “Pose transferrableperson re-identification,” in
IEEE Conference on Computer Vision andPattern Recognition (CVPR) , 2018, pp. 4099–4108.[8] L. Wei, S. Zhang, W. Gao, and Q. Tian, “Person transfer gan tobridge domain gap for person re-identification,” in
IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , 2018, pp. 79–88.[9] W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao, “Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person reidentification,” in
IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) , 2018, pp. 994–1003.[10] P. Chen, X. Xu, and C. Deng, “Deep view-aware metric learningfor person re-identification.” in the International Joint Conference onArtificial Intelligence (IJCAI) , 2018, pp. 620–626.[11] C. Su, F. Yang, S. Zhang, Q. Tian, L. S. Davis, and W. Gao, “Multi-tasklearning with low rank attribute embedding for person re-identification,”in
Proceedings of the IEEE International Conference on ComputerVision , 2015, pp. 3739–3747.[12] S. Khamis, C.-H. Kuo, V. K. Singh, V. D. Shet, and L. S. Davis, “Jointlearning for attribute-consistent person re-identification,” in
EuropeanConference on Computer Vision . Springer, 2014, pp. 134–146.[13] E. Ahmed, M. Jones, and T. K. Marks, “An improved deep learningarchitecture for person re-identification,” in
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2015, pp.3908–3916.[14] S. Liao, Y. Hu, X. Zhu, and S. Z. Li, “Person re-identification bylocal maximal occurrence representation and metric learning,” in
IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2015,pp. 2197–2206.[15] X. Liu, W. Liu, H. Ma, and H. Fu, “Large-scale vehicle re-identificationin urban surveillance videos,” in
IEEE International Conference onMultimedia and Expo (ICME) , 2016, pp. 1–6.[16] Y. Li, Y. Li, H. Yan, and J. Liu, “Deep joint discriminative learning forvehicle re-identification and retrieval,” in
IEEE International Conferenceon Image Processing (ICIP) . IEEE, 2017, pp. 395–399.[17] Y. Zhang, D. Liu, and Z.-J. Zha, “Improving triplet-wise training ofconvolutional neural network for vehicle re-identification,” in
IEEEInternational Conference on Multimedia and Expo (ICME) , 2017, pp.1386–1391.[18] J. Zhu, Y. Du, Y. Hu, L. Zheng, and C. Cai, “Vrsdnet: vehicle re-identification with a shortly and densely connected convolutional neuralnetwork,” in
Multimedia Tools and Applications , 2018, pp. 1–15.[19] M. S. Sarfraz, A. Schumann, A. Eberle, and R. Stiefelhagen, “A pose-sensitive embedding for person re-identification with expanded crossneighborhood re-ranking,” arXiv preprint arXiv:1711.10378 , 2017.[20] H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, andX. Tang, “Spindle net: Person re-identification with human body regionguided feature decomposition and fusion,” in
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2017, pp.1077–1085.[21] L. Zhao, X. Li, Y. Zhuang, and J. Wang, “Deeply-learned part-alignedrepresentations for person re-identification.” in
ICCV , 2017, pp. 3239–3248.[22] C. Su, J. Li, S. Zhang, J. Xing, W. Gao, and Q. Tian, “Pose-driven deepconvolutional model for person re-identification,” in
Computer Vision(ICCV), 2017 IEEE International Conference on . IEEE, 2017, pp.3980–3989.[23] Y. Zhou and L. Shao, “Cross-view gan based vehicle generation for re-identification,” in
Proceedings of the British Machine Vision Conference(BMVC) , 2017, pp. 1–12.[24] ——, “Aware attentive multi-view inference for vehicle re-identification,” in
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2018, pp. 6489–6498.[25] C. Su, S. Zhang, J. Xing, W. Gao, and Q. Tian, “Deep attributesdriven multi-camera person re-identification,” in
European conferenceon computer vision . Springer, 2016, pp. 475–491.[26] R. Layne, T. M. Hospedales, S. Gong, and Q. Mary, “Person re-identification by attributes.” in
The British Machine Vision Conference(BMVC) , 2012, p. 8.[27] Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang, “Camera styleadaptation for person re-identification,” in
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2018, pp.5157–5166.[28] Z. Zheng, L. Zheng, and Y. Yang, “A discriminatively learned cnn em-bedding for person reidentification,”
ACM Transactions on MultimediaComputing, Communications, and Applications (TOMM) , p. 13, 2017.[29] Y.-C. Chen, X. Zhu, W.-S. Zheng, and J.-H. Lai, “Person re-identificationby camera correlation aware feature augmentation,”
IEEE transactionson pattern analysis and machine intelligence , pp. 392–408, 2018.[30] A. Wu, W.-S. Zheng, and J.-H. Lai, “Robust depth-based person re-identification.”
IEEE Transactions on Image Processing , pp. 2588–2603,2017.[31] X. Zhu, B. Wu, D. Huang, and W.-S. Zheng, “Fast open-world personre-identification,”
IEEE Transactions on Image Processing , pp. 2286–2300, 2018.[32] X. Wang, S. Zheng, R. Yang, B. Luo, and J. Tang, “Pedestrian attributerecognition: A survey,” arXiv preprint arXiv:1901.07474 , 2019.[33] G. Haiyun, Z. Chaoyang, L. Zhiwei, W. Jinqiao, L. Hanqing et al. ,“Learning coarse-to-fine structured feature embedding for vehicle re-identification,” in
Association for the Advancement of Artificial Intelli-gence (AAAI) , 2018, pp. 1–8.[34] Z. Wu, Y. Li, and R. J. Radke, “Viewpoint invariant human re-identification in camera networks using pose priors and subject-discriminative features,”
IEEE transactions on pattern analysis andmachine intelligence , vol. 37, no. 5, pp. 1095–1108, 2015.[35] L. Zheng, Y. Huang, H. Lu, and Y. Yang, “Pose invariant embedding fordeep person re-identification,” arXiv preprint arXiv:1701.07732 , 2017.[36] J. Prokaj and G. Medioni, “3-d model based vehicle recognition,” in
Applications of Computer Vision (WACV), 2009 Workshop on . IEEE,2009, pp. 1–7.[37] Y. Lin, L. Zheng, Z. Zheng, Y. Wu, and Y. Yang, “Improving per-son re-identification by attribute and identity learning,” arXiv preprintarXiv:1703.07220 , 2017.[38] C. Su, S. Zhang, J. Xing, W. Gao, and Q. Tian, “Multi-type attributesdriven multi-camera person re-identification,”
Pattern Recognition , pp.77–89, 2018.[39] S. Li, T. Xiao, H. Li, B. Zhou, D. Yue, and X. Wang, “Person search withnatural language description,” arXiv preprint arXiv:1702.05729 , 2017.[40] Z. Yin, W.-S. Zheng, A. Wu, H.-X. Yu, H. Wan, X. Guo, F. Huang,and J. Lai, “Adversarial attribute-image person re-identification.” in theInternational Joint Conference on Artificial Intelligence (IJCAI) , 2018,pp. 1100–1106.[41] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,”in
International Conference on Neural Information Processing Systems(NIPS) , 2014, pp. 2672–2680.[42] Y. Li, L. Song, X. Wu, R. He, and T. Tan, “Anti-makeup: Learninga bi-level adversarial network for makeup-invariant face verification,”2018.[43] R. Huang, S. Zhang, T. Li, and R. He, “Beyond face rotation: Global andlocal perception gan for photorealistic and identity preserving frontalview synthesis,” in
The IEEE International Conference on ComputerVision (ICCV) , Oct 2017.[44] X. Qian, Y. Fu, W. Wang, T. Xiang, Y. Wu, Y.-G. Jiang, and X. Xue,“Pose-normalized image generation for person re-identification,” arXivpreprint arXiv:1712.02225 , 2017.[45] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translationwith conditional adversarial networks,” arXiv preprint , 2017.[46] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in
Advances in Neural Infor-mation Processing Systems , 2012, pp. 1097–1105.[47] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian, “Scalableperson re-identification: A benchmark,” in
IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) , 2015, pp. 1116–1124.[48] L. Yang, P. Luo, C. Change Loy, and X. Tang, “A large-scale car datasetfor fine-grained categorization and verification,” in