[PDF] HorNet: A Hierarchical Offshoot Recurrent Network for Improving Person Re-ID via Image Captioning

Abstract

Person re-identification (re-ID) aims to recognize a person-of-interest across different cameras with notable appearance variance. Existing research works focused on the capability and robustness of visual representation. In this paper, instead, we propose a novel hierarchical offshoot recurrent network (HorNet) for improving person re-ID via image captioning. Image captions are semantically richer and more consistent than visual attributes, which could significantly alleviate the variance. We use the similarity preserving generative adversarial network (SPGAN) and an image captioner to fulfill domain transfer and language descriptions generation. Then the proposed HorNet can learn the visual and language representation from both the images and captions jointly, and thus enhance the performance of person re-ID. Extensive experiments are conducted on several benchmark datasets with or without image captions, i.e., CUHK03, Market-1501, and Duke-MTMC, demonstrating the superiority of the proposed method. Our method can generate and extract meaningful image captions while achieving state-of-the-art performance.

Full PDF

HHorNet: A Hierarchical Offshoot Recurrent Network for Improving Person Re-IDvia Image Captioning

Shiyang Yan , , Jun Xu , , Yuai Liu , and Lin Xu , ∗ Nanjing Institute of Advanced Artiﬁcial Intelligence Horizon [email protected], { jun.xu, yuai.liu, lin01.xu } @horizon.ai Abstract

Person re-identiﬁcation (re-ID) aims to recognize aperson-of-interest across different cameras with no-table appearance variance. Existing research worksfocused on the capability and robustness of visualrepresentation. In this paper, instead, we propose anovel hierarchical offshoot recurrent network (Hor-Net) for improving person re-ID via image cap-tioning. Image captions are semantically richerand more consistent than visual attributes, whichcould signiﬁcantly alleviate the variance. We usethe similarity preserving generative adversarial net-work (SPGAN) and an image captioner to fulﬁlldomain transfer and language descriptions gener-ation. Then the proposed HorNet can learn the vi-sual and language representation from both the im-ages and captions jointly, and thus enhance the per-formance of person re-ID. Extensive experimentsare conducted on several benchmark datasets withor without image captions, i.e., CUHK03, Market-1501, and Duke-MTMC, demonstrating the superi-ority of the proposed method. Our method can gen-erate and extract meaningful image captions whileachieving state-of-the-art performance.

Person re-identiﬁcation (re-ID) has become increasingly pop-ular in the modern computer vision community due to itsgreat signiﬁcance in the research and applications of vi-sual surveillance. It aims at recognizing a person-of-interest(query) across different cameras. The most challenging prob-lem in re-ID is how to accurately match persons under inten-sive variance of appearances, such as human poses, cameraviewpoints, and illumination conditions. Encouraged by theremarkable success in deep learning algorithms and the emer-gence of large-scale datasets, many advanced methods havebeen developed to relieve these vision-based difﬁculties andmade signiﬁcant improvements in the community [Li et al. ,2017a; Su et al. , 2017; Chen et al. , 2018].Recent years witness that the application of the various ax-illary information, such as human poses [Su et al. , 2017], ∗ Contact Author Figure 1: Schematic illustration of the proposed framework forperson re-ID. Both images and captions are utilized for spottinga person-of-interest across different cameras. For persons withoutcaptions, we ﬁrst transfer all available images into a uniﬁed domainand then use image captioner to generate high-quality language de-scription automatically. The HorNet simultaneously extracts visualrepresentation from a given image and language description fromthe generated caption for the following person re-identiﬁcation. person attributes [Schumann, 2017] and language descrip-tions [Chen et al. , 2018], can signiﬁcantly boost the perfor-mance of person re-ID. These serve as the augmented fea-ture representations for improving person re-ID. Notably, theimage captions could provide a comprehensive and detailedfootprint of a speciﬁc person. It is semantically richer thanvisual attributes. More importantly, language descriptions ofa particular person are often more consistent across differentcameras (or views), which could alleviate the difﬁculty of theappearance variance in person re-ID task.Two signiﬁcant barriers exist in applying the image cap-tions for person re-ID. The ﬁrst one is the increasing com-plexity to handle image captions. It is certain that languagedescriptions contain many redundant and fuzzy information,which could be a great challenge if not handled properly.Thus an effective learning approach for constructing a com-pact representation of language descriptions is of vital im-portance. Another one is the lack of description annota-tions for person re-ID task. Recently, [Li et al. , 2017b]proposed the CUHK-PEDES, which provides person images a r X i v : . [ c s . C V ] A ug ith annotated captions. The images from this dataset arecollected from various person re-ID benchmark datasets suchas CUHK01 [Li et al. , 2012], CUHK03 [Li et al. , 2014],Market-1501 [Zheng et al. , 2015a], and et al. However, theannotations are usually restricted to these datasets. In real-world applications, the person images normally do not havepaired language descriptions. Thus, a method for automati-cally generating the high-quality semantic image captions tovarious real-world datasets is also urgently needed.In this paper, we propose a novel hierarchical offshoot re-current network (HorNet) for improving person re-ID via im-age captioning. Figure 1 illustrates the schematic illustra-tion of our framework for person re-ID task. We ﬁrst usethe similarity preserving generative adversarial network (SP-GAN) [Deng et al. , 2018] to transfer the real-world imagesinto a uniﬁed domain, which can signiﬁcantly enhance thequality of the generated descriptions via the following im-age captioner [Aneja et al. , 2018]. Then both of the imagesand generated captions are used as the input to the HorNet.The HorNet has two sub-networks to handle the input im-ages and captions, respectively. For images, we utilize main-stream CNNs (i.e., Resnet50) to extract the visual features.For captions, we develop a two-layer LSTMs module with adiscrete binary gate in each time step. The gradient of theseparate gates is estimated using Gumbel sigmoid [Jang etal. , 2016]. This module dynamically controls the informa-tion ﬂow from the lower layer to the upper layer via thesegates. It selects the most relevant vocabularies (i.e., the cor-rect or meaningful words), which are consistent with the in-put visual features. Consequently, HorNet can learn the vi-sual representations from the given images and the languagedescriptions from the generated image captions jointly, andthus signiﬁcantly enhance the performance of person re-ID.Finally, we verify the performance of our proposed methodin two scenarios, i.e., person re-ID datasets with and with-out image captions. Experimental results on several widelyused benchmark datasets, i.e., CUHK03, Market-1501, andDuke-MTMC, demonstrate the superiority of the proposedmethod. Our method can simultaneously learn the visual andlanguage representation from both the images and captionswhile achieving a state-of-the-art recognition performance.In a nutshell, our main contributions in the present workcan be summarized as threefold:(1) We develop a new captioning module via image domaintransfer and captioner in person re-ID system. It can generatehigh-quality language captions for given visual images.(2) We propose a novel hierarchical offshoot recurrentnetwork (HorNet) based on the generated images captions,which learns the visual and language representation jointly.(3) We verify the superiority of our proposed methodon person re-ID task. State-of-the-art empirical results areachieved on the three commonly used benchmark datasets.

The early research works on person re-ID mainly focus on thevisual feature extraction. For instance, [Yi et al. , 2014] splita pedestrian image into three horizontal parts and train three-part CNNs to extract features. Then the similarity between two images is calculated based on the cosine distance metricof their features. [Chen et al. , 2018] use triplet samples fortraining the network, considering not only the samples of thesame person but also the samples of different people. [Liu etal. , 2016] proposes a multi-scale triplet CNN for person re-ID. Due to recently released large-scale benchmark dataset,e.g., CUHK03 [Li et al. , 2014], Market-1501 [Zheng et al. ,2015a], many researchers try to learn a deep model based onthe identity loss for person re-ID. [Zheng et al. , 2016] di-rectly uses a conventional ﬁne-tuning approach and outper-forms many previous results. Also, recent research [Zheng etal. , 2017a] proves that a discriminative loss, combined withthe veriﬁcation loss objective, is superior.Several recent research has endeavored to use auxiliary in-formation to aid the feature representation for the person re-ID. Some research [Su et al. , 2017; Zhao et al. , 2017] relieson the extra information of the person’s poses for person re-ID. They leverage the human parts cues to alleviate the posevariations and learn robust feature representations from boththe global and local image regions. Another type of auxiliaryinformation, attributes of a person, has been used in personre-ID [Lin et al. , 2017]. However, these methods all rely onthe attribute annotations, which are normally hard to collectin real-world applications. [Schumann, 2017] uses automati-cally detected attributes and visual features for person re-ID.The attribute detector is trained on another dataset which con-tains the attribute annotations.The relationship between visual representations and lan-guage descriptions has long been investigated. It has attractedhigh attention in tasks such as image captioning [Yan et al. ,2018b], visual question answering. Associating person im-ages and their corresponding language descriptions for theperson searching has been proposed in [Li et al. , 2017b].Several research works employ the language descriptions ascomplementary information, together with visual representa-tions, for person re-ID. [Chen et al. , 2018] exploit naturallanguage descriptions as additional training supervision foreffective visual features. [Yan et al. , 2018a] propose to com-bine the language descriptions and image features and fusethem for the person re-ID task. Previous language models en-code the sentences using either Recurrent Neural Networks(RNNs) language encoder or Convolutional Neural Networks(CNNs) encoder. Recent research [Bahdanau et al. , 2014]employ the attention mechanism for these language modelsby looking over the entire sentence and assigning weightsto each word independently. Especially, RNNs with atten-tion have been widely applied in machine translation, imagecaptioning, speech recognition and Question and Answering(QA). The attention mechanism allows the model to look overthe entire sequence and pick up the most relevant information.Most of the previous attention mechanism employs a similarapproach to [Bahdanau et al. , 2014], where the neural modelassigns soft weights on the input tokens. Recently, [Ke et al. ,2018] proposes a Focused Hierarchical Encoder (FHE) forthe Question Answering (QA), which consists of multi-layerLSTMs with the discrete binary gates between each layer.Our HorNet also utilizes the discrete gate but with a verydifferent mechanism and purpose. We aim to eliminate re-dundant information or incorrect language tokens, while they igure 2: The pipeline of the proposed method for extending the language descriptions to the datasets without annotations: speciﬁcally, weﬁrst train an image captioning model on the CUHK-PEDES dataset. Then we transfer the image style of the Duke-MTMC dataset to theCUHK-PEDES style. The transferred the Duke-MTMC images with the CUHK-PEDES style are used to generate more precise languagedescriptions. Finally, we use the generated descriptions and the original Duke-MTMC images for person re-ID. tried to answer the question.

The image caption of a speciﬁc person is semantically richand can provide complementary information for the visualrepresentations. However, the handcrafted descriptions of aperson image are hard to collect due to the annotation difﬁ-culties in real-world person re-ID applications. We proposea method to generalize the language descriptions accuratelyfrom a dataset with image captions to others without suchcaptions. The whole scheme of our approach is illustratedin Figure 2. Given images with captions, i.e., the CUHK-PEDES dataset, we use SPGAN to transfer arbitrary imageto the CUHK-PEDES style. The SPGAN is proposed to im-prove image-to-image domain adaptation by preserving boththe self-similarity and domain-dissimilarity for person re-ID.We utilize it in our case as in [Deng et al. , 2018] to transfer theimage domain (or style) of the un-annotated datasets. Thenwe train an image captioner [Aneja et al. , 2018] to generateimage descriptions automatically on the transferred datasets.The visualization of the domain transfer process and corre-sponding generated captions are illustrated in Figure 4. Itis clear that the transferred images have more accurate lan-guage descriptions. However, the generated sentences, whichare based on the domain-translated images, still contain someincorrect keywords and redundant information. The proposedHorNet, which contains the discrete binary gates, can selectthe most relevant language tokens with the visual features,and thus provide a good solution for the issue.

To facilitate the visual representations of a person in the per-son re-ID task, we propose the HorNet to learn the visualfeatures and the corresponding language description jointly.The HorNet adds a branch to the CNNs (i.e., Resnet-50) withtwo-layer LSTMs and a discrete or continuous gate betweeneach layer at every time step. The lower-layer LSTM handlesthe input languages while the upper-layer LSTM selects therelevant language features via the gates. Finally, the last hid-den state of the upper-layer LSTM is concatenated with thevisual features extracted via Resnet-50 to generate a compactrepresentation. The objective function of the HorNet consistsof two parts: the identiﬁcation loss and the triplet loss, whichare trained jointly to optimize the person re-ID model.We present the pipeline of our proposed method in Fig-ure 3. The input to the HorNet consists of two parts, i.e.,the image and the corresponding language descriptions. Letthe language descriptions be processed by a two-layer LSTMmodel. The bottom layer is a normal LSTM network, whichreasons on the sequential input of the language descriptions.More formally, let the D = ( d , d , ..., d n ) be the inputof description, h t be the hidden state, c t be the LSTM cellstate at time t . The term E = ( e , e , ..., e n ) , where e t = W ord Embedding ( d t ) , t = 1 , , ..., n , denotes the wordembedding of the input description. In our research work, weuse a linear embedding for the input language tokens. Hence,the bottom LST M layer can be expressed in Equation (1). h lt , c lt = LST M ( e t , h lt − , c lt − ) , (1)where function LST M denotes the compact form of the for- igure 3: The pipeline of the proposed method. We ﬁrst use the domain transfer technique (i.e., SPGAN) to transfer all available trainingimages into a uniﬁed domain (or style). This preprocess can signiﬁcantly enhance the quality of the generated language descriptions viathe following image captioner. In the structure of the HorNet, the information ﬂow from the lower-layer LSTM to the upper-layer LSTM iscontrolled by the discrete binary gates. The discrete binary gates are determined by the corresponding visual features and the hidden unitsfrom the lower LSTM layer. The gradient of the discrete gates is then estimated via the

Gumbel sigmoid function. The red circles in theﬁgure indicate the closeness of the gates, while the yellow circles mean the gates are open. The ﬁnal concatenate representation from bothvisual and language features are employed for person re-ID through ID loss and Triplet loss objectives simultaneously. ward pass of an LSTM unit with a forget gate as: i lt = σ ( W lxi ∗ e t + U lhi ∗ h lt − + b li ) ,f lt = σ ( W lxf ∗ e t + U lhf ∗ h lt − + b lf ) ,o lt = σ ( W lxo ∗ e t + U lho ∗ h lt − + b lo ) ,g lt = σ ( W lxc ∗ e t + U lhc ∗ h lt − + b lc ) ,c lt = f t · c lt − + i lt · g lt ,h lt = o t · φ ( c t ) , (2)where e t ∈ R d denotes input vector, f lt ∈ R h is forget gate’sactivation, i lt ∈ R h is input gate’s activation, o lt ∈ R h is outputgate’s activation, c lt ∈ R h is cell state vector, and h lt ∈ R h ishidden state of the LSTM unit l . W ∈ R h × d , U ∈ R h × h and b ∈ R h are weight matrices and bias vector parameterswhich need to be learned during training. The activation σ is sigmoid function and the operator ∗ denotes the Hadamardproduct (i.e., element-wise product).The boundary gate controls the information from the lowerlayer to the upper layer. The boundary gate z t is estimatedwith Gumbel sigmoid , which is derived directly from the

Gumbel sof tmax proposed in [Jang et al. , 2016].The

Gumbel sof tmax replaces the argmax in the

Gumbel -Max Trick with the following Softmax function:

Gumbel sof tmax ( π i ) = exp ( log ( π i + g i ) /τ ) (cid:80) Kj =1 exp ( log ( π j + g j ) /τ ) , where g , ..., g k are i.i.d. sampled from the distribution Gumbel (0 , , and τ is the temperature parameter. K indi-cates the dimension of the generated Softmax vector (i.e., thenumber of categories). To derive the Gumbel sigmoid , we ﬁrstly re-write theSigmoid function as a Softmax of two values: π i and 0, asin the following Equation (3). sigm ( π i ) = 1(1 + exp ( − π i )) = 1(1 + exp (0 − π i ))= 11 + exp (0) /exp ( π i )= exp ( π i )( exp ( π i ) + exp (0)) . (3)Hence, the Gumbel sigmoid can be written as in the fol-lowing Equation (4).

Gumbel sigmoid ( π i ) = exp ( log ( π i + g i ) /τ ) exp ( log ( π i + g i ) /τ ) + exp ( log ( g (cid:48) ) /τ ) , (4)where g i and g (cid:48) are independently sampled from the distribu-tion Gumbel (0 , .Thus, the upper-layer LSTM inputs are the gated hiddenunits of the lower-layer, which can be expressed as the fol-lowing Equation (5) and Equation (6). In our experiments,all the soft gates z t are estimated using the Gumbel sigmoid with a constant τ of . . z t = Gumbel sigmoid ( Concat ( h lt , F )) , (5) h l +1 t , c l +1 t = LST M ( h lt ∗ z t , h l +1 t − , c l +1 t − ) , (6)where F denotes the deep visual features of the images ex-tracted via CNNs (i.e., Resnet-50) and the Concat indicatesthe features concatenation operation.o obtain a discrete value (i.e., language tokens selection),we also set the hard gates z t = (cid:101) y i in Equation (5). (cid:101) y i = (cid:26) y i > = 0 . , otherwise. (7)Finally, we forward the last hidden unit of the languagebranch to form a compact representation by using a concate-nation operation with the corresponding visual features F as, f = Concat ( h l +1 n , F ) . (8)Our loss function is the combination of the identiﬁcation lossand triplet loss objectives, which can be expressed in the fol-lowing Equation (9) as, L = L ID + L T riplet , (9)where L ID is a K -class cross entropy loss parametered by θ .It treats each ID number of person as an individual class as, L ID = − (cid:88) Kk y k log( e θ k f (cid:80) Kj e θ j f ) , (10)and L T riplet denotes the Triplet loss as, L T riplet = max ( (cid:107) f ( x a ) − f ( x p ) (cid:107) −(cid:107) f ( x a ) − f ( x n ) (cid:107) + α, , (11)where x a is the anchor, x p indicates the positive example, and x n is the negative sample. The α means margin. We evaluated the proposed methods on person re-ID datasetssuch as CUHK03 [Li et al. , 2014], Market-1501 [Zheng etal. , 2015a] and Duke-MTMC [Ristani et al. , 2016]. Thereare two types of experiments: with and without descriptionannotations. For CUHK03 and Market-1501, the descriptionannotations can be directly retrieved from the CUHK-PEDESdataset [Li et al. , 2017b]. Hence, we evaluated the proposedmethod for person re-ID by using these annotations. How-ever, Duke-MTMC lacks the language annotations. We usedan image captioner to generate language descriptions, whichare used to jointly optimize the proposed HorNet.

For the HorNet, the embedding dimension of word vectors is . The dimension of the hidden unit of the LSTM is also . Since we used a fully connected layer to process thelast hidden unit of the upper-layer LSTM and consequentlyreduced its dimension to 256, the dimension of the ﬁnal rep-resentation for vision and language is (i.e.,

256 + 2048 ),in which the dimension of the visual features is . To ﬁrst verify the effect of the language descriptions inperson re-ID, we augmented two standard person re-IDdatasets, which are CUHK03 and Market-1501, by using an-notations from CUHK-PEDES dataset. The annotations ofCUHK-PEDES are initially developed for cross-modal text-based person search. Since the persons in Market-1501 and

Figure 4: The visualization of the domain transfer process and cor-responding generated captions. The red keywords indicate incorrectdescriptions while the green words mean the correct keywords.

CUHK03 have many similar samples and only four imagesof each person in these two datasets have language descrip-tions, we annotated the unannotated images in those datasetswith the language information from the same ID, which is thesame as the protocol in [Chen et al. , 2018].We ﬁrst evaluated the proposed method on CUHK03dataset, using the classical evaluation protocols [Li et al. ,2014]. We tested the baseline method which uses the iden-tiﬁcation loss based on the Resnet-50 model, with . CMCtop-1 accuracy in the CUHK03 detected images. The CMCtop-1 result is raised to . by augmenting the languagedescriptions. A similar phenomenon can be seen in theCUHK03 labeled images. We performed an ablation study toverify the effect of the different part in HorNet. The experi-mental results are presented in Table 1. The proposed HorNetwith reranking technique can achieve the best performanceon Market-1501, CUHK03 detected images, and CUHK03labeled images datasets.The comparison with other state-of-the-art methods arelisted in Table 2. Speciﬁcally, we compared the pro-posed HorNet with other methods which employ auxil-iary information, which include Deep Semantic Embedding(DSE) [Chang et al. , 2018], ACRN [Schumann, 2017],ACRN [Schumann, 2017], Vision and Language (VL) [Yan etal. , 2018a], Image-language Association (ILA) [Chen et al. ,2018]. ACRN applies axillary attribute information to aid theperson re-ID. DSE, ACRN, VL and ILA all employ the ex-ternal language descriptions for person re-ID. Among them,VL uses a vanilla LSTM or CNN language encoding model,which is discriminatively poorer than our HorNet, since theproposed HorNet uses discrete gates to select useful informa-tion for person re-ID. ILA uses the same training and testingprotocol but with a more complex model. Our model canalso be combined with various metric learning techniques,including the Rerank proposed in [Zhong et al. , 2017]. Wealso employed the Rerank to post-process our features, withimproved results. Overall, our HorNet performs much bet-ter than the ILA on the CUHK03 classic evaluation protocol,achieved . CMC top-1 accuracy, with a . raise overthe ILA. We also conducted experiments on the Market-1501dataset, the results are presented in Table 1 and Table 3. Asimilar phenomenon to those of CUHK03 can be seen in Ta- ethods Market-1501 CUHK03 Detected CUHK03 LabeledmAP top-1 top-5 top-10 top-1 top-5 top-10 top-20 top-1 top-5 top-10 top-20Identiﬁcation Loss 65.5 82.4 92.9 95.3 72.4 89.0 93.0 96.0 79.3 90.6 92.3 93.0Identiﬁcation + Triplet Loss 71.4 86.3 95.1 Table 1: Ablation study results on Market-1501, CUHK03 Detected, and CUHK03 Labeled datasets.

Methods CUHK03 (Detected) CUHK03 (Labeled)top-1 top-5 top-1 top-5MSCAN [Li et al. , 2017a] 68.0 91.2 74.2 94.3SSM [Bai et al. , 2017a] 72.7 92.4 76.6 94.6k-rank [Zheng et al. , 2017b] 58.5 - 61.6 -JLMT [Li et al. , 2017b] 89.4 98.2 91.5 99.0Deep Person [Bai et al. , 2017b] 89.4 98.2 91.5 99.0SVDNet [Sun et al. , 2017] 81.8 95.2 - -MuDeep [Qian et al. , 2017] 75.6 94.4 - 76.9DSE [Chang et al. , 2018] 66.8 92.9 - -ACRN [Schumann, 2017] 62.6 89.7 - -VL [Yan et al. , 2018a] - - 81.8 98.1ILA [Chen et al. , 2018] 90.9 98.2 92.5 98.8

HorNet + Rerank 95.0 98.8 97.1 99.4

Table 2: Comparison with baselines on the CUHK03 dataset.

Methods Market-1501mAP top-1 top-5 top-10MSCAN [Li et al. , 2017a] 57.5 80.3 -SSM [Bai et al. , 2017a] 68.8 82.2 - -k-rank [Zheng et al. , 2017b] 63.4 77.1 - -SVDNet [Sun et al. , 2017] 62.1 82.3 - -DPLAR [Zhao et al. , 2017] 63.4 81.0 - -PDC [Su et al. , 2017] 63.4 84.1 - -JLMT [Li et al. , 2017b] 65.5 85.1 - -D-person [Bai et al. , 2017b] 79.6 92.3 - -TGP [Almazan et al. , 2018] 81.2 92.2 - -DSE [Chang et al. , 2018] 64.8 84.7 - -ACRN [Schumann, 2017] 62.6 83.6 - -ILA [Chen et al. , 2018] 81.8 - -

HorNet + Rerank 85.8

Table 3: Comparison with baselines on the Market-1501 dataset. ble 3, with . mAP result. In a realistic person re-ID system, language annotations arerare and hard to get. Hence, we want to see if the automat-ically generated language descriptions can boost the perfor-mance of a person re-ID system. We chose a more chal-lenging and realistic dataset, i.e., Duke-MTMC [Ristani etal. , 2016] to verify this assumption. Firstly, we trained animage captioning model based on the CUHK-PEDES datasetby using the convolutional image captioning model, whichhas released code and good performance [Aneja et al. , 2018].We split the CUHK-PEDES images into two splits: 95% fortraining and 5% for validation. We used the early stoppingtechnique to train the image captioning model and achieved35.4 BLEU-1, 22.4 BLEU-2, 15.0 BLEU-3, 9.9 BLEU-4,22.3 METEOR, 34.2 ROUGE L and 22.1 CIDEr results onthe validation set. Subsequently, we used the trained imagecaptioning model to generate language descriptions for theDuke-MTMC dataset. However, we found that the generateddescriptions are not discriminative enough, as shown in Fig-ure 4. There are many incorrent or imprecise keywords inthe language descriptions. Also, we tested the performance

Methods Duke-MTMCmAP top-1 top-5 top-10BoW + Kissme [Zheng et al. , 2015b] 12.2 25.1 - -LOMO + XQDA [Liao et al. , 2015] 17.0 30.8 - -Veriﬁcation + Identiﬁcation [Zheng et al. , 2017a] 49.3 68.9 - -PAN [Zheng et al. , 2018] 51.5 71.6 - -PAN + Rerank [Zheng et al. , 2018] 66.7 75.9 - -FMN [Ding et al. , 2017] 56.9 74.5 - -FMN + Rerank [Ding et al. , 2017] 72.8 79.5 - -D-person [Bai et al. , 2017b] 64.8 80.9 - -SVDNet [Sun et al. , 2017] 56.8 76.7 - -APR [Lin et al. , 2017] 51.9 71.0 - -ACRN [Schumann, 2017] 52.0 72.6 88.9 91.5Resnet50 + BERT + Rerank [Devlin et al. , 2018] 78.8 84.1 90.0 92.2Identiﬁcation Loss 54.6 72.5 84.4 88.7Identiﬁcation Loss + HorNet (Without Domain Transfer) 52.5 71.1 82.6 87.7Identiﬁcation Loss + HorNet (With Domain Transfer) 58.4 74.3 87.3 90.8HorNet (With Domain Transfer) 60.4 76.4 88.1 90.5

HorNet + Rerank 79.2 84.4 90.2 92.5

Table 4: Comparison with baselines on the Duke-MTMC dataset. by augmenting the Duke-MTMC with the generated descrip-tions and the results turned out to be poor, even worse than thebaselines, only with . mAP result, as shown in Table 4.The cause of this phenomenon is the poor generalization ca-pability of the image captioning model, especially when thereis a domain difference between two diverse datasets. To alle-viate this problem, we used the SPGAN [Deng et al. , 2018]to transfer the image style of the Duke-MTMC to the CUHK-PEDES. The generated language descriptions are with muchbetter quality, as presented in Figure 4. The results from theaugmentation with the generated language descriptions on thetransferred Duke-MTMC images are much better than thatprovided by the simple visual features, with . mAP re-sult on Duke-MTMC. To prove the superiority of the HorNet,we also use BERT [Devlin et al. , 2018] to replace HorNet, butwith a poorer performance. Furthermore, we also implementa Rerank [Zhong et al. , 2017] to boost the ﬁnal recognitionperformance and achieved . mAP result. In this paper, we developed a language captioning module viaimage domain transfer and captioner techniques in person re-ID system. It can generate high-quality language descriptionsfor visual images, which can signiﬁcantly compensate for thevisual variance in person re-ID. Then we proposed a novel hi-erarchical offshoot recurrent network (HorNet) for improvingperson re-ID via such an automatical image captioning mod-ule. It can learn the visual and language representation fromboth images and the generated captions, and thus enhance theperformance. The experiments demonstrate promising resultsof our model on CUHK03, Market-1501 and Duke-MTMCdatasets. Future research includes a more robust languagecaptioning module and advanced metric learning methods. eferences [Almazan et al. , 2018] Jon Almazan, Bojana Gajic, et al.Re-id done right: towards good practices for person re-identiﬁcation. arXiv:1801.05339 , 2018.[Aneja et al. , 2018] Jyoti Aneja, Aditya Deshpande, et al.Convolutional image captioning. In

CVPR , pages 5561–5570, 2018.[Bahdanau et al. , 2014] Dzmitry Bahdanau, KyunghyunCho, et al. Neural machine translation by jointly learningto align and translate. arXiv:1409.0473 , 2014.[Bai et al. , 2017a] Song Bai, Xiang Bai, et al. Scalable per-son re-identiﬁcation on supervised smoothed manifold. In

CVPR , pages 2530–2539, 2017.[Bai et al. , 2017b] Xiang Bai, Mingkun Yang, et al. Deep-person: Learning discriminative deep features for personre-identiﬁcation. arXiv:1711.10658 , 2017.[Chang et al. , 2018] Yan-Shuo Chang, Ming-Yu Wang, et al.Joint deep semantic embedding and metric learning forperson re-identiﬁcation.

PRL , 2018.[Chen et al. , 2018] Dapeng Chen, Hongsheng Li, et al.Improving deep visual representation for person re-identiﬁcation by global and local image-language associa-tion. In

ECCV , pages 54–70, 2018.[Deng et al. , 2018] Weijian Deng, Liang Zheng, et al.Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person reidentiﬁ-cation. In

CVPR , pages 994–1003, 2018.[Devlin et al. , 2018] Jacob Devlin, Ming-Wei Chang, et al.BERT: pre-training of deep bidirectional transformers forlanguage understanding.

CoRR , abs/1810.04805, 2018.[Ding et al. , 2017] Guodong Ding, Salman Khan, et al. Letfeatures decide for themselves: Feature mask network forperson re-identiﬁcation. arXiv:1711.07155 , 2017.[Jang et al. , 2016] Eric Jang, Shixiang Gu, et al. Cat-egorical reparameterization with gumbel-softmax. arXiv:1611.01144 , 2016.[Ke et al. , 2018] Nan Rosemary Ke, Konrad Zolna, et al. Fo-cused hierarchical rnns for conditional sequence process-ing. arXiv:1806.04342 , 2018.[Li et al. , 2012] Wei Li, Rui Zhao, et al. Human reidentiﬁ-cation with transferred metric learning. In

ACCV , pages31–44, 2012.[Li et al. , 2014] Wei Li, Rui Zhao, et al. Deepreid: Deepﬁlter pairing neural network for person re-identiﬁcation.In

CVPR , pages 152–159, 2014.[Li et al. , 2017a] Dangwei Li, Xiaotang Chen, et al. Learn-ing deep context-aware features over body and latent partsfor person re-id. In

CVPR , pages 384–393, 2017.[Li et al. , 2017b] Wei Li, Xiatian Zhu, et al. Person re-identiﬁcation by deep joint learning of multi-loss classi-ﬁcation. arXiv:1705.04724 , 2017.[Liao et al. , 2015] Shengcai Liao, Yang Hu, et al. Person re-id by local maximal occurrence representation and metriclearning. In

CVPR , pages 2197–2206, 2015. [Lin et al. , 2017] Yutian Lin, Liang Zheng, et al. Improvingperson re-identiﬁcation by attribute and identity learning. arXiv:1703.07220 , 2017.[Liu et al. , 2016] Jiawei Liu, Zheng-Jun Zha, et al. Multi-scale triplet cnn for person re-identiﬁcation. In

ACMMM ,pages 192–196, 2016.[Qian et al. , 2017] Xuelin Qian, Yanwei Fu, et al.Multi-scale deep learning architectures for personre-identiﬁcation. pages 5399–5408. ICCV, 2017.[Ristani et al. , 2016] Ergys Ristani, Francesco Solera, et al.Performance measures and a data set for multi-target,multi-camera tracking. In

ECCV , pages 17–35, 2016.[Schumann, 2017] Arne Schumann. Person re-identiﬁcationby deep learning attribute-complementary information. In

CVPR , pages 1435–1443, 2017.[Su et al. , 2017] Chi Su, Jianing Li, et al. Pose-driven deepconvolutional model for person re-identiﬁcation. In

ICCV ,pages 3960–3969, 2017.[Sun et al. , 2017] Yifan Sun, Liang Zheng, et al. Svdnet forpedestrian retrieval. In

ICCV , pages 3800–3808, 2017.[Yan et al. , 2018a] Fei Yan, Josef Kittler, et al. Person re-identiﬁcation with vision and language. In

ICPR , pages2136–2141, 2018.[Yan et al. , 2018b] Shiyang Yan, Fangyu Wu, et al. Imagecaptioning using adversarial networks and reinforcementlearning. In

ICPR , pages 248–253, 2018.[Yi et al. , 2014] Dong Yi, Zhen Lei, et al. Deep metric learn-ing for person re-identiﬁcation. In

ICPR , pages 34–39,2014.[Zhao et al. , 2017] Liming Zhao, Xi Li, et al. Deeply-learned part-aligned representations for person re-identiﬁcation. In

ICCV , pages 3219–3228, 2017.[Zheng et al. , 2015a] Liang Zheng, Liyue Shen, et al. Personre-identiﬁcation meets image search. arXiv:1502.02171 ,2015.[Zheng et al. , 2015b] Liang Zheng, Liyue Shen, et al. Scal-able person re-identiﬁcation: A benchmark. In

ICCV ,pages 1116–1124, 2015.[Zheng et al. , 2016] Liang Zheng, Yi Yang, et al. Person re-identiﬁcation: Past, present and future. arXiv:1610.02984 ,2016.[Zheng et al. , 2017a] Zhedong Zheng, Liang Zheng, et al. Adiscriminatively learned cnn embedding for person reiden-tiﬁcation.

TOMM , 14(1):13, 2017.[Zheng et al. , 2017b] Zhedong Zheng, Liang Zheng, et al.Unlabeled samples generated by gan improve the personre-identiﬁcation baseline in vitro. In

CVPR , pages 3754–3762, 2017.[Zheng et al. , 2018] Zhedong Zheng, Liang Zheng, et al.Pedestrian alignment network for large-scale person re-identiﬁcation.

TCSVT , 2018.[Zhong et al. , 2017] Zhun Zhong, Xi Li, et al. Re-rankingperson re-identiﬁcation with k-reciprocal encoding. In