[PDF] Person image generation with semantic attention network for person re-identification

Abstract

Pose variation is one of the key factors which prevents the network from learning a robust person re-identification (Re-ID) model. To address this issue, we propose a novel person pose-guided image generation method, which is called the semantic attention network. The network consists of several semantic attention blocks, where each block attends to preserve and update the pose code and the clothing textures. The introduction of the binary segmentation mask and the semantic parsing is important for seamlessly stitching foreground and background in the pose-guided image generation. Compared with other methods, our network can characterize better body shape and keep clothing attributes, simultaneously. Our synthesized image can obtain better appearance and shape consistency related to the original image. Experimental results show that our approach is competitive with respect to both quantitative and qualitative results on Market-1501 and DeepFashion. Furthermore, we conduct extensive evaluations by using person re-identification (Re-ID) systems trained with the pose-transferred person based augmented data. The experiment shows that our approach can significantly enhance the person Re-ID accuracy.

Full PDF

GGraphical Abstract

Person Image Generation with Semantic Attention Network for Person Re-identiﬁcation

Meichen Liu,Kejun Wang,Juihang Ji,Shuzhi Sam Ge a r X i v : . [ c s . C V ] A ug ighlights Person Image Generation with Semantic Attention Network for Person Re-identiﬁcation

Meichen Liu,Kejun Wang,Juihang Ji,Shuzhi Sam Ge• Research highlights item 1• Research highlights item 2• Research highlights item 3 erson Image Generation with Semantic Attention Network for PersonRe-identiﬁcation ⋆ Meichen Liu a , Kejun Wang a , ∗ , Juihang Ji b and Shuzhi Sam Ge c a College of Automation, Harbin Engineering University, Harbin 150001, China b College of Automation, Harbin Institute of Technology, Harbin 150001, China c National University of Singapore , 117576, Singapore

A R T I C L E I N F O

Keywords :semantic parsingpose transferimage generationperson re-identiﬁcation

A B S T R A C T

Pose variation is one of the key factors which prevents the network from learning a robust personre-identiﬁcation (Re-ID) model. To address this issue, we propose a novel person pose-guided imagegeneration method, which is called the semantic attention network. The network consists of severalsemantic attention blocks, where each block attends to preserve and update the pose code and the cloth-ing textures. The introduction of the binary segmentation mask and the semantic parsing is importantfor seamlessly stitching foreground and background in the pose-guided image generation. Comparedwith other methods, our network can characterize better body shape and keep clothing attributes, si-multaneously. Our synthesized image can obtain better appearance and shape consistency related tothe original image. Experimental results show that our approach is competitive with respect to bothquantitative and qualitative results on Market-1501 and DeepFashion. Furthermore, we conduct ex-tensive evaluations by using person re-identiﬁcation (Re-ID) systems trained with the pose-transferredperson based augmented data. The experiment shows that our approach can signiﬁcantly enhance theperson Re-ID accuracy.

1. Introduction

Person re-identiﬁcation (Re-ID) refers the task of match-ing a speciﬁc person across multiple non-overlapping cam-eras. It has been receiving considerable attention in the com-puter vision community due to its various surveillance appli-cations. With the strong learning capabilities of deep neu-ral networks, recent methods [1, 38, 36] about person Re-IDhas a great process. However, since the existing benchmarkscontain a limited number of pose changes, the variations inpose becomes one of the key factors which prevents the net-work from learning a robust Re-ID model. Motivated by theabove discussion, in this paper, we are interested in transfer-ring a person from one pose to another by giving a conditionimage as show in Fig. 1. It can provide additional labeledsamples for training discriminative methods as a data aug-mentation tool.The recent advent of generative adversarial networks [5,33, 34] has provided powerful tools to achieve pose-guidedimage generation, and inspired many researches in this ﬁeldby using the generated image to enhance the generalizationability for person Re-ID [27, 42, 31, 27]. However, mostof approaches focus on the appearance and the body shapeof the person in the pose transfer, where the details of theclothing texture are not fully considered. Due to the low-resolution of the cameras, person Re-ID is usually based onthe colors and the clothing attributes to match each other. ⋆ This document is the results of the research project funded by the Na-tional Science Foundation. ∗ Corresponding author [email protected] (M. Liu); [email protected] (K. Wang); [email protected] (J. Ji); [email protected] (S.S. Ge) http://homepage.hrbeu.edu.cn/web/wangkejun (K. Wang); https://robotics.nus.edu.sg/sge/ (S.S. Ge)

ORCID (s): (M. Liu)

ConditionImage TargetPose TargetImage Def-GAN VU-Net Intr-Flow Deep ImageSpatial OurMethod

Generated ImagesTarget Pose MasksConditionImage

Figure 1:

Some generated examples by our method based ondiﬀerent target poses.

These clothing attributes are signiﬁcance for human visualperception.Ignoring non-rigid human body deformations and cloth-ing shapes can result in compromised quality of the synthe-sized images. Therefore, how to simultaneously preserve theappearance and clothing attributes of the condition image inthe pose transfer becomes much more challenging. The keychallenges of the pose transfer include the following two as-pects: (1) To preserve clothing attributes is diﬃcult in thegeneration process, particularly when only given partial ob-servation of the person. (2) Due to the non-rigid nature ofhuman body, it is generally diﬃcult to transform the spatiallymisaligned body parts for the neural networks.To address above issue, we propose a

Semantic AttentionNetwork in this paper. In contrast to the other methods [42,31, 27], our network is divided into two pathway to capturethe appearance feature and the semantic parsing information

Meichen Liu et al.:

Preprint submitted to Elsevier

Page 1 of 8 erson Image Generation with Semantic Attention Network for Person Re-identiﬁcation of the original image. The network takes the representationsof the human semantic parsing and the image as inputs. Thepose transfer is carried out by a sequence of

Semantic At-tention Blocks (SABs). This scheme allows the generator tosimultaneously update the appearance code and the cloth-ing texture before achieving the ﬁnal output. Moreover, hu-man semantic parsing naturally provides a foreground mask,which is important for seamlessly stitching foreground andbackground in pose-guided image generation.Speciﬁcally, we design an attention mechanism to focuson the change of the clothing shape in each SAB. This atten-tion mechanism allows the generator to interest in better se-lecting the image regions, which can preserve or suppress in-formation for transferring. The outputs of several sequencesof blocks are the updated appearance code and the semanticcode. We can reconstruct the pose-transferred image via thesimilar U-Net [25]. Moreover, compared with the keypoint-based pose representation, we adopt the binary segmenta-tion mask as the pose representation as show in Fig 1. Onthe hand, the pose mask has the capability of removing thebackground clutters in pixel-level. On the other, the binarysegmentation mask featuring only one channel results in de-creasing the computation load and the designed parametersconsiderably in the training process. The main contributionsof this paper can be summarized as follows:• We propose a semantic attention network to addressthe challenging the task of the pose transfer. The pro-posed network can further improve the qualitative re-sult of the generated image by using the semantic at-tention block (SAB). These blocks can eﬀectively uti-lize the semantic parsing and the image feature to smo-othly guide the pose transfer process.• We ﬁrst utilize the segmentation mask as the pose rep-resentation in the pose transfer. In contrast to the keyp-oint-based pose representation, the segmentation maskhas the capability to remove the background cluttersin pixel-level, which can make the generator focus onthe human body region and decrease the computationburden drastically.• Our method has superior performances both in bodyshape and keeping clothing attributes on challengingbenchmarks. The generated images are utilized to en-rich the pose variation of the person, and substantiallyaugment person dataset for the person re-identiﬁcationapplication.

2. Related Work

With the development of the generative adversarial net-works (GANs) [22], image generation can be potentially ap-plied in many ﬁelds such as image restoration [23, 32], im-age retrieval [30] and cross-domain image generation [16].To address the pose variation and improve the generalizationability of ReID models, most of the existing methods focuson the pose-guided image generation. The early attempt on pose transfer was achieved in [20],which is composed of two-stage networks. Stage-I aims tocoarsely generate image under the target pose, and Stage-IIconsiders more appearance details. To further improve theexisting work, Ma 𝑒𝑡.𝑎𝑙 [21] disentangled and encoded threemodes (the foreground, the background and the pose) intothe embedding features, and then decoded them back to theimage. However, the spatial deformation between the sourceand the target is not fully considered in the aforementionedliterature, which makes it diﬃcult to handle the situationswith the misaligned appearance. Therefore, recent works[2, 27, 42] attempted to transform the pixels of the origi-nal image to align with the generated image under the targetpose. Balakrishnan 𝑒𝑡.𝑎𝑙 [2] proposed a method to separatethe source image into diﬀerent parts and reconstruct them bythe target pose. Siarohin 𝑒𝑡.𝑎𝑙 [27] introduced deformableskips connections that require extensive aﬃne transforma-tion computation. Zhen 𝑒𝑡.𝑎𝑙 [42] presented the pose transfernetwork to infers the regions of interest based on the humanpose. On the basis of the above proposed methods, the ap-pearances of the generated images are less realistic-looking,since the output results are from highly compressed features.Instead, we utilize the semantic parsing map to guide the im-age synthesis with the higher-level structure information.Inspire by the text-to-image translation methods in [11,13], Dong 𝑒𝑡.𝑎𝑙 [3] proposed a network, called Soft-GatedWarping-GAN, which ﬁrst predicted the target segmentationmap, and then estimated the transformations by the geomet-ric matcher for rendering the textures. Han 𝑒𝑡.𝑎𝑙 [7] utilizeda two-stream architecture to extract the appearance ﬂow be-tween the source and the target clothing regions for the per-son image generation. Similarly, Song 𝑒𝑡.𝑎𝑙 [31] decom-posed the conversion process into two stages: the semanticparsing transformation and the appearance generation. It isworth noting that the predicted failure semantic parsing mapin the above literature can aﬀect the quality of the synthe-sized image.Compared with the existing work, we introduce a se-mantic map with an attention mechanism to preserve tex-ture synthesis between the corresponding semantic map ar-eas in supervision manner. And we can generate the personappearances and the clothing texture simultaneously by us-ing the proposed two-pathway network. Furthermore, mostof these methods employ the pose and key-points estima-tion, whereas the binary segmentation mask [29] based onthe pose representation is much cheaper and more ﬂexible.Therefore, we utilize the binary segmentation mask as thetarget pose and remove the background clutter of the origi-nal image.

3. The Proposed Approach

Notations.

We ﬁrst deﬁne the notations to be used inthis paper. Assume we have the access to a set of 𝑁 images 𝐼 = { 𝐼 𝑖 } 𝑁𝑖 =1 , where 𝐼 𝑖 ∈ ℝ 𝐻 × 𝑊 represents the 𝑖 𝑡ℎ im-age in the dataset. 𝑆 𝑖 and 𝑃 𝑖 are the corresponding semanticmap and binary segmentation mask of 𝐼 𝑖 , respectively. The Meichen Liu et al.:

Preprint submitted to Elsevier

Page 2 of 8erson Image Generation with Semantic Attention Network for Person Re-identiﬁcation

ConvConv

Semantic Attention Network 𝐷 &’ D Semanticattentionblock-1 Semanticattentionblock-T 𝐹 ! ! % 𝐹 ! " % 𝐹 ! " & 𝐹 ! ! & 𝐼 ! ! 𝑆 ! ! 𝑀 ! " 𝑆 ! " %𝐼 ! " ℒ !" ℒ %&’( 𝑀 ! ! 𝐶𝑜𝑛𝑣 ! ℒ ℓ ! 𝐶𝑜𝑛𝑣 " Figure 2:

Our overall framework for the proposed method. We forward initial inputs into the convolutional layers to obtain theencoding feature 𝐹 𝑝 𝑠 and 𝐹 𝑝 𝑡 , then updated them by the semantic attention network (in gray). The decoder (in green) generatesthe fake image ̂𝐼 𝑝 𝑡 . We utilize the discriminator (in pink) to judge how likely ̂𝐼 𝑝 𝑡 contains the same person in 𝐼 𝑝 𝑠 . The detailedillustration of the semantic attention block can be found in Fig. 3. C×H×WC×H×W C×H×WSigmoid C×H×W2C×H×W

Conv C ConvConv

Element-wise SumElement-wise multiply C Concatentation 𝐹 ! ! " 𝐹 ! ! " 𝐹 ! " " 𝐹 ! " " Figure 3:

The details of the Semantic Attention Block. ·○,+○ represent the element-wise multiply and element-wise plus,respectively. c○ denotes the depth concatenation. semantic map 𝑆 𝑖 is extracted by the Pascal-Person-Part [17].We describe 𝑆 𝑖 by using a pixel-level one-hot encoding, 𝑖.𝑒. , 𝑆 𝑖 ∈ {0 , 𝐿 × 𝐻 × 𝑊 , where 𝐿 denotes the total number of se-mantic labels. The binary segmentation mask 𝑃 𝑖 is obtainedby Mask-RCNN [8], where 𝑃 𝑖 ∈ ℝ 𝐻 × 𝑊 as shown in Fig. 1. Problem Formulation.

In this paper, we propose the semantic attention network to address the challenging taskof the pose transfer. The overview of the proposed methodis shown in Fig. 2. Given a reference image 𝐼 𝑝 𝑠 under thesource pose 𝑝 𝑠 , our method aims to transfer the pose of theperson image 𝐼 𝑝 𝑠 from the source pose 𝑝 𝑠 to the target pose 𝑝 𝑡 . The generated image ̂𝐼 𝑝 𝑡 follows the clothing appearanceof 𝐼 𝑝 𝑠 but under the pose 𝑝 𝑡 . More details of each compo-nent of the proposed network can be found in the followingsection. The generator aims to synthesize the appearance and tex-ture of the output image ̂𝐼 𝑝 𝑡 under the target pose 𝑝 𝑡 , guidedby the reference image 𝐼 𝑝 𝑠 and the target semantic map 𝑆 𝑝 𝑡 .Therefore, the initial input can be divided into two pathwaysfor encoding features, which are called appearance pathwayand semantic pathway, respectively. The appearance path-way is used to extract the appearance details, which takesthe reference image 𝐼 𝑝 𝑠 , the pose mask 𝑀 𝑝 𝑠 and the seman-tic map 𝑆 𝑝 𝑠 as input. The another pathway is employed toupdate clothing textures, which takes 𝑀 𝑝 𝑡 and 𝑆 𝑝 𝑡 as inputs.The inputs of each pathway are stacked along their depthaxes before being encoded by a and two convolu-tional layers, each followed by a normalization layers (BN)[12] and a leaky rectiﬁed linear unit (LeakyReLU)[35]. Thisencoding process mixes the corresponding poses and seman-tic maps, which features the ability of preserving their pixel-level information and capturing their dependencies. To concentrate on the human body shape and clothingtexture transfer, we propose a semantic attention network(SAN), which consists of several semantic attention blocksto infer the interest regions based on the target pose and theappearance. The speciﬁc details of the block are shown inFig. 2. The encoding features 𝐹 𝑝 𝑠 and 𝐹 𝑝 𝑡 are simultaneouslysent to the SAN. Then, SAN updates these code to obtain theﬁnal appearance code 𝐹 𝑇𝑝 𝑠 and semantic code 𝐹 𝑇𝑝 𝑡 through thesequence of blocks. We describe the detailed update processas follows. Appearance Code Update.

The pose transfer is aboutmoving body patches from the reference location to the tar-get location, which can cause a large variations of the ar-ticulation and the clothing texture. To address this issue,the attention mask 𝑀 is used to evaluate the importance ofeach position. The encoding feature 𝐹 𝑝 𝑡 ﬁrstly goes through Meichen Liu et al.:

Preprint submitted to Elsevier

Page 3 of 8erson Image Generation with Semantic Attention Network for Person Re-identiﬁcation three convolutional layers with two normalization layers andLeakyReLU, then mapping the semantic code 𝐹 𝑡 −1 𝑝 𝑡 to thevalues between 0 and 1 by an element-wise sigmoid func-tion: 𝑀 = 11 + 𝑒 𝑥 𝑧𝑖𝑗 , (1)where 𝑥 𝑧𝑖𝑗 denotes the value of the ( 𝑖, 𝑗 ) element in 𝑧 𝑡ℎ chan-nel of the updated semantic code. The appearance code 𝐹 𝑡𝑝 𝑠 are either preserved or suppressed information by multiply-ing the semantic code 𝐹 𝑡 −1 𝑝 𝑡 with the attention mask 𝑀 . Inaddition, we adopt the residual connection to preserve theoriginal image information in the pose transfer. The trans-formed image feature is added after the element-wise prod-uct. The process can be described as: 𝐹 𝑡𝑝 𝑠 = 𝑀 𝑡 ⊙ 𝐹 𝑡 −1 𝑝 𝑡 + 𝐹 𝑡 −1 𝑝 𝑠 , (2)where 𝑀 𝑡 denotes the attention score of the 𝑡 𝑡ℎ semantic at-tention block, and ⊙ denotes the element-wise product. Semantic Code Update.

In the pose transfer, the se-mantic code 𝐹 𝑡 −1 𝑝 𝑡 need to be updated by incorporating withthe new appearance code. The updated method can evaluatewhich clothing patches can be preserved and how to changethe texture in the new poses. Speciﬁcally, the encoding fea-ture 𝐹 𝑝 𝑡 is ﬁrst fed into three convolutional layers with twonormalization layers and LeakyReLU to obtain the semanticcode 𝐹 𝑡𝑝 𝑡 . Then, 𝐹 𝑡𝑝 𝑡 is concatenated with the updated newappearance code 𝐹 𝑡𝑝 𝑠 along the depth axis. The update pro-cess can be expressed as: 𝐹 𝑡𝑝 𝑡 = 𝐶𝑜𝑛𝑣 ( 𝐹 𝑡 −1 𝑝 𝑡 || 𝐹 𝑡𝑝 𝑠 ) , (3)where || denotes the concatenation of the update code. Weforward the ﬁnal updated code 𝐹 𝑇𝑝 𝑆 and 𝐹 𝑇𝑝 𝑡 to 𝐶𝑜𝑛𝑣 𝐴 and 𝐶𝑜𝑛𝑣 𝑆 to obtain their content features, simultaneously. The 𝐶𝑜𝑛𝑣 𝐴 has three down-sampling convolution subnetworks,where each subnetwork consists of the convolutional lay-ers with LeakyReLU and BatchNorm. The architecture of 𝐶𝑜𝑛𝑣 𝑆 is similar to 𝐶𝑜𝑛𝑣 𝐴 . The structure of the decoder is similar to U-net [25],which can generate the image ̂𝐼 𝑝 𝑡 under the target pose 𝑝 𝑡 based on the encoded feature. At each up-sampling step, se-mantic features of 𝐶𝑜𝑛𝑣 𝑆 can be concatenated with the samesize feature map in the corresponding channel of 𝐶𝑜𝑛𝑣 𝐴 ,which can learn how to assemble the more realistic-lookingoutput image based on this information by 𝑁 deconvolu-tional layers. We employ a discriminator 𝐷 to help generator networkto diﬀerentiate the real and fake image. The discriminator 𝐷 takes ̂𝐼 𝑝 𝑡 and 𝐼 𝑝 𝑡 as inputs, which is built by three residualblocks after two down-sampling convolutions. The experi-ments have shown that if we use the hard label ( 𝑖.𝑒. 𝐹 ! ! " 𝐹 ! " " 𝐶𝑜𝑛𝑣 . 𝐶𝑜𝑛𝑣 / 𝐷 &( Figure 4:

The details of the Semantic Attention Block. ·○,+○ represent the element-wise multiply and element-wise plus,respectively. c○ denotes the depth concatenation. the loss value of the discriminator will quickly fall to zero.Accordingly, a sigmoid layer is employed at the end of thediscriminator to obtain the probabilities output.

We train the network using a joint loss consisting of aadversarial loss, reconstruction 𝐿 loss, and perceptual loss. Adversarial Loss.

The generative adversarial frame-work is employed to simulate the distributions of the ground-truth 𝐼 𝑝 𝑡 . The generator loss can be written as:  𝑎𝑑𝑣 = 𝔼 ( 𝐼 𝑝𝑠 ,𝐼 𝑝𝑡 )∈  [log 𝐷 ( 𝐼 𝑝 𝑠 , 𝐼 𝑝 𝑡 )]+ 𝔼 𝐼 𝑝𝑡 ∈  , ̂𝐼 𝑝𝑡 ∈ ̂  [log(1 − 𝐷 ( 𝐼 𝑝 𝑡 , ̂𝐼 𝑝 𝑡 ))] . (4)Note that  and ̂  denote the distribution of the ground-truth images and the synthesized images, respectively. 𝐼 𝑝 𝑡 represents the real target image under the pose 𝑝 𝑡 . 𝓁 Loss.

The  𝓁 loss can be further written as:  𝓁 = ‖ 𝐼 𝑝 𝑡 − ̂𝐼 𝑝 𝑡 ‖ , (5)where  𝓁 denotes the pixel-wise 𝐿 loss computed betweenthe synthesized image ̂𝐼 𝑝 𝑡 and the target ground-truth image 𝐼 𝑝 𝑡 . Perceptual Loss.

To generate more smooth and realistic-looking person image, we also utilize the perceptual loss,which is the Euclidean distance between feature representa-tions:  𝑝𝑒𝑟𝑐 = 1 𝑊 𝜌 𝐻 𝜌 𝐶 𝜌 𝑊 𝜌 ∑ 𝑥 =1 𝐻 𝜌 ∑ 𝑦 =1 𝐶 𝜌 ∑ 𝑧 =1 ‖ 𝜙 𝜌 ( ̂𝐼 𝑝 𝑡 ) 𝑥,𝑦,𝑧 − 𝜙 𝜌 ( 𝐼 𝑝 𝑡 ) 𝑥,𝑦,𝑧 ‖ , (6)where 𝜙 𝜌 is the activation map of the 𝜌 𝑡ℎ layer of the pre-trained VGG-19 model [28] on ImageNet [26]. 𝑊 𝜌 , 𝐻 𝜌 and 𝐶 𝜌 denote the spatial width, height and depth of 𝜙 𝜌 , respec-tively.The overall loss function of the model as:  𝑓𝑢𝑙𝑙 = arg min 𝐺 max 𝐷 𝛼  𝑎𝑑𝑣 + 𝛽  𝓁 + 𝛾  𝑝𝑒𝑟𝑐 , (7) Meichen Liu et al.:

Preprint submitted to Elsevier

Page 4 of 8erson Image Generation with Semantic Attention Network for Person Re-identiﬁcation where  𝑎𝑑𝑣 denotes the adversarial loss,  𝐿 denotes thepixel-wise 𝐿 loss and  𝑝𝑒𝑟𝑐 is the perceptual 𝐿 loss. 𝜆 , 𝛽 and 𝛾 represent the weight factors of three loss terms thatcontributes to  𝑓𝑢𝑙𝑙 , respectively.

4. Experiments

In this section, we conduct extensive experiments to ver-ify its design rationalities and eﬃciency of the proposed net-work. The experiments demonstrate the superiority of ourmethod in both objective quantitative scores and subjectivevisual realness.

Datasets.

We implement experiments on the challeng-ing person Re-ID dataset Market-1501 [39] and the

In-shopClothes Retrieval Benchmark of the DeepFashion dataset [19].Market-1501 contains 32,668 low-resolution images of , which has vary enormously in the pose, viewpoint, back-ground and illumination. DeepFashion contains 52,712 high-resolution images of

256 × 256 , which has contains a largenumber of clothing images with the various pose and the ap-pearance. We randomly collect 243,200 training pairs and12,000 testing pairs for Market-1501, and collect 106,966training pairs and 8,570 testing pairs for DeepFashion. Theperson identities of the training sets and the testing sets donot overlap, which better evaluate the generalization abilityof the network.

Metrics.

In our experiments, we adopt the Learned Per-ceptual Image Patch Similarity (LPIPS) [37] and FrÂťechetInception Distance (FID) [10] for the quantitative evaluation.LPIPS calculates the reconstruction error between the syn-thesized images and the ground-truth images at the percep-tual domain. We also use its masked version, mask-LPIPS,to reduce the background inﬂuence by masking it out. FIDcomputes the Wasserstein-2 distance between distributionsof the synthesized images and the reference images.

Our method is implemented by using the popular Py-torch framework. For the person representation, the seman-tic maps are extracted by the Pascal-Person-Part [17]. Weemploy the semantic label originally deﬁned in [17] and set 𝐿 = 20 ( 𝑖.𝑒. background, hat, hair, glove, sunglasses, upper-clothes, dress, coat, socks, pants, jumpsuits, scarf, skirt, face,left-arm, right-arm, left-leg, right-leg, left-shoe, right-shoe.)We utilize 5 semantic blocks in the network for both datasets.For Market-1501, we directly employ the training set and thetesting set on

128 × 64 . The proposed method is trained byusing the Adam optimizer [14] with 𝛽 = 0 . , 𝛽 = 0 . .Learning rate is initially set to −4 , and linearly decayto 0 after 300 epochs. The batch size is set to 32. The 𝛼 , 𝛽 ,and 𝛾 are set to 10, 15, and 5. For DeepFashion, we utilizethe training set and the testing set on

256 × 256 . Learningrate is initially set to −4 , and linearly decay to 0 after500 epochs. The batch size is set to 8. The 𝛼 , 𝛽 , and 𝛾 areset to 15, 1, and 5. ConditionImage TargetPose TargetImage Def-GAN VU-Net Intr-Flow Deep ImageSpatial OurMethod

Figure 5:

Compared with diﬀerent methods on Market-1501dataset, which are Def-GAN [27], VU-Net [4], Intro-Flow [18],Intro-Flow [18], Deep Image Spatial [24].

ReferenceImage TargetPose TargetImage Def-GAN VU-Net Intr-Flow Deep ImageSpatial OurMethod

ConditionImage TargetPose TargetImage Def-GAN VU-Net Intr-Flow Deep ImageSpatial OurMethod

Figure 6:

Compared with diﬀerent methods on DeepFashiondataset, which are Def-GAN [27], VU-Net [4], Intro-Flow [18],Intro-Flow [18], Deep Image Spatial [24].

Quantitative Result.

We compare our model with sev-eral stare-of-the-art methods including VU-Net [4], Def-GAN[27], Intro-Flow [18] and Deep Image Spatial [24]. Thequantitative comparison results are shown in Tab.1. To thefair comparison, we download the pre-trained models of theprevious works and re-evaluate the quantitative performanceson our testing set. In contrast to the other methods, ourproposed method achieves the best results in both datasets,which means that our model can generate realistic resultswith fewer reconstruction errors. The generated images in

Meichen Liu et al.:

Preprint submitted to Elsevier

Page 5 of 8erson Image Generation with Semantic Attention Network for Person Re-identiﬁcation

Table 1

Comparison with quantitative results on Market-1501 and DeepFashion datasets.Methods Market-1501 DeepFashionFID LPIPS Mask-LPIPS FID LPIPSVU-Net [4] 20.144 0.3211 0.1747 23.667 0.2637Def-GAN [27] 25.364 0.2994 0.1496 18.457 0.2330Intro-Flow [18] 27.163 0.2888 0.1403 16.314 0.2131Deep Image Spatial [24] 19.751 0.2817 0.1482 10.573 0.2341Our method our model have more realistic details and better body shape.

Qualitative Comparison.

We evaluate the performanceof our network both on Market-1501 and DeepFashion dataset.Some typical qualitative examples on Market-1501 are shownin Fig. 5. Our method achieve superior in the appearanceand the clothing attributes. For Market-1501, especially, thecloth color in the ﬁrst row, the detailed bag in the secondrow, and the clearly clothing textures in the last two row.For DeepFashion, the clear collar edge of the T-shirt in theﬁrst row, the detailed clothes texture in the second and thirdrow, the completed shoes and bag in the last row.

In this section, we present an ablation study to evaluatethe contribution of each component. The discriminator ar-chitecture is the same for all the methods.

PercLoss.

Perceptual 𝓁 loss is eﬀective in super reso-lution, style transfer as well as the pose transfer task. For theevaluate the impotent of the perceptual 𝓁 loss, we only trainthe network using a joint loss consisting of a adversarial loss,reconstruction 𝓁 loss. Segmentation Mask.

The binary segmentation mask todirectly separate the background of the input images, whichcan help the generator and the discriminator focus on the hu-man body region. In order to verify the eﬀective of the bi-nary segmentation mask, we remove the segmentation maskfrom the initial inputs. All the network as same as full pipeline.

Full pipeline.

We use our proposed the semantic atten-tion framework in this model.Fig 7 shows the corresponding generated image of thediﬀerent variants of our method. We analyze the perceptualloss function in our network. In same case, the improve-ment of Full with respect to PrecLoss is quite drastic, suchas drawing on the face. Moreover, without the guidance ofthe segmentation mask, the clothing texture can be confusedwith background. The network is diﬃcult to handle the clothattribute and the body shape at the same. The introductionof the segmentation mask and perceptual loss function canimprove the visual quality of the output image.

In this section, we analysis the eﬀective of the seman-tic attention blocks. We consider the impact of the diﬀerentnumbers of attention blocks on the network. The semanticattention block aims to capture the appearance consistencyand semantic information, simultaneously. We can improve

ConditionImage TargetImage PercLoss SegmantationMask Full

Figure 7:

Qualitative results on the Market-1501 dataset withrespect to diﬀerent variants of our method. the ability of the generator by varying the number of SABs.Qualitative comparison result are shown in Fig 8. As thenumber of SAB increase, the appearances of the person arebecoming clear, and the clothing textures are gradually assame as the target image. The overall quality of the imageare reﬁned from the foreground to background.

5. Application to person re-identiﬁcation

Person re-identiﬁcation (Re-ID) aims to match personacross no-overlapping video camera. This task has attracted

Meichen Liu et al.:

Preprint submitted to Elsevier

Page 6 of 8erson Image Generation with Semantic Attention Network for Person Re-identiﬁcation

ConditionImage TargetImage SAB-1 SAB-2 SAB-3 SAB-4 SAB-5 M a r k e t - D ee p F a s h i on Figure 8:

Qualitative results of the diﬀerent numbers of theSAB on Market-1501 and DeepFashion datasets. considerable attention for its application in automatic videosurveillance. Since the existing benchmarks such as Market-1501 contain a limited number of pose changes, pose varia-tion becomes one of the key factors which prevents the net-work from learning a robust Re-ID model. The experimentsin this section are motivated by the importance of using gen-erative methods as a data-augmentation tool which providesadditional labeled samples for training discriminative meth-ods.In this section, we utilize the diﬀerent person Re-ID ap-proaches to evaluate the performance of our method in aug-menting the person Re-ID on Market-1501. IdentiﬁcationEmbedding (IDE) [40]) is an approach that regards Re-IDtraining as an image classiﬁcation task. In the test phase,the identity of the person is assigned based on the featurerepresentation, where each image feature is obtained by theclassiﬁcation layer of the network. Each query image is as-sociated to the identity of the closest image in the gallery.In our experiment, we employ diﬀerent metrics to calcu-late the distance among the feature representations, whichare Euclidean distance [6], Cross-view Quadratic Discrim-inant Analysis (XQDA [6]) and a Mahalanobis-based dis-tance (KISSME [15]). The another approach, Siamese net-work [41] predicts whether the identities of the two input im-ages are the same. For all approaches, we adopt pre-trainedResNet-50 [9] as test benchmark.We augment the Market-1501 dataset by a factor 𝛼 . Foreach person image, we should select 𝛼 − 1 target poses toguide the generate network to synthesize new image. Eachsynthesized image is labeled with the identity of the refer-ence image. We compare our model performance to severalexisting person image generators [4, 21, 27, 42] under thesame setting for Re-ID data augmentation. Compared with the previous methods, our method can achieve competitiveresults as shown in Tab. ?? . Our proposed method can gen-erate more smooth and nature human images and be moreeﬀective for the Re-ID task.

6. Conclusion

In this paper, we propose a semantic attention networkfor person image generation to deal with the challenging posetransfer. To address the complexity of learning the cloth-ing attributes under diﬀerent poses, we design a attentionmechanism to capture the pose and semantic parsing simul-taneously. We take the binary segmentation and the humansemantic map as the network input, which can reduce the ef-fect of the background clutter and decrease the computationload. Compared with previous works, our model exhibits thesuperior performance in both the quantitative scores and thevisual realness. Moreover, the experiment results show thatour proposed method can be used to alleviate the insuﬃcienttraining data problem for person Re-ID substantially. In thefuture, we will predict the pose mask and the semantic mapof the desired pose to support the unsupervised person imagegeneration.

Acknowledgments

The work is supported by NationalNatural Science Foundation of China (61573114) and Fun-damental Research Funds for the Central Universities (HEU-CF160415).

References [1] Bai, X., Yang, M., Huang, T., Dou, Z., Yu, R., Xu, Y., 2020.Deep-person: Learning discriminative deep features for person re-identiﬁcation. Pattern Recognition 98, 107036.[2] Balakrishnan, G., Zhao, A., Dalca, A.V., Durand, F., Guttag, J., 2018.Synthesizing images of humans in unseen poses, in: Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition,pp. 8340–8348.[3] Dong, H., Liang, X., Gong, K., Lai, H., Zhu, J., Yin, J., 2018. Soft-gated warping-gan for pose-guided person image synthesis, in: Ad-vances in neural information processing systems, pp. 474–484.[4] Esser, P., Sutter, E., Ommer, B., 2018. A variational u-net for condi-tional appearance and shape generation, in: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pp. 8857–8866.[5] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley,D., Ozair, S., Courville, A., Bengio, Y., 2014. Generative adversar-ial nets, in: Advances in neural information processing systems, pp.2672–2680.[6] Gower, J.C., 1985. Properties of euclidean and non-euclidean distancematrices. Linear Algebra and its Applications 67, 81–97.[7] Han, X., Hu, X., Huang, W., Scott, M.R., 2019. Clothﬂow: Aﬂow-based model for clothed person generation, in: Proceedings ofthe IEEE International Conference on Computer Vision, pp. 10471–10480.[8] He, K., Gkioxari, G., Dollár, P., Girshick, R., 2017. Mask r-cnn, in:Proceedings of the IEEE international conference on computer vision,pp. 2961–2969.[9] He, K., Zhang, X., Ren, S., Sun, J., 2016. Deep residual learningfor image recognition, in: Proceedings of the IEEE conference oncomputer vision and pattern recognition, pp. 770–778.

Meichen Liu et al.:

Preprint submitted to Elsevier

Page 7 of 8erson Image Generation with Semantic Attention Network for Person Re-identiﬁcation [10] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter,S., 2017. Gans trained by a two time-scale update rule converge to alocal nash equilibrium, in: Advances in neural information processingsystems, pp. 6626–6637.[11] Hong, S., Yang, D., Choi, J., Lee, H., 2018. Inferring semantic layoutfor hierarchical text-to-image synthesis, in: Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, pp. 7986–7994.[12] Ioﬀe, S., Szegedy, C., 2015. Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift. arXiv preprintarXiv:1502.03167 .[13] Johnson, J., Gupta, A., Fei-Fei, L., 2018. Image generation from scenegraphs, in: Proceedings of the IEEE conference on computer visionand pattern recognition, pp. 1219–1228.[14] Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimiza-tion. arXiv preprint arXiv:1412.6980 .[15] Koestinger, M., Hirzer, M., Wohlhart, P., Roth, P.M., Bischof, H.,2012. Large scale metric learning from equivalence constraints, in:2012 IEEE conference on computer vision and pattern recognition,IEEE. pp. 2288–2295.[16] Li, D., Du, C., He, H., 2020. Semi-supervised cross-modal imagegeneration with generative adversarial networks. Pattern Recognition100, 107085.[17] Li, P., Xu, Y., Wei, Y., Yang, Y., 2019a. Self-correction for humanparsing. arXiv preprint arXiv:1910.09777 .[18] Li, Y., Huang, C., Loy, C.C., 2019b. Dense intrinsic appearance ﬂowfor human pose transfer, in: Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition, pp. 3693–3702.[19] Liu, Z., Luo, P., Qiu, S., Wang, X., Tang, X., 2016. Deepfashion:Powering robust clothes recognition and retrieval with rich annota-tions, in: Proceedings of the IEEE conference on computer visionand pattern recognition, pp. 1096–1104.[20] Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., Van Gool, L.,2017. Pose guided person image generation, in: Advances in NeuralInformation Processing Systems, pp. 406–416.[21] Ma, L., Sun, Q., Georgoulis, S., Van Gool, L., Schiele, B., Fritz, M.,2018. Disentangled person image generation, in: Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition, pp.99–108.[22] Mirza, M., Osindero, S., 2014. Conditional generative adversarialnets. arXiv preprint arXiv:1411.1784 .[23] Pan, J., Dong, J., Liu, Y., Zhang, J., Ren, J., Tang, J., Tai, Y.W., Yang,M.H., 2020. Physics-based generative adversarial models for imagerestoration and beyond. IEEE Transactions on Pattern Analysis andMachine Intelligence .[24] Ren, Y., Yu, X., Chen, J., Li, T.H., Li, G., 2020. Deep image spatialtransformation for person image generation, in: Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern Recognition,pp. 7690–7699.[25] Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: Convolutionalnetworks for biomedical image segmentation, in: International Con-ference on Medical image computing and computer-assisted interven-tion, Springer. pp. 234–241.[26] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S.,Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al., 2015. Im-agenet large scale visual recognition challenge. International journalof computer vision 115, 211–252.[27] Siarohin, A., Lathuilière, S., Sangineto, E., Sebe, N., 2019. Appear-ance and pose-conditioned human image generation using deformablegans. IEEE Transactions on Pattern Analysis and Machine Intelli-gence .[28] Simonyan, K., Zisserman, A., 2014. Very deep convolu-tional networks for large-scale image recognition. arXiv preprintarXiv:1409.1556 .[29] Song, C., Huang, Y., Ouyang, W., Wang, L., 2018. Mask-guided con-trastive attention model for person re-identiﬁcation, in: Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition,pp. 1179–1188. [30] Song, J., He, T., Gao, L., Xu, X., Hanjalic, A., Shen, H.T., 2020a.Uniﬁed binary generative adversarial network for image retrieval andcompression. International Journal of Computer Vision , 1–22.[31] Song, S., Zhang, W., Liu, J., Mei, T., 2019. Unsupervised person im-age generation with semantic parsing transformation, in: Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition,pp. 2357–2366.[32] Song, T.A., Chowdhury, S.R., Yang, F., Dutta, J., 2020b. Pet imagesuper-resolution using generative adversarial networks. Neural Net-works 125, 83–91.[33] Xing, X., Han, T., Gao, R., Zhu, S.C., Wu, Y.N., 2019. Unsuperviseddisentangling of appearance and geometry by deformable generatornetwork, in: Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition, pp. 10354–10363.[34] Xing, X., Wu, T., Zhu, S.C., Wu, Y.N., 2020. Inducing hierarchicalcompositional model by sparsifying generator network, in: Proceed-ings of the IEEE/CVF Conference on Computer Vision and PatternRecognition, pp. 14296–14305.[35] Xu, B., Wang, N., Chen, T., Li, M., 2015. Empirical evaluationof rectiﬁed activations in convolutional network. arXiv preprintarXiv:1505.00853 .[36] Yao, H., Zhang, S., Hong, R., Zhang, Y., Xu, C., Tian, Q., 2019. Deeprepresentation learning with part loss for person re-identiﬁcation.IEEE Transactions on Image Processing 28, 2860–2871.[37] Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O., 2018. Theunreasonable eﬀectiveness of deep features as a perceptual metric, in:Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, pp. 586–595.[38] Zheng, L., Huang, Y., Lu, H., Yang, Y., 2019. Pose-invariant embed-ding for deep person re-identiﬁcation. IEEE Transactions on ImageProcessing 28, 4500–4509.[39] Zheng, L., Shen, L., Tian, L., Wang, S., Wang, J., Tian, Q., 2015.Scalable person re-identiﬁcation: A benchmark, in: Proceedings ofthe IEEE international conference on computer vision, pp. 1116–1124.[40] Zheng, L., Yang, Y., Hauptmann, A.G., 2016. Person re-identiﬁcation: Past, present and future. arXiv preprintarXiv:1610.02984 .[41] Zheng, Z., Zheng, L., Yang, Y., 2017. A discriminatively learned cnnembedding for person reidentiﬁcation. ACM Transactions on Multi-media Computing, Communications, and Applications (TOMM) 14,1–20.[42] Zhu, Z., Huang, T., Shi, B., Yu, M., Wang, B., Bai, X., 2019. Pro-gressive pose attention transfer for person image generation, in: Pro-ceedings of the IEEE Conference on Computer Vision and PatternRecognition, pp. 2347–2356.

Meichen Liu et al.: