[PDF] Structure-aware Person Image Generation with Pose Decomposition and Semantic Correlation

Abstract

In this paper we tackle the problem of pose guided person image generation, which aims to transfer a person image from the source pose to a novel target pose while maintaining the source appearance. Given the inefficiency of standard CNNs in handling large spatial transformation, we propose a structure-aware flow based method for high-quality person image generation. Specifically, instead of learning the complex overall pose changes of human body, we decompose the human body into different semantic parts (e.g., head, torso, and legs) and apply different networks to predict the flow fields for these parts separately. Moreover, we carefully design the network modules to effectively capture the local and global semantic correlations of features within and among the human parts respectively. Extensive experimental results show that our method can generate high-quality results under large pose discrepancy and outperforms state-of-the-art methods in both qualitative and quantitative comparisons.

Full PDF

SStructure-aware Person Image Generation withPose Decomposition and Semantic Correlation

Jilin Tang , Yi Yuan , Tianjia Shao , Yong Liu , Mengmeng Wang , Kun Zhou NetEase Fuxi AI Lab State Key Lab of CAD&CG, Zhejiang University Institute of Cyber-Systems and Control, Zhejiang University { tangjilin, yuanyi } @corp.netease.com, { tjshao,mengmengwang } @zju.edu.cn, [email protected], [email protected] Abstract

In this paper we tackle the problem of pose guided personimage generation, which aims to transfer a person imagefrom the source pose to a novel target pose while maintain-ing the source appearance. Given the inefﬁciency of standardCNNs in handling large spatial transformation, we proposea structure-aware ﬂow based method for high-quality personimage generation. Speciﬁcally, instead of learning the com-plex overall pose changes of human body, we decompose thehuman body into different semantic parts (e.g., head, torso,and legs) and apply different networks to predict the ﬂowﬁelds for these parts separately. Moreover, we carefully de-sign the network modules to effectively capture the local andglobal semantic correlations of features within and amongthe human parts respectively. Extensive experimental resultsshow that our method can generate high-quality results un-der large pose discrepancy and outperforms state-of-the-artmethods in both qualitative and quantitative comparisons.

Introduction

Pose guided person image generation (Ma et al. 2017),which aims to synthesize a realistic-looking person image ina target pose while preserving the source appearance details(as depicted in Figure 1), has aroused extensive attention dueto its wide range of practical applications for image editing,image animation, person re-identiﬁcation (ReID), and so on.Motivated by the development of Generative Adversar-ial Networks (GANs) in the image-to-image transformationtask (Zhu et al. 2017), many researchers (Ma et al. 2017,2018; Zhu et al. 2019; Men et al. 2020) attempted to tacklethe person image generation problem within the frameworkof generative models. However, as CNNs are not good attackling large spatial transformation (Ren et al. 2020), thesegeneration-based models may fail to handle the feature mis-alignment caused by the spatial deformation between thesource and target image, leading to the appearance distor-tions. To deal with the feature misalignment, recently, ap-pearance ﬂow based methods have been proposed (Ren et al.2020; Liu et al. 2019; Han et al. 2019) to transform thesource features to align them with the target pose, modelingthe dense pixel-to-pixel correspondence between the source * Figure 1: The generated person images in random targetposes by our method.and target features. Speciﬁcally, the appearance ﬂow basedmethods aim to calculate the 2D coordinate offsets (i.e., ap-pearance ﬂow ﬁelds) that indicate which positions in thesource features should be sampled to reconstruct the corre-sponding target features. With such ﬂow mechanism, the ex-isting ﬂow based methods can synthesize target images withvisually plausible appearances for most cases. However, itis still challenging to generate satisfying results when thereare large pose discrepancies between the source and targetimages (see Figure 5 for example).To tackle this challenge, we propose a structure-awareﬂow based method for high-quality person image genera-tion. The key insight of our work is, incorporating the struc-ture information can provide important priors to guide thenetwork learning, and hence can effectively improve the re-sults. First, we observe that the human body is composed ofdifferent parts with different motion complexities w.r.t. posechanges. Hence, instead of using a uniﬁed network to pre-dict the overall appearance ﬂow ﬁeld of human body, we de-compose the human body into different semantic parts (e.g.,head, torso, and legs) and employ different networks to es-timate the ﬂow ﬁelds for these parts separately. In this way,we not only reduce the difﬁculty of learning the complexoverall pose changes, but can more precisely capture thepose change of each part with a speciﬁc network. Second,for close pixels belonging to each part of human body, theappearance features are often semantically correlated. Forexample, the adjacent positions inside the arm should havesimilar appearances after being transformed to a new pose.To this end, compared to the existing methods which gen-erate features at target positions independently with limited a r X i v : . [ c s . C V ] F e b eceptive ﬁelds, we introduce a hybrid dilated convolutionblock which is composed of sequential convolutional layerswith different dilation rates (Yu and Koltun 2015; Chen et al.2017; Li, Zhang, and Chen 2018) to effectively capture theshort-range semantic correlations of local neighbors insidehuman parts by enlarging the receptive ﬁeld of each posi-tion. Third, the semantic correlations also exist for the fea-tures of different human parts that are far away from eachother, owning to the symmetry of human body. For instance,the features of the left and right sleeves are often required tobe consistent. Therefore, we design a lightweight yet effec-tive non-local component named pyramid non-local block which combines the multi-scale pyramid pooling (He et al.2015; Kim et al. 2018) with the standard non-local opera-tion (Wang et al. 2018) to capture the long-range semanticcorrelations across different human part regions under dif-ferent scales.Technically, our network takes as input a source personimage and a target pose, and synthesizes a new person imagein the target pose while preserving the source appearance.The network architecture is composed of three modules. Thepart-based ﬂow generation module divides the human jointsinto different parts, and deploys different models to predictlocal appearance ﬂow ﬁelds and visibility maps of differentparts respectively. Then, the local warping module warps thesource part features extracted from the source part images,so as to align them with the target pose while capturing theshort-range semantic correlations of local neighbors withinthe parts via the hybrid dilated convolution block . Finally,the global fusion module aggregates the warped features ofdifferent parts into the global fusion features and further ap-plies the pyramid non-local block to learn the long-range se-mantic correlations among different part regions, and ﬁnallyoutputs a synthesized person image.The main contributions can be summarized as:• We propose a structure-aware ﬂow based framework forpose guided person image generation, which can synthe-size high-quality person images even with large pose dis-crepancies between the source and target images.• We decompose the task of learning the overall appearanceﬂow ﬁeld into learning different local ﬂow ﬁelds for dif-ferent semantic body parts, which can ease the learningand capture the pose change of each part more precisely.• We carefully design the modules in our network to cap-ture the local and global semantic correlations of featureswithin and among human parts respectively. Related Work

Pose guided person image generation can be regarded as atypical image-to-image transformation problem (Isola et al.2017; Zhu et al. 2017) where the goal is to convert a sourceperson image into a target person image conditioned ontwo constraints: (1) preserving the person appearance in thesource image and (2) deforming the person pose into the tar-get one.Ma et al. (Ma et al. 2017) proposed a two-stage gen-erative network named PG to synthesize person images in a coarse-to-ﬁne way. Ma et al. (Ma et al. 2018) fur-ther improved the performance of PG by disentanglingthe foreground, background, and pose with a multi-branchnetwork. However, the both methods require a complicatedstaged training process and have large computation bur-den. Zhu et al. (Zhu et al. 2019) proposed a progressivetransfer network to deform a source image into the targetimage through a series of intermediate representations toavoid capturing the complex global manifold directly. How-ever, the useful appearance information would degrade in-evitably during the sequential feature transfers, which maylead to the blurry results lacking vivid appearance details.Essner et al. (Esser, Sutter, and Ommer 2018) combined theVAE (Kingma and Welling 2013) and U-Net (Ronneberger,Fischer, and Brox 2015) to model the interaction betweenappearance and shape. However, the common skip connec-tions of U-Net can’t deal with the feature misalignments be-tween the source and target pose reliably. To tackle this is-sue, Siarohin et al. (Siarohin et al. 2018) further proposedthe deformable skip connections to transform the local tex-tures according to the local afﬁne transformations of certainsub-parts. However, the degrees of freedom are limited (i.e.,6 for afﬁne), which may produce inaccurate and unnaturaltransformations when there are large pose changes.Recently, a few ﬂow-based methods have been proposedto take advantage of the appearance ﬂow (Zhou et al. 2016;Ren et al. 2019) to transform the source image to align itwith the target pose. Han et al. (Han et al. 2019) introduceda three-stage framework named ClothFlow to model the ap-pearance ﬂow between source and target clothing regions ina cascaded manner. However, they warp the source imageat the pixel level instead of the feature level, which needsan extra reﬁnement network to handle the invisible contents.Li et al. (Li, Huang, and Loy 2019) leveraged the 3D hu-man model to predict the appearance ﬂow, and warped boththe encoded features and the raw pixels of source image.However, they require to ﬁt the 3D human model to all im-ages to obtain the annotations of appearance ﬂows beforethe training, which is too expensive to limit its application.Ren et al. (Ren et al. 2020) designed a global-ﬂow local-attention framework to generate the appearance ﬂow in anunsupervised way and transform the source image at the fea-ture level reasonably. However, this method directly takesthe overall source and target pose as input to predict the ap-pearance ﬂow of the whole human body, which may be un-able to tackle the large discrepancies between the source andtarget pose reliably. Besides, this method produces featuresat each target position independently and doesn’t considerthe semantic correlations among target features at differentlocations. The Proposed Method

Figure 2 illustrates the overall framework of our network. Itmainly consists of three modules: the part-based ﬂow gen-eration module, the local warping module, and the globalfusion module. In the following sections, we will give a de-tailed description of each module.igure 2: Overview of the proposed method. It mainly consists of three modules: the part-based ﬂow generation module, thelocal warping module, and the global fusion module.

Part-based Flow Generation Module

We ﬁrst introduce a few notations. Let P s ∈ R × h × w and P t ∈ R × h × w represent the overall pose of thesource image I s ∈ R × h × w and target image I t ∈ R × h × w respectively, where the 18 channels of the posecorrespond to the heatmaps that encode the spatial loca-tions of 18 human joints. The joints are extracted withthe OpenPose (Cao et al. 2017). As shown in Figure 2,our part-based ﬂow generation module ﬁrst decomposesthe overall pose into different sub-poses via grouping thehuman joints into different parts based on the inherentconnection relationship among them, Then, different sub-models G localflow = (cid:110) G headflow , G torsoflow , G legflow (cid:111) are deployedto generate the local appearance ﬂow ﬁelds and visibilitymaps of corresponding human parts respectively. Speciﬁ-cally, let P locals = (cid:8) P heads , P torsos , P legs (cid:9) and P localt = (cid:110) P headt , P torsot , P legt (cid:111) denote the decomposed source andtarget sub-poses, where each sub-pose corresponds to a sub-set of the 18 heatmaps of human joints. The sub-models G localflow take as input P locals and P localt , and output the localappearance ﬂow ﬁelds W local and visibility maps V local : W local , V local = G localflow ( P locals , P localt ) , (1)where W local = (cid:8) W head , W torso , W leg (cid:9) records the 2Dcoordinate offsets between the source and target features ofcorresponding parts, and V local = (cid:8) V head , V torso , V leg (cid:9) stores conﬁdence values between 0 and 1 representingwhether the information of certain target positions exists inthe source features. Local Warping Module

The generated local appearance ﬂow ﬁelds W local and vis-ibility maps V local provide important guidance on under- standing the spatial deformation of each part region be-tween the source and target image, specifying which posi-tions in the source features could be sampled to generatethe corresponding target features. Therefore, our local warp-ing module exploits this information to model the densepixel-to-pixel correspondence between the source and tar-get features. As shown in Figure 2, we ﬁrst crop differ-ent part images from the source image, and encode theminto the corresponding source part image features F locals = (cid:8) F heads , F torsos , F legs (cid:9) . Then, under the guidance of gener-ated local appearance ﬂow ﬁelds W local , our local warp-ing module warps F locals to obtain the warped source fea-tures F locals,w = (cid:8) F heads,w , F torsos,w , F legs,w (cid:9) aligned with the tar-get pose. Speciﬁcally, for each target position p = ( x, y ) in the features F locals,w , a sampling position is allocated ac-cording to the coordinate offsets (cid:52) p = ( (cid:52) x, (cid:52) y ) recordedin the ﬂow ﬁelds W local . The features at target positionare fetched from the corresponding sampling position in thesource features by the bilinear interpolation. Further detailsare available in our supplementary material. The procedurecan be written as: F locals,w = G warp ( F locals , W local ) . (2)Considering not all appearance information of the targetimage can be found in the source image due to different vis-ibilities of the source and target pose, we further take advan-tage of the generated local visibility maps V local to selectthe reasonable features between F locals,w and the local targetpose features F localpose = (cid:8) F headpose , F torsopose , F legpose (cid:9) which areencoded from the target sub-poses. The feature selection us-ing visibility maps is deﬁned as: F locals,w,v = V local · F locals,w + (1 − V local ) · F localpose , (3)where F locals,w,v = (cid:8) F heads,w,v , F torsos,w,v , F legs,w,v (cid:9) denotes the se-lected features for different parts.igure 3: The local warping module. It warps the source fea-tures encoded from the corresponding part images to alignthem with the target pose while capturing the short-rangesemantic correlations of local neighbors within the parts.At last, in order to perceive local semantic correlations in-side human parts, as shown in Figure 3, we further introducea hybrid dilated convolution block which is composed ofsequential convolutional layers with different dilation rates(e.g., { , } in our implementation) to capture the short-range semantic correlations of local neighbors within partsby enlarging the receptive ﬁeld of each position. Speciﬁ-cally, a dilated convolution with rate r can be deﬁned as: y ( m, n ) = (cid:88) i (cid:88) j x ( m + r × i, n + r × j ) w ( i, j ) , (4)where y ( m, n ) is the output of dilated convolution from in-put x ( m, n ) , and w ( i, j ) is the ﬁlter weight. Let G hdcb repre-sent the hybrid dilated convolution block . The ﬁnal warpedlocal image features of different human parts F localwarp = (cid:8) F headwarp , F torsowarp , F legwarp (cid:9) can be obtained by: F localwarp = G hdcb ( F locals,w,v ) . (5) Global Fusion Module

Let F globalpose denote the global target pose features encodedfrom the overall target pose P t , which can provide addi-tional context as to where different parts should be located inthe target image. Concatenating the warped image featuresof different parts F localwarp and the global target pose features F globalpose together as input, the global fusion module ﬁrst ag-gregates these local part features into the preliminary globalfusion features F fusion : F fusion = G fusion (cid:0) F localwarp , F globalpose (cid:1) . (6)Due to the symmetry of human body, there can also ex-ist important semantic correlations for the features of dif-ferent human parts with long distances. We therefore de-sign a lightweight yet effective non-local component named pyramid non-local block which incorporates the multi-scalepyramid pooling with the standard non-local operation tocapture such long-range semantic correlations across dif-ferent human part regions under different scales. Speciﬁ-cally, as shown in Figure 4, given the preliminary globalfusion features F fusion , we ﬁrst use the multi-scale pyra-mid pooling to adaptively divide them into different part re-gions and select the most signiﬁcant global representation Figure 4: The global fusion module. It aggregates the warpedfeatures of different parts into the global fusion features andcaptures the non-local semantic correlations among differenthuman parts.for each region, producing hierarchical features with dif-ferent sizes (e.g., × , × ) in parallel. Next, we ap-ply the standard non-local operations on the pooled fea-tures at different scales respectively to obtain the responseat a target position by the weighted summation of featuresfrom all positions, where the weights are the pairwise rela-tion values recorded in the generated relation maps (whichare visualized in our experiments). Speciﬁcally, given theinput features x , the relation maps R are calculated by R = sof tmax ( θ ( x ) T φ ( x )) , where θ ( · ) and φ ( · ) are twofeature embeddings implemented as × convolutions. Let G pnb denote the pyramid non-local block . The ﬁnal globalfeatures F global are obtained via: F global = G pnb ( F fusion ) . (7)Finally, the target person image ˆ I t is generated from theglobal features F global using a decoder network Dec whichcontains a set of deconvolutional layers: ˆ I t = Dec ( F global ) . (8) Training

We train our model in two stages. First, without the groundtruth of appearance ﬂow ﬁelds and visibility maps, we trainthe part-based ﬂow generation module in an unsupervisedmanner using the sampling correctness loss (Ren et al. 2019,2020). Since our part-based ﬂow generation module con-tains three sub-models corresponding to different parts, wetrain them together using the overall loss deﬁned as: L sam = L headsam + L torsosam + L legsam , (9)where L headsam , L torsosam , and L legsam denote the sampling correct-ness loss for each part respectively. The sampling correct-ness loss constrains the appearance ﬂow ﬁelds to sample po-sitions with similar semantics via measuring the similaritybetween the warped source features and ground truth targetfeatures. Refer to the supplementary material for details. odel Market-1501 DeepFashionFID ↓ LPIPS ↓ Mask-LPIPS ↓ SSIM ↑ Mask-SSIM ↑ PSNR ↑ Mask-PSNR ↑ FID ↓ LPIPS ↓ SSIM ↑ PSNR ↑ VU-Net 24.386 0.3211 0.1747 0.242 0.801 13.664 19.102 13.836 0.2637 0.745 16.255Def-GAN 29.035 0.2994 0.1496 0.276 0.793 14.391 20.425 26.283 0.2330

Table 1: Quantitative comparison with state-of-the-art methods on the Market-1501 and DeepFashion datasets. The ﬁrst andsecond best results are bolded and underlined respectively.Then, with the pre-trained part-based ﬂow generationmodule, we train our whole model in an end-to-end way.The full loss function is deﬁned as: L = λ L sam + λ L rec + λ L adv + λ L per + λ L sty , (10)where L rec denotes the reconstruction loss which is formu-lated as the L1 distance between the generated target image ˆ I t and ground truth target image I t , L rec = (cid:13)(cid:13)(cid:13) I t − ˆ I t (cid:13)(cid:13)(cid:13) . (11) L adv represents the adversarial loss (Goodfellow et al.2014) which uses the discriminator D to promote the gener-ator G to synthesize the target image with sharp details, L adv = E [ log (1 − D ( G ( I s , P s , P t )))] + E [ logD ( I t )] . (12) L per denotes the perceptual loss (Johnson, Alahi, and Fei-Fei 2016) formulated as the L1 distance between featuresextracted from special layers of a pre-trained VGG network, L per = (cid:88) i (cid:13)(cid:13)(cid:13) φ i ( I t ) − φ i ( ˆ I t ) (cid:13)(cid:13)(cid:13) , (13)where φ i is the feature maps of the i-th layer of the VGGnetwork pre-trained on ImageNet (Russakovsky et al. 2015). L sty denotes the style loss (Johnson, Alahi, and Fei-Fei2016) which uses the Gram matrix of features to calculatethe style similarity between the images, L sty = (cid:88) j (cid:13)(cid:13)(cid:13) G φj ( I t ) − G φj ( ˆ I t ) (cid:13)(cid:13)(cid:13) , (14)where G φj is the Gram matrix constructed from features φ j . Implementation Details.

Our model is implemented inthe PyTorch framework using one NVIDIA GTX 1080TiGPU with 11GB memory. We adopt the Adam optimizer( β = 0 , β = 0 . ) (Kingma and Ba 2014) to train ourmodel and the learning rate is ﬁxed to 0.001 in all exper-iments. For the Market-1501 dataset (Zheng et al. 2015),we train our model using the images with resolution of × , and the batch size is set to 8. For the DeepFash-ion dataset (Liu et al. 2016), our model is trained using theimages with resolution of × , and the batch size is 6. Experiment

In this section, we perform extensive experiments to demon-strate the superiority of the proposed method over state-of-the-art methods. Furthermore, we conduct the ablation studyto verify the contribution of each component in our model.

Datasets.

We conduct our experiments on the ReIDdataset Market-1501 (Zheng et al. 2015) and the In-shopClothes Retrieval Benchmark DeepFashion (Liu et al. 2016).The Market-1501 dataset contains 32,668 low-resolutionimages ( × ) which vary enormously in the pose,background, and illumination. Meanwhile, the DeepFash-ion dataset contains 52,712 person images ( × ) withvarious appearances and poses. For a fair comparison, wesplit the two datasets following the same setting in (Renet al. 2020). Consequently, we pick 263,632 training pairsand 12,000 testing pairs for the Market-1501 dataset. Forthe DeepFashion dataset, we randomly select 101,966 pairsfor training and 8,570 pairs for testing. Metrics.

It remains an open problem to evaluate the qual-ity of generated images reasonably. Following the previousworks (Siarohin et al. 2018; Zhu et al. 2019; Ren et al. 2020),we use the common metrics such as Learned Perceptual Im-age Patch Similarity (LPIPS) (Zhang et al. 2018), Fr ´ e chetInception Distance (FID) (Heusel et al. 2017), StructuralSimilarity (SSIM) (Wang et al. 2004), and Peak Signal-to-noise Ratio (PSNR) to assess the quality of generated im-ages quantitatively. Speciﬁcally, both LPIPS and FID calcu-late the perceptual distance between the generated imagesand ground truth images in the feature space w.r.t. each pairof samples and global distribution, respectively. Meanwhile,SSIM and PSNR indicate the similarity between paired im-ages in raw pixel space. For the Market-1501 dataset, we fur-ther calculate the masked results of these metrics to excludethe interference of the backgrounds. Furthermore, consider-ing that these quantitative metrics may not fully reﬂect theimage quality (Ma et al. 2017), we perform a user study toqualitatively evaluate the quality of generated images. Quantitative Comparison.

As shown in Table 1, we com-pare our method with four state-of-the-art methods includingVU-Net (Esser, Sutter, and Ommer 2018), Def-GAN (Siaro-hin et al. 2018), PATN (Zhu et al. 2019), and DIST (Renet al. 2020) on the Market-1501 and DeepFashion datasets.Speciﬁcally, we download the pre-trained models of state-of-the-art methods and evaluate their performance on thetesting set directly. As we can see, our method outperformsthe state-of-the-art methods in most metrics on both datasets,demonstrating the superiority of our model in generatinghigh-quality person images.igure 5: Qualitative comparison with state-of-the-art methods on the DeepFashion(left) and Market-1501(right) datasets.

Qualitative Comparison.

Figure 5 shows the qualitativecomparison of different methods on the two datasets. All theresults of state-of-the-art methods are obtained by directlyrunning their pre-trained models released by authors. As wecan see, for the challenging cases with large pose discrep-ancies (e.g., the ﬁrst two rows on the left of Figure 5), theexisting methods may produce results with heavy artifactsand appearance distortion. In contrast, for the DeepFashiondataset (Liu et al. 2016), our model can generate realisticimages in arbitrary target poses, which not only reconstructsthe reasonable and consistent global appearances, but pre-serves the vivid local details such as the textures of clothesand hat. Especially, our model is able to produce more suit-able appearance contents for target regions which are invisi-ble in the source image such as the legs and backs of clothes(see the last three rows). For the Market-1501 dataset (Zhenget al. 2015), our model yields natural-looking images withsharp appearance details whereas the artifacts and blurs canbe observed in the results of other state-of-the-art methods.More results can be found in the supplementary material.

User Study.

We perform a user study to judge the realnessand preference of the images generated by different meth-ods. For the realness, we recruit 30 participants to judgewhether a given image is real or fake within a second. Fol-lowing the setting of previous work (Ma et al. 2017; Siarohinet al. 2018; Zhu et al. 2019), for each method, 55 real imagesand 55 generated images are selected and shufﬂed randomly.Speciﬁcally, the ﬁrst 10 images are used to warm up and theremaining 100 images are used to evaluate. For the prefer-ence, in each group of comparison, a source image, a targetpose, and 5 result images generated by different methods aredisplayed to the participants, and the participants are askedto pick the most reasonable one w.r.t. both the source ap-pearance and target pose. We enlist 30 participants to takepart in the evaluation and each participant is asked to ﬁn-ish 30 groups of comparisons for each dataset. As shown in Table 2, our method outperforms the state-of-the-art meth-ods in all subjective measurements on the two datasets, es-pecially for the DeepFashion dataset (Liu et al. 2016) withhigher resolution, verifying that the images generated by ourmodel are more realistic and faithful.

Model Market-1501 DeepFashionG2R ↑ Prefer ↑ G2R ↑ Prefer ↑ VU-Net - - 11.44 - - 1.00Def-GAN 41.03 10.00 5.23 1.44PATN 38.03 14.00 10.93 2.22DIST 47.37 23.11 38.30 28.89Ours

Table 2: User study( % ). G2R means the percentage of gen-erated images rated as real w.r.t. all generated images.

Pre-fer denotes the user preference for the most realistic resultamong different methods.

Ablation Study

We further perform the ablation study to analyze the contri-bution of each technical component in our method. We ﬁrstintroduce the variants implemented by alternatively remov-ing a corresponding component from our full model. w/o the part-based decomposition (w/o Part).

This modelremoves the part-based decomposition in our ﬂow genera-tion module, and directly estimates the whole ﬂow ﬁeld ofhuman body to warp the global source image features. w/o the hybrid dilated convolution block (w/o HDCB).

This model removes the hybrid dilated convolution block inour local warping module, and directly uses the selected partfeatures to conduct the subsequent feature fusion. w/o the pyramid non-local block (w/o PNB).

This modelremoves the pyramid non-local block in our global fusionmodule, and simply takes the preliminary global fusion fea-tures as input to generate the ﬁnal target images.

Full.

This represents our full model.able 3 shows the quantitative results of ablation study onthe DeepFashion dataset (Liu et al. 2016). We can see that,our full model achieves the best performance on all evalu-ation metrics except SSIM, and the removal of any compo-nents will degrade the performance of the model.

Model DeepFashionFID ↓ LPIPS ↓ SSIM ↑ PSNR ↑ w/o Part 13.736 0.2090 0.716 17.420w/o PNB 9.302 0.1832 0.728 17.945w/o HDCB 9.326 0.1829 Table 3: The quantitative results of ablation study on theDeepFashion dataset. The best results are bolded.Qualitative comparison of different ablation models isdemonstrated in Figure 6. We can see that, although themodels w/o Part, w/o PNB, and w/o HDCB can generatetarget images with correct poses, they can’t preserve the hu-man appearances in source images very well. Speciﬁcally,there exists heavy appearance distortion on the results pro-duced by the model w/o Part, because of the difﬁculty in di-rectly learning the overall ﬂow ﬁelds of human body underlarge pose discrepancies. The results generated by the modelw/o PNB often suffer from the inconsistency in global hu-man appearance since it doesn’t explicitly consider the long-range semantic correlations across different human parts.Besides, the images produced by the model w/o HDCB maylose some local appearance details because it can’t fully cap-ture the short-range semantic correlations of local neighborswithin a certain part. In contrast, our full model can recon-struct the most realistic images which not only possess con-sistent global appearance, but maintain vivid local details.Figure 6: The qualitative comparison of ablation study.

Visualization of The Relation Map

To illustrate the effectiveness of our pyramid non-local block in capturing the global semantic correlations among differ-ent human parts, in Figure 7 we visualize the generated re-lation map (e.g., size of × ), which represents the rela-tion values of all patches w.r.t a certain target patch. As wecan see, for a target patch in a certain image region (e.g.,shirt, pants, background), the patches with similar seman-tics usually have larger relation values w.r.t. this target patch, indicating that our pyramid non-local block can capture thenon-local semantic correlations among different part regionseffectively.Figure 7: Visualization of the relation map w.r.t. a certaintarget patch marked by a red rectangle in the image. Person Image Generation in Random Poses

As shown in Figure 8, given the same source person imageand a set of target poses selected from the testing set ran-domly, our model is able to generate the target images withboth vivid appearances and correct poses , demonstrating theversatility of our model sufﬁciently.Figure 8: The results of generated person images in randomtarget poses on the DeepFashion dataset.

Conclusion

We present a structure-aware appearance ﬂow based ap-proach to generate realistic person images conditioned onthe source appearances and target poses. We decompose thetask of learning the overall appearance ﬂow ﬁeld into learn-ing different local ﬂow ﬁelds for different human body parts,which can simplify the learning and model the pose changeof each part more precisely. Besides, we carefully designdifferent modules within our framework to capture the lo-cal and global semantic correlations of features inside andacross human parts respectively. Both qualitative and quan-titative results demonstrate the superiority of our proposedmethod over the state-of-the-art methods. Moreover, the re-sults of ablation study and visualization verify the effective-ness of our designed modules.

Acknowledgments

This work is supported by the National Key R&D Programof China (2018YFB1004300). eferences

Cao, Z.; Simon, T.; Wei, S.-E.; and Sheikh, Y. 2017. Re-altime multi-person 2d pose estimation using part afﬁnityﬁelds. In

Proceedings of the IEEE conference on computervision and pattern recognition , 7291–7299.Chen, L.-C.; Papandreou, G.; Schroff, F.; and Adam, H.2017. Rethinking atrous convolution for semantic imagesegmentation. arXiv preprint arXiv:1706.05587 .Esser, P.; Sutter, E.; and Ommer, B. 2018. A variationalu-net for conditional appearance and shape generation. In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 8857–8866.Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.;Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y.2014. Generative adversarial nets. In

Advances in neuralinformation processing systems , 2672–2680.Han, X.; Hu, X.; Huang, W.; and Scott, M. R. 2019. Cloth-ﬂow: A ﬂow-based model for clothed person generation. In

Proceedings of the IEEE International Conference on Com-puter Vision , 10471–10480.He, K.; Zhang, X.; Ren, S.; and Sun, J. 2015. Spatialpyramid pooling in deep convolutional networks for visualrecognition.

IEEE transactions on pattern analysis and ma-chine intelligence

Advances inneural information processing systems , 6626–6637.Isola, P.; Zhu, J.-Y.; Zhou, T.; and Efros, A. A. 2017. Image-to-image translation with conditional adversarial networks.In

Proceedings of the IEEE conference on computer visionand pattern recognition , 1125–1134.Johnson, J.; Alahi, A.; and Fei-Fei, L. 2016. Percep-tual losses for real-time style transfer and super-resolution.In

European conference on computer vision , 694–711.Springer.Kim, S.-W.; Kook, H.-K.; Sun, J.-Y.; Kang, M.-C.; and Ko,S.-J. 2018. Parallel feature pyramid network for object de-tection. In

Proceedings of the European Conference onComputer Vision (ECCV) , 234–250.Kingma, D. P.; and Ba, J. 2014. Adam: A method forstochastic optimization. arXiv preprint arXiv:1412.6980 .Kingma, D. P.; and Welling, M. 2013. Auto-encoding varia-tional bayes. arXiv preprint arXiv:1312.6114 .Li, Y.; Huang, C.; and Loy, C. C. 2019. Dense intrinsic ap-pearance ﬂow for human pose transfer. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , 3693–3702.Li, Y.; Zhang, X.; and Chen, D. 2018. Csrnet: Dilated convo-lutional neural networks for understanding the highly con-gested scenes. In

Proceedings of the IEEE conference oncomputer vision and pattern recognition , 1091–1100. Liu, W.; Piao, Z.; Min, J.; Luo, W.; Ma, L.; and Gao, S.2019. Liquid warping GAN: A uniﬁed framework for hu-man motion imitation, appearance transfer and novel viewsynthesis. In

Proceedings of the IEEE International Confer-ence on Computer Vision , 5904–5913.Liu, Z.; Luo, P.; Qiu, S.; Wang, X.; and Tang, X. 2016. Deep-fashion: Powering robust clothes recognition and retrievalwith rich annotations. In

Proceedings of the IEEE con-ference on computer vision and pattern recognition , 1096–1104.Ma, L.; Jia, X.; Sun, Q.; Schiele, B.; Tuytelaars, T.; andVan Gool, L. 2017. Pose guided person image generation.In

Advances in neural information processing systems , 406–416.Ma, L.; Sun, Q.; Georgoulis, S.; Van Gool, L.; Schiele, B.;and Fritz, M. 2018. Disentangled person image generation.In

Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 99–108.Men, Y.; Mao, Y.; Jiang, Y.; Ma, W.-Y.; and Lian, Z.2020. Controllable person image synthesis with attribute-decomposed gan. In

Proceedings of the IEEE/CVF Confer-ence on Computer Vision and Pattern Recognition , 5084–5093.Ren, Y.; Yu, X.; Chen, J.; Li, T. H.; and Li, G. 2020. Deepimage spatial transformation for person image generation.In

Proceedings of the IEEE/CVF Conference on ComputerVision and Pattern Recognition , 7690–7699.Ren, Y.; Yu, X.; Zhang, R.; Li, T. H.; Liu, S.; and Li, G.2019. Structureﬂow: Image inpainting via structure-awareappearance ﬂow. In

Proceedings of the IEEE InternationalConference on Computer Vision , 181–190.Ronneberger, O.; Fischer, P.; and Brox, T. 2015. U-net: Con-volutional networks for biomedical image segmentation. In

International Conference on Medical image computing andcomputer-assisted intervention , 234–241. Springer.Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.;Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.;et al. 2015. Imagenet large scale visual recognition chal-lenge.

International journal of computer vision

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 3408–3416.Wang, X.; Girshick, R.; Gupta, A.; and He, K. 2018. Non-local neural networks. In

Proceedings of the IEEE con-ference on computer vision and pattern recognition , 7794–7803.Wang, Z.; Bovik, A. C.; Sheikh, H. R.; and Simoncelli, E. P.2004. Image quality assessment: from error visibility tostructural similarity.

IEEE transactions on image process-ing arXiv preprint arXiv:1511.07122 .hang, R.; Isola, P.; Efros, A. A.; Shechtman, E.; and Wang,O. 2018. The unreasonable effectiveness of deep features asa perceptual metric. In

Proceedings of the IEEE conferenceon computer vision and pattern recognition , 586–595.Zheng, L.; Shen, L.; Tian, L.; Wang, S.; Wang, J.; and Tian,Q. 2015. Scalable person re-identiﬁcation: A benchmark. In

Proceedings of the IEEE international conference on com-puter vision , 1116–1124.Zhou, T.; Tulsiani, S.; Sun, W.; Malik, J.; and Efros, A. A.2016. View synthesis by appearance ﬂow. In

European con-ference on computer vision , 286–301. Springer.Zhu, J.-Y.; Park, T.; Isola, P.; and Efros, A. A. 2017. Un-paired image-to-image translation using cycle-consistent ad-versarial networks. In

Proceedings of the IEEE internationalconference on computer vision , 2223–2232.Zhu, Z.; Huang, T.; Shi, B.; Yu, M.; Wang, B.; and Bai, X.2019. Progressive pose attention transfer for person imagegeneration. In