Template-Free Try-on Image Synthesis via Semantic-guided Optimization
Chien-Lung Chou, Chieh-Yun Chen, Chia-Wei Hsieh, Hong-Han Shuai, Jiaying Liu, Wen-Huang Cheng
11 Template-Free Try-on Image Synthesis viaSemantic-guided Optimization
Chien-Lung Chou, Chieh-Yun Chen, Chia-Wei Hsieh, Hong-Han Shuai, Jiaying Liu, and Wen-Huang Cheng
Abstract —The virtual try-on task is so attractive that it hasdrawn considerable attention in the field of computer vision.However, presenting the three-dimensional (3D) physical char-acteristic (e.g., pleat and shadow) based on a 2D image is verychallenging. Although there have been several previous studieson 2D-based virtual try-on work, most 1) required user-specifiedtarget poses that are not user-friendly and may not be thebest for the target clothing, and 2) failed to address someproblematic cases, including facial details, clothing wrinkles andbody occlusions. To address these two challenges, in this paper,we propose an innovative template-free try-on image synthesis(TF-TIS) network. The TF-TIS first synthesizes the target poseaccording to the user-specified in-shop clothing. Afterward, givenan in-shop clothing image, a user image, and a synthesized pose,we propose a novel model for synthesizing a human try-on imagewith the target clothing in the best fitting pose. The qualitativeand quantitative experiments both indicate that the proposedTF-TIS outperforms the state-of-the-art methods, especially fordifficult cases.
Index Terms —Virtual try-on, image synthesis, pose transfer,semantic-guided learning, cross-modal learning
I. I
NTRODUCTION S HOPPING in the brick-and-mortar stores takes consider-able time to purchase one satisfactory item of clothingbecause it usually requires entering a store to try on severalclothing candidates. In contrast, shopping online is expected tobe a much faster purchase journey because the process of find-ing the products with relevant items is facilitated by the onlinesearching and recommendation technology. However, althoughthe usage rate of online shopping is rapidly increasing, it isstill overshadowed by the brick-and-mortar stores because e-commerce platforms cannot provide sufficient information forthe consumers. Among many promising approaches [1]–[8]bridging the gap between online and offline shopping, virtualtry-on is regarded as the key technology for the online fashionindustry to burgeon, as well as a feasible work to bridge thegap between online and offline shopping.To realize virtual try-on services, a recent line of studieshas used clothing warping to transform the in-shop clothingand then paste the warped clothing on user images [1], [9],which preserves the details of clothes, including patterns and
C.-L. Chou, C.-W. Hsieh, and H.-H. Shuai are with the Department ofElectrical and Computer Engineering, National Chiao Tung University. E-mail: { chienlung.eed04,maggie1209.tem04,hhshuai } @nctu.edu.tw.C.-Y. Chen and W.-H. Cheng are with the Institute of Electronics, NationalChiao Tung University. W.-H. Cheng is also with the Artificial Intelli-gence and Data Science Program, National Chung Hsing University. E-mail: { cychen.ee09g,whcheng } @nctu.edu.tw.Jiaying Liu is with the Wangxuan Institute of Computer Technology, PekingUniversity. E-mail: [email protected]. Fig. 1. Examples of our template-free try-on image synthesis (TF-TIS)network, which takes only the in-shop clothing image and user image asinput without a defined pose. Our goal is to generate a realistic try-on imageaccording to the synthesized pose from the referential clothing image, whichcan reduce the cost of hiring photographers. No other work has achieved this. decorative designs. Nonetheless, the quality of the resultssignificantly decreases when an occlusion (e.g., the user’sarm is in front of the chest, obscuring the garment in thesource image) or a dramatic pose change (e.g., the limbs arefrom wide-opened to crossed) occurs. To solve these chal-lenging cases, our previous work [10] introduced a semantic-guided method, which uses the semantic parsing to learn therelationship between different poses. However, the clothingpart of the virtual try-on results still has some artifacts (i.e.,missing details, such as small buttons and local inconsistency,such as distorted plaid), which are important for a try-onservice. Moreover, although the state-of-the-art virtual try-onapplications [11], [12] have demonstrated the try-on resultsin arbitrary poses (including our previous work [10]), theyrequire users to assign the target poses instead of directlyrecommending the suitable poses based on the clothing style.Therefore, to create a convenient and practical virtual try-on service, a virtual try-on application that automaticallysynthesizes a suitable pose corresponding to the target clothingis desirable.Based on the above observations, in this paper, we propose a a r X i v : . [ c s . C V ] F e b novel virtual try-on network, namely, the template-free try-onimage synthesis (TF-TIS) framework, for synthesizing high-quality try-on images with automatically synthesized poses. Inaddition, Fig. 1 illustrates examples of the template-free virtualtry-on. Given a source user image and an in-shop clothing,the goal is to first synthesize the target pose automatically,which is further leveraged to generate the try-on image. Fig.2 presents the TF-TIS framework comprising four modules:1) cloth2pose , which synthesizes a suitable pose from the in-shop clothing (Column 3 in Fig. 1), 2) the pose-guided pars-ing translator , which translates the source pose to semanticsegmentation according to the synthesized poses (Column 4in Fig. 1), 3) segmentation region coloring , which renders theclothing and human information on the semantic segmentation(Column 5 in Fig. 1), and 4) salient region refinement , whichpolishes the important regions, such as faces and logos (thelast column in Fig. 1).Specifically, given an in-shop clothing for try-on, we firstaim to synthesize a suitable corresponding pose represented askeypoints . One of the basic approaches is to use the imagesof mannequins wearing corresponding in-shop clothes as thetarget poses. However, some in-shop clothes may not havecorresponding images. Another approach is to cluster the in-shop clothes first and assign the most frequent poses in thecluster as the target pose. Nevertheless, such an approachhighly depends on the clustering results, whereas rare/unseenclothes may not find the appropriate poses. Therefore, wepropose a novel cloth2pose network to directly learn therelationship between the in-shop garment and the target pose,which leverages the deep features from the pretrained modeland then uses the regressor to fit the joint map (i.e., keypoints).To the best of our knowledge, this is the first work to generatesuitable poses for corresponding in-shop clothes.Afterward, given a source user image and a synthesizedpose, the goal is to synthesize a realistic try-on image. Anintuitive method to tackle difficult cases of the body occlusionor the dramatic pose transfer is offering the body parsing infor-mation to the current try-on networks. However, this methodis not compatible with existing try-on models because most ofthe previous try-on works have focused on directly warping theclothing item and pasting the warped clothing onto the users.Therefore, the pose-guided parsing translator is proposed byconstructing a deep convolutional network to transform a poseinto a semantic segmentation form to guide the learning ofthe next stage. Semantic segmentation plays a critical rolein solving difficult cases. For example, limb parsing providesinformation for solving dramatic pose changes, whereas limband clothing parsing offer clues addressing body occlusionissues.Moreover, to present realistic try-on images to users, wecolor the transformed semantic segmentation with the appear-ance of the a human and clothes by using a conditional gen-erative adversarial network (CGAN) in segmentation regioncoloring . Finally, salient region refinement focuses on twosalient regions for try-on services (i.e., face and clothing) and The poses are specified by keypoints, which contain 18 keypoints and eachkeypoint represents one human body joint. improves these regions with details to achieve better virtualtry-on images. For clothing refinement, We constructed adetail-retaining network, which adopts two encoders to extractrelatively important features and global and local discrimina-tors to retain the consistency of images, especially clothing.Our previous work is called FashionOn [10]. We havemade several changes in this work, and the contributions aresummarized as follows. • We designed a new pose synthesis framework, whichdirectly learns the relationship between in-shop clothesand try-on poses to synthesize a suitable try-on pose.The automatically synthesized poses can facilitate a user-friendly platform without the extra effort of uploadinga target pose and exhibit better virtual try-on results toattract customers. To the best of our knowledge, TF-TISis the first virtual try-on network to provide a suitablepose for the corresponding clothing image. • We redesigned the clothing refinement generator (SectionIII-D-2) composed of two distinct encoders due to theunsatisfying results caused by the same parameters ofthe encoder for the two input features of in-shop clothingand warped coarse clothing. One is to encode the warpedcoarse clothing and we integrate it with the very detailedfeatures of the in-shop clothing extracted from the otherencoder. In addition, we adopted a UNet-like architectureto avoid losing the warped coarse clothing information,such as shape and color. • To enhance the consistency of the generated image, whichimproves the quality of pictures, we proposed global andlocal discriminators for our ClothingGAN (Section III-D-2). With the local discriminator, the generator is forced tosynthesize the image with more natural details. Using theglobal discriminator, the generator is forced to generatea realistic picture.II. R
ELATED WORK
A. Virtual Try-on
Existing virtual try-on approaches can be roughly catego-rized into 3D-based methods (e.g., 3D body shape) and 2D-based methods (e.g., clothing warping). We first introducethese two approaches and then compare them with TF-TIS.
1) 3D-based Try-on:
To generate more realistic results,numerous approaches [13]–[15] have used users’ 3D bodyshape measurements and 4D sequence (e.g., video) to offermore information. For example, with high-resolution videos,Pons-Moll et al. [13] first captured the geometry of clothingon a body to obtain a rough body meshes and then alignedthe defined clothing templates to garments of the input scansagain to generate more realistic and body-fitting clothes.Given the high cost of physics-based simulation to accu-rately drape a 3D garment on a 3D body, Gundogdu et al. [15] implemented 3D cloth draping using neural networks.Specifically, they used a PointNet-like model to derive theuser information and encoded the garment meshes to obtainthe point-wise, patch-wise, and global features for the fittedgarment estimation.
In summary, although 3D-based approaches can producetry-on videos, the collection of measurement data can becostly, requiring extensive manual labeling or expensive equip-ment. Therefore, many scholars have resorted to using rich 2Dimages, which can be easily found online, to achieve the virtualtry-on task. Moreover, the proposed TF-TIS only requires asource image and an in-shop clothing to a synthesize try-onimage with the suitable pose.
2) 2D-based Try-on:
To synthesize the try-on images, it isnecessary to transform the in-shop clothing to fit users’ poses.Therefore, spline-based approaches are introduced to achievethis task. Among them, thin plate spline (TPS) [16] has beenwidely adopted and predominates in the nonrigid transfer ofimages instead of direct generation using neural networks. Forexample, Han et al. [1] presented an image-based virtual try-onnetwork (VITON) that warps in-shop clothes through TPS andcascades a refinement network to generate the warped-clothingdetails with the coarse-grained person image. However, somedetails are still missed due to the refinement network [1].To correct deficiencies in [1], Wang et al. [9] constructeda two-stage framework, CP-VTON, combining the generatedperson with the warped clothes through a generated composi-tion mask without adopting the refinement network. Moreover,Zheng et al. advanced CP-VTON [9] and proposed VirtuallyTrying on New Clothing with Arbitrary Poses (VTNCAP) [11]by adopting a bidirectional GAN and an attention mechanism,which take the place of the generated composition mask inCP-VTON, to focus more on the clothing region.Nevertheless, they still neglected that the facial region isalso an important factor to determine the quality of the virtualtry-on task and cannot preserve the detailed clothing informa-tion (e.g., pleats and shadows) to follow the human poses. Incontrast, TF-TIS was developed as a semantic segmentation-based method that avoids these issues. In addition, TF-TIS pre-serves the comprehensive details of the in-shop clothes (e.g.,patterns and texture) and the realistic human appearance (e.g.,hair color and facial features) in accordance with human posesand different body shapes. The visual comparison between TF-TIS and the other mentioned methods is illustrated in Fig. 3.
B. Pose Transfer
Research on human pose transfer [17]–[23] has trendedrecently, as copious applications are planned in the future.The process of human pose transfer comprises two stages:pose estimation and image generation. The first stage can bedivided into two categories (i.e., keypoints estimation [24]–[26] and human semantic parsing [27]–[29]). For example,Hidalgo et al. [26] trained a single-stage network throughmulti-task learning to determine the keypoints of the wholebody simultaneously. Gong et al. [28] used the part groupingnetwork (PGN) to reformulate instance-level human parsing astwo twinned sub-tasks that can be jointly learned and mutuallyrefined.For the second stage, with the advances of GANs, imagegeneration has received considerable attentions and has beenwidely adopted [18], [21], [23] to generate realistic images.Among the existing pose transfer research, most constructed novel architectures and successfully transferred the pose ofthe given human image based on the human joint points. Forinstance, Ma et al. [17] separated this task into two stages:pose integration, which generates initial but blurry images,and image refinement, which refines images by training arefinement network in an adversarial way. Siarohin et al. [20] used geometric affine transformation to mitigate themisalignment problem between different poses. However, mostof the previous works did not extend the application of posetransfer to explore virtual try-on. By infusing pose transfer,virtual try-on services provide consumers with more chancesto realize their appearance in trying on new clothes in multipleaspects and induce them to buy clothes. Hence, by convertingsemantic segmentation, TF-TIS seamlessly integrates virtualtry-on with pose transfer to generate multi-view try-on imagesfor customers.
C. Cross-modal Learning
The association between different fields has been studiedand exploited recently [30]–[37] (e.g., the cross-modal match-ing between audio and visual signals [30]–[32], image andtext [33], [35], [36]). Castrej´on et al. [38] used the identi-cal network architecture with different weights as encodersto extract low-level features from different modalities (e.g.,sketches and natural images) and then inputted them intothe shared cross-modal representation network to learn therepresentation for scenes. Tae-Hyun et al. [39] attempted tolearn voice-face correlation in a self-supervised manner (i.e.,directly capturing the dominant facial traits of the personcorrelated with the input speech instead of synthesizing theface from the attributes). Inspired by the cross-modal learning,which can find hidden details from an inconspicuous part ofdata or align the embedding from one domain to another, weused a similar concept to learn the correlation between clothesand human poses to synthesize the image of the virtual try-on with a suitable pose for the user from the correspondingclothing. III. P
ROPOSED M ETHOD
As illustrated in Fig. 2, given an in-shop clothing image C t and a source user image I s , the goal of TF-TIS is togenerate the try-on image I g with an automatically synthe-sized suitable pose such that the personal appearance andthe clothing texture are retained. To achieve this goal, wedeveloped a four-stage framework in TF-TIS: (I) cloth2pose ,which derives a suitable pose P based on the in-shop clothing C t by exploiting the correlation between poses and clothes; (II)the pose-guided parsing translator , which transforms the bodysemantic segmentation M s into a new one, M g , according tothe derived pose; (III) segmentation region coloring , whichtakes I s , C t , and M g as input and synthesizes a coarse try-onimage I g by rendering the personal appearance and clothinginformation into the segmentation regions; and (IV) salientregion refinement , which refines the salient but blurry regionsof the try-on result I g , generated from the last stage (i.e,FacialGAN refines facial regions and ClothingGAN refinesclothing regions). To clarify the definition of each symbol,we created a table (Table I) to illustrate this clearly. Fig. 2.
Training overview . Stage I ( cloth2pose ) exploits the correlation of clothes and poses and synthesizes a pose from the in-shop clothes via sequentialconvolution blocks. Stage II ( pose-guided parsing translator ) transfers the human semantic segmentation to M g according to M (cid:48) s , M c , and P . Stage III( segmentation region coloring ) fills clothing information and the user’s appearance into the segmentation to synthesize a realistic try-on image I g . Stage IV( salient region refinement ) consists of two parts: FacialGAN and ClothingGAN. FacialGAN generates high-frequency details as a residual output and directlyadds on the facial region of I g . ClothingGAN extracts the fine information from the in-shop clothing image and uses the features for the details of the clothes C r . A. Cloth2pose
A virtual try-on service usually requires three inputs [9],[11], [12]: 1) a user image, 2) an in-shop clothing image, and3) a target pose. One potential improvement is to automaticallygenerate the target pose according to the in-shop clothingbecause it reduces the users’ efforts. Moreover, a suitable posebetter demonstrates the in-shop clothing, which may stimulateconsumption. For example, plain T-shirts in a sideways posecan mostly show muscle lines. To synthesize the target posedirectly from in-shop clothing, cloth2pose uses pairs of in-shop clothes and mannequin photos on the online shopping sitefor training. Specifically, cloth2pose first derives keypoints ofmannequin photos by existing models, e.g., [24], [40], [41]. Let x k denote the 2D position of the k th keypoint on the image( I t ). Because it is difficult to regress the clothing features toa single point, we converted the keypoint position x k into thepose map P k by applying a 2D Gaussian distribution for eachkeypoint. The values at position p ∈ R in P k are defined asfollows: P k ( p ) = exp (cid:18) − (cid:107) p − x k (cid:107) σ (cid:19) , (1) where σ determines the spread of the peak. After constructingthe 2D keypoint map for each keypoint, we stack all the 2D We use OpenPose model [24], a 2D pose estimation model pre-trained onlarge-scale human pose datasets (COCO [42] and MPII [43]) in our exper-iment. The following keypoints are used: nose, eyes, ears, neck, shoulders,elbows, wrists, hips, knees, and ankles. keypoint maps together as a keypoint tensor, denoted as P .Afterward, cloth2pose extracts features of the in-shopclothes by using the first 10 layers of VGG-19 [44], denoted as φ . Let C t denote the image of in-shop clothing. The clothingfeature map F is obtained as φ ( C t ) . Here, cloth2pose exploitsa progressive refinement architecture as illustrated in Fig. 2.Specifically, at the first block, the network produces a setof keypoint information only from the clothing feature map: P = φ ( F ) , where φ refers to the first convolutional block.For the succeeding convolutional blocks, we employed fiveconvolutional layers with a × kernel and two with a × kernel to generate the keypoint tensor, and each layeris followed by a ReLU. The convolutional block takes theconcatenation of F and the prediction from the previous blockas input to predict the refined keypoint tensor: P i = φ i ( F, P i − ) , ∀ ≤ i ≤ N c p , (2) where φ i represents the i th convolutional block and N c p isthe number of total convolutional blocks in cloth2pose .An intuitive choice for the loss function is the L distancebetween the keypoint tensors extracted from the pose estima-tion model ( P ) and that estimated from cloth2pose ( P N c p ),i.e., (cid:107) P N c p − P (cid:107) . However, only using the L loss is likelyto generate many responses in various locations for one joint.Given this condition, we employed the sparsity constraintto limit the number of candidates. Therefore, if the modelpredicts several candidates for one keypoint, the nonjoint area TABLE IN
OTATION T ABLE
Symbols Definitions C t in-shop clothing image M c in-shop clothing mask C d detailed clothing representation C w warped-clothing representation C r refined clothing P keypoint tensor (suitable pose) F in-shop clothing feature map I s source user image I (cid:48) s source user image without clothes I t target user image I g generated try-on image M s source body semantic segmentation M (cid:48) s source body semantic segmentation without clothing part M g generated body semantic segmentation M t target body semantic segmentation M fgi foreground channels of M i , i ∈ s, g, tM facei facial channels of M i , i ∈ s, g, tM clothingi clothing channels of M i , i ∈ s, g, tI facei facial part of I i , i ∈ s, g, tI clothingi clothing part of I i , i ∈ s, g, td high-frequency residual face details N c p number of convolution blocks in cloth2pose ⊗ pixel-wise multiplication is penalized less by L2 loss than by L1 loss. The final loss isas follows: L c p = (cid:88) i ∈ N c p (cid:107) P i − P (cid:107) + λ (cid:107) P N c p (cid:107) , (3)where λ = 0 . is the hyperparameter for striking abalance between multiple candidates and keypoint vanishing.If the value of λ is too high (e.g., λ = 0 . ), then the output P N c p is without any candidates. Conversely, if the value is toolow (e.g., λ = 0 . ), then the sparsity constraint becomesineffective, and the output still has more than one candidate. B. Pose-guided Parsing Translator
Showing the corresponding area of each body part explicitly,the human body segmentation are employed to synthesize real-istic human images. Accordingly, the goal of the pose-guidedparsing translator is to translate the source body semanticsegmentation M s to the target body semantic segmentation M t according to the target pose P . We first used the PGN[28], which is pretrained on the Crowd Instance-level HumanParsing dataset, to produce semantic parsing labels. The labelscontain 20 categories, including left-hand, top clothes, andface. Afterward, to precisely map each item to the newposition according to the pose P , we used one-hot encoding toconstitute a 20-channel tensor M ∈ R × W × H , where eachchannel is a binary mask representing one category. Due tothe unnecessity of the clothing channel of M s , we replace itwith the original in-shop clothing mask M c . This replacementfacilitates offering the in-shop clothing shape to realize thevirtual try-on service.Adapted from pix2pix [45], the pose-guided parsing trans-lator consists of two downsampling layers, nine residualblocks, and two upsampling layers. Convolutional layers and highway connections, concatenating the input and the outputof the corresponding block, are composed in each residualblock. The objective of the translator G t adopts a CGAN asfollows: L G t GAN ( G t , D t ) = E M in ,M t [ log D ( M in , M t )]+ E M in [ log (1 − D ( M in , G t ( M in ))] , (4) where G t minimizes the objective against D t that maximizesit (i.e., arg min G t max D t L G t GAN ( G t , D t ) ) and M in representsthe concatenation of M (cid:48) s , P , and M c .To accurately differentiate each pixel as the correspondingchannel, we integrate a pixel-wise binary cross-entropy lossof the G t , denoted as L G t BCE , with our CGAN objective, andthe discriminator stays the same: L G t BCE ( G t ) = − (cid:88) n c M t log( G t ( M in )) + (1 − M t ) log(1 − G t ( M in )) , (5) where n c denotes the total number of channels of humanparsing masks. In summary, the objective of the pose-guidedparsing translator is derived as follows: arg min G t max D t L G t GAN ( G t , D t ) + λ bce L G t BCE ( G t ) . (6) C. Segmentation Region Coloring
Having obtained the target semantic segmentation fromthe previous stage, the segmentation region coloring aims tosynthesize a coarse try-on result by rendering informationinto the segmentation regions, denoted as M g = G t ( M in ) .Given the great success of applying GANs in various imagegeneration tasks, we adopte the architecture of CGAN [46]to synthesize results. Specifically, we propose a coloringgenerator G c rendering the personal information into the bodysemantic segmentation M g according to I s and C t (i.e., theappearance of the source person and in-shop clothing texture).Because it is difficult to derive a significant number of trainingimages, we traine our network to change the source person. Toavoid supplying G c with the clothing information, we removethe clothing information from I s . In other words, we take asinput 1) the in-shop clothing C t ∈ R × W × H , 2) the sourceperson image without clothing information I (cid:48) s ∈ R × W × H ,and 3) the target semantic segmentation M g ∈ R × W × H for G c .Fig. 2 illustrates the architecture of TF-TIS. We adopted theUNet architecture with highway connections, combining theinput and processed information. Highway connections wereemployed to avoid the vanishing gradient [47]. Six residualblocks were implemented between the encoder and the decoderof G c . For each residual block, two convolutional layersand ReLU were stacked to integrate M g , I (cid:48) s , and C t fromsmall local regions to broader regions so that the appearanceinformation of I (cid:48) s and C t can be extracted.Because the background information is less important andeasily distracts the generator from synthesizing try-on images,we filtered out it to force G c to concentrate on generatingthe correct human part of the image rather than the wholeimage. Specifically, the background information of the gener-ation result I g = G c ( C t , I (cid:48) s , M g ) is filtered out with M fgg and so is the ground truth I t with M fgt , where M fgg and M fgt represent M g and M t without the background channel,respectively. Afterward, a global structural information andother low-frequency features are obtained from calculating theL1 distance function: L G c L = (cid:88) W (cid:88) H (cid:13)(cid:13)(cid:13) I g ⊗ M fgg − I t ⊗ M fgt (cid:13)(cid:13)(cid:13) , (7)where ⊗ represents the pixel-wise multiplication.For the discriminator, we constructed the coloring discrimi-nator D c against G c to distinguish two pairs: one including I t and I s , and the other including I g and I s . With the additionalreal image I s , D c impels G c to generate more realistic images.Moreover, because this is a binary classification problem (i.e.,the image is real or fake), we employed the binary cross-entropy loss as the GAN loss to compare the generated images: L G c GAN = L BCE ( D c ( G c ( C t , I (cid:48) s , M g ) , I s ) , , (8) L D c GAN = L BCE ( D c ( G c ( C t , I (cid:48) s , M g ) , I s ) , L BCE ( D c ( I t , I s ) , , (9)where G c attempts to deceive D c to recognize the synthesizedimage as a real image; thus, the goal of L BCE in L G c GAN isequal to . In contrast, because D c must classify the generatedor real images correctly, the goals of L BCE in L D c GAN are equalto and , respectively. In summary, the overall loss functionof segmentation region coloring is as follows: L G c = L G c GAN + λ L G c L . (10) D. Salient Region Refinement
Because users care most about the characteristics of prod-ucts, the performance of the virtual try-on service is highlydependent on the saliency of the synthesized image, forexample, users (e.g., facial details or body shape), clothingfeatures (e.g., button or bow tie), and 3D physics (e.g., pleatand shadows). Hence, in the fourth stage, we proposed twonetworks to refine the facial and clothing regions separately.
1) FacialGAN:
Modeling faces and hair is challengingbut essential in synthesizing try-on images. To simplify thiscomplicated work, our network generates residual face detailsinstead of the whole face. Precisely, for the facial refinementnetwork G rf , we adjusted the model of the segmentationregion coloring ( G c ) to the facial refinement task by excludingthe fully connected layer to avoid losing input details duringcompression. To force G rf to concentrate on facial details, M faceg and M faces were introduced to filter out the facialregion from I g and I s , respectively, where M faceg denotes theparsing channels representing the head (including the face,neck, and hair). As such, G rf generates the high-frequencydetails as the residual output d = G rf ( I faceg , I faces ) , where I faceg = I g ⊗ M faceg and I faces = I s ⊗ M faces . After processingimages through G rf , the fine-tuned result is obtained byadding d to I g .In addition, inspired by [48], [49], the perceptual losswas exploited to produce images that have a similar featurerepresentation even though the pixel-wise accuracy is not high.Let ( d + I g ) face and I facet denote the regions within M faceg of ( d + I g ) and I t , respectively. In addition to calculating the loss pixel-wise (cid:13)(cid:13)(cid:13) ( d + I g ) face − I facet (cid:13)(cid:13)(cid:13) , we computed theperceptual loss by mapping both ( d + I g ) face and I facet intothe perceptual feature space through the different layers ( φ i )of the VGG-19 model. This additional loss allows the modelto reconstruct the details and edges better. L G rf vgg (( d + I g ) face , I facet )= (cid:88) i λ i (cid:13)(cid:13)(cid:13) φ i (( d + I g ) face ) − φ i ( I facet ) (cid:13)(cid:13)(cid:13) , (11) where φ i represents the feature map retrieved from the i th layer in the pretrained VGG-19 model [44]. Furthermore, likeprevious stages, we integrated the GAN loss as follows: L G rf GAN = L BCE ( D rf (( d + I g ) face , I faces ) , (12) L D rf GAN = L BCE ( D rf ( I faces , ( d + I g ) face ) , L BCE ( D rf ( I faces , I facet ) , . (13) The overall loss function of FacialGAN is as follows: L G rf = λ f L G rf GAN + λ f L G rf vgg (( d + I g ) face , I facet )+ λ f (cid:88) W (cid:88) H (cid:13)(cid:13)(cid:13) ( d + I g ) face − I facet (cid:13)(cid:13)(cid:13) + λ f (cid:88) W (cid:88) H (cid:13)(cid:13)(cid:13) ( d + I g ) ⊗ M fgg − I t ⊗ M fgt (cid:13)(cid:13)(cid:13) , (14) where λ i denotes the weight of the corresponding loss.
2) ClothingGAN:
Most state-of-the-art virtual try-on net-works [1], [9], [11], [50] preserve detailed clothing informa-tion by fusing the prewarped clothes onto the try-on imagesdirectly. However, these approaches encounter the problemsof limbs occlusion or incorrect warping patterns of clothes.To solve these problems, in our previous work (FashionOn)[10], we implemented the virtual try-on framework by 1)transforming the human pose into the semantic segmentationform through G t , 2) coloring the clothing textures and humanappearance through G c , and 3) processing images throughrefinement networks.Although FashionOn fills in most clothing information back,some tiny but important details (e.g., neckline or button) aremissing and the generated images are not sufficient realistic.Hence, we modify the previous clothing refinement generatorand construct a new one ( G rc ) to retrieve clothing featuresdirectly from the in-shop clothing C t and render them into theclothing region of I g . Inputting the concatenation of the in-shop clothing and warped clothing into the Clothing UNet inour previous work [10] improved the details, but the generatedclothing region still lacks fined details, such as the necklineand buttons. The unsatisfactory results are caused by the sameparameters of the encoder for the two input features of in-shopclothing and warped clothing. Moreover, the subtle differencein the details is neglected by the discriminator.Based on these observations, the proposed ClothingGAN G rc contains four parts: (a) detail encoder ( E D ), (b) warped-clothing encoder ( E W ), (c) decoder ( Dec ), and (d) context discriminator ( D rc ). The generator exploits detailed infor-mation on in-shop clothing and warped clothing obtainedfrom E D and E W , respectively, which are then input into Dec to generate an image of refined clothing. Next, D rc ,which consists of the local and the global discriminators,differentiates whether the refined clothing is real or fake bycomparing the local and global consistency with real images. Detail Encoder ( E D ). The objective of E D is to learn thedetailed and neglected information (i.e., missing informationin the previous stage) from an in-shop clothing image ( C t ).To extract detailed visual features, we use seven convolutionallayers followed by an instance normalization (IN) layer [51]together with LeakyReLU [52] as the activation function,which is more than E W because detailed information isrequired from the original in-shop clothing, such as textureand logos. After training, E D can generate a detailed clothingrepresentation, denoted as C d = E D ( C t ) , which is furtheremployed by the decoder to complement the details andsynthesize the refined clothing. Warped-Clothing Encoder ( E W ). As depicted in Fig. 2,we use the UNet architecture to encode I clothingg = I g ⊗ M clothingg , where M clothingg ∈ R W × H is the clothing part of M g . The encoder includes five downsampling convolutionallayers with kernel=5, and each layer is followed by an INlayer with LeakyReLU. Each layer of the UNet encoder isconnected to the corresponding layer of the UNet decoderthrough highway connections to produce high-level features.Finally, we obtain the warped-clothing representation C w = E w ( I clothingg ) . In the following section, we present how theoutputs of E D and E W have been further employed in thedecoder network. Decoder (
Dec ). To generate refined clothing via the de-coder, we concatenate the encoded features C d and C w ob-tained from E D and E W , respectively, as input. From layerto layer in the decoder, we first derive the features obtainedfrom the previous layer and the precomputed feature mapsat E W connected through a highway connection. Next, weupsample the feature map with the × bicubic operation.After upsampling, a × convolutional and ReLU operationare applied. Using the highway connections with E W allowsthe network to align the detailed clothing features with thewarped-clothing features obtained by the UNet encoder ( E W ).In other words, the generator can be written as follows: C r = G rc ( C t , I clothingg )= Dec ( E D ( C t ) , E W ( I clothingg )) . (15) To bridge the difference between the refined clothing C r and the target clothing region I clothingt = I t ⊗ M clothingt ,where M clothingt represents the clothing channel of M t , weintroduced the L1 loss ( L G rc L ) and the perceptual loss ( L G rc vgg )to refine the clothing as follows: L G rc L ( C r , I clothingt ) = (cid:88) W (cid:88) H (cid:13)(cid:13)(cid:13) C r − I clothingt (cid:13)(cid:13)(cid:13) , (16) L G rc vgg ( C r , I clothingt ) = (cid:88) i =1 λ i (cid:13)(cid:13)(cid:13) φ i ( C r ) − φ i ( I clothingt ) (cid:13)(cid:13)(cid:13) , (17) where φ i ( C ) represents the feature map of the clothing C ofthe i th layer in the VGG-19 model [44]. By exploiting the L1 loss instead of L2 loss here, we address the problems ofblurry generated images. To further avoid the misalignment,the refined clothing C r is integrated into I g , where theclothing region is removed, to synthesize a refined human I rg = C r ⊗ M clothingg + I g ⊗ (1 − M clothingg ) . The parsingmask M clothingg is used to select the clothing region, whichfacilitates the process of excluding limbs in front of theclothing when fusing the clothing. The loss for the refinedclothing try-on is defined as follows: L G rc fullbody ( I rg , I t ) = (cid:88) W (cid:88) H (cid:107) I rg − I t (cid:107) . (18) Context Discriminator.
To make the refined clothing morerealistic, we also employed the GAN loss L G rc GAN by adoptingthe context discriminator comprising the global and the localdiscriminators that classify the refined clothing as real orfake by comparing the local and the global consistency withreal images. Both discriminators are based on a convolutionalnetwork that compresses the images into small feature tensors.A fully connected layer is applied to the concatenation of theoutput feature tensors and predicts a constant value between and , which represents the probability that the refined clothingis real.The global discriminator takes as input the image in whichwe create a bounding box of the clothing part from the resultand resizes it, using bilinear interpolation, to × . Itconsists of five two-stride convolutional layers with kernel=5and a fully connected layer that outputs a 1024-dimensionalvector. The local discriminator follows a similar pattern,except the last two single-stride convolutional layers have akernel=3 and an input size of × . The input of the localdiscriminator is generated by randomly sampling × fromthe bounding box and resizing it to × .After deriving the outputs from the global and the localdiscriminators, we build a fully connected layer, followed by asigmoid function to process the concatenation of two vectors (a2048-dimensional vector). The output value ranges from 0 to1, representing the probability that the refined clothing is real,rather than generated. The GAN loss is defined as follows: L G rc GAN = L BCE ( D rc ( r flocal , r fglobal ) , , (19) L D rc GAN = L BCE ( D rc ( r flocal , r fglobal ) , L BCE ( D rc ( r tlocal , r tglobal ) , , (20) where r is the resized result, the subscript global or local denoted the whole or sub-sampled result, respectively, and thesuperscript t or f means that result is true or fake (generated),respectively. The overall loss function of the ClothingGAN isdefined as follows: L G rc = λ c L G rc vgg + λ c L G rc L + λ c L G rc fullbody + λ c L G rc GAN , (21) where λ ci ( i = 1 , , , denotes the weight of the corre-sponding loss. IV. E XPERIMENTS
The datasets and implementation are detailed here. Af-terward, we conduct qualitative and quantitative experimentswith the state-of-the-art method and our previous work Fash-ionOn [10] to demonstrate the effectiveness of TF-TIS.
Fig. 3.
Visual detail comparison.
To compare the details of generated images between different models, we excluded the cloth2pose module from ournetwork. The leftmost three columns are the input, and the rest of the columns are the output of different models and the local enlargement of them. OurTF-TIS has the best performance regarding details, such as the neckline of polo shirts and clothing pattern, and retains global and local consistency.
A. Dataset
To train and evaluate the proposed TF-TIS, a dataset con-taining two different poses and one clothing image for eachperson is required. Still, most of the existing datasets provideeither only one pose for each person with the correspondingclothing image [1], [9] or multiple poses for each person butwithout clothing images [53]. Therefore, we collected a newlarge-scale dataset containing , in-shop clothes with thecorresponding images of mannequins wearing in-shop clothesin two different poses. In addition, the DeepFashion dataset[53], with a size of × , is also adopted to broaden thediversity of the data. After removing the incomplete imagepairs and wrapping one in-shop clothing and two humanimages into each triplet, , triplets were created. Finally,we randomly split the dataset into the training set and thetesting set with , and , triplets, respectively. B. Implementation Details
Cloth2pose.
We initialize the first 10 layers with that ofthe VGG-19 [44] and fine-tune them to generate a set ofclothing feature maps F from the information on the in-shopclothing. For the following convolutional blocks, each containsfive convolutional layers with a × kernel and two with a Please refer to the images in https://github.com/fashion-on/FashionOn.github.io. × kernel. Each layer is followed by a ReLU. In this stage,we apply N c p = 4 to the number of the convolutional blocks. Pose-guided Parsing Translator.
Based on the frameworkof ResNet, we implement two downsampling layers, nineresidual blocks, and followed by two upsampling layers.Specifically, we construct two single-stride convolutional lay-ers with a × kernel and one highway connection, combiningthe input and the output of each corresponding residual block. Segmentation Region Coloring.
The architecture is com-posed of the encoder and decoder with six residual blocksbetween them. Except for the last residual block and onefully connected layer, each block contains two single-strideconvolutional layers with a × kernel, one downsamplingtwo-stride convolutional layer with a × kernel. The numberof filters of all convolutional layers linearly increases anddecreases, respectively, for the encoder and decoder. Salient Region Refinement.
The generator of FacialGAN( G rf ) is similar to G c but without the fully connected layer.In addition, G rf has four residual blocks containing twoconvolutional layers and one downsampling convolutionallayer. For ClothingGAN, the generator ( G rc ) comprises twodifferent encoders and one decoder. The detail encoder ( E D )consists of four downsampling convolutional layers and threeconvolutional layers, and the warped-clothing encoder ( E W )consists of four downsampling convolutional layers and oneconvolutional layer. All downsampling convolutional layers Fig. 4.
Pose retrieval examples of cloth2pose . We queried our trainingdataset by comparing the features extracted via cloth2pose to all clothing fea-tures in the dataset. For each query, we present the top five retrieved samples.To focus on the pose information, we eliminate the human information, suchas skin or hair color. The leftmost two columns are input clothes and thetranslated parsing, generated via Stage II from the derived pose. Althoughsome examples are not like the query, it still shows that we could easily findresults visually close to the query. have a × kernel and a × stride, and other convolutionallayers have a × kernel and a × stride. Both kinds of con-volutional layers are followed by the IN layer and LeakyReLU.The decoder ( Dec ) consists of five × convolutional layers,and each layer is followed by one upsampling layer, one INlayer, and one ReLU.For the context discriminator ( D rc ), we adopt two dis-criminators: (1) the global discriminator, which consists offour downsampling convolutional layers and outputs a 1024-dimensional vector representing the global consistency, and (2)the local discriminator, which consists of three downsamplingconvolutional layers and outputs a 1024-dimensional vectorrepresenting the local consistency. A fully connected layer andsigmoid function are applied to the concatenation of the twovectors to differentiate whether the image is real or generated.We used Adam [54] with β = 0 . and β = 0 . asthe optimizer for all stages. The learning rates of the pose-guided parsing translator and the other stages are 2e-4 and2e-5, respectively. C. Qualitative Results
Several try-on results are depicted in Fig 3, 5, 6, and 7.
1) Evaluation of Virtual Try-on:
As Fig. 3 reveals, wecompare TF-TIS with the state-of-the-art clothing warping-based method (VTNCAP [11]) and our previous work (Fash-
Fig. 5.
Qualitative results sampled from our testing dataset.
For everyexample (six images as a group) we show from left to right is: the inputclothing, the generated segmentation image with the synthesized pose fromTF-TIS, the try-on result with the synthesized pose, the generated segmen-tation image with the defined pose in our dataset, the try-on result with thedefined pose, and the real try-on image. ionOn [10]), which adopts a coarse-to-fine strategy. In addi-tion, because CP-VTON does not include the pose transfer, wecombine the state-of-the-art pose transfer method GFLA [55]with CP-VTON [9] as an additional baseline (GFLA+CP-VTON). The results indicate that all methods accomplish thetask of virtual try-on with arbitrary poses. However, the resultsof VTNCAP and GFLA+CP-VTON contain some artifacts,while the results of FashionOn lose some details and localconsistency. Several cases are worth mentioning and listedbelow.
Neglecting Tiny but Essential Details.
Fig. 3 illustrates thatthe ClothingGAN ( G rc ) does generate detailed information.From the left to the right, the results are from the state-of-the-art works (VTNCAP, GFLA+CP-VTON, and FashionOn)and ablation studies for G rc in TF-TIS (TF-TIS without G rc ,TF-TIS without the local discriminator, and TF-TIS). Theapproaches without G rc (two encoders), including VTNCAP,GFLA+CP-VTON, and FashionOn, fail at the erroneous neck-line and the small button, as revealed in Rows 1 and 3. Theneckline and the small button on the clothing image by Fash-ionOn are neglected because FashionOn uses only one encoderto extract the information of the concatenation of the in-shopclothing and warped clothing, which degrades the focus ofboth images. In contrast, the local discriminator of TF-TISdiscerns tiny clothing details and the global discriminator isapplied to retain the consistency of the entire image. As aresult, TF-TIS generates the neckline and small button basedon more comprehensive information of the warping clothing,which generates an appearance that is closer to the in-shopclothing images. Wrong Warping Pattern.
As depicted in Row 2 in Fig. 3,FashionOn and TF-TIS successfully resolve the wrong warp-ing pattern problems of VTNCAP. Because warping clothesthrough TPS [16] only considers the deformation of clothes intwo dimensions, the warped clothes are unrealistic. Althoughin Rows 1 and 2 GFLA+CP-VTON preserves the necklineand the button and generates smooth plaid, GFLA+CP-VTONmisses the shade and makes the clothes an average color inRow 4. In Row 6, GFLA+CP-VTON mistakes the red pocketas being on the right side. In contrast, we predict the warped-clothing mask based on the in-shop clothing mask and thewarped body segmentation, which consider the correlationbetween body parts. Moreover, the proposed TF-TIS retainsthe consistency of clothes, such as the pattern shape, whichmakes the plaid shirts more realistic because we adopt globaland local discriminators to discern the clothing details and toretain consistency.
Average Face.
The VTNCAP often synthesizes an averageface as depicted in the fourth column of Fig. 3, because itsimply uses the whole body as a mask and renders the humaninformation into it. In contrast, we treat human parsing using18 channels and render the information for each body part intothe corresponding region, which is more specific for every part.Additionally, our works employs the FacialGAN to refine thefacial part, making it more distinctive, instead of synthesizingthe average faces.
Clothing Color Degradation.
In the second, fourth, fifth andsixth rows in Fig. 3, the clothing color of the results derivedby VTNCAP changes from the color of the in-shop clothing.In contrast, FashionOn and TF-TIS successfully preserve thecolor of the in-shop clothing, which is important in virtualtry-on services.
Fig. 6. Visual comparison of AdaIN [56] and IN [51] for ClothingGAN.
Human Limbs Occlusion.
Rows 5 and 6 in Fig. 3 reveal thatthe proposed TF-TIS can solve the human limbs occlusionproblems in VTNCAP. Rather than simply warping it throughTPS, we simultaneously warp the clothing and the bodysegmentation, then render the human appearance and the clothing information sequentially. Hence, G c can easily renderthe appearance based on all semantic segmentation, preservingthe natural correlation between clothes and humans. Dropping the Detailed Logo.
In Fig. 3, the rightmost twocolumns are the ablation study for the local discriminatorwithin the context discriminator. Row 4 shows that the localdiscriminator generates the full logo. The “PARIS” logo isevident with almost all five characters, using the local discrimi-nator in the rightmost column. Without the local discriminator,it only generates three characters.
Comparison of AdaIN and IN for G rc . We replace theIN layer in the two encoders of the ClothingGAN with anadaptive instance normalization layer (AdaIN) to evaluatewhether AdaIN helps preserve the clothing details in Fig. 6.Equation 15 for AdaIN becomes the following: C r = Dec ((1 − γ ) E W ( I clothingg )+ γAdaIN ( E W ( I clothingg ) , E D ( C t ))) , (22) where γ is a hyperparameter for the content-style trade-off.We used γ = 0 . and 0.75 to evaluate the difference anddemonstrated the visual comparison in Fig. 6. AdaIN ( x, y ) = α ( y ) (cid:18) x − µ ( x ) α ( x ) (cid:19) + µ ( y ) , (23) where x represents the content input, y is the style input, and α ( y ) denotes the standard deviation of y . The AdaIN simplyscales the normalized content input with α ( y ) and shifts itusing µ ( y ) . Fig. 6 reveals that AdaIN tends to generate globalfeatures for the clothing information and fails to generaterobust details. For example, as presented in Row 4, AdaINfails to synthesize the robust edge of the suspenders. Moreover,as displayed in Row 5, AdaIN tends to generate the blurryflowers. When increasing the hyperparameter γ to contain ahigher proportion of features from C t , the G rc adopting AdaINgenerates more robust but still blurrier results than using IN.
2) Evaluation of cloth2pose:
Because none of the previousresearch can generate the target poses according to the in-shop clothes, we evaluate the performance of cloth2pose bydetermining whether cloth2pose can learn the relationshipbetween the in-shop clothing and try-on pose. Specifically,in the testing phase, given the in-shop clothing, we use cloth2pose to derive the synthesized pose and generate thetranslated parsing (second column in Fig. 4). Afterward, wecompute the L2 distances between the in-shop clothing featureand all clothing features in the training dataset and retrieve topfive try-on poses results with the smallest clothing distance.The in-shop clothing features are extracted by using the first10 layers of the VGG-19 [44].Fig. 4 presents several examples. The retrieved results revealthat the synthesized poses are very close to some real poses inthe top five results (e.g., the fourth sample in Row 2, the firstsample in Row 5, and the first sample in Row 6). Moreover,our retrieved examples also demonstrate that different posesshould be synthesized in accordance with the in-shop clothingto better present the clothing. For example, T-shirts, like theclothes in Rows 1 to 3, are demonstrated in the front viewsto show the logo or with one hand in the pocket to showthe muscles. However, the camisole tops in Rows 4 to 6, Fig. 7. Qualitative results sampled from the testing dataset. are demonstrated with people standing sideways to show theirbody shapes, facing the right or left.Moreover, the qualitative results of our testing dataset arepresented in Fig. 5, and indicate that our model can synthesizea better pose to display clothing. For each example, we presentthe input clothing ( C t ), the user ( I s ), the translated human parsing with the synthesized pose via the cloth2pose moduleand the generated image, the human parsing with the definedpose and the generated image, and the ground truth image ofthe defined pose. Although appearing a little different from theimage with the defined pose, the cloth2pose results capturethe key information about the human, such as the direction TABLE II C OMPARISON OF THE VIRTUAL TRY - ON TESTING DATASET . W ERANDOMLY SAMPLED
DATA FROM THE TESTING DATASET . Method IS SSIM
VTNCAP [11] 2.5874 ± ± ± ± ± ± Real Data 3.2350 ± they face. Moreover, we synthesize suitable poses for clothes.For instance, 1) in Row 3, we derive the pose in the frontview to show the pattern of the clothing and 2) in Row 4 to5, we synthesize the sideways pose to show the upper armsand shoulders of people. Therefore, our model understands therelation between clothes and poses and can synthesize betterposes to present better try-on results, which induces users tobuy clothes. D. Quantitative Results
Because the structural similarity (SSIM) [57] and inceptionscore (IS) [58] are fairly standard metrics that focus on theoverall quality of the generated image instead of the pixel-wisecomparison, we calculated them for the reconstruction of thetry-on results in our dataset. The SSIM measures the similarityby comparing the generated images against the original imagesin the structural information, whereas IS provides scores toindicate whether the generated results are visually diverse andsemantically meaningful.Compared with the other virtual try-on systems (i.e., VT-NCAP, CP-VTON, GFLA+CP-VTON, and FashionOn), ourmethod outperforms them in terms of SSIM and IS, as revealedin Table II. Moreover, TF-TIS outperforms VTNCAP and CP-VTON in terms of IS by . and . , respectively. Ad-ditionally, the comparison in term of SSIM indicates that TF-TIS exceeds VTNCAP and CP-VTON by . and . ,respectively. Although TF-TIS only surpasses the results ofFashionOn within 1% in both metrics, the result complementsthe important details and the local and global consistency thatFashionOn lacks, as demonstrated in Fig 3. Runtime.
We evaluated the efficiency of the proposed TF-TIS by separately reporting the running time of the fourmodules. The results of the runtime were conducted on aNVIDIA 1080-Ti GPU and were averaged with 2000 randomlyselected image sets. The runtime of each module is as follows: cloth2pose (1.3 ms), pose-guided parsing translator (2.6 ms), segmentation region coloring (3.1 ms), and salient regionrefinement ( G rf : 1.9 ms, G rc : 2.6 ms). The results indicatethat the proposed TF-TIS not only reduces the cost of hiringphotographers but also provides a real-time try-on service forfashion e-commerce platforms.V. C ONCLUSION AND F UTURE WORK
In this paper, we present a part-level learning network (TF-TIS) for virtual try-on service with automatically synthesized poses. The previous work requires a user-specified targetpose for try-on. In contrast, TF-TIS precisely generates try-on images with the poses synthesized from the clothingcharacteristics, which better demonstrates the clothes. Theexperimental results indicate that TF-TIS significantly outper-forms the state-of-the-art virtual try-on approaches on variousclothing types, is better in term of being lifelike in appearance,and recommends poses that induce customers to buy clothes.Moreover, as shown in the experiments, TF-TIS capturesthe relation between clothes and poses to synthesize betterposes to present users with better try-on results. In addition,by proposing the global and the local discriminators in theclothing refinement network, TF-TIS retains consistency ofimages and preserves critical human information and clothingcharacteristics. Therefore, TF-TIS resolves many challengingproblems (e.g., generating tiny but essential details and pre-serving detailed logos). In the future, we plan to extend ourapproach to learn how different garment sizes deform on areal body in images using transfer training from 3D humanmodel methods. A
CKNOWLEDGMENTS
This work was supported in part by the Ministry of Sci-ence and Technology of Taiwan under Grants MOST-109-2221-E-009-114-MY3, MOST-109-2218-E-009-025, MOST-109-2221-E-009-097, MOST-109-2218-E-009-016, MOST-109-2223-E-009-002-MY3, MOST-109-2218-E-009-025 andMOST-109-2221-E-001-015, in part by the National NaturalScience Foundation of China under Grant 61772043, in part bythe Fundamental Research Funds for the Central Universities,and in part by the Beijing Natural Science Foundation underContract 4192025. R
EFERENCES[1] X. Han, Z. Wu, Z. Wu, R. Yu, and L. S. Davis, “Viton: An image-basedvirtual try-on network,” in
IEEE Conference on Computer Vision andPattern Recognition (CVPR) , 2018.[2] H.-J. Chen, K. M. Hui, S. Y. Wang, L.-W. Tsao, H.-H. Shuai, , and W.-H. Cheng, “Beautyglow: On-demand makeup transfer framework withreversible generative network,” in
IEEE Conference on Computer Visionand Pattern Recognition (CVPR) , 2019.[3] S. C. Hidayati, K.-L. Hua, W.-H. Cheng, and S.-W. Sun, “What arethe fashion trends in new york?” in
ACM International Conference onMultimedia (ACMMM) Grand Challenges , 2014.[4] S. C. Hidayati, K.-L. Hua, Y. Tsao, H.-H. Shuai, J. Liu, and W.-H. Cheng, “Garment detectives: Discovering clothes and its genre inconsumer photos,” in
IEEE Workshop on Artificial Intelligence for ArtCreation (AIArt) , 2019.[5] B. Wu, T. Mei, W.-H. Cheng, and Y. Zhang, “Unfolding temporaldynamics: Predicting social media popularity using multi-scale temporaldecomposition,” in
AAAI , 2016.[6] S. C. Hidayati, T. W. Goh, J.-S. G. Chan, C.-C. Hsu, J. See, L.-K. Wong,K.-L. Hua, Y. Tsao, and W.-H. Cheng, “Dress with style: Learning stylefrom joint deep embedding of clothing styles and body shapes,”
IEEETransactions on Multimedia , vol. 23, pp. 365–377, 2021.[7] L. Lo, C.-L. Liu, R.-A. Lin, B. Wu, H.-H. Shuai, and W.-H. Cheng,“Dressing for attention: Outfit based fashion popularity prediction,” in , 2019.[8] S. C. Hidayati, Y.-T. C. Cheng-Chun Hsu, K.-L. Hua, J. Fu, andW.-H. Cheng, “What dress fits me best? fashion recommendation onthe clothing style for personal body shape,” in
ACM InternationalConference on Multimedia , 2018.[9] B. Wang, H. Zheng, X. Liang, Y. Chen, L. Lin, and M. M. Yang,“Toward characteristic-preserving image-based virtual try-on network,”in
European Conference on Computer Vision (ECCV) , 2018. [10] C.-W. Hsieh, C.-Y. Chen, C.-L. Chou, H.-H. Shuai, J. Liu, and W.-H. Cheng, “FashionOn: Semantic-guided image-based virtual try-onwith detailed human and clothing information,” in ACM InternationalConference on Multimedia (ACMMM) , 2019.[11] N. Zheng, X. Song, Z. Chen, L. Hu, D. Cao, and L. Nie, “Virtually tryingon new clothing with arbitrary poses,” in
ACM International Conferenceon Multimedia (ACMMM) , 2019.[12] C.-W. Hsieh, C.-Y. Chen, C.-L. Chou, H.-H. Shuai, and W.-H. Cheng,“Fit-me: Image-based virtual try-on with arbitrary poses,” in
IEEEInternational Conference on Image Processing (ICIP) , 2019.[13] G. Pons-Moll, S. Pujades, S. Hu, and M. Black, “Clothcap: Seamless4d clothing capture and retargeting,”
ACM Transactions on Graphics(TOG) , 2017.[14] T. Y. Wang, D. Ceylan, J. Popovic, and N. J. Mitra, “Learning a sharedshape space for multimodal garment design,”
ACM Transactions onGraphics (TOG) , 2018.[15] E. Gundogdu, V. Constantin, A. Seifoddini, M. Dang, M. Salzmann, andP. Fua, “Garnet: A two-stream network for fast and accurate 3d clothdraping,” in
IEEE International Conference on Computer Vision (ICCV) ,2019.[16] S. J. Belongie, J. Malik, and J. Puzicha, “Shape matching and objectrecognition using shape contexts,”
IEEE Transactions on Pattern Anal-ysis and Machine Intelligence (TPAMI) , 2002.[17] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool, “Poseguided person image generation,” in
Advances in Neural InformationProcessing Systems (NIPS) , 2017.[18] C. Si, W. Wang, L. Wang, and T. Tan, “Multistage adversarial losses forpose-based human image synthesis,” in
IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , 2018.[19] A. Pumarola, A. Agudo, A. Sanfeliu, and F. Moreno-Noguer, “Unsuper-vised person image synthesis in arbitrary poses,” in
IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , 2018.[20] A. Siarohin, E. Sangineto, S. Lathuili`ere, and N. Sebe, “Deformablegans for pose-based human image generation,” in
IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2018.[21] S. Song, W. Zhang, J. Liu, and T. Mei, “Unsupervised person imagegeneration with semantic parsing transformation,” in
IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , 2019.[22] Y. Li, C. Huang, and C. C. Loy, “Dense intrinsic appearance flow forhuman pose transfer,” in
IEEE Conference on Computer Vision andPattern Recognition (CVPR) , 2019.[23] Z. Zhu, T. Huang, B. Shi, M. Yu, B. Wang, and X. Bai, “Progressivepose attention transfer for person image generation,” in
IEEE Conferenceon Computer Vision and Pattern Recognition (CVPR) , 2019.[24] Z. Cao, G. Hidalgo, T. Simon, S. Wei, and Y. Sheikh, “Openpose:Realtime multi-person 2d pose estimation using part affinity fields,” arXiv preprint arXiv:1812.08008 , 2018.[25] G. Rogez, P. Weinzaepfel, and C. Schmid, “LCR-Net++: Multi-person2d and 3d pose detection in natural images,” in
IEEE Transactions onPattern Analysis and Machine Intelligence (TPAMI) , 2019.[26] G. Hidalgo, Y. Raaj, H. Idrees, D. Xiang, H. Joo, T. Simon, andY. Sheikh, “Single-network whole-body pose estimation,” in
IEEEInternational Conference on Computer Vision (ICCV) , 2019.[27] M. M. Kalayeh, E. Basaran, M. Gokmen, M. E. Kamasak, and M. Shah,“Human semantic parsing for person re-identification,” in
IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR) , 2018.[28] K. Gong, X. Liang, Y. Li, Y. Chen, M. Yang, and L. Lin, “Instance-levelhuman parsing via part grouping network,” in
European Conference onComputer Vision (ECCV) , 2018.[29] J. Guo, Y. Yuan, L. Huang, C. Zhang, J.-G. Yao, and K. Han, “Be-yond human parts: Dual part-aligned representations for person re-identification,” in
IEEE International Conference on Computer Vision(ICCV) , 2019.[30] A. Senocak, T.-H. Oh, J. Kim, M.-H. Yang, and I. S. Kweon, “Learningto localize sound source in visual scenes,” in
IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2018.[31] A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adelson, and W. T.Freeman, “Visually indicated sounds,” in
IEEE Conference on ComputerVision and Pattern Recognition (CVPR) , 2016.[32] A. Owens and A. A. Efros, “Audio-visual scene analysis with self-supervised multisensory features,” in
European Conference on ComputerVision (ECCV) , 2018.[33] Y. Zhang and H. Lu, “Deep cross-modal projection learning for image-text matching,” in
European Conference on Computer Vision (ECCV) ,2018. [34] J. Sanchez-Riera, K.-L. Hua, Y.-S. Hsiao, T. Lim, S. C. Hidayati, andW.-H. Cheng, “A comparative study of data fusion for rgb-d based visualrecognition,”
Pattern Recognition Letters , vol. 73, pp. 1–6, 2016.[35] S. Li, T. Xiao, H. Li, W. Yang, and X. Wang, “Identity-aware textual-visual matching with latent co-attention,” in
IEEE International Confer-ence on Computer Vision (ICCV) , 2017.[36] Y. Liu, Y. Guo, E. M. Bakker, and M. S. Lew, “Learning a recurrentresidual fusion network for multimodal matching,” in
IEEE InternationalConference on Computer Vision (ICCV) , 2017.[37] J. Sanchez-Riera, K. Srinivasan, K.-L. Hua, W.-H. Cheng, M. A.Hossain, and M. F. Alhamid, “Robust rgb-d hand tracking using deeplearning priors,”
IEEE Transactions on Circuits and Systems for VideoTechnology , vol. 28, no. 9, pp. 2289–2301, 2018.[38] L. Castrej´on, Y. Aytar, C. Vondrick, H. Pirsiavash, and A. Torralba,“Learning aligned cross-modal representations from weakly aligneddata,” in
IEEE Conference on Computer Vision and Pattern Recognition(CVPR) , 2016.[39] T.-H. Oh, T. Dekel, C. Kim, I. Mosseri, W. T. Freeman, M. Rubinstein,and W. Matusik, “Speech2face: Learning the face behind a voice,” in
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) ,2019.[40] Y.-C. Lin, M.-C. Hu, W.-H. Cheng, Y.-H. Hsieh, and H.-M. Chen,“Human action recognition and retrieval using sole depth information,”in
ACM International Conference on Multimedia (ACMMM) , 2012.[41] S. C. Hidayati, C.-W. You, W.-H. Cheng, and K.-L. Hua, “Learning andrecognition of clothing genres from full-body images,”
IEEE Transac-tions on Cybernetics , vol. 48, no. 5, pp. 1647–1659, 2018.[42] T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays,P. Perona, D. Ramanan, C. L. Zitnick, and P. Doll`ar, “Microsoft coco:Common objects in context,” in
European Conference on ComputerVision (ECCV) , 2014.[43] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d humanpose estimation: New benchmark and state of the art analysis,” in
IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2014.[44] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” in
International Conference on LearningRepresentations (ICLR) , 2015.[45] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image trans-lation with conditional adversarial networks,” in
IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2017.[46] M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint , 2014.[47] Y. Bengio, P. Y. Simard, and P. Frasconi, “Learning long-term depen-dencies with gradient descent is difficult,”
IEEE Transactions on NeuralNetworks (TNN) , 1994.[48] M. S. M. Sajjadi, B. Sch¨olkopf, and M. Hirsch, “Enhancenet: Singleimage super-resolution through automated texture synthesis,” in
IEEEInternational Conference on Computer Vision (ICCV) , 2017.[49] H.-X. Xie, L. Lo, H.-H. Shuai, and W.-H. Cheng, “Au-assisted graphattention convolutional network for micro-expression recognition,” in
ACM International Conference on Multimedia (ACMMM) , 2020.[50] Z. Wu, G. Lin, Q. Tao, and J. Cai, “M2E-try on net: Fashion frommodel to everyone,” in
ACM International Conference on Multimedia(ACMMM) , 2019.[51] D. Ulyanov, A. Vedaldi, and V. S. Lempitsky, “Improved texturenetworks: Maximizing quality and diversity in feed-forward stylizationand texture synthesis,” in
IEEE Conference on Computer Vision andPattern Recognition (CVPR) , 2017.[52] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearitiesimprove neural network acoustic models,” in
International Conferenceon Machine Learning Workshops (ICMLW) , 2013.[53] Z. Liu, P. Luo, S. Qiu, X. Wang, and X. Tang, “Deepfashion: Poweringrobust clothes recognition and retrieval with rich annotations,” in
IEEEConference on Computer Vision and Pattern Recognition (CVPR) , 2016.[54] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”in
International Conference on Learning Representations (ICLR) , 2015.[55] Y. Ren, X. Yu, J. Chen, T. H. Li, and G. Li, “Deep image spatialtransformation for person image generation,” in
IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2020.[56] X. Huang and S. Belongie, “Arbitrary style transfer in real-time withadaptive instance normalization,” in
IEEE International Conference onComputer Vision (ICCV) , 2017.[57] Zhou Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Imagequality assessment: from error visibility to structural similarity,”
IEEETransactions on Image Processing (TIP) , 2004. [58] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford,X. Chen, and X. Chen, “Improved techniques for training gans,” in Advances in Neural Information Processing Systems (NIPS) , 2016.
Chien-Lung Chou received the B.S. degree fromthe Department of Electrical and Computer Engi-neering, National Chiao Tung University (NCTU),Hsinchu, Taiwan, in 2019. He is now a master stu-dent in the Department of Electrical and ComputerEngineering, University of Michigan. His researchinterest includes artificial intelligence, deep learning,and computer vision.
Chieh-Yun Chen received the B.S. degree from theDeparment of Elecrtophysics, National Chiao TungUniversity (NCTU), Hsinchu, Taiwan, in 2020. Sheis now a master student in the Institute of Elec-tronics, NCTU. Her research interests are mainly inartificial intelligence, deep learning, and computervision.
Chia-Wei Hsieh received the B.S. degree from theDepartment of Electrical and Computer Engineering,National Chiao Tung University (NCTU), Hsinchu,Taiwan. She is now master student in Electricaland Computer Engineering - Machine Learning andData Science, University of California, San Diego(UCSD). Her interest includes machine learning andcomputer vision.
Hong-Han Shuai received the B.S. degree fromthe Department of Electrical Engineering, NationalTaiwan University (NTU), Taipei, Taiwan, R.O.C.,in 2007, the M.S. degree in computer science fromNTU in 2009, and the Ph.D. degree from GraduateInstitute of Communication Engineering, NTU, in2015. He is now an associate professor in NCTU.His research interests are in the area of multimediaprocessing, machine learning, social network analy-sis, and data mining. His works have appeared intop-tier conferences such as MM, CVPR, AAAI,KDD, WWW, ICDM, CIKM and VLDB, and top-tier journals such as TKDE,TMM and JIOT. Moreover, he has served as the PC member for internationalconferences including MM, AAAI, IJCAI, WWW, and the invited reviewerfor journals including TKDE, TMM, JVCI and JIOT.
Jiaying Liu (M’10-SM’17) is currently an Asso-ciate Professor with the Wangxuan Institute of Com-puter Technology, Peking University. She receivedthe Ph.D. degree (Hons.) in computer science fromPeking University, Beijing China, 2010. She has au-thored over 100 technical articles in refereed journalsand proceedings, and holds 42 granted patents. Hercurrent research interests include multimedia sig-nal processing, compression, and computer vision.Dr. Liu is a Senior Member of IEEE/CCF/CSIG.She was a Visiting Scholar with the University ofSouthern California, Los Angeles, from 2007 to 2008. She was a VisitingResearcher with the Microsoft Research Asia in 2015 supported by the StarTrack Young Faculties Award. She has served as a member of MultimediaSystems & Applications Technical Committee (MSA-TC), Visual SignalProcessing and Communications Technical Committee (VSPC) and Educationand Outreach Technical Committee (EO-TC) in IEEE Circuits and SystemsSociety, a member of the Image, Video, and Multimedia (IVM) TechnicalCommittee in APSIPA. She has served as the Associate Editor for IEEE Trans.on Image Processing, and Elsevier JVCI. She has also served as the TechnicalProgram Chair of IEEE VCIP-2019/ACM ICMR-2021, the Publicity Chairof IEEE ICME-2020/ICIP-2019/VCIP-2018, and the Area Chair of ECCV-2020/ICCV-2019. She was the APSIPA Distinguished Lecturer (2016-2017).