Human Motion Transfer with 3D Constraints and Detail Enhancement
Yang-Tian Sun, Qian-Cheng Fu, Yue-Ren Jiang, Zitao Liu, Yu-Kun Lai, Hongbo Fu, Lin Gao
IIEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, MAY 2020 1
Human Motion Transfer with 3D Constraints andDetail Enhancement
Yang-Tian Sun, Qian-Cheng Fu, Yue-Ren Jiang, Zitao Liu, Yu-Kun Lai, Hongbo Fu, Lin Gao ∗ Abstract —We propose a new method for realistic human motion transfer using a generative adversarial network (GAN), whichgenerates a motion video of a target character imitating actions of a source character, while maintaining high authenticity of thegenerated results. We tackle the problem by decoupling and recombining the posture information and appearance information of boththe source and target characters. The innovation of our approach lies in the use of the projection of a reconstructed 3D human modelas the condition of GAN to better maintain the structural integrity of transfer results in different poses. We further introduce a detailenhancement net to enhance the details of transfer results by exploiting the details in real source frames. Extensive experiments showthat our approach yields better results both qualitatively and quantitatively than the state-of-the-art methods.
Index Terms —Motion Transfer, Deep Learning, 3D Constraints, Detail Enhancement. (cid:70)
NTRODUCTION T HE problem of video-based human motion transfer isan interesting but challenging research problem. Giventwo monocular video clips, one for a source subject andthe other for a target subject, the goal of this problemis to transfer the motion from the source person to thetarget, while maintaining the target person’s appearance.Specifically, in the synthesized video, the subject shouldhave the same motion as the source person, and the sameappearance as the target person (including human clothesand background). To achieve this, it is essential to producehigh-quality image-to-image translation of frames, whileensuring temporal coherence.The difficulty of this problem is how to effectively decou-ple and recombine the posture information and appearanceinformation of the source and target characters. Based ongenerative adversarial networks (GANs), a powerful toolfor high-quality image-to-image translation, Chan et al. [1]proposed to first learn a mapping from a 2D pose to asubject image from the target video, and then use the poseof the source subject as the input to the learned mappingfor video synthesis. However, due to the difference betweenthe source and target poses, this approach often results innoticeable artifacts, especially for the self-occlusion of bodyparts.Observing that the self-occlusion issue is difficult to ∗ Corresponding Author. • Y.T. Sun, Y.R. Jiang and L. Gao are with the Institute of ComputingTechnology, Chinese Academy of Sciences, Beijing, China.E-mail:[email protected], [email protected],[email protected] • Q.C. Fu is with the Department of Computer Science, Boston UniversityE-mail:[email protected] • Z.T. Liu is with TAL AI Lab, TAL Education Group, Beijing, ChinaE-mail:[email protected] • Y.-K. Lai is with the Visual Computing Group, School of Computer Scienceand Informatics, Cardiff University, Wales, UK.E-mail:[email protected] • H.B. Fu is with the School of Creative Media, City University of HongKong.E-mail: [email protected] handle in the image domain, we propose to first reconstructa 3D human model from a 2D image of both the sourceand target subjects, and then adjust the pose of the targethuman body to match the source (while maintaining thetarget person’s body shape). Intrinsic geometric descriptionof the deformed target is then projected back to 2D to forman image that reflects 3D structure.This along with the 2D pose figure extracted from thesource image is used as a constraint during GAN-basedimage-to-image translation, to effectively maintain the struc-tural characteristics of human body under different poses.In addition, previous methods [1], [2] only use the ap-pearance of the target person in the training process ofpose-to-image translation, and does not fully utilize theappearance of the source. When an input pose is verydifferent from any poses seen during the training process,such solutions might lead to blurry results. Observing thatthe source video frame corresponding to the input posemight contain reusable rich details (especially for the bodyparts like hands where the source and target subjects sharesome similarity), we intend to selectively transfer detailsfrom real source frames to the synthesized video frames.This is achieved by our detail enhancement network. Fig-ure 1 shows representative motion transfer results with richdetails. Our problem may also be seen as an appearancetransfer problem if viewed from a single frame perspective.However, from a holistic perspective, our goal is to transferthe motion from the source image domain to the target.Therefore, we define our task as motion transfer in thispaper.We summarize our contributions as follows: 1) We pro-pose to reconstruct a 3D human body with its shape from atarget frame and its pose from a source frame, and projectit to 2D to serve as a GAN-based network condition. Thiscontains rich 3D information including body shape, poseand occlusion to help maintain the structural characteristicsof the human body in the generated images. 2) We introducethe detail enhancement net (DE-Net), which utilizes theinformation from the real source frames to enhance details a r X i v : . [ c s . G R ] M a y EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, MAY 2020 2
Fig. 1. Given two monocular video clips, our method is able to transfer the motion of a source character (top) to a target character (middle), withrealistic details (bottom). in the generated results. Extensive experiments show thatour method outperforms the state-of-the-art methods, espe-cially for challenging cases where the source and target havesubstantial differences.
ELATED W ORK
Over the last decades, motion transfer has been extensivelystudied due to its ability for fast video content production.Some early solutions have mainly revolved around realign-ing existing video footage according to the similarity tothe desired pose [3], [4]. However, it is not an easy taskto find an accurate similarity measure for different actionsof different subjects. Several other approaches have alsoattempted to address this problem in 3D, but they focuson the use of inverse kinematic solvers [5] and transfermotion between 3D skeletons [6], whereas we consider usinga reconstructed 3D body mesh to guide motion transfer inthe image domain, which provides much richer constraints.Recently, the rapid advances of deep learning, especiallygenerative adversarial networks (GANs) and their varia-tions (e.g., cGAN [7], CoGAN [8], CycleGAN [9], DiscoGAN[10]) have provided a powerful tool for image-to-imagetranslation, which has yielded impressive results across awide spectrum of synthesis tasks and shows its ability tosynthesize visually pleasing images from conditional labels.Pix2pix [11], based on a conditional GAN framework, isone of the pioneering works. CycleGAN [9] further presentsthe idea of cycle consistency loss for learning to translatebetween two domains in the absence of paired images,and Recycle-GAN [12] combines both spatial and temporalconstraints for video retargeting tasks. Pix2pixHD [13] in-troduces a multi-scale conditional GAN to synthesize high-resolution images using both global and local generators,and vid2vid [2] designs specific spatial and temporal adver-sarial constraints for video synthesis. Based on these variants of GANs, a lot of approaches [1],[14], [15], [16], [17] have been proposed for human motiontransfer between two domains. The key idea of these ap-proaches is to decouple the pose information from the inputimage and use it as the input of a GAN network to generatea realistic output image. For example, in [14], the input im-age is separated into two parts: the foreground (or differentbody parts) and background, and the final realistic image isgenerated by separately processing and cross fusion of thetwo parts. Chan et al. [1] extract pose information with anoff-the-shelf human pose detector OpenPose [18], [19], [20],[21] and use the pix2pixHD [13] framework together with aspecialized Face GAN to learn a mapping from a 2D posefigure to an image. Neverova et al. [22] adopt a similar ideabut use the estimation of DensePose [23] to guide imagegeneration. Wang et al. make a step further to adopt bothOpenPose and DensePose in [2]. However, due to the lackof 3D semantic information, these approaches are highlysensitive to problems such as self-occlusions.To solve the above problems, it is natural to add 3D in-formation to the condition of generative networks. There aremany robust 3D human mesh reconstruction methods suchas [24], [25], [26], [27], which can reconstruct a 3D modelwith corresponding pose from a single image or a videoclip. Benefiting from these accurate and reliable 3D bodyreconstruction techniques, we can study the issue of humanmotion transfer in a new perspective. Liu et al. [28] present anovel warping strategy, which also uses the projection of 3Dmodels to tackle motion transfer, appearance transfer andnovel view synthesis within a unified model. However, dueto the diversity of their network functionalities, it does notperform particularly well in the aspect of motion transfer.
EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, MAY 2020 3
Mean (b) DE-Net(a) MT-Net
DE-Net
MT-Net app I pose I B I DE I MT I Fig. 2. The architecture of our Motion Transfer Net (MT-Net) and Detail Enhancement Net (DE-Net). (a) MT-Net takes two images I app and I pose astwo inputs, and outputs a new image I MT , which has the appearance of I app and the pose of I pose . (b) DE-Net takes image I B as input, which isthe blending of raw transfer result I MT possibly with blurry artifacts and corresponding real frame I pose with rich details, and the aim is to generatean image I DE in the target domain with the details enhanced. ETHOD
We aim to generate a new video of the target person imi-tating the character movements in the source video, whilekeeping the structural integrity and detail features of thetarget subject as much as possible. To accomplish this, weuse the mesh projection containing 3D information as thecondition for the GAN, and introduce a detail enhancementmechanism to improve the details.
We denote S = { S i } as a set of source video frames,and T = { T j } as a set of target frames. Our architecturecan be divided into two modules, as shown in Figure 2:the Motion Transfer Net (MT-Net) on account of motiontransfer across two domains and the Detail EnhancementNet (DE-Net) used for the enhancement of details. Morespecifically, MT-Net takes two real video frames I app and I pose as input, and generates an output image I MT thathas the same appearance as I app and the pose as I pose . DE-Net takes image I B as input, which is the blending of rawtransfer result I MT with blurred details and correspondingreal frame I pose with rich details, and aims to generate animage I DE in the target domain with the details enhanced.Our training pipeline is as follows: Within-Domain Pre-Training of MT-Net.
To stabilizethe training process, we first pre-train MT-Net using within-domain samples. For the domain S , let I app , I pose ∈ S , andwe can obtain I MT , which is the reconstructed source frame.For the domain T , let I app , I pose ∈ T , and we should obtainthe reconstructed target frame. This process initializes theMT-Net. Note that for each I pose , I app is randomly selectedfrom the corresponding domain and fixed during the train-ing process. Training of DE-Net.
After the pre-training of MT-Net,let I app ∈ S and I pose ∈ T , we generate initial transferredimage I MT by the MT-Net, which is often blurred. Wethen calculate a blended image I B which is an average of I MT and corresponding real frame I pose that contains cleardetails. We then train the DE-Net to discern and generatedetails from the blended result selectively to produce outputimage I DE with details enhanced that matches the targetdomain. OutputInput2Input1
Label ImageGenerator Pose GANAppearance Encoder
MT-Net
Fig. 3. Architecture of MT-Net, which synthesizes a motion transferredimage with the appearance of Input1 and pose of Input2.
Our transfer pipeline is as follows: Let I app ∈ T and I pose ∈ S , we can get the initial transfer result I MT (with thesource pose and target appearance) using MT-Net. And thenwe can obtain the final result I DE with details enhanced bythe DE-Net.Note that the domains of I app and I pose for trainingDE-Net and transfer are swapped, because in the trainingsetting, I MT has the appearance as the source and poseas the target, and DE-Net aims to produce an image withthe appearance and pose both in the target domain, sothe ground truth of I DE is available (which is exactly thecorresponding I pose ∈ T ). This provides supervision fortraining I DE to enhance details in the target domain (i.e.with the appearance of the target subject). Such supervisionis not available if the transfer setting is used. Although bothour method and Ma et al. [16] use two-stage pipelines, ourmethod is essentially different: our novelty lies in the reg-ularization of 3D constraints in the MT-Net and innovationof utilizing source subject details in the DE-Net; see detailsin the remaining subsections. As illustrated in Figure 3, the Motion Transfer Net con-sists of 3 parts: Label Image Generator that produces labelimages that encode human pose and shape information,Appearance Encoder that encodes the appearance of the
EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, MAY 2020 4 input image, and Pose GAN that produces an output imagewith given appearance and pose/shape constraints.
To maintain the structural integrity of the generated resultsand produce realistic images for actions involving self-occlusions, we utilize the 3D geometry information of theunderlying subject to produce label images as the GAN con-dition to regularize the generative network. The architectureof our label image generator is shown in Figure 4.
3D human model reconstruction.
We first extract the 3Dbody shape β and pose θ information for both source andtarget videos using a state-of-the-art pre-trained 3D poseand shape estimator [27]. This leads to a 3D deformablemesh model including the details of body, face and fingers.When transferring between two domains, the 3D humanmodels also allow the generation of a 3D mesh with the posefrom one domain and shape from the other. The extracteddeformable mesh sequences might exhibit temporal incoher-ence artifacts due to inevitable reconstruction errors. Thiscan be alleviated by simply applying temporal smoothingto mesh vertices, since our mesh sequences have the sameconnectivity. Human model projection.
We project the reconstructed3D human model onto 2D to obtain a label image, which willbe used as the condition to guide the generator. The imageshould ideally contain intrinsic 3D information (invariantto pose changes) to guide the synthesis process such thata particular color corresponds to a specific location on thehuman body. To achieve this, we propose to extract thethree non-trivial eigenvectors corresponding to three small-est eigenvalues and consider them as a 3-channel imageassigned to each vertex [29], which is projected to 2D toform a 3D constraint image.Note that although additional 3D information is avail-able, 3D meshes extracted from 2D images may occasionallycontain artifacts due to the inherent ambiguity. Therefore,we also adopt OpenPose [18], [19], [20], [21] to extracta 2D pose figure as part of the condition, which is lessinformative but more robust in the 2D space. Our labelimage therefore is 6-channel after combining both 2D and3D constraints actually. And we will discuss the roles ofthese two conditions and their combination play separatelyin the ablation study.
We learn the mapping from label image sequences to re-alistic image sequences by training a conditional GAN,consisting of Appearance Encoder and Pose GAN. Thedesign of Pose GAN is similar to pix2pixHD [13]: It iscomposed of a generator G pose , and two multi-scale framediscriminators with the same architecture for images in thesource and target domains, respectively. The two networksdrive each other: the generator learns to synthesize morerealistic images conditioned on the input to fool the dis-criminator, while the discriminator in turn learns to discernthe “real” images (ground truth) and “fake” (generated)images. The difference of our network from pix2pixHD isthat the data we use to learn the mapping includes notonly target video, but also source video. We have done label I pose I Laplace Function
Label Image Generator pose shape camera
Fig. 4. Architecture of our label image generator. We reconstruct a 3Dmesh of the transferred human body, assign eigenvectors correspondingto the three smallest eigenvalues of its Laplace matrix as intrinsicfeatures (visualized in RGB color), and project it to form a 3D constraintimage, denoted as I label . this by conditioning G pose on both label images and ap-pearance features, extracted by the Label Image Generatorand Appearance Encoder respectively. See Figure 5 for thegenerative network architecture. It is worth mentioning thatin order to solve the problem of poor continuity caused bysingle frame generation, similar to [1], adjacent frames areinvolved in training to improve temporal coherence. Appearance Encoder E app and Pose GAN Generator G pose . As said above, to make full use of the given data andmeet the need of the subsequent detail enhancement, wetrain the generative network using data from both sourcevideo and target video. However, training two separateconditional GANs has a high overhead for computing re-sources and time. In order to simplify this process, weintroduce an Appearance Encoder, and use label imagescontaining 2D/3D constraints and appearance features to-gether to guide Pose GAN to produce the reconstructedimage (for within-domain input) or initial transfer result I MT (for cross-domain input). Note that for a new sourcesubject video, our framework only needs to fine-tune theupsampling part of Pose GAN for the generation of the newsubject.Appearance Encoder is a fully convolutional networkthat extracts appearance features of the input image I app ,which is used as a condition for the Pose GAN. It takesrandomly selected frames as input and outputs appearancefeatures corresponding to that domain. Pose GAN is themain part of MT-Net, which consists of three submodules:Downsampling, ResNet blocks and Upsampling. It workson both label images and the appearance features extractedby the Appearance Encoder, and synthesize results with thecorresponding pose and appearance. As shown in Figure 5,the output of Appearance Encoder is added to the interme-diate ResNet blocks in the generator. Pose GAN Discriminator.
We use the multi-scale dis-criminator presented in pix2pixHD [13]. Discriminators ofdifferent scales can give the discrimination of images atdifferent levels. In our method, we use two discriminators D S pose and D T pose to discriminate the probability of generatedimages belonging to the corresponding domain, each with 3scales. Temporal Smoothing.
We use the time smoothing strat-egy in [1] to enhance the continuity between adjacent gener-
EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, MAY 2020 5 MT I label I label I’ app I Pose GAN
Appearance Encoder
Activation LayerInstance Normalization ···
Convolution LayerResNet LayerTransposed Convolution Layer
Downsample ResNet Upsample
Fig. 5. Architecture of the generative network. Appearance image I app is randomly selected and sent to Appearance Encoder (denoted as E app )to obtain appearance features. Adjacent label images I label (current frame) and I (cid:48) label (previous frame) are sent to Pose GAN (denoted as G pose )together with appearance features to generate the reconstructed image or initial transfer result I MT . ated frames. The generation of the current frame is not onlyrelated to the current label image I label , but also related tothe previous frame I (cid:48) label .Therefore, let d ∈ {S , T } denote the domain in whichthe training images are selected, our conditional GAN hasthe following objective: L d MT ( E app , G pose , D dpose ) = E [log D dpose ( I label , I (cid:48) label , I pose , I (cid:48) pose )]+ E [log(1 − D dpose ( I label , I (cid:48) label , I MT , I (cid:48) MT )] . (1)Here the discriminator D dpose takes a pair of adjacent imagesin the domain d , and classifies them to real images (thecurrent frame I pose , and previous frame I (cid:48) pose from thetraining set), or fake images ( I MT and the previous frameoutput I (cid:48) MT generated by the Pose GAN). Through the first stage of training, we can obtain initialtransfer results with blurred details: source to target transferresult I S T as well as target to source transfer result I T S .We can also construct paired data ( I S , I S T ) and ( I T , I T S ) ,as shown in Figure 6, where images in the same pairhave the same pose but different appearances, and differentclarity of details, which motivates us to use a DE-Net toenhance the details from the blended image. Note that indifferent videos, subjects might have different builds orpositions relative to camera. In such cases, before sendingthe paired data to DE-Net we need to align the source framein accordance with target by applying the transformationcalculated from reconstructed mesh with source parametersand target respectively. We have proved the effectivenessof this step in Fig 11 and more details will be included insupplementary material.The purpose of our DE-Net is to generate clear detailsof target images from the blended image pair. It is a GANwhere the generator G DE is a U-net which synthesizesimages in the target domain with clear details, as illustrated Source imageClear details Source to targetBlurred details Target to sourceBlurred detailsTarget imageClear details
Fig. 6. Comparison of the source image I S and source to target transferresult I S T , as well as the target image I T and target to source transferresult I T S . in Figure 7. The discriminator D DE discerns the “real”images (ground truth) and “fake” images (synthesized by G DE ).In the training stage, we use the blended image pair ( I T S , I T ) as input and supervisely train DE-Net with I T as ground truth. The use of mean blended image insteadof concatenation avoids the output overfitting to I T . Weoptimize the DE-Net by the following objective: L DE ( G DE , D DE ) = E [log D DE ( I label , I (cid:48) label , I T )]+ E [1 − log D DE ( I label , I (cid:48) label , I DE )] , (2)where I label and I (cid:48) label are the label images used to generate I T S .In the transfer stage, we use the source to target transferresult I S T and the corresponding source image I S to obtainenhanced transfer result. The training of our network is divided into two stages. Firstwe train the Motion Transfer Net, which consists of PoseGAN and Appearance Encoder. The full objective contains
EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, MAY 2020 6
Source Appearance
Target Appearance
DE-Net
Activation LayerInstance NormalizationConvolution Layer Bilinear Upsample Layer / T S T
I I / T S S
I I R I Mean
Fig. 7. Architecture of Detail Enhancement Net (DE-Net). The main part of our DE-Net is a U-net, which takes the mean of paired data ( I T S , I T ) or ( I S , I S T ) as input, and synthesizes a target image I DE with details enhanced. adversarial loss, perceptual loss and discriminator feature-matching loss, which has the following form: min E app ,G pose (( max D S pose L S MT ( E app , G pose , D S pose ))+ ( max D T pose L T MT ( E app , G pose , D T pose ))+ λ P L P ( I MT , I pose )+ λ F M L F M (( E app , G pose ) , D S pose )+ λ F M L F M (( E app , G pose ) , D T pose )) . (3)Here, L S MT and L T MT are defined in Eq. 1. The perceptual loss L P regularizes the generated result I MT to be closer to theground truth I pose in the VGG-19 [30] feature space, definedas L P ( I , I ) = || VGG ( I ) − VGG ( I ) || . (4)The discriminator feature-matching loss L F M is presentedin pix2pixHD [13] and similarly regularizes the output usingintermediate result of the discriminator, calculated as L F M ( G, D k ) = E T (cid:88) i =1 N i [ || D ( i ) k ( s, x ) − D ( i ) k ( s, G ( s )) || ] , (5)where T is the number of layers, N i is the number ofelements in the i th layer and k is the index of discriminatorsin the multi-scale architecture. s is the condition of cGANand x is the corresponding ground truth. The DE-Net isoptimized with the following objective min G DE ((max D DE L DE ( G DE , D DE )) + λ P L P ( I DE , I pose )+ λ F M L F M ( G DE , D DE )) . (6)Here L DE is defined in Eq. 2. The perceptual and discrimi-nator feature-matching losses are defined in Eqs. 4 and 5. XPERIMENT
We compare our method with state-of-the-art methods andablation variants, both quantitatively and qualitatively.
Dataset.
To verify the performance of our method, wecollected three types of data: the dataset published by [1], 10in-the-wild single-dancer videos from YouTube (includingthe data used by [2]) and 5 videos filmed by ourselves,out of which 2 with ordinary background and 3 with green screen. All videos are at × resolution or × resolution. Each subject wears different clothes and per-forms different types of action such as freestyle dancing andstretching exercises. To prepare for training and testing, Wecut the start and end parts that contain no action, and cropand normalize each frame to the same size by simple scalingand translation. Implementation details.
We adopt a multi-stage trainingstrategy in our method using Adam optimizer with learningrate 0.0001. In the first stage, we pre-train the MT-Net for20 epochs. In the next stage, the parameters of MT-Net arefixed and DE-Net is trained individually for 10 epoches.We set hyperparameters λ F M = 10 and λ P = 5 for bothstages. More details about MT-Net and DE-Net are given insupplementary material. Existing methods.
We compare our performance withexisting state-of-the-art methods vid2vid [2], EverybodyDance Now [1] and Liquid Warping GAN [28], using officialimplementation.
Evaluation Metrics.
We use objective metrics for quan-titative evaluation under two different conditions: 1) Todirectly measure the quality of the generated images, weperform self-transfer in which the source and target are thesame subject, and then use SSIM [31] and learning-basedperceptual similarity (LPIPS) [32] to assess the similaritybetween source and target images. We split frames of eachsubject into training and test set at the ratio of 8:2 for thisevaluation. 2) We also evaluate the performance of cross-subject transfer where the source and target are differentsubjects, using inception score [33] and Frchet InceptionDistance [34] as metrics. It should be noted that we computethe FID score between the original and generated targetimages since there exists no ground truth for comparison inthis case. We exclude the green screen dataset in quantitativeevaluation to focus on more challenging cases.The metrics mentioned above are all based on singleframes, which cannot reflect the smoothness of generatedimage sequences. The effect of mesh filtering in time seriescan be observed in the video results and quantitativelymeasured by the user study.
EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, MAY 2020 7
Fig. 8. Transfer results. We show the generated frames of several subjects with different genders, races or builds . In each group, the top row shows the source subject and the bottom row shows the generated target subject.
Fig. 9. The effect of number of source frames on transfer results. Weshow the inception score of transfer results ( y -axis) w.r.t the ratio ofsource frames number to target ( x -axis). For inception score, higher isbetter. Comparison results with state of the arts are reported inTable 1.It can be found that our method performs better thanothers.
In this part, we perform an ablation study to verify theimpact of each component of our model, including using3D constraints (“3D”) and DE-Net (“DE”). Our full pipeline
Metric Methodvid2vid Chan et al.
LW-GAN oursSelf-trans
SSIM
LPIPS
Cross- trans IS FID
TABLE 1Quantitative comparison with state of the arts on the dance dataset.Metrics are averaged over all subjects. For SSIM and IS, higher isbetter. For LPIPS and FID, lower is better. is indicated “Full“. When 3D is disabled, we use 2D posefigures as default label images.Table 2 shows the results of the ablation study. It isobvious that our full proposed framework performs betterthan its variants. Both 3d constraints and DE-Net are able toenhance the results. Although there is no explicit 3D loss, theLaplace projection of 3D meshes literally defines the shapeand geometry information and serves as a condition of PoseGAN, which plays an important role in generation, as isshown in
MT(3D only) and MT . And the score of MT+3D shows the complementarity of 2D and 3D condition on thistask. The comparison of
Full and
MT+3D (or
MT+DE and MT ) proves the performance of DE-Net.Furthermore, we can observe that scores of self-transferbetween MT and MT+3D (or
MT+DE and
Full ) are similar.This is because source and target subjects share the samebody shape in self-transfer, which somewhat limits the effec-tiveness of 3D constraints, where the scores of cross-subject
EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, MAY 2020 8 source Chan et al vid2vid ours source Chan et al vid2vid ours(a) (b)
Fig. 10. We compare with Chan et al. [1] and vid2vid [2] on the data published by [1] (a) and the data used in [2] (b) for the sake of fairness. Ourmethod obtains more accurate results in details for occlusion actions such as side faces and bending fingers.
Original source Transfer result Aligned source Transfer result Original source Transfer result Aligned source Transfer result
Fig. 11. Effects of aligning transformation. It can be seen that the absence of transformation will make DE-Net unable to match source & targetcharacters accurately, resulting in vague results and body shape changes. transfer demonstrate the important role 3D informationplays on transfer between different subjects with differentshapes. Metric MethodMT MT(3D only) MT+3D MT+DE FullSelf-trans SSIM 0.828 0.831 0.856 0.877
LPIPS 0.064 0.063 0.058 0.043
Cross-trans IS 3.178 3.265 3.325 3.224
FID 58.62 56.68 55.77 57.56
TABLE 2Ablation study. For SSIM and IS, higher is better. For LPIPS and FID,lower is better.
We also conduct a user study to measure the human percep-tual quality for cross-subject transfer results. In our exper-iments, we compare videos generated by vid2vid, Chan etal. , Liquid Warping GAN and our method. Specifically, we show to volunteers a series of videos by each of the methodsat the resolution of × , and the volunteers are givenunlimited time to make responses. 50 distinct participantsare involved and each of them is asked to select: 1) theclearest result with rich details; 2) the most temporally stableresult; and 3) the overall best result. As shown in Table 3,our method is more realistic, with richer details and withbetter temporal stability in comparison with other methods. Quality Methodvid2vid chan et al.
LW-GAN oursDetail and clarity 22.2% 16.7% 7.07%
Temporal stability 24.7% 15.1% 9.60% %Overall feeling 26.3% 17.7% 8.08% % TABLE 3User study. We report the percentage of participants’ choice in threedifferent aspects respectively.
EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, MAY 2020 9 vid2vid
Chan et al . ourssource LW-GAN
Fig. 12. Comparison with state-of-the-art methods. We show the gen-erated results by vid2vid, Everybody Dance Now, Liquid Warping GANand our method. Only our method has reconstructed the wriggling handand smile face.
The previous motion transfer methods such as [1] and [2]only use target frames at training stage, and the quality oftheir generative model is not directly related to the numberof source frames. While in our method, source frames andtarget are both involved in training. Therefore, it’s mean-ingful to explore the influence of source frames numberon the generated results. We carry out experiments on allsubjects and record the averaged evaluation of generatediamge quality with respect to the ration of source framesnumber to target when the number of target frames is fixed,as is shown in Fig 9. Note that we choose inception score asmetric. It can be seen that the loss has converged when theratio is around 0.5.
We visualize our generated results in Figure 8. It can be seenthat our method successfully drives the motions of differenttargets with structural integrity and rich details, particularlyin the face and hands. We also demonstrate that our methodoutperforms existing methods in Figure 12 and Figure 10.As illustrated in the first row of Figure 12, our methodcan enhance the structural integrity of arms and legs, andavoid the missing hands in the case that other methods failto generate. At the same time, our method can also charac-terize details of the generated results more accurately, suchas facial expression shown in the second row of Figure 12.
MT+3DSource MT Source MT+3DMT (a) Effect of 3D constraints. With 3D results, the results areimproved in occlusion movements such as bending legs (left)and structural integrity such as neck synthesis (right).
Source SourceMT+3D MT+3DFull Full (b) Effect of the DE-Net. The DE-Net shows superior results inthe generation of details in face (left) and hand (right) areas.
Fig. 13. Visual comparison for the ablation study. We show the gener-ated results of different conditions set in the ablation study.Fig. 14. Failure cases. For each case, the source image is shown on theleft and our transfer result on the right.
Figure 13 shows the advantage of using the 3D constraintsand DE-Net.
Although our model is able to synthesize motion transferredimages with high authenticity and details, there are stillseveral limitations. We show some failure cases with visualartifacts shown in Figure 14. In the left example, our modelfails to eliminate the long hair of the source character inresult, while in the right, some undesired part of clothesappears in the generated image because of the loose sourceclothes. These failure cases are mainly attributed to theabnormal movement of source character, which causes largechanges in human body shape (e.g., perturbations in hairor clothing) and makes the DE-Net fail to eliminate extra
EEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. XX, NO. XX, MAY 2020 10 details. Our future work will focus on improving the abilityof DE-Net to avoid the appearance of undesired details inthe transfer results.
ONCLUSION
We have proposed a new approach to human motion trans-fer. It employs the 3D body shape and pose constraints asa condition to regularize the generative adversarial learningframework, which is more expressive and complete than 2D.We also design a enhancement mechanism to reinforce thedetail characteristics of synthesized results using detailedinformation from real source frames. Extensive experimentsshow that our method outperforms existing methods bothvisually and quantitatively. R EFERENCES [1] C. Chan, S. Ginosar, T. Zhou, and A. A. Efros, “Everybody DanceNow,” 2018.[2] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz, andB. Catanzaro, “Video-to-video synthesis,” 2018.[3] C. Bregler, M. Covell, and M. Slaney, “Video rewrite: driving visualspeech with audio,” in
SIGGRAPH , 1997.[4] A. A. Efros, A. C. Berg, G. Mori, and J. Malik, “Recognizing actionat a distance,” in
IEEE International Conference on Computer Vision ,Nice, France, 2003, pp. 726–733.[5] J. Lee and S. Y. Shin, “A hierarchical approach to interactivemotion editing for human-like figures,” in
Proceedings of the 26thAnnual Conference on Computer Graphics and Interactive Techniques,SIGGRAPH 1999, Los Angeles, CA, USA, August 8-13, 1999 , W. N.Waggenspack, Ed. ACM, 1999, pp. 39–48. [Online]. Available:https://doi.org/10.1145/311535.311539[6] C. Hecker, B. Raabe, R. W. Enslow, J. DeWeese, J. Maynard, andK. van Prooijen, “Real-time motion retargeting to highly varieduser-created morphologies,” in
SIGGRAPH 2008 , 2008.[7] M. Mirza and S. Osindero, “Conditional generative adversarialnets,” arXiv preprint arXiv:1411.1784 , 2014.[8] M. Liu and O. Tuzel, “Coupled generative adversarial networks,”in
Advances in Neural Information Processing Systems 29:Annual Conference on Neural Information Processing Systems 2016,December 5-10, 2016, Barcelona, Spain , D. D. Lee, M. Sugiyama,U. von Luxburg, I. Guyon, and R. Garnett, Eds., 2016,pp. 469–477. [Online]. Available: http://papers.nips.cc/paper/6544-coupled-generative-adversarial-networks[9] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in
Computer Vision (ICCV), 2017 IEEE International Conference on , 2017.[10] T. Kim, M. Cha, H. Kim, J. K. Lee, and J. Kim, “Learningto discover cross-domain relations with generative adversarialnetworks,”
CoRR , vol. abs/1703.05192, 2017. [Online]. Available:http://arxiv.org/abs/1703.05192[11] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-imagetranslation with conditional adversarial networks,” in
Proceedingsof the IEEE conference on computer vision and pattern recognition , 2017,pp. 1125–1134.[12] A. Bansal, S. Ma, D. Ramanan, and Y. Sheikh, “Recycle-gan:Unsupervised video retargeting,” in
Proceedings of the EuropeanConference on Computer Vision (ECCV) , 2018, pp. 119–135.[13] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro.(2017) High-Resolution Image Synthesis and Semantic Manipula-tion with Conditional GANs.[14] G. Balakrishnan, A. Zhao, A. V. Dalca, F. Durand, and J. V. Guttag,“Synthesizing images of humans in unseen poses,” , pp. 8340–8348, 2018.[15] P. Esser, E. Sutter, and B. Ommer, “A variational u-net for condi-tional appearance and shape generation,” 2018.[16] L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars, and L. Van Gool,“Pose guided person image generation,” in
Advances in NeuralInformation Processing Systems , 2017, pp. 405–415.[17] L. Ma, Q. Sun, S. Georgoulis, L. Van Gool, B. Schiele, and M. Fritz,“Disentangled person image generation,” in
IEEE Conference onComputer Vision and Pattern Recognition , 2018. [18] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “OpenPose:realtime multi-person 2D pose estimation using Part AffinityFields,” in arXiv preprint arXiv:1812.08008 , 2018.[19] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person2d pose estimation using part affinity fields,” in
CVPR , 2017.[20] T. Simon, H. Joo, I. Matthews, and Y. Sheikh, “Hand keypointdetection in single images using multiview bootstrapping,” in
CVPR , 2017.[21] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolu-tional pose machines,” in
CVPR , 2016.[22] N. Neverova, R. Alp Guler, and I. Kokkinos, “Dense pose transfer,”in
Proceedings of the European Conference on Computer Vision (ECCV) ,2018, pp. 123–138.[23] R. A. G ¨uler, N. Neverova, and I. Kokkinos, “Densepose: Densehuman pose estimation in the wild,” 2018.[24] H. Joo, T. Simon, and Y. Sheikh, “Total capture: A 3d deformationmodel for tracking faces, hands, and bodies,” in
Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition , 2018.[25] A. Kanazawa, M. J. Black, D. W. Jacobs, and J. Malik, “End-to-endrecovery of human shape and pose,” in
Computer Vision and PatternRegognition (CVPR) , 2018.[26] G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. A. Osman,D. Tzionas, and M. J. Black, “Expressive body capture: 3d hands,face, and body from a single image,” in
Proceedings IEEE Conf. onComputer Vision and Pattern Recognition (CVPR) , 2019.[27] D. Xiang, H. Joo, and Y. Sheikh, “Monocular total capture: Posingface, body, and hands in the wild,” in
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2019.[28] W. Liu, W. L. L. M. Zhixin Piao, Min Jie, and S. Gao, “Liquidwarping gan: A unified framework for human motion imitation,appearance transfer and novel view synthesis,” in
The IEEE Inter-national Conference on Computer Vision (ICCV) , 2019.[29] M. Meyer, M. Desbrun, P. Schr¨oder, and A. H. Barr, “Discretedifferential-geometry operators for triangulated 2-manifolds,” in
Visualization and mathematics III . Springer, 2003, pp. 35–57.[30] K. Simonyan and A. Zisserman, “Very deep convolutionalnetworks for large-scale image recognition,” arXiv preprintarXiv:1409.1556 , 2014.[31] Z. Wang, A. C. Bovik, H. R. Sheikh, E. P. Simoncelli et al. , “Imagequality assessment: from error visibility to structural similarity,”
IEEE transactions on image processing , vol. 13, no. 4, pp. 600–612,2004.[32] R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang,“The unreasonable effectiveness of deep features as a perceptualmetric,” in
CVPR , 2018.[33] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford,and X. Chen, “Improved techniques for training gans,” in
Advancesin neural information processing systems , 2016, pp. 2234–2242.[34] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochre-iter, “Gans trained by a two time-scale update rule convergeto a local nash equilibrium,” in