[PDF] Constructing Human Motion Manifold with Sequential Networks

Abstract

This paper presents a novel recurrent neural network-based method to construct a latent motion manifold that can represent a wide range of human motions in a long sequence. We introduce several new components to increase the spatial and temporal coverage in motion space while retaining the details of motion capture data. These include new regularization terms for the motion manifold, combination of two complementary decoders for predicting joint rotations and joint velocities, and the addition of the forward kinematics layer to consider both joint rotation and position errors. In addition, we propose a set of loss terms that improve the overall quality of the motion manifold from various aspects, such as the capability of reconstructing not only the motion but also the latent manifold vector, and the naturalness of the motion through adversarial loss. These components contribute to creating compact and versatile motion manifold that allows for creating new motions by performing random sampling and algebraic operations, such as interpolation and analogy, in the latent motion manifold.

Full PDF

VVolume xx ( ), Number z, pp. 1–11

COMPUTER GRAPHICS forum

Constructing Human Motion Manifold with Sequential Networks

Deok-Kyeong Jang and Sung-Hee Lee

Korea Advanced Institute of Science and Technology (KAIST)

Abstract

This paper presents a novel recurrent neural network-based method to construct a latent motion manifold that can represent awide range of human motions in a long sequence. We introduce several new components to increase the spatial and temporalcoverage in motion space while retaining the details of motion capture data. These include new regularization terms for themotion manifold, combination of two complementary decoders for predicting joint rotations and joint velocities, and the addi-tion of the forward kinematics layer to consider both joint rotation and position errors. In addition, we propose a set of lossterms that improve the overall quality of the motion manifold from various aspects, such as the capability of reconstructing notonly the motion but also the latent manifold vector, and the naturalness of the motion through adversarial loss. These compo-nents contribute to creating compact and versatile motion manifold that allows for creating new motions by performing randomsampling and algebraic operations, such as interpolation and analogy, in the latent motion manifold.

CCS Concepts • Computing methodologies → Dimensionality reduction and manifold learning; Neural networks; Motion processing;

1. Introduction

Constructing a latent space for human motion is an important prob-lem as it has a wide range of applications such as motion recogni-tion, prediction, interpolation, and synthesis. Ideal motion spacesshould be compact in the sense that random sampling in the spaceleads to plausible motions and comprehensive so as to generate awide range of human motions. In addition, locally linear arrange-ment of the semantically related hidden vectors would beneﬁt mo-tion synthesis, e.g., by simple algebraic operations.However, constructing a compact and versatile motion space andextracting valid motions from it remains a challenging problem be-cause the body parts of human body are highly correlated in generalactions and the joints are constrained to satisfy the bone lengths andthe range of movement. The high dimensionality of the joint spaceadds additional difﬁculty to this problem.In this paper, we present a novel framework to construct a la-tent motion manifold and to produce various human motions fromthe motion manifold. In order to embrace the temporal character-istic of human motion, our model is based on the sequence-to-sequence model. The unsupervised sequence-to-sequence modelshave been shown to be effective by previous studies on motion pre-diction [MBR17, PGA18]. Based on these studies, we develop sev-eral novel technical contributions to achieve a compact yet versatilelatent motion manifold and a motion generation method as follows.First, our model is characterized by the combination of one en-coder and two decoders. Given a motion manifold vector, one de-coder learns to generate the joint rotation while the other learns tooutput joint rotation velocities. As will be discussed later, the jointrotation decoder has the advantage of reconstructing long term mo- tions better. In comparison, the joint velocity decoder has the ad-vantage of improving the continuity of the motion. By complement-ing each other, our two decoder model shows a higher reconstruc-tion accuracy than that of the single decoder model.Second, unlike previous studies that deal with only either jointangles or joint positions, by adding a forward kinematics (FK) layer[VYCL18], our joint angle-based human representation achievesthe advantage of satisfying bone-length constraints and simplifyingjoint limit representation. By additionally considering joint positioncomputed by the FK layer while training, our method reduces thejoint position error, which is visually more perceptible than the jointangle error.Lastly, we introduce several loss functions, each of which con-tributes to enhancing the quality of the motion manifold in differ-ent aspects. A reconstruction loss reduces the difference betweenthe reconstructed motion and the input motion and thus allows themanifold to synthesize motion content and details observed in thetraining motion dataset. A regularizer loss improves the distributionquality of the motion manifold and thus enables random samplingand interpolation on the manifold. In addition, an adversarial lossincreases the naturalness of the motions generated from the motionmanifold.In this paper we show that, based on these technical contribu-tions, our method allows for various practical applications such asrandom generation of motions, motion interpolation, motion de-noising and motion analogy as will be shown in Sec. 5. The capa-bility of our method is demonstrated by the comparison with otherapproaches, such as the seq2seq model [MBR17] and the convolu-tion model [HSKJ15, HSK16]. c (cid:13) (cid:13) a r X i v : . [ c s . G R ] M a y -K. Jang & S-H. Lee / Constructing Human Motion Manifold with Sequential Networks Figure 1:

Examples of motion interpolation on the latent motionmanifold generated by our method. The ﬁrst and last columns aresnapshots of two input motions, and the intermediate columns showthe snapshots of four individual motions obtained by the linear in-terpolation on the motion manifold.

The remaining part of this paper proceeds as follows: After re-viewing previous studies related to our work in Sec. 2, we presentour method and loss function in detail in Sec. 3. Sections 4 detailthe data pre-processing and Sec. 5 reports a number of experimentsperformed to verify the effectiveness of our method. Section 6discusses the limitations of our work, future research directions,and concludes the paper. Our code and networks are available at https://github.com/DK-Jang/human_motion_manifold .

2. Related work

Researcher have developed several methods to construct motionmanifold to generate natural human motions, but compared withstudies on manifold learning for other data such as image, researchon motion data is scarce. Linear methods such as PCA can modelhuman motion in only a local region. Chai et al . [CH05] applylocal PCA to produce a motion manifold that includes a certainrange of human motion, and apply it for synthesizing movementsfrom low dimensional inputs such as the position of end effectors.Lawrence [Law04] use Gaussian Process Latent Variable Model(GPLVM) to ﬁnd a low dimensional latent space for high dimen- sional motion data. Taylor et al . [THR07] propose a modiﬁed Re-stricted Boltzmann Machine that is able to deal with the temporalcoherency of the motion data. Lee et al . [LWB ∗

10] propose motionﬁelds method, a novel representation of motion data, which allowsfor creating human motion responsive to arbitrary external distur-bances. Recently, with the development of deep learning technol-ogy, a method of constructing a motion manifold by using Convo-lutional Neural Network (CNN)-based encoder was introduced byHolden et al . [HSKJ15, HSK16]. Butepage et al . [BBKK17] com-pare a number of deep learning frameworks for modeling humanmotion data.Our method for constructing motion manifold is based on previ-ous studies on sequence learning for motion to predict the joint po-sition sequences of a 3D human body given past motions. Martinez et al . [MBR17] develop a novel sequence-to-sequence encoder-decoder model that predicts human motion given a short duration ofpast motion. The presented result is impressive but has a few limi-tations that sometimes implausible motions such as foot sliding aregenerated and the initial pose of the predicted motion is somewhatdiscontinuous from the input motion.Pavllo et al . [PGA18] selectively use a joint rotation-based lossfor short term prediction and a joint position-based loss for longterm prediction. The latter includes forward kinematics to computethe joint positions. However, the basic sequence-to-sequence modelcan only predict short term motions and has limitations in predict-ing non-trivial, long term motions. In addition, a loss function thatminimizes only the prediction error does not guarantee to constructcompact and versatile motion manifold. Our method solves theseproblems by jointly considering joint rotation and position errors inthe loss function and by adding regularization to the motion mani-fold.In a broader perspective, our work is related with the studies onrecognizing and generating human motion, which remains a chal-lenging research topic due to the high dimensionality and dynamicnature of the human motion. Wu and Shao [WS14] propose a hi-erarchical dynamic framework that extracts top-level skeletal jointfeatures and uses the learned representation to infer the probabil-ity of emissions to infer motion sequences. Du et al . [DWW15]and Wang et al . [WW17] use recurrent neural network (RNN) tomodel temporal motion sequences and propose hierarchical struc-ture for action recognition. With regard to motion synthesis, Mittel-man et al . [MKSL14] propose a new class of Recurrent TemporalRestricted Boltzmann Machine (RTRBM). The structured RTRBMexplicitly graphs to model the dependency structure to improve thequality of motion synthesis. Fragkiadaki et al . [FLFM15] proposethe Encoder-Recurrent-Decoder (ERD) that combines representa-tion learning with learning temporal dynamics for recognition andprediction of human body pose in videos and motion capture. Jain et al . [JZSS16] propose structural RNN for combining the powerof high-level spatio-temporal graphs.

3. Method

This section details our framework. After deﬁning notations used inthis paper, we explain the structure of the network and the designof the loss function for training. c (cid:13) (cid:13) -K. Jang & S-H. Lee / Constructing Human Motion Manifold with Sequential Networks We denote the human motion set by Q and corresponding randomvariable by Q . A motion with a time range of [ t , t + ∆ t − ] is writ-ten as Q t : ( t + ∆ t − ) = [ q t ,..., q t + ∆ t − ] , where q t denotes the poseat time t . A pose is represented with a set of joint angles writtenin the exponential coordinates, i.e., q t = [ q ti , x , q ti , y , q ti , z ] n joint i = where ( q ti , x , q ti , y , q ti , z ) are the three components of the exponential coordi-nates and n joint is the number of joints. Therefore, the dimensionof a human motion is Q ∈ R ∆ t × n joint × . Lastly, p t is the pose rep-resented with the joint positions at time t corresponding to q t , and P t : ( t + ∆ t − ) = [ p t ,..., p t + ∆ t − ] . P is also a random variable of mo-tion set Q . We construct a motion manifold in an end-to-end unsupervised wayusing a network of sequential networks, with an objective to mini-mize the difference between the ground truth motion space distribu-tion and the reconstructed motion space distribution extracted fromthe latent motion manifold. To this end, we develop a sequentialmodel that consists of the RNN with Gated Recurrent Unit (GRU).Our model has a sequence-to-sequence structure [MBR17], whichis often used in machine translation. This RNN structure is effectivefor maintaining the temporal coherency in motion, and it is trainedto generate a ﬁxed length of motion (150 frames) in our study. Asshown in Fig. 2, our model includes the combination of one en-coder and two decoders with a regularizer. The encoder takes thesource motion as an input and maps it to the latent motion space.The regularizer encourages the encoded motion distribution to ap-proximate some prior distribution. The two decoders are designedto map the latent motion space to joint angles and joint velocities,respectively. Details of our model are given next.

The encoder consists of a GRU and one linear layer, and Fig. 2shows the unrolled schematic diagram of the encoder. The ∆ t poses [ q t ,..., q t + ∆ t − ] of a motion are input to the GRU sequentially. TheGRU encodes the current frame while being conditioned by theprevious frames with their hidden representation. Speciﬁcally, thepose q i in the i -th frame is encoded as follows: h Enci = GRU W Enc ( h Enci − , q i ) , (1)where h i is the hidden state at frame i , and W Enc ∈ R n joint × d h arethe training parameters with d h being the hidden dimension of theGRU. After the ﬁnal pose of the input motion is read, one linearlayer of parameter W c ∈ R d h × d m receives h t + ∆ t − and compressesit to produce the d m -dimensional code Z ∈ Z where Z denotesthe motion manifold. It is worth mentioning that this compressionbrings the beneﬁt of denoising input data. Now the encoder map-ping Enc : Q → Z is completed. We adopt the Wasserstein regularizer for matching the distribution E Z : = E P Q [ E ( Z | Q )] of the motion manifold to the desired priordistribution P Z . Unlike the variational auto-encoder [RMW14], thesequential networks trained with the Wasserstein regularizer allowsnon-random encoders to deterministically map inputs to the latent codes, and thus it helps randomly sampled or interpolated pointsin the motion manifold correspond to plausible motions.Refer to[TBGS17] for more details about the Wasserstein regularizer. Our decoder model consists of two kinds: One decoder learns thejoint rotation and the other learns joint rotational velocity as shownin Fig. 2. Both decoders are based on the GRU while the connectionstructures of the two are different. Unlike the rotation decoder, thevelocity decoder adds a residual connection between the input andthe output to construct joint rotation. Each decoder then generatesthe reconstructed joint angle sequence in reverse temporal order assuggested by [SMS15]. The decoders are trained simultaneouslywith backpropagation.This dual decoder model is based on the idea of [SMS15]. Bycombining the two decoders, we can alleviate the limitations ofindividual decoder models. The rotation decoder shows strengthwhen reconstructing long term motions because it learns joint angleitself. Conversely, it may cause pose discontinuity between frames.The velocity decoder has the advantage of reconstructing contin-uous human motion as it outputs difference between consecutiverotations, which is usually small and easier to learn. However, train-ing velocities tends to be unstable in a long-term sequence becausethe longer the motion is, the more error is accumulated. As our twodecoders have contrasting strengths and weaknesses, when com-bined, they complement each other in synergy.Unlike previous studies about motion prediction, recognition andmanifold [BBKK17,MBR17,HSKJ15,FLFM15,PGA18] in whicheither only the joint rotations or the joint positions are used, ourmodel considers both the joint rotations and positions in the motionreconstruction loss term, L R (See Eq. 8). Loss with joint angles hasthe advantage of preventing errors such as inconsistent bone lengthor deviation from human motion range, and thus learning with jointangle loss can generate plausible motions. However, rotation pre-diction is often paired with a loss that averages errors over jointsby giving each joint the same weight. The ignorance of varyinginﬂuence of different joints on the reconstructed motion can yieldlarge errors in the important joints and degrade the quality of thegenerated poses.The joint position loss minimizes the averaged position errorsover 3D points, which better reﬂects perceptual differences be-tween poses. To combine both joint rotations and positions in themotion reconstruction loss L R , we add a forward kinematics (FK)layer that computes the joint positions from the joint rotations. Thisallows for calculating the loss between the joint positions of the tar-get motion and the reconstruction motion. The FK module is validfor network training because its output is differentiable with respectto joint rotation.Finally, our method reconstructs the motion in the reverse orderof the input sequence. Reversing the target sequence has an advan-tage in learning in that the ﬁrst output frame of the decoder needsonly to match the last frame input of the encoder, which allows fora continuous transition of hidden space vectors from the encoderto the decoders. Refer to [SMS15] for a theoretical background onthis approach. Details of our decoder are explained next. Joint Rotation Decoder

The unfolded schematic diagram of thejoint rotation decoder is shown in the upper row in Fig. 2. It c (cid:13) (cid:13) -K. Jang & S-H. Lee / Constructing Human Motion Manifold with Sequential Networks Figure 2:

Structure of our sequential networks for constructing the motion manifold. ﬁrst transforms an element of the motion manifold z ∈ Z to a d h -dimensional hidden space vector with a linear layer of parameter W re ∈ R d m × d h . Then, conditioned by the hidden space vector repre-senting the future frames, the GRU and a linear layer outputs thereconstructed pose (cid:98) q ri at the i -th frame given its next pose (cid:98) q ri + : h Dec r i = GRU W Decr ( h Dec r i + , (cid:98) q ri + ) , (2) (cid:98) q ri = W rTo h Dec r i , (3)where W Dec r ∈ R n joint × d h is learning parameter of the GRU and W ro ∈ R d h × n joint is the parameter of the linear layer.Note that, as mentioned earlier, the decoder uses the reversedinput motion as the target motion, so the reconstruction is per-formed in the order of (cid:98) Q ( t + ∆ t − ) : t = [ ˆ q t + ∆ t − ,..., (cid:98) q t ] . Unlike theencoder, the decoder uses the reconstructed result of the previousframe as the input [MBR17, LZX ∗ (cid:98) q rt + ∆ t to the GRU is set zero because there is no reconstruc-tion result of the previous frame. The reconstructed joint rotationsare used to calculate the angle loss with respect to the target mo-tion, and are also used to calculate the position (cid:98) p ri through the FKlayer. (cid:98) p ri = Forward Kinematics ( (cid:98) q ri ) (4)After the last pose (cid:98) q t is generated, the joint decoder mapping Dec r : Z → Q is completed. Joint Velocity Decoder

The joint velocity decoder has the similarstructure to the joint rotation decoder. The main difference is that ithas a residual connection to generate (cid:98) q vi . h dec v i = GRU W Dec v ( h Dec v i + , (cid:98) q vi + ) , (5) (cid:98) q vi = W vTo h Dec v i + (cid:98) q vi + , (6) (cid:98) p vi = Forward Kinematics ( (cid:98) q vi ) , (7)where W Dec v ∈ R n joint × d h and W vo are the learning parameters. Thisresidual network learns the difference between the current framepose (cid:98) q vi and the previous frame pose (cid:98) q vi + . Therefore, the modelpredicts the angle difference or velocity and integrates it over time.After the last pose is generated, the joint velocity decoder mapping Dec v : Z → Q is completed. We model a number of loss functions, each of which contributesto enhancing the quality of the motion generated from the motionmanifold from different perspectives. To reduce the reconstructionloss, we employ two kinds of loss functions: Motion reconstructionloss L R that encourages a motion to be reconstructed after goingthrough the encoder and decoder, and manifold reconstruction loss L M that helps a latent vector be reconstructed after going throughthe decoder and encoder. In addition, we include Wasserstein loss L W that penalizes the discrepancy between P Z and the distribution E Z induced by the encoder, and an adversarial loss L G to achieve c (cid:13) (cid:13) -K. Jang & S-H. Lee / Constructing Human Motion Manifold with Sequential Networks Figure 3:

Each loss term is evaluated from the data processed in the network pipeline shown with black arrows. Red arrows indicate thedata used for the individual loss terms. more natural motions from the motion manifold. Figure 3 showsoverview of our loss functions.

Motion reconstruction loss

The motion reconstruction loss pe-nalizes the difference between the motion and the reconstructedmotion, which is obtained by encoding the motion followed by de-coding it. Speciﬁcally, we measure the discrepancy of both the jointrotation angle q and the joint position p as follows: L R = L ang + w p L pos (8) L ang = n joint ∑ i (cid:107) (cid:98) q ri − q i (cid:107) + (cid:107) (cid:98) q vi − q i (cid:107) (9) L pos = n joint ∑ i (cid:107) (cid:98) p ri − p i (cid:107) + (cid:107) (cid:98) p vi − p i (cid:107) , (10)where (cid:107) · (cid:107) is the Euclidean norm and w p (= 5 in our experiment)is the weight of the position error. Manifold reconstruction loss

A latent code sampled from the la-tent distribution should be reconstructed after decoding and encod-ing. Manifold reconstruction loss encourages this reciprocal map-ping between the motions and the manifold space. To this end,we apply L loss similar to [LTH ∗ Z from the encoded motion sequences and recon-struct it with (cid:98) Z r = Enc ( Dec r ( Z )) and (cid:98) Z v = Enc ( Dec v ( Z )) , where Z = Enc ( Q t : ( t + ∆ t − ) ) . L M = (cid:107) (cid:98) Z r − Z (cid:107) + (cid:107) (cid:98) Z v − Z (cid:107) (11) Wasserstein regularizer loss

In order to make the manifold spacehave a particular desired prior distribution so that we can efﬁcientlysample from the distribution, we use the Wasserstein regularizerthat penalizes deviation of the distribution E Z of the latent manifoldfrom the desired prior distribution P Z . L W = MMD k ( P Z , E Z ) , (12) where P Z ( Z ) = N ( Z ; , σ z · I d ) is modeled as the multivariate nor-mal distribution with σ z being decided through validation. We usethe maximum mean discrepancy MMD k to measure the divergencebetween two distributions with the inverse multi-quadratics kernel k ( x , y ) = C / ( C + (cid:107) x − y (cid:107) ) with C = Z dim σ z . We set σ z = Z dim = Adversarial loss

Finally, we employ the least squares generativeadversarial network (LSGAN) to match the distribution of gener-ated motion to the real motion data distribution, i.e., to promotemotions generated by our model to be indistinguishable from realmotions. L D = ∑ (cid:98) Q t : ( t + ∆ t − ) (cid:104) D ( (cid:98) Q t : ( t + ∆ t − ) ) − (cid:105) + ∑ Q t : ( t + ∆ t − ) (cid:104) D ( Q t : ( t + ∆ t − ) ) − (cid:105) (13) L G = ∑ (cid:98) Q t : ( t + ∆ t − ) (cid:104) D ( (cid:98) Q t : ( t + ∆ t − ) ) − (cid:105) (14)where the discriminator D tries to distinguish between the recon-structed motions and the real motions. The discriminator is thenused to help our decoder generate realistic motions. Total loss

We jointly train the encoder, joint rotation decoder, jointvelocity decoder and discriminator to optimize the total objec-tive function, which is a weighted sum of the reconstruction loss,Wasserstein regularizer loss and adversarial loss. The total objec-tive function of manifold network is:min

Enc , Dec r , Dec v L ( Enc , Dec r , Dec v )= L R + λ M L M + λ W L W + λ G L G (15)and the discriminator loss is:min D L ( D ) = λ G L D , (16) c (cid:13) (cid:13)(cid:13)

Enc , Dec r , Dec v L ( Enc , Dec r , Dec v )= L R + λ M L M + λ W L W + λ G L G (15)and the discriminator loss is:min D L ( D ) = λ G L D , (16) c (cid:13) (cid:13)(cid:13) -K. Jang & S-H. Lee / Constructing Human Motion Manifold with Sequential Networks where weighting parameters λ M , λ W and λ G are 0 . .

1, and0 .

001 determined through validation.

4. Data pre-processing

We tested our method with H3.6M dataset. Every motion in thedataset has the same skeletal structure. All the poses are repre-sented with the position and orientation of the root and the jointrotations expressed with the exponential coordinates. For the train-ing, motion clips of 150 frames are randomly selected from theinput motion sequence and used to learn a motion manifold. Theroot position in the transverse plane is removed and other data arenormalized for better performance. We will explain how motiondataset is processed.

H3.6M dataset

H3.6M dataset [IPOS14] consists of 15 activitiessuch as walking, smoking, discussion, taking pictures, and phoningperformed by 7 subjects. We reduce 32 joints in the original data to17 joints by removing redundant joints as done by [MBR17], andconﬁgured all data to have a frame rate of 25 Hz. Therefore, 150frames motion applied to our model cover 6 seconds. The activitiesof subject S5 were used as the test data and those of the remainingsubjects S1, S6, S7, S8, S9 and S11 were used as the training data.Some motion data contain noises such as joint popping, but wasused without noise removal.

5. Experimental Results

We perform several experiments to evaluate the performance of ourmethod. First, we compare the reconstruction accuracy of the pro-posed model with its own variations with some components ablatedas well as the sequence-to-sequence model proposed by [MBR17].Next, we test random sampling, motion interpolation via motionmanifold, and motion denoising, followed by an experiment formotion analogies. For these tests, we use the joint rotation decoderto generate motions. We qualitatively compare the result of motioninterpolation and motion analogies with that of [HSKJ15] † . All ex-periments were conducted with test sets not included in the trainingset. The supplemental video shows the resulting motions from theexperiments. We assess the accuracy of the reconstructed motion (cid:98) Q with respectto the input motion Q , as well as the accuracy of the reconstructedmotion manifold vector (cid:98) z with respect to the motion manifold vec-tor z obtained by encoding a motion. The results are provided inTable 1. Generally, the reconstruction accuracy and the data gener-ation quality of a manifold conﬂict with each other to some degree.As our purpose is to achieve a motion manifold that supports notonly the motion reconstruction but also motion generation, it is im-portant to strike a balance among various performance measures,and our method should not be evaluated only by the reconstructionaccuracy. This trade off will be discussed in Sec. 5.1.1.The sequence-to-sequence model (Seq2seq) compared with oursis based on [MBR17]. The only difference is that a fully connected † [HSKJ15] is not compared with ours with respect to the reconstructionquality as it deals only with joint positions and not joint angles. layer of 64 dimension is implemented between the encoder and thedecoder to construct a motion manifold.For ablation study, we prepare a set of variations of our model.The most basic model, denoted S , has only joint rotation decoderwith reconstruction and Wasserstein regularizer losses, without theFK layer in the network. Next model D is the dual decoder modelby adding the velocity decoder. From the dual model, we make vari-ations by incrementally accumulating FK layer ( DK ), adversarialloss ( DKG ), manifold reconstruction loss (

DKGM , our method).The last variation

DKGMZ is made by concatenating the manifoldvector to the decoder input, i.e., [ (cid:98) q i + , Z ] is used instead of (cid:98) q i + inEqs. 3 and 5. The idea of this last variation is to prevent the decoderfrom forgetting the motion manifold vector. All variations havethe same network weight dimensions and hyper-parameters as ourmodel. Supplemental material includes details of implementing thecompared models. All models are trained with datasets that includeall action categories.The accuracy of the motion reconstruction is evaluated for boththe joint rotation decoder ( Dec r ) and the joint velocity decoder( Dec v ). Both the Euclidean distances of joint angle errors ( L ang ,also denoted as E r ) and joint position errors ( L pos or E p ) are usedfor each decoder for the reconstruction loss. As for the reconstruc-tion quality of the motion manifold vector, we measure the L -norm( E z ) of the difference between the motion manifold vector z ob-tained by encoding a motion sequence and the reconstructed vector (cid:98) z r obtained by sequentially decoding z and encoding it. Figure 4:

Ground truth motions (green) and reconstruction results(coral) of our method from H3.6M dataset.

Table 1 shows the reconstruction errors of our method and oth-ers for the datasets containing all action categories (15 actions inH3.6M dataset). The reported errors are the average of 30 motionsrandomly selected from a test dataset. A total of 150 frames aredivided into 5 intervals, and errors ( E r , E p ) are measured for eachinterval to investigate the temporal characteristic. The lowest andthe next lowest errors are marked in bold and with underline, re-spectively.We ﬁrst compare with respect to E r and E p errors. Comparing S and D , the latter has lower E r and E p errors, which suggests that thejoint rotation and velocity decoders complement with each other toreduce the errors. Comparing D and DK , the latter reduces E p er-ror signiﬁcantly while only mildly sacriﬁcing E r error. DKG haslower E r and E p errors than DK , but higher errors than D and S .This shows that adversarial loss slightly reduces reconstruction er-ror. However, it turns out that the adversarial loss helps reconstructthe original behaviors, as will be discussed in Sec. 5.1.2. Examiningthe error of DKGM and

DKGMZ , we can see that adding manifold c (cid:13) (cid:13) -K. Jang & S-H. Lee / Constructing Human Motion Manifold with Sequential Networks E r E p E r E p E r E p E r E p E r E p E z rot 0.889 0.957 0.971 0.978 0.990 1.040 1.097 1.078 1.195 1.181 0.317 S vel - - - - - - - - - - -rot D vel 0.856 0.889 DK vel 1.347 0.600 1.353 0.698 1.323 0.723 1.382 0.756 1.391 0.809 0.288rot 0.986 0.549 1.077 DKG vel 1.343 0.589 1.345 0.682 1.332 0.702 1.405 0.765 1.415 0.834 0.307rot 0.997

DKGM (ours) vel 1.356 0.590 1.381 0.673 1.338 0.694 1.400 0.735 1.406 0.792 0.293rot 0.906 0.629 0.909 0.730 0.886 0.724 0.954 0.754 1.053 0.788 0.164

DKGMZ vel 0.877 0.635 0.883 0.703 rot - - - - - - - - - - -Seq2seq vel 0.875 0.863 0.870 0.954 0.891 1.059 1.039 1.177 1.154 1.258 0.216

Table 1:

Reconstruction errors of joint angles (E r ) and joint positions (E p ) at sample time frames, and the reconstruction error of themanifold vector (E z ). The error is measured with respect to the general actions (all the actions in the DB) in H3.6M dataset. reconstruction loss does not signiﬁcantly affect the reconstructionerrors while explicitly feeding the manifold vector to the decoderhelps reduce the errors.Next, we examine manifold reconstruction error, E z (= L M ).Comparing D and S , it is remarkable that D reduces E z error evenwithout any manifold-related loss term. However, adding FK layerto reduce joint position error slightly increases E z for the veloc-ity decoder while it is decreased for the rotation decoder. Compar-ing DK and DKG , we can see that adversarial loss has negligibleeffect to the manifold reconstruction error. Subsequently,

DKGM reduces E z slightly by adding the manifold reconstruction error,and DKGMZ achieves the lowest E z error by explicitly feedingthe manifold vector to the decoder.Seq2seq [MBR17] shows less E r than our model, but E p ishigher. In addition, our model shows better E z errors with respectto rotation decoder. Figure 4 visualizes the reconstruction resultswith our model over time in comparison with the ground truth in-put motion. This experiment examines the effect of different settings of theweight λ W for the regularization on the reconstruction errors ( L ang and L pos ) and on motion manifold ( L M ) on the test set. We em-ployed D model for this experiment to exclude the effect of otherloss terms. Figure 5 (a) and (b) show that the joint reconstruction er-rors decrease as λ W becomes smaller, which makes D model closerto a pure autoencoder, sacriﬁcing the ability to enforce a prior overthe motion manifold space while obtaining better reconstructionloss. For the same reason, Fig. 5 (c) shows that the motion mani-fold reconstruction error L M decreases as λ W becomes larger.As our goal is to obtain an effective motion manifold that is ableto generate realistic motions, it is important to ﬁnd a suitable set ofweight parameters that compromise among different qualities. Here we discuss the effects of adversarial loss (Sec. 3.3) and ex-plicitly feeding motion manifold vector to the decoders on motionquality. First, Table 1 shows that

DKG decreases E r from DK onlyslightly. However, Fig. 6 shows that DK cannot properly recon-struct the original motion, reconstructing only posing motion fromthe original motion of posing with walking. In contrast, DKG im-proves the overall motion quality by better reconstructing the be-haviors in the original motion. Comparing our method (

DKGM )and

DKGMZ , the latter results in lower E r and E p than our methodas shown in Table. 1. However, Fig. 6 reveals that DKGMZ fails tocapture walking motion. We conjecture that directly feeding mani-fold vector to decoder reduces reconstruction loss by explicitly re-taining the motion manifold vector, but tends to converge to meanpose. In contrast, our method successfully reconstructs the origi-nal posing with walking behavior. This observation suggests that,while the joint reconstruction error is an important indicator of mo-tion quality, it may not appropriately assess the motion quality interms of reconstructing the original behaviors.

To verify whether the latent motion manifold can create meaningfulmotions, we randomly sampled P Z and decoded to obtain motions.We extracted 30 random samples from the motion manifold learnedwith H3.6M dataset. Figure 7 is the results of random samplingfrom P Z , and one can see that our method can create various actionsincluding sitting, crossing the legs, and resting on the wall. Thisresult suggests that our motion manifold and decoder can create awide range of plausible behaviors.To examine the importance of WAE, we experimented randomsampling by replacing the WAE regularizer with a simple L -norm (cid:107) z (cid:107) loss. Sampled motions from this method, as shown in Fig.7 (right), often show unnatural poses and extreme joint rotations.This experiment shows that the WAE regularizer not only helps c (cid:13) (cid:13) -K. Jang & S-H. Lee / Constructing Human Motion Manifold with Sequential Networks (a) L ang with respect to λ W (b) L pos with respect to λ W (c) L M with respect to λ W Figure 5:

Reconstruction errors of joint angle, joint position and manifold according to training step while adjusting λ W for H3.6M dataset. Figure 6:

Reconstruction results of different loss combinations fora posing while walking motion. Supplementary video includes fullmotions.

Figure 7:

Results of randomly sampling motions from the motionmanifold P Z . achieve the desired motion manifold distribution but also improvesquality of motion sampling. We can interpolate two different motions by encoding them intothe latent motion manifold and then performing linear interpolationbetween the encoded motion manifold vectors. The resulting inter-polated motion created by our method is not just frame-by-frame interpolation, but may contain meaningful transition between theinput motions. For example, interpolating sitting down motion andphoto taking motion creates hand raising motion to prepare to takea picture from sitting posture. When waiting and smoking motionsare interpolated, an interesting motion that a character seems tiredof waiting and starts to smoke is created. The capability of creatingsuch meaningful motions is due to the Wasserstein regularizer thatshortens the distance between the encoded vectors by matching themotion manifold to the multivariate normal prior. Figure 1 and thesupplemental video show the interpolated motions.Figure 8 compares our model with [HSKJ15] with respect tointerpolation. See supplementary material for the implementationof [HSKJ15]. For the interpolation from sitting to walking (top)and from sitting down to taking photo (bottom), our model showsa natural transition between two motions while [HSKJ15] createssomewhat averaged motion between the two motions.

Figure 8:

Interpolation from sitting to walking (top) and from sit-ting down to taking photo (bottom) made by our model (left) and[HSKJ15] (right).

Our motion model can denoise motion data by projecting it to thelatent motion manifold and decoding the motion manifold vector toobtain a reconstructed motion. Since the motion manifold is con-structed only from human motion capture data, any element in the c (cid:13) (cid:13) -K. Jang & S-H. Lee / Constructing Human Motion Manifold with Sequential Networks manifold is likely to be decoded to natural motion. Therefore, de-noising effect occurs when noisy motion data is projected to themotion manifold. We experiment on the denoising capability of ourmethod in the similar manner as in [HSKJ15]. We generate noisecorrupted motion by randomly setting joint angles to zero with aprobability of 0.5, which makes half of the joint angle informationmeaningless. Figure 9 shows the denoised results which are quitesimilar to the ground truth motions. Figure 9:

Denoising experiment. Three poses are shown from thenoise corrupted motion (orange), denoised motion by our method(coral), and the ground truth motion (green). Two motions (top andbottom) are shown.

Through motion analogy, we can understand how our model or-ganizes motion manifold to represent the feature of actions. De-tails about analogy can be found in [Whi16]. We perform vectoralgebraic operations with the latent vectors encoded from differ-ent motions and explore how the model organizes the latent spaceto represent motions. Figure 10 (a) shows that subtracting a mo-tion manifold vector for “sitting down” motion from “taking photowith sitting down” motion creates a vector representing “takingphoto” motion. The character is standing because a zero vector inour motion manifold corresponds to an idle standing motion. Sub-sequently, when an encoded “walking” motion manifold vector isadded, the motion vector becomes a vector for “taking photo withwalking” motion. Figure 10 (b) shows a similar analogy among“walking”, “smoking with walking”, and “sitting” motions.Figure 11 shows the experiments of performing analogy with[HSKJ15]. Figure 11 (top) is the result of taking photo (left) andtaking photo with walking (right) that correspond to Fig. 10 (a),and Fig. 11 (bottom) shows smoking and smoking with sitting tocompare with Fig. 10 (b). One can see that the motion manifoldobtained with [HSKJ15] does not support analogy on the motionmanifold.

6. Conclusion and future work

In this paper, we presented a novel sequential network for con-structing a latent motion manifold for modeling human motion. Themain contributions of our method are the combined decoder for thejoint rotation and joint velocity, and considering both the joint rota-tions and positions by adding the FK layer in both decoders, which (a) Motion analogy among “Walking with posing”, “Walking” and “Sit-ting” actions.(b) Motion analogy among “Smoking with sitting”, “Sitting” and “Walk-ing” actions.

Figure 10:

Motion analogy experiments performing arithmetic op-erations in the motion manifold.

Figure 11:

Motion analogy experiment with [HSKJ15]. improve the reconstruction accuracy. In addition, we composed aset of loss functions, each of which contribute to enhancing thequality of motions generated from the motion manifold space fromdifferent aspects. The capabilities of our model have been examinedthrough various experiments such as random sampling, motion in-terpolation, denoising, and motion analogy.Our method has several limitations. First, as a sequence-to- c (cid:13) (cid:13) -K. Jang & S-H. Lee / Constructing Human Motion Manifold with Sequential Networks sequence framework, the performance of our model degrades iftrained to produce motions longer than 10 seconds. The supplemen-tary video shows randomly generated motions with our network be-ing trained to learn 300 frames (approx. 13 seconds). Resulting mo-tions tend to lose details. This limitation may be alleviated by em-ploying an attention mechanism [LPM15,BCB14]. Second, the en-coded motions tend to be smoothed in the process of matching thelatent motion manifold to the prior distribution through the regular-izer. For example, motions that contain frequent hand shaking, suchas “walking with dog” or “discussion” motions in H3.6M dataset,lose ﬁne details when reconstructed. Overcoming these limitationswill be important future work.We only considered joint rotations in the encoder, but incorpo-rating additional information, such as joint positions and velocities,may be beneﬁcial to achieve better motion qualities. In addition,in the process of learning a motion manifold, loss terms to checkvalidity of motions, such as joint limit, velocity limit and foot slid-ing, are not needed as all input motion data are considered valid.However, when an actual motion is sampled from the manifold andapplied to an environment, such criteria may need to be checked.Most studies on motion space learning have focused on repre-senting a wide range of motion categories with a compact repre-sentation. In fact, the range of motion categories is only one aspectof the variedness of human motions. Even a single motion cate-gory such as walking exhibits widely different styles depending ongender, body scale, emotion, and personality. Developing a motionmanifold that can generate stylistic variations of motion is anotherimportant future research direction. Acknowledgement

This work was supported by Giga Korea Project (GK17P0200)and Basic Science Research Program (NRF-2020R1A2C2011541)funded by Ministry of Science and ICT, Korea.

References [BBKK17] B

ÜTEPAGE

J., B

LACK

M. J., K

RAGIC

D., K

JELLSTRÖM

H.:Deep representation learning for human motion prediction and classiﬁ-cation. In

IEEE Conference on Computer Vision and Pattern Recognition(CVPR) (2017), IEEE, p. 2017. 2, 3[BCB14] B

AHDANAU

D., C HO K., B

ENGIO

Y.: Neural machinetranslation by jointly learning to align and translate. arXiv preprintarXiv:1409.0473 (2014). 10[BVJS15] B

ENGIO

S., V

INYALS

O., J

AITLY

N., S

HAZEER

N.: Sched-uled sampling for sequence prediction with recurrent neural networks. In

Advances in Neural Information Processing Systems (2015), pp. 1171–1179. 4[CH05] C

HAI

J., H

ODGINS

J. K.: Performance animation from low-dimensional control signals. In

ACM Transactions on Graphics (ToG) (2005), vol. 24, ACM, pp. 686–696. 2[DWW15] D U Y., W

ANG

W., W

ANG

L.: Hierarchical recurrent neu-ral network for skeleton based action recognition. In

Proceedings ofthe IEEE conference on computer vision and pattern recognition (2015),pp. 1110–1118. 2[FLFM15] F

RAGKIADAKI

K., L

EVINE

S., F

ELSEN

P., M

ALIK

J.: Re-current network models for human dynamics. In

Proceedings of the IEEEInternational Conference on Computer Vision (2015), pp. 4346–4354. 2,3[HSK16] H

OLDEN

D., S

AITO

J., K

OMURA

T.: A deep learning frame-work for character motion synthesis and editing.

ACM Transactions onGraphics (TOG) 35 , 4 (2016), 138. 1, 2, 11 [HSKJ15] H

OLDEN

D., S

AITO

J., K

OMURA

T., J

OYCE

T.: Learningmotion manifolds with convolutional autoencoders. In

SIGGRAPH Asia2015 Technical Briefs (2015), ACM, p. 18. 1, 2, 3, 6, 8, 9, 11[IPOS14] I

ONESCU

C., P

APAVA

D., O

LARU

V., S

MINCHISESCU

C.:Human3.6m: Large scale datasets and predictive methods for 3d humansensing in natural environments.

IEEE Transactions on Pattern Analysisand Machine Intelligence 36 , 7 (jul 2014), 1325–1339. 6[JZSS16] J

AIN

A., Z

AMIR

A. R., S

AVARESE

S., S

AXENA

A.:Structural-rnn: Deep learning on spatio-temporal graphs. In

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 5308–5317. 2[Law04] L

AWRENCE

N. D.: Gaussian process latent variable models forvisualisation of high dimensional data. In

Advances in neural informa-tion processing systems (2004), pp. 329–336. 2[LPM15] L

UONG

M.-T., P

HAM

H., M

ANNING

C. D.: Effective ap-proaches to attention-based neural machine translation. arXiv preprintarXiv:1508.04025 (2015). 10[LTH ∗

18] L EE H.-Y., T

SENG

H.-Y., H

UANG

J.-B., S

INGH

M., Y

ANG

M.-H.: Diverse image-to-image translation via disentangled representa-tions. In

Proceedings of the European Conference on Computer Vision(ECCV) (2018), pp. 35–51. 5[LWB ∗

10] L EE Y., W

AMPLER

K., B

ERNSTEIN

G., P

OPOVI ´C

J.,P

OPOVI ´C

Z.: Motion ﬁelds for interactive character locomotion. In

ACM Transactions on Graphics (TOG) (2010), vol. 29, ACM, p. 138. 2[LZX ∗

17] L I Z., Z

HOU

Y., X

IAO

S., H E C., H

UANG

Z., L I H.: Auto-conditioned recurrent networks for extended complex human motionsynthesis. arXiv preprint arXiv:1707.05363 (2017). 4[MBR17] M

ARTINEZ

J., B

LACK

M. J., R

OMERO

J.: On human mo-tion prediction using recurrent neural networks. In (2017), IEEE,pp. 4674–4683. 1, 2, 3, 4, 6, 7, 11[MKSL14] M

ITTELMAN

R., K

UIPERS

B., S

AVARESE

S., L EE H.:Structured recurrent temporal restricted boltzmann machines. In

Inter-national Conference on Machine Learning (2014), pp. 1647–1655. 2[PGA18] P

AVLLO

D., G

RANGIER

D., A

ULI

M.: Quaternet: Aquaternion-based recurrent model for human motion. arXiv preprintarXiv:1805.06485 (2018). 1, 2, 3[RMW14] R

EZENDE

D. J., M

OHAMED

S., W

IERSTRA

D.: Stochasticbackpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082 (2014). 3[SMS15] S

RIVASTAVA

N., M

ANSIMOV

E., S

ALAKHUDINOV

R.: Unsu-pervised learning of video representations using lstms. In

Internationalconference on machine learning (2015), pp. 843–852. 3[TBGS17] T

OLSTIKHIN

I., B

OUSQUET

O., G

ELLY

S., S

CHOELKOPF

B.: Wasserstein auto-encoders. arXiv preprint arXiv:1711.01558 (2017).3[THR07] T

AYLOR

G. W., H

INTON

G. E., R

OWEIS

S. T.: Modelinghuman motion using binary latent variables. In

Advances in neural in-formation processing systems (2007), pp. 1345–1352. 2[VYCL18] V

ILLEGAS

R., Y

ANG

J., C

EYLAN

D., L EE H.: Neural kine-matic networks for unsupervised motion retargetting. In

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition (2018), pp. 8639–8648. 1[Whi16] W

HITE

T.: Sampling generative networks. arXiv preprintarXiv:1609.04468 (2016). 9[WS14] W U D., S

HAO

L.: Leveraging hierarchical parametric networksfor skeletal joints based action segmentation and recognition. In

Pro-ceedings of the IEEE conference on computer vision and pattern recog-nition (2014), pp. 724–731. 2[WW17] W

ANG

H., W

ANG

L.: Modeling temporal dynamics and spa-tial conﬁgurations of actions using two-stream recurrent neural networks.In e Conference on Computer Vision and Pa ern Recognition (CVPR) (2017). 2 c (cid:13) (cid:13) -K. Jang & S-H. Lee / Constructing Human Motion Manifold with Sequential Networks Supplementary Material

A. Network structures and experimental setup

The following will describe details of the network structures of the models used for comparison and the hyperparameters of each model. Forall models, we use batch size of 30 for H3.6M data set. The number of training epochs is 500. Dimension of motion manifold is 64.

A.1. Sequence-to-sequence model

The seq2seq model used in our comparison has a similar structure as that in [MBR17]. The only difference is that a fully connected layer formotion manifold generation exists between the motion manifold and the encoder/decoder. The encoder and decoder are implemented with a1-layer Gated Recurrent Unit (GRU) with 1024 dimensional hidden state. The decoder includes a residual network.

Enc : Q t : ( t + ∆ t − ) ∈ Q → GRU → FC → Z ∈ Z Dec : Z ∈ Z → FC → Res [ GRU → FC ] Flip −−→ (cid:98) Q t : ( t + ∆ t − ) ∈ Q , (17)where Res and FC denote the residual network and the fully connected layer, respectively.For training, we use Adam optimizer with a learning rate of 0.001 and decaying rate of 0.999 per training step, and clip the gradients toscale 1. The loss function is the same as our motion reconstruction loss. A.2. Convolutional model

Convolutional models have the same structure as [HSKJ15, HSK16]. Unlike our model and seq2seq model, the convolutional model usedjoint position P for training. Both encoder and decoder use one 1D convolutional layer with a temporal ﬁlter of width 15 and the number ofhidden units being 256. The encoder passes P t : ( t + ∆ t − ) through the convolutional layer, max pooling, ReLU, and ﬁnally dropout to map tothe motion manifold space. Decoder generates (cid:98) P t : ( t + ∆ t − ) by passing through the inverse convolutional layer after upsampling the motionmanifold vector Z . Enc : P t : ( t + ∆ t − ) ∈ Q → Conv → MaxPool → ReLU → Dropout → Z ∈ Z Dec : Z ∈ Z → UpSampling → InConv → (cid:98) P t : ( t + ∆ t − ) ∈ Q , (18)We use Adam optimizer with learning rate of 0.001 and decay learning rate of 0.999. Loss function is the mean squared error of thereconstructed motion (cid:98) P t : ( t + ∆ t − ) and the ground truth motion P t : ( t + ∆ t − ) . A.3. Our model

Our encoder and decoder architectures are detailed in Sec. 3. Here we describe the network weights and parameters for the training. Boththe encoder and decoder include one layer of GRU with 1024 cell size and dropout of 0.2. Our model has three fully connected layers: FC from the encoder to the motion manifold, FC from the motion manifold to the decoder, and FC from the decoder to the output motion.For the discriminator network used for the adversarial loss, we use a total of four 1D convolutional layers. First three layers (Layers 1-3)have the kernel size 4, stride 2, 1 reﬂect padding, and leakyReLU activation with leak of 0.2. Layer 4 has kernel size 1, stride 1, and Reluactivation. The number of units in Layers 1 and 4 are 32 and 1, respectively. Layers 2 and 3 have 64 and 128 units respectively with batchnormalization. For training, we use Adam optimizer with a learning rate of 0.001 for both the motion manifold networks and discriminator,and clip the GRU gradients by the global norm of 1. c (cid:13) (cid:13)(cid:13)