[PDF] Learning to dance: A graph convolutional adversarial network to generate realistic dance motions from audio

Abstract

Synthesizing human motion through learning techniques is becoming an increasingly popular approach to alleviating the requirement of new data capture to produce animations. Learning to move naturally from music, i.e., to dance, is one of the more complex motions humans often perform effortlessly. Each dance movement is unique, yet such movements maintain the core characteristics of the dance style. Most approaches addressing this problem with classical convolutional and recursive neural models undergo training and variability issues due to the non-Euclidean geometry of the motion manifold this http URL this paper, we design a novel method based on graph convolutional networks to tackle the problem of automatic dance generation from audio information. Our method uses an adversarial learning scheme conditioned on the input music audios to create natural motions preserving the key movements of different music styles. We evaluate our method with three quantitative metrics of generative methods and a user study. The results suggest that the proposed GCN model outperforms the state-of-the-art dance generation method conditioned on music in different experiments. Moreover, our graph-convolutional approach is simpler, easier to be trained, and capable of generating more realistic motion styles regarding qualitative and different quantitative metrics. It also presented a visual movement perceptual quality comparable to real motion data.

Full PDF

PPreprint to appear at Elsevier Computers and Graphics 2020The ofﬁcial publication is available at https://doi.org/10.1016/j.cag.2020.09.009 . Learning to dance: A graph convolutional adversarial network to generaterealistic dance motions from audio

Jo˜ao P. FerreiraUFMG Thiago M. CoutinhoUFMG Thiago L. GomesUFOP Jos´e F. NetoUFMGRafael AzevedoUFMG Renato MartinsINRIA Erickson R. NascimentoUFMG

Abstract

Synthesizing human motion through learning techniquesis becoming an increasingly popular approach to alleviat-ing the requirement of new data capture to produce ani-mations. Learning to move naturally from music, i.e ., todance, is one of the more complex motions humans of-ten perform effortlessly. Each dance movement is unique,yet such movements maintain the core characteristics ofthe dance style. Most approaches addressing this prob-lem with classical convolutional and recursive neural mod-els undergo training and variability issues due to the non-Euclidean geometry of the motion manifold structure. Inthis paper, we design a novel method based on graph convo-lutional networks to tackle the problem of automatic dancegeneration from audio information. Our method uses an ad-versarial learning scheme conditioned on the input musicaudios to create natural motions preserving the key move-ments of different music styles. We evaluate our method withthree quantitative metrics of generative methods and a userstudy. The results suggest that the proposed GCN modeloutperforms the state-of-the-art dance generation methodconditioned on music in different experiments. Moreover,our graph-convolutional approach is simpler, easier to betrained, and capable of generating more realistic motionstyles regarding qualitative and different quantitative met-rics. It also presented a visual movement perceptual qualitycomparable to real motion data. The dataset and project arepublicly available at: .

1. Introduction

One of the enduring grand challenges in ComputerGraphics is to provide plausible animations to virtual avatars. Humans have a large set of different movementswhen performing activities such as walking, running, jump-ing, or dancing. Over the past several decades, modelingsuch movements has been delegated to motion capture sys-tems. Despite remarkable results achieved by highly skilledartists using captured motion data, the human motion hasa rich spatiotemporal distribution with an endless varietyof different motions. Moreover, human motion is affectedby complex situation-aware aspects, including the auditoryperception, physical conditions such as the person’s age andits gender, and cultural background.Synthesizing motions through learning techniques is be-coming an increasingly popular approach to alleviating therequirement of capturing new real motion data to produceanimations. The motion synthesis has been applied to amyriad of applications such as graphic animation for enter-tainment, robotics, and multimodal graphic rendering en-gines with human crowds [21], to name a few. Movementsof each human being can be considered unique having itsparticularities, yet such movements preserve the character-istics of the motion style ( e.g ., walking, jumping, or danc-ing), and we are often capable of identifying the style effort-lessly. When animating a virtual avatar, the ultimate goal isnot only retargeting a movement from a real human to avirtual character but embodying motions that resemble theoriginal human motion. In other words, a crucial step toachieve plausible animation is to learn the motion distribu-tion and then draw samples ( i.e ., new motions) from it. Forinstance, a challenging human movement is dancing, wherethe animator does not aim to create avatars that mimic realposes but to produce a set of poses that match the music’schoreography, while preserving the quality of being indi-vidual.In this paper, we address the problem of synthesizingdance movements from music using adversarial training anda convolutional graph network architecture (GCN). Dancing1 a r X i v : . [ c s . G R ] N ov s a representative and challenging human motion. Dancingis more than just performing pre-deﬁned and organized lo-comotor movements, but it comprises steps and sequencesof self-expression. In dance moves, both the particularitiesof the dancer and the characteristics of the movement playan essential role in recognizing the dance style. Thus, a cen-tral challenge in our work is to synthesize a set of poses tak-ing into account three main aspects: ﬁrstly, the motion mustbe plausible, i.e ., a blind evaluation should present similarresults when compared to real motions; secondly, the syn-thesized motion must retain all the characteristics present ina typical performance of the music’s choreography; third,each new set of poses should not be strictly equal to anotherset, in other words, when generating a movement for a newavatar, we must retain the quality of being individual. Fig-ure 1 illustrates our methodology.Creating motions from sound relates to the paradigmof embodied music cognition. It couples perception andaction, physical environmental conditions, and subjectiveuser experiences (cultural heritage) [30]. Therefore, syn-thesizing realistic human motions regarding embodyingmotion aspects remains a challenging and active researchﬁeld [13, 55]. Modeling distributions over movements isa powerful tool that can provide a large variety of mo-tions while not removing the individual characteristics ofeach sample that is drawn. Furthermore, by condition-ing these distributions, for instance, using an audio signallike music, we can select a sub-population of movementsthat match with the input signal. Generative models havedemonstrated impressive results in learning data distribu-tions. These models have been improved over the decadesthrough machine learning advances that broadened the un-derstanding of learning models from data. In particular, ad-vances in the deep learning techniques yielded an unprece-dented combination of effective and abundant techniquesable to predict and generate data. The result was an ex-plosion in highly accurate results in tasks of different ﬁelds.The explosion was felt ﬁrst and foremost in the ComputerVision community. From high accuracy scores in imageclassiﬁcation using convolutional neural networks (CNN)to photo-realistic image generation with the generative ad-versarial networks (GAN) [16], Computer Vision ﬁeld hasbeen beneﬁted with several improvements in the deep learn-ing methods. Both Computer Vision and Computer Graph-ics ﬁelds also achieved signiﬁcant advances in processingmultimodal data present in the scene by using several typesof sensors. These advances are assigned to the recent riseof learning approaches, especially convolutional neural net-works. Also, these approaches have been explored to syn-thesize data from multimodal sources, and the audio datais one that is achieving the most impressive results, as thework presented by [9].Most recently, networks operating on graphs have Figure 1. Our approach is composed of three main steps: First,given a music sound as input, we classify the sound according tothe dance style; Second, we generate a temporal coherent latentvector to condition the motion generation, i.e ., the spatial and tem-poral position of joints that deﬁne the motion; Third, a generativemodel based on a graph convolutional neural network is trained inan adversarial manner to generate the sequences of human poses.To exemplify an application scenario, we render animations of vir-tual characters performing the motion generated by our method. emerged as promising and effective approaches to deal withproblems which structure is known a priori . A representa-tive approach is the work of Kipf and Welling [26], wherea convolutional architecture that operates directly on graph-structured data is used in a semi-supervised classiﬁcationtask. Since graphs are natural representations for the hu-man skeleton, several approaches using GCN have beenproposed in the literature to estimate and generate humanmotion. Yan et al . [55], for instance, presented a frame-work based on GCNs that generates a set of skeleton posesby sampling random vectors from a Gaussian process (GP).Despite being able to create sets of poses that mimic a per-son’s movements, the framework does not provide any con-trol over the motion generation. As stated, our methodologysynthesizes human movements also using GCN, but unlikeYan et al .’s work, we can control the style of the movementusing audio data while preserving the plausibility of the ﬁ-nal motions. We argue that movements of a human skele-ton, which has a graph-structured model, follow complexsequences of poses that are temporal related, and the set ofdeﬁned and organized movements can be better modeled us-ing a convolutional graph network trained using adversarialregime.In this context, we propose an architecture that man-ages audio data to synthesize motion. Our method starts2ncoding a sound signal to extract the music style using aCNN architecture. The music style and a spatial-temporallatent vector are used to condition a GCN architecture that istrained in an adversarial regime to predict 2D human bodyjoint positions over time. Experiments with a user studyand quantitative metrics showed that our approach outper-forms the state-of-the-art method and provides plausiblemovements while maintaining the characteristics of differ-ent dance styles.The contribution of this paper can be summarized as fol-lows:• A new conditional GCN architecture to synthesize hu-man motion based on auditory data. In our method, wepush further the adversarial learning to provide multi-modal data learning with temporal dependence;• A novel multimodal dataset with paired audio, mo-tion data and videos of people dancing different musicstyles.

2. Related Work

Sound and Motion

Recently, we have witnessed an over-whelming growth of new approaches to deal with the tasksof transferring motion style and building animations of peo-ple from sounds. For example, Bregler et al . [4] createvideos of a subject saying a phrase they did not speak orig-inally, by reordering the mouth images in the training in-put video to match the phoneme sequence of the new au-dio track. In the same direction, Weiss [52] applied adata-driven multimodal approach to produce a 2D video-realistic audio-visual “Talking Head”, using F0 and Mel-Cepstrum coefﬁcients as acoustical features to model thespeech. Aiming to synthesize human motion according tomusic characteristics such as rhythm, speed, and intensity,Shiratori and Ikeuchi [42] established keyposes accordingto changes in the rhythm and performer’s hands, feet, andcenter of mass. Then, they used music and motion featurevectors to select candidate motion segments that match themusic and motion intensity. Despite the impressive results,the method fails when the keyposes are in fast segments ofthe music.Cudeiro et al . [9] presented an encoder-decoder networkthat uses audio features extracted from DeepSpeech [19].The network generates realistic 3D facial animations condi-tioned on subject labels to learn different individual speak-ing styles. To deform the human face mesh, Cudeiro etal . encode the audio features in a low-dimensional embed-ding space. Although their model is capable of generalizingfacial mesh results for unseen subjects, they reported thatthe ﬁnal animations were distant from the natural capturedreal sequences. Moreover, the introduction of a new styleis cumbersome since it requires a collection of 4D scanspaired with audios. Ginosar et al . [13] enable translation from speech to gesture, generating arms and hand move-ments by mapping audio to pose. They used an adversarialtraining, where a U-Net architecture transforms the encodedaudio input into a temporal sequence of 2D poses. In orderto produce more realistic results, the discriminator is condi-tioned on the differences between each pair of subsequentlygenerated poses. However, their method is subject-speciﬁcand does not generalize to other speakers.More related work to ours is the approach proposedby Lee et al . [29]. The authors use a complex architec-ture to synthesize dance movements (expressed as a se-quence of 2D poses) given a piece of input music. Theirarchitecture is based on an elaborated decomposition-to-composition framework trained with an adversarial learn-ing scheme. Our graph-convolutional based approach, onits turn, is simpler, easier to be trained, and generates morerealistic motion styles regarding qualitative and differentquantitative metrics.

Generative Graph Convolutional Networks

Since theseminal work of Goodfellow et al . [16], generative adver-sarial networks (GAN) have been successfully applied to amyriad of hard problems, notably for the synthesis of newinformation, such as of images [25], motion [6], and poseestimation [7], to name a few. Mirza and Osindero [36]proposed Conditional GANs (cGAN), which provides someguidance into the data generation. Reed et al . [41] synthe-size realistic images from text, demonstrating that cGANscan also be used to tackle multi-modal problems. GraphConvolutional Networks (GCN) recently emerged as a pow-erful tool for learning from data by leveraging geometricproperties that are embedded beyond n-dimensional Eu-clidean vector spaces, such as graphs and simplicial com-plex. In our context, conversely to classical CNNs, GCNscan model the motion manifold space structure [22, 56, 55].Yan et al . [56] applied GCNs to model human movementsand classify actions. After extracting 2D human poses foreach frame from the input video, the skeletons are processedby a Spatial-Temporal Graph Convolutional Network (ST-GCN). Yan et al . proceeded in exploiting the representa-tion power of GCNs and presented the Convolutional Se-quence Generation Network (CSGN) [55]. By samplingcorrelated latent vectors from a Gaussian process and usingtemporal convolutions, the CSGN architecture was capableof generating temporal coherent long human body action se-quences as skeleton graphs. Our method takes one step fur-ther than [56, 55]. It generates human skeletal-based graphmotion sequences conditioned on acoustic data, i.e ., music.By conditioning the movement distributions, our methodlearns not only creating plausible human motion, but it alsolearns the music style signature movements from differentdomains.3 stimating and Forecasting Human Pose

Motion syn-thesis and motion analysis problems have been beneﬁtedfrom the improvements in the accuracy of human poseestimation methods. Human pose estimation from im-ages, for its turn, greatly beneﬁted from the recent emer-gence of large datasets [32, 1, 18] with annotated posi-tions of joints, and dense correspondences from 2D im-ages to 3D human shapes [5, 31, 54, 18, 28, 24, 27]. Thislarge amount of annotated data has made possible importantmilestones towards predicting and modeling human mo-tions [51, 17, 12, 11, 48]. The recent trend in time-seriesprediction with recurrent neural networks (RNN) becamepopular in several frameworks for human motion predic-tion [11, 35, 12]. Nevertheless, the pose error accumulationin the predictions allows mostly predicting over a limitedrange of future frames [17]. Gui et al . [17] proposed to over-come this issue by applying adversarial training using twoglobal recurrent discriminators that simultaneously validatethe sequence-level plausibility of the prediction and its co-herence with the input sequence. Wang et al . [48] proposeda network architecture to model the spatial and temporalvariability of motions through a spatial component for fea-ture extraction. Yet, these RNN models are known to bedifﬁcult to train and computationally cumbersome [37]. Asalso noted by [29], motions generated by RNNs tend to col-lapse to certain poses regardless of the inputs.

Transferring Style and Human Motion

Synthesizingmotion with speciﬁc movement style has been studied ina large body of prior works [44, 39, 50, 6, 15]. Most meth-ods formulate the problem as transferring a speciﬁc mo-tion style to an input motion [53, 44], or transferring themotion from one character to another, commonly referredas motion retargeting [14, 8, 46]. Recent approaches ex-plored deep reinforcement learning to model physics-basedlocomotion with a speciﬁc style [38, 33, 39]. Another ac-tive research direction is transferring motion from video-to-video [50, 6, 15]. However, the generation of stylisticmotion from audio is less explored, and it is still a chal-lenging research ﬁeld. Villegas et al . [47] presented a videogeneration method based on high-level structure extraction,conditioning the creation of new frames on how this struc-ture evolves in time, therefore preventing pixel-wise errorprediction accumulation. Their approach was employed onlong-term video prediction of humans performing actionsby using 2D human poses as high-level structures.Wang et al . [49] discussed how adversarial learningcould be used to generate human motion by using a se-quence of autoencoders. The authors focused on three tasks:motion synthesis, conditional motion synthesis, and mo-tion style transfer. As our work, their framework enablesconditional movement generation according to a style labelparameterization, but there is no multimodality associated with it. Jang et al . [23] presented a method inspired bysequence-to-sequence models to generate a motion mani-fold. As a signiﬁcant drawback, the performance of theirmethod decreases when creating movements longer than10s, which makes the method inappropriate to generate longsequences. Our approach, on the other hand, can create longmovement sequences conditioned on different music styles,by taking advantage of the adversarial GCN’s power to gen-erate new long, yet recognizable, motion sequences.

3. Methodology

Our method has been designed to synthesize a sequenceof 2D human poses resembling a human dancing accordingto a music style. Speciﬁcally, we aim to estimate a motion M that provides the best ﬁt for a given input music audio. M is a sequence of N human body poses deﬁned as: M = [ P , P , · · · , P N ] ∈ R N × × , (1)where P t = [ J , J , · · · , J ] is a graph representing thebody pose in the frame t and J i ∈ R the 2D image coordi-nates of i -th node of this graph (see Figure 2).Our approach consists of three main components, out-lined in Figure 3. We start training a 1D-CNN classiﬁer todeﬁne the input music style. Then, the result of the classi-ﬁcation is combined with a spatial-temporal correlated la-tent vector generated by a Gaussian process (GP). The GPallows us to sample points of Gaussian noise from a dis-tribution over functions with a correlation between pointssampled for each function. Thus, we can draw points fromfunctions with different frequencies. This variation in thesignal frequency enables our model to infer which skele-ton joint is responsible for more prolonged movements andexplore a large variety of poses. The latent vector aims atmaintaining spatial coherence of the motion for each jointovertime. At last, we perform the human motion generationfrom the latent vector. In the training phase of the gener-ator, we use the latent vector to feed a graph convolutionalnetwork that is trained in an adversarial regime on the dancestyle deﬁned by an oracle algorithm. In the test phase, wereplace the oracle by the 1D-CNN classiﬁer. Thus, our ap-proach has two training stages: i) The training of the audioclassiﬁer to be used in the test phase and ii)

The GCN train-ing with an adversarial regime that uses the music style tocondition the motion generation.

Our motion generation is conditioned by a latent vectorthat encodes information from the music style. In this con-text, we used the SoundNet [3] architecture as the backboneto a one-dimensional CNN. The 1D-CNN receives a soundin waveform and outputs the most likely music style con-sidering three classes. The classiﬁer is trained in a dataset4 otion RepresentationMotion: ... ti m e Pose: Pose: Pose: Pose:

Figure 2. Motion and skeleton notations. In our method, we useda skeleton with

2D joints. composed of music ﬁles and divided into three music-dance styles:

Ballet , Salsa , and

Michael Jackson (MJ) .To ﬁnd the best hyperparameters, we ran a -fold cross-validation and kept the best model to predict the music styleto condition the generator. Different from works [2, 20] thatrequire 2D pre-processed sound spectrograms, our architec-ture is one-dimensional and works directly in the waveform. In order to create movements that follow the music style,while keeping particularities of the motion and being tem-porally coherent, we build a latent vector that combines theextracted music style with a spatiotemporal correlated sig-nal from a Gaussian process. It is noteworthy that our latentvector differs from the work of Yan et al . [55], since wecondition our latent space using the information providedby the audio classiﬁcation. The information used to condi-tion the motion generation, and to create our latent space, isa trainable dense feature vector representation of each mu-sic style. The dense music style vector representation worksas a categorical dictionary, which maps a dance style classto a higher dimensional space.Then, we combine a temporal coherent random noisewith the music style representation in order to generate co-herent motions over time. Thus, the ﬁnal latent vector is theresult of concatenating the dense trainable representation ofthe audio class with the coherent temporal signal in the di-mension of the features. This concatenation plays a keyrole in the capability of our method to generate syntheticmotions with more than one dancing style when the audiois a mix of different music styles. In other words, unlike avanilla conditional generative model, which conditioning islimited to one class, we can condition over several classesover time.The coherent temporal signals are sampled from RadialBasis Function kernel (RBF) [40] to enforce temporal re-lationship among the N frames. A zero-mean Gaussianprocess with a covariance function κ is given by ( z ( c ) t ) ∼ GP (0 , κ ) , where ( z ( c ) t ) is the c -th component of z t . The signal comprises c functions with t ∈ R N/ temporallycoherent values. This provides a signal with a shape of( C, T, V ), where C is interpreted as the channels (features)of our graph, T is related to the length of the sequencewe want to generate, and V is the spatial dimension of ourgraph signal. The covariance function κ is deﬁned as: κ ( t, t (cid:48) ) = exp (cid:18) − | t − t (cid:48) | σ c (cid:19) . (2)In our tests, we used C = 512 , T = 4 , V = 1 and σ c = σ (cid:0) c i C (cid:1) , where σ = 200 was chosen empirically and c i varies for every value from to C .The ﬁnal tensor representing the latent vector has the size (2 C, T, V ) , where the sizes of C and T are the same as thecoherent temporal signal. Note that the length of the ﬁnalsequence is proportional to T used in the creation of thelatent vector. The ﬁnal motion, after propagation in the mo-tion generator, will have T = N frames; thus, we cangenerate samples for any FPS and length by changing thedimensions of the latent vector. Moreover, as the dimen-sions of the channels condition the learning, we can changethe conditioning dance style over time.The Gaussian process generates our random noise z andthe dense representation of the dance style is the variableused to condition our model y . The combination of bothdata is used as input for the generator. To generate realistic movements, we use a graph convo-lutional neural network (GCN) trained with an adversarialstrategy. The key idea in adversarial conditional training isto learn the data distribution while two networks competeagainst each other in a minimax game. In our case, themotion generator G seeks to create motion samples similarto those in the motion training set, while the motion dis-criminator D tries to distinguish generated motion samples(fake) from real motions of the training dataset (real). Fig-ure 3 illustrates the training scheme. Generator

The architecture of our generator G is mainlycomposed of three types of layers: temporal and spatial up-sampling operations, and graph convolutions. When usingGCNs, one challenge that appears in an adversarial train-ing is the requirement of upsampling the latent vector in thespatial and temporal dimensions to ﬁt the motion space M (Equation 1).The temporal upsampling layer consists of transposed2D convolutions that double the time dimension, ignoringthe input shape of each layer. Inspired by Yan et al . [55],we also included in our architecture a spatial upsamplinglayer. This layer operates using an aggregation function de-ﬁned by an adjacency matrix A ω that maps a graph S ( V, E ) .. Sp a ti a l L ayer T e m p ora l L ayer ... Generator

Gaussian Process t i m e t i m e ... ... ... ...... ... ... ... N F e a t u re s Oracle B a ll e t M J S a l s a A ud i o N o i s e + D e n s e R e p re s e n t a t i o n GeneratorGraph ConvolutionalNetwork DiscriminatorGraph ConvolutionalNetwork Real or from the DatasetGenerated Motion Fake ?Real FakeReal Stylized Motion ... ...

Discriminator

Real orFake ?

Spa t i a l L ay er T e m po r a l L ay er N F ea t u re s N / F ea t u re s N / F ea t u re s F ea t u re s F ea t u re s ( u , v ) F ea t u re s M F ea t u re s M F ea t u re s M F ea t u re s F ea t u re b)a)c) N / F ea t u re s C o n vo l u ti o n a l L ayer M F ea t u re s C o n vo l u ti o n a l L ayer Figure 3. Motion synthesis conditioned on the music style. (a) GCN Motion Generator G ; (b) GCN Motion Discriminator D ; and (c) anoverview of the adversarial training regime. DownsampleUpsample

Figure 4. Graph scheme for upsampling and downsampling oper-ations. with V vertices and E edges to a bigger graph S (cid:48) ( V (cid:48) , E (cid:48) ) (see Figure 4). The network can learn the best values of A ω that leads to a good upsampling of the graph by assign-ing different importance of each neighbor to the new set ofvertices.The ﬁrst spatial upsampling layer starts from a graphwith one vertex and then increases to a graph with three ver-tices. When creating the new vertices, the features f j fromthe initial graph S are aggregated by A ω as follows: f (cid:48) i = (cid:88) k,j A ωkij f j , (3)where f (cid:48) i contains the features of the vertices in the newgraph S (cid:48) and k indicates the geodesic distance between ver-tex j and vertex i in the graph S (cid:48) .In the ﬁrst layer of the generator, we have one node con-taining a total of N features; these features represent our la- tent space (half from the Gaussian Process and a half fromthe audio representation). The features of the subsequentlayers are computed by the operations of upsampling andaggregation. The last layer outputs a graph with nodescontaining the ( x, y ) coordinates of each skeleton joint. Forinstance, in Figure 4 from right to left, we can see the up-sampling operation, where we move from a graph with onevertex S to a new graph containing three vertices S (cid:48) . Theaggregation function in A ω is represented by the red linksconnecting the vertices between the graphs and the topol-ogy of graph S (cid:48) . When k = 0 , vertex v is directly mappedto vertex v (cid:48) ( i.e ., the distance between v and v (cid:48) is ) andall values are zeros except the value of i = 0 , j = 0 then f (cid:48) = A ω , , f . Following the example, when k = 1 , wehave f (cid:48) = A ω , , f e f (cid:48) = A ω , , f .After applying the temporal and spatial upsampling op-erations, our generator uses the graph convolutional layersdeﬁned by Yan et al . [56]. These layers are responsiblefor creating the spatio-temporal relationship between thegraphs. Then, the ﬁnal architecture comprises three sets oftemporal, spatial, and convolutional layers: ﬁrst, temporalupsampling for a graph with one vertex followed by an up-sampling from one vertex to vertices, then one convolu-tional graph operation. We repeat these three operations forthe upsampling from vertices to , and ﬁnally from to vertices, which represents the ﬁnal pose. Figure 3-(a)shows this GCN architecture. Discriminator

The discriminator D has the same archi-tecture used by the generator but using downsampling lay-ers instead of upsampling layers. Thus, all transposed 2D6onvolutions are converted to standard 2D convolutions,and the spatial downsampling layers follow the same proce-dure of upsampling operations but using an aggregation ma-trix B φ with trainable weights φ , different from the weightslearned by the generator. Since the aggregation is per-formed from a large graph G (cid:48) to a smaller one G , the ﬁnalaggregation is given by f i = (cid:88) k,j B φkij f (cid:48) j . (4)In the discriminator network, the feature vectors are as-signed to each node as follows: the ﬁrst layer contains agraph with nodes, where their feature vectors are com-posed of the ( x, y ) coordinates on a normalized space andthe class of the input motion. In the subsequent layers,the features of each node are computed by the operationsof downsampling and aggregation. The last layer containsonly one node that outputs the classiﬁcation of the inputdata being fake or real. Figure 3-(b) illustrates the discrim-inator architecture. Adversarial training

Given the motion generator anddiscriminator, our conditional adversarial network aims atminimizing the binary cross-entropy loss: L cGAN ( G, D ) = min G max D ( E x ∼ p data ( x )[log D ( x | y )]+ E z ∼ p z ( z )[log(1 − D ( G ( z | y )))]) , (5)where the generator aims to maximize the error of the dis-criminator, while the discriminator aims to minimize theclassiﬁcation fake-real error shown in Equation 5. In par-ticular, in our problem, p data represents the distribution ofreal motion samples, x = M τ is a real sample from p data ,and τ ∈ [0 − D size ] and D size is the number of real samplesin the dataset. Figure 3-(c) shows a concise overview of thesteps in our adversarial training.The latent vector, which is used by the generator to syn-thesize the fake samples x (cid:48) , is represented by the variable z , the coherent temporal signal. The dense representationof the dance style is determined by y , and p z , which is adistribution of all possible temporal coherent latent vectorsgenerated by the Gaussian process. The data used by thegenerator G in the training stage is the pair of temporal co-herent latent vector z , with a real motion sample x , and thevalue of y given by the music classiﬁer that infers the dancestyle of the audio.To improve the generated motion results, we use a mo-tion reconstruction loss term applying L distance in allskeletons over the N motion frames as follows: L rec = 1 N N (cid:88) i =1 L pose ( P t , P (cid:48) t ) , (6) with P t ∈ M being the generated pose and P (cid:48) t ∈ M (cid:48) a real pose from the training set and extracted with theOpenPose [5]. The pose distance is computed as L pose = (cid:80) i =0 | J i − J (cid:48) i | / , following the notation shown in Equa-tion 1.Thus, our ﬁnal loss is a weighted sum of the motion re-construction and cGAN discriminator losses given by L = L cGAN + λ L rec , (7)where λ weights the reconstruction term. The λ value waschosen empirically, and was ﬁxed throughout the trainingstage. The initial guess regarding the magnitude of λ fol-lowed the values chosen by Wang et al . [50].We apply a cubic-spline interpolation in the ﬁnal motionto remove eventual high frequency artifacts from the gener-ated motion frames M .

4. Audio-Visual Dance Dataset

We build a new dataset composed of paired videos ofpeople dancing different music styles. The dataset is used totrain and evaluate the methodologies for motion generationfrom audio. We split the samples into training and evalua-tion sets that contain multimodal data for three music/dancestyles: Ballet, Michael Jackson, and Salsa. These two setsare composed of two data types: visual data from careful-selected parts of publicly available videos of dancers per-forming representative movements of the music style andaudio data from the styles we are training. Figure 5 showssome data samples of our dataset.In order to collect meaningful audio information, severalplaylists from YouTube were chosen with the name of thestyle/singer as a search query. The audios were extractedfrom the resulting videos of the search and resampled tothe standard audio frequency of 16KHz. For the visualdata, we started by collecting videos that matched the musicstyle and had representative moves. Each video was manu-ally cropped in parts of interest, by selecting representativemoves for each dance style present in our dataset. Then,we standardize the motion rate throughout the dataset andconvert all videos to frames-per-second (FPS), maintain-ing a constant relationship between the number of framesand speed of movements of the actors. We annotate the

2D human joint poses for each video by estimating the posewith OpenPose [5]. Each motion sample is deﬁned as a setof 2D human poses of consecutive frames.To improve the quality of the estimated poses in thedataset, we handled the miss-detection of joints by exploit-ing the body dynamics in the video. Since abrupt mo-tions are not expected in the joints in a short interval offrames, we recreate a missing joint and apply the transfor-mation chain of its parent joint. In other words, we infer themissing-joint position of a child’s joint by making it follow7 a ll e t S a l s a M i c h a e l J a c k s o n Figure 5. Video samples of the multimodal dataset with carefully annotated audio and 2D human motions of different dance styles. its parent movement over time. Thus, we can keep frameswith a miss-detected joint on our dataset.

We also performed motion data augmentation to increasethe variability and number of motion samples. We used theGaussian process described in Section 3.2 to add temporallycoherent noise in the joints lying in legs and arms over time.Also, we performed temporal shifts (strides) to create newmotion samples. For the training set, we collected sam-ples and applied the temporal coherent Gaussian noise and atemporal shift of size . In the evaluation set, we collected samples and applied only the temporal shift of size for Salsa and Ballet and for Michael Jackson becauseof the lower number of samples (see Table 1). The tem-poral Gaussian noise was not applied in the evaluation set.The statistics of our dataset are shown in Table 1. The re-sulting audio-visual dataset contains thousands of coherentvideo, audio, and motion samples that represent character-istic movements for the considered dance styles. We performed evaluations with the same architectureand hyperparameters, but without data augmentation, theperformance on the Fr´echet Inception Distance (FID) metricwas worse than when using data augmentation. Moreover,we observed that the motions did not present variability, thedance styles were not well pictured, and in the worst cases,body movements were difﬁcult to notice.

5. Experiments and Results

To assess our method, we conduct several experimentsevaluating different aspects of motion synthesis from audio The dataset and project are publicly available at: . Table 1. Statistics of our dataset. The bold values are the numberof samples used in the experiments.

Training Dataset Evaluation Dataset

Setup

Ballet MJ Salsa Total Ballet MJ Salsa Totalw/o Data Augmentation

16 26 27 69 73 30 126 229 w/ Data Augmentation

525 966 861 ,

134 102 235 information. We also compared our method to the state-of-the-art technique proposed by Lee et al . [29], hereinafterreferred to as D2M. We choose to compare our method toD2M since other methods have major drawbacks that makea comparison with our method unsuitable, such as differentskeleton structures in [13]. Unfortunately, due to the lackof some components in the publicly available implementa-tion of D2M, few adjustments were required in their audiopreprocessing step. We standardized the input audio databy selecting the maximum length of the audio divisible by28, deﬁned as L , and reshaping it to a tensor of dimensions (cid:0) L , (cid:1) to match the input dimensions of their architecture.The experiments are as follows: i) We performed a per-ceptual user study using a blind evaluation with users tryingto identify the dance style of the dance moves. For a gener-ated dance video, we ask the user to choose what style (Bal-let, Michael Jackson (MJ), or Salsa) the avatar on the videois dancing; ii)

Aside from the user study, we also evaluatedour approach on commonly adopted quantitative metrics inthe evaluation of generative models, such as Fr´echet Incep-tion Distance (FID), GAN-train, and GAN-test [43].

Audio and Poses Preprocessing

Our one-dimensionalaudio CNN was trained for epochs, with batch sizeequal to , Adam optimizer with β = 0 . and β = 0 . ,and learning rate of . . Similar to [45], we preprocessed8he input music audio using a µ − law non-linear transfor-mation to reduce noise from audio inputs not appropriatelyrecorded. We performed -fold cross-validation to choosethe best hyperparameters.In order to handle different shapes of the actors and toreduce the effect of translations in the 2D poses of the joints,we normalized the motion data used during the adversarialGCN training. We managed changes beyond body shapeand translations, such as the situations of actors lying on theﬂoor or bending forward, by selecting the diagonal distanceof the bounding box encapsulating all 2D body joints P t ofthe frame as scaling factor. The normalized poses are givenby: ¯ J i = 1 δ (cid:18) J i − (cid:18) ∆ u , ∆ v (cid:19)(cid:19) + 0 . , (8)where δ = (cid:112) (∆ u ) + (∆ v ) , and (∆ u, ∆ v ) are the differ-ences between right-top position and left-bottom position ofthe bounding box of the skeleton in the image coordinates ( u, v ) . Training

We trained our GCN adversarial model for epochs. We observed that additional epochs only producedslight improvements in the resulting motions. In our experi-ments, we select N = 64 frames, roughly corresponding tomotions of three seconds at FPS. We select framesas the size of our samples to follow a similar setup pre-sented in [13]. Moreover, the motion sample size in [29]also adopted motion samples of frames. In general,the motion sample size is a power of , because of thenature of the conventional convolutional layers adopted inboth [13, 29]. However, it is worth noting that our methodcan synthesize long motion sequences. We use a batch sizeof motion sets of N frames each. We optimized the cGANwith Adam optimizer for the generator with β = 0 . and β = 0 . with learning rate of . . The discrimina-tor was optimized with Stochastic Gradient Descent (SGD)with a learning rate of × − . We used λ = 100 inEquation 7. Dropout layers were used on both generatorand discriminator to prevent overﬁtting. Avatar Animations

As an application of our formulation,we animate three virtual avatars using the generated mo-tions to different music styles. The image-to-image trans-lation technique vid2vid [50] was selected to synthesizevideos. We trained vid2vid to generate new images for theseavatars, following the multi-resolution protocol describedin [50]. For inference, we feed vid2vid with the outputof our GCN. We highlight that any motion style transfermethod can be used with few adaptations, as for instance,the works of [6, 15].

We conducted a perceptual study with users and col-lected the age, gender, Computer Vision/Machine Learningexperience, and familiarity with different dance styles foreach user. Figure 6 shows the proﬁles of the participants.The perceptual study was composed of randomlysorted tests. For each test, the user watches a video (withno sound) synthesized by vid2vid using a generated set ofposes. Then we asked them to associate the motion per-formed on the synthesized video as belonging to one of theaudio classes: Ballet, Michael Jackson, or Salsa. In eachquestion, the users were supposed to listen to one audio ofeach class to help them to classify the video. The set ofquestions was composed of videos of movements gen-erated by our approach, videos generated by D2M [29],and videos of real movements extracted from our train-ing dataset. We applied the same transformations to all dataand every video had an avatar performing the motion with askeleton with approximately the same dimensions. We splitequally the videos shown between the three dance styles.From Table 2 and Figure 6, we draw the following ob-servations: ﬁrst, our method achieved similar motion per-ceptual performance to the one obtained from real data.Second, our method outperformed the D2M method witha large margin. Thus, we argue that our method was capa-ble of generating realistic samples of movement taking intoaccount two of the following aspects: i) Our performance issimilar to the results from real motion data in a blind study; ii)

Users show higher accuracy in categorizing our gener-ated motion. Furthermore, as far as the quality of a move-ment being individual is concerned, Figures 7 and 8 showthat our method was also able to generate samples with mo-tion variability among samples.We ran two statistical tests,

Difﬁculty Index and

Item Dis-crimination Index , to test the validity of the questions inour study. The Difﬁculty Index measures how easy to an-swer an item is by determining the proportion of users whoanswered the question correctly, i.e ., the accuracy. On theother hand, the Item Discrimination Index measures how agiven test question can differentiate between users that mas-tered the motion style classiﬁcation from those who havenot. Our methodology analysis was based on the guidelinesdescribed by Luger and Bowles [34]. Table 2 shows the av-erage values of the indexes for all questions in the study.One can clearly observe that our method’s questions hada higher difﬁculty index value, which means it was easierfor the participants to answer them correctly and, in somecases, even easier than the real motion data. Regarding thediscrimination index, we point out that the questions can-not be considered good enough to separate the ability levelof those who took the test, since items with discriminationindexes values between and . are not considered goodselectors [10]. These results suggest that our method and9 igure 6. The plots a), b), c) and d) show the proﬁle distribution of the participants of our user study; The plots e), f), g) and h) show theresults of the study. In the plots of semi-circles are shown the results of the user evaluation; each stacked bar represents one user evaluationand the colors of each stacked bar indicates the dance styles (Ballet = yellow, Michael Jackson (MJ) = blue, and Salsa = purple). e) Weshow the results for all users that fully answered our study; f) Results for the users which achieved top % scores and the % whichachieved the bottom scores; g) Results for the % user which achieved top scores; h) Results for the % users which achieved worstscores.Table 2. Quantitative metrics from the user perceptual study. Difﬁculty Index Discrimination Index Dance Style

D2M Ours Real D2M Ours RealBallet . .

987 0 .

080 0 . . . . . . . . . Average . . . . Better closer to 1. Better closer to 1. the videos obtained from real sequences look natural formost users, while the videos generated by [29] were con-fusing.

For a more detailed performance assessment regardingthe similarity between the learned distributions and the realones, we use the commonly used Fr´echet Inception Dis-tance (FID). We computed the FID values using motion fea-tures extracted from the action recognition ST-GCN modelpresented in [56], similar to the metric used in [55, 29].We train the ST-GCN model times using the same setof hyperparameters. The trained models achieved accuracyscores higher than for almost all training trials. Thedata used to train the feature vector extractor was not usedto train any of the methods evaluated in this paper. Table 3shows the results for the FID metric.We also computed the GAN-Train and GAN-Test met-rics, two well-known GAN evaluation metrics [43]. Tocompute the values of the GAN-Train metric, we trained theST-GCN model in a set composed of dance motion sam-ples generated by our method and another set with gener-ated motions by D2M. Then, we tested the model in the I npu t A ud i o O u r s D M Figure 7. Results of our approach in comparison to D2M [29] for

Ballet , the dance style shared by both methods. evaluation set (real samples). The GAN-Test values wereobtained by training the same classiﬁer in the evaluation setand tested in the sets of generated motions. For each metric,we ran training rounds and reported the average accuracywith the standard deviation in Table 3. Our method achievedsuperior performance as compared to D2M.We can also note that the generator performs better insome dance styles. Since some motions are more compli-cated than others, the performance of our generator can bebetter synthesizing less complicated motions related to aparticular audio class related to a dance style. For instance,the Michael Jackson style contains a richer set of motionswith the skeleton joints rotating and translating in a varietyof conﬁgurations. The Ballet style, on the other hand, iscomposed of fewer poses and consequently, easier to syn-thesize.10 able 3. Quantitative evaluation according to FID, GAN-Train, and GAN-Test metrics. FID GAN-Train GAN-Test Dance Style

D2M Ours Real D2M Ours Real D2M Ours RealBallet . ± . . ± . . ± .

58 0 . ± . . ± . . ± .

12 0 . ± . . ± . . ± . MJ . ± . . ± .

55 5 . ± .

42 0 . ± . . ± . . ± . . ± . . ± .

18 0 . ± . Salsa . ± . . ± . . ± . . ± . . ± .

11 0 . ± .

16 0 . ± . . ± . . ± . Average . ± . . ± . . ± .

86 0 . ± . . ± . . ± .

17 0 . ± . . ± . . ± . Better closer to 0. Better closer to 1. Better closer to 1.

Figures 7, 8, and 9 show some qualitative results. We cannotice that the sequences generated by D2M presented somecharacteristics clearly inherent to the dance style, but theyare not present along the whole sequence. For instance, inFigure 7, one can see that the last generated skeleton/framelooks like a spin, usually seen in ballet performances, butthe previous poses do not indicate any correlation to thisdance style. Conversely, our method generates poses com-monly associated with ballet movements such as rotatingthe torso with stretched arms.Figure 8 shows that for all three dance styles, the move-ment signature was preserved. Moreover, the

Experiment 1 in Figure 9 demonstrates that our method is highly respon-sive to audio style changes since our classiﬁer acts sequen-tially on subsequent music portions. This enables it to gen-erate videos where the performer executes movements fromdifferent styles. Together these results show that our methodholds the ability to create highly discriminative and plausi-ble dance movements. Notice that qualitatively we outper-formed D2M for all dance styles, including for the Balletstyle, which D2M was carefully coined to address.

Exper-iment 2 in Figure 9 also shows that our method can gen-erate different sequences from a given input music. Sinceour model is conditioned on the music style from the au-dio classiﬁcation pipeline, and not on the music itself, ourmethod exhibits the capacity of generating varied motionswhile still preserving the learned motion signature of eachdance style.

6. Conclusion

In this paper, we propose a new method for synthesizinghuman motion from music. Unlike previous methods, weuse graph convolutional networks trained using an adversar-ial regime to address the problem. We use audio data to con-dition the motion generation and produce realistic humanmovements with respect to a dance style. We achieved qual-itative and quantitative performance as compared to stateof the art. Our method outperformed Dancing to Music interms of FID, GAN-Train, and GAN-Test metrics. We alsoconducted a user study, which showed that our method re-ceived similar scores to real dance movements, which wasnot observed in the competitors.Moreover, we presented a new dataset with audio and visual data, carefully collected to train and evaluate algo-rithms designed to synthesize human motion in dance sce-nario. Our method and the dataset are one step towards fos-tering new approaches for generating human motions.As future work, we intend to extend our method to in-fer 3D human motions, which will allow us to use the gen-erated movements in different animation frameworks. Wealso plan to increase the dataset size by adding more dancestyles.

Acknowledgements

The authors thank CAPES, CNPq,and FAPEMIG for funding this work. We also thankNVIDIA for the donation of a Titan XP GPU used in thisresearch.

References [1] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2Dhuman pose estimation: New benchmark and state of the artanalysis. In

CVPR , 2014.[2] Relja Arandjelovic and Andrew Zisserman. Objects thatsound. In

ECCV , 2018.[3] Yusuf Aytar, Carl Vondrick, and Antonio Torralba. Sound-net: Learning sound representations from unlabeled video.In

NIPS , 2016.[4] Christoph Bregler, Michele Covell, and Malcolm Slaney.Video rewrite: Driving visual speech with audio. In

SIG-GRAPH , 1997.[5] Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A.Sheikh. Openpose: Realtime multi-person 2d pose estima-tion using part afﬁnity ﬁelds.

TPAMI , pages 1–1, 2019.[6] C. Chan, S. Ginosar, T. Zhou, and A. Efros. Everybody dancenow. In

ICCV , 2019.[7] Y. Chen, C. Shen, X. Wei, L. Liu, and J. Yang. Adversarialposenet: A structure-aware convolutional network for humanpose estimation. In

ICCV , 2017.[8] Kwang-Jin Choi and Hyeong-Seok Ko. On-line motion re-targeting.

JVCA , 11:223–235, 12 2000.[9] Daniel Cudeiro, Timo Bolkart, Cassidy Laidlaw, AnuragRanjan, and Michael J Black. Capture, learning, and syn-thesis of 3d speaking styles. In

CVPR , 2019.[10] R.L. Ebel and D.A. Frisbie.

Essentials of Educational Mea-surement . Prentice Hall, 1991.[11] K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik. Recurrentnetwork models for human dynamics. In

ICCV , 2015.[12] P. Ghosh, J. Song, E. Aksan, and O. Hilliges. Learning hu-man motion models for long-term predictions. In , 2017. J I npu t A ud i o O u r O u t pu t S k e l e t on s I m a g e s B a ll e t S a l s a I npu t A ud i o O u r O u t pu t S k e l e t on s I m a g e s I npu t A ud i o O u r O u t pu t S k e l e t on s I m a g e s Figure 8. Qualitative results using audio sequences with different styles. In each sequence: First row: input audio; Second row: thesequence of skeletons generated with our method; Third row: the animation of an avatar by vid2vid using our skeletons. M J I npu t A ud i o O u r O u t pu t S k e l e t on s I m a g e s B a ll e t S a l s a E xp e r i m e n t I npu t A ud i o O u r O u t pu t S k e l e t on s I m a g e s E xp e r i m e n t O u r O u t pu t S k e l e t on s I m a g e s Figure 9. Experiment 1 shows the ability of our method to generate different sequences with smooth transition from one given input audiocomposed of different music styles. Experiment 2 illustrates that our method can generate different sequences from a given input music.

13] S. Ginosar, A. Bar, G. Kohavi, C. Chan, A. Owens, and J.Malik. Learning individual styles of conversational gesture.In

CVPR , 2019.[14] Michael Gleicher. Retargetting motion to new characters. In

SIGGRAPH , 1998.[15] T. L. Gomes, R. Martins, J. Ferreira, and E. R. Nascimento.Do As I Do: Transferring human motion and appearancebetween monocular videos with spatial and temporal con-straints. In

WACV , 2020.[16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, BingXu, David Warde-Farley, Sherjil Ozair, Aaron Courville, andYoshua Bengio. Generative adversarial nets. In

NPIS , 2014.[17] Liang-Yan Gui, Yu-Xiong Wang, Xiaodan Liang, andJos´e MF Moura. Adversarial geometry-aware human motionprediction. In

ECCV , 2018.[18] R. A. G¨uler, N. Neverova, and I. Kokkinos. Densepose:Dense human pose estimation in the wild. In

CVPR , 2018.[19] Awni Hannun, Carl Case, Jared Casper, Bryan Catanzaro,Greg Diamos, Erich Elsen, Ryan Prenger, Sanjeev Satheesh,Shubho Sengupta, Adam Coates, and Andrew Y. Ng. Deepspeech: Scaling up end-to-end speech recognition. In arXiv ,2014.[20] Shawn Hershey, Sourish Chaudhuri, Daniel P. W. Ellis,Jort F. Gemmeke, Aren Jansen, Channing Moore, ManojPlakal, Devin Platt, Rif A. Saurous, Bryan Seybold, MalcolmSlaney, Ron Weiss, and Kevin Wilson. CNN architectures forlarge-scale audio classiﬁcation. In

ICASSP . 2017.[21] Katsushi Ikeuchi, Zhaoyuan Ma, Zengqiang Yan, ShunsukeKudoh, and Minako Nakamura. Describing upper-body mo-tions based on labanotation for learning-from-observationrobots.

IJCV , 126(12):1415–1429, 2018.[22] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena. Structural-rnn: Deep learning on spatio-temporal graphs. In

CVPR ,2016.[23] Deok-Kyeong Jang and Sung-Hee Lee. Constructing hu-man motion manifold with sequential networks.

ComputerGraphics Forum , 2020.[24] A. Kanazawa, J. Y. Zhang, P. Felsen, and J. Malik. Learning3d human dynamics from video. In

CVPR , 2019.[25] Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen.Progressive growing of GANs for improved quality, stability,and variation. In

ICLR , 2018.[26] Thomas N Kipf and Max Welling. Semi-supervised classiﬁ-cation with graph convolutional networks. In

ICLR , 2017.[27] Muhammed Kocabas, Nikos Athanasiou, and Michael J.Black. Vibe: Video inference for human body pose andshape estimation. In

CVPR , 2020.[28] N. Kolotouros, G. Pavlakos, M. Black, and K. Daniilidis.Learning to reconstruct 3d human pose and shape via model-ﬁtting in the loop. In

ICCV , 2019.[29] Hsin-Ying Lee, Xiaodong Yang, Ming-Yu Liu, Ting-ChunWang, Yu-Ding Lu, Ming-Hsuan Yang, and Jan Kautz.Dancing to music. In

NIPS , 2019.[30] Marc Leman. The role of embodiment in the perceptionof music.

Empirical Musicology Review , 9(3-4):236–246,2014. [31] J. Li, C. Wang, H. Zhu, Y. Mao, H. Fang, and C. Lu. Crowd-pose: Efﬁcient crowded scenes pose estimation and a newbenchmark. In

CVPR , 2019.[32] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C LawrenceZitnick. Microsoft coco: Common objects in context. In

ECCV , 2014.[33] Libin Liu and Jessica Hodgins. Learning basketball dribblingskills using trajectory optimization and deep reinforcementlearning.

ACM Trans. Graph. , 37(4), 2018.[34] Sarah K. K. Luger and Jeff Bowles. Comparative meth-ods and analysis for creating high-quality question sets fromcrowdsourced data. In

FLAIRS . AAAI, 2016.[35] J. Martinez, M. J. Black, and J. Romero. On human motionprediction using recurrent neural networks. In

CVPR , 2017.[36] Mehdi Mirza and Simon Osindero. Conditional generativeadversarial nets. arXiv , 2014.[37] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. Onthe difﬁculty of training recurrent neural networks. In

ICML ,2013.[38] Xue Bin Peng, Glen Berseth, KangKang Yin, and Michielvan de Panne. Deeploco: Dynamic locomotion skills us-ing hierarchical deep reinforcement learning.

ACM Trans.Graph. , 36(4), 2017.[39] Xue Bin Peng, Angjoo Kanazawa, Jitendra Malik, PieterAbbeel, and Sergey Levine. SFV: Reinforcement learningof physical skills from videos.

ACM Trans. Graph. , 37(6),2018.[40] Carl Edward Rasmussen. Gaussian processes in machinelearning. In

Summer School on Machine Learning . Springer,2003.[41] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Lo-geswaran, Bernt Schiele, and Honglak Lee. Generative ad-versarial text to image synthesis. In

ICML , 2016.[42] Takaaki Shiratori and Katsushi Ikeuchi. Synthesis of danceperformance based on analyses of human motion and music.

I&MT , 3(4):834–847, 2008.[43] Konstantin Shmelkov, Cordelia Schmid, and Karteek Ala-hari. How good is my GAN? In

ECCV , 2018.[44] Harrison Jesse Smith, Chen Cao, Michael Neff, and Yingy-ing Wang. Efﬁcient neural networks for real-time motionstyle transfer.

Proceedings of the ACM on Computer Graph-ics and Interactive Techniques , 2(2):1–17, 2019.[45] A¨aron van den Oord, Sander Dieleman, Heiga Zen, KarenSimonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner,Andrew Senior, and Koray Kavukcuoglu. Wavenet: A gen-erative model for raw audio. In

ISCA Speech Synthesis Work-shop , 2016.[46] Ruben Villegas, Jimei Yang, Duygu Ceylan, and HonglakLee. Neural kinematic networks for unsupervised motionretargetting. In

CVPR , June 2018.[47] Ruben Villegas, Jimei Yang, Yuliang Zou, Sungryull Sohn,Xunyu Lin, and Honglak Lee. Learning to generate long-term future via hierarchical prediction. In

ICML , 2017.[48] H. Wang, E. S. L. Ho, H. P. H. Shum, and Z. Zhu. Spatio-temporal manifold learning for human motions via long-horizon modeling.

IEEE TVCG , pages 1–1, 2019.

49] Qi Wang, Thierry Arti`eres, Mickael Chen, and Ludovic De-noyer. Adversarial learning for modeling human motion.

TheVisual Computer , 36(1):141–160, 2020.[50] Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu,Andrew Tao, Jan Kautz, and Bryan Catanzaro. Video-to-video synthesis. In

NIPS , 2018.[51] X Wang, Q Chen, and W Wang. 3d human motion editingand synthesis: a survey.

Computational and MathematicalMethods in Medicine , 2014:104535–104535, 2014.[52] Christian Weiss. FSM and k-nearest-neighbor for corpusbased video-realistic audio-visual synthesis. In

INTER-SPEECH , 2005.[53] Shihong Xia, Congyi Wang, Jinxiang Chai, and Jessica Hod-gins. Realtime style transfer for unlabeled heterogeneoushuman motion.

ACM Trans. Graph. , 34(4):1–10, 2015.[54] Yuliang Xiu, Jiefeng Li, Haoyu Wang, Yinghong Fang, andCewu Lu. Pose Flow: Efﬁcient online pose tracking. In

BMVC , 2018.[55] S. Yan, Z. Li, Y. Xiong, H. Yan, and D. Lin. Convolutionalsequence generation for skeleton-based action synthesis. In

ICCV , 2019.[56] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial tempo-ral graph convolutional networks for skeleton-based actionrecognition. In

AAAI , 2018., 2018.