Semi-Supervised Learning for In-Game Expert-Level Music-to-Dance Translation
Yinglin Duan, Tianyang Shi, Zhengxia Zou, Jia Qin, Yifei Zhao, Yi Yuan, Jie Hou, Xiang Wen, Changjie Fan
SSemi-Supervised Learning for In-Game Expert-Level Music-to-Dance Translation
Yinglin Duan, ∗ Tianyang Shi, ∗ Zhengxia Zou, Jia Qin,
1, 3
Yifei Zhao, Yi Yuan, † Jie Hou, Xiang Wen,
1, 3
Changjie Fan NetEase Fuxi AI Lab University of Michigan, Ann Arbor Zhejiang University
Abstract
Music-to-dance translation is a brand-new and powerful fea-ture in recent role-playing games. Players can now let theircharacters dance along with specified music clips and evengenerate fan-made dance videos. Previous works of this topicconsider music-to-dance as a supervised motion generationproblem based on time-series data. However, these meth-ods suffer from limited training data pairs and the degrada-tion of movements. This paper provides a new perspectivefor this task where we re-formulate the translation problemas a piece-wise dance phrase retrieval problem based on thechoreography theory. With such a design, players are allowedto further edit the dance movements on top of our genera-tion while other regression based methods ignore such userinteractivity. Considering that the dance motion capture isan expensive and time-consuming procedure which requiresthe assistance of professional dancers, we train our methodunder a semi-supervised learning framework with a largeunlabeled dataset (20x than labeled data) collected. A co-ascent mechanism is introduced to improve the robustness ofour network. Using this unlabeled dataset, we also introduceself-supervised pre-training so that the translator can under-stand the melody, rhythm, and other components of musicphrases. We show that the pre-training significantly improvesthe translation accuracy than that of training from scratch.Experimental results suggest that our method not only gen-eralizes well over various styles of music but also succeeds inexpert-level choreography for game players.
The music-dance is a very popular feature for many Role-Playing Games (RPGs), where the players can control theircharacter to dance with the music (e.g. “Just Dance ” and“FINAL FANTASY XIV ”). Recent games like “Heavenmobile ” further enriched this feature, where various in-struments and pre-defined dance movements are provided.Players can edit vivid music-dance and share it on their so-cial networks. However, the editing and customization ofmusic and dance require a lot of expertise. For those play-ers without experience in such area, choreography for game ∗ These authors contributed equally to this work † Corresponding author: [email protected] http://tym.163.com/ Figure 1: An overview of our method: We propose a methodfor music-to-dance translation based on player uploaded mu-sic. We frame the translation as a dance retrieval problemwhere we firstly segment the music to music phrases andthen assign proper dance phrases one by one.characters would be a very difficult task. Even for a veryexperienced team in music-dance, from the early capture ofdance movements to the late software synthesis, the entireproduction time period usually takes several days. In thispaper, we investigate an interesting problem called “Music-to-dance translation” which aims to automatically gener-ate dance movements for game characters according to theplayer-uploaded music.Recently, music-to-dance translation has drawn increas-ing research attention due to its wide applications in thegame industry and virtual reality. Deep learning basedmethods have shown great potential in this task (Alemi,Franc¸oise, and Pasquier 2017; Tang, Mao, and Jia 2018; Renet al. 2020). However, these methods are difficult to apply toin-game expert-level music-to-dance applications. The rea-son is threefold. First, in choreography theory, dance move-ments are typically expressed trough the “strength”, “speed”and “amplitude” of the human body, while the movementsgenerated by previous methods are mostly based on the am-plitude and thus the generation lacks a sense of strength.Second, most previous methods are designed to be trainedunder a fully-supervised fashion and require a large amountof motion data captured in advance. However, capturing a r X i v : . [ c s . C V ] S e p ance motions is usually expensive, time-consuming, andrequires the assistance of professional dancers. Finally, pre-vious methods cannot provide players with an interactive ex-perience.To solve the above problems, we propose a novel methodfor generating high-quality music-dances. We symbolizethe dance movements and re-formulate the music-to-dancetranslation as a phrase-wise dance phrase retrieval problem.Different with the dance generative models that directly gen-erate the dance movements from the music, we consider thedance movements as a set of semantic fragments accordingto the choreography theory, and then arrange these phrasesfor music fragments one by one. To map music phrases todance moves, we build an encoder-decoder network thattakes in the Mel Spectrogram of a music phrase and then pre-dicts the corresponding index of the dance phrase. As a tem-poral prediction problem, we introduce “transition priors”of the dance phrases based on a first-order Markov model toimprove the context reasoning, where the transition matricesare used to re-scale the probability of predicted results andget a smoother and more consistent generation result.Considering the high cost of building large-scale dancemovements datasets, we take advantages of the semi-supervised learning (Oliver et al. 2018), to improve the ro-bustness and generalization ability of our method. We extendour method on a large unlabeled music dataset (20x largerthan our labeled one). We first train our method on this un-labeled music dataset with self-supervised pretext tasks. Weenforce the network reconstruct the music phrases as wellas its melody and rhythm from the latent representations.The model can be thus pre-trained to learn a good represen-tation of the music phrases from the pretext tasks we de-signed without human annotations. After the pre-training,we fine-tune the model on a labeled subset. Since the tran-sition matrices initially learned on the labeled data are half-baked, we propose a co-ascent mechanism to jointly refinethe transition priors of movements and improve the accuracyof the prediction. Specifically, we use the transition matricesto correct the prediction results, i.e. generating pseudo la-bels (Lee 2013) on the large unlabeled dataset, and then iter-atively update the matrices and train our networks based oncorrected labels. With the help of semi-supervised learning,our method can better generalize to in-the-wild music data.Such scalability is not considered and supported in previousmethods.Our contributions are summarized as follows: • We propose a new music-to-dance translation methodbased on semi-supervised learning. We extend our methodto a larger unlabeled music dataset and explore the effec-tiveness of self-supervised pre-training in our task. Weshow that by pre-training the model on the unlabeleddataset and then fine-tune on a labeled subset, the music-to-dance translation accuracy can be greatly improvedthan that trained solely on the labeled subset from scratch. • We introduce a co-ascent mechanism and make full useof the latent structure of the unlabeled data in fine-tuning.We consider the “transition priors” of the dance phrasesand design a self-correction method to generate pseudo- labels for unlabeled data. To our best knowledge, thereare few works that incorporate such a mechanism in thistask. • Different from previous methods where the dance move-ments are directly generated based on the music, we sym-bolize the dance movements and re-formulate the music-to-dance translation as a phrase-wise music-to-dance re-trieval problem with the guidance of music-dance domainknowledge. With such a design, players can optionallyedit the dance moves on top of the generation results ac-cording to their preference while such interactivity wasignored in previous methods.
Music-to-dance is an emerging research hot-spot in recentyears. As a cross-modality generation problem, music-to-dance requires high consistency between music and gen-erated dance on artistic conception. Early works usuallyadopt statistical models to achieve this goal (Shiratori,Nakazawa, and Ikeuchi 2006; Ofli et al. 2008, 2011; Fan,Xu, and Geng 2011; Lee, Lee, and Park 2013). With thedevelopment of deep learning, artistic consistency now canbe achieved by building supervised deep learning mod-els (Alemi, Franc¸oise, and Pasquier 2017; Tang, Jia, andMao 2018; Lee et al. 2019). For example, Alemi et al . firstpropose GrooveNet to achieve real-time music-driven dancemovements generation (Alemi, Franc¸oise, and Pasquier2017). In their method, the Factored Conditional RestrictedBoltzmann Machines (FCRBM) is reformulated under a Re-current Neural Network framework and predicts the currentmotion capture frame by taking in the current music fea-tures and the historical frames. Tang et al . further proposean LSTM based Auto-Encoder model named “Anidance” toregress motions from acoustic features (Tang, Mao, and Jia2018; Tang, Jia, and Mao 2018). In their method, an ex-tractor is firstly used to reduce the dimension of acousticfeatures and then a predictor is adopted to translate reducedfeatures to motions. Lee et al . propose a decomposition-to-composition framework for music-to-dance generation (Leeet al. 2019), where they use a VAE to model dance unitsand use a Generative Adversarial Network (GAN) to orga-nize the dance units based on input music. Ren et al . inte-grate the local temporal discriminator and the global con-tent discriminator for helping generate coherent dance se-quences based on the noisy dataset, and then use pose-to-appearance mapping to generate human dance videos (Renet al. 2020). However, all the above methods directly gen-erate the dance movements from music, which inevitablyleads to a problem of motion degradation and is not yetable to meet the requirements of expert-level music-to-dancetranslation. In this paper, different from previous methods,we symbolize the dance movements and re-formulate themusic-to-dance generation as a retrieval problem to avoidthe degradation problem. The players can therefore obtainhigh-quality dance movements arranged by their input mu-sic.igure 2: An overview of our method. Our method consists of a music encoder E and a dance phrase predictor T . We alsointroduce three decoders for self-supervised pre-training. In the pre-training stage, we train our encoder on a large unlabeledmusic dataset with three pretext losses - a spectrogram reconstruction loss L spe , a melody prediction loss L mld , and a rhythmprediction loss L rym . In the fine-tuning/inference stage, we train the predictor T on a labeled dance-music dataset so that totranslate the input music phrases to dance phrases. Semi-supervised learning forms a challenging but impor-tant foundation of machine learning methods (Gammerman,Vovk, and Vapnik 2013; Joachims 1999, 2003; Zhu, Ghahra-mani, and Lafferty 2003; Bengio, Delalleau, and Le Roux2006) that combines a small amount of labeled data with alarge amount of unlabeled one during training to improvethe prediction. In recent years, there are various of meth-ods proposed in this field (Oliver et al. 2018).
Consistencyregularization methods aim at building a low-dimensionalmanifold for unlabeled data. Such a group of methods in-clude Π -Model (Laine and Aila 2016; Sajjadi, Javanmardi,and Tasdizen 2016), Mean Teacher (Tarvainen and Valpola2017), Virtual Adversarial Training (Miyato et al. 2018),and etc. Entropy-based methods encourage networks havea higher confident, i.e. low-entropy, on all examples by in-troducing entropy minimization losses (Grandvalet and Ben-gio 2005; Pereyra et al. 2017).
Pseudo-Labeling is anothersimple but widely used strategy in semi-supervised learn-ing, which requires that the model can provide probabilisticresults for the unlabeled data and then adopt those pseudo-labels with large enough confidence as targets to furthertrain the model (Lee 2013). After the era of deep learning,semi-supervised learning was used to solve various com-puter vision tasks, including image classification (Li et al.2019; Yalniz et al. 2019), semantic segmentation (Papan-dreou et al. 2015; Kalluri et al. 2019), and object detec-tion (Jeong et al. 2019). Semi-supervised learning was alsowidely used in various tasks in the multimedia field, such asmusic analysis (Song, Zhang, and Xiang 2007; Poria et al.2013; Li and Ogihara 2004), image understanding (Li et al.2019; Papandreou et al. 2015), and etc. In this work, wecombine the domain knowledge in music-dance with theidea proposed by Lee et al . , and use pseudo-labels to ex-tend our method on a large unlabeled music dataset.
In this paper, we propose a simple but efficient semi-supervised learning method for music-to-dance translation.Fig. 2 shows an overview of our method. Our method con-sists of a music feature encoder, a dance phrase predictor,and several decoders. The encoder is a ResNet50-based (Heet al. 2016) convolutional network which is trained to en-code the Mel Spectrogram of music phrases into music em-beddings. The predictor is an attention based fully connectednetwork which takes in the embeddings and predicts dancephrases. The decoders are specifically designed for the pre-training task and will not be involved during the inferencestage.Given a piece of music (e.g., a pop song), we first seg-ment the music into several phrases. Then, we pre-train ourencoder with self-supervised losses on a large unlabeled mu-sic dataset. Then, we fine-tune the predictor on labeled mu-sic data to assign dance phrases based on the input features.We further design and incorporate a co-ascent mechanismfor making full-use of the unlabeled data and improve thetranslation.
In choreography, the music phrase is a segment of the musiccontaining complete semantic-level structure and the dancephrases in each music phrase usually represent similar con-ceptions.We thus define the music phrases as our basic processingunits in our retrieval model. Considering that there are vari-ous types of time signatures for music (e.g., , , and etc.)and a music phrase may consist of 2 ∼ ∼
24 beatsif the time signature is ), to obtain the segmentation of themusic phrases under various beats, we design the followingthree steps for segmentation, as shown in Fig. 3:igure 3: The processing pipeline of music phrase segmen-tation. We firstly segment music to fragments, and then ex-tract features from music fragments. Finally, we slice musicphrases based these musical features. • Long fragment segmentation: Firstly, we analyze the mu-sic structure by using spectral clustering and segment mu-sic into long fragments. The segmentation on this step isimplemented based on librosa (McFee et al. 2015). • Rhythm feature detection: Secondly, we extract beats andonset by using librosa, and extract main-melody by a deeplearning method (Hsieh, Su, and Yang 2019). • Merging: Finally, we merge the above features and musiccan be segmented into a set of music phrases - We detectand slice the breaking point of a piece of music judgingby melody and onset around a beat.
The training of our method consists of two stages. In the firststage, we pre-train the encoder on a large unlabeled dataset(music without dance movements) with self-supervised pre-text losses. In the second stage, we fix the encoder and fine-tune the predictor on a labeled dataset (music phrases andcorresponding dance movements).Considering that choreography requires the concordanceof music-dance on rhythm and melody, we design three pre-text tasks for the pre-training - a spectrogram reconstruction tasks, a melody prediction task, and a rhythm prediction task.The pre-training is performed solely on the music data with-out any human annotations.
Spectrogram reconstruction . We compute the MelSpectrogram for an input music phrase and convert the 1dmusic signal to a 2D “image” by using librosa (McFee et al. 2015). We then feed the spectrogram to our ResNet encoder E to produce a set of low dimensional feature embeddings.Because we expect the embeddings containing all informa-tion of the input music phrase, we introduce a decoder D ,to upsample the features and restore the spectrogram. Weforce the Mel Spectrogram before the encoder and after thedecoder unchanged. We define the reconstruction loss as fol-lows: L spe ( E, D ) = (cid:107) D ( E ( Mel ( x ))) − Mel ( x ) (cid:107) , (1)where x is the music phrase and Mel ( x ) is its Mel Spectro-gram. The decoder D has a similar structure as the genera-tive network DCGAN (Radford, Metz, and Chintala 2015),with 8 transposed 2D-convolution layers. Melody prediction . Main-melody defines the pitch con-tours of the polyphonic music, and can be used in somehigh-level tasks such as song identification (Serra, G´omez,and Herrera 2010), music genre classification (Salamon,Rocha, and G´omez 2012), etc. Different from the previ-ous method (Tang, Mao, and Jia 2018) that uses vanillamelody, we use the Main-Melody extracted by deep learningmethod (Hsieh, Su, and Yang 2019) to improve the robust-ness. We define the prediction loss as follows: L mld ( E, D ) = (cid:107) D ( E ( Mel ( x ))) − Melody ( x ) (cid:107) , (2)where D is a decoder with 5 transposed 1D-convolutionlayers for regressing the melody from the embeddings.Melody ( x ) is the pre-computed target melody from the mu-sic phrase x . Rhythm . We define another prediction head to predict therhythm from the music embeddings. The prediction loss isdefined as follows: L rym ( E, D ) = BCELoss ( D ( E ( Mel ( x ))) , Rythm ( x )) (3)where BCELoss denotes the Binary-Cross-Entropy-Loss, D is a rhythm decoder which has a similar structure as D but produces binary output, and Rythm ( x ) is the tar-get rhythm from the music phrase x , which is pre-computedbased on librosa (McFee et al. 2015) and main-melody. Final pre-training loss
By combining the loss term (1),(2) and (3), we define the final pre-training loss as follows: L pre − tr ( E, D , D , D )= β L spe + β L mld + β L rym , (4)where β , β , and β are the weights to balance the lossterms. We train the encoder E and the decoders ( D , D , D ) to minimize the above loss function. After the pre-training, we remove the decoders and only keep the weightsof the encoder for a further fine-tuning on music-dance datapairs. We build an attention-based multilayer perceptron as ourdance phrase predictor T . The T consists of three residual at-tention blocks and two Fully Connected (FC) layers. In eachof the block, we make a simple modification of the squeezeand excitation block in SENet (Hu, Shen, and Sun 2018) topply it to an FC layer (the global pooling layer thus is re-moved).The T is trained to predicts the index of a proper dancephrase. For each music phrase, we define the prediction lossas the cross-entropy loss between the predicted probabilitydistribution and the K possible dance phrases captured inthe dance library: L pred = − K (cid:88) i =1 ˆ y ( i ) p log( F pred ( u ) ( i ) ) , (5)where [ˆ y (1) p , ..., ˆ y ( K ) p ] represent the one-hot ground truthvector of the prediction. F pred ( u ) ( i ) represents the predictedprobability for the i th kind of dance phrase. u = E ( Mel ( x )) is the music embedding from the encoder E . We train theencoder and predictor from the self-supervised pre-trainedinitialization. During the training, we fix the encoder E andonly update the predictor T for a faster convergence. Once we have built the above retrieval model, the music-to-dance translation essentially becomes a phrase-wise re-trieval problem. Considering that building a large scaledance phrase dataset is very expensive, we introduce the co-ascent learning mechanism to migrate our learning processto unlabeled data. This method also improves the predictionby using context reasoning.
Transition matrix . Inspired by the N-gram (Brown et al.1992) that has been widely used in the field of Natural Lan-guage Processing, we introduce a dance phrase transitionmatrix M ∈ R K × K to capture the probability transition be-tween the two adjacent dance phrases. This matrix can beseen as having a similar meaning to the probability transi-tion matrix in the first-order Markov process. During the in-ference stage, we use this matrix to re-scale the predictionresults of the current phrase (based on the history predic-tions). The re-scale of the predicted class probability can bewritten as follows: P ( d t | u t , d t − ) = P ( d t | u t ) P ( d t | d t − )= F pred ( u t ) M ( d t − → d t ) , (6)where d t is the dance phrase at the time step t , P ( d t | u t , d t − ) is the re-scaling results, F pred ( u t ) is theraw prediction results of the prediction head F pred , and M ( d t − → d t ) is the transition probability between twodance phrases from the step t − to t . Co-ascent learning . Pseudo-labeling (Lee 2013) is a sim-ple but effective strategy that has been widely used in semi-supervised learning methods. In our method, we first trainthe networks on a small labeled dataset and then apply theweak model to all unlabeled data (music without dances) topredict the corresponding labels. The dataset with both truelabels and pseudo labels is again used to train the network toenhance the decision boundary. During the pseudo-labelingprocess, we also apply the transition matrix M to correctthe predictions of our network, and the corrected labels arefurther used to update the transition matrix. The update of Figure 4: The pipeline of the proposed co-ascent learning.We further train our predictor in a semi-supervised manner,where the proposed transition matrix is also integrated tocorrect the pseudo-labels and also to be jointly updated.the transfer matrix is performed based on the product of theconfidences of two pseudo-labeled music phrases: M k +1 ( d t − → d t ) = M k ( d t − → d t ) + P ( d t − ) P ( d t ) (7)where M k +1 is the transition matrix after k th updates byusing the pseudo-labels. P ( d t ) is the prediction confidenceon the dance phrase at the time step t . Since the transitionmatrix and the networks can be mutually improved based onEq. 6 and Eq. 7, we refer to this mechanism as co-ascentlearning. Training details.
In our method, we adopt Mel Spectrogram as the in-put music feature rather than Mel-frequency cepstral coef-ficients (MFCCs) because it contains more original musicinformation, and we aim to learn a better representation ofmusic to replace manual features (i.e. MFCCs (Logan et al.2000)). The input Mel Spectrogram is resized to × before fed into the encoder E , the melody and rhythm arealso resized to × . The dimension of music embed-dings produced by the encoder is set to 512. For a detailednetwork configuration and the co-ascent learning pipeline,please refer to our Appendix.In the pre-training stage, we use Adam optimizer (Kingmaand Ba 2014) to train our model with the learning rate of − . The learning rate decay is set to 0.1 per 50 epochs andthe training stops at 200 epochs. We set the loss coefficient β = β = 1 and β = 10 . In the supervised fine-tuningstage, we train our translator by SGD with the learning rateof − , momentum . , weight decay × − and themax-epoch number of . In the co-ascent stage, we set thelearning rate to − , update pseudo labels every epochs,initialize the transition matrix M based on the style of dancephrases (i.e. the similar dance moves are allowed to transfer)and further clip the range of M within [0 . , to improveigure 5: Comparisons between our method (shown in game) and previous methods on the music “Sorry”.stability. Other configurations are kept the same as our su-pervised fine-tuning stage. Blending of dance phrases . Considering that the dancemoves in adjacent phrases are not always able to connectend to end, we use a common technique called blending tosmooth the movements on switching from one dance moveto another. We test our method on the music-dance creation platform ofa role-playing game named “Heaven mobile”. We build twodatasets for our task:
Labeled Dance-Music Dataset . In this dataset, we firstrecorded 1,101 different dance phrases by using motioncapturing devices (Vicon V16 cameras). Five professionaldancers took part in the motion capture for one month. Wethen collected about 600 songs ( ∼
33 hours) with differentgenres that are suitable for choreography. We segment thesesongs into about 16773 music phrases and invite six expertsto arrange dance phrases for these music phrases song bysong (multiple different music phrases may correspond tothe same kind of dance phrases). For performance evalua-tion, we split this dataset into a training set (90 %) and a testset (10 %).
Unlabeled Music Dataset . In addition to the labeleddataset, we also collected an unlabeled dataset which is 20xlarger than the labeled one. The dataset consists of about 10k https://unity.com/ songs in various styles ( ∼
686 hours). We segment each songof this dataset into music phrases and finally 293,579 musicphrases are extracted and orderly packaged.
Fig. 5 shows a group of translation results by using ourmethod and previous state of the art methods on the music“Sorry” (also used in the previous work (Ren et al. 2020)).It can be seen that the music-dance video generated byour method not only accurately capture the rhythm in thesong, but also contain rich musical feelings and movementstrength.
The ablation experiments are conducted to verify the impor-tance of each component in our network. We evaluate fiveconfigurations of our method, including:Group I: A ResNet-50 encoder is only adopted and initial-ized by ImageNet pre-trained weights.Group II: A ResNet-50 encoder is adopted and initializedby the weights trained under self-supervised learning.Group III: We fix the encoder trained by self-supervisedlosses and fine-tuning the attention-based predictor on thelabeled dataset.Group IV: We further balance the labeled dataset on topof Group III.Group V: We apply co-ascent learning on top of GroupIV.The results are listed in Fig. 1. We can see that our fullimplementation (Group V) achieves significant improve-able 1: The experimental results of the ablation studies (Higher score indicates better performance)
Group Ablations Index
Self-Supervised Attention Balance Co-Ascent Top1 Top5 Top10I × × × × .
3% 20 .
5% 23 . II (cid:88) × × × .
5% 19 .
3% 22 . III (cid:88) (cid:88) × × .
1% 23 .
7% 25 . IV (cid:88) (cid:88) (cid:88) × . . % . % V (cid:88) (cid:88) (cid:88) (cid:88) . % 24 .
8% 26 . Table 2: The experimental results of the subjective evaluation (Closer to Rank 1 represents better performance)
Method Ranking Average Ranking
Group1 Group2 UnseenDancing to music (Lee et al. 2019) . ± .
33 3 . ± .
00 3 . ± .
00 2 . ± . Dance Video Synthesis (Ren et al. 2020) . ± .
49 1 . ± .
40 1 . ± .
42 1 . ± . Ours . ± .
46 1 . ± .
40 1 . ± .
42 1 . ± . ment than baselines, the self-supervised learning (Group III)shows a noticeable impact on our results (+6.8% on top1than Group I), and only using self-supervised pre-trainedweights may lead the overfitting on the small dataset (+2.2%on top1 than Group I). Besides, co-ascent learning alsoshows improvements on top1 (+0.5%) - although the scoresare somewhat incremental, we find that co-ascent learningprovides prediction results with a much more consistencystyle. Since the predictor faces to a 1000-classification problemand the choreography can be very flexible, dance phrases canoften exchangeable. In other words, a higher index accuracyin this task may not necessarily indicate better performance(even may indicate overfitting on the proxy task).To better evaluate the quality of the generated dancephrases, subjective evaluations are further conducted. In thisexperiment, we first collect three groups of music: 1) mu-sic used in the previous method (Ren et al. 2020), 2) musicfrom our unlabeled test set, 3) unseen style music outside ofour dataset. Note that all these musics are not shown in ourtraining dataset. Then we generate dance videos based onthree methods, i.e. our full implementation method and twoprevious state of the art methods (Ren et al. 2020; Lee et al.2019).For each group of the result, we invite nine certified danceteachers (with more than 10 years dancing experiences) andnine professional dancers (with ∼ years experiences)to rank the results of the three methods. The result videosare randomly segmented to a set of 30s clips. The expertswere asked to ignore the differences in the appearance ofcharacter models and focus on the concordance of music-dance and the continuity of dance phrases. The statistics ofthe rating for different video groups are listed in Table 2.The experts agree our method generates expert-level dancevideos on the fluency and strength of the dance movements.The superiority of our method is twofold: 1) previous meth- ods focus more on generating short sequences ( < Although we achieve noticeable improvement than previousmethods, our method still has limitations. The first limita-tion is that since the encoder takes in resized square inputs,it drops absolute rhythmic information and may lead to afailure on very smooth music. The second limitation is thatsince the blending method used in this work is linear, thetransition between two dance phrases may cause model clip-ping on large movement changes. We will focus on theseproblems in our future work.
In this paper, we propose a new method for automatic music-to-dance translation. We re-formulate the music-to-dancetranslation as a semi-supervised dance movement retrievalproblem based on the choreography theory. We also build anew music-dance dataset which consists of over 16k musicphrases labeled with dance movements and also 300k unla-beled ones. We design a self-supervised pre-training methodand a co-ascent learning pipeline so that to fully explore theinformation in the unlabeled music data. Our experimentalresults in our dataset suggest that our methods can generateexpert-level music-dances. The ablation studies also suggestthe effectiveness of the core design in our method.
References
Alemi, O.; Franc¸oise, J.; and Pasquier, P. 2017. GrooveNet: Real-time music-driven dance movement generation using artificial neu-ral networks. networks
Computational linguistics
IEEE transac-tions on visualization and computer graphics arXiv preprint arXiv:1301.7375 .Grandvalet, Y.; and Bengio, Y. 2005. Semi-supervised learning byentropy minimization. In
Advances in neural information process-ing systems , 529–536.He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep residual learn-ing for image recognition. In
Proceedings of the IEEE conferenceon computer vision and pattern recognition , 770–778.Hsieh, T.-H.; Su, L.; and Yang, Y.-H. 2019. A streamlined en-coder/decoder architecture for melody extraction. In
ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Sig-nal Processing (ICASSP) , 156–160. IEEE.Hu, J.; Shen, L.; and Sun, G. 2018. Squeeze-and-excitation net-works. In
Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition , 7132–7141.Jeong, J.; Lee, S.; Kim, J.; and Kwak, N. 2019. Consistency-basedSemi-supervised Learning for Object detection. In
Advances inNeural Information Processing Systems , 10758–10767.Joachims, T. 1999. Transductive inference for text classificationusing support vector machines. In
Icml , volume 99, 200–209.Joachims, T. 2003. Transductive learning via spectral graph par-titioning. In
Proceedings of the 20th International Conference onMachine Learning (ICML-03) , 290–297.Kalluri, T.; Varma, G.; Chandraker, M.; and Jawahar, C. 2019.Universal semi-supervised semantic segmentation. In
Proceedingsof the IEEE International Conference on Computer Vision , 5259–5270.Kingma, D. P.; and Ba, J. 2014. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 .Laine, S.; and Aila, T. 2016. Temporal ensembling for semi-supervised learning. arXiv preprint arXiv:1610.02242 .Lee, D.-H. 2013. Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In
Workshopon challenges in representation learning, ICML , volume 3, 2.Lee, H.-Y.; Yang, X.; Liu, M.-Y.; Wang, T.-C.; Lu, Y.-D.; Yang, M.-H.; and Kautz, J. 2019. Dancing to music. In
Advances in NeuralInformation Processing Systems , 3586–3596.Lee, M.; Lee, K.; and Park, J. 2013. Music similarity-based ap-proach to generating dance motion sequence.
Multimedia tools andapplications
Proceedings of the thirteenth ACMinternational conference on Information and knowledge manage-ment , 152–153.Li, X.; Sun, Q.; Liu, Y.; Zhou, Q.; Zheng, S.; Chua, T.-S.; andSchiele, B. 2019. Learning to self-train for semi-supervised few-shot classification. In
Advances in Neural Information ProcessingSystems , 10276–10286.Logan, B.; et al. 2000. Mel frequency cepstral coefficients for mu-sic modeling. In
Ismir , volume 270, 1–11. McFee, B.; Raffel, C.; Liang, D.; Ellis, D. P.; McVicar, M.; Bat-tenberg, E.; and Nieto, O. 2015. librosa: Audio and music signalanalysis in python. In
Proceedings of the 14th python in scienceconference , volume 8.Miyato, T.; Maeda, S.-i.; Koyama, M.; and Ishii, S. 2018. Virtualadversarial training: a regularization method for supervised andsemi-supervised learning.
IEEE transactions on pattern analysisand machine intelligence
Journal on MultimodalUser Interfaces
IEEE Transactions on Multimedia
Advances in Neural Information Processing Systems ,3235–3246.Papandreou, G.; Chen, L.-C.; Murphy, K. P.; and Yuille, A. L. 2015.Weakly-and semi-supervised learning of a deep convolutional net-work for semantic image segmentation. In
Proceedings of the IEEEinternational conference on computer vision , 1742–1750.Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.;Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.;Desmaison, A.; Kopf, A.; Yang, E.; DeVito, Z.; Raison, M.;Tejani, A.; Chilamkurthy, S.; Steiner, B.; Fang, L.; Bai, J.;and Chintala, S. 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In
Advances in NeuralInformation Processing Systems 32 , 8024–8035. Curran Asso-ciates, Inc. URL http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf.Pereyra, G.; Tucker, G.; Chorowski, J.; Kaiser, Ł.; and Hinton, G.2017. Regularizing neural networks by penalizing confident outputdistributions. arXiv preprint arXiv:1701.06548 .Poria, S.; Gelbukh, A.; Hussain, A.; Bandyopadhyay, S.; andHoward, N. 2013. Music genre classification: A semi-supervisedapproach. In
Mexican Conference on Pattern Recognition , 254–263. Springer.Radford, A.; Metz, L.; and Chintala, S. 2015. Unsupervised rep-resentation learning with deep convolutional generative adversarialnetworks. arXiv preprint arXiv:1511.06434 .Ren, X.; Li, H.; Huang, Z.; and Chen, Q. 2020. Self-supervisedDance Video Synthesis Conditioned on Music.Sajjadi, M.; Javanmardi, M.; and Tasdizen, T. 2016. Regularizationwith stochastic transformations and perturbations for deep semi-supervised learning. In
Advances in neural information processingsystems , 1163–1171.Salamon, J.; Rocha, B.; and G´omez, E. 2012. Musical genre clas-sification using melody features extracted from polyphonic musicsignals. In , 81–84. IEEE.Serra, J.; G´omez, E.; and Herrera, P. 2010. Audio cover song iden-tification and similarity: background, approaches, evaluation, andbeyond. In
Advances in Music Information Retrieval , 307–332.Springer.Shiratori, T.; Nakazawa, A.; and Ikeuchi, K. 2006. Dancing-to-music character animation. In
Computer Graphics Forum , vol-ume 25, 449–458. Wiley Online Library.ong, Y.; Zhang, C.; and Xiang, S. 2007. Semi-supervised musicgenre classification. In , volume 2,II–729. IEEE.Tang, T.; Jia, J.; and Mao, H. 2018. Dance with melody: An lstm-autoencoder approach to music-oriented dance synthesis. In
Pro-ceedings of the 26th ACM international conference on Multimedia ,1598–1606.Tang, T.; Mao, H.; and Jia, J. 2018. AniDance: Real-Time DanceMotion Synthesize to the Song. In
Proceedings of the 26th ACMinternational conference on Multimedia , 1237–1239.Tarvainen, A.; and Valpola, H. 2017. Mean teachers are betterrole models: Weight-averaged consistency targets improve semi-supervised deep learning results. In
Advances in neural informa-tion processing systems , 1195–1204.Yalniz, I. Z.; J´egou, H.; Chen, K.; Paluri, M.; and Mahajan, D.2019. Billion-scale semi-supervised learning for image classifi-cation. arXiv preprint arXiv:1905.00546 .Zhu, X.; Ghahramani, Z.; and Lafferty, J. D. 2003. Semi-supervised learning using gaussian fields and harmonic functions.In
Proceedings of the 20th International conference on Machinelearning (ICML-03) , 912–919.
Appendix
A.1 Details of network configuration
In this section, we list the configurations of all networks mentioned in our main paper, i.e. the encoder E , the predictor P ,the 2D-decoder D and two 1D-decoders D & D . Our networks are implemented under PyTorch deep learning frame-work (Paszke et al. 2019). The configuration of Encoder E A detailed configuration of ResNet50-based encoder E is listed in Table 3. The input sizeof the Encoder E is × pixels, where the Mel Spectrogram is therefore resized on the time dimension. The outputs of E contain a temporal feature f t ∈ R × × and an embedding u ∈ R × × from feature layer and embedding layer .Specifically, in a c × w × w/s Convolution / Deconvolution layer, c denotes the number of filters, w × w denotes the filter’ssize and s denotes the filter’s stride. In a w × w/s Maxpool layer, w denotes the pooling window size, and s denotes the poolingstride. In an n/s Bottleneck block (He et al. 2016), n denotes the number of planes, and s denotes the block’s stride. In an ( h, w ) AdaptiveAvgPool2d layer, h and w denote the output dimension of height and width, and “None” means the size will bethe same as the input. Layer Component Configuration Feature Size E n c o d er E Conv 1 Conv2d + BN2d + ReLU 64x7x7 / 2 64x64MaxPool MaxPool 3x3 / 2 32x32Conv 2 3 x Bottleneck 64 / 1 32x32Conv 3 4 x Bottleneck 128 / 2 16x16Conv 4 6 x Bottleneck 256 / 2 8x8Conv 5 3 x Bottleneck 512 / 2 4x4Conv 6 Conv2d 2048x1x1 / 1 4x4feature AdaptiveAvgPool2d (None, 1) 4x1embedding AdaptiveAvgPool2d (1, None) 1x1Table 3: A detailed configuration of the Encoder E . he configuration of Decoder D A detailed configuration of Decoder D is listed in Table 4. The input of D is theembedding u with the length 512, and the output is reconstructed Mel Spectrogram with the size of × pixels. Layer Component Configuration Feature Size D ec o d er D Layer 1 ConvTranspose2d + BN2d + ReLU 512x4x4 / 1 4x4Layer 2 ConvTranspose2d + BN2d + ReLU 512x4x4 / 2 8x8Layer 3 ConvTranspose2d + BN2d + ReLU 256x4x4 / 2 16x16Layer 4 ConvTranspose2d + BN2d + ReLU 256x4x4 / 2 32x32Layer 5 ConvTranspose2d + BN2d + ReLU 128x3x3 / 1 32x32Layer 6 ConvTranspose2d + BN2d + ReLU 128x4x4 / 2 64x64Layer 7 ConvTranspose2d + BN2d + ReLU 64x3x3 / 1 64x64Layer 8 ConvTranspose2d 1x4x4 / 2 128x128Table 4: A detailed configuration of the Decoder D . The configuration of Decoder D and D Detailed configurations of the Decoder D and D are listed in Table 4. The inputof D and D is the temporal feature f t , and the output is the reconstructed Main-Melody and Rhythm with the length .Since rhythm prediction can be considered as a binary classification problem, we further add a sigmoid function at the end ofthe Decoder in this task. Similar to the above tables, in a c × w/s of 1D-Convolution / 1D-Deconvolution layer, c denotes thenumber of filters, w denotes the filter’s length and s denotes the filter’s stride. Layer Component Configuration Feature Length D ec o d er D & D Layer 1 ConvTranspose1d + ReLU 512x2 / 2 8Layer 2 ConvTranspose1d + ReLU 256x2 / 2 16Layer 3 ConvTranspose1d + ReLU 128x2 / 2 32Layer 4 ConvTranspose1d + ReLU 64x2 / 2 64Layer 5 ConvTranspose1d + ReLU 32x2 / 2 128Output Conv1d 1x1 / 1 128Table 5: A detailed configuration of the Decoders D and D . he configuration of Predictor T In our predictor, we adopt three residual attention blocks and two fully connected layers.The detailed configuration is shown in Table 6, the ( n, m ) of a Linear and a “Res-Att” layer represents that the input and outputchannel number are n and m respectively, and K is the number of output dance phrases. We follow the ResNet (He et al. 2016)and SENet (Hu, Shen, and Sun 2018) and set the four fully connected layers in our residual attention blocks (as shown in Fig. 6)are orderly set to “Linear(512,1024)”, “Linear(1024,512)”, “Linear(512,16)” and “Linear(16,512)”.Figure 6: The details of residual attention blocks (Res-Att). Layer Component Configuration Feature Channel P re d i c t o r T Layer 1 Linear (512, 512) 512Layer 2 Res-Att (512, 512) 512Layer 3 Res-Att (512, 512) 512Layer 4 Res-Att (512, 512) 512Output Linear (512, K) KTable 6: A detailed configuration of the Predictor T . .2 Details of co-ascent learning In this section, we give a detailed description on our co-ascent learning method, which can notably improve the performance ofour music-to-dance translation. The algorithm flow of co-ascent learning is shown in Alg. 1.
Algorithm 1:
Co-ascent learning algorithm
Data:
Labeled dataset D l with K kinds of dance phrases, Unlabeled dataset D u . Init:
Fix the Encoder E and initialize the predictor T by training T on D l . Calculate the transition matrix M on D l basedon the style of dance phrases. Set the threshold τ = 0 . and momentum parameter α = 0 . ; Var: epoch id k = 0 ; while Not all of samples in D u are labeled do Run E and T k on D u and get output probability vector set P of K classes; for each temporal adjacent dance phrases d t − and d t , probability P ( d t − ) and P ( d t − ) in D u , P do Update P ( d t ) : P ( d t ) ← P ( d t ) M k ( d t − → d t ) ;Get pseudo labels L based on re-scaled P ( d t ) ;Initialize D u ’s subset D (cid:48) u with a null set; for each dance movement d , label l , confidence P ( d ) in D u , L , P doif P(d) > τ then
Push d and l into D (cid:48) u ;Fine-tune the networks T k based on D l + D (cid:48) u and get the new one T k +1 ;Initialize M k +1 with a zero matrix; for each temporal adjacent phrases d t − and d t , and Top-1 confidences P ( d t − ) and P ( d t ) in D u , P do M k +1 ( d t − → d t ) = M k ( d t − → d t ) + P ( d t − ) P ( d t ) Update the M k with momentum: M k +1 ← α M k + (1 − α ) M k +1 Update the epoch id: k = k + 1 . Result:
Output optimized T (cid:63) and M (cid:63) . .3 Music and Dance Style Distribution on Datasets.3 Music and Dance Style Distribution on Datasets