Towards Explaining Expressive Qualities in Piano Recordings: Transfer of Explanatory Features via Acoustic Domain Adaptation
TTOWARDS EXPLAINING EXPRESSIVE QUALITIES IN PIANO RECORDINGS:TRANSFER OF EXPLANATORY FEATURES VIA ACOUSTIC DOMAIN ADAPTATION
Shreyan Chowdhury Gerhard Widmer , Institute of Computational Perception and LIT AI Lab, Johannes Kepler University Linz, Austria
ABSTRACT
Emotion and expressivity in music have been topics of con-siderable interest in the field of music information retrieval.In recent years, mid-level perceptual features have been sug-gested as means to explain computational predictions of mu-sical emotion. We find that the diversity of musical stylesand genres in the available dataset for learning these featuresis not sufficient for models to generalise well to specialisedacoustic domains such as solo piano music. In this work,we show that by utilising unsupervised domain adaptation to-gether with receptive-field regularised deep neural networks,it is possible to significantly improve generalisation to this do-main. Additionally, we demonstrate that our domain-adaptedmodels can better predict and explain expressive qualities inclassical piano performances, as perceived and described byhuman listeners.
Index Terms — Music, expressivity, domain adaptation
1. INTRODUCTION
Domain mismatch – a discrepancy between the kind of dataavailable for training a classifier and the data on which itshould then operate – is an important real-world problem, alsoin the field of acoustic recognition. For instance, the DCASE2019 and 2020 challenges had dedicated tasks on
AcousticScene Classification with multiple/mismatched recording de-vices . The machine learning answer to this problem is re-search on effective methods for transfer learning and (super-vised and unsupervised) domain adaptation .The work presented in this paper is motivated by a partic-ularly difficult acoustic transfer problem involving a complexmusical phenomenon. In a large project, we aim at study-ing the elusive concept of expressivity in music with com-putational and, specifically, machine learning methods. Oneaspect of that is the art of expressive performance , the sub-tle, continuous shaping of musical parameters such as tempo,timing, dynamics, and articulation by experienced musicians,while playing a piece, in this way imbuing the piece with e.g., http://dcase.community/challenge2019/task-acoustic-scene-classification-results-b particular expressive and emotional qualities [1]. The ConEspressione Game was a large-scale data collection effort weset up in order to obtain personal descriptions of perceived ex-pressive qualities , with the goal of studying human perceptionand characterisation of expressive aspects in performances ofthe same pieces by different artists [2].In analyzing this data, we are now interested in seeingwhether these subjective characterisations of expressive qual-ities are consistent and systematic enough for a machine tobe able to predict them – at least partially – from the audiorecordings. Moreover, we aim at obtaining musical insights:we want interpretable models that point to specific musicalqualities that might underlie perceived expressive qualities. Aset of musical descriptors that seem particularly suited to thiswas proposed in [3], where mid-level musical features weredescribed that are intuitively understandable to the averagemusical listener, and a corresponding human-annotated set ofmusic recordings was published (see next section). In [4], wehad shown how such mid-level features, predicted from audiovia trained classifiers, can be exploited to provide intuitive ex-planations in the context of emotion and mood recognition ingeneral (non-classical) music. This was extended to a two-level explanation scheme in [5] which permitted to trace themid-level explanations back to properties of the acoustic sig-nal. There is reason to believe that some of these features mayalso hold predictive and explanatory power for expressive as-pects in piano performance.This is where a severe mismatch problem arises: thereis no annotated ground truth data available for training mid-level feature extractors in classical piano music, and obtain-ing such data would be extremely cumbersome. At the sametime, recordings of solo piano music are very different, mu-sically and acoustically, from the kind of rock and pop musiccontained in the available mid-level training dataset. It is thuslikely that a classifier trained on the latter will not generalisewell to our piano recordings. In this paper, we present several steps to bridge this do-main mismatch through architecture choice and unsuperviseddomain adaptation techniques, and show that they are effec- Note that we cannot test this directly, as we have no mid-level featureground truth for the
Con Espressione performances. (We will use the fewpiano recordings in the midlevel dataset as our domain adapatation test set –see Section 4.) a r X i v : . [ c s . S D ] F e b ive in generalising a model to solo piano recordings. In a finalstep, we will try the adapted classifier on the Con Espressionerecordings, testing whether domain adaptation improves thepredictability of expressive qualities from mid-level featurespredicted from audio, and identifying those features that seemto have specific predictive and explanatory power.
2. MID-LEVEL FEATURES AND THECON ESPRESSIONE DATA2.1. The Mid-level Features Dataset
Seven mid-level musical features were proposed in [3], viz. melodiousness, articulation, rhythmic complexity, rhythmicstability, dissonance, tonal stability , and modality (or “minor-ness”). To approximate these perceptual features for a set ofaudio clips, the authors took a data-driven approach. Ratingsfrom listeners were modelled into the final feature values thatwere made available as labels in the associated dataset (whichwe call the
Mid-level Features Dataset ) along with the audioclips. The labels for each feature are in the form of continu-ous values between 1 and 10 (the learning task for our modelswill thus be a regression task.) The exact questions asked tothe listeners for rating each perceptual feature can be foundin [3]. The audio clips chosen for the dataset come from dif-ferent sources such as jamendo.com , magnatune.com ,and the Soundtracks dataset [6]. There are a total of 5,000clips of 15 seconds each in the dataset. Con Espressione Game
Dataset
In the
Con Espressione Game , participants listened to extractsfrom recordings of selected solo piano pieces (by composerssuch as Bach, Mozart, Beethoven, Schumann, Liszt, Brahms)by a variety of different famous pianists (for details, see [2])and were asked to describe, in free-text format, the expressivecharacter of each performance. Typical characterisations thatcame up were adjectives like “cold”, “playful”, “dynamic”,“passionate”, “gentle”, “romantic”, “mechanical”, “delicate”,etc. From these textual descriptors, the authors obtained, bystatistical analysis of the occurrence matrix of the descrip-tors, four underlying continuous expressive dimensions alongwhich the performances can be placed. These are the (nu-meric) target dimensions that we wish to predict via the routeof mid-level features predicted from the audio recordings.The central challenge in this is that the Mid-level FeaturesDataset [3], consisting mainly of pop, rock, hip-hop, jazz,electronic, and film soundtrack music, is vastly different, insound and musical style, from the music of the Con Espres-sione dataset. This results in what is known as a covariateshift [7] between the training and the testing data.In the following section, we describe a deliberate choiceof training architecture that results in better generalisabilityof the trained models, and then present a two-step method tofurther adapt the model to our domain of choice.
3. MID-LEVEL FEATURE LEARNINGVIA DOMAIN ADAPTATION (DA)
In the following sections, target domain refers to solo pianoperformance audio and source domain refers to all other mu-sical audio (non-piano audio clips in the Mid-level FeaturesDataset).
As a first step towards improving out-of-domain generaliza-tion of mid-level feature prediction, we switch from the VGG-ish network of [4] to the Receptive-Field Regularized ResNets(RF-ResNet) originally introduced in [8] for acoustic sceneclassification and later shown to work well for music infor-mation retrieval tasks as well [9]. The rationale behind thisis that the smaller receptive field of the RF-ResNet preventsoverfitting, particularly when the training data is limited inquantity. The architecture differs from a regular ResNet [10]by reducing the kernel sizes of several convolutional layersand removing some max pooling layers. Our RF-ResNet con-sists of three stages with three residual blocks in the first stageand one residual block each in the second and third stages.The last stage consists of only 1-by-1 convolutional layers.There are two max pooling layers in the first stage betweenthe convolutional blocks, and one average pooling layer afterthe third stage before going into a final 1-by-1 convolutionalfeed forward layer. The output is a seven-dimensional vectorwhere the elements correspond to the predictions of each ofthe seven mid-level features.
We adopt the reverse-gradient method introduced in [11],which achieves domain invariance by adversarially training adomain discriminator attached to the network being adapted,using a gradient reversal layer. The procedure requires a largeunlabelled dataset of the target domain in addition to thelabelled source data. The discriminator tries to learn discrim-inative features of the two domains but due to the gradientreversal layer between it and the feature extracting part ofthe network, the model learns to extract domain-invariantfeatures from the inputs.This adaptation procedure is applied to the RF-ResNet de-scribed above. Since our target domain of interest solo pi-ano performance music, we use audio from the MAESTROdataset [12] as our unlabelled data source. It contains morethan 200 hours of recorded piano performances. During train-ing, each batch that the model sees contains an equal num-ber of labelled source data points and unlabelled target datapoints. The regressor/classifier head of the model tries to pre-dict the source labels while the discriminator head predictsthe domain for each data point in the batch. The combinedloss of the two heads is then backpropagated while reversingthe gradient after the discriminator during the backward pass. .3. Teacher-Student Training Scheme
As a final step, we refine our domain adaptation using ateacher-student training scheme tailored to our scenario (seeFig.1). We train multiple domain-adaptive models using theunsupervised DA method of Section 3.2 and use these asteacher models that are eventually used to assign pseudo-labels to our unlabelled MAESTRO dataset. Before thepseudo-labelling step, we select the best performing teachermodels with the validation set. Even though the validation setcontains data from the source domain, this step ensures thatmodels with relatively lower variance are used as teachers.This helps filter out the particularly poorly adapted modelsfrom the previous step, which may occur due to the inherentlyless stable nature of adversarial training methods [13].After selecting a number of teacher models (in our exper-iments, we used four), we label a randomly selected subset ofour unlabelled dataset using predictions aggregated by takingthe average. This pseudo-labelled dataset is combined withthe original labelled source dataset to train the student model.We observed that the performance on the test set increased un-til the pseudo-labelled dataset was about 10% of the labelledsource dataset in size, after which it saturated.The teacher-student scheme allows the collective “knowl-edge” of an ensemble of adapted networks to be distilled intoa single student network. The idea of knowledge distillation,which was originally introduced for model compression in[14], has been used for domain adaptation in a supervised set-ting previously in [15]. The distillation process functions asa regularizer resulting in a student model with better gener-alisability than any of the individual teacher models alone.Additionally, it can be thought of as a stabilisation step help-ing to filter out the adversarially adapted models that resultfrom non-optimal convergence.
4. EXPERIMENTAL RESULTS
Since we have no ground truth labels for our real data of inter-est (the classical piano music) to evaluate the domain adapta-tion experiments , we created a (“piano”/target) test set man-ually by selecting clips from the Mid-level Features Datasetcontaining only solo piano. This resulted in a set of 79 pi-ano clips from the total of 5000. The other 4921 clips (“non-piano”/source) were split into training (90%), validation (2%)and test (8%) sets such that the artists in these sets are mutu-ally exclusive (following [3]). The validation set is used totune hyperparameters and for early stopping.The inputs to all our models were log-filtered spectro-grams (149 bands) of 15-second audio clips sampled at 22.05kHz with a window size of 2048 samples and a hop length of704 samples, resulting in 149 × https://gitlab.cp.jku.at/shreyan/midlevel_da Midlevel FeaturesDatasetMaestro Dataset(unlabeled) T1 T2 T3Maestro Dataset(pseudo-labeled) Train domainadaptive teachermodelsPredict andaverage Train studentmodel Final ModelSelecton validationset perf
Fig. 1 . Teacher-Student training scheme for unsupervised do-main adaptation.from the MAESTRO dataset are split into 15-second seg-ments and a random subset of size equal to the mid-leveltraining set is sampled on each run. During the pseudo-labelling stage, a random subset of 500 segments is sampled.We observe (Fig. 2) that each of the steps mentioned in theprevious section results in an improvement in the performanceon the “piano” test set without compromising the performanceon the “non-piano” one. In fact, we see a slight improvementin the non-piano metric upon introducing DA. This could bedue to the presence of some data points similar to the targetdomain – for instance excerpts from piano concertos, whichare not included in the “piano” test set.To investigate our results further, we look at the discrep-ancy between the source and target domains in the representa-
Fig. 2 . Performance of mid-level feature models on non-pianoand piano test sets. ig. 3 . Mean discrepancy between piano and non-piano sets.tion space, since it is known that the performance of a modelon the target domain is bounded by this discrepancy [7]. Weuse the method given in [16] to compute the empirical distri-butional discrepancy between domains for a trained model φ ,which is given as D ( S (cid:48) , T (cid:48) ; φ ) in Eq. 1: D ( S (cid:48) , T (cid:48) ; φ ) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) m (cid:88) x ∈ S (cid:48) φ ( x ) − n (cid:88) x ∈ T (cid:48) φ ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (1)where S (cid:48) is a population sample of size m from the sourcedomain and T (cid:48) is a population sample of size n from the targetdomain. We observe that the discrepancy decreases for eachstep (Fig.3), justifying our three-step approach and explainingthe improvement in performance.
5. PUTTING IT TO THE TEST
As a final step, we now return to our real target domain ofinterest and briefly investigate whether our domain-adaptedmodels can indeed predict better mid-level features for mod-elling the expressive descriptor embeddings of the ConEspressione dataset. We do this by predicting the averagemid-level features (over time) for each performance usingour models and training a simple linear regression model onthese features to fit the four embedding dimensions. Eventhough this is a very abstract task, for a variety of reasons– the noisy and varied nature of the human descriptions; theweak nature of the numeric dimensions gained from these;the complex and subjective nature of expressive music per-formance – it can be seen that the features predicted usingdomain-adapted models give comparatively better R -scoresfor all four dimensions.Taking a closer look at Dimension 1 – the one that cameout most clearly in the statistical analysis of the user responsesand was characterized by descriptions like “hectic” and “agi-tated” (as opposed to, e.g., “calm” and “tender”; see [2]) – andlooking at the individual mid-level features (see Table 2), wefind that, first of all, the predicted features that show a strong Dim 1 Dim 2 Dim 3 Dim 4VGG-ish 0.35 0.10 0.22 0.32RF-ResNet 0.36 0.07 0.28 0.33RF-ResNet DA
Table 1 . Coefficient of determination (R -score) of descrip-tion embedding dimensions of the Con Espressione game us-ing a linear regressor trained on predicted mid-level features. RF-ResNet RF-ResNet DA+TSFeature r Feature r articulation 0.47 melodiousness − Table 2 . Pearson’s correlation ( r ) for mid-level features withthe first description embedding dimension, with (right) andwithout (left) domain adaptation. Features with p < . and | r | > . are selected. This dimension has positive loadingsfor words like “hectic”, “irregular”, and negative loadings forwords like “sad”, “gentle”, “tender”.correlation with this dimension do indeed make sense: onewould expect articulated ways of playing (e.g., with strong staccato ) and rhythmically complex or uneven playing to beassociated with an impression of musical agitation. What ismore, after domain adaptation, the set of explanatory featuresgrows, now also including perceived dissonance as a positive,and perceived melodiousness of playing as a negative factor– which again makes musical sense and testifies to the poten-tial of domain adaptation for transferring explanatory acousticand musical features.
6. CONCLUSION
In this paper, we presented a three-step approach to adaptmid-level models for recordings of solo piano performances.We significantly improved the performance of these modelson piano audio by using a receptive field regularised net-work and performing unsupervised domain adaptation viaa teacher-student training scheme. We also demonstratedimproved prediction of meaningful perceptual features cor-responding to expressive dimensions. We conclude that thisroute of domain adaptation shows potential for a more generaltask of adapting models to specific genres or musical styles.
7. ACKNOWLEDGMENT
This work is supported by the European Research Council(ERC) under the EU’s Horizon 2020 research & innovationprogramme under grant agreement No. 670035 (“Con Espres-sione”), and the Federal State of Upper Austria (LIT AI Lab). . REFERENCES [1] Carlos E Cancino-Chacón, Maarten Grachten, WernerGoebl, and Gerhard Widmer, “Computational Models ofExpressive Music Performance: A Comprehensive andCritical Review,”
Frontiers in Digital Humanities , vol.5, pp. 25, 2018.[2] Carlos Cancino-Chacón, Silvan Peter, Shreyan Chowd-hury, Anna Aljanaki, and Gerhard Widmer, “On theCharacterization of Expressive Performance in ClassicalMusic: First Results of the Con Espressione Game,” in
Proceedings of the 21st International Society for MusicInformation Retrieval Conference (ISMIR) , 2020.[3] Anna Aljanaki and Mohammad Soleymani, “A Data-driven Approach to Mid-level Perceptual Musical Fea-ture Modeling,” in
Proceedings of the 19th Interna-tional Society for Music Information Retrieval Confer-ence, (ISMIR) , 2018, pp. 615–621.[4] Shreyan Chowdhury, Andreu Vall, Verena Haunschmid,and Gerhard Widmer, “Towards Explainable MusicEmotion Recognition: The Route via Mid-level Fea-tures,” in
Proceedings of the 20th International Societyfor Music Information Retrieval Conference (ISMIR) ,Delft, The Netherlands, 2019.[5] Verena Haunschmid, Shreyan Chowdhury, and GerhardWidmer, “Two-level Explanations in Music EmotionRecognition,” in
International Conference on MachineLearning (ICML) , 2019, Machine Learning for MusicDiscovery workshop.[6] Tuomas Eerola and Jonna K. Vuoskoski, “A Compari-son of the Discrete and Dimensional Models of Emotionin Music,”
Psychology of Music , vol. 39, no. 1, pp. 18–49, 2011.[7] Shai Ben-David, John Blitzer, Koby Crammer, AlexKulesza, Fernando Pereira, and Jennifer WortmanVaughan, “A Theory of Learning from Different Do-mains,”
Machine learning , vol. 79, no. 1-2, pp. 151–175, 2010.[8] Khaled Koutini, Hamid Eghbal-Zadeh, Matthias Dor-fer, and Gerhard Widmer, “The Receptive Field as aRegularizer in Deep Convolutional Neural Networks for Acoustic Scene Classification,” in . IEEE, 2019,pp. 1–5.[9] Khaled Koutini, Shreyan Chowdhury, VerenaHaunschmid, Hamid Eghbal-zadeh, and GerhardWidmer, “Emotion and Theme Recognition in Musicwith Frequency-Aware RF-Regularized CNNs,” in
Multimedia Evaluation Benchmark (MediaEval) 2019Workshop , 2019.[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun, “Deep Residual Learning for Image Recognition,”in
Proceedings of the IEEE conference on computer vi-sion and pattern recognition , 2016, pp. 770–778.[11] Yaroslav Ganin and Victor Lempitsky, “Unsuper-vised Domain Adaptation by Backpropagation,” in
In-ternational Conference on Machine Learning (ICML) .PMLR, 2015, pp. 1180–1189.[12] Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, IanSimon, Cheng-Zhi Anna Huang, Sander Dieleman,Erich Elsen, Jesse Engel, and Douglas Eck, “EnablingFactorized Piano Music Modeling and Generation withthe MAESTRO Dataset,” in
International Conferenceon Learning Representations , 2019.[13] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Ben-gio, and Wenjie Li, “Mode Regularized Generative Ad-versarial Networks,” arXiv preprint arXiv:1612.02136 ,2016.[14] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean, “Distill-ing the Knowledge in a Neural Network,” arXiv preprintarXiv:1503.02531 , 2015.[15] Taichi Asami, Ryo Masumura, Yoshikazu Yamaguchi,Hirokazu Masataki, and Yushi Aono, “Domain Adap-tation of DNN Acoustic Models using Knowledge Dis-tillation,” in .IEEE, 2017, pp. 5185–5189.[16] Yu Sun, Eric Tzeng, Trevor Darrell, and Alexei AEfros, “Unsupervised Domain Adaptation through Self-Supervision,” arXiv preprint arXiv:1909.11825arXiv preprint arXiv:1909.11825