[PDF] ASMD: an automatic framework for compiling multimodal datasets with audio and scores

Abstract

This paper describes an open-source Python framework for handling datasets for music processing tasks, built with the aim of improving the reproducibility of research projects in music computing and assessing the generalization abilities of machine learning models. The framework enables the automatic download and installation of several commonly used datasets for multimodal music processing. Specifically, we provide a Python API to access the datasets through Boolean set operations based on particular attributes, such as intersections and unions of composers, instruments, and so on. The framework is designed to ease the inclusion of new datasets and the respective ground-truth annotations so that one can build, convert, and extend one's own collection as well as distribute it by means of a compliant format to take advantage of the API. All code and ground-truth are released under suitable open licenses.

Full PDF

AASMD: AN AUTOMATIC FRAMEWORK FOR COMPILINGMULTIMODAL DATASETS WITH AUDIO AND SCORES

Federico Simonetta Stavros Ntalampiras Federico Avanzini

University of MilanDepartment of Computer ScienceLIM - Music Informatics Laboratory [name].[surname]@unimi.it

ABSTRACT

This paper describes an open-source Python framework forhandling datasets for music processing tasks, built with theaim of improving the reproducibility of research projects inmusic computing and assessing the generalization abilitiesof machine learning models. The framework enables theautomatic download and installation of several commonlyused datasets for multimodal music processing. Speciﬁ-cally, we provide a Python API to access the datasets throughBoolean set operations based on particular attributes, suchas intersections and unions of composers, instruments, andso on. The framework is designed to ease the inclusion ofnew datasets and the respective ground-truth annotationsso that one can build, convert, and extend one’s own collec-tion as well as distribute it by means of a compliant formatto take advantage of the API. All code and ground-truth arereleased under suitable open licenses.

1. INTRODUCTION

A recent trend in computer science is the adoption of mul-timodal strategies for increasing the effectiveness of algo-rithmic solutions in several domains [1–5]. This comesas a natural consequence of the a) ever-increasing avail-ability of computational resources, which are now able todeal with big data, and b) popularity of machine learningalgorithms, the performance of which is boosted as moredata (including multimodal) becomes available. As a re-sult, machine learning technologies are now employed innovel and unexplored ways.In the context of music information processing, severaltasks still pose unsolved challenges to the research com-munity, and multimodal approaches could provide a promis-ing path. The ﬁelds of multimodal music processing and multimodal music representation have already been inves-tigated in previous works [6, 7].Two issues that are more and more debated in severalresearch ﬁelds are the ability to reproduce published re-sults [8, 9] and to generalize the resulting models [10].

Copyright: © 2020 Federico Simonetta et al. This isan open-access article distributed under the terms of theCreative Commons Attribution 3.0 Unported License, which permits unre-stricted use, distribution, and reproduction in any medium, provided the originalauthor and source are credited.

Reproducibility is associated with the differences occur-ring in various implementations of the same method. Asan example, one issue is related to the different data for-mats used in music and in the available datasets, whichmight cause troubles in the translation between represen-tation formats and, consequently, in the reproducibility ofresearch.The generalization problem instead is due, among otherfactors, to the need of large and well-annotated datasetsfor training effective models. In particular, the whole ﬁeldof music information processing has only a limited num-ber of large datasets which could be much more useful ifthey could be merged together. Music itself, moreover, isparticularly affected by the difﬁculty of creating accurateannotations to evaluate and train models, often hinderingthe collection of large datasets and causing a low general-ization ability.With these three keywords in mind ( multimodal , repro-ducibility and generalization ), we have built ASMD to helpresearchers in the standardization of music ground-truthannotations and in the distribution of datasets. ASMD isthe acronym for Audio-Score Meta-Dataset and providesa framework for describing, converting, and accessing asingle dataset which includes various datasets – hence theexpression Meta-Dataset ; it was born as a side-project of aresearch about audio-to-score alignment and, consequently,it contains audio recordings and music scores for all thedata included in the ofﬁcial release – hence the

Audio-Score part. However, we have endeavoured to make ASMDable to include any contribute from anyone. ASMD isavailable under free licenses. A similar effort is held by mirdata [11], a Python pack-age for downloading and using common MIR datasets. How-ever, our work is more focused on multimodality and triesto keep the entire framework easily extensible and modu-lar.In the following sections, we describe a) the design prin-ciples, b) the implementation details, c) a few use casesand, d) possible future works.

2. DESIGN PRINCIPLES AND SPECIFICATIONS

In this section we present the principles which guided thedesign of the framework. Throughout this paper, we are Code is available at https://framagit.org/sapo/asmd/ ,documentation is available at https://asmd.readthedocs.io/ a r X i v : . [ c s . MM ] A p r oing to use the word annotation to refer to any music-related data complementing an audio recording. For in-stance, common types of annotations are music notes, f0for each audio frame, beat position, separated audio tracks,etc. With generalization , we mean the ability of including dif-ferent datasets which are distributed with various formatsand annotation types in the model generation process. Thisis an important issue especially during the conversion pro-cedure: since we aimed at distributing a conversion scriptto recreate annotations from scratch for the sake of repro-ducibility , we need to be able to handle all various storagesystems – e.g. ﬁle name patterns, directory structures, etc.– and ﬁle formats – e.g. midi, csv, musicxml, ad-hoc for-mats, etc.Also, our ground-truth format should be generic enoughto represent all the information contained in the availableground-truths and, at the same time, it should permit tohandle datasets with different ground-truth types – i.e. onedataset could provide aligned notes and f0 , while anotherone could provide aligned notes and beat-tracking , andthey should be completely accessible. Modularity refers to the re-use of parts of the frameworkin different contexts. Modularity is important during bothaddition of new datasets and usage of the API. To easethe conversion between ground-truth ﬁle formats, the usershould be able to re-use existing utilities to include addi-tional datasets. Moreover, the user should be allowed touse only some parts of the datasets and the correspondingannotations.

The purpose of the framework is to create a tool to help thestandardization of music information processing research.Consequently, we aimed for a framework that is completelyopen to new additions: it should be easy for the user toadd new datasets without editing sources from the frame-work itself. Also, it should be easy to convert from exist-ing formats in order to take advantage of the API and tobe able to merge existing datasets. Finally, the frameworkshould provide a usable format to add new annotations sothat new datasets can be natively created with the incorpo-rated tools.

Since the framework aims at merging multiple datasets, wewanted to add the ability to perform set operations overdatasets. As an example, within the context of automaticmusic transcription research, several large datasets existconsisting of piano music [12–14], but only few and con-siderably smaller are available for other instruments [15–19]. Consequently, a useful feature of the framework wouldbe the ability to select only some songs from multiple data-sets based on particular attributes, such as the instrument involved, the number of instruments, the composer or thetype of ground-truth available for that song.

A common issue with distributing music recordings andannotations are copyrights. Today, most of the datasetstypically used for music information processing are releasedunder Creative Commons Licenses, but there are many ex-ceptions of datasets released under closed terms [16,20] ornot released at all because of copyright restrictions [21].To overcome this problem, we wanted all datasets to bedownloadable from their ofﬁcial sources, in order to avoidany form of redistribution. Nonetheless, all the annota-tions that we produced were redistributable under CreativeCommons License.

Besides the effort to produce a general framework for mu-sic processing experiments, this project was born as a util-ity during conducting research addressing the audio-to-scoreproblem. The underlying idea is that we have various scoresand large amounts of audio available to end-users, thustrained models could easily take advantage of such multi-modality (i.e. the ability of the model to exploit both scoresand audio). The main problem is the availability of data fortraining the models: there is abundance of aligned data, butwithout the corresponding scores; on the other hand, whenscores are available, aligned performances are almost in-variably missing. Thus, the choice of the datasets that areincluded at now has mainly been focused on datasets pro-viding audio, symbolic scores and alignment annotations.However, since datasets ﬁtting all these requirements arequite rare, we wanted to augment the data available to in-crease the alignment data usable in our research.

3. IMPLEMENTATION DETAILS

This section details the implementation satisfying the de-sign principles outlined in section 2. Figure 1 depicts thestructure of the overall framework and the interactions be-tween its modules.

The entire framework is based on a small-sized but fun-damental JSON ﬁle loaded by the API and the installationscript to get the path where ﬁles are installed. Moreover,the user can optionally set a custom directory where to de-compress downloaded ﬁles if the hard-disk space is a crit-ical issue. Once the installation path is found, the scriptlooks for the existing directories in that path to discoverwhich datasets are already installed and skips them. TheAPI, instead, uses the information of the installation di-rectory to decouple the deﬁnition of each single datasetfrom the directory structure of the user: a user can havethe same dataset installed in multiple directories, or usethe same dataset from different datasets.json without in-terfering with the API.igure 1. Block diagram of the proposed framework: API interacts with deﬁnitions and datasets.json ; the formercontain references to the actual sound recording ﬁles and annotations, while the latter contains references to the dataset rootpath.

In the context of this framework, a dataset deﬁnition is es-sentially a JSON ﬁle which contains generic descriptionof a dataset. Deﬁnitions are built by using a pre-deﬁnedschema allowing the deﬁnition of various information use-ful for the installation of the dataset and for the usage ofthe API – e.g. for ﬁltering the dataset. If any of the infor-mation is not available for a dataset, the value unknown is offered as well.Examples of information contained in deﬁnitions are:• ensemble : if the dataset contains solo instrumentmusic pieces or ensemble;• instruments : a list of instruments that are usedin the dataset;• sources : if source-separated tracks are available,their format can be added here;• recording : the format of audio recordings;• install : ﬁeld containing all information for in-stalling the dataset: URL for downloading, shell com-mands for post-processing data, and so on;• ground-truth : ﬁeld associated to each type ofground-truth supported by the framework indicatingwhether the speciﬁc annotation type is available ornot – see Sec 3.3;• songs : a list of songs with meta-data such as thecomposer name and instruments used in these songsand with the list of paths to the audio recordings andto the annotations.Once a dataset has been described in this schema, its def-inition can be used out-of-the-box by simply specifying tothe API the path of its folder, possibly containing otherdataset deﬁnitions. All the paths speciﬁed in a deﬁnition must be relative to the installation directory as described inSec. 3.1.For the sake of generalization, we had to deal with a wide heterogeneity in path management among datasets. For in-stance, Bach10 [18] provides one different annotation ﬁleper each instrument in a song; in such a case we list all theannotation ﬁles for each song and leave to the API the taskof reassembling them. PHENICX [16], instead, only pro-vides source-separated tracks and thus we list all of themto reference the mixed track; again, we leave to the APIthe task of mixing them. In general, we have kept the fol-lowing principle: if a list of paths is provided where onewould logically expect a single path – such as in mixedtracks or annotation ﬁles – it is intended that the ﬁles inthe list should be “merged” whatever this means for thatspeciﬁc ﬁle-type. For instance, if multiple audio record-ings are provided instead of only one, it is assumed thatthe mixed track is derivable by adding (and normalizing)all listed tracks; if multiple annotation ﬁles are provided,it is assumed that each annotation ﬁle refers to a differentinstrument. Annotations are added in a custom JSON compressed for-mat stored in the same directory of the audio track that theyrefer to. In fact, annotation ﬁles can be stored anywhereand their path must be provided in the dataset deﬁnitionrelatively to the installation path deﬁned in datasets.json .Moreover, one annotation ﬁle must be provided for eachinstrument of the track; if multiple instruments should re-fer to the same annotations – e.g. ﬁrst and second violins– the annotation ﬁle can be only one, but in the datasetdeﬁnition ﬁle, its path should be repeated once for eachinstrument referring to it.Multiple types of annotations are available, but not all ofthem are provided for all the datasets in the ofﬁcial col-lection. In the dataset deﬁnition, the type of annotationsavailable should be explained. In our implementation, weused 3 different levels to describe ground truth availability mport audioscoredataset as asd d = asd.Dataset()d.filter(instruments=['piano'], ensemble=False, composer='Mozart',ground_truth=['precise_alignment']) (cid:44) → audio_array, sources_array, ground_truth_array = d.get_item(1) audio_array = d.get_mix(2)source_array = d.get_source(2)ground_truth_list = d.get_gts(2) mat = d.get_score(2, score_type=['precise_alignment']) mat = d.get_pianoroll(2, score_type=['non_aligned']) def processing(i, dataset, **args, *kwargs):mat = d.get_score(2, score_type=['precise_alignment']) pass d.filter(instruments=['violin']).parallel(processing, n_jobs=-1) Listing 1: Example of usage for ofﬁcial deﬁnitions and reliability: annotation type not available annotation type available and manually or mechani-cally annotated: this type of annotation has been addedby a domain expert or some mechanical transducer –e.g. Disklavier. annotation type available and algorithmically annotated:this type of annotation has been added by exploiting astate-of-art algorithm.The types of annotations currently supported are:1. precise alignment : onsets and offsets times in sec-onds, pitches, velocities and note names for eachnote played in the recording, taking into account asyn-chronies inside chords;2. broad alignment : same as precise alignment but thealignment does not consider asynchronies inside chords;3. non aligned notes : same as precise alignment but notaligned (see 3.4 for more information);4. f0 : the f0 of this instrument for each audio frame inthe corresponding track;5. beats non aligned : time instances of beats in thenon-aligned data;6. instrument : General Midi program number associ-ated with this instrument, starting from 0, while value128 indicates a drums kit. As described in section 2.6, this project originated for mu-sic alignment research. One problem is the lack of largedatasets containing audio recordings, aligned notes and sym-bolic non aligned scores. The approach that we used to overcome this problem is tostatistically analyze the available manual annotations andto augment the data by approximating them through thestatistical model. To prevent biases, we also replaced themanual annotations with the approximated ones.For now, the statistical analysis is simple: we computethe mean and the standard deviation of offsets and onsetsfor each piece. Then, we store the histogram of the stan-dardized offsets and onsets of each note; we also storehistograms of the mean and standard deviation values ofeach piece. To create new misaligned data, we chose astandardized value for each note accompanied by a meanand a standard deviation for each piece, using the corre-sponding histograms; with these data, we can compute anon-standardized value for each note. Note that initiallythe histograms are normalized so that they satisfy certaingiven constraints. In the distributed code, the standardizedvalues are normalized to (that is, the maximum value is second), while standard deviations are normalized to . .An additional problem is due to the fact that the time unitsin the aligned data are seconds, while those in the scoresare note lengths – e.g. breve, semibreve and so on. Usually,one translates a note length to seconds by using BPM; how-ever, in some scores the BPM annotation is unavailable oris not reliable. Hence, during the statistical analysis, we al-ways consider the tempo as BPM, which is a non-usualBPM, in the attempt of minimizing its overall inﬂuence. Ifwe used a usual BPM, such as or , songs with BPMnear to that value would have biased the analysis. More-over, models trained using the produced alignment anno-tations are ensured to be BPM-independent. Note that one mport audioscoredataset as asd d = asd.Dataset(['path/to/directory/containing/custom/definitions','path/to/the/official/definitions/']) (cid:44) → d.filter(instruments=['piano'], ensemble=False, composer='Mozart',ground_truth=['precise_alignment']) (cid:44) → Listing 2: Example of usage for custom deﬁnitions can still try to derive BPM information by making a BPMestimation over the audio [22, 23], a process which highlydepends on the algorithm’s precision.

The framework is complemented with a Python API writ-ten in Cython. It allows in particular to load variousdataset deﬁnitions aside of the ofﬁcial ones. The API pro-vides methods to retrieve audio and annotations in variousstructures, such as a matrix list of notes similar to the oneused by

Matlab MIDI Toolbox [24] or pianorolls. Thanksto the API, one can also ﬁlter the loaded datasets’ songsbased on the original dataset, active instrument, ensem-ble or solo instrumentation, composer, available annotationtypes, etc.Moreover, since the API basically consists in a class rep-resenting a large dataset, it is very easy to extend it in orderto use it in conjunction of PyTorch or TensorFlow frame-works for training neural network models. In Sec. 4 weprovide an example of the speciﬁc functionality.

To give the user the ability to write his/her own deﬁnitionswithout having to edit the framework code, we designed aconversion procedure as follows:1. the creator can use already developed conversion toolsfor the most common ﬁle formats (MIDI, sonic vi-sualizer, etc.);2. the creator can still write an ad-hoc function whichconverts a ﬁle from the original format to the ASMDone; in this case the creator has to decorate the con-version function with a special decorator providedby ASMD;3. the creator adds the needed conversion function inthe install section in the dataset deﬁnition;4. the user can run the conversion script for only a spe-ciﬁc dataset or for all other datasets.All the technical details are available in the ofﬁcial docu-mentation.

4. USE CASES

This section demonstrates the efﬁcacy of the ASMD frame-work through four different use cases. https://cython.org/ https://asmd.readthedocs.io/ To use the API, the user should carry out the followingsteps:• import audioscoredataset ;• create a audioscoredataset.Dataset object,giving the path of the datasets.json ﬁle as anargument to the constructor;• use the filter method on the object to ﬁlter dataaccording to his/her needs (conveniently, it is alsopossible to re-ﬁlter them at a later stage, withoutreloading the datasets.json ﬁle);• retrieve elements by calling the get item methodor similar ones.After the execution of the filter method, the

Dataset instance will contain a ﬁeld paths representing the list ofcorrect paths to the ﬁles requested by the user. Listing 1shows an example of such an operation.

Whenever the user wishes to apply customized deﬁnitions,he/she need simply to provide the list of directories to the

Dataset constructor, as shown in listing 2.

Integrate

ASMD with

PyTorch is straightforward. The userhas to inherit from both PyTorch and ASMD

Dataset classes and to implement the getitem method. List-ing 3 shows such an example.

Towards adding new deﬁnitions enabling users to down-load datasets, a user should also provide a conversion func-tion. Listing 4 is an example of one can write its own con-version function. However, conversion functions for themost common ﬁle types – such as Midi and Sonic Visual-izer – are already provided by the framework.

5. CONCLUSIONS

Future works will focus on the enhancement of conversionand installation procedures, as well as on the deﬁnition ofstandards for music annotations. In addition, multimodalmusic processing often requires processing of annotationtypes not included in this version of the framework, but mport torchimport audioscoredataset as asdfrom torch.utils.data import

Dataset as TorchDataset class MyDataset (asd.Dataset, TorchDataset): def __init__(self, *args, **kargs):super().__init__(['path/to/definitions']).filter(instruments=['piano']) def __getitem__(self, i): return torch.tensor(self.get_score(i)) def another_awsome_method(self, *args, **kargs): print ("Hello, world!") for i, mat in enumerate(MyDataset()): Listing 3: Example for using ASMD inside PyTorch from audioscoredataset.convert_from_file import convert, prototype_gt from copy import deepcopy @convert(['.myext']) def function_which_converts(filename, *args, **kargs): out = deepcopy(prototype_gt) data = csv.reader(open(filename), delimiter=',') for row in data:out[alignment]["onsets"].append(float(row[0]))out[alignment]["offsets"].append(float(row[0]) + float(row[2]))out[alignment]["pitches"].append(int(row[1]) return out Listing 4: Example for writing a custom conversion functioncould instead be handled in a future release. Some annota-tion types could be stored in standalone formats and usersshould be able to distribute annotations focusing only on aspeciﬁc ground truth kind, thus enhancing the distributedinfrastructure of ASMD.Studying the user experience of the framework shouldalso be a priority: for instance, users could be able to choosedatasets also based on the estimation of the download timesince for some datasets that is a big issue. Labels used inthe annotation format are also relevant to ease the usageof the framework by new users, especially in a multidisci-plinary ﬁeld such as the sound and music computing.This paper presented a new framework for multimodalmusic processing. We hope that our efforts in easing thedevelopment of multimodal machine learning approachesfor music information processing will be useful to the soundand music computing community. We are completely awarethat for a truly general and usable framework, the partici- pation of various and different perspectives is needed andwe are therefore open to any contribution towards the cre-ation of utilities that allow training and testing multimodalmodels, ensuring reasonable generalization ability and re-liable reproducibility of scientiﬁc results.

Acknowledgments

We would like to thank all people that worked on the datasetsused in the ASMD framework: Bach10 [18], Maestro [12],MusicNet [15], PHENICX Anechoic [16], SMD [13], Tra-ditional Flute Dataset [17], TRIOS [19] and Vienna 4x22Piano Corpus [14].

6. REFERENCES [1] T. Baltrusaitis, C. Ahuja, and L.-P. Morency, “Mul-timodal machine learning: A survey and taxonomy,”

EEE Trans. on Pattern Analysis and Machine Intelli-gence , vol. 41, no. 2, pp. 423–443, 2018.[2] S. Ntalampiras, F. Avanzini, and L. A. Ludovico, “Fus-ing acoustic and electroencephalographic modalitiesfor user-independent emotion prediction,” in , 2019, pp.36–41.[3] S. Ntalampiras, D. Arsi´c, M. Hofmann, M. Anders-son, and T. Ganchev, “PROMETHEUS: heterogeneoussensor database in support of research on human be-havioral patterns in unrestricted environments,”

Signal,Image and Video Processing , vol. 8, no. 7, pp. 1211–1231, 2012.[4] P. K. Atrey, M. A. Hossain, A. E. Saddik, and M. S.Kankanhalli, “Multimodal fusion for multimedia anal-ysis: A survey,”

J. of Multimedia Systems , vol. 16,no. 6, pp. 345–379, 2010.[5] M. Minsky, “Logical versus analogical or symbolicversus connectionist or neat versus scruffy,”

AI Mag-azine , vol. 12, no. 2, pp. 34–51, 1991.[6] F. Simonetta, S. Ntalampiras, and F. Avanzini, “Mul-timodal Music Information Processing and Retrieval:Survey and Future Challenges,” in , 2019, pp. 10–18.[7] L. A. Ludovico, A. Barat, F. Simonetta, and D. A.Mauro, “On the adoption of standard encoding formatsto ensure interoperability of music digital archives:The ieee 1599 format.” in , 2019, pp. 20–24.[8] M. Baker, “1,500 scientists lift the lid on reproducibil-ity,”

Nature News , vol. 533, no. 7604, p. 452, 2016.[9] M. Hutson, “Artiﬁcial intelligence faces reproducibil-ity crisis,”

Science , vol. 359, no. 6377, pp. 725–726,2018.[10] Y. S. Abu-Mostafa, M. Magdon-Ismail, and H.-T. Lin,

Learning From Data . AMLBook, 2012.[11] R. M. Bittner, M. Fuentes, D. Rubinstein, A. Jansson,K. Choi, and T. Kell, “mirdata: Software for repro-ducible usage of datasets.” in , 2019.[12] C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C.-Z. A. Huang, S. Dieleman, E. Elsen, J. H. Engel, andD. Eck, “Enabling factorized piano music modelingand generation with the maestro dataset.” in , 2019.[13] M. M¨uller, V. Konz, W. Bogler, and V. Ariﬁ-M¨uller,“Saarland music data (SMD),” in , 2011.[14] W. Goebl. (1999) The vienna 4x22 piano corpus.[Online]. Available: http://dx.doi.org/10.21939/4X22 [15] J. Thickstun, Z. Harchaoui, D. P. Foster, and S. M.Kakade, “Invariances and data augmentation for super-vised music transcription.” in , 2018,pp. 2241–2245.[16] M. Miron, J. J. Carabias-Orti, J. J. Bosch, E. Gmez, andJ. Janer, “Score-informed source separation for multi-channel orchestral recordings.”

J. Electrical and Com-puter Engineering

J. Sel. Topics Signal Processing , vol. 5, no. 6, pp.1205–1215, 2011.[19] J. Fritsch. (2012) The trios score-aligned multitrackrecordings dataset. [Online]. Available: https://c4dm.eecs.qmul.ac.uk/rdr/handle/123456789/27[20] F. Simonetta, F. Carnovalini, N. Orio, and A. Rod`a,“Symbolic music similarity through a graph-based rep-resentation,” in ,2018.[21] F. Simonetta, C. Cancino-Chacn, S. Ntalampiras, andG. Widmer, “A convolutional approach to melody lineidentiﬁcation in symbolic scores,” in ,2019.[22] H. Schreiber and M. Mller, “A single-step approach tomusical tempo estimation using a convolutional neu-ral network.” in , 2018, pp. 98–105.[23] S. Bck, F. Krebs, and G. Widmer, “Accurate tempo es-timation based on recurrent neural networks and res-onating comb ﬁlters.” in , 2015, pp. 625–631.[24] T. Eerola and P. Toiviainen, “Mir in matlab: The miditoolbox.” in5th Int. Conf. on Music Information Re-trieval (ISMIR)