The MIDI Degradation Toolkit: Symbolic Music Augmentation and Correction
TTHE MIDI DEGRADATION TOOLKIT:SYMBOLIC MUSIC AUGMENTATION AND CORRECTION
Andrew McLeod
EPFL [email protected]
James Owers
University of Edinburgh [email protected]
Kazuyoshi Yoshii
Kyoto University [email protected]
ABSTRACT
In this paper, we introduce the MIDI Degradation Toolkit(MDTK), containing functions which take as input a musi-cal excerpt (a set of notes with pitch, onset time, and dura-tion), and return a “degraded” version of that excerpt withsome error (or errors) introduced. Using the toolkit, wecreate the Altered and Corrupted MIDI Excerpts datasetversion 1.0 (ACME v1.0), and propose four tasks of in-creasing difficulty to detect, classify, locate, and correctthe degradations. We hypothesize that models trained forthese tasks can be useful in (for example) improving au-tomatic music transcription performance if applied as apost-processing step. To that end, MDTK includes a scriptthat measures the distribution of different types of errors ina transcription, and creates a degraded dataset with simi-lar properties. MDTK’s degradations can also be applieddynamically to a dataset during training (with or withoutthe above script), generating novel degraded excerpts eachepoch. MDTK could also be used to test the robustnessof any system designed to take MIDI (or similar) data asinput (e.g. systems designed for voice separation, metricalalignment, or chord detection) to such transcription errorsor otherwise noisy data. The toolkit and dataset are bothpublicly available online, and we encourage contributionand feedback from the community.
1. INTRODUCTION
Music language models (MLMs) have been the subject ofmuch research in recent years. In the most general terms,their goal is to learn the structure of a typical piece ofmusic, usually in symbolic form, as either a piano roll ora (monophonic or polyphonic) sequence of notes. Suchmodels can be designed either as a stand-alone system (i.e.to perform a specific task such as voice separation, metricalalignment, or chord detection), or as part of an automaticmusic transcription (AMT) system along with an acousticmodel.In AMT systems, MLMs have thus far led to onlysmall increases in performance compared to state-of-the-art acoustic models by themselves [9]. One possible reason c (cid:13) Andrew McLeod, James Owers, Kazuyoshi Yoshii. Li-censed under a Creative Commons Attribution 4.0 International License(CC BY 4.0).
Attribution:
Andrew McLeod, James Owers, KazuyoshiYoshii. “The MIDI Degradation Toolkit: Symbolic Music Augmenta-tion and Correction”, 21st International Society for Music InformationRetrieval Conference, Montréal, Canada, 2020. is that such MLMs are typically run at the frame-level ,rather than at the note-level or the beat-level [20]. Re-gardless, even beat- or note-level MLMs have not led tovery large improvements by themselves (e.g. [19,21]). Oneapproach to solving this issue has been proposed in [22],where a separate “blending model” is used to combine theacoustic model with the MLM. The blending model leadsto a small but significant increase in performance over us-ing the acoustic model only.Another possible reason for their minimal improvementis that such MLMs are not directly trained to solve the taskat hand—to correct errors produced by the acoustic model.That is, they are not discriminative models taking data witherrors as input and producing the correct transcription asoutput. Instead, they are typically trained to model the dis-tribution of clean (usually MIDI) data, and used to alterthe probabilistic predictions of the acoustic model. Theintegration of such an MLM into an AMT system usuallyinvolves searching through a large space of possible out-put transcriptions. One potential solution to this problem(at least when using an RNN-based MLM), is to train themodel with scheduled sampling [3], which uses its own(noisy) outputs during training, teaching it to recover fromsuch mistakes. In fact, the MLM from [22] is trained us-ing scheduled sampling. However, this training strategy isonly designed to allow the MLM to recover after a mistake,rather than to recognize and correct a mistake directly.Training a discriminative model which “cleans” the out-put of an acoustic model is only feasible in the presence ofa dataset mapping degraded data to clean data. Whilst thisdataset could be produced by running an acoustic model ona dataset mapping audio to the correct transcription, suchdatasets are small relative to the amount of clean MIDIdata available elsewhere. Our MDTK package allows theuser to take any clean data and degrade it to have mu-sical errors of their choosing. The pool of clean MIDIdata is many orders of magnitude larger than that whichmaps audio to transcription data. For example, MAE-STRO [7] has aligned MIDI and audio data of ~1 300 per-formances totalling ~200 hours. In comparison, the LakhMIDI Dataset [14] comprises ~175 000 MIDI files to-talling ~9 000 hours. This is over 40 times the size, and It is debatable as to whether frame-based models should be called“language” models, since they do not work at a step related to the lan-guage (e.g. musical notes or beats), but rather the frames of the acousticmodel. However, such a distinction is not the focus of this work. The true number of files is slightly smaller than this as it is knownthat some of these MIDI files are corrupted. a r X i v : . [ c s . S D ] S e p dditionally spans diverse genres. Using a dataset suchLakh MIDI, MDTK allows for the creation of datasetslarge enough to make the direct discriminative task fea-sible. In addition, as we will discuss later in Section 2.2,there is no need to restrict learning capability by explicitlycreating a degraded dataset: MDTK’s Degrader objectscan be used to degrade clean input dynamically when load-ing it into the model, thus providing on-the-fly data aug-mentation, enabling the model to be trained on a degradeddataset which is essentially unlimited in size.This process is analogous to performing learned dataaugmentation—MDTK makes the discriminative task ofcorrecting errors feasible by increasing the effective sizeof the dataset. Data augmentation has proved essential inother fields. In [5], the authors advocate the automated ap-plication of data augmentation for the ImageNet task [6], aclassification task for image data. They find that by auto-matically tuning the type of data augmentation they applyfor each task, they can attain a significant improvementover the state-of-the-art. In [1], the authors explicitly in-vestigate the effects of generating augmented data in low-data regimes, advocating the use of learned generators—essentially what MDTK’s
Degrader objects are—usingGANs. Finally, in [17], the authors solve their low-dataregime issue for environmental sound classification by us-ing data augmentation, finding that performing augmenta-tions such as pitch shifting and time stretching leads to a 6percentage point boost in classification accuracy. MDTKenables similar such data augmentation techniques to beperformed easily for AMT.For non-AMT tasks, standalone MLMs typically takeas input MIDI files and output some alignment or label,depending on the task. To our knowledge, the robustnessof these MLMs to noisy or incorrect data is rarely if everanalysed. This is not necessarily an important factor whenclean MIDI files are used as input, but when such a MIDIfile is the result of noisy process such as AMT or OpticalMusic Recognition (OMR; e.g. [18]), a model’s robustnessto noise becomes an important piece of information.We propose that both of these shortcomings—poorAMT post-processing, and that MLMs’ robustness to noisehas not been analysed—can be addressed using excerptsof music to which noise is added. In an AMT system, apost-processing model which is trained directly to identifyand correct similar noise should be better able to correctnoisy acoustic model outputs than a generic MLM. Like-wise, the robustness of a standalone MLM to noisy inputcan be analyzed with such noisy data, allowing the MLMto be evaluated for its potential usefulness in downstreamtasks such as those involved in creating a complete pieceof sheet music given an audio signal.In this paper, we introduce the MIDI DegradationToolkit (MDTK), a set of tools to easily introduce con-trolled noise into excerpts automatically extracted from aset of MIDI files. MDTK is similar to the Audio Degra-dation Toolbox [11] for audio, but to the authors’ knowl-edge, ours is the first toolkit of its kind for MIDI data.The controlled noise includes (1) shifting the pitch of anote; (2) lengthening, shortening, or shifting a note in time; (3) adding or removing a note; and (4) splitting or joiningnotes.We also introduce the Altered and Corrupted MIDI Ex-cerpts dataset version 1.0 (ACME v1.0), containing MIDIexcerpts which have been degraded (and some which havenot been degraded) using the toolkit, and four new tasksof increasing difficulty: to (1) detect whether each excerpthas been degraded; (2) if so, classify what degradation hasbeen applied and (3) locate where a degradation has takenplace; and (4) recover the original excerpt.We present a simple baseline model for each task andanalyse its performance. These baselines are provided asan easy starting point for researchers wanting to attemptour proposed tasks or post-process their own AMT data.We provide evaluation metrics for assessment and postu-late that, if high performance were achieved, we would beable to improve AMT output using models trained for thesetasks. We can easily swap out ACME v1.0 for a datasetmatching the errors for a specific AMT system using a pro-vided script.
2. THE TOOLKIT
The MIDI Degradation Toolkit (MDTK) is a python pack-age, installable with pip, which can be used to introduceerrors to MIDI excerpts. The code is released open sourceunder an MIT License, and is available online. We encour-age feedback and contribution from the community in itscontinued development.Internally, MDTK stores each excerpt as a set of notesin a Pandas [12] DataFrame with columns pitch (MIDIpitch, with C4=60), onset (the onset time of the note, inmilliseconds), track, and dur (the duration of the note, inmilliseconds), all integers. It contains functionality to loadan excerpt from a MIDI file (using pretty_midi [15]), aswell as to read from and write to a CSV file.
Each degradation provided in MDTK takes as input apandas DataFrame of an excerpt of music, and returns aDataFrame with the given degradation. Some degrada-tions (e.g. removing a note from an empty excerpt) arenot always possible. In such cases, a warning is printedand None is returned. Care is also taken to ensure that nooverlaps on the same pitch are introduced by a degrada-tion. There are a total of 8 degradations in MDTK, each ofwhich is described below.The pitch_shift degradation changes the pitch ofa random note. By default, the new pitch is chosen uni-formly at random from all possible pitches (a minimumand maximum pitch can be given, and the valid range de-faults to 21–108 inclusive). It can also be drawn from aweighted distribution of intervals around the original pitch,for example to emphasize octave errors from overtones.We also include a flag to force the new pitch to align withthe pitch of some other note in the excerpt, to reduce out-of-key shifts, if desired.Three degradations shift a random note in time in someway: onset_shift changes the note’s onset time, leav-ing its offset time unchanged; offset_shift changeshe note’s offset time, leaving its onset time unchanged;and time_shift changes the note’s onset and offsettimes by the same amount, leaving its duration unchanged.For all of these degradations, care is taken to ensure thatthe shifted note does not lie outside the excerpt’s initialtime range. A minimum and maximum resulting durationcan be specified, as well as a minimum and maximum shiftamount. We also include flags to align some combinationof the shifted note’s onset or duration with those of othernotes from the excerpt, ensuring the note lies on some met-rical grid, if desired.Two degradations can be used to either add a randomnote to an excerpt ( add_note ), or remove a random notefrom an excerpt ( remove_note ). Flags to align an addednote’s pitch, onset, or duration to those of existing notes areincluded.Two degradations can be used either to split a note intomultiple shorter consecutive notes or to combine consec-utive notes at the same pitch into a single longer note.Specifically, split_note will cut a random note intosome number of consecutive notes of shorter duration (thefirst of which begins at the original note’s onset time andthe last of which ends at the original note’s offset time).By default the note is split into two shorter notes, butthis—as well as a minimum allowable duration for theresulting notes—can be set with a parameter. Similarly, join_notes takes two or more consecutive notes at thesame pitch (with a maximum allowable gap—set with aparameter—allowed between them), and joins them into asingle note with onset time equal to that of the first noteand offset time equal to that of the last.
MDTK includes the
Degrader class, which can beused to degrade excerpts dynamically. When instanti-ating a
Degrader object, the proportion of excerptsthat should remain undegraded is set with a parame-ter (which can be 0). The probability of each degra-dation being performed on an excerpt (if it is to bedegraded) can also be set at this time. Then, eachtime
Degrader.degrade(excerpt) is called, a ran-domly degraded version of the input excerpt is generatedaccording to the proportions set during object creation.The
Degrader class can be easily inserted into any modeltraining procedure in order to dynamically create new de-graded excerpts during each epoch, dramatically increas-ing the amount of data available for training.
MDTK includes a measure_errors.py script, whichcan be used to estimate the types of errors (specifically,as degradations) typical to a particular AMT system,given a set of transcriptions and ground truths from thatsystem. Note that there is no unique set of degrada-tions which reproduce the errors that a transcription sys-tem has made (e.g., any shift degradation can be triv-ially replaced by a remove_note and an add_note ). We make no claim that the degradations found by thescript correspond to the exact causes of the errors madeby the AMT system. Rather, only that the distributionof degradations produces excerpts with similar proper-ties to those transcribed by that system. Nonetheless,the script finds what we believe are a reasonable set ofdegradations to have produced those errors using a sim-ple heuristic-based approach. Notes are first matchedas correct if possible (same pitch, and onset and offsetwithin a changeable threshold), and the remaining notesare checked for the various degradations in the follow-ing order: (1) join_notes and split_note , eitherof which may include an additional offset_shift or onset_shift ; (2) offset_shift , if the pitchand onset time match; (3) onset_shift , if the pitchand offset time match; (4) time_shift , but only ifthe transcribed note overlaps the position of the corre-sponding ground truth note; and (5) pitch_shift ,which must match in onset time, although an additional offset_shift can be added. Finally, any remain-ing unmatched notes are counted as add_note and remove_note .The output of the script is a json file containing the es-timated proportion of each degradation in the given set oftranscriptions. It does not yet include values for the var-ious degradation parameters (though this is planned for afuture update to MDTK). This output file can be used, forexample, to create a custom-tuned, static, degraded datasetfor training a model. However, the two tools can also becombined in powerful ways. By passing this json file tothe Degrader constructor, a
Degrader object can beinstantiated that generates degradations exactly matchingthe estimated proportions. This could then be used to traina model to correct the errors of that specific AMT systemusing a relatively small amount of raw data.
3. THE DATASET3.1 ACME version 1.0
The Altered and Corrupted MIDI Excerpts dataset v1.0(ACME v1.0) is a dataset of 5 second excerpts with degra-dations implemented by MDTK. It is not intended to emu-late the errors of any specific AMT system, but rather serveas a starting point for the modelling tasks we introduce be-low.The dataset is taken from two sources: (1) the piano-midi dataset , which contains 328 MIDI files of pseudo-live performance piano pieces of various styles (gen-erally classical); (2) the 22194 primes from the small,medium, and large sections of the monophonic andpolyphonic Patterns for Prediction Development Datasets(PPDD-Sep2018) , which contain excerpts drawn ran-domly from the Lakh MIDI Dataset (LMD) [14].We remove track information, flattening each excerpt toa single track, simplifying the modelling tasks ; analysis The files are quantized and beat-aligned, but their tempo curves weremanually edited by their creator to sound more like live performance. The use of tracks is not standard our different data sources. f multi-track MIDI files will be addressed in future work.We then fix any pair of overlapping notes of the same pitchby cutting the first note at the onset time of the second.We additionally set the offset time of the second note tothe maximum of the original offset times of the two notes,such that no sustain is removed.Once this pre-processing is complete, we select a ~5second excerpt from each piece by choosing a random noteand all notes beginning in the subsequent 5 seconds, butrequire that at least 10 notes be present. The excerpt endswhen the last held note is released. This duration is approx-imately 2 bars for most songs so is small enough for themodels proposed in section 4 to train quickly. We degrade of the excerpts, selecting the degradation uniformly atrandom from the set of 8 defined degradations, and leavethe remaining undegraded. For ACME v1.0, we use de-fault parameter settings for all degradations, although weintend to investigate the effect of different settings in futurework (and future releases of ACME datasets).The excerpts and degraded excerpts are split randomlyinto training, validation, and test sets of proportion . , . , and . , creating the official splits for ACME v1.0.The canonical form is available online as a set of CSVfiles. Additionally, the MDTK package includes the make_dataset.py script which we used to create thedataset from scratch—including the automatic download-ing of the raw data—and thus serves as a record of how itwas created. The make_dataset.py script can also be used to gen-erate an ACME-style dataset from a user-provided set ofMIDI or CSV files. The user can specify custom sizes forthe excerpts, a custom distribution of the various degrada-tions, as well as custom parameters for each. The script canbe given the json output of the measure_errors.py script in order to match the properties of the generateddataset with those measured from an AMT result. Alter-natively, a user can simply choose to degrade individualexcerpts from their own training set by calling MDTK dur-ing the training process, either manually or randomly usingthe
Degrader class.
4. PROPOSED TASKS4.1 Motivation
These tasks are performed on ACME v1.0, and proposed inlieu of taking existing AMT systems and measuring theirimprovement when trained with the assistance of MDTK.It is proposed that the output of arbitrary AMT systemscould be improved with models that can solve these tasks.For instance, we could use a model trained to classifythe error contained within a given excerpt to call out forhuman intervention. We could also train models to per-form the actual fix; however, we show that, with the mod-els we have chosen for our baseline, this problem is farfrom solved.
Figure 1 . Example piano rolls of a clean excerpt (left) be-ing degraded with pitch_shift (right), including the labelsfor Error Location (top right).
We propose four tasks of increasing difficulty. Figure 1shows a simple toy data point which has been pitch shifted(changed note in red). We will use it as an example whennecessary throughout this section. We should note thatthe tasks we introduce here are not in any way trivial, butrepresent significant steps towards successful AMT post-processing.
1. Error Detection : detect whether a given excerpt hasbeen degraded. This is a binary classification task with askewed distribution: excerpts are degraded (the positiveclass), and are not degraded (the negative class). Weevaluate performance using F-measure but, since the neg-ative class is the minority, for the purposes of F-measureevaluation, we treat those as positives. Thus, a modelwhich always outputs “degraded” achieves a “reverse F-measure” of . (with precision and recall both 0) ratherthan its F-measure of . (with precision and recall ).
2. Error Classification : specify which degradation(if any) was performed on each excerpt. This is a multi-class classification problem, and since ACME v1.0 con-tains a uniform distribution of each class, we evaluate per-formance using accuracy and a confusion matrix to showspecific error tendencies for each degradation.
3. Error Location : assign a binary label to each (40ms) frame of input identifying whether it contains an errori.e. whether this frame contains a degradation. We evaluateperformance using the standard F-measure. The labels forthis task are shown in the top right of Figure 1.
4. Error Correction : output the original, un-degradedversion of each excerpt. In Figure 1, a model is given thedegraded excerpt (right) and expected to output the orig-inal excerpt (left). For this task, we define our own met-ric, helpfulness ( H ), based on two F-measures proposed by[2]: frame-based F-measure with 40ms frames, and note-based onset-only F-measure. We use the mir_eval [16]implementation of note-based F-measure (with 50ms on-set tolerance) to evaluate both the given excerpt and thesystem’s output compared to the original excerpt. We takethe average between the two F-measures for each excerpt,which we denote F g (for the given excerpt) and F c (forthe system’s corrected output). If F g = 1 (the given ex-cerpt was not degraded), H = F c . If the given excerptwas degraded, however ( F g < ), H is calculated as inEquation (1). An intuition for this calculation is as fol-lows: H = 0 . represents an output which is exactly asaccurate as the given excerpt (the error correction systemhas neither helped nor hurt), and H scales linearly up to nd down to from there. For example, H = 0 . repre-sents an output which is in some sense twice as accurateas the given excerpt (its error, − F c , is half of the givenexcerpt’s error, − F g ). Similarly, H = 0 . represents anoutput which has twice as many errors as the input. H = (cid:40) −
12 1 − F c − F g F c ≥ F g F c F g F c < F g (1) For input into our baseline models, we first quantize eachexcerpt onto 40 ms frames, rounding note onsets and off-sets to the nearest frame. We use two different input for-mats for our baseline models, and provide data conversionand loading functions for each of them.The command format is based on the one designed by[13]. Each excerpt is converted into a sequence of one-hotvectors representing commands from a pre-defined vocab-ulary of 356 commands: note _ on ( p ) , note _ of f ( p ) , and shif t ( t ) ( p ∈ [0 , , t ∈ [1 , ). The note on and offcommands represent note onsets and offsets at the currentframe, and the shift command skips t frames. Longer shiftsare represented by multiple shift commands.The piano roll format represents each excerpt as twobinary piano rolls: one representing pitch presence in eachframe, and another represent pitch onsets in each frame.These two piano rolls are concatenated together frame-wise to form the model’s input. The details for the models provided in this paper are brief.For code which fully defines the models and the code usedto train and evaluate them, see the repo . Our choice ofmodels was relatively arbitrary; they are easy to implementwith existing open source packages and easy to improveupon.Our baseline for Error Detection uses the commandformat as input. It consists of an embedding layer ofsize 128, followed by a basic Long Short-Term Memory(LSTM) [8]. A dropout of . is applied to the final LSTMstate’s output, which is then passed to a fully-connectedlayer of size 2 with softmax activation, resulting in a sin-gle output for each input sequence.Our Error Classification baseline uses the same de-sign, but with output dimensionality 9 for the final layer(one for each degradation plus one for no degradation).For
Error Location , we use the piano roll format . Wefirst feed the input frames into a bi-directional LSTM (Bi-LSTM), and send the output of each Bi-LSTM state (withdropout . ) into 3 linear layers, each with dropout . and ELU activation. These are each fed into a final fully-connected layer of size 2 with softmax activation, resultingin one output per input frame.For Error Correction , we use the piano roll format ,and base our network on a basic Encoder-Decoder struc-ture [4], where both the encoder and the decoder are Bi-LSTMs. The input is passed directly into the encoder Bi-LSTM, and the output at each frame is passed through a Task Model Loss MetricError Detection Rule-based 0.466 0.000Baseline
Error Location Rule-based 0.404 0.000Baseline
Error Correction Rule-based
Baseline 0.693 0.000
Table 1 . Loss and evaluation metric for the baseline andrule-based models for each task on the ACME v1.0 test set.Each task’s metric is different, as explained in the text.single fully connected layer with dropout . . This se-quence is input into the decoder Bi-LSTM, each output ofwhich is fed into 4 linear layers which output a vector ofthe same length as the input.The models were trained using the Adam optimizer[10], and a grid search was performed for weight decay,learning rate, LSTM hidden-unit size, and linear layer sizes(for full details, see the code). The model with the lowestvalidation loss on each task is used as the baseline. To gauge the difficulty of each task, we compare each ofthe baseline models to a simple rule-based approach. Likeour baseline models, the rule-based systems output proba-bility values ∈ [0 , . For Error Detection, the rule-basedsystem returns an probability of each data point beingdegraded. For Error Classification, the rule-based systemoutputs a probability for each class. For Error Location,the rule-based system outputs a . probability that eachframe has been degraded (the proportion of frames that aredegraded in the training set is . ). Finally, for Error Cor-rection, we calculate p (1 |
0) = 0 . and p (1 |
1) = 0 . from the training set and have the system output thesevalues for each cell in a given piano roll.The results for each task on the ACME v1.0 test setare shown in Table 1. From the losses, it is clear that thebaseline models have learned something, since all of theirlosses are lower than the rule-based losses except for in Er-ror Correction. However, from the metrics, it is also clearthat there is much room for improvement on each of theproposed tasks (as we would hope).For Error Detection, the baseline predicts 1 (degraded)for every data point, just like the rule-based system, likelybecause of the skew of the training data. As a simple at-tempt to overcome this tendency, we trained another modelidentical to the baseline which weights the loss of each datapoint inversely proportional to that label’s frequency in thetraining set. This results in a model with greater overallloss (as expected), but which outputs some 0s, achieving a That is, for the degraded piano rolls from the training set, of cellswith a and of cells with a map to a value of in the correspondingcell of the clean piano roll. o n e p i t c h _ s h i f tt i m e _ s h i f t o n s e t _ s h i f t o ff s e t _ s h i f t r e m o v e _ n o t e a d d _ n o t e s p li t _ n o t e j o i n _ n o t e s nonepitch_shifttime_shiftonset_shiftoffset_shiftremove_noteadd_notesplit_notejoin_notes 0.17 0.02 0.02 0.11 0.13 0.02 0.06 0.15 0.330.18 0.01 0.05 0.10 0.10 0.08 0.10 0.11 0.280.08 0.00 0.05 0.14 0.16 0.03 0.20 0.11 0.230.08 0.02 0.05 0.17 0.12 0.04 0.14 0.14 0.260.08 0.03 0.06 0.11 0.21 0.02 0.13 0.11 0.260.17 0.01 0.03 0.05 0.10 0.07 0.07 0.10 0.400.07 0.00 0.05 0.11 0.20 0.05 0.29 0.07 0.160.11 0.01 0.06 0.14 0.14 0.02 0.08 0.13 0.310.11 0.01 0.05 0.06 0.16 0.04 0.06 0.10 0.42 T a s k Figure 2 . Left: Confusion matrix showing the distributionof the baseline Error Classification model’s classifications,normalized by true label. Rows show the true label, andcolumns show the predicted label. Right: The baseline Er-ror Location model’s F-measure for each degradation type.reverse F-measure of 0.155. Overcoming the skew of thedataset may prove to be a challenge for this task.For Error Classification, the baseline achieves an accu-racy of greater than that of the rule-based system. Thebaseline’s confusion matrix is shown in Figure 2 (left),where rows represent the ground truth label and columnsrepresent its output. This shows error tendencies, and(more importantly) gives an idea of the general difficultyof detecting each degradation. Here, it can be seen that themaximum point in each column is always on the diagonal,showing that the model does seem to have learned some-thing sensible. It performs well on the add note degrada-tion, classifying of those data points correctly. Pitchshift, time shift, and remove note seem to be the most diffi-cult, while join notes is a common target for false positives.We are interested to see whether the above trends continuein future work on Error Classification, and intend to furtherinvestigate their causes.The Error Location baseline outperforms the rule-basedsystem in terms of both loss and F-measure by wide mar-gins. It achieves this F-measure with a precision of 0.844and a recall of 0.381, so although it rarely guesses that aframe has been degraded, it is usually correct when it does.Figure 2 (right) presents the baseline’s F-measure split bydegradation type, which shows the model performing beston add_note, but also well for onset and offset shifts (preci-sion is over . for all three). It is slightly worse with pitchand time shifts (precision over . for both), and performspoorly on the other degradations (the value for “none” willalways be 0 since it has no positives). Given the relativesuccess of this model compared to the other tasks’ base-lines, pre-training a model for this task before continuingto train it for another task might be an avenue for improvedperformance. Another strategy could be to use a modeltrained for this task as an attention mechanism for some ofthe other tasks.Error Correction is clearly the most difficult task of thefour, and the baseline model’s performance reflects this. Although its loss is similar to that of the rule-based system,its helpfulness lags clearly behind. The rule-based model’sstrategy of (essentially) reproducing the input turns out tobe a strong baseline. Our baseline, on the other hand, al-most always outputs empty piano rolls, no matter the input.The difficulty of this task might require a more modularapproach than the presented end-to-end baseline, perhapscombining the results of models from tasks 2 and 3 with asystem designed to correct a specific degradation affectinga specific set of frames.
5. CONCLUSION
In this paper, we have introduced the MIDI DegradationToolkit (MDTK), which contains tools to “degrade” (in-troduce errors to) MIDI excerpts. The toolkit is publiclyavailable online under an MIT License, and we encour-age contributions and feedback from the community. Us-ing MDTK, we have created the Altered and CorruptedMIDI Excerpts v1.0 (ACME v1.0) dataset and includein MDTK a tool to create custom ACME-style datasetswith different settings or data. We have proposed a setof four new tasks of increasing difficulty involving suchdatasets: Error Detection, Classification, Location, andCorrection, and designed evaluation metrics and scripts foreach of them. We also designed and presented simple mod-els to be used as a baseline for each, which show that theproposed tasks are non-trivial, and may require innovativesolutions.The toolkit is ready to be used for improving AutomaticMusic Transcription (AMT). To do so, a user can:1. use measure_errors.py to analyse the types oferrors made by an AMT system or acoustic model.2. instantiate a Degrader with the configuration pro-duced by measure_errors.py —this can gener-ate unlimited data matching the errors made by thesystem from step (1).3. train a discriminative model using data generated bythe
Degrader .4. apply that model to the output of the model from step(1) and evaluate the difference in performance.As performance on the proposed tasks modellingACME v1.0 improves, we intend to introduce ACME v2.0with additional features such as multi-track excerpts, atrack-based degradation, longer excerpts, multiple degra-dations per excerpt, and various parameter settings forthe degradations. We also intend to analyze the effect ofadding noise on MLM performance.
6. ACKNOWLEDGEMENTS
Authors 1 and 2 contributed equally to this work. Thiswork is supported in part by JST ACCEL No. JP-MJAC1602 and JSPS KAKENHI Nos. 16H01744 and19H04137. . REFERENCES [1] Antreas Antoniou, Amos Storkey, and Harrison Ed-wards. Data augmentation generative adversarial net-works, 2017.[2] Mert Bay, Andreas F. Ehmann, and J. Stephen Downie.Evaluation of multiple-f0 estimation and tracking sys-tems. In International Society for Music InformationRetrieval Conference (ISMIR) , pages 315–320, 2009.[3] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, andNoam Shazeer. Scheduled sampling for sequence pre-diction with recurrent neural networks. In
Advances inNeural Information Processing Systems , pages 1171–1179, 2015.[4] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gul-cehre, Dzmitry Bahdanau, Fethi Bougares, HolgerSchwenk, and Yoshua Bengio. Learning phrase repre-sentations using RNN encoder-decoder for statisticalmachine translation. In
Conference on Empirical Meth-ods in Natural Language Processing (EMNLP) , pages1724–1734, 2014.[5] Ekin D. Cubuk, Barret Zoph, Dandelion Mane, VijayVasudevan, and Quoc V. Le. Autoaugment: Learningaugmentation strategies from data. In
The IEEE Con-ference on Computer Vision and Pattern Recognition(CVPR) , June 2019.[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, andL. Fei-Fei. ImageNet: A Large-Scale Hierarchical Im-age Database. In
CVPR09 , 2009.[7] Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, IanSimon, Cheng-Zhi Anna Huang, Sander Dieleman,Erich Elsen, Jesse Engel, and Douglas Eck. Enablingfactorized piano music modeling and generation withthe MAESTRO dataset. In
International Conference onLearning Representations , 2019.[8] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory.
Neural computation , 9(8):1735–1780,1997.[9] Rainer Kelz, Matthias Dorfer, Filip Korzeniowski, Se-bastian Böck, Andreas Arzt, and Gerhard Widmer. Onthe potential of simple framewise approaches to pi-ano transcription.
International Society for Music In-formation Retrieval Conference (ISMIR) , pages 475–481, 2016.[10] Diederik P. Kingma and Jimmy Ba. Adam: A methodfor stochastic optimization. In
International Confer-ence for Learning Representations , 2015.[11] Matthias Mauch and Sebastian Ewert. The audiodegradation toolbox and its application to robustnessevaluation. In
International Society for Music Informa-tion Retrieval Conference (ISMIR) , 2013. [12] Wes McKinney. Data structures for statistical comput-ing in python. In Stéfan van der Walt and Jarrod Mill-man, editors,
Proceedings of the 9th Python in ScienceConference , pages 51 – 56, 2010.[13] Sageev Oore, Ian Simon, Sander Dieleman, Dou-glas Eck, and Karen Simonyan. This time with feel-ing: Learning expressive musical performance.
NeuralComputing and Applications , 32(4):955–967, 2020.[14] Colin Raffel.
Learning-based methods for comparingsequences, with applications to audio-to-midi align-ment and matching . PhD thesis, Columbia University,2016.[15] Colin Raffel and Daniel PW Ellis. Intuitive analysis,creation and manipulation of midi data with pretty_-midi. In
International Society for Music InformationRetrieval Conference Late Breaking and Demo Papers ,pages 84–93, 2014.[16] Colin Raffel, Brian McFee, Eric J. Humphrey, JustinSalamon, Oriol Nieto, Dawen Liang, and Dan P. W.Ellis. mir_eval: A transparent implementation of com-mon MIR metrics. In
International Society for MusicInformation Retrieval Conference (ISMIR) , 2014.[17] Justin Salamon and Juan P. Bello. Deep convolutionalneural networks and data augmentation for environ-mental sound classification.
IEEE Signal ProcessingLetters , 24(3):279–283, 2017.[18] Eelco van der Wel and Karen Ullrich. Optical musicrecognition with convolutional sequence-to-sequencemodels. In
International Society for Music InformationRetrieval Conference (ISMIR) , pages 731–737, 2017.[19] Qi Wang, Ruohua Zhou, and Yonghong Yan. Poly-phonic piano transcription with a note-based music lan-guage model.
Applied Sciences , 8(3), 2018.[20] Adrien Ycart and Emmanouil Benetos. A study onLSTM networks for polyphonic music sequence mod-elling. In
International Society for Music InformationRetrieval Conference (ISMIR) , pages 421–427, 2017.[21] Adrien Ycart and Emmanouil Benetos. Polyphonicmusic sequence transduction with meter-constrainedLSTM networks. In
IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) ,pages 386–390, 2018.[22] Adrien Ycart, Andrew McLeod, Emmanouil Benetos,and Kazuyoshi Yoshii. Blending acoustic and languagemodel predictions for automatic music transcription. In