[PDF] A New Dataset for Amateur Vocal Percussion Analysis

Abstract

The imitation of percussive instruments via the human voice is a natural way for us to communicate rhythmic ideas and, for this reason, it attracts the interest of music makers. Specifically, the automatic mapping of these vocal imitations to their emulated instruments would allow creators to realistically prototype rhythms in a faster way. The contribution of this study is two-fold. Firstly, a new Amateur Vocal Percussion (AVP) dataset is introduced to investigate how people with little or no experience in beatboxing approach the task of vocal percussion. The end-goal of this analysis is that of helping mapping algorithms to better generalise between subjects and achieve higher performances. The dataset comprises a total of 9780 utterances recorded by 28 participants with fully annotated onsets and labels (kick drum, snare drum, closed hi-hat and opened hi-hat). Lastly, we conducted baseline experiments on audio onset detection with the recorded dataset, comparing the performance of four state-of-the-art algorithms in a vocal percussion context.

Full PDF

AA New Dataset for Amateur Vocal Percussion Analysis

Alejandro Delgado

Roli Ltd.London, England, [email protected]

SKoT McDonald

Roli Ltd.London, England, [email protected]

Ning Xu

Roli Ltd.London, England, [email protected]

Mark Sandler

Queen Mary University of LondonLondon, England, [email protected]

ABSTRACT

The imitation of percussive instruments via the human voice is anatural way for us to communicate rhythmic ideas and, for thisreason, it attracts the interest of music makers. Specifically, theautomatic mapping of these vocal imitations to their emulated in-struments would allow creators to realistically prototype rhythmsin a faster way. The contribution of this study is two-fold. Firstly,a new Amateur Vocal Percussion (AVP) dataset is introduced toinvestigate how people with little or no experience in beatboxingapproach the task of vocal percussion. The end-goal of this analysisis that of helping mapping algorithms to better generalise betweensubjects and achieve higher performances. The dataset comprisesa total of 9780 utterances recorded by 28 participants with fullyannotated onsets and labels (kick drum, snare drum, closed hi-hatand opened hi-hat). Lastly, we conducted baseline experiments onaudio onset detection with the recorded dataset, comparing the per-formance of four state-of-the-art algorithms in a vocal percussioncontext.

CCS CONCEPTS • Applied computing → Sound and music computing . KEYWORDS dataset, beatbox, vocal, imitation, percussion

ACM Reference Format:

Alejandro Delgado, SKoT McDonald, Ning Xu, and Mark Sandler. 2019.A New Dataset for Amateur Vocal Percussion Analysis. In

Audio Mostly(AM’19), September 18–20, 2019, Nottingham, United Kingdom.

ACM, NewYork, NY, USA, 7 pages. https://doi.org/10.1145/3356590.3356844

Music Information Retrieval (MIR), which uses Digital Signal Pro-cessing (DSP) and machine learning techniques to analyse musicrecordings, has had a growing influence on the music industry in

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

AM’19, September 18–20, 2019, Nottingham, United Kingdom © 2019 Association for Computing Machinery.ACM ISBN 978-1-4503-7297-8/19/09...$15.00https://doi.org/10.1145/3356590.3356844 recent decades. Results in tasks like music genre recognition, chordestimation and source separation [5] bring optimism to the field inthis regard. Some of the main sub-disciplines in MIR are also receiv-ing a significant amount of attention from independent musicians.Automatic music transcription, for instance, enables artists to learnand compose musical pieces more comfortably, while source sepa-ration helps them practice more efficiently by deriving individualinstrumental tracks from the original audio mix.In this context of innovation, a set of musical sounds worthexploring is that of vocalised percussion . It includes vocal utterancesthat are articulated so to communicate a rhythmic idea, usually byimitating the sound of percussive instruments like those featuredin a drum set. As such, these sounds and their dynamics could bemapped to real drum samples from a sound library so to create arealistic drum loop in seconds, making composers save time andeffort prototyping rhythms without actual music knowledge.However, despite the possibilities that these rhythmic explo-ration tools offer, they are seldom used today. This could be dueto several factors, ranging from the limited commercial spread ofthe already available applications to the insufficient precision oftheir algorithms. As a response to the current situation, we haverecorded a dataset of vocal imitations of percussion instruments,which could help researchers to shed light on the problem and headtowards a reliable and easy way of creating rhythmic patterns.A literature review of the work in vocalised percussion is pre-sented in section 2, alongside a list of the main datasets availablein the public domain. In Section 3, we introduce the Amateur VocalPercussion (AVP) dataset, laying out its main features and detailingits production process. In section 4, we perform baseline experi-ments on onset detection, where we compare the performance offour well-established algorithms in the context of vocal percussion.General conclusions are drawn in section 5.

Between the recording of a vocal percussion performance and itsmapping to a drum pattern, there are usually two main steps: onsetdetection and utterance classification. The former aims at localisingthe utterances in time, while the latter deals with the associationof these with percussion instruments.There are two main frameworks in vocal percussion analysisfor classification purposes.

Match and adapt tries to find the mostsimilar spectrum to the query one in a database of rhythmic perfor-mances [17], while onset-wise separation divides the audio file in a r X i v : . [ ee ss . A S ] S e p M’19, September 18–20, 2019, Nottingham, United Kingdom Delgado et al. utterances and analyse them separately [8]. Both of them use MIRanalytic routines, based on frame-wise extraction and statisticalaggregation of DSP descriptors (spectral centroid, zero-crossingrate, mel-frequency cepstral coefficients...) and machine learningtechniques (support vector machines, decision trees...). Studies areusually carried out focusing on three classes of imitated sounds:kick drum, snare drum and hi-hat.In most cases, a modest vocal percussion dataset is recorded forevaluation purposes, which could be oriented to general rhythmicvocal percussion [8] [21] [19] [7] or just beatboxing [11] [22]. Someof the mentioned work target user-specific vocal percussion anal-ysis rather than general methods that could work for everyone,as vocal imitation styles seem to change significantly from personto person [9] [19]. Nevertheless, the idea of a universal model forvocal percussion classification may also be plausible, as some stud-ies have suggested that humans, when given the task of imitatingnon-vocal environmental sounds, naturally give realism to theseimitations by focusing on certain characteristic perceptual featuresof the original sound queries [14].There are four widely adopted vocal percussion datasets in thepublic domain. The first one is the

Beatbox dataset by Stowell etal. [22]. It gathered experienced beatboxers to record 14 audio files(one per participant) with a mean duration of 47 seconds that re-sulted in a total of 7460 annotated utterances. Both typical drumsounds and beatbox-specific ones were labelled, being the audiofiles recorded in several environmental conditions (different micro-phones, equipment, noise levels...). The authors also discovered thata better classification performance could be achieved by delayingthe start of the first analysis frame around 23 milliseconds from theactual onset, effectively discarding the transient information of thesounds. The second one is the

Live Vocalised Transcription (LVT)dataset by Ramires et al. [19]. It was recorded by 20 participants(from experienced to amateur beatboxers) using three microphoneswith different noise levels and sound qualities. Two files per partici-pant are provided, one of them containing the imitation of a simpledrum loop and the other a free rhythmic improvisation. A softwareplugin for drum transcription by vocal percussion was presentedafterward [18]. The third dataset by Mehrabi et al. [16] is aimed atboth transcribing the instruments being imitated and recognisingtheir specific model (e.g. type of snare). In it, 14 participants withexperience in music production are asked to imitate 30 percussionsounds, including different kick drums, snare drums, hi-hats, tomsand cymbals. Finally, the fourth dataset is the

Vocal Imitation Set byKim et al. [12]. This is directed to vocal imitation of sound eventsin general, containing a total of 11242 crowd-sourced imitationsof 302 different classes. It features imitations of several percussioninstruments from approximately twenty participants, includingopened hi-hat, snare drum and kick drum.

As audio recording quality is becoming better and more accessi-ble with time, aspiring musicians are also moving away from therecording studios to a new workspace consisting of just a relativelyquiet room and a laptop. Enabling independent artists to prototyperhythmic ideas without musical knowledge and in an immediateway would make the creative workflow in this setting more natural,

Table 1: Dataset content summary (utterances)

Instrument - ‘ label ’ Personal Fixed ImprovisationKick Drum - ‘ kd ’

799 818 1201

Snare Drum - ‘ sd ’

813 839 811

Closed Hi-Hat - ‘ hhc ’

799 833 673

Opened Hi-Hat - ‘ hho ’

816 830 548 quick and spontaneous. This is the main reason the Amateur VocalPercussion (AVP) dataset was created.All audio files and annotations can be found in the link https://doi.org/10.5281/zenodo.3250230

The AVP dataset is characterised by the following attributes: •

28 participants • A total of 9780 utterances in 280 audio files • Annotated onsets and labels (kick, snare, closed hi-hat andopened hi-hat) • Recorded with one microphone (MacBook Pro’s built-in mic) • • New Dataset for Amateur Vocal Percussion Analysis AM’19, September 18–20, 2019, Nottingham, United Kingdom

The materials used to record the dataset were a MacBook Pro laptop,GarageBand software and a closed room of approximately 40 m .The experiment took an average of 15 minutes per participant.The first step of the process was a small survey about the rolethat music played in the participant’s life. This informal one-minutequestionnaire was oriented towards relaxing the participants andget them ready for the task. Right after this, a standard loop fea-turing kick and snare was presented and the participant was askedto both reproduce it vocally and write down the onomatopoeiasof the sounds he/she used in a notebook page. This performancewas intended as a first contact with the process and it did not getrecorded, while the annotation of the onomatopoeias was done tofacilitate the recalling of imitations and to better stick to them. As itturned out to be hard for participants to imitate complex beats withhi-hat included in the preliminary tests, five isolated utterances ofboth closed hi-hat and open hi-hat sounds were presented instead.The participants decided their imitations and wrote down theironomatopoeias in the same way as with the kick drum and snaredrum.Once the participants were familiarised with the task, they wereasked to sit naturally in front of the computer, as they would usu-ally do, and the recording of the dataset started. There were tworecording modalities taking place: personal and fixed. For the per-sonal modality, participants used their own vocal imitations torecord around twenty-five utterances of kick drum, snare drum,closed hi-hat and opened hi-hat sounds in four separated audiofiles. The utterances followed a simple rhythmic loop (one crotchetand two quavers) and a 90-BPM metronome track was providedthrough headphones to the participants in order to record them.An audio file of improvised vocal percussion featuring all or mostinstruments was recorded afterward. For the fixed modality, the pro-cedure above was repeated once more, but now four specific soundswere given to the subjects. These fixed imitations were based onspeech syllables so to feel natural for participants to reproduce, andtheir timbral characteristics were intended to mimic the imitatedpercussion instruments. A “pm” syllable would correspond to thekick drum, “ta” to the snare drum, “ti” to the closed hi-hat and a“chi” to the opened hi-hat. The articulation of these utterances, asthey were performed in a percussive setting, generally resulted inbrighter transient signals and more inharmonic steady state signalscompared to usual speech utterances. Once the raw audio files were recorded, two post-processing stagestook place to prepare them for analysis: trim silent regions at thebeginning and the end of each file to reduce their size and remove in-between passages where the participants made accidental mistakesor notable pauses. The annotation of the files was manually carriedout by the author right after this cleaning process, using SonicVisualiser [3] to write down both onset locations and class labels.As illustrated in table 1, the tag ‘ kd ’ was used for kick drum, ‘ sd ’for snare drum, ‘ hhc ’ for closed hi-hat and ‘ hho ’ for opened hi-hat.The choice of the exact starting point of a sound event is gen-erally considered to be dependent on the task at hand and, thus,a matter of convenience. For instance, if the goal is preparing an Figure 1: Annotation of two onsets, corresponding to two/s/ phonemes. Figure 1a) displays the utterances with theironsets marked in red. The waveform is plotted in blue andthe magnitude spectrogram in green. Figures 1b) and 1c) areclose captions of the first and the second onset respectively. utterance with a long transient to be used as a sound triggered bya MIDI message, the onset would sometimes be preferably placednear the point of maximum energy in the signal. In our case, wedecided to place the onsets at the very beginning of the utterance,where the percussive transient starts to build the sound. This is dueto the fact that our primary goal is to classify the utterance, and itstransient region could be informative as well.An important note to make regarding the annotation processis that, in a few occasions, two utterances were very close to eachother or even appeared to merge in one sound. A joint approach ofwaveform visualisation, spectrogram visualisation and listening atquarter speed was employed to solve the ambiguity in these cases.An example of how these specific annotations were approached isillustrated in figure 1.

A list of observations worth commenting on was made while record-ing and listening back the audio files in the AVP dataset. Some ofthem are the following: • A small set of recorded utterances exhibit a form of doublepercussion (such as “suc” or “brrr”), making them challengingfor the onset detector to output the events as a whole withoutsplitting them. • Various participants got mistaken in few occasions whenrecording the improvisation files, using other speech-likephonemes like “cha” instead of “ta” when imitating the snaredrum or “dm” instead of “pm” when imitating the kick drum.These passages were omitted in the final version of thedataset. • Several participants reported that the given fixed sounds,despite their resemblance with the original drum sounds,felt unnatural for them to perform.

M’19, September 18–20, 2019, Nottingham, United Kingdom Delgado et al.

Figure 2: Piece of waveform pertaining to a single vocal per-cussion utterance. One can appreciate the plosive phoneme/t/ and the vowel phoneme /a/ composing the “ta” syllable asdistinct components in the audio file. • Participants generally improvised non-complex and pre-dictable loops, which could make the classification routinesbenefit from a rhythmic pattern analyser.Finally, there have been three participants whose utteranceswithin the personal dataset were either not consistent with eachother, unintelligible or practically indistinguishable from the rest.The audio files and annotations pertaining to these cases are storedin the “Discarded" folder. Despite this, they could still be used forclassification purposes as long as the improvisation files, where theambiguities occur, are excluded.

The problem of onset detection is a recurrent one in MIR literature.Most of its subfields, in one way or another, need to face it at somepoint when trying to make music analysis fully automatic. Vocalpercussion resembles speech when it comes to the articulation ofutterances, making the job slightly more challenging than with reg-ular percussion, which exhibits short and well-defined attacks. Allvocal percussion utterances are usually composed by one fricativephoneme that is sometimes followed by a vowel phoneme. This isillustrated in figure 2.Here we describe a baseline study on onset detection for vocalpercussion using the AVP dataset. It aims at localising the timeinstances when the utterances begin up to a tolerance margin of 50ms [5]. This section is merely intended to be a brief analysis of thetask, provoking further discussions and laying down foundationsfor future work.

In this section, we present the onset detection methods that willbe evaluated in this study. We will first do a quick overview ofthe spectral flux descriptor and its relevance to the task. Then, wewill take a look at four state-of-the-art algorithms for audio onset detection. The first two,

Convolutional Neural Networks (CNN) and

Recurrent Neural Networks RNN , are based on deep machine learningmethods while the other two,

High-Frequency Content (HFC) and

Complex , are based on traditional signal processing routines.

Despite this descriptorwill not be used for onset detection but rather to refine the locationsof the detected onsets, all four algorithms investigated here areeither based on or inspired by it. For this reason, we will describeit before we introduce the rest of the methods, considering it atheoretical reference point.The spectral flux is a frame-wise measure of how quickly themagnitude in each frequency bin of the spectrum changes over time[4]. The sharper the change in a particular region, the more probablea percussive onset is occurring there. If x is the audio waveformand X is its Fourier transform, the spectral flux is expressed in thefollowing way: SF ( n ) = N − (cid:213) k = − N H (| X ( n , k )| − | X ( n − , k )|) (1)where k is a frequency bin of the n th frame and H ( x ) = x + | x | iscalled the half-wave rectifier function, which restricts the spectralflux to only output positive changes in time, i.e., when the fluxincreases. The higher a peak in the spectral flux function, the morelikely an onset is taking place in that location. CNNs [13] are feed-forward artificial neural networks composed of convolutional layers.The neurons in these layers compose a set of small local filter kernelsto analyse the input, creating several feature maps as a result. Aconvolutional layer could be followed by a pooling layer, whichsubsamples these resulting feature maps following a certain set ofrules, or by a fully connected layer to introduce non-linearities inthe process. These last layers are specially useful for classificationpurposes.CNNs are especially popular in computer vision, and they achievehigh accuracies when detecting edges in images. The main idea be-hind their use for audio onset detection is that they can effectivelydetect edges in the magnitude spectrograms as well, which usuallycorrespond to audio onsets. In this way, the usual input for the CNNis composed by three 80-band mel spectrograms with constant hopsize but different window sizes so to have both high frequency andtime resolution. The details of the original network’s architecturecan be found in [20]. It was discovered in the same study that theCNN, like spectral flux based techniques, computes spectral dif-ferences over time and uses them to estimate the likelihood of anonset happening in a certain time region. The implementation ofthe CNN-based onset detection method, which this study follows,is contained in Madmom’s

CNNOnsetProcessor [2].

The functioning of an RNN,in contrast with CNNs, is dependent on the information acquiredfrom past inputs, allowing these networks to model complex datasequences in a non-linear way. Long Short-Term Memory (LSTM)units in RNNs [10] allow the networks to better handle dependen-cies in time and avoid the exploding/vanishing gradient problem.

New Dataset for Amateur Vocal Percussion Analysis AM’19, September 18–20, 2019, Nottingham, United Kingdom

When spectrograms are fed to RNNs, these networks can detectand model changes in time by contrasting observations from cur-rent frames with information from previous ones. Their routines,then, would be analogous to those of the spectral flux onset detector,with RNNs taking information from more past frames into account.The original RNN model for onset detection can be found in [1].This time, the authors use three Bark spectrograms with constanthop size and different window sizes as inputs, and the resultingmodel is also optimised to handle less data than other onset detec-tion algorithms. This makes it suitable for real-time onset detection,using unidirectional (causal) RNNs without LSTM cells. The imple-mentation of the RNN-based onset detection method, which thisstudy follows, is contained in Madmom’s

RNNOnsetProcessor [2].We use the non-real-time version, with bidirectional RNNs.

In general, most ofthe energy of a sound’s steady state part is located in the lowerfrequencies, being the higher frequencies more represented in itstransient part. When a percussive onset occurs, the sudden increasein energy tends to be particularly more pronounced at higher fre-quencies, indicating the beginning of its transient.The HFC method for onset detection [15] exploits this insightmaking use of the high-frequency content descriptor, i.e., the sum-mation of the bin magnitudes of the spectrum multiplied by theirown bin indices. This can be formalised in the following way:HFC = N − (cid:213) i = i | X ( i )| (2)for a STFT spectrum X with bin indices [ ... n ] . The result, whichis linearly biased towards high frequencies, is then fed to a detectionfunction that returns the predicted onset location. This detectionfunction is based on both the high frequency energy flux betweenthe two adjacent frames and the normalised high frequency contentof the current frame. The implementation that this study follows isincluded in Aubio’s AubioOnset [2]. It has also been reported thatthe HFC method works especially well for vocal percussion [19].

The phases of the partials inthe steady state of a sound can be accurately predicted from pastframes, as their frequencies and amplitudes remain constant inthis region. In the case of its transient part, however, these phasestend to evolve in a non-linear, unpredictable way. We can use thisparticularity of the transient’s STFT phase spectrum to locate itwithin the waveform.The complex method for onset detection effectively combinesboth energy-based and phase-based approaches simultaneously bydetecting onsets in the complex domain [6]. The complex detectionfunction is then pre-processed with a weighted moving average andfed to a peak-picking algorithm that outputs the predicted onsetlocations. The implementation that this study follows is includedin Aubio’s

AubioOnset [2].

We carry out three different studies to retrieve the parametersthat produce the best performances for each algorithm in a vocalpercussion context. Grid search is performed for each parameterusing ten linearly spaced values. We test the algorithms on all the utterances in the AVP dataset and the accuracy of each performanceis measured using the F1-Score.The first analysis aims at retrieving the best values for themethod-specific parameters. These parameters are the peak pickingthreshold of the onset probability functions for all methods andthe frame size for the HFC and Complex ones. It has been foundthat the best results for each algorithm are reached by defining athreshold of 0.55 for CNN, 0.30 for RNN, 0.8 for HFC and 0.7 forcomplex. A frame size of 11 ms gave the best results for both theHFC and the Complex methods.The second study explores how a minimum separation betweenonsets could further benefit the algorithms’ performance. Thismeans that in case two of the predicted onsets happen to be sep-arated in time by less than the value of a certain threshold, thesecond onset will be automatically discarded. This appears to berelevant to our case for two reasons. The first one is that two vocalpercussion utterances occur seldom or never simultaneously, i.e.,we are dealing with a monophonic transcription problem. There-fore, there will always exist a minimum separation between onsets.The second reason is that these algorithms can easily confuse thetransient of a vowel phoneme with the onset of a new utterance,lowering their precision score significantly. This inconveniencecould be avoided with the inclusion of a fixed minimum separationbetween onsets, as the onset from the vowel phoneme could beeffectively discarded.Results for this second study are displayed in figure 3. The mainobservation is that both the CNN and RNN methods benefit fromthe inclusion of a minimum onset separation, reaching the peakof their performance at the 90 ms mark, whilst the HFC and theComplex methods show a detriment in performance when thisseparation is applied. A reason for this phenomenon could be thatboth the HFC and the Complex algorithms are based on distinctivefeatures of percussive transients, being them the high-frequencycontent and the predictability of the phase spectrum respectively.For these methods, the peaks in the onset probability function cor-responding to plosive phonemes (see figure 2) would be appreciablyhigher than the peaks relative to the vowel phonemes. Hence, set-ting an appropriate value for the peak-picking threshold would beenough to separate these types of phonemes, and a minimum onsetseparation would only make the algorithm discard correct onsets insituations when two real onsets are significantly close to each other.Deep learning based algorithms, on the other hand, are optimisedso to detect musical onsets in general and do not explicitly rely onthese characteristics of percussive transients, which makes themmore likely to detect the onsets of the vocal phonemes as well. Thus,setting a minimum onset separation is a simple yet effective wayfor them to ignore onsets coming from vowel phonemes, at theexpense of discarding correct onsets that are close in time.Finally, we attempt to refine these onsets in time using the spec-tral flux descriptor. The goal here is to relocate each predicted onsetin a neighbouring point which is closer to the real onset. Moretechnically, it consists in minimising the mean absolute deviationof the predicted onsets with respect to the real onsets without sig-nificantly affecting the performance (F1-Score). We approach thistask by centring an analysis window in the predicted onset and thenfinding the maximum of the spectral flux function in that region,where the real onset is more likely to be. We optimised the length

M’19, September 18–20, 2019, Nottingham, United Kingdom Delgado et al.

Figure 3: Effect of different minimum separation values onthe performance of the four onset detection algorithms.

Method F1-Score Deviation Duration

CNN ms 267 s RNN

HFC s Complex 0.95

21 ms 10 s

CNN-SF ms - RNN-SF

HFC-SF

Complex-SF 0.94

11 ms -

Table 2: Final results of the onset detection study for vocalpercussion. The ’-SF’ suffix indicates the inclusion of spec-tral flux refinement in the process. From left to right, it isdisplayed the F1-Score, the mean absolute deviation of thepredicted onsets with respect to the real ones, and the dura-tion of the analysis for all onsets using the same laptop. of this analysis window selecting the value that achieved the bestbalance in the performance vs. precision trade-off. Specifically, weset a maximum allowed drop in performance of 1% of its originalF1-Score, considering only the parameter values that did not makethe algorithms exceed that limit.The final results are gathered in table 2. The HFC and the Com-plex methods perform the best when it comes to accuracy and theonsets derived from deep learning methods are the closest in timeto the real onsets. Results also present CNNs and RNNs as slowcompared to HFC and Complex, which are 38 and 27 times fasterrespectively.

In this piece of work, we have presented the AVP dataset, compris-ing 9780 utterances of vocal percussion recorded by 28 participants.This aims at improving algorithms for drum pattern query by vo-cal imitation so that independent musicians can accelerate their creative routines by sketching realistic percussion rhythms on thefly. Baseline experiments on utterance onset detection were carriedout in the final section. It was shown that methods relying on spe-cific DSP features outperformed deep learning based techniques inthe context of vocal percussion, laying the groundwork for novelapproaches to improve upon.

ACKNOWLEDGMENTS

This project has received funding from the European Union’sHorizon 2020 research and innovation programme under the MarieSkodowska-Curie grant agreement No. 765068A warm and special thank you to the people that participated inthe creation of this dataset.

REFERENCES [1] Sebastian Böck, Andreas Arzt, Florian Krebs, and Markus Schedl. 2012. Onlinereal-time onset detection with recurrent neural networks. In

Proceedings of the15th International Conference on Digital Audio Effects DAFx-12, York, UK .[2] Sebastian Böck, Filip Korzeniowski, Jan Schlüter, Florian Krebs, and GerhardWidmer. 2016. madmom: a new Python Audio and Music Signal ProcessingLibrary. arXiv:1605.07008 (2016). https://doi.org/10.1145/2964284.2973795[3] Chris Cannam, Christian Landone, and Mark Sandler. 2010. Sonic visualiser: Anopen source application for viewing, analysing, and annotating music audio files.In

Proceedings of the 18th ACM international conference on Multimedia . ACM,1467–1468. https://doi.org/10.1145/1873951.1874248[4] Simon Dixon. 2006. Onset Detection Revisited. In

Proceedings of the 9th Interna-tional Conference on Digital Audio Effects DAFx-06, Montreal, Canada .[5] J Stephen Downie and Yun Hao. 2018. MIREX 2018 Evaluation Results. (2018).[6] Chris Duxbury, Juan Pablo Bello, Mike Davies, and Mark Sandler. 2003. Complexdomain onset detection for musical signals. In

Proceedings of the 6th InternationalConference on Digital Audio Effects DAFx-03, London, UK .[7] Olivier Gillet and Gaël Richard. 2005. Drum Loops Retrieval from Spoken Queries.

Journal of Intelligent Information Systems

24, 2-3 (March 2005). https://doi.org/10.1007/s10844-005-0321-9[8] Amaury Hazan. 2005. Towards automatic transcription of expressive oral per-cussive performances. In

Proceedings of the 10th international conference onIntelligent user interfaces - IUI ’05 . ACM Press, San Diego, California, USA.https://doi.org/10.1145/1040830.1040904[9] Kyle Hipke, Michael Toomim, Rebecca Fiebrink, and James Fogarty. 2014. Beat-Box: end-user interactive definition and training of recognizers for percus-sive vocalizations. In

Proceedings of the 2014 International Working Confer-ence on Advanced Visual Interfaces - AVI ’14 . ACM Press, Como, Italy. https://doi.org/10.1145/2598153.2598189[10] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.

Neuralcomputation

9, 8 (1997), 1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735[11] Ajay Kapur, Manj Benning, and George Tzanetakis. 2004. Query-By-Beat-Boxing:Music Retrieval for the DJ. In .[12] Bongjun Kim, Madhav Ghei, Bryan Pardo, and Zhiyao Duan. 2018. Vocal ImitationSet: a dataset of vocally imitated sound events using the AudioSet ontology.

Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018Workshop (DCASE2018) (Nov. 2018). https://doi.org/10.5281/zenodo.1340763[13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica-tion with deep convolutional neural networks. In

Advances in neural informationprocessing systems . 1097–1105. https://doi.org/10.1145/3065386[14] Guillaume Lemaitre, Olivier Houix, Frédéric Voisin, Nicolas Misdariis, and PatrickSusini. 2016. Vocal Imitations of Non-Vocal Sounds.

PLOS ONE

11, 12 (Dec. 2016).https://doi.org/10.1371/journal.pone.0168167[15] Paul Masri. 1996. Computer Modelling of Sound for Transformation and Synthesisof Musical Signals.

University of Bristol (1996).[16] Adib Mehrabi, Keunwoo Choi, Simon Dixon, and Mark Sandler. 2018. Similaritymeasures for vocal-based drum sample retrieval using deep convolutional auto-encoders. arXiv:1802.05178 (Feb. 2018). https://doi.org/10.1109/icassp.2018.8461566[17] Tomoyasu Nakano, Jun Ogata, Masataka Goto, and Yuzuru Hiraga. 2004. A DrumPattern Retrieval Method by Voice Percussion. In .[18] António Ramires, Rui Penha, and Matthew E. P. Davies. 2018. User SpecificAdaptation in Automatic Transcription of Vocalised Percussion. arXiv:1811.02406 (2018).[19] António Filipe Santana Ramires. 2017. Automatic Transcription of Drums andVocalised percussion.

Universidade do Porto (2017).

New Dataset for Amateur Vocal Percussion Analysis AM’19, September 18–20, 2019, Nottingham, United Kingdom [20] Jan Schluter and Sebastian Böck. 2014. Improved Musical Onset Detectionwith Convolutional Neural Networks.

Proceedings of the IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) (2014). https://doi.org/10.1109/ICASSP.2014.6854953[21] Elliot Sinyor, Cory Mckay Rebecca, Daniel Mcennis, and Ichiro Fujinaga. 2005.Beatbox classification using ACE. In

Proceedings of the International Conference on Music Information Retrieval .[22] Dan Stowell and Mark D. Plumbley. 2010. Delayed decision-making in real-time beatbox percussion classification.