User Specific Adaptation in Automatic Transcription of Vocalised Percussion
UUser Specific Adaptation in Automatic Transcription of Vocalised Percussion
António Ramires [email protected] Rui Penha [email protected] Matthew E. P. Davies [email protected] INESC TECSound and Music Computing GroupRua Dr. Roberto Frias, s/n, 4200 - 465 Porto, Portugal Faculty of EngineeringUniversity of PortoRua Dr. Roberto Frias, s/n, 4200 - 465 Porto, Portugal
Abstract
The goal of this work is to develop an application that enables music pro-ducers to use their voice to create drum patterns when composing in Digi-tal Audio Workstations (DAWs). An easy-to-use and user-oriented systemcapable of automatically transcribing vocalisations of percussion sounds,called LVT - Live Vocalised Transcription, is presented. LVT is devel-oped as a Max for Live device which follows the “segment-and-classify”methodology for drum transcription, and includes three modules: i) anonset detector to segment events in time; ii) a module that extracts rele-vant features from the audio content; and iii) a machine-learning compo-nent that implements the k-Nearest Neighbours (kNN) algorithm for theclassification of vocalised drum timbres.Due to the wide differences in vocalisations from distinct users forthe same drum sound, a user-specific approach to vocalised transcriptionis proposed. In this perspective, a given end-user trains the algorithmwith their own vocalisations for each drum sound before inputting theirdesired pattern into the DAW. The user adaption is achieved via a newMax external which implements Sequential Forward Selection (SFS) forchoosing the most relevant features for a given set of input drum sounds.The evaluation of LVT addresses two objectives. First, to investigatethe improvement in performance with user-specific training, and second,to assess if LVT can provide an optimised workflow for music productionin Ableton Live when compared to existing drum transcription algorithms.Obtained results demonstrate that both objectives are met.
The development of computers’ performance capacity, and the conse-quent possibility for real-time Digital Signal Processing (DSP) for audio,led to the appearance of Digital Audio Workstations (DAWs), making thecreation of computer music available to the general public. Followingthese advances, many new instruments and interfaces for creating elec-tronic music have surfaced. With changes in music culture, music pro-duction and how musicians work with their instruments has also changed.In other words, the ability to invent and reinvent the way to produce mu-sic is key to progress. Consequently, new proposals are necessary, suchas designing new techniques for the composition of music.Within the genre of Electronic Music, sequencing drum patterns playsa critical role. However, inputting drum patterns into DAWs often requireshigh technical skill on the part of the user, either by physically performingthe patterns by tapping them on MIDI drum pads, or manually enteringevents via music editing software. For non-expert users both options canbe very challenging, and can thus provide a barrier to entry. However, thevoice is an important and powerful instrument of rhythm production, so itcan be used to express or “perform” drum patterns in a very intuitive way- so called “beatboxing.” In order to leverage this concept within a com-putational system, our goal is towards a system to help users (both expertmusicians and amateur enthusiasts) input rhythm patterns they have inmind into a sequencer via the automatic transcription of vocalised per-cussion. Our proposed tool is beneficial both from the perspective ofworkflow optimisation (by providing accurate real-time transcriptions),but also as means to encourage users to engage with technology in thepursuit of creative activities. From a technical standpoint, we seek tobuild on the state of the techniques from the domain of music informationretrieval (MIR) for drum transcription [2, 4] but actively targeted towardsend-users and real-world music content production scenarios. This work is derived from the MSc dissertation of António Ramires, conducted in theDepartment of Electrical and Computer Engineering in the Faculty of Engineering, Universityof Porto.
A vocalised drum transcription software, LVT, able to be trained withthe user vocalisations is proposed. LVT is developed as a Max for Liveproject – a visual programming environment, based on Max 7 , whichallows users to build instruments and effects for use within the AbletonLive DAW. To develop LVT, a dataset of vocalised percussion was com-piled. A group of 20 participants (11 male, 9 female) were asked to recordtwo short vocalised percussion tracks, one identical for all participants,and the other, an improvised pattern. These input percussion tracks wererecorded three times: on a low quality laptop microphone, on an iPadmicrophone, and using a studio quality microphone (AKG c4000b). Allrecorded audio tracks were manually annotated using Sonic Visualiser ,a free application for viewing and analysing the contents of music audiofiles. The participants spanned a wide range of experience in beatboxing(from beatboxing experts, to those who had never vocalised drum patternsbefore), and covered a wide age range. Thus, we consider the annotateddataset to be representative of a wide range of potential users of the sys-tem, and highly heterogeneous in terms of the types of drum sounds.Our proposed vocalised percussion transcription system was devel-oped following a user-specific approach. LVT follows the “segment andclassify” method for drum transcription [2] and integrates three main el-ements: i) an onset detector – to identify when each drum sound occurs,ii) a component that extracts features for each event, and iii) a machinelearning component to classify the drum sounds. In the Max for Liveenvironment, the onset detection was performed with AubioOnset ∼ .Feature extraction was performed in real-time using existing Max objects: Zsa.mfcc ∼ – to characterise the timbre, Zsa.descriptors [3] – toprovide spectral centroid, spread, slope, decrease and rolloff features [3],and finally the zero crossing rate and number of zero crossings were cal-culated with the zerox ∼ object. The machine learning component istrained with the user’s preferred vocalisation and the features are selectedwhich give the best results for the provided input. This is achieved us-ing the Sequential Forward Selection method [5] along with a k-NearestNeighbours classification algorithm, with the most significant features se-lected by the accuracy obtained from testing the training data (in our case,the annotated improvised patterns from each participant). SFS works byselecting the most significant feature, according to a specific parameter (inthis case the classification accuracy), and adding it to an initially emptyset until there are no improvements or no features remain. The k-NNalgorithm was implemented using timbreID[1], and a new external forMax was developed to implement the SFS. A user interface was createdin Max for Live to facilitate the utilisation of the application by end-users.A screenshot of the interface of LVT is shown in Fig. 1. It demonstratesthe user-specific training stage – where a user inputs a set number of thedrum timbres they intend to use, after which their vocalised percussion istranscribed and rendered as a MIDI file for subsequent synthesis.To operate LVT, a user loads the device in Ableton Live and then vo-calises the set of desired drum sounds they intend to use, e.g. five kicksounds followed by five snare sounds, followed by five hi-hat sounds.Once the expected number of drum sounds have been detected, the SFSalgorithm then identifies the subset of features which best separate thedrum sounds for the user. After training, the user can then vocalise rhyth-mic patterns which are automatically converted from audio to a MIDIrepresentation in the DAW for later synthesis and editing. https://aubio.org/manpages/latest/aubioonset.1.html a r X i v : . [ c s . S D ] N ov igure 1: User interface of the LVT device.Table 1: Number of operations and F-measure for the AKG microphone. Edit Operations F-measure
Modify Add Remove Kick Snare Hi-hatAbleton 33 12 296 0.518 0.470 0.297LDT 52 24 206 0.538 0.204 0.419LVT 39 7 15 0.914 0.691 0.802
The evaluation of LVT was designed to serve two purposes. First, tounderstand how a user-specific trained system performs against state ofthe art drum transcription system (which have been optimised over largedatasets without any user-specific training), and second, to explore howLVT could improve a producer’s workflow. We compared LVT againsttwo existing drum transcription algorithms: LDT [4], and Ableton Live’sbuilt-in “Convert Drums to MIDI” function. For validation data we usedthe non-improvised vocalised patterns from our annotated dataset.To compare the accuracy of the systems we use the F-measure ofthe transcriptions. Then, to investigate how our system could improvea producers workflow, the “effort” to get an accurate transcription wascalculated by counting the number of editing operations required to obtainthe desired patterns. These operations are as follows: to modify , to add ,or to remove a MIDI note.Table 1 summarises the results obtained from counting the total num-ber of operations needed to obtain the desired pattern for the testing datarecorded on the studio quality AKG c4000b microphone and the corre-sponding F-measure per vocalised drum sound, on the three drum tran-scription systems. The results demonstrate that, for the studio quality mi-crophone, vocalised drum transcription accuracy for LVT is substantiallyhigher than the other systems, and far fewer modifications were requiredto obtain the desired patterns when editing the automatic transcriptions.To see the effect of user-specific training on the performance of LVT,an example is provided where LVT is trained on one user and tested onanother – and vice-versa. When training the LVT with a different personwith different vocalisations, the accuracy of the transcription is decreasedas shown in Fig. 2. In the upper part of each screenshot is the transcriptionof the user when trained with its own vocalisations, while the bottom partcorresponds to the transcription when trained with the other user. As canbe seen, without the user-specific training, many misclassifications occur.By examining the previously obtained results, we infer that LVT canprovide a transcription closer to the ground truth than the existing state ofthe art systems, as shown by the higher F-measure. In addition to LVT be-ing trained per individual user, these results may also derive from the factthat LVT does not try to detect polyphonic events (more than one drumvocalisation at the same time) as the other systems do. Furthermore, LVTdoes not detect as many events as the other systems, and this has a stronginfluence on the number of false positives, and hence the F-measure. Thenumber of events to achieve the desired transcription, presented in Ta-ble 1, shows that the end-user of the system does not have to perform asmany actions when producing music, which has a positive impact on theworkflow, leaving more time for creative experimentation.
In this paper, we have presented LVT – a new interface for assistive mu-sic content creation. LVT allows Ableton Live users to sequence MIDI Figure 2: (top) First user vocalisations trained with the second user. (bot-tom) Second user vocalisations trained with the first user.patterns that can be used for designing and performing rhythms with theirvoice. Existing state of the art systems, including one already in AbletonLive, are not able to transcribe vocalised percussion as effectively be-cause these tools are trained for general recorded drum sounds which aretypically not vocalised. Indeed, because different people vocalise drumsounds in different ways, LVT explicitly seeks to model and capture thisbehaviour via user-specific training. Our evaluation shows LVT to be veryeffective for wide range of users and vocalisations, outperforming exist-ing systems. Furthermore, we believe LVT can be applied to any kindsof arbitrary non-pitched percussive sounds – provided that the trainingsound types are sufficiently different from one another, and can thus bewell separated in the audio feature space using SFS.LVT is implemented as a Max for Live device, and thus fully inte-grates into Ableton Live, allowing users of all ability ranges to experimentwith music sequencing driven by their own personal percussion vocalisa-tions within an easy-to-use graphical user interface.
This work is financed by the ERDF - European Regional DevelopmentFund through the Operational Programme for Competitiveness and Inter-nationalisation - COMPETE 2020 Programme within project «POCI-01-0145-FEDER-006961», and by National Funds through the FCT - Fun-dação para a Ciência e a Tecnologia (Portuguese Foundation for Scienceand Technology) as part of project UID/EEA/50014/2013.Project TEC4Growth-Pervasive Intelligence, Enhancers and Proofsof Concept with Industrial Impact/NORTE-01-0145-FEDER-000020is financed by the North Portugal Regional Operational Programme(NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, andthrough the European Regional Development Fund (ERDF).
References [1] W. Brent. A timbre analysis and classification toolkit for pure data.In
Proc. of ICMC , pages 224–229, 2010.[2] O. Gillet and G. Richard. Transcription and separation of drum sig-nals from polyphonic music.
IEEE Transactions on Audio, Speech,and Language Processing , 16(3):529–540, March 2008.[3] M. Malt and E. Jourdan. Zsa. descriptors: a library for real-timedescriptors analysis. In
Proc. of 5th SMC Conference , pages 134–137, 2008.[4] M. Miron, M. E. P. Davies, and F. Gouyon. An open-source drumtranscription system for Pure Data and Max MSP. In
Proc. ofICASSP , pages 221–225, May 2013.[5] A. W. Whitney. A direct method of nonparametric measurement se-lection.