[PDF] Downbeat Tracking with Tempo-Invariant Convolutional Neural Networks

Abstract

The human ability to track musical downbeats is robust to changes in tempo, and it extends to tempi never previously encountered. We propose a deterministic time-warping operation that enables this skill in a convolutional neural network (CNN) by allowing the network to learn rhythmic patterns independently of tempo. Unlike conventional deep learning approaches, which learn rhythmic patterns at the tempi present in the training dataset, the patterns learned in our model are tempo-invariant, leading to better tempo generalisation and more efficient usage of the network capacity. We test the generalisation property on a synthetic dataset created by rendering the Groove MIDI Dataset using FluidSynth, split into a training set containing the original performances and a test set containing tempo-scaled versions rendered with different SoundFonts (test-time augmentation). The proposed model generalises nearly perfectly to unseen tempi (F-measure of 0.89 on both training and test sets), whereas a comparable conventional CNN achieves similar accuracy only for the training set (0.89) and drops to 0.54 on the test set. The generalisation advantage of the proposed model extends to real music, as shown by results on the GTZAN and Ballroom datasets.

Full PDF

DDOWNBEAT TRACKING WITH TEMPO-INVARIANTCONVOLUTIONAL NEURAL NETWORKS

Bruno Di Giorgi

Apple Inc. [email protected]

Matthias Mauch

Apple Inc. [email protected]

Mark Levy

Apple Inc. [email protected]

ABSTRACT

The human ability to track musical downbeats is robust tochanges in tempo, and it extends to tempi never previouslyencountered. We propose a deterministic time-warpingoperation that enables this skill in a convolutional neuralnetwork (CNN) by allowing the network to learn rhyth-mic patterns independently of tempo. Unlike conventionaldeep learning approaches, which learn rhythmic patternsat the tempi present in the training dataset, the patternslearned in our model are tempo-invariant, leading to bet-ter tempo generalisation and more efﬁcient usage of thenetwork capacity.We test the generalisation property on a syntheticdataset created by rendering the Groove MIDI Datasetusing FluidSynth, split into a training set containing theoriginal performances and a test set containing tempo-scaled versions rendered with different SoundFonts (test-time augmentation). The proposed model generalisesnearly perfectly to unseen tempi (F-measure of 0.89 onboth training and test sets), whereas a comparable conven-tional CNN achieves similar accuracy only for the trainingset (0.89) and drops to 0.54 on the test set. The gener-alisation advantage of the proposed model extends to realmusic, as shown by results on the GTZAN and Ballroomdatasets.

1. INTRODUCTION

Human musicians easily identify the downbeat (the ﬁrstbeat of each bar) in a piece of music and will effortlesslyadjust to a variety of tempi, even ones never before en-countered. This ability is the likely result of patterns andtempi being processed at distinct locations in the humanbrain [1].We argue that factorising rhythm into tempo and tempo-invariant rhythmic patterns is desirable for a machine-learned downbeat detection system as much as it is for thehuman brain. First, factorised representations generally re-duce the number of parameters that need to be learned.Second, having disentangled tempo from pattern we can © Bruno Di Giorgi, Matthias Mauch, Mark Levy. Licensedunder a Creative Commons Attribution 4.0 International License (CC BY4.0).

Attribution:

Bruno Di Giorgi, Matthias Mauch, Mark Levy,“Downbeat Tracking with Tempo-Invariant Convolutional Neural Net-works”, in

Proc. of the 21st Int. Society for Music Information RetrievalConf.,

Montréal, Canada, 2020. transfer information learned for one tempo to all others,eliminating the need for training datasets to cover all com-binations of tempo and pattern.Identifying invariances to disentangle representationshas proven useful in other domains [2]: translation invari-ance was the main motivation behind CNNs [3] — theidentity of a face should not depend on its position in animage. Similarly, voices retain many of their characteris-tics as pitch and level change, which can be exploited topredict pitch [4] and vocal activity [5]. Crucially, meth-ods exploiting such invariances don’t only generalise betterthan non-invariant models, they also perform better over-all.Some beat and downbeat trackers ﬁrst estimate tempo(or make use of a tempo oracle) and use the pre-calculatedtempo information in the ﬁnal tracking step [6–15]. Doingso disentangles tempo and tempo-independent representa-tions at the cost of propagating errors from the tempo es-timation step to the ﬁnal result. It is therefore desirable toestimate tempo and phase simultaneously [16–20], whichhowever leads to a much larger parameter space. Factoris-ing this space to make it amenable for machine learning isthe core aim of this paper.In recent years, many beat and downbeat trackingmethods changed their front-end audio processing fromhand-engineered onset detection functions towards beat-activation signals generated by neural networks [21–23].Deep learning architectures such as convolutional and re-current neural networks are trained to directly classify thebeat and downbeat frames, and therefore the resulting sig-nal is usually cleaner.By extending the receptive ﬁeld to several seconds,such architectures are able to identify rhythmic patterns atlonger time scales, a prerequisite for predicting the down-beat. But conventional CNN implementations learn rhyth-mic patterns separately for each tempo, which introducestwo problems. First, since datasets are biased towardsmid-tempo songs, it introduces a tempo-bias that no post-processing stage can correct. Second, it stores similarrhythms redundantly, once for every relevant tempo, i.e.it makes inefﬁcient use of network capacity. Our proposedapproach resolves these issues by learning rhythmic pat-terns that apply to all tempi.The two technical contributions are as follows:1. the introduction of a scale-invariant convolutionallayer that learns temporal patterns irrespective oftheir scale. a r X i v : . [ c s . S D ] F e b . the application of the scale-invariant convolutionallayer to CNN-based downbeat tracking to explicitlylearn tempo-invariant rhythmic patterns.Similar approaches to achieve scale-invariant CNNs,have been developed in the ﬁeld of computer vision [24,25], while no previous application exists for musical sig-nal analysis, to the best of our knowledge.We demonstrate that the proposed method generalisesbetter over unseen tempi and requires lower capacity withrespect to a standard CNN-based downbeat tracker. Themethod also achieves good results against academic testsets.

2. MODEL

The proposed downbeat tracking model has two compo-nents: a neural network to estimate the joint probability ofdownbeat presence and tempo for each time frame, usingtempo-invariant convolution, and a hidden Markov model(HMM) to infer a globally optimal sequence of downbeatlocations from the probability estimate.We discuss the proposed scale-invariant convolution inSec. 2.1 and its tempo-invariant application in Sec. 2.2.The entire neural network is described in Sec. 2.3 and thepost-processing HMM in Sec. 2.4.

In order to achieve scale invariance we generalise the con-ventional convolutional neural network layer.

We explain this ﬁrst in terms of a one-dimensional in-put tensor x ∈ R N and only one kernel h ∈ R N ∗ , andlater generalise the explanation to multiple channels inSec. 2.1.2. Conventional convolutional layers convolve x with h to obtain the output tensor y ∈ R N − N ∗ +1 y = x ∗ h, (1)where ∗ refers to the discrete convolution operation. Here,the kernel h is updated directly during back-propagation,and there is no concept of scale. Any two patterns that areidentical in all but scale (e.g. one is a “stretched” versionof the other) cannot be represented by the same kernel.To address this shortcoming, we factorise the kernelrepresentation into scale and pattern by parametrising thekernel as the dot product h j = (cid:104) ψ j , k (cid:105) between a ﬁxedscaling tensor ψ j ∈ R N ∗ × M and a scale-invariant pattern k ∈ R M . Only the pattern is updated during network train-ing, and the scaling tensor, corresponding to S scaling ma-trices, is pre-calculated (Sec. 2.1.3). The operation adds anexplicit scale dimension to the convolution output y j = x ∗ h j = x ∗ (cid:104) ψ j , k (cid:105) . (2)The convolution kernel is thus factorised into a constantscaling tensor ψ and trainable weights k that learn a scale-invariant pattern. A representation of a scale-invariant con-volution is shown in Figure 1. h = ! ψ, k " y = x ∗ h y = x ∗ " ψ, k h x y = k = x y Time S ca l e Time

Standard ConvolutionScale-Invariant Convolution

Figure 1 . The ﬁgure shows a representation of the stan-dard and scale-invariant convolution operations with in-put/output channel dimensions removed for simplicity. Inorder to achieve scale invariance, we parametrise the ker-nel as the dot product of two tensors ψ and k , where ψ isa deterministic scaling tensor and k is the trained part thatwill learn scale-invariant patterns. The resulting kernel h contains multiple scaled versions of k . layer inputvariable single-channel multi-channel N M S C x H signal x R N R N × C x patterns k R M R M × C x × H kernel h R N ∗ × S R N ∗ × C x × S × H output y R ( N − N ∗ +1) × S R ( N − N ∗ +1) × S × H scaling tensor ψ R N ∗ × M × S scale indices j = 0 , . . . , S − Table 1 . Variables and dimensions.

Usually the input to the convolutional layer has C x > input channels and there are H > kernels. The formu-las in Section 2.1 can easily be extended by the channeldimension, as illustrated in Table 1. The scaling tensor ψ contains S scaling matrices from size M to s j M where s j are the scale factors. ψ n,m,j = (cid:90) ˜ s (cid:90) ˜ n δ (˜ n − ˜ sm ) κ n ( n − ˜ n ) κ s ( s j − ˜ s ) d ˜ nd ˜ s, (3)where δ is the Dirac delta function and κ n , κ s are deﬁnedas follows: κ n ( d ) = sin( πd ) / ( πd ) κ s ( d ) = α cos ( αdπ/ H (1 − α | d | ) , here H is the Heaviside step function. The inner integralcan be interpreted as computing a resampling matrix for agiven scale factor and the outer integral as smoothing alongthe scale dimension, with the parameter α of the function κ s controlling the amount of smoothing applied. The size N ∗ of the scaling tensor ψ (and the resulting convolutionalkernel h ) is derived from the most stretched version of k : N ∗ = max j s j M. (4) After the ﬁrst scale-invariant layer, the tensor has an addi-tional dimension representing scale. In order to add furtherscale invariant convolutional layers without losing scale in-variance, subsequent operations are applied scale-wise: y j = x j ∗ (cid:104) ψ j , k (cid:105) . (5)The only difference with Eq. (2) is that the input tensor x of Eq. (5) already contains S scales, hence the addedsubscript j . In the context of the downbeat tracking task, tempo be-haves as a scale factor and the tempo-invariant patterns arerhythmic patterns. We construct the sequence of scale fac-tors s as s j = rτ j BM , τ j = τ jT (6)where τ j are the beat periods, r is the frame rate of the in-put feature, B is the number of beats spanned by the con-volution kernel factor k , τ is the shortest beat period, and T is the desired number of tempo samples per octave. Thematrix k has a simple interpretation as a set of rhythm frag-ments in musical time with M samples spanning B beats.To mimic our perception of tempo, the scale factors inEq. (6) are log-spaced, therefore the integral in Eq. (3) be-comes: ψ n,m,j = (cid:90) ˜ j (cid:90) ˜ n δ (˜ n − s ˜ j m ) κ n ( n − ˜ n ) κ s ( j − ˜ j ) d ˜ nd ˜ j, (7)where the parameter α of the function κ s has been set to .A representation of the scaling tensor used in the tempo-invariant convolution is shown in Figure 2. The tempo-invariant network (Fig. 3) is a fully convolu-tional deep neural network, where the layers are concep-tually divided into two groups. The ﬁrst group of layersare regular one-dimensional convolutional layers and actas onset detectors. The receptive ﬁeld is constrained in or-der to preserve the tempo-invariance property of the model:if even short rhythmic fragments are learned at a speciﬁctempo, the invariance assumption would be violated. Welimit the maximum size of the receptive ﬁeld to 0.25 sec-onds, i.e. the period of a beat at 240 BPM.

Listening Time M u s i ca l T i m e S ca l e ψ Figure 2 . The scaling tensor ψ is a sparse 3-dimensionalconstant tensor. In the ﬁgure ψ is represented as a cubewhere the 0 bins are rendered transparent. ψ transformsthe rhythm patterns contained in the kernel k from musicaltime (e.g. 16th notes) to listening time (e.g. frames) overmultiple scales. Convolution LayersTempo Invariant Convolution LayersReceptive ﬁeld p ( D , τ ) p ( ¬D ) Time inputoutput

Figure 3 . A global view of the neural network. The ﬁrstgroup of layers are regular convolutional layers and act asonset detectors. They have a small receptive ﬁeld, in orderto focus on acoustic features and avoid learning rhythmicpatterns, which will be learned by the successive tempo-invariant layers. The output tensor represents joint proba-bilities of downbeat presence D and tempo τ .The second group is a stack of tempo-invariant convolu-tional layers (as described in Sec. 2.1, 2.2). The receptiveﬁeld is measured in musical-time, with each layer spanning B = 4 beats. The last layer outputs only one channel, pro-ducing a 2-dimensional (frame and tempo) output tensor.The activations of the last layer represent the scores(logits) of having a downbeat at a speciﬁc tempo. An ad-ditional constant zero bin is concatenated to these acti-vations for each frame to model the score of having nodownbeat. After applying the softmax, the output o repre-sents the joint probability of the downbeat presence D at aspeciﬁc tempo τo j = (cid:40) p ( D , τ j ) j = 0 , . . . , S − p ( ¬D ) j = S (8) We can keep this constant because the other output values will adaptautomatically. he categorical cross-entropy loss is then applied frame-wise, with a weighting scheme that balances the loss con-tribution on downbeat versus non-downbeat frames. The target tensors are generated from the downbeat an-notations by spreading the downbeat locations to the neigh-bouring time frames and tempi using a rectangular window( . seconds wide) for time and a raised cosine window( /T octaves wide) for tempo. The network is trained withstochastic gradient descent using RMSprop, early stop-ping and learning rate reduction when the validation lossreaches a plateau. In order to transform the output activations of the networkinto a sequence of downbeat locations, we use a frame-wise HMM with the state-space [26].In its original form, this post-processing method usesa network activation that only encodes beat probability ateach position. In the proposed tempo-invariant neural net-work the output activation models the joint probability ofdownbeat presence and tempo , enabling a more explicitconnection to the post-processing HMM, via a slightlymodiﬁed observation model: P ( o j | q ) = (cid:40) c ( τ j , τ q ) o j q ∈ D , j < So S / ( σS ) q ∈ ¬D (9)where q is the state variable having tempo τ q , D is the setof downbeat states, c ( τ j , τ q ) is the interpolation coefﬁcientfrom the tempi modeled by the network τ j to the tempimodeled by the HMM τ q and σ approximates the propor-tion of non-downbeat and downbeat states ( |¬D| / |D| ).

3. EXPERIMENTS

In this section we describe the two experiments conductedin order to test the tempo-invariance property of the pro-posed architecture with respect to a regular CNN. The ﬁrstexperiment, described in Sec. 3.1, uses a synthetic datasetof drum MIDI recordings. The second experiment, out-lined in Sec. 3.2, evaluates the potential of the proposedalgorithm on real music.

We test the robustness of our model by training a regularCNN and a tempo-invariant CNN on a tempo-biased train-ing dataset and evaluating on a tempo-unbiased test set. Inorder to control the tempo distribution of the dataset, westart with a set of MIDI drum patterns from the magenta-groove dataset [27], randomly selecting bars from eachof the eval-sessions , resulting in patterns.These rhythms were then synthesised at 27 scaled tempi,with scale factors ε i = 2 i/ ( − ≤ i ≤ ) with respectto the original tempo of the recording. Each track startswith a short silence, the duration of which is randomly cho-sen within a bar length, after which the rhythm is repeated The loss of non-downbeat frames is reduced to / .

10 5 0 5 10scale0.40.60.8 F invnoinv noinv_aug (a) accuracy with respect to relative tempo change t e s t F invnoinv noinv_aug

60 80 100 120 140 160 180 200BPM0.5k t r a i n s a m p l e s (b) accuracy with respect to absolute tempo Figure 4 . Tempo invariance experiment using a datasetof time scaled versions of a set of drum patterns. Thescale factors ε i = 2 i/ range from . to . . Atempo-invariant CNN ( inv ) and a standard CNN ( noinv )are trained on the non scaled versions (scale= ) and testedon all others. A standard CNN trained on scales [ − , ( noinv_aug ) simulates the effect of data augmentation.Figure (a) shows that the invariant model is able to gener-alise on seen patterns at unseen tempi. Figure (b) showsthat the effect of the tempo-biased training set: for non-invariant models the beneﬁt is localised, while the invari-ant model distributes the rhythmic information across theentire tempo spectrum.4 times. Audio samples are rendered using FluidSynth with a set of combinations of SoundFonts and in-struments, resulting in audio ﬁles. The synthe-sised audio is pre-processed to obtain a log-amplitude mel-spectrogram with frequency bins and r = 50 frames persecond.The tempo-biased training set contains the originaltempi (scale factor: ε = 1 ), while the tempo-unbiasedtest set contains all scaled versions. The two sets were ren-dered with different SoundFonts.We compared a tempo-invariant architecture ( inv ) witha regular CNN ( noinv ). The hyper-parameter conﬁgura-tions are shown in Table 2 and were selected maximisingthe accuracy on the validation set.The results of the experiment are shown in Fig. 4 interms of F score, using the standard distance threshold https://github.com/FluidSynth/ﬂuidsynth/wiki/SoundFont rchitecturegroup inv noinv × CNN × × × dil-CNN × ×

60k 80k

Table 2 . Architectures used in the experiment. Groups oflayers are expressed as (number of layers × output chan-nels). All layers in group 1 have kernel size equal to 3frames. dil-CNN is a stack of dilated convolution layerswith kernel size equal to 7 frames and exponentially in-creasing dilation factors: , , , . The speciﬁc hyper-parameters of the tempo-invariant network TI-CNN areconﬁgured as follows: T = 8 , τ = 0 . , S = 25 , M =64 , B = 4 . ReLU non-linearities are used on both archi-tectures.of ms on both sides of the annotated downbeats [28].Despite the tempo bias of the training set, the accuracy ofthe proposed tempo-invariant architecture is approximatelyconstant across the tempo spectrum. Conversely, the non-invariant CNN performs better on the tempi that are presentin the training and validation set. Speciﬁcally, Fig. 4ashows that the two architectures perform equally well onthe training set containing the rhythms at their originaltempo (scale equal to 0 in the ﬁgure), while the accuracyof the non-invariant network drops for the scaled versions.A different view of the same results on Fig. 4b highlightshow the test set accuracy depends on the scaled tempo. Theaccuracy of the regular CNN peaks around the tempi thatare present in the training set, showing that the contribu-tion of the training samples is localised in tempo. The pro-posed architecture performs better (even at the tempi thatare present in the training set) because it efﬁciently dis-tributes the beneﬁt of all training samples over all tempi.In order to simulate the effect of data augmentation onthe non-invariant model, we also trained an instance of thenon-invariant model ( noinv_aug ) including two scaledversions ( ε i with | i | ≤ ) in the training set. As shown inthe ﬁgure, data-augmentation improves generalisation, buthas similar tempo dependency effects. In this experiment we used real music recordings. Wetrained on an internal dataset (1368 excerpts from a va-riety of genres, summing up to 10 hours of music) andthe RWC dataset [29] (Popular, Genre and Jazz subsets)and tested on Ballroom [30,31] and GTZAN [32] datasets.With respect to the previous experiment we used the sameinput features, but larger networks because of the higheramount of information contained in fully arranged record- In terms of number of channels, layers and convolution kernel sizes,optimized on the validation set. train valid test data split sc o r e F model noinvinv Figure 5 . Results of the experiment on music data in termsof F-measure. Track scores are used to compute the aver-age and the conﬁdence intervals at 95% (using bootstrap-ping). The proposed tempo-invariant architecture is able tobetter generalise over unseen data with respect to its stan-dard CNN counterpart.ings, with inv having k trainable parameters and noinv k.The results in Fig. 5 show that the proposed tempo-invariant architecture is performing worse on the trainingset, but better on the validation and test set, with the com-parisons on train and test set being statistically signiﬁcant( p < . ). Here the tempo-invariant architecture seemsto act as a regularisation, allocating the network capacityto learning patterns that better generalise on unseen data,instead of ﬁtting to the training set.

4. DISCUSSION

Since musicians are relentlessly creative, previously un-seen rhythmic patterns keep being invented, much like“out-of-vocabulary” words in natural language process-ing [33]. As a result, the generalisation power of tempo-invariant approaches is likely to remain useful. Once tunedfor optimal input representation and network capacity weexpect tempo-invariant models to have an edge particularlyon new, non-public test datasets.Disentangling timbral pattern and tempo may also beuseful to tasks such as auto-tagging: models can learn thatsome classes have a single precise tempo (e.g. ballroomdances [30]), some have varying tempos within a range(e.g. broader genres or moods), and others still are com-pletely invariant to tempo (e.g. instrumentation).

5. CONCLUSIONS

We introduced a scale-invariant convolution layer and usedit as the main component of our tempo-invariant neuralnetwork architecture for downbeat tracking. We experi-mented on drum grooves and real music data, showing thatthe proposed architecture generalises to unseen tempi bydesign and achieves higher accuracy with lower capacitycompared to a standard CNN.

6. REFERENCES [1] M. Thaut, P. Trimarchi, and L. Parsons, “Human brainbasis of musical rhythm perception: common and dis-inct neural substrates for meter, tempo, and pattern,”

Brain sciences , vol. 4, no. 2, pp. 428–452, 2014.[2] I. Higgins, D. Amos, D. Pfau, S. Racaniere,L. Matthey, D. Rezende, and A. Lerchner, “Towardsa deﬁnition of disentangled representations,” arXivpreprint arXiv:1812.02230 , 2018.[3] Y. LeCun, Y. Bengio et al. , “Convolutional networksfor images, speech, and time series,”

The handbook ofbrain theory and neural networks , vol. 3361, no. 10,1995.[4] R. M. Bittner, B. McFee, J. Salamon, P. Li, and J. P.Bello, “Deep salience representations for F0 estimationin polyphonic music,” in

Proc. of the International So-ciety for Music Information Retrieval Conference (IS-MIR) , 2016, pp. 63–70.[5] J. Schlüter and B. Lehner, “Zero-mean convolutions forlevel-invariant singing voice detection,” in

Proc. of theInternational Society for Music Information RetrievalConference (ISMIR) , 2018, pp. 321–326.[6] M. E. Davies and M. D. Plumbley, “Beat tracking witha two state model [music applications],” in

Proc. ofthe International Conference on Acoustics, Speech andSignal Processing (ICASSP) , vol. 3. IEEE, 2005, pp.iii–241.[7] D. P. Ellis, “Beat tracking by dynamic programming,”

Journal of New Music Research , vol. 36, no. 1, pp. 51–60, 2007.[8] A. P. Klapuri, A. J. Eronen, and J. T. Astola, “Analysisof the meter of acoustic musical signals,”

IEEE Trans-actions on Audio, Speech, and Language Processing ,vol. 14, no. 1, pp. 342–355, 2005.[9] N. Degara, E. A. Rúa, A. Pena, S. Torres-Guijarro,M. E. Davies, and M. D. Plumbley, “Reliability-informed beat tracking of musical signals,”

IEEETransactions on Audio, Speech, and Language Pro-cessing , vol. 20, no. 1, pp. 290–301, 2011.[10] B. Di Giorgi, M. Zanoni, S. Böck, and A. Sarti, “Mul-tipath beat tracking,”

Journal of the Audio EngineeringSociety , vol. 64, no. 7/8, pp. 493–502, 2016.[11] H. Papadopoulos and G. Peeters, “Joint estimation ofchords and downbeats from an audio signal,”

IEEETransactions on Audio, Speech, and Language Pro-cessing , vol. 19, no. 1, pp. 138–152, 2010.[12] F. Krebs, S. Böck, M. Dorfer, and G. Widmer, “Down-beat tracking using beat synchronous features with re-current neural networks,” in

Proc. of the InternationalSociety for Music Information Retrieval Conference(ISMIR) , 2016, pp. 129–135.[13] M. E. Davies and M. D. Plumbley, “A spectral differ-ence approach to downbeat extraction in musical au-dio,” in . IEEE, 2006, pp. 1–4. [14] S. Durand, J. P. Bello, B. David, and G. Richard,“Downbeat tracking with multiple features and deepneural networks,” in

Proc. of the International Con-ference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2015, pp. 409–413.[15] ——, “Feature adapted convolutional neural networksfor downbeat tracking,” in

Proc. of the InternationalConference on Acoustics, Speech and Signal Process-ing (ICASSP) . IEEE, 2016, pp. 296–300.[16] M. Goto, “An audio-based real-time beat tracking sys-tem for music with or without drum-sounds,”

Journalof New Music Research , vol. 30, no. 2, pp. 159–171,2001.[17] S. Dixon, “Automatic extraction of tempo and beatfrom expressive performances,”

Journal of New MusicResearch , vol. 30, no. 1, pp. 39–58, 2001.[18] D. Eck, “Beat tracking using an autocorrelation phasematrix,” in

Proc. of the International Conference onAcoustics, Speech and Signal Processing (ICASSP) ,vol. 4. IEEE, 2007, pp. IV–1313.[19] M. Goto, K. Yoshii, H. Fujihara, M. Mauch, andT. Nakano, “Songle: A web service for active mu-sic listening improved by user contributions,” in

Proc.of the International Society for Music Information Re-trieval Conference (ISMIR) , 2011, pp. 311–316.[20] F. Krebs, A. Holzapfel, A. T. Cemgil, and G. Wid-mer, “Inferring metrical structure in music using parti-cle ﬁlters,”

IEEE/ACM Transactions on Audio, Speech,and Language Processing , vol. 23, no. 5, pp. 817–827,2015.[21] S. Böck and M. Schedl, “Enhanced beat tracking withcontext-aware neural networks,” in

Proc. of the Inter-national Conference on Digital Audio Effects (DAFx) ,2011, pp. 135–139.[22] S. Böck, F. Krebs, and G. Widmer, “Joint beat anddownbeat tracking with recurrent neural networks,” in

Proc. of the International Society for Music Informa-tion Retrieval Conference (ISMIR) , 2016, pp. 255–261.[23] F. Korzeniowski, S. Böck, and G. Widmer, “Probabilis-tic extraction of beat positions from a beat activationfunction,” in

Proc. of the International Society for Mu-sic Information Retrieval Conference (ISMIR) , 2014,pp. 513–518.[24] Y. Xu, T. Xiao, J. Zhang, K. Yang, and Z. Zhang,“Scale-invariant convolutional neural networks,” arXivpreprint arXiv:1411.6369 , 2014.[25] A. Kanazawa, A. Sharma, and D. Jacobs, “Lo-cally scale-invariant convolutional neural networks,”in

Deep Learning and Representation Learning Work-shop, Neural Information Processing Systems (NIPS) ,2014.26] F. Krebs, S. Böck, and G. Widmer, “An efﬁcient state-space model for joint tempo and meter tracking,” in

Proc. of the International Society for Music Informa-tion Retrieval Conference (ISMIR) , 2015, pp. 72–78.[27] J. Gillick, A. Roberts, J. Engel, D. Eck, and D. Bam-man, “Learning to groove with inverse sequence trans-formations,” in

Proc. of the International Conferenceon Machine Learning (ICML) , 2019.[28] M. E. Davies, N. Degara, and M. D. Plumbley, “Eval-uation methods for musical audio beat tracking algo-rithms,”

Queen Mary University of London, Centre forDigital Music, Tech. Rep. C4DM-TR-09-06 , 2009.[29] M. Goto, H. Hashiguchi, T. Nishimura, and R. Oka,“RWC music database: Popular, classical and jazz mu-sic databases,” in

Proc. of the International Societyfor Music Information Retrieval Conference (ISMIR) ,vol. 2, 2002, pp. 287–288.[30] F. Gouyon, A. Klapuri, S. Dixon, M. Alonso, G. Tzane-takis, C. Uhle, and P. Cano, “An experimental compari-son of audio tempo induction algorithms,”

IEEE Trans-actions on Audio, Speech, and Language Processing ,vol. 14, no. 5, pp. 1832–1844, 2006.[31] F. Krebs, S. Böck, and G. Widmer, “Rhythmic patternmodeling for beat and downbeat tracking in musicalaudio,” in

Proc. of the International Society for MusicInformation Retrieval Conference (ISMIR) , 2013, pp.227–232.[32] G. Tzanetakis and P. Cook, “Musical genre classiﬁca-tion of audio signals,”

IEEE Transactions on Speechand Audio Processing , vol. 10, no. 5, pp. 293–302,2002.[33] T. Schick and H. Schütze, “Learning semantic repre-sentations for novel words: Leveraging both form andcontext,” in