Deep convolutional networks on the pitch spiral for musical instrument recognition
DDEEP CONVOLUTIONAL NETWORKS ON THE PITCH SPIRAL FORMUSICAL INSTRUMENT RECOGNITION
Vincent Lostanlen and Carmine-Emanuele Cella ´Ecole normale sup´erieure, PSL Research University, CNRS, Paris, France
ABSTRACT
Musical performance combines a wide range of pitches,nuances, and expressive techniques. Audio-based classifi-cation of musical instruments thus requires to build signalrepresentations that are invariant to such transformations.This article investigates the construction of learned convo-lutional architectures for instrument recognition, given alimited amount of annotated training data. In this context,we benchmark three different weight sharing strategies fordeep convolutional networks in the time-frequency domain:temporal kernels; time-frequency kernels; and a linear com-bination of time-frequency kernels which are one octaveapart, akin to a Shepard pitch spiral. We provide an acousti-cal interpretation of these strategies within the source-filterframework of quasi-harmonic sounds with a fixed spectralenvelope, which are archetypal of musical notes. The bestclassification accuracy is obtained by hybridizing all threeconvolutional layers into a single deep learning architecture.
1. INTRODUCTION
Among the cognitive attributes of musical tones, pitch isdistinguished by a combination of three properties. First,it is relative: ordering pitches from low to high gives riseto intervals and melodic patterns. Secondly, it is intensive:multiple pitches heard simultaneously produce a chord, nota single unified tone – contrary to loudness, which adds upwith the number of sources. Thirdly, it does not depend oninstrumentation: this makes possible the transcription ofpolyphonic music under a single symbolic system [5].Tuning auditory filters to a perceptual scale of pitchesprovides a time-frequency representation of music signalsthat satisfies the first two of these properties. It is thus astarting point for a wide range of MIR applications, whichcan be separated in two categories: pitch- relative (e.g. chordestimation [13]) and pitch- invariant (e.g. instrument recog-nition [9]). Both aim at disentangling pitch from timbral
This work is supported by the ERC InvariantClass grant 320959. Thesource code to reproduce figures and experiments is freely available at .c (cid:13) Vincent Lostanlen and Carmine-Emanuele Cella. Li-censed under a Creative Commons Attribution 4.0 International License(CC BY 4.0).
Attribution:
Vincent Lostanlen and Carmine-EmanueleCella. “Deep convolutional networks on the pitch spiral for musical in-strument recognition”, 17th International Society for Music InformationRetrieval Conference, 2016. content as independent factors of variability, a goal that ismade possible by the third aforementioned property. This ispursued by extracting mid-level features on top of the spec-trogram, be them engineered or learned from training data.Both approaches have their limitations: a ”bag-of-features”lacks flexibility to represent fine-grain class boundaries,whereas a purely learned pipeline often leads to uninter-pretable overfitting, especially in MIR where the quantityof thoroughly annotated data is relatively small.In this article, we strive to integrate domain-specificknowledge about musical pitch into a deep learning frame-work, in an effort towards bridging the gap between featureengineering and feature learning.Section 2 reviews the related work on feature learningfor signal-based music classification. Section 3 demon-strates that pitch is the major factor of variability amongmusical notes of a given instrument, if described by theirmel-frequency cepstra. Section 4 presents a typical deeplearning architecture for spectrogram-based classification,consisting of two convolutional layers in the time-frequencydomain and one densely connected layer. Section 5 intro-duces alternative convolutional architectures for learningmid-level features, along time and along a Shepard pitchspiral, as well as aggregation of multiple models in thedeepest layers. Sections 6 discusses the effectiveness ofthe presented systems on a challenging dataset for musicalinstrument recognition.
2. RELATED WORK
Spurred by the growth of annotated datasets and the democ-ratization of high-performance computing, feature learninghas enjoyed a renewed interest in recent years within theMIR community, both in supervised and unsupervised set-tings. Whereas unsupervised learning (e.g. k -means [25],Gaussian mixtures [14]) is employed to fit the distributionof the data with few parameters of relatively low abstractionand high dimensionality, state-of-the-art supervised learn-ing consists of a deep composition of multiple nonlineartransformations, jointly optimized to predict class labels,and whose behaviour tend to gain in abstraction as depthincreases [27].As compared to other deep learning techniques for audioprocessing, convolutional networks happen to strike the bal-ance between learning capacity and robustness. The convo-lutional structure of learned transformations is derived fromthe assumption that the input signal, be it a one-dimensionalwaveform or a two-dimensional spectrogram, is stationary a r X i v : . [ c s . S D ] J a n which means that content is independent from location.Moreover, the most informative dependencies between sig-nal coefficients are assumed to be concentrated to temporalor spectrotemporal neighborhoods. Under such hypotheses,linear transformations can be learned efficiently by limit-ing their support to a small kernel which is convolved overthe whole input. This method, known as weight sharing,decreases the number of parameters of each feature mapwhile increasing the amount of data on which kernels aretrained.By design, convolutional networks seem well adapted toinstrument recognition, as this task does not require a pre-cise timing of the activation function, and is thus essentiallya challenge of temporal integration [9, 14]. Furthermore,it benefits from an unequivocal ground truth, and may besimplified to a single-label classification problem by ex-tracting individual stems from a multitrack dataset [2]. Assuch, it is often used a test bed for the development of newalgorithms [17, 18], as well as in computational studies inmusic cognition [20, 21].Some other applications of deep convolutional networksinclude onset detection [23], transcription [24], chord recog-nition [13], genre classification [3], downbeat tracking [8],boundary detection [26], and recommendation [27].Interestingly, many research teams in MIR have con-verged to employ the same architecture, consisting oftwo convolutional layers and two densely connected lay-ers [7, 13, 15, 17, 18, 23, 26], and this article makes no ex-ception. However, there is no clear consensus regarding theweight sharing strategies that should be applied to musicalaudio streams: convolutions in time or in time-frequencycoexist in the recent literature. A promising paradigm [6, 8],at the interaction between feature engineering and featurelearning, is to extract temporal or spectrotemporal descrip-tors of various low-level modalities, train specific convolu-tional layers on each modality to learn mid-level features,and hybridize information at the top level. Recognizing thatthis idea has been successfully applied to large-scale artistrecognition [6] as well as downbeat tracking [8], we aim toproceed in a comparable way for instrument recognition.
3. HOW INVARIANT IS THE MEL-FREQUENCYCEPSTRUM ?
The mel scale is a quasi-logarithmic function of acousticfrequency designed such that perceptually similar pitch in-tervals appear equal in width over the full hearing range.This section shows that engineering transposition-invariantfeatures from the mel scale does not suffice to build pitchinvariants for complex sounds, thus motivating further in-quiry.The time-frequency domain produced by a constant-Qfilter bank tuned to the mel scale is covariant with respectto pitch transposition of pure tones. As a result, a chromaticscale played at constant speed would draw parallel, diagonallines, each of them corresponding to a different partial wave.However, the physics of musical instruments constrain thesepartial waves to bear a negligible energy if their frequenciesare beyond the range of acoustic resonance.
Figure 1 : Constant-Q spectrogram of a chromatic scaleplayed by a tuba. Although the harmonic partials shiftprogressively, the spectral envelope remains unchanged, asrevealed by the presence of a fixed cutoff frequency. Seetext for details.As shown on Figure 1, the constant-Q spectrogram ofa tuba chromatic scale exhibits a fixed, cutoff frequencyat about . , which delineates the support of its spec-tral envelope. This elementary observation implies thatrealistic pitch changes cannot be modeled by translating arigid spectral template along the log-frequency axis. Thesame property is verified for a wide class of instruments,especially brass and woodwinds. As a consequence, theconstruction of powerful invariants to musical pitch is notamenable to delocalized operations on the mel-frequencyspectrum, such as a discrete cosine transform (DCT) whichleads to the mel-frequency cepstral coefficients (MFCC),often used in audio classification [9, 14].To validate the above claim, we have extracted theMFCC of 1116 individual notes from the RWC dataset [10],as played by 6 instruments, with 32 pitches, 3 nuances,and 2 interprets and manufacturers. When more than 32pitches were available (e.g. piano), we selected a contigu-ous subset of 32 pitches in the middle register. Following awell-established rule [9, 14], the MFCC were defined the 12lowest nonzero ”quefrencies” among the DCT coefficientsextracted from a filter bank of 40 mel-frequency bands. Wethen have computed the distribution of squared Euclideandistances between musical notes in the 12-dimensionalspace of MFCC features.Figure 2 summarizes our results. We found that restrict-ing the cluster to one nuance, one interpret, or one manu-facturer hardly reduces intra-class distances. This suggeststhat MFCC are fairly successful in building invariant rep-resentations to such factors of variability. In contrast, thecluster corresponding to each instrument is shrinked if de-composed into a mixture of same-pitch clusters, sometimesby an order of magnitude. In other words, most of the vari-ance in an instrument cluster of mel-frequency cepstra isdue to pitch transposition.Keeping less than 12 coefficients certainly improvesinvariance, yet at the cost of inter-class discriminability,and vice versa. This experiment shows that the mel-frequency cepstrum is perfectible in terms of invariance-discriminability tradeoff, and that there remains a lot to begained by feature learning in this area. igure 2 : Distributions of squared Euclidean distancesamong various MFCC clusters in the RWC dataset. Whiskerends denote lower and upper deciles. See text for details.
4. DEEP CONVOLUTIONAL NETWORKS
A deep learning system for classification is built by stackingmultiple layers of weakly nonlinear transformations, whoseparameters are optimized such that the top-level layer fitsa training set of labeled examples. This section introducesa typical deep learning architecture for audio classificationand describes the functioning of each layer.Each layer in a convolutional network typically consistsin the composition of three operations: two-dimensionalconvolutions, application of a pointwise nonlinearity, andlocal pooling. The deep feed-forward network made of twoconvolutional layers and two densely connected layers, onwhich our experiment are conducted, has become a de facto standard in the MIR community [7, 13, 15, 17, 18, 23, 26].This ubiquity in the literature suggests that a four-layernetwork with two convolutional layers is well adapted tosupervised audio classification problems of moderate size.The input of our system is a constant-Q spectrogram,which is very comparable to a mel-frequency spectrogram.We used the implementation from the librosa package [19]with Q = 12 filters per octave, center frequencies rangingfrom A (
55 Hz ) to A (
14 kHz ), and a hop size of
23 ms .Furthermore, we applied nonlinear perceptual weighting ofloudness in order to reduce the dynamic range between thefundamental partial and its upper harmonics. A -secondsound excerpt x [ t ] is represented by a time-frequencymatrix x [ t, k ] of width T = 128 samples and height K = 96 frequency bands.A convolutional operator is defined as a family W [ τ, κ , k ] of K two-dimensional filters, whose im-pulse repsonses are all constrained to have width ∆ t and height ∆ k . Element-wise biases b [ k ] are added to theconvolutions, resulting in the three-way tensor y [ t, k , k ]= b [ k ] + W [ t, k , k ] t,k ∗ x [ t, k ]= b [ k ] + (cid:88) ≤ τ< ∆ t ≤ κ < ∆ k W [ τ, κ , k ] x [ t − τ, k − κ ] . (1)The pointwise nonlinearity we have chosen is the rectifiedlinear unit (ReLU), with a rectifying slope of α = 0 . fornegative inputs. y +2 [ t, k , k ] = (cid:26) α y [ t, k , k ] if y [ t, k , k ] < y [ t, k , k ] if y [ t, k , k ] ≥ (2)The pooling step consists in retaining the maximal activa-tion among neighboring units in the time-frequency domain ( t, k ) over non-overlapping rectangles of width ∆ t andheight ∆ k . x [ t, k , k ] = max ≤ τ< ∆ t ≤ κ < ∆ k (cid:110) y +2 [ t − τ, k − κ , k ] (cid:111) (3)The hidden units in x are in turn fed to a second layerof convolutions, ReLU, and pooling. Observe that the cor-responding convolutional operator W [ τ, κ , k , k ] per-forms a linear combination of time-frequency feature mapsin x along the variable k . y [ t, k , k ]= (cid:88) k b [ k , k ] + W [ t, k , k , k ] t,k ∗ x [ t, k , k ] . (4)Tensors y +3 and x are derived from y by ReLU andpooling, with formulae similar to Eqs. (2) and (3). Thethird layer consists of the linear projection of x , viewedas a vector of the flattened index ( t, k , k ) , over K units: y [ k ] = b [ k ] + (cid:88) t,k ,k W [ t, k , k , k ] x [ t, k , k ] (5)We apply a ReLU to y , yielding x [ k ] = y +4 [ k ] . Finally,we project x , onto a layer of output units y that shouldrepresent instrument activations: y [ k ] = (cid:88) k W [ k , k ] x [ k ] . (6)The final transformation is a softmax nonlinearity, whichensures that output coefficients are non-negative and sumto one, hence can be fit to a probability distribution: x [ k ] = exp y [ k ] (cid:80) κ exp y [ κ ] . (7)Given a training set of spectrogram-instrument pairs ( x , k ) , all weigths in the network are iteratively updatedto minimize the stochastic cross-entropy loss L ( x , k ) = − log x [ k ] over shuffled mini-batches of size with uni-form class distribution. The pairs ( x , k ) are extracted onthe fly by selecting non-silent regions at random within igure 3 : A two-dimensional deep convolutional network trained on constant-Q spectrograms. See text for details.a dataset of single-instrument audio recordings. Each -second spectrogram x [ t, k ] within a batch is globallynormalized such that the whole batch has zero mean andunit variance. At training time, a random dropout of 50% isapplied to the activations of x and x . The learning ratepolicy for each scalar weight in the network is Adam [16], astate-of-the-art online optimizer for gradient-based learning.Mini-batch training is stopped after the average training lossstopped decreasing over one full epoch of size . Thearchitecture is built using the Keras library [4] and trainedon a graphics processing unit within minutes.
5. IMPROVED WEIGHT SHARING STRATEGIES
Although a dataset of music signals is unquestionably sta-tionary over the time dimension – at least at the scale ofa few seconds – it cannot be taken for granted that all fre-quency bands of a constant-Q spectrogram would have thesame local statistics [12]. In this section, we introduce twoalternative architectures to address the nonstationarity ofmusic on the log-frequency axis, while still leveraging theefficiency of convolutional representations.Many are the objections to the stationarity assumptionamong local neighborhoods in mel frequency. Notablyenough, one of the most compelling is derived from theclassical source-filter model of sound production. The filter,which carries the overall spectral envelope, is affected byintensity and playing style, but not by pitch. Conversely,the source, which consists of a pseudo-periodic wave, istransposed in frequency under the action of pitch. In orderto extract the discriminative information present in bothterms, it is first necessary to disentangle the contributionsof source and filter in the constant-Q spectrogram. Yet, thiscan only be achieved by exploiting long-range correlationsin frequency, such as harmonic and formantic structures.Besides, the harmonic comb created by the Fourier series ofthe source makes an irregular pattern on the log-frequencyaxis which is hard to characterize by local statistics.
Facing nonstationary constant-Q spectra, the most conser-vative workaround is to increase the height ∆ κ of each convolutional kernel up to the total number of bins K inthe spectrogram. As a result, W and W are no longertransposed over adjacent frequency bands, since convolu-tions are merely performed over the time variable. Thedefinition of y [ t, k , k ] rewrites as y [ t, k , k ]= b [ k ] + W [ t, k , k ] t ∗ x [ t, k ]= b [ k ] + (cid:88) ≤ τ< ∆ t W [ τ, k , k ] x [ t − τ, k ] , (8)and similarly for y [ t, k , k ] . While this approach is the-oretically capable of encoding pitch invariants, it is proneto early specialization of low-level features, thus not fullytaking advantage of the network depth.However, the situation is improved if the feature mapsare restricted to the highest frequencies in the constant-Qspectrum. It should be observed that, around the n th partialof a quasi-harmonic sound, the distance in log-frequencybetween neighboring partials decays like /n , and the un-evenness between those distances decays like /n . Conse-quently, at the topmost octaves of the constant-Q spectrum,where n is equal or greater than Q , the partials appear closeto each other and almost evenly spaced. Furthermore, dueto the logarithmic compression of loudness, the polynomialdecay of the spectral envelope is linearized: thus, at highfrequencies, transposed pitches have similar spectra up tosome additive bias. The combination of these two phenom-ena implies that the correlation between constant-Q spectraof different pitches is greater towards high frequencies,and that the learning of polyvalent feature maps becomestractable.In our experiments, the one-dimensional convolutionsover the time variable range from A ( .
76 kHz ) to A (
14 kHz ). The weight sharing strategy presented above exploits thefacts that, at high frequencies, quasi-harmonic partials arenumerous, and that the amount of energy within a frequencyband is independent of pitch.At low frequencies, we claim that the harmonic combis sparse and covariant with respect to pitch shift. Observehat, for any two distinct partials taken at random between and n , the probability that they are in octave relation isslightly above /n . Thus, for n relatively low, the structureof harmonic sounds is well described by merely measuringcorrelations between partials one octave apart. This ideaconsists in rolling up the log-frequency axis into a Shepardpitch spiral, such that octave intervals correspond to fullturns, hence aligning all coefficients of the form x [ t, k + Q × j ] for j ∈ Z onto the same radius of the spiral.Therefore, correlations between power-of-two harmonicsare revealed by the octave variable j .To implement a convolutional network on the pitch spi-ral, we crop the constant-Q spectrogram in log-frequencyinto J = 3 half-overlapping bands whose height equals Q , that is two octaves. Each feature map in the first layer,indexed by k , results from the sum of convolutions be-tween a time-frequency kernel and a band, thus emulatinga linear combination in the pitch spiral with a 3-d tensor W [ τ, κ , j , k ] at fixed k . The definition of y [ t, k , k ] rewrites as y [ t, k , k ] = b [ k ]+ (cid:88) τ,κ ,j W [ τ, κ , j , k ] × x [ t − τ, k − κ − Qj ] . (9)The above is different from training two-dimensional kernelon a time-chroma-octave tensor, since it does not sufferfrom artifacts at octave boundaries.The linear combinations of frequency bands that areone octave apart, as proposed here, bears a resemblancewith engineered features for musical instrument recognition[22], such as tristimulus, empirical inharmonicity, harmonicspectral deviation, odd-to-even harmonic energy ratio, aswell as octave band signal intensities (OBSI) [14].Guaranteeing the partial index n to remain low isachieved by restricting the pitch spiral to its lowest fre-quencies. This operation also partially circumvents theproblem of fixed spectral envelope in musical sounds, thusimproving the validness of the stationarity assumption. Inour experiments, the pitch spiral ranges from A (
110 Hz )to A ( .
76 kHz ).In summary, the classical two-dimensional convolutionsmake a stationarity assumption among frequency neigh-borhoods. This approach gives a coarse approximationof the spectral envelope. Resorting to one-dimensionalconvolutions allows to disregard nonstationarity, but doesnot yield a pitch-invariant representation per se: thus, weonly apply them at the topmost frequencies, i.e. where theinvariance-to-stationarity ratio in the data is already favor-able. Conversely, two-dimensional convolutions on thepitch spiral addresses the invariant representation of sparse,transposition-covariant spectra: as such, they are best suitedto the lowest frequencies, i.e. where partials are further apartand pitch changes can be approximated by log-frequencytranslations. The next section reports experiments on instru-ment recognition that capitalize on these considerations. minutes tracks minutes trackspiano 58 28 44 15violin 51 14 49 22dist. guitar 15 14 17 11female singer 10 11 19 12clarinet 10 7 13 18flute 7 5 53 29trumpet 4 6 7 27tenor sax. 3 3 6 5total 158 88 208 139
Table 1 : Quantity of data in the training set (left) and test set(right). The training set is derived from MedleyDB. The testset is derived from MedleyDB for distorted electric guitarand female singer, and from [14] for other instruments.
6. APPLICATIONS
The proposed algorithms are trained on a subset of Med-leyDB v1.1. [2], a dataset of 122 multitracks annotatedwith instrument activations. We extracted the monophonicstems corresponding to a selection of eight pitched instru-ments (see Table 1). Stems with leaking instruments in thebackground were discarded.The evaluation set consists of 126 recordings of solomusic collected by Joder et al. [14], supplemented with 23stems of electric guitar and female voice from MedleyDB.In doing so, guitarists and vocalists were thoroughly puteither in the training set or the test set, to prevent any artistbias. We discarded recordings with extended instrumentaltechniques, since they are extremely rare in MedleyDB.Constant-Q spectrograms from the evaluation set were splitinto half-overlapping, 3-second excerpts.For the two-dimensional convolutional network, eachof the two layers consists of kernels of width andheight , followed by a max-pooling of width and height . Expressed in physical units, the supports of the kernelsare respectively equal to
116 ms and
580 ms in time, and semitones in frequency. For the one-dimensionalconvolutional network, each of two layers consists of kernels of width , followed by a max-pooling of width .Observe that the temporal supports match those of the two-dimensional convolutional network. For the convolutionalnetwork on the pitch spiral, the first layer consists of kernels of width , height semitones, and a radial lengthof octaves in the spiral. The max-pooling operator andthe second layer are the same as in the two-dimensionalconvolutional network.In addition to the three architectures above, we buildhybrid networks implementing more than one of the weightsharing strategy presented above. In all architectures, thedensely connected layers have K = 64 hidden units and K = 8 output units.In order to compare the results against shallow classi-fiers, we also extracted a typical ”bag-of-features” over half-overlapping, 3-second excerpts in the training set. Thesefeatures consist of the means and standard deviations ofspectral shape descriptors, i.e. centroid, bandwidth, skew-iano violin dist. female clarinet flute trumpet tenor averageguitar singer sax.bag-of-features ( ± ± ± ( ± ± ± ± ± ± spiral 86.9 37.0 72.3 84.4 61.1 30.0 54.9 52.7 59.9(36k parameters) ( ± ± ± ± ± ± ± ± ± ( ± ± ± ± ± ( ± ( ± ( ± ( ± kernels 96.8 68.5 86.0 80.6 81.3 44.4 68.0 48.4 69.1(93k parameters) ( ± ± ± ± ± ± ± ± ± spiral & 1-d 96.5 47.6 90.2 84.5 79.6 41.8 59.8 53.0 69.1(55k parameters) ( ± ± ± ± ± ± ± ± ± spiral & 2-d 97.6 73.3 86.5 ( ± ± ± ( ± ± ( ± ± ± ± ( ± ± ± ± ± ± ± ± ± (147k parameters) ( ± ± ± ± ± ± ± ( ± ± ( ± ± ± ± ± ± ± ± ± Table 2 : Test set accuracies for all presented architectures. All convolutional layers have kernels unless stated otherwise.ness, and rolloff; the mean and standard deviation of thezero-crossing rate in the time domain; and the means ofMFCC as well as their first and second derivative. Wetrained a random forest of decision trees on the result-ing feature vector of dimension , with balanced classprobability.Results are summarized in Table 2. First of all, thebag-of-features approach presents large accuracy variationsbetween classes, due to the unbalance of available trainingdata. In contrast, most convolutional models, especiallyhybrid ones, show less correlation between the amount oftraining data in the class and the accuracy. This suggeststhat convolutional networks are able to learn polyvalent mid-level features that can be re-used a test time to discriminaterarer classes.Furthermore, 2-d convolutions outperform other non-hybrid weight sharing strategies. However, a class withbroadband temporal modulations, namely the distorted elec-tric guitar, is best classified with 1-d convolutions.Hybridizing 2-d with either 1-d or spiral convolutionsprovide consistent, albeit small improvements with respectto 2-d alone. The best overall accuracy is reached by the fullhybridization of all three weight sharing strategies, becauseof a performance boost for the rarest classes.The accuracy gain by combining multiple models couldsimply be the result of a greater number of parameters. Torefute this hypothesis, we train a 2-d convolutional networkwith kernels instead of , so as to match the budgetof the full hybrid model, i.e. about 150k parameters. Theperformance is certainly increased, but not up to the hybridmodels involving 2-d convolutions, which have less param-eters. Increasing the number of kernels even more causethe accuracy to level out and the variance between trials toincrease. Running the same experiments with broader frequencyranges of 1-d and spiral convolutions often led to a degradedperformance, and are thus not reported.
7. CONCLUSIONS
Understanding the influence of pitch in audio streams isparamount to the design of an efficient system for automatedclassification, tagging, and similarity retrieval in music. Wehave presented deep learning methods to address pitch in-variance while preserving good timbral discriminability.It consists in training a feed-forward convolutional net-work over the constant-Q spectrogram, with three differentweight sharing strategies according to the type of input:along time at high frequencies (above ), on a Shep-ard pitch spiral at low frequencies (below ), and intime-frequency over both high and low frequencies.A possible improvement of the presented architecturewould be to place a third convolutional layer in the timedomain before performing long-term max-pooling, hencemodelling the joint dynamics of the three mid-level featuremaps. Future work will investigate the association of thepresented weight sharing strategies with recent advances indeep learning for music informatics, such as data augmenta-tion [18], multiscale representations [1, 11], and adversarialtraining [15].
8. REFERENCES [1] Joakim And´en, Vincent Lostanlen, and St´ephane Mallat.Joint time-frequency scattering for audio classification.In
Proc. MLSP , 2015.[2] Rachel Bittner, Justin Salamon, Mike Tierney, MatthiasMauch, Chris Cannam, and Juan Bello. MedleyDB: multitrack dataset for annotation-intensive MIR re-search. In
Proc. ISMIR , 2014.[3] Keunwoo Choi, George Fazekas, Mark Sandler, andJeonghee Kim. Auralisation of deep convolutional neu-ral networks: listening to learned features. In
Proc. IS-MIR , 2015.[4] Franc¸ois Chollet. Keras: a deep learning library forTheano and TensorFlow, 2015.[5] Alain de Cheveign´e. Pitch perception. In
Oxford Hand-book of Auditory Science: Hearing , chapter 4, pages71–104. Oxford University Press, 2005.[6] Sander Dieleman, Phil´emon Brakel, and BenjaminSchrauwen. Audio-based music classification with apretrained convolutional network. In
Proc. ISMIR , 2011.[7] Sander Dieleman and Benjamin Schrauwen. End-to-endlearning for music audio. In
Proc. ICASSP , 2014.[8] Simon Durand, Juan P. Bello, Bertrand David, and Ga¨elRichard. Feature-adapted convolutional neural networksfor downbeat tracking. In
Proc. ICASSP , 2016.[9] Antti Eronen and Anssi Klapuri. Musical instrumentrecognition using cepstral coefficients and temporal fea-tures. In
Proc. ICASSP , 2000.[10] Masataka Goto, Hiroki Hashiguchi, TakuichiNishimura, and Ryuichi Oka. RWC music database:music genre database and musical instrument sounddatabase. In
Proc. ISMIR , 2003.[11] Philippe Hamel, Yoshua Bengio, and Douglas Eck.Building musically-relevant audio features through mul-tiple timescale representations. In
Proc. ISMIR , 2012.[12] Eric J. Humphrey, Juan P. Bello, and Yann Le Cun.Feature learning and deep architectures: New directionsfor music informatics.
JIIS , 41(3):461–481, 2013.[13] Eric J. Humphrey, Taemin Cho, and Juan P. Bello.Learning a robust tonnetz-space transform for automaticchord recognition. In
Proc. ICASSP , 2012.[14] Cyril Joder, Slim Essid, and Ga¨el Richard. Temporal in-tegration for audio classification with application to mu-sical instrument classification.
IEEE TASLP , 17(1):174–186, 2009.[15] Corey Kereliuk, Bob L. Sturm, and Jan Larsen. DeepLearning and Music Adversaries.
IEEE Trans. Multime-dia , 17(11):2059–2071, 2015.[16] Diederik P. Kingma and Jimmy Lei Ba. Adam: amethod for stochastic optimization. In
Proc. ICML ,2015.[17] Peter Li, Jiyuan Qian, and Tian Wang. Automatic in-strument recognition in polyphonic music using convo-lutional neural networks. arXiv preprint , 1511.05520,2015. [18] Brian McFee, Eric J. Humphrey, and Juan P. Bello. Asoftware framework for musical data augmentation. In
Proc. ISMIR , 2015.[19] Brian McFee, Matt McVicar, Colin Raffel, DawenLiang, Oriol Nieto, Eric Battenberg, Josh Moore,Dan Ellis, Ryuichi Yamamoto, Rachel Bittner, Dou-glas Repetto, Petr Viktorin, Jo˜ao Felipe Santos, andAdrian Holovaty. librosa: 0.4.1. zenodo. 10.5281/zen-odo.18369, October 2015.[20] Michael J. Newton and Leslie S. Smith. A neurallyinspired musical instrument classification system basedupon the sound onset.
JASA , 131(6):4785, 2012.[21] Kailash Patil, Daniel Pressnitzer, Shihab Shamma, andMounya Elhilali. Music in our ears: the biologicalbases of musical timbre perception.
PLoS Comput. Biol. ,8(11):e1002759, 2012.[22] Geoffroy Peeters. A large set of audio features forsound description (similarity and classification) in theCUIDADO project. Technical report, Ircam, 2004.[23] Jan Schl¨uter and Sebastian B¨ock. Improved musicalonset detection with convolutional neural networks. In
Proc. ICASSP , 2014.[24] Siddharth Sigtia, Emmanouil Benetos, and SimonDixon. An end-to-end neural network for polyphonicmusic transcription. arXiv preprint , 1508.01774, 2015.[25] Dan Stowell and Mark D. Plumbley. Automatic large-scale classification of bird sounds is strongly improvedby unsupervised feature learning.
PeerJ , 2:e488, 2014.[26] Karen Ullrich, Jan Schl¨uter, and Thomas Grill. Bound-ary detection in music structure analysis using convolu-tional neural networks. In
Proc. ISMIR , 2014.[27] Aaron van den Oord, Sander Dieleman, and BenjaminSchrauwen. Deep content-based music recommenda-tion. In