Music Boundary Detection using Convolutional Neural Networks: A comparative analysis of combined input features
11 Music Boundary Detection using Convolutional Neural Networks:A comparative analysis of combined input features
Carlos Hernandez-Olivan, Jose R. Beltran, David Diaz-Guerra
Abstract —The analysis of the structure of musical pieces is atask that remains a challenge for Artificial Intelligence, especiallyin the field of Deep Learning. It requires prior identificationof structural boundaries of the music pieces. This structuralboundary analysis has recently been studied with unsupervisedmethods and end-to-end techniques such as Convolutional NeuralNetworks (CNN) using Mel-Scaled Log-magnitude Spectogramsfeatures (MLS), Self-Similarity Matrices (SSM) or Self-SimilarityLag Matrices (SSLM) as inputs and trained with human an-notations. Several studies have been published divided intounsupervised and end-to-end methods in which pre-processingis done in different ways, using different distance metrics andaudio characteristics, so a generalized pre-processing method tocompute model inputs is missing. The objective of this work isto establish a general method of pre-processing these inputs bycomparing the inputs calculated from different pooling strategies,distance metrics and audio characteristics, also taking intoaccount the computing time to obtain them. We also establish themost effective combination of inputs to be delivered to the CNNin order to establish the most efficient way to extract the limits ofthe structure of the music pieces. With an adequate combinationof input matrices and pooling strategies we obtain a measurementaccuracy F of 0.411 that outperforms the current one obtainedunder the same conditions. I. I
NTRODUCTION
Music Information Retrieval (MIR ) is the interdisciplinaryscience for retrieving information from music. MIR is a field ofresearch that faces different tasks in automatic music analysis,such as pitch tracking, chord estimation, score alignment ormusic structure detection.This paper deals with the issue of structure detection inmusical pieces. In particular, the comparison of differentmethods of boundary detection between the musical parts bymeans of Convolutional Neural Networks has been addressed.One of the most active and one of the scientific reference inMIR is the Music Information Retrieval Evaluation eXchange(MIREX ). This is a community which every year holds theInternational Society for Music Information Retrieval Con-ference (ISMIR). Algorithms are submitted to be tested inMIREX’s datasets within the different MIR tasks. Most ofthe previous results analyzed and compared in this work havebeen presented in different MIREX campaigns.Musical analysis [1], viewed from the point of music theory,is a discipline which studies the musical structure in order toget a general and thorough comprehension of the music. It isa very complex task in music theory because the analysis ofmusical pieces from the same period or from different periods C. Hernndez-Olivan, J.R. Beltran, and D. Diaz-Guerra are with the Depart-ment of Electronic Engineering and Communications, University of Zaragoza,Zaragoza, Spain https://musicinformationretrieval.com/index.html in the history of music can be carried out in many differentways. The concept of musical analysis has evolved throughouthistory and has included different types of analysis, such asformal analysis, harmonic analysis, and stylistic analysis. Inthis work we will study the formal analysis, in other words,the structural analysis of musical pieces.The automatic structural analysis of music is a very complexchallenge that has been studied in recent years, but it has notyet been solved with an adequate quality that surpasses theanalysis performed by musicians or specialists. This kind ofanalysis is only a part of the musical analysis that involvesmusical aspects like [2] harmony, timbre and tempo, and seg-mentation principles like repetition, homogeneity and novelty.The benefits of solving this task comes from the importance ofunderstanding how music is formed; the narrative, performing,the particularities in the techniques of composing of an authoror period, in other words, the pillars of music composition.This automatic music analysis can be faced starting frommusic representations such as the score of the piece, the MIDIfile or the raw audio file.In music, form refers to the structure of a musical piecewhich consist on dividing music in small units starting withmotifs, then phrases and finally sections which express amusical idea. Boundary detection is the first step that has to bedone in musical form analysis which must be done before thenaming of the different segments depending on the similaritybetween them. This last step is named
Labelling or Clustering .This task, translated to the most common genre in MIREXdatasets, the pop genre, would be the detection and extractionof the chorus, verse or introduction of the correspondingsong. Detecting the boundaries of music pieces consists onidentifying the transitions where these parts begin and end, atask that professional musicians do almost automatically byseeing a score. This detection of the boundaries in a musicalpiece is based on the
Audio Onset Detection task, which is thefirst step for several higher-level music analysis tasks such asbeat detection, tempo estimation and transcription.The automatic analysis of musical forms studies the musicalform by segmenting music signals. It is important to say thatthere is not a rule or a defined method to analyze music, so,even though the most typical musical forms like the Sonata, theMinuet and other musical forms have their respective structure,there is a lot of music that can be analyzed in different ways.As mentioned above, after identifying the limits of a pieceof music, the labelling phase must be carried out. This phase isthe main objective of the structural analysis of the music and,regardless of the way the analysis is performed (unsupervisedor supervised neural network methods), the first step to becarried out is the boundary detection.This problem can be accomplished with different techniques a r X i v : . [ ee ss . A S ] A ug that have in common the need of pre-processing the audiofiles in order to extract the desired audio features and thenapply unsupervised or supervised neural network methods.There are several studies where this pre-processing step ismade in different ways, so there is not yet a generalizedinput pre-processing method. The currently supervsed neuralnetwork best-performing methods use CNNs trained withhuman annotations. The inputs to the CNN are Mel-ScaledLog-magnitude Spectograms (MLSs) [3], Self-similarity Lag-Matrices (SSLMs) in combination with the MLSs [4] and alsocombining these matrices with chromas [5].One of the limits of these methods is that the analysis andresults obtained depend largely on the database annotator, sothere could be inconsistencies between different annotatorswhen analyzing the same piece. These methods are limitedto the quality of the labels given by the annotators so theycannot outperform them.The paper is structured as follows: Section 2 presents anoverview of the related work and previous studies in which thiswork is based on. The Self-Similarity Matrices and the useddatasets are also presented. In Section 3, the pre-processingmethod of the matrices which will be the inputs to the NN isexplained. Section 4 introduces the database used for training,validating and testing, and the labelling process. Section 5shows the NN structure and the thresholding and peak-pickingstrategies and section 6 describes the metrics used to test themodel and exposes the results of the experiments and theircomparison with previous studies. Finally, section 7 presentsthe conclusions and a proposal for future lines of work.II. R ELATED W ORK
Several studies have been done in the field of structurerecognition in music since Foote in 1999 introduced the self-similarity matrix [6] and later, in 2003 Goto derived from itthe self-similarity lag matrix [7]. Before that, the studies werebased on processing spectrograms [8], but in the recent yearsit has been demonstrated that SSM and SSLM calculated fromaudio features along with spectrograms perform better results.
A. Unsupervised Methods
It is difficult to make a difference between works whichtry to extract only the boundaries of music pieces and theones which try to cluster the different parts of the structureof music, because the principal idea of unsupervised methodsis to extract the musical structure of music pieces but not theboundaries, so we describe some previous work in both areaswhich belongs to the same task in MIREX’s campaigns, MusicStructure Segmentation task.These methods can be summarized in three approaches,according to Paulus et al. [23], based on: novelty, homogeneityand repetition. These approaches are computed with unsuper-vised and supervised Machine Learning algorithms such asgenetic algorithms ( fitness functions ), Hidden Markov Mod-els (HMM),
K-means , Linear Discriminant Analysis (NDA),Decision Stump or Checkerboard-like kernels. (cid:104)(cid:104) year (cid:105)(cid:105) :MIREX (cid:104)(cid:104) year (cid:105)(cid:105) Results -headland ”Music Structure Segmentation Results”.
Novelty-based approach consists on the detection of thetransitions between contrasting parts. This approach is well-performed using checkerboard-like kernel methods. Thesemethods are based on the construction of a 2D kernel whichis applied to the similarity matrix in order to measure theself-similarity on either side of the center point of a similaritymatrix. The value of the measure is high when both regionsare self-similar. The measure of the cross-similarity betweenthe two regions is then calculated. The difference between thistwo measures estimates the novelty of the signal at the centerpoint. This was introduced by Foote in 2000 [24] who usedthis method to extract the segment boundaries of the audiotracks using the similarity matrix as input, and then calculatingthe correlations with the proper kernel. These methods haveevolved over the years and it was find that multiple-temporal-scale kernels as Kaiser and Peeters did in 2013 [25]. In thiscase, a fusion of the novelty and repetition approaches isproposed.
Homogeneity-based approach is based on the identificationof sections that are consistent with respect to their musicalproperties. These methods used Hidden Markov Models, likeLogan and Chu did in 2000 [26], Aucouturier and Sandler in2001 [27] or Levy and Schandler in 2008 [28]. Hidden MarkovModels (HMM) are based on augmenting Markov chains andseek statistical models that reflects the structure of the data,so they can learn the segmentation of the music signal. Thesemethods used as inputs audio feature vectors such as MFCCs,and later on, in 2019, self-similarity matrices in Traile andMcFee work [29] where they combined different frame-levelfeatures such as MFCCs, chromas, chord estimation CREMAmodel [30] and tempograms by using a Similarity NetworkFusion (SNF) to later compute the segmentation and clusteringwith the L-measure method.
Repetition-based approach refers to the determination ofrecurring patterns. These methods apply a clustering algorithmto the Self-Similarity or Self-Similarity Lag Matrices. Theyare more applicable for labeling the structural parts of musicpieces than for precise segmentation as required by boundarydetection. Lu et al. in 2004 [31], and Paulus and Klapuri in2006 [32] are examples of this techniques. The segmentationwith repetition approach was done by Turnbull et al. in 2007[19] who combined temporally-local audio features in anapproach of the AdaBoost algorithm with a Boosted DecisionStump (BDS). Later on, McFee and Ellis [13] in 2014 useda Linear Discriminant Analysis, the Fisher’s linear discrim-inant, that simultaneously maximizes the distance betweenclass centroids, and minimizes the variance of each classindividually. In 2019, McCallum [33] used unsupervised train-ing of deep feature embeddings using Convolutional NeuralNetworks (CNNs) for music segmentation. These techniquesdid not show significantly improve in the results with respectto the classic unsupervised algorithms that have been describedpreviously.To finish, we can affirm that unsupervised algorithms arevery efficient performing the labelling (clustering) part, butnot the boundaries detection task that are better performed by supervised neural networks which came up in 2014 and aredescribed in the next section.
TABLE I: Results of boundary detection of previous studies for ”Full Structure” and ”Segmentation” tasks. It is only showedthe best-performing algorithm of each year in terms of F-measure for a ± Unsupervised MethodsYear Autors [Ref.] Algorithm Input Method F-measure ( F ) for Testing DatabasesMIREX09 RCW-A RCW-B SALAMI Fitness function
HMM
Viterbi
Novelty measure
Fisher’s LinearDiscriminant
Checkerboard-likekernel
HMM
Linear DiscriminantAnalysis
HMM
Supervised Neural Networks
TABLE II: Results of previous works in boundary detection task for ± Unsupervised MethodsYear Autors [Ref.] Input Method Train Set F-measure ( F ) for Testing DatabasesMIREX09 RCW-A RCW-B SALAMI Boosted Decision Stump - - - 0.378 -2011 Sargent et al. [20] MFCCs,chromas
Viterbi - - - 0.356 -
Supervised Neural Networks
B. Supervised Neural Networks
The term end-to-end learning refers to the architectures thatgoes from a pre-processed input to the desired output [34].These models learn from data so they can generalize muchbetter than unsupervised methods and require less manualpre-processing. A general block diagram of this method ispresented in Figure 1.Previous studies used Mel-Scaled Log-magnitude Spec-tograms (MLS) as the inputs of CNNs for boundaries detec-tion [3]. This method was based on
Audio Onset Detection
MIREX task which consist on finding the starting points ofall musically relevant events in an audio signal, in particular,in the algorithm presented in the 2013 MIREX campaign [21].Onsets detection in audio signals consist on the detection ofevents in music signals, specifically the beginning of a musicnote. It can be interpreted as a computer vision problem, likeedege detection, but applied to spectrograms instead of images with different textures.Later on, in 2015, Grill and Schulter improved their previouswork by adding SSLMs to the input which yielded to betterresults [4] and the addition of SSLMs with different lag factorsto the input of the CNN [5] outperformed this method givingthe best result to date.In Tables I and II a recap of the results of all the pre-vious works done in boundaries detection for unsupervisedand supervised neural networks are presented. Results and”Code” names in Table I have been extracted from MIREX’scampaigns of different years. It must be said that the resultsobtained with unsupervised methods on Table I are not ashigh as the results obtained with supervised neural networksbecause their goal was not the boundary detection (segmenta-tion) itself but the full structure identification (labelling).
Fig. 1: General scheme of supervised neural networks . C. Self-Similarity Matrices (SSM)
The Self-Similarity Matrix [2] is a tool not only used inmusic structure analysis but in time series analysis tasks. Itshomogeneous regions representing the structural elements ofmusic analysis leads this matrix and its combination withspectrograms to be the input of almost every model describedin sections II-A and II-B. In this work, this matrix is importantbecause music is in itself self-similar , that is, it is formed bysimilar time series.Self-Similarity Matrices have been employed under thename of Recurrence Plot for the analysis of dynamic systems[35], but their introduction to music domain was done by Foote[6] in 1999 and since then, there have been appearing differ-ent techniques for computing this matrices which highlightdifferent aspects of the audio features with which the SSM isformed. The SSM relies on the concept of self-similarity. Theself-similarity is measured by a similarity function s which isapplied to the audio features representation. As an example [6],the similarity between two feature vectors derived from audiowindows y of length N is a function that can be expressedas in Eq. 1. The results is a N -square matrix SSM ∈ R N x N being N the time dimension. SSM( n, m ) = s ( y n , y m ) (1)where n, m ∈ [ 1,..., N ] .The similarity function is obtained by the calculation of adistance between the two feature y vectors mentioned before.In the literature, this distance is usually calculated as theEuclidean distance δ eucl or the cosine distance δ cos . δ eucl = (cid:107) u − v (cid:107) (2) δ cos = 1 − u.v (cid:107) u (cid:107) . (cid:107) v (cid:107) (3)where u and v are time series vectors.Self-Similarity Matrices can be computed from differentaudio features representations such as MFCCs or chromasdepending on the properties we need to capture from themusic, and they can also be obtained by combining differentframe-level audio features [29]. MFCCs are more related toinstrumentation and timbre whereas chromas capture betterthe beat, tempo and rhythmic information. Once the similarityfunction has been applied to all pairs of vectors of the audiofeatures representation and the SSM has been calculated,we can filter the SSM by applying thresholding techniques,smoothing or invariance transposition so we can increase thepaths and emphasize the diagonal information and obtain amore adequate representation of the SSM to visualize and analyze the representation of the structure. The SSM can alsobe obtained with other techniques such as clustering methodsas Serra et al. proposed in [36], where the SSM is obtainedby applying the k-nn algorithm. Then, this matrices can besmoothed or post-processed in order to do a unsupervisedclustering classification algorithms as described in sectionII-A. One example is the work In Serra et al. work [36] theresultant SSM is convolved with a gaussian filter in order tomake a clustering of the different parts of musical pieces.After Foote in 1999 defined the SSM, in 2003, Goto [7]defined a variant of the SSM which is called Self-SimilarityLag Matrix (SSLM). The dimensions of this matrix are not N x N but N x L , being L the lag factor . With this represen-tation it is possible to plot the relations between past eventsand their repetitions in the future. The SSLM is a non-squarematrix: SSLM ∈ R N x L . Some libraries calculate this SSLMafter computing the SSM or the recurrence plot as in Eq. 4. SSLM( i, j ) =
SSM k +1 ,j (4)with i, j = 1 , ..., N and k = i + j − N ) .The choice of the type of audio features representation forcomputing the SSMs or SSLMs, and the choice of using SSMsor SSLMs is one of the most important steps when solving aMIR task that has to be studied depending on the issue we wewant to face. D. Datasets
Previous works had been tested in the annual Music Infor-mation Retrieval Evaluation eXchange (MIREX [37]). MIREXis a framework for evaluating music information retrievalalgorithms. The evaluation tasks are defined by the researchcommunity under the coordination of International MusicInformation Retrieval Systems Evaluation Laboratory at theUniversity of Illinois at Urbana Champaign [38]. MIREX’stesting databases collections contains MIREX09, MIREX10and then MIREX12 datasets.The first dataset of MIREX campaign structure segmen-tation task was the MIREX09 dataset consisting of Beatlessongs plus another smaller dataset . Beatles dataset have 2annotation versions, one is Paulus Beatles or Beatles-TUT dataset and the second one is the Isophonic Beatles or Beatles-ISO dataset. The second MIREX dataset was MIREX10,formed by RWC [39] dataset. This dataset has 2 annotationversions; RWC-A of QUAERO project which is the one http://ifs.tuwien.ac.at/mir/audiosegmentation.html http://isophonics.net/content/reference-annotations http://musicdata.gforge.inria.fr which corresponds to MIREX10 and RWC-B [40], which isthe original annotated version which the annotation guidelineswere established by [41].A few years later, the MIREX12 dataset provided a greatervariety of songs than the MIREX10 [42]. MIREX12 is adataset formed by the ”Structural Analysis of Large Amountsof Music Information” (SALAMI ) dataset which has evolvedin its more recent version, the SALAMI 2.0 database. Theanalysis of MIREX structure segmentation task was publishedin 2012 [43]. Our work uses the available SALAMI 2.0dataset. III. A UDIO P ROCESSING
This work is based on the previous works of Schuler, Grillet al. [3], [4], who explain the procedure of how to obtain theSSLMs from MFCCs features. We will extend these worksby calculating the SSLMs from chroma features and applyingalso the euclidean distance in order to give a comparison andprovide the best-performing input to the NN model.
A. Mel Spectrogram
The first step to process the audio files to extract their differ-ent features is to compute the Short-Time-Fourier-Transform(STFT) with a Hanning window of 46ms (2048 samples at44.1kHz sample rate) and an overlap of 50%. Then, a mel-scaled filterbank of 80 triangular filters from 80Hz to 16kHzand the amplitudes magnitudes are scaled logarithmically toobtain the mel-spectrogram (MLS) of the audio file. We haveused the librosa library [44] so, the mel-spectrogram can becomputed directly giving these parameters. The MLS will bemax-pooled by a factor of p = 6 to give the Neural Networka manageable size input. The size of the MLS matrix is [P,N] with P being the number of frequency bins (that are equalto the number of triangular filters) and N the number of timeframes. We name each MLS frame x i with i = 1 . . . N . B. Self-Similarity Lag Matrix from MFCCs
The method that we used to generate the SSLMs is thesame method that Grill and Schluter used in [4] and [5], whichin turn derives from Serr´a et al. [45].The first step after computing each frame mel-spectrogram x i is to pad a vector Φ i with a white noise constant value of-70dB of L lag seconds in the time dimension and P mel-bands in the frequency dimension, at the beginning of themel-spectrogram. ˇ x i = Φ i (cid:107) x i (5)where vector Φ i is a vector of shape [ L, P ] .Then, a max-pool of a factor of p is done in the timedimension as shown in Eq. 6. x (cid:48) i = max j =1 ...p (ˇ x ( i − p + j ) (6) http://staff.aist.go.jp/m.goto/RWC-MDB/AIST-Annotation https://ddmal.music.mcgill.ca/research/SALAMI/ https://github.com/carlosholivan/SelfSimilarityMatrices After that, we apply a Discrete Cosine Transform of TypeII of each frame omitting the first element. ˜ X i = DCT (II)2 ...P ( x (cid:48) i ) (7)where P are the number of mel-bands.Now we stack the time frames by a factor m so weobtain the time series in Eq. 8. The resulting ˆ X i vector hasdimensions [( P − ∗ m, ( N + L ) /p ] where N is the numberof time frames before the max-pooling and L the lag factorin frames. ˆ X i = [ ˜ X T i (cid:107) ˜ X T i + m ] T (8)The final SSLM matrix is obtained by calculating a distancebetween the vectors ˆ X i . Two different distance metrics havebeen implemented: euclidean and cosine distance. This willallow us to make a comparison between them and deduce theSSLM that will perform better.Then, the distance between two vectors ˆ X i and ˆ X i − l usingthe distance metric δ is D i,l = δ ( ˆ X i , ˆ X i − l ) , l = 1 . . . (cid:22) Lp (cid:23) (9)where δ is the distance metric as defined in Eqs. 2 and 3.Then, we compute an equalization factor ε i,l with a quantile κ of the distances δ ( ˆ X i , ˆ X i − j ) and for j = 1 . . . (cid:106) Lp (cid:107) ε i,l = Q κ (cid:16) D i,l , · · · , D i, (cid:98) Lp (cid:99) (cid:107) D i − l, , · · · , D i − l, (cid:98) Lp (cid:99) (cid:17) (10)We now remove L/p lag bins in the time dimension at thebeginning of the distances vector D i,l and in the equalizationfactor vector ε i,l , and we apply Eq. 6 with max-pooling factor p . Finally we obtain the SSLM applying Eq. 11. R i,l = σ (cid:18) − D i,l ε i,l (cid:19) (11)where σ ( x ) = e − x Once the SSLM has been obtained, we pad γ = 50 timeframes of pink noise at the beginning and end of the SSLMand MLS matrices and we normalized each frequency band tozero mean and unit variance for MLS and each lag band forSSLMs. Note also that if there are some time frames that havethe same values, the cosine distance would give a NAN (not-a-number) value. We avoid this by converting all this NANvalues into zero as the last step of the SSLM computation. C. Self-Similarity Lag Matrix from Chromas
The process of computing the SSLM from chroma featuresis similar to the method explained in section III-B. Thedifference here is that instead of starting with padding the mel-spectrogram in Eq. 5, we pad the STFT. After applying themax-pooling in Eq. 6, we compute the chroma filters insteadof computing the DCT in Eq. 7. The rest of the process is thesame as described in section III-B.All the values of the parameters used in obtaining the Self-Similarity Matrices are summarized in Table III. In additionto the euclidean and cosine metrics, and MFCCs and chromas audio features, we will compare two pooling strategies. Thefirst one is to make a max-pooling of factor p = 6 describedin Eq. 6 of the STFT and mel-spectrogram for SSLMs fromchromas and MFCCs, respectively. The other pooling strategyis the one showed in Fig. 2. We denote this pooling variantsas and . The total time for processing all theSSLMs (MFCCs and cosine distance) was a factor or 4 fasterfor than .The general schema of the pre-processing block is depictedin Fig. 2.Fig. 2: General block diagram of the pre-processing block inFig. 1. The * mark in max-pooling blocks refers to the 2variants done in this work, one is the which is the oneshowed in the scheme and the other one, is computedby applying the max-pooling of factor 6 in the first * blockand removing the second * block of the scheme.IV. D ATASET
The algorithm was trained, validated and tested on a subsetof the Structural Analysis of Large Amounts of Music In-formation (SALAMI) dataset [46]. SALAMI dataset contains1048 double annotated pieces from which we could obtain TABLE III: Parameter final values.
Parameter Symbol Value Units sampling rate sr w
46 msoverlap - 50 %hop length h
23 mslag L
14 spooling factor p p p m κ γ
50 frames
A. Labelling Process
As [3] explained, it is necessary to transform the labels ofSALAMI text files into Gaussian functions so that the NeuralNetwork can be trained correctly. We first set the mean valuesof the Gaussian functions by transforming the labels in secondsin time frames as showed in Eq. 12 constructing vector µ i ofdimension equal to the number of labels in the text file. In Eq.12, label i are the labels in seconds extracted from SALAMItext file ”functions”, p , p , h, sr and γ are defined in TableIII. µ i = label i p ∗ p + h ∗ srγ (12)Then we transform this labels in seconds into frames byapplying a gaussian function with standard deviation σ = 0.1and µ i equal to each label value in Eq. 12. In Eq. 13 the labelsare in seconds, so in this equation, a mean equal to 1 is setwhere a label with a standard deviation σ is found for eachtime frame position y (cid:48) i . gaussian labels i = g ( y (cid:48) i , µ i , σ ) (13)with g ( x, µ, σ ) = 1 √ πσ e − ( x − µσ ) where y (cid:48) i is a vector of y i ∗ γ + w sr frames from i = 1 ... [ Np p ] .To train the model, we removed the first tag from each textfile due to the proximity of the first two tags in almost allfiles and the uselessness of the Neural Network identifyingthe beginning of the file. It is also worth mentioning the fact that we have resampled all the songs in the SALAMI databaseat a single sampling rate of 44100Hz as showed in Table III.V. M ODEL
The model developed in this work for boundary detectionis showed in Fig. 3. Once the matrices of the preprocessingstep are obtained they are padded and normalized to formthe input of a Convolutional Neural Network (CNN). Theobtained predictions are post-processed with a peak-pickingand threshold algorithm to obtain the final predictions.Fig. 3: General block diagram of the Neural Network blockin Fig. 1.
A. Convolutional Neural Network
The model proposed in this paper is nearly the same thatthe model proposed in [3] and [4], so we could compare theresults and make a comparison with different input strategiesas Cohen [22] did.The model is composed by a CNN whose relevant param-eters are showed in Table IV. The difference between thismodel and the model proposed in [3] and [4] is that our finaltwo layers are not dense layers but convolutional layers inone dimension because we do not crop the inputs and get asingle probability value at the output, but we give the NeuralNetwork the whole matrix (so we use a batch size of 1 fortraining) and we obtain a time prediction curve at the output.The general schema of the CNN is showed in Fig. 4. TABLE IV: CNN architecture parameters of the schemapresented in Fig. 4
Layer Parameters
Convolution 1 +Leaky ReLU output feature maps: 32kernel size: 5 x 7stride: 1 x 1padding: (5-1)/2 x (7-1)/2Max-Pooling kernel size: 5 x 3stride: 5 x 1padding: 1 x 1Convolution 2 +Leaky ReLU output feature maps: 64kernel size: 3 x 5stride: 1 x 1padding: (3-1)/2 x (5-1)*3/2dilation: 1 x 3Convolution 3 +Leaky ReLU output feature maps: 128kernel size: 1 x 1stride: 1 x 1padding: 0 x 0Convolution 4 output feature maps: 1kernel size: 1 x 1stride: 1 x 1padding: 0 x 0
B. Training Parameters
We trained our CNN with
Binary Cross Entropy or BCE-withLogitsLoss in Pytorch [47] as the loss function whichincludes a sigmoid activation function at the end of the NeuralNetwork, a learning rate of 0.001 and Adam optimizer [48].We used batches of size 1 because we provide the networkwith songs of different duration as a whole. The modelswere trained on a GTX 980 Ti Nvidia GPU and we usedTensorboardX [49] to graph the loss and F-score of trainingand validation in real time.
C. Peak-Picking
Peak-picking consists on selecting the peaks of the outputsignal of the CNN that will be identified as boundaries ofthe different parts of the song. Each boundary on the outputsignal is considered true when no other boundary is detectedwithin 6 seconds. The application of a threshold help usto discriminate boundaries values that are not higher thanan optimum threshold that has been determined by testingthe pieces for each threshold value belonging to [0, 1]. Theoptimum threshold is the one for which the F-score is higher.When training different inputs the threshold may vary. We seta thresold of 0.205 for only MLS input as it is showed in Fig.5 and for the rest of our input variants is showed in TableVI. From the optimum threshold calculation, we can observethat almost all optimum threshold values for each input variantbelong to [2 . , . Fig. 5 shows Recall, Precision and F-scorevalues (see Section ?? of the testing dataset evaluated for eachpossible threshold value. Note that the Precision curve shouldincrease till it reaches a threshold of value equal to 1, but asthe peaks of our output signals does not reach this value wecan observe a decrease of the threshold value when it exceeds0.7. Fig. 4: Schema of the Convolutional Neural Network implemented. The main parameters are presented in Table IV.Fig. 5: Threshold calculation through MLS test after 180epochs of training only MLS.VI. E
XPERIMENTS AND R ESULTS
A. Evaluation Metrics
MIREX’s campaings uses two evaluation measures whichare
Median Deviation and
Hit Rate . The
Hit Rate (aslo calledF-score or F-measure) is denoted by F β , where β = 1 is themost frequently measure used in previous works. Nieto et al.[50] set a value of β = 0.58, but the truth is that F is themost used metric in MIREX works. We will later give ourresults for both β values. The Hit Rate score F is normallyevaluated for ± . s and ± s time-window tolerances, but inrecent works most of the results are given only for ± . stolerance which is the most restrictive one. We test our modelwith MIREX algoritm [51] which give us the Precision, Recalland F-measure parameters. Precision : P = TPTP + FP (14)
Recall : R = TPTP + FN (15)
F measure : F β = (1 + β ) P . R β . P + R (16)Where: • TP: True Positives. Estimated events of a given class thatstart and end at the same temporal positions as reference events of the same class, taking into account a tolerancetime-window. • FP: False Positives. Estimated events of given class thatstart and end at temporal positions where no referenceevents of the same class does, taking into account atolerance time-window. • FN: False Negatives. Reference events of a given classthat start and end at temporal positions where no esti-mated events of the same class does, taking into accounta tolerance time-window.
B. Results
TABLE V: Results of boundaries estimation according todifferent poolingstrategies, distances and audio features for ± . s and a thresh-old of 0.205. Tolerance: ± . s and Threshold: 0.205Input Epochs P R F MLS 180 0.501 0.359 0.389
SSLM
MFCCseuclidean
180 0.472 0.318 0.361
SSLM
MFCCscosine
180 0.477 0.311 0.355
SSLM chromaseuclidean
180 0.560 0.228 0.297
SSLM chromascosine
180 0.508 0.254 0.312
SSLM
MFCCseuclidean
120 0.422 0.369 0.375
SSLM
MFCCscosine
120 0.418 0.354 0.366
Previous works
MLS [3] - 0.555 0.458 0.465
SSLM
MFCCscosine [4] - - - 0.430
1) Pooing Strategy:
We first train the Neural Network witheach input matrix (see Fig. 3) separately in order to know whatinput performs better. We train MLS and SSLMs obtainedfrom MFCCs and Chromas and applying euclidean and cosinedistances, and we also give the results for both of the poolingstrategies mentioned before, (lower resolution) and (higher resolution). As mentioned in section IV, weremoved the first label of the SALAMI text files correspondingto 0.0s label. Results in terms of F score, Precision and Recallare showed in Table V. Note that the results showed fromprevious works used a different threshold value.In Fig. 6 we show an example of the boundaries detectionresults for some of our input variants on the MLS and SSLMs.
TABLE VI: Results of boundary estimation with tolerance ± . s and optimum threshold in terms of F-score, Precision andRecall. Note that results form previous works did not use the same threshold value. Tolerance: ± . s with matricesInput Train Database Epochs Thresh. P R F (std) F . MLS + SSLM
MFCCseuclidean
SALAMI 140 0.24 0.441 0.415 0.402 (0.163) 0.414
MLS + SSLM
MFCCscosine
SALAMI 140 0.24 0.428 0.407 0.396 (0.158) 0.404
MLS + (SSLM
MFCCseuclidean + SSLM chromaseuclidean ) SALAMI 100 0.24 0.465 0.400 0.407 (0.160) 0.419
MLS + (SSLM
MFCCscosine + SSLM chromascosine ) SALAMI 100 0.24 0.444 0.416 0.404 (0.166) 0.417
MLS + (SSLM
MFCCseuclidean + SSLM
MFCCscosine ) SALAMI 100 0.24 0.445 0.421 0.409 (0.173) 0.416
MLS + (SSLM chromaseuclidean + SSLM chromascosine ) SALAMI 100 0.24 0.457 0.396 0.400 (0.157) 0.420
MLS + (SSLM chromaseuclidean + SSLM chromascosine ++SSLM
MFCCseuclidean + SSLM
MFCCscosine ) SALAMI 100 0.26 0.526 0.374 (0.169)
MLS + SSLM
MFCCscosine [4] (2015) Private - 0.646 0.484 0.523 0.596
MLS + SSLM
MFCCscosine [22] (2017) SALAMI - 0.279 0.300 0.273 (0.132) -
MLS + (SSLM
MFCCscosine + SSLM chromascosine ) [22] (2017) SALAMI - 0.470 0.225 0.291 (0.120) - We obtained lower results than [4] but higher results than [22]who tried to re-implement [4]. The reasons for this differencecould be that the database used by Grill and Schluter [4] totrain their model, had 733 non-public pieces.Cohen and Peeters [22], as in our work, trained their modelonly with pieces from the SALAMI database, so that ourresults can be compared with theirs, since we trained, validatedand tested our Neuronal Network with the same database(although they had 732 SALAMI pieces and we had 1006).In view of the results in Table V, we can affirm that doinga max-pooling of 2, then computing the SSLMs and doinganother max-pooling of 3 afterwards increment the results butnot too much. This procedure not only takes much more timeto compute the SSLMs but also the training takes also muchmore time and it does not perform better results in terms ofF-score.
2) Inputs Combination:
With the higher results in Table Vwe make a combination of them as in [4] and later in [22]. Asummary of our results can be found in Table VI.The inputs combination that performs the best in [22] was
MLS + (SSLM
MFCCscosine + SSLM chromascosine ) for which F = 0.291.We overcome that result for the same combination of inputsobtaining F = 0.404. In spite of that, and the statementin [4] which says that cosine distance performs better thanthe euclidean one, we found that the best-performing inputscombination is MLS + (SSLM chromaseuclidean + SSLM chromascosine ++SSLM
MFCCseuclidean + SSLM
MFCCscosine ) for which F = 0.411. Thereis not a huge improvement in the F-measure but it is still ourbest result. VII. C ONCLUSIONS
In this work we have developed a comparative study to setthe best-performing way to compute the inputs of a Convolu-tional Neural Network to identify boundaries in music piecesby combining diverse methods of generating the SSLMs. Wedemonstrate that by computing a max-pooling of factor 6 atthe beginning of the process not only takes much less time but also the training of the Neural Network is faster and it doesnot affect the results as much as it could be thought. We alsogive the best-performance combination of inputs of SSLMsoutperforming the results gived in [22]. Despite the fact thatwe could not obtain [3] and [4] results with nearly the samemodel, we outperform the results in [22] who also tried tore-implement the model described in the previous literature.There has to be highlighted the fact that [3] and [4] had attheir disposition a private dataset of 733 pieces that they usedfor training the model, and in this paper the model has beentrained only with the public available dataset of SALAMI 2.0,so our results are maybe lower because of that, but our resultsoutperform other works that also trained their models withonly the public available SALAMI 2.0 database.As commented, the results obtained in this work improvethose presented previously, if the database used is the same.However, the accuracy in obtaining the boundaries in musicalpieces is relatively low and, to some extent, difficult to use.This makes it necessary, on the one hand, to continue studyingdifferent methods that allow a correct structural analysis ofmusic and, on the other hand, to obtain databases that areproperly labeled and contain a high number of musical pieces.In any case, the results obtained are promising and allowus to adequately set out the bases for future work.R
EFERENCES[1] N. Cook,
A guide to musical analysis . Oxford University Press, USA,1994.[2] M. M¨uller,
Fundamentals of music processing: Audio, analysis, algo-rithms, applications . Springer, 2015.[3] K. Ullrich, J. Schl¨uter, and T. Grill, “Boundary detection in musicstructure analysis using convolutional neural networks.,” in
ISMIR ,pp. 417–422, 2014.[4] T. Grill and J. Schluter, “Music boundary detection using neural net-works on spectrograms and self-similarity lag matrices,” in , pp. 1296–1300,IEEE, 2015.[5] T. Grill and J. Schl¨uter, “Music boundary detection using neural net-works on combined features and two-level annotations.,” in
ISMIR ,pp. 531–537, 2015. (a) CNN predictions on MLS(b) CNN predictions on SSLM calculated with MFCCs and euclidean distance with 2pool3 (best-performance SSLM input in terms ofF-measure). In this case F = 0.486 for a ± = 0.686 for a ± MLS + (SSLM
MFCCseuclidean + SSLM
MFCCscosine ) . In this case F =0.75 for a ± Fig. 6: Boundaries predictions using CNN on different inputs obtained from the ”Live at LaBoca on 2007-09-28” of DayDrugcorresponding to the 1358 song of SALAMI 2.0 database. The ground truth from SALAMI annotations are the gaussians inred, the model predictions is the white curve and the threshold is the horizontal yellow line. Note that the prediction havebeen rescaled in order to plot them on the MLS and SSLMs images. All these images have been padded according to what isexplained in the previous paragraphs and then normalized to zero mean and unit variance. [6] J. Foote, “Visualizing music and audio using self-similarity,” in
Pro-ceedings of the seventh ACM international conference on Multimedia(Part 1) , pp. 77–80, 1999.[7] M. Goto, “A chorus-section detecting method for musical audio signals,”in , vol. 5, pp. V–437, IEEE,2003.[8] T. Zhang and C.-C. J. Kuo, “Heuristic approach for generic audiodata segmentation and annotation,” in
Proceedings of the seventh ACMinternational conference on Multimedia (Part 1) , pp. 67–76, 1999.[9] J. Paulus and A. Klapuri, “Music structure analysis with a probabilisticfitness function in MIREX2009,” in
Proc. of the Fifth Annual MusicInformation Retrieval Evaluation eXchange , (Kobe, Japan), Oct 2009.Extended abstract.[10] M. Mauch, K. C. Noland, and S. Dixon, “Using musical structure toenhance automatic chord transcription.,” in
ISMIR , pp. 231–236, 2009.[11] G. Sargent, S. A. Raczynski, F. Bimbot, E. Vincent, and S. Sagayama, “A music structure inference algorithm based on symbolic data analysis.”MIREX - ISMIR 2011, Oct. 2011. Poster.[12] F. Kaiser, T. Sikora, and G. Peeters, “Mirex 2012-music structuralsegmentation task: Ircamstructure submission,”
Proceedings of the MusicInformation Retrieval Evaluation eXchange (MIREX) , 2012.[13] B. McFee and D. Ellis, “Dp1, mp1, mp2 entries for mirex 2013 structuralsegmentation and beat tracking,”
Proceedings of the Music InformationRetrieval Evaluation eXchange (MIREX) , 2013.[14] O. Nieto and J. P. Bello, “Mirex 2014 entry: 2d fourier magnitudecoefficients,”
Proceedings of the Music Information Retrieval EvaluationeXchange (MIREX) , 2014.[15] C. Cannam, E. Benetos, M. Mauch, M. E. Davies, S. Dixon, C. Landone,K. Noland, and D. Stowell, “Mirex 2015: Vamp plugins from thecentre for digital music,”
Proceedings of the Music Information RetrievalEvaluation eXchange (MIREX) , 2015.[16] O. Nieto, “Mirex: Msaf v0. 1.0 submission,” 2016.[17] J. Schl¨uter, K. Ullrich, and T. Grill, “Structural segmentation with convolutional neural networks mirex submission,” Tenth running of theMusic Information Retrieval Evaluation eXchange (MIREX 2014) , 2014.[18] T. Grill and J. Schl¨uter, “Structural segmentation with convolutional neu-ral networks mirex submission,”
Proceedings of the Music InformationRetrieval Evaluation eXchange (MIREX) , p. 3, 2015.[19] D. Turnbull, G. R. Lanckriet, E. Pampalk, and M. Goto, “A supervisedapproach for detecting boundaries in music using difference features andboosting.,” in
ISMIR , pp. 51–54, 2007.[20] G. Sargent, F. Bimbot, and E. Vincent, “A regularity-constrained viterbialgorithm and its application to the structural segmentation of songs,”2011.[21] J. Schl¨uter and S. B¨ock, “Improved musical onset detection withconvolutional neural networks,” in , pp. 6979–6983,IEEE, 2014.[22] A. Cohen-Hadria and G. Peeters, “Music structure boundaries estimationusing multiple self-similarity matrices as input depth of convolutionalneural networks,” in
Audio Engineering Society Conference: 2017 AESInternational Conference on Semantic Audio , Audio Engineering Soci-ety, 2017.[23] J. Paulus, M. M¨uller, and A. Klapuri, “State of the art report: Audio-based music structure analysis.,” in
ISMIR , pp. 625–636, Utrecht, 2010.[24] J. Foote, “Automatic audio segmentation using a measure of audionovelty,” in , vol. 1, pp. 452–455, IEEE,2000.[25] F. Kaiser and G. Peeters, “Multiple hypotheses at multiple scales foraudio novelty computation within music,” in , pp. 231–235,IEEE, 2013.[26] B. Logan and S. Chu, “Music summarization using key phrases,” in , vol. 2, pp. II749–II752,IEEE, 2000.[27] J.-J. Aucouturier and S. Mark, “Segmentation of musical signals usinghidden markov models,” in
In Proc. 110th Convention of the AudioEngineering Society , Citeseer, 2001.[28] M. Levy and M. Sandler, “Structural segmentation of musical audioby constrained clustering,”
IEEE transactions on audio, speech, andlanguage processing , vol. 16, no. 2, pp. 318–326, 2008.[29] C. J. Tralie and B. McFee, “Enhanced hierarchical music structure anno-tations via feature level similarity fusion,” in
ICASSP 2019-2019 IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) , pp. 201–205, IEEE, 2019.[30] B. McFee and J. P. Bello, “Structured training for large-vocabulary chordrecognition.,” in
ISMIR , pp. 188–194, 2017.[31] L. Lu, M. Wang, and H.-J. Zhang, “Repeating pattern discovery andstructure analysis from acoustic music data,” in
Proceedings of the6th ACM SIGMM international workshop on Multimedia informationretrieval , pp. 275–282, 2004.[32] J. Paulus and A. Klapuri, “Music structure analysis by finding repeatedparts,” in
Proceedings of the 1st ACM workshop on Audio and musiccomputing multimedia , pp. 59–68, 2006.[33] M. C. McCallum, “Unsupervised learning of deep features for musicsegmentation,” in
ICASSP 2019-2019 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) , pp. 346–350,IEEE, 2019.[34] S. Dieleman and B. Schrauwen, “End-to-end learning for music audio,”in , pp. 6964–6968, IEEE, 2014.[35] E. Jp, “Recurrence plots of dynamical systems,”
Europhysics Ltters ,vol. 5, pp. 973–977, 1987.[36] J. Serra, M. M¨uller, P. Grosche, and J. L. Arcos, “Unsupervised musicstructure annotation by time series structure features and segmentsimilarity,”
IEEE Transactions on Multimedia , vol. 16, no. 5, pp. 1229–1240, 2014.[37] J. S. Downie, A. F. Ehmann, M. Bay, and M. C. Jones, “The music infor-mation retrieval evaluation exchange: Some observations and insights,”in
Advances in music information retrieval , pp. 93–115, Springer, 2010.[38] J. S. Downie, “The music information retrieval evaluation exchange(2005–2007): A window into music information retrieval research,”
Acoustical Science and Technology , vol. 29, no. 4, pp. 247–255, 2008.[39] M. Goto et al. , “Development of the rwc music database,” in
Proceedingsof the 18th International Congress on Acoustics (ICA 2004) , vol. 1,pp. 553–556, 2004. [40] M. Goto et al. , “Aist annotation for the rwc music database.,” in
ISMIR ,pp. 359–360, 2006.[41] F. Bimbot, E. Deruty, G. Sargent, and E. Vincent, “Methodology andconventions for the latent semiotic annotation of music structure,” 2012.[42] A. F. Ehmann, M. Bay, J. S. Downie, I. Fujinaga, and D. De Roure,“Music structure segmentation algorithm evaluation: Expanding onmirex 2010 analyses and datasets.,” in
ISMIR , pp. 561–566, 2011.[43] J. B. Smith and E. Chew, “A meta-analysis of the mirex structuresegmentation task,” in
Proc. of the 14th International Society for MusicInformation Retrieval Conference, Curitiba, Brazil , vol. 16, pp. 45–47,2013.[44] B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg,and O. Nieto, “librosa: Audio and music signal analysis in python,” in
Proceedings of the 14th python in science conference , vol. 8, 2015.[45] J. Serra, M. M¨uller, P. Grosche, and J. L. Arcos, “Unsupervised detectionof music boundaries by time series structure features,” in
Twenty-SixthAAAI Conference on Artificial Intelligence , 2012.[46] J. B. L. Smith, J. A. Burgoyne, I. Fujinaga, D. De Roure, and J. S.Downie, “Design and creation of a large-scale database of structuralannotations.,” in
ISMIR , vol. 11, pp. 555–560, Miami, FL, 2011.[47] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin,A. Desmaison, L. Antiga, and A. Lerer, “Automatic differentiation inpytorch,” 2017.[48] D. P. Kingma and J. L. Ba, “Adam: A method for stochastic gradient de-scent,” in
ICLR: International Conference on Learning Representations ,2015.[49] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur,J. Levenberg, D. Man´e, R. Monga, S. Moore, D. Murray, C. Olah,M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker,V. Vanhoucke, V. Vasudevan, F. Vi´egas, O. Vinyals, P. Warden, M. Wat-tenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scalemachine learning on heterogeneous systems,” 2015. Software availablefrom tensorflow.org.[50] O. Nieto, M. M. Farbood, T. Jehan, and J. P. Bello, “Perceptualanalysis of the f-measure for evaluating section boundaries in music,”in
Proceedings of the 15th International Society for Music InformationRetrieval Conference (ISMIR 2014) , pp. 265–270, 2014.[51] C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Nieto, D. Liang,D. P. Ellis, and C. C. Raffel, “mir eval: A transparent implementationof common mir metrics,” in