[PDF] Deep Convolutional and Recurrent Networks for Polyphonic Instrument Classification from Monophonic Raw Audio Waveforms

Abstract

Sound Event Detection and Audio Classification tasks are traditionally addressed through time-frequency representations of audio signals such as spectrograms. However, the emergence of deep neural networks as efficient feature extractors has enabled the direct use of audio signals for classification purposes. In this paper, we attempt to recognize musical instruments in polyphonic audio by only feeding their raw waveforms into deep learning models. Various recurrent and convolutional architectures incorporating residual connections are examined and parameterized in order to build end-to-end classi-fiers with low computational cost and only minimal preprocessing. We obtain competitive classification scores and useful instrument-wise insight through the IRMAS test set, utilizing a parallel CNN-BiGRU model with multiple residual connections, while maintaining a significantly reduced number of trainable parameters.

Full PDF

DDEEP CONVOLUTIONAL AND RECURRENT NETWORKS FOR POLYPHONICINSTRUMENT CLASSIFICATION FROM MONOPHONIC RAW AUDIO WAVEFORMS

Kleanthis Avramidis ∗ , Agelos Kratimenos ∗ , Christos Garouﬁs ,Athanasia Zlatintsi and Petros Maragos School of ECE, National Technical University of Athens, 15773 Athens, Greece Robot Perception and Interaction Unit, Athena Research Center, 15125 Maroussi, Greece [email protected], [email protected], cgarouﬁ[email protected], { nzlat, maragos } @cs.ntua.gr ABSTRACT

Sound Event Detection and Audio Classiﬁcation tasks are tradition-ally addressed through time-frequency representations of audio sig-nals such as spectrograms. However, the emergence of deep neuralnetworks as efﬁcient feature extractors has enabled the direct use ofaudio signals for classiﬁcation purposes. In this paper, we attempt torecognize musical instruments in polyphonic audio by only feedingtheir raw waveforms into deep learning models. Various recurrentand convolutional architectures incorporating residual connectionsare examined and parameterized in order to build end-to-end classi-ﬁers with low computational cost and only minimal preprocessing.We obtain competitive classiﬁcation scores and useful instrument-wise insight through the IRMAS test set, utilizing a parallel CNN-BiGRU model with multiple residual connections, while maintaininga signiﬁcantly reduced number of trainable parameters.

Index Terms — Raw Waveforms, End-to-End Learning, Poly-phonic Music, Instrument Classiﬁcation

1. INTRODUCTION

Waveforms are abstract representations of sound waves and, whenrecorded, they constitute convoluted signals that incorporate noisefrom the complexity of the recorded sound event, the acoustic scene,as well as the recording equipment. Complex sound events suchas spoken dialogues or simultaneously playing musical instruments(i.e. polyphonic music) can be challenging in extracting meaning-ful information. Thus, audio classiﬁcation tasks traditionally discarddirect waveform modeling in favor of richer time-frequency featurerepresentations [1]. In ﬁelds like Speech Recognition [2] and MusicInformation Retrieval [3], such methods take advantage of the dis-criminative information of the signals’ spectra, which is aligned tothe human auditory system.In Instrument Classiﬁcation particularly there is strong intu-ition into utilizing frequency-related representations, since musicalnotes and instruments are densely associated with speciﬁc fre-quency events. Thus, most research works in the ﬁeld incorporatespectrograms in their analysis. It is however challenging and com-putationally expensive to design specialized feature representationsfor each different recognition task, especially when contemporarydeep learning models emerge as strong feature extractors for end-to-end classiﬁcation. In this paper we address this challenge byparameterizing deep recurrent and convolutional networks to modelraw audio waveforms efﬁciently. Our analysis is concentrated on ∗ The ﬁrst two authors contributed equally.

Fig. 1 . Intermediate activation of the Residual FCN Model (Sec. 3.2)for the above 1-sec piano sample.handling the high input dimensionality of the waveforms, miningtemporal features and preserving their low-level spatial locality,while reducing the computational cost of the process. We proposea lightweight end-to-end classiﬁer for Instrument Classiﬁcationthat shows comparable performance to state-of-the-art spectrogram-based architectures, including our previous work on the task [4].The rest of the paper is organized as follows: Sec. 2 providesa review of related research in Audio Signal Processing using rawwaveforms, as well as Instrument Classiﬁcation. The architecturesthat are used throughout our experiments are analyzed in Sec. 3.Sec. 4 describes the experimental setup, the dataset and the evalua-tion methods to be followed, whereas in Sec. 5 we discuss the resultsof our experiments. Finally, in Sec. 6 we present our conclusions aswell as propose further directions for future work.

2. RELATED WORK

Deep neural networks, which have achieved state-of-the-art per-formance in audio recognition [5] by operating directly in thetime domain, have blurred the line between representation learn-ing and predictive modeling. Convolutional networks particularly a r X i v : . [ c s . S D ] F e b ig. 2 . The DCNN, FCN and RFCN architectures used in the experimental evaluation.have shown competitive performance, in some cases matching thatof classical spectrogram-based models [6]. In speech analysisand synthesis, WaveNet [7] is a benchmark model fed with audiowaveforms, whereas [8] achieves robust representation learning byutilizing WaveNet Autoencoders that use waveforms directly asinput. In Music Information Retrieval, a number of works haveattempted to acquire high-level features like melody and pitch [9],while waveform-based architectures have also recorded competitiveresults in music [10] and speech [11] source separation.As far as Instrument Classiﬁcation is concerned, until recentlythe majority of works utilized time-frequency representations anddatasets of solo recordings or excerpt-level annotations (e.g. IRMAS[12], MedleyDB [13]). While traditional research, partially due tothe challenge of labeling multi-instrumental music, focused uponmonophonic audio [14], recent studies address polyphonic tasks, re-lying on the efﬁciency of deep learning models. Speciﬁc points offocus include investigation of the optimal input temporal resolution[15, 16] and the design of the convolutional ﬁlters involved [17],while we have also experimented with sophisticated augmentationmethods, attempting to isolate timbre-like characteristics [4].

3. ARCHITECTURES3.1. Recurrent Networks

Recurrent neural networks (RNN) have been widely used in wave-form modeling and classiﬁcation, thanks to their ability to modellong-range temporal dependencies. In our baseline network we em-ploy the Bidirectional Gated Recurrent Unit (BiGRU). GRU archi-tectures have shown comparable performance to Long Short-TermMemory (LSTM) units in processing audio sequences [18], althoughthey inherit a less complex structure and provide lower computa-tional costs. Moreover, Bidirectional units consider both past andfuture audio context, which intuitively assists our task.

Number of Layers Number of Units

Table 1 . BiGRU Architecture Conﬁgurations.Regarding this recurrent module, we experiment with the opti-mal number of layers and utilized units. Speciﬁcally, we trained 3

Fig. 3 . The utilized CNN cell structure.different networks: one with 1 BiGRU layer of 128 units, one with 1BiGRU layer of 256 units and one with 2 BiGRU layers of 128 and64 units respectively, as shown in Table 1. We used a fully connectedlayer to produce the models’ output as well as a Dropout layer rightbefore, to enhance the models’ ability to generalize. In order to ef-fectively reduce the high dimensionality of the waveforms we applymax pooling at the input level, with a pool size of 3 samples.

Convolutional Neural Networks (CNN) traditionally operate on im-ages [19] or, in Audio Classiﬁcation, on time-frequency represen-tations like spectrograms [15], being an extremely coherent featureextraction method for Deep Learning. However, they can provideuseful results when applied to 1D signals as well [20]. We base ourCNN on the architecture used in [4] that yielded strong results on theIRMAS Dataset. In order to adjust to the different input form, wesubstitute 2D with 1D Convolutional layers and ﬁne-tune the pool-ing parameters to handle the high input dimensionality. The CNNcell structure is composed of 2 stacked convolutional layers with thesame number of ﬁlters, a Batch Normalization layer that enhancesthe training efﬁciency [21], a Leaky ReLU activation and a maxpooling layer with kernels of varied length (Fig. 3).We place 5 CNN cells in a row with 16, 32, 64, 64 and 32 ﬁl-ters respectively, using kernel and pooling sizes equal to 3 for theﬁrst three layers and 5 for the rest. This module is followed by twofully connected layers (denoted as DCNN), something that increasessubstantially the number of its trainable parameters. We thus experi-ment by removing all the fully connected layers to form a Fully Con-volutional Network (FCN). The 11D output vector is then estimatedthrough an additional convolutional layer of unit kernel, followed by iGRU F1-micro % F1-macro % LRAP % ± ± ± ± ± ± ± ± ± Table 2 . Results for the Recurrent Networks discussed in Sec. 3.1,subject to the number of GRU layers and the number of units.

Models F1-micro % F1-macro % LRAP %

DCNN 55.32 ± ± ± ± ± ± RFCN ± ± ± Table 3 . Results for the Dense Connected Neural Network (DCNN),the Fully Connected Network (FCN) and the Residual FCN (RFCN)discussed in Sec. 3.2.Global Average Pooling. The adoption of FCNs in modeling rawaudio waveforms sharply reduces the number of parameters that weneed to train, while it can force the network to learn meaningful fea-tures in its hidden layers, that keep their temporal locality through-out the architecture [20]. The ﬁnal conﬁguration is a Residual FCN(RFCN), where we simply embed skip connections to the previousmodel, as shown in Fig. 2. Through residual connections the modelis able to propagate low-level features throughout the network.

The utilized feature extraction and prediction methods are capableof learning different types of features. It has been demonstrated [5]that convolutional nets concentrate on spatial features and, in thecontext of waveforms, on temporally local correlations, while recur-rent ones are useful in modeling longer-term temporal structure. Wecan therefore expect that a combined Convolutional-Recurrent Neu-ral Network (CRNN) would further enhance the performance of thehighlighted architectures, so we attach the best performing BiGRUmodel into our RFCN model. In order to keep the temporal resolu-tion of the RNN to a feasible magnitude, we experiment by embed-ding it after the 2nd, the 3rd and the 4th CNN cell, as well as byfeeding the CNN output to the recurrent units. Speciﬁcally, the em-bedded model takes the output of the corresponding CNN cell and itsoutput is reduced to the number of classes through additional convo-lution with a unit kernel, followed by Global Average Pooling. Thetwo 11D vectors are then averaged before the Sigmoid activation. Inthis way we empirically search the optimal way of integrating therecurrent model information into a robust classiﬁer.

4. EXPERIMENTAL SETUP4.1. Dataset & Training

The IRMAS dataset [12] is used to train and test our models, as it hasbeen extensively researched for the task of Instrument Classiﬁcation.IRMAS is divided into a training set containing 6705 audio segmentsof 3 seconds each and a testing set containing polyphonic tracks ofvarious lengths. Each of the 3-sec training snippets is annotated withexactly 1 out of the 11 available predominant instruments, while thepolyphonic tracks in the testing set contain 2–4 instruments. Wechoose to cut each track into 1-sec segments, since this temporalresolution increases the data volume and helps the model generalizebetter, as indicated in [15, 16]. Each waveform is then downsam-pled to 22.05 kHz, downmixed to mono and normalized by its root-mean-square energy. Since we are interested in classifying only raw

Models F1-micro % F1-macro % LRAP %

CRNN ± ± ± CRNN ± ± ± ± ± ± ± ± ± Table 4 . Results for the combined CNN and RNN networks dis-cussed in Sec. 3.3. The subscript denotes the layer in which thelatter was connected to the former.waveforms, no further pre-processing is applied.The training data are then partitioned into 5 subsets to performcross-validation for each of the above-mentioned architectures. Allnetworks were trained using binary cross-entropy loss since the taskis modeled as multi-class (11 instruments) and multi-label (instru-ments can co-play). Adam optimizer [22] is used to optimize theloss function, with an initial learning rate of 0.001 and 10% decayrate per 4 epochs of non-decreasing validation loss. The batch sizeis set to 64 after ﬁne-tuning. We also perform Early Stopping bymonitoring the validation loss with a patience of 7 epochs.

Each model is evaluated at the IRMAS test set, consisting of 2355polyphonic music tracks, ranging from 5 to 20 sec duration. Duringthe evaluation process, we partition each track into 1-sec segments,compute the frame-level predictions and then average them in orderto extract a single track prediction. This method will produce reli-able results because each labeled instrument is always active for thewhole duration of the track. For the particular polyphonic classiﬁca-tion we utilize two metrics. The ﬁrst one is the F1 Score, which iswidely used in many relative studies [12, 16] and provides a balancedview of multi-class performance. In order to calculate an overallscore, we compute the average of the per-instrument scores at bothmicro and macro scales. The second one is Label Ranking AveragePrecision (LRAP), a rank-based metric proposed in [23]. LRAP issuitable for multi-label evaluation as it is threshold-independent andmeasures the classiﬁer’s ability to assign higher scores to the correctlabels associated to each sample.

5. RESULTS AND DISCUSSION5.1. Architecture Comparison

Table 2 shows the accuracy scores for the recurrent neural modelsproposed in Sec. 3.1. It is clear that a simple recurrent networkcannot sufﬁciently decode the information included in a waveform.Still, the best model emerges from a combination of two BiGRUlayers with 128 and 64 units, respectively, which by far outperformsthe single GRU layer. Further experiments show that adding a CNNCell before the recurrent units signiﬁcantly improves the classiﬁca-tion, indicating the efﬁciency of the models described in Sec. 3.3.On the other hand, as we see in Table 3, 1D convolutional mod-els are capable of extracting the most discriminative features fromraw waveforms, almost as well as 2D convolutional models thatwork on spectrogram inputs [16, 4]. Furthermore, removing thedense layers not only reduces the number of model trainable param-eters, and thus the training time, but also increases the accuracy sub-stantially. We argue that, in the absence of a dense layer, the networkgeneralizes better upon the information from the spatial processing.Additionally, connecting the output of the ﬁrst CNN cell with thethird one, and the output of the second CNN cell with the fourth, ig. 4 . Instrument-wise performance of the proposed model and the monophonic [4] in terms of F1-score.

Models F1-micro F1-macro LRAP

Bosch et al. [12] 0.503 0.432 – –Pons et al. [17] 0.589 0.516 – –Han et al. [16] 0.602 0.503 – –Kratimenos et al. [4]

Table 5 . Comparison of our work with previous performances onthe IRMAS Datasetslightly increases the results with no additional cost in parametersor training time. Other types of residual connections though do notseem to yield consistently improved performance.To optimally combine the temporal information extracted fromthe RNN and the already utilized spatial characteristics drawn fromthe residual network, we additionally assess the performance of thecombined networks described in Sec. 3.3. Simply averaging theRNN and CNN model outputs lowers though the classiﬁcation ac-curacy, something we attribute to the inadequate standalone perfor-mance of the BiGRU (see Table 2). We thus inserted the RNN modelin various locations in the RFCN architecture that yielded optimalaccuracy. From the scores we report at Table 4, we notice that thereis no observed improvement in the model performance as far as theLRAP metric is concerned. However, there is a steady increase at F1scores, about 2% and 4% at micro and macro scales, respectively. Itshould be mentioned at this point that the combined models consistof a signiﬁcantly larger number of parameters than the fully convolu-tional ones, while the DCNN is the model with the most parameters,despite its lower performance.The optimal architecture is constructed by embedding the Bi-GRU module after the 3rd CNN cell of the residual fully con-volutional model and by averaging the 2 models’ output in theend. The proposed model yields an LRAP score close to 75% andF1 scores comparable to the literature [16, 4, 17] on the IRMASDataset. Speciﬁcally, F1 micro surpasses most studies on the task,while we observe dominant performance at the more competitiveF1 macro score, in which case the model even surpasses the perfor-mance achieved in our previous work with the use of complex dataaugmentation [4]. Those results are obtained with a signiﬁcantly re-duced number of trainable parameters, low training and testing timeand minimal pre-processing. In order to emphasize this, we attemptto train the model utilized in [4] by incorporating a signiﬁcantlyreduced number of parameters, without altering the architecture.Speciﬁcally, we cut the number of ﬁlters and the ﬁnal dense layer ofthe model. The results (Table 6) show that this network falls behindthe proposed one by 6% and an average 8%, regarding LRAP andF1 scores respectively.

Models F1-micro F1-macro LRAP

Proposed [4] Reduced 0.520 0.458 0.689 1.20M

Table 6 . Comparison between the proposed architecture and thestate-of-the-art after parameter reduction.

To get a more thorough insight into the waveform characteristicsof different instruments and how much these can assist the task ofInstrument Classiﬁcation, we examine the class-wise performancein terms of the F1 metric. The results are visualized in Fig. 4, alongwith the corresponding results obtained from Constant Q Trans-form (CQT) spectrogram modeling from our previous work [4].We clearly observe that brass instruments are recognized much bet-ter using raw waveforms, compared to CQT representations. Inspeciﬁc, clarinet, ﬂute, saxophone and trumpet achieve 14%, 7%,13% and 9% increase respectively in F1-score when waveformsare considered. On the other hand, predominant instruments, i.e.electric/acoustic guitar, piano/organ and the human voice, are distin-guished better through processing their CQT representation, with thehighest difference observed on piano with 10% change in F1-score.Apart from that, the absolute performance of the instruments prettymuch resembles the ﬁndings of the CQT-based study.

6. CONCLUSIONS

In this paper we attempt to perform polyphonic instrument classi-ﬁcation from monophonic music data with the usage of their rawaudio waveforms. We experiment with various architectures that arefavourable towards waveforms, like Fully Convolutional and Resid-ual Nets and we also attempt to embed a recurrent modules to theoptimal Architecture so as to fuse their discriminating information.A residual FCN-BiGRU model with a total of 1 million parametersoutperforms the state-of-the-art model, utilizing CQT spectrogramsand holding 24 million parameters, in the F1-macro metric by 4%,while it is comparable in the F1-micro and LRAP metrics. A morethorough experiment on the performance of each instrument inde-pendently shows that brass instruments are being identiﬁed easierthrough waveforms, while predominant instruments, like piano orthe electric guitar, beneﬁt more from time-frequency representations.Future work should therefore deal with alternate methods to ex-ploit a recurrent neural network when fed with raw waveforms, aswell as with methods to enhance the recognition performance of pre-dominant instruments. Finally, additional experiments should ad-dress how noise, that is highly incorporated within waveform sig-nals, affects the model capabilities. . REFERENCES [1] C. C. Chatterjee, M. Mulimani, and S. G. Koolagudi, “Poly-phonic sound event detection using transposed convolutionalrecurrent neural network,” in

Proc. IEEE Int’l Conf. on Acous-tics, Speech and Signal Processing (ICASSP) , 2020.[2] B. Puterka and J. Kacur, “Time window analysis for automaticspeech emotion recognition,” in

Proc. Int’l Symposium EL-MAR , 2018.[3] S. Gururani, M. Sharma, and A. Lerch, “An attention mecha-nism for musical instrument recognition,” in

Proc. Int’l Societyfor Music Information Retrieval Conference, (ISMIR) .[4] A. Kratimenos, K. Avramidis, C. Garouﬁs, A. Zlatintsi, andP. Maragos, “Augmentation Methods on Monophonic Audiofor Instrument Classiﬁcation in Polyphonic Music,” in

Proc.European Signal Processing Conference (EUSIPCO) , 2020.[5] S. H. Bae, I. Choi, and N. S. Kim, “Acoustic scene classiﬁca-tion using parallel combination of lstm and cnn,” in

Proc. ofthe Detection and Classiﬁcation of Acoustic Scenes and Events2016 Workshop (DCASE) , 2016.[6] T. Sainath, Ron J. Weiss, A. Senior, K. Wilson, and OriolVinyals, “Learning the speech front-end with raw waveformcldnns,” in

Proc. INTERSPEECH , 2015.[7] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan,O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, andK. Kavukcuoglu, “Wavenet: A generative model for raw au-dio,” in

Proc. Speech Synthesis Workshop (ISCA) . Sept., ISCA.[8] J. Chorowski, R. J. Weiss, S. Bengio, and A. van den Oord,“Unsupervised speech representation learning using wavenetautoencoders,”

IEEE/ACM Trans. on Audio, Speech, and Lan-guage Processing , vol. 27, no. 12, pp. 2041–2053, 2019.[9] J. W. Kim, J. Salamon, P. Li, and J. P. Bello, “CREPE: Aconvolutional representation for pitch estimation,”

CoRR , vol.abs/1802.06182, 2018.[10] D. Stoller, S. Ewert, and S. Dixon, “Wave-u-net: A multi-scale neural network for end-to-end audio source separation,”in

Proc. Int’l Society for Music Information Retrieval Confer-ence (ISMIR) , 2018.[11] Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation,”

IEEEACM Trans. Audio Speech Lang. Process. , vol. 27, no. 8, pp.1256–1266, 2019.[12] J. J. Bosch, J. Janer, F. Fuhrmann, and P. Herrera, “A com-parison of sound segregation techniques for predominant in-strument recognition in musical audio signals,” in

Proc. Int’lSociety for Music Information Retrieval Conference, (ISMIR) ,2012.[13] R. M. Bittner, J. Salamon, M. Tierney, M. Mauch, C. Cannam,and J. Bello, “Medleydb: A multitrack dataset for annotation-intensive mir research,” in

Proc. Int’l Society for Music Infor-mation Retrieval Conference, (ISMIR) , 2014.[14] P. Herrera-Boyer, G. Peeters, and S. Dubnov, “Automatic clas-siﬁcation of musical instrument sounds,”

Journal of New MusicResearch , vol. 32, no. 1, pp. 3–21, 2003.[15] S. Gururani, C. Summers, and A. Lerch, “Instrument activitydetection in polyphonic music using deep neural networks,” in

Proc. Int’l Society for Music Information Retrieval Conference,(ISMIR) , 2018. [16] Y. Han, J. Kim, and K. Lee, “Deep convolutional neural net-works for predominant instrument recognition in polyphonicmusic,”

IEEE/ACM Trans. on Audio, Speech, and LanguageProcessing , vol. 25, pp. 208–221, 2017.[17] J. Pons, O. Slizovskaia, R. Gong, E. G´omez, and X. Serra,“Timbre analysis of music audio signals with convolutionalneural networks,” , pp. 2744–2748, 2017.[18] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empiri-cal evaluation of gated recurrent neural networks on sequencemodeling,” in

Proc. NIPS 2014 Workshop on Deep Learning ,Dec. 2014.[19] K. Simonyan and A. Zisserman, “Very deep convolutionalnetworks for large-scale image recognition,”

CoRR , vol.abs/1409.1556, 2015.[20] W. Dai, C. Dai, S. Qu, J. Li, and S. Das, “Very deep convolu-tional neural networks for raw waveforms,” in

Proc. IEEE Int’lConf. on Acoustics, Speech and Signal Processing (ICASSP) ,2017.[21] S. Ioffe and C. Szegedy, “Batch normalization: Acceleratingdeep network training by reducing internal covariate shift,” vol.37, Jul. 2015.[22] D. P. Kingma and J. Ba, “Adam: A method for stochastic op-timization,” in

Proc. Int’l Conf. on Learning Representations(ICLR) , May 2015.[23] R. E. Schapire and Y. Singer, “Boostexter: A boosting-basedsystem for text categorization,”